Re: [intel-sgx-kernel-dev] [PATCH v5 08/11] intel_sgx: in-kernel launch enclave
On Mon, 2017-11-13 at 21:45 +0200, Jarkko Sakkinen wrote: > This commits implements the in-kernel launch enclave. It is wrapped into > a user space program that reads SIGSTRUCT instances from stdin and > outputs launch tokens to stdout. > > The commit also adds enclave signing tool that is used by kbuild to > measure and sign the launch enclave. > > CONFIG_INTEL_SGX_SIGNING_KEY points to a PEM-file for the 3072-bit RSA > key that is used as the LE public key pair. The default location is: > > drivers/platform/x86/intel_sgx/intel_sgx_signing_key.pem Unless there is some conflict you are worried about, "signing_key.pem" is preferable as the default name so that the key is ignored via the top-level .gitignore. The intel_sgx dir should have also a .gitignore to exclude the other LE related output files: drivers/platform/x86/intel_sgx/le/enclave/sgx_le.ss drivers/platform/x86/intel_sgx/le/enclave/sgxsign drivers/platform/x86/intel_sgx/le/sgx_le_proxy > If the default key does not exist kbuild will generate a random key and > place it to this location. KBUILD_SGX_SIGN_PIN can be used to specify > the passphrase for the LE public key. > > TinyCrypt (https://github.com/01org/tinycrypt) is used as AES > implementation, which is not timing resistant. Eventually this needs to > be replaced with AES-NI based implementation that could be either > > - re-use existing AES-NI code in the kernel > - have its own hand written code > > Signed-off-by: Jarkko Sakkinen > ---
Re: [intel-sgx-kernel-dev] [PATCH v5 06/11] intel_sgx: driver for Intel Software Guard Extensions
On Mon, 2017-11-13 at 21:45 +0200, Jarkko Sakkinen wrote: > Intel SGX is a set of CPU instructions that can be used by applications > to set aside private regions of code and data. The code outside the > enclave is disallowed to access the memory inside the enclave by the CPU > access control. > > SGX driver provides a ioctl API for loading and initializing enclaves. > Address range for enclaves is reserved with mmap() and they are > destroyed with munmap(). Enclave construction, measurement and > initialization is done with the provided the ioctl API. > > The driver implements also a swapper thread ksgxswapd for EPC pages > backed by a private shmem file. Currently it has a limitation of not > swapping VA pages but there is nothing preventing to implement it later > on. Now it was scoped out in order to keep the implementation simple. > > The parameter struct for SGX_IOC_ENCLAVE_INIT does not contain a > parameter to supply a launch token. Generating and using tokens is best > to be kept in the control of the kernel because it has direct binding to > the IA32_SGXPUBKEYHASHx MSRs (a core must have MSRs set to the same > value as the signer of token). > > By giving user space any role in the launch process is a risk for > introducing bottlenecks as kernel must exhibit behavior that user space > launch daemon depends on What do you mean by bottlenecks? Assuming you're referring to performance bottlenecks, this statement is flat out false. Moving the launch enclave into the kernel introduces performance bottlenecks, e.g. as implemented, a single LE services all EINIT requests and is protected by a mutex. That is the very definition of a bottleneck. The kernel will never be as performant as userspace when it comes to EINIT tokens because userspace can make informed decisions based on its usage model, e.g. number of LEs (or LE threads) to spawn, LE and token lifecycles, LE and token thread safety, etc... > , properietary risks (closed launch daemons on > closed platforms) This justifies the need for the kernel to be able to generate launch tokens, but I don't think allowing userspace to also provide its own tokens adds any proprietary risks. > and stability risks as there would be division of > semantics between user space and kernel. > What exactly are the stability risks? The token itself is architecturally defined and isn't fundamentally different than e.g. the sigstruct. Writing the LE hash MSRs as needed, e.g. for userspace LEs, isn't difficult. > Signed-off-by: Jarkko Sakkinen > ---
Re: [intel-sgx-kernel-dev] [PATCH v5 10/11] intel_sgx: glue code for in-kernel LE
On Mon, 2017-11-13 at 21:45 +0200, Jarkko Sakkinen wrote: > --- a/drivers/platform/x86/intel_sgx/sgx_main.c > +++ b/drivers/platform/x86/intel_sgx/sgx_main.c > @@ -88,6 +88,37 @@ u64 sgx_encl_size_max_64; > u64 sgx_xfrm_mask = 0x3; > u32 sgx_misc_reserved; > u32 sgx_xsave_size_tbl[64]; > +bool sgx_unlocked_msrs; > +u64 sgx_le_pubkeyhash[4]; > + > +static DECLARE_RWSEM(sgx_file_sem); > + > +static int sgx_open(struct inode *inode, struct file *file) > +{ > + int ret; > + > + down_read(&sgx_file_sem); > + > + ret = sgx_le_start(&sgx_le_ctx); > + if (ret) { > + up_read(&sgx_file_sem); > + return ret; > + } > + > + return 0; > +} > + > +static int sgx_release(struct inode *inode, struct file *file) > +{ > + up_read(&sgx_file_sem); > + > + if (down_write_trylock(&sgx_file_sem)) { > + sgx_le_stop(&sgx_le_ctx); > + up_write(&sgx_file_sem); > + } > + > + return 0; > +} This semaphore approach is broken due to the LE process using an anon inode for /dev/sgx, which results in sgx_release being called without an accompanying call to sgx_open. This causes deadlocks due to a semaphore underrun. https://lists.01.org/pipermail/intel-sgx-kernel-dev/2017-November/000901.html [ 242.659272] INFO: task lsdt:9425 blocked for more than 120 seconds. [ 242.659783] Not tainted 4.14.0-rc4+ #18 [ 242.660063] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this [ 242.660558] lsdtD0 9425 1 0x0004 [ 242.660559] Call Trace: [ 242.660564] __schedule+0x3c2/0x8b0 [ 242.660567] schedule+0x36/0x80 [ 242.660568] rwsem_down_read_failed+0x10a/0x170 [ 242.660569] call_rwsem_down_read_failed+0x18/0x30 [ 242.660570] ? call_rwsem_down_read_failed+0x18/0x30 [ 242.660571] down_read+0x20/0x40 [ 242.660572] sgx_open+0x19/0x40 [intel_sgx] [ 242.660574] chrdev_open+0xbf/0x1b0 [ 242.660576] do_dentry_open+0x1f8/0x300 [ 242.660577] ? cdev_put+0x30/0x30 [ 242.660578] vfs_open+0x4f/0x70 [ 242.660579] path_openat+0x2ae/0x13a0 [ 242.660581] ? mem_cgroup_uncharge_swap+0x60/0x90 [ 242.660582] do_filp_open+0x99/0x110 [ 242.660583] ? __check_object_size+0xfc/0x1a0 [ 242.660585] ? __alloc_fd+0xb0/0x170 [ 242.660586] do_sys_open+0x124/0x210 [ 242.660587] ? do_sys_open+0x124/0x210 [ 242.660588] SyS_open+0x1e/0x20 [ 242.660589] entry_SYSCALL_64_fastpath+0x1e/0xa9 [ 242.660590] RIP: 0033:0x7f426cf9ec7d [ 242.660591] RSP: 002b:7f426b31ea60 EFLAGS: 0293 ORIG_RAX: [ 242.660592] RAX: ffda RBX: 00c4200ba000 RCX: 7f426cf9ec7d [ 242.660592] RDX: RSI: 0002 RDI: 0068cca7 [ 242.660593] RBP: 7f426b31ec10 R08: 00f6bc30 R09: [ 242.660593] R10: 7f426478 R11: 0293 R12: 0001 [ 242.660594] R13: R14: 7f426d31b13d R15: 7f42640008c0 > #ifdef CONFIG_COMPAT > long sgx_compat_ioctl(struct file *filep, unsigned int cmd, unsigned long > arg) > @@ -141,8 +172,10 @@ static unsigned long sgx_get_unmapped_area(struct file > *file, > return addr; > } > > -static const struct file_operations sgx_fops = { > +const struct file_operations sgx_fops = { > .owner = THIS_MODULE, > + .open = sgx_open, > + .release= sgx_release, > .unlocked_ioctl = sgx_ioctl, > #ifdef CONFIG_COMPAT > .compat_ioctl = sgx_compat_ioctl,
[PATCH] x86/pkeys: Explicitly treat PK #PF on kernel address as a bad area
Kernel addresses are always mapped with _PAGE_USER=0, i.e. PKRU isn't enforced, and so we should never see X86_PF_PK set on a kernel address fault. WARN once to capture the issue in case we somehow don't die, e.g. the access originated in userspace. Remove a similar check and its comment from spurious_fault_check(). The intent of the comment (and later code[1]) was simply to document that spurious faults due to protection keys should be impossible, but that's irrelevant and a bit of a red herring since we should never get a protection keys fault on a kernel address regardless of the kernel's TLB flushing behavior. [1] http://lists-archives.com/linux-kernel/28407455-x86-pkeys-new-page-fault-error-code-bit-pf_pk.html Signed-off-by: Sean Christopherson Cc: Dave Hansen --- There's no indication that this condition has ever been encountered. I came across the code in spurious_fault_check() and was confused as to why we would unconditionally treat a protection keys fault as spurious when the comment explicitly stated that such a case should be impossible. Dave Hansen suggested adding a WARN_ON_ONCE in spurious_fault_check(), but it seemed more appropriate to freak out on any protection keys fault on a kernel address since that would imply a hardware issue or kernel bug. I omitted a Suggested-by since this isn't necessarily what Dave had in mind. arch/x86/mm/fault.c | 16 ++-- 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 2aafa6ab6103..f19a55972136 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1040,12 +1040,6 @@ static int spurious_fault_check(unsigned long error_code, pte_t *pte) if ((error_code & X86_PF_INSTR) && !pte_exec(*pte)) return 0; - /* -* Note: We do not do lazy flushing on protection key -* changes, so no spurious fault will ever set X86_PF_PK. -*/ - if ((error_code & X86_PF_PK)) - return 1; return 1; } @@ -1241,6 +1235,14 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, * protection error (error_code & 9) == 0. */ if (unlikely(fault_in_kernel_space(address))) { + /* +* We should never encounter a protection keys fault on a +* kernel address as kernel address are always mapped with +* _PAGE_USER=0, i.e. PKRU isn't enforced. +*/ + if (WARN_ON_ONCE(error_code & X86_PF_PK)) + goto bad_kernel_address; + if (!(error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) { if (vmalloc_fault(address) >= 0) return; @@ -1253,6 +1255,8 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, /* kprobes don't want to hook the spurious faults: */ if (kprobes_fault(regs)) return; + +bad_kernel_address: /* * Don't take the mm semaphore here. If we fixup a prefetch * fault we could otherwise deadlock: -- 2.18.0
Re: [PATCH v11 09/13] x86, sgx: basic routines for enclave page cache
On Fri, 2018-06-08 at 19:09 +0200, Jarkko Sakkinen wrote: > SGX has a set of data structures to maintain information about the enclaves > and their security properties. BIOS reserves a fixed size region of > physical memory for these structures by setting Processor Reserved Memory > Range Registers (PRMRR). This memory area is called Enclave Page Cache > (EPC). > > This commit implements the basic routines to allocate and free pages from > different EPC banks. There is also a swapper thread ksgxswapd for EPC pages > that gets woken up by sgx_alloc_page() when we run below the low watermark. > The swapper thread continues swapping pages up until it reaches the high > watermark. > > Each subsystem that uses SGX must provide a set of callbacks for EPC > pages that are used to reclaim, block and write an EPC page. Kernel > takes the responsibility of maintaining LRU cache for them. > > Signed-off-by: Jarkko Sakkinen > --- > arch/x86/include/asm/sgx.h | 67 + > arch/x86/include/asm/sgx_arch.h | 224 > arch/x86/kernel/cpu/intel_sgx.c | 443 +++- > 3 files changed, 732 insertions(+), 2 deletions(-) > create mode 100644 arch/x86/include/asm/sgx_arch.h ... > diff --git a/arch/x86/kernel/cpu/intel_sgx.c b/arch/x86/kernel/cpu/intel_sgx.c > index db6b315334f4..ae2b5c5b455f 100644 > --- a/arch/x86/kernel/cpu/intel_sgx.c > +++ b/arch/x86/kernel/cpu/intel_sgx.c > @@ -14,14 +14,439 @@ > #include > #include > #include > +#include > #include > #include > +#include > #include > > +#define SGX_NR_TO_SCAN 16 > +#define SGX_NR_LOW_PAGES 32 > +#define SGX_NR_HIGH_PAGES 64 > + > bool sgx_enabled __ro_after_init = false; > EXPORT_SYMBOL(sgx_enabled); > +bool sgx_lc_enabled __ro_after_init; > +EXPORT_SYMBOL(sgx_lc_enabled); > +atomic_t sgx_nr_free_pages = ATOMIC_INIT(0); > +EXPORT_SYMBOL(sgx_nr_free_pages); > +struct sgx_epc_bank sgx_epc_banks[SGX_MAX_EPC_BANKS]; > +EXPORT_SYMBOL(sgx_epc_banks); > +int sgx_nr_epc_banks; > +EXPORT_SYMBOL(sgx_nr_epc_banks); > +LIST_HEAD(sgx_active_page_list); > +EXPORT_SYMBOL(sgx_active_page_list); > +DEFINE_SPINLOCK(sgx_active_page_list_lock); > +EXPORT_SYMBOL(sgx_active_page_list_lock); I don't think we should be exporting anything other than sgx_enabled and sgx_lc_enabled. The only external use of a symbol that can't be trivially (re)moved is in the driver's sgx_pm_suspend() in sgx_main.c, which uses the sgx_active_page_list to invalidate enclaves. And that behavior seems unsafe, e.g. an enclave could theoretically have zero pages on the active list and so could be missed in the suspend flow. > +static struct task_struct *ksgxswapd_tsk; > +static DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave
On Wed, Jun 20, 2018 at 11:39:00AM -0700, Jethro Beekman wrote: > On 2018-06-20 11:16, Jethro Beekman wrote: > > > This last bit is also repeated in different words in Table 35-2 and > > > Section 42.2.2. The MSRs are *not writable* before the write-lock bit > > > itself is locked. Meaning the MSRs are either locked with Intel's key > > > hash, or not locked at all. > > Actually, this might be a documentation bug. I have some test hardware and I > was able to configure the MSRs in the BIOS and then read the MSRs after boot > like this: > > MSR 0x3a 0x00040005 > MSR 0x8c 0x20180620 > MSR 0x8d 0x20180620 > MSR 0x8e 0x20180620 > MSR 0x8f 0x20180620 > > Since this is not production hardware, it could also be a CPU bug of course. > > If it is indeed possible to configure AND lock the MSR values to non-Intel > values, I'm very much in favor of Nathaniels proposal to treat the launch > enclave like any other firmware blob. It's not a CPU or documentation bug (though the latter is arguable). SGX has an activation step that is triggered by doing a WRMSR(0x7a) with bit 0 set. Until SGX is activated, the SGX related bits in IA32_FEATURE_CONTROL cannot be set, i.e. SGX can't be enabled. But, the LE hash MSRs are fully writable prior to activation, e.g. to allow firmware to lock down the LE key with a non-Intel value. So yes, it's possible to lock the MSRs to a non-Intel value. The obvious caveat is that whatever blob is used to write the MSRs would need be executed prior to activation. As for the SDM, it's a documentation... omission? SGX activation is intentionally omitted from the SDM. The intended usage model is that firmware will always do the activation (if it wants SGX enabled), i.e. post-firmware software will only ever "see" SGX as disabled or in the fully activated state, and so the SDM doesn't describe SGX behavior prior to activation. I believe the activation process, or at least what is required from firmware, is documented in the BIOS writer's guide. > Jethro Beekman | Fortanix >
Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave
On Thu, Jun 21, 2018 at 03:11:18PM -0400, Nathaniel McCallum wrote: > If this is acceptable for everyone, my hope is the following: > > 1. Intel would split the existing code into one of the following > schemas (I don't care which): > A. three parts: UEFI module, FLC-only kernel driver and user-space > launch enclave > B. two parts: UEFI module (including launch enclave) and FLC-only > kernel driver To make sure I understand correctly... The UEFI module would lock the LE MSRs with a public key hardcoded into both the UEFI module and the kernel at build time? And for the kernel, it would only load its SGX driver if FLC is supported and the MSRs are locked to the expected key? IIUC, this approach will cause problems for virtualization. Running VMs with different LE keys would require the bare metal firmware to configure the LE MSRs to be unlocked, which would effectively make using SGX in the host OS mutually exlusive with exposing SGX to KVM guests. Theoretically it would be possible for KVM to emulate the guest's LE and use the host's LE to generate EINIT tokens, but emulating an enclave would likely require a massive amount of code and/or complexity. > 2. Intel would release a reproducible build of the GPL UEFI module > sources signed with a SecureBoot trusted key and provide an > acceptable[0] binary redistribution license. > > 3. The kernel community would agree to merge the kernel driver given > the above criteria (and, obviously, acceptable kernel code). > > The question of how to distribute the UEFI module and possible launch > enclave remains open. I see two options: independent distribution and > bundling it in linux-firmware. The former may be a better > technological fit since the UEFI module will likely need to be run > before the kernel (and the boot loader; and shim). However, the latter > has the benefit of already being a well-known entity to our downstream > distributors. I could go either way on this. Writing and locks the LE MSRs effectively needs to be done before running the bootloader/kernel/etc... Delaying activation would require, at a minimum, leaving IA32_FEATURE_CONTROL unlocked since IA32_FEATURE_CONTROL's SGX bits can't be set until SGX is activated. > I know this plan is more work for everyone involved, but I think it > manages to actually maximize both security and freedom. > > [0]: details here - > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/README#n19 > On Thu, Jun 21, 2018 at 11:29 AM Neil Horman wrote: > > > > On Thu, Jun 21, 2018 at 08:32:25AM -0400, Nathaniel McCallum wrote: > > > On Wed, Jun 20, 2018 at 5:02 PM Sean Christopherson > > > wrote: > > > > > > > > On Wed, Jun 20, 2018 at 11:39:00AM -0700, Jethro Beekman wrote: > > > > > On 2018-06-20 11:16, Jethro Beekman wrote: > > > > > > > This last bit is also repeated in different words in Table 35-2 > > > > > > > and > > > > > > > Section 42.2.2. The MSRs are *not writable* before the write-lock > > > > > > > bit > > > > > > > itself is locked. Meaning the MSRs are either locked with Intel's > > > > > > > key > > > > > > > hash, or not locked at all. > > > > > > > > > > Actually, this might be a documentation bug. I have some test > > > > > hardware and I > > > > > was able to configure the MSRs in the BIOS and then read the MSRs > > > > > after boot > > > > > like this: > > > > > > > > > > MSR 0x3a 0x00040005 > > > > > MSR 0x8c 0x20180620 > > > > > MSR 0x8d 0x20180620 > > > > > MSR 0x8e 0x20180620 > > > > > MSR 0x8f 0x20180620 > > > > > > > > > > Since this is not production hardware, it could also be a CPU bug of > > > > > course. > > > > > > > > > > If it is indeed possible to configure AND lock the MSR values to > > > > > non-Intel > > > > > values, I'm very much in favor of Nathaniels proposal to treat the > > > > > launch > > > > > enclave like any other firmware blob. > > > > > > > > It's not a CPU or documentation bug (though the latter is arguable). > > > > SGX has an activation step that is triggered by doing a WRMSR(0x7a) > > > > with bit 0 set. Until SGX is activated, the SGX related bits in > > > > IA32_FEATURE_CONTROL cannot be set, i.e. SGX can't be enabled. But, > > > > the L
Re: [PATCH v13 09/13] x86/sgx: Enclave Page Cache (EPC) memory manager
On Tue, Aug 28, 2018 at 07:07:33AM -0700, Dave Hansen wrote: > On 08/28/2018 01:35 AM, Jarkko Sakkinen wrote: > > On Mon, Aug 27, 2018 at 02:15:34PM -0700, Dave Hansen wrote: > >> On 08/27/2018 11:53 AM, Jarkko Sakkinen wrote: > >>> +struct sgx_epc_page_ops { > >>> + bool (*get)(struct sgx_epc_page *epc_page); > >>> + void (*put)(struct sgx_epc_page *epc_page); > >>> + bool (*reclaim)(struct sgx_epc_page *epc_page); > >>> + void (*block)(struct sgx_epc_page *epc_page); > >>> + void (*write)(struct sgx_epc_page *epc_page); > >>> +}; > >> Why do we need a fancy, slow (retpoline'd) set of function pointers when > >> we only have one user of these (the SGX driver)? > > KVM has its own implementation for these operations. > > That belongs in the changelog. > > Also, where is the implementation? How can we assess this code that was > built to create an abstraction without both of the users? I can provide an early preview of the KVM reclaim code, but honestly I think that would do more harm than good. The VMX architecture for EPC reclaim is complex, even for SGX standards. Opening that can of worms would likely derail this discussion. That being said, this abstraction isn't exactly what KVM will need, but it's pretty close and gives us something to build on. Regardless, this layer of indirection is justifiable even with a single implementation. Reclaiming an EPC page is not a simple matter of copying the data somewhere else and marking the page not present. Actual eviction requires a reference to the Secure Enclave Control Structure (SECS) of the enclave that owns the page. Software needs to ensure it doesn't violate the hardware-enforced access rules, e.g. most reclaim activites need to be done under a per-enclave lock. Not all pages have the same reclaim rules, e.g. an SECS can only be reclaimed after all its child pages have reclaimed, a VA page doesn't need to be blocked, etc... And the list goes on... All of the tracking metadata and logic about what can be evicted when resides in the driver component[1], e.g. the core SGX code doesn't even have a software representation of an enclave. Alternatively the driver could provide a single "do_reclaim" function if we wanted to avoid a fancy abstraction layer, and in fact that's how I initially implemented the abstraction, but that approach has it's own warts and in a lot of ways makes the end result more complex. Ultimately, moving pages in/out of the EPC is so abysmally slow that the raw performance of software is almost completely irrelevant. The algorithms certainly matter, but optimizing away things like function pointers definitely takes a back seat to just about everything else. [1] Early versions of SGX support tried to put all EPC management in the driver, but that approach caused major problems for KVM (even without reclaim support) and received a less than enthusiastic response from the community.
Re: [PATCH v13 07/13] x86/sgx: Add data structures for tracking the EPC pages
On Tue, Aug 28, 2018 at 09:53:11AM -0700, Dave Hansen wrote: > >>> + sgx_nr_epc_banks++; > >>> + } > >>> + > >>> + if (!sgx_nr_epc_banks) { > >>> + pr_err("There are zero EPC banks.\n"); > >>> + return -ENODEV; > >>> + } > >>> + > >>> + return 0; > >>> +} > >> > >> Does this support hot-addition of a bank? If not, why not? > ... > > I'm not aware that we would have an ACPI specification for SGX so this > > is all I have at the moment (does not show any ACPI event for > > hotplugging). > > So you're saying the one platform you looked at don't support hotplug. > I was looking for a more broad statement about SGX. Hardware doesn't support hotplug of EPC as the EPC size and location is locked during activation of SGX. And IIRC, activation of SGX must be synchronized across all CPUs in a multi-socket platform, e.g. you can't late-enable SGX on a socket and due hotplugging that way. In a virtualized environment there are no such restrictions. I am not aware of any explicit requirements or use cases for supporting hotplug of EPC, but that's probably only because virtualization of SGX is fairly nascent.
Re: [PATCH v13 09/13] x86/sgx: Enclave Page Cache (EPC) memory manager
On Tue, Aug 28, 2018 at 02:26:36PM -0700, Dave Hansen wrote: > On 08/28/2018 02:22 PM, Sean Christopherson wrote: > > On Tue, Aug 28, 2018 at 07:07:33AM -0700, Dave Hansen wrote: > >> On 08/28/2018 01:35 AM, Jarkko Sakkinen wrote: > >>> On Mon, Aug 27, 2018 at 02:15:34PM -0700, Dave Hansen wrote: > >>>> On 08/27/2018 11:53 AM, Jarkko Sakkinen wrote: > >>>>> +struct sgx_epc_page_ops { > >>>>> + bool (*get)(struct sgx_epc_page *epc_page); > >>>>> + void (*put)(struct sgx_epc_page *epc_page); > >>>>> + bool (*reclaim)(struct sgx_epc_page *epc_page); > >>>>> + void (*block)(struct sgx_epc_page *epc_page); > >>>>> + void (*write)(struct sgx_epc_page *epc_page); > >>>>> +}; > >>>> Why do we need a fancy, slow (retpoline'd) set of function pointers when > >>>> we only have one user of these (the SGX driver)? > >>> KVM has its own implementation for these operations. > >> > >> That belongs in the changelog. > >> > >> Also, where is the implementation? How can we assess this code that was > >> built to create an abstraction without both of the users? > > > > I can provide an early preview of the KVM reclaim code, but honestly > > I think that would do more harm than good. The VMX architecture for > > EPC reclaim is complex, even for SGX standards. Opening that can of > > worms would likely derail this discussion. That being said, this > > abstraction isn't exactly what KVM will need, but it's pretty close > > and gives us something to build on. > > Please remove the abstraction code. We don't introduce infrastructure > which no one will use. The infrastructure is used in the sense that it allows us to split the userspace-facing code, i.e. the driver, into a separate module. This in turn allows virtualization of SGX without having to load the driver or building it in the first place, e.g. to virtualize SGX on a system that doesn't meet the driver's requirements. We could eliminate the abstraction by moving the EPC management code into the driver, but that would directly conflict with past feedback and would need to be completely undone to enable KVM. The abstraction could be dumbed down to a single function, but as mentioned earlier, that comes with its own costs. I can dive into exactly what we lose with a single function approach if this is a sticking point.
Re: [PATCH v2 2/3] x86/mm: add .data..decrypted section to hold shared variables
On Tue, Aug 28, 2018 at 05:12:56PM -0500, Brijesh Singh wrote: > kvmclock defines few static variables which are shared with hypervisor > during the kvmclock initialization. > > When SEV is active, memory is encrypted with a guest-specific key, and > if guest OS wants to share the memory region with hypervisor then it must > clear the C-bit before sharing it. Currently, we use > kernel_physical_mapping_init() to split large pages before clearing the > C-bit on shared pages. But the kernel_physical_mapping_init fails when > called from the kvmclock initialization (mainly because memblock allocator > was not ready). > > The '__decrypted' can be used to define a shared variable; the variables > will be put in the .data.decryption section. This section is mapped with > C=0 early in the boot, we also ensure that the initialized values are > updated to match with C=0 (i.e perform an in-place decryption). The > .data..decrypted section is PMD aligned and sized so that we avoid the > need to split the large pages when mapping this section. What about naming the attribute (and section) '__unencrypted' instead of '__decrypted'? The attribute should be a property describing how the data must be accessed, it shouldn't imply anything regarding the history of the data. Decrypted implies that data was once encrypted, whereas unencrypted simply states that the data is stored in plain text. All data that has been decrypted is also unencrypted, but the reverse does not hold true.
Re: [PATCH v3 4/4] x86/kvm: use __decrypted attribute in shared variables
On Wed, Aug 29, 2018 at 01:24:00PM -0500, Brijesh Singh wrote: > The following commit: > > 368a540e0232 (x86/kvmclock: Remove memblock dependency) Checkpatch prefers: Commit 368a540e0232 ("x86/kvmclock: Remove memblock dependency") That'll also save three lines in the commit message. > caused SEV guest regression. When SEV is active, we map the shared > variables (wall_clock and hv_clock_boot) with C=0 to ensure that both > the guest and the hypervisor is able to access the data. To map the Nit: s/is/are > variables we use kernel_physical_mapping_init() to split the large pages, > but this routine fails to allocate a new page. Before the above commit, > kvmclock initialization was called after memory allocator was available > but now its called early during boot. What about something like this to make the issue a bit clearer: variables we use kernel_physical_mapping_init() to split the large pages, but splitting large pages requires allocating a new PMD, which fails now that kvmclock initialization is called early during boot. > Recently we added a special .data..decrypted section to hold the shared > variables. This section is mapped with C=0 early during boot. Use > __decrypted attribute to put the wall_clock and hv_clock_boot in > .data..decrypted section so that they are mapped with C=0. > > Signed-off-by: Brijesh Singh > Fixes: 368a540e0232 ("x86/kvmclock: Remove memblock dependency") > Cc: Tom Lendacky > Cc: k...@vger.kernel.org > Cc: Thomas Gleixner > Cc: Borislav Petkov > Cc: "H. Peter Anvin" > Cc: linux-kernel@vger.kernel.org > Cc: Paolo Bonzini > Cc: Sean Christopherson > Cc: k...@vger.kernel.org > Cc: "Radim Krčmář" > --- > arch/x86/kernel/kvmclock.c | 30 +- > 1 file changed, 25 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c > index 1e67646..08f5f8a 100644 > --- a/arch/x86/kernel/kvmclock.c > +++ b/arch/x86/kernel/kvmclock.c > @@ -28,6 +28,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -61,8 +62,8 @@ early_param("no-kvmclock-vsyscall", > parse_no_kvmclock_vsyscall); > (PAGE_SIZE / sizeof(struct pvclock_vsyscall_time_info)) > > static struct pvclock_vsyscall_time_info > - hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __aligned(PAGE_SIZE); > -static struct pvclock_wall_clock wall_clock; > + hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __decrypted > __aligned(PAGE_SIZE); > +static struct pvclock_wall_clock wall_clock __decrypted; > static DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu); > > static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void) > @@ -267,10 +268,29 @@ static int kvmclock_setup_percpu(unsigned int cpu) > return 0; > > /* Use the static page for the first CPUs, allocate otherwise */ > - if (cpu < HVC_BOOT_ARRAY_SIZE) > + if (cpu < HVC_BOOT_ARRAY_SIZE) { > p = &hv_clock_boot[cpu]; > - else > - p = kzalloc(sizeof(*p), GFP_KERNEL); > + } else { > + int rc; > + unsigned int sz = sizeof(*p); > + > + if (sev_active()) > + sz = PAGE_ALIGN(sz); This is a definite downside to the section approach. Unless I missed something, the section padding goes to waste since we don't have a mechanism in place to allocate into that section, e.g. as is we're burning nearly 2mb of data since we're only using 4k of the 2mb page. And every decrypted allocation can potentially fracture a large page since the allocator is unaware of the decrypted requirement. Might not be an issue for kvmclock since it's a one-time allocation, but we could suffer death by a thousand cuts if there are scenarios where a decrypted allocation isn't be persistent (VirtIO queues maybe?). Duplicating the full kernel tables for C=0 accesses doesn't suffer from these issues. And I think potential corruption issues due to mis-{aligned,size} objects can be detected through static analysis, build assertions and/or runtime checks. > + p = kzalloc(sz, GFP_KERNEL); For the SEV case, can't we do a straight kmalloc() since we zero out the page after decrypting it? > + > + /* > + * The physical address of per-cpu variable will be shared with > + * the hypervisor. Let's clear the C-bit before we assign the > + * memory to per_cpu variable. > + */ > + if (p && sev_active()) { > + rc = set_memory_decrypted((unsigned long)p, sz >> > PAGE_SHIFT); > + if (rc) > + return rc; > + memset(p, 0, sz); > + } > + } > > per_cpu(hv_clock_per_cpu, cpu) = p; > return p ? 0 : -ENOMEM; > -- > 2.7.4 >
Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken
On Wed, Aug 29, 2018 at 07:36:22PM +, Nadav Amit wrote: > at 10:11 AM, Nadav Amit wrote: > > > at 1:59 AM, Masami Hiramatsu wrote: > > > >> On Wed, 29 Aug 2018 01:11:42 -0700 > >> Nadav Amit wrote: > >> > >>> Use lockdep to ensure that text_mutex is taken when text_poke() is > >>> called. > >>> > >>> Actually it is not always taken, specifically when it is called by kgdb, > >>> so take the lock in these cases. > >> > >> Can we really take a mutex in kgdb context? > >> > >> kgdb_arch_remove_breakpoint > >> <- dbg_deactivate_sw_breakpoints > >> <- kgdb_reenter_check > >> <- kgdb_handle_exception > >> <- __kgdb_notify > >> <- kgdb_ll_trap > >> <- do_int3 > >> <- kgdb_notify > >> <- die notifier > >> > >> kgdb_arch_set_breakpoint > >> <- dbg_activate_sw_breakpoints > >> <- kgdb_reenter_check > >> <- kgdb_handle_exception > >> ... > >> > >> Both seems called in exception context, so we can not take a mutex lock. > >> I think kgdb needs a special path. > > > > You are correct, but I don’t want a special path. Presumably text_mutex is > > guaranteed not to be taken according to the code. > > > > So I guess the only concern is lockdep. Do you see any problem if I change > > mutex_lock() into mutex_trylock()? It should always succeed, and I can add a > > warning and a failure path if it fails for some reason. > > Err.. This will not work. I think I will drop this patch, since I cannot > find a proper yet simple assertion. Creating special path just for the > assertion seems wrong. It's probably worth expanding the comment for text_poke() to call out the kgdb case and reference kgdb_arch_{set,remove}_breakpoint(), whose code and comments make it explicitly clear why its safe for them to call text_poke() without acquiring the lock. Might prevent someone from going down this path again in the future.
Re: [PATCH v13 10/13] x86/sgx: Add sgx_einit() for initializing enclaves
On Wed, Aug 29, 2018 at 12:33:54AM -0700, Huang, Kai wrote: > [snip..] > > > > > > > > > @@ -38,6 +39,18 @@ static LIST_HEAD(sgx_active_page_list); static > > > > DEFINE_SPINLOCK(sgx_active_page_list_lock); > > > > static struct task_struct *ksgxswapd_tsk; static > > > > DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq); > > > > +static struct notifier_block sgx_pm_notifier; static u64 > > > > +sgx_pm_cnt; > > > > + > > > > +/* The cache for the last known values of IA32_SGXLEPUBKEYHASHx > > > > +MSRs > > > > for each > > > > + * CPU. The entries are initialized when they are first used by > > > > sgx_einit(). > > > > + */ > > > > +struct sgx_lepubkeyhash { > > > > + u64 msrs[4]; > > > > + u64 pm_cnt; > > > > > > May I ask why do we need pm_cnt here? In fact why do we need suspend > > > staff (namely, sgx_pm_cnt above, and related code in this patch) here > > > in this patch? From the patch commit message I don't see why we need > > > PM staff here. Please give comment why you need PM staff, or you may > > > consider to split the PM staff to another patch. > > > > Refining the commit message probably makes more sense because without PM > > code sgx_einit() would be broken. The MSRs have been reset after waking up. > > > > Some kind of counter is required to keep track of the power cycle. When > > going > > to sleep the sgx_pm_cnt is increased. sgx_einit() compares the current > > value of > > the global count to the value in the cache entry to see whether we are in a > > new > > power cycle. > > You mean reset to Intel default? I think we can also just reset the cached MSR > values on each power cycle, which would be simpler, IMHO? Refresh my brain, does hardware reset the MSRs on a transition to S3 or lower? > I think we definitely need some code to handle S3-S5, but should be in > separate patches, since I think the major impact of S3-S5 is entire EPC > being destroyed. I think keeping pm_cnt is not sufficient enough to handle > such case? > > > > This brings up one question though: how do we deal with VM host going to > > sleep? > > VM guest would not be aware of this. > > IMO VM just gets "sudden loss of EPC" after suspend & resume in host. SGX > driver > and SDK should be able to handle "sudden loss of EPC", ie, co-working > together to > re-establish the missing enclaves. > > Actually supporting "sudden loss of EPC" is a requirement to support live > migration > of VM w/ SGX. Internally long time ago we had a discussion and the decision > was we > should support SGX live migration given two facts: > > 1) losing platform-dependent is not important. For example, losing sealing > key is > not a problem, as we could get secrets provisioned again from remote. 2) Both > windows > & linux driver commit to support "sudden loss of EPC". > > I don't think we have to support in very first upstream driver, but I think > we need > to support someday. Actually, we can easily support this in the driver, at least for SGX1 hardware. SGX2 isn't difficult to handle, but we've intentionally postponed those patches until SGX1 support is in mainline[1]. Accesses to the EPC after it is lost will cause faults. Userspace EPC accesses, e.g. ERESUME, will get a SIGSEGV that the process should interpret as an "I should restart my enclave" event. The SDK already does this. In the driver, we just need to be aware of this potential behavior and not freak out. Specifically, SGX_INVD needs to not WARN on faults that may have been due to a the EPC being nuked. I think we can even remove the sgx_encl_pm_notifier() code altogether. [1] SGX1 hardware signals a #GP on an access to an invalid EPC page. SGX2 signals a #PF with the PF_SGX error code bit set. This is problematic because the kernel looks at the PTEs for CR2 and sees nothing wrong, so it thinks it should just restart the instruction, leading to an infinite fault loop. Resolving this is fairly straightforward, but a complete fix requires propagating PF_SGX down to the ENCLS fixup handler, which means plumbing the error code through the fixup handlers or smushing PF_SGX into trapnr. Since there is a parallel effort to plumb the error code through the handlers, https://lkml.org/lkml/2018/8/6/924, we opted to do this in a separate series. > Sean, > > Would you be able to comment here? > > > > > I think the best measure would be to add a new parameter to sgx_einit() that > > enforces update of the MSRs. The driver can then set this parameter in the > > case > > when sgx_einit() returns SGX_INVALID_LICENSE. This is coherent because the > > driver requires writable MSRs. It would not be coherent to do it directly > > in the > > core because KVM does not require writable MSRs. > > IMHO this is not required, as I mentioned above. > > And > [snip...] > > Thanks, > -Kai
Re: [RFC PATCH 1/6] x86/alternative: assert text_mutex is taken
On Wed, Aug 29, 2018 at 08:44:47PM +, Nadav Amit wrote: > at 1:13 PM, Sean Christopherson wrote: > > > On Wed, Aug 29, 2018 at 07:36:22PM +, Nadav Amit wrote: > >> at 10:11 AM, Nadav Amit wrote: > >> > >>> at 1:59 AM, Masami Hiramatsu wrote: > >>> > >>>> On Wed, 29 Aug 2018 01:11:42 -0700 > >>>> Nadav Amit wrote: > >>>> > >>>>> Use lockdep to ensure that text_mutex is taken when text_poke() is > >>>>> called. > >>>>> > >>>>> Actually it is not always taken, specifically when it is called by kgdb, > >>>>> so take the lock in these cases. > >>>> > >>>> Can we really take a mutex in kgdb context? > >>>> > >>>> kgdb_arch_remove_breakpoint > >>>> <- dbg_deactivate_sw_breakpoints > >>>> <- kgdb_reenter_check > >>>> <- kgdb_handle_exception > >>>><- __kgdb_notify > >>>> <- kgdb_ll_trap > >>>><- do_int3 > >>>> <- kgdb_notify > >>>><- die notifier > >>>> > >>>> kgdb_arch_set_breakpoint > >>>> <- dbg_activate_sw_breakpoints > >>>> <- kgdb_reenter_check > >>>> <- kgdb_handle_exception > >>>> ... > >>>> > >>>> Both seems called in exception context, so we can not take a mutex lock. > >>>> I think kgdb needs a special path. > >>> > >>> You are correct, but I don’t want a special path. Presumably text_mutex is > >>> guaranteed not to be taken according to the code. > >>> > >>> So I guess the only concern is lockdep. Do you see any problem if I change > >>> mutex_lock() into mutex_trylock()? It should always succeed, and I can > >>> add a > >>> warning and a failure path if it fails for some reason. > >> > >> Err.. This will not work. I think I will drop this patch, since I cannot > >> find a proper yet simple assertion. Creating special path just for the > >> assertion seems wrong. > > > > It's probably worth expanding the comment for text_poke() to call out > > the kgdb case and reference kgdb_arch_{set,remove}_breakpoint(), whose > > code and comments make it explicitly clear why its safe for them to > > call text_poke() without acquiring the lock. Might prevent someone > > from going down this path again in the future. > > I thought that the whole point of the patch was to avoid comments, and > instead enforce the right behavior. I don’t understand well enough kgdb > code, so I cannot attest it does the right thing. What happens if > kgdb_do_roundup==0? As is, the comment is wrong because there are obviously cases where text_poke() is called without text_mutex being held. I can't attest to the kgdb code either. My thought was to document the exception so that if someone does want to try and enforce the right behavior they can dive right into the problem instead of having to learn of the kgdb gotcha the hard way. Maybe a FIXME is the right approach?
Re: [PATCH v13 10/13] x86/sgx: Add sgx_einit() for initializing enclaves
On Wed, Aug 29, 2018 at 01:58:09PM -0700, Huang, Kai wrote: > > -Original Message- > > From: Christopherson, Sean J > > Sent: Thursday, August 30, 2018 8:34 AM > > To: Huang, Kai > > Cc: Jarkko Sakkinen ; platform-driver- > > x...@vger.kernel.org; x...@kernel.org; nhor...@redhat.com; linux- > > ker...@vger.kernel.org; t...@linutronix.de; suresh.b.sid...@intel.com; > > Ayoun, > > Serge ; h...@zytor.com; npmccal...@redhat.com; > > mi...@redhat.com; linux-...@vger.kernel.org; Hansen, Dave > > > > Subject: Re: [PATCH v13 10/13] x86/sgx: Add sgx_einit() for initializing > > enclaves > > > > On Wed, Aug 29, 2018 at 12:33:54AM -0700, Huang, Kai wrote: > > > [snip..] > > > > > > > > > > > > > > > @@ -38,6 +39,18 @@ static LIST_HEAD(sgx_active_page_list); > > > > > > static DEFINE_SPINLOCK(sgx_active_page_list_lock); > > > > > > static struct task_struct *ksgxswapd_tsk; static > > > > > > DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq); > > > > > > +static struct notifier_block sgx_pm_notifier; static u64 > > > > > > +sgx_pm_cnt; > > > > > > + > > > > > > +/* The cache for the last known values of IA32_SGXLEPUBKEYHASHx > > > > > > +MSRs > > > > > > for each > > > > > > + * CPU. The entries are initialized when they are first used by > > > > > > sgx_einit(). > > > > > > + */ > > > > > > +struct sgx_lepubkeyhash { > > > > > > + u64 msrs[4]; > > > > > > + u64 pm_cnt; > > > > > > > > > > May I ask why do we need pm_cnt here? In fact why do we need > > > > > suspend staff (namely, sgx_pm_cnt above, and related code in this > > > > > patch) here in this patch? From the patch commit message I don't > > > > > see why we need PM staff here. Please give comment why you need PM > > > > > staff, or you may consider to split the PM staff to another patch. > > > > > > > > Refining the commit message probably makes more sense because > > > > without PM code sgx_einit() would be broken. The MSRs have been reset > > after waking up. > > > > > > > > Some kind of counter is required to keep track of the power cycle. > > > > When going to sleep the sgx_pm_cnt is increased. sgx_einit() > > > > compares the current value of the global count to the value in the > > > > cache entry to see whether we are in a new power cycle. > > > > > > You mean reset to Intel default? I think we can also just reset the > > > cached MSR values on each power cycle, which would be simpler, IMHO? > > > > Refresh my brain, does hardware reset the MSRs on a transition to S3 or > > lower? > > > > > I think we definitely need some code to handle S3-S5, but should be in > > > separate patches, since I think the major impact of S3-S5 is entire > > > EPC being destroyed. I think keeping pm_cnt is not sufficient enough > > > to handle such case? > > > > > > > > This brings up one question though: how do we deal with VM host going to > > sleep? > > > > VM guest would not be aware of this. > > > > > > IMO VM just gets "sudden loss of EPC" after suspend & resume in host. > > > SGX driver and SDK should be able to handle "sudden loss of EPC", ie, > > > co-working together to re-establish the missing enclaves. > > > > > > Actually supporting "sudden loss of EPC" is a requirement to support > > > live migration of VM w/ SGX. Internally long time ago we had a > > > discussion and the decision was we should support SGX live migration given > > two facts: > > > > > > 1) losing platform-dependent is not important. For example, losing > > > sealing key is not a problem, as we could get secrets provisioned > > > again from remote. 2) Both windows & linux driver commit to support > > > "sudden > > loss of EPC". > > > > > > I don't think we have to support in very first upstream driver, but I > > > think we need to support someday. > > > > Actually, we can easily support this in the driver, at least for SGX1 > > hardware. > > That's my guess too. Just want to check whether we are still on the same page > :) > > > SGX2 isn't difficult to handle, but we've intentionally postponed those > > patches > > until SGX1 support is in mainline[1]. > > Accesses to the EPC after it is lost will cause faults. Userspace EPC > > accesses, e.g. > > ERESUME, will get a SIGSEGV that the process should interpret as an "I > > should > > restart my enclave" event. The SDK already does this. In the driver, we > > just need > > to be aware of this potential behavior and not freak out. Specifically, > > SGX_INVD > > needs to not WARN on faults that may have been due to a the EPC being nuked. > > I think we can even remove the sgx_encl_pm_notifier() code altogether. > > Possibly we still need to do some cleanup, ie, all structures of enclaves, > upon resume? Not for functional reasons. The driver will automatically do the cleanup via SGX_INVD when it next accesses the enclave's pages and takes a fault, e.g. during reclaim. Proactively reclaiming the EPC pages would probably affect performance, though not necessarily in a good way. And I think it would be a beneficial t
[PATCH 1/2] KVM: vmx: Add defines for SGX ENCLS exiting
Hardware support for basic SGX virtualization adds a new execution control (ENCLS_EXITING), VMCS field (ENCLS_EXITING_BITMAP) and exit reason (ENCLS), that enables a VMM to intercept specific ENCLS leaf functions, e.g. to inject faults when the VMM isn't exposing SGX to a VM. When ENCLS_EXITING is enabled, the VMM can set/clear bits in the bitmap to intercept/allow ENCLS leaf functions in non-root, e.g. setting bit 2 in the ENCLS_EXITING_BITMAP will cause ENCLS[EINIT] to VMExit(ENCLS). Note: EXIT_REASON_ENCLS was previously added by commit 1f5199927034 ("KVM: VMX: add missing exit reasons"). Signed-off-by: Sean Christopherson --- arch/x86/include/asm/vmx.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 6aa8499e1f62..2665c10ece4c 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -74,6 +74,7 @@ #define SECONDARY_EXEC_ENABLE_INVPCID 0x1000 #define SECONDARY_EXEC_ENABLE_VMFUNC0x2000 #define SECONDARY_EXEC_SHADOW_VMCS 0x4000 +#define SECONDARY_EXEC_ENCLS_EXITING 0x8000 #define SECONDARY_EXEC_RDSEED_EXITING 0x0001 #define SECONDARY_EXEC_ENABLE_PML 0x0002 #define SECONDARY_EXEC_XSAVES 0x0010 @@ -213,6 +214,8 @@ enum vmcs_field { VMWRITE_BITMAP_HIGH = 0x2029, XSS_EXIT_BITMAP = 0x202C, XSS_EXIT_BITMAP_HIGH= 0x202D, + ENCLS_EXITING_BITMAP= 0x202E, + ENCLS_EXITING_BITMAP_HIGH = 0x202F, TSC_MULTIPLIER = 0x2032, TSC_MULTIPLIER_HIGH = 0x2033, GUEST_PHYSICAL_ADDRESS = 0x2400, -- 2.18.0
Re: SEV guest regression in 4.18
On Wed, Aug 22, 2018 at 10:14:17AM +0200, Borislav Petkov wrote: > Dropping Pavel as it bounces. > > On Tue, Aug 21, 2018 at 11:07:38AM -0500, Brijesh Singh wrote: > > The tsc_early_init() is called before setup_arch() -> init_mem_mapping. > > Ok, I see it, thanks for explaining. > > So back to your original ideas - I'm wondering whether we should define > a chunk of memory which the hypervisor and guest can share and thus > communicate over... Something ala SEV-ES also with strictly defined > layout and put all those variables there. And then the guest can map > decrypted. What about creating a data section specifically for shared memory? The section would be PMD aligned and sized so that it could be mapped appropriately without having to fracture the page. Then define a macro to easily declare data in the new section, a la __read_mostly. > There might be something similar though, I dunno. > > Maybe Paolo has a better idea... > > -- > Regards/Gruss, > Boris. > > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB > 21284 (AG Nürnberg) > --
Re: SEV guest regression in 4.18
On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote: > On 22/08/2018 22:11, Brijesh Singh wrote: > > > > Yes, this is one of approach I have in mind. It will avoid splitting > > the larger pages; I am thinking that early in boot code we can lookup > > for this special section and decrypt it in-place and probably maps with > > C=0. Only downside, it will increase data section footprint a bit > > because we need to align this section to PM_SIZE. > > If you can ensure it doesn't span a PMD, maybe it does not need to be > aligned; you could establish a C=0 mapping of the whole 2M around it. Wouldn't that result in exposing/leaking whatever code/data happened to reside on the same 2M page (or corrupting it if the entire page isn't decrypted)? Or are you suggesting that we'd also leave the encrypted mapping intact? If it's the latter... Does hardware include the C-bit in the cache tag? I.e are the C=0 and C=1 variations of the same PA treated as different cache lines? If so, we could also treat the unencrypted variation as a separate PA by defining it to be (ACTUAL_PA | (1 << x86_phys_bits)), (re)adjusting x86_phys_bits if necessary to get the kernel to allow the address. init_memory_mapping() could then alias every PA with an unencrypted VA mapping, which would allow the kernel to access any PA unencrypted by using virt_to_phys() and phys_to_virt() to translate an encrypted VA to an unencrypted VA. It would mean doubling INIT_PGD_PAGE_COUNT, but that'd be a one-time cost regardless of how many pages needed to be accessed with C=0. > Paolo
Re: SEV guest regression in 4.18
On Fri, Aug 24, 2018 at 10:41:27AM -0500, Brijesh Singh wrote: > > > On 08/23/2018 11:16 AM, Paolo Bonzini wrote: > >On 23/08/2018 17:29, Sean Christopherson wrote: > >>On Thu, Aug 23, 2018 at 01:26:55PM +0200, Paolo Bonzini wrote: > >>>On 22/08/2018 22:11, Brijesh Singh wrote: > >>>> > >>>>Yes, this is one of approach I have in mind. It will avoid splitting > >>>>the larger pages; I am thinking that early in boot code we can lookup > >>>>for this special section and decrypt it in-place and probably maps with > >>>>C=0. Only downside, it will increase data section footprint a bit > >>>>because we need to align this section to PM_SIZE. > >>> > >>>If you can ensure it doesn't span a PMD, maybe it does not need to be > >>>aligned; you could establish a C=0 mapping of the whole 2M around it. > >> > >>Wouldn't that result in exposing/leaking whatever code/data happened > >>to reside on the same 2M page (or corrupting it if the entire page > >>isn't decrypted)? Or are you suggesting that we'd also leave the > >>encrypted mapping intact? > > > >Yes, exactly the latter, because... > > > Hardware does not enforce coherency between the encrypted and > unencrypted mapping for the same physical page. So, creating a > two mapping of same physical address will lead a possible data > corruption. But couldn't we avoid corruption by ensuring data accessed via the unencrypted mapping is cache line aligned and sized? The CPU could speculatively bring the encrypted version into the cache but it should never get into a modified state (barring a software bug, but that would be a problem regardless of encryption). > Note, SME creates two mapping of the same physical address to perform > in-place encryption of kernel and initrd images; this is a special case > and APM documents steps on how to do this. > > > > > >>Does hardware include the C-bit in the cache tag? > > > >... the C-bit is effectively part of the physical address and hence of > >the cache tag. The kernel is already relying on this to properly > >encrypt/decrypt pages, if I remember correctly. > > > >Paolo > >
Re: [PATCH] x86/kvm/vmx: don't read current->thread.{fs,gs}base of legacy tasks
On Wed, Jul 11, 2018 at 07:37:18PM +0200, Vitaly Kuznetsov wrote: > When we switched from doing rdmsr() to reading FS/GS base values from > current->thread we completely forgot about legacy 32-bit userspaces which > we still support in KVM (why?). task->thread.{fsbase,gsbase} are only > synced for 64-bit processes, calling save_fsgs_for_kvm() and using > its result from current is illegal for legacy processes. > > There's no ARCH_SET_FS/GS prctls for legacy applications. Base MSRs are, > however, not always equal to zero. Intel's manual says (3.4.4 Segment > Loading Instructions in IA-32e Mode): > > "In order to set up compatibility mode for an application, segment-load > instructions (MOV to Sreg, POP Sreg) work normally in 64-bit mode. An > entry is read from the system descriptor table (GDT or LDT) and is loaded > in the hidden portion of the segment register. > ... > The hidden descriptor register fields for FS.base and GS.base are > physically mapped to MSRs in order to load all address bits supported by > a 64-bit implementation. > " > > The issue was found by strace test suite where 32-bit ioctl_kvm_run test > started segfaulting. > > Reported-by: Dmitry V. Levin > Bisected-by: Masatake YAMATO > Fixes: 42b933b59721 ("x86/kvm/vmx: read MSR_{FS,KERNEL_GS}_BASE from > current->thread") > Signed-off-by: Vitaly Kuznetsov > --- > arch/x86/kvm/vmx.c | 25 + > 1 file changed, 17 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 559a12b6184d..65968649b365 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -2560,6 +2560,7 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu) > struct vcpu_vmx *vmx = to_vmx(vcpu); > #ifdef CONFIG_X86_64 > int cpu = raw_smp_processor_id(); > + unsigned long fsbase, kernel_gsbase; Because bikeshedding is fun, what do you think about using fs_base and kernel_gs_base for these names? I have a series that touches this code and also adds local variables for {FS,GS}.base and {FS,GS}.sel. I used {fs,gs}_base and {fs,gs}_sel to be consistent with the vmx->host_state nomenclature (the local variables are used to update the associated vmx->host_state variables), but I'll change my patches if you have a strong preference for omitting the underscore. > #endif > int i; > > @@ -2575,12 +2576,20 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu) > vmx->host_state.gs_ldt_reload_needed = vmx->host_state.ldt_sel; > > #ifdef CONFIG_X86_64 > - save_fsgs_for_kvm(); > - vmx->host_state.fs_sel = current->thread.fsindex; > - vmx->host_state.gs_sel = current->thread.gsindex; > -#else > - savesegment(fs, vmx->host_state.fs_sel); > - savesegment(gs, vmx->host_state.gs_sel); > + if (likely(is_64bit_mm(current->mm))) { > + save_fsgs_for_kvm(); > + vmx->host_state.fs_sel = current->thread.fsindex; > + vmx->host_state.gs_sel = current->thread.gsindex; > + fsbase = current->thread.fsbase; > + kernel_gsbase = current->thread.gsbase; > + } else { > +#endif > + savesegment(fs, vmx->host_state.fs_sel); > + savesegment(gs, vmx->host_state.gs_sel); > +#ifdef CONFIG_X86_64 > + fsbase = read_msr(MSR_FS_BASE); > + kernel_gsbase = read_msr(MSR_KERNEL_GS_BASE); > + } > #endif > if (!(vmx->host_state.fs_sel & 7)) { > vmcs_write16(HOST_FS_SELECTOR, vmx->host_state.fs_sel); > @@ -2600,10 +2609,10 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu) > savesegment(ds, vmx->host_state.ds_sel); > savesegment(es, vmx->host_state.es_sel); > > - vmcs_writel(HOST_FS_BASE, current->thread.fsbase); > + vmcs_writel(HOST_FS_BASE, fsbase); > vmcs_writel(HOST_GS_BASE, cpu_kernelmode_gs_base(cpu)); > > - vmx->msr_host_kernel_gs_base = current->thread.gsbase; > + vmx->msr_host_kernel_gs_base = kernel_gsbase; > if (is_long_mode(&vmx->vcpu)) > wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base); > #else > -- > 2.14.4 >
Re: [PATCH] X86/KVM: Do not allow DISABLE_EXITS_MWAIT when LAPIC ARAT is not available
On Wed, 2018-04-11 at 11:16 +0200, KarimAllah Ahmed wrote: > If the processor does not have an "Always Running APIC Timer" (aka ARAT), > we should not give guests direct access to MWAIT. The LAPIC timer would > stop ticking in deep C-states, so any host deadlines would not wakeup the > host kernel. > > The host kernel intel_idle driver handles this by switching to broadcast > mode when ARAT is not available and MWAIT is issued with a deep C-state > that would stop the LAPIC timer. When MWAIT is passed through, we can not > tell when MWAIT is issued. > > So just disable this capability when LAPIC ARAT is not available. I am not > even sure if there are any CPUs with VMX support but no LAPIC ARAT or not. ARAT was added on WSM, so NHM, the Core 2 family and a few PSC SKUs support VMX+MWAIT but not ARAT. > Cc: Paolo Bonzini > Cc: Radim Krčmář > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: H. Peter Anvin > Cc: x...@kernel.org > Cc: k...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Reported-by: Wanpeng Li > Signed-off-by: KarimAllah Ahmed > --- > arch/x86/kvm/x86.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index b2ff74b..0334b25 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2819,7 +2819,8 @@ static int msr_io(struct kvm_vcpu *vcpu, struct > kvm_msrs __user *user_msrs, > static inline bool kvm_can_mwait_in_guest(void) > { > return boot_cpu_has(X86_FEATURE_MWAIT) && > - !boot_cpu_has_bug(X86_BUG_MONITOR); > + !boot_cpu_has_bug(X86_BUG_MONITOR) && > + boot_cpu_has(X86_FEATURE_ARAT); > } > > int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
Re: [PATCH 06/10] KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
On Thu, Apr 12, 2018 at 04:38:39PM +0200, Paolo Bonzini wrote: > On 21/02/2018 18:47, KarimAllah Ahmed wrote: > > + > > + if (kvm_vcpu_map(vcpu, > > gpa_to_gfn(vmcs12->virtual_apic_page_addr), map)) > > + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, > > gfn_to_gpa(map->pfn)); > > This should in principle be pfn_to_hpa. However, pfn_to_hpa is unused; > let's just remove it and do "<< PAGE_SHIFT". Unlike gfn_to_gpa, where > in principle the shift could be different, pfn_to_hpa is *by definition* > a shift left by PAGE_SHIFT. Any reason not to use PFN_PHYS instead of the handcoded shift? > Paolo
Re: [PATCH v2 4/9] x86/kvm/mmu: introduce guest_mmu
On Tue, Sep 25, 2018 at 07:58:39PM +0200, Vitaly Kuznetsov wrote: > When EPT is used for nested guest we need to re-init MMU as shadow > EPT MMU (nested_ept_init_mmu_context() does that). When we return back > from L2 to L1 kvm_mmu_reset_context() in nested_vmx_load_cr3() resets > MMU back to normal TDP mode. Add a special 'guest_mmu' so we can use > separate root caches; the improved hit rate is not very important for > single vCPU performance, but it avoids contention on the mmu_lock for > many vCPUs. > > On the nested CPUID benchmark, with 16 vCPUs, an L2->L1->L2 vmexit > goes from 42k to 26k cycles. > > Signed-off-by: Vitaly Kuznetsov > Signed-off-by: Paolo Bonzini > --- > Changes since v1: > - drop now unneded local vmx variable in vmx_free_vcpu_nested > [Sean Christopherson] > --- > arch/x86/include/asm/kvm_host.h | 3 +++ > arch/x86/kvm/mmu.c | 15 +++ > arch/x86/kvm/vmx.c | 27 ++- > 3 files changed, 32 insertions(+), 13 deletions(-) ... > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 2d55adab52de..93ff08136fc1 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -8468,8 +8468,10 @@ static inline void nested_release_vmcs12(struct > vcpu_vmx *vmx) > * Free whatever needs to be freed from vmx->nested when L1 goes down, or > * just stops using VMX. > */ > -static void free_nested(struct vcpu_vmx *vmx) > +static void free_nested(struct kvm_vcpu *vcpu) > { > + struct vcpu_vmx *vmx = to_vmx(vcpu); > + > if (!vmx->nested.vmxon && !vmx->nested.smm.vmxon) > return; > > @@ -8502,6 +8504,8 @@ static void free_nested(struct vcpu_vmx *vmx) > vmx->nested.pi_desc = NULL; > } > > + kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL); > + > free_loaded_vmcs(&vmx->nested.vmcs02); > } > > @@ -8510,7 +8514,7 @@ static int handle_vmoff(struct kvm_vcpu *vcpu) > { > if (!nested_vmx_check_permission(vcpu)) > return 1; > - free_nested(to_vmx(vcpu)); > + free_nested(vcpu); > nested_vmx_succeed(vcpu); > return kvm_skip_emulated_instruction(vcpu); > } > @@ -8541,6 +8545,8 @@ static int handle_vmclear(struct kvm_vcpu *vcpu) > if (vmptr == vmx->nested.current_vmptr) > nested_release_vmcs12(vmx); > > + kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL); Shouldn't we only free guest_mmu if VMCLEAR is targeting current_vmptr? Assuming that's the case, we could put the call to kvm_mmu_free_roots() in nested_release_vmcs12() instead of calling it from handle_vmclear() and handle_vmptrld(). > + > kvm_vcpu_write_guest(vcpu, > vmptr + offsetof(struct vmcs12, launch_state), > &zero, sizeof(zero)); > @@ -8924,6 +8930,9 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu) > } > > nested_release_vmcs12(vmx); > + > + kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, > +KVM_MMU_ROOTS_ALL); > /* >* Load VMCS12 from guest memory since it is not already >* cached. > @@ -10976,12 +10985,10 @@ static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, > struct loaded_vmcs *vmcs) > */ > static void vmx_free_vcpu_nested(struct kvm_vcpu *vcpu) > { > - struct vcpu_vmx *vmx = to_vmx(vcpu); > - > - vcpu_load(vcpu); > - vmx_switch_vmcs(vcpu, &vmx->vmcs01); > - free_nested(vmx); > - vcpu_put(vcpu); > + vcpu_load(vcpu); > + vmx_switch_vmcs(vcpu, &to_vmx(vcpu)->vmcs01); > + free_nested(vcpu); > + vcpu_put(vcpu); > } > > static void vmx_free_vcpu(struct kvm_vcpu *vcpu) > @@ -11331,6 +11338,7 @@ static int nested_ept_init_mmu_context(struct > kvm_vcpu *vcpu) > if (!valid_ept_address(vcpu, nested_ept_get_cr3(vcpu))) > return 1; > > + vcpu->arch.mmu = &vcpu->arch.guest_mmu; > kvm_init_shadow_ept_mmu(vcpu, > to_vmx(vcpu)->nested.msrs.ept_caps & > VMX_EPT_EXECUTE_ONLY_BIT, > @@ -11346,6 +11354,7 @@ static int nested_ept_init_mmu_context(struct > kvm_vcpu *vcpu) > > static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu) > { > + vcpu->arch.mmu = &vcpu->arch.root_mmu; > vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; > } > > @@ -13421,7 +13430,7 @@ static void vmx_leave_nested(struct kvm_vcpu *vcpu) > to_vmx(vcpu)->nested.nested_run_pending = 0; > nested_vmx_vmexit(vcpu, -1, 0, 0); > } > - free_nested(to_vmx(vcpu)); > + free_nested(vcpu); > } > > /* > -- > 2.17.1 >
Re: [PATCH v2 2/9] x86/kvm/mmu.c: set get_pdptr hook in kvm_init_shadow_ept_mmu()
On Tue, Sep 25, 2018 at 07:58:37PM +0200, Vitaly Kuznetsov wrote: > kvm_init_shadow_ept_mmu() doesn't set get_pdptr() hook and is this > not a problem just because MMU context is already initialized and this > hook points to kvm_pdptr_read(). As we're intended to use a dedicated > MMU for shadow EPT MMU set this hook explicitly. > > Signed-off-by: Vitaly Kuznetsov > Signed-off-by: Paolo Bonzini > --- > arch/x86/kvm/mmu.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index ca79ec0d8060..2bdc63f67886 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -4846,6 +4846,8 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, > bool execonly, > context->root_level = PT64_ROOT_4LEVEL; > context->direct_map = false; > context->base_role.word = root_page_role.word & mmu_base_role_mask.word; > + context->get_pdptr = kvm_pdptr_read; Would it make sense to set this in nested_ept_init_mmu_context() along with set_cr3, get_cr3 and inject_page_fault? The other MMU flows set them as a package deal. Either way... Reviewed-by: Sean Christopherson > + > update_permission_bitmask(vcpu, context, true); > update_pkru_bitmask(vcpu, context, true); > update_last_nonleaf_level(vcpu, context);
Re: [PATCH v2 1/9] x86/kvm/mmu: make vcpu->mmu a pointer to the current MMU
On Tue, Sep 25, 2018 at 07:58:36PM +0200, Vitaly Kuznetsov wrote: > As a preparation to full MMU split between L1 and L2 make vcpu->arch.mmu > a pointer to the currently used mmu. For now, this is always > vcpu->arch.root_mmu. No functional change. > > Signed-off-by: Vitaly Kuznetsov > Signed-off-by: Paolo Bonzini > --- Reviewed-by: Sean Christopherson
Re: [PATCH v2 3/9] x86/kvm/mmu.c: add kvm_mmu parameter to kvm_mmu_free_roots()
On Tue, Sep 25, 2018 at 07:58:38PM +0200, Vitaly Kuznetsov wrote: > Add an option to specify which MMU root we want to free. This will > be used when nested and non-nested MMUs for L1 are split. > > Signed-off-by: Vitaly Kuznetsov > Signed-off-by: Paolo Bonzini > --- Reviewed-by: Sean Christopherson
Re: [PATCH v2 5/9] x86/kvm/mmu: get rid of redundant kvm_mmu_setup()
On Tue, Sep 25, 2018 at 07:58:40PM +0200, Vitaly Kuznetsov wrote: > From: Paolo Bonzini > > Just inline the contents into the sole caller, kvm_init_mmu is now > public. > > Suggested-by: Vitaly Kuznetsov > Signed-off-by: Paolo Bonzini > --- Reviewed-by: Sean Christopherson
Re: [PATCH v2 6/9] x86/kvm/mmu: make space for source data caching in struct kvm_mmu
On Tue, Sep 25, 2018 at 07:58:41PM +0200, Vitaly Kuznetsov wrote: > In preparation to MMU reconfiguration avoidance we need a space to > cache source data. As this partially intersects with kvm_mmu_page_role, > create 64bit sized union kvm_mmu_role holding both base and extended data. > No functional change. > > Signed-off-by: Vitaly Kuznetsov > --- One nit below, other than that... Reviewed-by: Sean Christopherson > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index e59e5f49c8c2..bb1ef0f68f8e 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -2359,7 +2359,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct > kvm_vcpu *vcpu, > int collisions = 0; > LIST_HEAD(invalid_list); > > - role = vcpu->arch.mmu->base_role; > + role = vcpu->arch.mmu->mmu_role.base; > role.level = level; > role.direct = direct; > if (role.direct) > @@ -4407,7 +4407,8 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu > *vcpu, > void > reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context) > { > - bool uses_nx = context->nx || context->base_role.smep_andnot_wp; > + bool uses_nx = context->nx || > + context->mmu_role.base.smep_andnot_wp; > struct rsvd_bits_validate *shadow_zero_check; > int i; > > @@ -4726,7 +4727,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) > { > struct kvm_mmu *context = vcpu->arch.mmu; > > - context->base_role.word = mmu_base_role_mask.word & > + context->mmu_role.base.word = mmu_base_role_mask.word & > kvm_calc_tdp_mmu_root_page_role(vcpu).word; > context->page_fault = tdp_page_fault; > context->sync_page = nonpaging_sync_page; > @@ -4807,7 +4808,7 @@ void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu) > else > paging32_init_context(vcpu, context); > > - context->base_role.word = mmu_base_role_mask.word & > + context->mmu_role.base.word = mmu_base_role_mask.word & > kvm_calc_shadow_mmu_root_page_role(vcpu).word; > reset_shadow_zero_bits_mask(vcpu, context); > } > @@ -4816,7 +4817,7 @@ EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu); > static union kvm_mmu_page_role > kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool > accessed_dirty) > { > - union kvm_mmu_page_role role = vcpu->arch.mmu->base_role; > + union kvm_mmu_page_role role = vcpu->arch.mmu->mmu_role.base; > > role.level = PT64_ROOT_4LEVEL; > role.direct = false; > @@ -4846,7 +4847,8 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, > bool execonly, > context->update_pte = ept_update_pte; > context->root_level = PT64_ROOT_4LEVEL; > context->direct_map = false; > - context->base_role.word = root_page_role.word & mmu_base_role_mask.word; > + context->mmu_role.base.word = > + root_page_role.word & mmu_base_role_mask.word; > context->get_pdptr = kvm_pdptr_read; > > update_permission_bitmask(vcpu, context, true); > @@ -5161,10 +5163,13 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, > gpa_t gpa, > > local_flush = true; > while (npte--) { > + unsigned int base_role = Nit: should this be a u32 to match mmu_role.base.word? > + vcpu->arch.mmu->mmu_role.base.word; > + > entry = *spte; > mmu_page_zap_pte(vcpu->kvm, sp, spte); > if (gentry && > - !((sp->role.word ^ vcpu->arch.mmu->base_role.word) > + !((sp->role.word ^ base_role) > & mmu_base_role_mask.word) && rmap_can_add(vcpu)) > mmu_pte_write_new_pte(vcpu, sp, spte, &gentry); > if (need_remote_flush(entry, *spte)) > @@ -5861,6 +5866,16 @@ int kvm_mmu_module_init(void) > { > int ret = -ENOMEM; > > + /* > + * MMU roles use union aliasing which is, generally speaking, an > + * undefined behavior. However, we supposedly know how compilers behave > + * and the current status quo is unlikely to change. Guardians below are > + * supposed to let us know if the assumption becomes false. > + */ > + BUILD_BUG_ON(sizeof(union kvm_mmu_page_role) != sizeof(u32)); > + BUILD_BUG_ON(sizeof(union kvm_mmu_extended_role) != sizeof(u32)); > + BUILD_BUG_ON(sizeof(union kvm_mmu_role) != sizeof
Re: [PATCH v2 7/9] x86/kvm/nVMX: introduce source data cache for kvm_init_shadow_ept_mmu()
On Tue, Sep 25, 2018 at 07:58:42PM +0200, Vitaly Kuznetsov wrote: > MMU re-initialization is expensive, in particular, > update_permission_bitmask() and update_pkru_bitmask() are. > > Cache the data used to setup shadow EPT MMU and avoid full re-init when > it is unchanged. > > Signed-off-by: Vitaly Kuznetsov > --- > arch/x86/include/asm/kvm_host.h | 14 + > arch/x86/kvm/mmu.c | 51 - > 2 files changed, 52 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 1821b0215230..87ddaa1579e7 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -274,7 +274,21 @@ union kvm_mmu_page_role { > }; > > union kvm_mmu_extended_role { > +/* > + * This structure complements kvm_mmu_page_role caching everything needed for > + * MMU configuration. If nothing in both these structures changed, MMU > + * re-configuration can be skipped. @valid bit is set on first usage so we > don't > + * treat all-zero structure as valid data. > + */ > u32 word; > + struct { > + unsigned int valid:1; > + unsigned int execonly:1; > + unsigned int cr4_pse:1; > + unsigned int cr4_pke:1; > + unsigned int cr4_smap:1; > + unsigned int cr4_smep:1; > + }; > }; > > union kvm_mmu_role { > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index bb1ef0f68f8e..d8611914544a 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -4708,6 +4708,24 @@ static void paging32E_init_context(struct kvm_vcpu > *vcpu, > paging64_init_context_common(vcpu, context, PT32E_ROOT_LEVEL); > } > > +static union kvm_mmu_role > +kvm_calc_mmu_role_common(struct kvm_vcpu *vcpu) > +{ > + union kvm_mmu_role role = {0}; > + > + role.base.access = ACC_ALL; > + role.base.cr0_wp = is_write_protection(vcpu); > + > + role.ext.cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) != 0; > + role.ext.cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP) != 0; > + role.ext.cr4_pse = !!is_pse(vcpu); > + role.ext.cr4_pke = kvm_read_cr4_bits(vcpu, X86_CR4_PKE) != 0; > + > + role.ext.valid = 1; > + > + return role; > +} > + > static union kvm_mmu_page_role > kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu) > { > @@ -4814,16 +4832,18 @@ void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu) > } > EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu); > > -static union kvm_mmu_page_role > -kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool > accessed_dirty) > +static union kvm_mmu_role > +kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool > accessed_dirty, > +bool execonly) > { > - union kvm_mmu_page_role role = vcpu->arch.mmu->mmu_role.base; > + union kvm_mmu_role role = kvm_calc_mmu_role_common(vcpu); kvm_calc_mmu_role_common() doesn't preserve the current mmu_role.base and kvm_calc_mmu_role_common() doesn't capture all base fields. Won't @role will be incorrect for base fields that aren't set below, e.g. cr4_pae, smep_andnot_wp, smap_andnot_wp, etc... > > - role.level = PT64_ROOT_4LEVEL; > - role.direct = false; > - role.ad_disabled = !accessed_dirty; > - role.guest_mode = true; > - role.access = ACC_ALL; > + role.base.level = PT64_ROOT_4LEVEL; > + role.base.direct = false; > + role.base.ad_disabled = !accessed_dirty; > + role.base.guest_mode = true; > + > + role.ext.execonly = execonly; > > return role; > } > @@ -4832,10 +4852,16 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, > bool execonly, >bool accessed_dirty, gpa_t new_eptp) > { > struct kvm_mmu *context = vcpu->arch.mmu; > - union kvm_mmu_page_role root_page_role = > - kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty); > + union kvm_mmu_role new_role = > + kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty, > +execonly); > + > + __kvm_mmu_new_cr3(vcpu, new_eptp, new_role.base, false); > + > + new_role.base.word &= mmu_base_role_mask.word; > + if (new_role.as_u64 == context->mmu_role.as_u64) > + return; > > - __kvm_mmu_new_cr3(vcpu, new_eptp, root_page_role, false); > context->shadow_root_level = PT64_ROOT_4LEVEL; > > context->nx = true; > @@ -4847,8 +4873,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, > bool execonly, > context->update_pte = ept_update_pte; > context->root_level = PT64_ROOT_4LEVEL; > context->direct_map = false; > - context->mmu_role.base.word = > - root_page_role.word & mmu_base_role_mask.word; > + context->mmu_role.as_u64 = new_role.as_u64; > context->get_pdptr = kvm_pdptr_read; > > update_permission_bitmask(vcpu, context, true); > -- > 2.17.1 >
Re: [PATCH v2 8/9] x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed
On Tue, Sep 25, 2018 at 07:58:43PM +0200, Vitaly Kuznetsov wrote: > MMU reconfiguration in init_kvm_tdp_mmu()/kvm_init_shadow_mmu() can be > avoided if the source data used to configure it didn't change; enhance > kvm_mmu_scache with the required fields and consolidate common code in Nit: kvm_mmu_scache no longer exists, probably say "source cache" instead? > kvm_calc_mmu_role_common(). > > Signed-off-by: Vitaly Kuznetsov > --- > arch/x86/include/asm/kvm_host.h | 2 + > arch/x86/kvm/mmu.c | 86 +++-- > 2 files changed, 52 insertions(+), 36 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 87ddaa1579e7..609811066580 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -284,10 +284,12 @@ union kvm_mmu_extended_role { > struct { > unsigned int valid:1; > unsigned int execonly:1; > + unsigned int cr0_pg:1; > unsigned int cr4_pse:1; > unsigned int cr4_pke:1; > unsigned int cr4_smap:1; > unsigned int cr4_smep:1; > + unsigned int cr4_la57:1; > }; > }; > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index d8611914544a..f676c14d5c62 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -4709,34 +4709,40 @@ static void paging32E_init_context(struct kvm_vcpu > *vcpu, > } > > static union kvm_mmu_role > -kvm_calc_mmu_role_common(struct kvm_vcpu *vcpu) > +kvm_calc_mmu_role_common(struct kvm_vcpu *vcpu, bool mmu_init) > { > union kvm_mmu_role role = {0}; > > role.base.access = ACC_ALL; > + role.base.nxe = !!is_nx(vcpu); > + role.base.cr4_pae = !!is_pae(vcpu); > role.base.cr0_wp = is_write_protection(vcpu); > + role.base.smm = is_smm(vcpu); > + role.base.guest_mode = is_guest_mode(vcpu); > > + if (!mmu_init) > + return role; Can you add a comment explaining why we don't fill in role.ext when !mmu_init? Or maybe just rename mmu_init to something like base_only? >From what I can tell it's false when we only care about the base role, which just happens to be only in the non-init flow. > + > + role.ext.cr0_pg = !!is_paging(vcpu); > role.ext.cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) != 0; > role.ext.cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP) != 0; > role.ext.cr4_pse = !!is_pse(vcpu); > role.ext.cr4_pke = kvm_read_cr4_bits(vcpu, X86_CR4_PKE) != 0; > + role.ext.cr4_la57 = kvm_read_cr4_bits(vcpu, X86_CR4_LA57) != 0; > > role.ext.valid = 1; > > return role; > } > > -static union kvm_mmu_page_role > -kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu) > +static union kvm_mmu_role > +kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu, bool mmu_init) > { > - union kvm_mmu_page_role role = {0}; > + union kvm_mmu_role role = kvm_calc_mmu_role_common(vcpu, mmu_init); > > - role.guest_mode = is_guest_mode(vcpu); > - role.smm = is_smm(vcpu); > - role.ad_disabled = (shadow_accessed_mask == 0); > - role.level = kvm_x86_ops->get_tdp_level(vcpu); > - role.direct = true; > - role.access = ACC_ALL; > + role.base.ad_disabled = (shadow_accessed_mask == 0); > + role.base.level = kvm_x86_ops->get_tdp_level(vcpu); > + role.base.direct = true; > > return role; > } > @@ -4744,9 +4750,14 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu) > static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) > { > struct kvm_mmu *context = vcpu->arch.mmu; > + union kvm_mmu_role new_role = > + kvm_calc_tdp_mmu_root_page_role(vcpu, true); > > - context->mmu_role.base.word = mmu_base_role_mask.word & > - kvm_calc_tdp_mmu_root_page_role(vcpu).word; > + new_role.base.word &= mmu_base_role_mask.word; > + if (new_role.as_u64 == context->mmu_role.as_u64) > + return; > + > + context->mmu_role.as_u64 = new_role.as_u64; > context->page_fault = tdp_page_fault; > context->sync_page = nonpaging_sync_page; > context->invlpg = nonpaging_invlpg; > @@ -4786,29 +4797,23 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu) > reset_tdp_shadow_zero_bits_mask(vcpu, context); > } > > -static union kvm_mmu_page_role > -kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu) > -{ > - union kvm_mmu_page_role role = {0}; > - bool smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP); > - bool smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP); > - > - role.nxe = is_nx(vcpu); > - role.cr4_pae = !!is_pae(vcpu); > - role.cr0_wp = is_write_protection(vcpu); > - role.smep_andnot_wp = smep && !is_write_protection(vcpu); > - role.smap_andnot_wp = smap && !is_write_protection(vcpu); > - role.guest_mode = is_guest_mode(vcpu); > - role.smm = is_smm(vcpu); > - role.direct = !is_paging(vcpu);
Re: [PATCH v2 9/9] x86/kvm/mmu: check if MMU reconfiguration is needed in init_kvm_nested_mmu()
On Tue, Sep 25, 2018 at 07:58:44PM +0200, Vitaly Kuznetsov wrote: > We don't use root page role for nested_mmu, however, optimizing out > re-initialization in case nothing changed is still valuable as this > is done for every nested vmentry. > > Signed-off-by: Vitaly Kuznetsov > --- Reviewed-by: Sean Christopherson
Re: [PATCH v14 09/19] x86/mm: x86/sgx: Signal SEGV_SGXERR for #PFs w/ PF_SGX
On Tue, Sep 25, 2018 at 03:53:48PM -0700, Andy Lutomirski wrote: > Minor nit: > > On Tue, Sep 25, 2018 at 6:12 AM Jarkko Sakkinen > wrote: > > > > From: Sean Christopherson > > > > > by (c) as the kernel doesn't really have any other reasonable option, > > e.g. we could kill the task or panic, but neither is warranted. > > Not killing the task is quite nice, but... > > > + /* > > +* Access is blocked by the Enclave Page Cache Map (EPCM), > > +* i.e. the access is allowed by the PTE but not the EPCM. > > +* This usually happens when the EPCM is yanked out from > > +* under us, e.g. by hardware after a suspend/resume cycle. > > +* In any case, there is nothing that can be done by the > > +* kernel to resolve the fault (short of killing the task). > > Maybe s/killing the task/sending a signal/? My intent was to document that, unlike all other page faults, the kernel can't fix the source of the fault even if it were omniscient. How about this? With formatting changes since it's long-winded... /* * Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the * access is allowed by the PTE but not the EPCM. This usually happens * when the EPCM is yanked out from under us, e.g. by hardware after a * suspend/resume cycle. In any case, software, i.e. the kernel, can't * fix the source of the fault as the EPCM can't be directly modified * by software. Handle the fault as an access error in order to signal * userspace, e.g. so that userspace can rebuild their enclave(s), even * though userspace may not have actually violated access permissions. */
Re: [PATCH v14 08/19] signal: x86/sgx: Add SIGSEGV siginfo code for SGX EPCM fault
On Tue, Sep 25, 2018 at 04:06:45PM +0300, Jarkko Sakkinen wrote: > From: Sean Christopherson > > The SGX Enclave Page Cache Map (EPCM) is a hardware-managed table > that enforces accesses to an enclave's EPC page in addition to the > software-managed kernel page tables, i.e. the effective permissions > for an EPC page are a logical AND of the kernel's page tables and > the corresponding EPCM entry. The primary purpose of the EPCM is > to prevent a malcious or compromised kernel from attacking an enclave > by modifying the enclave's page tables. The EPCM entires for an > enclave are populated when the enclave is built and verified, using > metadata provided by the enclave that is included in the measurement > used to verify the enclave. > > In normal operation of a properly functioning, non-malicious kernel > (and enclave), the EPCM permissions will never trigger a fault, i.e. > the kernel may make the permissions for an EPC page more restrictive, > e.g. mark it not-present to swap out the EPC page, but the kernel will > never make its permissions less restrictive. > > But, there is a legitimate scenario in which the kernel's page tables > can become less restrictive than the EPCM: on current hardware all > enclaves are destroyed (by hardware) on a transition to S3 or lower > sleep states, i.e. all EPCM entries are invalid (not-present) after > the system resumes from its sleep state. > > Unfortunately, on CPUs that support only SGX1, EPCM violations result > in a #GP. The upside of the #GP is that no kernel changes are needed > to deal with the EPCM being blasted away by hardware, e.g. userspace > gets a SIGSEGV, assumes the EPCM was lost and restarts its enclave > and everyone is happy. The downside is that userspace has to assume > the SIGSEGV was because the EPC was lost (or possibly do some leg work > to rule out other causes). > > In SGX2, the oddity of delivering a #GP due to what are inherently > paging related violations is remedied. CPUs that support SGX2 deliver > EPCM violations as #PFs with a new SGX error code bit set. So, now > that hardware provides us with a way to unequivocally determine that > a fault was due to a EPCM violation, define a signfo code for SIGSEGV > so that the information can be passed onto userspace. > > Cc: Dave Hansen > Signed-off-by: Sean Christopherson > Signed-off-by: Jarkko Sakkinen > --- > include/uapi/asm-generic/siginfo.h | 4 > 1 file changed, 4 insertions(+) > > diff --git a/include/uapi/asm-generic/siginfo.h > b/include/uapi/asm-generic/siginfo.h > index 80e2a7227205..fdd898e2325b 100644 > --- a/include/uapi/asm-generic/siginfo.h > +++ b/include/uapi/asm-generic/siginfo.h > @@ -225,7 +225,11 @@ typedef struct siginfo { > #else > # define SEGV_PKUERR 4 /* failed protection key checks */ > #endif > +#ifdef __x86_64__ Argh, this needs to be "#if defined(__i386__) || defined(__x86_64__)" otherwise 32-bit builds break on later patches that use SEGV_SGXERR, e.g. the errors flagged by the 0-DAY bot. > +#define SEGV_SGXERR 5 /* SGX Enclave Page Cache Map fault */ > +#else > #define SEGV_ACCADI 5 /* ADI not enabled for mapped object */ > +#endif > #define SEGV_ADIDERR 6 /* Disrupting MCD error */ > #define SEGV_ADIPERR 7 /* Precise MCD exception */ > #define NSIGSEGV 7 > -- > 2.17.1 >
Re: [PATCH v14 09/19] x86/mm: x86/sgx: Signal SEGV_SGXERR for #PFs w/ PF_SGX
On Wed, Sep 26, 2018 at 01:16:59PM -0700, Dave Hansen wrote: > On 09/26/2018 11:12 AM, Andy Lutomirski wrote: > >> e omniscient. > >> > >> How about this? With formatting changes since it's long-winded... > >> > >>/* > >> * Access is blocked by the Enclave Page Cache Map (EPCM), i.e. the > >> * access is allowed by the PTE but not the EPCM. This usually > >> happens > >> * when the EPCM is yanked out from under us, e.g. by hardware > >> after a > >> * suspend/resume cycle. In any case, software, i.e. the kernel, > >> can't > >> * fix the source of the fault as the EPCM can't be directly > >> modified > >> * by software. Handle the fault as an access error in order to > >> signal > >> * userspace, e.g. so that userspace can rebuild their enclave(s), > >> even > >> * though userspace may not have actually violated access > >> permissions. > >> */ > >> > > Looks good to me. > > Including the actual architectural definition of the bit might add some > clarity. The SDM explicitly says (Vol 3a section 4.7): > > The fault resulted from violation of SGX-specific access-control > requirements. > > Which totally squares with returning true from access_error(). > > There's also a tidbit that says: > > This flag is 1 if the exception is unrelated to paging and > resulted from violation of SGX-specific access-control > requirements. ... such a violation can occur only if there > is no ordinary page fault... > > This is pretty important. It means that *none* of the other > paging-related stuff that we're doing applies. > > We also need to clarify how this can happen. Is it through something > than an app does, or is it solely when the hardware does something under > the covers, like suspend/resume. Are you looking for something in the changelog, the comment, or just a response? If it's the latter... On bare metal with a bug-free kernel, the only scenario I'm aware of where we'll encounter these faults is when hardware pulls the rug out from under us. In a virtualized environment all bets are off because the architecture allows VMMs to silently "destroy" the EPC at will, e.g. KVM, and I believe Hyper-V, will take advantage of this behavior to support live migration. Post migration, the destination system will generate PF_SGX because the EPC{M} can't be migrated between system, i.e. the destination EPCM sees all EPC pages as invalid.
Re: RFC: userspace exception fixups
On Wed, Nov 21, 2018 at 05:17:34PM +0200, Jarkko Sakkinen wrote: > On Wed, Nov 21, 2018 at 05:17:32AM +, Jethro Beekman wrote: > > Jarkko, can you please explain you solution in detail? The CPU receives an > > exception. This will be handled by the kernel exception handler. What > > information does the kernel exception handler use to determine whether to > > deliver the exception as a regular signal to the process, or whether to set > > the special registers values for userspace and just continue executing the > > process manually? > > Now we throw SIGSEGV when PF_SGX set, right? In my solution that would > be turned just doing iret to AEP with the extra that three registers get > exception data (type, reason, addr). No decoding or RIP adjusting > involved. > > That would mean that you would actually have to implement AEP handler > than just have enclu there. > > I've also proposed that perhaps for SGX also #UD should be propagated > this way because for some instructions you need outside help to emulate > "non-enclave" environment. And how would you determine the #UD is related to SGX? Hardware doesn't provide any indication that a #UD (or any other fault) is related to SGX or occurred in an enclave. The only fault that is special-cased in a non-virtualized environment is #PF signaled by the EPCM, which gets the PF_SGX bit set in the error code. > That is all I have drafted together so far. I'll try to finish v18 this > week with other stuff and refine further next week (unless someone gives > obvious reason why this doesn't work, which might well be because I > haven't went too deep with my analysis yet because of lack of time). > > /Jarkko
Re: [PATCH v4 4/5] lib/ioremap: Ensure phys_addr actually corresponds to a physical address
On Mon, Nov 26, 2018 at 05:07:46PM +, Will Deacon wrote: > The current ioremap() code uses a phys_addr variable at each level of > page table, which is confusingly offset by subtracting the base virtual > address being mapped so that adding the current virtual address back on > when iterating through the page table entries gives back the corresponding > physical address. > > This is fairly confusing and results in all users of phys_addr having to > add the current virtual address back on. Instead, this patch just updates > phys_addr when iterating over the page table entries, ensuring that it's > always up-to-date and doesn't require explicit offsetting. > > Cc: Chintan Pandya > Cc: Toshi Kani > Cc: Thomas Gleixner > Cc: Michal Hocko > Cc: Andrew Morton > Cc: Sean Christopherson > Signed-off-by: Will Deacon Tested-by: Sean Christopherson Reviewed-by: Sean Christopherson
Re: [PATCH 12/13] x86/fault: Decode page fault OOPSes better
On Mon, Nov 19, 2018 at 02:45:36PM -0800, Andy Lutomirski wrote: > One of Linus' favorite hobbies seems to be looking at OOPSes and > decoding the error code in his head. This is not one of my favorite > hobbies :) > > Teach the page fault OOPS hander to decode the error code. If it's > a !USER fault from user mode, print an explicit note to that effect > and print out the addresses of various tables that might cause such > an error. > > With this patch applied, if I intentionally point the LDT at 0x0 and > run the x86 selftests, I get: > > BUG: unable to handle kernel NULL pointer dereference at > HW error: normal kernel read fault > This was a system access from user code > IDT: 0xfe00 (limit=0xfff) GDT: 0xfe001000 (limit=0x7f) > LDTR: 0x50 -- base=0x0 limit=0xfff7 > TR: 0x40 -- base=0xfe003000 limit=0x206f > PGD 8456e067 P4D 8456e067 PUD 4623067 PMD 0 > SMP PTI > CPU: 0 PID: 153 Comm: ldt_gdt_64 Not tainted 4.19.0+ #1317 > Hardware name: ... > RIP: 0033:0x401454 > > Signed-off-by: Andy Lutomirski > --- > arch/x86/mm/fault.c | 84 + > 1 file changed, 84 insertions(+) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index 092ed6b1df8a..f34241fcc633 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -27,6 +27,7 @@ > #include /* struct vm86 > */ > #include /* vma_pkey() */ > #include /* efi_recover_from_page_fault()*/ > +#include /* store_idt(), ... > */ > > #define CREATE_TRACE_POINTS > #include > @@ -571,10 +572,53 @@ static int is_f00f_bug(struct pt_regs *regs, unsigned > long address) > return 0; > } > > +static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 > index) > +{ > + u32 offset = (index >> 3) * sizeof(struct desc_struct); > + unsigned long addr; > + struct ldttss_desc desc; > + > + if (index == 0) { > + pr_alert("%s: NULL\n", name); > + return; > + } > + > + if (offset + sizeof(struct ldttss_desc) >= gdt->size) { > + pr_alert("%s: 0x%hx -- out of bounds\n", name, index); > + return; > + } > + > + if (probe_kernel_read(&desc, (void *)(gdt->address + offset), > + sizeof(struct ldttss_desc))) { > + pr_alert("%s: 0x%hx -- GDT entry is not readable\n", > + name, index); > + return; > + } > + > + addr = desc.base0 | (desc.base1 << 16) | (desc.base2 << 24); > +#ifdef CONFIG_X86_64 > + addr |= ((u64)desc.base3 << 32); > +#endif > + pr_alert("%s: 0x%hx -- base=0x%lx limit=0x%x\n", > + name, index, addr, (desc.limit0 | (desc.limit1 << 16))); > +} > + > +static void errstr(unsigned long ec, char *buf, unsigned long mask, > +const char *txt) > +{ > + if (ec & mask) { > + if (buf[0]) > + strcat(buf, " "); > + strcat(buf, txt); > + } > +} > + > static void > show_fault_oops(struct pt_regs *regs, unsigned long error_code, > unsigned long address) > { > + char errtxt[64]; > + > if (!oops_may_print()) > return; > > @@ -602,6 +646,46 @@ show_fault_oops(struct pt_regs *regs, unsigned long > error_code, >address < PAGE_SIZE ? "NULL pointer dereference" : "paging > request", >(void *)address); > > + errtxt[0] = 0; > + errstr(error_code, errtxt, X86_PF_PROT, "PROT"); > + errstr(error_code, errtxt, X86_PF_WRITE, "WRITE"); > + errstr(error_code, errtxt, X86_PF_USER, "USER"); > + errstr(error_code, errtxt, X86_PF_RSVD, "RSVD"); > + errstr(error_code, errtxt, X86_PF_INSTR, "INSTR"); > + errstr(error_code, errtxt, X86_PF_PK, "PK"); > + pr_alert("HW error: %s\n", error_code ? errtxt : > + "normal kernel read fault"); What about something like this instead of manually handling the case where error_code==0 so that we get e.g. "!PROT KERNEL READ" instead of "normal kernel read fault"? Not sure !PROT and/or KERNEL are needed, but getting at least "PROT READ" seems useful. errstr(!error_code, errtxt, X86_PF_PROT, "!PROT"); errstr(!error_code, errtxt, X86_PF_USER, "KERNEL"); errstr(!error_code, errtxt, X86_PF_WRITE | X86_PF_INSTR, "READ"); And change the pr_alert to "HW error code:"? The original is confusing (to me) because "HW error: normal kernel read fault" obfuscates the fact that we're printing the #PF error code, i.e. it looks like an arbitrary kernel message. This: HW error code: !PROT KERNEL READ This was a system access from user code or: HW error code: !PROT READ This was a system access from user code or: HW error code: KERNEL READ This was a system access from user code or: HW error code
Re: [PATCH 6/5] x86/fault: Clean up the page fault oops decoder a bit
On Thu, Nov 22, 2018 at 09:41:19AM +0100, Ingo Molnar wrote: > > * Andy Lutomirski wrote: > > > One of Linus' favorite hobbies seems to be looking at OOPSes and > > decoding the error code in his head. This is not one of my favorite > > hobbies :) > > > > Teach the page fault OOPS hander to decode the error code. If it's > > a !USER fault from user mode, print an explicit note to that effect > > and print out the addresses of various tables that might cause such > > an error. > > > > With this patch applied, if I intentionally point the LDT at 0x0 and > > run the x86 selftests, I get: > > > > BUG: unable to handle kernel NULL pointer dereference at > > HW error: normal kernel read fault > > This was a system access from user code > > IDT: 0xfe00 (limit=0xfff) GDT: 0xfe001000 (limit=0x7f) > > LDTR: 0x50 -- base=0x0 limit=0xfff7 > > TR: 0x40 -- base=0xfe003000 limit=0x206f > > PGD 8456e067 P4D 8456e067 PUD 4623067 PMD 0 > > SMP PTI > > CPU: 0 PID: 153 Comm: ldt_gdt_64 Not tainted 4.19.0+ #1317 > > Hardware name: ... > > RIP: 0033:0x401454 > > I've applied your series, with one small edit, the following message: > > > HW error: normal kernel read fault > > will IMHO confuse the heck out of users, thinking that their hardware is > broken... > > Yes, the message is accurate, in MM pagefault language it's indeed the HW > error code, but it's a language very few people speak. > > So I edited it over to say '#PF error code'. I also applied a few other > minor cleanups - see the changelog below. I responded to the original thread a hair too late... What about something like this instead of manually handling the case where error_code==0 so that we get e.g. "[KERNEL] [READ]" instead of "normal kernel read fault"? Getting "[PROT] [KERNEL] [READ]" seems useful. IMO "[normal kernel read fault]" followed by "This was a system access from user code" is still confusing. --- 8b29ee4351d5c625aa9ca2765f8da5e Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 27 Nov 2018 07:09:57 -0800 Subject: [PATCH] x86/fault: Print "KERNEL" and "READ" for #PF error codes ...and explicitly state that it's a "code" that's being printed. Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Yu-cheng Yu Cc: linux-kernel@vger.kernel.org Cc: Ingo Molnar Signed-off-by: Sean Christopherson --- arch/x86/mm/fault.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 2ff25ad33233..510e263c256b 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -660,8 +660,10 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad err_str_append(error_code, err_txt, X86_PF_RSVD, "[RSVD]" ); err_str_append(error_code, err_txt, X86_PF_INSTR, "[INSTR]"); err_str_append(error_code, err_txt, X86_PF_PK,"[PK]" ); - - pr_alert("#PF error: %s\n", error_code ? err_txt : "[normal kernel read fault]"); + err_str_append(~error_code, err_txt, X86_PF_USER, "[KERNEL]"); + err_str_append(~error_code, err_txt, X86_PF_WRITE | X86_PF_INSTR, + "[READ]"); + pr_alert("#PF error code: %s\n", err_txt); if (!(error_code & X86_PF_USER) && user_mode(regs)) { struct desc_ptr idt, gdt; -- 2.19.2 > > Let me know if you have any objections. > > Thanks, > > Ingo > > ===> > From a2aa52ab16efbee40ad118ebac4a5e438f5b43ee Mon Sep 17 00:00:00 2001 > From: Ingo Molnar > Date: Thu, 22 Nov 2018 09:34:03 +0100 > Subject: [PATCH] x86/fault: Clean up the page fault oops decoder a bit > > - Make the oops messages a bit less scary (don't mention 'HW errors') > > - Turn 'PROT USER' (which is visually easily confused with PROT_USER) >into individual bit descriptors: "[PROT] [USER]". >This also makes "[normal kernel read fault]" more apparent. > > - De-abbreviate variables to make the code easier to read > > - Use vertical alignment where appropriate. > > - Add comment about string size limits and the helper function. > > - Remove unnecessary line breaks. > > Cc: Andy Lutomirski > Cc: Borislav Petkov > Cc: Dave Hansen > Cc: H. Peter Anvin > Cc: Linus Torvalds > Cc: Pet
Re: [PATCH] mm: warn only once if page table misaccounting is detected
On Tue, Nov 27, 2018 at 09:36:03AM +0100, Heiko Carstens wrote: > Use pr_alert_once() instead of pr_alert() if page table misaccounting > has been detected. > > If this happens once it is very likely that there will be numerous > other occurrence as well, which would flood dmesg and the console with > hardly any added information. Therefore print the warning only once. > > Cc: Kirill A. Shutemov > Cc: Martin Schwidefsky > Signed-off-by: Heiko Carstens > --- > kernel/fork.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/fork.c b/kernel/fork.c > index 07cddff89c7b..c887e9eba89f 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -647,8 +647,8 @@ static void check_mm(struct mm_struct *mm) > } > > if (mm_pgtables_bytes(mm)) > - pr_alert("BUG: non-zero pgtables_bytes on freeing mm: %ld\n", > - mm_pgtables_bytes(mm)); > + pr_alert_once("BUG: non-zero pgtables_bytes on freeing mm: > %ld\n", > + mm_pgtables_bytes(mm)); I found the print-always behavior to be useful when developing a driver that mucked with PTEs directly via vmf_insert_pfn() and had issues with racing against exit_mmap(). It was nice to be able to recompile only the driver and rely on dmesg to let me know when I messed up yet again. Would pr_alert_ratelimited() suffice? > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS > VM_BUG_ON_MM(mm->pmd_huge_pte, mm); > -- > 2.16.4 >
Re: BUG: corrupted list in freeary
On Tue, Nov 27, 2018 at 10:07:53AM -0600, Eric W. Biederman wrote: > > > syzbot writes: > > > Hello, > > > > syzbot found the following crash on: > > > > HEAD commit:e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel... > > git tree: upstream > > console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a340 > > kernel config: https://syzkaller.appspot.com/x/.config?x=73e2bc0cb6463446 > > dashboard link: https://syzkaller.appspot.com/bug?extid=c92d3646e35bc5d1a909 > > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > > I don't have this commit. I don't have a clue where it find this > commit. I can't possibly look into this let alone debug this. It's in v4.20-rc4. > > That is not a useful but report. Please fix your bot. > > Eric > >
Re: [PATCH v2] x86: modernize sync_bitops.h
On Wed, Nov 21, 2018 at 03:53:05PM +, David Laight wrote: > > > > -Original Message- > > From: Jan Beulich [mailto:jbeul...@suse.com] > > Sent: 21 November 2018 14:42 > > To: David Laight > > Cc: mi...@elte.hu; t...@linutronix.de; Boris Ostrovsky; Juergen Gross; > > linux-kernel@vger.kernel.org; > > h...@zytor.com > > Subject: RE: [PATCH v2] x86: modernize sync_bitops.h > > > > >>> On 21.11.18 at 14:49, wrote: > > > From: Jan Beulich > > >> Sent: 21 November 2018 13:03 > > >> > > >> >>> On 21.11.18 at 12:55, wrote: > > >> > From: Jan Beulich > > >> >> Sent: 21 November 2018 10:11 > > >> >> > > >> >> Add missing insn suffixes and use rmwcc.h just like was (more or less) > > >> >> recently done for bitops.h as well. > > >> > > > >> > Why? bts (etc) on memory don't really have an 'operand size'. > > >> > > >> Of course they do - depending on operand size they operate on > > >> 2-, 4-, or 8-byte quantities. When the second operand is a > > >> register, the suffix is redundant (but doesn't hurt), but when > > >> the second operand is an immediate, the assembler (in AT&T > > >> syntax) has no way of knowing what operand size you mean. > > > > > > You need to RTFM. > > > > Excuse me? How about you look at this table from the SDM > > (format of course comes out better in the .pdf): > > > > 0F AB /r BTS r/m16, r16 MR Valid Valid Store selected bit in CF flag and > > set. > > 0F AB /r BTS r/m32, r32 MR Valid Valid Store selected bit in CF flag and > > set. > > REX.W + 0F AB /r BTS r/m64, r64 MR Valid N.E. Store selected bit in CF flag > > and set. > > 0F BA /5 ib BTS r/m16, imm8 MI Valid Valid Store selected bit in CF flag > > and set. > > 0F BA /5 ib BTS r/m32, imm8 MI Valid Valid Store selected bit in CF flag > > and set. > > REX.W + 0F BA /5 ib BTS r/m64, imm8 MI Valid N.E. Store selected bit in CF > > flag and set. > > > > Please read manuals yourself before making such statements. > > > > > Regardless of the 'operand size' the 'bit' instructions do a 32 bit > > > aligned > > > 32 bit wide read/modify/write cycle. > > > > > > The 'operand size' does affect whether the bit number (which is signed) > > > comes from %cl (8 bits), %cx (16 bits), %rcx (32 bits) or (%ecx) 64 bits. > > > But that is implicit in the register name used. > > > > There is no form with %cl as operand. Instead there are forms with > > an immediate operand. > > See also section 3.1.1.9 (in the copy of 325462.pdf I have): > (edited out references to the figures) > > Bit(BitBase, BitOffset) — Returns the value of a bit within a bit string. >The bit string is a sequence of bits in memory or a register. Bits are >numbered from low-order to high-order within registers and within memory >bytes. If the BitBase is a register, the BitOffset can be in the range 0 >to [15, 31, 63] depending on the mode and register size. > >If BitBase is a memory address, the BitOffset can range has different > ranges >depending on the operand size (see Table 3-2). > >The addressed bit is numbered (Offset MOD 8) within the byte at address >(BitBase + (BitOffset DIV 8)) where DIV is signed division with rounding >towards negative infinity and MOD returns a positive number. > > Table 3-2. Range of Bit Positions Specified by Bit Offset Operands > Operand Size Immediate BitOffset Register BitOffset > 16 0 to 15 2**15 to 2**15 − 1 > 32 0 to 31 2**31 to 2**31 − 1 > 64 0 to 63 2**63 to 2**63 − 1 > > For Bit test it says: > > When accessing a bit in memory, the processor may access 4 bytes starting from > the memory address for a 32-bit operand size, using by the following > relationship: > Effective Address + (4 ∗ (BitOffset DIV 32)) > Or, it may access 2 bytes starting from the memory address for a 16-bit > operand, using this relationship: > Effective Address + (2 ∗ (BitOffset DIV 16)) > It may do so even when only a single byte needs to be accessed to reach the > given bit. This agrees with what Jan is saying, e.g. "4 bytes ... for a 32-bit op size" and "2 bytes ... for a 16-bit op size". For whatever reason the 64-bit op size is omitted, but the normal behavior applies. > > (That is the text I remember being in the old docs, but I don't think it > used to refer to the address being aligned. > My 8086, 286 and 386 books are all at home.) > > The other instructions (eg btc) say: > > If the bit base operand specifies a memory location, the operand represents > the > address of the byte in memory that contains the bit base (bit 0 of the > specified byte) > of the bit string. The range of the bit position that can be > referenced by the offset operand depends on the operand size. > > So the 'operand size' gives the size of the bit offset, not the size of > the memory cycle. No, it determines both. Here's rough pseudocode for how the op size affects what memory is read. Note that there isn't alignment per se, b
Re: RFC: userspace exception fixups
On Tue, Nov 20, 2018 at 12:11:33PM +0200, Jarkko Sakkinen wrote: > On Mon, Nov 19, 2018 at 09:00:08AM -0800, Andy Lutomirski wrote: > > On Mon, Nov 19, 2018 at 8:02 AM Jarkko Sakkinen > > wrote: > > > > > > On Mon, Nov 19, 2018 at 07:29:36AM -0800, Andy Lutomirski wrote: > > > > 1. The kernel needs some way to know *when* to apply this fixup. > > > > Decoding the instruction stream and doing it to all exceptions that > > > > hit an ENCLU instruction seems like a poor design. > > > > > > I'm not sure why you would ever need to do any type of fixup as the idea > > > is to just return to AEP i.e. from chosen exceptions (EPCM, #UD) the AEP > > > would work the same way as for exceptions that the kernel can deal with > > > except filling the exception information to registers. > > > > Sure, but how does the kernel know when to do that and when to send a > > signal? I don't really like decoding the instruction stream to figure > > it out. > > Hmm... why you have to decode instruction stream to find that out? Would > just depend on exception type (#GP with EPCM, #UD). #PF w/ PFEC_SGX is the only exception that indicates a fault is related to SGX. Theoretically we could avoid decoding by using a magic value for the AEP itself and doing even more magic fixup, but that wouldn't help for faults that occur on EENTER, which can be generic #GPs due to loss of EPC on SGX1 systems. > Or are you saying > that kernel should need to SIGSEGV if there is in fact ENCLU so that > there is no infinite trap loop? Sorry, I'm a bit lost here that where > does this decoding requirement comes from in the first place. I > understand how it is used in Sean's proposal... > > Anyway, this option can be probably discarded without further > consideration because apparently single stepping can cause #DB SS fault > if AEP handler is anything else than a single instruction. Not that it matters, but we could satisfy the "one instruction" requirement if the fixup changed RIP to point at an ENCLU for #DBs. > For me it seems that by ruling out options, vDSO option is what is > left. I don't like it but at least it works... > > /Jarkko
Re: [PATCH v17 03/23] x86/cpufeatures: Add SGX sub-features (as Linux-defined bits)
On Fri, Nov 16, 2018 at 03:37:15PM +0100, Borislav Petkov wrote: > On Fri, Nov 16, 2018 at 03:01:10AM +0200, Jarkko Sakkinen wrote: > > From: Sean Christopherson > > > > CPUID_12_EAX is an Intel-defined feature bits leaf dedicated for SGX > > that enumerates the SGX instruction sets that are supported by the > > CPU, e.g. SGX1, SGX2, etc... Because Linux currently only cares about > > two bits (SGX1 and SGX2) and there are currently only four documented > > bits in total, relocate the bits to Linux-defined word 8 to conserve > > space. > > > > But, keep the bit positions identical between the Intel-defined value > > and the Linux-defined value, e.g. keep SGX1 at bit 0. This allows KVM > > to use its existing code for probing guest CPUID bits using Linux's > > X86_FEATURE_* definitions. To do so, shift around some existing bits > > to effectively reserve bits 0-7 of word 8 for SGX sub-features. > > > > Signed-off-by: Sean Christopherson > > Signed-off-by: Jarkko Sakkinen > > --- > > arch/x86/include/asm/cpufeatures.h | 21 +++-- > > arch/x86/kernel/cpu/scattered.c | 2 ++ > > tools/arch/x86/include/asm/cpufeatures.h | 21 +++-- > > 3 files changed, 32 insertions(+), 12 deletions(-) > > > > diff --git a/arch/x86/include/asm/cpufeatures.h > > b/arch/x86/include/asm/cpufeatures.h > > index da7fed4939a3..afdf5f2e13b5 100644 > > --- a/arch/x86/include/asm/cpufeatures.h > > +++ b/arch/x86/include/asm/cpufeatures.h > > @@ -222,12 +222,21 @@ > > #define X86_FEATURE_L1TF_PTEINV( 7*32+29) /* "" L1TF > > workaround PTE inversion */ > > #define X86_FEATURE_IBRS_ENHANCED ( 7*32+30) /* Enhanced IBRS */ > > > > -/* Virtualization flags: Linux defined, word 8 */ > > -#define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow > > */ > > -#define X86_FEATURE_VNMI ( 8*32+ 1) /* Intel Virtual NMI */ > > -#define X86_FEATURE_FLEXPRIORITY ( 8*32+ 2) /* Intel FlexPriority */ > > -#define X86_FEATURE_EPT( 8*32+ 3) /* Intel Extended > > Page Table */ > > -#define X86_FEATURE_VPID ( 8*32+ 4) /* Intel Virtual Processor > > ID */ > > +/* > > + * Scattered Intel features: Linux defined, word 8. > > + * > > + * Note that the bit location of the SGX features is meaningful as KVM > > expects > > + * the Linux defined bit to match the Intel defined bit, e.g. > > X86_FEATURE_SGX1 > > + * must remain at bit 0, SGX2 at bit 1, etc... > > + */ > > +#define X86_FEATURE_SGX1 ( 8*32+ 0) /* SGX1 leaf functions */ > > +#define X86_FEATURE_SGX2 ( 8*32+ 1) /* SGX2 leaf functions */ > > + > > Yeah, add here > > /* Bits [0:7] are reserved for SGX */ > > or so, so that people don't use those. Once CPUID(12) gets more bits > added to it, I don't see anything wrong with allocating a separate leaf > for that. > > BUT(!), the question then is whether kvm would still be ok with that? > I'm thinking yes, as it will simply use the new definitions, or? Yep, wouldn't be a problem for KVM.
Re: [PATCH v17 13/23] x86/msr: Add SGX Launch Control MSR definitions
On Fri, Nov 16, 2018 at 03:01:20AM +0200, Jarkko Sakkinen wrote: > From: Sean Christopherson > > Add a new IA32_FEATURE_CONTROL bit, SGX_LE_WR. Introducing SGX_LE_WR needs to land before patch 06/23, which references the flag when updating feature bits.
Re: [PATCH RFC] selftests/x86: Add a selftest for SGX
On Tue, Nov 13, 2018 at 11:40:09PM +0200, Jarkko Sakkinen wrote: > Add a selftest for SGX. It is a trivial test where a simple enclave > copies one 64-bit word of memory between two memory locations given to > the enclave as arguments. > > Signed-off-by: Jarkko Sakkinen > --- > +SUBDIRS_64 := sgx > +ASSERT(!DEFINED(.altinstructions), "ALTERNATIVES are not supported in the > SGX LE") > +ASSERT(!DEFINED(.altinstr_replacement), "ALTERNATIVES are not supported in > the SGX LE") > +ASSERT(!DEFINED(.discard.retpoline_safe), "RETPOLINE ALTERNATIVES are not > supported in the SGX LE") > +ASSERT(!DEFINED(.discard.nospec), "RETPOLINE ALTERNATIVES are not supported > in the SGX LE") Maybe this? s/LE/Test Enclave > diff --git a/tools/testing/selftests/x86/sgx/encl_bootstrap.S > b/tools/testing/selftests/x86/sgx/encl_bootstrap.S > new file mode 100644 > index ..62251c7d9927 > --- /dev/null > +++ b/tools/testing/selftests/x86/sgx/encl_bootstrap.S > @@ -0,0 +1,94 @@ > +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */ > +/* > + * Copyright(c) 2016-18 Intel Corporation. > + */ > + > + .macro ENCLU > + .byte 0x0f, 0x01, 0xd7 > + .endm > + > + .section ".tcs", "a" > + .balign 4096 > + > + .fill 1, 8, 0 # STATE (set by CPU) > + .fill 1, 8, 0 # FLAGS > + .long encl_ssa# OSSA Any reason not to do .quad for OSSA and OENTRY? > + .fill 1, 4, 0 > + .fill 1, 4, 0 # CSSA (set by CPU) > + .fill 1, 4, 1 # NSSA > + .long encl_entry # OENTRY > + .fill 1, 4, 0 > + .fill 1, 8, 0 # AEP (set by EENTER and ERESUME) > + .fill 1, 8, 0 # OFSBASE > + .fill 1, 8, 0 # OGSBASE > + .fill 1, 4, 0x# FSLIMIT > + .fill 1, 4, 0x# GSLIMIT > + .fill 503, 8, 0 # Reserved I'd prefer to do 1-byte fill with a size of 4024 to match the SDM. > + > + .text > + > +encl_entry: > + # %rbx contains the base address for TCS, which is also the first > + # address inside the enclave. By adding $le_stack_end to it, we get the > + # absolute address for the stack. > + lea (encl_stack)(%rbx), %rax > + xchg%rsp, %rax > + push%rax > + > + push%rcx # push the address after EENTER > + push%rbx # push the enclave base address > + > + callencl_body > + > + pop %rbx # pop the enclave base address > + > + # Restore XSAVE registers to a synthetic state. > + mov $0x, %rax > + mov $0x, %rdx > + lea (xsave_area)(%rbx), %rdi > + fxrstor (%rdi) > + > + # Clear GPRs > + xor %rcx, %rcx > + xor %rdx, %rdx > + xor %rdi, %rdi > + xor %rsi, %rsi > + xor %r8, %r8 > + xor %r9, %r9 > + xor %r10, %r10 > + xor %r11, %r11 > + xor %r12, %r12 > + xor %r13, %r13 > + xor %r14, %r14 > + xor %r15, %r15 > + > + # Reset status flags > + add %rdx, %rdx # OF = SF = AF = CF = 0; ZF = PF = 1 > + > + pop %rbx # pop the address after EENTER Probably worth expanding the comment to explain that ENCLU[EEXIT] takes the target address via %rbx, i.e. we're "returning" from the EENTER "call". > + > + # Restore the caller stack. > + pop %rax > + mov %rax, %rsp > + > + # EEXIT > + mov $4, %rax > + enclu > + > + .section ".data", "aw" > + > +encl_ssa: > + .space 4096 > + > +xsave_area: > + .fill 1, 4, 0x037F# FCW > + .fill 5, 4, 0 > + .fill 1, 4, 0x1F80# MXCSR > + .fill 1, 4, 0x# MXCSR_MASK > + .fill 123, 4, 0 > + .fill 1, 4, 0x8000# XCOMP_BV[63] = 1, compaction mode > + .fill 12, 4, 0 > + > + .balign 4096 > + .space 8192 > +encl_stack:
[PATCH 1/2] x86/vdso: Remove obsolete "fake section table" reservation
At one point the vDSO image was manually stripped down by vdso2c in an attempt to minimize the size of the image mapped into userspace. Part of that stripping process involved building a fake section table so as not to break userspace processes that parse the section table. Memory for the fake section table was reserved in the .rodata section so that vdso2c could simply copy the entire PT_LOAD segment into the userspace image after building the fake table. Eventually, the entire fake section table approach was dropped in favor of stripping the vdso "the old fashioned way", i.e. via objdump -S. But, the reservation in .rodata for the fake table was left behind. Remove the reserveration along with a few other related defines and section entries. Removing the fake section table placeholder zaps a whopping 0x340 bytes from the 64-bit vDSO image, which drops the current image's size to under 4k, i.e. reduces the effective size of the userspace vDSO mapping by a full page. Fixes: da861e18eccc ("x86, vdso: Get rid of the fake section mechanism") Cc: Andy Lutomirski Signed-off-by: Sean Christopherson --- arch/x86/entry/vdso/vdso-layout.lds.S | 22 -- arch/x86/entry/vdso/vdso2c.c | 6 -- 2 files changed, 28 deletions(-) diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S index acfd5ba7d943..0cedc905c8d6 100644 --- a/arch/x86/entry/vdso/vdso-layout.lds.S +++ b/arch/x86/entry/vdso/vdso-layout.lds.S @@ -7,16 +7,6 @@ * This script controls its layout. */ -#if defined(BUILD_VDSO64) -# define SHDR_SIZE 64 -#elif defined(BUILD_VDSO32) || defined(BUILD_VDSOX32) -# define SHDR_SIZE 40 -#else -# error unknown VDSO target -#endif - -#define NUM_FAKE_SHDRS 13 - SECTIONS { /* @@ -60,20 +50,8 @@ SECTIONS *(.bss*) *(.dynbss*) *(.gnu.linkonce.b.*) - - /* -* Ideally this would live in a C file, but that won't -* work cleanly for x32 until we start building the x32 -* C code using an x32 toolchain. -*/ - VDSO_FAKE_SECTION_TABLE_START = .; - . = . + NUM_FAKE_SHDRS * SHDR_SIZE; - VDSO_FAKE_SECTION_TABLE_END = .; } :text - .fake_shstrtab : { *(.fake_shstrtab) } :text - - .note : { *(.note.*) }:text :note .eh_frame_hdr : { *(.eh_frame_hdr) } :text :eh_frame_hdr diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index 4674f58581a1..2479a454b15c 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -98,12 +98,6 @@ struct vdso_sym required_syms[] = { [sym_hpet_page] = {"hpet_page", true}, [sym_pvclock_page] = {"pvclock_page", true}, [sym_hvclock_page] = {"hvclock_page", true}, - [sym_VDSO_FAKE_SECTION_TABLE_START] = { - "VDSO_FAKE_SECTION_TABLE_START", false - }, - [sym_VDSO_FAKE_SECTION_TABLE_END] = { - "VDSO_FAKE_SECTION_TABLE_END", false - }, {"VDSO32_NOTE_MASK", true}, {"__kernel_vsyscall", true}, {"__kernel_sigreturn", true}, -- 2.19.2
[PATCH 2/2] x86/vdso: Remove a stale/misleading comment from the linker script
Once upon a time, vdso2c aggressively stripped data from the vDSO image when generating the final userspace image. This included stripping the .altinstructions and .altinstr_replacement sections. Eventually, the stripping process reverted to "objdump -S" and no longer removed the aforementioned sections, but the comment remained. Keeping the .alt* sections at the end of the PT_LOAD segment is no longer necessary, but there's no harm in doing so and it's a helpful reminder that they don't need to be included in the final vDSO image, i.e. someone may want to take another stab at zapping/stripping the unneeded sections. Fixes: da861e18eccc ("x86, vdso: Get rid of the fake section mechanism") Cc: Andy Lutomirski Signed-off-by: Sean Christopherson --- arch/x86/entry/vdso/vdso-layout.lds.S | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S index 0cedc905c8d6..93c6dc7812d0 100644 --- a/arch/x86/entry/vdso/vdso-layout.lds.S +++ b/arch/x86/entry/vdso/vdso-layout.lds.S @@ -65,11 +65,6 @@ SECTIONS .text : { *(.text*) } :text =0x90909090, - /* -* At the end so that eu-elflint stays happy when vdso2c strips -* these. A better implementation would avoid allocating space -* for these. -*/ .altinstructions: { *(.altinstructions) } :text .altinstr_replacement : { *(.altinstr_replacement) } :text -- 2.19.2
[PATCH 0/2] x86/vdso: Remove remnants of the fake section table
Remove pieces of the vDSO's fake section table mechanism that were left behind when the vDSO build process reverted to using "objdump -S" to strip the userspace image. One of the removed pieces is a 0x340 byte reservation (for 64-bit) in the .rodata section. Trimming that fat drops the current image's size to under 4k, i.e. reduces the effective size of the userspace vDSO mapping by a full page. Sean Christopherson (2): x86/vdso: Remove obsolete "fake section table" reservation x86/vdso: Remove a stale/misleading comment from the linker script arch/x86/entry/vdso/vdso-layout.lds.S | 27 --- arch/x86/entry/vdso/vdso2c.c | 6 -- 2 files changed, 33 deletions(-) -- 2.19.2
Re: [PATCH 1/2] x86/vdso: Remove obsolete "fake section table" reservation
On Tue, Dec 04, 2018 at 08:17:40AM -0800, Sean Christopherson wrote: > At one point the vDSO image was manually stripped down by vdso2c in an > attempt to minimize the size of the image mapped into userspace. Part > of that stripping process involved building a fake section table so as > not to break userspace processes that parse the section table. Memory > for the fake section table was reserved in the .rodata section so that > vdso2c could simply copy the entire PT_LOAD segment into the userspace > image after building the fake table. > > Eventually, the entire fake section table approach was dropped in favor > of stripping the vdso "the old fashioned way", i.e. via objdump -S. > But, the reservation in .rodata for the fake table was left behind. > Remove the reserveration along with a few other related defines and > section entries. > > Removing the fake section table placeholder zaps a whopping 0x340 bytes > from the 64-bit vDSO image, which drops the current image's size to > under 4k, i.e. reduces the effective size of the userspace vDSO mapping > by a full page. > > Fixes: da861e18eccc ("x86, vdso: Get rid of the fake section mechanism") > Cc: Andy Lutomirski > Signed-off-by: Sean Christopherson > --- > diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c > index 4674f58581a1..2479a454b15c 100644 > --- a/arch/x86/entry/vdso/vdso2c.c > +++ b/arch/x86/entry/vdso/vdso2c.c > @@ -98,12 +98,6 @@ struct vdso_sym required_syms[] = { > [sym_hpet_page] = {"hpet_page", true}, > [sym_pvclock_page] = {"pvclock_page", true}, > [sym_hvclock_page] = {"hvclock_page", true}, > - [sym_VDSO_FAKE_SECTION_TABLE_START] = { > - "VDSO_FAKE_SECTION_TABLE_START", false > - }, > - [sym_VDSO_FAKE_SECTION_TABLE_END] = { > - "VDSO_FAKE_SECTION_TABLE_END", false > - }, Doh, I missed removing the definitions for sym_VDSO_FAKE_SECTION_TABLE_*. > {"VDSO32_NOTE_MASK", true}, > {"__kernel_vsyscall", true}, > {"__kernel_sigreturn", true}, > -- > 2.19.2 >
Re: [PATCH 1/2] x86/vdso: Remove obsolete "fake section table" reservation
On Tue, Dec 04, 2018 at 10:22:39AM -0800, Sean Christopherson wrote: > On Tue, Dec 04, 2018 at 08:17:40AM -0800, Sean Christopherson wrote: > > At one point the vDSO image was manually stripped down by vdso2c in an > > attempt to minimize the size of the image mapped into userspace. Part > > of that stripping process involved building a fake section table so as > > not to break userspace processes that parse the section table. Memory > > for the fake section table was reserved in the .rodata section so that > > vdso2c could simply copy the entire PT_LOAD segment into the userspace > > image after building the fake table. > > > > Eventually, the entire fake section table approach was dropped in favor > > of stripping the vdso "the old fashioned way", i.e. via objdump -S. > > But, the reservation in .rodata for the fake table was left behind. > > Remove the reserveration along with a few other related defines and > > section entries. > > > > Removing the fake section table placeholder zaps a whopping 0x340 bytes > > from the 64-bit vDSO image, which drops the current image's size to > > under 4k, i.e. reduces the effective size of the userspace vDSO mapping > > by a full page. > > > > Fixes: da861e18eccc ("x86, vdso: Get rid of the fake section mechanism") > > Cc: Andy Lutomirski > > Signed-off-by: Sean Christopherson > > --- > > diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c > > index 4674f58581a1..2479a454b15c 100644 > > --- a/arch/x86/entry/vdso/vdso2c.c > > +++ b/arch/x86/entry/vdso/vdso2c.c > > @@ -98,12 +98,6 @@ struct vdso_sym required_syms[] = { > > [sym_hpet_page] = {"hpet_page", true}, > > [sym_pvclock_page] = {"pvclock_page", true}, > > [sym_hvclock_page] = {"hvclock_page", true}, > > - [sym_VDSO_FAKE_SECTION_TABLE_START] = { > > - "VDSO_FAKE_SECTION_TABLE_START", false > > - }, > > - [sym_VDSO_FAKE_SECTION_TABLE_END] = { > > - "VDSO_FAKE_SECTION_TABLE_END", false > > - }, > > Doh, I missed removing the definitions for sym_VDSO_FAKE_SECTION_TABLE_*. And with sym_VDSO_FAKE_SECTION_TABLE_* gone all symbols are exported, meaning required_syms can be a char* array and struct vdso_sym can be removed. > > {"VDSO32_NOTE_MASK", true}, > > {"__kernel_vsyscall", true}, > > {"__kernel_sigreturn", true}, > > -- > > 2.19.2 > >
Re: [PATCH 1/2] x86/vdso: Remove obsolete "fake section table" reservation
On Tue, Dec 04, 2018 at 10:58:51AM -0800, Andy Lutomirski wrote: > On Tue, Dec 4, 2018 at 10:29 AM Sean Christopherson > wrote: > > > > On Tue, Dec 04, 2018 at 10:22:39AM -0800, Sean Christopherson wrote: > > > On Tue, Dec 04, 2018 at 08:17:40AM -0800, Sean Christopherson wrote: > > > > At one point the vDSO image was manually stripped down by vdso2c in an > > > > attempt to minimize the size of the image mapped into userspace. Part > > > > of that stripping process involved building a fake section table so as > > > > not to break userspace processes that parse the section table. Memory > > > > for the fake section table was reserved in the .rodata section so that > > > > vdso2c could simply copy the entire PT_LOAD segment into the userspace > > > > image after building the fake table. > > > > > > > > Eventually, the entire fake section table approach was dropped in favor > > > > of stripping the vdso "the old fashioned way", i.e. via objdump -S. > > > > But, the reservation in .rodata for the fake table was left behind. > > > > Remove the reserveration along with a few other related defines and > > > > section entries. > > > > > > > > Removing the fake section table placeholder zaps a whopping 0x340 bytes > > > > from the 64-bit vDSO image, which drops the current image's size to > > > > under 4k, i.e. reduces the effective size of the userspace vDSO mapping > > > > by a full page. > > > > > > > > Fixes: da861e18eccc ("x86, vdso: Get rid of the fake section mechanism") > > > > Cc: Andy Lutomirski > > > > Signed-off-by: Sean Christopherson > > > > --- > > > > diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c > > > > index 4674f58581a1..2479a454b15c 100644 > > > > --- a/arch/x86/entry/vdso/vdso2c.c > > > > +++ b/arch/x86/entry/vdso/vdso2c.c > > > > @@ -98,12 +98,6 @@ struct vdso_sym required_syms[] = { > > > > [sym_hpet_page] = {"hpet_page", true}, > > > > [sym_pvclock_page] = {"pvclock_page", true}, > > > > [sym_hvclock_page] = {"hvclock_page", true}, > > > > - [sym_VDSO_FAKE_SECTION_TABLE_START] = { > > > > - "VDSO_FAKE_SECTION_TABLE_START", false > > > > - }, > > > > - [sym_VDSO_FAKE_SECTION_TABLE_END] = { > > > > - "VDSO_FAKE_SECTION_TABLE_END", false > > > > - }, > > > > > > Doh, I missed removing the definitions for sym_VDSO_FAKE_SECTION_TABLE_*. > > > > And with sym_VDSO_FAKE_SECTION_TABLE_* gone all symbols are exported, > > meaning required_syms can be a char* array and struct vdso_sym can be > > removed. > > I bet that we'll want that field to parse out the extable entries once > all the dust settles, though. Hmm, the extable stuff will be sections, a la .altinstructions, I don't think we'll need separate symbols. What if I send out a RFC for the extable stuff in parallel to this series and to the SGX series? It'd also be nice to iterate on that code without having to spin a full SGX series.
Re: [PATCH 6/5] x86/fault: Clean up the page fault oops decoder a bit
On Tue, Dec 04, 2018 at 11:22:25AM -0800, Andy Lutomirski wrote: > On Tue, Nov 27, 2018 at 7:32 AM Sean Christopherson > wrote: > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index 2ff25ad33233..510e263c256b 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -660,8 +660,10 @@ show_fault_oops(struct pt_regs *regs, unsigned long > > error_code, unsigned long ad > > err_str_append(error_code, err_txt, X86_PF_RSVD, "[RSVD]" ); > > err_str_append(error_code, err_txt, X86_PF_INSTR, "[INSTR]"); > > err_str_append(error_code, err_txt, X86_PF_PK,"[PK]" ); > > - > > - pr_alert("#PF error: %s\n", error_code ? err_txt : "[normal kernel > > read fault]"); > > + err_str_append(~error_code, err_txt, X86_PF_USER, "[KERNEL]"); > > + err_str_append(~error_code, err_txt, X86_PF_WRITE | X86_PF_INSTR, > > + "[READ]"); > > + pr_alert("#PF error code: %s\n", err_txt); > > > > Seems generally nice, but I would suggest making the bit-not-set name > be another parameter to err_str_append(). I'm also slightly uneasy > about making "KERNEL" look like a bit, but I guess it doesn't bother > me too much. What about "SUPERVISOR" instead of "KERNEL"? It'd be consistent with the SDM and hopefully less likely to be misconstrued as something else. > Want to send a real patch? Will do.
Re: [PATCH 6/5] x86/fault: Clean up the page fault oops decoder a bit
On Tue, Dec 04, 2018 at 11:47:10AM -0800, Andy Lutomirski wrote: > On Tue, Dec 4, 2018 at 11:34 AM Sean Christopherson > wrote: > > > > On Tue, Dec 04, 2018 at 11:22:25AM -0800, Andy Lutomirski wrote: > > > On Tue, Nov 27, 2018 at 7:32 AM Sean Christopherson > > > wrote: > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > > > index 2ff25ad33233..510e263c256b 100644 > > > > --- a/arch/x86/mm/fault.c > > > > +++ b/arch/x86/mm/fault.c > > > > @@ -660,8 +660,10 @@ show_fault_oops(struct pt_regs *regs, unsigned > > > > long error_code, unsigned long ad > > > > err_str_append(error_code, err_txt, X86_PF_RSVD, "[RSVD]" ); > > > > err_str_append(error_code, err_txt, X86_PF_INSTR, "[INSTR]"); > > > > err_str_append(error_code, err_txt, X86_PF_PK,"[PK]" ); > > > > - > > > > - pr_alert("#PF error: %s\n", error_code ? err_txt : "[normal > > > > kernel read fault]"); > > > > + err_str_append(~error_code, err_txt, X86_PF_USER, "[KERNEL]"); > > > > + err_str_append(~error_code, err_txt, X86_PF_WRITE | > > > > X86_PF_INSTR, > > > > + "[READ]"); > > > > + pr_alert("#PF error code: %s\n", err_txt); > > > > > > > > > > Seems generally nice, but I would suggest making the bit-not-set name > > > be another parameter to err_str_append(). I'm also slightly uneasy > > > about making "KERNEL" look like a bit, but I guess it doesn't bother > > > me too much. > > > > What about "SUPERVISOR" instead of "KERNEL"? It'd be consistent with > > the SDM and hopefully less likely to be misconstrued as something else. > > Or even just [!USER], perhaps. I thought about that too, but the pedant in me didn't like the inconsistency of doing "READ" instead of "[!WRITE] [!INSTR]", and IMO "READ" is a lot more readable (no pun intended). I also like having completely different text, makes it harder to miss a single "!" and go down the wrong path.
[PATCH v2 2/4] x86/vdso: Remove a stale/misleading comment from the linker script
Once upon a time, vdso2c aggressively stripped data from the vDSO image when generating the final userspace image. This included stripping the .altinstructions and .altinstr_replacement sections. Eventually, the stripping process reverted to "objdump -S" and no longer removed the aforementioned sections, but the comment remained. Keeping the .alt* sections at the end of the PT_LOAD segment is no longer necessary, but there's no harm in doing so and it's a helpful reminder that they don't need to be included in the final vDSO image, i.e. someone may want to take another stab at zapping/stripping the unneeded sections. Fixes: da861e18eccc ("x86, vdso: Get rid of the fake section mechanism") Cc: Andy Lutomirski Signed-off-by: Sean Christopherson Acked-by: Andy Lutomirski --- arch/x86/entry/vdso/vdso-layout.lds.S | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S index 0cedc905c8d6..93c6dc7812d0 100644 --- a/arch/x86/entry/vdso/vdso-layout.lds.S +++ b/arch/x86/entry/vdso/vdso-layout.lds.S @@ -65,11 +65,6 @@ SECTIONS .text : { *(.text*) } :text =0x90909090, - /* -* At the end so that eu-elflint stays happy when vdso2c strips -* these. A better implementation would avoid allocating space -* for these. -*/ .altinstructions: { *(.altinstructions) } :text .altinstr_replacement : { *(.altinstr_replacement) } :text -- 2.19.2
[PATCH v2 4/4] x86/vdso: Rename "required_syms" to "requested_syms"
The "required" moniker implies that vdso2c will fail if one of the defined symbols is not found, which is simply not true, e.g. VDSO32_NOTE_MASK is quite obviously 32-bit only and not required for the 64-bit image. Cc: Andy Lutomirski Signed-off-by: Sean Christopherson --- arch/x86/entry/vdso/vdso2c.c | 4 ++-- arch/x86/entry/vdso/vdso2c.h | 12 ++-- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index a175cd2016c9..b2acdaffc66d 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -85,7 +85,7 @@ const int special_pages[] = { sym_hvclock_page, }; -const char *required_syms[] = { +const char *requested_syms[] = { [sym_vvar_start] = "vvar_start", [sym_vvar_page] = "vvar_page", [sym_hpet_page] = "hpet_page", @@ -139,7 +139,7 @@ extern void bad_put_le(void); PLE(x, val, 64, PLE(x, val, 32, PLE(x, val, 16, LAST_PLE(x, val -#define NSYMS ARRAY_SIZE(required_syms) +#define NSYMS ARRAY_SIZE(requested_syms) #define BITSFUNC3(name, bits, suffix) name##bits##suffix #define BITSFUNC2(name, bits, suffix) BITSFUNC3(name, bits, suffix) diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h index 14003d311298..6c5a290ce468 100644 --- a/arch/x86/entry/vdso/vdso2c.h +++ b/arch/x86/entry/vdso/vdso2c.h @@ -97,10 +97,10 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len, GET_LE(&sym->st_name); for (k = 0; k < NSYMS; k++) { - if (!strcmp(name, required_syms[k])) { + if (!strcmp(name, requested_syms[k])) { if (syms[k]) { fail("duplicate symbol %s\n", -required_syms[k]); +requested_syms[k]); } /* @@ -123,12 +123,12 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len, if (symval % 4096) fail("%s must be a multiple of 4096\n", -required_syms[i]); +requested_syms[i]); if (symval + 4096 < syms[sym_vvar_start]) - fail("%s underruns vvar_start\n", required_syms[i]); + fail("%s underruns vvar_start\n", requested_syms[i]); if (symval + 4096 > 0) fail("%s is on the wrong side of the vdso text\n", -required_syms[i]); +requested_syms[i]); } if (syms[sym_vvar_start] % 4096) fail("vvar_begin must be a multiple of 4096\n"); @@ -168,7 +168,7 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len, for (i = 0; i < NSYMS; i++) { if (syms[i]) fprintf(outfile, "\t.sym_%s = %" PRIi64 ",\n", - required_syms[i], (int64_t)syms[i]); + requested_syms[i], (int64_t)syms[i]); } fprintf(outfile, "};\n"); } -- 2.19.2
[PATCH v2 1/4] x86/vdso: Remove obsolete "fake section table" reservation
At one point the vDSO image was manually stripped down by vdso2c in an attempt to minimize the size of the image mapped into userspace. Part of that stripping process involved building a fake section table so as not to break userspace processes that parse the section table. Memory for the fake section table was reserved in the .rodata section so that vdso2c could simply copy the entire PT_LOAD segment into the userspace image after building the fake table. Eventually, the entire fake section table approach was dropped in favor of stripping the vdso "the old fashioned way", i.e. via objdump -S. But, the reservation in .rodata for the fake table was left behind. Remove the reserveration along with a few other related defines and section entries. Removing the fake section table placeholder zaps a whopping 0x340 bytes from the 64-bit vDSO image, which drops the current image's size to under 4k, i.e. reduces the effective size of the userspace vDSO mapping by a full page. Fixes: da861e18eccc ("x86, vdso: Get rid of the fake section mechanism") Cc: Andy Lutomirski Signed-off-by: Sean Christopherson --- arch/x86/entry/vdso/vdso-layout.lds.S | 22 -- arch/x86/entry/vdso/vdso2c.c | 8 2 files changed, 30 deletions(-) diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S index acfd5ba7d943..0cedc905c8d6 100644 --- a/arch/x86/entry/vdso/vdso-layout.lds.S +++ b/arch/x86/entry/vdso/vdso-layout.lds.S @@ -7,16 +7,6 @@ * This script controls its layout. */ -#if defined(BUILD_VDSO64) -# define SHDR_SIZE 64 -#elif defined(BUILD_VDSO32) || defined(BUILD_VDSOX32) -# define SHDR_SIZE 40 -#else -# error unknown VDSO target -#endif - -#define NUM_FAKE_SHDRS 13 - SECTIONS { /* @@ -60,20 +50,8 @@ SECTIONS *(.bss*) *(.dynbss*) *(.gnu.linkonce.b.*) - - /* -* Ideally this would live in a C file, but that won't -* work cleanly for x32 until we start building the x32 -* C code using an x32 toolchain. -*/ - VDSO_FAKE_SECTION_TABLE_START = .; - . = . + NUM_FAKE_SHDRS * SHDR_SIZE; - VDSO_FAKE_SECTION_TABLE_END = .; } :text - .fake_shstrtab : { *(.fake_shstrtab) } :text - - .note : { *(.note.*) }:text :note .eh_frame_hdr : { *(.eh_frame_hdr) } :text :eh_frame_hdr diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index 4674f58581a1..8e470b018512 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -76,8 +76,6 @@ enum { sym_hpet_page, sym_pvclock_page, sym_hvclock_page, - sym_VDSO_FAKE_SECTION_TABLE_START, - sym_VDSO_FAKE_SECTION_TABLE_END, }; const int special_pages[] = { @@ -98,12 +96,6 @@ struct vdso_sym required_syms[] = { [sym_hpet_page] = {"hpet_page", true}, [sym_pvclock_page] = {"pvclock_page", true}, [sym_hvclock_page] = {"hvclock_page", true}, - [sym_VDSO_FAKE_SECTION_TABLE_START] = { - "VDSO_FAKE_SECTION_TABLE_START", false - }, - [sym_VDSO_FAKE_SECTION_TABLE_END] = { - "VDSO_FAKE_SECTION_TABLE_END", false - }, {"VDSO32_NOTE_MASK", true}, {"__kernel_vsyscall", true}, {"__kernel_sigreturn", true}, -- 2.19.2
[PATCH v2 3/4] x86/vdso: Remove struct vdso_sym and its associated export option
...now that all required symbols are exported by vdso2c. Cc: Andy Lutomirski Signed-off-by: Sean Christopherson --- Regarding Andy's concern that we might want the exported flag in the future for exception fixup, I prototyped a few approaches and in the end we always need to know at least the base of the original extable so that we have an anchor point for the relative IPs. Technically it's possible to omit the length/end, but doing so requires making either vdso2c or the kernel code more complex than it needs to be. So, regardless of whether we use symbols or sections for extable, I'm fairly confident we'll want to export the result to the .c file. arch/x86/entry/vdso/vdso2c.c | 27 +++ arch/x86/entry/vdso/vdso2c.h | 15 +++ 2 files changed, 18 insertions(+), 24 deletions(-) diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index 8e470b018512..a175cd2016c9 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -85,22 +85,17 @@ const int special_pages[] = { sym_hvclock_page, }; -struct vdso_sym { - const char *name; - bool export; -}; - -struct vdso_sym required_syms[] = { - [sym_vvar_start] = {"vvar_start", true}, - [sym_vvar_page] = {"vvar_page", true}, - [sym_hpet_page] = {"hpet_page", true}, - [sym_pvclock_page] = {"pvclock_page", true}, - [sym_hvclock_page] = {"hvclock_page", true}, - {"VDSO32_NOTE_MASK", true}, - {"__kernel_vsyscall", true}, - {"__kernel_sigreturn", true}, - {"__kernel_rt_sigreturn", true}, - {"int80_landing_pad", true}, +const char *required_syms[] = { + [sym_vvar_start] = "vvar_start", + [sym_vvar_page] = "vvar_page", + [sym_hpet_page] = "hpet_page", + [sym_pvclock_page] = "pvclock_page", + [sym_hvclock_page] = "hvclock_page", + "VDSO32_NOTE_MASK", + "__kernel_vsyscall", + "__kernel_sigreturn", + "__kernel_rt_sigreturn", + "int80_landing_pad", }; __attribute__((format(printf, 1, 2))) __attribute__((noreturn)) diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h index fa847a620f40..14003d311298 100644 --- a/arch/x86/entry/vdso/vdso2c.h +++ b/arch/x86/entry/vdso/vdso2c.h @@ -97,10 +97,10 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len, GET_LE(&sym->st_name); for (k = 0; k < NSYMS; k++) { - if (!strcmp(name, required_syms[k].name)) { + if (!strcmp(name, required_syms[k])) { if (syms[k]) { fail("duplicate symbol %s\n", -required_syms[k].name); +required_syms[k]); } /* @@ -123,13 +123,12 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len, if (symval % 4096) fail("%s must be a multiple of 4096\n", -required_syms[i].name); +required_syms[i]); if (symval + 4096 < syms[sym_vvar_start]) - fail("%s underruns vvar_start\n", -required_syms[i].name); + fail("%s underruns vvar_start\n", required_syms[i]); if (symval + 4096 > 0) fail("%s is on the wrong side of the vdso text\n", -required_syms[i].name); +required_syms[i]); } if (syms[sym_vvar_start] % 4096) fail("vvar_begin must be a multiple of 4096\n"); @@ -167,9 +166,9 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len, (unsigned long)GET_LE(&alt_sec->sh_size)); } for (i = 0; i < NSYMS; i++) { - if (required_syms[i].export && syms[i]) + if (syms[i]) fprintf(outfile, "\t.sym_%s = %" PRIi64 ",\n", - required_syms[i].name, (int64_t)syms[i]); + required_syms[i], (int64_t)syms[i]); } fprintf(outfile, "};\n"); } -- 2.19.2
[PATCH v2 0/4] x86/vdso: Remove remnants of the fake section table
Remove pieces of the vDSO's fake section table mechanism that were left behind when the vDSO build process reverted to using "objdump -S" to strip the userspace image. One of the removed pieces is a 0x340 byte reservation (for 64-bit) in the .rodata section. Trimming that fat drops the current image's size to under 4k, i.e. reduces the effective size of the userspace vDSO mapping by a full page. v1->v2: - Remove the definition of sym_VDSO_FAKE_SECTION_TABLE_* - Add patches 3/4 and 4/4 for additional cleanup Sean Christopherson (4): x86/vdso: Remove obsolete "fake section table" reservation x86/vdso: Remove a stale/misleading comment from the linker script x86/vdso: Remove struct vdso_sym and its associated export option x86/vdso: Rename "required_syms" to "requested_syms" arch/x86/entry/vdso/vdso-layout.lds.S | 27 --- arch/x86/entry/vdso/vdso2c.c | 37 +-- arch/x86/entry/vdso/vdso2c.h | 15 +-- 3 files changed, 19 insertions(+), 60 deletions(-) -- 2.19.2
Re: [PATCH 6/5] x86/fault: Clean up the page fault oops decoder a bit
On Tue, Dec 04, 2018 at 11:22:25AM -0800, Andy Lutomirski wrote: > On Tue, Nov 27, 2018 at 7:32 AM Sean Christopherson > wrote: > > arch/x86/mm/fault.c | 6 -- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index 2ff25ad33233..510e263c256b 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -660,8 +660,10 @@ show_fault_oops(struct pt_regs *regs, unsigned long > > error_code, unsigned long ad > > err_str_append(error_code, err_txt, X86_PF_RSVD, "[RSVD]" ); > > err_str_append(error_code, err_txt, X86_PF_INSTR, "[INSTR]"); > > err_str_append(error_code, err_txt, X86_PF_PK,"[PK]" ); > > - > > - pr_alert("#PF error: %s\n", error_code ? err_txt : "[normal kernel > > read fault]"); > > + err_str_append(~error_code, err_txt, X86_PF_USER, "[KERNEL]"); > > + err_str_append(~error_code, err_txt, X86_PF_WRITE | X86_PF_INSTR, > > + "[READ]"); > > + pr_alert("#PF error code: %s\n", err_txt); > > > > Seems generally nice, but I would suggest making the bit-not-set name > be another parameter to err_str_append(). I didn't recall why I chose to negate error_code until I revisited the actual code. The "READ" case is a combination of !WRITE && !USER, i.e. doesn't fit into an existing err_str_append() call. So we'd end up with an extra err_str_append() call that would also have a null message for the positive test, which seemed unnecessarily complex and more convoluted than simply negating error_code. E.g.: diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 2ff25ad33233..48b420621825 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -607,12 +607,17 @@ static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 index) * This helper function transforms the #PF error_code bits into * "[PROT] [USER]" type of descriptive, almost human-readable error strings: */ -static void err_str_append(unsigned long error_code, char *buf, unsigned long mask, const char *txt) +static void err_str_append(unsigned long error_code, char *buf, unsigned long mask, + const char *pos, const char *neg) { - if (error_code & mask) { + if ((error_code & mask) == mask && pos) { if (buf[0]) strcat(buf, " "); - strcat(buf, txt); + strcat(buf, pos); + } else if (!(error_code & mask) && neg) { + if (buf[0]) + strcat(buf, " "); + strcat(buf, neg); } } @@ -654,14 +659,15 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad * Note: length of these appended strings including the separation space and the * zero delimiter must fit into err_txt[]. */ - err_str_append(error_code, err_txt, X86_PF_PROT, "[PROT]" ); - err_str_append(error_code, err_txt, X86_PF_WRITE, "[WRITE]"); - err_str_append(error_code, err_txt, X86_PF_USER, "[USER]" ); - err_str_append(error_code, err_txt, X86_PF_RSVD, "[RSVD]" ); - err_str_append(error_code, err_txt, X86_PF_INSTR, "[INSTR]"); - err_str_append(error_code, err_txt, X86_PF_PK,"[PK]" ); - - pr_alert("#PF error: %s\n", error_code ? err_txt : "[normal kernel read fault]"); + err_str_append(error_code, err_txt, X86_PF_PROT, "[PROT]" , NULL); + err_str_append(error_code, err_txt, X86_PF_WRITE, "[WRITE]", NULL); + err_str_append(error_code, err_txt, X86_PF_USER, "[USER]" , "[SUPERVISOR]"); + err_str_append(error_code, err_txt, X86_PF_RSVD, "[RSVD]" , NULL); + err_str_append(error_code, err_txt, X86_PF_INSTR, "[INSTR]", NULL); + err_str_append(error_code, err_txt, X86_PF_PK,"[PK]" , NULL); + err_str_append(error_code, err_txt, X86_PF_WRITE | X86_PF_INSTR, NULL, + "[READ]"); + pr_alert("#PF error code: %s\n", err_txt); if (!(error_code & X86_PF_USER) && user_mode(regs)) { struct desc_ptr idt, gdt;
[PATCH] x86/fault: Print "SUPERVISOR" and "READ" when decoding #PF oops
...instead of manually handling the case where error_code=0, e.g. to display "[SUPERVISOR] [READ]" instead of "normal kernel read fault". This makes the zero case consistent with all other messages and also provides additional information for other error code combinations, e.g. error_code==1 will display "[PROT] [SUPERVISOR] [READ]" instead of simply "[PROT]". Print unique names for the negative cases as opposed to e.g. "[!USER]" to avoid mixups due to users missing a single "!" character, and to be more concise for the !INSTR && !WRITE case. Print "SUPERVISOR" in favor of "KERNEL" to reduce the likelihood that the message is misinterpreted as a generic kernel/software error and to be consistent with the SDM's nomenclature. An alternative to passing a negated error code to err_str_append() would be to expand err_str_append() to take a second string for the negative test, but that approach complicates handling the "[READ]" case, which looks for !INSTR && !WRITE, e.g. it would require an extra call to err_str_append() and logic in err_str_append() to allow null messages for both the positive and negative tests. Printing "[INSTR] [READ]" wouldn't be the end of the world, but a little bit of trickery in the kernel is a relatively small price to pay in exchange for the ability to unequivocally know the access type by reading a single word. Now that all components of the message use the [] format, explicitly state that it's the error *code* that's being printed and group the err_str_append() calls by type so that the resulting print messages are consistent, e.g. the deciphered codes will always be: [PROT] [USER|SUPERVISOR] [WRITE|INSTR|READ] [RSDV] [PK] Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Yu-cheng Yu Cc: linux-kernel@vger.kernel.org Cc: Ingo Molnar Signed-off-by: Sean Christopherson --- arch/x86/mm/fault.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 2ff25ad33233..0b4ce5d2b461 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -609,7 +609,7 @@ static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 index) */ static void err_str_append(unsigned long error_code, char *buf, unsigned long mask, const char *txt) { - if (error_code & mask) { + if ((error_code & mask) == mask) { if (buf[0]) strcat(buf, " "); strcat(buf, txt); @@ -655,13 +655,16 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad * zero delimiter must fit into err_txt[]. */ err_str_append(error_code, err_txt, X86_PF_PROT, "[PROT]" ); - err_str_append(error_code, err_txt, X86_PF_WRITE, "[WRITE]"); err_str_append(error_code, err_txt, X86_PF_USER, "[USER]" ); - err_str_append(error_code, err_txt, X86_PF_RSVD, "[RSVD]" ); + err_str_append(~error_code, err_txt, X86_PF_USER, "[SUPERVISOR]"); + err_str_append(error_code, err_txt, X86_PF_WRITE, "[WRITE]"); err_str_append(error_code, err_txt, X86_PF_INSTR, "[INSTR]"); + err_str_append(~error_code, err_txt, X86_PF_WRITE | X86_PF_INSTR, + "[READ]"); + err_str_append(error_code, err_txt, X86_PF_RSVD, "[RSVD]" ); err_str_append(error_code, err_txt, X86_PF_PK,"[PK]" ); - pr_alert("#PF error: %s\n", error_code ? err_txt : "[normal kernel read fault]"); + pr_alert("#PF error code: %s\n", err_txt); if (!(error_code & X86_PF_USER) && user_mode(regs)) { struct desc_ptr idt, gdt; -- 2.19.2
[RFC PATCH 4/4] x86/vdso: Add __vdso_sgx_eenter() to wrap SGX enclave transitions
Intel Software Guard Extensions (SGX) SGX introduces a new CPL3-only enclave mode that runs as a sort of black box shared object that is hosted by an untrusted normal CPL3 process. Enclave transitions have semantics that are a lovely blend of SYCSALL, SYSRET and VM-Exit. In a non-faulting scenario, entering and exiting an enclave can only be done through SGX-specific instructions, EENTER and EEXIT respectively. EENTER+EEXIT is analogous to SYSCALL+SYSRET, e.g. EENTER/SYSCALL load RCX with the next RIP and EEXIT/SYSRET load RIP from R{B,C}X. But in a faulting/interrupting scenario, enclave transitions act more like VM-Exit and VMRESUME. Maintaining the black box nature of the enclave means that hardware must automatically switch CPU context when an Asynchronous Exiting Event (AEE) occurs, an AEE being any interrupt or exception (exceptions are AEEs because asynchronous in this context is relative to the enclave and not CPU execution, e.g. the enclave doesn't get an opportunity to save/fuzz CPU state). Like VM-Exits, all AEEs jump to a common location, referred to as the Asynchronous Exiting Point (AEP). The AEP is specified at enclave entry via register passed to EENTER/ERESUME, similar to how the hypervisor specifies the VM-Exit point (via VMCS.HOST_RIP at VMLAUNCH/VMRESUME). Resuming the enclave/VM after the exiting event is handled is done via ERESUME/VMRESUME respectively. In SGX, AEEs that are handled by the kernel, e.g. INTR, NMI and most page faults, IRET will journey back to the AEP which then ERESUMEs th enclave. Enclaves also behave a bit like VMs in the sense that they can generate exceptions as part of their normal operation that for all intents and purposes need to handled in the enclave/VM. However, unlike VMX, SGX doesn't allow the host to modify its guest's, a.k.a. enclave's, state, as doing so would circumvent the enclave's security. So to handle an exception, the enclave must first be re-entered through the normal EENTER flow (SYSCALL/SYSRET behavior), and then resumed via ERESUME (VMRESUME behavior) after the source of the exception is resolved. All of the above is just the tip of the iceberg when it comes to running an enclave. But, SGX was designed in such a way that the host process can utilize a library to build, launch and run an enclave. This is roughly analogous to how e.g. libc implementations are used by most applications so that the application can focus on its business logic. The big gotcha is that because enclaves can generate *and* handle exceptions, any SGX library must be prepared to handle nearly any exception at any time (well, any time a thread is executing in an enclave). In Linux, this means the SGX library must register a signal handler in order to intercept relevant exceptions and forward them to the enclave (or in some cases, take action on behalf of the enclave). Unfortunately, Linux's signal mechanism doesn't mesh well with libraries, e.g. signal handlers are process wide, are difficult to chain, etc... This becomes particularly nasty when using multiple levels of libraries that register signal handlers, e.g. running an enclave via cgo inside of the Go runtime. In comes vDSO to save the day. Now that vDSO can fixup exceptions, add a function to wrap enclave transitions and intercept any exceptions that occur in the enclave or on EENTER/ERESUME. The actually code is blissfully short (especially compared to this changelog). In addition to the obvious trapnr, error_code and address, propagate the leaf number, i.e. RAX, back to userspace so that the caller can know whether the fault occurred in the enclave or if it occurred on EENTER. A fault on EENTER generally means the enclave has died and needs to be restarted. Suggested-by: Andy Lutomirski Cc: Andy Lutomirski Cc: Jarkko Sakkinen Cc: Dave Hansen Cc: Josh Triplett Signed-off-by: Sean Christopherson --- arch/x86/entry/vdso/Makefile | 1 + arch/x86/entry/vdso/vdso.lds.S| 1 + arch/x86/entry/vdso/vsgx_eenter.c | 108 ++ 3 files changed, 110 insertions(+) create mode 100644 arch/x86/entry/vdso/vsgx_eenter.c diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile index eb543ee1bcec..ba46673076bd 100644 --- a/arch/x86/entry/vdso/Makefile +++ b/arch/x86/entry/vdso/Makefile @@ -18,6 +18,7 @@ VDSO32-$(CONFIG_IA32_EMULATION) := y # files to link into the vdso vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o +vobjs-$(VDSO64-y) += vsgx_eenter.o # files to link into kernel obj-y += vma.o extable.o diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S index d3a2dce4cfa9..e422c4454f34 100644 --- a/arch/x86/entry/vdso/vdso.lds.S +++ b/arch/x86/entry/vdso/vdso.lds.S @@ -25,6 +25,7 @@ VERSION { __vdso_getcpu; time; __vdso_time; + __vdso_sgx_eenter; local: *;
[RFC PATCH 0/4] x86: Add vDSO exception fixup for SGX
First things first, this RFC is not intended to address whether or not vDSO is the appropriate method for supporting SGX enclaves, but rather its purpose is to hammer out the vDSO implementation *if* we decide it is the best approach. Though the code technically stands alone, the intent is to roll it into the main SGX series once it has been beat on for a bit. For that reason there are no dependencies on CONFIG_SGX (which doesn't exist) or any other bits that one might expect. The quick and dirty is that for all intents and purposes, the SGX architecture requires userspace to gracefully handle exceptions. The vast majority of enclaves are expected to leverage a library to handle much of the plumbing. Unfortunately for us (them?), putting two and two together means SGX libraries would need to handle most signals on behalf of the enclave's application, which is far from ideal. One proposed solution for supporting SGX without requiring signals is to wrap enclave transitions in a vDSO function so that SGX exceptions can be intercepted via exception fixup and returned inline to the caller. This RFC series adds exception fixup and SGX support to the vDSO. Sean Christopherson (4): x86/vdso: Add support for exception fixup in vDSO functions x86/fault: Attempt to fixup unhandled #PF in vDSO before signaling x86/traps: Attempt to fixup exceptions in vDSO before signaling x86/vdso: Add __vdso_sgx_eenter() to wrap SGX enclave transitions arch/x86/entry/vdso/Makefile | 5 +- arch/x86/entry/vdso/extable.c | 37 + arch/x86/entry/vdso/extable.h | 17 arch/x86/entry/vdso/vdso-layout.lds.S | 9 ++- arch/x86/entry/vdso/vdso.lds.S| 1 + arch/x86/entry/vdso/vdso2c.h | 58 -- arch/x86/entry/vdso/vsgx_eenter.c | 108 ++ arch/x86/include/asm/vdso.h | 5 ++ arch/x86/kernel/traps.c | 11 +++ arch/x86/mm/fault.c | 7 ++ 10 files changed, 247 insertions(+), 11 deletions(-) create mode 100644 arch/x86/entry/vdso/extable.c create mode 100644 arch/x86/entry/vdso/extable.h create mode 100644 arch/x86/entry/vdso/vsgx_eenter.c -- 2.19.2
[RFC PATCH 1/4] x86/vdso: Add support for exception fixup in vDSO functions
The basic concept and implementation is very similar to the kernel's exception fixup mechanism. The key differences are that the kernel handler is hardcoded and the fixup entry addresses are relative to the overall table as opposed to individual entries. Hardcoding the kernel handler avoids the need to figure out how to get userspace code to point at a kernel function. Given that the expected usage is to propagate information to userspace, dumping all fault information into registers is likely the desired behavior for the vast majority of yet-to-be-created functions. Use registers DI, SI and DX to communicate fault information, which follows Linux's ABI for register consumption and hopefully avoids conflict with hardware features that might leverage the fixup capabilities, e.g. register usage for SGX instructions was at least partially designed with calling conventions in mind. Making fixup addresses relative to the overall table allows the table to be stripped from the final vDSO image (it's a kernel construct) without complicating the offset logic, e.g. entry-relative addressing would also need to account for the table's location relative to the image. Regarding stripping the table, modify vdso2c to extract the table from the raw, a.k.a. unstripped, data and dump it as a standalone byte array in the resulting .c file. The original base of the table, its length and a pointer to the byte array are captured in struct vdso_image. Alternatively, the table could be dumped directly into the struct, but because the number of entries can vary per image, that would require either hardcoding a max sized table into the struct definition or defining the table as a flexible length array. The flexible length array approach has zero benefits, e.g. the base/size are still needed, and prevents reusing the extraction code, while hardcoding the max size adds ongoing maintenance just to avoid exporting the explicit size. The immediate use case is for Intel Software Guard Extensions (SGX). SGX introduces a new CPL3-only "enclave" mode that runs as a sort of black box shared object that is hosted by an untrusted "normal" CPl3 process. Entering an enclave can only be done through SGX-specific instructions, EENTER and ERESUME, and is a non-trivial process. Because of the complexity of transitioning to/from an enclave, the vast majority of enclaves are expected to utilize a library to handle the actual transitions. This is roughly analogous to how e.g. libc implementations are used by most applications. Another crucial characteristic of SGX enclaves is that they can generate exceptions as part of their normal (at least as "normal" as SGX can be) operation that need to be handled *in* the enclave and/or are unique to SGX. And because they are essentially fancy shared objects, a process can host any number of enclaves, each of which can execute multiple threads simultaneously. Putting everything together, userspace enclaves will utilize a library that must be prepared to handle any and (almost) all exceptions any time at least one thread may be executing in an enclave. Leveraging signals to handle the enclave exceptions is unpleasant, to put it mildly, e.g. the SGX library must constantly (un)register its signal handler based on whether or not at least one thread is executing in an enclave, and filter and forward exceptions that aren't related to its enclaves. This becomes particularly nasty when using multiple levels of libraries that register signal handlers, e.g. running an enclave via cgo inside of the Go runtime. Enabling exception fixup in vDSO allows the kernel to provide a vDSO function that wraps the low-level transitions to/from the enclave, i.e. the EENTER and ERESUME instructions. The vDSO function can intercept exceptions that would otherwise generate a signal and return the fault information directly to its caller, thus avoiding the need to juggle signal handlers. Suggested-by: Andy Lutomirski Cc: Andy Lutomirski Cc: Jarkko Sakkinen Cc: Dave Hansen Cc: Josh Triplett Signed-off-by: Sean Christopherson --- arch/x86/entry/vdso/Makefile | 4 +- arch/x86/entry/vdso/extable.c | 37 + arch/x86/entry/vdso/extable.h | 17 arch/x86/entry/vdso/vdso-layout.lds.S | 9 - arch/x86/entry/vdso/vdso2c.h | 58 +++ arch/x86/include/asm/vdso.h | 5 +++ 6 files changed, 119 insertions(+), 11 deletions(-) create mode 100644 arch/x86/entry/vdso/extable.c create mode 100644 arch/x86/entry/vdso/extable.h diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile index 141d415a8c80..eb543ee1bcec 100644 --- a/arch/x86/entry/vdso/Makefile +++ b/arch/x86/entry/vdso/Makefile @@ -20,7 +20,7 @@ VDSO32-$(CONFIG_IA32_EMULATION) := y vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o # files to link into kernel -obj-y += v
[RFC PATCH 2/4] x86/fault: Attempt to fixup unhandled #PF in vDSO before signaling
Call fixup_vdso_exception() in the SIGSEGV and SIGBUS paths to attempt to fixup page faults in vDSO. This allows vDSO functions to utilize exception fixup to report unhandled page faults directly to userspace for specific flows in lieu of generating a signal. In the SIGSEGV flow, make sure to call fixup_vdso_exception() after the error code has been sanitized. Suggested-by: Andy Lutomirski Cc: Andy Lutomirski Cc: Jarkko Sakkinen Cc: Dave Hansen Cc: Josh Triplett Signed-off-by: Sean Christopherson --- arch/x86/mm/fault.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 2ff25ad33233..ff1cb10858d5 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -28,6 +28,7 @@ #include/* vma_pkey() */ #include/* efi_recover_from_page_fault()*/ #include /* store_idt(), ... */ +#include /* fixup_vdso_exception() */ #define CREATE_TRACE_POINTS #include @@ -928,6 +929,9 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, if (address >= TASK_SIZE_MAX) error_code |= X86_PF_PROT; + if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address)) + return; + if (likely(show_unhandled_signals)) show_signal_msg(regs, error_code, address, tsk); @@ -1045,6 +1049,9 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, if (is_prefetch(regs, error_code, address)) return; + if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address)) + return; + set_signal_archinfo(address, error_code); #ifdef CONFIG_MEMORY_FAILURE -- 2.19.2
[RFC PATCH 3/4] x86/traps: Attempt to fixup exceptions in vDSO before signaling
Call fixup_vdso_exception() to attempt to fixup exceptions in vDSO code before generating a signal for userspace faults. This allows vDSO functions to utilize exception fixup to report unhandled exceptions directly to userspace for specific flows in lieu of generating a signal. Suggested-by: Andy Lutomirski Cc: Andy Lutomirski Cc: Jarkko Sakkinen Cc: Dave Hansen Cc: Josh Triplett Signed-off-by: Sean Christopherson --- arch/x86/kernel/traps.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 9b7c4ca8f0a7..106a6cc20ad6 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -61,6 +61,7 @@ #include #include #include +#include #ifdef CONFIG_X86_64 #include @@ -223,6 +224,10 @@ do_trap_no_signal(struct task_struct *tsk, int trapnr, const char *str, tsk->thread.error_code = error_code; tsk->thread.trap_nr = trapnr; + if (user_mode(regs) && + fixup_vdso_exception(regs, trapnr, error_code, 0)) + return 0; + return -1; } @@ -563,6 +568,9 @@ do_general_protection(struct pt_regs *regs, long error_code) tsk->thread.error_code = error_code; tsk->thread.trap_nr = X86_TRAP_GP; + if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0)) + return; + show_signal(tsk, SIGSEGV, "", desc, regs, error_code); force_sig(SIGSEGV, tsk); @@ -854,6 +862,9 @@ static void math_error(struct pt_regs *regs, int error_code, int trapnr) if (!si_code) return; + if (fixup_vdso_exception(regs, trapnr, error_code, 0)) + return; + force_sig_fault(SIGFPE, si_code, (void __user *)uprobe_get_trap_addr(regs), task); } -- 2.19.2
Re: [RFC PATCH 4/4] x86/vdso: Add __vdso_sgx_eenter() to wrap SGX enclave transitions
On Wed, Dec 05, 2018 at 03:40:48PM -0800, Andy Lutomirski wrote: > On Wed, Dec 5, 2018 at 3:20 PM Sean Christopherson > wrote: > > +notrace long __vdso_sgx_eenter(void *tcs, void *priv, > > + struct sgx_eenter_fault_info *fault_info) > > +{ > > + u32 trapnr, error_code; > > + long leaf; > > + u64 addr; > > + > > + /* > > +* %eax = EENTER > > +* %rbx = tcs > > +* %rcx = do_eresume > > +* %rdi = priv > > +* do_eenter: > > +* enclu > > +* jmp out > > +* > > +* do_eresume: > > +* enclu > > +* ud2 > > Is the only reason for do_eresume to be different from do_eenter so > that you can do the ud2? No, it was a holdover from doing fixup via a magic prefix in user code. The fixup could only skip the ENCLU and so a second ENCLU was needed to differentiate between EENTER and ERESUME. The need for two ENCLUs got ingrained in my head. I can't think of anything that will break if we use a single ENCLU. > > +* > > +* out: > > +* > > +* > > +* fault_fixup: > > +* > > +* jmp out > > +*/ > > This has the IMO excellent property that it's extremely awkward to use > it for a model where the enclave is reentrant. I think it's excellent > because reentrancy on the same enclave thread is just asking for > severe bugs. Of course, I fully expect the SDK to emulate reentrancy, > but then it's 100% their problem :) On the fiip side, it means that > you can't really recover from a reported fault, even if you want to, > because there's no way to ask for ERESUME. So maybe the API should > allow that after all. Doh. The ability to do ERESUME is an explicit requirement from the SDK folks. More code that I pulled from my userspace implementation and didn't revisit. > I think it might be polite to at least give some out regs, maybe RSI and RDI? For the outbound path? I was thinking @priv would be used for passing data out as well as in.
Re: [RFC PATCH 4/4] x86/vdso: Add __vdso_sgx_eenter() to wrap SGX enclave transitions
On Thu, Dec 06, 2018 at 05:55:47AM -0800, Sean Christopherson wrote: > On Wed, Dec 05, 2018 at 03:40:48PM -0800, Andy Lutomirski wrote: > > On Wed, Dec 5, 2018 at 3:20 PM Sean Christopherson > > wrote: > > > +notrace long __vdso_sgx_eenter(void *tcs, void *priv, > > > + struct sgx_eenter_fault_info *fault_info) > > > +{ > > > + u32 trapnr, error_code; > > > + long leaf; > > > + u64 addr; > > > + > > > + /* > > > +* %eax = EENTER > > > +* %rbx = tcs > > > +* %rcx = do_eresume > > > +* %rdi = priv > > > +* do_eenter: > > > +* enclu > > > +* jmp out > > > +* > > > +* do_eresume: > > > +* enclu > > > +* ud2 > > > > Is the only reason for do_eresume to be different from do_eenter so > > that you can do the ud2? > > No, it was a holdover from doing fixup via a magic prefix in user code. > The fixup could only skip the ENCLU and so a second ENCLU was needed to > differentiate between EENTER and ERESUME. The need for two ENCLUs got > ingrained in my head. I can't think of anything that will break if we > use a single ENCLU. > > > > +* > > > +* out: > > > +* > > > +* > > > +* fault_fixup: > > > +* > > > +* jmp out > > > +*/ > > > > This has the IMO excellent property that it's extremely awkward to use > > it for a model where the enclave is reentrant. I think it's excellent > > because reentrancy on the same enclave thread is just asking for > > severe bugs. Of course, I fully expect the SDK to emulate reentrancy, > > but then it's 100% their problem :) On the fiip side, it means that > > you can't really recover from a reported fault, even if you want to, > > because there's no way to ask for ERESUME. So maybe the API should > > allow that after all. > > Doh. The ability to do ERESUME is an explicit requirement from the SDK > folks. More code that I pulled from my userspace implementation and > didn't revisit. Is it ok to add a separate exported function for ERESUME? ERESUME can't explicitly pass anything to the enclave, i.e. doesn't need a @priv param. A separate function is a little prettier, e.g.: static inline long vdso_enter_enclave(enum sgx_enclu_leaf op, void *tcs, void *priv, struct sgx_eenter_fault_info *fault_info) { ... } notrace long __vdso_sgx_eenter(void *tcs, void *priv, struct sgx_eenter_fault_info *fault_info) { return vdso_enter_enclave(SGX_EENTER, tcs, priv, fault_info); } notrace long __vdso_sgx_eresume(void *tcs, struct sgx_eenter_fault_info *fault_info) { return vdso_enter_enclave(SGX_ERESUME, tcs, NULL, fault_info); }
Re: [PATCH] x86/mm/fault: Streamline the fault error_code decoder some more
error code: +P -W -U -S -I -K (01) [0.144411] #PF error code: +P +W -U -S -I -K (03) [0.144826] #PF error code: +P +W +U -S -I -K (07) [0.145252] #PF error code: +P -W +U -S -I +K (25) [0.145706] #PF error code: -P +W -U -S -I -K (02) [0.146111] #PF error code: -P -W +U -S -I -K (04) [0.146521] #PF error code: -P +W +U -S -I -K (06) [0.146934] #PF error code: -P -W +U -S +I -K (14) [0.147348] #PF error code: +P -W -U -S +I -K (11) [0.147767] #PF error code: -P -W -U -S -I -K (00) vs. (with SGX added as 'G' for testing purposes) [0.158849] #PF error code(0001): +P !W !U !S !I !K !G [0.159292] #PF error code(0003): +P +W !U !S !I !K !G [0.159742] #PF error code(0007): +P +W +U !S !I !K !G [0.160190] #PF error code(0025): +P !W +U !S !I +K !G [0.160638] #PF error code(0002): !P +W !U !S !I !K !G [0.161087] #PF error code(0004): !P !W +U !S !I !K !G [0.161538] #PF error code(0006): !P +W +U !S !I !K !G [0.161992] #PF error code(0014): !P !W +U !S +I !K !G [0.162450] #PF error code(0011): +P !W !U !S +I !K !G [0.162667] #PF error code(8001): +P !W !U !S !I !K +G [0.162667] #PF error code(8003): +P +W !U !S !I !K +G [0.162667] #PF error code(8007): +P +W +U !S !I !K +G [0.162667] #PF error code(8025): +P !W +U !S !I +K +G [0.162667] #PF error code(8002): !P +W !U !S !I !K +G [0.162667] #PF error code(8004): !P !W +U !S !I !K +G [0.162667] #PF error code(8006): !P +W +U !S !I !K +G [0.162667] #PF error code(8014): !P !W +U !S +I !K +G [0.162667] #PF error code(8011): +P !W !U !S +I !K +G [0.162667] #PF error code(): !P !W !U !S !I !K !G vs. (with consistent ordering between raw and decoded) [0.153004] #PF error code(0001): !K !I !S !U !W +P [0.153004] #PF error code(0003): !K !I !S !U +W +P [0.153004] #PF error code(0007): !K !I !S +U +W +P [0.153004] #PF error code(0025): +K !I !S +U !W +P [0.153004] #PF error code(0002): !K !I !S !U +W !P [0.153004] #PF error code(0004): !K !I !S +U !W !P [0.153004] #PF error code(0006): !K !I !S +U +W !P [0.153362] #PF error code(0014): !K +I !S +U !W !P [0.153788] #PF error code(0011): !K +I !S !U !W +P [0.154104] #PF error code(): !K !I !S !U !W !P A patch with the kitchen sink... > >From 22e6d40e52b4341a424f065a138be54bc79d4db4 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 6 Dec 2018 07:25:13 -0800 Subject: [PATCH] x86/fault: Make show_error_code() more extensible and tweak its formatting - Initialize each bit individually in the error_code_chars array. This allows for non-contiguous bits and is self-documenting. Define a macro to make the initialization code somewhatmore readable. - Reverse the decode order so it's consistent with the raw display. - Use ARRAY_SIZE instead of hardcoding '6' in multiple locations. - Use '!' for the negative case to better contrast against '+'. - Display four digits (was two) when printing the raw error code. - Remove @regs from show_error_code(). Signed-off-by: Sean Christopherson --- arch/x86/mm/fault.c | 47 - 1 file changed, 25 insertions(+), 22 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 23dc7163f6ac..cd28d058ed39 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -605,45 +605,48 @@ static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 index) /* * This maps the somewhat obscure error_code number to symbolic text: - * - * P = protection fault (X86_PF_PROT) - * W = write access (X86_PF_WRITE) - * U = user-mode access (X86_PF_USER) - * S = supervisor mode(X86_PF_RSVD) - * I = instruction fault (X86_PF_INSTR) - * K = keys fault (X86_PF_PK) */ -static const char error_code_chars[] = "PWUSIK"; +#define EC(x) ilog2(X86_PF_##x) +static const char error_code_chars[] = { + [EC(PROT)] = 'P', + [EC(WRITE)] = 'W', + [EC(USER)] = 'U', + [EC(RSVD)] = 'S', + [EC(INSTR)] = 'I', + [EC(PK)]= 'K', +}; /* - * This helper function transforms the #PF error_code bits into " +P -W +U -R -I -K" + * This helper function transforms the #PF error_code bits into " +P !W +U !R !I !K" * type of descriptive, almost human-readable error strings: */ -static void show_error_code(struct pt_regs *regs, unsigned long error_code) +static void show_error_code(unsigned long error_code) { - unsigned int bit, mask; - char err_txt[6*3+1]; /* Fixed length of 6 bits decoded plus zero at the end */ + char err_txt[ARRAY_SIZE(error_code_chars)*3+1]; /* 3 chars per bit plus zero at the end */ + unsigned offset = 0; +
Re: [RFC PATCH 2/4] x86/fault: Attempt to fixup unhandled #PF in vDSO before signaling
On Thu, Dec 06, 2018 at 10:17:34AM -0800, Dave Hansen wrote: > > #define CREATE_TRACE_POINTS > > #include > > @@ -928,6 +929,9 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned > > long error_code, > > if (address >= TASK_SIZE_MAX) > > error_code |= X86_PF_PROT; > > > > + if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, > > address)) > > + return; > > + > > if (likely(show_unhandled_signals)) > > show_signal_msg(regs, error_code, address, tsk); > > I'd preferably like to get this plugged into the page fault code before > we get to the "bad_area" handling. This plugs it in near the erratum > handling which seems really late to me. > > > @@ -1045,6 +1049,9 @@ do_sigbus(struct pt_regs *regs, unsigned long > > error_code, unsigned long address, > > if (is_prefetch(regs, error_code, address)) > > return; > > > > + if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address)) > > + return; > > + > > set_signal_archinfo(address, error_code); > > > > #ifdef CONFIG_MEMORY_FAILURE > > This *seems* really late to me. We've already called into the mm fault > handling code to try and handle the fault and they told us it was > VM_FAULT_SIGBUS. Shouldn't we have just detected that it was in the > vDSO first and not even called the handling code? Not sure if Andy had other use cases in mind, but for SGX we only want to invoke the fixup in lieu of a signal. E.g. #PFs to fault in an EPC page need to be handled in the kernel.
[RFC PATCH v2 1/4] x86/vdso: Add support for exception fixup in vDSO functions
The basic concept and implementation is very similar to the kernel's exception fixup mechanism. The key differences are that the kernel handler is hardcoded and the fixup entry addresses are relative to the overall table as opposed to individual entries. Hardcoding the kernel handler avoids the need to figure out how to get userspace code to point at a kernel function. Given that the expected usage is to propagate information to userspace, dumping all fault information into registers is likely the desired behavior for the vast majority of yet-to-be-created functions. Use registers DI, SI and DX to communicate fault information, which follows Linux's ABI for register consumption and hopefully avoids conflict with hardware features that might leverage the fixup capabilities, e.g. register usage for SGX instructions was at least partially designed with calling conventions in mind. Making fixup addresses relative to the overall table allows the table to be stripped from the final vDSO image (it's a kernel construct) without complicating the offset logic, e.g. entry-relative addressing would also need to account for the table's location relative to the image. Regarding stripping the table, modify vdso2c to extract the table from the raw, a.k.a. unstripped, data and dump it as a standalone byte array in the resulting .c file. The original base of the table, its length and a pointer to the byte array are captured in struct vdso_image. Alternatively, the table could be dumped directly into the struct, but because the number of entries can vary per image, that would require either hardcoding a max sized table into the struct definition or defining the table as a flexible length array. The flexible length array approach has zero benefits, e.g. the base/size are still needed, and prevents reusing the extraction code, while hardcoding the max size adds ongoing maintenance just to avoid exporting the explicit size. The immediate use case is for Intel Software Guard Extensions (SGX). SGX introduces a new CPL3-only "enclave" mode that runs as a sort of black box shared object that is hosted by an untrusted "normal" CPl3 process. Entering an enclave can only be done through SGX-specific instructions, EENTER and ERESUME, and is a non-trivial process. Because of the complexity of transitioning to/from an enclave, the vast majority of enclaves are expected to utilize a library to handle the actual transitions. This is roughly analogous to how e.g. libc implementations are used by most applications. Another crucial characteristic of SGX enclaves is that they can generate exceptions as part of their normal (at least as "normal" as SGX can be) operation that need to be handled *in* the enclave and/or are unique to SGX. And because they are essentially fancy shared objects, a process can host any number of enclaves, each of which can execute multiple threads simultaneously. Putting everything together, userspace enclaves will utilize a library that must be prepared to handle any and (almost) all exceptions any time at least one thread may be executing in an enclave. Leveraging signals to handle the enclave exceptions is unpleasant, to put it mildly, e.g. the SGX library must constantly (un)register its signal handler based on whether or not at least one thread is executing in an enclave, and filter and forward exceptions that aren't related to its enclaves. This becomes particularly nasty when using multiple levels of libraries that register signal handlers, e.g. running an enclave via cgo inside of the Go runtime. Enabling exception fixup in vDSO allows the kernel to provide a vDSO function that wraps the low-level transitions to/from the enclave, i.e. the EENTER and ERESUME instructions. The vDSO function can intercept exceptions that would otherwise generate a signal and return the fault information directly to its caller, thus avoiding the need to juggle signal handlers. Suggested-by: Andy Lutomirski Cc: Andy Lutomirski Cc: Jarkko Sakkinen Cc: Dave Hansen Cc: Josh Triplett Signed-off-by: Sean Christopherson --- arch/x86/entry/vdso/Makefile | 4 +- arch/x86/entry/vdso/extable.c | 37 + arch/x86/entry/vdso/extable.h | 17 arch/x86/entry/vdso/vdso-layout.lds.S | 9 - arch/x86/entry/vdso/vdso2c.h | 58 +++ arch/x86/include/asm/vdso.h | 5 +++ 6 files changed, 119 insertions(+), 11 deletions(-) create mode 100644 arch/x86/entry/vdso/extable.c create mode 100644 arch/x86/entry/vdso/extable.h diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile index 141d415a8c80..eb543ee1bcec 100644 --- a/arch/x86/entry/vdso/Makefile +++ b/arch/x86/entry/vdso/Makefile @@ -20,7 +20,7 @@ VDSO32-$(CONFIG_IA32_EMULATION) := y vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o # files to link into kernel -obj-y += v
[RFC PATCH v2 3/4] x86/traps: Attempt to fixup exceptions in vDSO before signaling
Call fixup_vdso_exception() in all trap flows that generate signals to userspace immediately prior to generating any such signal. If the exception is fixed, return cleanly and do not generate a signal. The goal of vDSO fixup is not to fixup all faults, nor is it to avoid all signals, but rather to report faults directly to userspace when the fault would otherwise directly result in a signal being sent to the process. Suggested-by: Andy Lutomirski Cc: Andy Lutomirski Cc: Jarkko Sakkinen Cc: Dave Hansen Cc: Josh Triplett Signed-off-by: Sean Christopherson --- arch/x86/kernel/traps.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 9b7c4ca8f0a7..f813481a85ff 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -61,6 +61,7 @@ #include #include #include +#include #ifdef CONFIG_X86_64 #include @@ -223,6 +224,10 @@ do_trap_no_signal(struct task_struct *tsk, int trapnr, const char *str, tsk->thread.error_code = error_code; tsk->thread.trap_nr = trapnr; + if (user_mode(regs) && + fixup_vdso_exception(regs, trapnr, error_code, 0)) + return 0; + return -1; } @@ -563,6 +568,9 @@ do_general_protection(struct pt_regs *regs, long error_code) tsk->thread.error_code = error_code; tsk->thread.trap_nr = X86_TRAP_GP; + if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0)) + return; + show_signal(tsk, SIGSEGV, "", desc, regs, error_code); force_sig(SIGSEGV, tsk); @@ -791,6 +799,10 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) goto exit; } + if (user_mode(regs) && + fixup_vdso_exception(regs, X86_TRAP_DB, error_code, 0)) + goto exit; + if (WARN_ON_ONCE((dr6 & DR_STEP) && !user_mode(regs))) { /* * Historical junk that used to handle SYSENTER single-stepping. @@ -854,6 +866,9 @@ static void math_error(struct pt_regs *regs, int error_code, int trapnr) if (!si_code) return; + if (fixup_vdso_exception(regs, trapnr, error_code, 0)) + return; + force_sig_fault(SIGFPE, si_code, (void __user *)uprobe_get_trap_addr(regs), task); } -- 2.19.2
[RFC PATCH v2 4/4] x86/vdso: Add __vdso_sgx_enter_enclave() to wrap SGX enclave transitions
ithout losing its stack context. In addition to allowing the runtime to do silly shenanigans with its stack, e.g. pushing data onto the stack from within the enclave, the handler approach preserves the full call stack for debugging purposes. For example, to accept a new EPC page into the enclave, an enclave with a single TCS must re-EENTER the enclave using the same TCS to EACCEPT the new page prior to executing ERESUME to restart at the fault context. The handler approach allows reentry without undoing the original call stack, i.e. preserves the view of the original interrupted call. Note that this effectively requires userspace to implement an exit handler if they want to support correctable enclave faults, as there is no other way to request ERESUME. Suggested-by: Andy Lutomirski Cc: Andy Lutomirski Cc: Jarkko Sakkinen Cc: Dave Hansen Cc: Josh Triplett Signed-off-by: Sean Christopherson --- arch/x86/entry/vdso/Makefile | 1 + arch/x86/entry/vdso/vdso.lds.S | 1 + arch/x86/entry/vdso/vsgx_enter_enclave.c | 119 +++ 3 files changed, 121 insertions(+) create mode 100644 arch/x86/entry/vdso/vsgx_enter_enclave.c diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile index eb543ee1bcec..8b530e20e8be 100644 --- a/arch/x86/entry/vdso/Makefile +++ b/arch/x86/entry/vdso/Makefile @@ -18,6 +18,7 @@ VDSO32-$(CONFIG_IA32_EMULATION) := y # files to link into the vdso vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o +vobjs-$(VDSO64-y) += vsgx_enter_enclave.o # files to link into kernel obj-y += vma.o extable.o diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S index d3a2dce4cfa9..50952a995a6c 100644 --- a/arch/x86/entry/vdso/vdso.lds.S +++ b/arch/x86/entry/vdso/vdso.lds.S @@ -25,6 +25,7 @@ VERSION { __vdso_getcpu; time; __vdso_time; + __vdso_sgx_enter_enclave; local: *; }; } diff --git a/arch/x86/entry/vdso/vsgx_enter_enclave.c b/arch/x86/entry/vdso/vsgx_enter_enclave.c new file mode 100644 index ..896c2eb079bb --- /dev/null +++ b/arch/x86/entry/vdso/vsgx_enter_enclave.c @@ -0,0 +1,119 @@ +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) +// Copyright(c) 2018 Intel Corporation. + +#include +#include + +#include "extable.h" + +/* + * The exit info/handler definitions will live elsewhere in the actual + * implementation, e.g. arch/x86/include/uapi/asm/sgx.h. + */ +struct sgx_enclave_exit_info { + __u32 leaf; + union { + struct { + __u16 trapnr; + __u16 error_code; + __u64 address; + } fault; + struct { + __u64 rdi; + __u64 rsi; + __u64 rdx; + } eexit; + }; +}; + +typedef long (sgx_enclave_exit_handler)(struct sgx_enclave_exit_info *exit_info, + void *tcs, void *priv); + +/* + * ENCLU (ENCLave User) is an umbrella instruction for a variety of CPL3 + * SGX functions, The ENCLU function that is executed is specified in EAX, + * with each function potentially having more leaf-specific operands beyond + * EAX. In the vDSO we're only concerned with the leafs that are used to + * transition to/from the enclave. + */ +enum sgx_enclu_leaf { + SGX_EENTER = 2, + SGX_ERESUME = 3, + SGX_EEXIT = 4, +}; + +notrace long __vdso_sgx_enter_enclave(void *tcs, void *priv, + struct sgx_enclave_exit_info *exit_info, + sgx_enclave_exit_handler *exit_handler) +{ + u64 rdi, rsi, rdx; + u32 leaf; + + if (!tcs || !exit_info) + return -EINVAL; + + leaf = SGX_EENTER; + +enter_enclave: + asm volatile( + /* +* When an event occurs in an enclave, hardware first exits the +* enclave to the AEP, switching CPU context along the way, and +* *then* delivers the event as usual. As part of the context +* switching, registers are loaded with synthetic state (except +* BP and SP, which are saved/restored). The defined synthetic +* state loads registers so that simply executing ENCLU will do +* ERESUME, e.g. RAX=4, RBX=TCS and RCX=AEP after an AEE. So, +* we only need to load RAX, RBX and RCX for the initial entry. +* The AEP can point at that same ENCLU, fixup will jump us out +* if an exception was unhandled. +*/ + " lea 1f(%%rip), %%rcx\n" + "1: enclu\n" + "2:\n"
[RFC PATCH v2 0/4] x86: Add vDSO exception fixup for SGX
This version is almost entirely about the vDSO function itself, i.e. patch 4/4. Feel free to ignore patches 2/4 and 3/4, I need to do (a lot) more legwork to address feedback and improve their changelogs. I'm expecting that to take a fair amount of time and wanted to get the alternative exit handler idea out there ASAP. The new vDSO function builds but is otherwise completely untested. v2: - For all intents and purposes, rewrite the SGX vDSO function. This version is quite a bit different than anything discussed in the past. Rather than provide separate a separate function or an explicit parameter to request ERESUME, e.g. to recover after a fault, take an optional "exit handler" that provides the caller the opportunity to specify if and how the enclave should be resumed. More details in the changelog. - Rename it to __vdso_sgx_enter_enclave() to abstract the details of EENTER and ERESUME to some degree. - Give the enclave RDI, RSI and RDX to pass data out of the enclave. - Call fixup_vdso_exception() in do_int3(). v1: https://lkml.kernel.org/r/20181205232012.28920-1-sean.j.christopher...@intel.com Sean Christopherson (4): x86/vdso: Add support for exception fixup in vDSO functions x86/fault: Attempt to fixup unhandled #PF in vDSO before signaling x86/traps: Attempt to fixup exceptions in vDSO before signaling x86/vdso: Add __vdso_sgx_enter_enclave() to wrap SGX enclave transitions arch/x86/entry/vdso/Makefile | 5 +- arch/x86/entry/vdso/extable.c| 37 +++ arch/x86/entry/vdso/extable.h| 17 arch/x86/entry/vdso/vdso-layout.lds.S| 9 +- arch/x86/entry/vdso/vdso.lds.S | 1 + arch/x86/entry/vdso/vdso2c.h | 58 +-- arch/x86/entry/vdso/vsgx_enter_enclave.c | 119 +++ arch/x86/include/asm/vdso.h | 5 + arch/x86/kernel/traps.c | 15 +++ arch/x86/mm/fault.c | 7 ++ 10 files changed, 262 insertions(+), 11 deletions(-) create mode 100644 arch/x86/entry/vdso/extable.c create mode 100644 arch/x86/entry/vdso/extable.h create mode 100644 arch/x86/entry/vdso/vsgx_enter_enclave.c -- 2.19.2
[RFC PATCH v2 2/4] x86/fault: Attempt to fixup unhandled #PF in vDSO before signaling
Call fixup_vdso_exception() in the SIGSEGV and SIGBUS paths of the page fault handler immediately prior to signaling. If the fault is fixed, return cleanly and do not generate a signal. The goal of vDSO fixup is not to fixup all faults, nor is it to avoid all signals, but rather to report faults directly to userspace when the fault would otherwise directly result in a signal being sent to the process. For example, a page fault that points to valid user memory that happened to be swapped out should not trigger fixup, i.e. should not be reported to userspace. In the SIGSEGV flow, make sure to call fixup_vdso_exception() after the error code has been sanitized. Suggested-by: Andy Lutomirski Cc: Andy Lutomirski Cc: Jarkko Sakkinen Cc: Dave Hansen Cc: Josh Triplett Signed-off-by: Sean Christopherson --- arch/x86/mm/fault.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 2ff25ad33233..ff1cb10858d5 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -28,6 +28,7 @@ #include/* vma_pkey() */ #include/* efi_recover_from_page_fault()*/ #include /* store_idt(), ... */ +#include /* fixup_vdso_exception() */ #define CREATE_TRACE_POINTS #include @@ -928,6 +929,9 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, if (address >= TASK_SIZE_MAX) error_code |= X86_PF_PROT; + if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address)) + return; + if (likely(show_unhandled_signals)) show_signal_msg(regs, error_code, address, tsk); @@ -1045,6 +1049,9 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, if (is_prefetch(regs, error_code, address)) return; + if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address)) + return; + set_signal_archinfo(address, error_code); #ifdef CONFIG_MEMORY_FAILURE -- 2.19.2
Re: [RFC PATCH v2 4/4] x86/vdso: Add __vdso_sgx_enter_enclave() to wrap SGX enclave transitions
+Cc: linux-sgx, Haitao, Greg and Jethro My apologies for neglecting to cc the SGX folks, original thread is here: https://lkml.kernel.org/r/20181206221922.31012-1-sean.j.christopher...@intel.com On Thu, Dec 06, 2018 at 02:50:01PM -0800, Andy Lutomirski wrote: > On Thu, Dec 6, 2018 at 2:19 PM Sean Christopherson > wrote: > > > + > > + /* > > +* Invoke the caller's exit handler if one was provided. The return > > +* value tells us whether to re-enter the enclave (EENTER or > > ERESUME) > > +* or to return (EEXIT). > > +*/ > > + if (exit_handler) { > > + leaf = exit_handler(exit_info, tcs, priv); > > + if (leaf == SGX_EENTER || leaf == SGX_ERESUME) > > + goto enter_enclave; > > + if (leaf == SGX_EEXIT) > > + return 0; > > + return -EINVAL; > > + } else if (leaf != SGX_EEXIT) { > > + return -EFAULT; > > + } > > This still seems overcomplicated to me. How about letting the > requested leaf (EENTER or ERESUME) be a parameter to the function and > then just returning here? As it stands, you're requiring any ERESUME > that gets issued (other than the implicit ones) to be issued in the > same call stack, which is very awkward if you're doing something like > forwarding the fault to a different task over a socket and then > waiting in epoll_wait() or similar before resuming the enclave. Ah, yeah, wasn't thinking about usage models where the enclave could get passed off to a different thread. What about supporting both, i.e. keep the exit handler but make it 100% optional? And simplify the exit_handler to effectively return a boolean, i.e. "exit or continue". Something like this: notrace long __vdso_sgx_enter_enclave(u32 op, void *tcs, void *priv, struct sgx_enclave_exit_info *exit_info, sgx_enclave_exit_handler *exit_handler) { u64 rdi, rsi, rdx; u32 leaf; long ret; if (!tcs || !exit_info) return -EINVAL; enter_enclave: if (op != SGX_EENTER && op != SGX_ERESUME) return -EINVAL; /* * Invoke the caller's exit handler if one was provided. The return * value tells us whether to re-enter the enclave (EENTER or ERESUME) * or to return (EEXIT). */ if (exit_handler) { if (exit_handler(exit_info, tcs, priv)) { op = exit_info->leaf; goto enter_enclave; } } if (exit_info->leaf == SGX_EEXIT) return -EFAULT; return 0; } I like that the exit handler allows userspace to trap/panic with the full call stack in place, and in a dedicated path, i.e. outside of the basic enter/exit code. An exit handler probably doesn't fundamentally change what userspace can do with respect to debugging/reporting, but I think it would actually simplify some userspace implementations, e.g. I'd use it in my tests like so: long fault_handler(struct sgx_enclave_exit_info *exit_info, void *tcs, void *priv) { if (exit_info->leaf == SGX_EEXIT) return 0; }
[PATCH] x86/fault: Decode and print #PF oops in human readable form
Linus pointed out[1] that deciphering the raw #PF error code and printing a more human readable message are two different things, e.g. not-executable, not-writable, protection keys and even SGX faults are all some form of permission violation faults, and a #PF with the USER bit cleared in the error code does not indicate that the fault originated in kernel code. Remove the per-bit decoding of the error code and instead print the raw error code followed by a brief description of what caused the fault, the effective privilege level of the faulting access, and whether the fault originated in user code or kernel code. This provides the user with the information needed to do basic triage on 99.9% oopses without polluting the message with information that is useless the vast majority of the time, e.g. an oops on a protection keys violation is beyond rare, stating that an oops *wasn't* due to a keys violation is pointless noise. [1] https://lkml.kernel.org/r/CAHk-=whk_fsnxvmvf1t2ffcap2wrvsybabrlqcwljycvhw6...@mail.gmail.com Suggested-by: Linus Torvalds Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Yu-cheng Yu Cc: linux-kernel@vger.kernel.org Cc: Ingo Molnar Signed-off-by: Sean Christopherson --- arch/x86/mm/fault.c | 41 ++--- 1 file changed, 10 insertions(+), 31 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 2ff25ad33233..85de05d41f58 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -603,24 +603,9 @@ static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 index) name, index, addr, (desc.limit0 | (desc.limit1 << 16))); } -/* - * This helper function transforms the #PF error_code bits into - * "[PROT] [USER]" type of descriptive, almost human-readable error strings: - */ -static void err_str_append(unsigned long error_code, char *buf, unsigned long mask, const char *txt) -{ - if (error_code & mask) { - if (buf[0]) - strcat(buf, " "); - strcat(buf, txt); - } -} - static void show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long address) { - char err_txt[64]; - if (!oops_may_print()) return; @@ -648,27 +633,21 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad address < PAGE_SIZE ? "NULL pointer dereference" : "paging request", (void *)address); - err_txt[0] = 0; - - /* -* Note: length of these appended strings including the separation space and the -* zero delimiter must fit into err_txt[]. -*/ - err_str_append(error_code, err_txt, X86_PF_PROT, "[PROT]" ); - err_str_append(error_code, err_txt, X86_PF_WRITE, "[WRITE]"); - err_str_append(error_code, err_txt, X86_PF_USER, "[USER]" ); - err_str_append(error_code, err_txt, X86_PF_RSVD, "[RSVD]" ); - err_str_append(error_code, err_txt, X86_PF_INSTR, "[INSTR]"); - err_str_append(error_code, err_txt, X86_PF_PK,"[PK]" ); - - pr_alert("#PF error: %s\n", error_code ? err_txt : "[normal kernel read fault]"); + pr_alert("#PF: error_code(0x%04lx) - %s\n", error_code, + !(error_code & X86_PF_PROT) ? "not-present page" : + (error_code & X86_PF_RSVD) ? "reserved bit violation" : + "permissions violation"); + pr_alert("#PF: %s-privileged %s from %s code\n", + (error_code & X86_PF_USER) ? "user" : "supervisor", + (error_code & X86_PF_INSTR) ? "instruction fetch" : + (error_code & X86_PF_WRITE) ? "write access" : + "read access", + user_mode(regs) ? "user" : "kernel"); if (!(error_code & X86_PF_USER) && user_mode(regs)) { struct desc_ptr idt, gdt; u16 ldtr, tr; - pr_alert("This was a system access from user code\n"); - /* * This can happen for quite a few reasons. The more obvious * ones are faults accessing the GDT, or LDT. Perhaps -- 2.19.2
Re: [RFC PATCH v2 4/4] x86/vdso: Add __vdso_sgx_enter_enclave() to wrap SGX enclave transitions
On Fri, Dec 07, 2018 at 09:56:09AM -0800, Andy Lutomirski wrote: > On Fri, Dec 7, 2018 at 8:51 AM Sean Christopherson > wrote: > > I like that the exit handler allows userspace to trap/panic with the full > > call stack in place, and in a dedicated path, i.e. outside of the basic > > enter/exit code. An exit handler probably doesn't fundamentally change > > what userspace can do with respect to debugging/reporting, but I think > > it would actually simplify some userspace implementations, e.g. I'd use > > it in my tests like so: > > > > long fault_handler(struct sgx_enclave_exit_info *exit_info, void *tcs, void > > *priv) > > { > > if (exit_info->leaf == SGX_EEXIT) > > return 0; > > > > > > } > > > > Hmm. That't not totally silly, although you could accomplish almost > the same thing by wrapping the vDSO helper in another function. True, but I think there's value in having the option to intercept an exception at the exact(ish) point of failure, without having to return up the stack. The enclave has full access to the process' memory space, including the untrsuted stack. It's not too far fetched to envision a scenario where the enclave corrupts the stack even if the enclave isn't intentionally using the stack, e.g. the host passes in variable that a points at the stack instead of whatever memory is supposed to be shared between the enclave and host. It'd be nice to have the ability to triage something like that without having to e.g. configure breakpoints on the stack.
Re: [PATCH] x86/fault: Decode and print #PF oops in human readable form
On Fri, Dec 07, 2018 at 10:52:49AM -0800, Linus Torvalds wrote: > On Fri, Dec 7, 2018 at 10:44 AM Sean Christopherson > wrote: > > > > Remove the per-bit decoding of the error code and instead print the raw > > error code followed by a brief description of what caused the fault, the > > effective privilege level of the faulting access, and whether the fault > > originated in user code or kernel code. > > This doesn't quite work as-is, though. > > For example, at least the PK bit is independent of the other bits and > would be interesting in the human-legible version, but doesn't show up > there at all. Heh, I actually intentionally omitted protection keys thinking it'd be superfluous, i.e. "go look at the error code bits if you care that much". > That said, I think the end result might be more legible than the > previous version, so this approach may well be good, it just needs at > least that "permissions violation" part to be extended with whether > it was PK or not. > > Also, shouldn't we show the SGX bit too as some kind of "during SGX" > extension on the "in user/kernel space" part? The SGX bit isn't defined in mainline yet. But yeah, I can see how printing e.g. "SGX EPCM violation" would be a lot more helpful than a vanilla "permissions violation". I'll send a v2 with the PK bit added and a slightly reworded changelog.
[PATCH v2] x86/fault: Decode and print #PF oops in human readable form
Linus pointed out that deciphering the raw #PF error code and printing a more human readable message are two different things, and also that printing the negative cases is mostly just noise[1]. For example, the USER bit doesn't mean the fault originated in user code and stating that an oops wasn't due to a protection keys violation isn't interesting since an oops on a keys violation is a one-in-a-million scenario. Remove the per-bit decoding of the error code and instead print: - the raw error code - why the fault occurred - the effective privilege level of the access - the type of access - whether the fault originated in user code or kernel code This provides the user with the information needed to triage 99.9% of oopses without polluting the log with useless information or conflating the error_code with the CPL. [1] https://lkml.kernel.org/r/CAHk-=whk_fsnxvmvf1t2ffcap2wrvsybabrlqcwljycvhw6...@mail.gmail.com Suggested-by: Linus Torvalds Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Dave Hansen Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Yu-cheng Yu Cc: linux-kernel@vger.kernel.org Cc: Ingo Molnar Signed-off-by: Sean Christopherson --- v2: - Explicitly call out protection keys violations - "Slightly" reword the changelog arch/x86/mm/fault.c | 42 +++--- 1 file changed, 11 insertions(+), 31 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 2ff25ad33233..26feb8c525c1 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -603,24 +603,9 @@ static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 index) name, index, addr, (desc.limit0 | (desc.limit1 << 16))); } -/* - * This helper function transforms the #PF error_code bits into - * "[PROT] [USER]" type of descriptive, almost human-readable error strings: - */ -static void err_str_append(unsigned long error_code, char *buf, unsigned long mask, const char *txt) -{ - if (error_code & mask) { - if (buf[0]) - strcat(buf, " "); - strcat(buf, txt); - } -} - static void show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long address) { - char err_txt[64]; - if (!oops_may_print()) return; @@ -648,27 +633,22 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad address < PAGE_SIZE ? "NULL pointer dereference" : "paging request", (void *)address); - err_txt[0] = 0; - - /* -* Note: length of these appended strings including the separation space and the -* zero delimiter must fit into err_txt[]. -*/ - err_str_append(error_code, err_txt, X86_PF_PROT, "[PROT]" ); - err_str_append(error_code, err_txt, X86_PF_WRITE, "[WRITE]"); - err_str_append(error_code, err_txt, X86_PF_USER, "[USER]" ); - err_str_append(error_code, err_txt, X86_PF_RSVD, "[RSVD]" ); - err_str_append(error_code, err_txt, X86_PF_INSTR, "[INSTR]"); - err_str_append(error_code, err_txt, X86_PF_PK,"[PK]" ); - - pr_alert("#PF error: %s\n", error_code ? err_txt : "[normal kernel read fault]"); + pr_alert("#PF: error_code(0x%04lx) - %s\n", error_code, + !(error_code & X86_PF_PROT) ? "not-present page" : + (error_code & X86_PF_RSVD) ? "reserved bit violation" : + (error_code & X86_PF_PK)? "protection keys violation" : + "permissions violation"); + pr_alert("#PF: %s-privileged %s from %s code\n", + (error_code & X86_PF_USER) ? "user" : "supervisor", + (error_code & X86_PF_INSTR) ? "instruction fetch" : + (error_code & X86_PF_WRITE) ? "write access" : + "read access", + user_mode(regs) ? "user" : "kernel"); if (!(error_code & X86_PF_USER) && user_mode(regs)) { struct desc_ptr idt, gdt; u16 ldtr, tr; - pr_alert("This was a system access from user code\n"); - /* * This can happen for quite a few reasons. The more obvious * ones are faults accessing the GDT, or LDT. Perhaps -- 2.19.2
Re: [RFC PATCH v2 4/4] x86/vdso: Add __vdso_sgx_enter_enclave() to wrap SGX enclave transitions
On Fri, Dec 07, 2018 at 11:23:10AM -0800, Andy Lutomirski wrote: > > > On Dec 7, 2018, at 11:02 AM, Sean Christopherson > > wrote: > > > >> On Fri, Dec 07, 2018 at 09:56:09AM -0800, Andy Lutomirski wrote: > >> On Fri, Dec 7, 2018 at 8:51 AM Sean Christopherson > >> wrote: > >>> I like that the exit handler allows userspace to trap/panic with the full > >>> call stack in place, and in a dedicated path, i.e. outside of the basic > >>> enter/exit code. An exit handler probably doesn't fundamentally change > >>> what userspace can do with respect to debugging/reporting, but I think > >>> it would actually simplify some userspace implementations, e.g. I'd use > >>> it in my tests like so: > >>> > >>> long fault_handler(struct sgx_enclave_exit_info *exit_info, void *tcs, > >>> void *priv) > >>> { > >>>if (exit_info->leaf == SGX_EEXIT) > >>>return 0; > >>> > >>> > >>> } > >>> > >> > >> Hmm. That't not totally silly, although you could accomplish almost > >> the same thing by wrapping the vDSO helper in another function. > > > > True, but I think there's value in having the option to intercept an > > exception at the exact(ish) point of failure, without having to return > > up the stack. > > > > The enclave has full access to the process' memory space, including the > > untrsuted stack. It's not too far fetched to envision a scenario where > > the enclave corrupts the stack even if the enclave isn't intentionally > > using the stack, e.g. the host passes in variable that a points at the > > stack instead of whatever memory is supposed to be shared between the > > enclave and host. It'd be nice to have the ability to triage something > > like that without having to e.g. configure breakpoints on the stack. > > Ah, I see. You’re saying that, if the non-enclave stare is corrupted such > that RIP is okay and RSP still points somewhere reasonable but the return > address is garbage, then we can at least get to the fault handler and print > something? Yep. Even for something more subtle like GPR corruption it could dump the entire call stack before attempting to return back up. > This only works if the fault handler pointer itself is okay, though, which > somewhat limits the usefulness, given that its pointer is quite likely to > be on the stack very close to the return address. Yeah, it's not a silver bullet by any means, but it does seem useful for at least some scenarios. Even exploding when invoking the handler instead of at a random point might prove useful, e.g. "calling my exit handler exploded, maybe my enclave corrupted the stack!". > I really wish the ENCLU instruction had seen fit to preserve some registers. Speaking of preserving registers, the asm blob needs to mark RBX as clobbered since it's modified for EEXIT.
Re: [RFC PATCH v2 4/4] x86/vdso: Add __vdso_sgx_enter_enclave() to wrap SGX enclave transitions
On Fri, Dec 07, 2018 at 12:16:59PM -0800, Andy Lutomirski wrote: > > > On Dec 7, 2018, at 12:09 PM, Sean Christopherson > > wrote: > > > > Speaking of preserving registers, the asm blob needs to mark RBX as > > clobbered since it's modified for EEXIT. > > Have fun with that. The x86_32 compiler seems to really like having its > PIC register preserved, and you may get some lovely compiler errors. Tagentinally related, as-is the SGX vDSO is only compiled for x86_64 since CONFIG_SGX depends on CONFIG_X86_64. Mapping the EPC in 32-bit mode complicates things and no one is asking for SGX support on 32-bit builds, so...
Re: [RFC PATCH v2 4/4] x86/vdso: Add __vdso_sgx_enter_enclave() to wrap SGX enclave transitions
On Fri, Dec 07, 2018 at 12:16:59PM -0800, Andy Lutomirski wrote: > > > > On Dec 7, 2018, at 12:09 PM, Sean Christopherson > > wrote: > > > >> On Fri, Dec 07, 2018 at 11:23:10AM -0800, Andy Lutomirski wrote: > >> > >> Ah, I see. You’re saying that, if the non-enclave stare is corrupted such > >> that RIP is okay and RSP still points somewhere reasonable but the return > >> address is garbage, then we can at least get to the fault handler and print > >> something? > > > > Yep. Even for something more subtle like GPR corruption it could dump the > > entire call stack before attempting to return back up. > > > >> This only works if the fault handler pointer itself is okay, though, which > >> somewhat limits the usefulness, given that its pointer is quite likely to > >> be on the stack very close to the return address. > > > > Yeah, it's not a silver bullet by any means, but it does seem useful for at > > least some scenarios. Even exploding when invoking the handler instead of > > at a random point might prove useful, e.g. "calling my exit handler > > exploded, > > maybe my enclave corrupted the stack!". > > Here’s another idea: calculate some little hash or other checksum of > RSP, RBP, and perhaps a couple words on the stack, and do: Corrupting RSP and RBP as opposed to the stack memory seems much less likely since the enclave would have to poke into the save state area. And as much as I dislike the practice of intentionally manipulating SSA.RSP, preventing the user from doing something because we're "helping" doesn't seem right. > call __vdso_enclave_corrupted_state > > If you get a mismatch after return. That function could be: > > call __vdso_enclave_corrupted_state: > ud2 > > And now the debug trace makes it very clear what happened. > > This may or may not be worth the effort. Running a checksum on the stack for every exit doesn't seem like it'd be worth the effort, especially since this type of bug should be quite rare, at least in production environments. If we want to pursue the checksum idea I think the easiest approach would be to combine it with an exit_handler and do a simple check on the handler. It'd be minimal overhead in the fast path and would flag cases where invoking exit_handle() would explode, while deferring all other checks to the user. E.g. something like this: diff --git a/arch/x86/entry/vdso/vsgx_enter_enclave.c b/arch/x86/entry/vdso/vsgx_enter_enclave.c index d5145e5c5a54..c89dd3cd8da9 100644 --- a/arch/x86/entry/vdso/vsgx_enter_enclave.c +++ b/arch/x86/entry/vdso/vsgx_enter_enclave.c @@ -42,10 +42,13 @@ enum sgx_enclu_leaf { SGX_EEXIT = 4, }; +#define VDSO_MAGIC 0xa5a5a5a5a5a5a5a5UL + notrace long __vdso_sgx_enter_enclave(u32 op, void *tcs, void *priv, struct sgx_enclave_exit_info *exit_info, sgx_enclave_exit_handler *exit_handler) { + volatile unsigned long hash; u64 rdi, rsi, rdx; u32 leaf; long ret; @@ -53,6 +56,9 @@ notrace long __vdso_sgx_enter_enclave(u32 op, void *tcs, void *priv, if (!tcs || !exit_info) return -EINVAL; + /* Always hash the handler. XOR is much cheaper than Jcc. */ + hash = (unsigned long)exit_handler ^ VDSO_MAGIC; + enter_enclave: if (op != SGX_EENTER && op != SGX_ERESUME) return -EINVAL; @@ -107,6 +113,8 @@ notrace long __vdso_sgx_enter_enclave(u32 op, void *tcs, void *priv, * or to return (EEXIT). */ if (exit_handler) { + if (hash != ((unsigned long)exit_handler ^ VDSO_MAGIC)) + asm volatile("ud2\n"); if (exit_handler(exit_info, tcs, priv)) { op = exit_info->leaf; goto enter_enclave; > But ISTM the enclave is almost as likely to corrupt the host state and > the. EEXIT as it is to corrupt the host state and then fault. Agreed, I would say even more likely. But the idea is that the exit_handler is called on any exit, not just exceptions.
Re: [PATCH v2] x86/fault: Decode and print #PF oops in human readable form
On Fri, Dec 07, 2018 at 12:46:30PM -0800, Linus Torvalds wrote: > On Fri, Dec 7, 2018 at 11:52 AM Sean Christopherson > wrote: > > > > Remove the per-bit decoding of the error code and instead print: > > The patch looks fine to me, so feel free to add an acked-by, but: > > (a) I'm not the one who wanted the human-legible version in the first > place, since I'm also perfectly ok with just the error code, so my > judgement is obviously garbage wrt this whole thing > > (b) it would be good to have a couple of actual examples of the > printout to judge. > > I can certainly imagine how it looks just from the patch, but maybe if > I actually see reality I'd go "eww". Or somebody else goes "eww". Here's what it looks like in the full oops context: [ 14.109977] BUG: unable to handle kernel paging request at beef [ 14.110604] #PF: error_code(0x) - not-present page [ 14.111048] #PF: supervisor-privileged read access from kernel code [ 14.111594] PGD 0 P4D 0 [ 14.111820] Oops: [#1] SMP [ 14.112098] CPU: 4 PID: 1007 Comm: modprobe Not tainted 4.20.0-rc2+ #91 [ 14.112668] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 14.113344] RIP: 0010:hardware_setup+0xab/0x6f1 [kvm_intel] [ 14.113825] Code: 05 30 8c ff ff 0f 84 56 05 00 00 0f 31 83 e0 03 75 13 48 b8 00 00 00 00 ef be ff ff 48 c7 00 00 00 00 00 eb 2b 48 ff c8 75 0c <48> a1 00 00 00 00 ef be ff ff eb 1a 48 8b 15 50 ff f6 ff 48 b8 00 [ 14.115421] RSP: 0018:c958bbb8 EFLAGS: 00010246 [ 14.115878] RAX: RBX: 0006 RCX: ea0009d42148 [ 14.116497] RDX: 000a RSI: dead0200 RDI: 006000c0 [ 14.117115] RBP: c958bc50 R08: 0006 R09: 0016a6d0 [ 14.117731] R10: c958bc68 R11: 8882750ef5e8 R12: 5e40 [ 14.118346] R13: a014dd00 R14: 888270273480 R15: c958be98 [ 14.118961] FS: 7ffac1c70700() GS:88827780() knlGS: [ 14.119661] CS: 0010 DS: ES: CR0: 80050033 [ 14.120160] CR2: beef CR3: 000270352004 CR4: 00360ee0 [ 14.120778] DR0: DR1: DR2: [ 14.121393] DR3: DR6: fffe0ff0 DR7: 0400 [ 14.122006] Call Trace: [ 14.122235] kvm_arch_hardware_setup+0x42/0x200 [kvm] [ 14.122685] ? vmx_check_processor_compat+0x8d/0x8d [kvm_intel] [ 14.123201] ? kvm_init+0x79/0x280 [kvm] [ 14.123552] kvm_init+0x79/0x280 [kvm] [ 14.123883] ? vmx_check_processor_compat+0x8d/0x8d [kvm_intel] [ 14.124394] vmx_init+0x9d/0x400 [kvm_intel] [ 14.124766] ? vmx_check_processor_compat+0x8d/0x8d [kvm_intel] [ 14.125277] do_one_initcall+0x4d/0x1c4 [ 14.125613] ? kmem_cache_alloc_trace+0x30/0x1a0 [ 14.126014] do_init_module+0x5b/0x20d [ 14.126342] load_module+0x2389/0x2980 [ 14.126670] ? vfs_read+0x117/0x130 [ 14.126976] ? __do_sys_finit_module+0xd2/0x100 [ 14.127369] __do_sys_finit_module+0xd2/0x100 [ 14.127751] do_syscall_64+0x4f/0x100 [ 14.128078] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 14.128504] RIP: 0033:0x7ffac178e4d9 [ 14.128810] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8f 29 2c 00 f7 d8 64 89 01 48 [ 14.130358] RSP: 002b:7ffc5b68d0c8 EFLAGS: 0246 ORIG_RAX: 0139 [ 14.131023] RAX: ffda RBX: 5581920445c0 RCX: 7ffac178e4d9 [ 14.131658] RDX: RSI: 558192048850 RDI: 0005 [ 14.132256] RBP: 558192048850 R08: R09: 0008 [ 14.132856] R10: 0005 R11: 0246 R12: [ 14.133454] R13: 558192044640 R14: 0004 R15: 0008 [ 14.134055] Modules linked in: kvm_intel(+) kvm irqbypass bridge stp llc [ 14.134649] CR2: beef [ 14.134933] ---[ end trace 8dce06ca17fa1e39 ]--- Looking at it again, my own personal preference would be to swap the order of the #PF lines. For most cases the three main lines show up in reverse fir-tree ordering and the cause of the fault is easy to pick out since it's the last thing highlighted by pr_alert (excepting when we dump the IDT, GDT, etc...). [ 160.246820] BUG: unable to handle kernel paging request at beef [ 160.247517] #PF: supervisor-privileged instruction fetch from kernel code [ 160.248085] #PF: error_code(0x0010) - not-present page [ 160.248517] PGD 0 P4D 0 [ 160.248738] Oops: 0010 [#1] SMP [ 160.249012] CPU: 4 PID: 1003 Comm: modprobe Not tainted 4.20.0-rc2+ #93 [ 160.249566] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 160.250221] RIP: 0010:0xbeef00
Re: x86_64 INIT/SIPI Bug
On Thu, Nov 08, 2018 at 03:23:59PM -0700, Rian Quinn wrote: > I apologize upfront if this is the wrong place to post this, pretty new to > this. > > We are working on the Bareflank Hypervisor (www.bareflank.org), and we > are passing through the INIT/SIPI process (similar to how a VMX > rootkit from EFI might boot the OS) and we noticed that on Arch Linux, > the INIT/SIPI process stalls, something we are not seeing on Ubuntu. I'm confused, INIT is blocked post-VMXON, what are you passing through?
Re: x86_64 INIT/SIPI Bug
On Fri, Nov 09, 2018 at 11:04:59AM -0700, Rian Quinn wrote: > >> I apologize upfront if this is the wrong place to post this, pretty new to > >> this. > >> > >> We are working on the Bareflank Hypervisor (www.bareflank.org), and we > >> are passing through the INIT/SIPI process (similar to how a VMX > >> rootkit from EFI might boot the OS) and we noticed that on Arch Linux, > >> the INIT/SIPI process stalls, something we are not seeing on Ubuntu. > > > > I'm confused, INIT is blocked post-VMXON, what are you passing through? > > You are correct that INIT will track unconditionally, but all we do is set the > activity state to wait-for-sipi and return back, allowing Linux to continue > its boot process. That's not pass-through, maybe call it reflection? I realize I'm being a bit pedantic, but differentiating between the two matters since true pass-through gives you bare metal performance whereas reflection obviously requires a round-trip VMX transition. Most hypervisors don't need a delay because they don't pass-through the local APIC and instead emulate INIT/SIPI/SIPI. In other words, forcing a delay for all hypervisors is unwarranted. The correct fix is probably to add a new hook to struct x86_hyper_init to provide a custom init delay, and add Bareflank as a new hypervisor.
Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave
On Mon, Jun 25, 2018 at 05:00:05PM -0400, Nathaniel McCallum wrote: > On Thu, Jun 21, 2018 at 5:21 PM Sean Christopherson > wrote: > > > > On Thu, Jun 21, 2018 at 03:11:18PM -0400, Nathaniel McCallum wrote: > > > If this is acceptable for everyone, my hope is the following: > > > > > > 1. Intel would split the existing code into one of the following > > > schemas (I don't care which): > > > A. three parts: UEFI module, FLC-only kernel driver and user-space > > > launch enclave > > > B. two parts: UEFI module (including launch enclave) and FLC-only > > > kernel driver > > > > To make sure I understand correctly... > > > > The UEFI module would lock the LE MSRs with a public key hardcoded > > into both the UEFI module and the kernel at build time? > > > > And for the kernel, it would only load its SGX driver if FLC is > > supported and the MSRs are locked to the expected key? > > > > IIUC, this approach will cause problems for virtualization. Running > > VMs with different LE keys would require the bare metal firmware to > > configure the LE MSRs to be unlocked, which would effectively make > > using SGX in the host OS mutually exlusive with exposing SGX to KVM > > guests. Theoretically it would be possible for KVM to emulate the > > guest's LE and use the host's LE to generate EINIT tokens, but > > emulating an enclave would likely require a massive amount of code > > and/or complexity. > > How is this different from any other scenario where you lock the LE > MSRs? Unless Intel provides hardware support between the LE MSRs and > the VMX instructions, I don't see any way around this besides letting > any launch enclave run. It's not. All prior discussions have effectively required unlocked MSRs, so that's my baseline. > > > 2. Intel would release a reproducible build of the GPL UEFI module > > > sources signed with a SecureBoot trusted key and provide an > > > acceptable[0] binary redistribution license. > > > > > > 3. The kernel community would agree to merge the kernel driver given > > > the above criteria (and, obviously, acceptable kernel code). > > > > > > The question of how to distribute the UEFI module and possible launch > > > enclave remains open. I see two options: independent distribution and > > > bundling it in linux-firmware. The former may be a better > > > technological fit since the UEFI module will likely need to be run > > > before the kernel (and the boot loader; and shim). However, the latter > > > has the benefit of already being a well-known entity to our downstream > > > distributors. I could go either way on this. > > > > Writing and locks the LE MSRs effectively needs to be done before > > running the bootloader/kernel/etc... Delaying activation would > > require, at a minimum, leaving IA32_FEATURE_CONTROL unlocked since > > IA32_FEATURE_CONTROL's SGX bits can't be set until SGX is activated. > > > > > I know this plan is more work for everyone involved, but I think it > > > manages to actually maximize both security and freedom. > > > > > > [0]: details here - > > > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/README#n19 > > > On Thu, Jun 21, 2018 at 11:29 AM Neil Horman wrote: > > > > > > > > On Thu, Jun 21, 2018 at 08:32:25AM -0400, Nathaniel McCallum wrote: > > > > > On Wed, Jun 20, 2018 at 5:02 PM Sean Christopherson > > > > > wrote: > > > > > > > > > > > > On Wed, Jun 20, 2018 at 11:39:00AM -0700, Jethro Beekman wrote: > > > > > > > On 2018-06-20 11:16, Jethro Beekman wrote: > > > > > > > > > This last bit is also repeated in different words in Table > > > > > > > > > 35-2 and > > > > > > > > > Section 42.2.2. The MSRs are *not writable* before the > > > > > > > > > write-lock bit > > > > > > > > > itself is locked. Meaning the MSRs are either locked with > > > > > > > > > Intel's key > > > > > > > > > hash, or not locked at all. > > > > > > > > > > > > > > Actually, this might be a documentation bug. I have some test > > > > > > > hardware and I > > > > > > > was able to configure the MSRs in the BIOS and then read the MSRs > > > > > > > after boot > > >
Re: [PATCH v11 09/13] x86, sgx: basic routines for enclave page cache
On Tue, Jun 19, 2018 at 05:57:53PM +0300, Jarkko Sakkinen wrote: > On Fri, Jun 08, 2018 at 11:24:12AM -0700, Dave Hansen wrote: > > On 06/08/2018 10:09 AM, Jarkko Sakkinen wrote: > > > +static __init bool sgx_is_enabled(bool *lc_enabled) > > > { > > > unsigned long fc; > > > > > > @@ -41,12 +466,26 @@ static __init bool sgx_is_enabled(void) > > > if (!(fc & FEATURE_CONTROL_SGX_ENABLE)) > > > return false; > > > > > > + *lc_enabled = !!(fc & FEATURE_CONTROL_SGX_LE_WR); > > > + > > > return true; > > > } > > > > I'm baffled why lc_enabled is connected to the enclave page cache. > > KVM works only with writable MSRs. Driver works both with writable > and read-only MSRs. That's not true, KVM can/will support SGX regardless of whether or not Launch Control (LC) is available and/or enabled. KVM does need to know whether or not LC is enabled. Back to Dave's question, LC isn't connected to the EPC management, the LC code should be split into a separate patch. > Thanks, I'll try my best to deal with all this :-) > > /Jarkko
Re: [PATCH v11 08/13] x86, sgx: added ENCLS wrappers
On Fri, 2018-06-08 at 19:09 +0200, Jarkko Sakkinen wrote: > This commit adds wrappers for Intel(R) SGX ENCLS opcode functionality. > > Signed-off-by: Jarkko Sakkinen > --- > arch/x86/include/asm/sgx.h | 198 + > 1 file changed, 198 insertions(+) > > diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h > index fa3e6e0eb8af..a2f727f85b91 100644 > --- a/arch/x86/include/asm/sgx.h > +++ b/arch/x86/include/asm/sgx.h > @@ -10,6 +10,10 @@ > #ifndef _ASM_X86_SGX_H > #define _ASM_X86_SGX_H > > +#include > +#include > +#include > +#include > #include > > #define SGX_CPUID 0x12 > @@ -20,6 +24,200 @@ enum sgx_cpuid { > SGX_CPUID_EPC_BANKS = 2, > }; > > +enum sgx_commands { This should be something like "sgx_encls_leafs" and probably moved to sgx_arch.h (as Dave alluded to, these are architectural values). "sgx_commands" is not accurate (they're only the cpl0 "commands") and will have collision issues in the future, e.g. with the ENCLV instruction and its leafs. > + ECREATE = 0x0, > + EADD= 0x1, > + EINIT = 0x2, > + EREMOVE = 0x3, > + EDGBRD = 0x4, > + EDGBWR = 0x5, > + EEXTEND = 0x6, Even though it's not used in the code (yet...), I think ELDB, leaf 0x7, should be defined here for completeness. > + ELDU= 0x8, > + EBLOCK = 0x9, > + EPA = 0xA, > + EWB = 0xB, > + ETRACK = 0xC, > + EAUG= 0xD, > + EMODPR = 0xE, > + EMODT = 0xF, > +};
Re: [PATCH v11 09/13] x86, sgx: basic routines for enclave page cache
On Fri, 2018-06-08 at 19:09 +0200, Jarkko Sakkinen wrote: > SGX has a set of data structures to maintain information about the enclaves > and their security properties. BIOS reserves a fixed size region of > physical memory for these structures by setting Processor Reserved Memory > Range Registers (PRMRR). This memory area is called Enclave Page Cache > (EPC). > > This commit implements the basic routines to allocate and free pages from > different EPC banks. There is also a swapper thread ksgxswapd for EPC pages > that gets woken up by sgx_alloc_page() when we run below the low watermark. > The swapper thread continues swapping pages up until it reaches the high > watermark. > > Each subsystem that uses SGX must provide a set of callbacks for EPC > pages that are used to reclaim, block and write an EPC page. Kernel > takes the responsibility of maintaining LRU cache for them. > > Signed-off-by: Jarkko Sakkinen > --- > arch/x86/include/asm/sgx.h | 67 + > arch/x86/include/asm/sgx_arch.h | 224 > arch/x86/kernel/cpu/intel_sgx.c | 443 +++- > 3 files changed, 732 insertions(+), 2 deletions(-) > create mode 100644 arch/x86/include/asm/sgx_arch.h ... > +struct sgx_pcmd { > + struct sgx_secinfo secinfo; > + uint64_t enclave_id; > + uint8_t reserved[40]; > + uint8_t mac[16]; > +}; sgx_pcmd has a 128-byte alignment requirement. I think it's worth specifying here as sgx_pcmd is small enough that it could be put on the stack, e.g. by KVM when trapping and executing ELD* on behalf of a guest VM. In fact, it probably makes sense to add alightment attributes to all SGX structs for self-documentation purposes, even though many of them will never be allocated statically or on the stack. > + > +#define SGX_MODULUS_SIZE 384 > + > +struct sgx_sigstruct_header { > + uint64_t header1[2]; > + uint32_t vendor; > + uint32_t date; > + uint64_t header2[2]; > + uint32_t swdefined; > + uint8_t reserved1[84]; > +}; > + > +struct sgx_sigstruct_body { > + uint32_t miscselect; > + uint32_t miscmask; > + uint8_t reserved2[20]; > + uint64_t attributes; > + uint64_t xfrm; > + uint8_t attributemask[16]; > + uint8_t mrenclave[32]; > + uint8_t reserved3[32]; > + uint16_t isvprodid; > + uint16_t isvsvn; > +} __attribute__((__packed__)); > + > +struct sgx_sigstruct { > + struct sgx_sigstruct_header header; > + uint8_t modulus[SGX_MODULUS_SIZE]; > + uint32_t exponent; > + uint8_t signature[SGX_MODULUS_SIZE]; > + struct sgx_sigstruct_body body; > + uint8_t reserved4[12]; > + uint8_t q1[SGX_MODULUS_SIZE]; > + uint8_t q2[SGX_MODULUS_SIZE]; > +}; > + > +struct sgx_sigstruct_payload { > + struct sgx_sigstruct_header header; > + struct sgx_sigstruct_body body; > +}; > + > +struct sgx_einittoken_payload { > + uint32_t valid; > + uint32_t reserved1[11]; > + uint64_t attributes; > + uint64_t xfrm; > + uint8_t mrenclave[32]; > + uint8_t reserved2[32]; > + uint8_t mrsigner[32]; > + uint8_t reserved3[32]; > +}; > + > +struct sgx_einittoken { > + struct sgx_einittoken_payload payload; > + uint8_t cpusvnle[16]; > + uint16_t isvprodidle; > + uint16_t isvsvnle; > + uint8_t reserved2[24]; > + uint32_t maskedmiscselectle; > + uint64_t maskedattributesle; > + uint64_t maskedxfrmle; > + uint8_t keyid[32]; > + uint8_t mac[16]; > +}; > + > +struct sgx_report { > + uint8_t cpusvn[16]; > + uint32_t miscselect; > + uint8_t reserved1[28]; > + uint64_t attributes; > + uint64_t xfrm; > + uint8_t mrenclave[32]; > + uint8_t reserved2[32]; > + uint8_t mrsigner[32]; > + uint8_t reserved3[96]; > + uint16_t isvprodid; > + uint16_t isvsvn; > + uint8_t reserved4[60]; > + uint8_t reportdata[64]; > + uint8_t keyid[32]; > + uint8_t mac[16]; > +}; > + > +struct sgx_targetinfo { > + uint8_t mrenclave[32]; > + uint64_t attributes; > + uint64_t xfrm; > + uint8_t reserved1[4]; > + uint32_t miscselect; > + uint8_t reserved2[456]; > +}; > + > +struct sgx_keyrequest { > + uint16_t keyname; > + uint16_t keypolicy; > + uint16_t isvsvn; > + uint16_t reserved1; > + uint8_t cpusvn[16]; > + uint64_t attributemask; > + uint64_t xfrmmask; > + uint8_t keyid[32]; > + uint32_t miscmask; > + uint8_t reserved2[436]; > +};
[PATCH] x86/speculation/l1tf: Exempt zeroed PTEs from XOR conversion
clear_page() does not undergo the XOR logic to invert the address bits, i.e. PTE, PMD and PUD entries that have not been individually written will have val=0 and so will trigger __pte_needs_invert(). As a result, {pte,pmd,pud}_pfn() will return the wrong PFN value, i.e. all ones (adjusted by the max PFN mask) instead of zero. A zeroed entry is ok because the page at physical address 0 is reserved early in boot specifically to mitigate L1TF, so explicitly exempt them from the inversion when reading the PFN. Manifested as an unexpected mprotect(..., PROT_NONE) failure when called on a VMA that has VM_PFNMAP and was mmap'd to as something other than PROT_NONE but never used. mprotect() sends the PROT_NONE request down prot_none_walk(), which walks the PTEs to check the PFNs. prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns -EACCES because it thinks mprotect() is trying to adjust a high MMIO address. Fixes: 6b28baca9b1f ("x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation") Signed-off-by: Sean Christopherson Cc: Andi Kleen Cc: Thomas Gleixner Cc: Josh Poimboeuf Cc: Michal Hocko Cc: Vlastimil Babka Cc: Dave Hansen Cc: Greg Kroah-Hartman --- arch/x86/include/asm/pgtable.h | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index e4ffa565a69f..f21a1df4ca89 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -195,21 +195,24 @@ static inline u64 protnone_mask(u64 val); static inline unsigned long pte_pfn(pte_t pte) { phys_addr_t pfn = pte_val(pte); - pfn ^= protnone_mask(pfn); + if (pfn) + pfn ^= protnone_mask(pfn); return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT; } static inline unsigned long pmd_pfn(pmd_t pmd) { phys_addr_t pfn = pmd_val(pmd); - pfn ^= protnone_mask(pfn); + if (pfn) + pfn ^= protnone_mask(pfn); return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT; } static inline unsigned long pud_pfn(pud_t pud) { phys_addr_t pfn = pud_val(pud); - pfn ^= protnone_mask(pfn); + if (pfn) + pfn ^= protnone_mask(pfn); return (pfn & pud_pfn_mask(pud)) >> PAGE_SHIFT; } -- 2.18.0
Re: [PATCH] x86/speculation/l1tf: Exempt zeroed PTEs from XOR conversion
On Fri, Aug 17, 2018 at 09:13:51AM -0700, Linus Torvalds wrote: > On Thu, Aug 16, 2018 at 1:47 PM Sean Christopherson > wrote: > > > > Fixes: 6b28baca9b1f ("x86/speculation/l1tf: Protect PROT_NONE PTEs against > > speculation") > > This seems wrong. > > That commit doesn't invert a cleared page table entry, because that > commit still required _PAGE_PROTNONE being set for a pte to be > inverted. > > I'm assuming the real culprit is commit f22cc87f6c1f > ("x86/speculation/l1tf: Invert all not present mappings") which made > it look at _just_ the present bit. > > And yeah, that was wrong. > > So I really think a much better patch would be the appended one-liner. > > Note - it's whitespace-damaged by cut-and-paste, but it should be > obvious enough to apply by hand. > > Can you test this one instead? Checking for a non-zero val in __pte_needs_invert() also resolves the issue. I shied away from that change because prot_none_walk() doesn't pass the full PTE to __pte_needs_invert(), it only passes the pgprot_t bits. This works because PAGE_NONE sets the global and accessed bits, but it made me nervous nonetheless. > Linus > --- > > arch/x86/include/asm/pgtable-invert.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/pgtable-invert.h > b/arch/x86/include/asm/pgtable-invert.h > index 44b1203ece12..821438e91b77 100644 > --- a/arch/x86/include/asm/pgtable-invert.h > +++ b/arch/x86/include/asm/pgtable-invert.h > @@ -6,7 +6,7 @@ > > static inline bool __pte_needs_invert(u64 val) > { > - return !(val & _PAGE_PRESENT); > + return val && !(val & _PAGE_PRESENT); > } > > /* Get a mask to xor with the page table entry to get the correct pfn. */
[RFC PATCH v4 1/5] x86/vdso: Add support for exception fixup in vDSO functions
The basic concept and implementation is very similar to the kernel's exception fixup mechanism. The key differences are that the kernel handler is hardcoded and the fixup entry addresses are relative to the overall table as opposed to individual entries. Hardcoding the kernel handler avoids the need to figure out how to get userspace code to point at a kernel function. Given that the expected usage is to propagate information to userspace, dumping all fault information into registers is likely the desired behavior for the vast majority of yet-to-be-created functions. Use registers DI, SI and DX to communicate fault information, which follows Linux's ABI for register consumption and hopefully avoids conflict with hardware features that might leverage the fixup capabilities, e.g. register usage for SGX instructions was at least partially designed with calling conventions in mind. Making fixup addresses relative to the overall table allows the table to be stripped from the final vDSO image (it's a kernel construct) without complicating the offset logic, e.g. entry-relative addressing would also need to account for the table's location relative to the image. Regarding stripping the table, modify vdso2c to extract the table from the raw, a.k.a. unstripped, data and dump it as a standalone byte array in the resulting .c file. The original base of the table, its length and a pointer to the byte array are captured in struct vdso_image. Alternatively, the table could be dumped directly into the struct, but because the number of entries can vary per image, that would require either hardcoding a max sized table into the struct definition or defining the table as a flexible length array. The flexible length array approach has zero benefits, e.g. the base/size are still needed, and prevents reusing the extraction code, while hardcoding the max size adds ongoing maintenance just to avoid exporting the explicit size. The immediate use case is for Intel Software Guard Extensions (SGX). SGX introduces a new CPL3-only "enclave" mode that runs as a sort of black box shared object that is hosted by an untrusted "normal" CPl3 process. Entering an enclave can only be done through SGX-specific instructions, EENTER and ERESUME, and is a non-trivial process. Because of the complexity of transitioning to/from an enclave, the vast majority of enclaves are expected to utilize a library to handle the actual transitions. This is roughly analogous to how e.g. libc implementations are used by most applications. Another crucial characteristic of SGX enclaves is that they can generate exceptions as part of their normal (at least as "normal" as SGX can be) operation that need to be handled *in* the enclave and/or are unique to SGX. And because they are essentially fancy shared objects, a process can host any number of enclaves, each of which can execute multiple threads simultaneously. Putting everything together, userspace enclaves will utilize a library that must be prepared to handle any and (almost) all exceptions any time at least one thread may be executing in an enclave. Leveraging signals to handle the enclave exceptions is unpleasant, to put it mildly, e.g. the SGX library must constantly (un)register its signal handler based on whether or not at least one thread is executing in an enclave, and filter and forward exceptions that aren't related to its enclaves. This becomes particularly nasty when using multiple levels of libraries that register signal handlers, e.g. running an enclave via cgo inside of the Go runtime. Enabling exception fixup in vDSO allows the kernel to provide a vDSO function that wraps the low-level transitions to/from the enclave, i.e. the EENTER and ERESUME instructions. The vDSO function can intercept exceptions that would otherwise generate a signal and return the fault information directly to its caller, thus avoiding the need to juggle signal handlers. Suggested-by: Andy Lutomirski Cc: Andy Lutomirski Cc: Jarkko Sakkinen Cc: Dave Hansen Cc: Josh Triplett Signed-off-by: Sean Christopherson --- arch/x86/entry/vdso/Makefile | 4 +- arch/x86/entry/vdso/extable.c | 37 + arch/x86/entry/vdso/extable.h | 29 ++ arch/x86/entry/vdso/vdso-layout.lds.S | 9 - arch/x86/entry/vdso/vdso2c.h | 58 +++ arch/x86/include/asm/vdso.h | 5 +++ 6 files changed, 131 insertions(+), 11 deletions(-) create mode 100644 arch/x86/entry/vdso/extable.c create mode 100644 arch/x86/entry/vdso/extable.h diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile index 0624bf2266fd..b8f7c301b88f 100644 --- a/arch/x86/entry/vdso/Makefile +++ b/arch/x86/entry/vdso/Makefile @@ -20,7 +20,7 @@ VDSO32-$(CONFIG_IA32_EMULATION) := y vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o # files to link into kernel -obj-y
[RFC PATCH v4 4/5] x86/traps: Attempt to fixup exceptions in vDSO before signaling
Call fixup_vdso_exception() in all trap flows that generate signals to userspace immediately prior to generating any such signal. If the exception is fixed, return cleanly and do not generate a signal. The goal of vDSO fixup is not to fixup all faults, nor is it to avoid all signals, but rather to report faults directly to userspace when the fault would otherwise directly result in a signal being sent to the process. Suggested-by: Andy Lutomirski Cc: Andy Lutomirski Cc: Jarkko Sakkinen Cc: Dave Hansen Cc: Josh Triplett Signed-off-by: Sean Christopherson --- arch/x86/kernel/traps.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 9b7c4ca8f0a7..b1ca05efb15e 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -61,6 +61,7 @@ #include #include #include +#include #ifdef CONFIG_X86_64 #include @@ -209,6 +210,9 @@ do_trap_no_signal(struct task_struct *tsk, int trapnr, const char *str, tsk->thread.error_code = error_code; tsk->thread.trap_nr = trapnr; die(str, regs, error_code); + } else { + if (fixup_vdso_exception(regs, trapnr, error_code, 0)) + return 0; } /* @@ -560,6 +564,9 @@ do_general_protection(struct pt_regs *regs, long error_code) return; } + if (fixup_vdso_exception(regs, X86_TRAP_GP, error_code, 0)) + return; + tsk->thread.error_code = error_code; tsk->thread.trap_nr = X86_TRAP_GP; @@ -774,6 +781,10 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) SIGTRAP) == NOTIFY_STOP) goto exit; + if (user_mode(regs) && + fixup_vdso_exception(regs, X86_TRAP_DB, error_code, 0)) + goto exit; + /* * Let others (NMI) know that the debug stack is in use * as we may switch to the interrupt stack. @@ -854,6 +865,9 @@ static void math_error(struct pt_regs *regs, int error_code, int trapnr) if (!si_code) return; + if (fixup_vdso_exception(regs, trapnr, error_code, 0)) + return; + force_sig_fault(SIGFPE, si_code, (void __user *)uprobe_get_trap_addr(regs), task); } -- 2.19.2
[RFC PATCH v4 3/5] x86/fault: Attempt to fixup unhandled #PF on ENCLU before signaling
Call fixup_sgx_enclu_exception() in the SIGSEGV and SIGBUS paths of the page fault handler immediately prior to signaling. If the fault is fixed, return cleanly and do not generate a signal. In the SIGSEGV flow, make sure the error code passed to userspace has been sanitized. Suggested-by: Andy Lutomirski Cc: Andy Lutomirski Cc: Jarkko Sakkinen Cc: Dave Hansen Cc: Josh Triplett Signed-off-by: Sean Christopherson --- arch/x86/mm/fault.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index fefeb745d21d..c6f5f77ffabd 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -28,6 +28,7 @@ #include/* vma_pkey() */ #include/* efi_recover_from_page_fault()*/ #include /* store_idt(), ... */ +#include /* fixup_vdso_exception() */ #define CREATE_TRACE_POINTS #include @@ -928,6 +929,9 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, sanitize_error_code(address, &error_code); + if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address)) + return; + if (likely(show_unhandled_signals)) show_signal_msg(regs, error_code, address, tsk); @@ -1047,6 +1051,9 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, sanitize_error_code(address, &error_code); + if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address)) + return; + set_signal_archinfo(address, error_code); #ifdef CONFIG_MEMORY_FAILURE -- 2.19.2
[RFC PATCH v4 5/5] x86/vdso: Add __vdso_sgx_enter_enclave() to wrap SGX enclave transitions
Intel Software Guard Extensions (SGX) SGX introduces a new CPL3-only enclave mode that runs as a sort of black box shared object that is hosted by an untrusted normal CPL3 process. Enclave transitions have semantics that are a lovely blend of SYCALL, SYSRET and VM-Exit. In a non-faulting scenario, entering and exiting an enclave can only be done through SGX-specific instructions, EENTER and EEXIT respectively. EENTER+EEXIT is analogous to SYSCALL+SYSRET, e.g. EENTER/SYSCALL load RCX with the next RIP and EEXIT/SYSRET load RIP from R{B,C}X. But in a faulting/interrupting scenario, enclave transitions act more like VM-Exit and VMRESUME. Maintaining the black box nature of the enclave means that hardware must automatically switch CPU context when an Asynchronous Exiting Event (AEE) occurs, an AEE being any interrupt or exception (exceptions are AEEs because asynchronous in this context is relative to the enclave and not CPU execution, e.g. the enclave doesn't get an opportunity to save/fuzz CPU state). Like VM-Exits, all AEEs jump to a common location, referred to as the Asynchronous Exiting Point (AEP). The AEP is specified at enclave entry via register passed to EENTER/ERESUME, similar to how the hypervisor specifies the VM-Exit point (via VMCS.HOST_RIP at VMLAUNCH/VMRESUME). Resuming the enclave/VM after the exiting event is handled is done via ERESUME/VMRESUME respectively. In SGX, AEEs that are handled by the kernel, e.g. INTR, NMI and most page faults, IRET will journey back to the AEP which then ERESUMEs th enclave. Enclaves also behave a bit like VMs in the sense that they can generate exceptions as part of their normal operation that for all intents and purposes need to handled in the enclave/VM. However, unlike VMX, SGX doesn't allow the host to modify its guest's, a.k.a. enclave's, state, as doing so would circumvent the enclave's security. So to handle an exception, the enclave must first be re-entered through the normal EENTER flow (SYSCALL/SYSRET behavior), and then resumed via ERESUME (VMRESUME behavior) after the source of the exception is resolved. All of the above is just the tip of the iceberg when it comes to running an enclave. But, SGX was designed in such a way that the host process can utilize a library to build, launch and run an enclave. This is roughly analogous to how e.g. libc implementations are used by most applications so that the application can focus on its business logic. The big gotcha is that because enclaves can generate *and* handle exceptions, any SGX library must be prepared to handle nearly any exception at any time (well, any time a thread is executing in an enclave). In Linux, this means the SGX library must register a signal handler in order to intercept relevant exceptions and forward them to the enclave (or in some cases, take action on behalf of the enclave). Unfortunately, Linux's signal mechanism doesn't mesh well with libraries, e.g. signal handlers are process wide, are difficult to chain, etc... This becomes particularly nasty when using multiple levels of libraries that register signal handlers, e.g. running an enclave via cgo inside of the Go runtime. In comes vDSO to save the day. Now that vDSO can fixup exceptions, add a function, __vdso_sgx_enter_enclave(), to wrap enclave transitions and intercept any exceptions that occur when running the enclave. __vdso_sgx_enter_enclave() accepts four parameters: - The ENCLU leaf to execute (must be EENTER or ERESUME). - A pointer to a Thread Control Structure (TCS). A TCS is a page within the enclave that defines/tracks the context of an enclave thread. - An optional 'struct sgx_enclave_regs' pointer. If defined, the corresponding registers are loaded prior to entering the enclave and saved after (cleanly) exiting the enclave. The effective enclave register ABI follows the kernel x86-64 ABI. The x86-64 userspace ABI is not used due to RCX being usurped by hardware to pass the return RIP to the enclave. - An optional 'struct sgx_enclave_exception' pointer. If provided, the struct is filled with the faulting ENCLU leaf, trapnr, error code and address if an unhandled exception occurs on ENCLU or in the enclave. An unhandled exception is an exception that would normally be delivered to userspace via a signal, e.g. SIGSEGV. Note that this means that not all enclave exits are reported to the caller, e.g. interrupts and faults that are handled by the kernel do not trigger fixup and IRET back to ENCLU[ERESUME], i.e. unconditionally resume the enclave. Suggested-by: Andy Lutomirski Cc: Andy Lutomirski Cc: Jarkko Sakkinen Cc: Dave Hansen Cc: Josh Triplett Cc: Haitao Huang Cc: Jethro Beekman Cc: Dr. Greg Wettstein Signed-off-by: Sean Christopherson --- arch/x86/entry/vdso/Makefile | 2 + arch/x86/entry/vdso/vdso.lds.S
[RFC PATCH v4 2/5] x86/fault: Add helper function to sanitize error code
...to prepare for vDSO exception fixup, which will expose the error code to userspace and runs before set_signal_archinfo(), i.e. squashes the signal when fixup is successful. Signed-off-by: Sean Christopherson --- arch/x86/mm/fault.c | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 7e8a7558ca07..fefeb745d21d 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -719,18 +719,22 @@ pgtable_bad(struct pt_regs *regs, unsigned long error_code, oops_end(flags, regs, sig); } -static void set_signal_archinfo(unsigned long address, - unsigned long error_code) +static void sanitize_error_code(unsigned long address, + unsigned long *error_code) { - struct task_struct *tsk = current; - /* * To avoid leaking information about the kernel page * table layout, pretend that user-mode accesses to * kernel addresses are always protection faults. */ if (address >= TASK_SIZE_MAX) - error_code |= X86_PF_PROT; + *error_code |= X86_PF_PROT; +} + +static void set_signal_archinfo(unsigned long address, + unsigned long error_code) +{ + struct task_struct *tsk = current; tsk->thread.trap_nr = X86_TRAP_PF; tsk->thread.error_code = error_code | X86_PF_USER; @@ -771,6 +775,8 @@ no_context(struct pt_regs *regs, unsigned long error_code, * faulting through the emulate_vsyscall() logic. */ if (current->thread.sig_on_uaccess_err && signal) { + sanitize_error_code(address, &error_code); + set_signal_archinfo(address, error_code); /* XXX: hwpoison faults will set the wrong code. */ @@ -920,13 +926,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, if (is_errata100(regs, address)) return; - /* -* To avoid leaking information about the kernel page table -* layout, pretend that user-mode accesses to kernel addresses -* are always protection faults. -*/ - if (address >= TASK_SIZE_MAX) - error_code |= X86_PF_PROT; + sanitize_error_code(address, &error_code); if (likely(show_unhandled_signals)) show_signal_msg(regs, error_code, address, tsk); @@ -1045,6 +1045,8 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, if (is_prefetch(regs, error_code, address)) return; + sanitize_error_code(address, &error_code); + set_signal_archinfo(address, error_code); #ifdef CONFIG_MEMORY_FAILURE -- 2.19.2
[RFC PATCH v4 0/5] x86: Add vDSO exception fixup for SGX
Episode IV: The vDSO Strikes Back After a brief detour into ioctl-based fixup, the vDSO implementation is back. Relative to v2 (the previous vDSO RFC), patch 4/4 once again contains the vast majority of changes. __vdso_sgx_enter_enclave() is now written entirely in straight assembly. Implementing the expanded enclave ABI in inline assembly was an absolute train wreck as GCC's contraints don't play nice with the full spectrum of registers. And I suspect the previous vDSO RFCs were likely broken due to incorrect register clobbering (I never tested them end-to-end). The exit_handler() concept proposed in v2 is gone. I expect most SGX libraries will directly call __vdso_sgx_enter_enclave(), at which point the overhead of returning effectively boils down to restoring the non-volatile registers, which is likely outweighed by the cost of the retpoline call (to the handler) alone. In other words, I doubt anyone will actually use the exit_handler()... ...except for libraries that want to manipulate the untrusted stack, i.e. force __vdso_sgx_enter_enclave() to treat RSP as volatile. At that point we're effectively maintaining two separate ABIs since the normal ABI would be e.g. "RBP, RSP and the red zone must be preserved" vs. the exit_handler() ABI's "RBP must be preserved". And *if* we want to deal with maintaining two ABIs, supporting the kernel's existing signal ABI is a lot cleaner (from a kernel perspective) than polluting the vDSO. v1: https://lkml.kernel.org/r/20181205232012.28920-1-sean.j.christopher...@intel.com v2: https://lkml.kernel.org/r/20181206221922.31012-1-sean.j.christopher...@intel.com v3: https://lkml.kernel.org/r/20181210232141.5425-1-sean.j.christopher...@intel.com v4: - Back to vDSO - Implement __vdso_sgx_enter_enclave() directly in assembly - Modify effective enclave register ABI to follow x86-64 kernel ABI - Split __vdso_sgx_enter_enclave input into separate non-union params - Drop the exit_handler() concept Sean Christopherson (5): x86/vdso: Add support for exception fixup in vDSO functions x86/fault: Add helper function to sanitize error code x86/fault: Attempt to fixup unhandled #PF on ENCLU before signaling x86/traps: Attempt to fixup exceptions in vDSO before signaling x86/vdso: Add __vdso_sgx_enter_enclave() to wrap SGX enclave transitions arch/x86/entry/vdso/Makefile | 6 +- arch/x86/entry/vdso/extable.c| 37 ++ arch/x86/entry/vdso/extable.h| 29 + arch/x86/entry/vdso/vdso-layout.lds.S| 9 +- arch/x86/entry/vdso/vdso.lds.S | 1 + arch/x86/entry/vdso/vdso2c.h | 58 -- arch/x86/entry/vdso/vsgx_enter_enclave.S | 136 +++ arch/x86/include/asm/vdso.h | 5 + arch/x86/include/uapi/asm/sgx.h | 44 arch/x86/kernel/traps.c | 14 +++ arch/x86/mm/fault.c | 33 -- 11 files changed, 349 insertions(+), 23 deletions(-) create mode 100644 arch/x86/entry/vdso/extable.c create mode 100644 arch/x86/entry/vdso/extable.h create mode 100644 arch/x86/entry/vdso/vsgx_enter_enclave.S -- 2.19.2
Re: [PATCH v17 18/23] platform/x86: Intel SGX driver
On Mon, Dec 17, 2018 at 08:01:02PM +0200, Jarkko Sakkinen wrote: > On Mon, Dec 17, 2018 at 09:45:40AM -0800, Dave Hansen wrote: > > > +struct sgx_encl *sgx_encl_alloc(struct sgx_secs *secs) > > > +{ > > ... > > > + kref_init(&encl->refcount); > > > + INIT_LIST_HEAD(&encl->add_page_reqs); > > > + INIT_RADIX_TREE(&encl->page_tree, GFP_KERNEL); > > > + mutex_init(&encl->lock); > > > + INIT_WORK(&encl->add_page_work, sgx_add_page_worker); > > > + > > > + encl->mm = current->mm; <-> + > > > encl->base = secs->base; > > > + encl->size = secs->size; > > > + encl->ssaframesize = secs->ssa_frame_size; > > > + encl->backing = backing; > > > + > > > + return encl; > > > +} > > > > How is this OK without taking a reference on the mm? It's subtle and the ordering is all kinds of weird, but technically we are taking a reference on mm when the mmu_notifier is registered in sgx_encl_create(). sgx_encl_alloc() and sgx_encl_create() are always called in tandem and with mm->mm_users > 0, so we'll never use encl->mm without holding a reference to mm. We need to comment the weirdness or maybe register the notifier before > > I have a feeling a bunch of your bugs with the mmu notifiers and so > > forth are because the refcounting is wrong here. Eh, not really. Maybe the mmu_notifier is more subtle, e.g. calling do_unmap() after mmput() would be quite obvious, but there's no fundamental bug, we just haven't needed to touch VMAs during release prior to moving away from shmem. > > Sean's SGX_ENCL_MM_RELEASED would, I think be unnecessary if you just > > take a refcount here and release it when the enclave is destroyed. > > Right, atomic_inc(encl->mm->count) here and once when releasing. > > The we would not even need the whole mmu notifier in the first place. I'm pretty sure doing mmget() would result in circular dependencies and a zombie enclave. In the do_exit() case where a task is abruptly killed: - __mmput() is never called because the enclave holds a ref - sgx_encl_release() is never be called because its VMAs hold refs - sgx_vma_close() is never called because __mmput()->exit_mmap() is blocked and the process itself is dead, i.e. won't unmap anything.
Re: [PATCH v17 18/23] platform/x86: Intel SGX driver
On Mon, Dec 17, 2018 at 08:23:19PM +0200, Jarkko Sakkinen wrote: > On Mon, Dec 17, 2018 at 10:09:57AM -0800, Sean Christopherson wrote: > > No, EREMOVE should never fail if the enclave is being released, i.e. all > > references to the enclave are gone. And failure during sgx_encl_release() > > means we leaked an EPC page, which warrants a WARN. > > Right that what I was suspecting as swapper should hold a ref to the > enclave while it is working on it. It is a programming error when this > happens. > > Maybe change the boolean parameter to flags parameter have a flag to > use sgx_free_page()? I tried that approach when I first split it to __sgx_free_page() and sgx_free_page(), but IMO the code is more difficult to read and harder to maintain since sgx_free_page() should be used except under special circumstances, e.g. race with reclaim or the freeing is "untrusted", i.e. requested by userspace via sgx_ioc_enclave_remove_pages(). > > > That makes sense. > > What do you think of Dave's proposal? > > /Jarkko