Hi all, This series is RFC Xen SGX virtualization support design and RFC draft patches.
Intel SGX (Software Guard Extensions) is a new set of instructions and memory access mechanisms targetting for application developers seeking to protect select code and data from disclosure or modification. The SGX specification can be found in latest Intel SDM as Volume 3D: https://software.intel.com/sites/default/files/managed/7c/f1/332831-sdm-vol-3d.pdf SGX is relatively more complicated on specification (entire Volume D) so it is unrealistic to list all hardware details here. First part of the design is the brief SGX introduction, which I think is mandatory for introducing SGX virtualization support. Part 2 is design itself. And I put some reference at last. In first part I only introduced the info related virtualization support, although this is definitely not the most important part of SGX. Other parts of SGX (most related to cryptography), ie, enclave measurement, SGX key architecture, Sealing & Attestion (which is critical feature actually) are ommited. Please refer to SGX specification for detailed info. In the design there are some particualr points that I don't know which implementation is better. For those I added a question mark (?) at the right of the menu. Your comments on those parts (and other comments as well, of course) are highly appreciated. Because SGX has lots of details, so I think the design itself can only be high level, and I also included the RFC patches which contains lots of details. Your comments on the patches are also highly appreciated. The code can also be found at below github repo for your access: # git clone https://github.com/01org/xen-sgx -b rfc-v1 And there is another branch named 4.6-sgx is another implementation based on Xen 4.6, it is old but it has some different implementation with this rfc-v1 patches in terms of design choice (ex, it adds a dedicated hypercall). Please help to review and give comments. Thanks in advance. ============================================================================== 1. SGX Introduction 1.1 Overview 1.1.1 Enclave 1.1.2 EPC (Enclave Paage Cache) 1.1.3 ENCLS and ENCLU 1.2 Discovering SGX Capability 1.2.1 Enumerate SGX via CPUID 1.2.2 Intel SGX Opt-in Configuration 1.3 Enclave Life Cycle 1.3.1 Constructing & Destroying Enclave 1.3.2 Enclave Entry and Exit 1.3.2.1 Synchonous Entry and Exit 1.3.2.2 Asynchounous Enclave Exit 1.3.3 EPC Eviction and Reload 1.4 SGX Launch Control 1.5 SGX Interaction with IA32 and IA64 Architecture 2. SGX Virtualization Design 2.1 High Level Toolstack Changes 2.1.1 New 'epc' parameter 2.1.2 New XL commands (?) 2.1.3 Notify domain's virtual EPC base and size to Xen 2.1.4 Launch Control Support (?) 2.2 High Level Hypervisor Changes 2.2.1 EPC Management (?) 2.2.2 EPC Virtualization (?) 2.2.3 Populate EPC for Guest 2.2.4 New Dedicated Hypercall (?) 2.2.5 Launch Control Support 2.2.6 CPUID Emulation 2.2.7 MSR Emulation 2.2.8 EPT Violation & ENCLS Trapping Handling 2.2.9 Guest Suspend & Resume 2.2.10 Destroying Domain 2.3 Additional Point: Live Migration, Snapshot Support (?) 3. Reference 1. SGX Introduction 1.1 Overview 1.1.1 Enclave Intel Software Guard Extensions (SGX) is a set of instructions and mechanisms for memory accesses in order to provide security accesses for sensitive applications and data. SGX allows an application to use it's pariticular address space as an *enclave*, which is a protected area provides confidentiality and integrity even in the presence of privileged malware. Accesses to the enclave memory area from any software not resident in the enclave are prevented, including those from privileged software. Below diagram illustrates the presence of Enclave in application. |-----------------------| | | | |---------------| | | | OS kernel | | |-----------------------| | |---------------| | | | | | | | | |---------------| | | |---------------| | | | Entry table | | | | Enclave |---|-----> | |---------------| | | |---------------| | | | Enclave stack | | | | App code | | | |---------------| | | |---------------| | | | Enclave heap | | | | Enclave | | | |---------------| | | |---------------| | | | Enclave code | | | | App code | | | |---------------| | | |---------------| | | | | | | |-----------------------| |-----------------------| SGX supports SGX1 and SGX2 extensions. SGX1 provides basic enclave support, and SGX2 allows additional flexibility in runtime management of enclave resources and thread execution within an enclave. 1.1.2 EPC (Enclave Page Cache) Just like normal application memory management, enclave memory management can be devided into two parts: address space allocation and memory commitment. Address space allocation is allocating particular range of linear address space for enclave. Memory commitment is assigning actual resource for the enclave. Enclave Page Cache (EPC) is the physical resource used to commit to enclave. EPC is divided to 4K pages. An EPC page is 4K in size and always aligned to 4K boundary. Hardware performs additional access control checks to restrict access to the EPC page. The Enclave Page Cache Map (EPCM) is a secure structure which holds one entry for each EPC page, and is used by hardware to track the status of each EPC page (invisibe to software). Typically EPC and EPCM are reserved by BIOS as Processor Reserved Memory but the actual amount, size, and layout of EPC are model-specific, and dependent on BIOS settings. EPC is enumerated via new SGX CPUID, and is reported as reserved memory. EPC pages can either be invalid or valid. There are 4 valid EPC types in SGX1: regular EPC page, SGX Enclave Control Structure (SECS) page, Thread Control Structure (TCS) page, and Version Array (VA) page. SGX2 adds Trimmed EPC page. Each enclave is associated with one SECS page. Each thread in enclave is associated with one TCS page. VA page is used in EPC page eviction and reload. Trimmed EPC page is introduced in SGX2 when particular 4K page in enclave is going to be freed (trimmed) at runtime after enclave is initialized. 1.1.3 ENCLS and ENCLU Two new instructions ENCLS and ENCLU are introduced to manage enclave and EPC. ENCLS can only run in ring 0, while ENCLU can only run in ring 3. Both ENCLS and ENCLU have multiple leaf functions, with EAX indicating the specific leaf function. SGX1 supports below ENCLS and ENCLU leaves: ENCLS: - ECREATE, EADD, EEXTEND, EINIT, EREMOVE (Enclave build and destroy) - EPA, EBLOCK, ETRACK, EWB, ELDU/ELDB (EPC eviction & reload) ENCLU: - EENTER, EEXIT, ERESUME (Enclave entry, exit, re-enter) - EGETKEY, EREPORT (SGX key derivation, attestation) Additionally, SGX2 supports below ENCLS and ENCLU leaves for runtime add/remove EPC page to enclave after enclave is initialized, along with permission change. ENCLS: - EAUG, EMODT, EMODPR ENCLU: - EACCEPT, EACCEPTCOPY, EMODPE VMM is able to interfere with ENCLS running in guest (see 1.2.x SGX interaction with VMX) but is unable to interfere with ENCLU. 1.2 Discovering SGX Capability 1.2.1 Enumerate SGX via CPUID If CPUID.0x7.0:EBX.SGX (bit 2) is 1, then processor supports SGX and SGX capability and resource can be enumerated via new SGX CPUID (0x12). CPUID.0x12.0x0 reports SGX capability, such as the presence of SGX1, SGX2, enclave's maximum size for both 32-bit and 64-bit application. CPUID.0x12.0x1 reports the availability of bits that can be set for SECS.ATTRIBUTES. CPUID.0x12.0x2 reports the EPC resource's base and size. Platform may support multiple EPC sections, and CPUID.0x12.0x3 and further sub-leaves can be used to detect the existence of multiple EPC sections (until CPUID reports invalid EPC). Refer to 37.7.2 Intel SGX Resource Enumeration Leaves for full description of SGX CPUID 0x12. 1.2.2 Intel SGX Opt-in Configuration On processors that support Intel SGX, IA32_FEATURE_CONTROL also provides the SGX_ENABLE bit (bit 18) to turn on/off SGX. Before system software can enable and use SGX, BIOS is required to set IA32_FEATURE_CONTROL.SGX_ENABLE = 1 to opt-in SGX. Setting SGX_ENABLE follows the rules of IA32_FEATURE_CONTROL.LOCK (bit 0). Software is considered to have opted into Intel SGX if and only if IA32_FEATURE_CONTROL.SGX_ENABLE and IA32_FEATURE_CONTROL.LOCK are set to 1. The setting of IA32_FEATURE_CONTROL.SGX_ENABLE (bit 18) is not reflected by SGX CPUID. Enclave instructions will behavior differently according to value of CPUID.0x7.0x0:EBX.SGX and whether BIOS has opted-in SGX. Refer to 37.7.1 Intel SGX Opt-in Configuration for more information. 1.3 Enclave Life Cycle 1.3.1 Constructing & Destroying Enclave Enclave is created via ENCLS[ECREATE] leaf by previleged software. Basically ECREATE converts an invalid EPC page into SECS page, according to a source SECS structure resides in normal memory. The source SECS contains enclave's info such as base (linear) address, size, enclave attributes, enclave's measurement, etc. After ECREATE, for each 4K linear address space page, priviledged software uses EADD and EEXTEND to add one EPC page to it. Enclave code/data (resides in normal memory) is loaded to enclave during EADD for enclave's each 4K page. After all EPC pages are added to enclave, priviledged software calls EINIT to initialize the enclave, and then enclave is ready to run. During enclave is constructed, enclave measurement, which is a SHA256 hash value, is also built according to enclave's size, code/data itself and its location in enclave, etc. The measurement can be used to uniquely identify the enclave. SIGSTRUCT in EINIT leaf also contains the measurement specified by untrusted software, via MRENCLAVE. EINIT will check the two measurements and will only succeed when the two matches. Enclave is destroyed by running EREMOVE for all Enclave's EPC page, and then for enclave's SECS. EREMOVE will report SGX_CHILD_PRESENT error if it is called for SECS when there's still regular EPC pages that haven't been removed from enclave. Please refer to SDM chapter 39.1 Constructing an Enclave for more infomation. 1.3.2 Enclave Entry and Exit 1.3.2.1 Synchonous Entry and Exit After enclave is constructed, non-priviledged software use ENCLU[EENTER] to enter enclave to run. While process runs in enclave, non-priviledged software can use ENCLU[EEXIT] to exit from enclave and return to normal mode. 1.3.2.2 Asynchounous Enclave Exit Asynchronous and synchronous events, such as exceptions, interrupts, traps, SMIs, and VM exits may occur while executing inside an enclave. These events are referred to as Enclave Exiting Events (EEE). Upon an EEE, the processor state is securely saved inside the enclave and then replaced by a synthetic state to prevent leakage of secrets. The process of securely saving state and establishing the synthetic state is called an Asynchronous Enclave Exit (AEX). After AEX, non-priviledged software uses ENCLU[ERESUME] to re-enter enclave. The SGX userspace software maintains a small piece of code (resides in normal memory) which basically calls ERESUME to re-enter enclave. The address of this piece of code is called Asynchronous Exit Pointer (AEP). AEP is specified as parameter in EENTER and will be kept internally in enclave. Upon AEX, AEP will be pushed to stack and upon returning from EEE handling, such as IRET, AEP will be loaded to RIP and ERESUME will be called subsequently to re-enter enclave. During AEX the processor will do context saving and restore automatically therefore no change to interrupt handling of OS kernel and VMM is required. It is SGX userspace software's responsibility to setup AEP correctly. Please refer to SDM chapter 39.2 Enclave Entry and Exit for more infomation. 1.3.3 EPC Eviction and Reload SGX also allows priviledged software to evict any EPC pages that are used by enclave. The idea is the same as normal memory swapping. Below is the detail info of how to evict EPC pages. Below is the sequence to evict regular EPC page: 1) Select one or multiple regular EPC pages from one enclave 2) Remove EPT/PT mapping for selected EPC pages 3) Send IPIs to remote CPUs to flush TLB of selected EPC pages 4) EBLOCK on selected EPC pages 5) ETRACK on enclave's SECS page 6) allocate one available slot (8-byte) in VA page 7) EWB on selected EPC pages With EWB taking: - VA slot, to restore eviction version info. - one normal 4K page in memory, to store encrypted content of EPC page. - one struct PCMD in memory, to store meta data. (VA slot is a 8-byte slot in VA page, which is a particualr EPC page.) And below is the sequence to evict an SECS page or VA page: 1) locate SECS (or VA) page 2) remove EPT/PT mapping for SECS (or VA) page 3) Send IPIs to remote CPUs 6) allocate one available slot (8-byte) in VA page 4) EWB on SECS (or) page And for evicting SECS page, all regular EPC pages that belongs to that SECS must be evicted out prior, otherwise EWB returns SGX_CHILD_PRESENT error. And to reload an EPC page: 1) ELDU/ELDB on EPC page 2) setup EPT/PT mapping With ELDU/ELDB taking: - location of SECS page - linear address of enclave's 4K page (that we are going to reload to) - VA slot (used in EWB) - 4K page in memory (used in EWB) - struct PCMD in memory (used in EWB) Please refer to SDM chapter 39.5 EPC and Management of EPC pages for more information. *********** Instruction Behavior changes in Enclave - Illegal instructions inside enclave Instruction Result Comment CPUID,GETSEC,RDPMC,SGDT,SIDT,SLDT,STR,VMCALL, 1.4 SGX Launch Control SGX requires running "Launch Enclave" (LE) before running any other enclaves. This is because LE is the only enclave that does not requires EINITTOKEN in EINIT. Running any other enclave requires a valid EINITTOKEN, which contains MAC of the (first 192 bytes) EINITTOKEN calculated by EINITTOKEN key. EINIT will verify the MAC via internally deriving the EINITTOKEN key, and only the EINITTOKEN that has matched MAC will be accepted by EINIT. The EINITTOKEN key derivation depends on some info from LE. The typical process is LE generates EINITTOKEN for other enclave according to LE itself and the target enclave, and calcualtes the MAC by using ENCLU[EGETKEY] to get the EINITTOKEN key. Only LE is able to get the EINITTOKEN key. Running LE requies the SHA256 hash of LE signer's RSA public key (SHA256 of sigstruct->modulus) to equal to IA32_SGXLEPUBKEYHASH[0-3] MSRs (the 4 MSRs together makes up 256-bit SHA256 hash value). If CPUID.0x7.0x0:EBX.SGX is set, then IA32_SGXLEPUBKEYHASHn are readable. If CPUID.0x7.0x0:ECX.SGX_LAUNCH_CONTROL[bit 30] is set, then IA32_FEATURE_CONTROL MSR has SGX_LAUNCH_CONTROL_ENABLE bit (bit 17) available. 1-setting of SGX_LAUNCH_CONTROL_ENABLE bit enables runtime change of IA32_SGXLEPUBKEYHASHn after IA32_FEATURE_CONTROL is locked. Otherwise, IA32_SGXLEPUBKEYHASHn are read-only after IA32_FEATURE_CONTROL is locked. IA32_SGXLEPUBKEYHASHn will be set to SHA256 hash of Intel's default RSA public key. Above mechanism allows 3rd party to run their own LE. On physical machine, typically BIOS will provide option to *lock* or *unlock* IA32_SGXLEPUBKEYHASHn before transfering to OS. BIOS may also provide interface for user to change default value of IA32_SGXLEPUBKEYHASHn, but what interfaces will be provided by BIOS is BIOS implementation dependent. 1.5 SGX Interaction with IA32 and IA64 Architecture SDM Chapter 42 describes SGX interaction with various features in IA32 and IA64 architecture. Below outlines the major ones. Refer to Chapter 42 for full description of SGX interaction with various IA32 and IA64 features. 1.5.1 VMX Changes for Supporting SGX Virtualization A new 64-bit ENCLS-exiting bitmap control field is added to VMCS (encoding 0202EH) to control VMEXIT on ENCLS leaf functions. And a new "Enable ENCLS exiting" control bit (bit 15) is defined in secondary processor based vm execution control. 1-Setting of "Enable ENCLS exiting" enables ENCLS-exiting bitmap control. ENCLS-exiting bitmap controls which ENCLS leaves will trigger VMEXIT. Additionally two new bits are added to indicate whether VMEXIT (any) is from enclave. Below two bits will be set if VMEXIT is from enclave: - Bit 27 in the Exit reason filed of Basic VM-exit information. - Bit 4 in the Interruptibility State of Guest Non-Register State of VMCS. Refer to 42.5 Interactions with VMX, 27.2.1 Basic VM-Exit Information, and 27.3.4 Saving Non-Register. 1.5.2 Interaction with XSAVE SGX defines a sub-field called X-Feature Request Mask (XFRM) in the attributes field of SECS. On enclave entry, SGX HW verifies XFRM in SECS.ATTRIBUTES are already enabled in XCR0. Upon AEX, SGX saves the processor extended state and miscellaneous state to enclave's state-save area (SSA), and clear the secrets from processor extended state that is used by enclave (from leaking secrets). Refer to 42.7 Interaction with Processor Extended State and Miscellaneous State 1.5.3 Interaction with S state When processor goes into S3-S5 state, EPC is destroyed, thus all enclaves are destroyed as well consequently. Refer to 42.14 Interaction with S States. 2. SGX Virtualization Design 2.1 High Level Toolstack Changes: 2.1.1 New 'epc' parameter EPC is limited resource. In order to use EPC efficiently among all domains, when creating guest, administrator should be able to specify domain's virtual EPC size. And admin alao should be able to get all domain's virtual EPC size. For this purpose, a new 'epc = <size>' parameter is added to XL configuration file. This parameter specifies guest's virtual EPC size. The EPC base address will be calculated by toolstack internally, according to guest's memory size, MMIO size, etc. 'epc' is MB in unit and any 1MB aligned value will be accepted. 2.1.2 New XL commands (?) Administrator should be able to get physical EPC size, and all domain's virtual EPC size. For this purpose, we can introduce 2 additional commands: # xl sgxinfo Which will print out physical EPC size, and other SGX info (such as SGX1, SGX2, etc) if necessary. # xl sgxlist <did> Which will print out particular domain's virtual EPC size, or list all virtual EPC sizes for all supported domains. Alternatively, we can also extend existing XL commands by adding new option # xl info -sgx Which will print out physical EPC size along with other physinfo. And # xl list <did> -sgx Which will print out domain's virtual EPC size. Comments? In my RFC patches I didn't implement the commands as I don't know which is better. In the github repo I mentioned at the beginning, there's an old branch in which I implemented 'xl sgxinfo' and 'xl sgxlist', but they are implemented via dedicated hypercall for SGX, which I am not sure whether is a good option so I didn't include it in my RFC patches. 2.1.3 Notify domain's virtual EPC base and size to Xen Xen needs to know guest's EPC base and size in order to populate EPC pages for it. Toolstack notifies EPC base and size to Xen via XEN_DOMCTL_set_cpuid. 2.1.4 Launch Control Support (?) Xen Launch Control Support is about to support running multiple domains with each running its own LE signed by different owners (if HW allows, explained below). As explained in 1.4 SGX Launch Control, EINIT for LE (Launch Enclave) only succeeds when SHA256(SIGSTRUCT.modulus) matches IA32_SGXLEPUBKEYHASHn, and EINIT for other enclaves will derive EINITTOKEN key according to IA32_SGXLEPUBKEYHASHn. Therefore, to support this, guest's virtual IA32_SGXLEPUBKEYHASHn must be updated to phyiscal MSRs before EINIT (which also means the physical IA32_SGXLEPUBKEYHASHn need to be *unlocked* in BIOS before booting to OS). For physical machine, it is BIOS's writer's decision that whether BIOS would provide interface for user to specify customerized IA32_SGXLEPUBKEYHASHn (it is default to digest of Intel's signing key after reset). In reality, OS's SGX driver may require BIOS to make MSRs *unlocked* and actively write the hash value to MSRs in order to run EINIT successfully, as in this case, the driver will not depend on BIOS's capability (whether it allows user to customerize IA32_SGXLEPUBKEYHASHn value). The problem is for Xen, do we need a new parameter, such as 'lehash=<SHA256>' to specify the default value of guset's virtual IA32_SGXLEPUBKEYHASHn? And do we need a new parameter, such as 'lewr' to specify whether guest's virtual MSRs are locked or not before handling to guest's OS? I tends to not introduce 'lehash', as it seems SGX driver would actively update the MSRs. And new parameter would add additional changes for upper layer software (such as openstack). And 'lewr' is not needed either as Xen can always *unlock* the MSRs to guest. Please give comments? Currently in my RFC patches above two parameters are not implemented. Xen hypervisor will always *unlock* the MSRs. Whether there is 'lehash' parameter or not doesn't impact Xen hypervisor's emulation of IA32_SGXLEPUBKEYHASHn. See below Xen hypervisor changes for details. 2.2 High Level Xen Hypervisor Changes: 2.2.1 EPC Management (?) Xen hypervisor needs to detect SGX, discover EPC, and manage EPC before supporting SGX to guest. EPC is detected via SGX CPUID 0x12.0x2. It's possible that there are multiple EPC sections (enumerated via sub-leaves 0x3 and so on, until invaid EPC is reported), but this is only true on multiple-socket server machines. For server machines there are additional things also needs to be done, such as NUMA EPC, scheduling, etc. We will support server machine in the future but currently we only support one EPC. EPC is reported as reserved memory (so it is not reported as normal memory). EPC must be managed in 4K pages. CPU hardware uses EPCM to track status of each EPC pages. Xen needs to manage EPC and provide functions to, ie, alloc and free EPC pages for guest. There are two ways to manage EPC: Manage EPC separately; or Integrate it to existing memory management framework. It is easy to manage EPC separately, as currently EPC is pretty small (~100MB), and we can even put them in a single list. However it is not flexible, for example, you will have to write new algorithms when EPC becomes larger, ex, GB. And you have to write new code to support NUMA EPC (although this will not come in short time). Integrating EPC to existing memory management framework seems more reasonable, as in this way we can resume memory management data structures/algorithms, and it will be more flexible to support larger EPC and potentially NUMA EPC. But modifying MM framework has a higher risk to break existing memory management code (potentially more bugs). In my RFC patches currently we choose to manage EPC separately. A new structure epc_page is added to represent a single 4K EPC page. A whole array of struct epc_page will be allocated during EPC initialization, so that given the other, one of PFN of EPC page and 'struct epc_page' can be got by adding offset. But maybe integrating EPC to MM framework is more reasonable. Comments? 2.2.2 EPC Virtualization (?) This part is how to populate EPC for guests. We have 3 choices: - Static Partitioning - Oversubscription - Ballooning Static Partitioning means all EPC pages will be allocated and mapped to guest when it is created, and there's no runtime change of page table mappings for EPC pages. Oversubscription means Xen hypervisor supports EPC page swapping between domains, meaning Xen is able to evict EPC page from another domain and assign it to the domain that needs the EPC. With oversubscription, EPC can be assigned to domain on demand, when EPT violation happens. Ballooning is similar to memory ballooning. It is basically "Static Partitioning" + "Balloon driver" in guest. Static Partitioning is the easiest way in terms of implementation, and there will be no hypervisor overhead (except EPT overhead of course), because in "Static partitioning", there is no EPT violation for EPC, and Xen doesn't need to turn on ENCLS VMEXIT for guest as ENCLS runs perfectly in non-root mode. Ballooning is "Static Partitioning" + "Balloon driver" in guest. Like "Static Paratitioning", ballooning doesn't need to turn on ENCLS VMEXIT, and doesn't have EPT violation for EPC either. To support ballooning, we need ballooning driver in guest to issue hypercall to give up or reclaim EPC pages. In terms of hypercall, we have two choices: 1) Add new hypercall for EPC ballooning; 2) Using existing XENMEM_{increase/decrease}_reservation with new memory flag, ie, XENMEMF_epc. I'll discuss more regarding to adding dedicated hypercall or not later. Oversubscription looks nice but it requires more complicated implemetation. Firstly, as explained in 1.3.3 EPC Eviction & Reload, we need to follow specific steps to evict EPC pages, and in order to do that, basically Xen needs to trap ENCLS from guest and keep track of EPC page status and enclave info from all guest. This is because: - To evict regular EPC page, Xen needs to know SECS location - Xen needs to know EPC page type: evicting regular EPC and evicting SECS, VA page have different steps. - Xen needs to know EPC page status: whether the page is blocked or not. Those info can only be got by trapping ENCLS from guest, and parsing its parameters (to identify SECS page, etc). Parsing ENCLS parameters means we need to know which ENCLS leaf is being trapped, and we need to translate guest's virtual address to get physical address in order to locate EPC page. And once ENCLS is trapped, we have to emulate ENCLS in Xen, which means we need to reconstruct ENCLS parameters by remapping all guest's virtual address to Xen's virtual address (gva->gpa->pa->xen_va), as ENCLS always use *effective address* which is able to be traslated by processor when running ENCLS. -------------------------------------------------------------- | ENCLS | -------------------------------------------------------------- | /|\ ENCLS VMEXIT| | VMENTRY | | \|/ | 1) parse ENCLS parameters 2) reconstruct(remap) guest's ENCLS parameters 3) run ENCLS on behalf of guest (and skip ENCLS) 4) on success, update EPC/enclave info, or inject error And Xen needs to maintain each EPC page's status (type, blocked or not, in enclave or not, etc). Xen also needs to maintain all Enclave's info from all guests, in order to find the correct SECS for regular EPC page, and enclave's linear address as well. So in general, "Static Partitioning" has simplest implementation, but obviously not the best way to use EPC efficiently; "Ballooning" has all pros of Static Partitioning but requies guest balloon driver; "Oversubscription" is best in terms of flexibility but requires complicated hypervisor implemetation. We have implemented "Static Partitioning" in RFC patches, but needs your feedback on whether it is enough. If not, which one should we do at next stage -- Ballooning or Oversubscription. IMO Ballooning may be good enough, given fact that currently memory is also "Static Partitioning" + "Ballooning". Comments? 2.2.3 Populate EPC for Guest Toolstack notifies Xen about domain's EPC base and size by XEN_DOMCTL_set_cpuid, so currently Xen populates all EPC pages for guest in XEN_DOMCTL_set_cpuid, particularly, in handling XEN_DOMCTL_set_cpuid for CPUID.0x12.0x2. Once Xen checks the values passed from toolstack is valid, Xen will allocate all EPC pages and setup EPT mappings for guest. 2.2.4 New Dedicated Hypercall (?) So far for all the changes mentioned above, if without a dedicated new hypercall, we have to implement those changes in: - xl sgxifo (or xl info -sgx) Toolstack can do this by running SGX CPUID directly, along with checking host cpu featureset. - xl sgxlist (or xl list -sgx) This is not quite straightforward. Looks we have to extend xen_domctl_getdomaininfo. However SGX is Intel specific feature, so I am not sure it's a good idea to extend xen_domctl_getdomaininfo. - Populate EPC for guest In XEN_DOMCTL_set_cpuid, Xen populates EPC pages for guest after receiving EPC base and size from toolstack. - Potential EPC Ballooning Need to add new XENMEMF_epc and use existing XENMEM_{increase/decrease}_reservation. With new hypercall for SGX (ie, XEN_sgx_op), all of above can be consolidated into the hypercall. We can also extend it to more generic hypercall for Intel platform genrally (ie, XEN_intel_op). For example, the new hypercall would look like: #define XEN_INTEL_SGX_physinfo 0x1 struct xen_sgx_physinfo { /* OUT */ unsigned long total_epc_pages; unsigned long free_epc_pages; }; typedef struct xen_sgx_physinfo xen_sgx_physinfo_t; DEFINE_XEN_GUEST_HANDLE(xen_sgx_physinfo_t); #define XEN_INTEL_SGX_setup_epc 0x2 struct xen_sgx_setup_epc { /* IN */ domid_t domid; unsigned long epc_base_gfn; unsigned long total_epc_pages; }; typedef struct xen_sgx_setup_epc xen_sgx_setup_epc_t; DEFINE_XEN_GUEST_HANDLE(xen_sgx_setup_epc_t); #define XEN_INTEL_SGX_dominfo 0x3 struct xen_sgx_dominfo { /* IN */ domid_t domid; /* OUT */ unsigned long epc_base_gfn; unsigned long total_epc_pages; }; DEFINE_XEN_GUEST_HANDLE(xen_sgx_dominfo); struct xen_sgx_op { /* XEN_INTEL_SGX_* */ int cmd; union { struct xen_sgx_physinfo physinfo; struct xen_sgx_setup_epc setup_epc; struct xen_sgx_dominfo dominfo; } u; }; typedef struct xen_sgx_op xen_sgx_op_t; DEFINE_XEN_GUEST_HANDLE(xen_sgx_op); /* New arch specific hypercall for Intel platform specific operations, * __HYPERVISOR_arch_0 is used by Xen x86 machine check... */ #define __HYPERVISOR_intel_op __HYPERVISOR_arch_1 /* Currently only SGX uses this */ #define XEN_INTEL_OP_sgx (0x1 << 1) struct xen_intel_op { int cmd; /* XEN_INTEL_OP_*** */ union { struct xen_sgx_op sgx_op; } u; } typedef struct xen_intel_op xen_intel_op_t; DEFINE_XEN_GUEST_HANDLE(xen_intel_op_t); In my RFC patches, the new hypercall is not implemented as I am not sure whether it is a good idea. Comments? 2.2.5 Launch Control Support To support running multiple domains with each running its own LE signed by different owners, physical machine's BIOS must leave IA32_SGXLEPUBKEYHASHn *unlocked* before handing to Xen. Xen will trap domain's write to IA32_SGXLEPUBKEYHASHn and keep the value in vcpu internally, and update the value to physical MSRs when vcpu is scheduled in. This can guarantee that when EINIT runs in guest, guest's virtual IA32_SGXLEPUBKEYHASHn have been written to physical MSRs. SGX_LAUNCH_CONTROL_ENABLE bit will always be set in guest's IA32_FEATURE_CONTROL MSR (see 2.1.4 Launch Control Support). If physical IA32_SGXLEPUBKEYHASHn are *locked* in machine's BIOS, then only MSR read is allowed from guest, and Xen will inject error for guest's MSR writes. If CPUID.0x7.0x0:ECX.SGX_LAUHCN_CONTROL is not present, then this feature will not be exposed to guest as well, and SGX_LAUNCH_CONTROL_ENABLE bit is set to 0 (as it is invalid). 2.2.6 CPUID Emulation Most of native SGX CPUID info can be exposed to guest, expect below two parts: - Sub-leaf 0x2 needs to report domain's virtual EPC base and size, instead of physical EPC info. - Sub-leaf 0x1 needs to be consistent with guest's XCR0. For the reason of this part please refer to 1.5.2 Interaction with XSAVE. 2.2.7 MSR Emulation SGX_ENABLE it in IA32_FEATURE_CONTROL is always set if SGX is exposed to guest, SGX_LAUNCH_CONTROL_ENABLE bit is handled as in 2.2.4. Any write from guest to IA32_FEATURE_CONTROL is ignored. IA32_SGXLEPUBKEYHASHn emulation is described in 2.2.4. 2.2.8 EPT Violation & ENCLS Trapping Handling Only needed when Xen supports EPC Oversubscription, as explained above. 2.2.9 Guest Suspend & Resume On hardware, EPC is destroyed when power goes to S3-S5. So Xen will destroy guest's EPC when guest's power goes into S3-S5. Currently Xen is notified by Qemu in terms of S State change via HVM_PARAM_ACPI_S_STATE, where Xen will destroy EPC if S State is S3-S5. Specifically, Xen will run EREMOVE for guest's each EPC page, as guest may not handle EPC suspend & resume correctly, in which case physically guest's EPC pages may still be valid, so Xen needs to run EREMOVE to make sure all EPC pages are becoming invalid. Otherwise further operation in guest on EPC may fault as it assumes all EPC pages are invalid after guest is resumed. For SECS page, EREMOVE may fault with SGX_CHILD_PRESENT, in which case Xen will keep this SECS page into a list, and call EREMOVE for them again after all EPC pages have been called with EREMOVE. This time the EREMOVE on SECS will succeed as all children (regular EPC pages) have already been removed. 2.2.10 Destroying Domain Normally Xen just frees all EPC pages for domain when it is destroyed. But Xen will also do EREMOVE on all guest's EPC pages (described in above 2.2.7) before free them, as guest may shutdown unexpected (ex, user kills guest), and in this case, guest's EPC may still be valid. 2.3 Additional Point: Live Migration, Snapshot Support (?) Actually from hardware's point of view, SGX is not migratable. There are two reasons: - SGX key architecture cannot be virtualized. For example, some keys are bound to CPU. For example, Sealing key, EREPORT key, etc. If VM is migrated to another machine, the same enclave will derive the different keys. Taking Sealing key as an example, Sealing key is typically used by enclave (enclave can get sealing key by EGETKEY) to *seal* its secrets to outside (ex, persistent storage) for further use. If Sealing key changes after VM migration, then the enclave can never get the sealed secrets back by using sealing key, as it has changed, and old sealing key cannot be got back. - There's no ENCLS to evict EPC page to normal memory, but at the meaning time, still keep content in EPC. Currently once EPC page is evicted, the EPC page becomes invalid. So technically, we are unable to implement live migration (or check pointing, or snapshot) for enclave. But, with some workaround, and some facts of existing SGX driver, technically we are able to support Live migration (or even check pointing, snapshot). This is because: - Changing key (which is bound to CPU) is not a problem in reality Take Sealing key as an example. Losing sealed data is not a problem, because sealing key is only supposed to encrypt secrets that can be provisioned again. The typical work model is, enclave gets secrets provisioned from remote (service provider), and use sealing key to store it for further use. When enclave tries to *unseal* use sealing key, if the sealing key is changed, enclave will find the data is some kind of corrupted (integrity check failure), so it will ask secrets to be provisioned again from remote. Another reason is, in data center, VM's typically share lots of data, and as sealing key is bound to CPU, it means the data encrypted by one enclave on one machine cannot be shared by another enclave on another mahcine. So from SGX app writer's point of view, developer should treat Sealing key as a changeable key, and should handle lose of sealing data anyway. Sealing key should only be used to seal secrets that can be easily provisioned again. For other keys such as EREPORT key and provisioning key, which are used for local attestation and remote attestation, due to the second reason below, losing them is not a problem either. - Sudden lose of EPC is not a problem. On hardware, EPC will be lost if system goes to S3-S5, or reset, or shutdown, and SGX driver need to handle lose of EPC due to power transition. This is done by cooperation between SGX driver and userspace SGX SDK/apps. However during live migration, there may not be power transition in guest, so there may not be EPC lose during live migration. And technically we cannot *really* live migrate enclave (explained above), so looks it's not feasible. But the fact is that both Linux SGX driver and Windows SGX driver have already supported *sudden* lose of EPC (not EPC lose during power transition), which means both driver are able to recover in case EPC is lost at any runtime. With this, technically we are able to support live migration by simply ignoring EPC. After VM is migrated, the destination VM will only suffer *sudden* lose of EPC, which both Windows SGX driver and Linux SGX driver are already able to handle. But we must point out such *sudden* lose of EPC is not hardware behavior, and other SGX driver for other OSes (such as FreeBSD) may not implement this, so for those guests, destination VM will behavior in unexpected manner. But I am not sure we need to care about other OSes. For the same reason, we are able to support check pointing for SGX guest (only Linux and Windows); For snapshot, we can support snapshot SGX guest by either: - Suspend guest before snapshot (s3-s5). This works for all guests but requires user to manually susppend guest. - Issue an hypercall to destroy guest's EPC in save_vm. This only works for Linux and Windows but doesn't require user intervention. What's your comments? 3. Reference - Intel SGX Homepage https://software.intel.com/en-us/sgx - Linux SGX SDK https://01.org/intel-software-guard-extensions - Linux SGX driver for upstreaming https://github.com/01org/linux-sgx - Intel SGX Specification (SDM Vol 3D) https://software.intel.com/sites/default/files/managed/7c/f1/332831-sdm-vol-3d.pdf - Paper: Intel SGX Explained https://eprint.iacr.org/2016/086.pdf - ISCA 2015 tutorial slides for IntelĀ® SGX - IntelĀ® Software https://software.intel.com/sites/default/files/332680-002.pdf Kai Huang (15): xen: x86: expose SGX to HVM domain in CPU featureset xen: vmx: detect ENCLS VMEXIT xen: x86: add early stage SGX feature detection xen: mm: add ioremap_cache xen: p2m: new 'p2m_epc' type for EPC mapping xen: x86: add SGX basic EPC management xen: x86: add functions to populate and destroy EPC for domain xen: x86: add SGX cpuid handling support. xen: vmx: handle SGX related MSRs xen: vmx: handle ENCLS VMEXIT xen: vmx: handle VMEXIT from SGX enclave xen: x86: reset EPC when guest got suspended. xen: tools: add new 'epc' parameter support xen: tools: add SGX to applying CPUID policy xen: tools: expose EPC in ACPI table tools/firmware/hvmloader/util.c | 23 + tools/firmware/hvmloader/util.h | 3 + tools/libacpi/build.c | 3 + tools/libacpi/dsdt.asl | 49 ++ tools/libacpi/dsdt_acpi_info.asl | 6 +- tools/libacpi/libacpi.h | 1 + tools/libxc/include/xc_dom.h | 4 + tools/libxc/include/xenctrl.h | 10 + tools/libxc/xc_cpuid_x86.c | 68 ++- tools/libxl/libxl.h | 3 +- tools/libxl/libxl_cpuid.c | 15 +- tools/libxl/libxl_create.c | 9 + tools/libxl/libxl_dom.c | 36 +- tools/libxl/libxl_internal.h | 2 + tools/libxl/libxl_nocpuid.c | 4 +- tools/libxl/libxl_types.idl | 6 + tools/libxl/libxl_x86.c | 12 + tools/libxl/libxl_x86_acpi.c | 3 + tools/ocaml/libs/xc/xenctrl_stubs.c | 11 +- tools/python/xen/lowlevel/xc/xc.c | 11 +- tools/xl/xl_parse.c | 5 + xen/arch/x86/cpuid.c | 87 ++- xen/arch/x86/domctl.c | 47 +- xen/arch/x86/hvm/hvm.c | 3 + xen/arch/x86/hvm/vmx/Makefile | 1 + xen/arch/x86/hvm/vmx/sgx.c | 871 ++++++++++++++++++++++++++++ xen/arch/x86/hvm/vmx/vmcs.c | 21 + xen/arch/x86/hvm/vmx/vmx.c | 73 +++ xen/arch/x86/hvm/vmx/vvmx.c | 11 + xen/arch/x86/mm.c | 15 +- xen/arch/x86/mm/p2m-ept.c | 3 + xen/arch/x86/mm/p2m.c | 41 ++ xen/include/asm-x86/cpufeature.h | 4 + xen/include/asm-x86/cpuid.h | 26 +- xen/include/asm-x86/hvm/hvm.h | 3 + xen/include/asm-x86/hvm/vmx/sgx.h | 100 ++++ xen/include/asm-x86/hvm/vmx/vmcs.h | 10 + xen/include/asm-x86/hvm/vmx/vmx.h | 3 + xen/include/asm-x86/msr-index.h | 6 + xen/include/asm-x86/p2m.h | 12 +- xen/include/public/arch-x86/cpufeatureset.h | 3 +- xen/include/xen/vmap.h | 1 + xen/tools/gen-cpuid.py | 3 + 43 files changed, 1607 insertions(+), 21 deletions(-) create mode 100644 xen/arch/x86/hvm/vmx/sgx.c create mode 100644 xen/include/asm-x86/hvm/vmx/sgx.h -- 2.11.0 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel