On Tue, May 28, 2024 at 09:23:13AM -0700, Dave Hansen wrote: > On 5/17/24 04:06, Dmitrii Kuvaiskii wrote: > ... > > First, why is SGX so special here? How is the SGX problem different > than what the core mm code does?
Here is my understanding why SGX is so special and why I have to introduce a new bit SGX_ENCL_PAGE_BEING_REMOVED. In SGX's removal of the enclave page, two operations must happen atomically: the PTE entry must be removed and the page must be EREMOVE'd. Generally, to guarantee atomicity, encl->lock is acquired. Ideally, if this encl->lock could be acquired at the beginning of sgx_encl_remove_pages() and be released at the very end of this function, there would be no EREMOVE page vs EAUG page data race, and my bug fix (with SGX_ENCL_PAGE_BEING_REMOVED bit) wouldn't be needed. However, the current implementation of sgx_encl_remove_pages() has to release encl->lock before removing the PTE entry. Releasing the lock is required because the function that removes the PTE entry -- sgx_zap_enclave_ptes() -- acquires another, enclave-MM lock: mmap_read_lock(encl_mm->mm). The two locks must be taken in this order: 1. mmap_read_lock(encl_mm->mm) 2. mutex_lock(&encl->lock) This lock order is apparent from e.g. sgx_encl_add_page(). This order also seems to make intuitive sense: VMA callbacks are called with the MM lock being held, so the MM lock should be the first in lock order. So, if sgx_encl_remove_pages() would _not_ release encl->lock before calling sgx_zap_enclave_ptes(), this would violate the lock order and might lead to deadlocks. At the same time, releasing encl->lock in the middle of the two-operations flow leads to a data race that I found in this patch series. Quick summary: - Removing the enclave page requires two operations: removing the PTE and performing EREMOVE. - The complete flow of removing the enclave page cannot be protected by a single encl->lock, because it would violate the lock order and would lead to deadlocks. - The current upstream implementation thus breaks the flow into two critical sections, releasing encl->lock before sgx_zap_enclave_ptes() and re-acquiring this lock afterwards. This leads to a data race. - My patch restores "atomicity" of the flow by introducing a new flag SGX_ENCL_PAGE_BEING_REMOVED. > > --- a/arch/x86/kernel/cpu/sgx/encl.h > > +++ b/arch/x86/kernel/cpu/sgx/encl.h > > @@ -25,6 +25,9 @@ > > /* 'desc' bit marking that the page is being reclaimed. */ > > #define SGX_ENCL_PAGE_BEING_RECLAIMED BIT(3) > > > > +/* 'desc' bit marking that the page is being removed. */ > > +#define SGX_ENCL_PAGE_BEING_REMOVED BIT(2) > > Second, convince me that this _needs_ a new bit. Why can't we just have > a bit that effectively means "return EBUSY if you see this bit when > handling a fault". As Haitao mentioned in his reply, the bit SGX_ENCL_PAGE_BEING_RECLAIMED is also used in reclaimer_writing_to_pcmd(). If we would re-use this bit to mark a page being removed, reclaimer_writing_to_pcmd() would incorrectly return 1, meaning that the reclaimer is about to write to the PCMD page, which is not true. > > struct sgx_encl_page { > > unsigned long desc; > > unsigned long vm_max_prot_bits:8; > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c > > b/arch/x86/kernel/cpu/sgx/ioctl.c > > index 5d390df21440..de59219ae794 100644 > > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > > @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl > > *encl, > > * Do not keep encl->lock because of dependency on > > * mmap_lock acquired in sgx_zap_enclave_ptes(). > > */ > > + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED; > > This also needs a comment, no matter what. Ok, I will write something along the lines that we want to prevent a data race with an EAUG flow, and since we have to release encl->lock (which would otherwise prevent the data race) we instead set a bit to mark this enclave page as being in the process of removal, so that the EAUG flow backs off and retries later. -- Dmitrii Kuvaiskii