[PATCH v4 0/3] x86/sgx: Fix two data races in EAUG/EREMOVE flows

2024-07-05 Thread Dmitrii Kuvaiskii
SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
enclave pages and may support MADV_DONTNEED semantics [1]. The former
implies #PF-based page allocation, and the latter implies the usage of
SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.

EDMM-based lazy allocation and MADV_DONTNEED semantics provide
significant performance improvement for some workloads that run on
Gramine. For example, a Java workload with a 16GB enclave size has
approx. 57x improvement in total runtime. Thus, we consider it important
to permit these optimizations in Gramine. However, we observed hangs of
applications (Node.js, PyTorch, R, iperf, Blender, Nginx) when run on
Gramine with EDMM, lazy allocation and MADV_DONTNEED features enabled.

We wrote a trivial stress test to reproduce the hangs observed in
real-world applications. The test stresses #PF-based page allocation and
SGX_IOC_ENCLAVE_REMOVE_PAGES flows in the SGX driver:

/* repeatedly touch different enclave pages at random and mix with
 * madvise(MADV_DONTNEED) to stress EAUG/EREMOVE flows */
static void* thread_func(void* arg) {
size_t num_pages = 0xA000 / page_size;
for (int i = 0; i < 5000; i++) {
size_t page = get_random_ulong() % num_pages;
char data = READ_ONCE(((char*)arg)[page * page_size]);

page = get_random_ulong() % num_pages;
madvise(arg + page * page_size, page_size, MADV_DONTNEED);
}
}

addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
pthread_t threads[16];
for (int i = 0; i < 16; i++)
pthread_create([i], NULL, thread_func, addr);

This test uncovers two data races in the SGX driver. The remaining
patches describe and fix these races.

I performed several stress tests to verify that there are no other data
races (at least with the test program above):

- On Icelake server with 128GB of PRM, without madvise(). This stresses
  the first data race. A Gramine SGX test suite running in the
  background for additional stressing. Result: 1,000 runs without hangs
  (result without the first bug fix: hangs every time).
- On Icelake server with 128GB of PRM, with madvise(). This stresses the
  second data race. A Gramine SGX test suite running in the background
  for additional stressing. Result: 1,000 runs without hangs (result
  with the first bug fix but without the second bug fix: hangs approx.
  once in 50 runs).
- On Icelake server with 4GB of PRM, with madvise(). This additionally
  stresses the enclave page swapping flows. Two Gramine SGX test suites
  running in the background for additional stressing of swapping (I
  observe 100% CPU utilization from ksgxd which confirms that swapping
  happens). Result: 1,000 runs without hangs.

v3 -> v4:
- Added a preparatory patch to split the SGX_ENCL_PAGE_BEING_RECLAIMED
  flag into two: SGX_ENCL_PAGE_BUSY and SGX_ENCL_PAGE_PCMD_BUSY
  (split suggested by Dave Hansen [2])
- No changes in the second patch (that fixes the first bug)
- Trivial changes in the third patch (that fixes the second bug), now
  that we have a preparatory patch; plus expanded a comment (as
  suggested by Dave Hansen)

v2 -> v3:
- No changes in code itself
- Improved commit message of the first patch (text suggested by Dave
  Hansen); kept the CPU1 vs CPU2 diagram (as all reviewers liked it)
- No changes in the commit message of the second patch

v1 -> v2:
- No changes in code itself
- Expanded cover letter
- Added CPU1 vs CPU2 race scenarios in commit messages

[1] https://github.com/gramineproject/gramine/pull/1513
[2] https://lore.kernel.org/all/1d405428-3847-4862-b146-dd57711c8...@intel.com/

v1: 
https://lore.kernel.org/all/20240429104330.3636113-3-dmitrii.kuvais...@intel.com/
v2: 
https://lore.kernel.org/all/20240515131240.1304824-1-dmitrii.kuvais...@intel.com/
v3: 
https://lore.kernel.org/all/20240517110631.3441817-1-dmitrii.kuvais...@intel.com/

Dmitrii Kuvaiskii (3):
  x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags
  x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  x86/sgx: Resolve EREMOVE page vs EAUG page data race

 arch/x86/kernel/cpu/sgx/encl.c  | 23 ---
 arch/x86/kernel/cpu/sgx/encl.h  | 10 --
 arch/x86/kernel/cpu/sgx/ioctl.c |  7 +++
 arch/x86/kernel/cpu/sgx/main.c  |  4 ++--
 4 files changed, 29 insertions(+), 15 deletions(-)

-- 
2.43.0




[PATCH v4 3/3] x86/sgx: Resolve EREMOVE page vs EAUG page data race

2024-07-05 Thread Dmitrii Kuvaiskii
Two enclave threads may try to add and remove the same enclave page
simultaneously (e.g., if the SGX runtime supports both lazy allocation
and MADV_DONTNEED semantics). Consider some enclave page added to the
enclave. User space decides to temporarily remove this page (e.g.,
emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
space performs a memory access on the same page on CPU2, which results
in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
follows:

/*
 * CPU1: User space performs
 * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
 * on enclave page X
 */
sgx_encl_remove_pages() {

  mutex_lock(>lock);

  entry = sgx_encl_load_page(encl);
  /*
   * verify that page is
   * trimmed and accepted
   */

  mutex_unlock(>lock);

  /*
   * remove PTE entry; cannot
   * be performed under lock
   */
  sgx_zap_enclave_ptes(encl);
 /*
  * Fault on CPU2 on same page X
  */
 sgx_vma_fault() {
   /*
* PTE entry was removed, but the
* page is still in enclave's xarray
*/
   xa_load(>page_array) != NULL ->
   /*
* SGX driver thinks that this page
* was swapped out and loads it
*/
   mutex_lock(>lock);
   /*
* this is effectively a no-op
*/
   entry = sgx_encl_load_page_in_vma();
   /*
* add PTE entry
*
* *BUG*: a PTE is installed for a
* page in process of being removed
*/
   vmf_insert_pfn(...);

   mutex_unlock(>lock);
   return VM_FAULT_NOPAGE;
 }
  /*
   * continue with page removal
   */
  mutex_lock(>lock);

  sgx_encl_free_epc_page(epc_page) {
/*
 * remove page via EREMOVE
 */
/*
 * free EPC page
 */
sgx_free_epc_page(epc_page);
  }

  xa_erase(>page_array);

  mutex_unlock(>lock);
}

Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
same page. This enclave page becomes perpetually inaccessible (until
another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
marked accessible in the PTE entry but is not EAUGed, and any subsequent
access to this page raises a fault: with the kernel believing there to
be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
path do_user_addr_fault() -> access_error() causes the SGX driver's
sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
The userspace SIGSEGV handler cannot perform EACCEPT because the page
was not EAUGed. Thus, the user space is stuck with the inaccessible
page.

Fix this race by forcing the fault handler on CPU2 to back off if the
page is currently being removed (on CPU1). This is achieved by
setting SGX_ENCL_PAGE_BUSY flag right-before the first mutex_unlock() in
sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
page is busy, and if yes then CPU2 backs off and waits until the page is
completely removed. After that, any memory access to this page results
in a normal "allocate and EAUG a page on #PF" flow.

Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
Cc: sta...@vger.kernel.org
Signed-off-by: Dmitrii Kuvaiskii 
---
 arch/x86/kernel/cpu/sgx/ioctl.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 5d390df21440..02441883401d 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -1141,7 +1141,14 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
/*
 * Do not keep encl->lock because of dependency on
 * mmap_lock acquired in sgx_zap_enclave_ptes().
+*
+* Releasing encl->lock leads to a data race: while CPU1
+* performs sgx_zap_enclave_ptes() and removes the PTE entry
+* for the enclave page, CPU2 may attempt to load this page
+* (because the page is still in enclave's xarray). To prevent
+* CPU2 from loading the page, mark the page as busy.
 */
+   entry->desc |= SGX_ENCL_PAGE_BUSY;
mutex_unlock(>lock);
 
sgx_zap_enclave_ptes(encl, addr);
-- 
2.43.0




[PATCH v4 1/3] x86/sgx: Split SGX_ENCL_PAGE_BEING_RECLAIMED into two flags

2024-07-05 Thread Dmitrii Kuvaiskii
SGX_ENCL_PAGE_BEING_RECLAIMED flag is set when the enclave page is being
reclaimed (moved to the backing store). This flag however has two
logical meanings:

1. Don't attempt to load the enclave page (the page is busy).
2. Don't attempt to remove the PCMD page corresponding to this enclave
   page (the PCMD page is busy).

To reflect these two meanings, split SGX_ENCL_PAGE_BEING_RECLAIMED into
two flags: SGX_ENCL_PAGE_BUSY and SGX_ENCL_PAGE_PCMD_BUSY. Currently,
both flags are set only when the enclave page is being reclaimed. A
future commit will introduce a new case when the enclave page is being
removed; this new case will set only the SGX_ENCL_PAGE_BUSY flag.

Cc: sta...@vger.kernel.org
Signed-off-by: Dmitrii Kuvaiskii 
---
 arch/x86/kernel/cpu/sgx/encl.c | 16 +++-
 arch/x86/kernel/cpu/sgx/encl.h | 10 --
 arch/x86/kernel/cpu/sgx/main.c |  4 ++--
 3 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..c0a3c00284c8 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -46,10 +46,10 @@ static int sgx_encl_lookup_backing(struct sgx_encl *encl, 
unsigned long page_ind
  * a check if an enclave page sharing the PCMD page is in the process of being
  * reclaimed.
  *
- * The reclaimer sets the SGX_ENCL_PAGE_BEING_RECLAIMED flag when it
- * intends to reclaim that enclave page - it means that the PCMD page
- * associated with that enclave page is about to get some data and thus
- * even if the PCMD page is empty, it should not be truncated.
+ * The reclaimer sets the SGX_ENCL_PAGE_PCMD_BUSY flag when it intends to
+ * reclaim that enclave page - it means that the PCMD page associated with that
+ * enclave page is about to get some data and thus even if the PCMD page is
+ * empty, it should not be truncated.
  *
  * Context: Enclave mutex (_encl->lock) must be held.
  * Return: 1 if the reclaimer is about to write to the PCMD page
@@ -77,8 +77,7 @@ static int reclaimer_writing_to_pcmd(struct sgx_encl *encl,
 * Stop when reaching the SECS page - it does not
 * have a page_array entry and its reclaim is
 * started and completed with enclave mutex held so
-* it does not use the SGX_ENCL_PAGE_BEING_RECLAIMED
-* flag.
+* it does not use the SGX_ENCL_PAGE_PCMD_BUSY flag.
 */
if (addr == encl->base + encl->size)
break;
@@ -91,8 +90,7 @@ static int reclaimer_writing_to_pcmd(struct sgx_encl *encl,
 * VA page slot ID uses same bit as the flag so it is important
 * to ensure that the page is not already in backing store.
 */
-   if (entry->epc_page &&
-   (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)) {
+   if (entry->epc_page && (entry->desc & SGX_ENCL_PAGE_PCMD_BUSY)) 
{
reclaimed = 1;
break;
}
@@ -257,7 +255,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct 
sgx_encl *encl,
 
/* Entry successfully located. */
if (entry->epc_page) {
-   if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
+   if (entry->desc & SGX_ENCL_PAGE_BUSY)
return ERR_PTR(-EBUSY);
 
return entry;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..11b09899cd92 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -22,8 +22,14 @@
 /* 'desc' bits holding the offset in the VA (version array) page. */
 #define SGX_ENCL_PAGE_VA_OFFSET_MASK   GENMASK_ULL(11, 3)
 
-/* 'desc' bit marking that the page is being reclaimed. */
-#define SGX_ENCL_PAGE_BEING_RECLAIMED  BIT(3)
+/* 'desc' bit indicating that the page is busy (e.g. being reclaimed). */
+#define SGX_ENCL_PAGE_BUSY BIT(2)
+
+/*
+ * 'desc' bit indicating that PCMD page associated with the enclave page is
+ * busy (e.g. because the enclave page is being reclaimed).
+ */
+#define SGX_ENCL_PAGE_PCMD_BUSYBIT(3)
 
 struct sgx_encl_page {
unsigned long desc;
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..e94b09c43673 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -204,7 +204,7 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
void *va_slot;
int ret;
 
-   encl_page->desc &= ~SGX_ENCL_PAGE_BEING_RECLAIMED;
+   encl_page->desc &= ~(SGX_ENCL_PAGE_BUSY | SGX_ENCL_PAGE_PCMD_BUSY);
 
va_page = list_first_entry(>va_pages, struct sgx_va_page,
   list);
@@ -340,7 +340,7 @@ static void sgx_reclaim_pages(void)
goto skip;
 

[PATCH v4 2/3] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-07-05 Thread Dmitrii Kuvaiskii
Imagine an mmap()'d file. Two threads touch the same address at the same
time and fault. Both allocate a physical page and race to install a PTE
for that page. Only one will win the race. The loser frees its page, but
still continues handling the fault as a success and returns
VM_FAULT_NOPAGE from the fault handler.

The same race can happen with SGX. But there's a bug: the loser in the
SGX steers into a failure path. The loser EREMOVE's the winner's EPC
page, then returns SIGBUS, likely killing the app.

Fix the SGX loser's behavior. Change the return code to VM_FAULT_NOPAGE
to avoid SIGBUS and call sgx_free_epc_page() which avoids EREMOVE'ing
the winner's page and only frees the page that the loser allocated.

The race can be illustrated as follows:

/* /*
 * Fault on CPU1* Fault on CPU2
 * on enclave page X* on enclave page X
 */ */
sgx_vma_fault() {  sgx_vma_fault() {

  xa_load(>page_array) xa_load(>page_array)
  == NULL -->== NULL -->

  sgx_encl_eaug_page() { sgx_encl_eaug_page() {

......

/* /*
 * alloc encl_page  * alloc encl_page
 */ */
   mutex_lock(>lock);
   /*
* alloc EPC page
*/
   epc_page = sgx_alloc_epc_page(...);
   /*
* add page to enclave's xarray
*/
   xa_insert(>page_array, ...);
   /*
* add page to enclave via EAUG
* (page is in pending state)
*/
   /*
* add PTE entry
*/
   vmf_insert_pfn(...);

   mutex_unlock(>lock);
   return VM_FAULT_NOPAGE;
 }
   }
   /*
* All good up to here: enclave page
* successfully added to enclave,
* ready for EACCEPT from user space
*/
mutex_lock(>lock);
/*
 * alloc EPC page
 */
epc_page = sgx_alloc_epc_page(...);
/*
 * add page to enclave's xarray,
 * this fails with -EBUSY as this
 * page was already added by CPU2
 */
xa_insert(>page_array, ...);

  err_out_shrink:
sgx_encl_free_epc_page(epc_page) {
  /*
   * remove page via EREMOVE
   *
   * *BUG*: page added by CPU2 is
   * yanked from enclave while it
   * remains accessible from OS
   * perspective (PTE installed)
   */
  /*
   * free EPC page
   */
  sgx_free_epc_page(epc_page);
}

mutex_unlock(>lock);
/*
 * *BUG*: SIGBUS is returned
 * for a valid enclave page
 */
return VM_FAULT_SIGBUS;
  }
}

Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized 
enclave")
Cc: sta...@vger.kernel.org
Reported-by: Marcelina Kościelnicka 
Suggested-by: Reinette Chatre 
Signed-off-by: Dmitrii Kuvaiskii 
Reviewed-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Reinette Chatre 
---
 arch/x86/kernel/cpu/sgx/encl.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index c0a3c00284c8..9f7f9e57cdeb 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -380,8 +380,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct 
*vma,
 * If ret == -EBUSY then page was created in another flow while
 * running without encl->lock
 */
-   if (ret)
+   if (ret) {
+   if (ret == -EBUSY)
+   vmret = VM_FAULT_NOPAGE;
goto err_out_shrink;
+   }
 
pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
pginfo.addr = encl_page->desc & PAGE_MASK;
@@ -417,7 +420,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct 
*vma,
 err_out_shrink:
sgx_encl_shrink(encl, va_page);
 err_out_epc:
-   sgx_encl_free_epc_page(epc_page);
+   sgx_free_epc_page(epc_page);
 err_out_unlock:
mutex_unlock(>lock);
kfree(encl_page);
-- 
2.43.0




Re: [PATCH v3 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

2024-06-07 Thread Dmitrii Kuvaiskii
On Tue, May 28, 2024 at 09:01:10AM -0700, Dave Hansen wrote:
> On 5/17/24 04:06, Dmitrii Kuvaiskii wrote:
> > We wrote a trivial stress test to reproduce the hangs observed in
> > real-world applications. The test stresses #PF-based page allocation and
> > SGX_IOC_ENCLAVE_REMOVE_PAGES flows in the SGX driver:
>
> This seems like something we'd want in the kernel SGX selftests.

I looked at tools/testing/selftests/sgx/ and I observe several
complications:

1. The stress test requires creation of several threads (at least two,
   ideally more). However, current SGX selftests are single-threaded.
   Adding the scaffolding to add multi-threading support to SGX selftests
   seems like a non-trivial task.

2. Catching the data race would require a for loop with some threshold.
   - First, there are no such looping tests in current SGX selftests. Is
 it normal to add such a test?
   - Second, what would be the threshold to loop for? I.e., after how many
 iterations should we consider the data race not manifesting, and
 report success?
   - Third, the data race may hang the test. Is this something that is
 allowed in selftests? (I mean the test can have only two outcomes --
 either it hangs, meaning the data race was not fixed, or it runs to
 completion. There is no result that we could EXCEPT or ASSERT on.)

Do we still want to add such a selftest? Or could we maybe piggy-back on
Gramine CI (that will include the test I mentioned in the cover letter)?

--
Dmitrii Kuvaiskii



Re: [PATCH v3 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

2024-06-07 Thread Dmitrii Kuvaiskii
On Tue, May 28, 2024 at 09:23:13AM -0700, Dave Hansen wrote:
> On 5/17/24 04:06, Dmitrii Kuvaiskii wrote:
> ...
>
> First, why is SGX so special here?  How is the SGX problem different
> than what the core mm code does?

Here is my understanding why SGX is so special and why I have to introduce
a new bit SGX_ENCL_PAGE_BEING_REMOVED.

In SGX's removal of the enclave page, two operations must happen
atomically: the PTE entry must be removed and the page must be EREMOVE'd.

Generally, to guarantee atomicity, encl->lock is acquired. Ideally, if
this encl->lock could be acquired at the beginning of
sgx_encl_remove_pages() and be released at the very end of this function,
there would be no EREMOVE page vs EAUG page data race, and my bug fix
(with SGX_ENCL_PAGE_BEING_REMOVED bit) wouldn't be needed.

However, the current implementation of sgx_encl_remove_pages() has to
release encl->lock before removing the PTE entry. Releasing the lock is
required because the function that removes the PTE entry --
sgx_zap_enclave_ptes() -- acquires another, enclave-MM lock:
mmap_read_lock(encl_mm->mm).

The two locks must be taken in this order:
1. mmap_read_lock(encl_mm->mm)
2. mutex_lock(>lock)

This lock order is apparent from e.g. sgx_encl_add_page(). This order also
seems to make intuitive sense: VMA callbacks are called with the MM lock
being held, so the MM lock should be the first in lock order.

So, if sgx_encl_remove_pages() would _not_ release encl->lock before
calling sgx_zap_enclave_ptes(), this would violate the lock order and
might lead to deadlocks. At the same time, releasing encl->lock in the
middle of the two-operations flow leads to a data race that I found in
this patch series.

Quick summary:
- Removing the enclave page requires two operations: removing the PTE and
  performing EREMOVE.
- The complete flow of removing the enclave page cannot be protected by a
  single encl->lock, because it would violate the lock order and would
  lead to deadlocks.
- The current upstream implementation thus breaks the flow into two
  critical sections, releasing encl->lock before sgx_zap_enclave_ptes()
  and re-acquiring this lock afterwards. This leads to a data race.
- My patch restores "atomicity" of the flow by introducing a new flag
  SGX_ENCL_PAGE_BEING_REMOVED.

> > --- a/arch/x86/kernel/cpu/sgx/encl.h
> > +++ b/arch/x86/kernel/cpu/sgx/encl.h
> > @@ -25,6 +25,9 @@
> >  /* 'desc' bit marking that the page is being reclaimed. */
> >  #define SGX_ENCL_PAGE_BEING_RECLAIMED  BIT(3)
> >
> > +/* 'desc' bit marking that the page is being removed. */
> > +#define SGX_ENCL_PAGE_BEING_REMOVEDBIT(2)
>
> Second, convince me that this _needs_ a new bit.  Why can't we just have
> a bit that effectively means "return EBUSY if you see this bit when
> handling a fault".

As Haitao mentioned in his reply, the bit SGX_ENCL_PAGE_BEING_RECLAIMED is
also used in reclaimer_writing_to_pcmd(). If we would re-use this bit to
mark a page being removed, reclaimer_writing_to_pcmd() would incorrectly
return 1, meaning that the reclaimer is about to write to the PCMD page,
which is not true.

> >  struct sgx_encl_page {
> > unsigned long desc;
> > unsigned long vm_max_prot_bits:8;
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c 
> > b/arch/x86/kernel/cpu/sgx/ioctl.c
> > index 5d390df21440..de59219ae794 100644
> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl 
> > *encl,
> >  * Do not keep encl->lock because of dependency on
> >  * mmap_lock acquired in sgx_zap_enclave_ptes().
> >  */
> > +   entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
>
> This also needs a comment, no matter what.

Ok, I will write something along the lines that we want to prevent a data
race with an EAUG flow, and since we have to release encl->lock (which
would otherwise prevent the data race) we instead set a bit to mark this
enclave page as being in the process of removal, so that the EAUG flow
backs off and retries later.

--
Dmitrii Kuvaiskii



[PATCH v3 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

2024-05-17 Thread Dmitrii Kuvaiskii
Two enclave threads may try to add and remove the same enclave page
simultaneously (e.g., if the SGX runtime supports both lazy allocation
and MADV_DONTNEED semantics). Consider some enclave page added to the
enclave. User space decides to temporarily remove this page (e.g.,
emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
space performs a memory access on the same page on CPU2, which results
in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
follows:

/*
 * CPU1: User space performs
 * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
 * on enclave page X
 */
sgx_encl_remove_pages() {

  mutex_lock(>lock);

  entry = sgx_encl_load_page(encl);
  /*
   * verify that page is
   * trimmed and accepted
   */

  mutex_unlock(>lock);

  /*
   * remove PTE entry; cannot
   * be performed under lock
   */
  sgx_zap_enclave_ptes(encl);
 /*
  * Fault on CPU2 on same page X
  */
 sgx_vma_fault() {
   /*
* PTE entry was removed, but the
* page is still in enclave's xarray
*/
   xa_load(>page_array) != NULL ->
   /*
* SGX driver thinks that this page
* was swapped out and loads it
*/
   mutex_lock(>lock);
   /*
* this is effectively a no-op
*/
   entry = sgx_encl_load_page_in_vma();
   /*
* add PTE entry
*
* *BUG*: a PTE is installed for a
* page in process of being removed
*/
   vmf_insert_pfn(...);

   mutex_unlock(>lock);
   return VM_FAULT_NOPAGE;
 }
  /*
   * continue with page removal
   */
  mutex_lock(>lock);

  sgx_encl_free_epc_page(epc_page) {
/*
 * remove page via EREMOVE
 */
/*
 * free EPC page
 */
sgx_free_epc_page(epc_page);
  }

  xa_erase(>page_array);

  mutex_unlock(>lock);
}

Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
same page. This enclave page becomes perpetually inaccessible (until
another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
marked accessible in the PTE entry but is not EAUGed, and any subsequent
access to this page raises a fault: with the kernel believing there to
be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
path do_user_addr_fault() -> access_error() causes the SGX driver's
sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
The userspace SIGSEGV handler cannot perform EACCEPT because the page
was not EAUGed. Thus, the user space is stuck with the inaccessible
page.

Fix this race by forcing the fault handler on CPU2 to back off if the
page is currently being removed (on CPU1). This is achieved by
introducing a new flag SGX_ENCL_PAGE_BEING_REMOVED, which is unset by
default and set only right-before the first mutex_unlock() in
sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
page is being removed, and if yes then CPU2 backs off and waits until
the page is completely removed. After that, any memory access to this
page results in a normal "allocate and EAUG a page on #PF" flow.

Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
Cc: sta...@vger.kernel.org
Signed-off-by: Dmitrii Kuvaiskii 
Reviewed-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Acked-by: Reinette Chatre 
---
 arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
 arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
 arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 41f14b1a3025..7ccd8b2fce5f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct 
sgx_encl *encl,
 
/* Entry successfully located. */
if (entry->epc_page) {
-   if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
+   if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
+  SGX_ENCL_PAGE_BEING_REMOVED))
return ERR_PTR(-EBUSY);
 
return entry;
diff --git a/arch/x86/kernel/cpu/sgx/encl.

[PATCH v3 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-05-17 Thread Dmitrii Kuvaiskii
Imagine an mmap()'d file. Two threads touch the same address at the same
time and fault. Both allocate a physical page and race to install a PTE
for that page. Only one will win the race. The loser frees its page, but
still continues handling the fault as a success and returns
VM_FAULT_NOPAGE from the fault handler.

The same race can happen with SGX. But there's a bug: the loser in the
SGX steers into a failure path. The loser EREMOVE's the winner's EPC
page, then returns SIGBUS, likely killing the app.

Fix the SGX loser's behavior. Change the return code to VM_FAULT_NOPAGE
to avoid SIGBUS and call sgx_free_epc_page() which avoids EREMOVE'ing
the winner's page and only frees the page that the loser allocated.

The race can be illustrated as follows:

/* /*
 * Fault on CPU1* Fault on CPU2
 * on enclave page X* on enclave page X
 */ */
sgx_vma_fault() {  sgx_vma_fault() {

  xa_load(>page_array) xa_load(>page_array)
  == NULL -->== NULL -->

  sgx_encl_eaug_page() { sgx_encl_eaug_page() {

......

/* /*
 * alloc encl_page  * alloc encl_page
 */ */
   mutex_lock(>lock);
   /*
* alloc EPC page
*/
   epc_page = sgx_alloc_epc_page(...);
   /*
* add page to enclave's xarray
*/
   xa_insert(>page_array, ...);
   /*
* add page to enclave via EAUG
* (page is in pending state)
*/
   /*
* add PTE entry
*/
   vmf_insert_pfn(...);

   mutex_unlock(>lock);
   return VM_FAULT_NOPAGE;
 }
   }
   /*
* All good up to here: enclave page
* successfully added to enclave,
* ready for EACCEPT from user space
*/
mutex_lock(>lock);
/*
 * alloc EPC page
 */
epc_page = sgx_alloc_epc_page(...);
/*
 * add page to enclave's xarray,
 * this fails with -EBUSY as this
 * page was already added by CPU2
 */
xa_insert(>page_array, ...);

  err_out_shrink:
sgx_encl_free_epc_page(epc_page) {
  /*
   * remove page via EREMOVE
   *
   * *BUG*: page added by CPU2 is
   * yanked from enclave while it
   * remains accessible from OS
   * perspective (PTE installed)
   */
  /*
   * free EPC page
   */
  sgx_free_epc_page(epc_page);
}

mutex_unlock(>lock);
/*
 * *BUG*: SIGBUS is returned
 * for a valid enclave page
 */
return VM_FAULT_SIGBUS;
  }
}

Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized 
enclave")
Cc: sta...@vger.kernel.org
Reported-by: Marcelina Kościelnicka 
Suggested-by: Reinette Chatre 
Signed-off-by: Dmitrii Kuvaiskii 
Reviewed-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Reinette Chatre 
---
 arch/x86/kernel/cpu/sgx/encl.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..41f14b1a3025 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct 
*vma,
 * If ret == -EBUSY then page was created in another flow while
 * running without encl->lock
 */
-   if (ret)
+   if (ret) {
+   if (ret == -EBUSY)
+   vmret = VM_FAULT_NOPAGE;
goto err_out_shrink;
+   }
 
pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
pginfo.addr = encl_page->desc & PAGE_MASK;
@@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct 
*vma,
 err_out_shrink:
sgx_encl_shrink(encl, va_page);
 err_out_epc:
-   sgx_encl_free_epc_page(epc_page);
+   sgx_free_epc_page(epc_page);
 err_out_unlock:
mutex_unlock(>lock);
kfree(encl_page);
-- 
2.34.1




[PATCH v3 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

2024-05-17 Thread Dmitrii Kuvaiskii
SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
enclave pages and may support MADV_DONTNEED semantics [1]. The former
implies #PF-based page allocation, and the latter implies the usage of
SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.

EDMM-based lazy allocation and MADV_DONTNEED semantics provide
significant performance improvement for some workloads that run on
Gramine. For example, a Java workload with a 16GB enclave size has
approx. 57x improvement in total runtime. Thus, we consider it important
to permit these optimizations in Gramine. However, we observed hangs of
applications (Node.js, PyTorch, R, iperf, Blender, Nginx) when run on
Gramine with EDMM, lazy allocation and MADV_DONTNEED features enabled.

We wrote a trivial stress test to reproduce the hangs observed in
real-world applications. The test stresses #PF-based page allocation and
SGX_IOC_ENCLAVE_REMOVE_PAGES flows in the SGX driver:

/* repeatedly touch different enclave pages at random and mix with
 * madvise(MADV_DONTNEED) to stress EAUG/EREMOVE flows */
static void* thread_func(void* arg) {
size_t num_pages = 0xA000 / page_size;
for (int i = 0; i < 5000; i++) {
size_t page = get_random_ulong() % num_pages;
char data = READ_ONCE(((char*)arg)[page * page_size]);

page = get_random_ulong() % num_pages;
madvise(arg + page * page_size, page_size, MADV_DONTNEED);
}
}

addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
pthread_t threads[16];
for (int i = 0; i < 16; i++)
pthread_create([i], NULL, thread_func, addr);

This test uncovers two data races in the SGX driver. The remaining
patches describe and fix these races.

I performed several stress tests to verify that there are no other data
races (at least with the test program above):

- On Icelake server with 128GB of PRM, without madvise(). This stresses
  the first data race. A Gramine SGX test suite running in the
  background for additional stressing. Result: 1,000 runs without hangs
  (result without the first bug fix: hangs every time).
- On Icelake server with 128GB of PRM, with madvise(). This stresses the
  second data race. A Gramine SGX test suite running in the background
  for additional stressing. Result: 1,000 runs without hangs (result
  with the first bug fix but without the second bug fix: hangs approx.
  once in 50 runs).
- On Icelake server with 4GB of PRM, with madvise(). This additionally
  stresses the enclave page swapping flows. Two Gramine SGX test suites
  running in the background for additional stressing of swapping (I
  observe 100% CPU utilization from ksgxd which confirms that swapping
  happens). Result: 1,000 runs without hangs.

[1] https://github.com/gramineproject/gramine/pull/1513

v2 -> v3:
- No changes in code itself
- Improved commit message of the first patch (text suggested by Dave
  Hansen); kept the CPU1 vs CPU2 diagram (as all reviewers liked it)
- No changes in the commit message of the second patch

v1 -> v2:
- No changes in code itself
- Expanded cover letter
- Added CPU1 vs CPU2 race scenarios in commit messages

v1: 
https://lore.kernel.org/all/20240429104330.3636113-3-dmitrii.kuvais...@intel.com/
v2: 
https://lore.kernel.org/all/20240515131240.1304824-1-dmitrii.kuvais...@intel.com/

Dmitrii Kuvaiskii (2):
  x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  x86/sgx: Resolve EREMOVE page vs EAUG page data race

 arch/x86/kernel/cpu/sgx/encl.c  | 10 +++---
 arch/x86/kernel/cpu/sgx/encl.h  |  3 +++
 arch/x86/kernel/cpu/sgx/ioctl.c |  1 +
 3 files changed, 11 insertions(+), 3 deletions(-)

-- 
2.34.1




[PATCH v2 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-05-15 Thread Dmitrii Kuvaiskii
() with sgx_free_epc_page().

Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized 
enclave")
Cc: sta...@vger.kernel.org
Reported-by: Marcelina Kościelnicka 
Suggested-by: Reinette Chatre 
Signed-off-by: Dmitrii Kuvaiskii 
---
 arch/x86/kernel/cpu/sgx/encl.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..41f14b1a3025 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct 
*vma,
 * If ret == -EBUSY then page was created in another flow while
 * running without encl->lock
 */
-   if (ret)
+   if (ret) {
+   if (ret == -EBUSY)
+   vmret = VM_FAULT_NOPAGE;
goto err_out_shrink;
+   }
 
pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
pginfo.addr = encl_page->desc & PAGE_MASK;
@@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct 
*vma,
 err_out_shrink:
sgx_encl_shrink(encl, va_page);
 err_out_epc:
-   sgx_encl_free_epc_page(epc_page);
+   sgx_free_epc_page(epc_page);
 err_out_unlock:
mutex_unlock(>lock);
kfree(encl_page);
-- 
2.34.1




[PATCH v2 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

2024-05-15 Thread Dmitrii Kuvaiskii
SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
enclave pages and may support MADV_DONTNEED semantics [1]. The former
implies #PF-based page allocation, and the latter implies the usage of
SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.

EDMM-based lazy allocation and MADV_DONTNEED semantics provide
significant performance improvement for some workloads that run on
Gramine. For example, a Java workload with a 16GB enclave size has
approx. 57x improvement in total runtime. Thus, we consider it important
to permit these optimizations in Gramine. However, we observed hangs of
applications (Node.js, PyTorch, R, iperf, Blender, Nginx) when run on
Gramine with EDMM, lazy allocation and MADV_DONTNEED features enabled.

We wrote a trivial stress test to reproduce the hangs observed in
real-world applications. The test stresses #PF-based page allocation and
SGX_IOC_ENCLAVE_REMOVE_PAGES flows in the SGX driver:

/* repeatedly touch different enclave pages at random and mix with
 * madvise(MADV_DONTNEED) to stress EAUG/EREMOVE flows */
static void* thread_func(void* arg) {
size_t num_pages = 0xA000 / page_size;
for (int i = 0; i < 5000; i++) {
size_t page = get_random_ulong() % num_pages;
char data = READ_ONCE(((char*)arg)[page * page_size]);

page = get_random_ulong() % num_pages;
madvise(arg + page * page_size, page_size, MADV_DONTNEED);
}
}

addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
pthread_t threads[16];
for (int i = 0; i < 16; i++)
pthread_create([i], NULL, thread_func, addr);

This test uncovers two data races in the SGX driver. The remaining
patches describe and fix these races.

I performed several stress tests to verify that there are no other data
races (at least with the test program above):

- On Icelake server with 128GB of PRM, without madvise(). This stresses
  the first data race. A Gramine SGX test suite running in the
  background for additional stressing. Result: 1,000 runs without hangs
  (result without the first bug fix: hangs every time).
- On Icelake server with 128GB of PRM, with madvise(). This stresses the
  second data race. A Gramine SGX test suite running in the background
  for additional stressing. Result: 1,000 runs without hangs (result
  with the first bug fix but without the second bug fix: hangs approx.
  once in 50 runs).
- On Icelake server with 4GB of PRM, with madvise(). This additionally
  stresses the enclave page swapping flows. Two Gramine SGX test suites
  running in the background for additional stressing of swapping (I
  observe 100% CPU utilization from ksgxd which confirms that swapping
  happens). Result: 1,000 runs without hangs.

[1] https://github.com/gramineproject/gramine/pull/1513

v1 -> v2:
- No changes in code itself
- Expanded cover letter
- Added CPU1 vs CPU2 race scenarios in commit messages

v1: 
https://lore.kernel.org/all/20240429104330.3636113-3-dmitrii.kuvais...@intel.com/

Dmitrii Kuvaiskii (2):
  x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  x86/sgx: Resolve EREMOVE page vs EAUG page data race

 arch/x86/kernel/cpu/sgx/encl.c  | 10 +++---
 arch/x86/kernel/cpu/sgx/encl.h  |  3 +++
 arch/x86/kernel/cpu/sgx/ioctl.c |  1 +
 3 files changed, 11 insertions(+), 3 deletions(-)

-- 
2.34.1




[PATCH v2 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

2024-05-15 Thread Dmitrii Kuvaiskii
Two enclave threads may try to add and remove the same enclave page
simultaneously (e.g., if the SGX runtime supports both lazy allocation
and MADV_DONTNEED semantics). Consider some enclave page added to the
enclave. User space decides to temporarily remove this page (e.g.,
emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user
space performs a memory access on the same page on CPU2, which results
in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as
follows:

/*
 * CPU1: User space performs
 * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
 * on enclave page X
 */
sgx_encl_remove_pages() {

  mutex_lock(>lock);

  entry = sgx_encl_load_page(encl);
  /*
   * verify that page is
   * trimmed and accepted
   */

  mutex_unlock(>lock);

  /*
   * remove PTE entry; cannot
   * be performed under lock
   */
  sgx_zap_enclave_ptes(encl);
 /*
  * Fault on CPU2 on same page X
  */
 sgx_vma_fault() {
   /*
* PTE entry was removed, but the
* page is still in enclave's xarray
*/
   xa_load(>page_array) != NULL ->
   /*
* SGX driver thinks that this page
* was swapped out and loads it
*/
   mutex_lock(>lock);
   /*
* this is effectively a no-op
*/
   entry = sgx_encl_load_page_in_vma();
   /*
* add PTE entry
*
* *BUG*: a PTE is installed for a
* page in process of being removed
*/
   vmf_insert_pfn(...);

   mutex_unlock(>lock);
   return VM_FAULT_NOPAGE;
 }
  /*
   * continue with page removal
   */
  mutex_lock(>lock);

  sgx_encl_free_epc_page(epc_page) {
/*
 * remove page via EREMOVE
 */
/*
 * free EPC page
 */
sgx_free_epc_page(epc_page);
  }

  xa_erase(>page_array);

  mutex_unlock(>lock);
}

Here, CPU1 removed the page. However CPU2 installed the PTE entry on the
same page. This enclave page becomes perpetually inaccessible (until
another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
marked accessible in the PTE entry but is not EAUGed, and any subsequent
access to this page raises a fault: with the kernel believing there to
be a valid VMA, the unlikely error code X86_PF_SGX encountered by code
path do_user_addr_fault() -> access_error() causes the SGX driver's
sgx_vma_fault() to be skipped and user space receives a SIGSEGV instead.
The userspace SIGSEGV handler cannot perform EACCEPT because the page
was not EAUGed. Thus, the user space is stuck with the inaccessible
page.

Fix this race by forcing the fault handler on CPU2 to back off if the
page is currently being removed (on CPU1). This is achieved by
introducing a new flag SGX_ENCL_PAGE_BEING_REMOVED, which is unset by
default and set only right-before the first mutex_unlock() in
sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
page is being removed, and if yes then CPU2 backs off and waits until
the page is completely removed. After that, any memory access to this
page results in a normal "allocate and EAUG a page on #PF" flow.

Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
Cc: sta...@vger.kernel.org
Signed-off-by: Dmitrii Kuvaiskii 
---
 arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
 arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
 arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 41f14b1a3025..7ccd8b2fce5f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct 
sgx_encl *encl,
 
/* Entry successfully located. */
if (entry->epc_page) {
-   if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
+   if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
+  SGX_ENCL_PAGE_BEING_REMOVED))
return ERR_PTR(-EBUSY);
 
return entry;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..fff5f2293ae7 100644
--- a/

Re: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

2024-04-30 Thread Dmitrii Kuvaiskii
On Mon, Apr 29, 2024 at 04:11:03PM +0300, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> > Two enclave threads may try to add and remove the same enclave page
> > simultaneously (e.g., if the SGX runtime supports both lazy allocation
> > and `MADV_DONTNEED` semantics). Consider this race:
> >
> > 1. T1 performs page removal in sgx_encl_remove_pages() and stops right
> >after removing the page table entry and right before re-acquiring the
> >enclave lock to EREMOVE and xa_erase(>page_array) the page.
> > 2. T2 tries to access the page, and #PF[not_present] is raised. The
> >condition to EAUG in sgx_vma_fault() is not satisfied because the
> >page is still present in encl->page_array, thus the SGX driver
> >assumes that the fault happened because the page was swapped out. The
> >driver continues on a code path that installs a page table entry
> >*without* performing EAUG.
> > 3. The enclave page metadata is in inconsistent state: the PTE is
> >installed but there was no EAUG. Thus, T2 in userspace infinitely
> >receives SIGSEGV on this page (and EACCEPT always fails).
> >
> > Fix this by making sure that T1 (the page-removing thread) always wins
> > this data race. In particular, the page-being-removed is marked as such,
> > and T2 retries until the page is fully removed.
> >
> > Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Dmitrii Kuvaiskii 
> > ---
> >  arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
> >  arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
> >  arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
> >  3 files changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> > index 41f14b1a3025..7ccd8b2fce5f 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -257,7 +257,8 @@ static struct sgx_encl_page 
> > *__sgx_encl_load_page(struct sgx_encl *encl,
> >  
> > /* Entry successfully located. */
> > if (entry->epc_page) {
> > -   if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> > +   if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
> > +  SGX_ENCL_PAGE_BEING_REMOVED))
> > return ERR_PTR(-EBUSY);
> >  
> > return entry;
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> > index f94ff14c9486..fff5f2293ae7 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.h
> > +++ b/arch/x86/kernel/cpu/sgx/encl.h
> > @@ -25,6 +25,9 @@
> >  /* 'desc' bit marking that the page is being reclaimed. */
> >  #define SGX_ENCL_PAGE_BEING_RECLAIMED  BIT(3)
> >  
> > +/* 'desc' bit marking that the page is being removed. */
> > +#define SGX_ENCL_PAGE_BEING_REMOVEDBIT(2)
> > +
> >  struct sgx_encl_page {
> > unsigned long desc;
> > unsigned long vm_max_prot_bits:8;
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c 
> > b/arch/x86/kernel/cpu/sgx/ioctl.c
> > index b65ab214bdf5..c542d4dd3e64 100644
> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl 
> > *encl,
> >  * Do not keep encl->lock because of dependency on
> >  * mmap_lock acquired in sgx_zap_enclave_ptes().
> >  */
> > +   entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
> > mutex_unlock(>lock);
> >  
> > sgx_zap_enclave_ptes(encl, addr);
> 
> It is somewhat trivial to NAK this as the commit message does
> not do any effort describing the new flag. By default at least
> I have strong opposition against any new flags related to
> reclaiming even if it needs a bit of extra synchronization
> work in the user space.
> 
> One way to describe concurrency scenarios would be to take
> example from https://www.kernel.org/doc/Documentation/memory-barriers.txt
> 
> I.e. see the examples with CPU 1 and CPU 2.

Thank you for the suggestion. Here is my new attempt at describing the racy
scenario:

Consider some enclave page added to the enclave. User space decides to
temporarily remove this page (e.g., emulating the MADV_DONTNEED semantics)
on CPU1. At the same time, user space performs a memory access on the same
page on CPU2, which results in a #PF and ultimately in sgx_vma_fault().
Scenario proceeds as follows:

/*
 * CPU1: User space performs
 * i

Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-04-30 Thread Dmitrii Kuvaiskii
On Mon, Apr 29, 2024 at 04:04:24PM +0300, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> > Two enclave threads may try to access the same non-present enclave page
> > simultaneously (e.g., if the SGX runtime supports lazy allocation). The
> > threads will end up in sgx_encl_eaug_page(), racing to acquire the
> > enclave lock. The winning thread will perform EAUG, set up the page
> > table entry, and insert the page into encl->page_array. The losing
> > thread will then get -EBUSY on xa_insert(>page_array) and proceed
> > to error handling path.
> 
> And that path removes page. Not sure I got gist of this tbh.

Well, this is not about a redundant EREMOVE performed. This is about the
enclave page becoming inaccessible due to a bug triggered with a data race.

Consider some enclave page not yet added to the enclave. The enclave
performs a memory access to it at the same time on CPU1 and CPU2. Since the
page does not yet have a corresponding PTE, the #PF handler on both CPUs
calls sgx_vma_fault(). Scenario proceeds as follows:

/*
 * Fault on CPU1
 */
sgx_vma_fault() {

  xa_load(>page_array) == NULL ->

  sgx_encl_eaug_page() {

.../*
* Fault on CPU2
*/
   sgx_vma_fault() {

 xa_load(>page_array) == NULL ->

 sgx_encl_eaug_page() {

   ...

   mutex_lock(>lock);
   /*
* alloc encl_page
*/
   /*
* alloc EPC page
*/
   epc_page = sgx_alloc_epc_page(...);
   /*
* add page_to enclave's xarray
*/
   xa_insert(>page_array, ...);
   /*
* add page to enclave via EAUG
* (page is in pending state)
*/
   /*
* add PTE entry
*/
   vmf_insert_pfn(...);

   mutex_unlock(>lock);
   return VM_FAULT_NOPAGE;
 }
   }
 mutex_lock(>lock);
 /*
  * alloc encl_page
  */
 /*
  * alloc EPC page
  */
 epc_page = sgx_alloc_epc_page(...);
 /*
  * add page_to enclave's xarray,
  * this fails with -EBUSY
  */
 xa_insert(>page_array, ...);

   err_out_shrink:
 sgx_encl_free_epc_page(epc_page) {
   /*
* remove page via EREMOVE
*/
   /*
* free EPC page
*/
   sgx_free_epc_page(epc_page);
 }

  mutex_unlock(>lock);
  return VM_FAULT_SIGBUS;
}
  }

CPU2 added the enclave page (in pending state) to the enclave and installed
the PTE. The kernel gives control back to the user space, without raising a
signal. The user space on CPU2 retries the memory access and induces a page
fault, but now with the SGX bit set in the #PF error code. The #PF handler
calls do_user_addr_fault(), which calls access_error() and ultimately
raises a SIGSEGV. The userspace SIGSEGV handler is supposed to perform
EACCEPT, after which point the enclave page becomes accessible.

CPU1 however jumps to the error handling path because the page was already
inserted into the enclave's xarray. This error handling path EREMOVEs the
page and also raises a SIGBUS signal to user space. The PTE entry is not
removed.

After CPU1 performs EREMOVE, this enclave page becomes perpetually
inaccessible (until an SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because
the page is marked accessible in the PTE entry but is not EAUGed. Because
of this combination, the #PF handler sees the SGX bit set in the #PF error
code and does not call sgx_vma_fault() but instead raises a SIGSEGV. The
userspace SIGSEGV handler cannot perform EACCEPT because the page was not
EAUGed. Thus, the user space is stuck with the inaccessible page.

Also note that in the scenario, CPU1 raises a SIGBUS signal to user space
unnecessarily. This signal is spurious because a page-access retry on CPU2
will also raise the SIGBUS signal. That said, this side effect is less
severe because it affects only user space. Therefore, it could be
circumvented in user spac

Re: [PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

2024-04-30 Thread Dmitrii Kuvaiskii
On Mon, Apr 29, 2024 at 04:06:39PM +0300, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> > SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
> > enclave pages and may support MADV_DONTNEED semantics [1]. The former
> > implies #PF-based page allocation, and the latter implies the usage of
> > SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.
> >
> > A trivial program like below (run under Gramine and with EDMM enabled)
> > stresses these two flows in the SGX driver and hangs:
> >
> > /* repeatedly touch different enclave pages at random and mix with
> >  * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */
> > static void* thread_func(void* arg) {
> > size_t num_pages = 0xA000 / page_size;
> > for (int i = 0; i < 5000; i++) {
> > size_t page = get_random_ulong() % num_pages;
> > char data = READ_ONCE(((char*)arg)[page * page_size]);
> >
> > page = get_random_ulong() % num_pages;
> > madvise(arg + page * page_size, page_size, MADV_DONTNEED);
> > }
> > }
> >
> > addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
> > pthread_t threads[16];
> > for (int i = 0; i < 16; i++)
> > pthread_create([i], NULL, thread_func, addr);
> 
> I'm not convinced that kernel is the problem here but it could be also
> how Gramine is implemented.
> 
> So maybe you could make a better case of that. The example looks a bit
> artificial to me.

I believe that these are the bugs in the kernel (in the SGX driver). I
provided more detailed descriptions of the races and ensuing bugs in the
other two replies, please check them.

The example is a stress test written to debug very infrequent hangs of
real-world applications that are run with Gramine, EDMM, and two
optimizations (lazy allocation and MADV_DONTNEED semantics). We observed
hangs of Node.js, PyTorch, R, iperf, Blender, Nginx. To root cause these
hangs, we wrote this artificial stress test. This test succeeds on vanilla
Linux, so ideally it should also pass on Gramine.

Please also note that the optimizations of lazy allocation and
MADV_DONTNEED provide significant performance improvement for some
workloads that run on Gramine. For example, a Java workload with a 16GB
enclave size has approx. 57x improvement in total runtime. Thus, we
consider it important to permit these optimizations in Gramine, which IIUC
requires bug fixes in the SGX driver.

You can find more info at
https://github.com/gramineproject/gramine/pull/1513.

Which parts do you consider artificial, and how could I modify the stress
test?

--
Dmitrii Kuvaiskii



[PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

2024-04-29 Thread Dmitrii Kuvaiskii
Two enclave threads may try to add and remove the same enclave page
simultaneously (e.g., if the SGX runtime supports both lazy allocation
and `MADV_DONTNEED` semantics). Consider this race:

1. T1 performs page removal in sgx_encl_remove_pages() and stops right
   after removing the page table entry and right before re-acquiring the
   enclave lock to EREMOVE and xa_erase(>page_array) the page.
2. T2 tries to access the page, and #PF[not_present] is raised. The
   condition to EAUG in sgx_vma_fault() is not satisfied because the
   page is still present in encl->page_array, thus the SGX driver
   assumes that the fault happened because the page was swapped out. The
   driver continues on a code path that installs a page table entry
   *without* performing EAUG.
3. The enclave page metadata is in inconsistent state: the PTE is
   installed but there was no EAUG. Thus, T2 in userspace infinitely
   receives SIGSEGV on this page (and EACCEPT always fails).

Fix this by making sure that T1 (the page-removing thread) always wins
this data race. In particular, the page-being-removed is marked as such,
and T2 retries until the page is fully removed.

Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
Cc: sta...@vger.kernel.org
Signed-off-by: Dmitrii Kuvaiskii 
---
 arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
 arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
 arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 41f14b1a3025..7ccd8b2fce5f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct 
sgx_encl *encl,
 
/* Entry successfully located. */
if (entry->epc_page) {
-   if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
+   if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
+  SGX_ENCL_PAGE_BEING_REMOVED))
return ERR_PTR(-EBUSY);
 
return entry;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..fff5f2293ae7 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -25,6 +25,9 @@
 /* 'desc' bit marking that the page is being reclaimed. */
 #define SGX_ENCL_PAGE_BEING_RECLAIMED  BIT(3)
 
+/* 'desc' bit marking that the page is being removed. */
+#define SGX_ENCL_PAGE_BEING_REMOVEDBIT(2)
+
 struct sgx_encl_page {
unsigned long desc;
unsigned long vm_max_prot_bits:8;
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index b65ab214bdf5..c542d4dd3e64 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
 * Do not keep encl->lock because of dependency on
 * mmap_lock acquired in sgx_zap_enclave_ptes().
 */
+   entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
mutex_unlock(>lock);
 
sgx_zap_enclave_ptes(encl, addr);
-- 
2.34.1




[PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-04-29 Thread Dmitrii Kuvaiskii
Two enclave threads may try to access the same non-present enclave page
simultaneously (e.g., if the SGX runtime supports lazy allocation). The
threads will end up in sgx_encl_eaug_page(), racing to acquire the
enclave lock. The winning thread will perform EAUG, set up the page
table entry, and insert the page into encl->page_array. The losing
thread will then get -EBUSY on xa_insert(>page_array) and proceed
to error handling path.

This error handling path contains two bugs: (1) SIGBUS is sent to
userspace even though the enclave page is correctly installed by another
thread, and (2) sgx_encl_free_epc_page() is called that performs EREMOVE
even though the enclave page was never intended to be removed. The first
bug is less severe because it impacts only the user space; the second
bug is more severe because it also impacts the OS state by ripping the
page (added by the winning thread) from the enclave.

Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
fault handler so that no signal is sent to userspace, and (2) by
replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
EREMOVE is performed.

Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized 
enclave")
Cc: sta...@vger.kernel.org
Reported-by: Marcelina Kościelnicka 
Suggested-by: Reinette Chatre 
Signed-off-by: Dmitrii Kuvaiskii 
---
 arch/x86/kernel/cpu/sgx/encl.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..41f14b1a3025 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct 
*vma,
 * If ret == -EBUSY then page was created in another flow while
 * running without encl->lock
 */
-   if (ret)
+   if (ret) {
+   if (ret == -EBUSY)
+   vmret = VM_FAULT_NOPAGE;
goto err_out_shrink;
+   }
 
pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
pginfo.addr = encl_page->desc & PAGE_MASK;
@@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct 
*vma,
 err_out_shrink:
sgx_encl_shrink(encl, va_page);
 err_out_epc:
-   sgx_encl_free_epc_page(epc_page);
+   sgx_free_epc_page(epc_page);
 err_out_unlock:
mutex_unlock(>lock);
kfree(encl_page);
-- 
2.34.1




[PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

2024-04-29 Thread Dmitrii Kuvaiskii
SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
enclave pages and may support MADV_DONTNEED semantics [1]. The former
implies #PF-based page allocation, and the latter implies the usage of
SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.

A trivial program like below (run under Gramine and with EDMM enabled)
stresses these two flows in the SGX driver and hangs:

/* repeatedly touch different enclave pages at random and mix with
 * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */
static void* thread_func(void* arg) {
size_t num_pages = 0xA000 / page_size;
for (int i = 0; i < 5000; i++) {
size_t page = get_random_ulong() % num_pages;
char data = READ_ONCE(((char*)arg)[page * page_size]);

page = get_random_ulong() % num_pages;
madvise(arg + page * page_size, page_size, MADV_DONTNEED);
}
}

addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
pthread_t threads[16];
for (int i = 0; i < 16; i++)
pthread_create([i], NULL, thread_func, addr);

This program uncovers two data races in the SGX driver. The remaining
patches describe and fix these races.

I performed several stress tests to verify that there are no other data
races (at least with the test program above):

- On Icelake server with 128GB of PRMRR (EPC), without madvise(). This
  stresses the first data race. A Gramine SGX test suite running in the
  background for additional stressing. Result: 1,000 runs without hangs
  (result without the first bug fix: hangs every time).
- On Icelake server with 128GB of PRMRR (EPC), with madvise(). This
  stresses the second data race. A Gramine SGX test suite running in the
  background for additional stressing. Result: 1,000 runs without hangs
  (result with the first bug fix but without the second bug fix: hangs
  approx. once in 50 runs).
- On Icelake server with 4GB of PRMRR (EPC), with madvise(). This
  additionally stresses the enclave page swapping flows. Two Gramine SGX
  test suites running in the background for additional stressing of
  swapping (I observe 100% CPU utilization from ksgxd which confirms that
  swapping happens). Result: 1,000 runs without hangs.

(Sorry for the previous copy of this email, accidentally sent to
sta...@vger.kernel.org. Failed to use `--suppress-cc` during a test send.)

Dmitrii Kuvaiskii (2):
  x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  x86/sgx: Resolve EREMOVE page vs EAUG page data race

 arch/x86/kernel/cpu/sgx/encl.c  | 10 +++---
 arch/x86/kernel/cpu/sgx/encl.h  |  3 +++
 arch/x86/kernel/cpu/sgx/ioctl.c |  1 +
 3 files changed, 11 insertions(+), 3 deletions(-)

-- 
2.34.1