Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-26 Thread Haitao Huang
On Mon, 27 Nov 2023 00:01:56 +0800, Haitao Huang  
 wrote:

> > > > Then here we can have something like:
> > > >
> > > > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
> > > > {
> > > > struct sgx_epc_lru_list *lru =  epc_cg ? 
_cg->lru :
> > > > _global_lru;
> > > >
> > > > sgx_reclaim_pages_lru(lru);
> > > > }
> > > >
> > > > Makes sense?
> > > >


The reason we 'isolate' first then do real 'reclaim' is that the actual  
reclaim is expensive and especially for eblock, etrack, etc.
Sorry this is out of context. It was meant to be in the other response for  
patch 9/12.


Also FYI I'm traveling for a vacation and email access may be sporadic.

BR
Haitao



Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-26 Thread Haitao Huang

On Mon, 20 Nov 2023 11:16:42 +0800, Huang, Kai  wrote:


> >
>
> That's true. I was thinking no need to have them done in separate  
calls.
> The caller has to check the return value for epc_cg instance first,  
then

> check result of try_charge. But there is really only one caller,
> sgx_alloc_epc_page() below, so I don't have strong opinions now.
>
> With them separate, the checks will look like this:
> if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled,
> should continue for allocation
> {
>if (ret =  sgx_epc_cgroup_try_charge())
>return ret
> }
> // continue...
>
> I can go either way.


Let's keep this aligned with other _try_charge() variants: return 'int'  
to

indicate whether the charge is done or not.



Fine with me if no objections from maintainers.


>
> >
> > > > >  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool  
reclaim)

> > > > >  {
> > > > >struct sgx_epc_page *page;
> > > > > +  struct sgx_epc_cgroup *epc_cg;
> > > > > +
> > > > > +  epc_cg = sgx_epc_cgroup_try_charge();
> > > > > +  if (IS_ERR(epc_cg))
> > > > > +  return ERR_CAST(epc_cg);
> > > > >
> > > > >for ( ; ; ) {
> > > > >page = __sgx_alloc_epc_page();
> > > > > @@ -580,10 +587,21 @@ struct sgx_epc_page  
*sgx_alloc_epc_page(void

> > > > > *owner, bool reclaim)
> > > > >break;
> > > > >}
> > > > >
> > > > > +  /*
> > > > > +		 * Need to do a global reclamation if cgroup was not full  
but

> > > free
> > > > > +		 * physical pages run out, causing __sgx_alloc_epc_page()  
to

> > > fail.
> > > > > +   */
> > > > >sgx_reclaim_pages();
> > > >
> > > > What's the final behaviour?  IIUC it should be reclaiming from  
the

> > > > *current* EPC
> > > > cgroup?  If so shouldn't we just pass the @epc_cg to it here?
> > > >
> > > > I think we can make this patch as "structure" patch w/o actually
> > > having
> > > > EPC
> > > > cgroup enabled, i.e., sgx_get_current_epc_cg() always return  
NULL.

> > > >
> > > > And we can have one patch to change sgx_reclaim_pages() to take  
the

> > > > 'struct
> > > > sgx_epc_lru_list *' as argument:
> > > >
> > > > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
> > > > {
> > > > ...
> > > > }
> > > >
> > > > Then here we can have something like:
> > > >
> > > > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
> > > > {
> > > > struct sgx_epc_lru_list *lru =  epc_cg ? 
_cg->lru :
> > > > _global_lru;
> > > >
> > > > sgx_reclaim_pages_lru(lru);
> > > > }
> > > >
> > > > Makes sense?
> > > >


The reason we 'isolate' first then do real 'reclaim' is that the actual  
reclaim is expensive and especially for eblock, etrack, etc.



> > >
> > > This is purely global reclamation. No cgroup involved.
> >
> > Again why?  Here you are allocating one EPC page for enclave in a
> > particular EPC
> > cgroup.  When that fails, shouldn't you try only to reclaim from the
> > *current*
> > EPC cgroup?  Or at least you should try to reclaim from the  
*current*

> > EPC cgroup
> > first?
> >
>
> Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true,  
cgroup

> reclaims synchronously, otherwise in background and returns -EBUSY in
> that case. This function also returns if no valid epc_cg pointer
> returned.
>
> All reclamation for *current* cgroup is done in  
sgx_epc_cg_try_charge().


This is fine, but I believe my question above is about where to reclaim  
when

"allocation" fails,  but not "try charge" fails.


I mean "will be done" :-) Currently no reclaim in try_charge.

And for "reclaim for current cgroup when charge fails", I don't think  
its even
necessary in this initial implementation of EPC cgroup.  You can just  
fail the
allocation when charge fails (reaching the limit).  Trying to reclaim  
when limit

is hit can be done later.



Yes. It is done later.


Please see Dave and Michal's replies here:

https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df...@intel.com/#t
https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/


>
> So, by reaching to this point,  a valid epc_cg pointer was returned,
> that means allocation is allowed for the cgroup (it has reclaimed if
> necessary, and its usage is not above limit after charging).


I found memory cgroup uses different logic -- allocation first and then  
charge:


For instance:

static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
..

folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
if (!folio)
goto oom;
   if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
goto oom_free_page;

..}

Why EPC needs to "charge first" and "then allocate"?



EPC allocation can involve reclaiming which is more expensive than regular  
RAM 

Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-19 Thread Huang, Kai
> > > 
> > 
> > That's true. I was thinking no need to have them done in separate calls.  
> > The caller has to check the return value for epc_cg instance first, then  
> > check result of try_charge. But there is really only one caller,  
> > sgx_alloc_epc_page() below, so I don't have strong opinions now.
> > 
> > With them separate, the checks will look like this:
> > if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled,  
> > should continue for allocation
> > {
> > if (ret =  sgx_epc_cgroup_try_charge())
> > return ret
> > }
> > // continue...
> > 
> > I can go either way.

Let's keep this aligned with other _try_charge() variants: return 'int' to
indicate whether the charge is done or not.

> > 
> > > 
> > > > > >  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> > > > > >  {
> > > > > > struct sgx_epc_page *page;
> > > > > > +   struct sgx_epc_cgroup *epc_cg;
> > > > > > +
> > > > > > +   epc_cg = sgx_epc_cgroup_try_charge();
> > > > > > +   if (IS_ERR(epc_cg))
> > > > > > +   return ERR_CAST(epc_cg);
> > > > > > 
> > > > > > for ( ; ; ) {
> > > > > > page = __sgx_alloc_epc_page();
> > > > > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
> > > > > > *owner, bool reclaim)
> > > > > > break;
> > > > > > }
> > > > > > 
> > > > > > +   /*
> > > > > > +* Need to do a global reclamation if cgroup was not 
> > > > > > full but  
> > > > free
> > > > > > +* physical pages run out, causing 
> > > > > > __sgx_alloc_epc_page() to  
> > > > fail.
> > > > > > +*/
> > > > > > sgx_reclaim_pages();
> > > > > 
> > > > > What's the final behaviour?  IIUC it should be reclaiming from the
> > > > > *current* EPC
> > > > > cgroup?  If so shouldn't we just pass the @epc_cg to it here?
> > > > > 
> > > > > I think we can make this patch as "structure" patch w/o actually  
> > > > having
> > > > > EPC
> > > > > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.
> > > > > 
> > > > > And we can have one patch to change sgx_reclaim_pages() to take the
> > > > > 'struct
> > > > > sgx_epc_lru_list *' as argument:
> > > > > 
> > > > >   void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
> > > > >   {
> > > > >   ...
> > > > >   }
> > > > > 
> > > > > Then here we can have something like:
> > > > > 
> > > > >   void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
> > > > >   {
> > > > >   struct sgx_epc_lru_list *lru =  epc_cg 
> > > > > ? _cg->lru :
> > > > > _global_lru;
> > > > > 
> > > > >   sgx_reclaim_pages_lru(lru);
> > > > >   }
> > > > > 
> > > > > Makes sense?
> > > > > 
> > > > 
> > > > This is purely global reclamation. No cgroup involved.
> > > 
> > > Again why?  Here you are allocating one EPC page for enclave in a  
> > > particular EPC
> > > cgroup.  When that fails, shouldn't you try only to reclaim from the  
> > > *current*
> > > EPC cgroup?  Or at least you should try to reclaim from the *current*  
> > > EPC cgroup
> > > first?
> > > 
> > 
> > Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, cgroup  
> > reclaims synchronously, otherwise in background and returns -EBUSY in  
> > that case. This function also returns if no valid epc_cg pointer  
> > returned.
> > 
> > All reclamation for *current* cgroup is done in sgx_epc_cg_try_charge().

This is fine, but I believe my question above is about where to reclaim when
"allocation" fails,  but not "try charge" fails.

And for "reclaim for current cgroup when charge fails", I don't think its even
necessary in this initial implementation of EPC cgroup.  You can just fail the
allocation when charge fails (reaching the limit).  Trying to reclaim when limit
is hit can be done later.

Please see Dave and Michal's replies here:

https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df...@intel.com/#t
https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/

> > 
> > So, by reaching to this point,  a valid epc_cg pointer was returned,  
> > that means allocation is allowed for the cgroup (it has reclaimed if  
> > necessary, and its usage is not above limit after charging).

I found memory cgroup uses different logic -- allocation first and then charge:

For instance:

static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
..

folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); 
if (!folio)
goto oom;

if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))  
goto oom_free_page;

.. 
}

Why EPC needs to "charge first" and "then allocate"?
 
> > 
> > But the system level free count may be low 

Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-15 Thread Jarkko Sakkinen
On Mon Oct 30, 2023 at 8:20 PM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
>
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
>
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem).  The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.
>
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".
>
> This patch was modified from the previous version to only add basic EPC
> cgroup structure, accounting allocations for cgroup usage
> (charge/uncharge), setup misc cgroup callbacks, set total EPC capacity.
>
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
>
> Later patches will reorganize the tracking and reclamation code in the
> globale reclaimer and implement per-cgroup tracking and reclaiming.
>
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> V6:
> - Split the original large patch"Limit process EPC usage with misc
> cgroup controller"  and restructure it (Kai)
> ---
>  arch/x86/Kconfig |  13 
>  arch/x86/kernel/cpu/sgx/Makefile |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  36 ++
>  arch/x86/kernel/cpu/sgx/main.c   |  28 
>  arch/x86/kernel/cpu/sgx/sgx.h|   3 +
>  6 files changed, 184 insertions(+)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 66bfabae8814..e17c5dc3aea4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1921,6 +1921,19 @@ config X86_SGX
>  
> If unsure, say N.
>  
> +config CGROUP_SGX_EPC
> + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for 
> Intel SGX"
> + depends on X86_SGX && CGROUP_MISC
> + help
> +   Provides control over the EPC footprint of tasks in a cgroup via
> +   the Miscellaneous cgroup controller.
> +
> +   EPC is a subset of regular memory that is usable only by SGX
> +   enclaves and is very limited in quantity, e.g. less than 1%
> +   of total DRAM.
> +
> +   Say N if unsure.
> +
>  config X86_USER_SHADOW_STACK
>   bool "X86 userspace shadow stack"
>   depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile 
> b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
>   ioctl.o \
>   main.o
>  obj-$(CONFIG_X86_SGX_KVM)+= virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC)+= epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index ..500627d0563f
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,103 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include 
> +#include 
> +#include "epc_cgroup.h"
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct 
> misc_cg *cg)
> +{
> + return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +}
> +
> +static inline bool sgx_epc_cgroup_disabled(void)
> +{
> + return !cgroup_subsys_enabled(misc_cgrp_subsys);
> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC 
> page
> + *
> + * Returns EPC cgroup or NULL on success, -errno on failure.

Should have a description explaining what "charging hierarchically" is
all about. This is too cryptic like this.

E.g. consider wahat non-hierarchically charging means. There must be
opposite end in order to have a meaning (for anything expressed with
a language).

> + */
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> + int ret;
> +
> + if 

RE: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-07 Thread Huang, Kai
> I should have sticked to the orignial comment added in code. Actually
> __sgx_alloc_epc_page() can fail if system runs out of EPC. That's the really 
> reason
> for global reclaim. The free count enforcement is near the end of this method
> after should_reclaim() check.

Hi Haitao,

Sorry I have something else to do at this moment and will continue this series 
next week.


Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-06 Thread Haitao Huang
On Mon, 06 Nov 2023 19:16:30 -0600, Haitao Huang  
 wrote:


On Mon, 06 Nov 2023 16:18:30 -0600, Huang, Kai   
wrote:




> > +/**
> > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a  
single

> > EPC page
> > + *
> > + * Returns EPC cgroup or NULL on success, -errno on failure.
> > + */
> > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> > +{
> > + struct sgx_epc_cgroup *epc_cg;
> > + int ret;
> > +
> > + if (sgx_epc_cgroup_disabled())
> > + return NULL;
> > +
> > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> > +	ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,  
PAGE_SIZE);

> > +
> > + if (!ret) {
> > + /* No epc_cg returned, release ref from get_current_misc_cg() 
*/
> > + put_misc_cg(epc_cg->cg);
> > + return ERR_PTR(-ENOMEM);
>
> misc_cg_try_charge() returns 0 when successfully charged, no?

Right. I really made some mess in rebasing :-(

>
> > + }
> > +
> > + /* Ref released in sgx_epc_cgroup_uncharge() */
> > + return epc_cg;
> > +}
>
> IMHO the above _try_charge() returning a pointer of EPC cgroup is a
> little bit
> odd, because it doesn't match the existing misc_cg_try_charge() which
> returns
> whether the charge is successful or not.  sev_misc_cg_try_charge()
> matches
> misc_cg_try_charge() too.
>
> I think it's better to split "getting EPC cgroup" part out as a  
separate

> helper,
> and make this _try_charge() match existing pattern:
>
>struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
>{
>if (sgx_epc_cgroup_disabled())
>return NULL;
>
>return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
>}
>
>int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>{
>if (!epc_cg)
>return -EINVAL;
>
>return misc_cg_try_charge(epc_cg->cg);
>}
>
> Having sgx_get_current_epc_cg() also makes the caller easier to read,
> because we
> can immediately know we are going to charge the *current* EPC cgroup,
> but not
> some cgroup hidden within sgx_epc_cgroup_try_charge().
>

Actually, unlike other misc controllers, we need charge and get the  
epc_cg

reference at the same time.


Can you elaborate?

And in practice you always call sgx_epc_cgroup_try_charge() right after
sgx_get_current_epc_cg() anyway.  The only difference is the whole  
thing is done

in one function or in separate functions.

[...]



That's true. I was thinking no need to have them done in separate calls.  
The caller has to check the return value for epc_cg instance first, then  
check result of try_charge. But there is really only one caller,  
sgx_alloc_epc_page() below, so I don't have strong opinions now.


With them separate, the checks will look like this:
if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled,  
should continue for allocation

{
if (ret =  sgx_epc_cgroup_try_charge())
return ret
}
// continue...

I can go either way.




> >  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> >  {
> >   struct sgx_epc_page *page;
> > + struct sgx_epc_cgroup *epc_cg;
> > +
> > + epc_cg = sgx_epc_cgroup_try_charge();
> > + if (IS_ERR(epc_cg))
> > + return ERR_CAST(epc_cg);
> >
> >   for ( ; ; ) {
> >   page = __sgx_alloc_epc_page();
> > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
> > *owner, bool reclaim)
> >   break;
> >   }
> >
> > + /*
> > +		 * Need to do a global reclamation if cgroup was not full but  
free
> > +		 * physical pages run out, causing __sgx_alloc_epc_page() to  
fail.

> > +  */
> >   sgx_reclaim_pages();
>
> What's the final behaviour?  IIUC it should be reclaiming from the
> *current* EPC
> cgroup?  If so shouldn't we just pass the @epc_cg to it here?
>
> I think we can make this patch as "structure" patch w/o actually  
having

> EPC
> cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.
>
> And we can have one patch to change sgx_reclaim_pages() to take the
> 'struct
> sgx_epc_lru_list *' as argument:
>
>void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
>{
>...
>}
>
> Then here we can have something like:
>
>void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
>{
>struct sgx_epc_lru_list *lru =  epc_cg ? 
_cg->lru :
> _global_lru;
>
>sgx_reclaim_pages_lru(lru);
>}
>
> Makes sense?
>

This is purely global reclamation. No cgroup involved.


Again why?  Here you are allocating one EPC page for enclave in a  
particular EPC
cgroup.  When that fails, shouldn't you try only to reclaim from the  
*current*
EPC cgroup?  Or at least you should try to reclaim from the *current*  
EPC cgroup

first?



Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, cgroup  

Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-06 Thread Haitao Huang

On Mon, 06 Nov 2023 16:18:30 -0600, Huang, Kai  wrote:



> > +/**
> > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a  
single

> > EPC page
> > + *
> > + * Returns EPC cgroup or NULL on success, -errno on failure.
> > + */
> > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> > +{
> > + struct sgx_epc_cgroup *epc_cg;
> > + int ret;
> > +
> > + if (sgx_epc_cgroup_disabled())
> > + return NULL;
> > +
> > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> > +	ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,  
PAGE_SIZE);

> > +
> > + if (!ret) {
> > + /* No epc_cg returned, release ref from get_current_misc_cg() 
*/
> > + put_misc_cg(epc_cg->cg);
> > + return ERR_PTR(-ENOMEM);
>
> misc_cg_try_charge() returns 0 when successfully charged, no?

Right. I really made some mess in rebasing :-(

>
> > + }
> > +
> > + /* Ref released in sgx_epc_cgroup_uncharge() */
> > + return epc_cg;
> > +}
>
> IMHO the above _try_charge() returning a pointer of EPC cgroup is a
> little bit
> odd, because it doesn't match the existing misc_cg_try_charge() which
> returns
> whether the charge is successful or not.  sev_misc_cg_try_charge()
> matches
> misc_cg_try_charge() too.
>
> I think it's better to split "getting EPC cgroup" part out as a  
separate

> helper,
> and make this _try_charge() match existing pattern:
>
>struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
>{
>if (sgx_epc_cgroup_disabled())
>return NULL;
>
>return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
>}
>
>int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>{
>if (!epc_cg)
>return -EINVAL;
>
>return misc_cg_try_charge(epc_cg->cg);
>}
>
> Having sgx_get_current_epc_cg() also makes the caller easier to read,
> because we
> can immediately know we are going to charge the *current* EPC cgroup,
> but not
> some cgroup hidden within sgx_epc_cgroup_try_charge().
>

Actually, unlike other misc controllers, we need charge and get the  
epc_cg

reference at the same time.


Can you elaborate?

And in practice you always call sgx_epc_cgroup_try_charge() right after
sgx_get_current_epc_cg() anyway.  The only difference is the whole thing  
is done

in one function or in separate functions.

[...]



That's true. I was thinking no need to have them done in separate calls.  
The caller has to check the return value for epc_cg instance first, then  
check result of try_charge. But there is really only one caller,  
sgx_alloc_epc_page() below, so I don't have strong opinions now.


With them separate, the checks will look like this:
if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled,  
should continue for allocation

{
if (ret =  sgx_epc_cgroup_try_charge())
return ret
}
// continue...

I can go either way.




> >  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> >  {
> >   struct sgx_epc_page *page;
> > + struct sgx_epc_cgroup *epc_cg;
> > +
> > + epc_cg = sgx_epc_cgroup_try_charge();
> > + if (IS_ERR(epc_cg))
> > + return ERR_CAST(epc_cg);
> >
> >   for ( ; ; ) {
> >   page = __sgx_alloc_epc_page();
> > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
> > *owner, bool reclaim)
> >   break;
> >   }
> >
> > + /*
> > +		 * Need to do a global reclamation if cgroup was not full but  
free
> > +		 * physical pages run out, causing __sgx_alloc_epc_page() to  
fail.

> > +  */
> >   sgx_reclaim_pages();
>
> What's the final behaviour?  IIUC it should be reclaiming from the
> *current* EPC
> cgroup?  If so shouldn't we just pass the @epc_cg to it here?
>
> I think we can make this patch as "structure" patch w/o actually  
having

> EPC
> cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.
>
> And we can have one patch to change sgx_reclaim_pages() to take the
> 'struct
> sgx_epc_lru_list *' as argument:
>
>void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
>{
>...
>}
>
> Then here we can have something like:
>
>void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
>{
>struct sgx_epc_lru_list *lru =  epc_cg ? 
_cg->lru :
> _global_lru;
>
>sgx_reclaim_pages_lru(lru);
>}
>
> Makes sense?
>

This is purely global reclamation. No cgroup involved.


Again why?  Here you are allocating one EPC page for enclave in a  
particular EPC
cgroup.  When that fails, shouldn't you try only to reclaim from the  
*current*
EPC cgroup?  Or at least you should try to reclaim from the *current*  
EPC cgroup

first?



Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, cgroup  
reclaims synchronously, otherwise in background and returns -EBUSY 

Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-06 Thread Huang, Kai
On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> +static int __init sgx_epc_cgroup_init(void)
> +{
> + struct misc_cg *cg;
> +
> + if (!boot_cpu_has(X86_FEATURE_SGX))
> + return 0;
> +
> + cg = misc_cg_root();
> + BUG_ON(!cg);
> +
> + return sgx_epc_cgroup_alloc(cg);
> +}
> +subsys_initcall(sgx_epc_cgroup_init);

This should be called from sgx_init(), which is the place to init SGX related
staff.  In case you didn't notice, sgx_init() is actually device_initcall(),
which is actually called _after_ the subsys_initcall() you used above.


Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-06 Thread Huang, Kai
> 
> > > +/**
> > > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single  
> > > EPC page
> > > + *
> > > + * Returns EPC cgroup or NULL on success, -errno on failure.
> > > + */
> > > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> > > +{
> > > + struct sgx_epc_cgroup *epc_cg;
> > > + int ret;
> > > +
> > > + if (sgx_epc_cgroup_disabled())
> > > + return NULL;
> > > +
> > > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> > > + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> > > +
> > > + if (!ret) {
> > > + /* No epc_cg returned, release ref from get_current_misc_cg() */
> > > + put_misc_cg(epc_cg->cg);
> > > + return ERR_PTR(-ENOMEM);
> > 
> > misc_cg_try_charge() returns 0 when successfully charged, no?
> 
> Right. I really made some mess in rebasing :-(
> 
> > 
> > > + }
> > > +
> > > + /* Ref released in sgx_epc_cgroup_uncharge() */
> > > + return epc_cg;
> > > +}
> > 
> > IMHO the above _try_charge() returning a pointer of EPC cgroup is a  
> > little bit
> > odd, because it doesn't match the existing misc_cg_try_charge() which  
> > returns
> > whether the charge is successful or not.  sev_misc_cg_try_charge()  
> > matches
> > misc_cg_try_charge() too.
> > 
> > I think it's better to split "getting EPC cgroup" part out as a separate  
> > helper,
> > and make this _try_charge() match existing pattern:
> > 
> > struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
> > {
> > if (sgx_epc_cgroup_disabled())
> > return NULL;
> > 
> > return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> > }
> > 
> > int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> > {
> > if (!epc_cg)
> > return -EINVAL;
> > 
> > return misc_cg_try_charge(epc_cg->cg);
> > }
> > 
> > Having sgx_get_current_epc_cg() also makes the caller easier to read,  
> > because we
> > can immediately know we are going to charge the *current* EPC cgroup,  
> > but not
> > some cgroup hidden within sgx_epc_cgroup_try_charge().
> > 
> 
> Actually, unlike other misc controllers, we need charge and get the epc_cg  
> reference at the same time. 
> 

Can you elaborate?

And in practice you always call sgx_epc_cgroup_try_charge() right after
sgx_get_current_epc_cg() anyway.  The only difference is the whole thing is done
in one function or in separate functions.

[...]


> > >  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> > >  {
> > >   struct sgx_epc_page *page;
> > > + struct sgx_epc_cgroup *epc_cg;
> > > +
> > > + epc_cg = sgx_epc_cgroup_try_charge();
> > > + if (IS_ERR(epc_cg))
> > > + return ERR_CAST(epc_cg);
> > > 
> > >   for ( ; ; ) {
> > >   page = __sgx_alloc_epc_page();
> > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void  
> > > *owner, bool reclaim)
> > >   break;
> > >   }
> > > 
> > > + /*
> > > +  * Need to do a global reclamation if cgroup was not full but 
> > > free
> > > +  * physical pages run out, causing __sgx_alloc_epc_page() to 
> > > fail.
> > > +  */
> > >   sgx_reclaim_pages();
> > 
> > What's the final behaviour?  IIUC it should be reclaiming from the  
> > *current* EPC
> > cgroup?  If so shouldn't we just pass the @epc_cg to it here?
> > 
> > I think we can make this patch as "structure" patch w/o actually having  
> > EPC
> > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.
> > 
> > And we can have one patch to change sgx_reclaim_pages() to take the  
> > 'struct
> > sgx_epc_lru_list *' as argument:
> > 
> > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
> > {
> > ...
> > }
> > 
> > Then here we can have something like:
> > 
> > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
> > {
> > struct sgx_epc_lru_list *lru =  epc_cg ? 
> > _cg->lru :  
> > _global_lru;
> > 
> > sgx_reclaim_pages_lru(lru);
> > }
> > 
> > Makes sense?
> > 
> 
> This is purely global reclamation. No cgroup involved. 
> 

Again why?  Here you are allocating one EPC page for enclave in a particular EPC
cgroup.  When that fails, shouldn't you try only to reclaim from the *current*
EPC cgroup?  Or at least you should try to reclaim from the *current* EPC cgroup
first?

> You can see it  
> later in changes in patch 10/12. For now I just make a comment there but  
> no real changes. Cgroup reclamation will be done as part of _try_charge  
> call.
> 
> > >   cond_resched();
> > >   }
> > > 
> > > + if (!IS_ERR(page)) {
> > > + WARN_ON_ONCE(page->epc_cg);
> > > + page->epc_cg = epc_cg;
> > > + } else {
> > > + sgx_epc_cgroup_uncharge(epc_cg);
> > > + }
> > > +
> > >   if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> > >   wake_up(_waitq);
> > > 
> > > @@ -604,6 

Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-06 Thread Haitao Huang

On Mon, 06 Nov 2023 06:09:45 -0600, Huang, Kai  wrote:


On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:

From: Kristen Carlson Accardi 

Implement support for cgroup control of SGX Enclave Page Cache (EPC)
memory using the misc cgroup controller. EPC memory is independent
from normal system memory, e.g. must be reserved at boot from RAM and
cannot be converted between EPC and normal memory while the system is
running. EPC is managed by the SGX subsystem and is not accounted by
the memory controller.

Much like normal system memory, EPC memory can be overcommitted via
virtual memory techniques and pages can be swapped out of the EPC to
their backing store (normal system memory, e.g. shmem).  The SGX EPC
subsystem is analogous to the memory subsystem and the SGX EPC  
controller

is in turn analogous to the memory controller; it implements limit and
protection models for EPC memory.


Nit:

The above two paragraphs talk about what is EPC and EPC resource control  
needs
to be done separately, etc, but IMHO it lacks some background about  
"why" EPC

resource control is needed, e.g, from use case's perspective.



The misc controller provides a mechanism to set a hard limit of EPC
usage via the "sgx_epc" resource in "misc.max". The total EPC memory
available on the system is reported via the "sgx_epc" resource in
"misc.capacity".


Please separate what the current misc cgroup provides, and how this  
patch is

going to utilize.

Please describe the changes in imperative mood. E.g, "report total EPC  
memory

via ...", instead of "... is reported via ...".



Will update



This patch was modified from the previous version to only add basic EPC
cgroup structure, accounting allocations for cgroup usage
(charge/uncharge), setup misc cgroup callbacks, set total EPC capacity.


This isn't changelog material.  Please focus on describing the high  
level design

and why you chose such design.



For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
globale reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller"  and restructure it (Kai)
---
 arch/x86/Kconfig |  13 
 arch/x86/kernel/cpu/sgx/Makefile |   1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  36 ++
 arch/x86/kernel/cpu/sgx/main.c   |  28 
 arch/x86/kernel/cpu/sgx/sgx.h|   3 +
 6 files changed, 184 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..e17c5dc3aea4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1921,6 +1921,19 @@ config X86_SGX

  If unsure, say N.

+config CGROUP_SGX_EPC
+	bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC)  
for Intel SGX"

+   depends on X86_SGX && CGROUP_MISC
+   help
+ Provides control over the EPC footprint of tasks in a cgroup via
+ the Miscellaneous cgroup controller.
+
+ EPC is a subset of regular memory that is usable only by SGX
+ enclaves and is very limited in quantity, e.g. less than 1%
+ of total DRAM.
+
+ Say N if unsure.
+
 config X86_USER_SHADOW_STACK
bool "X86 userspace shadow stack"
depends on AS_WRUSS
diff --git a/arch/x86/kernel/cpu/sgx/Makefile  
b/arch/x86/kernel/cpu/sgx/Makefile

index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
 obj-$(CONFIG_X86_SGX_KVM)  += virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC)  += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c  
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c

new file mode 100644
index ..500627d0563f
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include 
+#include 
+#include "epc_cgroup.h"
+
+static inline struct sgx_epc_cgroup  
*sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)

+{
+   return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+}
+
+static inline bool sgx_epc_cgroup_disabled(void)
+{
+   return !cgroup_subsys_enabled(misc_cgrp_subsys);


From below, the root EPC cgroup is dynamically allocated.  Shouldn't it  
also

check whether the 

Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-06 Thread Huang, Kai
On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
> 
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem).  The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.

Nit:

The above two paragraphs talk about what is EPC and EPC resource control needs
to be done separately, etc, but IMHO it lacks some background about "why" EPC
resource control is needed, e.g, from use case's perspective.

> 
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".

Please separate what the current misc cgroup provides, and how this patch is
going to utilize.

Please describe the changes in imperative mood. E.g, "report total EPC memory
via ...", instead of "... is reported via ...".

> 
> This patch was modified from the previous version to only add basic EPC
> cgroup structure, accounting allocations for cgroup usage
> (charge/uncharge), setup misc cgroup callbacks, set total EPC capacity.

This isn't changelog material.  Please focus on describing the high level design
and why you chose such design.

> 
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
> 
> Later patches will reorganize the tracking and reclamation code in the
> globale reclaimer and implement per-cgroup tracking and reclaiming.
> 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> V6:
> - Split the original large patch"Limit process EPC usage with misc
> cgroup controller"  and restructure it (Kai)
> ---
>  arch/x86/Kconfig |  13 
>  arch/x86/kernel/cpu/sgx/Makefile |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  36 ++
>  arch/x86/kernel/cpu/sgx/main.c   |  28 
>  arch/x86/kernel/cpu/sgx/sgx.h|   3 +
>  6 files changed, 184 insertions(+)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 66bfabae8814..e17c5dc3aea4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1921,6 +1921,19 @@ config X86_SGX
>  
> If unsure, say N.
>  
> +config CGROUP_SGX_EPC
> + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for 
> Intel SGX"
> + depends on X86_SGX && CGROUP_MISC
> + help
> +   Provides control over the EPC footprint of tasks in a cgroup via
> +   the Miscellaneous cgroup controller.
> +
> +   EPC is a subset of regular memory that is usable only by SGX
> +   enclaves and is very limited in quantity, e.g. less than 1%
> +   of total DRAM.
> +
> +   Say N if unsure.
> +
>  config X86_USER_SHADOW_STACK
>   bool "X86 userspace shadow stack"
>   depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile 
> b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
>   ioctl.o \
>   main.o
>  obj-$(CONFIG_X86_SGX_KVM)+= virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC)+= epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index ..500627d0563f
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,103 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include 
> +#include 
> +#include "epc_cgroup.h"
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct 
> misc_cg *cg)
> +{
> + return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +}
> +
> +static inline bool sgx_epc_cgroup_disabled(void)
> +{
> + return !cgroup_subsys_enabled(misc_cgrp_subsys);

From below,