from:"Huang, Kai"

Re: [PATCH v15 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-06-20 Thread Huang, Kai

On Thu, 2024-06-20 at 10:06 -0500, Haitao Huang wrote:
> Hi Kai
> 
> On Thu, 20 Jun 2024 05:30:16 -0500, Huang, Kai  wrote:
> 
> > 
> > On 18/06/2024 12:53 am, Huang, Haitao wrote:
> > > From: Kristen Carlson Accardi 
> > > 
> > > Previous patches have implemented all infrastructure needed for
> > > per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
> > > pages are still tracked in the global LRU as sgx_epc_page_lru() always
> > > returns reference to the global LRU.
> > > 
> > > Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
> > > given EPC page is allocated.
> > > 
> > > This makes all EPC pages tracked in per-cgroup LRUs and the global
> > > reclaimer (ksgxd) will not be able to reclaim any pages from the global
> > > LRU. However, in cases of over-committing, i.e., the sum of cgroup
> > > limits greater than the total capacity, cgroups may never reclaim but
> > > the total usage can still be near the capacity. Therefore a global
> > > reclamation is still needed in those cases and it should be performed
> > > from the root cgroup.
> > > 
> > > Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
> > > when cgroup is enabled. Similar to sgx_cgroup_reclaim_pages(), return
> > > the next cgroup so callers can use it as the new starting node for next
> > > round of reclamation if needed.
> > > 
> > > Also update sgx_can_reclaim_global(), to check emptiness of LRUs of all
> > > cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
> > > 
> > > Finally, change sgx_reclaim_direct(), to check and ensure there are free
> > > pages at cgroup level so forward progress can be made by the caller.
> > 
> > Reading above, it's not clear how the _new_ global reclaim works with
> > multiple LRUs.
> > 
> > E.g., the current global reclaim essentially treats all EPC pages equally
> > when scanning those pages.  From the above, I don't see how this is
> > achieved in the new global reclaim.
> > 
> > The changelog should:
> > 
> > 1) describe the how does existing global reclaim work, and then describe
> > how to achieve the same beahviour in the new global reclaim which works
> > with multiple LRUs;
> > 
> > 2) If there's any behaviour difference between the "existing" vs the  
> > "new"
> > global reclaim, the changelog should point out the difference, and  
> > explain
> > why such difference is OK.
> 
> Sure I can explain. here is what I plan to add for v16:
> 
> Note the original global reclaimer has
> only one LRU and always scans and reclaims from the head of this global
> LRU. The new global reclaimer always starts the scanning from the root
> node, only moves down to its descendants if more reclamation is needed
> or the root node does not have SGX_NR_TO_SCAN (16) pages in the LRU.
> This makes the enclave pages in the root node more likely being
> reclaimed if they are not frequently used (not 'young'). Unless we track
> pages in one LRU again, we can not really match exactly the same
> behavior of the original global reclaimer. And one-LRU approach would
> make per-cgroup reclamation scanning and reclaiming too complex.  On the
> other hand, this design is acceptable for following reasons:
> 
> 1) For all purposes of using cgroups, enclaves will need live in
>   non-root (leaf for cgroup v2) nodes where limits can be enforced
>   per-cgroup.

I don't see how it matters.  If ROOT is empty, then it moves to the first
descendant.

> 2) Global reclamation now only happens in situation mentioned above when
>   a lower level cgroup not at its limit can't allocate due to over
>   commit at global level.

Really?  In sgx_reclaim_direct() the code says:

/*
 * Make sure there are some free pages at both cgroup and global levels.
 * In both cases, only make one attempt of reclamation to avoid lengthy
 * block on the caller.
 */

Yeah only one attempt will be made for global level but it is still global
reclaim.

> 3) The pages in root being slightly penalized are not busily used
>   anyway.

The 1) says in practice the root will have no enclaves, thus no EPC at
all.

In other words, in practice the global reclaim will always skip the root
and move to the first descendant.

> 4) In cases that multiple rounds of reclamation is needed, the caller of
>   sgx_reclaim_page_global() can still recall for reclaiming in 'next'
>   descendant in round robin way, each round scans for SGX_NR_SCAN pages
>   from the head of 'next' cgroup's LRU.

Re: [PATCH v15 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-06-20 Thread Huang, Kai

> > 
> > > In other cases, the caller may invoke this function in a
> > > loop to ensure enough pages reclaimed for its usage. To ensure all
> > > descendant groups scanned in a round-robin fashion in those cases,
> > > sgx_cgroup_reclaim_pages() takes in a starting cgroup and returns the
> > > next cgroup that the caller can pass in as the new starting cgroup for a
> > > subsequent call.
> > 
> > 
> > AFAICT this part is new, and I believe this "round-robin" thing is just
> > for the "global reclaim"?  Or is it also for per-cgroup reclaim where  
> > more
> > than SGX_NR_TO_SCAN pages needs to be reclaimed?
> > 
> > I wish the changelog should just point out what consumers will use this
> > new sgx_cgroup_reclaim_pages(), like:
> > 
> > The sgx_cgroup_reclaim_pages() will be used in three cases:
> > 
> >  1) direct/sync per-cgroup reclaim in try_charge()
> >  2) indirect/async per-cgroup reclaim triggered in try_charge()
> >  3) global reclaim
> > 
> > And then describe how will they use sgx_cgroup_reclaim_pages():
> > 
> > Both 1) and 2) can result in needing to reclaim more than SGX_NR_TO_SCAN
> > pages, in which case we should .
> > 
> > For 3), the new global reclaim should try tot match the existing global
> > reclaim behaviour, that is to try to treat all EPC pages equally.
> > 
> > 
> > With above context, we can justify why to make sgx_cgroup_reclaim_pages()
> > in this form.
> > 
> This new part is only to address the issue you raised in this thread:
> https://lore.kernel.org/lkml/op.2ndsydgywjv...@hhuan26-mobl.amr.corp.intel.com/
> 
> Really it has nothing to do whether global, direct/async, per-cgroup  
> contexts. They all should use the function the same way. This paragraph  
> describes the design and
> I thought the above new statements justify the reason we return 'next' so  
> it can reclaim into descedant in round-robin fashion?  No sure we need get  
> into details of different usages of the functions which are in code  
> actually?

Please clearly define the behaviour of "per-cgroup reclaim" first.

I can understand "global reclaim" means we essentially want to treats all
EPC pages equally.  But it's not obvious to me what is the desired
behaviour of "per-cgroup reclaim", especially when the behaviour is
different between this version and the previous versions (see below).
> 

[...]

> > And when there are more than SGX_NR_TO_SCAN pages that need to reclaim,
> > the above ...
> 
> Note, all sgx_cgroup_reclaim_pages() guarantees is scanning SGX_NR_TO_SCAN  
> pages.
> > 
> > for (;;) {
> > cg_next = sgx_cgroup_reclaim_pages(sgx_cg->cg, cg_next);
> > }
> > 
> > ... actually tries to reclaim those pages from @sgx_cg _AND_ it's
> > descendants, and tries to do it _EQUALLY_.
> > 
> > Is this desired, or should we always try to reclaim from the @sgx_cg
> > first, but only moves to the desendants when the @sgx_cg shouldn't be
> > reclaimed anymore?
> > 
> 
> we still reclaim in sgx_cg in first scan and attempt of reclaiming for  
> SGX_NR_TOS_CAN pages, but if it turns out that did not satisfy caller  
> needs, then caller goes on to reclaim from descendants by passing in  
> 'next' as starting point.

But why?
 
> 
> > Anyway, it's different from the previous behaviour.
> > 
> Again, this is to fix the issue you raised. I consider it improved  
> behavior :-)

Please clearly define the _EXPECTED_ hebaviour of "per-cgroup reclaim"
first.
> 

We have two choices: 

1) Always try to reclaim desired number of pages from the given cgroup,
but only moves to reclaim from descendants when there's less than
SGX_NR_TO_SCAN pages left;

2) Always try to reclaim desired number of pages _EQUALLY_ from the given
cgroup _AND_ its descendants (in granularity of reclaiming SGX_NR_TO_SCAN
pages each time).

The 1) was the old behavour in the previous versions, 2) is the new
behaviour in this version.

I am not against any option, but the patch needs to be clear on which
option to choose and why it is desired/better.

Re: [PATCH v15 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-06-20 Thread Huang, Kai


On 18/06/2024 12:53 am, Huang, Haitao wrote:
> From: Kristen Carlson Accardi 
> 
> Currently in the EPC page allocation, the kernel simply fails the
> allocation when the current EPC cgroup fails to charge due to its usage
> reaching limit.  This is not ideal. When that happens, a better way is
> to reclaim EPC page(s) from the current EPC cgroup (and/or its
> descendants) to reduce its usage so the new allocation can succeed.
> 
> Add the basic building blocks to support per-cgroup reclamation.
> 
> Currently the kernel only has one place to reclaim EPC pages: the global
> EPC LRU list.  To support the "per-cgroup" EPC reclaim, maintain an LRU
> list for each EPC cgroup, and introduce a "cgroup" variant function to
> reclaim EPC pages from a given EPC cgroup and its descendants.
> 
> Currently the kernel does the global EPC reclaim in sgx_reclaim_page().
> It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16)
> pages.  Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN
> pages from the global LRU, and then tries to reclaim these pages at once
> for better performance.
> 
> Implement the "cgroup" variant EPC reclaim in a similar way, but keep
> the implementation simple: 1) change sgx_reclaim_pages() to take an LRU
> as input, and return the pages that are "scanned" and attempted for
> reclamation (but not necessarily reclaimed successfully); 2) loop the
> given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
> until SGX_NR_TO_SCAN pages are "scanned". 
> 
> This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always
> tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC
> cgroup, and only moves to its descendants when there's no enough
> reclaimable EPC pages to "scan" in its LRU.  It should be enough for
> most cases. 

[...]

> In other cases, the caller may invoke this function in a
> loop to ensure enough pages reclaimed for its usage. To ensure all
> descendant groups scanned in a round-robin fashion in those cases,
> sgx_cgroup_reclaim_pages() takes in a starting cgroup and returns the
> next cgroup that the caller can pass in as the new starting cgroup for a
> subsequent call.


AFAICT this part is new, and I believe this "round-robin" thing is just 
for the "global reclaim"?  Or is it also for per-cgroup reclaim where more
than SGX_NR_TO_SCAN pages needs to be reclaimed?

I wish the changelog should just point out what consumers will use this
new sgx_cgroup_reclaim_pages(), like:

The sgx_cgroup_reclaim_pages() will be used in three cases:

 1) direct/sync per-cgroup reclaim in try_charge()
 2) indirect/async per-cgroup reclaim triggered in try_charge()
 3) global reclaim

And then describe how will they use sgx_cgroup_reclaim_pages():

Both 1) and 2) can result in needing to reclaim more than SGX_NR_TO_SCAN
pages, in which case we should .

For 3), the new global reclaim should try tot match the existing global
reclaim behaviour, that is to try to treat all EPC pages equally. 


With above context, we can justify why to make sgx_cgroup_reclaim_pages()
in this form.

> 
> Note, this simple implementation doesn't _exactly_ mimic the current
> global EPC reclaim (which always tries to do the actual reclaim in batch
> of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN
> reclaimable pages, the actual reclaim of EPC pages will be split into
> smaller batches _across_ multiple LRUs with each being smaller than
> SGX_NR_TO_SCAN pages.
> 
> A more precise way to mimic the current global EPC reclaim would be to
> have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
> _across_ the given EPC cgroup _AND_ its descendants, and then do the
> actual reclaim in one batch.  But this is unnecessarily complicated at
> this stage.
> 
> Alternatively, the current sgx_reclaim_pages() could be changed to
> return the actual "reclaimed" pages, but not "scanned" pages. However,
> the reclamation is a lengthy process, forcing a successful reclamation
> of predetermined number of pages may block the caller for too long. And
> that may not be acceptable in some synchronous contexts, e.g., in
> serving an ioctl().
> 
> With this building block in place, add synchronous reclamation support
> in sgx_cgroup_try_charge(): trigger a call to
> sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the
> caller allows synchronous reclaim as indicated by s newly added
> parameter.
> 
> A later patch will add support for asynchronous reclamation reusing
> sgx_cgroup_reclaim_pages().

It seems you also should mention the new global reclaim will also use 
this sgx_cgroup_reclaim_pages()?

[...]

> +/**
> + * sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree
> + * @root:The root of cgroup tree to reclaim from.
> + * @start:   The descendant cgroup from which to start the tree walking.
> + *
> + * This function performs a pre-order walk in the cgroup tree under the given
> + * root, starting from the node %start,

Re: [PATCH v15 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-06-20 Thread Huang, Kai


On 18/06/2024 12:53 am, Huang, Haitao wrote:
> From: Kristen Carlson Accardi 
> 
> Previous patches have implemented all infrastructure needed for
> per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
> pages are still tracked in the global LRU as sgx_epc_page_lru() always
> returns reference to the global LRU.
> 
> Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
> given EPC page is allocated.
> 
> This makes all EPC pages tracked in per-cgroup LRUs and the global
> reclaimer (ksgxd) will not be able to reclaim any pages from the global
> LRU. However, in cases of over-committing, i.e., the sum of cgroup
> limits greater than the total capacity, cgroups may never reclaim but
> the total usage can still be near the capacity. Therefore a global
> reclamation is still needed in those cases and it should be performed
> from the root cgroup.
> 
> Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
> when cgroup is enabled. Similar to sgx_cgroup_reclaim_pages(), return
> the next cgroup so callers can use it as the new starting node for next
> round of reclamation if needed.
> 
> Also update sgx_can_reclaim_global(), to check emptiness of LRUs of all
> cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
> 
> Finally, change sgx_reclaim_direct(), to check and ensure there are free
> pages at cgroup level so forward progress can be made by the caller.

Reading above, it's not clear how the _new_ global reclaim works with
multiple LRUs.

E.g., the current global reclaim essentially treats all EPC pages equally
when scanning those pages.  From the above, I don't see how this is
achieved in the new global reclaim.

The changelog should:

1) describe the how does existing global reclaim work, and then describe
how to achieve the same beahviour in the new global reclaim which works
with multiple LRUs;

2) If there's any behaviour difference between the "existing" vs the "new"
global reclaim, the changelog should point out the difference, and explain
why such difference is OK.

> 
> With these changes, the global reclamation and per-cgroup reclamation
> both work properly with all pages tracked in per-cgroup LRUs.
> 

[...]

>   
> -static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
> +static struct misc_cg *sgx_reclaim_pages_global(struct misc_cg *next_cg,
> + struct mm_struct *charge_mm)
>   {
> + if (IS_ENABLED(CONFIG_CGROUP_MISC))
> + return sgx_cgroup_reclaim_pages(misc_cg_root(), next_cg, 
> charge_mm);
> +
>   sgx_reclaim_pages(_global_lru, charge_mm);
> + return NULL;
>   }
>   
>   /*
> @@ -414,12 +443,35 @@ static void sgx_reclaim_pages_global(struct mm_struct 
> *charge_mm)
>*/
>   void sgx_reclaim_direct(void)
>   {
> + struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
> + struct misc_cg *cg = misc_from_sgx(sgx_cg);

From below @sgx_cg could be NULL.  It's not immediately clear whether calling 
misc_from_sgx(sgx_cg) unconditionally is safe here.

Leave the initiaization of @cg to a later phase where @sgx_cg is
guaranteed not being NULL, or initialize @cg to NULL here and update later.

> + struct misc_cg *next_cg = NULL;
> +
> + /*
> +  * Make sure there are some free pages at both cgroup and global levels.
> +  * In both cases, only make one attempt of reclamation to avoid lengthy
> +  * block on the caller.
> +  */
> + if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg))
> + next_cg = sgx_cgroup_reclaim_pages(cg, next_cg, current->mm);

I don't quite follow the logic.

First of all, sgx_cgroup_reclaim_pages() isn't called in a loop, so why
not just do:

next_cg = sgx_cgroup_reclaim_pages(cg, NULL, current->mm);

And what is the point of set @next_cg here, since ...


> +
> + if (next_cg != cg)
> + put_misc_cg(next_cg);
> +
> + next_cg = NULL;

... here @next_cg is reset to NULL ?

Looks the only reason is you need to do ...

put_misc_cg(next_cg);

... above?

This piece of code appears repeatedly in this file.  Is there any way we
can get rid of it?

>   if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
> - sgx_reclaim_pages_global(current->mm);
> + next_cg = sgx_reclaim_pages_global(next_cg, current->mm);

And this doesn't seems "global reclaim" at all?

Because it essentially equals to:

next_cg = sgx_reclaim_pages_global(NULL, current->mm);

which always reclaims from the ROOT.

So each call to sgx_reclaim_direct() will always reclaim from the ROOT --
any other LRUs in the hierarchy will unlikely to get any chance to be
reclaimed.

> +
> + if (next_cg != misc_cg_root())
> + put_misc_cg(next_cg);
> +
> + sgx_put_cg(sgx_cg);
>   }
>   
>   static int ksgxd(void *p)
>   {
> + struct misc_cg *next_cg = NULL;
> +
>   set_freezable();
>   
>   /*
> @@ -437,11 +489,15 @@ static int ksgxd(void *p)
>

Re: [PATCH v15 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-06-18 Thread Huang, Kai

On Tue, 2024-06-18 at 18:23 -0500, Haitao Huang wrote:
> On Tue, 18 Jun 2024 18:15:37 -0500, Huang, Kai  wrote:
> 
> > On Tue, 2024-06-18 at 07:56 -0500, Haitao Huang wrote:
> > > On Tue, 18 Jun 2024 06:31:09 -0500, Huang, Kai   
> > > wrote:
> > > 
> > > > 
> > > > > @@ -921,7 +956,8 @@ static int __init sgx_init(void)
> > > > >   if (!sgx_page_cache_init())
> > > > >   return -ENOMEM;
> > > > > 
> > > > > - if (!sgx_page_reclaimer_init()) {
> > > > > + if (!sgx_page_reclaimer_init() || !sgx_cgroup_init()) {
> > > > > + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
> > > > >   ret = -ENOMEM;
> > > > >   goto err_page_cache;
> > > > >   }
> > > > 
> > > > This code change is wrong due to two reasons:
> > > > 
> > > > 1) If sgx_page_reclaimer_init() was successful, but sgx_cgroup_init()
> > > > failed, you actually need to 'goto err_kthread' because the ksgxd()
> > > > kernel
> > > > thread is already created and is running.
> > > > 
> > > > 2) There are other cases after here that can also result in  
> > > sgx_init() to
> > > > fail completely, e.g., registering sgx_dev_provision mics device.  We
> > > > need
> > > > to reset the capacity to 0 for those cases as well.
> > > > 
> > > > AFAICT, you need something like:
> > > > 
> > > > diff --git a/arch/x86/kernel/cpu/sgx/main.c
> > > > b/arch/x86/kernel/cpu/sgx/main.c
> > > > index 27892e57c4ef..46f9c26992a7 100644
> > > > --- a/arch/x86/kernel/cpu/sgx/main.c
> > > > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > > > @@ -930,6 +930,10 @@ static int __init sgx_init(void)
> > > > if (ret)
> > > > goto err_kthread;
> > > > +   ret = sgx_cgroup_init();
> > > > +   if (ret)
> > > > +   goto err_provision;
> > > > +
> > > > /*
> > > >  * Always try to initialize the native *and* KVM drivers.
> > > >  * The KVM driver is less picky than the native one and
> > > > @@ -941,10 +945,12 @@ static int __init sgx_init(void)
> > > > ret = sgx_drv_init();
> > > >if (sgx_vepc_init() && ret)
> > > > -   goto err_provision;
> > > > +   goto err_cgroup;
> > > >return 0;
> > > > +err_cgroup:
> > > > +   /* SGX EPC cgroup cleanup */
> > > >  err_provision:
> > > > misc_deregister(_dev_provision);
> > > > @@ -952,6 +958,8 @@ static int __init sgx_init(void)
> > > > kthread_stop(ksgxd_tsk);
> > > > err_page_cache:
> > > > +   misc_misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
> > > > +
> > > > for (i = 0; i < sgx_nr_epc_sections; i++) {
> > > > vfree(sgx_epc_sections[i].pages);
> > > > memunmap(sgx_epc_sections[i].virt_addr);
> > > > 
> > > > 
> > > > I put the sgx_cgroup_init() before sgx_drv_init() and sgx_vepc_init(),
> > > > otherwise you will need sgx_drv_cleanup() and sgx_vepc_cleanup()
> > > > respectively when sgx_cgroup_init() fails.
> > > > 
> > > 
> > > Yes, good catch.
> > > 
> > > > This looks a little bit weird too, though:
> > > > 
> > > > Calling misc_misc_cg_set_capacity() to reset capacity to 0 is done at  
> > > end
> > > > of sgx_init() error path, because the "set capacity" part is done in
> > > > sgx_epc_cache_init().
> > > > But logically, both "set capacity" and "reset capacity to 0" should be
> > > > SGX
> > > > EPC cgroup operation, so it's more reasonable to do "set capacity" in
> > > > sgx_cgroup_init() and do "reset to 0" in the
> > > > 
> > > > /* SGX EPC cgroup cleanup */
> > > > 
> > > > as shown above.
> > > > 
> > > > Eventually, you will need to do EPC cgroup cleanup anyway, e.g., to  
> > > free
> > > > the workqueue, so it's odd to have two places to handle EPC cgroup
> > > > cleanup.
> > > > 
> > > > I understand the reason "

Re: [PATCH v15 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-06-18 Thread Huang, Kai

On Tue, 2024-06-18 at 07:56 -0500, Haitao Huang wrote:
> On Tue, 18 Jun 2024 06:31:09 -0500, Huang, Kai  wrote:
> 
> > 
> > > @@ -921,7 +956,8 @@ static int __init sgx_init(void)
> > >   if (!sgx_page_cache_init())
> > >   return -ENOMEM;
> > > 
> > > - if (!sgx_page_reclaimer_init()) {
> > > + if (!sgx_page_reclaimer_init() || !sgx_cgroup_init()) {
> > > + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
> > >   ret = -ENOMEM;
> > >   goto err_page_cache;
> > >   }
> > 
> > This code change is wrong due to two reasons:
> > 
> > 1) If sgx_page_reclaimer_init() was successful, but sgx_cgroup_init()
> > failed, you actually need to 'goto err_kthread' because the ksgxd()  
> > kernel
> > thread is already created and is running.
> > 
> > 2) There are other cases after here that can also result in sgx_init() to
> > fail completely, e.g., registering sgx_dev_provision mics device.  We  
> > need
> > to reset the capacity to 0 for those cases as well.
> > 
> > AFAICT, you need something like:
> > 
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c
> > b/arch/x86/kernel/cpu/sgx/main.c
> > index 27892e57c4ef..46f9c26992a7 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -930,6 +930,10 @@ static int __init sgx_init(void)
> > if (ret)
> > goto err_kthread;
> > +   ret = sgx_cgroup_init();
> > +   if (ret)
> > +   goto err_provision;
> > +
> > /*
> >  * Always try to initialize the native *and* KVM drivers.
> >  * The KVM driver is less picky than the native one and
> > @@ -941,10 +945,12 @@ static int __init sgx_init(void)
> > ret = sgx_drv_init();
> >if (sgx_vepc_init() && ret)
> > -   goto err_provision;
> > +   goto err_cgroup;
> >return 0;
> > +err_cgroup:
> > +   /* SGX EPC cgroup cleanup */
> >  err_provision:
> > misc_deregister(_dev_provision);
> > @@ -952,6 +958,8 @@ static int __init sgx_init(void)
> > kthread_stop(ksgxd_tsk);
> > err_page_cache:
> > +   misc_misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
> > +
> > for (i = 0; i < sgx_nr_epc_sections; i++) {
> > vfree(sgx_epc_sections[i].pages);
> > memunmap(sgx_epc_sections[i].virt_addr);
> > 
> > 
> > I put the sgx_cgroup_init() before sgx_drv_init() and sgx_vepc_init(),
> > otherwise you will need sgx_drv_cleanup() and sgx_vepc_cleanup()
> > respectively when sgx_cgroup_init() fails.
> > 
> 
> Yes, good catch.
> 
> > This looks a little bit weird too, though:
> > 
> > Calling misc_misc_cg_set_capacity() to reset capacity to 0 is done at end
> > of sgx_init() error path, because the "set capacity" part is done in
> > sgx_epc_cache_init().  
> > But logically, both "set capacity" and "reset capacity to 0" should be  
> > SGX
> > EPC cgroup operation, so it's more reasonable to do "set capacity" in
> > sgx_cgroup_init() and do "reset to 0" in the
> > 
> > /* SGX EPC cgroup cleanup */
> > 
> > as shown above.
> > 
> > Eventually, you will need to do EPC cgroup cleanup anyway, e.g., to free
> > the workqueue, so it's odd to have two places to handle EPC cgroup
> > cleanup.
> > 
> > I understand the reason "set capacity" part is done in
> > sgx_page_cache_init() now is because in that function you can easily get
> > the capacity.  But the fact is @sgx_numa_nodes also tracks EPC size for
> > each node, so you can also get the total EPC size from @sgx_numa_node in
> > sgx_cgroup_init() and set capacity there.
> > 
> > In this case, you can put "reset capacity to 0" and "free workqueue"
> > together as the "SGX EPC cgroup cleanup", which is way more clear IMHO.
> > 
> Okay, will  expose @sgx_numa_nodes to epc_cgroup.c and do the calculations  
> in sgx_cgroup_init().
> 

Looks you will also need to expose @sgx_numa_mask, which looks overkill.

Other options:

1) Expose a function to return total EPC pages/size in "sgx.h".

2) Move out the new 'capacity' variable in this patch as a global variable
and expose it in "sgx.h" (perhaps rename to 'sgx_total_epc_pages/size').

3) Make sgx_cgroup_init() to take an argument of total EPC pages/size, and
pass it in sgx_init().  

For 3) there are also options to get total EPC pages/size:

a) Move out the new 'capacity' variable in this patch as a static.

b) Add a function to calculate total EPC pages/size from sgx_numa_nodes.

Hmm.. I think we can just use option 2)?

Re: [PATCH v15 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-06-18 Thread Huang, Kai


> @@ -921,7 +956,8 @@ static int __init sgx_init(void)
>   if (!sgx_page_cache_init())
>   return -ENOMEM;
>  
> - if (!sgx_page_reclaimer_init()) {
> + if (!sgx_page_reclaimer_init() || !sgx_cgroup_init()) {
> + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
>   ret = -ENOMEM;
>   goto err_page_cache;
>   }

This code change is wrong due to two reasons:

1) If sgx_page_reclaimer_init() was successful, but sgx_cgroup_init()
failed, you actually need to 'goto err_kthread' because the ksgxd() kernel
thread is already created and is running.

2) There are other cases after here that can also result in sgx_init() to
fail completely, e.g., registering sgx_dev_provision mics device.  We need
to reset the capacity to 0 for those cases as well.

AFAICT, you need something like:

diff --git a/arch/x86/kernel/cpu/sgx/main.c
b/arch/x86/kernel/cpu/sgx/main.c
index 27892e57c4ef..46f9c26992a7 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -930,6 +930,10 @@ static int __init sgx_init(void)
if (ret)
goto err_kthread;
 
+   ret = sgx_cgroup_init();
+   if (ret)
+   goto err_provision;
+
/*
 * Always try to initialize the native *and* KVM drivers.
 * The KVM driver is less picky than the native one and
@@ -941,10 +945,12 @@ static int __init sgx_init(void)
ret = sgx_drv_init();
 
if (sgx_vepc_init() && ret)
-   goto err_provision;
+   goto err_cgroup;
 
return 0;
 
+err_cgroup:
+   /* SGX EPC cgroup cleanup */
 err_provision:
misc_deregister(_dev_provision);
 
@@ -952,6 +958,8 @@ static int __init sgx_init(void)
kthread_stop(ksgxd_tsk);
 
 err_page_cache:
+   misc_misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, 0);
+
for (i = 0; i < sgx_nr_epc_sections; i++) {
vfree(sgx_epc_sections[i].pages);
memunmap(sgx_epc_sections[i].virt_addr);


I put the sgx_cgroup_init() before sgx_drv_init() and sgx_vepc_init(),
otherwise you will need sgx_drv_cleanup() and sgx_vepc_cleanup()
respectively when sgx_cgroup_init() fails.

This looks a little bit weird too, though:

Calling misc_misc_cg_set_capacity() to reset capacity to 0 is done at end
of sgx_init() error path, because the "set capacity" part is done in
sgx_epc_cache_init().  

But logically, both "set capacity" and "reset capacity to 0" should be SGX
EPC cgroup operation, so it's more reasonable to do "set capacity" in
sgx_cgroup_init() and do "reset to 0" in the

/* SGX EPC cgroup cleanup */

as shown above.

Eventually, you will need to do EPC cgroup cleanup anyway, e.g., to free
the workqueue, so it's odd to have two places to handle EPC cgroup
cleanup.

I understand the reason "set capacity" part is done in
sgx_page_cache_init() now is because in that function you can easily get
the capacity.  But the fact is @sgx_numa_nodes also tracks EPC size for
each node, so you can also get the total EPC size from @sgx_numa_node in
sgx_cgroup_init() and set capacity there.

In this case, you can put "reset capacity to 0" and "free workqueue"
together as the "SGX EPC cgroup cleanup", which is way more clear IMHO.

Re: [PATCH v14 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-06-11 Thread Huang, Kai

On Tue, 2024-06-11 at 07:57 -0500, Haitao Huang wrote:
> On Mon, 10 Jun 2024 17:39:53 -0500, Huang, Kai  wrote:
> 
> > 
> > > --- a/arch/x86/kernel/cpu/sgx/main.c
> > > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > > @@ -1045,7 +1045,7 @@ static int __init sgx_init(void)
> > >if (!sgx_page_cache_init())
> > >return -ENOMEM;
> > >  -if (!sgx_page_reclaimer_init()) {
> > > +if (!sgx_page_reclaimer_init() || !sgx_cgroup_init()) {
> > >ret = -ENOMEM;
> > >goto err_page_cache;
> > >}
> > 
> > Does it make more sense to move the sgx_cgroup_init() to the  
> > sgx_drv_init()?  The SGX cgroup only works for the driver side anyway.  
> > In this case, if something went wrong in sgx_cgroup_init(), the  
> > sgx_vepc_init() could still have a chance to work.
> > 
> 
> vepc reclamation is not done by cgroup/ksgxd but try_charge() won't work  
> if user expecting cgroup to limit vepc allocation. 
> 

Oh ok.

> Would it be more  
> consistent to just disable vepc, i.e., on system with MISC, sgx/vepc  
> always go with cgroup enabled?
> 

Yes fine to me.
>

Re: [PATCH v14 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-06-10 Thread Huang, Kai





--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -1045,7 +1045,7 @@ static int __init sgx_init(void)
   if (!sgx_page_cache_init())
   return -ENOMEM;

-if (!sgx_page_reclaimer_init()) {
+if (!sgx_page_reclaimer_init() || !sgx_cgroup_init()) {
   ret = -ENOMEM;
   goto err_page_cache;
   }


Does it make more sense to move the sgx_cgroup_init() to the 
sgx_drv_init()?  The SGX cgroup only works for the driver side anyway. 
In this case, if something went wrong in sgx_cgroup_init(), the 
sgx_vepc_init() could still have a chance to work.


And IIUC we need to reset the "capacity" to 0 if sgx_cgroup_init() 
fails, no matter it is called inside sgx_drv_init() or sgx_init(), 
otherwise the "epc" would appear in the cgroup hierarchy as a misc 
cgroup resource.


Another option is to defer setting the capacity to the point where we 
have made sure sgx_drv_init() and sgx_cgroup_init() cannot fail.


Btw, I plan to review the rest from late of this week or next week 
because this week I have some other staff needs to be finished first.

Re: [PATCH v14 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-06-05 Thread Huang, Kai





Reorg:

void sgx_cgroup_init(void)
{
struct workqueue_struct *wq;

/* eagerly allocate the workqueue: */
wq = alloc_workqueue("sgx_cg_wq", wq_unbound | wq_freezable, 
wq_unbound_max_active);

if (!wq) {
    pr_warn("sgx_cg_wq creation failed\n");
    return;


sgx_cgroup_try_charge() expects sgx_cg_wq, so it would break unless we 
check and return 0 which was the initially implemented in v12. But then 
Kai had some concern on that we expose all the interface files to allow 
user to set limits but we don't enforce. To keep it simple we settled 
down back to BUG_ON(). 


[...]

This would only happen rarely and user can add 
command-line to disable SGX if s/he really wants to start kernel in this 
case, just can't do SGX.


Just to be clear that I don't like BUG_ON() either.  It's inevitable you 
will get attention because of using it.


This is a compromise that you don't want to reset "capacity" to 0 when 
alloc_workqueue() fails.


There are existing code where BUG_ON() is used during the kernel early 
boot code when memory allocation fails (e.g., see cgroup_init_subsys()), 
so it might be acceptable to use BUG_ON() here, but it's up to 
maintainers to decide whether it is OK.


[...]


With "--strict" flag I also catched these:

CHECK: spinlock_t definition without comment
#1308: FILE: arch/x86/kernel/cpu/sgx/sgx.h:122:
+    spinlock_t lock;

Yes I had a comment but Kai thought it was too obvious and I can't think 
of a better one that's not obvious so I removed:


https://lore.kernel.org/linux-sgx/9ffb02a3344807f2c173fe8c7cb000cd6c7843b6.ca...@intel.com/

To be clear, my reply was really about the comment itself isn't useful, 
but didn't say we shouldn't use a comment here.





CHECK: multiple assignments should be avoided
#444: FILE: kernel/cgroup/misc.c:450:
+    parent_cg = cg = _cg;



This was also suggested by Kai a few versions back:
https://lore.kernel.org/linux-sgx/8f08f0b0f2b04b90d7cdb7b628f16f9080687c43.ca...@intel.com/



I didn't know checkpatch complains this.  Feel free to revert back as it 
is trivial to me.

Re: [PATCH v13 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-05-06 Thread Huang, Kai





On 1/05/2024 7:51 am, Haitao Huang wrote:
  
  static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)

  {
-   sgx_reclaim_pages(_global_lru, charge_mm);
+   if (IS_ENABLED(CONFIG_CGROUP_MISC))
+   sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm);
+   else
+   sgx_reclaim_pages(_global_lru, charge_mm);
  }
  


I think we have a problem here when we do global reclaim starting from 
the ROOT cgroup:


This function will mostly just only try to reclaim from the ROOT cgroup, 
but won't reclaim from the descendants.


The reason is the sgx_cgroup_reclaim_pages() will simply return after 
"scanning" SGX_NR_TO_SCAN (16) pages w/o going to the descendants, and 
the "scanning" here simply means "removing the EPC page from the 
cgroup's LRU list".


So as long as the ROOT cgroup LRU contains more than SGX_NR_TO_SCAN (16) 
pages, effectively sgx_cgroup_reclaim_pages() will just scan and return 
w/o going into the descendants.  Having 16 EPC pages should be a "almost 
always true" case I suppose.


When the sgx_reclaim_pages_global() is called again, we will start from 
the ROOT again.


That means the this doesn't truly reclaim "from global" at all.

IMHO the behaviour of sgx_cgroup_reclaim_pages() is OK for per-cgroup 
reclaim because I think in this case our intention is we should try best 
to reclaim from the cgroup, i.e., whether we can reclaim from 
descendants doesn't matter.


But for global reclaim this doesn't work.

Am I missing anything?

Re: [PATCH v13 11/14] x86/sgx: Abstract check for global reclaimable pages

2024-05-02 Thread Huang, Kai





On 1/05/2024 7:51 am, Haitao Huang wrote:

From: Kristen Carlson Accardi 

For the global reclaimer to determine if any page available for
reclamation at the global level, it currently only checks for emptiness
of the global LRU. That will be inadequate when pages are tracked in
multiple LRUs, one per cgroup. For this purpose, create a new helper,
sgx_can_reclaim_global(), to abstract this check. Currently it only
checks the global LRU, later will check emptiness of LRUs of all cgroups
when per-cgroup tracking is turned on.

Replace all the checks for emptiness of the global LRU,
list_empty(_global_lru.reclaimable), with calls to
sgx_can_reclaim_global().

Rename sgx_should_reclaim() to sgx_should_reclaim_global() as it is used
to check if a global reclamation should be performed.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Tested-by: Jarkko Sakkinen 
---


Free free to add:

Reviewed-by: Kai Huang 

One thing below:

[...]


-static bool sgx_should_reclaim(unsigned long watermark)
+static bool sgx_should_reclaim_global(unsigned long watermark)
  {
return atomic_long_read(_nr_free_pages) < watermark &&
-  !list_empty(_global_lru.reclaimable);
+   sgx_can_reclaim_global();
  }
  
  static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)

@@ -405,7 +413,7 @@ static void sgx_reclaim_pages_global(struct mm_struct 
*charge_mm)
   */
  void sgx_reclaim_direct(void)
  {
-   if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
+   if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
sgx_reclaim_pages_global(current->mm);
  }
  


Hmm.. Sorry for not pointing out in the previous version.

Perhaps it makes more sense to do the rename in the patch ...

  x86/sgx: Add basic EPC reclamation flow for cgroup

... where we have actually introduced the concept of per-cgroup reclaim, 
and we literally have below change in that patch:


 void sgx_reclaim_direct(void)
 {
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-   sgx_reclaim_pages();
+   sgx_reclaim_pages_global();
 }

So in that patch, the sgx_should_reclaim() literally just means we 
should do gloabl reclaim, but not the per-cgruop reclaim.  Thus, perhaps 
we just do the renaming here together with the new 
sgx_reclaim_pages_global().


If there's a new version needed, please move the renaming to that patch?

Re: [PATCH v12 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-29 Thread Huang, Kai





 /*
@@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list 
*sgx_lru_list(struct sgx_epc_page *epc_pag

  */
 static inline bool sgx_can_reclaim(void)
 {
-    return !list_empty(_global_lru.reclaimable);
+    return !sgx_cgroup_lru_empty(misc_cg_root()) ||
+   !list_empty(_global_lru.reclaimable);
 }


Shouldn't this be:

if (IS_ENABLED(CONFIG_CGROUP_MISC))
    return !sgx_cgroup_lru_empty(misc_cg_root());
else
    return !list_empty(_global_lru.reclaimable);
?

In this way, it is consistent with the sgx_reclaim_pages_global() below.



I changed to this way because sgx_cgroup_lru_empty() is now defined in 
both KConfig cases.
And it seems better to minimize use of the KConfig variables based on 
earlier feedback (some are yours).
Don't really have strong preference here. So let me know one way of the 
other.




But IMHO your code could be confusing, e.g., it can be interpreted as:

  The EPC pages can be managed by both the cgroup LRUs and the
  sgx_global_lru simultaneously at runtime when CONFIG_CGROUP_MISC
  is on.

Which is not true.

So we should make the code clearly reflect the true behaviour.

Re: [PATCH v12 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-29 Thread Huang, Kai


> +/*
> + * Get the per-cgroup or global LRU list that tracks the given reclaimable 
> page.
> + */
>  static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page 
> *epc_page)
>  {
> +#ifdef CONFIG_CGROUP_MISC
> + /*
> +  * epc_page->sgx_cg here is never NULL during a reclaimable epc_page's
> +  * life between sgx_alloc_epc_page() and sgx_free_epc_page():
> +  *
> +  * In sgx_alloc_epc_page(), epc_page->sgx_cg is set to the return from
> +  * sgx_get_current_cg() which is the misc cgroup of the current task, or
> +  * the root by default even if the misc cgroup is disabled by kernel
> +  * command line.
> +  *
> +  * epc_page->sgx_cg is only unset by sgx_free_epc_page().
> +  *
> +  * This function is never used before sgx_alloc_epc_page() or after
> +  * sgx_free_epc_page().
> +  */
> + return _page->sgx_cg->lru;
> +#else
>   return _global_lru;
> +#endif
>  }
>  
>  /*
> @@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct 
> sgx_epc_page *epc_pag
>   */
>  static inline bool sgx_can_reclaim(void)
>  {
> - return !list_empty(_global_lru.reclaimable);
> + return !sgx_cgroup_lru_empty(misc_cg_root()) ||
> +!list_empty(_global_lru.reclaimable);
>  }

Shouldn't this be:

if (IS_ENABLED(CONFIG_CGROUP_MISC))
return !sgx_cgroup_lru_empty(misc_cg_root());
else
return !list_empty(_global_lru.reclaimable);
?

In this way, it is consistent with the sgx_reclaim_pages_global() below.

>  
>  static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
> @@ -404,7 +426,10 @@ static bool sgx_should_reclaim(unsigned long watermark)
>  
>  static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
>  {
> - sgx_reclaim_pages(_global_lru, charge_mm);
> + if (IS_ENABLED(CONFIG_CGROUP_MISC))
> + sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm);
> + else
> + sgx_reclaim_pages(_global_lru, charge_mm);
>  }
>  
>  /*
> @@ -414,6 +439,14 @@ static void sgx_reclaim_pages_global(struct mm_struct 
> *charge_mm)
>   */
>  void sgx_reclaim_direct(void)
>  {
> + struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
> +
> + /* Make sure there are some free pages at cgroup level */
> + if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg)) {
> + sgx_cgroup_reclaim_pages(misc_from_sgx(sgx_cg), current->mm);
> + sgx_put_cg(sgx_cg);
> + }

Empty line.

> + /* Make sure there are some free pages at global level */
>   if (sgx_should_reclaim(SGX_NR_LOW_PAGES))

Looking at the code, to me sgx_should_reclaim() is a little bit vague
because from the name we don't know whether it interally checks the
current cgroup or the global.  

It's better to rename to sgx_should_reclaim_global().

Ditto for sgx_can_reclaim() -> sgx_can_reclaim_global().

And I think you can do the renaming in the previous patch, because in the
changelog of your previous patch, it seems you have called out the two
functions are for global reclaim.

>   sgx_reclaim_pages_global(current->mm);
>  }
> @@ -616,6 +649,12 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, 
> enum sgx_reclaim reclaim)
>   break;
>   }
>  
> + /*
> +  * At this point, the usage within this cgroup is under its
> +  * limit but there is no physical page left for allocation.
> +  * Perform a global reclaim to get some pages released from any
> +  * cgroup with reclaimable pages.
> +  */
>   sgx_reclaim_pages_global(current->mm);
>   cond_resched();
>   }

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-23 Thread Huang, Kai

On Tue, 2024-04-23 at 19:26 -0500, Haitao Huang wrote:
> On Tue, 23 Apr 2024 17:13:15 -0500, Huang, Kai  wrote:
> 
> > On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote:
> > > > > It's a workaround because you use the capacity==0 but it does not  
> > > really
> > > > > mean to disable the misc cgroup for specific resource IIUC.
> > > > 
> > > > Please read the comment around @misc_res_capacity again:
> > > > 
> > > >   * Miscellaneous resources capacity for the entire machine. 0  
> > > capacity
> > > >   * means resource is not initialized or not present in the host.
> > > > 
> > > 
> > > I mentioned this in earlier email. I think this means no SGX EPC. It  
> > > doesnot mean sgx epc cgroup not enabled. That's also consistent with the 
> > > behavior try_charge() fails if capacity is zero.
> > 
> > OK. To me the "capacity" is purely the concept of cgroup, so it must be
> > interpreted within the scope of "cgroup".  If cgroup, in our case, SGX
> > cgroup, is disabled, then whether "leaving the capacity to reflect the
> > presence of hardware resource" doesn't really matter.
> > So what you are saying is that, the kernel must initialize the capacity  
> > of
> > some MISC resource once it is added as valid type.  
> > And you must initialize the "capacity" even MISC cgroup is disabled
> > entirely by kernel commandline, in which case, IIUC, misc.capacity is not
> > even going to show in the /fs.
> > 
> > If this is your point, then your patch:
> > 
> > cgroup/misc: Add SGX EPC resource type
> > 
> > is already broken, because you added the new type w/o initializing the
> > capacity.
> > 
> > Please fix that up.
> > 
> > > 
> > > > > 
> > > > > There is explicit way for user to disable misc without setting  
> > > capacity> > to
> > > > > zero.
> > > > 
> > > > Which way are you talking about?
> > > 
> > > Echo "-misc" to cgroup.subtree_control at root level for example still 
> > > shows non-zero sgx_epc capacity.
> > 
> > I guess "having to disable all MISC resources just in order to disable  
> > SGX
> > EPC cgroup" is a brilliant idea.
> > 
> > You can easily disable the entire MISC cgroup by commandline for that
> > purpose if that's acceptable.
> > 
> > And I have no idea why "still showing non-zero EPC capacity" is important
> > if SGX cgroup cannot be supported at all.
> > 
> 
> Okay, all I'm trying to say is we should care about consistency in code  
> and don't want SGX do something different. Mixing "disable" with  
> "capacity==0" causes inconsistencies AFAICS:
> 
> 1) The try_charge() API currently returns error when capacity is zero. So  
> it appears not to mean that the cgroup is disabled otherwise it should  
> return success.

I agree this isn't ideal.  My view is we can fix it in MISC code if
needed.

> 
> 2) The current explicit way ("-misc") to disable misc still shows non-zero  
> entries in misc.capacity. (At least for v2 cgroup, it does when I tested).  
> Maybe this is not important but I just don't feel good about this  
> inconsistency.

This belongs to "MISC resource cgroup was initially enabled by the kernel
at boot time, but later was disabled *somewhere in hierarchy* by the
user".

In fact, if you only do "-misc" for "some subtree", it's quite reasonable
to still report the resource in max.capacity.  In the case above, the
"some subtree" happens to be the root.

So to me it's reasonable to still show max.capacity in this case.  And you
can actually argue that the kernel still supports the cgroup for the
resource.  E.g., you can at runtime do "+misc" to re-enable.

However, if the kernel isn't able to support certain MISC resource cgroup
at boot time, it's quite reasonable to just set the "capacity" to 0 so it
isn't visible to userspace.

Note:

My key point is, when userspace sees 0 "capacity", it shouldn't need to
care about whether it is because of "hardware resource is not available",
or "hardware resource is available but kernel cannot support cgroup for
it".  The resource cgroup is simply unavailable.

That means the kernel has full right to just hide that resource from the
cgroup at boot time.

But this should be just within "cgroup's scope", i.e., that resource can
still be available if kernel can provide it.  If some app wants to
additionally check whether such resource is indeed available but only
cgroup is not available, it should check resource specific interface but
not take advantage of the MISC cgroup interface.

> 
> For now I'll just do BUG_ON() unless there are more strong opinions one  
> way or the other.
> 

Fine to me.

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Huang, Kai

On Wed, 2024-04-24 at 00:27 +0300, Jarkko Sakkinen wrote:
> On Tue Apr 23, 2024 at 8:08 PM EEST, Reinette Chatre wrote:
> > Hi Kai,
> > 
> > On 4/23/2024 4:50 AM, Huang, Kai wrote:
> > > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c 
> > > > b/arch/x86/kernel/cpu/sgx/ioctl.c
> > > > index b65ab214bdf5..2340a82fa796 100644
> > > > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > > > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > > > @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl 
> > > > *encl,
> > > > }
> > > >  
> > > > mutex_unlock(>lock);
> > > > +
> > > > +   if (need_resched())
> > > > +   cond_resched();
> > > > }
> > > >  
> > > > ret = 0;
> > > > @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct 
> > > > sgx_encl *encl,
> > > > entry->type = page_type;
> > > >  
> > > > mutex_unlock(>lock);
> > > > +
> > > > +   if (need_resched())
> > > > +   cond_resched();
> > > > }
> > > >  
> > > > ret = 0;
> > > > @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl 
> > > > *encl,
> > > > kfree(entry);
> > > >  
> > > > mutex_unlock(>lock);
> > > > +
> > > > +   if (need_resched())
> > > > +   cond_resched();
> > > > }
> > > > 
> > > 
> > > You can remove the need_reshced() in all 3 places above but just call
> > > cond_resched() directly.
> > > 
> > 
> > This change will call cond_resched() after dealing with each page in a
> > potentially large page range (cover mentions 30GB but we have also had to
> > make optimizations for enclaves larger than this). Adding a cond_resched()
> > here will surely placate the soft lockup detector, but we need to take care
> > how changes like this impact the performance of the system and having 
> > actions
> > on these page ranges take much longer than necessary.
> > For reference, please see 7b72c823ddf8 ("x86/sgx: Reduce delay and 
> > interference
> > of enclave release") that turned frequent cond_resched() into batches
> > to address performance issues.

Ah I didn't know this.  Thanks for the info.

> > 
> > It looks to me like the need_resched() may be a quick check that can be used
> > to improve performance? 
> > 

Perhaps?  My assumption is eventually cond_resched() will do similar check
of need_resched() but I am not entirely sure about of that.

Reading the code, it seems cond_resched() eventually does
should_resched().  The generic version indeed does similar check of
need_resched() but it seems the x86 version has a different
implementation.

> > I am not familiar with all use cases that need to be
> > considered to determine if a batching solution may be needed.

Looks at least the REMOVE_PAGES ioctls() could have the same impact to the
performance downgrade problem mentioned in commit 7b72c823ddf8 ("x86/sgx:
Reduce delay and interference of enclave release"), but I guess it's
acceptable to fix softlockup first and then improve performance if there's
someone hit any real issue. 

> 
> Ya, well no matter it is the reasoning will need to be documented
> because this should have symmetry with sgx_ioc_enclave_add_pages()
> (see my response to Kai).

Yeah I was actually going to mention this, but somehow I didn't choose to.

> 
> I because this makes dealing with need_resched() a change in code
> even if it is left out as a side-effect, I'd support of not removing
> which means adding need_resched() as a side-effect.

I am fine with keeping the need_resched().

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-23 Thread Huang, Kai

On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote:
> > > It's a workaround because you use the capacity==0 but it does not really
> > > mean to disable the misc cgroup for specific resource IIUC.
> > 
> > Please read the comment around @misc_res_capacity again:
> > 
> >   * Miscellaneous resources capacity for the entire machine. 0 capacity
> >   * means resource is not initialized or not present in the host.
> > 
> 
> I mentioned this in earlier email. I think this means no SGX EPC. It does  
> not mean sgx epc cgroup not enabled. That's also consistent with the  
> behavior try_charge() fails if capacity is zero.

OK. To me the "capacity" is purely the concept of cgroup, so it must be
interpreted within the scope of "cgroup".  If cgroup, in our case, SGX
cgroup, is disabled, then whether "leaving the capacity to reflect the
presence of hardware resource" doesn't really matter. 

So what you are saying is that, the kernel must initialize the capacity of
some MISC resource once it is added as valid type.  

And you must initialize the "capacity" even MISC cgroup is disabled
entirely by kernel commandline, in which case, IIUC, misc.capacity is not
even going to show in the /fs.

If this is your point, then your patch:

cgroup/misc: Add SGX EPC resource type

is already broken, because you added the new type w/o initializing the
capacity.

Please fix that up.

> 
> > > 
> > > There is explicit way for user to disable misc without setting capacity  
> > > to
> > > zero.
> > 
> > Which way are you talking about?
> 
> Echo "-misc" to cgroup.subtree_control at root level for example still  
> shows non-zero sgx_epc capacity.

I guess "having to disable all MISC resources just in order to disable SGX
EPC cgroup" is a brilliant idea.

You can easily disable the entire MISC cgroup by commandline for that
purpose if that's acceptable.

And I have no idea why "still showing non-zero EPC capacity" is important
if SGX cgroup cannot be supported at all.

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-23 Thread Huang, Kai

On Tue, 2024-04-23 at 08:08 -0500, Haitao Huang wrote:
> On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai  wrote:
> 
> > On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
> > > On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai   
> > > wrote:
> > > 
> > > > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
> > > > > > > I think we can add support for "sgx_cgroup=disabled" in future  
> > > if
> > > > > indeed
> > > > > > > needed. But just for init failure, no?
> > > > > > > 
> > > > > > 
> > > > > > It's not about the commandline, which we can add in the future  
> > > when
> > > > > > needed.  It's about we need to have a way to handle SGX cgroup  
> > > being
> > > > > > disabled at boot time nicely, because we already have a case  
> > > where we
> > > > > > need
> > > > > > to do so.
> > > > > > 
> > > > > > Your approach looks half-way to me, and is not future  
> > > extendible.  If
> > > > > we
> > > > > > choose to do it, do it right -- that is, we need a way to disable  
> > > it
> > > > > > completely in both kernel and userspace so that userspace won't be
> > > > > able> to
> > > > > > see it.
> > > > > 
> > > > > That would need more changes in misc cgroup implementation to  
> > > support
> > > > > sgx-disable. Right now misc does not have separate files for  
> > > different
> > > > > resource types. So we can only block echo "sgx_epc..." to those
> > > > > interfacefiles, can't really make files not visible.
> > > > 
> > > > "won't be able to see" I mean "only for SGX EPC resource", but not the
> > > > control files for the entire MISC cgroup.
> > > > 
> > > > I replied at the beginning of the previous reply:
> > > > 
> > > > "
> > > > Given SGX EPC is just one type of MISC cgroup resources, we cannot  
> > > just
> > > > disable MISC cgroup as a whole.
> > > > "
> > > > 
> > > Sorry I missed this point. below.
> > > 
> > > > You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC.   
> > > See
> > > > the comment of @misc_res_capacity:
> > > > 
> > > >  * Miscellaneous resources capacity for the entire machine. 0 capacity
> > > >  * means resource is not initialized or not present in the host.
> > > > 
> > > 
> > > IIUC I don't think the situation we have is either of those cases. For  
> > > our
> > > case, resource is inited and present on the host but we have allocation
> > > error for sgx cgroup infra.
> > 
> > You have calculated the "capacity", but later you failed something and
> > then reset the "capacity" to 0, i.e., cleanup.  What's wrong with that?
> > 
> > > 
> > > > And "blocking echo sgx_epc ... to those control files" is already
> > > > sufficient for the purpose of not exposing SGX EPC to userspace,  
> > > correct?
> > > > 
> > > > E.g., if SGX cgroup is enabled, you can see below when you read "max":
> > > > 
> > > >  # cat /sys/fs/cgroup/my_group/misc.max
> > > >  #  
> > > >sgx_epc ...
> > > >...
> > > > 
> > > > Otherwise you won't be able to see "sgx_epc":
> > > > 
> > > >  # cat /sys/fs/cgroup/my_group/misc.max
> > > >  #  
> > > >...
> > > > 
> > > > And when you try to write the "max" for "sgx_epc", you will hit error:
> > > > 
> > > >  # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
> > > >  # ... echo: write error: Invalid argument
> > > > 
> > > > The above applies to all the control files.  To me this is pretty much
> > > > means "SGX EPC is disabled" or "not supported" for userspace.
> > > > 
> > > You are right, capacity == 0 does block echoing max and users see an  
> > > error
> > > if they do that. But 1) doubt you literately wanted "SGX EPC is  
> > > disabled"
> > > and make it unsupported in t

Re: [RFC PATCH 1/1] x86/sgx: Explicitly give up the CPU in EDMM's ioctl() to avoid softlockup

2024-04-23 Thread Huang, Kai

On Tue, 2024-04-23 at 17:25 +0800, 朱伯君(杰铭) wrote:
> EDMM's ioctl()s support batch operations, which may be
> time-consuming. Try to explicitly give up the CPU at
> the every end of "for loop" in
> sgx_enclave_{ modify_types | restrict_permissions | remove_pages}
> to give other tasks a chance to run, and avoid softlockup warning.
> 
> The following has been observed on Linux v6.9-rc5 with kernel
> preemptions disabled(by configuring "PREEMPT_NONE=y"), when kernel
> is requested to restrict page permissions of a large number of EPC pages.
> 
> [ cut here ]
> watchdog: BUG: soft lockup - CPU#45 stuck for 22s! [occlum-run:3905]
> ...
> CPU: 45 PID: 3905 Comm: occlum-run Not tainted 6.9.0-rc5 #7
> ...
> RIP: 0010:sgx_enclave_restrict_permissions+0xba/0x1f0
> Code: 48 c1 e6 05 48 89 d1 48 8d 5c 24 40 b8 0e 00 00 00 48 2b 8e 70 8e 
> f5 93 48 c1 e9 05 48 c1 e1 0c 48 03 8e 68 8e f5 93 0f 01 cf  00 00 00 40 
> 0f 85 b2 00 00 00 85 c0 0f 85 db 00 00 00 4c 89 ef
> RSP: 0018:b55a6591fa80 EFLAGS: 0202
> RAX:  RBX: b55a6591fac0 RCX: b581e7384000
> RDX: b59a9e4e8080 RSI: 0020 RDI: 91d69e8cc000
> RBP: b55a6591fb70 R08: 0002 R09: 91d646e12be0
> R10: 006e R11: 0002 R12: 00072052d000
> R13: 91d69e8cc000 R14: b55a6591fbd8 R15: 91d69e8cc020
> FS:  7fe10dbda740() GS:92163e48() 
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fc041811000 CR3: 0040d95c8005 CR4: 00770ef0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe07f0 DR7: 0400
> PKRU: 5554
> Call Trace:
>  
>  ? show_regs+0x67/0x70
>  ? watchdog_timer_fn+0x1f3/0x280
>  ? __pfx_watchdog_timer_fn+0x10/0x10
>  ? __hrtimer_run_queues+0xc8/0x220
>  ? hrtimer_interrupt+0x10c/0x250
>  ? __sysvec_apic_timer_interrupt+0x53/0x130
>  ? sysvec_apic_timer_interrupt+0x7b/0x90
>  
>  
>  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
>  ? sgx_enclave_restrict_permissions+0xba/0x1f0
>  ? __pte_offset_map_lock+0x94/0x110
>  ? sgx_encl_test_and_clear_young_cb+0x40/0x60
>  sgx_ioctl+0x1ab/0x900
>  ? do_syscall_64+0x79/0x110
>  ? apply_to_page_range+0x14/0x20
>  ? sgx_encl_test_and_clear_young+0x6c/0x80
>  ? sgx_vma_fault+0x132/0x4f0
>  __x64_sys_ioctl+0x95/0xd0
>  x64_sys_call+0x1209/0x20c0
>  do_syscall_64+0x6d/0x110
>  ? do_syscall_64+0x79/0x110
>  ? do_pte_missing+0x2e8/0xcc0
>  ? __pte_offset_map+0x1c/0x190
>  ? __handle_mm_fault+0x7b9/0xe60
>  ? __count_memcg_events+0x70/0x100
>  ? handle_mm_fault+0x256/0x360
>  ? do_user_addr_fault+0x3c1/0x860
>  ? irqentry_exit_to_user_mode+0x67/0x190
>  ? irqentry_exit+0x3b/0x50
>  ? exc_page_fault+0x89/0x180
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fe10e2ee5cb
> Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff 
> ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff 
> ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48
> RSP: 002b:7fffb2c75518 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 00078000 RCX: 7fe10e2ee5cb
> RDX: 7fffb2c75520 RSI: c028a405 RDI: 0005
> RBP: 0005 R08:  R09: 7fffb2c75594
> R10: 7fffb2c755c8 R11: 0246 R12: c028a405
> R13: 7fffb2c75520 R14: 00078000 R15: 7fe10e1a7980
>  
>  [ end trace ]

Could you trim down the trace to only include the relevant part?

E.g., please at least remove the two register dumps at the beginning and
end of the trace.

Please refer to "Backtraces in commit messages" section in
Documentation/process/submitting-patches.rst.

> 
> Signed-off-by: Bojun Zhu 
> ---
>  arch/x86/kernel/cpu/sgx/ioctl.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..2340a82fa796 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -806,6 +806,9 @@ sgx_enclave_restrict_permissions(struct sgx_encl *encl,
>   }
>  
>   mutex_unlock(>lock);
> +
> + if (need_resched())
> + cond_resched();
>   }
>  
>   ret = 0;
> @@ -1010,6 +1013,9 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> *encl,
>   entry->type = page_type;
>  
>   mutex_unlock(>lock);
> +
> + if (need_resched())
> + cond_resched();
>   }
>  
>   ret = 0;
> @@ -1156,6 +1162,9 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>

Re: [PATCH v12 13/14] Docs/x86/sgx: Add description for cgroup support

2024-04-23 Thread Huang, Kai

On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Sean Christopherson 
> 
> Add initial documentation of how to regulate the distribution of
> SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
> controller.
> 
> 

Acked-by: Kai Huang

Re: [PATCH v12 10/14] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

2024-04-23 Thread Huang, Kai

On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> Enclave Page Cache(EPC) memory can be swapped out to regular system
> memory, and the consumed memory should be charged to a proper
> mem_cgroup. Currently the selection of mem_cgroup to charge is done in
> sgx_encl_get_mem_cgroup(). But it considers all contexts other than the
> ksgxd thread are user processes. With the new EPC cgroup implementation,
> the swapping can also happen in EPC cgroup work-queue threads. In those
> cases, it improperly selects the root mem_cgroup to charge for the RAM
> usage.
> 
> Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take
> an additional argument to explicitly specify the mm struct to charge for
> allocations. Callers from background kthreads not associated with a
> charging mm struct would set it to NULL, while callers in user process
> contexts set it to current->mm.
> 
> Internally, it handles the case when the charging mm given is NULL, by
> searching for an mm struct from enclave's mm_list.
> 
> Signed-off-by: Haitao Huang 
> Reported-by: Mikko Ylinen 
> Tested-by: Mikko Ylinen 
> Tested-by: Jarkko Sakkinen 
> 

Reviewed-by: Kai Huang

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-22 Thread Huang, Kai

On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
> On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai  wrote:
> 
> > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
> > > > > I think we can add support for "sgx_cgroup=disabled" in future if  
> > > indeed
> > > > > needed. But just for init failure, no?
> > > > > 
> > > > 
> > > > It's not about the commandline, which we can add in the future when
> > > > needed.  It's about we need to have a way to handle SGX cgroup being
> > > > disabled at boot time nicely, because we already have a case where we 
> > > > need
> > > > to do so.
> > > > 
> > > > Your approach looks half-way to me, and is not future extendible.  If  
> > > we
> > > > choose to do it, do it right -- that is, we need a way to disable it
> > > > completely in both kernel and userspace so that userspace won't be  
> > > able> to
> > > > see it.
> > > 
> > > That would need more changes in misc cgroup implementation to support 
> > > sgx-disable. Right now misc does not have separate files for different 
> > > resource types. So we can only block echo "sgx_epc..." to those  
> > > interfacefiles, can't really make files not visible.
> > 
> > "won't be able to see" I mean "only for SGX EPC resource", but not the
> > control files for the entire MISC cgroup.
> > 
> > I replied at the beginning of the previous reply:
> > 
> > "
> > Given SGX EPC is just one type of MISC cgroup resources, we cannot just
> > disable MISC cgroup as a whole.
> > "
> > 
> Sorry I missed this point. below.
> 
> > You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC.  See
> > the comment of @misc_res_capacity:
> > 
> >  * Miscellaneous resources capacity for the entire machine. 0 capacity
> >  * means resource is not initialized or not present in the host.
> > 
> 
> IIUC I don't think the situation we have is either of those cases. For our  
> case, resource is inited and present on the host but we have allocation  
> error for sgx cgroup infra.

You have calculated the "capacity", but later you failed something and
then reset the "capacity" to 0, i.e., cleanup.  What's wrong with that?

> 
> > And "blocking echo sgx_epc ... to those control files" is already
> > sufficient for the purpose of not exposing SGX EPC to userspace, correct?
> > 
> > E.g., if SGX cgroup is enabled, you can see below when you read "max":
> > 
> >  # cat /sys/fs/cgroup/my_group/misc.max
> >  #  
> >sgx_epc ...
> >...
> > 
> > Otherwise you won't be able to see "sgx_epc":
> > 
> >  # cat /sys/fs/cgroup/my_group/misc.max
> >  #  
> >...
> > 
> > And when you try to write the "max" for "sgx_epc", you will hit error:
> > 
> >  # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
> >  # ... echo: write error: Invalid argument
> > 
> > The above applies to all the control files.  To me this is pretty much
> > means "SGX EPC is disabled" or "not supported" for userspace.
> > 
> You are right, capacity == 0 does block echoing max and users see an error  
> if they do that. But 1) doubt you literately wanted "SGX EPC is disabled"  
> and make it unsupported in this case, 
> 

I don't understand.  Something failed during SGX cgroup initialization,
you _literally_ cannot continue to support it.


> 2) even if we accept this is "sgx  
> cgroup disabled" I don't see how it is much better user experience than  
> current solution or really helps user better.

In your way, the userspace is still able to see "sgx_epc" in control files
and is able to update them.  So from userspace's perspective SGX cgroup is
enabled, but obviously updating to "max" doesn't have any impact.  This
will confuse userspace.

> 
> Also to implement this approach, as you mentioned, we need workaround the  
> fact that misc_try_charge() fails when capacity set to zero, and adding  
> code to return root always? 
> 

Why this is a problem?

> So it seems like more workaround code to just  
> make it work for a failing case no one really care much and end result is  
> not really much better IMHO.

It's not workaround, it's the right thing to do.

The result is userspace will see it being disabled when kernel disables
it.

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-21 Thread Huang, Kai

On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
> > > I think we can add support for "sgx_cgroup=disabled" in future if indeed
> > > needed. But just for init failure, no?
> > > 
> > 
> > It's not about the commandline, which we can add in the future when
> > needed.  It's about we need to have a way to handle SGX cgroup being
> > disabled at boot time nicely, because we already have a case where we  
> > need
> > to do so.
> > 
> > Your approach looks half-way to me, and is not future extendible.  If we
> > choose to do it, do it right -- that is, we need a way to disable it
> > completely in both kernel and userspace so that userspace won't be able  
> > to
> > see it.
> 
> That would need more changes in misc cgroup implementation to support  
> sgx-disable. Right now misc does not have separate files for different  
> resource types. So we can only block echo "sgx_epc..." to those interface  
> files, can't really make files not visible.

"won't be able to see" I mean "only for SGX EPC resource", but not the
control files for the entire MISC cgroup.

I replied at the beginning of the previous reply:

"
Given SGX EPC is just one type of MISC cgroup resources, we cannot just 
disable MISC cgroup as a whole.
"

You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC.  See
the comment of @misc_res_capacity:

 * Miscellaneous resources capacity for the entire machine. 0 capacity
 * means resource is not initialized or not present in the host.

And "blocking echo sgx_epc ... to those control files" is already
sufficient for the purpose of not exposing SGX EPC to userspace, correct?

E.g., if SGX cgroup is enabled, you can see below when you read "max":

 # cat /sys/fs/cgroup/my_group/misc.max
 #  
   sgx_epc ...
   ...

Otherwise you won't be able to see "sgx_epc":

 # cat /sys/fs/cgroup/my_group/misc.max
 #  
   ...

And when you try to write the "max" for "sgx_epc", you will hit error:

 # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
 # ... echo: write error: Invalid argument

The above applies to all the control files.  To me this is pretty much
means "SGX EPC is disabled" or "not supported" for userspace. 

Am I missing anything?

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-19 Thread Huang, Kai

On Fri, 2024-04-19 at 13:55 -0500, Haitao Huang wrote:
> On Thu, 18 Apr 2024 20:32:14 -0500, Huang, Kai  wrote:
> 
> > 
> > 
> > On 16/04/2024 3:20 pm, Haitao Huang wrote:
> > > From: Kristen Carlson Accardi 
> > >  In cases EPC pages need be allocated during a page fault and the cgroup
> > > usage is near its limit, an asynchronous reclamation needs be triggered
> > > to avoid blocking the page fault handling.
> > >  Create a workqueue, corresponding work item and function definitions
> > > for EPC cgroup to support the asynchronous reclamation.
> > >  In case the workqueue allocation is failed during init, disable cgroup.
> > 
> > It's fine and reasonable to disable (SGX EPC) cgroup.  The problem is  
> > "exactly what does this mean" isn't quite clear.
> > 
> First, this is really some corner case most people don't care: during  
> init, kernel can't even allocate a workqueue object. So I don't think we  
> should write extra code to implement some sophisticated solution. Any  
> solution we come up with may just not work as the way user want or solve  
> the real issue due to the fact such allocation failure even happens at  
> init time.

I think for such boot time failure we can either choose directly BUG_ON(),
or we try to handle it _nicely_, but not half-way.  My experience is
adding BUG_ON() should be avoided in general, but it might be acceptable
during kernel boot.  I will leave it to others.


[...]

> > 
> > ..., IIUC you choose a (third) solution that is even one more step back:
> > 
> > It just makes try_charge() always succeed, but EPC pages are still  
> > managed in the "per-cgroup" list.
> > 
> > But this solution, AFAICT, doesn't work.  The reason is when you fail to  
> > allocate EPC page you will do the global reclaim, but now the global  
> > list is empty.
> > 
> > Am I missing anything?
> 
> But when cgroups enabled in config, global reclamation starts from root  
> and reclaim from the whole hierarchy if user may still be able to create.  
> Just that we don't have async/sync per-cgroup reclaim triggered.

OK.  I missed this as it is in a later patch.

> 
> > 
> > So my thinking is, we have two options:
> > 
> > 1) Modify the MISC cgroup core code to allow the kernel to disable one  
> > particular resource.  It shouldn't be hard, e.g., we can add a  
> > 'disabled' flag to the 'struct misc_res'.
> > 
> > Hmm.. wait, after checking, the MISC cgroup won't show any control files  
> > if the "capacity" of the resource is 0:
> > 
> > "
> >   * Miscellaneous resources capacity for the entire machine. 0 capacity
> >   * means resource is not initialized or not present in the host.
> > "
> > 
> > So I really suppose we should go with this route, i.e., by just setting  
> > the EPC capacity to 0?
> > 
> > Note misc_cg_try_charge() will fail if capacity is 0, but we can make it  
> > return success by explicitly check whether SGX cgroup is disabled by  
> > using a helper, e.g., sgx_cgroup_disabled().
> > 
> > And you always return the root SGX cgroup in sgx_get_current_cg() when  
> > sgx_cgroup_disabled() is true.
> > 
> > And in sgx_reclaim_pages_global(), you do something like:
> > 
> > static void sgx_reclaim_pages_global(..)
> > {
> > #ifdef CONFIG_CGROUP_MISC
> > if (sgx_cgroup_disabled())
> > sgx_reclaim_pages(_root_cg.lru);
> > else
> > sgx_cgroup_reclaim_pages(misc_cg_root());
> > #else
> > sgx_reclaim_pages(_global_list);
> > #endif
> > }
> > 
> > I am perhaps missing some other spots too but you got the idea.
> > 
> > At last, after typing those, I believe we should have a separate patch  
> > to handle disable SGX cgroup at initialization time.  And you can even  
> > put this patch _somewhere_ after the patch
> > 
> > "x86/sgx: Implement basic EPC misc cgroup functionality"
> > 
> > and before this patch.
> > 
> > It makes sense to have such patch anyway, because with it we can easily  
> > to add a kernel command line 'sgx_cgroup=disabled" if the user wants it  
> > disabled (when someone has such requirement in the future).
> > 
> 
> I think we can add support for "sgx_cgroup=disabled" in future if indeed  
> needed. But just for init failure, no?
> 

It's not about the commandline, which we can add in the future when
needed.  It's about we need to have a way to handle SGX cgroup being
disabled at boot time nicely, because we already have a case where we need
to do so.

Your approach looks half-way to me, and is not future extendible.  If we
choose to do it, do it right -- that is, we need a way to disable it
completely in both kernel and userspace so that userspace won't be able to
see it.

Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-19 Thread Huang, Kai



> Documentation of task_get_css() says it always  
> returns a valid css. This function is used by get_current_misc_cg() to get  
> the css refernce.
> 
> 
> /**
>   * task_get_css - find and get the css for (task, subsys)
>   * @task: the target task
>   * @subsys_id: the target subsystem ID
>   *
>   * Find the css for the (@task, @subsys_id) combination, increment a
>   * reference on and return it.  This function is guaranteed to return a
>   * valid css.  The returned css may already have been offlined.
>   */
> static inline struct cgroup_subsys_state *
> task_get_css(struct task_struct *task, int subsys_id)

Ah, I missed this comment.

This confirms my code reading too.

> 
> 
> If you look at the code of this function, you will see it does not check  
> NULL either for task_css().
> 
> So I think we are pretty sure here it's confirmed by this documentation  
> and testing.

Yeah agreed.  Thanks.

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-18 Thread Huang, Kai





On 16/04/2024 3:20 pm, Haitao Huang wrote:

From: Kristen Carlson Accardi 

In cases EPC pages need be allocated during a page fault and the cgroup
usage is near its limit, an asynchronous reclamation needs be triggered
to avoid blocking the page fault handling.

Create a workqueue, corresponding work item and function definitions
for EPC cgroup to support the asynchronous reclamation.

In case the workqueue allocation is failed during init, disable cgroup.


It's fine and reasonable to disable (SGX EPC) cgroup.  The problem is 
"exactly what does this mean" isn't quite clear.


Given SGX EPC is just one type of MISC cgroup resources, we cannot just 
disable MISC cgroup as a whole.


So, the first interpretation is we treat the entire MISC_CG_RES_SGX 
resource type doesn't exist, that is, we just don't show control files 
in the file system, and all EPC pages are tracked in the global list.


But it might be not straightforward to implement in the SGX driver, 
i.e., we might need to do more MISC cgroup core code change to make it 
being able to support disable particular resource at runtime -- I need 
to double check.


So if that is not something worth to do, we will still need to live with 
the fact that, the user is still able to create SGX cgroup in the 
hierarchy and see those control files, and being able to read/write them.


The second interpretation I suppose is, although the SGX cgroup is still 
seen as supported in userspace, in kernel we just treat it doesn't exist.


Specifically, that means: 1) we always return the root SGX cgroup for 
any EPC page when allocating a new one; 2) as a result, we still track 
all EPC pages in a single global list.


But from the code below ...



  static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
  {
if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
@@ -117,19 +226,28 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum 
sgx_reclaim reclaim)
  {
int ret;
  
+	/* cgroup disabled due to wq allocation failure during sgx_cgroup_init(). */

+   if (!sgx_cg_wq)
+   return 0;
+


..., IIUC you choose a (third) solution that is even one more step back:

It just makes try_charge() always succeed, but EPC pages are still 
managed in the "per-cgroup" list.


But this solution, AFAICT, doesn't work.  The reason is when you fail to 
allocate EPC page you will do the global reclaim, but now the global 
list is empty.


Am I missing anything?

So my thinking is, we have two options:

1) Modify the MISC cgroup core code to allow the kernel to disable one 
particular resource.  It shouldn't be hard, e.g., we can add a 
'disabled' flag to the 'struct misc_res'.


Hmm.. wait, after checking, the MISC cgroup won't show any control files 
if the "capacity" of the resource is 0:


"
 * Miscellaneous resources capacity for the entire machine. 0 capacity
 * means resource is not initialized or not present in the host.
"

So I really suppose we should go with this route, i.e., by just setting 
the EPC capacity to 0?


Note misc_cg_try_charge() will fail if capacity is 0, but we can make it 
return success by explicitly check whether SGX cgroup is disabled by 
using a helper, e.g., sgx_cgroup_disabled().


And you always return the root SGX cgroup in sgx_get_current_cg() when 
sgx_cgroup_disabled() is true.


And in sgx_reclaim_pages_global(), you do something like:

static void sgx_reclaim_pages_global(..)
{
#ifdef CONFIG_CGROUP_MISC
if (sgx_cgroup_disabled())
sgx_reclaim_pages(_root_cg.lru);
else
sgx_cgroup_reclaim_pages(misc_cg_root());
#else
sgx_reclaim_pages(_global_list);
#endif
}

I am perhaps missing some other spots too but you got the idea.

At last, after typing those, I believe we should have a separate patch 
to handle disable SGX cgroup at initialization time.  And you can even 
put this patch _somewhere_ after the patch


"x86/sgx: Implement basic EPC misc cgroup functionality"

and before this patch.

It makes sense to have such patch anyway, because with it we can easily 
to add a kernel command line 'sgx_cgroup=disabled" if the user wants it 
disabled (when someone has such requirement in the future).

Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-18 Thread Huang, Kai





Was requested by Jarkko:
https://lore.kernel.org/lkml/CYU504RLY7QU.QZY9LWC076NX@suppilovahvero/#t


[...]


Ah I missed that.  No problem to me.





--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SGX_EPC_CGROUP_H_
+#define _SGX_EPC_CGROUP_H_
+
+#include 


I don't see why you need  here.  Also, ...


+#include 
+#include 
+
+#include "sgx.h"


... "sgx.h" already includes 

[...]


right



+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+    /* get_current_misc_cg() never returns NULL when Kconfig enabled */
+    return sgx_cgroup_from_misc_cg(get_current_misc_cg());
+}


I spent some time looking into this.  And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when Kconfig
is on.

I typed my analysis below in [*].  And it would be helpful if any cgroup
expert can have a second eye on this.

[...]

Thanks for checking this and I did similar and agree with the 
conclusion. I think this is confirmed also by Michal's description AFAICT:

"
The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist."


After looking I believe we can even disable MISC cgroup at runtime for a 
particular cgroup (haven't actually verified on real machine, though):


 # echo "-misc" > /sys/fs/cgroup/my_group/cgroup.subtree_control

And if you look at the MISC cgroup core code, many functions actually 
handle a NULL css, e.g., misc_cg_try_charge():


int misc_cg_try_charge(enum misc_res_type type,
struct misc_cg *cg, u64 amount)
{
...

if (!(valid_type(type) && cg &&
READ_ONCE(misc_res_capacity[type])))
return -EINVAL;

...
}

That's why I am still a little bit worried about this.  And it's better 
to have cgroup expert(s) to confirm here.


Btw, AMD SEV doesn't need to worry because it doesn't dereference @css 
but just pass it to MISC cgroup core functions like 
misc_cg_try_charge().  But for SGX, we actually dereference it directly.

Re: [PATCH v12 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-04-17 Thread Huang, Kai





On 16/04/2024 3:20 pm, Haitao Huang wrote:

From: Kristen Carlson Accardi 

Currently in the EPC page allocation, the kernel simply fails the
allocation when the current EPC cgroup fails to charge due to its usage
reaching limit.  This is not ideal. When that happens, a better way is
to reclaim EPC page(s) from the current EPC cgroup (and/or its
descendants) to reduce its usage so the new allocation can succeed.

Add the basic building blocks to support per-cgroup reclamation.

Currently the kernel only has one place to reclaim EPC pages: the global
EPC LRU list.  To support the "per-cgroup" EPC reclaim, maintain an LRU
list for each EPC cgroup, and introduce a "cgroup" variant function to
reclaim EPC pages from a given EPC cgroup and its descendants.

Currently the kernel does the global EPC reclaim in sgx_reclaim_page().
It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16)
pages.  Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN
pages from the global LRU, and then tries to reclaim these pages at once
for better performance.

Implement the "cgroup" variant EPC reclaim in a similar way, but keep
the implementation simple: 1) change sgx_reclaim_pages() to take an LRU
as input, and return the pages that are "scanned" and attempted for
reclamation (but not necessarily reclaimed successfully); 2) loop the
given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
until SGX_NR_TO_SCAN pages are "scanned".

This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always
tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC
cgroup, and only moves to its descendants when there's no enough
reclaimable EPC pages to "scan" in its LRU.  It should be enough for
most cases.

Note, this simple implementation doesn't _exactly_ mimic the current
global EPC reclaim (which always tries to do the actual reclaim in batch
of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN
reclaimable pages, the actual reclaim of EPC pages will be split into
smaller batches _across_ multiple LRUs with each being smaller than
SGX_NR_TO_SCAN pages.

A more precise way to mimic the current global EPC reclaim would be to
have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
_across_ the given EPC cgroup _AND_ its descendants, and then do the
actual reclaim in one batch.  But this is unnecessarily complicated at
this stage.

Alternatively, the current sgx_reclaim_pages() could be changed to
return the actual "reclaimed" pages, but not "scanned" pages. However,
the reclamation is a lengthy process, forcing a successful reclamation
of predetermined number of pages may block the caller for too long. And
that may not be acceptable in some synchronous contexts, e.g., in
serving an ioctl().

With this building block in place, add synchronous reclamation support
in sgx_cgroup_try_charge(): trigger a call to
sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the
caller allows synchronous reclaim as indicated by s newly added
parameter.

A later patch will add support for asynchronous reclamation reusing
sgx_cgroup_reclaim_pages().

Note all reclaimable EPC pages are still tracked in the global LRU thus
no per-cgroup reclamation is actually active at the moment. Per-cgroup
tracking and reclamation will be turned on in the end after all
necessary infrastructure is in place.


Nit:

"all necessary infrastructures are in place", or, "all necessary 
building blocks are in place".


?



Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Tested-by: Jarkko Sakkinen 
---


Reviewed-by: Kai Huang 

More nitpickings below:

[...]


-static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum 
sgx_reclaim reclaim)


Let's still wrap the text on 80-character basis.

I guess most people are more used to that.

[...]


-   epc_page = list_first_entry_or_null(_global_lru.reclaimable,
-   struct sgx_epc_page, list);
+   epc_page = list_first_entry_or_null(>reclaimable, struct 
sgx_epc_page, list);


Ditto.

Re: [PATCH v12 07/14] x86/sgx: Abstract tracking reclaimable pages in LRU

2024-04-16 Thread Huang, Kai

On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking
> of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly
> allocated page into the global LRU list while
> sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded
> global LRU references in these functions to make them reusable when
> pages are tracked in per-cgroup LRUs.
> 
> Create a helper, sgx_lru_list(), that returns the LRU that tracks a given
> EPC page. It simply returns the global LRU now, and will later return
> the LRU of the cgroup within which the EPC page was allocated. Replace
> the hard coded global LRU with a call to this helper.
> 
> Next patches will first get the cgroup reclamation flow ready while
> keeping pages tracked in the global LRU and reclaimed by ksgxd before we
> make the switch in the end for sgx_lru_list() to return per-cgroup
> LRU.

I found the first paragraph hard to read.  Provide my version below for
your reference:

"
The SGX driver tracks reclaimable EPC pages via
sgx_mark_page_reclaimable(), which adds the newly allocated page into the
global LRU list.  sgx_unmark_page_reclaimable() does the opposite.

To support SGX EPC cgroup, the SGX driver will need to maintain an LRU
list for each cgroup, and the new allocated EPC page will need to be added
to the LRU of associated cgroup, but not always the global LRU list.

When sgx_mark_page_reclaimable() is called, the cgroup that the new
allocated EPC page belongs to is already known, i.e., it has been set to
the 'struct sgx_epc_page'.

Add a helper, sgx_lru_list(), to return the LRU that the EPC page should
be/is added to for the given EPC page.  Currently it just returns the
global LRU.  Change sgx_{mark|unmark}_page_reclaimable() to use the helper
function to get the LRU from the EPC page instead of referring to the
global LRU directly.

This allows EPC page being able to be tracked in "per-cgroup" LRU when
that becomes ready.
"

Nit:

That being said, is sgx_epc_page_lru() better than sgx_lru_list()?

> 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Reviewed-by: Jarkko Sakkinen 
> Tested-by: Jarkko Sakkinen 
> ---
> 

Feel free to add:

Reviewed-by: Kai Huang

Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-16 Thread Huang, Kai

On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The
> existing cgroup memory controller cannot be used to limit or account for
> SGX EPC memory, which is a desirable feature in some environments. For
> instance, within a Kubernetes environment, while a user may specify a
> particular EPC quota for a pod, the orchestrator requires a mechanism to
> enforce that the pod's actual runtime EPC usage does not exceed the
> allocated quota.
> 
> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
> limit and track EPC allocations per cgroup. Earlier patches have added
> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
> support in SGX driver as the "sgx_epc" resource provider:
> 
> - Set "capacity" of EPC by calling misc_cg_set_capacity()
> - Update EPC usage counter, "current", by calling charge and uncharge
> APIs for EPC allocation and deallocation, respectively.
> - Setup sgx_epc resource type specific callbacks, which perform
> initialization and cleanup during cgroup allocation and deallocation,
> respectively.
> 
> With these changes, the misc cgroup controller enables users to set a hard
> limit for EPC usage in the "misc.max" interface file. It reports current
> usage in "misc.current", the total EPC memory available in
> "misc.capacity", and the number of times EPC usage reached the max limit
> in "misc.events".
> 
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
> 
> Later patches will reorganize the tracking and reclamation code in the
> global reclaimer and implement per-cgroup tracking and reclaiming.
> 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Reviewed-by: Jarkko Sakkinen 
> Reviewed-by: Tejun Heo 
> Tested-by: Jarkko Sakkinen 

I don't see any big issue, so feel free to add:

Reviewed-by: Kai Huang 

Nitpickings below:

[...]


> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2022-2024 Intel Corporation. */
> +
> +#include 
> +#include 

It doesn't seem you need the above two here.

Probably they are needed in later patches, in that case we can move to the
relevant patch(es) that they got used.

However I think it's better to explicitly include  since
kzalloc()/kfree() are used.

Btw, I am not sure whether you want to use  because looks
it contains a lot of unrelated staff.  Anyway I guess nobody cares.

> +#include "epc_cgroup.h"
> +
> +/* The root SGX EPC cgroup */
> +static struct sgx_cgroup sgx_cg_root;

The comment isn't necessary (sorry didn't notice before), because the code
is pretty clear saying that IMHO.

[...]

> 
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,72 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _SGX_EPC_CGROUP_H_
> +#define _SGX_EPC_CGROUP_H_
> +
> +#include 

I don't see why you need  here.  Also, ...

> +#include 
> +#include 
> +
> +#include "sgx.h"

... "sgx.h" already includes 

[...]

> 
> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
> +{
> + /* get_current_misc_cg() never returns NULL when Kconfig enabled */
> + return sgx_cgroup_from_misc_cg(get_current_misc_cg());
> +}

I spent some time looking into this.  And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when Kconfig
is on.

I typed my analysis below in [*].  And it would be helpful if any cgroup
expert can have a second eye on this.

[...]


> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

Is this needed?  I believe SGX variants in "epc_cgroup.h" should be enough
for sgx/main.c?

[...]


[*] IIUC get_current_misc_cg() should never return NULL when Kconfig is on
(code indent slight adjusted for text wrap).

Firstly, during kernel boot there's always a valid @css allocated for MISC
cgroup, regardless whether it is disabled in kernel command line.

int __init cgroup_init(void)
{
...

for_each_subsys(ss, ssid) {
if (ss->early_init) {   
  
...
} else {
cgroup_init_subsys(ss, false);  
   
}

...

Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-15 Thread Huang, Kai

> 
> I'll send a fixup for this patch or another version of the series if more  
> changes are needed.

Hi Haitao,

I don't like to say but in general I think you are sending too frequently.  The
last version was sent April, 11th (my time), so considering the weekend it has
only been 3 or at most 4 days.  

Please slow down a little bit to give people more time.

More information please also see:

https://www.kernel.org/doc/html/next/process/submitting-patches.html#resend-reminders

Re: [PATCH v11 06/14] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list

2024-04-15 Thread Huang, Kai

On Wed, 2024-04-10 at 11:25 -0700, Haitao Huang wrote:
> From: Sean Christopherson 
> 
> Introduce a data structure to wrap the existing reclaimable list and its
> spinlock. Each cgroup later will have one instance of this structure to
> track EPC pages allocated for processes associated with the same cgroup.
> Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
> from the reclaimable list in this structure when its usage reaches near
> its limit.
> 
> Use this structure to encapsulate the LRU list and its lock used by the
> global reclaimer.
> 
> Signed-off-by: Sean Christopherson 
> Co-developed-by: Kristen Carlson Accardi 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Cc: Sean Christopherson 
> Reviewed-by: Jarkko Sakkinen 
> 

Reviewed-by: Kai Huang

Re: [PATCH v11 04/14] cgroup/misc: Add SGX EPC resource type

2024-04-15 Thread Huang, Kai

On Wed, 2024-04-10 at 11:25 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
> for the misc controller.
> 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Reviewed-by: Jarkko Sakkinen 
> 

Reviewed-by: Kai Huang

Re: [PATCH v11 03/14] cgroup/misc: Export APIs for SGX driver

2024-04-15 Thread Huang, Kai

On Wed, 2024-04-10 at 11:25 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> The SGX EPC cgroup will reclaim EPC pages when usage in a cgroup reaches
> its or ancestor's limit. This requires a walk from the current cgroup up
> to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to
> enable this walk.
> 
> The SGX driver also needs start a global level reclamation from the
> root. Export misc_cg_root() for the SGX driver to access.
> 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Reviewed-by: Jarkko Sakkinen 
> Reviewed-by: Tejun Heo 
> 

Reviewed-by: Kai Huang

Re: [PATCH v11 02/14] cgroup/misc: Add per resource callbacks for CSS events

2024-04-15 Thread Huang, Kai

On Wed, 2024-04-10 at 11:25 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> The misc cgroup controller (subsystem) currently does not perform
> resource type specific action for Cgroups Subsystem State (CSS) events:
> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> when a cgroup is destroyed.
> 
> Define callbacks for those events and allow resource providers to
> register the callbacks per resource type as needed. This will be
> utilized later by the EPC misc cgroup support implemented in the SGX
> driver.
> 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Reviewed-by: Jarkko Sakkinen 
> Reviewed-by: Tejun Heo 
> 

Reviewed-by: Kai Huang 

Nitpickings below:

>  
> +/**
> + * struct misc_res_ops: per resource type callback ops.
> + * @alloc: invoked for resource specific initialization when cgroup is 
> allocated.
> + * @free: invoked for resource specific cleanup when cgroup is deallocated.
> + */
> +struct misc_res_ops {
> + int (*alloc)(struct misc_cg *cg);
> + void (*free)(struct misc_cg *cg);
> +};
> +

Perhaps you can mention why you take 'struct misc_cg *cg' as parameter, but not
'struct misc_res *res' to the changelog.  

It's not very clear in this patch.


[...]

>  static struct cgroup_subsys_state *
>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>  {
> - enum misc_res_type i;
> - struct misc_cg *cg;
> + struct misc_cg *parent_cg, *cg;
> + int ret;
>  
> - if (!parent_css) {
> - cg = _cg;
> + if (unlikely(!parent_css)) {
> + parent_cg = cg = _cg;

Seems the 'unlikely' is new.

I think you can remove it because it's not something that is related to this
patch.

Re: [PATCH v11 01/14] x86/sgx: Replace boolean parameters with enums

2024-04-15 Thread Huang, Kai

On Wed, 2024-04-10 at 11:25 -0700, Haitao Huang wrote:
> Replace boolean parameters for 'reclaim' in the function
> sgx_alloc_epc_page() and its callers with an enum.
> 
> Also opportunistically remove non-static declaration of
> __sgx_alloc_epc_page() and a typo
> 
> Signed-off-by: Haitao Huang 
> Suggested-by: Jarkko Sakkinen 
> Suggested-by: Dave Hansen 
> 

Reviewed-by: Kai Huang

Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-08 Thread Huang, Kai





On 9/04/2024 6:03 am, Haitao Huang wrote:




The misc root cgroup is a static similar to sgx_cg_root. So 
misc_cg_root()  won't be NULL
However, based on how css_misc() was check NULL, I suppose 
sgx_get_current_cg() may be NULL when cgroup is disabled (again not 100% 
sure but we handle it anyway)


Could you help to check?  Sorry I am busy on something else thus won't 
be able to do any actual check.

Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-08 Thread Huang, Kai


> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -28,6 +28,10 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup 
> *sgx_cg, enum sgx_recl
>  static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
>  
>  static inline void sgx_cgroup_init(void) { }
> +
> +static inline void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct 
> mm_struct *charge_mm)
> +{
> +}
>  #else
>  struct sgx_cgroup {
>   struct misc_cg *cg;
> @@ -65,6 +69,9 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
>  
>  int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim 
> reclaim);
>  void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
> +bool sgx_cgroup_lru_empty(struct misc_cg *root);
> +bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg);
> +void sgx_cgroup_reclaim_pages(struct misc_cg *root,  struct mm_struct 
> *charge_mm);

Seems the decision to choose whether to implement a stub function for some
function is quite random to me.

My impression is people in general don't like #ifdef in the C file but prefer to
implementing it in the header in some helper function.

I guess you might want to just implement a stub function for each of the 3
functions exposed, so that we can eliminate some #ifdefs in the sgx/main.c (see
below).

>  void sgx_cgroup_init(void);
>  
>  #endif
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 7f92455d957d..68f28ff2d5ef 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -34,6 +34,16 @@ static struct sgx_epc_lru_list sgx_global_lru;
>  
>  static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page 
> *epc_page)
>  {
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + if (epc_page->sgx_cg)
> + return _page->sgx_cg->lru;
> +
> + /*
> +  * This should not happen when cgroup is enabled: Every page belongs
> +  * to a cgroup, or the root by default.
> +  */
> + WARN_ON_ONCE(1);

In the case MISC cgroup is enabled in Kconfig but disabled by command line, I
think this becomes legal now?

> +#endif
>   return _global_lru;
>  }
>  
> @@ -42,7 +52,11 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct 
> sgx_epc_page *epc_pag
>   */
>  static inline bool sgx_can_reclaim(void)
>  {
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + return !sgx_cgroup_lru_empty(misc_cg_root());
> +#else
>   return !list_empty(_global_lru.reclaimable);
> +#endif
>  }
>  

Here you are using #ifdef  CONFIG_CGRUP_SGX_EPC, but ...

>  static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
> @@ -404,7 +418,10 @@ static bool sgx_should_reclaim(unsigned long watermark)
>  
>  static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
>  {
> - sgx_reclaim_pages(_global_lru, charge_mm);
> + if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> + sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm);
> + else
> + sgx_reclaim_pages(_global_lru, charge_mm);
>  }

... here you are using IS_ENABLED(CONFIG_CGROUP_SGX_EPC).

Any reason they are not consistent?

Also, in the case where MISC cgroup is disabled via commandline, I think it
won't work, because misc_cg_root() should be NULL in this case while
IS_ENABLED(CONFIG_CGROUP_SGX_EPC) is true.

>  
>  /*
> @@ -414,6 +431,16 @@ static void sgx_reclaim_pages_global(struct mm_struct 
> *charge_mm)
>   */
>  void sgx_reclaim_direct(void)
>  {
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
> +
> + /* Make sure there are some free pages at cgroup level */
> + if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg)) {
> + sgx_cgroup_reclaim_pages(sgx_cg->cg, current->mm);
> + sgx_put_cg(sgx_cg);
> + }
> +#endif

This #ifdef CONFIG_CGROUP_SGX_EPC can be removed if we implement a stub function
for sgx_cgroup_should_reclaim().

> + /* Make sure there are some free pages at global level */
>   if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>   sgx_reclaim_pages_global(current->mm);
>  }

Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-04-04 Thread Huang, Kai

On Thu, 2024-04-04 at 12:05 -0500, Haitao Huang wrote:
> > > -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> > > +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg,  
> > > enum sgx_reclaim r)
> > 
> > Is the @r here intentional for shorter typing?
> > 
> 
> yes :-)
> Will speel out to make it consistent if that's the concern.

I kinda prefer the full name to match the CONFIG_CGROUP_SGX_EPC on case.  You
can put the 'enum sgx_reclaim reclaim' parameter into the new line if needed:

static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg,
enum sgx_reclaim reclaim)
{
return 0;
}

Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-04-04 Thread Huang, Kai

On Thu, 2024-04-04 at 12:05 -0500, Haitao Huang wrote:
> > Please also mention why "leaving asynchronous reclamation to later  
> > patch(es)" is
> > fine.  E.g., it won't break anything I suppose.
> > 
> 
> Right. Pages are still in the global list at the moment and only global  
> reclaiming is active until the "turn on" patch. Separating out is really  
> just for the purpose of review IMHO.

Sounds good to me.  Thanks.

Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-04 Thread Huang, Kai

On Thu, 2024-04-04 at 20:24 -0500, Haitao Huang wrote:
> > Again, IMHO having CONFIG_CGROUP_SGX_EPC here is ugly, because it  
> > doesn't even
> > match the try_charge() above, which doesn't have the  
> > CONFIG_CGROUP_SGX_EPC.
> > 
> > If you add a wrapper in "epc_cgroup.h"
> > 
> Agree. but in sgx.h so sgx_epc_page struct is not exposed in epc_cgroup.h.

I am fine with any place that suits.

Re: [PATCH v10 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-04 Thread Huang, Kai

On Wed, 2024-03-27 at 17:22 -0700, Haitao Huang wrote:
>  
>  void sgx_cgroup_init(void)
>  {
> + sgx_cg_wq = alloc_workqueue("sgx_cg_wq", WQ_UNBOUND | WQ_FREEZABLE, 
> WQ_UNBOUND_MAX_ACTIVE);
> +
> + /* All Cgroups functionalities are disabled. */
> + if (WARN_ON(!sgx_cg_wq))
> + return;
> +

I don't think you should WARN(), because it's not a kernel bug or similar.  Just
print a message saying EPC cgroup is disabled and move on.

if (!sgx_cg_wq) {
pr_err("SGX EPC cgroup disabled: alloc_workqueue() failed.\n");
return;
}

Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-04-03 Thread Huang, Kai

On Wed, 2024-03-27 at 17:22 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> When a cgroup usage reaches its limit, and it is to be charged, i.e.,
> sgx_cgroup_try_charge() called for new allocations, the cgroup needs to
> reclaim pages from its LRU or LRUs of its descendants to make room for
> any new allocations. This patch adds the basic building block for the
> per-cgroup reclamation flow and use it for synchronous reclamation in
> sgx_cgroup_try_charge().

It's better to firstly mention _why_ we need this first:

Currently in the EPC page allocation, the kernel simply fails the allocation
when the current EPC cgroup fails to charge due to its usage reaching limit. 
This is not ideal.  When that happens, a better way is to reclaim EPC page(s)
from the current EPC cgroup (and/or its descendants) to reduce its usage so the
new allocation can succeed.

Add the basic building blocks to support the per-cgroup reclamation flow ...

> 
> First, modify sgx_reclaim_pages() to let callers to pass in the LRU from
> which pages are reclaimed, so it can be reused by both the global and
> cgroup reclaimers. Also return the number of pages attempted, so a
> cgroup reclaimer can use it to track reclamation progress from its
> descendants.

IMHO you are jumping too fast to the implementation details.  Better to have
some more background:

"
Currently the kernel only has one place to reclaim EPC pages: the global EPC LRU
list.  To support the "per-cgroup" EPC reclaim, maintain an LRU list for each
EPC cgroup, and introduce a "cgroup" variant function to reclaim EPC page(s)
from a given EPC cgroup (and its descendants).
"

> 
> For the global reclaimer, replace all call sites of sgx_reclaim_pages()
> with calls to a newly created wrapper, sgx_reclaim_pages_global(), which
> just calls sgx_reclaim_pages() with the global LRU passed in.
> 
> For cgroup reclamation, implement a basic reclamation flow, encapsulated
> in the top-level function, sgx_cgroup_reclaim_pages(). It performs a
> pre-order walk on a given cgroup subtree, and calls sgx_reclaim_pages()
> at each node passing in the LRU of that node. It keeps track of total
> attempted pages and stops the walk if desired number of pages are
> attempted.

Then it's time to jump to implementation details:

"
Currently the kernel does the global EPC reclaim in sgx_reclaim_page().  It
always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) pages. 
Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN pages from the
global LRU, and then tries to reclaim these pages at once for better
performance.

Use similar way to implement the "cgroup" variant EPC reclaim, but keep the
implementation simple: 1) change sgx_reclaim_pages() to take an LRU as input,
and return the pages that are "scanned" (but not actually reclaimed); 2) loop
the given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
until SGX_NR_TO_SCAN pages are "scanned".

This implementation always tries to reclaim SGX_NR_TO_SCAN pages from the LRU of
the given EPC cgroup, and only moves to its descendants when there's no enough
reclaimable EPC pages to "scan" in its LRU.  It should be enough for most cases.
"

Then I think it's better to explain why "alternatives" are not chosen:

"
Note, this simple implementation doesn't _exactly_ mimic the current global EPC
reclaim (which always tries to do the actual reclaim in batch of SGX_NR_TO_SCAN
pages): when LRUs have less than SGX_NR_TO_SCAN reclaimable pages, the actual
reclaim of EPC pages will be split into smaller batches _across_ multiple LRUs
with each being smaller than SGX_NR_TO_SCAN pages.

A more precise way to mimic the current global EPC reclaim would be to have a
new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages _across_ the
given EPC cgroup _AND_ its descendants, and then do the actual reclaim in one
batch.  But this is unnecessarily complicated at this stage.

Alternatively, the current sgx_reclaim_pages() could be changed to return the
actual "reclaimed" pages, but not "scanned" pages.  However this solution also
has cons: 
"

:

I recall you mentioned "unable to control latency of each reclaim" etc, but IIUC
one could be:

This approach may result in higher chance of "reclaiming EPC pages from
descendants but not the root/given EPC cgorup", e.g., when all EPC pages in the
root EPC cgroup are all young while these in its descendants are not.  This may
not be desired.

Makes sense?

> 
> Finally, pass a parameter to sgx_cgroup_try_charge() to indicate whether
> a synchronous reclamation is allowed. If the caller allows and cgroup
> usage is at its limit, trigger the synchronous reclamation by calling
> sgx_cgroup_reclaim_pages() in a loop with cond_resched() in between
> iterations.

This isn't needed IMHO as you can easily see in the code, and there's no "design
choices" here.

General rule: focus on explaining "why", and "design choices", but not
implementation details, which can be seen in the

Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-01 Thread Huang, Kai

On Sat, 2024-03-30 at 13:17 +0200, Jarkko Sakkinen wrote:
> On Thu Mar 28, 2024 at 2:53 PM EET, Huang, Kai wrote:
> > 
> > > --- /dev/null
> > > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> > > @@ -0,0 +1,74 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +// Copyright(c) 2022 Intel Corporation.
> > 
> > It's 2024 now.
> > 
> > And looks you need to use C style comment for /* Copyright ... */, after 
> > looking
> > at some other C files.
> 
> To be fair, this happens *all the time* to everyone :-)
> 
> I've proposed this few times in SGX context and going to say it now.
> Given the nature of Git copyrights would anyway need to be sorted by
> the Git log not possibly incorrect copyright platters in the header
> and source files.
> 

Sure fine to me either way.  Thanks for pointing out.

I have some vague memory that we should update the year but I guess I was wrong.

Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-03-28 Thread Huang, Kai


> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,74 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.

It's 2024 now.

And looks you need to use C style comment for /* Copyright ... */, after looking
at some other C files.

> +
> +#include 
> +#include 
> +#include "epc_cgroup.h"
> +
> +/* The root SGX EPC cgroup */
> +static struct sgx_cgroup sgx_cg_root;
> +
> +/**
> + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
> + *
> + * @sgx_cg:  The EPC cgroup to be charged for the page.
> + * Return:
> + * * %0 - If successfully charged.
> + * * -errno - for failures.
> + */
> +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> +{
> + return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
> +}
> +
> +/**
> + * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page
> + * @sgx_cg:  The charged sgx cgroup
> + */
> +void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg)
> +{
> + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
> +}
> +
> +static void sgx_cgroup_free(struct misc_cg *cg)
> +{
> + struct sgx_cgroup *sgx_cg;
> +
> + sgx_cg = sgx_cgroup_from_misc_cg(cg);
> + if (!sgx_cg)
> + return;
> +
> + kfree(sgx_cg);
> +}
> +
> +static int sgx_cgroup_alloc(struct misc_cg *cg);

Again, this declaration can be removed if you move the below structure ...

> +
> +const struct misc_res_ops sgx_cgroup_ops = {
> + .alloc = sgx_cgroup_alloc,
> + .free = sgx_cgroup_free,
> +};
> +
> +static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup 
> *sgx_cg)
> +{
> + cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
> + sgx_cg->cg = cg;
> +}
> +
> +static int sgx_cgroup_alloc(struct misc_cg *cg)
> +{
> + struct sgx_cgroup *sgx_cg;
> +
> + sgx_cg = kzalloc(sizeof(*sgx_cg), GFP_KERNEL);
> + if (!sgx_cg)
> + return -ENOMEM;
> +
> + sgx_cgroup_misc_init(cg, sgx_cg);
> +
> + return 0;
> +}

... here.

> +
> +void sgx_cgroup_init(void)
> +{
> + misc_cg_set_ops(MISC_CG_RES_SGX_EPC, _cgroup_ops);
> + sgx_cgroup_misc_init(misc_cg_root(), _cg_root);
> +}
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h 
> b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> new file mode 100644
> index ..8f794e23fad6
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,70 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2022 Intel Corporation. */
> +#ifndef _SGX_EPC_CGROUP_H_
> +#define _SGX_EPC_CGROUP_H_
> +
> +#include 
> +#include 
> +#include 
> +
> +#include "sgx.h"
> +
> +#ifndef CONFIG_CGROUP_SGX_EPC

Nit: add an empty line to make text more breathable.

> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
> +struct sgx_cgroup;
> +
> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
> +{
> + return NULL;
> +}
> +
> +static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }
> +
> +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> +{
> + return 0;
> +}
> +
> +static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
> +
> +static inline void sgx_cgroup_init(void) { }
> +#else

Nit: I prefer two empty lines before and after the 'else'.

> +struct sgx_cgroup {
> + struct misc_cg *cg;
> +};
> +
> +static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
> +{
> + return (struct sgx_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +}
> +
> +/**
> + * sgx_get_current_cg() - get the EPC cgroup of current process.
> + *
> + * Returned cgroup has its ref count increased by 1. Caller must call
> + * sgx_put_cg() to return the reference.
> + *
> + * Return: EPC cgroup to which the current task belongs to.
> + */
> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
> +{
> + return sgx_cgroup_from_misc_cg(get_current_misc_cg());
> +}

Again, I _think_ you need to check whether get_current_misc_cg() returns NULL?

Misc cgroup can be disabled by command line even it is on in the Kconfig.

I am not expert on cgroup, so could you check on this?

> +
> +/**
> + * sgx_put_sgx_cg() - Put the EPC cgroup and reduce its ref count.
> + * @sgx_cg - EPC cgroup to put.
> + */
> +static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
> +{
> + if (sgx_cg)
> + put_misc_cg(sgx_cg->cg);
> +}
> +
> +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg);
> +void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
> +void sgx_cgroup_init(void);
> +
> +#endif
> +
> +#endif /* _SGX_EPC_CGROUP_H_ */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index d219f14365d4..023af54c1beb 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -17,6 +18,7 @@
>  #include "driver.h"
>  #include "encl.h"
>  #include "encls.h"
> +#include "epc_cgroup.h"
>  
>  struct

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Huang, Kai





On 27/02/2024 11:38 am, Dave Hansen wrote:

On 2/26/24 14:34, Huang, Kai wrote:

So I am trying to get the actual downside of doing per-cgroup reclaim or
the full reason that we choose global reclaim.


Take the most extreme example:

while (hit_global_sgx_limit())
reclaim_from_this(cgroup);

You eventually end up with all of 'cgroup's pages gone and handed out to
other users on the system who stole them all.  Other users might cause
you to go over the global limit.  *They* should be paying part of the
cost, not just you and your cgroup.


Yeah likely we will need another layer of logic to decide when to do 
global reclaim.  I agree that is downside and is unnecessary for this 
patchset.


Thanks for the comments.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Huang, Kai





On 27/02/2024 11:31 am, Dave Hansen wrote:

On 2/26/24 14:24, Huang, Kai wrote:

What is the downside of doing per-group reclaim when try_charge()
succeeds for the enclave but failed to allocate EPC page?

Could you give an complete answer why you choose to use global reclaim
for the above case?


There are literally two different limits at play.  There's the limit
that the cgroup imposes and then the actual physical limit.

Hitting the cgroup limit induces cgroup reclaim.

Hitting the physical limit induces global reclaim.

Maybe I'm just being dense, but I fail to understand why you would want
to entangle those two different concepts more than absolutely necessary.


OK.  Yes I agree doing per-cgroup reclaim when hitting physical limit 
would bring another layer of consideration of when to do global reclaim, 
which is not necessary now.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Huang, Kai






Kai, I think your examples sound a little bit contrived.  Have actual
users expressed a strong intent for doing anything with this series
other than limiting bad actors from eating all the EPC?

I am not sure about this.  I am also trying to get a full picture.

I asked because I didn't quite like the duplicated code change in 
try_charge() in this patch and in sgx_alloc_epc_page().  I think using 
per-group reclaim we can unify the code (I have even started to write 
the code) and I don't see the downside of doing so.


So I am trying to get the actual downside of doing per-cgroup reclaim or 
the full reason that we choose global reclaim.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Huang, Kai

On 27/02/2024 10:18 am, Haitao Huang wrote:

On Mon, 26 Feb 2024 05:36:02 -0600, Huang, Kai  wrote:

On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote:
On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai  
wrote:

>
>
> On 24/02/2024 6:00 am, Haitao Huang wrote:
> > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai 
> > wrote:
> >
> > > > >
> > > > Right. When code reaches to here, we already passed reclaim per
> > > > cgroup.
> > >
> > > Yes if try_charge() failed we must do pre-cgroup reclaim.
> > >
> > > > The cgroup may not at or reach limit but system has run out of
> > > > physical
> > > > EPC.
> > > >
> > >
> > > But after try_charge() we can still choose to reclaim from the 
current

> > > group,
> > > but not necessarily have to be global, right?  I am not sure 
whether I

> > > am
> > > missing something, but could you elaborate why we should choose to
> > > reclaim from
> > > the global?
> > >
> >  Once try_charge is done and returns zero that means the cgroup 
usage
> > is charged and it's not over usage limit. So you really can't 
reclaim
> > from that cgroup if allocation failed. The only  thing you can do 
is to

> > reclaim globally.
>
> Sorry I still cannot establish the logic here.
>
> Let's say the sum of all cgroups are greater than the physical EPC, 
and
> elclave(s) in each cgroup could potentially fault w/o reaching 
cgroup's

> limit.
>
> In this case, when enclave(s) in one cgroup faults, why we cannot
> reclaim from the current cgroup, but have to reclaim from global?
>
> Is there any real downside of the former, or you just want to 
follow the

> reclaim logic w/o cgroup at all?
>
> IIUC, there's at least one advantage of reclaim from the current 
group,

> that faults of enclave(s) in one group won't impact other enclaves in
> other cgroups.  E.g., in this way other enclaves in other groups may
> never need to trigger faults.
>
> Or perhaps I am missing anything?
>
The use case here is that user knows it's OK for group A to borrow some
pages from group B for some time without impact much performance, vice
versa. That's why the user is overcomitting so system can run more
enclave/groups. Otherwise, if she is concerned about impact of A on 
B, she

could lower limit for A so it never interfere or interfere less with B
(assume the lower limit is still high enough to run all enclaves in A),
and sacrifice some of A's performance. Or if she does not want any
interference between groups, just don't over-comit. So we don't really
lose anything here.

But if we reclaim from the same group, seems we could enable a user 
case that

allows the admin to ensure certain group won't be impacted at all, while
allowing other groups to over-commit?

E.g., let's say we have 100M physical EPC.  And let's say the admin 
wants to run
some performance-critical enclave(s) which costs 50M EPC w/o being 
impacted.
The admin also wants to run other enclaves which could cost 100M EPC 
in total

but EPC swapping among them is acceptable.

If we choose to reclaim from the current EPC cgroup, then seems to 
that the
admin can achieve the above by setting up 2 groups with group1 having 
50M limit
and group2 having 100M limit, and then run performance-critical 
enclave(s) in

group1 and others in group2?  Or am I missing anything?

The more important groups should have limits higher than or equal to 
peak usage to ensure no impact.

Yes.  But if you do global reclaim there's no guarantee of this 
regardless of the limit setting.  It depends on setting of limits of 
other groups.

The less important groups should have lower limits than its peak usage 
to avoid impacting higher priority groups.

Yeah, but depending on how low the limit is, the try_charge() can still 
succeed but physical EPC is already running out.

Are you saying we should always expect the admin to set limits of groups 
not exceeding the physical EPC?

The limit is the maximum usage allowed.

By setting group2 limit to 100M, you are allowing it to use 100M. So as 
soon as it gets up and consume 100M, group1 can not even load any 
enclave if we only reclaim per-cgroup and do not do global reclaim.

I kinda forgot, but I think SGX supports swapping out EPC of an enclave 
before EINIT?  Also, with SGX2 the initial enclave can take less EPC to 
be loaded.

If we choose to do global reclaim, then we cannot achieve that.

You can achieve this by setting group 2 limit to 50M. No need to 
overcommiting to the system.
Group 2 will swap as soon as it hits 50M, which is the maximum it can 
consume so no impact to group 1.

Right.  We can achieve this by doing so.  But as said above, you are 
depending on setting up the limit to do per-cgorup reclaim.

So, back to the question:

What is the downside of doing per-group reclaim when try_charge() 
succeeds for the enclave but failed to allocate EPC page?

Could you give an complete answer why you choose to use global reclaim 
for the above case?

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-26 Thread Huang, Kai

On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote:
> On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai  wrote:
> 
> > 
> > 
> > On 24/02/2024 6:00 am, Haitao Huang wrote:
> > > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai   
> > > wrote:
> > > 
> > > > > > 
> > > > > Right. When code reaches to here, we already passed reclaim per  
> > > > > cgroup.
> > > > 
> > > > Yes if try_charge() failed we must do pre-cgroup reclaim.
> > > > 
> > > > > The cgroup may not at or reach limit but system has run out of  
> > > > > physical
> > > > > EPC.
> > > > > 
> > > > 
> > > > But after try_charge() we can still choose to reclaim from the current  
> > > > group,
> > > > but not necessarily have to be global, right?  I am not sure whether I  
> > > > am
> > > > missing something, but could you elaborate why we should choose to  
> > > > reclaim from
> > > > the global?
> > > > 
> > >  Once try_charge is done and returns zero that means the cgroup usage  
> > > is charged and it's not over usage limit. So you really can't reclaim  
> > > from that cgroup if allocation failed. The only  thing you can do is to  
> > > reclaim globally.
> > 
> > Sorry I still cannot establish the logic here.
> > 
> > Let's say the sum of all cgroups are greater than the physical EPC, and  
> > elclave(s) in each cgroup could potentially fault w/o reaching cgroup's  
> > limit.
> > 
> > In this case, when enclave(s) in one cgroup faults, why we cannot  
> > reclaim from the current cgroup, but have to reclaim from global?
> > 
> > Is there any real downside of the former, or you just want to follow the  
> > reclaim logic w/o cgroup at all?
> > 
> > IIUC, there's at least one advantage of reclaim from the current group,  
> > that faults of enclave(s) in one group won't impact other enclaves in  
> > other cgroups.  E.g., in this way other enclaves in other groups may  
> > never need to trigger faults.
> > 
> > Or perhaps I am missing anything?
> > 
> The use case here is that user knows it's OK for group A to borrow some  
> pages from group B for some time without impact much performance, vice  
> versa. That's why the user is overcomitting so system can run more  
> enclave/groups. Otherwise, if she is concerned about impact of A on B, she  
> could lower limit for A so it never interfere or interfere less with B  
> (assume the lower limit is still high enough to run all enclaves in A),  
> and sacrifice some of A's performance. Or if she does not want any  
> interference between groups, just don't over-comit. So we don't really  
> lose anything here.

But if we reclaim from the same group, seems we could enable a user case that
allows the admin to ensure certain group won't be impacted at all, while
allowing other groups to over-commit?

E.g., let's say we have 100M physical EPC.  And let's say the admin wants to run
some performance-critical enclave(s) which costs 50M EPC w/o being impacted. 
The admin also wants to run other enclaves which could cost 100M EPC in total
but EPC swapping among them is acceptable.

If we choose to reclaim from the current EPC cgroup, then seems to that the
admin can achieve the above by setting up 2 groups with group1 having 50M limit
and group2 having 100M limit, and then run performance-critical enclave(s) in
group1 and others in group2?  Or am I missing anything?

If we choose to do global reclaim, then we cannot achieve that.

> 
> In case of overcomitting, even if we always reclaim from the same cgroup  
> for each fault, one group may still interfere the other: e.g., consider an  
> extreme case in that group A used up almost all EPC at the time group B  
> has a fault, B has to fail allocation and kill enclaves.

If the admin allows group A to use almost all EPC, to me it's fair to say he/she
doesn't want to run anything inside B at all and it is acceptable enclaves in B
to be killed.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-25 Thread Huang, Kai





On 24/02/2024 6:00 am, Haitao Huang wrote:

On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai  wrote:


>
Right. When code reaches to here, we already passed reclaim per cgroup.


Yes if try_charge() failed we must do pre-cgroup reclaim.


The cgroup may not at or reach limit but system has run out of physical
EPC.



But after try_charge() we can still choose to reclaim from the current 
group,

but not necessarily have to be global, right?  I am not sure whether I am
missing something, but could you elaborate why we should choose to 
reclaim from

the global?



Once try_charge is done and returns zero that means the cgroup usage is 
charged and it's not over usage limit. So you really can't reclaim from 
that cgroup if allocation failed. The only  thing you can do is to 
reclaim globally.


Sorry I still cannot establish the logic here.

Let's say the sum of all cgroups are greater than the physical EPC, and 
elclave(s) in each cgroup could potentially fault w/o reaching cgroup's 
limit.


In this case, when enclave(s) in one cgroup faults, why we cannot 
reclaim from the current cgroup, but have to reclaim from global?


Is there any real downside of the former, or you just want to follow the 
reclaim logic w/o cgroup at all?


IIUC, there's at least one advantage of reclaim from the current group, 
that faults of enclave(s) in one group won't impact other enclaves in 
other cgroups.  E.g., in this way other enclaves in other groups may 
never need to trigger faults.


Or perhaps I am missing anything?

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-23 Thread Huang, Kai

> > 
> Right. When code reaches to here, we already passed reclaim per cgroup.  

Yes if try_charge() failed we must do pre-cgroup reclaim.

> The cgroup may not at or reach limit but system has run out of physical  
> EPC.
> 

But after try_charge() we can still choose to reclaim from the current group,
but not necessarily have to be global, right?  I am not sure whether I am
missing something, but could you elaborate why we should choose to reclaim from
the global?

Re: [PATCH v9 13/15] x86/sgx: Turn on per-cgroup EPC reclamation

2024-02-22 Thread Huang, Kai





On 23/02/2024 5:36 am, Haitao Huang wrote:

On Wed, 21 Feb 2024 05:23:00 -0600, Huang, Kai  wrote:


On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:

From: Kristen Carlson Accardi 

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_lru_list() returns hard
coded reference to the global LRU.

Change sgx_lru_list() to return the LRU of the cgroup in which the given
EPC page is allocated.

This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., sum of cgroup limits
greater than the total capacity, cgroups may never reclaim but the total
usage can still be near the capacity. Therefore global reclamation is
still needed in those cases and it should reclaim from the root cgroup.

Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled, otherwise from the global LRU.

Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.

With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c 
b/arch/x86/kernel/cpu/sgx/main.c

index 6b0c26cac621..d4265a390ba9 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru;

 static inline struct sgx_epc_lru_list *sgx_lru_list(struct 
sgx_epc_page *epc_page)

 {
+#ifdef CONFIG_CGROUP_SGX_EPC
+    if (epc_page->epc_cg)
+    return _page->epc_cg->lru;
+
+    /* This should not happen if kernel is configured correctly */
+    WARN_ON_ONCE(1);
+#endif
 return _global_lru;
 }


How about when EPC cgroup is enabled, but one enclave doesn't belong 
to any EPC
cgroup?  Is it OK to track EPC pages for these enclaves to the root 
EPC cgroup's

LRU list together with other enclaves belongs to the root cgroup?


This should be a valid case, right?


There is no such case. Each page is in the root by default.



Is it guaranteed by the (misc) cgroup design/implementation?  If so 
please add this information to the changelog and/or comments?  It helps 
non-cgroup expert like me to understand.

Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-22 Thread Huang, Kai

On 23/02/2024 6:20 am, Haitao Huang wrote:

On Wed, 21 Feb 2024 05:00:27 -0600, Huang, Kai  wrote:

On Wed, 2024-02-21 at 00:44 -0600, Haitao Huang wrote:

[...]
>
> Here the @nr_to_scan is reduced by the number of pages that are
> isolated, but
> not actually reclaimed (which is reflected by @cnt).
>
> IIUC, looks you want to make this function do "each cycle" as what you
> mentioned
> in the v8 [1]:
>
> I tested with that approach and found we can only target number of
> pages
> attempted to reclaim not pages actually reclaimed due to the
> uncertainty
> of how long it takes to reclaim pages. Besides targeting number of
> scanned pages for each cycle is also what the ksgxd does.
>
> If we target actual number of pages, sometimes it just takes 
too long.

> I
> saw more timeouts with the default time limit when running 
parallel

> selftests.
>
> I am not sure what does "sometimes it just takes too long" mean, but
> what I am
> thinking is you are trying to do some perfect but yet complicated code
> here.

I think what I observed was that the try_charge() would block too long
before getting chance of schedule() to yield, causing more timeouts than
necessary.
I'll do some re-test to be sure.

Looks this is a valid information that can be used to justify whatever 
you are

implementing in the EPC cgroup reclaiming function(s).

I'll add some comments. Was assuming this is just following the old 
design as ksgxd.
There were some comments at the beginning of 
sgx_epc_cgrooup_reclaim_page().

     /*
  * Attempting to reclaim only a few pages will often fail and is
  * inefficient, while reclaiming a huge number of pages can 
result in
  * soft lockups due to holding various locks for an extended 
duration.

  */
     unsigned int nr_to_scan = SGX_NR_TO_SCAN;

I think it can be improved to emphasize we only "attempt" to finish 
scanning fixed number of pages for reclamation, not enforce number of 
pages successfully reclaimed.

Not sure need to be this comment, but at somewhere just state you are 
trying to follow the ksgxd() (the current sgx_reclaim_pages()), but 
trying to do it "_across_ given cgroup and all the descendants".

That's the reason you made @nr_to_scan as a pointer.

And also some text to explain why to follow ksgxd() -- not wanting to 
block longer due to loop over descendants etc -- so we can focus on 
discussing whether such justification is reasonable.

Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-22 Thread Huang, Kai

On 23/02/2024 9:12 am, Haitao Huang wrote:

On Wed, 21 Feb 2024 04:48:58 -0600, Huang, Kai  wrote:

On Wed, 2024-02-21 at 00:23 -0600, Haitao Huang wrote:

StartHi Kai
On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai  
wrote:

[...]
>
> So you introduced the work/workqueue here but there's no place which
> actually
> queues the work.  IMHO you can either:
>
> 1) move relevant code change here; or
> 2) focus on introducing core functions to reclaim certain pages from a
> given EPC
> cgroup w/o workqueue and introduce the work/workqueue in later patch.
>
> Makes sense?
>

Starting in v7, I was trying to split the big patch, #10 in v6 as you 
and
others suggested. My thought process was to put infrastructure needed 
for

per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9
13/15] in the end.

That's reasonable for sure.

Thanks for the confirmation :-)

Before that, all reclaimables are tracked in the global LRU so really
there is no "reclaim certain pages from a  given EPC cgroup w/o 
workqueue"

or reclaim through workqueue before that point, as suggested in #2. This
patch puts down the implementation for both flows but neither used 
yet, as

stated in the commit message.

I know it's not used yet.  The point is how to split patches to make 
them more

self-contain and easy to review.

I would think this patch already self-contained in that all are 
implementation of cgroup reclamation building blocks utilized later. But 
I'll try to follow your suggestions below to split further (would prefer 
not to merge in general unless there is strong reasons).

For #2, sorry for not being explicit -- I meant it seems it's more 
reasonable to

split in this way:

Patch 1)
  a). change to sgx_reclaim_pages();

I'll still prefer this to be a separate patch. It is self-contained IMHO.
We were splitting the original patch because it was too big. I don't 
want to merge back unless there is a strong reason.

  b). introduce sgx_epc_cgroup_reclaim_pages();

Ok.

If I got you right, I believe you want to have a cgroup variant function 
following the same behaviour of the one for global reclaim, i.e., the 
_current_ sgx_reclaim_pages(), which always tries to scan and reclaim 
SGX_NR_TO_SCAN pages each time.

And this cgroup variant function, sgx_epc_cgroup_reclaim_pages(), tries 
to scan and reclaim SGX_NR_TO_SCAN pages each time "_across_ the cgroup 
and all the descendants".

And you want to implement sgx_epc_cgroup_reclaim_pages() in this way due 
to WHATEVER reasons.

In that case, the change to sgx_reclaim_pages() and the introduce of 
sgx_epc_cgroup_reclaim_pages() should really be together because they 
are completely tied together in terms of implementation.

In this way you can just explain clearly in _ONE_ patch why you choose 
this implementation, and for reviewer it's also easier to review because 
we can just discuss in one patch.

Makes sense?

  c). introduce sgx_epc_cgroup_reclaim_work_func() (use a better 
name), which just takes an EPC cgroup as input w/o involving any 
work/workqueue.

This is for the workqueue use only. So I think it'd be better be with 
patch #2 below?

There are multiple levels of logic here IMHO:

  1. a) and b) above focus on "each reclaim" a given EPC cgroup
  2. c) is about a loop of above to bring given cgroup's usage to limit
  3. workqueue is one (probably best) way to do c) in async way
  4. the logic where 1) (direct reclaim) and 3) (indirect) are triggered

To me, it's clear 1) should be in one patch as stated above.

Also, to me 3) and 4) are better to be together since they give you a 
clear view on how the direct/indirect reclaim are triggered.

2) could be flexible depending on how you see it.  If you prefer viewing 
it from low-level implementation of reclaiming pages from cgroup, then 
it's also OK to be together with 1).  If you want to treat it as a part 
of _async_ way of bring down usage to limit, then _MAYBE_ it's also OK 
to be with 3) and 4).

But to me 2) can be together with 1) or even a separate patch because 
it's still kinda of low-level reclaiming details.  3) and 4) shouldn't 
contain such detail but should focus on how direct/indirect reclaim is done.

[...]

To be honest, the part I'm feeling most confusing is this 
self-contained-ness. It seems depend on how you look at things.

Completely understand.  But I think our discussion should be helpful to 
both of us and others.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-22 Thread Huang, Kai





On 23/02/2024 6:09 am, Haitao Huang wrote:

On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai  wrote:




-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool 
reclaim)

 {
-    return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, 
PAGE_SIZE);

+    for (;;) {
+    if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+    PAGE_SIZE))
+    break;
+
+    if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+    return -ENOMEM;
+
+    if (signal_pending(current))
+    return -ERESTARTSYS;
+
+    if (!reclaim) {
+    queue_work(sgx_epc_cg_wq, _cg->reclaim_work);
+    return -EBUSY;
+    }
+
+    if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
+    /* All pages were too young to reclaim, try again a 
little later */

+    schedule();
+    }
+
+    return 0;
 }



Seems this code change is 90% similar to the existing code in the
sgx_alloc_epc_page():

...
for ( ; ; ) {
    page = __sgx_alloc_epc_page();
    if (!IS_ERR(page)) {
    page->owner = owner;
    break;
    }

    if (list_empty(_active_page_list))
    return ERR_PTR(-ENOMEM);

    if (!reclaim) {
    page = ERR_PTR(-EBUSY);
    break;
    }

    if (signal_pending(current)) {
    page = ERR_PTR(-ERESTARTSYS);
    break;
    }

    sgx_reclaim_pages();
    cond_resched();
    }
...

Is it better to move the logic/code change in try_charge() out to
sgx_alloc_epc_page() to unify them?

IIUC, the logic is quite similar: When you either failed to allocate 
one page,
or failed to charge one page, you try to reclaim EPC page(s) from the 
current

EPC cgroup, either directly or indirectly.

No?


Only these lines are the same:
     if (!reclaim) {
     page = ERR_PTR(-EBUSY);
     break;
     }

     if (signal_pending(current)) {
     page = ERR_PTR(-ERESTARTSYS);
     break;
     }

In sgx_alloc_epc_page() we do global reclamation but here we do 
per-cgroup reclamation. 


But why?  If we failed to allocate, shouldn't we try to reclaim from the 
_current_ EPC cgroup instead of global?  E.g., I thought one enclave in 
one EPC cgroup requesting insane amount of EPC shouldn't impact enclaves 
inside other cgroups?


That's why the logic of other lines is different
though they look similar due to similar function names. For the global 
reclamation we need consider case in that cgroup is not enabled. 
Similarly list_empty(_active_page_list) would have to be changed to 
check root cgroup if cgroups enabled otherwise check global LRU.  The 
(!reclaim) case is also different.  


W/o getting clear on my above question, so far I am not convinced why 
such difference cannot be hide inside wrapper function(s).


So I don't see an obvious good way

to abstract those to get meaningful savings.

Thanks
Haitao

Re: [PATCH v9 11/15] x86/sgx: Abstract check for global reclaimable pages

2024-02-21 Thread Huang, Kai

On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> To determine if any page available for reclamation at the global level,
> only checking for emptiness of the global LRU is not adequate when pages
> are tracked in multiple LRUs, one per cgroup. For this purpose, create a
> new helper, sgx_can_reclaim(), currently only checks the global LRU,
> later will check emptiness of LRUs of all cgroups when per-cgroup
> tracking is turned on. Replace all the checks of the global LRU,
> list_empty(_global_lru.reclaimable), with calls to
> sgx_can_reclaim().
> 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> v7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 2279ae967707..6b0c26cac621 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -37,6 +37,11 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct 
> sgx_epc_page *epc_pag
>   return _global_lru;
>  }
>  
> +static inline bool sgx_can_reclaim(void)
> +{
> + return !list_empty(_global_lru.reclaimable);
> +}
> +
>  static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>  
>  /* Nodes with one or more EPC sections. */
> @@ -398,7 +403,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list 
> *lru, unsigned int *nr_to
>  static bool sgx_should_reclaim(unsigned long watermark)
>  {
>   return atomic_long_read(_nr_free_pages) < watermark &&
> -!list_empty(_global_lru.reclaimable);
> + sgx_can_reclaim();
>  }
>  
>  static void sgx_reclaim_pages_global(bool indirect)
> @@ -601,7 +606,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool 
> reclaim)
>   break;
>   }
>  
> - if (list_empty(_global_lru.reclaimable)) {
> + if (!sgx_can_reclaim()) {
>   page = ERR_PTR(-ENOMEM);
>   break;
>   }

Seems a basic elemental change.  Why did you put this patch at almost end of
this series but not at an earlier place?

I think one advantage of putting elemental changes at early place is, if there's
any code change related to these (the code changes sgx_global_lru in this patch)
in any later patch, the updated one can be used.  Otherwise if you do elemental
change at later place, when you replace you have to replace all the places that
were modified in previous patches.

Re: [PATCH v9 13/15] x86/sgx: Turn on per-cgroup EPC reclamation

2024-02-21 Thread Huang, Kai

On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> Previous patches have implemented all infrastructure needed for
> per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
> pages are still tracked in the global LRU as sgx_lru_list() returns hard
> coded reference to the global LRU.
> 
> Change sgx_lru_list() to return the LRU of the cgroup in which the given
> EPC page is allocated.
> 
> This makes all EPC pages tracked in per-cgroup LRUs and the global
> reclaimer (ksgxd) will not be able to reclaim any pages from the global
> LRU. However, in cases of over-committing, i.e., sum of cgroup limits
> greater than the total capacity, cgroups may never reclaim but the total
> usage can still be near the capacity. Therefore global reclamation is
> still needed in those cases and it should reclaim from the root cgroup.
> 
> Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
> when cgroup is enabled, otherwise from the global LRU.
> 
> Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
> cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
> 
> With these changes, the global reclamation and per-cgroup reclamation
> both work properly with all pages tracked in per-cgroup LRUs.
> 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> V7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 16 +++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 6b0c26cac621..d4265a390ba9 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru;
>  
>  static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page 
> *epc_page)
>  {
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + if (epc_page->epc_cg)
> + return _page->epc_cg->lru;
> +
> + /* This should not happen if kernel is configured correctly */
> + WARN_ON_ONCE(1);
> +#endif
>   return _global_lru;
>  }

How about when EPC cgroup is enabled, but one enclave doesn't belong to any EPC
cgroup?  Is it OK to track EPC pages for these enclaves to the root EPC cgroup's
LRU list together with other enclaves belongs to the root cgroup?


This should be a valid case, right?

Re: [PATCH v9 12/15] x86/sgx: Expose sgx_epc_cgroup_reclaim_pages() for global reclaimer

2024-02-21 Thread Huang, Kai

On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> When cgroup is enabled, all reclaimable pages will be tracked in cgroup
> LRUs. The global reclaimer needs to start reclamation from the root
> cgroup. Expose the top level cgroup reclamation function so the global
> reclaimer can reuse it.
> 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> V8:
> - Remove unneeded breaks in function declarations. (Jarkko)
> 
> V7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 2 +-
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h | 7 +++
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> index abf74fdb12b4..6e31f8727b8a 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -96,7 +96,7 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
>   * @indirect:   In ksgxd or EPC cgroup work queue context.
>   * Return:   Number of pages reclaimed.
>   */
> -static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool 
> indirect)
> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool 
> indirect)
>  {
>   /*
>* Attempting to reclaim only a few pages will often fail and is
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h 
> b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> index d061cd807b45..5b3e8e1b8630 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -31,6 +31,11 @@ static inline int sgx_epc_cgroup_try_charge(struct 
> sgx_epc_cgroup *epc_cg, bool
>  static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
>  
>  static inline void sgx_epc_cgroup_init(void) { }
> +
> +static inline unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg 
> *root, bool indirect)
> +{
> + return 0;
> +}
>  #else
>  struct sgx_epc_cgroup {
>   struct misc_cg *cg;
> @@ -69,6 +74,8 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup 
> *epc_cg)
>  int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim);
>  void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
>  bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool 
> indirect);
> +
>  void sgx_epc_cgroup_init(void);
>  
>  #endif

I'd just prefer to merge patch such like this to the one that actually uses the
exposed function.  It's just couple of LOC and we don't deserve to read these
repeated changelog and move back and forth between patches during review.

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-21 Thread Huang, Kai


> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
>  {
> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> + for (;;) {
> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> + PAGE_SIZE))
> + break;
> +
> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> + return -ENOMEM;
> +
> + if (signal_pending(current))
> + return -ERESTARTSYS;
> +
> + if (!reclaim) {
> + queue_work(sgx_epc_cg_wq, _cg->reclaim_work);
> + return -EBUSY;
> + }
> +
> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
> + /* All pages were too young to reclaim, try again a 
> little later */
> + schedule();
> + }
> +
> + return 0;
>  }
>  

Seems this code change is 90% similar to the existing code in the
sgx_alloc_epc_page():

...
for ( ; ; ) {
page = __sgx_alloc_epc_page();
if (!IS_ERR(page)) {
page->owner = owner;
break;
}

if (list_empty(_active_page_list))
return ERR_PTR(-ENOMEM);

if (!reclaim) {
page = ERR_PTR(-EBUSY);
break;
}

if (signal_pending(current)) {
page = ERR_PTR(-ERESTARTSYS);
break;
}

sgx_reclaim_pages();
cond_resched();
}
...

Is it better to move the logic/code change in try_charge() out to
sgx_alloc_epc_page() to unify them?

IIUC, the logic is quite similar: When you either failed to allocate one page,
or failed to charge one page, you try to reclaim EPC page(s) from the current
EPC cgroup, either directly or indirectly.

No?

Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-21 Thread Huang, Kai

On Wed, 2024-02-21 at 00:44 -0600, Haitao Huang wrote:
> [...]
> > 
> > Here the @nr_to_scan is reduced by the number of pages that are  
> > isolated, but
> > not actually reclaimed (which is reflected by @cnt).
> > 
> > IIUC, looks you want to make this function do "each cycle" as what you  
> > mentioned
> > in the v8 [1]:
> > 
> > I tested with that approach and found we can only target number of
> > pages
> > attempted to reclaim not pages actually reclaimed due to the
> > uncertainty
> > of how long it takes to reclaim pages. Besides targeting number of
> > scanned pages for each cycle is also what the ksgxd does.
> > 
> > If we target actual number of pages, sometimes it just takes too long.
> > I
> > saw more timeouts with the default time limit when running parallel
> > selftests.
> > 
> > I am not sure what does "sometimes it just takes too long" mean, but  
> > what I am
> > thinking is you are trying to do some perfect but yet complicated code  
> > here.
> 
> I think what I observed was that the try_charge() would block too long  
> before getting chance of schedule() to yield, causing more timeouts than  
> necessary.
> I'll do some re-test to be sure.

Looks this is a valid information that can be used to justify whatever you are
implementing in the EPC cgroup reclaiming function(s).

> 
> > 
> > For instance, I don't think selftest reflect the real workload, and I  
> > believe
> > adjusting the limit of a given EPC cgroup shouldn't be a frequent  
> > operation,
> > thus it is acceptable to use some easy-maintain code but less perfect  
> > code.
> > 
> > Here I still think having @nr_to_scan as a pointer is over-complicated.   
> > For
> > example, we can still let sgx_reclaim_pages() to always scan  
> > SGX_NR_TO_SCAN
> > pages, but give up when there's enough pages reclaimed or when the EPC  
> > cgroup
> > and its descendants have been looped:
> > 
> > unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
> > {
> > unsigned int cnt = 0;
> > ...
> > 
> > css_for_each_descendant_pre(pos, css_root) {
> > ...
> > epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> > cnt += sgx_reclaim_pages(_cg->lru);
> > 
> > if (cnt >= SGX_NR_TO_SCAN)
> > break;
> > }
> > 
> > ...
> > return cnt;
> > }
> > 
> > Yeah it may reclaim more than SGX_NR_TO_SCAN when the loop actually  
> > reaches any
> > descendants, but that should be rare and we don't care that much, do we?
> > 
> I assume you meant @cnt here to be number of pages actually reclaimed.  

Yes.

> This could cause  sgx_epc_cgroup_reclaim_pages() block too long as @cnt  
> may always be zero (all pages are too young) and you have to loop all  
> descendants.

I am not sure whether this is a valid point.

For example, your change in patch 10 "x86/sgx: Add EPC reclamation in cgroup
try_charge()" already loops all descendants in below code:

+   if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+   return -ENOMEM;

Anyway, I can see all these can be justification to your design/implementation.
My point is please put these justification in changelog/comments so that we can
actually understand.


Makes sense?

Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-21 Thread Huang, Kai

On Wed, 2024-02-21 at 00:23 -0600, Haitao Huang wrote:
> StartHi Kai
> On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai  wrote:
> [...]
> > 
> > So you introduced the work/workqueue here but there's no place which  
> > actually
> > queues the work.  IMHO you can either:
> > 
> > 1) move relevant code change here; or
> > 2) focus on introducing core functions to reclaim certain pages from a  
> > given EPC
> > cgroup w/o workqueue and introduce the work/workqueue in later patch.
> > 
> > Makes sense?
> > 
> 
> Starting in v7, I was trying to split the big patch, #10 in v6 as you and  
> others suggested. My thought process was to put infrastructure needed for  
> per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9  
> 13/15] in the end.

That's reasonable for sure.

> 
> Before that, all reclaimables are tracked in the global LRU so really  
> there is no "reclaim certain pages from a  given EPC cgroup w/o workqueue"  
> or reclaim through workqueue before that point, as suggested in #2. This  
> patch puts down the implementation for both flows but neither used yet, as  
> stated in the commit message.

I know it's not used yet.  The point is how to split patches to make them more
self-contain and easy to review.

For #2, sorry for not being explicit -- I meant it seems it's more reasonable to
split in this way:

Patch 1)
  a). change to sgx_reclaim_pages();
  b). introduce sgx_epc_cgroup_reclaim_pages();
  c). introduce sgx_epc_cgroup_reclaim_work_func() (use a better name), 
 which just takes an EPC cgroup as input w/o involving any work/workqueue.

These functions are all related to how to implement reclaiming pages from a
given EPC cgroup, and they are logically related in terms of implementation thus
it's easier to be reviewed together.

Then you just need to justify the design/implementation in changelog/comments.

Patch 2) 
  - Introduce work/workqueue, and implement the logic to queue the work.

Now we all know there's a function to reclaim pages for a given EPC cgroup, then
we can focus on when that is called, either directly or indirectly.

> 
> #1 would force me go back and merge the patches again.

I don't think so.  I am not asking to put all things together, but only asking
to split in better way (that I see).

You mentioned some function is "Scheduled by sgx_epc_cgroup_try_charge() to
reclaim pages", but I am not seeing any code doing that in this patch.  This
needs fixing, either by moving relevant code here, or removing these not-done-
yet comments.

For instance (I am just giving an example), if after review we found the
queue_work() shouldn't be done in try_charge(), you will need to go back to this
patch and remove these comments.

That's not the best way.  Each patch needs to be self-contained.

> 
> Sorry I feel kind of lost on this whole thing by now. It seems so random  
> to me. Is there hard rules on this?

See above.  I am only offering my opinion on how to split patches in better way.

> 
> I was hoping these statements would help reviewers on the flow of the  
> patches.
> 
> At the end of [v9 04/15]:
> 
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
> 
> Later patches will reorganize the tracking and reclamation code in the
> global reclaimer and implement per-cgroup tracking and reclaiming.
> 
> At the end of [v9 06/15]:
> 
> Next patches will first get the cgroup reclamation flow ready while
> keeping pages tracked in the global LRU and reclaimed by ksgxd before we
> make the switch in the end for sgx_lru_list() to return per-cgroup
> LRU.
> 
> At the end of [v9 08/15]:
> 
> Both synchronous and asynchronous flows invoke the same top level reclaim
> function, and will be triggered later by sgx_epc_cgroup_try_charge()
> when usage of the cgroup is at or near its limit.
> 
> At the end of [v9 10/15]:
> Note at this point, all reclaimable EPC pages are still tracked in the
> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
> is activated yet.

They are useful in the changelog in each patch I suppose, but to me we are
discussing different things.

I found one pain in the review is I have to jump back and forth many times among
multiple patches to see whether one patch is reasonable.  That's why I am asking
whether there's better way to split patches so that each patch can be self-
contained logically in someway and easier to review.

Re: Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-20 Thread Huang, Kai

On Tue, 2024-02-20 at 14:18 +0100, Michal Koutný wrote:
> On Tue, Feb 20, 2024 at 09:52:39AM +, "Huang, Kai"  
> wrote:
> > I am not sure, but is it possible or legal for an ancestor to have less 
> > limit
> > than children?
> 
> Why not?
> It is desired for proper hiearchical delegation and the tightest limit
> of ancestors applies (cf memory.max).
> 

OK.  Thanks for the info.

Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-20 Thread Huang, Kai


> +/*
> + * Get the lower bound of limits of a cgroup and its ancestors.  Used in
> + * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup 
> is
> + * over its limit or its ancestors' hence reclamation is needed.
> + */
> +static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup 
> *epc_cg)
> +{
> + struct misc_cg *i = epc_cg->cg;
> + u64 m = U64_MAX;
> +
> + while (i) {
> + m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
> + i = misc_cg_parent(i);
> + }
> +
> + return m / PAGE_SIZE;
> +}

I am not sure, but is it possible or legal for an ancestor to have less limit
than children?

> +
>  /**
> - * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
> + * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its 
> LRUs
> + * @root:Root of the tree to check
>   *
> + * Return: %true if all cgroups under the specified root have empty LRU 
> lists.
> + * Used to avoid livelocks due to a cgroup having a non-zero charge count but
> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
> + * because all pages in the cgroup are unreclaimable.
> + */
> +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
> +{
> + struct cgroup_subsys_state *css_root;
> + struct cgroup_subsys_state *pos;
> + struct sgx_epc_cgroup *epc_cg;
> + bool ret = true;
> +
> + /*
> +  * Caller ensure css_root ref acquired
> +  */
> + css_root = >css;
> +
> + rcu_read_lock();
> + css_for_each_descendant_pre(pos, css_root) {
> + if (!css_tryget(pos))
> + break;
> +
> + rcu_read_unlock();
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +
> + spin_lock(_cg->lru.lock);
> + ret = list_empty(_cg->lru.reclaimable);
> + spin_unlock(_cg->lru.lock);
> +
> + rcu_read_lock();
> + css_put(pos);
> + if (!ret)
> + break;
> + }
> +
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> +/**
> + * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to 
> reclaim pages
> + * @root:Root of the tree to start walking from.
> + * Return:   Number of pages reclaimed.

Just wondering, do you need to return @cnt given this function is called w/o
checking the return value?

> + */
> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
> +{
> + /*
> +  * Attempting to reclaim only a few pages will often fail and is
> +  * inefficient, while reclaiming a huge number of pages can result in
> +  * soft lockups due to holding various locks for an extended duration.
> +  */

Not sure we need this comment, given it's already implied in
sgx_reclaim_pages().  You cannot pass a value > SGX_NR_TO_SCAN anyway.

> + unsigned int nr_to_scan = SGX_NR_TO_SCAN;
> + struct cgroup_subsys_state *css_root;
> + struct cgroup_subsys_state *pos;
> + struct sgx_epc_cgroup *epc_cg;
> + unsigned int cnt;
> +
> +  /* Caller ensure css_root ref acquired */
> + css_root = >css;
> +
> + cnt = 0;
> + rcu_read_lock();
> + css_for_each_descendant_pre(pos, css_root) {
> + if (!css_tryget(pos))
> + break;
> + rcu_read_unlock();
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> + cnt += sgx_reclaim_pages(_cg->lru, _to_scan);
> +
> + rcu_read_lock();
> + css_put(pos);
> + if (!nr_to_scan)
> + break;
> + }
> +
> + rcu_read_unlock();
> + return cnt;
> +}

Here the @nr_to_scan is reduced by the number of pages that are isolated, but
not actually reclaimed (which is reflected by @cnt).

IIUC, looks you want to make this function do "each cycle" as what you mentioned
in the v8 [1]:

I tested with that approach and found we can only target number of
pages  
attempted to reclaim not pages actually reclaimed due to the
uncertainty  
of how long it takes to reclaim pages. Besides targeting number of
scanned pages for each cycle is also what the ksgxd does.

If we target actual number of pages, sometimes it just takes too long.
I
saw more timeouts with the default time limit when running parallel  
selftests.

I am not sure what does "sometimes it just takes too long" mean, but what I am
thinking is you are trying to do some perfect but yet complicated code here.

For instance, I don't think selftest reflect the real workload, and I believe
adjusting the limit of a given EPC cgroup shouldn't be a frequent operation,
thus it is acceptable to use some easy-maintain code but less perfect code.

Here I still think having @nr_to_scan as a pointer is over-complicated.  For
example, we can still let sgx_reclaim_pages() to always scan SGX_NR_TO_SCAN
pages, but give up

Re: [PATCH v9 07/15] x86/sgx: Expose sgx_reclaim_pages() for cgroup

2024-02-20 Thread Huang, Kai

On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> From: Sean Christopherson 
> 
> Each EPC cgroup will have an LRU structure to track reclaimable EPC pages.
> When a cgroup usage reaches its limit, the cgroup needs to reclaim pages
> from its LRU or LRUs of its descendants to make room for any new
> allocations.
> 
> To prepare for reclamation per cgroup, expose the top level reclamation
> function, sgx_reclaim_pages(), in header file for reuse. Add a parameter
> to the function to pass in an LRU so cgroups can pass in different
> tracking LRUs later.  
> 

[...]

> Add another parameter for passing in the number of
> pages to scan and make the function return the number of pages reclaimed
> as a cgroup reclaimer may need to track reclamation progress from its
> descendants, change number of pages to scan in subsequent calls.

Firstly, sorry for late reply as I was away.  

From the changelog, it's understandable you want to make this function return
pages that are actually reclaimed, and perhaps it's also OK to pass the number
of pages to scan.

But this doesn't explain why you need to make @nr_to_scan as a pointer, while
you are returning the number of pages that are actually reclaimed?

And ...

[...]

> -/*
> - * Take a fixed number of pages from the head of the active page pool and
> - * reclaim them to the enclave's private shmem files. Skip the pages, which 
> have
> - * been accessed since the last scan. Move those pages to the tail of active
> - * page pool so that the pages get scanned in LRU like fashion.
> +/**
> + * sgx_reclaim_pages() - Reclaim a fixed number of pages from an LRU
> + *
> + * Take a fixed number of pages from the head of a given LRU and reclaim 
> them to
> + * the enclave's private shmem files. Skip the pages, which have been 
> accessed
> + * since the last scan. Move those pages to the tail of the list so that the
> + * pages get scanned in LRU like fashion.
>   *
>   * Batch process a chunk of pages (at the moment 16) in order to degrade 
> amount

... there's no comment to explain such design either (@nr_to_scan as a pointer).

Btw, with this change, seems "Take a fixed number of pages ..." and "at the
moment 16" are not accurate any more.

>   * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a 
> bit
> @@ -298,8 +300,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
> *epc_page,
>   * + EWB) but not sufficiently. Reclaiming one page at a time would also be
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
> + *
> + * @lru: The LRU from which pages are reclaimed.
> + * @nr_to_scan: Pointer to the target number of pages to scan, must be less 
> than
> + *   SGX_NR_TO_SCAN.
> + * Return:   Number of pages reclaimed.
>   */
> -static void sgx_reclaim_pages(void)
> +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int 
> *nr_to_scan)
>  {
>   struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
>   struct sgx_backing backing[SGX_NR_TO_SCAN];
> @@ -310,10 +317,10 @@ static void sgx_reclaim_pages(void)
>   int ret;
>   int i;
>  
> - spin_lock(_global_lru.lock);
> - for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> - epc_page = list_first_entry_or_null(_global_lru.reclaimable,
> - struct sgx_epc_page, list);
> + spin_lock(>lock);
> +
> + for (; *nr_to_scan > 0; --(*nr_to_scan)) {
> + epc_page = list_first_entry_or_null(>reclaimable, struct 
> sgx_epc_page, list);
>   if (!epc_page)
>   break;
>  
> @@ -328,7 +335,8 @@ static void sgx_reclaim_pages(void)
>*/
>   epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
>   }
> - spin_unlock(_global_lru.lock);
> +
> + spin_unlock(>lock);
>  
>   for (i = 0; i < cnt; i++) {
>   epc_page = chunk[i];
> @@ -351,9 +359,9 @@ static void sgx_reclaim_pages(void)
>   continue;
>  
>  skip:
> - spin_lock(_global_lru.lock);
> - list_add_tail(_page->list, _global_lru.reclaimable);
> - spin_unlock(_global_lru.lock);
> + spin_lock(>lock);
> + list_add_tail(_page->list, >reclaimable);
> + spin_unlock(>lock);
>  
>   kref_put(_page->encl->refcount, sgx_encl_release);
>  
> @@ -366,6 +374,7 @@ static void sgx_reclaim_pages(void)
>   sgx_reclaimer_block(epc_page);
>   }
>  
> + ret = 0;
>   for (i = 0; i < cnt; i++) {
>   epc_page = chunk[i];
>   if (!epc_page)
> @@ -378,7 +387,10 @@ static void sgx_reclaim_pages(void)
>   epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
>  
>   sgx_free_epc_page(epc_page);
> + ret++;
>   }
> +
> + return (unsigned int)ret;
>  }
>  

Here basically the @nr_to_scan is reduced by the number of pages that are

Re: [PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality

2024-02-19 Thread Huang, Kai

RE: [PATCH v8 07/15] x86/sgx: Expose sgx_reclaim_pages() for cgroup

2024-01-30 Thread Huang, Kai

> + * @lru: The LRU from which pages are reclaimed.
> + * @nr_to_scan: Pointer to the target number of pages to scan, must be less
> than
> + *   SGX_NR_TO_SCAN.
> + * Return:   Number of pages reclaimed.
>   */
> -static void sgx_reclaim_pages(void)
> +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned
> +int *nr_to_scan)

Since the function is now returning the number of reclaimed pages, why do you 
need to make the @nr_to_scan as pointer?

Cannot the caller just adjust @nr_to_scan when calling this function based on 
how many pages have reclaimed?

I am not even sure whether you need @nr_to_scan at all because as we discussed 
I think it's just extremely rare you need to pass "< SGX_NR_TO_SCAN" to this 
function.

Even if you need, you can always choose to try to reclaim SGX_NR_TO_SCAN pages.

[...]

> 
> +static void sgx_reclaim_pages_global(void) {
> + unsigned int nr_to_scan = SGX_NR_TO_SCAN;
> +
> + sgx_reclaim_pages(_global_lru, _to_scan); }
> +

I think this function doesn't look sane at all when you have @nr_to_scan being 
a pointer?

I am also not sure whether this function is needed -- if we don't add 
@nr_to_scan to sgx_reclaim_pages(), then this function is basically:

sgx_reclaim_pages(_global_lru);

RE: [PATCH v8 04/15] x86/sgx: Implement basic EPC misc cgroup functionality

2024-01-30 Thread Huang, Kai

>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)  {
> + struct sgx_epc_cgroup *epc_cg;
>   struct sgx_epc_page *page;
> + int ret;
> +
> + epc_cg = sgx_get_current_epc_cg();
> + ret = sgx_epc_cgroup_try_charge(epc_cg);
> + if (ret) {
> + sgx_put_epc_cg(epc_cg);
> + return ERR_PTR(ret);
> + }
> 
>   for ( ; ; ) {
>   page = __sgx_alloc_epc_page();
> @@ -567,8 +578,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
> *owner, bool reclaim)
>   break;
>   }
> 
> - if (list_empty(_active_page_list))
> - return ERR_PTR(-ENOMEM);
> + if (list_empty(_active_page_list)) {
> + page = ERR_PTR(-ENOMEM);
> + break;
> + }

(Sorry for replying from Outlook because I am in travel for Chinese New Year.)

Perhaps I am missing something but I don't understand this change.

An empty sgx_active_page_list means you cannot reclaim any page from it, so why 
need to break?

> 
>   if (!reclaim) {
>   page = ERR_PTR(-EBUSY);
> @@ -580,10 +593,25 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
> *owner, bool reclaim)
>   break;
>   }
> 
> + /*
> +  * Need to do a global reclamation if cgroup was not full but
> free
> +  * physical pages run out, causing __sgx_alloc_epc_page() to
> fail.
> +  */
>   sgx_reclaim_pages();
>   cond_resched();
>   }

And why adding this comment, especially in this patch?

I don't see it brings additional clarity because there's only global reclaim 
now, no matter whether cgroup is enabled or not.

Re: [PATCH v7 09/15] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

2024-01-26 Thread Huang, Kai


> 
> Signed-off-by: Haitao Huang 
> Reported-by: Mikko Ylinen 
> ---

Non-technical staff:

I believe checkpatch requires you to have a "Closes" tag after "Reported-by"
otherwise it complains something like this:

  WARNING: Reported-by: should be immediately followed by Closes: with a URL
  to the report

Not sure how strict this rule is, but seems you forgot to run checkpatch so just
a reminder.

Re: [PATCH] x86/sgx: fix kernel-doc comment misuse

2023-12-17 Thread Huang, Kai

On Sat, 2023-12-16 at 09:16 -0800, Randy Dunlap wrote:
> Don't use "/**" for a non-kernel-doc comment. This prevents a warning
> from scripts/kernel-doc:
> 
> main.c:740: warning: expecting prototype for A section metric is concatenated 
> in a way that @low bits 12(). Prototype was for sgx_calc_section_metric() 
> instead
> 
> Signed-off-by: Randy Dunlap 
> Cc: Jarkko Sakkinen 
> Cc: Dave Hansen 
> Cc: linux-...@vger.kernel.org
> Cc: x...@kernel.org

Reviewed-by: Kai Huang 

> ---
>  arch/x86/kernel/cpu/sgx/main.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff -- a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -731,7 +731,7 @@ out:
>   return 0;
>  }
>  
> -/**
> +/*
>   * A section metric is concatenated in a way that @low bits 12-31 define the
>   * bits 12-31 of the metric and @high bits 0-19 define the bits 32-51 of the
>   * metric.
>

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-12-17 Thread Huang, Kai


> > 
> > The point is, with or w/o this patch, you can only reclaim 16 EPC pages  
> > in one
> > function call (as you have said you are going to remove  
> > SGX_NR_TO_SCAN_MAX,
> > which is a cipher to both of us).  The only difference I can see is,  
> > with this
> > patch, you can have multiple calls of "isolate" and then call the  
> > "do_reclaim"
> > once.
> > 
> > But what's the good of having the "isolate" if the "do_reclaim" can only  
> > reclaim
> > 16 pages anyway?
> > 
> > Back to my last reply, are you afraid of any LRU has less than 16 pages  
> > to
> > "isolate", therefore you need to loop LRUs of descendants to get 16?   
> > Cause I
> > really cannot think of any other reason why you are doing this.
> > 
> > 
> 
> I think I see your point. By capping pages reclaimed per cycle to 16,  
> there is not much difference even if those 16 pages are spread in separate  
> LRUs . The difference is only significant when we ever raise that cap. To  
> preserve the current behavior of ewb loops independent on number of LRUs  
> to loop through for each reclaiming cycle, regardless of the exact value  
> of the page cap, I would still think current approach in the patch is  
> reasonable choice.  What do you think?

To me I won't bother to do that.  Having less than 16 pages in one LRU is
*extremely rare* that should never happen in practice.  It's pointless to make
such code adjustment at this stage.

Let's focus on enabling functionality first.  When you have some real
performance issue that is related to this, we can come back then.

Btw, I think you need to step back even further.  IIUC the whole multiple LRU
thing isn't mandatory in this initial support.

Please (again) take a look at the comments from Dave and Michal:

https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df...@intel.com/#t
https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-12-13 Thread Huang, Kai

On Mon, 2023-12-11 at 22:04 -0600, Haitao Huang wrote:
> Hi Kai
> 
> On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai  wrote:
> 
> > On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
> > > On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai   
> > > wrote:
> > > 
> > > > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> > > > > From: Sean Christopherson 
> > > > > 
> > > > > To prepare for per-cgroup reclamation, separate the top-level  
> > > reclaim
> > > > > function, sgx_reclaim_epc_pages(), into two separate functions:
> > > > > 
> > > > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from  
> > > a
> > > > > given LRU list.
> > > > > - sgx_do_epc_reclamation() performs the real reclamation for the
> > > > > already isolated pages.
> > > > > 
> > > > > Create a new function, sgx_reclaim_epc_pages_global(), calling  
> > > those two
> > > > > in succession, to replace the original sgx_reclaim_epc_pages(). The
> > > > > above two functions will serve as building blocks for the  
> > > reclamation
> > > > > flows in later EPC cgroup implementation.
> > > > > 
> > > > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The  
> > > EPC
> > > > > cgroup will use the result to track reclaiming progress.
> > > > > 
> > > > > sgx_isolate_epc_pages() returns the additional number of pages to  
> > > scan
> > > > > for current epoch of reclamation. The EPC cgroup will use the  
> > > result to
> > > > > determine if more scanning to be done in LRUs in its children  
> > > groups.
> > > > 
> > > > This changelog says nothing about "why", but only mentions the
> > > > "implementation".
> > > > 
> > > > For instance, assuming we need to reclaim @npages_to_reclaim from the
> > > > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
> > > > 
> > > > for_each_cgroup_and_descendants(_cgrp_to_reclaim, 
> > > > _cgrp) {
> > > > if (npages_to_reclaim <= 0)
> > > > return;
> > > > 
> > > > npages_to_reclaim -= 
> > > > sgx_reclaim_pages_lru(_cgrp->lru,
> > > > npages_to_reclaim);
> > > > }
> > > > 
> > > > Is there any difference to have "isolate" + "reclaim"?
> > > > 
> > > 
> > > This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
> > > Here we just follow the same design as ksgxd for each reclamation cycle.
> > 
> > I don't see how did you "follow" ksgxd.  If I am guessing correctly, you  
> > are
> > afraid of there might be less than 16 pages in a given EPC cgroup, thus  
> > w/o
> > splitting into "isolate" + "reclaim" you might feed the "reclaim" less  
> > than 16
> > pages, which might cause some performance degrade?
> > 
> > But is this a common case?  Should we even worry about this?
> > 
> > I suppose for such new feature we should bring functionality first and  
> > then
> > optimization if you have real performance data to show.
> > 
> The concern is not about "reclaim less than 16".
> I mean this is just refactoring with exactly the same design of ksgxd  
> preserved, 
> 

I literally have no idea what you are talking about here.  ksgxd() just calls
sgx_reclaim_pages(), which tries to reclaim 16 pages at once.

> in that we first isolate as many candidate EPC pages (up  to  
> 16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb in  
> one shot without anything else done in between. 
> 

Assuming you are referring the implementation of sgx_reclaim_pages(), and
assuming the "isolate" you mean removing EPC pages from the list (which is
exactly what the sgx_isolate_epc_pages() in this patch does), what happens to
the loops of "backing store allocation" and "EBLOCK", before the loop of EWB? 
Eaten by you?


> As described in original  
> comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to  
> finish all ewb quickly while minimizing impact of IPI.
> 
> The way you proposed will work but alters the current design and behavior  
> if cgroups is enabled and EPCs of an enclave are tracked across multiple  
> LRUs within the descendant cgroups, in that you will have isolation loop,  
> backing store allocation loop, eblock loop interleaved with the ewb loop.
> 

I have no idea what you are talking about.

The point is, with or w/o this patch, you can only reclaim 16 EPC pages in one
function call (as you have said you are going to remove SGX_NR_TO_SCAN_MAX,
which is a cipher to both of us).  The only difference I can see is, with this
patch, you can have multiple calls of "isolate" and then call the "do_reclaim"
once.

But what's the good of having the "isolate" if the "do_reclaim" can only reclaim
16 pages anyway?

Back to my last reply, are you afraid of any LRU has less than 16 pages to
"isolate", therefore you need to loop LRUs of descendants to get 16?  Cause I
really cannot think of any other reason why you are doing this.


> >

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-11-27 Thread Huang, Kai

On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
> On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai  wrote:
> 
> > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> > > From: Sean Christopherson 
> > > 
> > > To prepare for per-cgroup reclamation, separate the top-level reclaim
> > > function, sgx_reclaim_epc_pages(), into two separate functions:
> > > 
> > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a  
> > > given LRU list.
> > > - sgx_do_epc_reclamation() performs the real reclamation for the  
> > > already isolated pages.
> > > 
> > > Create a new function, sgx_reclaim_epc_pages_global(), calling those two
> > > in succession, to replace the original sgx_reclaim_epc_pages(). The
> > > above two functions will serve as building blocks for the reclamation
> > > flows in later EPC cgroup implementation.
> > > 
> > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
> > > cgroup will use the result to track reclaiming progress.
> > > 
> > > sgx_isolate_epc_pages() returns the additional number of pages to scan
> > > for current epoch of reclamation. The EPC cgroup will use the result to
> > > determine if more scanning to be done in LRUs in its children groups.
> > 
> > This changelog says nothing about "why", but only mentions the  
> > "implementation".
> > 
> > For instance, assuming we need to reclaim @npages_to_reclaim from the
> > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
> > 
> > for_each_cgroup_and_descendants(_cgrp_to_reclaim, _cgrp) {
> > if (npages_to_reclaim <= 0)
> > return;
> > 
> > npages_to_reclaim -= sgx_reclaim_pages_lru(_cgrp->lru,
> > npages_to_reclaim);
> > }
> > 
> > Is there any difference to have "isolate" + "reclaim"?
> > 
> 
> This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
> Here we just follow the same design as ksgxd for each reclamation cycle.

I don't see how did you "follow" ksgxd.  If I am guessing correctly, you are
afraid of there might be less than 16 pages in a given EPC cgroup, thus w/o
splitting into "isolate" + "reclaim" you might feed the "reclaim" less than 16
pages, which might cause some performance degrade?

But is this a common case?  Should we even worry about this?

I suppose for such new feature we should bring functionality first and then
optimization if you have real performance data to show.

> 
> > > 
> > > Signed-off-by: Sean Christopherson 
> > > Co-developed-by: Kristen Carlson Accardi 
> > > Signed-off-by: Kristen Carlson Accardi 
> > > Co-developed-by: Haitao Huang 
> > > Signed-off-by: Haitao Huang 
> > > Cc: Sean Christopherson 
> > > ---
> > > 
> > 
> > [...]
> > 
> > > +/**
> > > + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC  
> > > pages.
> > > + * @iso: List of isolated pages for reclamation
> > > + *
> > > + * Take a list of EPC pages and reclaim them to the enclave's private  
> > > shmem files.  Do not
> > > + * reclaim the pages that have been accessed since the last scan, and  
> > > move each of those pages
> > > + * to the tail of its tracking LRU list.
> > > + *
> > > + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX  
> > > per call in order to
> > > + * degrade amount of IPI's and ETRACK's potentially required.  
> > > sgx_encl_ewb() does degrade a bit
> > > + * among the HW threads with three stage EWB pipeline (EWB, ETRACK +  
> > > EWB and IPI + EWB) but not
> > > + * sufficiently. Reclaiming one page at a time would also be  
> > > problematic as it would increase
> > > + * the lock contention too much, which would halt forward progress.
> > 
> > This is kinda optimization, correct?  Is there any real performance data  
> > to
> > justify this?
> 
> The above sentences were there originally. This optimization was justified.

I am talking about 16 -> 32.

> 
> > If this optimization is useful, shouldn't we bring this
> > optimization to the current sgx_reclaim_pages() instead, e.g., just  
> > increase
> > SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?
> > 
> 
> SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't  
> know. Currently it is really the buffer size allocated in  
> sgx_reclaim_pages(). Both cgroup and ksgxd scan 16 pages a time.Maybe we  
> should just use SGX_NR_TO_SCAN. No _MAX needed. The point was to batch  
> reclamation to certain number to minimize impact of EWB pipeline. 16 was  
> the original design.
> 

Please don't leave why you are trying to do this to the reviewers.  If you don't
know, then just drop this.

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-11-19 Thread Huang, Kai

On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> From: Sean Christopherson 
> 
> To prepare for per-cgroup reclamation, separate the top-level reclaim
> function, sgx_reclaim_epc_pages(), into two separate functions:
> 
> - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a given 
> LRU list.
> - sgx_do_epc_reclamation() performs the real reclamation for the already 
> isolated pages.
> 
> Create a new function, sgx_reclaim_epc_pages_global(), calling those two
> in succession, to replace the original sgx_reclaim_epc_pages(). The
> above two functions will serve as building blocks for the reclamation
> flows in later EPC cgroup implementation.
> 
> sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
> cgroup will use the result to track reclaiming progress.
> 
> sgx_isolate_epc_pages() returns the additional number of pages to scan
> for current epoch of reclamation. The EPC cgroup will use the result to
> determine if more scanning to be done in LRUs in its children groups.

This changelog says nothing about "why", but only mentions the "implementation".

For instance, assuming we need to reclaim @npages_to_reclaim from the
@epc_cgrp_to_reclaim and its descendants, why cannot we do:

for_each_cgroup_and_descendants(_cgrp_to_reclaim, _cgrp) {
if (npages_to_reclaim <= 0)
return;

npages_to_reclaim -= sgx_reclaim_pages_lru(_cgrp->lru,
npages_to_reclaim);
}

Is there any difference to have "isolate" + "reclaim"?

> 
> Signed-off-by: Sean Christopherson 
> Co-developed-by: Kristen Carlson Accardi 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Cc: Sean Christopherson 
> ---
> 

[...]

> +/**
> + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC pages.
> + * @iso: List of isolated pages for reclamation
> + *
> + * Take a list of EPC pages and reclaim them to the enclave's private shmem 
> files.  Do not
> + * reclaim the pages that have been accessed since the last scan, and move 
> each of those pages
> + * to the tail of its tracking LRU list.
> + *
> + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX per 
> call in order to
> + * degrade amount of IPI's and ETRACK's potentially required. sgx_encl_ewb() 
> does degrade a bit
> + * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and 
> IPI + EWB) but not
> + * sufficiently. Reclaiming one page at a time would also be problematic as 
> it would increase
> + * the lock contention too much, which would halt forward progress.

This is kinda optimization, correct?  Is there any real performance data to
justify this?  If this optimization is useful, shouldn't we bring this
optimization to the current sgx_reclaim_pages() instead, e.g., just increase
SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?

Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-19 Thread Huang, Kai

> > > 
> > 
> > That's true. I was thinking no need to have them done in separate calls.  
> > The caller has to check the return value for epc_cg instance first, then  
> > check result of try_charge. But there is really only one caller,  
> > sgx_alloc_epc_page() below, so I don't have strong opinions now.
> > 
> > With them separate, the checks will look like this:
> > if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled,  
> > should continue for allocation
> > {
> > if (ret =  sgx_epc_cgroup_try_charge())
> > return ret
> > }
> > // continue...
> > 
> > I can go either way.

Let's keep this aligned with other _try_charge() variants: return 'int' to
indicate whether the charge is done or not.

> > 
> > > 
> > > > > >  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> > > > > >  {
> > > > > > struct sgx_epc_page *page;
> > > > > > +   struct sgx_epc_cgroup *epc_cg;
> > > > > > +
> > > > > > +   epc_cg = sgx_epc_cgroup_try_charge();
> > > > > > +   if (IS_ERR(epc_cg))
> > > > > > +   return ERR_CAST(epc_cg);
> > > > > > 
> > > > > > for ( ; ; ) {
> > > > > > page = __sgx_alloc_epc_page();
> > > > > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
> > > > > > *owner, bool reclaim)
> > > > > > break;
> > > > > > }
> > > > > > 
> > > > > > +   /*
> > > > > > +* Need to do a global reclamation if cgroup was not 
> > > > > > full but  
> > > > free
> > > > > > +* physical pages run out, causing 
> > > > > > __sgx_alloc_epc_page() to  
> > > > fail.
> > > > > > +*/
> > > > > > sgx_reclaim_pages();
> > > > > 
> > > > > What's the final behaviour?  IIUC it should be reclaiming from the
> > > > > *current* EPC
> > > > > cgroup?  If so shouldn't we just pass the @epc_cg to it here?
> > > > > 
> > > > > I think we can make this patch as "structure" patch w/o actually  
> > > > having
> > > > > EPC
> > > > > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.
> > > > > 
> > > > > And we can have one patch to change sgx_reclaim_pages() to take the
> > > > > 'struct
> > > > > sgx_epc_lru_list *' as argument:
> > > > > 
> > > > >   void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
> > > > >   {
> > > > >   ...
> > > > >   }
> > > > > 
> > > > > Then here we can have something like:
> > > > > 
> > > > >   void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
> > > > >   {
> > > > >   struct sgx_epc_lru_list *lru =  epc_cg 
> > > > > ? _cg->lru :
> > > > > _global_lru;
> > > > > 
> > > > >   sgx_reclaim_pages_lru(lru);
> > > > >   }
> > > > > 
> > > > > Makes sense?
> > > > > 
> > > > 
> > > > This is purely global reclamation. No cgroup involved.
> > > 
> > > Again why?  Here you are allocating one EPC page for enclave in a  
> > > particular EPC
> > > cgroup.  When that fails, shouldn't you try only to reclaim from the  
> > > *current*
> > > EPC cgroup?  Or at least you should try to reclaim from the *current*  
> > > EPC cgroup
> > > first?
> > > 
> > 
> > Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, cgroup  
> > reclaims synchronously, otherwise in background and returns -EBUSY in  
> > that case. This function also returns if no valid epc_cg pointer  
> > returned.
> > 
> > All reclamation for *current* cgroup is done in sgx_epc_cg_try_charge().

This is fine, but I believe my question above is about where to reclaim when
"allocation" fails,  but not "try charge" fails.

And for "reclaim for current cgroup when charge fails", I don't think its even
necessary in this initial implementation of EPC cgroup.  You can just fail the
allocation when charge fails (reaching the limit).  Trying to reclaim when limit
is hit can be done later.

Please see Dave and Michal's replies here:

https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df...@intel.com/#t
https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/

> > 
> > So, by reaching to this point,  a valid epc_cg pointer was returned,  
> > that means allocation is allowed for the cgroup (it has reclaimed if  
> > necessary, and its usage is not above limit after charging).

I found memory cgroup uses different logic -- allocation first and then charge:

For instance:

static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
..

folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); 
if (!folio)
goto oom;

if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))  
goto oom_free_page;

.. 
}

Why EPC needs to "charge first" and "then allocate"?
 
> > 
> > But the system level free count may be low

RE: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-07 Thread Huang, Kai

> I should have sticked to the orignial comment added in code. Actually
> __sgx_alloc_epc_page() can fail if system runs out of EPC. That's the really 
> reason
> for global reclaim. The free count enforcement is near the end of this method
> after should_reclaim() check.

Hi Haitao,

Sorry I have something else to do at this moment and will continue this series 
next week.

Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-06 Thread Huang, Kai

On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> +static int __init sgx_epc_cgroup_init(void)
> +{
> + struct misc_cg *cg;
> +
> + if (!boot_cpu_has(X86_FEATURE_SGX))
> + return 0;
> +
> + cg = misc_cg_root();
> + BUG_ON(!cg);
> +
> + return sgx_epc_cgroup_alloc(cg);
> +}
> +subsys_initcall(sgx_epc_cgroup_init);

This should be called from sgx_init(), which is the place to init SGX related
staff.  In case you didn't notice, sgx_init() is actually device_initcall(),
which is actually called _after_ the subsys_initcall() you used above.

Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-06 Thread Huang, Kai

> 
> > > +/**
> > > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single  
> > > EPC page
> > > + *
> > > + * Returns EPC cgroup or NULL on success, -errno on failure.
> > > + */
> > > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> > > +{
> > > + struct sgx_epc_cgroup *epc_cg;
> > > + int ret;
> > > +
> > > + if (sgx_epc_cgroup_disabled())
> > > + return NULL;
> > > +
> > > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> > > + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> > > +
> > > + if (!ret) {
> > > + /* No epc_cg returned, release ref from get_current_misc_cg() */
> > > + put_misc_cg(epc_cg->cg);
> > > + return ERR_PTR(-ENOMEM);
> > 
> > misc_cg_try_charge() returns 0 when successfully charged, no?
> 
> Right. I really made some mess in rebasing :-(
> 
> > 
> > > + }
> > > +
> > > + /* Ref released in sgx_epc_cgroup_uncharge() */
> > > + return epc_cg;
> > > +}
> > 
> > IMHO the above _try_charge() returning a pointer of EPC cgroup is a  
> > little bit
> > odd, because it doesn't match the existing misc_cg_try_charge() which  
> > returns
> > whether the charge is successful or not.  sev_misc_cg_try_charge()  
> > matches
> > misc_cg_try_charge() too.
> > 
> > I think it's better to split "getting EPC cgroup" part out as a separate  
> > helper,
> > and make this _try_charge() match existing pattern:
> > 
> > struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
> > {
> > if (sgx_epc_cgroup_disabled())
> > return NULL;
> > 
> > return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> > }
> > 
> > int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> > {
> > if (!epc_cg)
> > return -EINVAL;
> > 
> > return misc_cg_try_charge(epc_cg->cg);
> > }
> > 
> > Having sgx_get_current_epc_cg() also makes the caller easier to read,  
> > because we
> > can immediately know we are going to charge the *current* EPC cgroup,  
> > but not
> > some cgroup hidden within sgx_epc_cgroup_try_charge().
> > 
> 
> Actually, unlike other misc controllers, we need charge and get the epc_cg  
> reference at the same time. 
> 

Can you elaborate?

And in practice you always call sgx_epc_cgroup_try_charge() right after
sgx_get_current_epc_cg() anyway.  The only difference is the whole thing is done
in one function or in separate functions.

[...]


> > >  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> > >  {
> > >   struct sgx_epc_page *page;
> > > + struct sgx_epc_cgroup *epc_cg;
> > > +
> > > + epc_cg = sgx_epc_cgroup_try_charge();
> > > + if (IS_ERR(epc_cg))
> > > + return ERR_CAST(epc_cg);
> > > 
> > >   for ( ; ; ) {
> > >   page = __sgx_alloc_epc_page();
> > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void  
> > > *owner, bool reclaim)
> > >   break;
> > >   }
> > > 
> > > + /*
> > > +  * Need to do a global reclamation if cgroup was not full but 
> > > free
> > > +  * physical pages run out, causing __sgx_alloc_epc_page() to 
> > > fail.
> > > +  */
> > >   sgx_reclaim_pages();
> > 
> > What's the final behaviour?  IIUC it should be reclaiming from the  
> > *current* EPC
> > cgroup?  If so shouldn't we just pass the @epc_cg to it here?
> > 
> > I think we can make this patch as "structure" patch w/o actually having  
> > EPC
> > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.
> > 
> > And we can have one patch to change sgx_reclaim_pages() to take the  
> > 'struct
> > sgx_epc_lru_list *' as argument:
> > 
> > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
> > {
> > ...
> > }
> > 
> > Then here we can have something like:
> > 
> > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
> > {
> > struct sgx_epc_lru_list *lru =  epc_cg ? 
> > _cg->lru :  
> > _global_lru;
> > 
> > sgx_reclaim_pages_lru(lru);
> > }
> > 
> > Makes sense?
> > 
> 
> This is purely global reclamation. No cgroup involved. 
> 

Again why?  Here you are allocating one EPC page for enclave in a particular EPC
cgroup.  When that fails, shouldn't you try only to reclaim from the *current*
EPC cgroup?  Or at least you should try to reclaim from the *current* EPC cgroup
first?

> You can see it  
> later in changes in patch 10/12. For now I just make a comment there but  
> no real changes. Cgroup reclamation will be done as part of _try_charge  
> call.
> 
> > >   cond_resched();
> > >   }
> > > 
> > > + if (!IS_ERR(page)) {
> > > + WARN_ON_ONCE(page->epc_cg);
> > > + page->epc_cg = epc_cg;
> > > + } else {
> > > + sgx_epc_cgroup_uncharge(epc_cg);
> > > + }
> > > +
> > >   if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> > >   wake_up(_waitq);
> > > 
> > > @@ -604,6

Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-06 Thread Huang, Kai

On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
> 
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem).  The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.

Nit:

The above two paragraphs talk about what is EPC and EPC resource control needs
to be done separately, etc, but IMHO it lacks some background about "why" EPC
resource control is needed, e.g, from use case's perspective.

> 
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".

Please separate what the current misc cgroup provides, and how this patch is
going to utilize.

Please describe the changes in imperative mood. E.g, "report total EPC memory
via ...", instead of "... is reported via ...".

> 
> This patch was modified from the previous version to only add basic EPC
> cgroup structure, accounting allocations for cgroup usage
> (charge/uncharge), setup misc cgroup callbacks, set total EPC capacity.

This isn't changelog material.  Please focus on describing the high level design
and why you chose such design.

> 
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
> 
> Later patches will reorganize the tracking and reclamation code in the
> globale reclaimer and implement per-cgroup tracking and reclaiming.
> 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> V6:
> - Split the original large patch"Limit process EPC usage with misc
> cgroup controller"  and restructure it (Kai)
> ---
>  arch/x86/Kconfig |  13 
>  arch/x86/kernel/cpu/sgx/Makefile |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  36 ++
>  arch/x86/kernel/cpu/sgx/main.c   |  28 
>  arch/x86/kernel/cpu/sgx/sgx.h|   3 +
>  6 files changed, 184 insertions(+)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 66bfabae8814..e17c5dc3aea4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1921,6 +1921,19 @@ config X86_SGX
>  
> If unsure, say N.
>  
> +config CGROUP_SGX_EPC
> + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for 
> Intel SGX"
> + depends on X86_SGX && CGROUP_MISC
> + help
> +   Provides control over the EPC footprint of tasks in a cgroup via
> +   the Miscellaneous cgroup controller.
> +
> +   EPC is a subset of regular memory that is usable only by SGX
> +   enclaves and is very limited in quantity, e.g. less than 1%
> +   of total DRAM.
> +
> +   Say N if unsure.
> +
>  config X86_USER_SHADOW_STACK
>   bool "X86 userspace shadow stack"
>   depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile 
> b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
>   ioctl.o \
>   main.o
>  obj-$(CONFIG_X86_SGX_KVM)+= virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC)+= epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index ..500627d0563f
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,103 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include 
> +#include 
> +#include "epc_cgroup.h"
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct 
> misc_cg *cg)
> +{
> + return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +}
> +
> +static inline bool sgx_epc_cgroup_disabled(void)
> +{
> + return !cgroup_subsys_enabled(misc_cgrp_subsys);

From below,

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Huang, Kai

On Mon, 2023-10-16 at 19:10 -0500, Haitao Huang wrote:
> On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai  wrote:
> [...]
> 
> > still need to fix the bug mentioned above here.
> > 
> > I really think you should just go this simple way:
> > 
> > When you want to take EPC back from VM, kill the VM.
> > 
> 
> My only concern is that this is a compromise due to current limitation (no  
> other sane way to take EPC from VMs). If we define this behavior and it  
> becomes a contract to user space, then we can't change in future.

Why do we need to "define such behaviour"?

This isn't some kinda of kernel/userspace ABI IMHO, but only kernel internal
implementation.  Here VM is similar to normal host enclaves.  You limit the
resource, some host enclaves could be killed.  Similarly, VM could also be
killed too.

And supporting VMM EPC oversubscription doesn't mean VM won't be killed.  The VM
can still be a target to kill after VM's all EPC pages have been swapped out.

> 
> On the other hand, my understanding the reason you want this behavior is  
> to enforce EPC limit at runtime. 
> 

No I just thought this is a bug/issue needs to be fixed.  If anyone believes
this is not a bug/issue then it's a separate discussion.

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Huang, Kai

> 
> 
>  From this perspective, I think the current implementation is  
> "well-defined": EPC cgroup limits for VMs are only enforced at VM launch  
> time, not runtime. In practice,  SGX VM can be launched only with fixed  
> EPC size and all those EPCs are fully committed to the VM once launched.  
> Because of that, I imagine people are using VMs to primarily partition the  
> physical EPCs, i.e, the static size itself is the 'limit' for the workload  
> of a single VM and not expecting EPCs taken away at runtime.
> 
> So killing does not really add much value for the existing usages IIUC.

It's not about adding value to the existing usages, it's about fixing the issue
when we lower the EPC limit to a point that is less than total virtual EPC size.

It's a design issue, or simply a bug in the current implementation we need to
fix.

> 
> That said, I don't anticipate adding the enforcement of killing VMs at  
> runtime would break such usages as admin/user can simply choose to set the  
> limit equal to the static size to launch the VM and forget about it.
> 
> Given that, I'll propose an add-on patch to this series as RFC and have  
> some feedback from community before we decide if that needs be included in  
> first version or we can skip it until we have EPC reclaiming for VMs.

I don't understand what is the "add-on" patch you are talking about.

I mentioned the "typical deployment thing" is that can help us understand which
algorithm is better way to select the victim.  But no matter what we choose, we
still need to fix the bug mentioned above here.

I really think you should just go this simple way: 

When you want to take EPC back from VM, kill the VM.

Another bad thing about "just removing EPC pages from VM" is the enclaves in the
VM would suffer "sudden lose of EPC", or even worse, suffer it at a high
frequency.  Although we depend on that for supporting SGX VM live migration, but
that needs to avoided if possible.

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Huang, Kai

On Wed, 2023-10-11 at 01:14 +, Huang, Kai wrote:
> On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> > > 
> > > This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> > > EPC too?
> > > 
> > 
> > That flag is set for enclaves, do you mean we set similar flag in vepc  
> > struct?
> 
> Yes.

I missed the "ENCL" part but only noted the "NO_MEMORY" part, so I guess it
should not be used directly for vEPC.  So if it is needed, SGX_VEPC_NO_MEMORY,
or a simple 'bool dead' or similar in 'struct sgx_vepc' is more suitable.

As I said I was fighting with fever and headache last week :-)

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Huang, Kai

On Thu, 2023-10-12 at 08:27 -0500, Haitao Huang wrote:
> On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai  wrote:
> [...]
> > (btw, even you track VA/SECS pages in unreclaimable list, given they  
> > both have
> > 'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
> > SGX_EPC_OWNER_PAGE ?)
> 
> Let me think about it, there might be also a way just track encl objects  
> not unreclaimable pages.
> 
> I still not get why we need kill the VM not just remove just enough pages.  
> Is it due to the static allocation not able to reclaim?

We can choose to "just remove enough EPC pages".  The VM may or may not be
killed when it wants the EPC pages back, depending on whether the current EPC
cgroup can provide enough EPC pages or not.  And this depends on how we
implement the cgroup algorithm to reclaim EPC pages.

One problem could be: for a EPC cgroup only has SGX VMs, you may end up with
moving EPC pages from one VM to another and then vice versa endlessly, because
you never really actually mark any VM to be dead just like OOM does to the
normal enclaves.

From this point, you still need a way to kill a VM, IIUC.

I think the key point of virtual EPC vs cgroup, as quoted from Sean, should be
"having sane, well-defined behavior".

Does "just remove enough EPC pages" meet this?  If the problem mentioned above
can be avoided, I suppose so?  So if there's an easy way to achieve, I guess it
can be an option too.

But for the initial support, IMO we are not looking for a perfect but yet
complicated solution.  I would say, from review's point of view, it's preferred
to have a simple implementation to achieve a not-prefect, but consistent, well-
defined behaviour.

So to me looks killing the VM when cgroup cannot reclaim any more EPC pages is a
simple option.

But I might have missed something, especially since middle of last week I have
been having fever and headache :-)

So as mentioned above, you can try other alternatives, but please avoid
complicated ones.

Also, I guess it will be helpful if we can understand the typical SGX app and/or
SGX VM deployment under EPC cgroup use case.  This may help us on justifying why
the EPC cgroup algorithm to select victim is reasonable.

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-10 Thread Huang, Kai

On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> > 
> > This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> > EPC too?
> > 
> 
> That flag is set for enclaves, do you mean we set similar flag in vepc  
> struct?

Yes.

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-10 Thread Huang, Kai

On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 20:34:29 -0500, Huang, Kai  wrote:
> 
> > On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote:
> > > On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> > > > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > > > +/**
> > > > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target  
> > > LRU
> > > > > > + * @lru:   LRU that is low
> > > > > > + *
> > > > > > + * Return: %true if a victim was found and kicked.
> > > > > > + */
> > > > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > > > +{
> > > > > > +   struct sgx_epc_page *victim;
> > > > > > +
> > > > > > +   spin_lock(>lock);
> > > > > > +   victim = sgx_oom_get_victim(lru);
> > > > > > +   spin_unlock(>lock);
> > > > > > +
> > > > > > +   if (!victim)
> > > > > > +   return false;
> > > > > > +
> > > > > > +   if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > > > +   return sgx_oom_encl_page(victim->encl_page);
> > > > > > +
> > > > > > +   if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > > > +   return sgx_oom_encl(victim->encl);
> > > > > 
> > > > > I hate to bring this up, at least at this stage, but I am wondering  
> > > why we need
> > > > > to put VA and SECS pages to the unreclaimable list, but cannot keep  
> > > an
> > > > > "enclave_list" instead?
> > > > 
> > > > The motivation for tracking EPC pages instead of enclaves was so that  
> > > the EPC
> > > > OOM-killer could "kill" VMs as well as host-owned enclaves. >
> > > 
> > > Ah this seems a fair argument. :-)
> > > 
> > > > The virtual EPC code
> > > > didn't actually kill the VM process, it instead just freed all of the  
> > > EPC pages
> > > > and abused the SGX architecture to effectively make the guest  
> > > recreate all its
> > > > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> > > 
> > > It returns SIGBUS.  SGX VM live migration also requires enough EPC  
> > > being able to
> > > be allocated on the destination machine to work AFAICT.
> > > 
> > > > 
> > > > Looks like y'all punted on that with:
> > > > 
> > > >   The EPC pages allocated for KVM guests by the virtual EPC driver  
> > > are not
> > > >   reclaimable by the host kernel [5]. Therefore they are not tracked  
> > > by any
> > > >   LRU lists for reclaiming purposes in this implementation, but they  
> > > are
> > > >   charged toward the cgroup of the user processs (e.g., QEMU)  
> > > launching the
> > > >   guest.  And when the cgroup EPC usage reaches its limit, the  
> > > virtual EPC
> > > >   driver will stop allocating more EPC for the VM, and return SIGBUS  
> > > to the
> > > >   user process which would abort the VM launch.
> > > > 
> > > > which IMO is a hack, unless returning SIGBUS is actually enforced  
> > > somehow. >
> > > 
> > > "enforced" do you mean?
> > > 
> > > Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot  
> > > allocate
> > > EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in  
> > > hva_to_pfn(),
> > > which eventually results in KVM returning -EFAULT to userspace in  
> > > vcpu_run().
> > > And Qemu then kills the VM with some nonsense message:
> > > 
> > > error: kvm run failed Bad address
> > > 
> > > 
> > > > Relying
> > > > on userspace to be kind enough to kill its VMs kinda defeats the  
> > > purpose of cgroup
> > > > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
> > > userspace running
> > > > encalves in a VM could continue on and refuse to give up its EPC, and  
> > > thus run above
> > > > its limit in perpetuity.
> > > 
> > > > 
> > > > I can see userspace want

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-10 Thread Huang, Kai

On Tue, 2023-10-10 at 12:05 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai  wrote:
> 
> > 
> > > > > > 
> > > > > Later the hosting process could migrated/reassigned to another  
> > > cgroup?
> > > > > What to do when the new cgroup is OOM?
> > > > > 
> > > > 
> > > > You addressed in the documentation, no?
> > > > 
> > > > +Migration
> > > > +-
> > > > +
> > > > +Once an EPC page is charged to a cgroup (during allocation), it
> > > > +remains charged to the original cgroup until the page is released
> > > > +or reclaimed.  Migrating a process to a different cgroup doesn't
> > > > +move the EPC charges that it incurred while in the previous cgroup
> > > > +to its new cgroup.
> > > 
> > > Should we kill the enclave though because some VA pages may be in the  
> > > new
> > > group?
> > > 
> > 
> > I guess acceptable?
> > 
> > And any difference if you keep VA/SECS to unreclaimabe list?
> 
> Tracking VA/SECS allows all cgroups, in which an enclave has allocation,  
> to identify the enclave following the back pointer and kill it as needed.
> 
> > If you migrate one
> > enclave to another cgroup, the old EPC pages stay in the old cgroup  
> > while the
> > new one is charged to the new group IIUC.
> > 
> > I am not cgroup expert, but by searching some old thread it appears this  
> > isn't a
> > supported model:
> > 
> > https://lore.kernel.org/lkml/yeyr9181qgzt+...@mtj.duckdns.org/
> > 
> 
> IIUC it's a different problem here. If we don't track the allocated VAs in  
> the new group, then the enclave that spans the two groups can't be killed  
> by the new group. If so, some enclave could just hide in some small group  
> and never gets killed but keeps allocating in a different group?
> 

I mean from the link above IIUC migrating enclave among different cgroups simply
isn't a supported model, thus any bad behaviour isn't a big concern in terms of
decision making.

Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller

2023-10-10 Thread Huang, Kai

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
> 
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem).  The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.
> 
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".
> 
> This patch was modified from its original version to use the misc cgroup
> controller instead of a custom controller.
> 
> 

[...]

> 
> 7) Other minor refactoring:
> - Remove unused params in epc_cgroup APIs
> - centralize uncharge into sgx_free_epc_page()
> ---
>  arch/x86/Kconfig |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 415 +++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 
>  arch/x86/kernel/cpu/sgx/main.c   |  68 -
>  arch/x86/kernel/cpu/sgx/sgx.h|  17 +-
>  6 files changed, 556 insertions(+), 17 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

As mentioned before, this patch is pretty large thus it's hard to review.  I
think we should try to split into smaller patches so that they can be
reviewable.

I cannot recall how many times that I've done scroll up/down just to find some
function.

Any idea to further split this patch?

Also, I am thinking _perhaps_ the way of organizing the patches of this patchset
can be improved.  I had an impression that this patchset is organized in this
way:  

1) There are many small patches to adjust the elemental code pieces to suit EPC
cgroup support, but many of them don't have enough design information to
justify, but only says "EPC cgroup will use later" etc.

2) And then the EPC cgroup is implemented in one large patch at the end.

Both 1) and 2) are hard to review.  I need to do a lot of back and forth to
review this series.

I am not finger pointing at anything because it's not easy at all, but just want
to explore options that may make this series easier to review.

For instance, will below make more sense:

Instead of implementing EPC cgroup in one big patch, we introduce key
structures, elemental helpers in separate patch(es) at early position so that
it's easy to review some basic logic/conversion.

And then we may move some key logic of handling EPC cgroup, e.g., reclaim logic,
to early patches when we adjust the elemental code pieces for EPC cgroup.

Perhaps it's worth to try, but just my 2cents.

Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller

2023-10-10 Thread Huang, Kai


> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct 
> misc_cg *cg)
> +{
> + if (cg)
> + return (struct sgx_epc_cgroup 
> *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +
> + return NULL;
> +}
> +
> 

Is it good idea to allow passing a NULL @cg to this basic function?

Why not only call this function when @cg is valid?

> +
> +static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
> +bool reclaim)
> +{
> + struct sgx_epc_reclaim_control rc;
> + unsigned int nr_empty = 0;
> +
> + sgx_epc_reclaim_control_init(, epc_cg);
> +
> + for (;;) {
> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> + PAGE_SIZE))
> + break;
> +
> + if (sgx_epc_cgroup_lru_empty(epc_cg))
> + return -ENOMEM;
> +
> + if (signal_pending(current))
> + return -ERESTARTSYS;
> +
> + if (!reclaim) {
> + queue_work(sgx_epc_cg_wq, _cg->reclaim_work);
> + return -EBUSY;
> + }
> +
> + if (!sgx_epc_cgroup_reclaim_pages(1, )) {
> + if (sgx_epc_cgroup_reclaim_failed()) {
> + if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
> + return -ENOMEM;
> + schedule();
> + }
> + }
> + }
> + if (epc_cg->cg != misc_cg_root())
> + css_get(_cg->cg->css);

I don't quite understand why root is treated specially.

And I thought get_current_misc_cg() in sgx_epc_cgroup_try_charge() already grabs
the reference before calling this function?  Why do it again?

> +
> + return 0;
> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC 
> page
> + * @mm:  the mm_struct of the process to charge
> + * @reclaim: whether or not synchronous reclaim is allowed
> + *
> + * Returns EPC cgroup or NULL on success, -errno on failure.
> + */
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> + int ret;
> +
> + if (sgx_epc_cgroup_disabled())
> + return NULL;
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> + ret = __sgx_epc_cgroup_try_charge(epc_cg, reclaim);
> + put_misc_cg(epc_cg->cg);
> +
> + if (ret)
> + return ERR_PTR(ret);
> +
> + return epc_cg;
> +}
> +
> +/**
> + * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages
> + * @epc_cg:  the charged epc cgroup
> + */
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
> +{
> + if (sgx_epc_cgroup_disabled())
> + return;
> +
> + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +
> + if (epc_cg->cg != misc_cg_root())
> + put_misc_cg(epc_cg->cg);

Again why root is special?  And where is the get_misc_cg()?

Oh is it the 

if (epc_cg->cg != misc_cg_root())
css_get(_cg->cg->css);

in __sgx_epc_cgroup_try_charge()?

That's horrible to follow.  Can this be explicitly done in
sgx_epc_cgroup_try_charge() and sgx_epc_cgroup_uncharge(), that is, grab the
reference in the former and release the reference in the latter?

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Huang, Kai

On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
> Hi Sean
> 
> On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson  
>  wrote:
> 
> > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > +/**
> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > > + * @lru:   LRU that is low
> > > > + *
> > > > + * Return: %true if a victim was found and kicked.
> > > > + */
> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > +{
> > > > +   struct sgx_epc_page *victim;
> > > > +
> > > > +   spin_lock(>lock);
> > > > +   victim = sgx_oom_get_victim(lru);
> > > > +   spin_unlock(>lock);
> > > > +
> > > > +   if (!victim)
> > > > +   return false;
> > > > +
> > > > +   if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > +   return sgx_oom_encl_page(victim->encl_page);
> > > > +
> > > > +   if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > +   return sgx_oom_encl(victim->encl);
> > > 
> > > I hate to bring this up, at least at this stage, but I am wondering why  
> > > we need
> > > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > > "enclave_list" instead?
> > 
> > The motivation for tracking EPC pages instead of enclaves was so that  
> > the EPC
> > OOM-killer could "kill" VMs as well as host-owned enclaves.  The virtual  
> > EPC code
> > didn't actually kill the VM process, it instead just freed all of the  
> > EPC pages
> > and abused the SGX architecture to effectively make the guest recreate  
> > all its
> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> > 
> > Looks like y'all punted on that with:
> > 
> >   The EPC pages allocated for KVM guests by the virtual EPC driver are  
> > not
> >   reclaimable by the host kernel [5]. Therefore they are not tracked by  
> > any
> >   LRU lists for reclaiming purposes in this implementation, but they are
> >   charged toward the cgroup of the user processs (e.g., QEMU) launching  
> > the
> >   guest.  And when the cgroup EPC usage reaches its limit, the virtual  
> > EPC
> >   driver will stop allocating more EPC for the VM, and return SIGBUS to  
> > the
> >   user process which would abort the VM launch.
> > 
> > which IMO is a hack, unless returning SIGBUS is actually enforced  
> > somehow.  Relying
> > on userspace to be kind enough to kill its VMs kinda defeats the purpose  
> > of cgroup
> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
> > userspace running
> > encalves in a VM could continue on and refuse to give up its EPC, and  
> > thus run above
> > its limit in perpetuity.
> > 
> Cgroup would refuse to allocate more when limit is reached so VMs can not  
> run above limit.
> 
> IIRC VMs only support static EPC size right now, reaching limit at launch  
> means the EPC size given in command line for QEMU is not appropriate. So  
> VM should not launch, hence the current behavior.
> 
> [all EPC pages in guest are allocated on page fault caused by the  
> sensitization process in guest kernel during init, which is part of the VM  
> Launch process. So SIGNBUS will turn into failed VM launch.]
> 
> Once it is launched, guest kernel would have 'total capacity' given by the  
> static value from QEMU option. And it would start paging when it is used  
> up, never would ask for more from host.
> 
> For future with dynamic EPC for running guests, QEMU could handle  
> allocation failure and pass SIGBUS to the running guest kernel.  Is that  
> correct understanding?
> 
> 
> > I can see userspace wanting to explicitly terminate the VM instead of  
> > "silently"
> > the VM's enclaves, but that seems like it should be a knob in the  
> > virtual EPC
> > code.
> 
> If my understanding above is correct and understanding your statement  
> above correctly, then don't see we really need separate knob for vEPC  
> code. Reaching a cgroup limit by a running guest (assuming dynamic  
> allocation implemented) should not translate automatically killing the VM.  
> Instead, it's user space job to work with guest to handle allocation  
> failure. Guest could page and kill enclaves.
> 

IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs:

1) misc.max = 100M
2) Launch VMs with total virtual EPC size = 100M<- success
3) misc.max = 50M

3) will also succeed, but nothing will happen, the VMs will be still holding
100M EPC.

You need to somehow track virtual EPC and kill VM instead.

(or somehow fail to do 3) if it is also an acceptable option.)

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Huang, Kai


> > > > 
> > > Later the hosting process could migrated/reassigned to another cgroup?
> > > What to do when the new cgroup is OOM?
> > > 
> > 
> > You addressed in the documentation, no?
> > 
> > +Migration
> > +-
> > +
> > +Once an EPC page is charged to a cgroup (during allocation), it
> > +remains charged to the original cgroup until the page is released
> > +or reclaimed.  Migrating a process to a different cgroup doesn't
> > +move the EPC charges that it incurred while in the previous cgroup
> > +to its new cgroup.
> 
> Should we kill the enclave though because some VA pages may be in the new  
> group?
> 

I guess acceptable?

And any difference if you keep VA/SECS to unreclaimabe list? If you migrate one
enclave to another cgroup, the old EPC pages stay in the old cgroup while the
new one is charged to the new group IIUC.

I am not cgroup expert, but by searching some old thread it appears this isn't a
supported model:

https://lore.kernel.org/lkml/yeyr9181qgzt+...@mtj.duckdns.org/

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Huang, Kai

On Tue, 2023-10-10 at 00:50 +, Huang, Kai wrote:
> On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > +/**
> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > > + * @lru:   LRU that is low
> > > > + *
> > > > + * Return: %true if a victim was found and kicked.
> > > > + */
> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > +{
> > > > +   struct sgx_epc_page *victim;
> > > > +
> > > > +   spin_lock(>lock);
> > > > +   victim = sgx_oom_get_victim(lru);
> > > > +   spin_unlock(>lock);
> > > > +
> > > > +   if (!victim)
> > > > +   return false;
> > > > +
> > > > +   if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > +   return sgx_oom_encl_page(victim->encl_page);
> > > > +
> > > > +   if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > +   return sgx_oom_encl(victim->encl);
> > > 
> > > I hate to bring this up, at least at this stage, but I am wondering why 
> > > we need
> > > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > > "enclave_list" instead?
> > 
> > The motivation for tracking EPC pages instead of enclaves was so that the 
> > EPC
> > OOM-killer could "kill" VMs as well as host-owned enclaves.  
> > 
> 
> Ah this seems a fair argument. :-)
> 
> > The virtual EPC code
> > didn't actually kill the VM process, it instead just freed all of the EPC 
> > pages
> > and abused the SGX architecture to effectively make the guest recreate all 
> > its
> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> 
> It returns SIGBUS.  SGX VM live migration also requires enough EPC being able 
> to
> be allocated on the destination machine to work AFAICT.
>  
> > 
> > Looks like y'all punted on that with:
> > 
> >   The EPC pages allocated for KVM guests by the virtual EPC driver are not
> >   reclaimable by the host kernel [5]. Therefore they are not tracked by any
> >   LRU lists for reclaiming purposes in this implementation, but they are
> >   charged toward the cgroup of the user processs (e.g., QEMU) launching the
> >   guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
> >   driver will stop allocating more EPC for the VM, and return SIGBUS to the
> >   user process which would abort the VM launch.
> > 
> > which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  
> > 
> 
> "enforced" do you mean?
> 
> Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate
> EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in 
> hva_to_pfn(),
> which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). 
> And Qemu then kills the VM with some nonsense message:
> 
> error: kvm run failed Bad address
> 
> 
> > Relying
> > on userspace to be kind enough to kill its VMs kinda defeats the purpose of 
> > cgroup
> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace 
> > running
> > encalves in a VM could continue on and refuse to give up its EPC, and thus 
> > run above
> > its limit in perpetuity.
> 
> > 
> > I can see userspace wanting to explicitly terminate the VM instead of 
> > "silently"
> > the VM's enclaves, but that seems like it should be a knob in the virtual 
> > EPC
> > code.

I guess I slightly misunderstood your words.

You mean we want to kill VM when the limit is set to be lower than virtual EPC
size.

This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual EPC too?

In the sgx_vepc_fault(), we check this flag at early time and return SIGBUS if
it is set.

But this also requires keeping virtual EPC pages in some list, and handles them
in sgx_epc_oom() too.

And for virtual EPC pages, I guess the "young" logic can be applied thus
probably it's better to keep the actual virtual EPC pages to a (separate?) list
instead of keeping the virtual EPC instance.

struct sgx_epc_lru {
struct list_head reclaimable;
struct sgx_encl *enclaves;
struct list_head vepc_pages;
}

Or still tracking VA/SECS and virtual EPC pages in a single unrecliamable list?

I don't know :-)

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Huang, Kai

On Mon, 2023-10-09 at 20:04 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai  wrote:
> 
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > From: Sean Christopherson 
> > > 
> > > Introduce the OOM path for killing an enclave with a reclaimer that is  
> > > no
> > > longer able to reclaim enough EPC pages. Find a victim enclave, which
> > > will be an enclave with only "unreclaimable" EPC pages left in the
> > > cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
> > > and zap the enclave's entire page range, and drain all mm references in
> > > encl->mm_list. Block allocating any EPC pages in #PF handler, or
> > > reloading any pages in all paths, or creating any new mappings.
> > > 
> > > The OOM killing path may race with the reclaimers: in some cases, the
> > > victim enclave is in the process of reclaiming the last EPC pages when
> > > OOM happens, that is, all pages other than SECS and VA pages are in
> > > RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
> > > the enclave backing, VA pages as well as SECS. So the OOM killer does
> > > not directly release those enclave resources, instead, it lets all
> > > reclaiming in progress to finish, and relies (as currently done) on
> > > kref_put on encl->refcount to trigger sgx_encl_release() to do the
> > > final cleanup.
> > > 
> > > Signed-off-by: Sean Christopherson 
> > > Co-developed-by: Kristen Carlson Accardi 
> > > Signed-off-by: Kristen Carlson Accardi 
> > > Co-developed-by: Haitao Huang 
> > > Signed-off-by: Haitao Huang 
> > > Cc: Sean Christopherson 
> > > ---
> > > V5:
> > > - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
> > > 
> > > V4:
> > > - Updates for patch reordering and typo fixes.
> > > 
> > > V3:
> > > - Rebased to use the new VMA_ITERATOR to zap VMAs.
> > > - Fixed the racing cases by blocking new page allocation/mapping and
> > > reloading when enclave is marked for OOM. And do not release any enclave
> > > resources other than draining mm_list entries, and let pages in
> > > RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
> > > - Due to above changes, also removed the no-longer needed encl->lock in
> > > the OOM path which was causing deadlocks reported by the lock prover.
> > > 
> > 
> > [...]
> > 
> > > +
> > > +/**
> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > + * @lru: LRU that is low
> > > + *
> > > + * Return:   %true if a victim was found and kicked.
> > > + */
> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > +{
> > > + struct sgx_epc_page *victim;
> > > +
> > > + spin_lock(>lock);
> > > + victim = sgx_oom_get_victim(lru);
> > > + spin_unlock(>lock);
> > > +
> > > + if (!victim)
> > > + return false;
> > > +
> > > + if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > + return sgx_oom_encl_page(victim->encl_page);
> > > +
> > > + if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > + return sgx_oom_encl(victim->encl);
> > 
> > I hate to bring this up, at least at this stage, but I am wondering why  
> > we need
> > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > "enclave_list" instead?
> > 
> > So by looking the patch (" x86/sgx: Limit process EPC usage with misc  
> > cgroup
> > controller"), if I am not missing anything, the whole "unreclaimable"  
> > list is
> > just used to find the victim enclave when OOM needs to be done.  Thus, I  
> > don't
> > see why "enclave_list" cannot be used to achieve this.
> > 
> > The reason that I am asking is because it seems using "enclave_list" we  
> > can
> > simplify the code.  At least the patches related to track VA/SECS pages,  
> > and the
> > SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated  
> > completely.  
> > Using "enclave_list", I guess you just need to put the enclave to the  
> > current
> > EPC cgroup when SECS page is allocated.
> > 
> Later the hosting process could migrated/reassigned to another cgroup?
> What to do when the new cgroup is OOM?
> 

You addressed in the documentation, no?

+Migration
+-
+
+Once an EPC page is charged to a cgroup (during allocation), it
+remains charged to the original cgroup until the page is released
+or reclaimed.  Migrating a process to a different cgroup doesn't
+move the EPC charges that it incurred while in the previous cgroup
+to its new cgroup.

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Huang, Kai

On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> On Mon, Oct 09, 2023, Kai Huang wrote:
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > +/**
> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > + * @lru: LRU that is low
> > > + *
> > > + * Return:   %true if a victim was found and kicked.
> > > + */
> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > +{
> > > + struct sgx_epc_page *victim;
> > > +
> > > + spin_lock(>lock);
> > > + victim = sgx_oom_get_victim(lru);
> > > + spin_unlock(>lock);
> > > +
> > > + if (!victim)
> > > + return false;
> > > +
> > > + if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > + return sgx_oom_encl_page(victim->encl_page);
> > > +
> > > + if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > + return sgx_oom_encl(victim->encl);
> > 
> > I hate to bring this up, at least at this stage, but I am wondering why we 
> > need
> > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > "enclave_list" instead?
> 
> The motivation for tracking EPC pages instead of enclaves was so that the EPC
> OOM-killer could "kill" VMs as well as host-owned enclaves.  
> 

Ah this seems a fair argument. :-)

> The virtual EPC code
> didn't actually kill the VM process, it instead just freed all of the EPC 
> pages
> and abused the SGX architecture to effectively make the guest recreate all its
> enclaves (IIRC, QEMU does the same thing to "support" live migration).

It returns SIGBUS.  SGX VM live migration also requires enough EPC being able to
be allocated on the destination machine to work AFAICT.
 
> 
> Looks like y'all punted on that with:
> 
>   The EPC pages allocated for KVM guests by the virtual EPC driver are not
>   reclaimable by the host kernel [5]. Therefore they are not tracked by any
>   LRU lists for reclaiming purposes in this implementation, but they are
>   charged toward the cgroup of the user processs (e.g., QEMU) launching the
>   guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
>   driver will stop allocating more EPC for the VM, and return SIGBUS to the
>   user process which would abort the VM launch.
> 
> which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  
> 

"enforced" do you mean?

Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate
EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in hva_to_pfn(),
which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). 
And Qemu then kills the VM with some nonsense message:

error: kvm run failed Bad address


> Relying
> on userspace to be kind enough to kill its VMs kinda defeats the purpose of 
> cgroup
> enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace 
> running
> encalves in a VM could continue on and refuse to give up its EPC, and thus 
> run above
> its limit in perpetuity.

Agreed.  But this looks cannot resolved until we can reclaim EPC page from VM.

Or in the EPC cgroup code we can refuse to set the maximum which cannot be
supported, e.g., less the total virtual EPC size.

I guess the second is acceptable for now?

> 
> I can see userspace wanting to explicitly terminate the VM instead of 
> "silently"
> the VM's enclaves, but that seems like it should be a knob in the virtual EPC
> code.

See above for the second option.

Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller

2023-10-09 Thread Huang, Kai


> @@ -332,6 +336,7 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, 
> size_t nr_to_scan,
>   * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
>   * @nr_to_scan:   Number of EPC pages to scan for reclaim
>   * @ignore_age:   Reclaim a page even if it is young
> + * @epc_cg:   EPC cgroup from which to reclaim
>   *
>   * Take a fixed number of pages from the head of the active page pool and
>   * reclaim them to the enclave's private shmem files. Skip the pages, which 
> have
> @@ -345,7 +350,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, 
> size_t nr_to_scan,
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
>   */
> -size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
> +size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age,
> +  struct sgx_epc_cgroup *epc_cg)
>  {
>   struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
>   struct sgx_epc_page *epc_page, *tmp;
> @@ -355,7 +361,15 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool 
> ignore_age)
>   LIST_HEAD(iso);
>   size_t ret, i;
>  
> - sgx_isolate_epc_pages(_global_lru, nr_to_scan, );
> + /*
> +  * If a specific cgroup is not being targeted, take from the global
> +  * list first, even when cgroups are enabled.  If there are
> +  * pages on the global LRU then they should get reclaimed asap.
> +  */
> + if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
> + sgx_isolate_epc_pages(_global_lru, _to_scan, );
> +
> + sgx_epc_cgroup_isolate_pages(epc_cg, _to_scan, );

(I wish such code can be somehow moved to the earlier patches, so that we can
get early idea that how sgx_reclaim_epc_pages() is supposed to be used.)

So here when we are not targeting a specific EPC cgroup, we always reclaim from
the global list first, ...

[...]

>  
>   if (list_empty())
>   return 0;
> @@ -423,7 +437,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
>  void sgx_reclaim_direct(void)
>  {
>   if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> - sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> + sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);

... and we always try to reclaim the global list first when directly reclaim is
desired, even the enclave is within some EPC cgroup.  ... 

>  }
>  
>  static int ksgxd(void *p)
> @@ -446,7 +460,7 @@ static int ksgxd(void *p)
>sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>  
>   if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> - sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> + sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);

... and in ksgxd() as well, which I guess is somehow acceptable.  ...

>  
>   cond_resched();
>   }
> @@ -600,6 +614,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  {
>   struct sgx_epc_page *page;
> + struct sgx_epc_cgroup *epc_cg;
> +
> + epc_cg = sgx_epc_cgroup_try_charge(reclaim);
> + if (IS_ERR(epc_cg))
> + return ERR_CAST(epc_cg);
>  
>   for ( ; ; ) {
>   page = __sgx_alloc_epc_page();
> @@ -608,8 +627,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, 
> bool reclaim)
>   break;
>   }
>  
> - if (!sgx_can_reclaim())
> - return ERR_PTR(-ENOMEM);
> + if (!sgx_can_reclaim()) {
> + page = ERR_PTR(-ENOMEM);
> + break;
> + }
>  
>   if (!reclaim) {
>   page = ERR_PTR(-EBUSY);
> @@ -621,10 +642,17 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, 
> bool reclaim)
>   break;
>   }
>  
> - sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
> + sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);

... and when an EPC page is allocated, no matter whether the EPC page belongs to
any cgroup or not.

When we are allocating EPC page for one enclave, if that enclave belongs to some
cgroup, is it more reasonable to reclaim EPC pages from it's own group (and the
children under it)?

You already got the current EPC cgroup at the beginning of sgx_alloc_epc_page()
when you want to charge the EPC allocation.

>   cond_resched();
>   }
>

Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller

2023-10-09 Thread Huang, Kai


> +static inline struct sgx_epc_lru_lists *epc_cg_lru(struct sgx_epc_cgroup 
> *epc_cg)
> +{
> + if (epc_cg)
> + return _cg->lru;
> + return NULL;
> +}
> 

It's legal to return NULL EPC cgroup for a given EPC page, i.e., when the
enclave isn't assigned to any cgroup.  But ...

>  
>  static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page 
> *epc_page)
>  {
> + if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> + return epc_cg_lru(epc_page->epc_cg);
> +
>   return _global_lru;
>  }

... here is it legal to return a NULL LRU list?

It appears you always want to return a valid LRU list.  That is, if EPC cgroup
is enabled, and when the EPC page doesn't belong to any cgroup, then you want to
return the sgx_global_lru ?

Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller

2023-10-09 Thread Huang, Kai


> +/**
> + * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its 
> lrus
> + * @root:root of the tree to check
> + *
> + * Return: %true if all cgroups under the specified root have empty LRU 
> lists.
> + * Used to avoid livelocks due to a cgroup having a non-zero charge count but
> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
> + * because all pages in the cgroup are unreclaimable.
> + */
> +bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
> +{
> + struct cgroup_subsys_state *css_root;
> + struct cgroup_subsys_state *pos;
> + struct sgx_epc_cgroup *epc_cg;
> + bool ret = true;
> +
> + /*
> +  * Caller ensure css_root ref acquired
> +  */
> + css_root = root ? >cg->css : &(misc_cg_root()->css);
> +
> + rcu_read_lock();
> + css_for_each_descendant_pre(pos, css_root) {
> + if (!css_tryget(pos))
> + break;
> +
> + rcu_read_unlock();
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +
> + spin_lock(_cg->lru.lock);
> + ret = list_empty(_cg->lru.reclaimable);
> + spin_unlock(_cg->lru.lock);
> +
> + rcu_read_lock();
> + css_put(pos);
> + if (!ret)
> + break;
> + }
> +
> + rcu_read_unlock();
> +
> + return ret;
> +}
> 

[...]

> 
>  static inline bool sgx_can_reclaim(void)
>  {
> + if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> + return !sgx_epc_cgroup_lru_empty(NULL);
> +

Is it better to keep a root sgx_epc_cgroup and pass the root instead of NULL?

>   return !list_empty(_global_lru.reclaimable);
>  }
>

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Huang, Kai

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Sean Christopherson 
> 
> Introduce the OOM path for killing an enclave with a reclaimer that is no
> longer able to reclaim enough EPC pages. Find a victim enclave, which
> will be an enclave with only "unreclaimable" EPC pages left in the
> cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
> and zap the enclave's entire page range, and drain all mm references in
> encl->mm_list. Block allocating any EPC pages in #PF handler, or
> reloading any pages in all paths, or creating any new mappings.
> 
> The OOM killing path may race with the reclaimers: in some cases, the
> victim enclave is in the process of reclaiming the last EPC pages when
> OOM happens, that is, all pages other than SECS and VA pages are in
> RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
> the enclave backing, VA pages as well as SECS. So the OOM killer does
> not directly release those enclave resources, instead, it lets all
> reclaiming in progress to finish, and relies (as currently done) on
> kref_put on encl->refcount to trigger sgx_encl_release() to do the
> final cleanup.
> 
> Signed-off-by: Sean Christopherson 
> Co-developed-by: Kristen Carlson Accardi 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Cc: Sean Christopherson 
> ---
> V5:
> - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
> 
> V4:
> - Updates for patch reordering and typo fixes.
> 
> V3:
> - Rebased to use the new VMA_ITERATOR to zap VMAs.
> - Fixed the racing cases by blocking new page allocation/mapping and
> reloading when enclave is marked for OOM. And do not release any enclave
> resources other than draining mm_list entries, and let pages in
> RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
> - Due to above changes, also removed the no-longer needed encl->lock in
> the OOM path which was causing deadlocks reported by the lock prover.
> 

[...]

> +
> +/**
> + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> + * @lru: LRU that is low
> + *
> + * Return:   %true if a victim was found and kicked.
> + */
> +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> +{
> + struct sgx_epc_page *victim;
> +
> + spin_lock(>lock);
> + victim = sgx_oom_get_victim(lru);
> + spin_unlock(>lock);
> +
> + if (!victim)
> + return false;
> +
> + if (victim->flags & SGX_EPC_OWNER_PAGE)
> + return sgx_oom_encl_page(victim->encl_page);
> +
> + if (victim->flags & SGX_EPC_OWNER_ENCL)
> + return sgx_oom_encl(victim->encl);

I hate to bring this up, at least at this stage, but I am wondering why we need
to put VA and SECS pages to the unreclaimable list, but cannot keep an
"enclave_list" instead?

So by looking the patch (" x86/sgx: Limit process EPC usage with misc cgroup
controller"), if I am not missing anything, the whole "unreclaimable" list is
just used to find the victim enclave when OOM needs to be done.  Thus, I don't
see why "enclave_list" cannot be used to achieve this.

The reason that I am asking is because it seems using "enclave_list" we can
simplify the code.  At least the patches related to track VA/SECS pages, and the
SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated completely.  

Using "enclave_list", I guess you just need to put the enclave to the current
EPC cgroup when SECS page is allocated.

In fact, putting "unreclaimable" list to LRU itself is a little bit confusing
because: 1) you cannot really reclaim anything from the list; 2) VA/SECS pages
don't have the concept of "young" at all, thus makes no sense to annotate they
as LRU.

Thus putting VA/SECS to "unreclaimable" list, instead of keeping an
"enclave_list" seems won't have any benefit but will only make the code more
complicated.

Or am I missing anything?

Re: [PATCH v7 00/13] selftests/sgx: Fix compilation errors

2023-10-05 Thread Huang, Kai

Hi Jo,

Just FYI I won't review the rest patches in this series.  One of the reasons is
I am not that familiar with the rest.  Jarkko has reviewed anyway :-).

On Thu, 2023-10-05 at 17:38 +0200, Jo Van Bulck wrote:
> Hi,
> 
> This patch series ensures that all SGX selftests succeed when compiling with
> optimizations (as tested with -O{0,1,2,3,s} for both gcc 11.3.0 and clang
> 14.0.0). The aim of the patches is to avoid reliance on undefined,
> compiler-specific behavior that can make the test results fragile.
> 
> As far as I see, all commits in this series now have an explicit reviewed-by
> tag, so hopefully this can get merged upstream? Please let me know if any
> concerns remain and I'd happily address them.
> 
> Reference output below:
> 
> .. Testing   gcc   -O0[OK]
> .. Testing   gcc   -O1[OK]
> .. Testing   gcc   -O2[OK]
> .. Testing   gcc   -O3[OK]
> .. Testing   gcc   -Os[OK]
> .. Testing   gcc   -Ofast [OK]
> .. Testing   gcc   -Og[OK]
> .. Testing   clang -O0[OK]
> .. Testing   clang -O1[OK]
> .. Testing   clang -O2[OK]
> .. Testing   clang -O3[OK]
> .. Testing   clang -Os[OK]
> .. Testing   clang -Ofast [OK]
> .. Testing   clang -Og[OK]
> 
> Changelog
> -
> 
> v7
>   - Add reviewed-by tag (Jarkko)
> 
> v6
>   - Collect final ack/reviewed-by tags (Jarkko, Kai)
> 
> v5
>   - Reorder patches (Jarkko, Kai)
>   - Include fixes tag for inline asm memory clobber patch (Kai)
>   - Include linker error in static-pie commit message (Kai)
>   - Include generated assembly in relocations commit (Kai)
> 
> v4
>   - Remove redundant -nostartfiles compiler flag (Jarkko)
>   - Split dynamic symbol table removal in separate commit (Kai)
>   - Split redundant push/pop elimination in separate commit (Kai)
>   - Remove (incomplete) register cleansing on enclave exit
>   - Fix possibly uninitialized pointer dereferences in load.c
> 
> v3
>   - Refactor encl_op_array declaration and indexing (Jarkko)
>   - Annotate encl_buffer with "used" attribute (Kai)
>   - Split encl_buffer size and placement commits (Kai)
> 
> v2
>   - Add additional check for NULL pointer (Kai)
>   - Refine to produce proper static-pie executable
>   - Fix linker script assertions
>   - Specify memory clobber for inline asm instead of volatile (Kai)
>   - Clarify why encl_buffer non-static (Jarkko, Kai)
>   - Clarify -ffreestanding (Jarkko)
> 
> Best,
> Jo
> 
> Jo Van Bulck (13):
>   selftests/sgx: Fix uninitialized pointer dereference in error path
>   selftests/sgx: Fix uninitialized pointer dereferences in
> encl_get_entry
>   selftests/sgx: Include memory clobber for inline asm in test enclave
>   selftests/sgx: Separate linker options
>   selftests/sgx: Specify freestanding environment for enclave
> compilation
>   selftests/sgx: Remove redundant enclave base address save/restore
>   selftests/sgx: Produce static-pie executable for test enclave
>   selftests/sgx: Handle relocations in test enclave
>   selftests/sgx: Fix linker script asserts
>   selftests/sgx: Ensure test enclave buffer is entirely preserved
>   selftests/sgx: Ensure expected location of test enclave buffer
>   selftests/sgx: Discard unsupported ELF sections
>   selftests/sgx: Remove incomplete ABI sanitization code in test enclave
> 
>  tools/testing/selftests/sgx/Makefile  | 12 ++--
>  tools/testing/selftests/sgx/defines.h |  2 +
>  tools/testing/selftests/sgx/load.c|  9 ++-
>  tools/testing/selftests/sgx/sigstruct.c   |  5 +-
>  tools/testing/selftests/sgx/test_encl.c   | 67 +--
>  tools/testing/selftests/sgx/test_encl.lds | 10 +--
>  .../selftests/sgx/test_encl_bootstrap.S   | 28 +++-
>  7 files changed, 77 insertions(+), 56 deletions(-)
>

Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller

2023-10-05 Thread Huang, Kai


> ---
>  arch/x86/Kconfig |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 415 +++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 
>  arch/x86/kernel/cpu/sgx/main.c   |  68 -
>  arch/x86/kernel/cpu/sgx/sgx.h|  17 +-
>  6 files changed, 556 insertions(+), 17 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

Given how large this patch is, it's better to split if we can.

It seems we can at least split ...

[...]

> 
> @@ -970,6 +1005,7 @@ static void __init arch_update_sysfs_visibility(int nid) 
> {}
>  static bool __init sgx_page_cache_init(void)
>  {
>   u32 eax, ebx, ecx, edx, type;
> + u64 capacity = 0;
>   u64 pa, size;
>   int nid;
>   int i;
> @@ -1020,6 +1056,7 @@ static bool __init sgx_page_cache_init(void)
>  
>   sgx_epc_sections[i].node =  _numa_nodes[nid];
>   sgx_numa_nodes[nid].size += size;
> + capacity += size;
>  
>   sgx_nr_epc_sections++;
>   }
> @@ -1029,6 +1066,9 @@ static bool __init sgx_page_cache_init(void)
>   return false;
>   }
>  
> + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
> + sgx_epc_total_pages = capacity >> PAGE_SHIFT;
> +
>   return true;
>  }
> 

... setting up capacity out as a separate patch, as it is a top-level only file
showing the maximum instances of the resource.

I'll review rest later.

Re: [PATCH v5 15/18] x86/sgx: Prepare for multiple LRUs

2023-10-05 Thread Huang, Kai

On Thu, 2023-10-05 at 14:33 -0500, Haitao Huang wrote:
> On Thu, 05 Oct 2023 07:30:46 -0500, Huang, Kai  wrote:
> 
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > +static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct  
> > > sgx_epc_page *epc_page)
> > > +{
> > > + return _global_lru;
> > > +}
> > > +
> > > +static inline bool sgx_can_reclaim(void)
> > > +{
> > > + return !list_empty(_global_lru.reclaimable);
> > > +}
> > > +
> > 
> > Shouldn't sgx_can_reclaim() also take a 'struct sgx_epc_lru_lists *'?
> > 
> > I thought we also need to check whether a cgroup's LRU lists can be  
> > reclaimed?
> 
> This is only used to check if any pages reclaimable at the top level in  
> this file. Later sgx_epc_cgroup_lru_empty(NULL) is used in this function  
> to recursively check all cgroups starting from the root.
> 
> 

This again falls to the "impossible to review unless review a later patch first"
category.  This patch says nothing about sgx_can_reclaim() will only be used at
the top level.  Even if it does, why cannot it take LRU lists as input?

All this patch says is we need to prepare these functions to suit multiple LRU
lists.

Btw, why sgx_reclaim_epc_pages() doesn't take LRU lists as input either?  Is it
possible that it can be called across multiple LRU lists, or across different
lists in one LRU?

Why do we need to find some particular LRU lists by given EPC page?

+static inline struct sgx_epc_lru_lists *sgx_lru_lists(struct sgx_epc_page
*epc_page)
+{
+   return _global_lru;
+}
+

Maybe it's clear for other people, but to me it sounds some necessary design
background is missing at least.

Please try best to make the patch self-reviewable by justifying all of those.

1 2 >

1 - 100 of 169 matches

Mail list logo