Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-12-11 Thread Haitao Huang

Hi Kai

On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai  wrote:


On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai   
wrote:


> On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> > From: Sean Christopherson 
> >
> > To prepare for per-cgroup reclamation, separate the top-level  
reclaim

> > function, sgx_reclaim_epc_pages(), into two separate functions:
> >
> > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from  
a

> > given LRU list.
> > - sgx_do_epc_reclamation() performs the real reclamation for the
> > already isolated pages.
> >
> > Create a new function, sgx_reclaim_epc_pages_global(), calling  
those two

> > in succession, to replace the original sgx_reclaim_epc_pages(). The
> > above two functions will serve as building blocks for the  
reclamation

> > flows in later EPC cgroup implementation.
> >
> > sgx_do_epc_reclamation() returns the number of reclaimed pages. The  
EPC

> > cgroup will use the result to track reclaiming progress.
> >
> > sgx_isolate_epc_pages() returns the additional number of pages to  
scan
> > for current epoch of reclamation. The EPC cgroup will use the  
result to
> > determine if more scanning to be done in LRUs in its children  
groups.

>
> This changelog says nothing about "why", but only mentions the
> "implementation".
>
> For instance, assuming we need to reclaim @npages_to_reclaim from the
> @epc_cgrp_to_reclaim and its descendants, why cannot we do:
>
>for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
>if (npages_to_reclaim <= 0)
>return;
>
>npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
>npages_to_reclaim);
>}
>
> Is there any difference to have "isolate" + "reclaim"?
>

This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
Here we just follow the same design as ksgxd for each reclamation cycle.


I don't see how did you "follow" ksgxd.  If I am guessing correctly, you  
are
afraid of there might be less than 16 pages in a given EPC cgroup, thus  
w/o
splitting into "isolate" + "reclaim" you might feed the "reclaim" less  
than 16

pages, which might cause some performance degrade?

But is this a common case?  Should we even worry about this?

I suppose for such new feature we should bring functionality first and  
then

optimization if you have real performance data to show.


The concern is not about "reclaim less than 16".
I mean this is just refactoring with exactly the same design of ksgxd  
preserved, in that we first isolate as many candidate EPC pages (up  to  
16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb in  
one shot without anything else done in between. As described in original  
comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to  
finish all ewb quickly while minimizing impact of IPI.


The way you proposed will work but alters the current design and behavior  
if cgroups is enabled and EPCs of an enclave are tracked across multiple  
LRUs within the descendant cgroups, in that you will have isolation loop,  
backing store allocation loop, eblock loop interleaved with the ewb loop.




> >
> > Signed-off-by: Sean Christopherson 
> > Co-developed-by: Kristen Carlson Accardi 
> > Signed-off-by: Kristen Carlson Accardi 
> > Co-developed-by: Haitao Huang 
> > Signed-off-by: Haitao Huang 
> > Cc: Sean Christopherson 
> > ---
> >
>
> [...]
>
> > +/**
> > + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC
> > pages.
> > + * @iso: List of isolated pages for reclamation
> > + *
> > + * Take a list of EPC pages and reclaim them to the enclave's  
private

> > shmem files.  Do not
> > + * reclaim the pages that have been accessed since the last scan,  
and

> > move each of those pages
> > + * to the tail of its tracking LRU list.
> > + *
> > + * Limit the number of pages to be processed up to  
SGX_NR_TO_SCAN_MAX

> > per call in order to
> > + * degrade amount of IPI's and ETRACK's potentially required.
> > sgx_encl_ewb() does degrade a bit
> > + * among the HW threads with three stage EWB pipeline (EWB, ETRACK  
+

> > EWB and IPI + EWB) but not
> > + * sufficiently. Reclaiming one page at a time would also be
> > problematic as it would increase
> > + * the lock contention too much, which would halt forward progress.
>
> This is kinda optimization, correct?  Is there any real performance  
data

> to
> justify this?

The above sentences were there originally. This optimization was  
justified.


I am talking about 16 -> 32.



> If this optimization is useful, shouldn't we bring this
> optimization to the current sgx_reclaim_pages() instead, e.g., just
> increase
> SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?
>

SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't
know. Currently it is really the buffer size allocated in
sgx_reclaim_pages(). Both c

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-12-13 Thread Huang, Kai
On Mon, 2023-12-11 at 22:04 -0600, Haitao Huang wrote:
> Hi Kai
> 
> On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai  wrote:
> 
> > On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
> > > On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai   
> > > wrote:
> > > 
> > > > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> > > > > From: Sean Christopherson 
> > > > > 
> > > > > To prepare for per-cgroup reclamation, separate the top-level  
> > > reclaim
> > > > > function, sgx_reclaim_epc_pages(), into two separate functions:
> > > > > 
> > > > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from  
> > > a
> > > > > given LRU list.
> > > > > - sgx_do_epc_reclamation() performs the real reclamation for the
> > > > > already isolated pages.
> > > > > 
> > > > > Create a new function, sgx_reclaim_epc_pages_global(), calling  
> > > those two
> > > > > in succession, to replace the original sgx_reclaim_epc_pages(). The
> > > > > above two functions will serve as building blocks for the  
> > > reclamation
> > > > > flows in later EPC cgroup implementation.
> > > > > 
> > > > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The  
> > > EPC
> > > > > cgroup will use the result to track reclaiming progress.
> > > > > 
> > > > > sgx_isolate_epc_pages() returns the additional number of pages to  
> > > scan
> > > > > for current epoch of reclamation. The EPC cgroup will use the  
> > > result to
> > > > > determine if more scanning to be done in LRUs in its children  
> > > groups.
> > > > 
> > > > This changelog says nothing about "why", but only mentions the
> > > > "implementation".
> > > > 
> > > > For instance, assuming we need to reclaim @npages_to_reclaim from the
> > > > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
> > > > 
> > > > for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, 
> > > > &epc_cgrp) {
> > > > if (npages_to_reclaim <= 0)
> > > > return;
> > > > 
> > > > npages_to_reclaim -= 
> > > > sgx_reclaim_pages_lru(&epc_cgrp->lru,
> > > > npages_to_reclaim);
> > > > }
> > > > 
> > > > Is there any difference to have "isolate" + "reclaim"?
> > > > 
> > > 
> > > This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
> > > Here we just follow the same design as ksgxd for each reclamation cycle.
> > 
> > I don't see how did you "follow" ksgxd.  If I am guessing correctly, you  
> > are
> > afraid of there might be less than 16 pages in a given EPC cgroup, thus  
> > w/o
> > splitting into "isolate" + "reclaim" you might feed the "reclaim" less  
> > than 16
> > pages, which might cause some performance degrade?
> > 
> > But is this a common case?  Should we even worry about this?
> > 
> > I suppose for such new feature we should bring functionality first and  
> > then
> > optimization if you have real performance data to show.
> > 
> The concern is not about "reclaim less than 16".
> I mean this is just refactoring with exactly the same design of ksgxd  
> preserved, 
> 

I literally have no idea what you are talking about here.  ksgxd() just calls
sgx_reclaim_pages(), which tries to reclaim 16 pages at once.

> in that we first isolate as many candidate EPC pages (up  to  
> 16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb in  
> one shot without anything else done in between. 
> 

Assuming you are referring the implementation of sgx_reclaim_pages(), and
assuming the "isolate" you mean removing EPC pages from the list (which is
exactly what the sgx_isolate_epc_pages() in this patch does), what happens to
the loops of "backing store allocation" and "EBLOCK", before the loop of EWB? 
Eaten by you?


> As described in original  
> comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to  
> finish all ewb quickly while minimizing impact of IPI.
> 
> The way you proposed will work but alters the current design and behavior  
> if cgroups is enabled and EPCs of an enclave are tracked across multiple  
> LRUs within the descendant cgroups, in that you will have isolation loop,  
> backing store allocation loop, eblock loop interleaved with the ewb loop.
> 

I have no idea what you are talking about.

The point is, with or w/o this patch, you can only reclaim 16 EPC pages in one
function call (as you have said you are going to remove SGX_NR_TO_SCAN_MAX,
which is a cipher to both of us).  The only difference I can see is, with this
patch, you can have multiple calls of "isolate" and then call the "do_reclaim"
once.

But what's the good of having the "isolate" if the "do_reclaim" can only reclaim
16 pages anyway?

Back to my last reply, are you afraid of any LRU has less than 16 pages to
"isolate", therefore you need to loop LRUs of descendants to get 16?  Cause I
really cannot think of any other reason why you are doing this.


> > 



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-12-15 Thread Haitao Huang

On Wed, 13 Dec 2023 05:17:11 -0600, Huang, Kai  wrote:


On Mon, 2023-12-11 at 22:04 -0600, Haitao Huang wrote:

Hi Kai

On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai   
wrote:


> On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
> > On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai 
> > wrote:
> >
> > > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> > > > From: Sean Christopherson 
> > > >
> > > > To prepare for per-cgroup reclamation, separate the top-level
> > reclaim
> > > > function, sgx_reclaim_epc_pages(), into two separate functions:
> > > >
> > > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages  
from

> > a
> > > > given LRU list.
> > > > - sgx_do_epc_reclamation() performs the real reclamation for the
> > > > already isolated pages.
> > > >
> > > > Create a new function, sgx_reclaim_epc_pages_global(), calling
> > those two
> > > > in succession, to replace the original sgx_reclaim_epc_pages().  
The

> > > > above two functions will serve as building blocks for the
> > reclamation
> > > > flows in later EPC cgroup implementation.
> > > >
> > > > sgx_do_epc_reclamation() returns the number of reclaimed pages.  
The

> > EPC
> > > > cgroup will use the result to track reclaiming progress.
> > > >
> > > > sgx_isolate_epc_pages() returns the additional number of pages  
to

> > scan
> > > > for current epoch of reclamation. The EPC cgroup will use the
> > result to
> > > > determine if more scanning to be done in LRUs in its children
> > groups.
> > >
> > > This changelog says nothing about "why", but only mentions the
> > > "implementation".
> > >
> > > For instance, assuming we need to reclaim @npages_to_reclaim from  
the

> > > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
> > >
> > > 	for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp)  
{

> > >  if (npages_to_reclaim <= 0)
> > >  return;
> > >
> > >  npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
> > >  npages_to_reclaim);
> > >  }
> > >
> > > Is there any difference to have "isolate" + "reclaim"?
> > >
> >
> > This is to optimize "reclaim". See how etrack was done in  
sgx_encl_ewb.
> > Here we just follow the same design as ksgxd for each reclamation  
cycle.

>
> I don't see how did you "follow" ksgxd.  If I am guessing correctly,  
you

> are
> afraid of there might be less than 16 pages in a given EPC cgroup,  
thus

> w/o
> splitting into "isolate" + "reclaim" you might feed the "reclaim" less
> than 16
> pages, which might cause some performance degrade?
>
> But is this a common case?  Should we even worry about this?
>
> I suppose for such new feature we should bring functionality first and
> then
> optimization if you have real performance data to show.
>
The concern is not about "reclaim less than 16".
I mean this is just refactoring with exactly the same design of ksgxd
preserved,


I literally have no idea what you are talking about here.  ksgxd() just  
calls

sgx_reclaim_pages(), which tries to reclaim 16 pages at once.


in that we first isolate as many candidate EPC pages (up  to
16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb  
in

one shot without anything else done in between.


Assuming you are referring the implementation of sgx_reclaim_pages(), and
assuming the "isolate" you mean removing EPC pages from the list (which  
is
exactly what the sgx_isolate_epc_pages() in this patch does), what  
happens to
the loops of "backing store allocation" and "EBLOCK", before the loop of  
EWB?Eaten by you?




I skipped those as what really matters is to keep ewb loop separate and  
run in one shot for each reclaiming cycle, not dependent on number of  
LRUs.  All those loops in original sgx_reclaim_pages() except the  
"isolate" loop are not dealing with multiple LRUs of cgroups later. That's  
the reason to refactor out only the "isolate" part and loop it through  
cgroup LRUs in later patches.





As described in original
comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to
finish all ewb quickly while minimizing impact of IPI.

The way you proposed will work but alters the current design and  
behavior

if cgroups is enabled and EPCs of an enclave are tracked across multiple
LRUs within the descendant cgroups, in that you will have isolation  
loop,
backing store allocation loop, eblock loop interleaved with the ewb  
loop.




I have no idea what you are talking about.

The point is, with or w/o this patch, you can only reclaim 16 EPC pages  
in one
function call (as you have said you are going to remove  
SGX_NR_TO_SCAN_MAX,
which is a cipher to both of us).  The only difference I can see is,  
with this
patch, you can have multiple calls of "isolate" and then call the  
"do_reclaim"

once.

But what's the good of having the "isolate" if the "do_reclaim" can only  
reclaim

16 pages anyway?

Back to my last reply, are you afraid of any LRU has less tha

Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-12-17 Thread Huang, Kai

> > 
> > The point is, with or w/o this patch, you can only reclaim 16 EPC pages  
> > in one
> > function call (as you have said you are going to remove  
> > SGX_NR_TO_SCAN_MAX,
> > which is a cipher to both of us).  The only difference I can see is,  
> > with this
> > patch, you can have multiple calls of "isolate" and then call the  
> > "do_reclaim"
> > once.
> > 
> > But what's the good of having the "isolate" if the "do_reclaim" can only  
> > reclaim
> > 16 pages anyway?
> > 
> > Back to my last reply, are you afraid of any LRU has less than 16 pages  
> > to
> > "isolate", therefore you need to loop LRUs of descendants to get 16?   
> > Cause I
> > really cannot think of any other reason why you are doing this.
> > 
> > 
> 
> I think I see your point. By capping pages reclaimed per cycle to 16,  
> there is not much difference even if those 16 pages are spread in separate  
> LRUs . The difference is only significant when we ever raise that cap. To  
> preserve the current behavior of ewb loops independent on number of LRUs  
> to loop through for each reclaiming cycle, regardless of the exact value  
> of the page cap, I would still think current approach in the patch is  
> reasonable choice.  What do you think?

To me I won't bother to do that.  Having less than 16 pages in one LRU is
*extremely rare* that should never happen in practice.  It's pointless to make
such code adjustment at this stage.

Let's focus on enabling functionality first.  When you have some real
performance issue that is related to this, we can come back then.

Btw, I think you need to step back even further.  IIUC the whole multiple LRU
thing isn't mandatory in this initial support.

Please (again) take a look at the comments from Dave and Michal:

https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df...@intel.com/#t
https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/


Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-12-18 Thread Mikko Ylinen
On Mon, Dec 18, 2023 at 01:44:56AM +, Huang, Kai wrote:
> 
> Let's focus on enabling functionality first.  When you have some real
> performance issue that is related to this, we can come back then.
> 
> Btw, I think you need to step back even further.  IIUC the whole multiple LRU
> thing isn't mandatory in this initial support.
> 
> Please (again) take a look at the comments from Dave and Michal:
> 
> https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df...@intel.com/#t
> https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/

I don't think setting a hard limit without any reclaiming is preferred.

I'd rather see this similar to what the "sgx_epc.high" was in the RFC
patchset: misc.max for sgx_epc becomes the max value for EPC usage but
enclaves larger than the limit would still run OK. Per-cgroup reclaiming
allows additional controls via memory.high/max in the same cgroup.

If this reclaim flexibily was not there, the sgx_epc limit would always
have to be set based on some "peak" EPC consumption which may not even
be known at the time the limit is set.

>From a container runtime perspective (which is what I'm working for Kubernetes)
the current proposal seems best to me: a container is guaranteed at most
the amount of EPC set as the limit and no other container gets to use it.
Also, each container gets charged for reclaiming independently if a low
max value is used (which might be desirable to get more containers to run on the
same node/system). In this model, the sum of containers' max values would be
the capacity.

-- Mikko



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-12-18 Thread Haitao Huang

On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai  wrote:




>
> The point is, with or w/o this patch, you can only reclaim 16 EPC  
pages

> in one
> function call (as you have said you are going to remove
> SGX_NR_TO_SCAN_MAX,
> which is a cipher to both of us).  The only difference I can see is,
> with this
> patch, you can have multiple calls of "isolate" and then call the
> "do_reclaim"
> once.
>
> But what's the good of having the "isolate" if the "do_reclaim" can  
only

> reclaim
> 16 pages anyway?
>
> Back to my last reply, are you afraid of any LRU has less than 16  
pages

> to
> "isolate", therefore you need to loop LRUs of descendants to get 16?
> Cause I
> really cannot think of any other reason why you are doing this.
>
>

I think I see your point. By capping pages reclaimed per cycle to 16,
there is not much difference even if those 16 pages are spread in  
separate
LRUs . The difference is only significant when we ever raise that cap.  
To

preserve the current behavior of ewb loops independent on number of LRUs
to loop through for each reclaiming cycle, regardless of the exact value
of the page cap, I would still think current approach in the patch is
reasonable choice.  What do you think?


To me I won't bother to do that.  Having less than 16 pages in one LRU is
*extremely rare* that should never happen in practice.  It's pointless  
to make

such code adjustment at this stage.

Let's focus on enabling functionality first.  When you have some real
performance issue that is related to this, we can come back then.

Btw, I think you need to step back even further.  IIUC the whole  
multiple LRU

thing isn't mandatory in this initial support.

Please (again) take a look at the comments from Dave and Michal:

https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df...@intel.com/#t
https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/


Thanks for raising this. Actually my understanding the above discussion  
was mainly about not doing reclaiming by killing enclaves, i.e., I assumed  
"reclaiming" within that context only meant for that particular kind.


As Mikko pointed out, without reclaiming per-cgroup, the max limit of each  
cgroup needs to accommodate the peak usage of enclaves within that cgroup.  
That may be inconvenient for practical usage and limits could be forced to  
be set larger than necessary to run enclaves performantly. For example, we  
can observe following undesired consequences comparing a system with  
current kernel loaded with enclaves whose total peak usage is greater than  
the EPC capacity.


1) If a user wants to load the same exact enclaves but in separate  
cgroups, then the sum of cgroup limits must be higher than the capacity  
and the system will end up doing the same old global reclaiming as it is  
currently doing. Cgroup is not useful at all for isolating EPC  
consumptions.


2) To isolate impact of usage of each cgroup on other cgroups and yet  
still being able to load each enclave, the user basically has to carefully  
plan to ensure the sum of cgroup max limits, i.e., the sum of peak usage  
of enclaves, is not reaching over the capacity. That means no  
over-commiting allowed and the same system may not be able to load as many  
enclaves as with current kernel.


@Dave and @Michal, Your thoughts? Or could you confirm we should not do  
reclaim per cgroup at

all?

If confirmed as desired, then this series can stop at patch 4.

Thanks
Haitao



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-03 Thread Dave Hansen
On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
thoughts? Or could you confirm we should not
> do reclaim per cgroup at all?
What's the benefit of doing reclaim per cgroup?  Is that worth the extra
complexity?

The key question here is whether we want the SGX VM to be complex and
more like the real VM or simple when a cgroup hits its limit.  Right?

If stopping at patch 5 and having less code is even remotely an option,
why not do _that_?



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Michal Koutný
Hello.

On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang 
 wrote:
> Thanks for raising this. Actually my understanding the above discussion was
> mainly about not doing reclaiming by killing enclaves, i.e., I assumed
> "reclaiming" within that context only meant for that particular kind.
> 
> As Mikko pointed out, without reclaiming per-cgroup, the max limit of each
> cgroup needs to accommodate the peak usage of enclaves within that cgroup.
> That may be inconvenient for practical usage and limits could be forced to
> be set larger than necessary to run enclaves performantly. For example, we
> can observe following undesired consequences comparing a system with current
> kernel loaded with enclaves whose total peak usage is greater than the EPC
> capacity.
> 
> 1) If a user wants to load the same exact enclaves but in separate cgroups,
> then the sum of cgroup limits must be higher than the capacity and the
> system will end up doing the same old global reclaiming as it is currently
> doing. Cgroup is not useful at all for isolating EPC consumptions.

That is the use of limits to prevent a runaway cgroup smothering the
system. Overcommited values in such a config are fine because the more
simultaneous runaways, the less likely.
The peak consumption is on the fair expense of others (some efficiency)
and the limit contains the runaway (hence the isolation).

> 2) To isolate impact of usage of each cgroup on other cgroups and yet still
> being able to load each enclave, the user basically has to carefully plan to
> ensure the sum of cgroup max limits, i.e., the sum of peak usage of
> enclaves, is not reaching over the capacity. That means no over-commiting
> allowed and the same system may not be able to load as many enclaves as with
> current kernel.

Another "config layout" of limits is to achieve partitioning (sum ==
capacity). That is perfect isolation but it naturally goes against
efficient utilization. The way other controllers approach this trade-off
is with weights (cpu, io) or protections (memory). I'm afraid misc
controller is not ready for this.

My opinion is to start with the simple limits (first patches) and think
of prioritization/guarantee mechanism based on real cases.

HTH,
Michal


signature.asc
Description: PGP signature


Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Haitao Huang

Hi Dave,

On Wed, 03 Jan 2024 10:37:35 -0600, Dave Hansen   
wrote:



On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
thoughts? Or could you confirm we should not

do reclaim per cgroup at all?

What's the benefit of doing reclaim per cgroup?  Is that worth the extra
complexity?



Without reclaiming per cgroup, then we have to always set the limit to  
enclave's peak usage. This may not be efficient utilization as in many  
cases each enclave can perform fine with EPC limit set less than peak.  
Basically each group can not give up some pages for greater good without  
dying :-)


Also with enclaves enabled with EDMM, the peak usage is not static so hard  
to determine upfront. Hence it might be an operation/deployment  
inconvenience.


In case of over-committing (sum of limits > total capacity), one cgroup at  
peak usage may require swapping pages out in a different cgroup if system  
is overloaded at that time.



The key question here is whether we want the SGX VM to be complex and
more like the real VM or simple when a cgroup hits its limit.  Right?



Although it's fair to say the majority of complexity of this series is in  
support for reclaiming per cgroup, I think it's manageable and much less  
than real VM after we removed the enclave killing parts: the only extra  
effort is to track pages in separate list and reclaim them in separately  
as opposed to track in on global list and reclaim together. The main  
reclaiming loop code is still pretty much the same as before.




If stopping at patch 5 and having less code is even remotely an option,
why not do _that_?


I hope I described limitations clear enough above.
If those are OK with users and also make it acceptable for merge quickly,  
I'm happy to do that :-)


Thanks
Haitao



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Jarkko Sakkinen
On Thu Jan 4, 2024 at 9:11 PM EET, Haitao Huang wrote:
> > The key question here is whether we want the SGX VM to be complex and
> > more like the real VM or simple when a cgroup hits its limit.  Right?
> >
>
> Although it's fair to say the majority of complexity of this series is in  
> support for reclaiming per cgroup, I think it's manageable and much less  
> than real VM after we removed the enclave killing parts: the only extra  
> effort is to track pages in separate list and reclaim them in separately  
> as opposed to track in on global list and reclaim together. The main  
> reclaiming loop code is still pretty much the same as before.

I'm not seeing any unmanageable complexity on SGX side, and also
cgroups specific changes are somewhat clean to me at least...

BR, Jarkko



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Haitao Huang

Hi Michal,

On Thu, 04 Jan 2024 06:38:41 -0600, Michal Koutný  wrote:


Hello.

On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang  
 wrote:
Thanks for raising this. Actually my understanding the above discussion  
was

mainly about not doing reclaiming by killing enclaves, i.e., I assumed
"reclaiming" within that context only meant for that particular kind.

As Mikko pointed out, without reclaiming per-cgroup, the max limit of  
each
cgroup needs to accommodate the peak usage of enclaves within that  
cgroup.
That may be inconvenient for practical usage and limits could be forced  
to
be set larger than necessary to run enclaves performantly. For example,  
we
can observe following undesired consequences comparing a system with  
current
kernel loaded with enclaves whose total peak usage is greater than the  
EPC

capacity.

1) If a user wants to load the same exact enclaves but in separate  
cgroups,

then the sum of cgroup limits must be higher than the capacity and the
system will end up doing the same old global reclaiming as it is  
currently

doing. Cgroup is not useful at all for isolating EPC consumptions.


That is the use of limits to prevent a runaway cgroup smothering the
system. Overcommited values in such a config are fine because the more
simultaneous runaways, the less likely.
The peak consumption is on the fair expense of others (some efficiency)
and the limit contains the runaway (hence the isolation).



This makes sense to me in theory. Mikko, Chris Y/Bo Z, your thoughts on  
whether this is good enough for your intended usages?


2) To isolate impact of usage of each cgroup on other cgroups and yet  
still
being able to load each enclave, the user basically has to carefully  
plan to

ensure the sum of cgroup max limits, i.e., the sum of peak usage of
enclaves, is not reaching over the capacity. That means no  
over-commiting
allowed and the same system may not be able to load as many enclaves as  
with

current kernel.


Another "config layout" of limits is to achieve partitioning (sum ==
capacity). That is perfect isolation but it naturally goes against
efficient utilization. The way other controllers approach this trade-off
is with weights (cpu, io) or protections (memory). I'm afraid misc
controller is not ready for this.

My opinion is to start with the simple limits (first patches) and think
of prioritization/guarantee mechanism based on real cases.



We moved away from using mem like custom controller with (low, high, max)  
to misc controller. But if we need add those down the road, then the  
interface needs be changed. So my concern on this route would be whether  
misc would allow any of those extensions. On the other hand, it might turn  
out less complex just doing the reclamation per cgroup.


Thanks a lot for your comments and they are really helpful!

Haitao




Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Dave Hansen
On 1/4/24 11:11, Haitao Huang wrote:
> If those are OK with users and also make it acceptable for merge
> quickly, I'm happy to do that 🙂

How about we put some actual numbers behind this?  How much complexity
are we talking about here?  What's the diffstat for the utterly
bare-bones implementation and what does it cost on top of that to do the
per-cgroup reclaim?



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-04 Thread Haitao Huang
On Thu, 04 Jan 2024 13:27:07 -0600, Dave Hansen   
wrote:



On 1/4/24 11:11, Haitao Huang wrote:

If those are OK with users and also make it acceptable for merge
quickly, I'm happy to do that 🙂


How about we put some actual numbers behind this?  How much complexity
are we talking about here?  What's the diffstat for the utterly
bare-bones implementation and what does it cost on top of that to do the
per-cgroup reclaim?



For bare-bone:

 arch/x86/Kconfig |  13 
 arch/x86/kernel/cpu/sgx/Makefile |   1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 104  

 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  39  
+++
 arch/x86/kernel/cpu/sgx/main.c   |  41  


 arch/x86/kernel/cpu/sgx/sgx.h|   5 +
 include/linux/misc_cgroup.h  |  42  
+
 kernel/cgroup/misc.c |  52  
+++---

 8 files changed, 281 insertions(+), 16 deletions(-)

Additional for per-cgroup reclaim:

 arch/x86/kernel/cpu/sgx/encl.c   |  41 +
 arch/x86/kernel/cpu/sgx/encl.h   |   3 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 224  
+--

 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  18 ++--
 arch/x86/kernel/cpu/sgx/main.c   | 226  
++--
 arch/x86/kernel/cpu/sgx/sgx.h|  85  
+--

 6 files changed, 480 insertions(+), 117 deletions(-)


Thanks
Haitao



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-05 Thread Mikko Ylinen
On Thu, Jan 04, 2024 at 01:11:15PM -0600, Haitao Huang wrote:
> Hi Dave,
> 
> On Wed, 03 Jan 2024 10:37:35 -0600, Dave Hansen 
> wrote:
> 
> > On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
> > thoughts? Or could you confirm we should not
> > > do reclaim per cgroup at all?
> > What's the benefit of doing reclaim per cgroup?  Is that worth the extra
> > complexity?
> > 
> 
> Without reclaiming per cgroup, then we have to always set the limit to
> enclave's peak usage. This may not be efficient utilization as in many cases
> each enclave can perform fine with EPC limit set less than peak. Basically
> each group can not give up some pages for greater good without dying :-)

+1. this is exactly my thinking too. The per cgroup reclaiming is
important for the containers use case we are working on. I also think
it makes the limit more meaningful: the per-container pool of EPC pages
to use (which is independent of the enclave size).

> 
> Also with enclaves enabled with EDMM, the peak usage is not static so hard
> to determine upfront. Hence it might be an operation/deployment
> inconvenience.
> 
> In case of over-committing (sum of limits > total capacity), one cgroup at
> peak usage may require swapping pages out in a different cgroup if system is
> overloaded at that time.
> 
> > The key question here is whether we want the SGX VM to be complex and
> > more like the real VM or simple when a cgroup hits its limit.  Right?
> > 
> 
> Although it's fair to say the majority of complexity of this series is in
> support for reclaiming per cgroup, I think it's manageable and much less
> than real VM after we removed the enclave killing parts: the only extra
> effort is to track pages in separate list and reclaim them in separately as
> opposed to track in on global list and reclaim together. The main reclaiming
> loop code is still pretty much the same as before.
> 
> 
> > If stopping at patch 5 and having less code is even remotely an option,
> > why not do _that_?
> > 
> I hope I described limitations clear enough above.
> If those are OK with users and also make it acceptable for merge quickly,

You explained the gaps very well already. I don't think the simple
version without per-cgroup reclaiming is enough for the container case.

Mikko



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-12 Thread Haitao Huang

On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai  wrote:




>
> The point is, with or w/o this patch, you can only reclaim 16 EPC  
pages

> in one
> function call (as you have said you are going to remove
> SGX_NR_TO_SCAN_MAX,
> which is a cipher to both of us).  The only difference I can see is,
> with this
> patch, you can have multiple calls of "isolate" and then call the
> "do_reclaim"
> once.
>
> But what's the good of having the "isolate" if the "do_reclaim" can  
only

> reclaim
> 16 pages anyway?
>
> Back to my last reply, are you afraid of any LRU has less than 16  
pages

> to
> "isolate", therefore you need to loop LRUs of descendants to get 16?
> Cause I
> really cannot think of any other reason why you are doing this.
>
>

I think I see your point. By capping pages reclaimed per cycle to 16,
there is not much difference even if those 16 pages are spread in  
separate
LRUs . The difference is only significant when we ever raise that cap.  
To

preserve the current behavior of ewb loops independent on number of LRUs
to loop through for each reclaiming cycle, regardless of the exact value
of the page cap, I would still think current approach in the patch is
reasonable choice.  What do you think?


To me I won't bother to do that.  Having less than 16 pages in one LRU is
*extremely rare* that should never happen in practice.  It's pointless  
to make

such code adjustment at this stage.

Let's focus on enabling functionality first.  When you have some real
performance issue that is related to this, we can come back then.



I have done some rethinking about this and realize this does save quite  
some significant work: without breaking out isolation part from  
sgx_reclaim_pages(), I can remove the changes to use a list for isolated  
pages, and no need to introduce "state" such as RECLAIM_IN_PROGRESS. About  
1/3 of changes for per-cgroup reclamation will be gone.


So I think I'll go this route now. The only downside may be performance if  
a enclave spreads its pages in different cgroups and even that is minimum  
impact as we limit reclamation to 16 pages a time. Let me know if someone  
feel strongly we need dealt with that and see some other potential issues  
I may have missed.


Thanks

Haitao



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-13 Thread Jarkko Sakkinen
On Fri Jan 12, 2024 at 7:07 PM EET, Haitao Huang wrote:
> On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai  wrote:
>
> >
> >> >
> >> > The point is, with or w/o this patch, you can only reclaim 16 EPC  
> >> pages
> >> > in one
> >> > function call (as you have said you are going to remove
> >> > SGX_NR_TO_SCAN_MAX,
> >> > which is a cipher to both of us).  The only difference I can see is,
> >> > with this
> >> > patch, you can have multiple calls of "isolate" and then call the
> >> > "do_reclaim"
> >> > once.
> >> >
> >> > But what's the good of having the "isolate" if the "do_reclaim" can  
> >> only
> >> > reclaim
> >> > 16 pages anyway?
> >> >
> >> > Back to my last reply, are you afraid of any LRU has less than 16  
> >> pages
> >> > to
> >> > "isolate", therefore you need to loop LRUs of descendants to get 16?
> >> > Cause I
> >> > really cannot think of any other reason why you are doing this.
> >> >
> >> >
> >>
> >> I think I see your point. By capping pages reclaimed per cycle to 16,
> >> there is not much difference even if those 16 pages are spread in  
> >> separate
> >> LRUs . The difference is only significant when we ever raise that cap.  
> >> To
> >> preserve the current behavior of ewb loops independent on number of LRUs
> >> to loop through for each reclaiming cycle, regardless of the exact value
> >> of the page cap, I would still think current approach in the patch is
> >> reasonable choice.  What do you think?
> >
> > To me I won't bother to do that.  Having less than 16 pages in one LRU is
> > *extremely rare* that should never happen in practice.  It's pointless  
> > to make
> > such code adjustment at this stage.
> >
> > Let's focus on enabling functionality first.  When you have some real
> > performance issue that is related to this, we can come back then.
> >
>
> I have done some rethinking about this and realize this does save quite  
> some significant work: without breaking out isolation part from  
> sgx_reclaim_pages(), I can remove the changes to use a list for isolated  
> pages, and no need to introduce "state" such as RECLAIM_IN_PROGRESS. About  
> 1/3 of changes for per-cgroup reclamation will be gone.
>
> So I think I'll go this route now. The only downside may be performance if  
> a enclave spreads its pages in different cgroups and even that is minimum  
> impact as we limit reclamation to 16 pages a time. Let me know if someone  
> feel strongly we need dealt with that and see some other potential issues  
> I may have missed.

We could deal with possible performance regression later on (if there
is need). I mean there should we a workload first that would so that
sort stimulus...

> Thanks
>
> Haitao

BR, Jarkko



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-13 Thread Jarkko Sakkinen
On Sat Jan 13, 2024 at 11:04 PM EET, Jarkko Sakkinen wrote:
> On Fri Jan 12, 2024 at 7:07 PM EET, Haitao Huang wrote:
> > On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai  wrote:
> >
> > >
> > >> >
> > >> > The point is, with or w/o this patch, you can only reclaim 16 EPC  
> > >> pages
> > >> > in one
> > >> > function call (as you have said you are going to remove
> > >> > SGX_NR_TO_SCAN_MAX,
> > >> > which is a cipher to both of us).  The only difference I can see is,
> > >> > with this
> > >> > patch, you can have multiple calls of "isolate" and then call the
> > >> > "do_reclaim"
> > >> > once.
> > >> >
> > >> > But what's the good of having the "isolate" if the "do_reclaim" can  
> > >> only
> > >> > reclaim
> > >> > 16 pages anyway?
> > >> >
> > >> > Back to my last reply, are you afraid of any LRU has less than 16  
> > >> pages
> > >> > to
> > >> > "isolate", therefore you need to loop LRUs of descendants to get 16?
> > >> > Cause I
> > >> > really cannot think of any other reason why you are doing this.
> > >> >
> > >> >
> > >>
> > >> I think I see your point. By capping pages reclaimed per cycle to 16,
> > >> there is not much difference even if those 16 pages are spread in  
> > >> separate
> > >> LRUs . The difference is only significant when we ever raise that cap.  
> > >> To
> > >> preserve the current behavior of ewb loops independent on number of LRUs
> > >> to loop through for each reclaiming cycle, regardless of the exact value
> > >> of the page cap, I would still think current approach in the patch is
> > >> reasonable choice.  What do you think?
> > >
> > > To me I won't bother to do that.  Having less than 16 pages in one LRU is
> > > *extremely rare* that should never happen in practice.  It's pointless  
> > > to make
> > > such code adjustment at this stage.
> > >
> > > Let's focus on enabling functionality first.  When you have some real
> > > performance issue that is related to this, we can come back then.
> > >
> >
> > I have done some rethinking about this and realize this does save quite  
> > some significant work: without breaking out isolation part from  
> > sgx_reclaim_pages(), I can remove the changes to use a list for isolated  
> > pages, and no need to introduce "state" such as RECLAIM_IN_PROGRESS. About  
> > 1/3 of changes for per-cgroup reclamation will be gone.
> >
> > So I think I'll go this route now. The only downside may be performance if  
> > a enclave spreads its pages in different cgroups and even that is minimum  
> > impact as we limit reclamation to 16 pages a time. Let me know if someone  
> > feel strongly we need dealt with that and see some other potential issues  
> > I may have missed.
>
> We could deal with possible performance regression later on (if there
> is need). I mean there should we a workload first that would so that
> sort stimulus...

I.e. no reason to deal with imaginary workload :-) Go ahead and we'll
go through it.

BR, Jarkko



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-11-19 Thread Huang, Kai
On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> From: Sean Christopherson 
> 
> To prepare for per-cgroup reclamation, separate the top-level reclaim
> function, sgx_reclaim_epc_pages(), into two separate functions:
> 
> - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a given 
> LRU list.
> - sgx_do_epc_reclamation() performs the real reclamation for the already 
> isolated pages.
> 
> Create a new function, sgx_reclaim_epc_pages_global(), calling those two
> in succession, to replace the original sgx_reclaim_epc_pages(). The
> above two functions will serve as building blocks for the reclamation
> flows in later EPC cgroup implementation.
> 
> sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
> cgroup will use the result to track reclaiming progress.
> 
> sgx_isolate_epc_pages() returns the additional number of pages to scan
> for current epoch of reclamation. The EPC cgroup will use the result to
> determine if more scanning to be done in LRUs in its children groups.

This changelog says nothing about "why", but only mentions the "implementation".

For instance, assuming we need to reclaim @npages_to_reclaim from the
@epc_cgrp_to_reclaim and its descendants, why cannot we do:

for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
if (npages_to_reclaim <= 0)
return;

npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
npages_to_reclaim);
}

Is there any difference to have "isolate" + "reclaim"?

> 
> Signed-off-by: Sean Christopherson 
> Co-developed-by: Kristen Carlson Accardi 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Cc: Sean Christopherson 
> ---
> 

[...]

> +/**
> + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC pages.
> + * @iso: List of isolated pages for reclamation
> + *
> + * Take a list of EPC pages and reclaim them to the enclave's private shmem 
> files.  Do not
> + * reclaim the pages that have been accessed since the last scan, and move 
> each of those pages
> + * to the tail of its tracking LRU list.
> + *
> + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX per 
> call in order to
> + * degrade amount of IPI's and ETRACK's potentially required. sgx_encl_ewb() 
> does degrade a bit
> + * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and 
> IPI + EWB) but not
> + * sufficiently. Reclaiming one page at a time would also be problematic as 
> it would increase
> + * the lock contention too much, which would halt forward progress.

This is kinda optimization, correct?  Is there any real performance data to
justify this?  If this optimization is useful, shouldn't we bring this
optimization to the current sgx_reclaim_pages() instead, e.g., just increase
SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?




Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-11-26 Thread Haitao Huang

On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai  wrote:


On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:

From: Sean Christopherson 

To prepare for per-cgroup reclamation, separate the top-level reclaim
function, sgx_reclaim_epc_pages(), into two separate functions:

- sgx_isolate_epc_pages() scans and isolates reclaimable pages from a  
given LRU list.
- sgx_do_epc_reclamation() performs the real reclamation for the  
already isolated pages.


Create a new function, sgx_reclaim_epc_pages_global(), calling those two
in succession, to replace the original sgx_reclaim_epc_pages(). The
above two functions will serve as building blocks for the reclamation
flows in later EPC cgroup implementation.

sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
cgroup will use the result to track reclaiming progress.

sgx_isolate_epc_pages() returns the additional number of pages to scan
for current epoch of reclamation. The EPC cgroup will use the result to
determine if more scanning to be done in LRUs in its children groups.


This changelog says nothing about "why", but only mentions the  
"implementation".


For instance, assuming we need to reclaim @npages_to_reclaim from the
@epc_cgrp_to_reclaim and its descendants, why cannot we do:

for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
if (npages_to_reclaim <= 0)
return;

npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
npages_to_reclaim);
}

Is there any difference to have "isolate" + "reclaim"?



This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
Here we just follow the same design as ksgxd for each reclamation cycle.



Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Cc: Sean Christopherson 
---



[...]


+/**
+ * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC  
pages.

+ * @iso:   List of isolated pages for reclamation
+ *
+ * Take a list of EPC pages and reclaim them to the enclave's private  
shmem files.  Do not
+ * reclaim the pages that have been accessed since the last scan, and  
move each of those pages

+ * to the tail of its tracking LRU list.
+ *
+ * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX  
per call in order to
+ * degrade amount of IPI's and ETRACK's potentially required.  
sgx_encl_ewb() does degrade a bit
+ * among the HW threads with three stage EWB pipeline (EWB, ETRACK +  
EWB and IPI + EWB) but not
+ * sufficiently. Reclaiming one page at a time would also be  
problematic as it would increase

+ * the lock contention too much, which would halt forward progress.


This is kinda optimization, correct?  Is there any real performance data  
to

justify this?


The above sentences were there originally. This optimization was justified.


If this optimization is useful, shouldn't we bring this
optimization to the current sgx_reclaim_pages() instead, e.g., just  
increase

SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?



SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't  
know. Currently it is really the buffer size allocated in  
sgx_reclaim_pages(). Both cgroup and ksgxd scan 16 pages a time.Maybe we  
should just use SGX_NR_TO_SCAN. No _MAX needed. The point was to batch  
reclamation to certain number to minimize impact of EWB pipeline. 16 was  
the original design.


Thanks
Haitao



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-11-27 Thread Huang, Kai
On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
> On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai  wrote:
> 
> > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> > > From: Sean Christopherson 
> > > 
> > > To prepare for per-cgroup reclamation, separate the top-level reclaim
> > > function, sgx_reclaim_epc_pages(), into two separate functions:
> > > 
> > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a  
> > > given LRU list.
> > > - sgx_do_epc_reclamation() performs the real reclamation for the  
> > > already isolated pages.
> > > 
> > > Create a new function, sgx_reclaim_epc_pages_global(), calling those two
> > > in succession, to replace the original sgx_reclaim_epc_pages(). The
> > > above two functions will serve as building blocks for the reclamation
> > > flows in later EPC cgroup implementation.
> > > 
> > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
> > > cgroup will use the result to track reclaiming progress.
> > > 
> > > sgx_isolate_epc_pages() returns the additional number of pages to scan
> > > for current epoch of reclamation. The EPC cgroup will use the result to
> > > determine if more scanning to be done in LRUs in its children groups.
> > 
> > This changelog says nothing about "why", but only mentions the  
> > "implementation".
> > 
> > For instance, assuming we need to reclaim @npages_to_reclaim from the
> > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
> > 
> > for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
> > if (npages_to_reclaim <= 0)
> > return;
> > 
> > npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
> > npages_to_reclaim);
> > }
> > 
> > Is there any difference to have "isolate" + "reclaim"?
> > 
> 
> This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
> Here we just follow the same design as ksgxd for each reclamation cycle.

I don't see how did you "follow" ksgxd.  If I am guessing correctly, you are
afraid of there might be less than 16 pages in a given EPC cgroup, thus w/o
splitting into "isolate" + "reclaim" you might feed the "reclaim" less than 16
pages, which might cause some performance degrade?

But is this a common case?  Should we even worry about this?

I suppose for such new feature we should bring functionality first and then
optimization if you have real performance data to show.

> 
> > > 
> > > Signed-off-by: Sean Christopherson 
> > > Co-developed-by: Kristen Carlson Accardi 
> > > Signed-off-by: Kristen Carlson Accardi 
> > > Co-developed-by: Haitao Huang 
> > > Signed-off-by: Haitao Huang 
> > > Cc: Sean Christopherson 
> > > ---
> > > 
> > 
> > [...]
> > 
> > > +/**
> > > + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC  
> > > pages.
> > > + * @iso: List of isolated pages for reclamation
> > > + *
> > > + * Take a list of EPC pages and reclaim them to the enclave's private  
> > > shmem files.  Do not
> > > + * reclaim the pages that have been accessed since the last scan, and  
> > > move each of those pages
> > > + * to the tail of its tracking LRU list.
> > > + *
> > > + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX  
> > > per call in order to
> > > + * degrade amount of IPI's and ETRACK's potentially required.  
> > > sgx_encl_ewb() does degrade a bit
> > > + * among the HW threads with three stage EWB pipeline (EWB, ETRACK +  
> > > EWB and IPI + EWB) but not
> > > + * sufficiently. Reclaiming one page at a time would also be  
> > > problematic as it would increase
> > > + * the lock contention too much, which would halt forward progress.
> > 
> > This is kinda optimization, correct?  Is there any real performance data  
> > to
> > justify this?
> 
> The above sentences were there originally. This optimization was justified.

I am talking about 16 -> 32.

> 
> > If this optimization is useful, shouldn't we bring this
> > optimization to the current sgx_reclaim_pages() instead, e.g., just  
> > increase
> > SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?
> > 
> 
> SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't  
> know. Currently it is really the buffer size allocated in  
> sgx_reclaim_pages(). Both cgroup and ksgxd scan 16 pages a time.Maybe we  
> should just use SGX_NR_TO_SCAN. No _MAX needed. The point was to batch  
> reclamation to certain number to minimize impact of EWB pipeline. 16 was  
> the original design.
> 

Please don't leave why you are trying to do this to the reviewers.  If you don't
know, then just drop this.