Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-18 Thread Haitao Huang

On Wed, 18 Oct 2023 10:52:23 -0500, Michal Koutný  wrote:

On Wed, Oct 18, 2023 at 08:37:25AM -0700, Dave Hansen  
 wrote:

1. Admin sets a limit
2. Enclave is created
3. Enclave hits limit, allocation fails


I was actually about to suggest reorganizing the series to a part
implementing this simple limiting and a subsequent part with the reclaim
stuff for easier digestability.


Nothing else matters.


If the latter part is an unncessary overkill, it's even better.



Ok. I'll take out max_write() callback and only implement non-preemptive  
misc.max for EPC.
I can also separate OOEPC_killing enclaves out, which is not needed if we  
only block allocation at limit, no need killing one enclave to make space  
for another. This will simplify a lot.


Thanks to all for your input!

Haitao



Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-18 Thread Michal Koutný
On Wed, Oct 18, 2023 at 08:37:25AM -0700, Dave Hansen  
wrote:
> 1. Admin sets a limit
> 2. Enclave is created
> 3. Enclave hits limit, allocation fails

I was actually about to suggest reorganizing the series to a part
implementing this simple limiting and a subsequent part with the reclaim
stuff for easier digestability. 

> Nothing else matters.

If the latter part is an unncessary overkill, it's even better.

Thanks,
Michal


signature.asc
Description: PGP signature


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-18 Thread Dave Hansen
On 10/18/23 08:26, Haitao Huang wrote:
> Maybe not in sense of killing something. My understanding memory.reclaim
> does not necessarily invoke the OOM killer. But what I really intend to
> say is we can have a separate knob for user to express the need for
> reducing the current usage explicitly and keep "misc.max' non-preemptive
> semantics intact. When we implement that new knob, then we can define
> what kind of reclaim for that. Depending on vEPC implementation, it may
> or may not involve killing VMs. But at least that semantics will be
> explicit for user.

I'm really worried that you're going for "perfect" semantics here.  This
is SGX.  It's *NOT* getting heavy use and even fewer folks will ever
apply cgroup controls to it.

Can we please stick to simple, easily-coded rules here?  I honestly
don't think these corner cases matter all that much and there's been
*WAY* too much traffic in this thread for what is ultimately not that
complicated.  Focus on *ONE* thing:

1. Admin sets a limit
2. Enclave is created
3. Enclave hits limit, allocation fails

Nothing else matters.  What if the admin lowers the limit on an
already-created enclave?  Nobody cares.  Seriously.  What about inducing
reclaim?  Nobody cares.  What about vEPC?  Doesn't matter, an enclave
page is an enclave page.


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-18 Thread Haitao Huang
On Wed, 18 Oct 2023 08:55:12 -0500, Dave Hansen   
wrote:



On 10/17/23 21:37, Haitao Huang wrote:

Yes we can introduce misc.reclaim to give user a knob to forcefully
reducing usage if that is really needed in real usage. The semantics
would make force-kill VMs explicit to user.


Do any other controllers do something like this?  It seems odd.


Maybe not in sense of killing something. My understanding memory.reclaim  
does not necessarily invoke the OOM killer. But what I really intend to  
say is we can have a separate knob for user to express the need for  
reducing the current usage explicitly and keep "misc.max' non-preemptive  
semantics intact. When we implement that new knob, then we can define what  
kind of reclaim for that. Depending on vEPC implementation, it may or may  
not involve killing VMs. But at least that semantics will be explicit for  
user.


Thanks
Haitao


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-18 Thread Dave Hansen
On 10/17/23 21:37, Haitao Huang wrote:
> Yes we can introduce misc.reclaim to give user a knob to forcefully 
> reducing usage if that is really needed in real usage. The semantics
> would make force-kill VMs explicit to user.

Do any other controllers do something like this?  It seems odd.


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-17 Thread Haitao Huang

On Tue, 17 Oct 2023 14:13:22 -0500, Michal Koutný  wrote:

On Tue, Oct 17, 2023 at 08:54:48PM +0200, Michal Koutný  
 wrote:

Is this distinction between preemptability of EPC pages mandated by the
HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
users an option to lock certain pages in memory that yields this
difference?


(After skimming Documentation/arch/x86/sgx.rst, Section "Virtual EPC")

Or would these two types warrant also two types of miscresource? (To
deal with each in own way.)


They are from the same bucket of HW resource so I think it's more suitable  
to be one resource type. Otherwise need to policy to dividing the  
capacity, etc. And it is still possible in future vEPC become reclaimable.


My current thinking is we probably can get away with non-preemptive  
max_write for enclaves too. See my other reply.


Thanks
Haitao


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-17 Thread Haitao Huang

Hi Michal,

On Tue, 17 Oct 2023 13:54:46 -0500, Michal Koutný  wrote:


Hello Haitao.

On Tue, Oct 17, 2023 at 07:58:02AM -0500, Haitao Huang  
 wrote:
AFAIK, before we introducing max_write() callback in this series, no  
misc
controller would possibly enforce the limit when misc.max is reduced.  
e.g. I

don't think CVMs be killed when ASID limit is reduced and the cgroup was
full before limit is reduced.


Yes, misccontroller was meant to be simple, current >= max serves to
prevent new allocations.


Thanks for confirming. Maybe another alternative we just keep max_write
non-preemptive. No need to add max_write() callback.

The EPC controller only triggers reclaiming on new allocations or return
NOMEM if no more to reclaim. Reclaiming here includes normal EPC page  
reclaiming and killing enclaves in out of EPC cases. vEPCs assigned to  
guests are basically carved out and never reclaimable by the host.


As we no longer enforce limits on max_write a lower value, user should not  
expect cgroup to force reclaim pages from enclave or kill VMs/enclaves as  
a result of reducing limits 'in-place'. User should always create cgroups,  
set limits, launch enclave/VM into the groups created.



FTR, at some point in time memory.max was considered for reclaim control
of regular pages but it turned out to be too coarse (and OOM killing
processes if amount was not sensed correctly) and this eventually
evolved into specific mechanism of memory.reclaim.
So I'm mentioning this should that be an interface with better semantic
for your use case (and misc.max writes can remain non-preemptive).



Yes we can introduce misc.reclaim to give user a knob to forcefully  
reducing usage if
that is really needed in real usage. The semantics would make force-kill  
VMs explicit to user.



One more note -- I was quite confused when I read in the rest of the
series about OOM and _kill_ing but then I found no such measure in the
code implementation. So I would suggest two terminological changes:

- the basic premise of the series (00/18) is that EPC pages are a
  different resource than memory, hence choose a better suiting name
  than OOM (out of memory) condition,


I couldn't come up a good name. Out of EPC (OOEPC) maybe? I feel OOEPC  
would be hard to read in code though. OOM was relatable as it is similar  
to normal OOM but special kind of memory :-) I'm open to any better  
suggestions.



- killing -- (unless you have an intention to implement process
  termination later) My current interpretation that it is rather some
  aggressive unmapping within address space, so less confusing name for
  that would be "reclaim".



yes. Killing here refers to killing enclave, analogous to killing process,
not just 'reclaim' though. I can change to always use 'killing enclave'  
explicitly.




I think EPC pages to VMs could have the same behavior, once they are  
given
to a guest, never taken back by the host. For enclaves on host side,  
pages

are reclaimable, that allows us to enforce in a similar way to memcg.


Is this distinction between preemptability of EPC pages mandated by the
HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
users an option to lock certain pages in memory that yields this
difference?



The difference is really a result of current vEPC implementation. Because
enclave pages once in use contains confidential content, they need special
process to reclaim. So it's complex to implement host reclaiming guest EPCs
gracefully.

Thanks
Haitao


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-17 Thread Michal Koutný
On Tue, Oct 17, 2023 at 08:54:48PM +0200, Michal Koutný  
wrote:
> Is this distinction between preemptability of EPC pages mandated by the
> HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
> users an option to lock certain pages in memory that yields this
> difference?

(After skimming Documentation/arch/x86/sgx.rst, Section "Virtual EPC")

Or would these two types warrant also two types of miscresource? (To
deal with each in own way.)

Thanks,
Michal


signature.asc
Description: PGP signature


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-17 Thread Michal Koutný
Hello Haitao.

On Tue, Oct 17, 2023 at 07:58:02AM -0500, Haitao Huang 
 wrote:
> AFAIK, before we introducing max_write() callback in this series, no misc
> controller would possibly enforce the limit when misc.max is reduced. e.g. I
> don't think CVMs be killed when ASID limit is reduced and the cgroup was
> full before limit is reduced.

Yes, misccontroller was meant to be simple, current >= max serves to
prevent new allocations.

FTR, at some point in time memory.max was considered for reclaim control
of regular pages but it turned out to be too coarse (and OOM killing
processes if amount was not sensed correctly) and this eventually
evolved into specific mechanism of memory.reclaim.
So I'm mentioning this should that be an interface with better semantic
for your use case (and misc.max writes can remain non-preemptive).

One more note -- I was quite confused when I read in the rest of the
series about OOM and _kill_ing but then I found no such measure in the
code implementation. So I would suggest two terminological changes:

- the basic premise of the series (00/18) is that EPC pages are a
  different resource than memory, hence choose a better suiting name
  than OOM (out of memory) condition,
- killing -- (unless you have an intention to implement process
  termination later) My current interpretation that it is rather some
  aggressive unmapping within address space, so less confusing name for
  that would be "reclaim".


> I think EPC pages to VMs could have the same behavior, once they are given
> to a guest, never taken back by the host. For enclaves on host side, pages
> are reclaimable, that allows us to enforce in a similar way to memcg.

Is this distinction between preemptability of EPC pages mandated by the
HW implementation? (host/"process" enclaves vs VM enclaves) Or do have
users an option to lock certain pages in memory that yields this
difference?

Regards,
Michal


signature.asc
Description: PGP signature


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-17 Thread Sean Christopherson
On Mon, Oct 16, 2023, Haitao Huang wrote:
> Hi Sean
> 
> On Mon, 16 Oct 2023 16:32:31 -0500, Sean Christopherson 
> wrote:
> 
> > On Mon, Oct 16, 2023, Haitao Huang wrote:
> > > From this perspective, I think the current implementation is
> > > "well-defined":
> > > EPC cgroup limits for VMs are only enforced at VM launch time, not
> > > runtime.  In practice,  SGX VM can be launched only with fixed EPC size
> > > and all those EPCs are fully committed to the VM once launched.
> > 
> > Fully committed doesn't mean those numbers are reflected in the cgroup.  A
> > VM scheduler can easily "commit" EPC to a guest, but allocate EPC on
> > demand, i.e.  when the guest attempts to actually access a page.
> > Preallocating memory isn't free, e.g. it can slow down guest boot, so it's
> > entirely reasonable to have virtual EPC be allocated on-demand.  Enforcing
> > at launch time doesn't work for such setups, because from the cgroup's
> > perspective, the VM is using 0 pages of EPC at launch.
> > 
> Maybe I understood the current implementation wrong. From what I see, vEPC
> is impossible not fully commit at launch time. The guest would EREMOVE all
> pages during initialization resulting #PF and all pages allocated. This
> essentially makes "prealloc=off" the same as "prealloc=on".
> Unless you are talking about some custom OS or kernel other than upstream
> Linux here?

Yes, a customer could be running an older kernel, something other than Linux, a
custom kernel, an out-of-tree SGX driver, etc.  The host should never assume
anything about the guest kernel when it comes to correctness (unless the guest
kernel is controlled by the host).

Doing EREMOVE on all pages is definitely not mandatory, especially if the kernel
detects a hypervisor, i.e. knows its running as a guest.


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-17 Thread Haitao Huang

On Mon, 16 Oct 2023 20:34:57 -0500, Huang, Kai  wrote:


On Mon, 2023-10-16 at 19:10 -0500, Haitao Huang wrote:
On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai   
wrote:

[...]

> still need to fix the bug mentioned above here.
>
> I really think you should just go this simple way:
>
> When you want to take EPC back from VM, kill the VM.
>

My only concern is that this is a compromise due to current limitation  
(no

other sane way to take EPC from VMs). If we define this behavior and it
becomes a contract to user space, then we can't change in future.


Why do we need to "define such behaviour"?

This isn't some kinda of kernel/userspace ABI IMHO, but only kernel  
internal
implementation.  Here VM is similar to normal host enclaves.  You limit  
the
resource, some host enclaves could be killed.  Similarly, VM could also  
be

killed too.

And supporting VMM EPC oversubscription doesn't mean VM won't be  
killed.  The VM
can still be a target to kill after VM's all EPC pages have been swapped  
out.




On the other hand, my understanding the reason you want this behavior is
to enforce EPC limit at runtime.


No I just thought this is a bug/issue needs to be fixed.  If anyone  
believes

this is not a bug/issue then it's a separate discussion.



AFAIK, before we introducing max_write() callback in this series, no misc  
controller would possibly enforce the limit when misc.max is reduced. e.g.  
I don't think CVMs be killed when ASID limit is reduced and the cgroup was  
full before limit is reduced.


I think EPC pages to VMs could have the same behavior, once they are given  
to a guest, never taken back by the host. For enclaves on host side, pages  
are reclaimable, that allows us to enforce in a similar way to memcg.


Thanks
Haitao


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-17 Thread Mikko Ylinen
On Mon, Oct 16, 2023 at 02:32:31PM -0700, Sean Christopherson wrote:
> Genuinely curious, who is asking for EPC cgroup support that *isn't* running 
> VMs?

People who work with containers: [1], [2]. 

> AFAIK, these days, SGX is primarily targeted at cloud.  I assume virtual EPC 
> is
> the primary use case for an EPC cgroup.

The common setup is that a cloud VM instance with vEPC is created and then
several SGX enclave containers are run simultaneously on that instance. EPC
cgroups is used to ensure that each container gets their own share of EPC
(and any attempts to go beyond the limit is reclaimed and charged from
the container's memcg). The same containers w/ enclaves use case is
applicable to baremetal also, though.

As far as Kubernetes orchestrated containers are concerned, "in-place" resource
scaling is still in very early stages which means that the cgroups values are
adjusted by *re-creating* the container. The hierarchies are also built
such that there's no mix of VMs w/ vEPC and enclaves in the same tree.

Mikko

[1] 
https://lore.kernel.org/linux-sgx/20221202183655.3767674-1-kris...@linux.intel.com/T/#m6d1c895534b4c0636f47c2d1620016b4c362bb9b
[2] 
https://lore.kernel.org/linux-sgx/20221202183655.3767674-1-kris...@linux.intel.com/T/#m37600e457b832feee6e8346aa74dcff8f21965f8


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Huang, Kai
On Mon, 2023-10-16 at 19:10 -0500, Haitao Huang wrote:
> On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai  wrote:
> [...]
> 
> > still need to fix the bug mentioned above here.
> > 
> > I really think you should just go this simple way:
> > 
> > When you want to take EPC back from VM, kill the VM.
> > 
> 
> My only concern is that this is a compromise due to current limitation (no  
> other sane way to take EPC from VMs). If we define this behavior and it  
> becomes a contract to user space, then we can't change in future.

Why do we need to "define such behaviour"?

This isn't some kinda of kernel/userspace ABI IMHO, but only kernel internal
implementation.  Here VM is similar to normal host enclaves.  You limit the
resource, some host enclaves could be killed.  Similarly, VM could also be
killed too.

And supporting VMM EPC oversubscription doesn't mean VM won't be killed.  The VM
can still be a target to kill after VM's all EPC pages have been swapped out.

> 
> On the other hand, my understanding the reason you want this behavior is  
> to enforce EPC limit at runtime. 
> 

No I just thought this is a bug/issue needs to be fixed.  If anyone believes
this is not a bug/issue then it's a separate discussion.



Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Haitao Huang

On Mon, 16 Oct 2023 16:09:52 -0500, Huang, Kai  wrote:
[...]


still need to fix the bug mentioned above here.

I really think you should just go this simple way:

When you want to take EPC back from VM, kill the VM.



My only concern is that this is a compromise due to current limitation (no  
other sane way to take EPC from VMs). If we define this behavior and it  
becomes a contract to user space, then we can't change in future.


On the other hand, my understanding the reason you want this behavior is  
to enforce EPC limit at runtime. I just not sure how important it is and  
if it is a real usage given all limitations of SGX VMs we have (static EPC  
size, no migration).


Thanks

Haitao


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Haitao Huang

Hi Sean

On Mon, 16 Oct 2023 16:32:31 -0500, Sean Christopherson  
 wrote:



On Mon, Oct 16, 2023, Haitao Huang wrote:
From this perspective, I think the current implementation is  
"well-defined":
EPC cgroup limits for VMs are only enforced at VM launch time, not  
runtime.
In practice,  SGX VM can be launched only with fixed EPC size and all  
those

EPCs are fully committed to the VM once launched.


Fully committed doesn't mean those numbers are reflected in the cgroup.   
A VM
scheduler can easily "commit" EPC to a guest, but allocate EPC on  
demand, i.e.
when the guest attempts to actually access a page.  Preallocating memory  
isn't
free, e.g. it can slow down guest boot, so it's entirely reasonable to  
have virtual
EPC be allocated on-demand.  Enforcing at launch time doesn't work for  
such setups,
because from the cgroup's perspective, the VM is using 0 pages of EPC at  
launch.


Maybe I understood the current implementation wrong. From what I see, vEPC  
is impossible not fully commit at launch time. The guest would EREMOVE all  
pages during initialization resulting #PF and all pages allocated. This  
essentially makes "prealloc=off" the same as "prealloc=on".
Unless you are talking about some custom OS or kernel other than upstream  
Linux here?


Thanks
Haitap


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Sean Christopherson
On Mon, Oct 16, 2023, Haitao Huang wrote:
> From this perspective, I think the current implementation is "well-defined":
> EPC cgroup limits for VMs are only enforced at VM launch time, not runtime.
> In practice,  SGX VM can be launched only with fixed EPC size and all those
> EPCs are fully committed to the VM once launched.

Fully committed doesn't mean those numbers are reflected in the cgroup.  A VM
scheduler can easily "commit" EPC to a guest, but allocate EPC on demand, i.e.
when the guest attempts to actually access a page.  Preallocating memory isn't
free, e.g. it can slow down guest boot, so it's entirely reasonable to have 
virtual
EPC be allocated on-demand.  Enforcing at launch time doesn't work for such 
setups,
because from the cgroup's perspective, the VM is using 0 pages of EPC at launch.

> Because of that, I imagine people are using VMs to primarily partition the
> physical EPCs, i.e, the static size itself is the 'limit' for the workload of
> a single VM and not expecting EPCs taken away at runtime.

If everything goes exactly as planned, sure.  But it's not hard to imagine some
configuration change way up the stack resulting in the hard limit for an EPC 
cgroup
being lowered.

> So killing does not really add much value for the existing usages IIUC.

As I said earlier, the behavior doesn't have to result in terminating a VM, e.g.
the virtual EPC code could provide a knob to send a signal/notification if the
owning cgroup has gone above the limit and the VM is targeted for forced 
reclaim.

> That said, I don't anticipate adding the enforcement of killing VMs at
> runtime would break such usages as admin/user can simply choose to set the
> limit equal to the static size to launch the VM and forget about it.
> 
> Given that, I'll propose an add-on patch to this series as RFC and have some
> feedback from community before we decide if that needs be included in first
> version or we can skip it until we have EPC reclaiming for VMs.

Gracefully *swapping* virtual EPC isn't required for oversubscribing virtual 
EPC.
Think of it like airlines overselling tickets.  The airline sells more tickets
than they have seats, and banks on some passengers canceling.  If too many 
people
show up, the airline doesn't swap passengers to the cargo bay, they just shunt 
them
to a different plane.

The same could be easily be done for hosts and virtual EPC.  E.g. if every VM
*might* use 1GiB, but in practice 99% of VMs only consume 128MiB, then it's not
too crazy to advertise 1GiB to each VM, but only actually carve out 256MiB per 
VM
in order to pack more VMs on a host.  If the host needs to free up EPC, then the
most problematic VMs can be migrated to a different host.

Genuinely curious, who is asking for EPC cgroup support that *isn't* running 
VMs?
AFAIK, these days, SGX is primarily targeted at cloud.  I assume virtual EPC is
the primary use case for an EPC cgroup.

I don't have any skin in the game beyond my name being attached to some of the
patches, i.e. I certainly won't stand in the way.  I just don't understand why
you would go through all the effort of adding an EPC cgroup and then not go the
extra few steps to enforce limits for virtual EPC.  Compared to the complexity
of the rest of the series, that little bit seems quite trivial.


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Huang, Kai
> 
> 
>  From this perspective, I think the current implementation is  
> "well-defined": EPC cgroup limits for VMs are only enforced at VM launch  
> time, not runtime. In practice,  SGX VM can be launched only with fixed  
> EPC size and all those EPCs are fully committed to the VM once launched.  
> Because of that, I imagine people are using VMs to primarily partition the  
> physical EPCs, i.e, the static size itself is the 'limit' for the workload  
> of a single VM and not expecting EPCs taken away at runtime.
> 
> So killing does not really add much value for the existing usages IIUC.

It's not about adding value to the existing usages, it's about fixing the issue
when we lower the EPC limit to a point that is less than total virtual EPC size.

It's a design issue, or simply a bug in the current implementation we need to
fix.

> 
> That said, I don't anticipate adding the enforcement of killing VMs at  
> runtime would break such usages as admin/user can simply choose to set the  
> limit equal to the static size to launch the VM and forget about it.
> 
> Given that, I'll propose an add-on patch to this series as RFC and have  
> some feedback from community before we decide if that needs be included in  
> first version or we can skip it until we have EPC reclaiming for VMs.

I don't understand what is the "add-on" patch you are talking about.

I mentioned the "typical deployment thing" is that can help us understand which
algorithm is better way to select the victim.  But no matter what we choose, we
still need to fix the bug mentioned above here.

I really think you should just go this simple way: 

When you want to take EPC back from VM, kill the VM.

Another bad thing about "just removing EPC pages from VM" is the enclaves in the
VM would suffer "sudden lose of EPC", or even worse, suffer it at a high
frequency.  Although we depend on that for supporting SGX VM live migration, but
that needs to avoided if possible.



Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Haitao Huang

On Mon, 16 Oct 2023 05:57:36 -0500, Huang, Kai  wrote:


On Thu, 2023-10-12 at 08:27 -0500, Haitao Huang wrote:
On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai   
wrote:

[...]
> (btw, even you track VA/SECS pages in unreclaimable list, given they
> both have
> 'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
> SGX_EPC_OWNER_PAGE ?)

Let me think about it, there might be also a way just track encl objects
not unreclaimable pages.

I still not get why we need kill the VM not just remove just enough  
pages.

Is it due to the static allocation not able to reclaim?


We can choose to "just remove enough EPC pages".  The VM may or may not  
be
killed when it wants the EPC pages back, depending on whether the  
current EPC

cgroup can provide enough EPC pages or not.  And this depends on how we
implement the cgroup algorithm to reclaim EPC pages.

One problem could be: for a EPC cgroup only has SGX VMs, you may end up  
with

moving EPC pages from one VM to another and then vice versa endlessly,


This could be a valid use case though if people intend to share EPCs  
between two VMs. Understand no one would be able to use VMs this way  
currently with the static allocation.



because
you never really actually mark any VM to be dead just like OOM does to  
the

normal enclaves.

From this point, you still need a way to kill a VM, IIUC.

I think the key point of virtual EPC vs cgroup, as quoted from Sean,  
should be

"having sane, well-defined behavior".

Does "just remove enough EPC pages" meet this?  If the problem mentioned  
above
can be avoided, I suppose so?  So if there's an easy way to achieve, I  
guess it

can be an option too.

But for the initial support, IMO we are not looking for a perfect but yet
complicated solution.  I would say, from review's point of view, it's  
preferred
to have a simple implementation to achieve a not-prefect, but  
consistent, well-

defined behaviour.

So to me looks killing the VM when cgroup cannot reclaim any more EPC  
pages is a

simple option.

But I might have missed something, especially since middle of last week  
I have

been having fever and headache :-)

So as mentioned above, you can try other alternatives, but please avoid
complicated ones.

Also, I guess it will be helpful if we can understand the typical SGX  
app and/or
SGX VM deployment under EPC cgroup use case.  This may help us on  
justifying why

the EPC cgroup algorithm to select victim is reasonable.




From this perspective, I think the current implementation is  
"well-defined": EPC cgroup limits for VMs are only enforced at VM launch  
time, not runtime. In practice,  SGX VM can be launched only with fixed  
EPC size and all those EPCs are fully committed to the VM once launched.  
Because of that, I imagine people are using VMs to primarily partition the  
physical EPCs, i.e, the static size itself is the 'limit' for the workload  
of a single VM and not expecting EPCs taken away at runtime.


So killing does not really add much value for the existing usages IIUC.

That said, I don't anticipate adding the enforcement of killing VMs at  
runtime would break such usages as admin/user can simply choose to set the  
limit equal to the static size to launch the VM and forget about it.


Given that, I'll propose an add-on patch to this series as RFC and have  
some feedback from community before we decide if that needs be included in  
first version or we can skip it until we have EPC reclaiming for VMs.


Thanks
Haitao



Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Huang, Kai
On Wed, 2023-10-11 at 01:14 +, Huang, Kai wrote:
> On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> > > 
> > > This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> > > EPC too?
> > > 
> > 
> > That flag is set for enclaves, do you mean we set similar flag in vepc  
> > struct?
> 
> Yes.

I missed the "ENCL" part but only noted the "NO_MEMORY" part, so I guess it
should not be used directly for vEPC.  So if it is needed, SGX_VEPC_NO_MEMORY,
or a simple 'bool dead' or similar in 'struct sgx_vepc' is more suitable.

As I said I was fighting with fever and headache last week :-) 


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Huang, Kai
On Thu, 2023-10-12 at 08:27 -0500, Haitao Huang wrote:
> On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai  wrote:
> [...]
> > (btw, even you track VA/SECS pages in unreclaimable list, given they  
> > both have
> > 'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
> > SGX_EPC_OWNER_PAGE ?)
> 
> Let me think about it, there might be also a way just track encl objects  
> not unreclaimable pages.
> 
> I still not get why we need kill the VM not just remove just enough pages.  
> Is it due to the static allocation not able to reclaim?

We can choose to "just remove enough EPC pages".  The VM may or may not be
killed when it wants the EPC pages back, depending on whether the current EPC
cgroup can provide enough EPC pages or not.  And this depends on how we
implement the cgroup algorithm to reclaim EPC pages.

One problem could be: for a EPC cgroup only has SGX VMs, you may end up with
moving EPC pages from one VM to another and then vice versa endlessly, because
you never really actually mark any VM to be dead just like OOM does to the
normal enclaves.

From this point, you still need a way to kill a VM, IIUC.

I think the key point of virtual EPC vs cgroup, as quoted from Sean, should be
"having sane, well-defined behavior".

Does "just remove enough EPC pages" meet this?  If the problem mentioned above
can be avoided, I suppose so?  So if there's an easy way to achieve, I guess it
can be an option too.

But for the initial support, IMO we are not looking for a perfect but yet
complicated solution.  I would say, from review's point of view, it's preferred
to have a simple implementation to achieve a not-prefect, but consistent, well-
defined behaviour.

So to me looks killing the VM when cgroup cannot reclaim any more EPC pages is a
simple option.

But I might have missed something, especially since middle of last week I have
been having fever and headache :-)

So as mentioned above, you can try other alternatives, but please avoid
complicated ones.

Also, I guess it will be helpful if we can understand the typical SGX app and/or
SGX VM deployment under EPC cgroup use case.  This may help us on justifying why
the EPC cgroup algorithm to select victim is reasonable.




Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-12 Thread Haitao Huang

On Tue, 10 Oct 2023 19:51:17 -0500, Huang, Kai  wrote:
[...]
(btw, even you track VA/SECS pages in unreclaimable list, given they  
both have

'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
SGX_EPC_OWNER_PAGE ?)


Let me think about it, there might be also a way just track encl objects  
not unreclaimable pages.


I still not get why we need kill the VM not just remove just enough pages.  
Is it due to the static allocation not able to reclaim?



If we always remove all vEPC pages/kill VM, then we should not need track  
individual vepc pages.


Thanks

Haitao


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-11 Thread Haitao Huang

On Tue, 10 Oct 2023 19:31:19 -0500, Huang, Kai  wrote:


On Tue, 2023-10-10 at 12:05 -0500, Haitao Huang wrote:
On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai   
wrote:


>
> > > > >
> > > > Later the hosting process could migrated/reassigned to another
> > cgroup?
> > > > What to do when the new cgroup is OOM?
> > > >
> > >
> > > You addressed in the documentation, no?
> > >
> > > +Migration
> > > +-
> > > +
> > > +Once an EPC page is charged to a cgroup (during allocation), it
> > > +remains charged to the original cgroup until the page is released
> > > +or reclaimed.  Migrating a process to a different cgroup doesn't
> > > +move the EPC charges that it incurred while in the previous  
cgroup

> > > +to its new cgroup.
> >
> > Should we kill the enclave though because some VA pages may be in  
the

> > new
> > group?
> >
>
> I guess acceptable?
>
> And any difference if you keep VA/SECS to unreclaimabe list?

Tracking VA/SECS allows all cgroups, in which an enclave has allocation,
to identify the enclave following the back pointer and kill it as  
needed.


> If you migrate one
> enclave to another cgroup, the old EPC pages stay in the old cgroup
> while the
> new one is charged to the new group IIUC.
>
> I am not cgroup expert, but by searching some old thread it appears  
this

> isn't a
> supported model:
>
> https://lore.kernel.org/lkml/yeyr9181qgzt+...@mtj.duckdns.org/
>

IIUC it's a different problem here. If we don't track the allocated VAs  
in
the new group, then the enclave that spans the two groups can't be  
killed
by the new group. If so, some enclave could just hide in some small  
group

and never gets killed but keeps allocating in a different group?



I mean from the link above IIUC migrating enclave among different  
cgroups simply
isn't a supported model, thus any bad behaviour isn't a big concern in  
terms of

decision making.


If we leave some pages in a cgroup unkillable, we are in the same  
situation of not able to enforce a cgroup limit as that we are are in if  
we don't kill VMs for lower limits.


I think not supporting migration of pages between cgroups should not leave  
a gap for enforcement just like we don't want to have an enforcement gap  
if we let VMs to hold pages once it is launched.


Haitao


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-11 Thread Haitao Huang
On Tue, 10 Oct 2023 19:01:25 -0500, Sean Christopherson  
 wrote:



On Tue, Oct 10, 2023, Haitao Huang wrote:
On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai   
wrote:


> On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
> > Hi Sean
> >
> > On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson
> >  wrote:
> > > I can see userspace wanting to explicitly terminate the VM  
instead of

> > > "silently"
> > > the VM's enclaves, but that seems like it should be a knob in the
> > > virtual EPC
> > > code.
> >
> > If my understanding above is correct and understanding your  
statement
> > above correctly, then don't see we really need separate knob for  
vEPC

> > code. Reaching a cgroup limit by a running guest (assuming dynamic
> > allocation implemented) should not translate automatically killing
> > the VM.
> > Instead, it's user space job to work with guest to handle allocation
> > failure. Guest could page and kill enclaves.
> >
>
> IIUC Sean was talking about changing misc.max _after_ you launch SGX  
VMs:

>
> 1) misc.max = 100M
> 2) Launch VMs with total virtual EPC size = 100M   <- success
> 3) misc.max = 50M
>
> 3) will also succeed, but nothing will happen, the VMs will be still
> holding 100M EPC.
>
> You need to somehow track virtual EPC and kill VM instead.
>
> (or somehow fail to do 3) if it is also an acceptable option.)
>
Thanks for explaining it.

There is an error code to return from max_write. I can add that too to  
the
callback definition and fail it when it can't be enforced for any  
reason.

Would like some community feedback if this is acceptable though.


That likely isn't acceptable.  E.g. create a cgroup with both a host  
enclave and
virtual EPC, set the hard limit to 100MiB.  Virtual EPC consumes 50MiB,  
and the
host enclave consumes 50MiB.  Userspace lowers the limit to 49MiB.  The  
cgroup
code would reclaim all of the enclave's reclaimable EPC, and then kill  
the enclave
because it's still over the limit.  And then fail the max_write because  
the cgroup
is *still* over the limit.  So in addition to burning a lot of cycles,  
from
userspace's perspective its enclave was killed for no reason, as the new  
limit

wasn't actually set.


I was thinking before reclaiming enclave pages, if we know the untracked  
vepc pages (current-tracked) is larger than limit, we just return error  
without enforcing the limit. That way user also knows something is wrong.


But I get that we want to be able to kill VMs to enforce the newer lower  
limit. I assume we can't optimize efficiency/priority of killing: enclave  
pages will be reclaimed first no matter what just because they are  
reclaimable; no good rules to choose victim between enclave and VMs in  
your example so enclaves could be killed still before VMs.


I think to solve it ultimately, we need be able to adjust 'capacity' of  
VMs

not to just kill them, which is basically the same as dynamic allocation
support for VMs (being able to increase/decrease epc size when it is
running). For now, we only have static allocation so max can't be  
enforced

once it is launched.


No, reclaiming virtual EPC is not a requirement.  VMM EPC  
oversubscription is
insanely complex, and I highly doubt any users actually want to  
oversubcribe VMs.


There are use cases for cgroups beyond oversubscribing/swapping, e.g.  
privileged
userspace may set limits on a container to ensure the container doesn't  
*accidentally*
consume more EPC than it was allotted, e.g. due to a configuration bug  
that created

a VM with more EPC than it was supposed to have.

My comments on virtual EPC vs. cgroups is much more about having sane,  
well-defined
behavior, not about saying the kernel actually needs to support  
oversubscribing EPC

for KVM guests.


So here are the steps I see now, let me know if this is aligned with your  
thinking or I'm off track.


0) when all left are enclave VA, SECS pages and VEPC in a cgroup, the OOM  
kill process starts.
1) The cgroup identifies a victim vepc for OOM kill(this could be before  
or after enclaves being killed), sets a new flag VEPC_NO_MEMORY in the  
vepc.
2) call sgx_vepc_remove_all(), ignore failure counts returned. This will  
perform best effort to eremove all normal pages used by the vepc.
3) Guest may trigger fault again, return SIGBUS in sgx_vepc_fault when  
VEPC_NO_MEMORY is set. Do the same for mmap.
4) That would eventually cause sgx_vepc_release() before VM dies or killed  
by user space, which does the final cleanup.


Q: should we reset VEPC_NO_MEMORY flag if #4 never happens and cgroup  
usage is below limit? I suppose we can do it, but not sure it is sane  
because VM would try to load as much pages as configured originally.


I'm still thinking about using one unreclaimable list, justification is  
simplicity and lack of rules to select items from separate lists, but may  
change my mind if I found it's inconvenient.


Not sure how age/youngness can be accounted for guest pages though Kai 

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-10 Thread Huang, Kai
On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> > 
> > This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> > EPC too?
> > 
> 
> That flag is set for enclaves, do you mean we set similar flag in vepc  
> struct?

Yes.


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-10 Thread Huang, Kai
On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 20:34:29 -0500, Huang, Kai  wrote:
> 
> > On Tue, 2023-10-10 at 00:50 +, Huang, Kai wrote:
> > > On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> > > > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > > > +/**
> > > > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target  
> > > LRU
> > > > > > + * @lru:   LRU that is low
> > > > > > + *
> > > > > > + * Return: %true if a victim was found and kicked.
> > > > > > + */
> > > > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > > > +{
> > > > > > +   struct sgx_epc_page *victim;
> > > > > > +
> > > > > > +   spin_lock(>lock);
> > > > > > +   victim = sgx_oom_get_victim(lru);
> > > > > > +   spin_unlock(>lock);
> > > > > > +
> > > > > > +   if (!victim)
> > > > > > +   return false;
> > > > > > +
> > > > > > +   if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > > > +   return sgx_oom_encl_page(victim->encl_page);
> > > > > > +
> > > > > > +   if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > > > +   return sgx_oom_encl(victim->encl);
> > > > > 
> > > > > I hate to bring this up, at least at this stage, but I am wondering  
> > > why we need
> > > > > to put VA and SECS pages to the unreclaimable list, but cannot keep  
> > > an
> > > > > "enclave_list" instead?
> > > > 
> > > > The motivation for tracking EPC pages instead of enclaves was so that  
> > > the EPC
> > > > OOM-killer could "kill" VMs as well as host-owned enclaves. >
> > > 
> > > Ah this seems a fair argument. :-)
> > > 
> > > > The virtual EPC code
> > > > didn't actually kill the VM process, it instead just freed all of the  
> > > EPC pages
> > > > and abused the SGX architecture to effectively make the guest  
> > > recreate all its
> > > > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> > > 
> > > It returns SIGBUS.  SGX VM live migration also requires enough EPC  
> > > being able to
> > > be allocated on the destination machine to work AFAICT.
> > > 
> > > > 
> > > > Looks like y'all punted on that with:
> > > > 
> > > >   The EPC pages allocated for KVM guests by the virtual EPC driver  
> > > are not
> > > >   reclaimable by the host kernel [5]. Therefore they are not tracked  
> > > by any
> > > >   LRU lists for reclaiming purposes in this implementation, but they  
> > > are
> > > >   charged toward the cgroup of the user processs (e.g., QEMU)  
> > > launching the
> > > >   guest.  And when the cgroup EPC usage reaches its limit, the  
> > > virtual EPC
> > > >   driver will stop allocating more EPC for the VM, and return SIGBUS  
> > > to the
> > > >   user process which would abort the VM launch.
> > > > 
> > > > which IMO is a hack, unless returning SIGBUS is actually enforced  
> > > somehow. >
> > > 
> > > "enforced" do you mean?
> > > 
> > > Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot  
> > > allocate
> > > EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in  
> > > hva_to_pfn(),
> > > which eventually results in KVM returning -EFAULT to userspace in  
> > > vcpu_run().
> > > And Qemu then kills the VM with some nonsense message:
> > > 
> > > error: kvm run failed Bad address
> > > 
> > > 
> > > > Relying
> > > > on userspace to be kind enough to kill its VMs kinda defeats the  
> > > purpose of cgroup
> > > > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
> > > userspace running
> > > > encalves in a VM could continue on and refuse to give up its EPC, and  
> > > thus run above
> > > > its limit in perpetuity.
> > > 
> > > > 
> > > > I can see userspace wanting to explicitly terminate the VM instead of  
> > > "silently"
> > > > the VM's enclaves, but that seems like it should be a knob in the  
> > > virtual EPC
> > > > code.
> > 
> > I guess I slightly misunderstood your words.
> > 
> > You mean we want to kill VM when the limit is set to be lower than  
> > virtual EPC
> > size.
> > 
> > This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> > EPC too?
> > 
> 
> That flag is set for enclaves, do you mean we set similar flag in vepc  
> struct?
> 
> > In the sgx_vepc_fault(), we check this flag at early time and return  
> > SIGBUS if
> > it is set.
> > 
> > But this also requires keeping virtual EPC pages in some list, and  
> > handles them
> > in sgx_epc_oom() too.
> > 
> > And for virtual EPC pages, I guess the "young" logic can be applied thus
> > probably it's better to keep the actual virtual EPC pages to a  
> > (separate?) list
> > instead of keeping the virtual EPC instance.
> > 
> > struct sgx_epc_lru {
> > struct list_head reclaimable;
> > struct sgx_encl *enclaves;
> > struct list_head vepc_pages;
> > }
> > 
> > Or still tracking VA/SECS and virtual EPC pages in a single  
> > unrecliamable list?
> 

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-10 Thread Huang, Kai
On Tue, 2023-10-10 at 12:05 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai  wrote:
> 
> > 
> > > > > > 
> > > > > Later the hosting process could migrated/reassigned to another  
> > > cgroup?
> > > > > What to do when the new cgroup is OOM?
> > > > > 
> > > > 
> > > > You addressed in the documentation, no?
> > > > 
> > > > +Migration
> > > > +-
> > > > +
> > > > +Once an EPC page is charged to a cgroup (during allocation), it
> > > > +remains charged to the original cgroup until the page is released
> > > > +or reclaimed.  Migrating a process to a different cgroup doesn't
> > > > +move the EPC charges that it incurred while in the previous cgroup
> > > > +to its new cgroup.
> > > 
> > > Should we kill the enclave though because some VA pages may be in the  
> > > new
> > > group?
> > > 
> > 
> > I guess acceptable?
> > 
> > And any difference if you keep VA/SECS to unreclaimabe list?
> 
> Tracking VA/SECS allows all cgroups, in which an enclave has allocation,  
> to identify the enclave following the back pointer and kill it as needed.
> 
> > If you migrate one
> > enclave to another cgroup, the old EPC pages stay in the old cgroup  
> > while the
> > new one is charged to the new group IIUC.
> > 
> > I am not cgroup expert, but by searching some old thread it appears this  
> > isn't a
> > supported model:
> > 
> > https://lore.kernel.org/lkml/yeyr9181qgzt+...@mtj.duckdns.org/
> > 
> 
> IIUC it's a different problem here. If we don't track the allocated VAs in  
> the new group, then the enclave that spans the two groups can't be killed  
> by the new group. If so, some enclave could just hide in some small group  
> and never gets killed but keeps allocating in a different group?
> 

I mean from the link above IIUC migrating enclave among different cgroups simply
isn't a supported model, thus any bad behaviour isn't a big concern in terms of
decision making.


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-10 Thread Sean Christopherson
On Tue, Oct 10, 2023, Haitao Huang wrote:
> On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai  wrote:
> 
> > On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
> > > Hi Sean
> > > 
> > > On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson
> > >  wrote:
> > > > I can see userspace wanting to explicitly terminate the VM instead of
> > > > "silently"
> > > > the VM's enclaves, but that seems like it should be a knob in the
> > > > virtual EPC
> > > > code.
> > > 
> > > If my understanding above is correct and understanding your statement
> > > above correctly, then don't see we really need separate knob for vEPC
> > > code. Reaching a cgroup limit by a running guest (assuming dynamic
> > > allocation implemented) should not translate automatically killing
> > > the VM.
> > > Instead, it's user space job to work with guest to handle allocation
> > > failure. Guest could page and kill enclaves.
> > > 
> > 
> > IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs:
> > 
> > 1) misc.max = 100M
> > 2) Launch VMs with total virtual EPC size = 100M<- success
> > 3) misc.max = 50M
> > 
> > 3) will also succeed, but nothing will happen, the VMs will be still
> > holding 100M EPC.
> > 
> > You need to somehow track virtual EPC and kill VM instead.
> > 
> > (or somehow fail to do 3) if it is also an acceptable option.)
> > 
> Thanks for explaining it.
> 
> There is an error code to return from max_write. I can add that too to the
> callback definition and fail it when it can't be enforced for any reason.
> Would like some community feedback if this is acceptable though.

That likely isn't acceptable.  E.g. create a cgroup with both a host enclave and
virtual EPC, set the hard limit to 100MiB.  Virtual EPC consumes 50MiB, and the
host enclave consumes 50MiB.  Userspace lowers the limit to 49MiB.  The cgroup
code would reclaim all of the enclave's reclaimable EPC, and then kill the 
enclave
because it's still over the limit.  And then fail the max_write because the 
cgroup
is *still* over the limit.  So in addition to burning a lot of cycles, from
userspace's perspective its enclave was killed for no reason, as the new limit
wasn't actually set.

> I think to solve it ultimately, we need be able to adjust 'capacity' of VMs
> not to just kill them, which is basically the same as dynamic allocation
> support for VMs (being able to increase/decrease epc size when it is
> running). For now, we only have static allocation so max can't be enforced
> once it is launched.

No, reclaiming virtual EPC is not a requirement.  VMM EPC oversubscription is
insanely complex, and I highly doubt any users actually want to oversubcribe 
VMs.

There are use cases for cgroups beyond oversubscribing/swapping, e.g. privileged
userspace may set limits on a container to ensure the container doesn't 
*accidentally*
consume more EPC than it was allotted, e.g. due to a configuration bug that 
created
a VM with more EPC than it was supposed to have.  

My comments on virtual EPC vs. cgroups is much more about having sane, 
well-defined
behavior, not about saying the kernel actually needs to support oversubscribing 
EPC
for KVM guests.


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-10 Thread Haitao Huang

On Mon, 09 Oct 2023 21:12:27 -0500, Huang, Kai  wrote:




> > >
> > Later the hosting process could migrated/reassigned to another  
cgroup?

> > What to do when the new cgroup is OOM?
> >
>
> You addressed in the documentation, no?
>
> +Migration
> +-
> +
> +Once an EPC page is charged to a cgroup (during allocation), it
> +remains charged to the original cgroup until the page is released
> +or reclaimed.  Migrating a process to a different cgroup doesn't
> +move the EPC charges that it incurred while in the previous cgroup
> +to its new cgroup.

Should we kill the enclave though because some VA pages may be in the  
new

group?



I guess acceptable?

And any difference if you keep VA/SECS to unreclaimabe list?


Tracking VA/SECS allows all cgroups, in which an enclave has allocation,  
to identify the enclave following the back pointer and kill it as needed.



If you migrate one
enclave to another cgroup, the old EPC pages stay in the old cgroup  
while the

new one is charged to the new group IIUC.

I am not cgroup expert, but by searching some old thread it appears this  
isn't a

supported model:

https://lore.kernel.org/lkml/yeyr9181qgzt+...@mtj.duckdns.org/



IIUC it's a different problem here. If we don't track the allocated VAs in  
the new group, then the enclave that spans the two groups can't be killed  
by the new group. If so, some enclave could just hide in some small group  
and never gets killed but keeps allocating in a different group?


Thanks
Haitao


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-10 Thread Haitao Huang

On Mon, 09 Oct 2023 20:34:29 -0500, Huang, Kai  wrote:


On Tue, 2023-10-10 at 00:50 +, Huang, Kai wrote:

On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> On Mon, Oct 09, 2023, Kai Huang wrote:
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > +/**
> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target  
LRU

> > > + * @lru:LRU that is low
> > > + *
> > > + * Return:  %true if a victim was found and kicked.
> > > + */
> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > +{
> > > +struct sgx_epc_page *victim;
> > > +
> > > +spin_lock(>lock);
> > > +victim = sgx_oom_get_victim(lru);
> > > +spin_unlock(>lock);
> > > +
> > > +if (!victim)
> > > +return false;
> > > +
> > > +if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > +return sgx_oom_encl_page(victim->encl_page);
> > > +
> > > +if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > +return sgx_oom_encl(victim->encl);
> >
> > I hate to bring this up, at least at this stage, but I am wondering  
why we need
> > to put VA and SECS pages to the unreclaimable list, but cannot keep  
an

> > "enclave_list" instead?
>
> The motivation for tracking EPC pages instead of enclaves was so that  
the EPC

> OOM-killer could "kill" VMs as well as host-owned enclaves. >

Ah this seems a fair argument. :-)

> The virtual EPC code
> didn't actually kill the VM process, it instead just freed all of the  
EPC pages
> and abused the SGX architecture to effectively make the guest  
recreate all its

> enclaves (IIRC, QEMU does the same thing to "support" live migration).

It returns SIGBUS.  SGX VM live migration also requires enough EPC  
being able to

be allocated on the destination machine to work AFAICT.

>
> Looks like y'all punted on that with:
>
>   The EPC pages allocated for KVM guests by the virtual EPC driver  
are not
>   reclaimable by the host kernel [5]. Therefore they are not tracked  
by any
>   LRU lists for reclaiming purposes in this implementation, but they  
are
>   charged toward the cgroup of the user processs (e.g., QEMU)  
launching the
>   guest.  And when the cgroup EPC usage reaches its limit, the  
virtual EPC
>   driver will stop allocating more EPC for the VM, and return SIGBUS  
to the

>   user process which would abort the VM launch.
>
> which IMO is a hack, unless returning SIGBUS is actually enforced  
somehow. >


"enforced" do you mean?

Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot  
allocate
EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in  
hva_to_pfn(),
which eventually results in KVM returning -EFAULT to userspace in  
vcpu_run().

And Qemu then kills the VM with some nonsense message:

error: kvm run failed Bad address


> Relying
> on userspace to be kind enough to kill its VMs kinda defeats the  
purpose of cgroup
> enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
userspace running
> encalves in a VM could continue on and refuse to give up its EPC, and  
thus run above

> its limit in perpetuity.

>
> I can see userspace wanting to explicitly terminate the VM instead of  
"silently"
> the VM's enclaves, but that seems like it should be a knob in the  
virtual EPC

> code.


I guess I slightly misunderstood your words.

You mean we want to kill VM when the limit is set to be lower than  
virtual EPC

size.

This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
EPC too?




That flag is set for enclaves, do you mean we set similar flag in vepc  
struct?


In the sgx_vepc_fault(), we check this flag at early time and return  
SIGBUS if

it is set.

But this also requires keeping virtual EPC pages in some list, and  
handles them

in sgx_epc_oom() too.

And for virtual EPC pages, I guess the "young" logic can be applied thus
probably it's better to keep the actual virtual EPC pages to a  
(separate?) list

instead of keeping the virtual EPC instance.

struct sgx_epc_lru {
struct list_head reclaimable;
struct sgx_encl *enclaves;
struct list_head vepc_pages;
}

Or still tracking VA/SECS and virtual EPC pages in a single  
unrecliamable list?




One LRU should be OK as we only need relative order in which they are  
loaded?
If an VA page is in front of vEPC, we just kill host side enclave first  
before disrupting VMs in the same group.
As the group is not in a good situation anyway so kernel just pick  
something reasonable to force kill.


Also after rereading the sentences "The virtual EPC code didn't actually  
kill the VM process, it instead just freed all of the  EPC pages and  
abused the SGX architecture to effectively make the guest  recreate all  
its enclaves..."


Maybe by "kill" vm, Sean means EREMOVE the vepc pages in the unreclaimable  
LRU, which effectively make enclaves in guest receiving "EPC lost" 

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-10 Thread Haitao Huang

On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai  wrote:


On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:

Hi Sean

On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson
 wrote:

> On Mon, Oct 09, 2023, Kai Huang wrote:
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > +/**
> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target  
LRU

> > > + * @lru:LRU that is low
> > > + *
> > > + * Return:  %true if a victim was found and kicked.
> > > + */
> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > +{
> > > +struct sgx_epc_page *victim;
> > > +
> > > +spin_lock(>lock);
> > > +victim = sgx_oom_get_victim(lru);
> > > +spin_unlock(>lock);
> > > +
> > > +if (!victim)
> > > +return false;
> > > +
> > > +if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > +return sgx_oom_encl_page(victim->encl_page);
> > > +
> > > +if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > +return sgx_oom_encl(victim->encl);
> >
> > I hate to bring this up, at least at this stage, but I am wondering  
why

> > we need
> > to put VA and SECS pages to the unreclaimable list, but cannot keep  
an

> > "enclave_list" instead?
>
> The motivation for tracking EPC pages instead of enclaves was so that
> the EPC
> OOM-killer could "kill" VMs as well as host-owned enclaves.  The  
virtual

> EPC code
> didn't actually kill the VM process, it instead just freed all of the
> EPC pages
> and abused the SGX architecture to effectively make the guest recreate
> all its
> enclaves (IIRC, QEMU does the same thing to "support" live migration).
>
> Looks like y'all punted on that with:
>
>   The EPC pages allocated for KVM guests by the virtual EPC driver are
> not
>   reclaimable by the host kernel [5]. Therefore they are not tracked  
by

> any
>   LRU lists for reclaiming purposes in this implementation, but they  
are
>   charged toward the cgroup of the user processs (e.g., QEMU)  
launching

> the
>   guest.  And when the cgroup EPC usage reaches its limit, the virtual
> EPC
>   driver will stop allocating more EPC for the VM, and return SIGBUS  
to

> the
>   user process which would abort the VM launch.
>
> which IMO is a hack, unless returning SIGBUS is actually enforced
> somehow.  Relying
> on userspace to be kind enough to kill its VMs kinda defeats the  
purpose

> of cgroup
> enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,
> userspace running
> encalves in a VM could continue on and refuse to give up its EPC, and
> thus run above
> its limit in perpetuity.
>
Cgroup would refuse to allocate more when limit is reached so VMs can  
not

run above limit.

IIRC VMs only support static EPC size right now, reaching limit at  
launch

means the EPC size given in command line for QEMU is not appropriate. So
VM should not launch, hence the current behavior.

[all EPC pages in guest are allocated on page fault caused by the
sensitization process in guest kernel during init, which is part of the  
VM

Launch process. So SIGNBUS will turn into failed VM launch.]

Once it is launched, guest kernel would have 'total capacity' given by  
the

static value from QEMU option. And it would start paging when it is used
up, never would ask for more from host.

For future with dynamic EPC for running guests, QEMU could handle
allocation failure and pass SIGBUS to the running guest kernel.  Is that
correct understanding?


> I can see userspace wanting to explicitly terminate the VM instead of
> "silently"
> the VM's enclaves, but that seems like it should be a knob in the
> virtual EPC
> code.

If my understanding above is correct and understanding your statement
above correctly, then don't see we really need separate knob for vEPC
code. Reaching a cgroup limit by a running guest (assuming dynamic
allocation implemented) should not translate automatically killing the  
VM.

Instead, it's user space job to work with guest to handle allocation
failure. Guest could page and kill enclaves.



IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs:

1) misc.max = 100M
2) Launch VMs with total virtual EPC size = 100M<- success
3) misc.max = 50M

3) will also succeed, but nothing will happen, the VMs will be still  
holding

100M EPC.

You need to somehow track virtual EPC and kill VM instead.

(or somehow fail to do 3) if it is also an acceptable option.)


Thanks for explaining it.

There is an error code to return from max_write. I can add that too to the  
callback definition and fail it when it can't be enforced for any reason.

Would like some community feedback if this is acceptable though.

I think to solve it ultimately, we need be able to adjust 'capacity' of  
VMs not to just kill them, which is basically the same as dynamic  
allocation support for VMs (being able to increase/decrease epc size when  
it is running). For now, we only have static allocation so max can't 

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Huang, Kai
On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
> Hi Sean
> 
> On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson  
>  wrote:
> 
> > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > +/**
> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > > + * @lru:   LRU that is low
> > > > + *
> > > > + * Return: %true if a victim was found and kicked.
> > > > + */
> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > +{
> > > > +   struct sgx_epc_page *victim;
> > > > +
> > > > +   spin_lock(>lock);
> > > > +   victim = sgx_oom_get_victim(lru);
> > > > +   spin_unlock(>lock);
> > > > +
> > > > +   if (!victim)
> > > > +   return false;
> > > > +
> > > > +   if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > +   return sgx_oom_encl_page(victim->encl_page);
> > > > +
> > > > +   if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > +   return sgx_oom_encl(victim->encl);
> > > 
> > > I hate to bring this up, at least at this stage, but I am wondering why  
> > > we need
> > > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > > "enclave_list" instead?
> > 
> > The motivation for tracking EPC pages instead of enclaves was so that  
> > the EPC
> > OOM-killer could "kill" VMs as well as host-owned enclaves.  The virtual  
> > EPC code
> > didn't actually kill the VM process, it instead just freed all of the  
> > EPC pages
> > and abused the SGX architecture to effectively make the guest recreate  
> > all its
> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> > 
> > Looks like y'all punted on that with:
> > 
> >   The EPC pages allocated for KVM guests by the virtual EPC driver are  
> > not
> >   reclaimable by the host kernel [5]. Therefore they are not tracked by  
> > any
> >   LRU lists for reclaiming purposes in this implementation, but they are
> >   charged toward the cgroup of the user processs (e.g., QEMU) launching  
> > the
> >   guest.  And when the cgroup EPC usage reaches its limit, the virtual  
> > EPC
> >   driver will stop allocating more EPC for the VM, and return SIGBUS to  
> > the
> >   user process which would abort the VM launch.
> > 
> > which IMO is a hack, unless returning SIGBUS is actually enforced  
> > somehow.  Relying
> > on userspace to be kind enough to kill its VMs kinda defeats the purpose  
> > of cgroup
> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
> > userspace running
> > encalves in a VM could continue on and refuse to give up its EPC, and  
> > thus run above
> > its limit in perpetuity.
> > 
> Cgroup would refuse to allocate more when limit is reached so VMs can not  
> run above limit.
> 
> IIRC VMs only support static EPC size right now, reaching limit at launch  
> means the EPC size given in command line for QEMU is not appropriate. So  
> VM should not launch, hence the current behavior.
> 
> [all EPC pages in guest are allocated on page fault caused by the  
> sensitization process in guest kernel during init, which is part of the VM  
> Launch process. So SIGNBUS will turn into failed VM launch.]
> 
> Once it is launched, guest kernel would have 'total capacity' given by the  
> static value from QEMU option. And it would start paging when it is used  
> up, never would ask for more from host.
> 
> For future with dynamic EPC for running guests, QEMU could handle  
> allocation failure and pass SIGBUS to the running guest kernel.  Is that  
> correct understanding?
> 
> 
> > I can see userspace wanting to explicitly terminate the VM instead of  
> > "silently"
> > the VM's enclaves, but that seems like it should be a knob in the  
> > virtual EPC
> > code.
> 
> If my understanding above is correct and understanding your statement  
> above correctly, then don't see we really need separate knob for vEPC  
> code. Reaching a cgroup limit by a running guest (assuming dynamic  
> allocation implemented) should not translate automatically killing the VM.  
> Instead, it's user space job to work with guest to handle allocation  
> failure. Guest could page and kill enclaves.
> 

IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs:

1) misc.max = 100M
2) Launch VMs with total virtual EPC size = 100M<- success
3) misc.max = 50M

3) will also succeed, but nothing will happen, the VMs will be still holding
100M EPC.

You need to somehow track virtual EPC and kill VM instead.

(or somehow fail to do 3) if it is also an acceptable option.)



Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Huang, Kai

> > > > 
> > > Later the hosting process could migrated/reassigned to another cgroup?
> > > What to do when the new cgroup is OOM?
> > > 
> > 
> > You addressed in the documentation, no?
> > 
> > +Migration
> > +-
> > +
> > +Once an EPC page is charged to a cgroup (during allocation), it
> > +remains charged to the original cgroup until the page is released
> > +or reclaimed.  Migrating a process to a different cgroup doesn't
> > +move the EPC charges that it incurred while in the previous cgroup
> > +to its new cgroup.
> 
> Should we kill the enclave though because some VA pages may be in the new  
> group?
> 

I guess acceptable?

And any difference if you keep VA/SECS to unreclaimabe list? If you migrate one
enclave to another cgroup, the old EPC pages stay in the old cgroup while the
new one is charged to the new group IIUC.

I am not cgroup expert, but by searching some old thread it appears this isn't a
supported model:

https://lore.kernel.org/lkml/yeyr9181qgzt+...@mtj.duckdns.org/



Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Haitao Huang

Hi Sean

On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson  
 wrote:



On Mon, Oct 09, 2023, Kai Huang wrote:

On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> +/**
> + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> + * @lru:  LRU that is low
> + *
> + * Return:%true if a victim was found and kicked.
> + */
> +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> +{
> +  struct sgx_epc_page *victim;
> +
> +  spin_lock(>lock);
> +  victim = sgx_oom_get_victim(lru);
> +  spin_unlock(>lock);
> +
> +  if (!victim)
> +  return false;
> +
> +  if (victim->flags & SGX_EPC_OWNER_PAGE)
> +  return sgx_oom_encl_page(victim->encl_page);
> +
> +  if (victim->flags & SGX_EPC_OWNER_ENCL)
> +  return sgx_oom_encl(victim->encl);

I hate to bring this up, at least at this stage, but I am wondering why  
we need

to put VA and SECS pages to the unreclaimable list, but cannot keep an
"enclave_list" instead?


The motivation for tracking EPC pages instead of enclaves was so that  
the EPC
OOM-killer could "kill" VMs as well as host-owned enclaves.  The virtual  
EPC code
didn't actually kill the VM process, it instead just freed all of the  
EPC pages
and abused the SGX architecture to effectively make the guest recreate  
all its

enclaves (IIRC, QEMU does the same thing to "support" live migration).

Looks like y'all punted on that with:

  The EPC pages allocated for KVM guests by the virtual EPC driver are  
not
  reclaimable by the host kernel [5]. Therefore they are not tracked by  
any

  LRU lists for reclaiming purposes in this implementation, but they are
  charged toward the cgroup of the user processs (e.g., QEMU) launching  
the
  guest.  And when the cgroup EPC usage reaches its limit, the virtual  
EPC
  driver will stop allocating more EPC for the VM, and return SIGBUS to  
the

  user process which would abort the VM launch.

which IMO is a hack, unless returning SIGBUS is actually enforced  
somehow.  Relying
on userspace to be kind enough to kill its VMs kinda defeats the purpose  
of cgroup
enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
userspace running
encalves in a VM could continue on and refuse to give up its EPC, and  
thus run above

its limit in perpetuity.

Cgroup would refuse to allocate more when limit is reached so VMs can not  
run above limit.


IIRC VMs only support static EPC size right now, reaching limit at launch  
means the EPC size given in command line for QEMU is not appropriate. So  
VM should not launch, hence the current behavior.


[all EPC pages in guest are allocated on page fault caused by the  
sensitization process in guest kernel during init, which is part of the VM  
Launch process. So SIGNBUS will turn into failed VM launch.]


Once it is launched, guest kernel would have 'total capacity' given by the  
static value from QEMU option. And it would start paging when it is used  
up, never would ask for more from host.


For future with dynamic EPC for running guests, QEMU could handle  
allocation failure and pass SIGBUS to the running guest kernel.  Is that  
correct understanding?



I can see userspace wanting to explicitly terminate the VM instead of  
"silently"
the VM's enclaves, but that seems like it should be a knob in the  
virtual EPC

code.


If my understanding above is correct and understanding your statement  
above correctly, then don't see we really need separate knob for vEPC  
code. Reaching a cgroup limit by a running guest (assuming dynamic  
allocation implemented) should not translate automatically killing the VM.  
Instead, it's user space job to work with guest to handle allocation  
failure. Guest could page and kill enclaves.


Haitao


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Haitao Huang

On Mon, 09 Oct 2023 20:18:00 -0500, Huang, Kai  wrote:


On Mon, 2023-10-09 at 20:04 -0500, Haitao Huang wrote:
On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai   
wrote:


> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > From: Sean Christopherson 
> >
> > Introduce the OOM path for killing an enclave with a reclaimer that  
is

> > no
> > longer able to reclaim enough EPC pages. Find a victim enclave,  
which

> > will be an enclave with only "unreclaimable" EPC pages left in the
> > cgroup LRU lists. Once a victim is identified, mark the enclave as  
OOM
> > and zap the enclave's entire page range, and drain all mm  
references in

> > encl->mm_list. Block allocating any EPC pages in #PF handler, or
> > reloading any pages in all paths, or creating any new mappings.
> >
> > The OOM killing path may race with the reclaimers: in some cases,  
the
> > victim enclave is in the process of reclaiming the last EPC pages  
when

> > OOM happens, that is, all pages other than SECS and VA pages are in
> > RECLAIMING_IN_PROGRESS state. The reclaiming process requires  
access to
> > the enclave backing, VA pages as well as SECS. So the OOM killer  
does

> > not directly release those enclave resources, instead, it lets all
> > reclaiming in progress to finish, and relies (as currently done) on
> > kref_put on encl->refcount to trigger sgx_encl_release() to do the
> > final cleanup.
> >
> > Signed-off-by: Sean Christopherson 
> > Co-developed-by: Kristen Carlson Accardi 
> > Signed-off-by: Kristen Carlson Accardi 
> > Co-developed-by: Haitao Huang 
> > Signed-off-by: Haitao Huang 
> > Cc: Sean Christopherson 
> > ---
> > V5:
> > - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
> >
> > V4:
> > - Updates for patch reordering and typo fixes.
> >
> > V3:
> > - Rebased to use the new VMA_ITERATOR to zap VMAs.
> > - Fixed the racing cases by blocking new page allocation/mapping and
> > reloading when enclave is marked for OOM. And do not release any  
enclave

> > resources other than draining mm_list entries, and let pages in
> > RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
> > - Due to above changes, also removed the no-longer needed  
encl->lock in
> > the OOM path which was causing deadlocks reported by the lock  
prover.

> >
>
> [...]
>
> > +
> > +/**
> > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > + * @lru: LRU that is low
> > + *
> > + * Return:   %true if a victim was found and kicked.
> > + */
> > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > +{
> > + struct sgx_epc_page *victim;
> > +
> > + spin_lock(>lock);
> > + victim = sgx_oom_get_victim(lru);
> > + spin_unlock(>lock);
> > +
> > + if (!victim)
> > + return false;
> > +
> > + if (victim->flags & SGX_EPC_OWNER_PAGE)
> > + return sgx_oom_encl_page(victim->encl_page);
> > +
> > + if (victim->flags & SGX_EPC_OWNER_ENCL)
> > + return sgx_oom_encl(victim->encl);
>
> I hate to bring this up, at least at this stage, but I am wondering  
why

> we need
> to put VA and SECS pages to the unreclaimable list, but cannot keep an
> "enclave_list" instead?
>
> So by looking the patch (" x86/sgx: Limit process EPC usage with misc
> cgroup
> controller"), if I am not missing anything, the whole "unreclaimable"
> list is
> just used to find the victim enclave when OOM needs to be done.   
Thus, I

> don't
> see why "enclave_list" cannot be used to achieve this.
>
> The reason that I am asking is because it seems using "enclave_list"  
we

> can
> simplify the code.  At least the patches related to track VA/SECS  
pages,

> and the
> SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated
> completely.
> Using "enclave_list", I guess you just need to put the enclave to the
> current
> EPC cgroup when SECS page is allocated.
>
Later the hosting process could migrated/reassigned to another cgroup?
What to do when the new cgroup is OOM?



You addressed in the documentation, no?

+Migration
+-
+
+Once an EPC page is charged to a cgroup (during allocation), it
+remains charged to the original cgroup until the page is released
+or reclaimed.  Migrating a process to a different cgroup doesn't
+move the EPC charges that it incurred while in the previous cgroup
+to its new cgroup.


Should we kill the enclave though because some VA pages may be in the new  
group?


Haitao


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Huang, Kai
On Tue, 2023-10-10 at 00:50 +, Huang, Kai wrote:
> On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > +/**
> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > > + * @lru:   LRU that is low
> > > > + *
> > > > + * Return: %true if a victim was found and kicked.
> > > > + */
> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > +{
> > > > +   struct sgx_epc_page *victim;
> > > > +
> > > > +   spin_lock(>lock);
> > > > +   victim = sgx_oom_get_victim(lru);
> > > > +   spin_unlock(>lock);
> > > > +
> > > > +   if (!victim)
> > > > +   return false;
> > > > +
> > > > +   if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > +   return sgx_oom_encl_page(victim->encl_page);
> > > > +
> > > > +   if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > +   return sgx_oom_encl(victim->encl);
> > > 
> > > I hate to bring this up, at least at this stage, but I am wondering why 
> > > we need
> > > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > > "enclave_list" instead?
> > 
> > The motivation for tracking EPC pages instead of enclaves was so that the 
> > EPC
> > OOM-killer could "kill" VMs as well as host-owned enclaves.  
> > 
> 
> Ah this seems a fair argument. :-)
> 
> > The virtual EPC code
> > didn't actually kill the VM process, it instead just freed all of the EPC 
> > pages
> > and abused the SGX architecture to effectively make the guest recreate all 
> > its
> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> 
> It returns SIGBUS.  SGX VM live migration also requires enough EPC being able 
> to
> be allocated on the destination machine to work AFAICT.
>  
> > 
> > Looks like y'all punted on that with:
> > 
> >   The EPC pages allocated for KVM guests by the virtual EPC driver are not
> >   reclaimable by the host kernel [5]. Therefore they are not tracked by any
> >   LRU lists for reclaiming purposes in this implementation, but they are
> >   charged toward the cgroup of the user processs (e.g., QEMU) launching the
> >   guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
> >   driver will stop allocating more EPC for the VM, and return SIGBUS to the
> >   user process which would abort the VM launch.
> > 
> > which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  
> > 
> 
> "enforced" do you mean?
> 
> Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate
> EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in 
> hva_to_pfn(),
> which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). 
> And Qemu then kills the VM with some nonsense message:
> 
> error: kvm run failed Bad address
> 
> 
> > Relying
> > on userspace to be kind enough to kill its VMs kinda defeats the purpose of 
> > cgroup
> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace 
> > running
> > encalves in a VM could continue on and refuse to give up its EPC, and thus 
> > run above
> > its limit in perpetuity.
> 
> > 
> > I can see userspace wanting to explicitly terminate the VM instead of 
> > "silently"
> > the VM's enclaves, but that seems like it should be a knob in the virtual 
> > EPC
> > code.

I guess I slightly misunderstood your words.

You mean we want to kill VM when the limit is set to be lower than virtual EPC
size.

This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual EPC too?

In the sgx_vepc_fault(), we check this flag at early time and return SIGBUS if
it is set.

But this also requires keeping virtual EPC pages in some list, and handles them
in sgx_epc_oom() too.

And for virtual EPC pages, I guess the "young" logic can be applied thus
probably it's better to keep the actual virtual EPC pages to a (separate?) list
instead of keeping the virtual EPC instance.

struct sgx_epc_lru {
struct list_head reclaimable;
struct sgx_encl *enclaves;
struct list_head vepc_pages;
}

Or still tracking VA/SECS and virtual EPC pages in a single unrecliamable list?

I don't know :-)


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Huang, Kai
On Mon, 2023-10-09 at 20:04 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai  wrote:
> 
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > From: Sean Christopherson 
> > > 
> > > Introduce the OOM path for killing an enclave with a reclaimer that is  
> > > no
> > > longer able to reclaim enough EPC pages. Find a victim enclave, which
> > > will be an enclave with only "unreclaimable" EPC pages left in the
> > > cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
> > > and zap the enclave's entire page range, and drain all mm references in
> > > encl->mm_list. Block allocating any EPC pages in #PF handler, or
> > > reloading any pages in all paths, or creating any new mappings.
> > > 
> > > The OOM killing path may race with the reclaimers: in some cases, the
> > > victim enclave is in the process of reclaiming the last EPC pages when
> > > OOM happens, that is, all pages other than SECS and VA pages are in
> > > RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
> > > the enclave backing, VA pages as well as SECS. So the OOM killer does
> > > not directly release those enclave resources, instead, it lets all
> > > reclaiming in progress to finish, and relies (as currently done) on
> > > kref_put on encl->refcount to trigger sgx_encl_release() to do the
> > > final cleanup.
> > > 
> > > Signed-off-by: Sean Christopherson 
> > > Co-developed-by: Kristen Carlson Accardi 
> > > Signed-off-by: Kristen Carlson Accardi 
> > > Co-developed-by: Haitao Huang 
> > > Signed-off-by: Haitao Huang 
> > > Cc: Sean Christopherson 
> > > ---
> > > V5:
> > > - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
> > > 
> > > V4:
> > > - Updates for patch reordering and typo fixes.
> > > 
> > > V3:
> > > - Rebased to use the new VMA_ITERATOR to zap VMAs.
> > > - Fixed the racing cases by blocking new page allocation/mapping and
> > > reloading when enclave is marked for OOM. And do not release any enclave
> > > resources other than draining mm_list entries, and let pages in
> > > RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
> > > - Due to above changes, also removed the no-longer needed encl->lock in
> > > the OOM path which was causing deadlocks reported by the lock prover.
> > > 
> > 
> > [...]
> > 
> > > +
> > > +/**
> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > + * @lru: LRU that is low
> > > + *
> > > + * Return:   %true if a victim was found and kicked.
> > > + */
> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > +{
> > > + struct sgx_epc_page *victim;
> > > +
> > > + spin_lock(>lock);
> > > + victim = sgx_oom_get_victim(lru);
> > > + spin_unlock(>lock);
> > > +
> > > + if (!victim)
> > > + return false;
> > > +
> > > + if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > + return sgx_oom_encl_page(victim->encl_page);
> > > +
> > > + if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > + return sgx_oom_encl(victim->encl);
> > 
> > I hate to bring this up, at least at this stage, but I am wondering why  
> > we need
> > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > "enclave_list" instead?
> > 
> > So by looking the patch (" x86/sgx: Limit process EPC usage with misc  
> > cgroup
> > controller"), if I am not missing anything, the whole "unreclaimable"  
> > list is
> > just used to find the victim enclave when OOM needs to be done.  Thus, I  
> > don't
> > see why "enclave_list" cannot be used to achieve this.
> > 
> > The reason that I am asking is because it seems using "enclave_list" we  
> > can
> > simplify the code.  At least the patches related to track VA/SECS pages,  
> > and the
> > SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated  
> > completely.  
> > Using "enclave_list", I guess you just need to put the enclave to the  
> > current
> > EPC cgroup when SECS page is allocated.
> > 
> Later the hosting process could migrated/reassigned to another cgroup?
> What to do when the new cgroup is OOM?
> 

You addressed in the documentation, no?

+Migration
+-
+
+Once an EPC page is charged to a cgroup (during allocation), it
+remains charged to the original cgroup until the page is released
+or reclaimed.  Migrating a process to a different cgroup doesn't
+move the EPC charges that it incurred while in the previous cgroup
+to its new cgroup.


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Haitao Huang

On Mon, 09 Oct 2023 18:45:06 -0500, Huang, Kai  wrote:


On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:

From: Sean Christopherson 

Introduce the OOM path for killing an enclave with a reclaimer that is  
no

longer able to reclaim enough EPC pages. Find a victim enclave, which
will be an enclave with only "unreclaimable" EPC pages left in the
cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
and zap the enclave's entire page range, and drain all mm references in
encl->mm_list. Block allocating any EPC pages in #PF handler, or
reloading any pages in all paths, or creating any new mappings.

The OOM killing path may race with the reclaimers: in some cases, the
victim enclave is in the process of reclaiming the last EPC pages when
OOM happens, that is, all pages other than SECS and VA pages are in
RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
the enclave backing, VA pages as well as SECS. So the OOM killer does
not directly release those enclave resources, instead, it lets all
reclaiming in progress to finish, and relies (as currently done) on
kref_put on encl->refcount to trigger sgx_encl_release() to do the
final cleanup.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Cc: Sean Christopherson 
---
V5:
- Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY

V4:
- Updates for patch reordering and typo fixes.

V3:
- Rebased to use the new VMA_ITERATOR to zap VMAs.
- Fixed the racing cases by blocking new page allocation/mapping and
reloading when enclave is marked for OOM. And do not release any enclave
resources other than draining mm_list entries, and let pages in
RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
- Due to above changes, also removed the no-longer needed encl->lock in
the OOM path which was causing deadlocks reported by the lock prover.



[...]


+
+/**
+ * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
+ * @lru:   LRU that is low
+ *
+ * Return: %true if a victim was found and kicked.
+ */
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
+{
+   struct sgx_epc_page *victim;
+
+   spin_lock(>lock);
+   victim = sgx_oom_get_victim(lru);
+   spin_unlock(>lock);
+
+   if (!victim)
+   return false;
+
+   if (victim->flags & SGX_EPC_OWNER_PAGE)
+   return sgx_oom_encl_page(victim->encl_page);
+
+   if (victim->flags & SGX_EPC_OWNER_ENCL)
+   return sgx_oom_encl(victim->encl);


I hate to bring this up, at least at this stage, but I am wondering why  
we need

to put VA and SECS pages to the unreclaimable list, but cannot keep an
"enclave_list" instead?

So by looking the patch (" x86/sgx: Limit process EPC usage with misc  
cgroup
controller"), if I am not missing anything, the whole "unreclaimable"  
list is
just used to find the victim enclave when OOM needs to be done.  Thus, I  
don't

see why "enclave_list" cannot be used to achieve this.

The reason that I am asking is because it seems using "enclave_list" we  
can
simplify the code.  At least the patches related to track VA/SECS pages,  
and the
SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated  
completely.  
Using "enclave_list", I guess you just need to put the enclave to the  
current

EPC cgroup when SECS page is allocated.


Later the hosting process could migrated/reassigned to another cgroup?
What to do when the new cgroup is OOM?

Thanks
Haitao


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Huang, Kai
On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> On Mon, Oct 09, 2023, Kai Huang wrote:
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > +/**
> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > + * @lru: LRU that is low
> > > + *
> > > + * Return:   %true if a victim was found and kicked.
> > > + */
> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > +{
> > > + struct sgx_epc_page *victim;
> > > +
> > > + spin_lock(>lock);
> > > + victim = sgx_oom_get_victim(lru);
> > > + spin_unlock(>lock);
> > > +
> > > + if (!victim)
> > > + return false;
> > > +
> > > + if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > + return sgx_oom_encl_page(victim->encl_page);
> > > +
> > > + if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > + return sgx_oom_encl(victim->encl);
> > 
> > I hate to bring this up, at least at this stage, but I am wondering why we 
> > need
> > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > "enclave_list" instead?
> 
> The motivation for tracking EPC pages instead of enclaves was so that the EPC
> OOM-killer could "kill" VMs as well as host-owned enclaves.  
> 

Ah this seems a fair argument. :-)

> The virtual EPC code
> didn't actually kill the VM process, it instead just freed all of the EPC 
> pages
> and abused the SGX architecture to effectively make the guest recreate all its
> enclaves (IIRC, QEMU does the same thing to "support" live migration).

It returns SIGBUS.  SGX VM live migration also requires enough EPC being able to
be allocated on the destination machine to work AFAICT.
 
> 
> Looks like y'all punted on that with:
> 
>   The EPC pages allocated for KVM guests by the virtual EPC driver are not
>   reclaimable by the host kernel [5]. Therefore they are not tracked by any
>   LRU lists for reclaiming purposes in this implementation, but they are
>   charged toward the cgroup of the user processs (e.g., QEMU) launching the
>   guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
>   driver will stop allocating more EPC for the VM, and return SIGBUS to the
>   user process which would abort the VM launch.
> 
> which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  
> 

"enforced" do you mean?

Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate
EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in hva_to_pfn(),
which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). 
And Qemu then kills the VM with some nonsense message:

error: kvm run failed Bad address


> Relying
> on userspace to be kind enough to kill its VMs kinda defeats the purpose of 
> cgroup
> enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace 
> running
> encalves in a VM could continue on and refuse to give up its EPC, and thus 
> run above
> its limit in perpetuity.

Agreed.  But this looks cannot resolved until we can reclaim EPC page from VM.

Or in the EPC cgroup code we can refuse to set the maximum which cannot be
supported, e.g., less the total virtual EPC size.

I guess the second is acceptable for now?

> 
> I can see userspace wanting to explicitly terminate the VM instead of 
> "silently"
> the VM's enclaves, but that seems like it should be a knob in the virtual EPC
> code.

See above for the second option.



Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Sean Christopherson
On Mon, Oct 09, 2023, Kai Huang wrote:
> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > +/**
> > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > + * @lru:   LRU that is low
> > + *
> > + * Return: %true if a victim was found and kicked.
> > + */
> > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > +{
> > +   struct sgx_epc_page *victim;
> > +
> > +   spin_lock(>lock);
> > +   victim = sgx_oom_get_victim(lru);
> > +   spin_unlock(>lock);
> > +
> > +   if (!victim)
> > +   return false;
> > +
> > +   if (victim->flags & SGX_EPC_OWNER_PAGE)
> > +   return sgx_oom_encl_page(victim->encl_page);
> > +
> > +   if (victim->flags & SGX_EPC_OWNER_ENCL)
> > +   return sgx_oom_encl(victim->encl);
> 
> I hate to bring this up, at least at this stage, but I am wondering why we 
> need
> to put VA and SECS pages to the unreclaimable list, but cannot keep an
> "enclave_list" instead?

The motivation for tracking EPC pages instead of enclaves was so that the EPC
OOM-killer could "kill" VMs as well as host-owned enclaves.  The virtual EPC 
code
didn't actually kill the VM process, it instead just freed all of the EPC pages
and abused the SGX architecture to effectively make the guest recreate all its
enclaves (IIRC, QEMU does the same thing to "support" live migration).

Looks like y'all punted on that with:

  The EPC pages allocated for KVM guests by the virtual EPC driver are not
  reclaimable by the host kernel [5]. Therefore they are not tracked by any
  LRU lists for reclaiming purposes in this implementation, but they are
  charged toward the cgroup of the user processs (e.g., QEMU) launching the
  guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
  driver will stop allocating more EPC for the VM, and return SIGBUS to the
  user process which would abort the VM launch.

which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  
Relying
on userspace to be kind enough to kill its VMs kinda defeats the purpose of 
cgroup
enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace 
running
encalves in a VM could continue on and refuse to give up its EPC, and thus run 
above
its limit in perpetuity.

I can see userspace wanting to explicitly terminate the VM instead of "silently"
the VM's enclaves, but that seems like it should be a knob in the virtual EPC
code.


Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Huang, Kai
On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> From: Sean Christopherson 
> 
> Introduce the OOM path for killing an enclave with a reclaimer that is no
> longer able to reclaim enough EPC pages. Find a victim enclave, which
> will be an enclave with only "unreclaimable" EPC pages left in the
> cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
> and zap the enclave's entire page range, and drain all mm references in
> encl->mm_list. Block allocating any EPC pages in #PF handler, or
> reloading any pages in all paths, or creating any new mappings.
> 
> The OOM killing path may race with the reclaimers: in some cases, the
> victim enclave is in the process of reclaiming the last EPC pages when
> OOM happens, that is, all pages other than SECS and VA pages are in
> RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
> the enclave backing, VA pages as well as SECS. So the OOM killer does
> not directly release those enclave resources, instead, it lets all
> reclaiming in progress to finish, and relies (as currently done) on
> kref_put on encl->refcount to trigger sgx_encl_release() to do the
> final cleanup.
> 
> Signed-off-by: Sean Christopherson 
> Co-developed-by: Kristen Carlson Accardi 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Cc: Sean Christopherson 
> ---
> V5:
> - Rename SGX_ENCL_OOM to SGX_ENCL_NO_MEMORY
> 
> V4:
> - Updates for patch reordering and typo fixes.
> 
> V3:
> - Rebased to use the new VMA_ITERATOR to zap VMAs.
> - Fixed the racing cases by blocking new page allocation/mapping and
> reloading when enclave is marked for OOM. And do not release any enclave
> resources other than draining mm_list entries, and let pages in
> RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
> - Due to above changes, also removed the no-longer needed encl->lock in
> the OOM path which was causing deadlocks reported by the lock prover.
> 

[...]

> +
> +/**
> + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> + * @lru: LRU that is low
> + *
> + * Return:   %true if a victim was found and kicked.
> + */
> +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> +{
> + struct sgx_epc_page *victim;
> +
> + spin_lock(>lock);
> + victim = sgx_oom_get_victim(lru);
> + spin_unlock(>lock);
> +
> + if (!victim)
> + return false;
> +
> + if (victim->flags & SGX_EPC_OWNER_PAGE)
> + return sgx_oom_encl_page(victim->encl_page);
> +
> + if (victim->flags & SGX_EPC_OWNER_ENCL)
> + return sgx_oom_encl(victim->encl);

I hate to bring this up, at least at this stage, but I am wondering why we need
to put VA and SECS pages to the unreclaimable list, but cannot keep an
"enclave_list" instead?

So by looking the patch (" x86/sgx: Limit process EPC usage with misc cgroup
controller"), if I am not missing anything, the whole "unreclaimable" list is
just used to find the victim enclave when OOM needs to be done.  Thus, I don't
see why "enclave_list" cannot be used to achieve this.

The reason that I am asking is because it seems using "enclave_list" we can
simplify the code.  At least the patches related to track VA/SECS pages, and the
SGX_EPC_OWNER_PAGE/SGX_EPC_OWNER_ENCL thing can be eliminated completely.  

Using "enclave_list", I guess you just need to put the enclave to the current
EPC cgroup when SECS page is allocated.

In fact, putting "unreclaimable" list to LRU itself is a little bit confusing
because: 1) you cannot really reclaim anything from the list; 2) VA/SECS pages
don't have the concept of "young" at all, thus makes no sense to annotate they
as LRU.

Thus putting VA/SECS to "unreclaimable" list, instead of keeping an
"enclave_list" seems won't have any benefit but will only make the code more
complicated.

Or am I missing anything?