Re: [PATCH v3 0/2] Update mce_record tracepoint

2024-02-08 Thread Steven Rostedt
On Fri, 26 Jan 2024 01:57:58 -0600
Avadhut Naik  wrote:

> This patchset updates the mce_record tracepoint so that the recently added
> fields of struct mce are exported through it to userspace.
> 
> The first patch adds PPIN (Protected Processor Inventory Number) field to
> the tracepoint.
> 
> The second patch adds the microcode field (Microcode Revision) to the
> tracepoint.

From a tracing POV only:

Reviewed-by: Steven Rostedt (Google) 

-- Steve



Re: [PATCH v2] mm: Update mark_victim tracepoints fields

2024-02-08 Thread Steven Rostedt
On Thu, 11 Jan 2024 21:05:30 +
Carlos Galo  wrote:

> The current implementation of the mark_victim tracepoint provides only
> the process ID (pid) of the victim process. This limitation poses
> challenges for userspace tools that need additional information
> about the OOM victim. The association between pid and the additional
> data may be lost after the kill, making it difficult for userspace to
> correlate the OOM event with the specific process.
> 
> In order to mitigate this limitation, add the following fields:
> 
> - UID
>In Android each installed application has a unique UID. Including
>the `uid` assists in correlating OOM events with specific apps.
> 
> - Process Name (comm)
>Enables identification of the affected process.
> 
> - OOM Score
>Allows userspace to get additional insights of the relative kill
>priority of the OOM victim.
> 

From a tracing POV this looks fine to me, but it's up to the mm
maintainers to decide to take it.

> Cc: Steven Rostedt 
> Cc: Andrew Morton 
> Cc: Suren Baghdasaryan 
> Signed-off-by: Carlos Galo 
> ---
> v2: Fixed build error. Added missing comma when printing `__entry->uid`.
> 
>  include/trace/events/oom.h | 19 +++
>  mm/oom_kill.c  |  6 +-
>  2 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h
> index 26a11e4a2c36..fb8a5d1b8a0a 100644
> --- a/include/trace/events/oom.h
> +++ b/include/trace/events/oom.h
> @@ -72,19 +72,30 @@ TRACE_EVENT(reclaim_retry_zone,
>  );
>  
>  TRACE_EVENT(mark_victim,
> - TP_PROTO(int pid),
> + TP_PROTO(struct task_struct *task, uid_t uid),
>  
> - TP_ARGS(pid),
> + TP_ARGS(task, uid),
>  
>   TP_STRUCT__entry(
>   __field(int, pid)
> + __field(uid_t, uid)
> + __string(comm, task->comm)
> + __field(short, oom_score_adj)
>   ),
>  
>   TP_fast_assign(
> - __entry->pid = pid;
> + __entry->pid = task->pid;
> + __entry->uid = uid;
> + __assign_str(comm, task->comm);
> + __entry->oom_score_adj = task->signal->oom_score_adj;
>   ),
>  
> - TP_printk("pid=%d", __entry->pid)
> + TP_printk("pid=%d uid=%u comm=%s oom_score_adj=%hd",
> + __entry->pid,
> + __entry->uid,
> + __get_str(comm),
> + __entry->oom_score_adj
> + )
>  );
>  
>  TRACE_EVENT(wake_reaper,
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e6071fde34a..0698c00c5da6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include "internal.h"
> @@ -753,6 +754,7 @@ static inline void queue_oom_reaper(struct task_struct 
> *tsk)
>   */
>  static void mark_oom_victim(struct task_struct *tsk)
>  {
> + const struct cred *cred;
>   struct mm_struct *mm = tsk->mm;
>  
>   WARN_ON(oom_killer_disabled);
> @@ -772,7 +774,9 @@ static void mark_oom_victim(struct task_struct *tsk)
>*/
>   __thaw_task(tsk);
>   atomic_inc(_victims);
> - trace_mark_victim(tsk->pid);
> + cred = get_task_cred(tsk);

get_task_cred() isn't a trivial function and is used only for tracing.
But this isn't a fast path so I guess that is fine.

For tracing:

Reviewed-by: Steven Rostedt (Google) 

-- Steve


> + trace_mark_victim(tsk, cred->uid.val);
> + put_cred(cred);
>  }
>  
>  /**
> 
> base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a




[PATCH v2 2/4] hwspinlock: omap: Use devm_pm_runtime_enable() helper

2024-02-08 Thread Andrew Davis
This disables runtime PM on module exit automatically for us, currently
we manually disable runtime PM which can be error-prone if not done
in the right order or missed in some exit path. This also allows us
to simplify the probe exit path and remove callbacks. Do that here.

While here, as we can now return right after registering our hwspinlock,
simply return directly and remove the extra debug message.

Signed-off-by: Andrew Davis 
---

Changes for v2:
 - Return directly from register as suggested on v1
 - Clarify commit message

 drivers/hwspinlock/omap_hwspinlock.c | 33 +++-
 1 file changed, 8 insertions(+), 25 deletions(-)

diff --git a/drivers/hwspinlock/omap_hwspinlock.c 
b/drivers/hwspinlock/omap_hwspinlock.c
index cca55143d24d4..3bd3ffab92100 100644
--- a/drivers/hwspinlock/omap_hwspinlock.c
+++ b/drivers/hwspinlock/omap_hwspinlock.c
@@ -89,10 +89,10 @@ static int omap_hwspinlock_probe(struct platform_device 
*pdev)
 * make sure the module is enabled and clocked before reading
 * the module SYSSTATUS register
 */
-   pm_runtime_enable(>dev);
+   devm_pm_runtime_enable(>dev);
ret = pm_runtime_resume_and_get(>dev);
if (ret < 0)
-   goto runtime_err;
+   return ret;
 
/* Determine number of locks */
i = readl(io_base + SYSSTATUS_OFFSET);
@@ -104,41 +104,26 @@ static int omap_hwspinlock_probe(struct platform_device 
*pdev)
 */
ret = pm_runtime_put(>dev);
if (ret < 0)
-   goto runtime_err;
+   return ret;
 
/* one of the four lsb's must be set, and nothing else */
-   if (hweight_long(i & 0xf) != 1 || i > 8) {
-   ret = -EINVAL;
-   goto runtime_err;
-   }
+   if (hweight_long(i & 0xf) != 1 || i > 8)
+   return -EINVAL;
 
num_locks = i * 32; /* actual number of locks in this device */
 
bank = devm_kzalloc(>dev, struct_size(bank, lock, num_locks),
GFP_KERNEL);
-   if (!bank) {
-   ret = -ENOMEM;
-   goto runtime_err;
-   }
+   if (!bank)
+   return -ENOMEM;
 
platform_set_drvdata(pdev, bank);
 
for (i = 0, hwlock = >lock[0]; i < num_locks; i++, hwlock++)
hwlock->priv = io_base + LOCK_BASE_OFFSET + sizeof(u32) * i;
 
-   ret = hwspin_lock_register(bank, >dev, _hwspinlock_ops,
+   return hwspin_lock_register(bank, >dev, _hwspinlock_ops,
base_id, num_locks);
-   if (ret)
-   goto runtime_err;
-
-   dev_dbg(>dev, "Registered %d locks with HwSpinlock core\n",
-   num_locks);
-
-   return 0;
-
-runtime_err:
-   pm_runtime_disable(>dev);
-   return ret;
 }
 
 static void omap_hwspinlock_remove(struct platform_device *pdev)
@@ -151,8 +136,6 @@ static void omap_hwspinlock_remove(struct platform_device 
*pdev)
dev_err(>dev, "%s failed: %d\n", __func__, ret);
return;
}
-
-   pm_runtime_disable(>dev);
 }
 
 static const struct of_device_id omap_hwspinlock_of_match[] = {
-- 
2.39.2




[PATCH v2 3/4] hwspinlock: omap: Use devm_hwspin_lock_register() helper

2024-02-08 Thread Andrew Davis
This will unregister the HW spinlock on module exit automatically for us,
currently we manually unregister which can be error-prone if not done in
the right order. This also allows us to remove the remove callback.
Do that here.

Signed-off-by: Andrew Davis 
---

Changes for v2:
 - Clarify commit message

 drivers/hwspinlock/omap_hwspinlock.c | 17 +
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/drivers/hwspinlock/omap_hwspinlock.c 
b/drivers/hwspinlock/omap_hwspinlock.c
index 3bd3ffab92100..fe73da80018b1 100644
--- a/drivers/hwspinlock/omap_hwspinlock.c
+++ b/drivers/hwspinlock/omap_hwspinlock.c
@@ -117,27 +117,13 @@ static int omap_hwspinlock_probe(struct platform_device 
*pdev)
if (!bank)
return -ENOMEM;
 
-   platform_set_drvdata(pdev, bank);
-
for (i = 0, hwlock = >lock[0]; i < num_locks; i++, hwlock++)
hwlock->priv = io_base + LOCK_BASE_OFFSET + sizeof(u32) * i;
 
-   return hwspin_lock_register(bank, >dev, _hwspinlock_ops,
+   return devm_hwspin_lock_register(>dev, bank, _hwspinlock_ops,
base_id, num_locks);
 }
 
-static void omap_hwspinlock_remove(struct platform_device *pdev)
-{
-   struct hwspinlock_device *bank = platform_get_drvdata(pdev);
-   int ret;
-
-   ret = hwspin_lock_unregister(bank);
-   if (ret) {
-   dev_err(>dev, "%s failed: %d\n", __func__, ret);
-   return;
-   }
-}
-
 static const struct of_device_id omap_hwspinlock_of_match[] = {
{ .compatible = "ti,omap4-hwspinlock", },
{ .compatible = "ti,am64-hwspinlock", },
@@ -148,7 +134,6 @@ MODULE_DEVICE_TABLE(of, omap_hwspinlock_of_match);
 
 static struct platform_driver omap_hwspinlock_driver = {
.probe  = omap_hwspinlock_probe,
-   .remove_new = omap_hwspinlock_remove,
.driver = {
.name   = "omap_hwspinlock",
.of_match_table = omap_hwspinlock_of_match,
-- 
2.39.2




[PATCH v2 1/4] hwspinlock: omap: Remove unneeded check for OF node

2024-02-08 Thread Andrew Davis
We do not use the OF node anymore, nor does it matter how
we got to probe, so remove the check for of_node.

Signed-off-by: Andrew Davis 
---
 drivers/hwspinlock/omap_hwspinlock.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/hwspinlock/omap_hwspinlock.c 
b/drivers/hwspinlock/omap_hwspinlock.c
index a9fd9ca45f2a8..cca55143d24d4 100644
--- a/drivers/hwspinlock/omap_hwspinlock.c
+++ b/drivers/hwspinlock/omap_hwspinlock.c
@@ -74,7 +74,6 @@ static const struct hwspinlock_ops omap_hwspinlock_ops = {
 
 static int omap_hwspinlock_probe(struct platform_device *pdev)
 {
-   struct device_node *node = pdev->dev.of_node;
struct hwspinlock_device *bank;
struct hwspinlock *hwlock;
void __iomem *io_base;
@@ -82,9 +81,6 @@ static int omap_hwspinlock_probe(struct platform_device *pdev)
/* Only a single hwspinlock block device is supported */
int base_id = 0;
 
-   if (!node)
-   return -ENODEV;
-
io_base = devm_platform_ioremap_resource(pdev, 0);
if (IS_ERR(io_base))
return PTR_ERR(io_base);
-- 
2.39.2




[PATCH v2 4/4] hwspinlock: omap: Use index to get hwspinlock pointer

2024-02-08 Thread Andrew Davis
For loops with multiple initializers and increments are hard to read
and reason about, simplify this by using the looping index to index
into the hwspinlock array.

Signed-off-by: Andrew Davis 
---
 drivers/hwspinlock/omap_hwspinlock.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/hwspinlock/omap_hwspinlock.c 
b/drivers/hwspinlock/omap_hwspinlock.c
index fe73da80018b1..27b47b8623c09 100644
--- a/drivers/hwspinlock/omap_hwspinlock.c
+++ b/drivers/hwspinlock/omap_hwspinlock.c
@@ -75,7 +75,6 @@ static const struct hwspinlock_ops omap_hwspinlock_ops = {
 static int omap_hwspinlock_probe(struct platform_device *pdev)
 {
struct hwspinlock_device *bank;
-   struct hwspinlock *hwlock;
void __iomem *io_base;
int num_locks, i, ret;
/* Only a single hwspinlock block device is supported */
@@ -117,8 +116,8 @@ static int omap_hwspinlock_probe(struct platform_device 
*pdev)
if (!bank)
return -ENOMEM;
 
-   for (i = 0, hwlock = >lock[0]; i < num_locks; i++, hwlock++)
-   hwlock->priv = io_base + LOCK_BASE_OFFSET + sizeof(u32) * i;
+   for (i = 0; i < num_locks; i++)
+   bank->lock[i].priv = io_base + LOCK_BASE_OFFSET + sizeof(u32) * 
i;
 
return devm_hwspin_lock_register(>dev, bank, _hwspinlock_ops,
base_id, num_locks);
-- 
2.39.2




Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-08 Thread Alex Williamson
On Thu, 8 Feb 2024 07:21:40 +
"Tian, Kevin"  wrote:

> > From: Ankit Agrawal 
> > Sent: Thursday, February 8, 2024 3:13 PM  
> > >> > +    * Determine how many bytes to be actually read from the
> > >> > device memory.
> > >> > +    * Read request beyond the actual device memory size is
> > >> > filled with ~0,
> > >> > +    * while those beyond the actual reported size is skipped.
> > >> > +    */
> > >> > +   if (offset >= memregion->memlength)
> > >> > +   mem_count = 0;  
> > >>
> > >> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not
> > >> necessary.  
> > >
> > > Harmless, other than the possibly unnecessary call through to
> > > nvgrace_gpu_map_device_mem().  Maybe both  
> > nvgrace_gpu_map_and_read()  
> > > and nvgrace_gpu_map_and_write() could conditionally return 0 as their
> > > first operation when !mem_count.  Thanks,
> > >
> > >Alex  
> > 
> > IMO, this seems like adding too much code to reduce the call length for a
> > very specific case. If there aren't any strong opinion on this, I'm 
> > planning to
> > leave this code as it is.  
> 
> a slight difference. if mem_count==0 the result should always succeed
> no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course
> if it fails it's already a big problem probably nobody cares about the subtle
> difference when reading non-exist range.
> 
> but regarding to readability it's still clearer:
> 
> if (mem_count)
>   nvgrace_gpu_map_and_read();
> 

The below has better flow imo vs conditionalizing the call to
map_and_read/write and subsequent error handling, but I don't think
either adds too much code.  Thanks,

Alex

--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -429,6 +429,9 @@ nvgrace_gpu_map_and_read(struct 
nvgrace_gpu_vfio_pci_core_device *nvdev,
u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
int ret;
 
+   if (!mem_count)
+   return 0;
+
/*
 * Handle read on the BAR regions. Map to the target device memory
 * physical address and copy to the request read buffer.
@@ -547,6 +550,9 @@ nvgrace_gpu_map_and_write(struct 
nvgrace_gpu_vfio_pci_core_device *nvdev,
loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
int ret;
 
+   if (!mem_count)
+   return 0;
+
ret = nvgrace_gpu_map_device_mem(index, nvdev);
if (ret)
return ret;




[PATCH] tracing: Fix wasted memory in saved_cmdlines logic

2024-02-08 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

While looking at improving the saved_cmdlines cache I found a huge amount
of wasted memory that should be used for the cmdlines.

The tracing data saves pids during the trace. At sched switch, if a trace
occurred, it will save the comm of the task that did the trace. This is
saved in a "cache" that maps pids to comms and exposed to user space via
the /sys/kernel/tracing/saved_cmdlines file. Currently it only caches by
default 128 comms.

The structure that uses this creates an array to store the pids using
PID_MAX_DEFAULT (which is usually set to 32768). This causes the structure
to be of the size of 131104 bytes on 64 bit machines.

In hex: 131104 = 0x20020, and since the kernel allocates generic memory in
powers of two, the kernel would allocate 0x4 or 262144 bytes to store
this structure. That leaves 131040 bytes of wasted space.

Worse, the structure points to an allocated array to store the comm names,
which is 16 bytes times the amount of names to save (currently 128), which
is 2048 bytes. Instead of allocating a separate array, make the structure
end with a variable length string and use the extra space for that.

This is similar to a recommendation that Linus had made about eventfs_inode 
names:

  
https://lore.kernel.org/all/20240130190355.11486-5-torva...@linux-foundation.org/

Instead of allocating a separate string array to hold the saved comms,
have the structure end with: char saved_cmdlines[]; and round up to the
next power of two over sizeof(struct saved_cmdline_buffers) + num_cmdlines * 
TASK_COMM_LEN
It will use this extra space for the saved_cmdline portion.

Now, instead of saving only 128 comms by default, by using this wasted
space at the end of the structure it can save over 8000 comms and even
saves space by removing the need for allocating the other array.

Cc: sta...@vger.kernel.org
Fixes: 939c7a4f04fcd ("tracing: Introduce saved_cmdlines_size file")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 73 +---
 1 file changed, 34 insertions(+), 39 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..0b3e60b827f7 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2320,7 +2320,7 @@ struct saved_cmdlines_buffer {
unsigned *map_cmdline_to_pid;
unsigned cmdline_num;
int cmdline_idx;
-   char *saved_cmdlines;
+   char saved_cmdlines[];
 };
 static struct saved_cmdlines_buffer *savedcmd;
 
@@ -2334,47 +2334,54 @@ static inline void set_cmdline(int idx, const char 
*cmdline)
strncpy(get_saved_cmdlines(idx), cmdline, TASK_COMM_LEN);
 }
 
-static int allocate_cmdlines_buffer(unsigned int val,
-   struct saved_cmdlines_buffer *s)
+static void free_saved_cmdlines_buffer(struct saved_cmdlines_buffer *s)
+{
+   int order = get_order(sizeof(*s) + s->cmdline_num * TASK_COMM_LEN);
+
+   kfree(s->map_cmdline_to_pid);
+   free_pages((unsigned long)s, order);
+}
+
+static struct saved_cmdlines_buffer *allocate_cmdlines_buffer(unsigned int val)
 {
+   struct saved_cmdlines_buffer *s;
+   struct page *page;
+   int orig_size, size;
+   int order;
+
+   /* Figure out how much is needed to hold the given number of cmdlines */
+   orig_size = sizeof(*s) + val * TASK_COMM_LEN;
+   order = get_order(orig_size);
+   size = 1 << (order + PAGE_SHIFT);
+   page = alloc_pages(GFP_KERNEL, order);
+   if (!page)
+   return NULL;
+
+   s = page_address(page);
+   memset(s, 0, sizeof(*s));
+
+   /* Round up to actual allocation */
+   val = (size - sizeof(*s)) / TASK_COMM_LEN;
+   s->cmdline_num = val;
+
s->map_cmdline_to_pid = kmalloc_array(val,
  sizeof(*s->map_cmdline_to_pid),
  GFP_KERNEL);
-   if (!s->map_cmdline_to_pid)
-   return -ENOMEM;
-
-   s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL);
-   if (!s->saved_cmdlines) {
-   kfree(s->map_cmdline_to_pid);
-   return -ENOMEM;
-   }
 
s->cmdline_idx = 0;
-   s->cmdline_num = val;
memset(>map_pid_to_cmdline, NO_CMDLINE_MAP,
   sizeof(s->map_pid_to_cmdline));
memset(s->map_cmdline_to_pid, NO_CMDLINE_MAP,
   val * sizeof(*s->map_cmdline_to_pid));
 
-   return 0;
+   return s;
 }
 
 static int trace_create_savedcmd(void)
 {
-   int ret;
-
-   savedcmd = kmalloc(sizeof(*savedcmd), GFP_KERNEL);
-   if (!savedcmd)
-   return -ENOMEM;
-
-   ret = allocate_cmdlines_buffer(SAVED_CMDLINES_DEFAULT, savedcmd);
-   if (ret < 0) {
-   kfree(savedcmd);
-   savedcmd = NULL;
-   return -ENOMEM;
-   }
+   savedcmd = allocate_cmdlines_buffer(SAVED_CMDLINES_DEFAULT);
 
-   return 0;
+ 

Re: [PATCH] riscv: Fix text patching when icache flushes use IPIs

2024-02-08 Thread Andrea Parri
> I did not even think of that, and it actually makes sense so I'll go
> with what you propose: I'll replace atomic_inc() with
> atomic_inc_return_release(). And I'll add the following comment if
> that's ok with you:
> 
> "Make sure the patching store is effective *before* we increment the
> counter which releases all waiting cpus"

Yes, this sounds good to me.


> Honestly, I looked at it one minute, did not understand its purpose
> and said to myself "ok that can't hurt anyway, I may be missing
> something".
> 
> FWIW,  I see that arm64 uses isb() here. If you don't see its purpose,
> I'll remove it (here and where I copied it).

Removing the smp_mb() (and keeping the local_flush_icache_all()) seems
fine to me; thanks for the confirmation.


> > On a last topic, although somehow orthogonal to the scope of this patch,
> > I'm not sure the patch_{map,unmap}() dance in our patch_insn_write() is
> > correct: I can see why we may want (need to do) the local TLB flush be-
> > fore returning from patch_{map,unmap}(), but does a local flush suffice?
> > For comparison, arm64 seems to go through a complete dsb-tlbi-dsb(-isb)
> > sequence in their unmapping stage (and apparently relying on "no caching
> > of invalid ptes" in their mapping stage).  Of course, "broadcasting" our
> > (riscv's) TLB invalidations will necessary introduce some complexity...
> >
> > Thoughts?
> 
> To avoid remote TLBI, could we simply disable the preemption before
> the first patch_map()? arm64 disables the irqs, but that seems
> overkill to me, but maybe I'm missing something again?

Mmh, I'm afraid this will require more thinking/probing on my end (not
really "the expert" of the codebase at stake...).  Maybe the ftrace
reviewers will provide further ideas/suggestions for us to brainstorm.

  Andrea



Re: [PATCH] riscv: Fix text patching when icache flushes use IPIs

2024-02-08 Thread Alexandre Ghiti
Hi Andrea,

On Thu, Feb 8, 2024 at 12:42 PM Andrea Parri  wrote:
>
> > +static int __ftrace_modify_code(void *data)
> > +{
> > + struct ftrace_modify_param *param = data;
> > +
> > + if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
> > + ftrace_modify_all_code(param->command);
> > + atomic_inc(>cpu_count);
>
> I stared at ftrace_modify_all_code() for a bit but honestly I don't see
> what prevents the ->cpu_count increment from being reordered before the
> insn write(s) (architecturally) now that you have removed the IPI dance:
> perhaps add an smp_wmb() right before the atomic_inc() (or promote this
> latter to a (void)atomic_inc_return_release()) and/or an inline comment
> saying why such reordering is not possible?

I did not even think of that, and it actually makes sense so I'll go
with what you propose: I'll replace atomic_inc() with
atomic_inc_return_release(). And I'll add the following comment if
that's ok with you:

"Make sure the patching store is effective *before* we increment the
counter which releases all waiting cpus"

>
>
> > + } else {
> > + while (atomic_read(>cpu_count) <= num_online_cpus())
> > + cpu_relax();
> > + smp_mb();
>
> I see that you've lifted/copied the memory barrier from patch_text_cb():
> what's its point?  AFAIU, the barrier has no ordering effect on program
> order later insn fetches; perhaps the code was based on some legacy/old
> version of Zifencei?  IAC, comments, comments, ... or maybe just remove
> that memory barrier?

Honestly, I looked at it one minute, did not understand its purpose
and said to myself "ok that can't hurt anyway, I may be missing
something".

FWIW,  I see that arm64 uses isb() here. If you don't see its purpose,
I'll remove it (here and where I copied it).

>
>
> > + }
> > +
> > + local_flush_icache_all();
> > +
> > + return 0;
> > +}
>
> [...]
>
>
> > @@ -232,8 +230,7 @@ static int patch_text_cb(void *data)
> >   if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
> >   for (i = 0; ret == 0 && i < patch->ninsns; i++) {
> >   len = GET_INSN_LENGTH(patch->insns[i]);
> > - ret = patch_text_nosync(patch->addr + i * len,
> > - >insns[i], len);
> > + ret = patch_insn_write(patch->addr + i * len, 
> > >insns[i], len);
> >   }
> >   atomic_inc(>cpu_count);
> >   } else {
> > @@ -242,6 +239,8 @@ static int patch_text_cb(void *data)
> >   smp_mb();
> >   }
> >
> > + local_flush_icache_all();
> > +
> >   return ret;
> >  }
> >  NOKPROBE_SYMBOL(patch_text_cb);
>
> My above remarks/questions also apply to this function.
>
>
> On a last topic, although somehow orthogonal to the scope of this patch,
> I'm not sure the patch_{map,unmap}() dance in our patch_insn_write() is
> correct: I can see why we may want (need to do) the local TLB flush be-
> fore returning from patch_{map,unmap}(), but does a local flush suffice?
> For comparison, arm64 seems to go through a complete dsb-tlbi-dsb(-isb)
> sequence in their unmapping stage (and apparently relying on "no caching
> of invalid ptes" in their mapping stage).  Of course, "broadcasting" our
> (riscv's) TLB invalidations will necessary introduce some complexity...
>
> Thoughts?

To avoid remote TLBI, could we simply disable the preemption before
the first patch_map()? arm64 disables the irqs, but that seems
overkill to me, but maybe I'm missing something again?

Thanks for your comments Andrea,

Alex

>
>   Andrea



Re: [PATCH] tools/rtla: Replace setting prio with nice for SCHED_OTHER

2024-02-08 Thread Daniel Bristot de Oliveira
On 2/7/24 07:51, limingming3 wrote:
> Since the sched_priority for SCHED_OTHER is always 0, it makes no
> sence to set it.
> Setting nice for SCHED_OTHER seems more meaningful.

Thanks!

This is actually a fix, I meant to set nice since the beginning.

-- Daniel



Re: (subset) [PATCH v5] leds: qcom-lpg: Add PM660L configuration and compatible

2024-02-08 Thread Lee Jones
On Sun, 04 Feb 2024 18:24:20 +0100, Marijn Suijten wrote:
> Inherit PM660L PMIC LPG/triled block configuration from downstream
> drivers and DT sources, consisting of a triled block with automatic
> trickle charge control and source selection, three colored led channels
> belonging to the synchronized triled block and one loose PWM channel.
> 
> 

Applied, thanks!

[1/1] leds: qcom-lpg: Add PM660L configuration and compatible
  commit: 0e848bb4630e12099636fde050cadad33221045f

--
Lee Jones [李琼斯]




Re: [PATCH] riscv: Fix text patching when icache flushes use IPIs

2024-02-08 Thread Andrea Parri
> +static int __ftrace_modify_code(void *data)
> +{
> + struct ftrace_modify_param *param = data;
> +
> + if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
> + ftrace_modify_all_code(param->command);
> + atomic_inc(>cpu_count);

I stared at ftrace_modify_all_code() for a bit but honestly I don't see
what prevents the ->cpu_count increment from being reordered before the
insn write(s) (architecturally) now that you have removed the IPI dance:
perhaps add an smp_wmb() right before the atomic_inc() (or promote this
latter to a (void)atomic_inc_return_release()) and/or an inline comment
saying why such reordering is not possible?


> + } else {
> + while (atomic_read(>cpu_count) <= num_online_cpus())
> + cpu_relax();
> + smp_mb();

I see that you've lifted/copied the memory barrier from patch_text_cb():
what's its point?  AFAIU, the barrier has no ordering effect on program
order later insn fetches; perhaps the code was based on some legacy/old
version of Zifencei?  IAC, comments, comments, ... or maybe just remove
that memory barrier?


> + }
> +
> + local_flush_icache_all();
> +
> + return 0;
> +}

[...]


> @@ -232,8 +230,7 @@ static int patch_text_cb(void *data)
>   if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
>   for (i = 0; ret == 0 && i < patch->ninsns; i++) {
>   len = GET_INSN_LENGTH(patch->insns[i]);
> - ret = patch_text_nosync(patch->addr + i * len,
> - >insns[i], len);
> + ret = patch_insn_write(patch->addr + i * len, 
> >insns[i], len);
>   }
>   atomic_inc(>cpu_count);
>   } else {
> @@ -242,6 +239,8 @@ static int patch_text_cb(void *data)
>   smp_mb();
>   }
>  
> + local_flush_icache_all();
> +
>   return ret;
>  }
>  NOKPROBE_SYMBOL(patch_text_cb);

My above remarks/questions also apply to this function.


On a last topic, although somehow orthogonal to the scope of this patch,
I'm not sure the patch_{map,unmap}() dance in our patch_insn_write() is
correct: I can see why we may want (need to do) the local TLB flush be-
fore returning from patch_{map,unmap}(), but does a local flush suffice?
For comparison, arm64 seems to go through a complete dsb-tlbi-dsb(-isb)
sequence in their unmapping stage (and apparently relying on "no caching
of invalid ptes" in their mapping stage).  Of course, "broadcasting" our
(riscv's) TLB invalidations will necessary introduce some complexity...

Thoughts?

  Andrea



Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-08 Thread Ankit Agrawal
>> >>
>> >> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not
>> >> necessary.
>> >
>> > Harmless, other than the possibly unnecessary call through to
>> > nvgrace_gpu_map_device_mem().  Maybe both
>> nvgrace_gpu_map_and_read()
>> > and nvgrace_gpu_map_and_write() could conditionally return 0 as their
>> > first operation when !mem_count.  Thanks,
>> >
>> >Alex
>>
>> IMO, this seems like adding too much code to reduce the call length for a
>> very specific case. If there aren't any strong opinion on this, I'm planning 
>> to
>> leave this code as it is.
>
> a slight difference. if mem_count==0 the result should always succeed
> no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course
> if it fails it's already a big problem probably nobody cares about the subtle
> difference when reading non-exist range.
>
> but regarding to readability it's still clearer:
>
> if (mem_count)
>    nvgrace_gpu_map_and_read();

Makes sense. I'll change it.


Re: [PATCH 3/3] remoteproc: qcom_q6v5_pas: Unload lite firmware on ADSP

2024-02-08 Thread Sibi Sankar




On 1/31/24 15:00, Abel Vesa wrote:

On 24-01-29 17:17:28, Dmitry Baryshkov wrote:

On Mon, 29 Jan 2024 at 15:35, Abel Vesa  wrote:


From: Sibi Sankar 

The UEFI loads a lite variant of the ADSP firmware to support charging
use cases. The kernel needs to unload and reload it with the firmware
that has full feature support for audio. This patch arbitarily shutsdown
the lite firmware before loading the full firmware.

Signed-off-by: Sibi Sankar 
Signed-off-by: Abel Vesa 
---
  drivers/remoteproc/qcom_q6v5_pas.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/drivers/remoteproc/qcom_q6v5_pas.c 
b/drivers/remoteproc/qcom_q6v5_pas.c
index 083d71f80e5c..4f6940368eb4 100644
--- a/drivers/remoteproc/qcom_q6v5_pas.c
+++ b/drivers/remoteproc/qcom_q6v5_pas.c
@@ -39,6 +39,7 @@ struct adsp_data {
 const char *dtb_firmware_name;
 int pas_id;
 int dtb_pas_id;
+   int lite_pas_id;
 unsigned int minidump_id;
 bool auto_boot;
 bool decrypt_shutdown;
@@ -72,6 +73,7 @@ struct qcom_adsp {
 const char *dtb_firmware_name;
 int pas_id;
 int dtb_pas_id;
+   int lite_pas_id;
 unsigned int minidump_id;
 int crash_reason_smem;
 bool decrypt_shutdown;
@@ -210,6 +212,10 @@ static int adsp_load(struct rproc *rproc, const struct 
firmware *fw)
 /* Store firmware handle to be used in adsp_start() */
 adsp->firmware = fw;

+   /* WIP: Shutdown the ADSP if it's running a lite version of the 
firmware*/


Why is it still marked as WIP?


AFAIU, there was more to be done here w.r.t. preloaded lite version
firmware.

Later, was agreed that that is not case.

So maybe I just need to drop the comment.

Sibi, can you confirm?


ack, this is the best we can currently do. Please drop the comment when
you re-spin the series. Thanks for sending this out.

-Sibi






+   if (adsp->lite_pas_id)
+   ret = qcom_scm_pas_shutdown(adsp->lite_pas_id);
+
 if (adsp->dtb_pas_id) {
 ret = request_firmware(>dtb_firmware, 
adsp->dtb_firmware_name, adsp->dev);
 if (ret) {
@@ -693,6 +699,7 @@ static int adsp_probe(struct platform_device *pdev)
 adsp->rproc = rproc;
 adsp->minidump_id = desc->minidump_id;
 adsp->pas_id = desc->pas_id;
+   adsp->lite_pas_id = desc->lite_pas_id;
 adsp->info_name = desc->sysmon_name;
 adsp->decrypt_shutdown = desc->decrypt_shutdown;
 adsp->region_assign_idx = desc->region_assign_idx;
@@ -990,6 +997,7 @@ static const struct adsp_data x1e80100_adsp_resource = {
 .dtb_firmware_name = "adsp_dtb.mdt",
 .pas_id = 1,
 .dtb_pas_id = 0x24,
+   .lite_pas_id = 0x1f,
 .minidump_id = 5,
 .auto_boot = true,
 .proxy_pd_names = (char*[]){

--
2.34.1





--
With best wishes
Dmitry




Re: [PATCH 1/6] tools/rtla: Fix Makefile compiler options for clang

2024-02-08 Thread Daniel Bristot de Oliveira
On 2/6/24 16:48, Nathan Chancellor wrote:
> On Tue, Feb 06, 2024 at 12:05:29PM +0100, Daniel Bristot de Oliveira wrote:
>> The following errors are showing up when compiling rtla with clang:
>>
>>  $ make HOSTCC=clang CC=clang LLVM_IAS=1
>>  [...]
>>
>>   clang -O -g -DVERSION=\"6.8.0-rc1\" -flto=auto -ffat-lto-objects
>>  -fexceptions -fstack-protector-strong
>>  -fasynchronous-unwind-tables -fstack-clash-protection  -Wall
>>  -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
>>  -Wp,-D_GLIBCXX_ASSERTIONS -Wno-maybe-uninitialized
>>  $(pkg-config --cflags libtracefs)-c -o src/utils.o src/utils.c
>>
>>   clang: warning: optimization flag '-ffat-lto-objects' is not supported 
>> [-Wignored-optimization-argument]
> 
> For what it's worth, this flag is supported in clang 17.0.0 and newer:
> 
> https://github.com/llvm/llvm-project/commit/610fc5cbcc8b68879c562f6458608afe2473ab7f

Good! still, I am getting this error on fedora, with this clang version:

bristot@x1:~/src/git/linux/tools/tracing/rtla$ clang --version
clang version 17.0.6 (Fedora 17.0.6-1.fc39)
Target: x86_64-redhat-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

:-(

> But if it is not critical, just dropping the flag like you have done
> here rather than conditionally supporting it is probably easier.

Yeah, I will drop it for now, and keep monitoring.

Thanks Natan!
-- Daniel



Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-08 Thread Mete Durlu

On 2/7/24 16:47, Steven Rostedt wrote:

On Wed, 07 Feb 2024 14:33:21 +0100
Sven Schnelle  wrote:


My assumption without reading the code is that something like this
happens:

CPU0 CPU1
[ringbuffer enabled]
  ring_buffer_write()
  if (atomic_read(>record_disabled))
 goto out;
echo 0 > tracing_on
record_disabled |= RB_BUFFER_OFF
csum1=`md5sum trace`


Note, the CPU1 is performing with preemption disabled, so for this to
happen, something really bad happened on CPU0 to delay preempt disabled
section so long to allow the trace to be read. Perhaps we should have
the return of the echo 0 > tracing_on require a synchronize_rcu() to
make sure all ring buffers see it disabled before it returns.

But unless your system is doing something really stressed to cause the
preempt disabled section to take so long, I highly doubt this was the
race.



I have been only able to reliably reproduce this issue when the system
is under load from stressors. But I am not sure if it can be considered
as *really stressed*.

system : 8 cpus (4 physical cores)
load   : stress-ng --fanotify 1 (or --context 2)
result : ~5/10 test fails

of course as load increases test starts to fail more often, but a
single stressor doesn't seem like much to me for a 4 core machine.

after adding synchronize_rcu() + patch from Sven, I am no longer seeing
failures with the setup above. So it seems like synchronize_rcu() did
the trick(or at least it helps a lot) for the case described on the
previous mail. I couldn't trigger the failure yet, not even with
increased load(but now the test case takes > 5mins to finish :) ).

Here is the diff:

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
@@ -9328,10 +9328,12 @@ rb_simple_write(struct file *filp, const char 
__user *ubuf,

val = 0; /* do nothing */
} else if (val) {
tracer_tracing_on(tr);
+   synchronize_rcu();
if (tr->current_trace->start)
tr->current_trace->start(tr);
} else {
tracer_tracing_off(tr);
+   synchronize_rcu();
if (tr->current_trace->stop)
tr->current_trace->stop(tr);

Not 100% sure if these were the correct places to add them.





Re: [PATCH RFC 0/4] Introduce uts_release

2024-02-08 Thread John Garry

On 05/02/2024 23:10, Masahiro Yamada wrote:

I think what you can contribute are:

   - Explore the UTS_RELEASE users, and check if you can get rid of it.

Unfortunately I expect resistance for this. I also expect places like FW
loader it is necessary. And when this is used in sysfs, people will say
that it is part of the ABI now.

How about I send the patch to update to use init_uts_ns and mention also
that it would be better to not use at all, if possible? I can cc you.


OK.


As I mentioned in the previous reply, the replacement is safe
for builtin code.

When you touch modular code, please pay a little more care,
because UTS_RELEASE and init_utsname()->release
may differ when CONFIG_MODVERSIONS=y.



Are you saying that we may have a different release version kernel and 
module built with CONFIG_MODVERSIONS=y, and the module was using 
UTS_RELEASE for something? That something may be like setting some info 
in a sysfs file, like in this example:


static ssize_t target_core_item_version_show(struct config_item *item,
char *page)
{
return sprintf(page, "Target Engine Core ConfigFS Infrastructure %s"
" on %s/%s on "UTS_RELEASE"\n", TARGET_CORE_VERSION,
utsname()->sysname, utsname()->machine);
}

And the intention is to use the module codebase release version and not 
the kernel codebase release version. Hence utsname() is used for 
.sysname and .machine, but not .release .


Thanks,
John



[PATCH v2 2/2] usb: typec: ucsi: Add qcm6490-pmic-glink as needing PDOS quirk

2024-02-08 Thread Luca Weiss
The QCM6490 Linux Android firmware needs this workaround as well. Add it
to the list.

Acked-by: Heikki Krogerus 
Reviewed-by: Dmitry Baryshkov 
Signed-off-by: Luca Weiss 
---
 drivers/usb/typec/ucsi/ucsi_glink.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/usb/typec/ucsi/ucsi_glink.c 
b/drivers/usb/typec/ucsi/ucsi_glink.c
index 53a7ede8556d..0bd3f6dee678 100644
--- a/drivers/usb/typec/ucsi/ucsi_glink.c
+++ b/drivers/usb/typec/ucsi/ucsi_glink.c
@@ -298,6 +298,7 @@ static void pmic_glink_ucsi_destroy(void *data)
 }
 
 static const struct of_device_id pmic_glink_ucsi_of_quirks[] = {
+   { .compatible = "qcom,qcm6490-pmic-glink", .data = (void 
*)UCSI_NO_PARTNER_PDOS, },
{ .compatible = "qcom,sc8180x-pmic-glink", .data = (void 
*)UCSI_NO_PARTNER_PDOS, },
{ .compatible = "qcom,sc8280xp-pmic-glink", .data = (void 
*)UCSI_NO_PARTNER_PDOS, },
{ .compatible = "qcom,sm8350-pmic-glink", .data = (void 
*)UCSI_NO_PARTNER_PDOS, },

-- 
2.43.0




[PATCH v2 1/2] dt-bindings: soc: qcom: qcom,pmic-glink: document QCM6490 compatible

2024-02-08 Thread Luca Weiss
Document the QCM6490 compatible used to describe the pmic glink on this
platform.

Reviewed-by: Krzysztof Kozlowski 
Signed-off-by: Luca Weiss 
---
 Documentation/devicetree/bindings/soc/qcom/qcom,pmic-glink.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/soc/qcom/qcom,pmic-glink.yaml 
b/Documentation/devicetree/bindings/soc/qcom/qcom,pmic-glink.yaml
index 61df97ffe1e4..101c09554b80 100644
--- a/Documentation/devicetree/bindings/soc/qcom/qcom,pmic-glink.yaml
+++ b/Documentation/devicetree/bindings/soc/qcom/qcom,pmic-glink.yaml
@@ -23,6 +23,7 @@ properties:
 oneOf:
   - items:
   - enum:
+  - qcom,qcm6490-pmic-glink
   - qcom,sc8180x-pmic-glink
   - qcom,sc8280xp-pmic-glink
   - qcom,sm8350-pmic-glink

-- 
2.43.0




[PATCH v2 0/2] Fairphone 5 PMIC-GLINK support (USB-C, charger, fuel gauge)

2024-02-08 Thread Luca Weiss
This series adds all the necessary bits to enable USB-C role switching,
charger and fuel gauge (all via pmic-glink) on Fairphone 5.

Signed-off-by: Luca Weiss 
---
Changes in v2:
- Rebase on -next, drop applied patch
- Pick up tags
- Link to v1: 
https://lore.kernel.org/r/20231220-fp5-pmic-glink-v1-0-2a1f8e3c6...@fairphone.com

---
Luca Weiss (2):
  dt-bindings: soc: qcom: qcom,pmic-glink: document QCM6490 compatible
  usb: typec: ucsi: Add qcm6490-pmic-glink as needing PDOS quirk

 Documentation/devicetree/bindings/soc/qcom/qcom,pmic-glink.yaml | 1 +
 drivers/usb/typec/ucsi/ucsi_glink.c | 1 +
 2 files changed, 2 insertions(+)
---
base-commit: d36e89e4c25c15302eb1820d45680f765847dad9
change-id: 20231220-fp5-pmic-glink-b01d4fa1c7ea

Best regards,
-- 
Luca Weiss 




Re: [PATCH 1/1] vhost: Added pad cleanup if vnet_hdr is not present.

2024-02-08 Thread Yuri Benditovich
Just polite ping

On Tue, Jan 16, 2024 at 12:32 AM Michael S. Tsirkin  wrote:
>
> On Mon, Jan 15, 2024 at 09:48:40PM +0200, Andrew Melnychenko wrote:
> > When the Qemu launched with vhost but without tap vnet_hdr,
> > vhost tries to copy vnet_hdr from socket iter with size 0
> > to the page that may contain some trash.
> > That trash can be interpreted as unpredictable values for
> > vnet_hdr.
> > That leads to dropping some packets and in some cases to
> > stalling vhost routine when the vhost_net tries to process
> > packets and fails in a loop.
> >
> > Qemu options:
> >   -netdev tap,vhost=on,vnet_hdr=off,...
> >
> > Signed-off-by: Andrew Melnychenko 
> > ---
> >  drivers/vhost/net.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index f2ed7167c848..57411ac2d08b 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -735,6 +735,9 @@ static int vhost_net_build_xdp(struct 
> > vhost_net_virtqueue *nvq,
> >   hdr = buf;
> >   gso = >gso;
> >
> > + if (!sock_hlen)
> > + memset(buf, 0, pad);
> > +
> >   if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> >   vhost16_to_cpu(vq, gso->csum_start) +
> >   vhost16_to_cpu(vq, gso->csum_offset) + 2 >
>
>
> Hmm need to analyse it to make sure there are no cases where we leak
> some data to guest here in case where sock_hlen is set ...
> > --
> > 2.43.0
>



Re: [PATCH v9 00/15] Add Cgroup support for SGX EPC memory

2024-02-08 Thread Mikko Ylinen
On Mon, Feb 05, 2024 at 01:06:23PM -0800, Haitao Huang wrote:
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The existing
> cgroup memory controller cannot be used to limit or account for SGX EPC
> memory, which is a desirable feature in some environments, e.g., support
> for pod level control in a Kubernates cluster on a VM or bare-metal host
> [1,2].
>  
> This patchset implements the support for sgx_epc memory within the misc
> cgroup controller. A user can use the misc cgroup controller to set and
> enforce a max limit on total EPC usage per cgroup. The implementation
> reports current usage and events of reaching the limit per cgroup as well
> as the total system capacity.
>  
> Much like normal system memory, EPC memory can be overcommitted via virtual
> memory techniques and pages can be swapped out of the EPC to their backing
> store, which are normal system memory allocated via shmem and accounted by
> the memory controller. Similar to per-cgroup reclamation done by the memory
> controller, the EPC misc controller needs to implement a per-cgroup EPC
> reclaiming process: when the EPC usage of a cgroup reaches its hard limit
> ('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
> some EPC pages within the same cgroup to make room for new allocations.
> 
> For that, this implementation tracks reclaimable EPC pages in a separate
> LRU list in each cgroup, and below are more details and justification of
> this design. 
> 
> Track EPC pages in per-cgroup LRUs (from Dave)
> --
> 
> tl;dr: A cgroup hitting its limit should be as similar as possible to the
> system running out of EPC memory. The only two choices to implement that
> are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
> The result: Add a new LRU for each cgroup and scans those instead. Replace
> the existing global cgroup with the root cgroup's LRU (only when this new
> support is compiled in, obviously).
> 
> The existing EPC memory management aims to be a miniature version of the
> core VM where EPC memory can be overcommitted and reclaimed. EPC
> allocations can wait for reclaim. The alternative to waiting would have
> been to send a signal and let the enclave die.
>  
> This series attempts to implement that same logic for cgroups, for the same
> reasons: it's preferable to wait for memory to become available and let
> reclaim happen than to do things that are fatal to enclaves.
>  
> There is currently a global reclaimable page SGX LRU list. That list (and
> the existing scanning algorithm) is essentially useless for doing reclaim
> when a cgroup hits its limit because the cgroup's pages are scattered
> around that LRU. It is unspeakably inefficient to scan a linked list with
> millions of entries for what could be dozens of pages from a cgroup that
> needs reclaim.
>  
> Even if unspeakably slow reclaim was accepted, the existing scanning
> algorithm only picks a few pages off the head of the global LRU. It would
> either need to hold the list locks for unreasonable amounts of time, or be
> taught to scan the list in pieces, which has its own challenges.
>  
> Unreclaimable Enclave Pages
> ---
> 
> There are a variety of page types for enclaves, each serving different
> purposes [5]. Although the SGX architecture supports swapping for all
> types, some special pages, e.g., Version Array(VA) and Secure Enclave
> Control Structure (SECS)[5], holds meta data of reclaimed pages and
> enclaves. That makes reclamation of such pages more intricate to manage.
> The SGX driver global reclaimer currently does not swap out VA pages. It
> only swaps the SECS page of an enclave when all other associated pages have
> been swapped out. The cgroup reclaimer follows the same approach and does
> not track those in per-cgroup LRUs and considers them as unreclaimable
> pages. The allocation of these pages is counted towards the usage of a
> specific cgroup and is subject to the cgroup's set EPC limits.
> 
> Earlier versions of this series implemented forced enclave-killing to
> reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
> particularly in scenarios where a user or administrator reduces this limit
> post-launch of enclaves. However, subsequent discussions [3, 4] indicated
> that such preemptive enforcement is not necessary for the misc-controllers.
> Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
> and the limit is only enforced at the time of new EPC allocation request.
> When a cgroup hits its limit but nothing left in the LRUs of the subtree,
> i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
> within that cgroup will result in an 'ENOMEM'.
> 
> Unreclaimable Guest VM EPC Pages
> 
> 
> The EPC pages allocated for guest VMs by the virtual