date:20240430

Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Michael S. Tsirkin

On Tue, Apr 30, 2024 at 08:01:11PM -0500, Mike Christie wrote:
> On 4/30/24 7:15 PM, Hillf Danton wrote:
> > On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
> >> On 4/30/24 8:05 AM, Edward Adam Davis wrote:
> >>>  static int vhost_task_fn(void *data)
> >>>  {
> >>>   struct vhost_task *vtsk = data;
> >>> @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
> >>>   schedule();
> >>>   }
> >>>  
> >>> - mutex_lock(>exit_mutex);
> >>> + mutex_lock(_mutex);
> >>>   /*
> >>>* If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
> >>>* When the vhost layer has called vhost_task_stop it's already stopped
> >>> @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
> >>>   vtsk->handle_sigkill(vtsk->data);
> >>>   }
> >>>   complete(>exited);
> >>> - mutex_unlock(>exit_mutex);
> >>> + mutex_unlock(_mutex);
> >>>  
> >>
> >> Edward, thanks for the patch. I think though I just needed to swap the
> >> order of the calls above.
> >>
> >> Instead of:
> >>
> >> complete(>exited);
> >> mutex_unlock(>exit_mutex);
> >>
> >> it should have been:
> >>
> >> mutex_unlock(>exit_mutex);
> >> complete(>exited);
> > 
> > JFYI Edward did it [1]
> > 
> > [1] 
> > https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
> 
> Thanks.
> 
> I tested the code with that change and it no longer triggers the UAF.

Weird but syzcaller said that yes it triggers.

Compare
dcc0ca06174e6...@google.com
which tests the order
mutex_unlock(>exit_mutex);
complete(>exited);
that you like and says it triggers

and
97bda90617521...@google.com
which says it does not trigger.

Whatever you do please send it to syzcaller in the original
thread and then when you post please include the syzcaller report.

Given this gets confusing I'm fine with just a fixup patch,
and note in the commit log where I should squash it.


> I've fixed up the original patch that had the bug and am going to
> resubmit the patchset like how Michael requested.
>

Re: [PATCH V1] soc: qcom: smp2p: Introduce tracepoint support

2024-04-30 Thread Bjorn Andersson

On Tue, Apr 30, 2024 at 04:18:27PM -0700, Chris Lew wrote:
> On 4/29/2024 12:55 AM, Sudeepgoud Patil wrote:
> > Introduce tracepoint support for smp2p providing useful logging
> > for communication between clients.
> > 
> 
> Let's add some more description as to why these tracepoint are useful. Do
> they help us track latency? debugging information for us? for clients?

+1

> > diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile
[..]
> > diff --git a/drivers/soc/qcom/smp2p.c b/drivers/soc/qcom/smp2p.c
> > index a21241cbeec7..dde8513641ae 100644
> > --- a/drivers/soc/qcom/smp2p.c
> > +++ b/drivers/soc/qcom/smp2p.c
> > @@ -20,6 +20,9 @@
> >   #include 
> >   #include 
> > +#define CREATE_TRACE_POINTS
> > +#include "trace-smp2p.h"
> > +
> >   /*
> >* The Shared Memory Point to Point (SMP2P) protocol facilitates 
> > communication
> >* of a single 32-bit value between two processors.  Each value has a 
> > single
> > @@ -191,6 +194,7 @@ static void qcom_smp2p_do_ssr_ack(struct qcom_smp2p 
> > *smp2p)
> > struct smp2p_smem_item *out = smp2p->out;
> > u32 val;
> > +   trace_smp2p_ssr_ack(smp2p->remote_pid);
> > smp2p->ssr_ack = !smp2p->ssr_ack;
> > val = out->flags & ~BIT(SMP2P_FLAGS_RESTART_ACK_BIT);
> > @@ -213,6 +217,7 @@ static void qcom_smp2p_negotiate(struct qcom_smp2p 
> > *smp2p)
> > smp2p->ssr_ack_enabled = true;
> > smp2p->negotiation_done = true;
> > +   trace_smp2p_negotiate(smp2p->remote_pid, 
> > smp2p->ssr_ack_enabled);
> 
> since this tracepoint is for negotiating, maybe we should capture all of the
> features (out->features) instead of just the ssr_ack feature.
> 

Perhaps we can use __print_flags() in TP_printk() for that, it will
attempt to resolve the bits and if it fails include the numeric value.

> > }
> >   }
[..]
> > diff --git a/drivers/soc/qcom/trace-smp2p.h b/drivers/soc/qcom/trace-smp2p.h
[..]
> > +TRACE_EVENT(smp2p_ssr_ack,
> > +   TP_PROTO(unsigned int remote_pid),
> 
> Is there any way we can map the remote pid's to a string? I feel like the
> tracepoints would be more useful if they called out modem, adsp, etc instead
> of an integer value.
> 

And __print_symbolic() for this one.

Regards,
Bjorn

Re: [PATCH net-next v8] net/ipv4: add tracepoint for icmp_send

2024-04-30 Thread Jakub Kicinski

On Mon, 29 Apr 2024 17:15:57 +0800 (CST) xu.xi...@zte.com.cn wrote:
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.

Rebase please, it doesn't apply. And please put the change log under
the --- delimiter.
-- 
pw-bot: cr

Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Mike Christie

On 4/30/24 7:15 PM, Hillf Danton wrote:
> On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
>> On 4/30/24 8:05 AM, Edward Adam Davis wrote:
>>>  static int vhost_task_fn(void *data)
>>>  {
>>> struct vhost_task *vtsk = data;
>>> @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
>>> schedule();
>>> }
>>>  
>>> -   mutex_lock(>exit_mutex);
>>> +   mutex_lock(_mutex);
>>> /*
>>>  * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
>>>  * When the vhost layer has called vhost_task_stop it's already stopped
>>> @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
>>> vtsk->handle_sigkill(vtsk->data);
>>> }
>>> complete(>exited);
>>> -   mutex_unlock(>exit_mutex);
>>> +   mutex_unlock(_mutex);
>>>  
>>
>> Edward, thanks for the patch. I think though I just needed to swap the
>> order of the calls above.
>>
>> Instead of:
>>
>> complete(>exited);
>> mutex_unlock(>exit_mutex);
>>
>> it should have been:
>>
>> mutex_unlock(>exit_mutex);
>> complete(>exited);
> 
> JFYI Edward did it [1]
> 
> [1] 
> https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/

Thanks.

I tested the code with that change and it no longer triggers the UAF.

I've fixed up the original patch that had the bug and am going to
resubmit the patchset like how Michael requested.

Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Hillf Danton

On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
> On 4/30/24 8:05 AM, Edward Adam Davis wrote:
> >  static int vhost_task_fn(void *data)
> >  {
> > struct vhost_task *vtsk = data;
> > @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
> > schedule();
> > }
> >  
> > -   mutex_lock(>exit_mutex);
> > +   mutex_lock(_mutex);
> > /*
> >  * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
> >  * When the vhost layer has called vhost_task_stop it's already stopped
> > @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
> > vtsk->handle_sigkill(vtsk->data);
> > }
> > complete(>exited);
> > -   mutex_unlock(>exit_mutex);
> > +   mutex_unlock(_mutex);
> >  
> 
> Edward, thanks for the patch. I think though I just needed to swap the
> order of the calls above.
> 
> Instead of:
> 
> complete(>exited);
> mutex_unlock(>exit_mutex);
> 
> it should have been:
> 
> mutex_unlock(>exit_mutex);
> complete(>exited);

JFYI Edward did it [1]

[1] 
https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
> 
> If my analysis is correct, then Michael do you want me to resubmit a
> patch on top of your vhost branch or resubmit the entire patchset?

Re: [PATCH V1] soc: qcom: smp2p: Introduce tracepoint support

2024-04-30 Thread Chris Lew





On 4/29/2024 12:55 AM, Sudeepgoud Patil wrote:

Introduce tracepoint support for smp2p providing useful logging
for communication between clients.



Let's add some more description as to why these tracepoint are useful. 
Do they help us track latency? debugging information for us? for clients?



Signed-off-by: Sudeepgoud Patil 
Signed-off-by: Deepak Kumar Singh 


As Elliot mentioned in the internal review, your Signed-off-by should be 
last. I would suggest removing Deepak's Signed-off-by and letting him 
reply with Reviewed-by since he was the main reviewer internally.



---
  drivers/soc/qcom/Makefile  |  1 +
  drivers/soc/qcom/smp2p.c   | 10 
  drivers/soc/qcom/trace-smp2p.h | 99 ++
  3 files changed, 110 insertions(+)
  create mode 100644 drivers/soc/qcom/trace-smp2p.h

diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile
index ca0bece0dfff..30c1bf645501 100644
--- a/drivers/soc/qcom/Makefile
+++ b/drivers/soc/qcom/Makefile
@@ -23,6 +23,7 @@ qcom_rpmh-y   += rpmh.o
  obj-$(CONFIG_QCOM_SMD_RPM)+= rpm-proc.o smd-rpm.o
  obj-$(CONFIG_QCOM_SMEM) +=smem.o
  obj-$(CONFIG_QCOM_SMEM_STATE) += smem_state.o
+CFLAGS_smp2p.o := -I$(src)
  obj-$(CONFIG_QCOM_SMP2P)  += smp2p.o
  obj-$(CONFIG_QCOM_SMSM)   += smsm.o
  obj-$(CONFIG_QCOM_SOCINFO)+= socinfo.o
diff --git a/drivers/soc/qcom/smp2p.c b/drivers/soc/qcom/smp2p.c
index a21241cbeec7..dde8513641ae 100644
--- a/drivers/soc/qcom/smp2p.c
+++ b/drivers/soc/qcom/smp2p.c
@@ -20,6 +20,9 @@
  #include 
  #include 
  
+#define CREATE_TRACE_POINTS

+#include "trace-smp2p.h"
+
  /*
   * The Shared Memory Point to Point (SMP2P) protocol facilitates communication
   * of a single 32-bit value between two processors.  Each value has a single
@@ -191,6 +194,7 @@ static void qcom_smp2p_do_ssr_ack(struct qcom_smp2p *smp2p)
struct smp2p_smem_item *out = smp2p->out;
u32 val;
  
+	trace_smp2p_ssr_ack(smp2p->remote_pid);

smp2p->ssr_ack = !smp2p->ssr_ack;
  
  	val = out->flags & ~BIT(SMP2P_FLAGS_RESTART_ACK_BIT);

@@ -213,6 +217,7 @@ static void qcom_smp2p_negotiate(struct qcom_smp2p *smp2p)
smp2p->ssr_ack_enabled = true;
  
  		smp2p->negotiation_done = true;

+   trace_smp2p_negotiate(smp2p->remote_pid, 
smp2p->ssr_ack_enabled);


since this tracepoint is for negotiating, maybe we should capture all of 
the features (out->features) instead of just the ssr_ack feature.



}
  }
  
@@ -251,6 +256,8 @@ static void qcom_smp2p_notify_in(struct qcom_smp2p *smp2p)

status = val ^ entry->last_value;
entry->last_value = val;
  
+		trace_smp2p_notify_in(smp2p->remote_pid, entry->name, status, val);

+
/* No changes of this entry? */
if (!status)
continue;
@@ -406,6 +413,9 @@ static int smp2p_update_bits(void *data, u32 mask, u32 
value)
writel(val, entry->value);
spin_unlock_irqrestore(>lock, flags);
  
+	trace_smp2p_update_bits(entry->smp2p->remote_pid,

+   entry->name, orig, val);
+
if (val != orig)
qcom_smp2p_kick(entry->smp2p);
  
diff --git a/drivers/soc/qcom/trace-smp2p.h b/drivers/soc/qcom/trace-smp2p.h

new file mode 100644
index ..c61afab23f0c
--- /dev/null
+++ b/drivers/soc/qcom/trace-smp2p.h
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM qcom_smp2p
+
+#if !defined(__QCOM_SMP2P_TRACE_H__) || defined(TRACE_HEADER_MULTI_READ)
+#define __QCOM_SMP2P_TRACE_H__
+
+#include 
+
+TRACE_EVENT(smp2p_ssr_ack,
+   TP_PROTO(unsigned int remote_pid),


Is there any way we can map the remote pid's to a string? I feel like 
the tracepoints would be more useful if they called out modem, adsp, etc 
instead of an integer value.



+   TP_ARGS(remote_pid),
+   TP_STRUCT__entry(
+   __field(u32, remote_pid)
+   ),
+   TP_fast_assign(
+   __entry->remote_pid = remote_pid;
+   ),
+   TP_printk("%d: SSR detected, doing SSR Handshake",
+   __entry->remote_pid
+   )
+);
+
+TRACE_EVENT(smp2p_negotiate,
+   TP_PROTO(unsigned int remote_pid, bool ssr_ack_enabled),
+   TP_ARGS(remote_pid, ssr_ack_enabled),
+   TP_STRUCT__entry(
+   __field(u32, remote_pid)
+   __field(bool, ssr_ack_enabled)
+   ),
+   TP_fast_assign(
+   __entry->remote_pid = remote_pid;
+   __entry->ssr_ack_enabled = ssr_ack_enabled;
+   ),
+   TP_printk("%d: state=open ssr_ack=%d",
+   __entry->remote_pid,
+   __entry->ssr_ack_enabled
+   )
+);
+
+TRACE_EVENT(smp2p_notify_in,
+   TP_PROTO(unsigned int remote_pid, const char *name, unsigned long 
status, u32 val),
+   TP_ARGS(remote_pid, name, status, val),
+

Re: [PATCH] virtio_net: Warn if insufficient queue length for transmitting

2024-04-30 Thread Michael S. Tsirkin

On Tue, Apr 30, 2024 at 03:35:09PM -0400, Darius Rad wrote:
> The transmit queue is stopped when the number of free queue entries is less
> than 2+MAX_SKB_FRAGS, in start_xmit().  If the queue length (QUEUE_NUM_MAX)
> is less than then this, transmission will immediately trigger a netdev
> watchdog timeout.  Report this condition earlier and more directly.
> 
> Signed-off-by: Darius Rad 
> ---
>  drivers/net/virtio_net.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 115c3c5414f2..72ee8473b61c 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -4917,6 +4917,9 @@ static int virtnet_probe(struct virtio_device *vdev)
>   set_bit(guest_offloads[i], >guest_offloads);
>   vi->guest_offloads_capable = vi->guest_offloads;
>  
> + if (virtqueue_get_vring_size(vi->sq->vq) < 2 + MAX_SKB_FRAGS)
> + netdev_warn_once(dev, "not enough queue entries, expect xmit 
> timeout\n");
> +

How about actually fixing it though? E.g. by linearizing...

It also bothers me that there's practically
/proc/sys/net/core/max_skb_frags
and if that's low then things could actually work.

Finally, while originally it was just 17 typically, now it's
configurable. So it's possible that you change the config to make big
tcp work better and device stops working while it worked fine
previously.


>   pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
>dev->name, max_queue_pairs);
>  
> -- 
> 2.39.2

[PATCH v13 13/14] Docs/x86/sgx: Add description for cgroup support

2024-04-30 Thread Haitao Huang

From: Sean Christopherson 

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang
Signed-off-by: Haitao Huang
Cc: Sean Christopherson 
Reviewed-by: Bagas Sanjaya 
Acked-by: Kai Huang 
Tested-by: Mikko Ylinen 
Tested-by: Jarkko Sakkinen 
---
V8:
- Limit text width to 80 characters to be consistent.

V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes

V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
 Documentation/arch/x86/sgx.rst | 83 ++
 1 file changed, 83 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..c537e6a9aa65 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,86 @@ to expected failures and handle them as follows:
first call.  It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that is used to
+provide SGX-enabled applications with protected memory, and is otherwise
+inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be
+read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all
+intents and purposes the EPC is independent from normal system memory, e.g. 
must
+be reserved at boot from RAM and cannot be converted between EPC and normal
+memory while the system is running.  The EPC is managed by the SGX subsystem 
and
+is not accounted by the memory controller.  Note that this is true only for EPC
+memory itself, i.e.  normal memory allocations related to SGX and EPC memory,
+e.g. the backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual
+memory techniques and pages can be swapped out of the EPC to their backing 
store
+(normal system memory allocated via shmem).  The SGX EPC subsystem is analogous
+to the memory subsystem, and it implements limit and protection models for EPC
+memory.
+
+SGX EPC Interface Files
+---
+
+For a generic description of the Miscellaneous controller interface files,
+please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If
+a value which is not PAGE_SIZE aligned is written, the actual value used by the
+controller will be rounded down to the closest PAGE_SIZE multiple.
+
+  misc.capacity
+A read-only flat-keyed file shown only in the root cgroup. The sgx_epc
+resource will show the total amount of EPC memory available on the
+platform.
+
+  misc.current
+A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc
+resource will show the current active EPC memory usage of the cgroup 
and
+its descendants. EPC pages that are swapped out to backing RAM are not
+included in the current count.
+
+  misc.max
+A read-write single value file which exists on non-root cgroups. The
+sgx_epc resource will show the EPC usage hard limit. The default is
+"max".
+
+If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for
+page fault handling, will be blocked until EPC can be reclaimed from 
the
+cgroup. If there are no pages left that are reclaimable within the same
+group, the kernel returns ENOMEM.
+
+The EPC pages allocated for a guest VM by the virtual EPC driver are 
not
+reclaimable by the host kernel. In case the guest cgroup's limit is
+reached and no reclaimable pages left in the same cgroup, the virtual
+EPC driver returns SIGBUS to the user space process to indicate failure
+on new EPC allocation requests.
+
+The misc.max limit is non-preemptive. If a user writes a limit lower
+than the current usage to this file, the cgroup will not preemptively
+deallocate pages currently in use, and will only start blocking the 
next
+allocation and reclaiming EPC at that time.
+
+  misc.events
+A read-only flat-keyed file which exists on non-root cgroups.
+A value

[PATCH v13 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-30 Thread Haitao Huang

With different cgroups, the script starts one or multiple concurrent SGX
selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
test case, which loads an enclave of EPC size equal to the EPC capacity
available on the platform. The script checks results against the
expectation set for each cgroup and reports success or failure.

The script creates 3 different cgroups at the beginning with following
expectations:

1) small - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) large - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script starts 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) larger - limit is the same as the capacity, large enough to run lots of
concurrent tests. The script starts 8 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all processes exit.

The script also includes a test with low mem_cg limit and large sgx_epc
limit to verify that the RAM used for per-cgroup reclamation is charged
to a proper mem_cg. For this test, it turns off swapping before start,
and turns swapping back on afterwards.

Add README to document how to run the tests.

Signed-off-by: Haitao Huang 
---
V13:
- More improvement on handling error cases and style fixes.
- Add settings file for custom timeout

V12:
- Integrate the scripts to the "run_tests" target. (Jarkko)

V11:
- Remove cgroups-tools dependency and make scripts ash compatible. (Jarkko)
- Drop support for cgroup v1 and simplify. (Michal, Jarkko)
- Add documentation for functions. (Jarkko)
- Turn off swapping before memcontrol tests and back on after
- Format and style fixes, name for hard coded values

V7:
- Added memcontrol test.

V5:
- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.
---
 tools/testing/selftests/sgx/Makefile  |   3 +-
 tools/testing/selftests/sgx/README| 109 +++
 tools/testing/selftests/sgx/ash_cgexec.sh |  16 +
 tools/testing/selftests/sgx/config|   4 +
 .../selftests/sgx/run_epc_cg_selftests.sh | 295 ++
 tools/testing/selftests/sgx/settings  |   2 +
 .../selftests/sgx/watch_misc_for_tests.sh |  11 +
 7 files changed, 439 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/sgx/README
 create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
 create mode 100644 tools/testing/selftests/sgx/config
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100644 tools/testing/selftests/sgx/settings
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/Makefile 
b/tools/testing/selftests/sgx/Makefile
index 867f88ce2570..739376af9e33 100644
--- a/tools/testing/selftests/sgx/Makefile
+++ b/tools/testing/selftests/sgx/Makefile
@@ -20,7 +20,8 @@ ENCL_LDFLAGS := -Wl,-T,test_encl.lds,--build-id=none
 
 ifeq ($(CAN_BUILD_X86_64), 1)
 TEST_CUSTOM_PROGS := $(OUTPUT)/test_sgx
-TEST_FILES := $(OUTPUT)/test_encl.elf
+TEST_FILES := $(OUTPUT)/test_encl.elf ash_cgexec.sh
+TEST_PROGS := run_epc_cg_selftests.sh
 
 all: $(TEST_CUSTOM_PROGS) $(OUTPUT)/test_encl.elf
 endif
diff --git a/tools/testing/selftests/sgx/README 
b/tools/testing/selftests/sgx/README
new file mode 100644
index ..f84406bf29a4
--- /dev/null
+++ b/tools/testing/selftests/sgx/README
@@ -0,0 +1,109 @@
+SGX selftests
+
+The SGX selftests includes a c program (test_sgx) that covers basic user space
+facing APIs and a shell scripts (run_sgx_cg_selftests.sh) testing SGX misc
+cgroup. The SGX cgroup test script requires root privileges and runs a
+specific test case of  the test_sgx in different cgroups configured by the
+script. More details about the cgroup test can be found below.
+
+All SGX selftests can run with or without kselftest framework.
+
+WITH KSELFTEST FRAMEWORK
+===
+
+BUILD
+-
+
+Build executable file "test_sgx" from top level directory of the kernel source:
+ $ make -C tools/testing/selftests TARGETS=sgx
+
+RUN
+---
+
+Run all sgx tests as sudo or root since the cgroup tests need to configure 
cgroup
+limits in files under /sys/fs/cgroup.
+
+ $ sudo make -C tools/testing/selftests/sgx run_tests
+
+Without sudo, SGX cgroup tests will be skipped.
+
+On platforms with large Enclave Page Cache (EPC) and/or less cpu cores, you may
+need adjust the timeout in 'settings' to avoid timeouts.
+
+More details about kselftest framework can be found in
+Documentation/dev-tools/kselftest.rst.
+
+WITHOUT KSELFTEST FRAMEWORK
+===
+
+BUILD
+-
+
+Build executable file "test_sgx" from this
+directory(tools/testing/selftests/sgx/):
+
+  $ make
+
+RUN
+---
+
+Run all non-cgroup tests:
+
+ $ ./test_sgx
+
+To test SGX cgroup:
+
+ $

[PATCH v13 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_epc_page_lru() always
returns reference to the global LRU.

Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
given EPC page is allocated.

This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., the sum of cgroup
limits greater than the total capacity, cgroups may never reclaim but
the total usage can still be near the capacity. Therefore a global
reclamation is still needed in those cases and it should be performed
from the root cgroup.

Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled, otherwise from the global LRU. Export
sgx_cgroup_reclaim_pages() in the header file so it can be reused for
this purpose.

Similarly, modify sgx_can_reclaim_global(), to check emptiness of LRUs
of all cgroups when EPC cgroup is enabled, otherwise only check the
global LRU.  Export sgx_cgroup_lru_empty() so it can be reused for this
purpose.

Finally, change sgx_reclaim_direct(), to check and ensure there are free
pages at cgroup level so forward progress can be made by the caller.
Export sgx_cgroup_should_reclaim() for reuse.

With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Tested-by: Mikko Ylinen 
Tested-by: Jarkko Sakkinen 
---
V13:
- Use IS_ENABLED(CONFIG_CGROUP_MISC) in sgx_can_reclaim_global(). (Kai)

V12:
- Remove CONFIG_CGROUP_SGX_EPC, conditional compile SGX Cgroup for
CONFIGCONFIG_CGROUPMISC. (Jarkko)

V11:
- Reword the comments for global reclamation for allocation failure
after passing cgroup charging. (Kai)
- Add stub functions to remove ifdefs in c file (Kai)
- Add more detailed comments to clarify each page belongs to one cgroup, or the
root. (Kai)

V10:
- Add comment to clarify each page belongs to one cgroup, or the root by
default. (Kai)
- Merge the changes that expose sgx_cgroup_* functions to this patch.
- Add changes for sgx_reclaim_direct() that was missed previously.

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c |  6 ++--
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 27 
 arch/x86/kernel/cpu/sgx/main.c   | 46 ++--
 3 files changed, 74 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index c237bc3330ee..fe2fd6034023 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -67,7 +67,7 @@ static inline u64 sgx_cgroup_max_pages_to_root(struct 
sgx_cgroup *sgx_cg)
  *
  * Return: %true if all cgroups under the specified root have empty LRU lists.
  */
-static bool sgx_cgroup_lru_empty(struct misc_cg *root)
+bool sgx_cgroup_lru_empty(struct misc_cg *root)
 {
struct cgroup_subsys_state *css_root;
struct cgroup_subsys_state *pos;
@@ -115,7 +115,7 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root)
  * the LRUs are recently accessed, i.e., considered "too young" to reclaim, no
  * page will actually be reclaimed after walking the whole tree.
  */
-static void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct 
*charge_mm)
+void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct 
*charge_mm)
 {
struct cgroup_subsys_state *css_root;
struct cgroup_subsys_state *pos;
@@ -156,7 +156,7 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root, 
struct mm_struct *cha
  * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the
  * cgroup.
  */
-static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
 {
u64 cur, max;
 
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 2044e0d64076..9d69608eadf6 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -13,6 +13,11 @@
 #define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
 struct sgx_cgroup;
 
+static inline struct misc_cg *misc_from_sgx(struct sgx_cgroup *sgx_cg)
+{
+   return NULL;
+}
+
 static inline struct sgx_cgroup *sgx_get_current_cg(void)
 {
return NULL;
@@ -27,8 +32,22 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup 
*sgx_cg, enum sgx_recl
 
 static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
 
+static inline bool sgx_cgroup_lru_empty(struct misc_cg *root)
+{
+   return true;
+}
+
+static inline bool

[PATCH v13 11/14] x86/sgx: Abstract check for global reclaimable pages

2024-04-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

For the global reclaimer to determine if any page available for
reclamation at the global level, it currently only checks for emptiness
of the global LRU. That will be inadequate when pages are tracked in
multiple LRUs, one per cgroup. For this purpose, create a new helper,
sgx_can_reclaim_global(), to abstract this check. Currently it only
checks the global LRU, later will check emptiness of LRUs of all cgroups
when per-cgroup tracking is turned on.

Replace all the checks for emptiness of the global LRU,
list_empty(_global_lru.reclaimable), with calls to
sgx_can_reclaim_global().

Rename sgx_should_reclaim() to sgx_should_reclaim_global() as it is used
to check if a global reclamation should be performed.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Tested-by: Jarkko Sakkinen 
---
V13:
- Rename sgx_can_reclaim() to sgx_can_reclaim_global() and
sgx_should_reclaim() to sgx_should_reclaim_global(). (Kai)

V10:
- Add comments for the new function. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index f1bd15620b83..92bd3151a589 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -36,6 +36,14 @@ static inline struct sgx_epc_lru_list 
*sgx_epc_page_lru(struct sgx_epc_page *epc
return _global_lru;
 }
 
+/*
+ * Check if there is any reclaimable page at global level.
+ */
+static inline bool sgx_can_reclaim_global(void)
+{
+   return !list_empty(_global_lru.reclaimable);
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -387,10 +395,10 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list 
*lru, struct mm_struct *c
return cnt;
 }
 
-static bool sgx_should_reclaim(unsigned long watermark)
+static bool sgx_should_reclaim_global(unsigned long watermark)
 {
return atomic_long_read(_nr_free_pages) < watermark &&
-  !list_empty(_global_lru.reclaimable);
+   sgx_can_reclaim_global();
 }
 
 static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
@@ -405,7 +413,7 @@ static void sgx_reclaim_pages_global(struct mm_struct 
*charge_mm)
  */
 void sgx_reclaim_direct(void)
 {
-   if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
+   if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
sgx_reclaim_pages_global(current->mm);
 }
 
@@ -426,9 +434,9 @@ static int ksgxd(void *p)
 
wait_event_freezable(ksgxd_waitq,
 kthread_should_stop() ||
-sgx_should_reclaim(SGX_NR_HIGH_PAGES));
+
sgx_should_reclaim_global(SGX_NR_HIGH_PAGES));
 
-   if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
+   if (sgx_should_reclaim_global(SGX_NR_HIGH_PAGES))
/* Indirect reclaim, no mm to charge, so NULL: */
sgx_reclaim_pages_global(NULL);
 
@@ -592,7 +600,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum 
sgx_reclaim reclaim)
break;
}
 
-   if (list_empty(_global_lru.reclaimable)) {
+   if (!sgx_can_reclaim_global()) {
page = ERR_PTR(-ENOMEM);
break;
}
@@ -620,7 +628,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum 
sgx_reclaim reclaim)
sgx_put_cg(sgx_cg);
}
 
-   if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
+   if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
wake_up(_waitq);
 
return page;
-- 
2.25.1

[PATCH v13 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

In cases EPC pages need be allocated during a page fault and the cgroup
usage is near its limit, an asynchronous reclamation needs be triggered
to avoid blocking the page fault handling.

Create a workqueue, corresponding work item and function definitions
for EPC cgroup to support the asynchronous reclamation.

In sgx_cgroup_try_charge(), if caller does not allow synchronous
reclamation, queue an asynchronous work into the workqueue.

Reclaiming only when the usage is at or very close to the limit would
cause thrashing. To avoid that, before returning from
sgx_cgroup_try_charge(), check the need for reclamation (usage too close
to the limit), queue an async work if needed, similar to how the global
reclaimer wakes up its reclaiming thread after each allocation in
sgx_alloc_epc_pages().

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Tested-by: Jarkko Sakkinen 
---
V13:
- Revert to BUG_ON() in case of workq allocation failure in init and
only alloc if misc is enabled.

V11:
- Print error instead of WARN (Kai)
- Add check for need to queue an async reclamation before returning from
try_charge(), do so if needed. This is to be consistent with global
reclaimer to minimize thrashing during allocation time.

V10:
- Split asynchronous flow in separate patch. (Kai)
- Consider cgroup disabled when the workqueue allocation fail during
init. (Kai)
- Abstract out sgx_cgroup_should_reclaim().

V9:
- Add comments for static variables. (Jarkko)

V8:
- Remove alignment for substructure variables. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 135 ++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |   1 +
 2 files changed, 134 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 3602616726ff..6368611cb29e 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -4,9 +4,63 @@
 #include
 #include "epc_cgroup.h"
 
+/*
+ * The minimal free pages maintained by per-cgroup reclaimer
+ * Set this to the low threshold used by the global reclaimer, ksgxd.
+ */
+#define SGX_CG_MIN_FREE_PAGE   (SGX_NR_LOW_PAGES)
+
+/*
+ * If the cgroup limit is close to SGX_CG_MIN_FREE_PAGE, maintaining the 
minimal
+ * free pages would barely leave any page for use, causing excessive 
reclamation
+ * and thrashing.
+ *
+ * Define the following limit, below which cgroup does not maintain the minimal
+ * free page threshold. Set this to quadruple of the minimal so at least 75%
+ * pages used without being reclaimed.
+ */
+#define SGX_CG_LOW_LIMIT   (SGX_CG_MIN_FREE_PAGE * 4)
+
 /* The root SGX EPC cgroup */
 static struct sgx_cgroup sgx_cg_root;
 
+/*
+ * The work queue that reclaims EPC pages in the background for cgroups.
+ *
+ * A cgroup schedules a work item into this queue to reclaim pages within the
+ * same cgroup when its usage limit is reached and synchronous reclamation is 
not
+ * an option, i.e., in a page fault handler.
+ */
+static struct workqueue_struct *sgx_cg_wq;
+
+static inline u64 sgx_cgroup_page_counter_read(struct sgx_cgroup *sgx_cg)
+{
+   return atomic64_read(_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / 
PAGE_SIZE;
+}
+
+static inline u64 sgx_cgroup_max_pages(struct sgx_cgroup *sgx_cg)
+{
+   return READ_ONCE(sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors. Used in
+ * sgx_cgroup_should_reclaim() to determine if EPC usage of a cgroup is
+ * close to its limit or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg)
+{
+   struct misc_cg *i = sgx_cg->cg;
+   u64 m = U64_MAX;
+
+   while (i) {
+   m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+   i = misc_cg_parent(i);
+   }
+
+   return m / PAGE_SIZE;
+}
+
 /**
  * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
  * @root:  Root of the tree to check
@@ -89,6 +143,61 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
rcu_read_unlock();
 }
 
+/**
+ * sgx_cgroup_should_reclaim() - check if EPC reclamation is needed for a 
cgroup
+ * @sgx_cg: The cgroup to be checked.
+ *
+ * This function can be used to guard a call to sgx_cgroup_reclaim_pages() 
where
+ * the minimal number of free page needs be maintained for the cgroup to make
+ * good forward progress.
+ *
+ * Return: %true if number of free pages available for the cgroup below a
+ * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the
+ * cgroup.
+ */
+static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+{
+   u64 cur, max;
+
+   if (sgx_cgroup_lru_empty(sgx_cg->cg))
+

[PATCH v13 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-04-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

Currently in the EPC page allocation, the kernel simply fails the
allocation when the current EPC cgroup fails to charge due to its usage
reaching limit.  This is not ideal. When that happens, a better way is
to reclaim EPC page(s) from the current EPC cgroup (and/or its
descendants) to reduce its usage so the new allocation can succeed.

Add the basic building blocks to support per-cgroup reclamation.

Currently the kernel only has one place to reclaim EPC pages: the global
EPC LRU list.  To support the "per-cgroup" EPC reclaim, maintain an LRU
list for each EPC cgroup, and introduce a "cgroup" variant function to
reclaim EPC pages from a given EPC cgroup and its descendants.

Currently the kernel does the global EPC reclaim in sgx_reclaim_page().
It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16)
pages.  Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN
pages from the global LRU, and then tries to reclaim these pages at once
for better performance.

Implement the "cgroup" variant EPC reclaim in a similar way, but keep
the implementation simple: 1) change sgx_reclaim_pages() to take an LRU
as input, and return the pages that are "scanned" and attempted for
reclamation (but not necessarily reclaimed successfully); 2) loop the
given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
until SGX_NR_TO_SCAN pages are "scanned".

This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always
tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC
cgroup, and only moves to its descendants when there's no enough
reclaimable EPC pages to "scan" in its LRU.  It should be enough for
most cases.

Note, this simple implementation doesn't _exactly_ mimic the current
global EPC reclaim (which always tries to do the actual reclaim in batch
of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN
reclaimable pages, the actual reclaim of EPC pages will be split into
smaller batches _across_ multiple LRUs with each being smaller than
SGX_NR_TO_SCAN pages.

A more precise way to mimic the current global EPC reclaim would be to
have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
_across_ the given EPC cgroup _AND_ its descendants, and then do the
actual reclaim in one batch.  But this is unnecessarily complicated at
this stage.

Alternatively, the current sgx_reclaim_pages() could be changed to
return the actual "reclaimed" pages, but not "scanned" pages. However,
the reclamation is a lengthy process, forcing a successful reclamation
of predetermined number of pages may block the caller for too long. And
that may not be acceptable in some synchronous contexts, e.g., in
serving an ioctl().

With this building block in place, add synchronous reclamation support
in sgx_cgroup_try_charge(): trigger a call to
sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the
caller allows synchronous reclaim as indicated by s newly added
parameter.

A later patch will add support for asynchronous reclamation reusing
sgx_cgroup_reclaim_pages().

Note all reclaimable EPC pages are still tracked in the global LRU thus
no per-cgroup reclamation is actually active at the moment. Per-cgroup
tracking and reclamation will be turned on in the end after all
necessary infrastructure is in place.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V11:
- Use commit message suggested by Kai
- Remove "usage" comments for functions. (Kai)

V10:
- Simplify the signature by removing a pointer to nr_to_scan (Kai)
- Return pages attempted instead of reclaimed as it is really what the
cgroup caller needs to track progress. This further simplifies the design.
- Merge patch for exposing sgx_reclaim_pages() with basic synchronous
reclamation. (Kai)
- Shorten names for EPC cgroup functions. (Jarkko)
- Fix/add comments to justify the design (Kai)
- Separate out a helper for for addressing single iteration of the loop
in sgx_cgroup_try_charge(). (Jarkko)

V9:
- Add comments for static variables. (Jarkko)

V8:
- Use width of 80 characters in text paragraphs. (Jarkko)
- Remove alignment for substructure variables. (Jarkko)

V7:
- Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim
function". Do not split the top level function (Kai)
- Dropped patches 7 and 8 of V6.
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 119 ++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |   5 +-
 arch/x86/kernel/cpu/sgx/main.c   |  45 ++
 arch/x86/kernel/cpu/sgx/sgx.h|   1 +
 4 files changed, 148 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 5c484fd10160..3602616726ff 100644
---

[PATCH v13 07/14] x86/sgx: Abstract tracking reclaimable pages in LRU

2024-04-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

The SGX driver tracks reclaimable EPC pages by adding a newly allocated
page into the global LRU list in sgx_mark_page_reclaimable(), and doing
the opposite in sgx_unmark_page_reclaimable().

To support SGX EPC cgroup, the SGX driver will need to maintain an LRU
list for each cgroup, and each newly allocated EPC page will need to be
added to the LRU of associated cgroup, not always the global LRU list.

When sgx_mark_page_reclaimable() is called, the cgroup that the newly
allocated EPC page belongs to is already known, i.e., it has been set to
the 'struct sgx_epc_page'.

Add a helper, sgx_epc_page_lru(), to return the LRU that the EPC page
should be added to for the given EPC page.  Currently it just returns
the global LRU. Change sgx_{mark|unmark}_page_reclaimable() to use the
helper function to get the LRU from the EPC page instead of referring to
the global LRU directly.

This allows EPC page being able to be tracked in "per-cgroup" LRU when
that becomes ready.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V13:
- Revise commit log (Kai)
- Rename sgx_lru_list() to sgx_epc_page_lru()

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 59736dd02ca7..2bf9cca3be21 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -31,6 +31,11 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_list sgx_global_lru;
 
+static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page 
*epc_page)
+{
+   return _global_lru;
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -499,25 +504,24 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
 }
 
 /**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a 
LRU.
  * @page:  EPC page
- *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_global_lru.lock);
+   struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page);
+
+   spin_lock(>lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(>list, _global_lru.reclaimable);
-   spin_unlock(_global_lru.lock);
+   list_add_tail(>list, >reclaimable);
+   spin_unlock(>lock);
 }
 
 /**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU
  * @page:  EPC page
  *
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
  *
  * Return:
  *   0 on success,
@@ -525,18 +529,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_global_lru.lock);
+   struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page);
+
+   spin_lock(>lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(>list)) {
-   spin_unlock(_global_lru.lock);
+   spin_unlock(>lock);
return -EBUSY;
}
 
list_del(>list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(_global_lru.lock);
+   spin_unlock(>lock);
 
return 0;
 }
-- 
2.25.1

[PATCH v13 06/14] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list

2024-04-30 Thread Haitao Huang

From: Sean Christopherson 

Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.

Use this structure to encapsulate the LRU list and its lock used by the
global reclaimer.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Cc: Sean Christopherson 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V6:
- removed introduction to unreclaimables in commit message.

V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.

V3:
- Removed the helper functions and revised commit messages.
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +-
 arch/x86/kernel/cpu/sgx/sgx.h  | 15 +
 2 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 1226ea0d5b3c..59736dd02ca7 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -27,10 +27,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -305,13 +304,13 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
 
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-   if (list_empty(_active_page_list))
+   epc_page = list_first_entry_or_null(_global_lru.reclaimable,
+   struct sgx_epc_page, list);
+   if (!epc_page)
break;
 
-   epc_page = list_first_entry(_active_page_list,
-   struct sgx_epc_page, list);
list_del_init(_page->list);
encl_page = epc_page->owner;
 
@@ -323,7 +322,7 @@ static void sgx_reclaim_pages(void)
 */
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(_reclaimer_lock);
+   spin_unlock(_global_lru.lock);
 
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -346,9 +345,9 @@ static void sgx_reclaim_pages(void)
continue;
 
 skip:
-   spin_lock(_reclaimer_lock);
-   list_add_tail(_page->list, _active_page_list);
-   spin_unlock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
+   list_add_tail(_page->list, _global_lru.reclaimable);
+   spin_unlock(_global_lru.lock);
 
kref_put(_page->encl->refcount, sgx_encl_release);
 
@@ -379,7 +378,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
return atomic_long_read(_nr_free_pages) < watermark &&
-  !list_empty(_active_page_list);
+  !list_empty(_global_lru.reclaimable);
 }
 
 /*
@@ -431,6 +430,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
ksgxd_tsk = tsk;
 
+   sgx_lru_init(_global_lru);
+
return true;
 }
 
@@ -506,10 +507,10 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(>list, _active_page_list);
-   spin_unlock(_reclaimer_lock);
+   list_add_tail(>list, _global_lru.reclaimable);
+   spin_unlock(_global_lru.lock);
 }
 
 /**
@@ -524,18 +525,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(>list)) {
-   spin_unlock(_reclaimer_lock);
+   spin_unlock(_global_lru.lock);
return -EBUSY;
}

[PATCH v13 10/14] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

2024-04-30 Thread Haitao Huang

Enclave Page Cache(EPC) memory can be swapped out to regular system
memory, and the consumed memory should be charged to a proper
mem_cgroup. Currently the selection of mem_cgroup to charge is done in
sgx_encl_get_mem_cgroup(). But it considers all contexts other than the
ksgxd thread are user processes. With the new EPC cgroup implementation,
the swapping can also happen in EPC cgroup work-queue threads. In those
cases, it improperly selects the root mem_cgroup to charge for the RAM
usage.

Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take
an additional argument to explicitly specify the mm struct to charge for
allocations. Callers from background kthreads not associated with a
charging mm struct would set it to NULL, while callers in user process
contexts set it to current->mm.

Internally, it handles the case when the charging mm given is NULL, by
searching for an mm struct from enclave's mm_list.

Signed-off-by: Haitao Huang 
Reported-by: Mikko Ylinen 
Reviewed-by: Kai Huang 
Tested-by: Mikko Ylinen 
Tested-by: Jarkko Sakkinen 
---
V10:
- Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)

V9:
- Reduce number of if statements. (Tim)

V8:
- Limit text paragraphs to 80 characters wide. (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c   | 29 ++--
 arch/x86/kernel/cpu/sgx/encl.h   |  3 +--
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 10 ++
 arch/x86/kernel/cpu/sgx/main.c   | 29 +---
 arch/x86/kernel/cpu/sgx/sgx.h|  2 +-
 5 files changed, 36 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f474179b6f77..7b77dad41daf 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -993,23 +993,23 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, 
unsigned long page_inde
 }
 
 /*
- * When called from ksgxd, returns the mem_cgroup of a struct mm stored
- * in the enclave's mm_list. When not called from ksgxd, just returns
- * the mem_cgroup of the current task.
+ * Find the mem_cgroup to charge for memory allocated on behalf of an enclave.
+ *
+ * Used in sgx_encl_alloc_backing() for backing store allocation.
+ *
+ * Return the mem_cgroup of the given charge_mm. Otherwise return the 
mem_cgroup
+ * of a struct mm stored in the enclave's mm_list.
  */
-static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
+static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl,
+ struct mm_struct *charge_mm)
 {
struct mem_cgroup *memcg = NULL;
struct sgx_encl_mm *encl_mm;
int idx;
 
-   /*
-* If called from normal task context, return the mem_cgroup
-* of the current task's mm. The remainder of the handling is for
-* ksgxd.
-*/
-   if (!current_is_ksgxd())
-   return get_mem_cgroup_from_mm(current->mm);
+/* Use the charge_mm if given. */
+   if (charge_mm)
+   return get_mem_cgroup_from_mm(charge_mm);
 
/*
 * Search the enclave's mm_list to find an mm associated with
@@ -1047,8 +1047,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct 
sgx_encl *encl)
  * @encl:  an enclave pointer
  * @page_index:enclave page index
  * @backing:   data for accessing backing storage for the page
+ * @charge_mm: the mm to charge for the allocation
  *
- * When called from ksgxd, sets the active memcg from one of the
+ * When charge_mm is NULL, sets the active memcg from one of the
  * mms in the enclave's mm_list prior to any backing page allocation,
  * in order to ensure that shmem page allocations are charged to the
  * enclave.  Create a backing page for loading data back into an EPC page with
@@ -1060,9 +1061,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct 
sgx_encl *encl)
  *   -errno otherwise.
  */
 int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
-  struct sgx_backing *backing)
+  struct sgx_backing *backing, struct mm_struct 
*charge_mm)
 {
-   struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
+   struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl, 
charge_mm);
struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
int ret;
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index fe15ade02ca1..5ce9d108290f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, 
unsigned long addr,
 int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 unsigned long end, unsigned long vm_flags);
 
-bool current_is_ksgxd(void);
 void sgx_encl_release(struct kref *ref);
 int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
 const

[PATCH v13 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments. For
instance, within a Kubernetes environment, while a user may specify a
particular EPC quota for a pod, the orchestrator requires a mechanism to
enforce that the pod's actual runtime EPC usage does not exceed the
allocated quota.

Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:

- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.

With these changes, the misc cgroup controller enables users to set a hard
limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V13:
- Remove unneeded includes. (Kai)

V12:
- Remove CONFIG_CGROUP_SGX_EPC and make sgx cgroup implementation
conditionally compiled with CONFIG_CGROUP_MISC. (Jarkko)

V11:
- Update copyright and format better (Kai)
- Create wrappers to remove #ifdefs in c file. (Kai)
- Remove unneeded comments (Kai)

V10:
- Shorten function, variable, struct names, s/sgx_epc_cgroup/sgx_cgroup. 
(Jarkko)
- Use enums instead of booleans for the parameters. (Dave, Jarkko)

V8:
- Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko)
- Remove extra space, '_INTEL'. (Jarkko)

V7:
- Use a static for root cgroup (Kai)
- Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
- Correct check for charge API return (Kai)
- Start initialization in SGX device driver init (Kai)
- Remove unneeded BUG_ON (Kai)
- Split  sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)

V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller"  and restructure it (Kai)
---
 arch/x86/kernel/cpu/sgx/Makefile |  1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 71 +++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 72 
 arch/x86/kernel/cpu/sgx/main.c   | 42 +++-
 arch/x86/kernel/cpu/sgx/sgx.h| 21 
 include/linux/misc_cgroup.h  |  2 +
 6 files changed, 207 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..081cb424575e 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
 obj-$(CONFIG_X86_SGX_KVM)  += virt.o
+obj-$(CONFIG_CGROUP_MISC)  += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index ..5c484fd10160
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2022-2024 Intel Corporation. */
+
+#include
+#include "epc_cgroup.h"
+
+/* The root SGX EPC cgroup */
+static struct sgx_cgroup sgx_cg_root;
+
+/**
+ * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * @sgx_cg:The EPC cgroup to be charged for the page.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -errno - for failures.
+ */
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+   return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+}
+
+/**
+ * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page
+ * @sgx_cg:The charged sgx cgroup.
+ */
+void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg)
+{
+   misc_cg_uncharge(MISC_CG_RES_SGX_EPC,

[PATCH v13 04/14] cgroup/misc: Add SGX EPC resource type

2024-04-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V12:
- Remove CONFIG_CGROUP_SGX_EPC (Jarkko)

V6:
- Split the original patch into this and the preceding one (Kai)
---
 include/linux/misc_cgroup.h | 4 
 kernel/cgroup/misc.c| 4 
 2 files changed, 8 insertions(+)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 541a5611c597..440ed2bb8053 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
MISC_CG_RES_SEV,
/* AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_X86_SGX
+   /* SGX EPC memory resource */
+   MISC_CG_RES_SGX_EPC,
 #endif
MISC_CG_RES_TYPES
 };
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 7d852139121a..863f9147602b 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
/* AMD SEV-ES ASIDs resource */
"sev_es",
 #endif
+#ifdef CONFIG_X86_SGX
+   /* Intel SGX EPC memory bytes */
+   "sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
-- 
2.25.1

[PATCH v13 03/14] cgroup/misc: Export APIs for SGX driver

2024-04-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

The SGX EPC cgroup will reclaim EPC pages when usage in a cgroup reaches
its or ancestor's limit. This requires a walk from the current cgroup up
to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to
enable this walk.

The SGX driver also needs start a global level reclamation from the
root. Export misc_cg_root() for the SGX driver to access.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V6:
- Make commit messages more concise and split the original patch into two(Kai)
---
 include/linux/misc_cgroup.h | 24 
 kernel/cgroup/misc.c| 21 -
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 0806d4436208..541a5611c597 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -64,6 +64,7 @@ struct misc_cg {
struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
 int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
@@ -84,6 +85,20 @@ static inline struct misc_cg *css_misc(struct 
cgroup_subsys_state *css)
return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -108,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+   return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+   return NULL;
+}
 
 static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 4a602d68cf7d..7d852139121a 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+   return _cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -183,7 +178,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!amount)
return 0;
 
-   for (i = cg; i; i = parent_misc(i)) {
+   for (i = cg; i; i = misc_cg_parent(i)) {
res = >res[type];
 
new_usage = atomic64_add_return(amount, >usage);
@@ -196,12 +191,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
return 0;
 
 err_charge:
-   for (j = i; j; j = parent_misc(j)) {
+   for (j = i; j; j = misc_cg_parent(j)) {
atomic64_inc(>res[type].events);
cgroup_file_notify(>events_file);
}
 
-   for (j = cg; j != i; j = parent_misc(j))
+   for (j = cg; j != i; j = misc_cg_parent(j))
misc_cg_cancel_charge(type, j, amount);
misc_cg_cancel_charge(type, i, amount);
return ret;
@@ -223,7 +218,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!(amount && valid_type(type) && cg))
return;
 
-   for (i = cg; i; i = parent_misc(i))
+   for (i = cg; i; i = misc_cg_parent(i))
misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.25.1

[PATCH v13 02/14] cgroup/misc: Add per resource callbacks for CSS events

2024-04-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.

This design passes a struct misc_cg into the callbacks. An alternative
to pass only a struct misc_res was considered for encapsulating how
misc_cg stores its misc_res array. However, the SGX driver would still
need to access the misc_res array in the statically defined misc root_cg
during initialization to set resource specific fields. Therefore, some
extra getter/setter APIs to abstract such access to the misc_res array
within the misc_cg struct would be needed. That seems an overkill at the
moment and is deferred to later improvement when there are more users of
these callbacks.

Link: 
https://lore.kernel.org/lkml/op.2kdw36otwjv...@hhuan26-mobl.amr.corp.intel.com/

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V12:
- Add comments in commit to clarify reason to pass in misc_cg, not
misc_res. (Kai)
- Remove unlikely (Kai)

V8:
- Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko)

V7:
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of 
required callback
functions (Jarkko)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)

V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
 include/linux/misc_cgroup.h | 11 +
 kernel/cgroup/misc.c| 82 +
 2 files changed, 86 insertions(+), 7 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..0806d4436208 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -27,6 +27,16 @@ struct misc_cg;
 
 #include 
 
+/**
+ * struct misc_res_ops: per resource type callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is 
allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_res_ops {
+   int (*alloc)(struct misc_cg *cg);
+   void (*free)(struct misc_cg *cg);
+};
+
 /**
  * struct misc_res: Per cgroup per misc type resource
  * @max: Maximum limit on the resource.
@@ -56,6 +66,7 @@ struct misc_cg {
 
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 
amount);
 void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
 
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..4a602d68cf7d 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -39,6 +39,9 @@ static struct misc_cg root_cg;
  */
 static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 
+/* Resource type specific operations */
+static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
+
 /**
  * parent_misc() - Get the parent of the passed misc cgroup.
  * @cgroup: cgroup whose parent needs to be fetched.
@@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 
capacity)
 }
 EXPORT_SYMBOL_GPL(misc_cg_set_capacity);
 
+/**
+ * misc_cg_set_ops() - set resource specific operations.
+ * @type: Type of the misc res.
+ * @ops: Operations for the given type.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - Successfully registered the operations.
+ * * %-EINVAL - If @type is invalid, or the operations missing any required 
callbacks.
+ */
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
+{
+   if (!valid_type(type))
+   return -EINVAL;
+
+   if (!ops->alloc) {
+   pr_err("%s: alloc missing\n", __func__);
+   return -EINVAL;
+   }
+
+   if (!ops->free) {
+   pr_err("%s: free missing\n", __func__);
+   return -EINVAL;
+   }
+
+   misc_res_ops[type] = ops;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_ops);
+
 /**
  * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
  * @type: Misc res type in misc cg to cancel the charge from.
@@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = {
{}
 };
 
+static inline int _misc_cg_res_alloc(struct misc_cg *cg)
+{
+   enum misc_res_type i;
+   int ret;
+
+   for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+

[PATCH v13 00/14] Add Cgroup support for SGX EPC memory

2024-04-30 Thread Haitao Huang

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The existing
cgroup memory controller cannot be used to limit or account for SGX EPC
memory, which is a desirable feature in some environments, e.g., support
for pod level control in a Kubernates cluster on a VM or bare-metal host
[1,2].
 
This patchset implements the support for sgx_epc memory within the misc
cgroup controller. A user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.
 
Much like normal system memory, EPC memory can be overcommitted via virtual
memory techniques and pages can be swapped out of the EPC to their backing
store, which are normal system memory allocated via shmem and accounted by
the memory controller. Similar to per-cgroup reclamation done by the memory
controller, the EPC misc controller needs to implement a per-cgroup EPC
reclaiming process: when the EPC usage of a cgroup reaches its hard limit
('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
some EPC pages within the same cgroup to make room for new allocations.
 
For that, this implementation tracks reclaimable EPC pages in a separate
LRU list in each cgroup, and below are more details and justification of
this design. 
 
Track EPC pages in per-cgroup LRUs (from Dave)
--
 
tl;dr: A cgroup hitting its limit should be as similar as possible to the
system running out of EPC memory. The only two choices to implement that
are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
The result: Add a new LRU for each cgroup and scans those instead. Replace
the existing global cgroup with the root cgroup's LRU (only when this new
support is compiled in, obviously).
 
The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.
 
This series attempts to implement that same logic for cgroups, for the same
reasons: it's preferable to wait for memory to become available and let
reclaim happen than to do things that are fatal to enclaves.
 
There is currently a global reclaimable page SGX LRU list. That list (and
the existing scanning algorithm) is essentially useless for doing reclaim
when a cgroup hits its limit because the cgroup's pages are scattered
around that LRU. It is unspeakably inefficient to scan a linked list with
millions of entries for what could be dozens of pages from a cgroup that
needs reclaim.
 
Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It would
either need to hold the list locks for unreasonable amounts of time, or be
taught to scan the list in pieces, which has its own challenges.
 
Unreclaimable Enclave Pages
---
 
There are a variety of page types for enclaves, each serving different
purposes [5]. Although the SGX architecture supports swapping for all
types, some special pages, e.g., Version Array(VA) and Secure Enclave
Control Structure (SECS)[5], holds meta data of reclaimed pages and
enclaves. That makes reclamation of such pages more intricate to manage.
The SGX driver global reclaimer currently does not swap out VA pages. It
only swaps the SECS page of an enclave when all other associated pages have
been swapped out. The cgroup reclaimer follows the same approach and does
not track those in per-cgroup LRUs and considers them as unreclaimable
pages. The allocation of these pages is counted towards the usage of a
specific cgroup and is subject to the cgroup's set EPC limits.
 
Earlier versions of this series implemented forced enclave-killing to
reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
particularly in scenarios where a user or administrator reduces this limit
post-launch of enclaves. However, subsequent discussions [3, 4] indicated
that such preemptive enforcement is not necessary for the misc-controllers.
Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
and the limit is only enforced at the time of new EPC allocation request.
When a cgroup hits its limit but nothing left in the LRUs of the subtree,
i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
within that cgroup will result in an 'ENOMEM'.
 
Unreclaimable Guest VM EPC Pages

 
The EPC pages allocated for guest VMs by the virtual EPC driver are not
reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
those as unreclaimable and returns ENOMEM when its limit is hit and nothing
reclaimable left within the cgroup. The virtual EPC driver

[PATCH v13 01/14] x86/sgx: Replace boolean parameters with enums

2024-04-30 Thread Haitao Huang

Replace boolean parameters for 'reclaim' in the function
sgx_alloc_epc_page() and its callers with an enum.

Also opportunistically remove non-static declaration of
__sgx_alloc_epc_page() and a typo

Signed-off-by: Haitao Huang 
Suggested-by: Jarkko Sakkinen 
Suggested-by: Dave Hansen 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
 arch/x86/kernel/cpu/sgx/encl.c  | 12 ++--
 arch/x86/kernel/cpu/sgx/encl.h  |  4 ++--
 arch/x86/kernel/cpu/sgx/ioctl.c | 10 +-
 arch/x86/kernel/cpu/sgx/main.c  | 14 +++---
 arch/x86/kernel/cpu/sgx/sgx.h   | 13 +++--
 arch/x86/kernel/cpu/sgx/virt.c  |  2 +-
 6 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..f474179b6f77 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -217,7 +217,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct 
sgx_encl_page *encl_page,
struct sgx_epc_page *epc_page;
int ret;
 
-   epc_page = sgx_alloc_epc_page(encl_page, false);
+   epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM);
if (IS_ERR(epc_page))
return epc_page;
 
@@ -359,14 +359,14 @@ static vm_fault_t sgx_encl_eaug_page(struct 
vm_area_struct *vma,
goto err_out_unlock;
}
 
-   epc_page = sgx_alloc_epc_page(encl_page, false);
+   epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM);
if (IS_ERR(epc_page)) {
if (PTR_ERR(epc_page) == -EBUSY)
vmret =  VM_FAULT_NOPAGE;
goto err_out_unlock;
}
 
-   va_page = sgx_encl_grow(encl, false);
+   va_page = sgx_encl_grow(encl, SGX_NO_RECLAIM);
if (IS_ERR(va_page)) {
if (PTR_ERR(va_page) == -EBUSY)
vmret = VM_FAULT_NOPAGE;
@@ -1232,8 +1232,8 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned 
long addr)
 
 /**
  * sgx_alloc_va_page() - Allocate a Version Array (VA) page
- * @reclaim: Reclaim EPC pages directly if none available. Enclave
- *   mutex should not be held if this is set.
+ * @reclaim: Whether reclaim EPC pages directly if none available. Enclave
+ *   mutex should not be held for SGX_DO_RECLAIM.
  *
  * Allocate a free EPC page and convert it to a Version Array (VA) page.
  *
@@ -1241,7 +1241,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned 
long addr)
  *   a VA page,
  *   -errno otherwise
  */
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim)
 {
struct sgx_epc_page *epc_page;
int ret;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..fe15ade02ca1 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,14 +116,14 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl 
*encl,
  unsigned long offset,
  u64 secinfo_flags);
 void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
 void sgx_encl_free_epc_page(struct sgx_epc_page *page);
 struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 unsigned long addr);
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim 
reclaim);
 void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index b65ab214bdf5..793a0ba2cb16 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -17,7 +17,7 @@
 #include "encl.h"
 #include "encls.h"
 
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim 
reclaim)
 {
struct sgx_va_page *va_page = NULL;
void *err;
@@ -64,7 +64,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct 
sgx_secs *secs)
struct file *backing;
long ret;
 
-   va_page = sgx_encl_grow(encl, true);
+   va_page = sgx_encl_grow(encl, SGX_DO_RECLAIM);
if (IS_ERR(va_page))
return PTR_ERR(va_page);
else if (va_page)
@@ -83,7 +83,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct 
sgx_secs *secs)
 
encl->backing = backing;
 
-   secs_epc = sgx_alloc_epc_page(>secs, true);
+   secs_epc =

[PATCH] virtio_net: Warn if insufficient queue length for transmitting

2024-04-30 Thread Darius Rad

The transmit queue is stopped when the number of free queue entries is less
than 2+MAX_SKB_FRAGS, in start_xmit().  If the queue length (QUEUE_NUM_MAX)
is less than then this, transmission will immediately trigger a netdev
watchdog timeout.  Report this condition earlier and more directly.

Signed-off-by: Darius Rad 
---
 drivers/net/virtio_net.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 115c3c5414f2..72ee8473b61c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4917,6 +4917,9 @@ static int virtnet_probe(struct virtio_device *vdev)
set_bit(guest_offloads[i], >guest_offloads);
vi->guest_offloads_capable = vi->guest_offloads;
 
+   if (virtqueue_get_vring_size(vi->sq->vq) < 2 + MAX_SKB_FRAGS)
+   netdev_warn_once(dev, "not enough queue entries, expect xmit 
timeout\n");
+
pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
 dev->name, max_queue_pairs);
 
-- 
2.39.2

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-30 Thread David Hildenbrand


On 26.04.24 21:55, Guillaume Morin wrote:

On 26 Apr  9:19, David Hildenbrand wrote:

A couple of points:

a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want
to check PageAnonExclusive.

b) If you're not following the can_follow_write_pte/_pmd model, you are
doing something wrong :)

c) The code was heavily changed in mm/mm-unstable. It was merged with t
the common code.

Likely, in mm/mm-unstable, the existing can_follow_write_pte and
can_follow_write_pmd checks will already cover what you want in most cases.

We'd need a can_follow_write_pud() to cover follow_huge_pud() and
(unfortunately) something to handle follow_hugepd() as well similarly.

Copy-pasting what we do in can_follow_write_pte() and adjusting for
different PTE types is the right thing to do. Maybe now it's time to factor
out the common checks into a separate helper.


I tried to get the hugepd stuff right but this was the first I heard
about it :-) Afaict follow_huge_pmd and friends were already DTRT


I'll have to have a closer look at some details (the hugepd writability 
check looks a bit odd), but it's mostly what I would have expected!


--
Cheers,

David / dhildenb

Re: [PATCH v7 6/6] remoteproc: qcom: enable in-kernel PD mapper

2024-04-30 Thread Chris Lew





On 4/26/2024 6:36 PM, Dmitry Baryshkov wrote:

On Sat, 27 Apr 2024 at 04:03, Chris Lew  wrote:




On 4/24/2024 2:28 AM, Dmitry Baryshkov wrote:

diff --git a/drivers/remoteproc/qcom_q6v5_adsp.c 
b/drivers/remoteproc/qcom_q6v5_adsp.c
index 1d24c9b656a8..02d0c626b03b 100644
--- a/drivers/remoteproc/qcom_q6v5_adsp.c
+++ b/drivers/remoteproc/qcom_q6v5_adsp.c
@@ -23,6 +23,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 

@@ -375,10 +376,14 @@ static int adsp_start(struct rproc *rproc)
   int ret;
   unsigned int val;

- ret = qcom_q6v5_prepare(>q6v5);
+ ret = qcom_pdm_get();
   if (ret)
   return ret;


Would it make sense to try and model this as a rproc subdev? This
section of the remoteproc code seems to be focused on making specific
calls to setup and enable hardware resources, where as pd mapper is
software.

sysmon and ssr are also purely software and they are modeled as subdevs
in qcom_common. I'm not an expert on remoteproc organization but this
was just a thought.


Well, the issue is that the pd-mapper is a global, not a per-remoteproc instance



Both sysmon and ssr have some kind of global states that they manage 
too. Each subdev functionality tends to be a mix of per-remoteproc 
instance management and global state management.


If pd-mapper was completely global, pd-mapper would be able to 
instantiate by itself. Instead, instantiation is dependent on each 
remoteproc instance properly getting and putting references.


The pdm subdev could manage the references to pd-mapper for that 
remoteproc instance.


On the other hand, I think Bjorn recommended this could be moved to 
probe time in v4. The v4 version was doing the reinitialization-dance, 
but I think the recommendation could still apply to this version.




Thanks!
Chris



+ ret = qcom_q6v5_prepare(>q6v5);
+ if (ret)
+ goto put_pdm;
+
   ret = adsp_map_carveout(rproc);
   if (ret) {
   dev_err(adsp->dev, "ADSP smmu mapping failed\n");
@@ -446,6 +451,8 @@ static int adsp_start(struct rproc *rproc)
   adsp_unmap_carveout(rproc);
   disable_irqs:
   qcom_q6v5_unprepare(>q6v5);
+put_pdm:
+ qcom_pdm_release();

   return ret;
   }

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-30 Thread Guillaume Morin

On 30 Apr 20:21, David Hildenbrand wrote:
> Sorry for not replying earlier, was busy with other stuff. I'll try getiing
> that stuff into shape and send it out soonish.

No worries. Let me know what you think of the FOLL_FORCE patch when you
have a sec.

> > I went with using one write uprobe function with some additional
> > branches. I went back and forth between that and making them 2 different
> > functions.
> 
> All the folio_test_hugetlb() special casing is a bit suboptimal. Likely we
> want a separate variant, because we should be sing hugetlb PTE functions
> consistently (e.g., huge_pte_uffd_wp() vs pte_uffd_wp(), softdirty does not
> exist etc.)

Ok, I'll give this a whirl and send something probably tomorrow.

> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index 2f4e88552d3f..8a33e380f7ea 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -83,6 +83,10 @@ static const struct fs_parameter_spec 
> > hugetlb_fs_parameters[] = {
> > {}
> >   };
> > +bool hugetlbfs_mapping(struct address_space *mapping) {
> > +   return mapping->a_ops == _aops;
> 
> is_vm_hugetlb_page() might be what you are looking for.

I use hugetlbfs_mapping() in __uprobe_register() which does an early
return using the mapping only if it's not supported. There is no vma
that I can get at this point (afaict).

I could refactor so we check this when we have a vma but it looked
cleaner to introduce it since there is already shmem_mapping()

> >   }
> > -static void copy_from_page(struct page *page, unsigned long vaddr, void 
> > *dst, int len)
> > +static void copy_from_page(struct page *page, unsigned long vaddr, void 
> > *dst, int len, unsigned long page_mask)
> >   {
> > void *kaddr = kmap_atomic(page);
> > -   memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
> > +   memcpy(dst, kaddr + (vaddr & ~page_mask), len);
> > kunmap_atomic(kaddr);
> >   }
> 
> > -static void copy_to_page(struct page *page, unsigned long vaddr, const 
> > void *src, int len)
> > +static void copy_to_page(struct page *page, unsigned long vaddr, const 
> > void *src, int len, unsigned long page_mask)
> >   {
> > void *kaddr = kmap_atomic(page);
> > -   memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
> > +   memcpy(kaddr + (vaddr & ~page_mask), src, len);
> > kunmap_atomic(kaddr);
> >   }
> 
> These two changes really are rather ugly ...
> 
> An why are they even required? We get a PAGE_SIZED-based subpage of a
> hugetlb page. We only kmap that one and copy within that one.
> 
> In other words, I don't think the copy_from_page() and copy_to_page()
> changes are even required when we consistently work on subpages and not
> suddenly on head pages.

The main reason is that the previous __replace_page worked directly on the full
HP page so adjusting after gup seemed to make more sense to me. But
now I guess it's not that useful (esp we're going with a different
version of write_uprobe). I'll fix

(...)
> >   {
> > struct uwo_data *data = walk->private;;
> > const bool is_register = !!is_swbp_insn(>opcode);
> > @@ -415,9 +417,12 @@ static int __write_opcode_pte(pte_t *ptep, unsigned 
> > long vaddr,
> > /* Unmap + flush the TLB, such that we can write atomically .*/
> > flush_cache_page(vma, vaddr, pte_pfn(pte));
> > -   pte = ptep_clear_flush(vma, vaddr, ptep);
> > +   if (folio_test_hugetlb(folio))
> > +   pte = huge_ptep_clear_flush(vma, vaddr, ptep);
> > +   else
> > +   pte = ptep_clear_flush(vma, vaddr, ptep);
> > copy_to_page(page, data->opcode_vaddr, >opcode,
> > -UPROBE_SWBP_INSN_SIZE);
> > +UPROBE_SWBP_INSN_SIZE, page_mask);
> > /* When unregistering, we may only zap a PTE if uffd is disabled ... */
> > if (is_register || userfaultfd_missing(vma))
> > @@ -443,13 +448,18 @@ static int __write_opcode_pte(pte_t *ptep, unsigned 
> > long vaddr,
> > if (!identical || folio_maybe_dma_pinned(folio))
> > goto remap;
> > -   /* Zap it and try to reclaim swap space. */
> > -   dec_mm_counter(mm, MM_ANONPAGES);
> > -   folio_remove_rmap_pte(folio, page, vma);
> > -   if (!folio_mapped(folio) && folio_test_swapcache(folio) &&
> > -folio_trylock(folio)) {
> > -   folio_free_swap(folio);
> > -   folio_unlock(folio);
> > +   if (folio_test_hugetlb(folio)) {
> > +   hugetlb_remove_rmap(folio);
> > +   large = false;
> > +   } else {
> > +   /* Zap it and try to reclaim swap space. */
> > +   dec_mm_counter(mm, MM_ANONPAGES);
> > +   folio_remove_rmap_pte(folio, page, vma);
> > +   if (!folio_mapped(folio) && folio_test_swapcache(folio) &&
> > +   folio_trylock(folio)) {
> > +   folio_free_swap(folio);
> > +   folio_unlock(folio);
> > +   }
> > }
> > folio_put(folio);
> > @@ -461,11 +471,29 @@ static int __write_opcode_pte(pte_t *ptep, unsigned 
> > long vaddr,
> >  */
> > smp_wmb();
>

[PATCH] eventfs/tracing: Add callback for release of an eventfs_inode

2024-04-30 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Synthetic events create and destroy tracefs files when they are created
and removed. The tracing subsystem has its own file descriptor
representing the state of the events attached to the tracefs files.
There's a race between the eventfs files and this file descriptor of the
tracing system where the following can cause an issue:

With two scripts 'A' and 'B' doing:

  Script 'A':
echo "hello int aaa" > /sys/kernel/tracing/synthetic_events
while :
do
  echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable
done

  Script 'B':
echo > /sys/kernel/tracing/synthetic_events

Script 'A' creates a synthetic event "hello" and then just writes zero
into its enable file.

Script 'B' removes all synthetic events (including the newly created
"hello" event).

What happens is that the opening of the "enable" file has:

 {
struct trace_event_file *file = inode->i_private;
int ret;

ret = tracing_check_open_get_tr(file->tr);
 [..]

But deleting the events frees the "file" descriptor, and a "use after
free" happens with the dereference at "file->tr".

The file descriptor does have a reference counter, but there needs to be a
way to decrement it from the eventfs when the eventfs_inode is removed
that represents this file descriptor.

Add an optional "release" callback to the eventfs_entry array structure,
that gets called when the eventfs file is about to be removed. This allows
for the creating on the eventfs file to increment the tracing file
descriptor ref counter. When the eventfs file is deleted, it can call the
release function that will call the put function for the tracing file
descriptor.

This will protect the tracing file from being freed while a eventfs file
that references it is being opened.

Link: 
https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-tze-nan...@mediatek.com/

Cc: sta...@vger.kernel.org
Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: Tze-nan wu 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c|  7 +++
 include/linux/tracefs.h |  3 +++
 kernel/trace/trace_events.c | 12 
 3 files changed, 22 insertions(+)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 894c6ca1e500..dc97c19f9e0a 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -84,10 +84,17 @@ enum {
 static void release_ei(struct kref *ref)
 {
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
+   const struct eventfs_entry *entry;
struct eventfs_root_inode *rei;
 
WARN_ON_ONCE(!ei->is_freed);
 
+   for (int i = 0; i < ei->nr_entries; i++) {
+   entry = >entries[i];
+   if (entry->release)
+   entry->release(entry->name, ei->data);
+   }
+
kfree(ei->entry_attrs);
kfree_const(ei->name);
if (ei->is_events) {
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 7a5fe17b6bf9..d03f74658716 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -62,6 +62,8 @@ struct eventfs_file;
 typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);
 
+typedef void (*eventfs_release)(const char *name, void *data);
+
 /**
  * struct eventfs_entry - dynamically created eventfs file call back handler
  * @name:  Then name of the dynamic file in an eventfs directory
@@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t 
*mode, void **data,
 struct eventfs_entry {
const char  *name;
eventfs_callbackcallback;
+   eventfs_release release;
 };
 
 struct eventfs_inode;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 52f75c36bbca..6ef29eba90ce 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -2552,6 +2552,14 @@ static int event_callback(const char *name, umode_t 
*mode, void **data,
return 0;
 }
 
+/* The file is incremented on creation and freeing the enable file decrements 
it */
+static void event_release(const char *name, void *data)
+{
+   struct trace_event_file *file = data;
+
+   event_file_put(file);
+}
+
 static int
 event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file)
 {
@@ -2566,6 +2574,7 @@ event_create_dir(struct eventfs_inode *parent, struct 
trace_event_file *file)
{
.name   = "enable",
.callback   = event_callback,
+   .release= event_release,
},
{
.name   = "filter",
@@ -2634,6 +2643,9 @@ event_create_dir(struct eventfs_inode *parent, struct 
trace_event_file *file)
return ret;
}
 
+   /*

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-30 Thread David Hildenbrand


On 30.04.24 17:22, Guillaume Morin wrote:

On 26 Apr 21:55, Guillaume Morin wrote:


On 26 Apr  9:19, David Hildenbrand wrote:

A couple of points:

a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want
to check PageAnonExclusive.

b) If you're not following the can_follow_write_pte/_pmd model, you are
doing something wrong :)

c) The code was heavily changed in mm/mm-unstable. It was merged with t
the common code.

Likely, in mm/mm-unstable, the existing can_follow_write_pte and
can_follow_write_pmd checks will already cover what you want in most cases.

We'd need a can_follow_write_pud() to cover follow_huge_pud() and
(unfortunately) something to handle follow_hugepd() as well similarly.

Copy-pasting what we do in can_follow_write_pte() and adjusting for
different PTE types is the right thing to do. Maybe now it's time to factor
out the common checks into a separate helper.


I tried to get the hugepd stuff right but this was the first I heard
about it :-) Afaict follow_huge_pmd and friends were already DTRT


I got it working on top of your uprobes-cow branch with the foll force
patch sent friday. Still pretty lightly tested


Sorry for not replying earlier, was busy with other stuff. I'll try 
getiing that stuff into shape and send it out soonish.




I went with using one write uprobe function with some additional
branches. I went back and forth between that and making them 2 different
functions.


All the folio_test_hugetlb() special casing is a bit suboptimal. Likely 
we want a separate variant, because we should be sing hugetlb PTE 
functions consistently (e.g., huge_pte_uffd_wp() vs pte_uffd_wp(), 
softdirty does not exist etc.)




diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2f4e88552d3f..8a33e380f7ea 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -83,6 +83,10 @@ static const struct fs_parameter_spec 
hugetlb_fs_parameters[] = {
{}
  };
  
+bool hugetlbfs_mapping(struct address_space *mapping) {

+   return mapping->a_ops == _aops;


is_vm_hugetlb_page() might be what you are looking for.

[...]


  }
  
-static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)

+static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, 
int len, unsigned long page_mask)
  {
void *kaddr = kmap_atomic(page);
-   memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
+   memcpy(dst, kaddr + (vaddr & ~page_mask), len);
kunmap_atomic(kaddr);
  }


  
-static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)

+static void copy_to_page(struct page *page, unsigned long vaddr, const void 
*src, int len, unsigned long page_mask)
  {
void *kaddr = kmap_atomic(page);
-   memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
+   memcpy(kaddr + (vaddr & ~page_mask), src, len);
kunmap_atomic(kaddr);
  }


These two changes really are rather ugly ...

An why are they even required? We get a PAGE_SIZED-based subpage of a 
hugetlb page. We only kmap that one and copy within that one.


In other words, I don't think the copy_from_page() and copy_to_page() 
changes are even required when we consistently work on subpages and not 
suddenly on head pages.


  
-static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)

+static int verify_opcode(struct page *page, unsigned long vaddr, 
uprobe_opcode_t *new_opcode, unsigned long page_mask)
  {
uprobe_opcode_t old_opcode;
bool is_swbp;
@@ -191,7 +192,8 @@ static int verify_opcode(struct page *page, unsigned long 
vaddr, uprobe_opcode_t
 * is a trap variant; uprobes always wins over any other (gdb)
 * breakpoint.
 */
-   copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE);
+   copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE,
+  page_mask);
is_swbp = is_swbp_insn(_opcode);
  
  	if (is_swbp_insn(new_opcode)) {

@@ -376,8 +378,8 @@ struct uwo_data {
uprobe_opcode_t opcode;
  };
  
-static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr,

-   unsigned long next, struct mm_walk *walk)
+static int __write_opcode(pte_t *ptep, unsigned long vaddr,
+ unsigned long page_mask, struct mm_walk *walk)



Unrelated alignment change.


  {
struct uwo_data *data = walk->private;;
const bool is_register = !!is_swbp_insn(>opcode);
@@ -415,9 +417,12 @@ static int __write_opcode_pte(pte_t *ptep, unsigned long 
vaddr,
  
  	/* Unmap + flush the TLB, such that we can write atomically .*/

flush_cache_page(vma, vaddr, pte_pfn(pte));
-   pte = ptep_clear_flush(vma, vaddr, ptep);
+   if (folio_test_hugetlb(folio))
+   pte = huge_ptep_clear_flush(vma, vaddr, ptep);
+   else
+   pte = ptep_clear_flush(vma, vaddr, ptep);
copy_to_page(page, data->opcode_vaddr, >opcode,
-

Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Michael S. Tsirkin

On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
> On 4/30/24 8:05 AM, Edward Adam Davis wrote:
> >  static int vhost_task_fn(void *data)
> >  {
> > struct vhost_task *vtsk = data;
> > @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
> > schedule();
> > }
> >  
> > -   mutex_lock(>exit_mutex);
> > +   mutex_lock(_mutex);
> > /*
> >  * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
> >  * When the vhost layer has called vhost_task_stop it's already stopped
> > @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
> > vtsk->handle_sigkill(vtsk->data);
> > }
> > complete(>exited);
> > -   mutex_unlock(>exit_mutex);
> > +   mutex_unlock(_mutex);
> >  
> 
> Edward, thanks for the patch. I think though I just needed to swap the
> order of the calls above.
> 
> Instead of:
> 
> complete(>exited);
> mutex_unlock(>exit_mutex);
> 
> it should have been:
> 
> mutex_unlock(>exit_mutex);
> complete(>exited);
> 
> If my analysis is correct, then Michael do you want me to resubmit a
> patch on top of your vhost branch or resubmit the entire patchset?

Resubmit all please.

-- 
MST

[PATCH] scripts/spdxcheck: Add count of missing files to stats output

2024-04-30 Thread Bird, Tim

Add a count of files missing an SPDX header to the stats
output.  This is useful detailed information for working
on SPDX header additions.

Signed-off-by: Tim Bird 
---
 scripts/spdxcheck.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/scripts/spdxcheck.py b/scripts/spdxcheck.py
index 18cb9f5b3d3d..8b8fb115fc81 100755
--- a/scripts/spdxcheck.py
+++ b/scripts/spdxcheck.py
@@ -412,6 +412,9 @@ if __name__ == '__main__':
 if parser.checked:
 pc = int(100 * parser.spdx_valid / parser.checked)
 sys.stderr.write('Files with SPDX:   %12d %3d%%\n' 
%(parser.spdx_valid, pc))
+missing = parser.checked - parser.spdx_valid
+mpc = int(100 * missing / parser.checked)
+sys.stderr.write('Files without SPDX:%12d %3d%%\n' 
%(missing, mpc))
 sys.stderr.write('Files with errors: %12d\n' 
%parser.spdx_errors)
 if ndirs:
 sys.stderr.write('\n')
-- 
2.25.1

[PATCH v3 4/4] dax/bus.c: Use the right locking mode (read vs write) in size_show

2024-04-30 Thread Vishal Verma

In size_show(), the dax_dev_rwsem only needs a read lock, but was
acquiring a write lock. Change it to down_read_interruptible() so it
doesn't unnecessarily hold a write lock.

Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
rwsem")
Cc: Dan Williams 
Reviewed-by: Dan Williams 
Signed-off-by: Vishal Verma 
---
 drivers/dax/bus.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 0011a6e6a8f2..f24b67c64d5e 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -937,11 +937,11 @@ static ssize_t size_show(struct device *dev,
unsigned long long size;
int rc;
 
-   rc = down_write_killable(_dev_rwsem);
+   rc = down_read_interruptible(_dev_rwsem);
if (rc)
return rc;
size = dev_dax_size(dev_dax);
-   up_write(_dev_rwsem);
+   up_read(_dev_rwsem);
 
return sysfs_emit(buf, "%llu\n", size);
 }

-- 
2.44.0

[PATCH v3 3/4] dax/bus.c: Don't use down_write_killable for non-user processes

2024-04-30 Thread Vishal Verma

Change an instance of down_write_killable() to a simple down_write() where
there is no user process that might want to interrupt the operation.

Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
rwsem")
Reported-by: Dan Williams 
Reviewed-by: Dan Williams 
Signed-off-by: Vishal Verma 
---
 drivers/dax/bus.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index e2c7354ce328..0011a6e6a8f2 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1540,12 +1540,8 @@ static struct dev_dax *__devm_create_dev_dax(struct 
dev_dax_data *data)
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
struct dev_dax *dev_dax;
-   int rc;
-
-   rc = down_write_killable(_region_rwsem);
-   if (rc)
-   return ERR_PTR(rc);
 
+   down_write(_region_rwsem);
dev_dax = __devm_create_dev_dax(data);
up_write(_region_rwsem);
 

-- 
2.44.0

[PATCH v3 1/4] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts

2024-04-30 Thread Vishal Verma

In [1], Dan points out that all of the WARN_ON_ONCE() usage in the
referenced patch should be replaced with lockdep_assert_held, or
lockdep_held_assert_write(). Replace these as appropriate.

Link: 
https://lore.kernel.org/r/65f0b5ef41817_aa2229...@dwillia2-mobl3.amr.corp.intel.com.notmuch
 [1]
Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
rwsem")
Cc: Andrew Morton 
Reported-by: Dan Williams 
Reviewed-by: Dan Williams 
Signed-off-by: Vishal Verma 
---
 drivers/dax/bus.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 797e1ebff299..7924dd542a13 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -192,7 +192,7 @@ static u64 dev_dax_size(struct dev_dax *dev_dax)
u64 size = 0;
int i;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_dev_rwsem));
+   lockdep_assert_held(_dev_rwsem);
 
for (i = 0; i < dev_dax->nr_range; i++)
size += range_len(_dax->ranges[i].range);
@@ -302,7 +302,7 @@ static unsigned long long dax_region_avail_size(struct 
dax_region *dax_region)
resource_size_t size = resource_size(_region->res);
struct resource *res;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem));
+   lockdep_assert_held(_region_rwsem);
 
for_each_dax_region_resource(dax_region, res)
size -= resource_size(res);
@@ -447,7 +447,7 @@ static void trim_dev_dax_range(struct dev_dax *dev_dax)
struct range *range = _dax->ranges[i].range;
struct dax_region *dax_region = dev_dax->region;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem));
+   lockdep_assert_held_write(_region_rwsem);
dev_dbg(_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
(unsigned long long)range->start,
(unsigned long long)range->end);
@@ -507,7 +507,7 @@ static int __free_dev_dax_id(struct dev_dax *dev_dax)
struct dax_region *dax_region;
int rc = dev_dax->id;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_dev_rwsem));
+   lockdep_assert_held_write(_dev_rwsem);
 
if (!dev_dax->dyn_id || dev_dax->id < 0)
return -1;
@@ -713,7 +713,7 @@ static void __unregister_dax_mapping(void *data)
 
dev_dbg(dev, "%s\n", __func__);
 
-   WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem));
+   lockdep_assert_held_write(_region_rwsem);
 
dev_dax->ranges[mapping->range_id].mapping = NULL;
mapping->range_id = -1;
@@ -830,7 +830,7 @@ static int devm_register_dax_mapping(struct dev_dax 
*dev_dax, int range_id)
struct device *dev;
int rc;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem));
+   lockdep_assert_held_write(_region_rwsem);
 
if (dev_WARN_ONCE(_dax->dev, !dax_region->dev->driver,
"region disabled\n"))
@@ -876,7 +876,7 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 
start,
struct resource *alloc;
int i, rc;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem));
+   lockdep_assert_held_write(_region_rwsem);
 
/* handle the seed alloc special case */
if (!size) {
@@ -935,7 +935,7 @@ static int adjust_dev_dax_range(struct dev_dax *dev_dax, 
struct resource *res, r
struct device *dev = _dax->dev;
int rc;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem));
+   lockdep_assert_held_write(_region_rwsem);
 
if (dev_WARN_ONCE(dev, !size, "deletion is handled by 
dev_dax_shrink\n"))
return -EINVAL;

-- 
2.44.0

[PATCH v3 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths

2024-04-30 Thread Vishal Verma

Commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
rwsem")
aimed to undo device_lock() abuses for protecting changes to dax-driver
internal data-structures like the dax_region resource tree to
device-dax-instance range structures. However, the device_lock() was
legitamately enforcing that devices to be deleted were not current
actively attached to any driver nor assigned any capacity from the
region.

As a result of the device_lock restoration in delete_store(), the
conditional locking in unregister_dev_dax() and unregister_dax_mapping()
can be removed.

Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
rwsem")
Reported-by: Dan Williams 
Reviewed-by: Dan Williams 
Signed-off-by: Vishal Verma 
---
 drivers/dax/bus.c | 42 --
 1 file changed, 8 insertions(+), 34 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 7924dd542a13..e2c7354ce328 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -465,26 +465,17 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax)
trim_dev_dax_range(dev_dax);
 }
 
-static void __unregister_dev_dax(void *dev)
+static void unregister_dev_dax(void *dev)
 {
struct dev_dax *dev_dax = to_dev_dax(dev);
 
dev_dbg(dev, "%s\n", __func__);
 
+   down_write(_region_rwsem);
kill_dev_dax(dev_dax);
device_del(dev);
free_dev_dax_ranges(dev_dax);
put_device(dev);
-}
-
-static void unregister_dev_dax(void *dev)
-{
-   if (rwsem_is_locked(_region_rwsem))
-   return __unregister_dev_dax(dev);
-
-   if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0))
-   return;
-   __unregister_dev_dax(dev);
up_write(_region_rwsem);
 }
 
@@ -560,15 +551,10 @@ static ssize_t delete_store(struct device *dev, struct 
device_attribute *attr,
if (!victim)
return -ENXIO;
 
-   rc = down_write_killable(_region_rwsem);
-   if (rc)
-   return rc;
-   rc = down_write_killable(_dev_rwsem);
-   if (rc) {
-   up_write(_region_rwsem);
-   return rc;
-   }
+   device_lock(dev);
+   device_lock(victim);
dev_dax = to_dev_dax(victim);
+   down_write(_dev_rwsem);
if (victim->driver || dev_dax_size(dev_dax))
rc = -EBUSY;
else {
@@ -589,11 +575,12 @@ static ssize_t delete_store(struct device *dev, struct 
device_attribute *attr,
rc = -EBUSY;
}
up_write(_dev_rwsem);
+   device_unlock(victim);
 
/* won the race to invalidate the device, clean it up */
if (do_del)
devm_release_action(dev, unregister_dev_dax, victim);
-   up_write(_region_rwsem);
+   device_unlock(dev);
put_device(victim);
 
return rc;
@@ -705,7 +692,7 @@ static void dax_mapping_release(struct device *dev)
put_device(parent);
 }
 
-static void __unregister_dax_mapping(void *data)
+static void unregister_dax_mapping(void *data)
 {
struct device *dev = data;
struct dax_mapping *mapping = to_dax_mapping(dev);
@@ -713,25 +700,12 @@ static void __unregister_dax_mapping(void *data)
 
dev_dbg(dev, "%s\n", __func__);
 
-   lockdep_assert_held_write(_region_rwsem);
-
dev_dax->ranges[mapping->range_id].mapping = NULL;
mapping->range_id = -1;
 
device_unregister(dev);
 }
 
-static void unregister_dax_mapping(void *data)
-{
-   if (rwsem_is_locked(_region_rwsem))
-   return __unregister_dax_mapping(data);
-
-   if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0))
-   return;
-   __unregister_dax_mapping(data);
-   up_write(_region_rwsem);
-}
-
 static struct dev_dax_range *get_dax_range(struct device *dev)
 {
struct dax_mapping *mapping = to_dax_mapping(dev);

-- 
2.44.0

Re: [PATCH v3 0/2] remoteproc: k3-r5: Wait for core0 power-up before powering up core1

2024-04-30 Thread Mathieu Poirier

On Tue, Apr 30, 2024 at 04:23:05PM +0530, Beleswar Padhi wrote:
> PSC controller has a limitation that it can only power-up the second core
> when the first core is in ON state. Power-state for core0 should be equal
> to or higher than core1, else the kernel is seen hanging during rproc
> loading.
> 
> Make the powering up of cores sequential, by waiting for the current core
> to power-up before proceeding to the next core, with a timeout of 2sec.
> Add a wait queue event in k3_r5_cluster_rproc_init call, that will wait
> for the current core to be released from reset before proceeding with the
> next core.
> 
> Also, ensure that core1 can not be powered on before core0 when starting
> cores from sysfs. Similarly, ensure that core0 can not be shutdown
> before core1 from sysfs.
> 
> v3: Changelog:
> 1) Added my own Signed-off-by in PATCH 1, specifying the changes done
> 2) Addressed changes requested by adding comments in code in PATCH 1
> 
> Link to v2:
> https://lore.kernel.org/all/20240424130504.494916-1-b-pa...@ti.com/
> 
> v2: Changelog:
> 1) Fixed multi-line comment format
> 2) Included root cause of bug in comments
> 3) Added a patch to ensure power-up/shutdown is sequential via sysfs
> 
> Link to v1:
> https://lore.kernel.org/all/20230906124756.3480579-1-a-nan...@ti.com/
> 
> Apurva Nandan (1):
>   remoteproc: k3-r5: Wait for core0 power-up before powering up core1
> 
> Beleswar Padhi (1):
>   remoteproc: k3-r5: Do not allow core1 to power up before core0 via
> sysfs
> 
>  drivers/remoteproc/ti_k3_r5_remoteproc.c | 56 +++-
>  1 file changed, 54 insertions(+), 2 deletions(-)

Applied - thanks,
Mathieu

> 
> -- 
> 2.34.1
>

Re: [PATCH v4 0/4] Support MT8188 SCP core 1

2024-04-30 Thread Mathieu Poirier

On Tue, Apr 30, 2024 at 09:15:30AM +0800, Olivia Wen wrote:
> Change in v4:
> Updating the description of PATCH v4 4/4.
> 
> Olivia Wen (4):
>   dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP
>   remoteproc: mediatek: Support MT8188 SCP core 1
>   remoteproc: mediatek: Support setting DRAM and IPI shared buffer sizes
>   remoteproc: mediatek: Add IMGSYS IPI command
> 
>  .../devicetree/bindings/remoteproc/mtk,scp.yaml|   2 +
>  drivers/remoteproc/mtk_common.h|  11 +-
>  drivers/remoteproc/mtk_scp.c   | 230 
> +++--
>  drivers/remoteproc/mtk_scp_ipi.c   |   7 +-
>  include/linux/remoteproc/mtk_scp.h |   1 +
>  5 files changed, 225 insertions(+), 26 deletions(-)

I have applied this set.

Thanks,
Mathieu

> 
> -- 
> 2.6.4
>

Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-30 Thread Andrii Nakryiko

On Tue, Apr 30, 2024 at 6:32 AM Masami Hiramatsu  wrote:
>
> On Mon, 29 Apr 2024 13:25:04 -0700
> Andrii Nakryiko  wrote:
>
> > On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu  
> > wrote:
> > >
> > > Hi Andrii,
> > >
> > > On Thu, 25 Apr 2024 13:31:53 -0700
> > > Andrii Nakryiko  wrote:
> > >
> > > > Hey Masami,
> > > >
> > > > I can't really review most of that code as I'm completely unfamiliar
> > > > with all those inner workings of fprobe/ftrace/function_graph. I left
> > > > a few comments where there were somewhat more obvious BPF-related
> > > > pieces.
> > > >
> > > > But I also did run our BPF benchmarks on probes/for-next as a baseline
> > > > and then with your series applied on top. Just to see if there are any
> > > > regressions. I think it will be a useful data point for you.
> > >
> > > Thanks for testing!
> > >
> > > >
> > > > You should be already familiar with the bench tool we have in BPF
> > > > selftests (I used it on some other patches for your tree).
> > >
> > > What patches we need?
> > >
> >
> > You mean for this `bench` tool? They are part of BPF selftests (under
> > tools/testing/selftests/bpf), you can build them by running:
> >
> > $ make RELEASE=1 -j$(nproc) bench
> >
> > After that you'll get a self-container `bench` binary, which has all
> > the self-contained benchmarks.
> >
> > You might also find a small script (benchs/run_bench_trigger.sh inside
> > BPF selftests directory) helpful, it collects final summary of the
> > benchmark run and optionally accepts a specific set of benchmarks. So
> > you can use it like this:
> >
> > $ benchs/run_bench_trigger.sh kprobe kprobe-multi
> > kprobe :   18.731 ± 0.639M/s
> > kprobe-multi   :   23.938 ± 0.612M/s
> >
> > By default it will run a wider set of benchmarks (no uprobes, but a
> > bunch of extra fentry/fexit tests and stuff like this).
>
> origin:
> # benchs/run_bench_trigger.sh
> kretprobe :1.329 ± 0.007M/s
> kretprobe-multi:1.341 ± 0.004M/s
> # benchs/run_bench_trigger.sh
> kretprobe :1.288 ± 0.014M/s
> kretprobe-multi:1.365 ± 0.002M/s
> # benchs/run_bench_trigger.sh
> kretprobe :1.329 ± 0.002M/s
> kretprobe-multi:1.331 ± 0.011M/s
> # benchs/run_bench_trigger.sh
> kretprobe :1.311 ± 0.003M/s
> kretprobe-multi:1.318 ± 0.002M/s s
>
> patched:
>
> # benchs/run_bench_trigger.sh
> kretprobe :1.274 ± 0.003M/s
> kretprobe-multi:1.397 ± 0.002M/s
> # benchs/run_bench_trigger.sh
> kretprobe :1.307 ± 0.002M/s
> kretprobe-multi:1.406 ± 0.004M/s
> # benchs/run_bench_trigger.sh
> kretprobe :1.279 ± 0.004M/s
> kretprobe-multi:1.330 ± 0.014M/s
> # benchs/run_bench_trigger.sh
> kretprobe :1.256 ± 0.010M/s
> kretprobe-multi:1.412 ± 0.003M/s
>
> Hmm, in my case, it seems smaller differences (~3%?).
> I attached perf report results for those, but I don't see large difference.

I ran my benchmarks on bare metal machine (and quite powerful at that,
you can see my numbers are almost 10x of yours), with mitigations
disabled, no retpolines, etc. If you have any of those mitigations it
might result in smaller differences, probably. If you are running
inside QEMU/VM, the results might differ significantly as well.

>
> > > >
> > > > BASELINE
> > > > 
> > > > kprobe :   24.634 ± 0.205M/s
> > > > kprobe-multi   :   28.898 ± 0.531M/s
> > > > kretprobe  :   10.478 ± 0.015M/s
> > > > kretprobe-multi:   11.012 ± 0.063M/s
> > > >
> > > > THIS PATCH SET ON TOP
> > > > =
> > > > kprobe :   25.144 ± 0.027M/s (+2%)
> > > > kprobe-multi   :   28.909 ± 0.074M/s
> > > > kretprobe  :9.482 ± 0.008M/s (-9.5%)
> > > > kretprobe-multi:   13.688 ± 0.027M/s (+24%)
> > >
> > > This looks good. Kretprobe should also use kretprobe-multi (fprobe)
> > > eventually because it should be a single callback version of
> > > kretprobe-multi.
>
> I ran another benchmark (prctl loop, attached), the origin kernel result is 
> here;
>
> # sh ./benchmark.sh
> count = 1000, took 6.748133 sec
>
> And the patched kernel result;
>
> # sh ./benchmark.sh
> count = 1000, took 6.644095 sec
>
> I confirmed that the parf result has no big difference.
>
> Thank you,
>
>
> > >
> > > >
> > > > These numbers are pretty stable and look to be more or less 
> > > > representative.
> > > >
> > > > As you can see, kprobes got a bit faster, kprobe-multi seems to be
> > > > about the same, though.
> > > >
> > > > Then (I suppose they are "legacy") kretprobes got quite noticeably
> > > > slower, almost by 10%. Not sure why, but looks real after re-running
> > > > benchmarks a bunch of times and getting stable results.
> > >
> > > Hmm, kretprobe on x86 should use ftrace + rethook even with my series.
> > > So nothing should be changed. Maybe cache access pattern has been
> > > changed?
> > > I'll check it with tracefs (to remove the effect from bpf related changes)
> > >
> > > >
> > > > On the other hand, multi-kretprobes got significantly faster (+24%!).
> > > > Again, I

Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Mike Christie

On 4/30/24 8:05 AM, Edward Adam Davis wrote:
>  static int vhost_task_fn(void *data)
>  {
>   struct vhost_task *vtsk = data;
> @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
>   schedule();
>   }
>  
> - mutex_lock(>exit_mutex);
> + mutex_lock(_mutex);
>   /*
>* If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
>* When the vhost layer has called vhost_task_stop it's already stopped
> @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
>   vtsk->handle_sigkill(vtsk->data);
>   }
>   complete(>exited);
> - mutex_unlock(>exit_mutex);
> + mutex_unlock(_mutex);
>  

Edward, thanks for the patch. I think though I just needed to swap the
order of the calls above.

Instead of:

complete(>exited);
mutex_unlock(>exit_mutex);

it should have been:

mutex_unlock(>exit_mutex);
complete(>exited);

If my analysis is correct, then Michael do you want me to resubmit a
patch on top of your vhost branch or resubmit the entire patchset?

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-30 Thread Guillaume Morin

On 26 Apr 21:55, Guillaume Morin wrote:
>
> On 26 Apr  9:19, David Hildenbrand wrote:
> > A couple of points:
> > 
> > a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want
> > to check PageAnonExclusive.
> > 
> > b) If you're not following the can_follow_write_pte/_pmd model, you are
> > doing something wrong :)
> > 
> > c) The code was heavily changed in mm/mm-unstable. It was merged with t
> > the common code.
> > 
> > Likely, in mm/mm-unstable, the existing can_follow_write_pte and
> > can_follow_write_pmd checks will already cover what you want in most cases.
> > 
> > We'd need a can_follow_write_pud() to cover follow_huge_pud() and
> > (unfortunately) something to handle follow_hugepd() as well similarly.
> > 
> > Copy-pasting what we do in can_follow_write_pte() and adjusting for
> > different PTE types is the right thing to do. Maybe now it's time to factor
> > out the common checks into a separate helper.
> 
> I tried to get the hugepd stuff right but this was the first I heard
> about it :-) Afaict follow_huge_pmd and friends were already DTRT

I got it working on top of your uprobes-cow branch with the foll force
patch sent friday. Still pretty lightly tested

I went with using one write uprobe function with some additional
branches. I went back and forth between that and making them 2 different
functions.

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2f4e88552d3f..8a33e380f7ea 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -83,6 +83,10 @@ static const struct fs_parameter_spec 
hugetlb_fs_parameters[] = {
{}
 };
 
+bool hugetlbfs_mapping(struct address_space *mapping) {
+   return mapping->a_ops == _aops;
+}
+
 /*
  * Mask used when checking the page offset value passed in via system
  * calls.  This value will be converted to a loff_t which is signed.
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 68244bb3637a..66fdcc0b5db2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -511,6 +511,8 @@ struct hugetlbfs_sb_info {
umode_t mode;
 };
 
+bool hugetlbfs_mapping(struct address_space *mapping);
+
 static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
 {
return sb->s_fs_info;
@@ -557,6 +559,8 @@ static inline struct hstate *hstate_inode(struct inode *i)
 {
return NULL;
 }
+
+static inline bool hugetlbfs_mapping(struct address_space *mapping) { return 
false; }
 #endif /* !CONFIG_HUGETLBFS */
 
 #ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4237a7477cca..e6e93a574c39 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 #include  /* read_mapping_page */
 #include 
 #include 
@@ -120,7 +121,7 @@ struct xol_area {
  */
 static bool valid_vma(struct vm_area_struct *vma, bool is_register)
 {
-   vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
+   vm_flags_t flags = VM_MAYEXEC | VM_MAYSHARE;
 
if (is_register)
flags |= VM_WRITE;
@@ -163,21 +164,21 @@ bool __weak is_trap_insn(uprobe_opcode_t *insn)
return is_swbp_insn(insn);
 }
 
-static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, 
int len)
+static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, 
int len, unsigned long page_mask)
 {
void *kaddr = kmap_atomic(page);
-   memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
+   memcpy(dst, kaddr + (vaddr & ~page_mask), len);
kunmap_atomic(kaddr);
 }
 
-static void copy_to_page(struct page *page, unsigned long vaddr, const void 
*src, int len)
+static void copy_to_page(struct page *page, unsigned long vaddr, const void 
*src, int len, unsigned long page_mask)
 {
void *kaddr = kmap_atomic(page);
-   memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
+   memcpy(kaddr + (vaddr & ~page_mask), src, len);
kunmap_atomic(kaddr);
 }
 
-static int verify_opcode(struct page *page, unsigned long vaddr, 
uprobe_opcode_t *new_opcode)
+static int verify_opcode(struct page *page, unsigned long vaddr, 
uprobe_opcode_t *new_opcode, unsigned long page_mask)
 {
uprobe_opcode_t old_opcode;
bool is_swbp;
@@ -191,7 +192,8 @@ static int verify_opcode(struct page *page, unsigned long 
vaddr, uprobe_opcode_t
 * is a trap variant; uprobes always wins over any other (gdb)
 * breakpoint.
 */
-   copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE);
+   copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE,
+  page_mask);
is_swbp = is_swbp_insn(_opcode);
 
if (is_swbp_insn(new_opcode)) {
@@ -376,8 +378,8 @@ struct uwo_data {
uprobe_opcode_t opcode;
 };
 
-static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr,
-   unsigned long next, struct mm_walk *walk)
+static int __write_opcode(pte_t *ptep,

Re: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

2024-04-30 Thread Dmitrii Kuvaiskii

On Mon, Apr 29, 2024 at 04:11:03PM +0300, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> > Two enclave threads may try to add and remove the same enclave page
> > simultaneously (e.g., if the SGX runtime supports both lazy allocation
> > and `MADV_DONTNEED` semantics). Consider this race:
> >
> > 1. T1 performs page removal in sgx_encl_remove_pages() and stops right
> >after removing the page table entry and right before re-acquiring the
> >enclave lock to EREMOVE and xa_erase(>page_array) the page.
> > 2. T2 tries to access the page, and #PF[not_present] is raised. The
> >condition to EAUG in sgx_vma_fault() is not satisfied because the
> >page is still present in encl->page_array, thus the SGX driver
> >assumes that the fault happened because the page was swapped out. The
> >driver continues on a code path that installs a page table entry
> >*without* performing EAUG.
> > 3. The enclave page metadata is in inconsistent state: the PTE is
> >installed but there was no EAUG. Thus, T2 in userspace infinitely
> >receives SIGSEGV on this page (and EACCEPT always fails).
> >
> > Fix this by making sure that T1 (the page-removing thread) always wins
> > this data race. In particular, the page-being-removed is marked as such,
> > and T2 retries until the page is fully removed.
> >
> > Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Dmitrii Kuvaiskii 
> > ---
> >  arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
> >  arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
> >  arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
> >  3 files changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> > index 41f14b1a3025..7ccd8b2fce5f 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -257,7 +257,8 @@ static struct sgx_encl_page 
> > *__sgx_encl_load_page(struct sgx_encl *encl,
> >  
> > /* Entry successfully located. */
> > if (entry->epc_page) {
> > -   if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> > +   if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
> > +  SGX_ENCL_PAGE_BEING_REMOVED))
> > return ERR_PTR(-EBUSY);
> >  
> > return entry;
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> > index f94ff14c9486..fff5f2293ae7 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.h
> > +++ b/arch/x86/kernel/cpu/sgx/encl.h
> > @@ -25,6 +25,9 @@
> >  /* 'desc' bit marking that the page is being reclaimed. */
> >  #define SGX_ENCL_PAGE_BEING_RECLAIMED  BIT(3)
> >  
> > +/* 'desc' bit marking that the page is being removed. */
> > +#define SGX_ENCL_PAGE_BEING_REMOVEDBIT(2)
> > +
> >  struct sgx_encl_page {
> > unsigned long desc;
> > unsigned long vm_max_prot_bits:8;
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c 
> > b/arch/x86/kernel/cpu/sgx/ioctl.c
> > index b65ab214bdf5..c542d4dd3e64 100644
> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl 
> > *encl,
> >  * Do not keep encl->lock because of dependency on
> >  * mmap_lock acquired in sgx_zap_enclave_ptes().
> >  */
> > +   entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
> > mutex_unlock(>lock);
> >  
> > sgx_zap_enclave_ptes(encl, addr);
> 
> It is somewhat trivial to NAK this as the commit message does
> not do any effort describing the new flag. By default at least
> I have strong opposition against any new flags related to
> reclaiming even if it needs a bit of extra synchronization
> work in the user space.
> 
> One way to describe concurrency scenarios would be to take
> example from https://www.kernel.org/doc/Documentation/memory-barriers.txt
> 
> I.e. see the examples with CPU 1 and CPU 2.

Thank you for the suggestion. Here is my new attempt at describing the racy
scenario:

Consider some enclave page added to the enclave. User space decides to
temporarily remove this page (e.g., emulating the MADV_DONTNEED semantics)
on CPU1. At the same time, user space performs a memory access on the same
page on CPU2, which results in a #PF and ultimately in sgx_vma_fault().
Scenario proceeds as follows:

/*
 * CPU1: User space performs
 * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
 * on a single enclave page
 */
sgx_encl_remove_pages() {

  mutex_lock(>lock);

  entry = sgx_encl_load_page(encl);
  /*
   * verify that page is
   * trimmed and accepted
   */

  mutex_unlock(>lock);

  /*
   * remove PTE entry; cannot
   * be performed under lock
   */
  sgx_zap_enclave_ptes(encl);
   /*
* Fault on CPU2
*/
   sgx_vma_fault() {

Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-04-30 Thread Dmitrii Kuvaiskii

On Mon, Apr 29, 2024 at 04:04:24PM +0300, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> > Two enclave threads may try to access the same non-present enclave page
> > simultaneously (e.g., if the SGX runtime supports lazy allocation). The
> > threads will end up in sgx_encl_eaug_page(), racing to acquire the
> > enclave lock. The winning thread will perform EAUG, set up the page
> > table entry, and insert the page into encl->page_array. The losing
> > thread will then get -EBUSY on xa_insert(>page_array) and proceed
> > to error handling path.
> 
> And that path removes page. Not sure I got gist of this tbh.

Well, this is not about a redundant EREMOVE performed. This is about the
enclave page becoming inaccessible due to a bug triggered with a data race.

Consider some enclave page not yet added to the enclave. The enclave
performs a memory access to it at the same time on CPU1 and CPU2. Since the
page does not yet have a corresponding PTE, the #PF handler on both CPUs
calls sgx_vma_fault(). Scenario proceeds as follows:

/*
 * Fault on CPU1
 */
sgx_vma_fault() {

  xa_load(>page_array) == NULL ->

  sgx_encl_eaug_page() {

.../*
* Fault on CPU2
*/
   sgx_vma_fault() {

 xa_load(>page_array) == NULL ->

 sgx_encl_eaug_page() {

   ...

   mutex_lock(>lock);
   /*
* alloc encl_page
*/
   /*
* alloc EPC page
*/
   epc_page = sgx_alloc_epc_page(...);
   /*
* add page_to enclave's xarray
*/
   xa_insert(>page_array, ...);
   /*
* add page to enclave via EAUG
* (page is in pending state)
*/
   /*
* add PTE entry
*/
   vmf_insert_pfn(...);

   mutex_unlock(>lock);
   return VM_FAULT_NOPAGE;
 }
   }
 mutex_lock(>lock);
 /*
  * alloc encl_page
  */
 /*
  * alloc EPC page
  */
 epc_page = sgx_alloc_epc_page(...);
 /*
  * add page_to enclave's xarray,
  * this fails with -EBUSY
  */
 xa_insert(>page_array, ...);

   err_out_shrink:
 sgx_encl_free_epc_page(epc_page) {
   /*
* remove page via EREMOVE
*/
   /*
* free EPC page
*/
   sgx_free_epc_page(epc_page);
 }

  mutex_unlock(>lock);
  return VM_FAULT_SIGBUS;
}
  }

CPU2 added the enclave page (in pending state) to the enclave and installed
the PTE. The kernel gives control back to the user space, without raising a
signal. The user space on CPU2 retries the memory access and induces a page
fault, but now with the SGX bit set in the #PF error code. The #PF handler
calls do_user_addr_fault(), which calls access_error() and ultimately
raises a SIGSEGV. The userspace SIGSEGV handler is supposed to perform
EACCEPT, after which point the enclave page becomes accessible.

CPU1 however jumps to the error handling path because the page was already
inserted into the enclave's xarray. This error handling path EREMOVEs the
page and also raises a SIGBUS signal to user space. The PTE entry is not
removed.

After CPU1 performs EREMOVE, this enclave page becomes perpetually
inaccessible (until an SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because
the page is marked accessible in the PTE entry but is not EAUGed. Because
of this combination, the #PF handler sees the SGX bit set in the #PF error
code and does not call sgx_vma_fault() but instead raises a SIGSEGV. The
userspace SIGSEGV handler cannot perform EACCEPT because the page was not
EAUGed. Thus, the user space is stuck with the inaccessible page.

Also note that in the scenario, CPU1 raises a SIGBUS signal to user space
unnecessarily. This signal is spurious because a page-access retry on CPU2
will also raise the SIGBUS signal. That said, this side effect is less
severe because it affects only user space. Therefore, it could be
circumvented in user space alone, but it seems reasonable to fix it in this
patch.

> > This error handling path

Re: [PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

2024-04-30 Thread Dmitrii Kuvaiskii

On Mon, Apr 29, 2024 at 04:06:39PM +0300, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> > SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
> > enclave pages and may support MADV_DONTNEED semantics [1]. The former
> > implies #PF-based page allocation, and the latter implies the usage of
> > SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.
> >
> > A trivial program like below (run under Gramine and with EDMM enabled)
> > stresses these two flows in the SGX driver and hangs:
> >
> > /* repeatedly touch different enclave pages at random and mix with
> >  * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */
> > static void* thread_func(void* arg) {
> > size_t num_pages = 0xA000 / page_size;
> > for (int i = 0; i < 5000; i++) {
> > size_t page = get_random_ulong() % num_pages;
> > char data = READ_ONCE(((char*)arg)[page * page_size]);
> >
> > page = get_random_ulong() % num_pages;
> > madvise(arg + page * page_size, page_size, MADV_DONTNEED);
> > }
> > }
> >
> > addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
> > pthread_t threads[16];
> > for (int i = 0; i < 16; i++)
> > pthread_create([i], NULL, thread_func, addr);
> 
> I'm not convinced that kernel is the problem here but it could be also
> how Gramine is implemented.
> 
> So maybe you could make a better case of that. The example looks a bit
> artificial to me.

I believe that these are the bugs in the kernel (in the SGX driver). I
provided more detailed descriptions of the races and ensuing bugs in the
other two replies, please check them.

The example is a stress test written to debug very infrequent hangs of
real-world applications that are run with Gramine, EDMM, and two
optimizations (lazy allocation and MADV_DONTNEED semantics). We observed
hangs of Node.js, PyTorch, R, iperf, Blender, Nginx. To root cause these
hangs, we wrote this artificial stress test. This test succeeds on vanilla
Linux, so ideally it should also pass on Gramine.

Please also note that the optimizations of lazy allocation and
MADV_DONTNEED provide significant performance improvement for some
workloads that run on Gramine. For example, a Java workload with a 16GB
enclave size has approx. 57x improvement in total runtime. Thus, we
consider it important to permit these optimizations in Gramine, which IIUC
requires bug fixes in the SGX driver.

You can find more info at
https://github.com/gramineproject/gramine/pull/1513.

Which parts do you consider artificial, and how could I modify the stress
test?

--
Dmitrii Kuvaiskii

Re: [PATCH v9 1/3] soc: qcom: Add qcom_rproc_minidump module

2024-04-30 Thread Bjorn Andersson

On Tue, Mar 26, 2024 at 07:43:12PM +0530, Mukesh Ojha wrote:
> Add qcom_rproc_minidump module in a preparation to remove
> minidump specific code from driver/remoteproc/qcom_common.c
> and provide needed exported API, this as well helps to
> abstract minidump specific data layout from qualcomm's
> remoteproc driver.
> 
> It is just a copying of qcom_minidump() functionality from
> driver/remoteproc/qcom_common.c into a separate file under
> qcom_rproc_minidump().
> 

I'd prefer to see this enter the git history as one patch, extracting
this logic from the remoteproc into its own driver - rather than as
presented here give a sense that it's a new thing added. (I'll take care
of the maintainer sync...)

I also would prefer for this to include a problem description,
documenting why this is done.


I've not compared patch 1 and 3, but I'd also like a statement in the
commit message telling if there are any changes, or if the functions are
cleanly moved from one place to another.

Regards,
Bjorn

> Signed-off-by: Mukesh Ojha 
> ---
> Changes in v9:
>  - Added source file driver/remoteproc/qcom_common.c copyright
>to qcom_rproc_minidump.c 
>  - Dissociated it from minidump series as this can go separately
>and minidump can put it dependency for the data structure files.
> 
> Nothing much changed in these three patches from previous version,
> However, giving the link of their older versions.
> 
> v8: 
> https://lore.kernel.org/lkml/20240131105734.13090-1-quic_mo...@quicinc.com/
> v7: 
> https://lore.kernel.org/lkml/20240109153200.12848-1-quic_mo...@quicinc.com/
> v6: 
> https://lore.kernel.org/lkml/1700864395-1479-1-git-send-email-quic_mo...@quicinc.com/
> v5: 
> https://lore.kernel.org/lkml/1694429639-21484-1-git-send-email-quic_mo...@quicinc.com/
> v4: 
> https://lore.kernel.org/lkml/1687955688-20809-1-git-send-email-quic_mo...@quicinc.com/
> 
>  drivers/soc/qcom/Kconfig  |  10 +++
>  drivers/soc/qcom/Makefile |   1 +
>  drivers/soc/qcom/qcom_minidump_internal.h |  64 +
>  drivers/soc/qcom/qcom_rproc_minidump.c| 115 
> ++
>  include/soc/qcom/qcom_minidump.h  |  23 ++
>  5 files changed, 213 insertions(+)
>  create mode 100644 drivers/soc/qcom/qcom_minidump_internal.h
>  create mode 100644 drivers/soc/qcom/qcom_rproc_minidump.c
>  create mode 100644 include/soc/qcom/qcom_minidump.h
> 
> diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig
> index 5af33b0e3470..ed23e0275c22 100644
> --- a/drivers/soc/qcom/Kconfig
> +++ b/drivers/soc/qcom/Kconfig
> @@ -277,4 +277,14 @@ config QCOM_PBS
> This module provides the APIs to the client drivers that wants to 
> send the
> PBS trigger event to the PBS RAM.
>  
> +config QCOM_RPROC_MINIDUMP
> + tristate "QCOM Remoteproc Minidump Support"
> + depends on ARCH_QCOM || COMPILE_TEST
> + depends on QCOM_SMEM
> + help
> +   Enablement of core Minidump feature is controlled from boot firmware
> +   side, so if it is enabled from firmware, this config allow Linux to
> +   query predefined Minidump segments associated with the remote 
> processor
> +   and check its validity and end up collecting the dump on remote 
> processor
> +   crash during its recovery.
>  endmenu
> diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile
> index ca0bece0dfff..44664589263d 100644
> --- a/drivers/soc/qcom/Makefile
> +++ b/drivers/soc/qcom/Makefile
> @@ -36,3 +36,4 @@ obj-$(CONFIG_QCOM_ICC_BWMON)+= icc-bwmon.o
>  qcom_ice-objs+= ice.o
>  obj-$(CONFIG_QCOM_INLINE_CRYPTO_ENGINE)  += qcom_ice.o
>  obj-$(CONFIG_QCOM_PBS) +=qcom-pbs.o
> +obj-$(CONFIG_QCOM_RPROC_MINIDUMP)+= qcom_rproc_minidump.o
> diff --git a/drivers/soc/qcom/qcom_minidump_internal.h 
> b/drivers/soc/qcom/qcom_minidump_internal.h
> new file mode 100644
> index ..71709235b196
> --- /dev/null
> +++ b/drivers/soc/qcom/qcom_minidump_internal.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved.
> + */
> +
> +#ifndef _QCOM_MINIDUMP_INTERNAL_H_
> +#define _QCOM_MINIDUMP_INTERNAL_H_
> +
> +#define MAX_NUM_OF_SS   10
> +#define MAX_REGION_NAME_LENGTH  16
> +#define SBL_MINIDUMP_SMEM_ID 602
> +#define MINIDUMP_REGION_VALID   ('V' << 24 | 'A' << 16 | 'L' << 8 | 
> 'I' << 0)
> +#define MINIDUMP_SS_ENCR_DONE   ('D' << 24 | 'O' << 16 | 'N' << 8 | 
> 'E' << 0)
> +#define MINIDUMP_SS_ENABLED ('E' << 24 | 'N' << 16 | 'B' << 8 | 'L' << 0)
> +
> +/**
> + * struct minidump_region - Minidump region
> + * @name : Name of the region to be dumped
> + * @seq_num: : Use to differentiate regions with same name.
> + * @valid: This entry to be dumped (if set to 1)
> + * @address  : Physical address of region to be dumped
> + * @size : Size of the region
> +

Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-30 Thread Google

On Mon, 29 Apr 2024 13:25:04 -0700
Andrii Nakryiko  wrote:

> On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu  wrote:
> >
> > Hi Andrii,
> >
> > On Thu, 25 Apr 2024 13:31:53 -0700
> > Andrii Nakryiko  wrote:
> >
> > > Hey Masami,
> > >
> > > I can't really review most of that code as I'm completely unfamiliar
> > > with all those inner workings of fprobe/ftrace/function_graph. I left
> > > a few comments where there were somewhat more obvious BPF-related
> > > pieces.
> > >
> > > But I also did run our BPF benchmarks on probes/for-next as a baseline
> > > and then with your series applied on top. Just to see if there are any
> > > regressions. I think it will be a useful data point for you.
> >
> > Thanks for testing!
> >
> > >
> > > You should be already familiar with the bench tool we have in BPF
> > > selftests (I used it on some other patches for your tree).
> >
> > What patches we need?
> >
> 
> You mean for this `bench` tool? They are part of BPF selftests (under
> tools/testing/selftests/bpf), you can build them by running:
> 
> $ make RELEASE=1 -j$(nproc) bench
> 
> After that you'll get a self-container `bench` binary, which has all
> the self-contained benchmarks.
> 
> You might also find a small script (benchs/run_bench_trigger.sh inside
> BPF selftests directory) helpful, it collects final summary of the
> benchmark run and optionally accepts a specific set of benchmarks. So
> you can use it like this:
> 
> $ benchs/run_bench_trigger.sh kprobe kprobe-multi
> kprobe :   18.731 ± 0.639M/s
> kprobe-multi   :   23.938 ± 0.612M/s
> 
> By default it will run a wider set of benchmarks (no uprobes, but a
> bunch of extra fentry/fexit tests and stuff like this).

origin:
# benchs/run_bench_trigger.sh 
kretprobe :1.329 ± 0.007M/s 
kretprobe-multi:1.341 ± 0.004M/s 
# benchs/run_bench_trigger.sh 
kretprobe :1.288 ± 0.014M/s 
kretprobe-multi:1.365 ± 0.002M/s 
# benchs/run_bench_trigger.sh 
kretprobe :1.329 ± 0.002M/s 
kretprobe-multi:1.331 ± 0.011M/s 
# benchs/run_bench_trigger.sh 
kretprobe :1.311 ± 0.003M/s 
kretprobe-multi:1.318 ± 0.002M/s s

patched: 

# benchs/run_bench_trigger.sh
kretprobe :1.274 ± 0.003M/s 
kretprobe-multi:1.397 ± 0.002M/s 
# benchs/run_bench_trigger.sh
kretprobe :1.307 ± 0.002M/s 
kretprobe-multi:1.406 ± 0.004M/s 
# benchs/run_bench_trigger.sh
kretprobe :1.279 ± 0.004M/s 
kretprobe-multi:1.330 ± 0.014M/s 
# benchs/run_bench_trigger.sh
kretprobe :1.256 ± 0.010M/s 
kretprobe-multi:1.412 ± 0.003M/s 

Hmm, in my case, it seems smaller differences (~3%?).
I attached perf report results for those, but I don't see large difference.

> > >
> > > BASELINE
> > > 
> > > kprobe :   24.634 ± 0.205M/s
> > > kprobe-multi   :   28.898 ± 0.531M/s
> > > kretprobe  :   10.478 ± 0.015M/s
> > > kretprobe-multi:   11.012 ± 0.063M/s
> > >
> > > THIS PATCH SET ON TOP
> > > =
> > > kprobe :   25.144 ± 0.027M/s (+2%)
> > > kprobe-multi   :   28.909 ± 0.074M/s
> > > kretprobe  :9.482 ± 0.008M/s (-9.5%)
> > > kretprobe-multi:   13.688 ± 0.027M/s (+24%)
> >
> > This looks good. Kretprobe should also use kretprobe-multi (fprobe)
> > eventually because it should be a single callback version of
> > kretprobe-multi.

I ran another benchmark (prctl loop, attached), the origin kernel result is 
here;

# sh ./benchmark.sh 
count = 1000, took 6.748133 sec

And the patched kernel result;

# sh ./benchmark.sh 
count = 1000, took 6.644095 sec

I confirmed that the parf result has no big difference.

Thank you,


> >
> > >
> > > These numbers are pretty stable and look to be more or less 
> > > representative.
> > >
> > > As you can see, kprobes got a bit faster, kprobe-multi seems to be
> > > about the same, though.
> > >
> > > Then (I suppose they are "legacy") kretprobes got quite noticeably
> > > slower, almost by 10%. Not sure why, but looks real after re-running
> > > benchmarks a bunch of times and getting stable results.
> >
> > Hmm, kretprobe on x86 should use ftrace + rethook even with my series.
> > So nothing should be changed. Maybe cache access pattern has been
> > changed?
> > I'll check it with tracefs (to remove the effect from bpf related changes)
> >
> > >
> > > On the other hand, multi-kretprobes got significantly faster (+24%!).
> > > Again, I don't know if it is expected or not, but it's a nice
> > > improvement.
> >
> > Thanks!
> >
> > >
> > > If you have any idea why kretprobes would get so much slower, it would
> > > be nice to look into that and see if you can mitigate the regression
> > > somehow. Thanks!
> >
> > OK, let me check it.
> >
> > Thank you!
> >
> > >
> > >
> > > >  51 files changed, 2325 insertions(+), 882 deletions(-)
> > > >  create mode 100644 
> > > > tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc
> > > >
> > > > --
> > > > Masami Hiramatsu (Google) 
> > > >
> >
> >
> > --
> > Masami Hiramatsu (Google) 


-- 
Masami

[PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Edward Adam Davis

[syzbot reported]
BUG: KASAN: slab-use-after-free in instrument_atomic_read 
include/linux/instrumented.h:68 [inline]
BUG: KASAN: slab-use-after-free in atomic_long_read 
include/linux/atomic/atomic-instrumented.h:3188 [inline]
BUG: KASAN: slab-use-after-free in __mutex_unlock_slowpath+0xef/0x750 
kernel/locking/mutex.c:921
Read of size 8 at addr 888023632880 by task vhost-5104/5105

CPU: 0 PID: 5105 Comm: vhost-5104 Not tainted 6.9.0-rc5-next-20240426-syzkaller 
#0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
03/27/2024
Call Trace:
 
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 print_address_description mm/kasan/report.c:377 [inline]
 print_report+0x169/0x550 mm/kasan/report.c:488
 kasan_report+0x143/0x180 mm/kasan/report.c:601
 kasan_check_range+0x282/0x290 mm/kasan/generic.c:189
 instrument_atomic_read include/linux/instrumented.h:68 [inline]
 atomic_long_read include/linux/atomic/atomic-instrumented.h:3188 [inline]
 __mutex_unlock_slowpath+0xef/0x750 kernel/locking/mutex.c:921
 vhost_task_fn+0x3bc/0x3f0 kernel/vhost_task.c:65
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 

Allocated by task 5104:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 poison_kmalloc_redzone mm/kasan/common.c:370 [inline]
 __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387
 kasan_kmalloc include/linux/kasan.h:211 [inline]
 kmalloc_trace_noprof+0x19c/0x2b0 mm/slub.c:4146
 kmalloc_noprof include/linux/slab.h:660 [inline]
 kzalloc_noprof include/linux/slab.h:778 [inline]
 vhost_task_create+0x149/0x300 kernel/vhost_task.c:134
 vhost_worker_create+0x17b/0x3f0 drivers/vhost/vhost.c:667
 vhost_dev_set_owner+0x563/0x940 drivers/vhost/vhost.c:945
 vhost_dev_ioctl+0xda/0xda0 drivers/vhost/vhost.c:2108
 vhost_vsock_dev_ioctl+0x2bb/0xfa0 drivers/vhost/vsock.c:875
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:907 [inline]
 __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 5104:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
 poison_slab_object+0xe0/0x150 mm/kasan/common.c:240
 __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256
 kasan_slab_free include/linux/kasan.h:184 [inline]
 slab_free_hook mm/slub.c:2190 [inline]
 slab_free mm/slub.c:4430 [inline]
 kfree+0x149/0x350 mm/slub.c:4551
 vhost_worker_destroy drivers/vhost/vhost.c:629 [inline]
 vhost_workers_free drivers/vhost/vhost.c:648 [inline]
 vhost_dev_cleanup+0x9b0/0xba0 drivers/vhost/vhost.c:1051
 vhost_vsock_dev_release+0x3aa/0x410 drivers/vhost/vsock.c:751
 __fput+0x406/0x8b0 fs/file_table.c:422
 __do_sys_close fs/open.c:1555 [inline]
 __se_sys_close fs/open.c:1540 [inline]
 __x64_sys_close+0x7f/0x110 fs/open.c:1540
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
[Fix]
Delete the member exit_mutex from the struct vhost_task and replace it with a
declared global static mutex.

Fixes: a3df30984f4f ("vhost_task: Handle SIGKILL by flushing work and exiting")
Reported--by: syzbot+98edc2df894917b34...@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis 
---
 kernel/vhost_task.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index 48c289947b99..375356499867 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -20,10 +20,10 @@ struct vhost_task {
struct completion exited;
unsigned long flags;
struct task_struct *task;
-   /* serialize SIGKILL and vhost_task_stop calls */
-   struct mutex exit_mutex;
 };
 
+static DEFINE_MUTEX(exit_mutex); //serialize SIGKILL and vhost_task_stop calls
+
 static int vhost_task_fn(void *data)
 {
struct vhost_task *vtsk = data;
@@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
schedule();
}
 
-   mutex_lock(>exit_mutex);
+   mutex_lock(_mutex);
/*
 * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
 * When the vhost layer has called vhost_task_stop it's already stopped
@@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
vtsk->handle_sigkill(vtsk->data);
}
complete(>exited);
-   mutex_unlock(>exit_mutex);
+   mutex_unlock(_mutex);
 
do_exit(0);
 }
@@ -88,12 +88,12 @@ EXPORT_SYMBOL_GPL(vhost_task_wake);
  */
 void vhost_task_stop(struct vhost_task *vtsk)
 {
-   mutex_lock(>exit_mutex);
+   mutex_lock(_mutex);
if (!test_bit(VHOST_TASK_FLAGS_KILLED, >flags)) {
set_bit(VHOST_TASK_FLAGS_STOP, >flags);

Re: [PATCH v7 05/16] module: make module_memory_{alloc,free} more self-contained

2024-04-30 Thread Philippe Mathieu-Daudé


On 29/4/24 14:16, Mike Rapoport wrote:

From: "Mike Rapoport (IBM)" 

Move the logic related to the memory allocation and freeing into
module_memory_alloc() and module_memory_free().

Signed-off-by: Mike Rapoport (IBM) 
---
  kernel/module/main.c | 64 +++-
  1 file changed, 39 insertions(+), 25 deletions(-)


Nice code simplification.

Reviewed-by: Philippe Mathieu-Daudé

[PATCH v22 4/5] Documentation: tracing: Add ring-buffer mapping

2024-04-30 Thread Vincent Donnefort

It is now possible to mmap() a ring-buffer to stream its content. Add
some documentation and a code example.

Signed-off-by: Vincent Donnefort 

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5092d6c13af5..0b300901fd75 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -29,6 +29,7 @@ Linux Tracing Technologies
timerlat-tracer
intel_th
ring-buffer-design
+   ring-buffer-map
stm
sys-t
coresight/index
diff --git a/Documentation/trace/ring-buffer-map.rst 
b/Documentation/trace/ring-buffer-map.rst
new file mode 100644
index ..8e296bcc0d7f
--- /dev/null
+++ b/Documentation/trace/ring-buffer-map.rst
@@ -0,0 +1,106 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Tracefs ring-buffer memory mapping
+==
+
+:Author: Vincent Donnefort 
+
+Overview
+
+Tracefs ring-buffer memory map provides an efficient method to stream data
+as no memory copy is necessary. The application mapping the ring-buffer becomes
+then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
+
+Memory mapping setup
+
+The mapping works with a mmap() of the trace_pipe_raw interface.
+
+The first system page of the mapping contains ring-buffer statistics and
+description. It is referred to as the meta-page. One of the most important
+fields of the meta-page is the reader. It contains the sub-buffer ID which can
+be safely read by the mapper (see ring-buffer-design.rst).
+
+The meta-page is followed by all the sub-buffers, ordered by ascending ID. It 
is
+therefore effortless to know where the reader starts in the mapping:
+
+.. code-block:: c
+
+reader_id = meta->reader->id;
+reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
+
+When the application is done with the current reader, it can get a new one 
using
+the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates
+the meta-page fields.
+
+Limitations
+===
+When a mapping is in place on a Tracefs ring-buffer, it is not possible to
+either resize it (either by increasing the entire size of the ring-buffer or
+each subbuf). It is also not possible to use snapshot and causes splice to copy
+the ring buffer data instead of using the copyless swap from the ring buffer.
+
+Concurrent readers (either another application mapping that ring-buffer or the
+kernel with trace_pipe) are allowed but not recommended. They will compete for
+the ring-buffer and the output is unpredictable, just like concurrent readers 
on
+trace_pipe would be.
+
+Example
+===
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+#include 
+
+#define TRACE_PIPE_RAW 
"/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
+
+int main(void)
+{
+int page_size = getpagesize(), fd, reader_id;
+unsigned long meta_len, data_len;
+struct trace_buffer_meta *meta;
+void *map, *reader, *data;
+
+fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK);
+if (fd < 0)
+exit(EXIT_FAILURE);
+
+map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+if (map == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+meta = (struct trace_buffer_meta *)map;
+meta_len = meta->meta_page_size;
+
+printf("entries:%llu\n", meta->entries);
+printf("overrun:%llu\n", meta->overrun);
+printf("read:   %llu\n", meta->read);
+printf("nr_subbufs: %u\n", meta->nr_subbufs);
+
+data_len = meta->subbuf_size * meta->nr_subbufs;
+data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, 
meta_len);
+if (data == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0)
+exit(EXIT_FAILURE);
+
+reader_id = meta->reader.id;
+reader = data + meta->subbuf_size * reader_id;
+
+printf("Current reader address: %p\n", reader);
+
+munmap(data, data_len);
+munmap(meta, meta_len);
+close (fd);
+
+return 0;
+}
-- 
2.44.0.769.g3c40516874-goog

[PATCH v22 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-04-30 Thread Vincent Donnefort

Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Mapping will prevent snapshot and buffer size modifications.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index b682e9925539..bd1066754220 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,4 +43,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 233d1af39fff..a35e7f598233 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1191,6 +1191,12 @@ static void tracing_snapshot_instance_cond(struct 
trace_array *tr,
return;
}
 
+   if (tr->mapped) {
+   trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
+   trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+   return;
+   }
+
local_irq_save(flags);
update_max_tr(tr, current, smp_processor_id(), cond_data);
local_irq_restore(flags);
@@ -1323,7 +1329,7 @@ static int tracing_arm_snapshot_locked(struct trace_array 
*tr)
lockdep_assert_held(_types_lock);
 
spin_lock(>snapshot_trigger_lock);
-   if (tr->snapshot == UINT_MAX) {
+   if (tr->snapshot == UINT_MAX || tr->mapped) {
spin_unlock(>snapshot_trigger_lock);
return -EBUSY;
}
@@ -6068,7 +6074,7 @@ static void tracing_set_nop(struct trace_array *tr)
 {
if (tr->current_trace == _trace)
return;
-   
+
tr->current_trace->enabled--;
 
if (tr->current_trace->reset)
@@ -8194,15 +8200,32 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = >iter;
+   int err;
+
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent,
+  NULL, NULL);
+   if (err)
+   return err;
+   }
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
 
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(_types_lock);
 
/* Make sure the waiters see the new wait_index */
@@ -8214,6 +8237,76 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+#ifdef CONFIG_TRACER_MAX_TRACE
+static int get_snapshot_map(struct trace_array *tr)
+{
+   int err = 0;
+
+   /*
+* Called with mmap_lock held. lockdep would be unhappy if we would now
+* take trace_types_lock. Instead use the specific
+* snapshot_trigger_lock.
+*/
+   spin_lock(>snapshot_trigger_lock);
+
+   if (tr->snapshot || tr->mapped == UINT_MAX)
+   err = -EBUSY;
+   else
+   tr->mapped++;
+
+   spin_unlock(>snapshot_trigger_lock);
+
+   /* Wait for update_max_tr() to observe iter->tr->mapped */
+   if (tr->mapped == 1)
+   synchronize_rcu();
+
+   return err;
+
+}
+static void put_snapshot_map(struct trace_array *tr)
+{
+   spin_lock(>snapshot_trigger_lock);
+   if (!WARN_ON(!tr->mapped))
+

[PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-30 Thread Vincent Donnefort

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..b682e9925539
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Internal use only.
+ * @Reserved2: Internal use only.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..fc66d01ff472 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(_buffer->irq_work.waiters);
init_waitqueue_head(_buffer->irq_work.full_waiters);
+

[PATCH v22 1/5] ring-buffer: Allocate sub-buffers with __GFP_COMP

2024-04-30 Thread Vincent Donnefort

In preparation for the ring-buffer memory mapping, allocate compound
pages for the ring-buffer sub-buffers to enable us to map them to
user-space with vm_insert_pages().

Signed-off-by: Vincent Donnefort 

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 25476ead681b..cc9ebe593571 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu 
*cpu_buffer,
list_add(>list, pages);
 
page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
-   mflags | __GFP_ZERO,
+   mflags | __GFP_COMP | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
@@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
 
cpu_buffer->reader_page = bpage;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | 
__GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
@@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, 
int cpu)
goto out;
 
page = alloc_pages_node(cpu_to_node(cpu),
-   GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
+   GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | 
__GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page) {
kfree(bpage);
-- 
2.44.0.769.g3c40516874-goog

[PATCH v22 0/5] Introducing trace buffer mapping by user-space

2024-04-30 Thread Vincent Donnefort

The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Support for this new feature can already be found in libtracefs from
version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.

Vincent

v21 -> v22:
  * Remove DONTDUMP (VM_IO implies DONTDUMP already)
  * Remove MIXEDMAP (implicit when using vm_insert_page)
  * Remove PFNMAP (We do not perform raw PFN mappings and MIXEDMAP is
already implicitely set)
  * Add comments to justify the VM_* flags

v20 -> v21:
  * Collect Ack
  * Add .gitignore
  * Few nits
  * Remove meta-page padding (zero-page not supported by vm_insert_pages)
  * Remove single-usage macros
  * Move vma flags handling into ring-buffer.c

v19 -> v20:
  * Fix typos in documentation.
  * Remove useless mmap open and fault callbacks.
  * add mm.h include for vm_insert_pages

v18 -> v19:
  * Use VM_PFNMAP and vm_insert_pages
  * Allocate ring-buffer subbufs with __GFP_COMP
  * Pad the meta-page with the zero-page to align on the subbuf_order
  * Extend the ring-buffer test with mmap() dedicated suite

v17 -> v18:
  * Fix lockdep_assert_held
  * Fix spin_lock_init typo
  * Fix CONFIG_TRACER_MAX_TRACE typo

v16 -> v17:
  * Documentation and comments improvements.
  * Create get/put_snapshot_map() for clearer code.
  * Replace kzalloc with kcalloc.
  * Fix -ENOMEM handling in rb_alloc_meta_page().
  * Move flush(cpu_buffer->reader_page) behind the reader lock.
  * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer
mutex. (removes READ_ONCE/WRITE_ONCE accesses).

v15 -> v16:
  * Add comment for the dcache flush.
  * Remove now unnecessary WRITE_ONCE for the meta-page.

v14 -> v15:
  * Add meta-page and reader-page flush. Intends to fix the mapping
for VIVT and aliasing-VIPT data caches.
  * -EPERM on VM_EXEC.
  * Fix build warning !CONFIG_TRACER_MAX_TRACE.

v13 -> v14:
  * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
  * on unmap, sync meta-page teardown with the reader_lock instead of
the synchronize_rcu.
  * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
(intends to fix a lockdep issue)
  * Add kerneldoc for flags and Reserved fields.
  * Add kselftest for snapshot/map mutual exclusion.

v12 -> v13:
  * Swap subbufs_{touched,lost} for Reserved fields.
  * Add a flag field in the meta-page.
  * Fix CONFIG_TRACER_MAX_TRACE.
  * Rebase on top of trace/urgent.
  * Add a comment for try_unregister_trigger()

v11 -> v12:
  * Fix code sample mmap bug.
  * Add logging in sample code.
  * Reset tracer in selftest.
  * Add a refcount for the snapshot users.
  * Prevent mapping when there are snapshot users and vice versa.
  * Refine the meta-page.
  * Fix types in the meta-page.
  * Collect Reviewed-by.

v10 -> v11:
  * Add Documentation and code sample.
  * Add a selftest.
  * Move all the update to the meta-page into a single
rb_update_meta_page().
  * rb_update_meta_page() is now called from
ring_buffer_map_get_reader() to fix NOBLOCK callers.
  * kerneldoc for struct trace_meta_page.
  * Add a patch to zero all the ring-buffer allocations.

v9 -> v10:
  * Refactor rb_update_meta_page()
  * In-loop declaration for foreach_subbuf_page()
  * Check for cpu_buffer->mapped overflow

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page

[PATCH v3 2/2] remoteproc: k3-r5: Do not allow core1 to power up before core0 via sysfs

2024-04-30 Thread Beleswar Padhi

PSC controller has a limitation that it can only power-up the second
core when the first core is in ON state. Power-state for core0 should be
equal to or higher than core1.

Therefore, prevent core1 from powering up before core0 during the start
process from sysfs. Similarly, prevent core0 from shutting down before
core1 has been shut down from sysfs.

Fixes: 6dedbd1d5443 ("remoteproc: k3-r5: Add a remoteproc driver for R5F 
subsystem")

Signed-off-by: Beleswar Padhi 
---
 drivers/remoteproc/ti_k3_r5_remoteproc.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index 6d6afd6beb3a..1799b4f6d11e 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -548,7 +548,7 @@ static int k3_r5_rproc_start(struct rproc *rproc)
struct k3_r5_rproc *kproc = rproc->priv;
struct k3_r5_cluster *cluster = kproc->cluster;
struct device *dev = kproc->dev;
-   struct k3_r5_core *core;
+   struct k3_r5_core *core0, *core;
u32 boot_addr;
int ret;
 
@@ -574,6 +574,15 @@ static int k3_r5_rproc_start(struct rproc *rproc)
goto unroll_core_run;
}
} else {
+   /* do not allow core 1 to start before core 0 */
+   core0 = list_first_entry(>cores, struct k3_r5_core,
+elem);
+   if (core != core0 && core0->rproc->state == RPROC_OFFLINE) {
+   dev_err(dev, "%s: can not start core 1 before core 0\n",
+   __func__);
+   return -EPERM;
+   }
+
ret = k3_r5_core_run(core);
if (ret)
goto put_mbox;
@@ -619,7 +628,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
 {
struct k3_r5_rproc *kproc = rproc->priv;
struct k3_r5_cluster *cluster = kproc->cluster;
-   struct k3_r5_core *core = kproc->core;
+   struct device *dev = kproc->dev;
+   struct k3_r5_core *core1, *core = kproc->core;
int ret;
 
/* halt all applicable cores */
@@ -632,6 +642,15 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
}
}
} else {
+   /* do not allow core 0 to stop before core 1 */
+   core1 = list_last_entry(>cores, struct k3_r5_core,
+   elem);
+   if (core != core1 && core1->rproc->state != RPROC_OFFLINE) {
+   dev_err(dev, "%s: can not stop core 0 before core 1\n",
+   __func__);
+   return -EPERM;
+   }
+
ret = k3_r5_core_halt(core);
if (ret)
goto out;
-- 
2.34.1

[PATCH v3 1/2] remoteproc: k3-r5: Wait for core0 power-up before powering up core1

2024-04-30 Thread Beleswar Padhi

From: Apurva Nandan 

PSC controller has a limitation that it can only power-up the second core
when the first core is in ON state. Power-state for core0 should be equal
to or higher than core1, else the kernel is seen hanging during rproc
loading.

Make the powering up of cores sequential, by waiting for the current core
to power-up before proceeding to the next core, with a timeout of 2sec.
Add a wait queue event in k3_r5_cluster_rproc_init call, that will wait
for the current core to be released from reset before proceeding with the
next core.

Fixes: 6dedbd1d5443 ("remoteproc: k3-r5: Add a remoteproc driver for R5F 
subsystem")

Signed-off-by: Apurva Nandan 
[added comments and fixed code style]
Signed-off-by: Beleswar Padhi 
---
 drivers/remoteproc/ti_k3_r5_remoteproc.c | 33 
 1 file changed, 33 insertions(+)

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index ad3415a3851b..6d6afd6beb3a 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -103,12 +103,14 @@ struct k3_r5_soc_data {
  * @dev: cached device pointer
  * @mode: Mode to configure the Cluster - Split or LockStep
  * @cores: list of R5 cores within the cluster
+ * @core_transition: wait queue to sync core state changes
  * @soc_data: SoC-specific feature data for a R5FSS
  */
 struct k3_r5_cluster {
struct device *dev;
enum cluster_mode mode;
struct list_head cores;
+   wait_queue_head_t core_transition;
const struct k3_r5_soc_data *soc_data;
 };
 
@@ -128,6 +130,7 @@ struct k3_r5_cluster {
  * @atcm_enable: flag to control ATCM enablement
  * @btcm_enable: flag to control BTCM enablement
  * @loczrama: flag to dictate which TCM is at device address 0x0
+ * @released_from_reset: flag to signal when core is out of reset
  */
 struct k3_r5_core {
struct list_head elem;
@@ -144,6 +147,7 @@ struct k3_r5_core {
u32 atcm_enable;
u32 btcm_enable;
u32 loczrama;
+   bool released_from_reset;
 };
 
 /**
@@ -460,6 +464,8 @@ static int k3_r5_rproc_prepare(struct rproc *rproc)
ret);
return ret;
}
+   core->released_from_reset = true;
+   wake_up_interruptible(>core_transition);
 
/*
 * Newer IP revisions like on J7200 SoCs support h/w auto-initialization
@@ -1140,6 +1146,12 @@ static int k3_r5_rproc_configure_mode(struct k3_r5_rproc 
*kproc)
return ret;
}
 
+   /*
+* Skip the waiting mechanism for sequential power-on of cores if the
+* core has already been booted by another entity.
+*/
+   core->released_from_reset = c_state;
+
ret = ti_sci_proc_get_status(core->tsp, _vec, , ,
 );
if (ret < 0) {
@@ -1280,6 +1292,26 @@ static int k3_r5_cluster_rproc_init(struct 
platform_device *pdev)
cluster->mode == CLUSTER_MODE_SINGLECPU ||
cluster->mode == CLUSTER_MODE_SINGLECORE)
break;
+
+   /*
+* R5 cores require to be powered on sequentially, core0
+* should be in higher power state than core1 in a cluster
+* So, wait for current core to power up before proceeding
+* to next core and put timeout of 2sec for each core.
+*
+* This waiting mechanism is necessary because
+* rproc_auto_boot_callback() for core1 can be called before
+* core0 due to thread execution order.
+*/
+   ret = wait_event_interruptible_timeout(cluster->core_transition,
+  
core->released_from_reset,
+  msecs_to_jiffies(2000));
+   if (ret <= 0) {
+   dev_err(dev,
+   "Timed out waiting for %s core to power up!\n",
+   rproc->name);
+   return ret;
+   }
}
 
return 0;
@@ -1709,6 +1741,7 @@ static int k3_r5_probe(struct platform_device *pdev)
cluster->dev = dev;
cluster->soc_data = data;
INIT_LIST_HEAD(>cores);
+   init_waitqueue_head(>core_transition);
 
ret = of_property_read_u32(np, "ti,cluster-mode", >mode);
if (ret < 0 && ret != -EINVAL) {
-- 
2.34.1

[PATCH v3 0/2] remoteproc: k3-r5: Wait for core0 power-up before powering up core1

2024-04-30 Thread Beleswar Padhi

PSC controller has a limitation that it can only power-up the second core
when the first core is in ON state. Power-state for core0 should be equal
to or higher than core1, else the kernel is seen hanging during rproc
loading.

Make the powering up of cores sequential, by waiting for the current core
to power-up before proceeding to the next core, with a timeout of 2sec.
Add a wait queue event in k3_r5_cluster_rproc_init call, that will wait
for the current core to be released from reset before proceeding with the
next core.

Also, ensure that core1 can not be powered on before core0 when starting
cores from sysfs. Similarly, ensure that core0 can not be shutdown
before core1 from sysfs.

v3: Changelog:
1) Added my own Signed-off-by in PATCH 1, specifying the changes done
2) Addressed changes requested by adding comments in code in PATCH 1

Link to v2:
https://lore.kernel.org/all/20240424130504.494916-1-b-pa...@ti.com/

v2: Changelog:
1) Fixed multi-line comment format
2) Included root cause of bug in comments
3) Added a patch to ensure power-up/shutdown is sequential via sysfs

Link to v1:
https://lore.kernel.org/all/20230906124756.3480579-1-a-nan...@ti.com/

Apurva Nandan (1):
  remoteproc: k3-r5: Wait for core0 power-up before powering up core1

Beleswar Padhi (1):
  remoteproc: k3-r5: Do not allow core1 to power up before core0 via
sysfs

 drivers/remoteproc/ti_k3_r5_remoteproc.c | 56 +++-
 1 file changed, 54 insertions(+), 2 deletions(-)

-- 
2.34.1

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-04-30 Thread Tobias Huschle

It took me a while, but I was able to figure out why EEVDF behaves 
different then CFS does. I'm still waiting for some official confirmation
of my assumptions but it all seems very plausible to me.

Leaving aside all the specifics of vhost and kworkers, a more general
description of the scenario would be as follows:

Assume that we have two tasks taking turns on a single CPU. 
Task 1 does something and wakes up Task 2.
Task 2 does something and goes to sleep.
And we're just repeating that.
Task 1 and task 2 only run for very short amounts of time, i.e. much 
shorter than a regular time slice (vhost = task1, kworker = task2).

Let's further assume, that task 1 runs longer than task 2. 
In CFS, this means, that vruntime of task 1 starts to outrun the vruntime
of task 2. This means that vruntime(task2) < vruntime(task1). Hence, task 2
always gets picked on wake up because it has the smaller vruntime. 
In EEVDF, this would translate to a permanent positive lag, which also 
causes task 2 to get consistently scheduled on wake up.

Let's now assume, that ocassionally, task 2 runs a little bit longer than
task 1. In CFS, this means, that task 2 can close the vruntime gap by a
bit, but, it can easily remain below the value of task 1. Task 2 would 
still get picked on wake up.
With EEVDF, in its current form, task 2 will now get a negative lag, which
in turn, will cause it not being picked on the next wake up.

So, it seems we have a change in the level of how far the both variants look 
into the past. CFS being willing to take more history into account, whereas
EEVDF does not (with update_entity_lag setting the lag value from scratch, 
and place_entity not taking the original vruntime into account).

All of this can be seen as correct by design, a task consumes more time
than the others, so it has to give way to others. The big difference
is now, that CFS allowed a task to collect some bonus by constantly using 
less CPU time than others and trading that time against ocassionally taking
more CPU time. EEVDF could do the same thing, by allowing the accumulation
of positive lag, which can then be traded against the one time the task
would get negative lag. This might clash with other EEVDF assumptions though.

The patch below fixes the degredation, but is not at all aligned with what 
EEVDF wants to achieve, but it helps as an indicator that my hypothesis is
correct.

So, what does this now mean for the vhost regression we were discussing?

1. The behavior of the scheduler changed with regard to wake-up scenarios.
2. vhost in its current form relies on the way how CFS works by assuming 
   that the kworker always gets scheduled.

I would like to argue that it therefore makes sense to reconsider the vhost
implementation to make it less dependent on the internals of the scheduler.
As proposed earlier in this thread, I see two options:

1. Do an explicit schedule() after every iteration across the vhost queues
2. Set the need_resched flag after writing to the socket that would trigger
   eventfd and the underlying kworker

Both options would make sure that the vhost gives up the CPU as it cannot
continue anyway without the kworker handling the event. Option 1 will give
up the CPU regardless of whether something was found in the queues, whereas
option 2 would only give up the CPU if there is.

It shall be noted, that we encountered similar behavior when running some
fio benchmarks. From a brief glance at the code, I was seeing similar
intentions: Loop over queues, then trigger an action through some event
mechanism. Applying the same patch as mentioned above also fixes this issue.

It could be argued, that this is still something that needs to be somehow
addressed by the scheduler since it might affect others as well and there 
are in fact patches coming in. Will they address our issue here? Not sure yet.
On the other hand, it might just be beneficial to make vhost more resilient
towards the scheduler's algorithm by not relying on a certain behavior in
the wakeup path.
Further discussion on additional commits to make EEVDF work correctly can 
be found here: 
https://lore.kernel.org/lkml/20240408090639.gd21...@noisy.programming.kicks-ass.net/T/
So far these patches do not fix the degredation.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..b83a72311d2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -701,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
s64 lag, limit;
 
SCHED_WARN_ON(!se->on_rq);
-   lag = avg_vruntime(cfs_rq) - se->vruntime;
+   lag = se->vlag + avg_vruntime(cfs_rq) - se->vruntime;
 
limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
se->vlag = clamp(lag, -limit, limit);

Re: [PATCH v2 2/2] LoongArch: Add steal time support in guest side

2024-04-30 Thread maobibo





On 2024/4/30 下午4:23, Huacai Chen wrote:

Hi, Bibo,

On Tue, Apr 30, 2024 at 9:45 AM Bibo Mao  wrote:


Percpu struct kvm_steal_time is added here, its size is 64 bytes and
also defined as 64 bytes, so that the whole structure is in one physical
page.

When vcpu is onlined, function pv_enable_steal_time() is called. This
function will pass guest physical address of struct kvm_steal_time and
tells hypervisor to enable steal time. When vcpu is offline, physical
address is set as 0 and tells hypervisor to disable steal time.

Here is output of vmstat on guest when there is workload on both host
and guest. It includes steal time stat information.

procs ---memory-- -io -system-- --cpu-
  r  b   swpd   free  inact active   bibo   in   cs us sy id wa st
15  1  0 7583616 184112  72208200  162   52 31  6 43  0 20
17  0  0 7583616 184704  721920 0 6318 6885  5 60  8  5 22
16  0  0 7583616 185392  721440 0 1766 1081  0 49  0  1 50
16  0  0 7583616 184816  723040 0 6300 6166  4 62 12  2 20
18  0  0 7583632 184480  722400 0 2814 1754  2 58  4  1 35

Signed-off-by: Bibo Mao 
---
  arch/loongarch/Kconfig|  11 +++
  arch/loongarch/include/asm/paravirt.h |   5 +
  arch/loongarch/kernel/paravirt.c  | 131 ++
  arch/loongarch/kernel/time.c  |   2 +
  4 files changed, 149 insertions(+)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 0a1540a8853e..f3a03c33a052 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -592,6 +592,17 @@ config PARAVIRT
   over full virtualization.  However, when run without a hypervisor
   the kernel is theoretically slower and slightly larger.

+config PARAVIRT_TIME_ACCOUNTING
+   bool "Paravirtual steal time accounting"
+   select PARAVIRT
+   help
+ Select this option to enable fine granularity task steal time
+ accounting. Time spent executing other tasks in parallel with
+ the current vCPU is discounted from the vCPU power. To account for
+ that, there can be a small performance impact.
+
+ If in doubt, say N here.
+

Can we use a hidden selection manner, which means:

config PARAVIRT_TIME_ACCOUNTING
def_bool PARAVIRT

Because I think we needn't give too many choices to users (and bring
much more effort to test).

PowerPC even hide all the PARAVIRT config options...
see arch/powerpc/platforms/pseries/Kconfig


I do not know neither :(  It is just used at beginning, maybe there is
negative effect or potential issue, I just think that it can be selected 
by user for easily debugging. After it is matured, hidden selection 
manner can be used.


Regards
Bibo Mao


Huacai


  config ARCH_SUPPORTS_KEXEC
 def_bool y

diff --git a/arch/loongarch/include/asm/paravirt.h 
b/arch/loongarch/include/asm/paravirt.h
index 58f7b7b89f2c..fe27fb5e82b8 100644
--- a/arch/loongarch/include/asm/paravirt.h
+++ b/arch/loongarch/include/asm/paravirt.h
@@ -17,11 +17,16 @@ static inline u64 paravirt_steal_clock(int cpu)
  }

  int pv_ipi_init(void);
+int __init pv_time_init(void);
  #else
  static inline int pv_ipi_init(void)
  {
 return 0;
  }

+static inline int pv_time_init(void)
+{
+   return 0;
+}
  #endif // CONFIG_PARAVIRT
  #endif
diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c
index 9044ed62045c..3f83afc7e2d2 100644
--- a/arch/loongarch/kernel/paravirt.c
+++ b/arch/loongarch/kernel/paravirt.c
@@ -5,10 +5,13 @@
  #include 
  #include 
  #include 
+#include 
  #include 

  struct static_key paravirt_steal_enabled;
  struct static_key paravirt_steal_rq_enabled;
+static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
+static int has_steal_clock;

  static u64 native_steal_clock(int cpu)
  {
@@ -17,6 +20,57 @@ static u64 native_steal_clock(int cpu)

  DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);

+static bool steal_acc = true;
+static int __init parse_no_stealacc(char *arg)
+{
+   steal_acc = false;
+   return 0;
+}
+early_param("no-steal-acc", parse_no_stealacc);
+
+static u64 para_steal_clock(int cpu)
+{
+   u64 steal;
+   struct kvm_steal_time *src;
+   int version;
+
+   src = _cpu(steal_time, cpu);
+   do {
+
+   version = src->version;
+   /* Make sure that the version is read before the steal */
+   virt_rmb();
+   steal = src->steal;
+   /* Make sure that the steal is read before the next version */
+   virt_rmb();
+
+   } while ((version & 1) || (version != src->version));
+   return steal;
+}
+
+static int pv_enable_steal_time(void)
+{
+   int cpu = smp_processor_id();
+   struct kvm_steal_time *st;
+   unsigned long addr;
+
+   if (!has_steal_clock)
+   return -EPERM;
+
+   st = _cpu(steal_time, cpu);
+   addr =

[syzbot] [net?] [virt?] [kvm?] KASAN: slab-use-after-free Read in vhost_task_fn

2024-04-30 Thread syzbot

Hello,

syzbot found the following issue on:

HEAD commit:bb7a2467e6be Add linux-next specific files for 20240426
git tree:   linux-next
console+strace: https://syzkaller.appspot.com/x/log.txt?x=123bf96b18
kernel config:  https://syzkaller.appspot.com/x/.config?x=5c6a0288262dd108
dashboard link: https://syzkaller.appspot.com/bug?extid=98edc2df894917b3431f
compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
2.40
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11c8a4ef18
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=16c3002898

Downloadable assets:
disk image: 
https://storage.googleapis.com/syzbot-assets/5175af7dda64/disk-bb7a2467.raw.xz
vmlinux: 
https://storage.googleapis.com/syzbot-assets/70db0462e868/vmlinux-bb7a2467.xz
kernel image: 
https://storage.googleapis.com/syzbot-assets/3217fb825698/bzImage-bb7a2467.xz

The issue was bisected to:

commit a3df30984f4faf82d63d2a96f8ac773403ce935d
Author: Mike Christie 
Date:   Sat Mar 16 00:47:06 2024 +

vhost_task: Handle SIGKILL by flushing work and exiting

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1442391718
final oops: https://syzkaller.appspot.com/x/report.txt?x=1642391718
console output: https://syzkaller.appspot.com/x/log.txt?x=1242391718

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+98edc2df894917b34...@syzkaller.appspotmail.com
Fixes: a3df30984f4f ("vhost_task: Handle SIGKILL by flushing work and exiting")

==
BUG: KASAN: slab-use-after-free in instrument_atomic_read 
include/linux/instrumented.h:68 [inline]
BUG: KASAN: slab-use-after-free in atomic_long_read 
include/linux/atomic/atomic-instrumented.h:3188 [inline]
BUG: KASAN: slab-use-after-free in __mutex_unlock_slowpath+0xef/0x750 
kernel/locking/mutex.c:921
Read of size 8 at addr 88802a9d9080 by task vhost-5103/5104

CPU: 1 PID: 5104 Comm: vhost-5103 Not tainted 6.9.0-rc5-next-20240426-syzkaller 
#0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
03/27/2024
Call Trace:
 
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 print_address_description mm/kasan/report.c:377 [inline]
 print_report+0x169/0x550 mm/kasan/report.c:488
 kasan_report+0x143/0x180 mm/kasan/report.c:601
 kasan_check_range+0x282/0x290 mm/kasan/generic.c:189
 instrument_atomic_read include/linux/instrumented.h:68 [inline]
 atomic_long_read include/linux/atomic/atomic-instrumented.h:3188 [inline]
 __mutex_unlock_slowpath+0xef/0x750 kernel/locking/mutex.c:921
 vhost_task_fn+0x3bc/0x3f0 kernel/vhost_task.c:65
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 

Allocated by task 5103:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 poison_kmalloc_redzone mm/kasan/common.c:370 [inline]
 __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387
 kasan_kmalloc include/linux/kasan.h:211 [inline]
 kmalloc_trace_noprof+0x19c/0x2b0 mm/slub.c:4146
 kmalloc_noprof include/linux/slab.h:660 [inline]
 kzalloc_noprof include/linux/slab.h:778 [inline]
 vhost_task_create+0x149/0x300 kernel/vhost_task.c:134
 vhost_worker_create+0x17b/0x3f0 drivers/vhost/vhost.c:667
 vhost_dev_set_owner+0x563/0x940 drivers/vhost/vhost.c:945
 vhost_dev_ioctl+0xda/0xda0 drivers/vhost/vhost.c:2108
 vhost_vsock_dev_ioctl+0x2bb/0xfa0 drivers/vhost/vsock.c:875
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:907 [inline]
 __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 5103:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
 poison_slab_object+0xe0/0x150 mm/kasan/common.c:240
 __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256
 kasan_slab_free include/linux/kasan.h:184 [inline]
 slab_free_hook mm/slub.c:2190 [inline]
 slab_free mm/slub.c:4430 [inline]
 kfree+0x149/0x350 mm/slub.c:4551
 vhost_worker_destroy drivers/vhost/vhost.c:629 [inline]
 vhost_workers_free drivers/vhost/vhost.c:648 [inline]
 vhost_dev_cleanup+0x9b0/0xba0 drivers/vhost/vhost.c:1051
 vhost_vsock_dev_release+0x3aa/0x410 drivers/vhost/vsock.c:751
 __fput+0x406/0x8b0 fs/file_table.c:422
 __do_sys_close fs/open.c:1555 [inline]
 __se_sys_close fs/open.c:1540 [inline]
 __x64_sys_close+0x7f/0x110 fs/open.c:1540
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The buggy address belongs to the object at 88802a9d9000
 which belongs to the cache kmalloc-512 of size 512
The buggy address is located 128 bytes inside of
 freed 512-byte region

Re: [PATCH v2 2/2] LoongArch: Add steal time support in guest side

2024-04-30 Thread Huacai Chen

Hi, Bibo,

On Tue, Apr 30, 2024 at 9:45 AM Bibo Mao  wrote:
>
> Percpu struct kvm_steal_time is added here, its size is 64 bytes and
> also defined as 64 bytes, so that the whole structure is in one physical
> page.
>
> When vcpu is onlined, function pv_enable_steal_time() is called. This
> function will pass guest physical address of struct kvm_steal_time and
> tells hypervisor to enable steal time. When vcpu is offline, physical
> address is set as 0 and tells hypervisor to disable steal time.
>
> Here is output of vmstat on guest when there is workload on both host
> and guest. It includes steal time stat information.
>
> procs ---memory-- -io -system-- --cpu-
>  r  b   swpd   free  inact active   bibo   in   cs us sy id wa st
> 15  1  0 7583616 184112  72208200  162   52 31  6 43  0 20
> 17  0  0 7583616 184704  721920 0 6318 6885  5 60  8  5 22
> 16  0  0 7583616 185392  721440 0 1766 1081  0 49  0  1 50
> 16  0  0 7583616 184816  723040 0 6300 6166  4 62 12  2 20
> 18  0  0 7583632 184480  722400 0 2814 1754  2 58  4  1 35
>
> Signed-off-by: Bibo Mao 
> ---
>  arch/loongarch/Kconfig|  11 +++
>  arch/loongarch/include/asm/paravirt.h |   5 +
>  arch/loongarch/kernel/paravirt.c  | 131 ++
>  arch/loongarch/kernel/time.c  |   2 +
>  4 files changed, 149 insertions(+)
>
> diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
> index 0a1540a8853e..f3a03c33a052 100644
> --- a/arch/loongarch/Kconfig
> +++ b/arch/loongarch/Kconfig
> @@ -592,6 +592,17 @@ config PARAVIRT
>   over full virtualization.  However, when run without a hypervisor
>   the kernel is theoretically slower and slightly larger.
>
> +config PARAVIRT_TIME_ACCOUNTING
> +   bool "Paravirtual steal time accounting"
> +   select PARAVIRT
> +   help
> + Select this option to enable fine granularity task steal time
> + accounting. Time spent executing other tasks in parallel with
> + the current vCPU is discounted from the vCPU power. To account for
> + that, there can be a small performance impact.
> +
> + If in doubt, say N here.
> +
Can we use a hidden selection manner, which means:

config PARAVIRT_TIME_ACCOUNTING
   def_bool PARAVIRT

Because I think we needn't give too many choices to users (and bring
much more effort to test).

PowerPC even hide all the PARAVIRT config options...
see arch/powerpc/platforms/pseries/Kconfig

Huacai

>  config ARCH_SUPPORTS_KEXEC
> def_bool y
>
> diff --git a/arch/loongarch/include/asm/paravirt.h 
> b/arch/loongarch/include/asm/paravirt.h
> index 58f7b7b89f2c..fe27fb5e82b8 100644
> --- a/arch/loongarch/include/asm/paravirt.h
> +++ b/arch/loongarch/include/asm/paravirt.h
> @@ -17,11 +17,16 @@ static inline u64 paravirt_steal_clock(int cpu)
>  }
>
>  int pv_ipi_init(void);
> +int __init pv_time_init(void);
>  #else
>  static inline int pv_ipi_init(void)
>  {
> return 0;
>  }
>
> +static inline int pv_time_init(void)
> +{
> +   return 0;
> +}
>  #endif // CONFIG_PARAVIRT
>  #endif
> diff --git a/arch/loongarch/kernel/paravirt.c 
> b/arch/loongarch/kernel/paravirt.c
> index 9044ed62045c..3f83afc7e2d2 100644
> --- a/arch/loongarch/kernel/paravirt.c
> +++ b/arch/loongarch/kernel/paravirt.c
> @@ -5,10 +5,13 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  struct static_key paravirt_steal_enabled;
>  struct static_key paravirt_steal_rq_enabled;
> +static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
> +static int has_steal_clock;
>
>  static u64 native_steal_clock(int cpu)
>  {
> @@ -17,6 +20,57 @@ static u64 native_steal_clock(int cpu)
>
>  DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
>
> +static bool steal_acc = true;
> +static int __init parse_no_stealacc(char *arg)
> +{
> +   steal_acc = false;
> +   return 0;
> +}
> +early_param("no-steal-acc", parse_no_stealacc);
> +
> +static u64 para_steal_clock(int cpu)
> +{
> +   u64 steal;
> +   struct kvm_steal_time *src;
> +   int version;
> +
> +   src = _cpu(steal_time, cpu);
> +   do {
> +
> +   version = src->version;
> +   /* Make sure that the version is read before the steal */
> +   virt_rmb();
> +   steal = src->steal;
> +   /* Make sure that the steal is read before the next version */
> +   virt_rmb();
> +
> +   } while ((version & 1) || (version != src->version));
> +   return steal;
> +}
> +
> +static int pv_enable_steal_time(void)
> +{
> +   int cpu = smp_processor_id();
> +   struct kvm_steal_time *st;
> +   unsigned long addr;
> +
> +   if (!has_steal_clock)
> +   return -EPERM;
> +
> +   st = _cpu(steal_time, cpu);
> +   addr = per_cpu_ptr_to_phys(st);
> +
> +   /* The whole structure kvm_steal_time should be one

Re: [PATCH v4 4/4] remoteproc: mediatek: Add IMGSYS IPI command

2024-04-30 Thread AngeloGioacchino Del Regno


Il 30/04/24 03:15, Olivia Wen ha scritto:

Add an IPI command definition for communication with IMGSYS through
SCP mailbox.

Signed-off-by: Olivia Wen 


Reviewed-by: AngeloGioacchino Del Regno

[syzbot] Monthly trace report (Apr 2024)

2024-04-30 Thread syzbot

Hello trace maintainers/developers,

This is a 31-day syzbot report for the trace subsystem.
All related reports/information can be found at:
https://syzkaller.appspot.com/upstream/s/trace

During the period, 3 new issues were detected and 2 were fixed.
In total, 9 issues are still open and 34 have been fixed so far.

Some of the still happening issues:

Ref Crashes Repro Title
<1> 471 Yes   WARNING in format_decode (3)
  https://syzkaller.appspot.com/bug?extid=e2c932aec5c8a6e1d31c
<2> 26  Yes   INFO: task hung in blk_trace_ioctl (4)
  https://syzkaller.appspot.com/bug?extid=ed812ed461471ab17a0c
<3> 26  Yes   WARNING in blk_register_tracepoints
  https://syzkaller.appspot.com/bug?extid=c54ded83396afee31eb1
<4> 13  Nopossible deadlock in __send_signal_locked
  https://syzkaller.appspot.com/bug?extid=6e3b6eab5bd4ed584a38
<5> 7   Yes   WARNING in get_probe_ref
  https://syzkaller.appspot.com/bug?extid=8672dcb9d10011c0a160

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

To disable reminders for individual bugs, reply with the following command:
#syz set  no-reminders

To change bug's subsystems, reply with:
#syz set  subsystems: new-subsystem

You may send multiple commands in a single email message.

Re: [PATCH v9 1/3] soc: qcom: Add qcom_rproc_minidump module

2024-04-30 Thread Mukesh Ojha


Gentle ping..

-Mukesh

On 3/26/2024 7:43 PM, Mukesh Ojha wrote:

Add qcom_rproc_minidump module in a preparation to remove
minidump specific code from driver/remoteproc/qcom_common.c
and provide needed exported API, this as well helps to
abstract minidump specific data layout from qualcomm's
remoteproc driver.

It is just a copying of qcom_minidump() functionality from
driver/remoteproc/qcom_common.c into a separate file under
qcom_rproc_minidump().

Signed-off-by: Mukesh Ojha 
---
Changes in v9:
  - Added source file driver/remoteproc/qcom_common.c copyright
to qcom_rproc_minidump.c
  - Dissociated it from minidump series as this can go separately
and minidump can put it dependency for the data structure files.

Nothing much changed in these three patches from previous version,
However, giving the link of their older versions.

v8: https://lore.kernel.org/lkml/20240131105734.13090-1-quic_mo...@quicinc.com/
v7: https://lore.kernel.org/lkml/20240109153200.12848-1-quic_mo...@quicinc.com/
v6: 
https://lore.kernel.org/lkml/1700864395-1479-1-git-send-email-quic_mo...@quicinc.com/
v5: 
https://lore.kernel.org/lkml/1694429639-21484-1-git-send-email-quic_mo...@quicinc.com/
v4: 
https://lore.kernel.org/lkml/1687955688-20809-1-git-send-email-quic_mo...@quicinc.com/

  drivers/soc/qcom/Kconfig  |  10 +++
  drivers/soc/qcom/Makefile |   1 +
  drivers/soc/qcom/qcom_minidump_internal.h |  64 +
  drivers/soc/qcom/qcom_rproc_minidump.c| 115 ++
  include/soc/qcom/qcom_minidump.h  |  23 ++
  5 files changed, 213 insertions(+)
  create mode 100644 drivers/soc/qcom/qcom_minidump_internal.h
  create mode 100644 drivers/soc/qcom/qcom_rproc_minidump.c
  create mode 100644 include/soc/qcom/qcom_minidump.h

diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig
index 5af33b0e3470..ed23e0275c22 100644
--- a/drivers/soc/qcom/Kconfig
+++ b/drivers/soc/qcom/Kconfig
@@ -277,4 +277,14 @@ config QCOM_PBS
  This module provides the APIs to the client drivers that wants to 
send the
  PBS trigger event to the PBS RAM.
  
+config QCOM_RPROC_MINIDUMP

+   tristate "QCOM Remoteproc Minidump Support"
+   depends on ARCH_QCOM || COMPILE_TEST
+   depends on QCOM_SMEM
+   help
+ Enablement of core Minidump feature is controlled from boot firmware
+ side, so if it is enabled from firmware, this config allow Linux to
+ query predefined Minidump segments associated with the remote 
processor
+ and check its validity and end up collecting the dump on remote 
processor
+ crash during its recovery.
  endmenu
diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile
index ca0bece0dfff..44664589263d 100644
--- a/drivers/soc/qcom/Makefile
+++ b/drivers/soc/qcom/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_QCOM_ICC_BWMON)  += icc-bwmon.o
  qcom_ice-objs += ice.o
  obj-$(CONFIG_QCOM_INLINE_CRYPTO_ENGINE)   += qcom_ice.o
  obj-$(CONFIG_QCOM_PBS) += qcom-pbs.o
+obj-$(CONFIG_QCOM_RPROC_MINIDUMP)  += qcom_rproc_minidump.o
diff --git a/drivers/soc/qcom/qcom_minidump_internal.h 
b/drivers/soc/qcom/qcom_minidump_internal.h
new file mode 100644
index ..71709235b196
--- /dev/null
+++ b/drivers/soc/qcom/qcom_minidump_internal.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved.
+ */
+
+#ifndef _QCOM_MINIDUMP_INTERNAL_H_
+#define _QCOM_MINIDUMP_INTERNAL_H_
+
+#define MAX_NUM_OF_SS   10
+#define MAX_REGION_NAME_LENGTH  16
+#define SBL_MINIDUMP_SMEM_ID   602
+#define MINIDUMP_REGION_VALID ('V' << 24 | 'A' << 16 | 'L' << 8 | 'I' << 0)
+#define MINIDUMP_SS_ENCR_DONE ('D' << 24 | 'O' << 16 | 'N' << 8 | 'E' << 0)
+#define MINIDUMP_SS_ENABLED   ('E' << 24 | 'N' << 16 | 'B' << 8 | 'L' << 0)
+
+/**
+ * struct minidump_region - Minidump region
+ * @name   : Name of the region to be dumped
+ * @seq_num:   : Use to differentiate regions with same name.
+ * @valid  : This entry to be dumped (if set to 1)
+ * @address: Physical address of region to be dumped
+ * @size   : Size of the region
+ */
+struct minidump_region {
+   charname[MAX_REGION_NAME_LENGTH];
+   __le32  seq_num;
+   __le32  valid;
+   __le64  address;
+   __le64  size;
+};
+
+/**
+ * struct minidump_subsystem - Subsystem's SMEM Table of content
+ * @status : Subsystem toc init status
+ * @enabled : if set to 1, this region would be copied during coredump
+ * @encryption_status: Encryption status for this subsystem
+ * @encryption_required : Decides to encrypt the subsystem regions or not
+ * @region_count : Number of regions added in this subsystem toc
+ * @regions_baseptr : regions base pointer of the subsystem
+ */
+struct minidump_subsystem {
+   __le32  status;
+   __le32  enabled;
+

61 matches

Mail list logo