Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn
On Tue, Apr 30, 2024 at 08:01:11PM -0500, Mike Christie wrote: > On 4/30/24 7:15 PM, Hillf Danton wrote: > > On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote: > >> On 4/30/24 8:05 AM, Edward Adam Davis wrote: > >>> static int vhost_task_fn(void *data) > >>> { > >>> struct vhost_task *vtsk = data; > >>> @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data) > >>> schedule(); > >>> } > >>> > >>> - mutex_lock(>exit_mutex); > >>> + mutex_lock(_mutex); > >>> /* > >>>* If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL. > >>>* When the vhost layer has called vhost_task_stop it's already stopped > >>> @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data) > >>> vtsk->handle_sigkill(vtsk->data); > >>> } > >>> complete(>exited); > >>> - mutex_unlock(>exit_mutex); > >>> + mutex_unlock(_mutex); > >>> > >> > >> Edward, thanks for the patch. I think though I just needed to swap the > >> order of the calls above. > >> > >> Instead of: > >> > >> complete(>exited); > >> mutex_unlock(>exit_mutex); > >> > >> it should have been: > >> > >> mutex_unlock(>exit_mutex); > >> complete(>exited); > > > > JFYI Edward did it [1] > > > > [1] > > https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/ > > Thanks. > > I tested the code with that change and it no longer triggers the UAF. Weird but syzcaller said that yes it triggers. Compare dcc0ca06174e6...@google.com which tests the order mutex_unlock(>exit_mutex); complete(>exited); that you like and says it triggers and 97bda90617521...@google.com which says it does not trigger. Whatever you do please send it to syzcaller in the original thread and then when you post please include the syzcaller report. Given this gets confusing I'm fine with just a fixup patch, and note in the commit log where I should squash it. > I've fixed up the original patch that had the bug and am going to > resubmit the patchset like how Michael requested. >
Re: [PATCH V1] soc: qcom: smp2p: Introduce tracepoint support
On Tue, Apr 30, 2024 at 04:18:27PM -0700, Chris Lew wrote: > On 4/29/2024 12:55 AM, Sudeepgoud Patil wrote: > > Introduce tracepoint support for smp2p providing useful logging > > for communication between clients. > > > > Let's add some more description as to why these tracepoint are useful. Do > they help us track latency? debugging information for us? for clients? +1 > > diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile [..] > > diff --git a/drivers/soc/qcom/smp2p.c b/drivers/soc/qcom/smp2p.c > > index a21241cbeec7..dde8513641ae 100644 > > --- a/drivers/soc/qcom/smp2p.c > > +++ b/drivers/soc/qcom/smp2p.c > > @@ -20,6 +20,9 @@ > > #include > > #include > > +#define CREATE_TRACE_POINTS > > +#include "trace-smp2p.h" > > + > > /* > >* The Shared Memory Point to Point (SMP2P) protocol facilitates > > communication > >* of a single 32-bit value between two processors. Each value has a > > single > > @@ -191,6 +194,7 @@ static void qcom_smp2p_do_ssr_ack(struct qcom_smp2p > > *smp2p) > > struct smp2p_smem_item *out = smp2p->out; > > u32 val; > > + trace_smp2p_ssr_ack(smp2p->remote_pid); > > smp2p->ssr_ack = !smp2p->ssr_ack; > > val = out->flags & ~BIT(SMP2P_FLAGS_RESTART_ACK_BIT); > > @@ -213,6 +217,7 @@ static void qcom_smp2p_negotiate(struct qcom_smp2p > > *smp2p) > > smp2p->ssr_ack_enabled = true; > > smp2p->negotiation_done = true; > > + trace_smp2p_negotiate(smp2p->remote_pid, > > smp2p->ssr_ack_enabled); > > since this tracepoint is for negotiating, maybe we should capture all of the > features (out->features) instead of just the ssr_ack feature. > Perhaps we can use __print_flags() in TP_printk() for that, it will attempt to resolve the bits and if it fails include the numeric value. > > } > > } [..] > > diff --git a/drivers/soc/qcom/trace-smp2p.h b/drivers/soc/qcom/trace-smp2p.h [..] > > +TRACE_EVENT(smp2p_ssr_ack, > > + TP_PROTO(unsigned int remote_pid), > > Is there any way we can map the remote pid's to a string? I feel like the > tracepoints would be more useful if they called out modem, adsp, etc instead > of an integer value. > And __print_symbolic() for this one. Regards, Bjorn
Re: [PATCH net-next v8] net/ipv4: add tracepoint for icmp_send
On Mon, 29 Apr 2024 17:15:57 +0800 (CST) xu.xi...@zte.com.cn wrote: > Introduce a tracepoint for icmp_send, which can help users to get more > detail information conveniently when icmp abnormal events happen. Rebase please, it doesn't apply. And please put the change log under the --- delimiter. -- pw-bot: cr
Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn
On 4/30/24 7:15 PM, Hillf Danton wrote: > On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote: >> On 4/30/24 8:05 AM, Edward Adam Davis wrote: >>> static int vhost_task_fn(void *data) >>> { >>> struct vhost_task *vtsk = data; >>> @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data) >>> schedule(); >>> } >>> >>> - mutex_lock(>exit_mutex); >>> + mutex_lock(_mutex); >>> /* >>> * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL. >>> * When the vhost layer has called vhost_task_stop it's already stopped >>> @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data) >>> vtsk->handle_sigkill(vtsk->data); >>> } >>> complete(>exited); >>> - mutex_unlock(>exit_mutex); >>> + mutex_unlock(_mutex); >>> >> >> Edward, thanks for the patch. I think though I just needed to swap the >> order of the calls above. >> >> Instead of: >> >> complete(>exited); >> mutex_unlock(>exit_mutex); >> >> it should have been: >> >> mutex_unlock(>exit_mutex); >> complete(>exited); > > JFYI Edward did it [1] > > [1] > https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/ Thanks. I tested the code with that change and it no longer triggers the UAF. I've fixed up the original patch that had the bug and am going to resubmit the patchset like how Michael requested.
Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn
On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote: > On 4/30/24 8:05 AM, Edward Adam Davis wrote: > > static int vhost_task_fn(void *data) > > { > > struct vhost_task *vtsk = data; > > @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data) > > schedule(); > > } > > > > - mutex_lock(>exit_mutex); > > + mutex_lock(_mutex); > > /* > > * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL. > > * When the vhost layer has called vhost_task_stop it's already stopped > > @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data) > > vtsk->handle_sigkill(vtsk->data); > > } > > complete(>exited); > > - mutex_unlock(>exit_mutex); > > + mutex_unlock(_mutex); > > > > Edward, thanks for the patch. I think though I just needed to swap the > order of the calls above. > > Instead of: > > complete(>exited); > mutex_unlock(>exit_mutex); > > it should have been: > > mutex_unlock(>exit_mutex); > complete(>exited); JFYI Edward did it [1] [1] https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/ > > If my analysis is correct, then Michael do you want me to resubmit a > patch on top of your vhost branch or resubmit the entire patchset?
Re: [PATCH V1] soc: qcom: smp2p: Introduce tracepoint support
On 4/29/2024 12:55 AM, Sudeepgoud Patil wrote: Introduce tracepoint support for smp2p providing useful logging for communication between clients. Let's add some more description as to why these tracepoint are useful. Do they help us track latency? debugging information for us? for clients? Signed-off-by: Sudeepgoud Patil Signed-off-by: Deepak Kumar Singh As Elliot mentioned in the internal review, your Signed-off-by should be last. I would suggest removing Deepak's Signed-off-by and letting him reply with Reviewed-by since he was the main reviewer internally. --- drivers/soc/qcom/Makefile | 1 + drivers/soc/qcom/smp2p.c | 10 drivers/soc/qcom/trace-smp2p.h | 99 ++ 3 files changed, 110 insertions(+) create mode 100644 drivers/soc/qcom/trace-smp2p.h diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile index ca0bece0dfff..30c1bf645501 100644 --- a/drivers/soc/qcom/Makefile +++ b/drivers/soc/qcom/Makefile @@ -23,6 +23,7 @@ qcom_rpmh-y += rpmh.o obj-$(CONFIG_QCOM_SMD_RPM)+= rpm-proc.o smd-rpm.o obj-$(CONFIG_QCOM_SMEM) +=smem.o obj-$(CONFIG_QCOM_SMEM_STATE) += smem_state.o +CFLAGS_smp2p.o := -I$(src) obj-$(CONFIG_QCOM_SMP2P) += smp2p.o obj-$(CONFIG_QCOM_SMSM) += smsm.o obj-$(CONFIG_QCOM_SOCINFO)+= socinfo.o diff --git a/drivers/soc/qcom/smp2p.c b/drivers/soc/qcom/smp2p.c index a21241cbeec7..dde8513641ae 100644 --- a/drivers/soc/qcom/smp2p.c +++ b/drivers/soc/qcom/smp2p.c @@ -20,6 +20,9 @@ #include #include +#define CREATE_TRACE_POINTS +#include "trace-smp2p.h" + /* * The Shared Memory Point to Point (SMP2P) protocol facilitates communication * of a single 32-bit value between two processors. Each value has a single @@ -191,6 +194,7 @@ static void qcom_smp2p_do_ssr_ack(struct qcom_smp2p *smp2p) struct smp2p_smem_item *out = smp2p->out; u32 val; + trace_smp2p_ssr_ack(smp2p->remote_pid); smp2p->ssr_ack = !smp2p->ssr_ack; val = out->flags & ~BIT(SMP2P_FLAGS_RESTART_ACK_BIT); @@ -213,6 +217,7 @@ static void qcom_smp2p_negotiate(struct qcom_smp2p *smp2p) smp2p->ssr_ack_enabled = true; smp2p->negotiation_done = true; + trace_smp2p_negotiate(smp2p->remote_pid, smp2p->ssr_ack_enabled); since this tracepoint is for negotiating, maybe we should capture all of the features (out->features) instead of just the ssr_ack feature. } } @@ -251,6 +256,8 @@ static void qcom_smp2p_notify_in(struct qcom_smp2p *smp2p) status = val ^ entry->last_value; entry->last_value = val; + trace_smp2p_notify_in(smp2p->remote_pid, entry->name, status, val); + /* No changes of this entry? */ if (!status) continue; @@ -406,6 +413,9 @@ static int smp2p_update_bits(void *data, u32 mask, u32 value) writel(val, entry->value); spin_unlock_irqrestore(>lock, flags); + trace_smp2p_update_bits(entry->smp2p->remote_pid, + entry->name, orig, val); + if (val != orig) qcom_smp2p_kick(entry->smp2p); diff --git a/drivers/soc/qcom/trace-smp2p.h b/drivers/soc/qcom/trace-smp2p.h new file mode 100644 index ..c61afab23f0c --- /dev/null +++ b/drivers/soc/qcom/trace-smp2p.h @@ -0,0 +1,99 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved. + */ + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM qcom_smp2p + +#if !defined(__QCOM_SMP2P_TRACE_H__) || defined(TRACE_HEADER_MULTI_READ) +#define __QCOM_SMP2P_TRACE_H__ + +#include + +TRACE_EVENT(smp2p_ssr_ack, + TP_PROTO(unsigned int remote_pid), Is there any way we can map the remote pid's to a string? I feel like the tracepoints would be more useful if they called out modem, adsp, etc instead of an integer value. + TP_ARGS(remote_pid), + TP_STRUCT__entry( + __field(u32, remote_pid) + ), + TP_fast_assign( + __entry->remote_pid = remote_pid; + ), + TP_printk("%d: SSR detected, doing SSR Handshake", + __entry->remote_pid + ) +); + +TRACE_EVENT(smp2p_negotiate, + TP_PROTO(unsigned int remote_pid, bool ssr_ack_enabled), + TP_ARGS(remote_pid, ssr_ack_enabled), + TP_STRUCT__entry( + __field(u32, remote_pid) + __field(bool, ssr_ack_enabled) + ), + TP_fast_assign( + __entry->remote_pid = remote_pid; + __entry->ssr_ack_enabled = ssr_ack_enabled; + ), + TP_printk("%d: state=open ssr_ack=%d", + __entry->remote_pid, + __entry->ssr_ack_enabled + ) +); + +TRACE_EVENT(smp2p_notify_in, + TP_PROTO(unsigned int remote_pid, const char *name, unsigned long status, u32 val), + TP_ARGS(remote_pid, name, status, val), +
Re: [PATCH] virtio_net: Warn if insufficient queue length for transmitting
On Tue, Apr 30, 2024 at 03:35:09PM -0400, Darius Rad wrote: > The transmit queue is stopped when the number of free queue entries is less > than 2+MAX_SKB_FRAGS, in start_xmit(). If the queue length (QUEUE_NUM_MAX) > is less than then this, transmission will immediately trigger a netdev > watchdog timeout. Report this condition earlier and more directly. > > Signed-off-by: Darius Rad > --- > drivers/net/virtio_net.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 115c3c5414f2..72ee8473b61c 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -4917,6 +4917,9 @@ static int virtnet_probe(struct virtio_device *vdev) > set_bit(guest_offloads[i], >guest_offloads); > vi->guest_offloads_capable = vi->guest_offloads; > > + if (virtqueue_get_vring_size(vi->sq->vq) < 2 + MAX_SKB_FRAGS) > + netdev_warn_once(dev, "not enough queue entries, expect xmit > timeout\n"); > + How about actually fixing it though? E.g. by linearizing... It also bothers me that there's practically /proc/sys/net/core/max_skb_frags and if that's low then things could actually work. Finally, while originally it was just 17 typically, now it's configurable. So it's possible that you change the config to make big tcp work better and device stops working while it worked fine previously. > pr_debug("virtnet: registered device %s with %d RX and TX vq's\n", >dev->name, max_queue_pairs); > > -- > 2.39.2
[PATCH v13 13/14] Docs/x86/sgx: Add description for cgroup support
From: Sean Christopherson Add initial documentation of how to regulate the distribution of SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup controller. Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Cc: Sean Christopherson Reviewed-by: Bagas Sanjaya Acked-by: Kai Huang Tested-by: Mikko Ylinen Tested-by: Jarkko Sakkinen --- V8: - Limit text width to 80 characters to be consistent. V6: - Remove mentioning of VMM specific behavior on handling SIGBUS - Remove statement of forced reclamation, add statement to specify ENOMEM returned when no reclamation possible. - Added statements on the non-preemptive nature for the max limit - Dropped Reviewed-by tag because of changes V4: - Fix indentation (Randy) - Change misc.events file to be read-only - Fix a typo for 'subsystem' - Add behavior when VMM overcommit EPC with a cgroup (Mikko) --- Documentation/arch/x86/sgx.rst | 83 ++ 1 file changed, 83 insertions(+) diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst index d90796adc2ec..c537e6a9aa65 100644 --- a/Documentation/arch/x86/sgx.rst +++ b/Documentation/arch/x86/sgx.rst @@ -300,3 +300,86 @@ to expected failures and handle them as follows: first call. It indicates a bug in the kernel or the userspace client if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has a return code other than 0. + + +Cgroup Support +== + +The "sgx_epc" resource within the Miscellaneous cgroup controller regulates +distribution of SGX EPC memory, which is a subset of system RAM that is used to +provide SGX-enabled applications with protected memory, and is otherwise +inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be +read/written outside of an SGX enclave. + +Although current systems implement EPC by stealing memory from RAM, for all +intents and purposes the EPC is independent from normal system memory, e.g. must +be reserved at boot from RAM and cannot be converted between EPC and normal +memory while the system is running. The EPC is managed by the SGX subsystem and +is not accounted by the memory controller. Note that this is true only for EPC +memory itself, i.e. normal memory allocations related to SGX and EPC memory, +e.g. the backing memory for evicted EPC pages, are accounted, limited and +protected by the memory controller. + +Much like normal system memory, EPC memory can be overcommitted via virtual +memory techniques and pages can be swapped out of the EPC to their backing store +(normal system memory allocated via shmem). The SGX EPC subsystem is analogous +to the memory subsystem, and it implements limit and protection models for EPC +memory. + +SGX EPC Interface Files +--- + +For a generic description of the Miscellaneous controller interface files, +please see Documentation/admin-guide/cgroup-v2.rst + +All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If +a value which is not PAGE_SIZE aligned is written, the actual value used by the +controller will be rounded down to the closest PAGE_SIZE multiple. + + misc.capacity +A read-only flat-keyed file shown only in the root cgroup. The sgx_epc +resource will show the total amount of EPC memory available on the +platform. + + misc.current +A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc +resource will show the current active EPC memory usage of the cgroup and +its descendants. EPC pages that are swapped out to backing RAM are not +included in the current count. + + misc.max +A read-write single value file which exists on non-root cgroups. The +sgx_epc resource will show the EPC usage hard limit. The default is +"max". + +If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for +page fault handling, will be blocked until EPC can be reclaimed from the +cgroup. If there are no pages left that are reclaimable within the same +group, the kernel returns ENOMEM. + +The EPC pages allocated for a guest VM by the virtual EPC driver are not +reclaimable by the host kernel. In case the guest cgroup's limit is +reached and no reclaimable pages left in the same cgroup, the virtual +EPC driver returns SIGBUS to the user space process to indicate failure +on new EPC allocation requests. + +The misc.max limit is non-preemptive. If a user writes a limit lower +than the current usage to this file, the cgroup will not preemptively +deallocate pages currently in use, and will only start blocking the next +allocation and reclaiming EPC at that time. + + misc.events +A read-only flat-keyed file which exists on non-root cgroups. +A value
[PATCH v13 14/14] selftests/sgx: Add scripts for EPC cgroup testing
With different cgroups, the script starts one or multiple concurrent SGX selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed test case, which loads an enclave of EPC size equal to the EPC capacity available on the platform. The script checks results against the expectation set for each cgroup and reports success or failure. The script creates 3 different cgroups at the beginning with following expectations: 1) small - intentionally small enough to fail the test loading an enclave of size equal to the capacity. 2) large - large enough to run up to 4 concurrent tests but fail some if more than 4 concurrent tests are run. The script starts 4 expecting at least one test to pass, and then starts 5 expecting at least one test to fail. 3) larger - limit is the same as the capacity, large enough to run lots of concurrent tests. The script starts 8 of them and expects all pass. Then it reruns the same test with one process randomly killed and usage checked to be zero after all processes exit. The script also includes a test with low mem_cg limit and large sgx_epc limit to verify that the RAM used for per-cgroup reclamation is charged to a proper mem_cg. For this test, it turns off swapping before start, and turns swapping back on afterwards. Add README to document how to run the tests. Signed-off-by: Haitao Huang --- V13: - More improvement on handling error cases and style fixes. - Add settings file for custom timeout V12: - Integrate the scripts to the "run_tests" target. (Jarkko) V11: - Remove cgroups-tools dependency and make scripts ash compatible. (Jarkko) - Drop support for cgroup v1 and simplify. (Michal, Jarkko) - Add documentation for functions. (Jarkko) - Turn off swapping before memcontrol tests and back on after - Format and style fixes, name for hard coded values V7: - Added memcontrol test. V5: - Added script with automatic results checking, remove the interactive script. - The script can run independent from the series below. --- tools/testing/selftests/sgx/Makefile | 3 +- tools/testing/selftests/sgx/README| 109 +++ tools/testing/selftests/sgx/ash_cgexec.sh | 16 + tools/testing/selftests/sgx/config| 4 + .../selftests/sgx/run_epc_cg_selftests.sh | 295 ++ tools/testing/selftests/sgx/settings | 2 + .../selftests/sgx/watch_misc_for_tests.sh | 11 + 7 files changed, 439 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/sgx/README create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh create mode 100644 tools/testing/selftests/sgx/config create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh create mode 100644 tools/testing/selftests/sgx/settings create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh diff --git a/tools/testing/selftests/sgx/Makefile b/tools/testing/selftests/sgx/Makefile index 867f88ce2570..739376af9e33 100644 --- a/tools/testing/selftests/sgx/Makefile +++ b/tools/testing/selftests/sgx/Makefile @@ -20,7 +20,8 @@ ENCL_LDFLAGS := -Wl,-T,test_encl.lds,--build-id=none ifeq ($(CAN_BUILD_X86_64), 1) TEST_CUSTOM_PROGS := $(OUTPUT)/test_sgx -TEST_FILES := $(OUTPUT)/test_encl.elf +TEST_FILES := $(OUTPUT)/test_encl.elf ash_cgexec.sh +TEST_PROGS := run_epc_cg_selftests.sh all: $(TEST_CUSTOM_PROGS) $(OUTPUT)/test_encl.elf endif diff --git a/tools/testing/selftests/sgx/README b/tools/testing/selftests/sgx/README new file mode 100644 index ..f84406bf29a4 --- /dev/null +++ b/tools/testing/selftests/sgx/README @@ -0,0 +1,109 @@ +SGX selftests + +The SGX selftests includes a c program (test_sgx) that covers basic user space +facing APIs and a shell scripts (run_sgx_cg_selftests.sh) testing SGX misc +cgroup. The SGX cgroup test script requires root privileges and runs a +specific test case of the test_sgx in different cgroups configured by the +script. More details about the cgroup test can be found below. + +All SGX selftests can run with or without kselftest framework. + +WITH KSELFTEST FRAMEWORK +=== + +BUILD +- + +Build executable file "test_sgx" from top level directory of the kernel source: + $ make -C tools/testing/selftests TARGETS=sgx + +RUN +--- + +Run all sgx tests as sudo or root since the cgroup tests need to configure cgroup +limits in files under /sys/fs/cgroup. + + $ sudo make -C tools/testing/selftests/sgx run_tests + +Without sudo, SGX cgroup tests will be skipped. + +On platforms with large Enclave Page Cache (EPC) and/or less cpu cores, you may +need adjust the timeout in 'settings' to avoid timeouts. + +More details about kselftest framework can be found in +Documentation/dev-tools/kselftest.rst. + +WITHOUT KSELFTEST FRAMEWORK +=== + +BUILD +- + +Build executable file "test_sgx" from this +directory(tools/testing/selftests/sgx/): + + $ make + +RUN +--- + +Run all non-cgroup tests: + + $ ./test_sgx + +To test SGX cgroup: + + $
[PATCH v13 12/14] x86/sgx: Turn on per-cgroup EPC reclamation
From: Kristen Carlson Accardi Previous patches have implemented all infrastructure needed for per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC pages are still tracked in the global LRU as sgx_epc_page_lru() always returns reference to the global LRU. Change sgx_epc_page_lru() to return the LRU of the cgroup in which the given EPC page is allocated. This makes all EPC pages tracked in per-cgroup LRUs and the global reclaimer (ksgxd) will not be able to reclaim any pages from the global LRU. However, in cases of over-committing, i.e., the sum of cgroup limits greater than the total capacity, cgroups may never reclaim but the total usage can still be near the capacity. Therefore a global reclamation is still needed in those cases and it should be performed from the root cgroup. Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup when cgroup is enabled, otherwise from the global LRU. Export sgx_cgroup_reclaim_pages() in the header file so it can be reused for this purpose. Similarly, modify sgx_can_reclaim_global(), to check emptiness of LRUs of all cgroups when EPC cgroup is enabled, otherwise only check the global LRU. Export sgx_cgroup_lru_empty() so it can be reused for this purpose. Finally, change sgx_reclaim_direct(), to check and ensure there are free pages at cgroup level so forward progress can be made by the caller. Export sgx_cgroup_should_reclaim() for reuse. With these changes, the global reclamation and per-cgroup reclamation both work properly with all pages tracked in per-cgroup LRUs. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Tested-by: Mikko Ylinen Tested-by: Jarkko Sakkinen --- V13: - Use IS_ENABLED(CONFIG_CGROUP_MISC) in sgx_can_reclaim_global(). (Kai) V12: - Remove CONFIG_CGROUP_SGX_EPC, conditional compile SGX Cgroup for CONFIGCONFIG_CGROUPMISC. (Jarkko) V11: - Reword the comments for global reclamation for allocation failure after passing cgroup charging. (Kai) - Add stub functions to remove ifdefs in c file (Kai) - Add more detailed comments to clarify each page belongs to one cgroup, or the root. (Kai) V10: - Add comment to clarify each page belongs to one cgroup, or the root by default. (Kai) - Merge the changes that expose sgx_cgroup_* functions to this patch. - Add changes for sgx_reclaim_direct() that was missed previously. V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 6 ++-- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 27 arch/x86/kernel/cpu/sgx/main.c | 46 ++-- 3 files changed, 74 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index c237bc3330ee..fe2fd6034023 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -67,7 +67,7 @@ static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg) * * Return: %true if all cgroups under the specified root have empty LRU lists. */ -static bool sgx_cgroup_lru_empty(struct misc_cg *root) +bool sgx_cgroup_lru_empty(struct misc_cg *root) { struct cgroup_subsys_state *css_root; struct cgroup_subsys_state *pos; @@ -115,7 +115,7 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root) * the LRUs are recently accessed, i.e., considered "too young" to reclaim, no * page will actually be reclaimed after walking the whole tree. */ -static void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm) +void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm) { struct cgroup_subsys_state *css_root; struct cgroup_subsys_state *pos; @@ -156,7 +156,7 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *cha * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the * cgroup. */ -static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg) +bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg) { u64 cur, max; diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h index 2044e0d64076..9d69608eadf6 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -13,6 +13,11 @@ #define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES struct sgx_cgroup; +static inline struct misc_cg *misc_from_sgx(struct sgx_cgroup *sgx_cg) +{ + return NULL; +} + static inline struct sgx_cgroup *sgx_get_current_cg(void) { return NULL; @@ -27,8 +32,22 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_recl static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { } +static inline bool sgx_cgroup_lru_empty(struct misc_cg *root) +{ + return true; +} + +static inline bool
[PATCH v13 11/14] x86/sgx: Abstract check for global reclaimable pages
From: Kristen Carlson Accardi For the global reclaimer to determine if any page available for reclamation at the global level, it currently only checks for emptiness of the global LRU. That will be inadequate when pages are tracked in multiple LRUs, one per cgroup. For this purpose, create a new helper, sgx_can_reclaim_global(), to abstract this check. Currently it only checks the global LRU, later will check emptiness of LRUs of all cgroups when per-cgroup tracking is turned on. Replace all the checks for emptiness of the global LRU, list_empty(_global_lru.reclaimable), with calls to sgx_can_reclaim_global(). Rename sgx_should_reclaim() to sgx_should_reclaim_global() as it is used to check if a global reclamation should be performed. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Tested-by: Jarkko Sakkinen --- V13: - Rename sgx_can_reclaim() to sgx_can_reclaim_global() and sgx_should_reclaim() to sgx_should_reclaim_global(). (Kai) V10: - Add comments for the new function. (Jarkko) V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 22 +++--- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index f1bd15620b83..92bd3151a589 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -36,6 +36,14 @@ static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc return _global_lru; } +/* + * Check if there is any reclaimable page at global level. + */ +static inline bool sgx_can_reclaim_global(void) +{ + return !list_empty(_global_lru.reclaimable); +} + static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); /* Nodes with one or more EPC sections. */ @@ -387,10 +395,10 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, struct mm_struct *c return cnt; } -static bool sgx_should_reclaim(unsigned long watermark) +static bool sgx_should_reclaim_global(unsigned long watermark) { return atomic_long_read(_nr_free_pages) < watermark && - !list_empty(_global_lru.reclaimable); + sgx_can_reclaim_global(); } static void sgx_reclaim_pages_global(struct mm_struct *charge_mm) @@ -405,7 +413,7 @@ static void sgx_reclaim_pages_global(struct mm_struct *charge_mm) */ void sgx_reclaim_direct(void) { - if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) + if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES)) sgx_reclaim_pages_global(current->mm); } @@ -426,9 +434,9 @@ static int ksgxd(void *p) wait_event_freezable(ksgxd_waitq, kthread_should_stop() || -sgx_should_reclaim(SGX_NR_HIGH_PAGES)); + sgx_should_reclaim_global(SGX_NR_HIGH_PAGES)); - if (sgx_should_reclaim(SGX_NR_HIGH_PAGES)) + if (sgx_should_reclaim_global(SGX_NR_HIGH_PAGES)) /* Indirect reclaim, no mm to charge, so NULL: */ sgx_reclaim_pages_global(NULL); @@ -592,7 +600,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim) break; } - if (list_empty(_global_lru.reclaimable)) { + if (!sgx_can_reclaim_global()) { page = ERR_PTR(-ENOMEM); break; } @@ -620,7 +628,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim) sgx_put_cg(sgx_cg); } - if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) + if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES)) wake_up(_waitq); return page; -- 2.25.1
[PATCH v13 09/14] x86/sgx: Implement async reclamation for cgroup
From: Kristen Carlson Accardi In cases EPC pages need be allocated during a page fault and the cgroup usage is near its limit, an asynchronous reclamation needs be triggered to avoid blocking the page fault handling. Create a workqueue, corresponding work item and function definitions for EPC cgroup to support the asynchronous reclamation. In sgx_cgroup_try_charge(), if caller does not allow synchronous reclamation, queue an asynchronous work into the workqueue. Reclaiming only when the usage is at or very close to the limit would cause thrashing. To avoid that, before returning from sgx_cgroup_try_charge(), check the need for reclamation (usage too close to the limit), queue an async work if needed, similar to how the global reclaimer wakes up its reclaiming thread after each allocation in sgx_alloc_epc_pages(). Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Tested-by: Jarkko Sakkinen --- V13: - Revert to BUG_ON() in case of workq allocation failure in init and only alloc if misc is enabled. V11: - Print error instead of WARN (Kai) - Add check for need to queue an async reclamation before returning from try_charge(), do so if needed. This is to be consistent with global reclaimer to minimize thrashing during allocation time. V10: - Split asynchronous flow in separate patch. (Kai) - Consider cgroup disabled when the workqueue allocation fail during init. (Kai) - Abstract out sgx_cgroup_should_reclaim(). V9: - Add comments for static variables. (Jarkko) V8: - Remove alignment for substructure variables. (Jarkko) V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 135 ++- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 1 + 2 files changed, 134 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index 3602616726ff..6368611cb29e 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -4,9 +4,63 @@ #include #include "epc_cgroup.h" +/* + * The minimal free pages maintained by per-cgroup reclaimer + * Set this to the low threshold used by the global reclaimer, ksgxd. + */ +#define SGX_CG_MIN_FREE_PAGE (SGX_NR_LOW_PAGES) + +/* + * If the cgroup limit is close to SGX_CG_MIN_FREE_PAGE, maintaining the minimal + * free pages would barely leave any page for use, causing excessive reclamation + * and thrashing. + * + * Define the following limit, below which cgroup does not maintain the minimal + * free page threshold. Set this to quadruple of the minimal so at least 75% + * pages used without being reclaimed. + */ +#define SGX_CG_LOW_LIMIT (SGX_CG_MIN_FREE_PAGE * 4) + /* The root SGX EPC cgroup */ static struct sgx_cgroup sgx_cg_root; +/* + * The work queue that reclaims EPC pages in the background for cgroups. + * + * A cgroup schedules a work item into this queue to reclaim pages within the + * same cgroup when its usage limit is reached and synchronous reclamation is not + * an option, i.e., in a page fault handler. + */ +static struct workqueue_struct *sgx_cg_wq; + +static inline u64 sgx_cgroup_page_counter_read(struct sgx_cgroup *sgx_cg) +{ + return atomic64_read(_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE; +} + +static inline u64 sgx_cgroup_max_pages(struct sgx_cgroup *sgx_cg) +{ + return READ_ONCE(sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE; +} + +/* + * Get the lower bound of limits of a cgroup and its ancestors. Used in + * sgx_cgroup_should_reclaim() to determine if EPC usage of a cgroup is + * close to its limit or its ancestors' hence reclamation is needed. + */ +static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg) +{ + struct misc_cg *i = sgx_cg->cg; + u64 m = U64_MAX; + + while (i) { + m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max)); + i = misc_cg_parent(i); + } + + return m / PAGE_SIZE; +} + /** * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs * @root: Root of the tree to check @@ -89,6 +143,61 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root) rcu_read_unlock(); } +/** + * sgx_cgroup_should_reclaim() - check if EPC reclamation is needed for a cgroup + * @sgx_cg: The cgroup to be checked. + * + * This function can be used to guard a call to sgx_cgroup_reclaim_pages() where + * the minimal number of free page needs be maintained for the cgroup to make + * good forward progress. + * + * Return: %true if number of free pages available for the cgroup below a + * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the + * cgroup. + */ +static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg) +{ + u64 cur, max; + + if (sgx_cgroup_lru_empty(sgx_cg->cg)) +
[PATCH v13 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup
From: Kristen Carlson Accardi Currently in the EPC page allocation, the kernel simply fails the allocation when the current EPC cgroup fails to charge due to its usage reaching limit. This is not ideal. When that happens, a better way is to reclaim EPC page(s) from the current EPC cgroup (and/or its descendants) to reduce its usage so the new allocation can succeed. Add the basic building blocks to support per-cgroup reclamation. Currently the kernel only has one place to reclaim EPC pages: the global EPC LRU list. To support the "per-cgroup" EPC reclaim, maintain an LRU list for each EPC cgroup, and introduce a "cgroup" variant function to reclaim EPC pages from a given EPC cgroup and its descendants. Currently the kernel does the global EPC reclaim in sgx_reclaim_page(). It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) pages. Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN pages from the global LRU, and then tries to reclaim these pages at once for better performance. Implement the "cgroup" variant EPC reclaim in a similar way, but keep the implementation simple: 1) change sgx_reclaim_pages() to take an LRU as input, and return the pages that are "scanned" and attempted for reclamation (but not necessarily reclaimed successfully); 2) loop the given EPC cgroup and its descendants and do the new sgx_reclaim_pages() until SGX_NR_TO_SCAN pages are "scanned". This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC cgroup, and only moves to its descendants when there's no enough reclaimable EPC pages to "scan" in its LRU. It should be enough for most cases. Note, this simple implementation doesn't _exactly_ mimic the current global EPC reclaim (which always tries to do the actual reclaim in batch of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN reclaimable pages, the actual reclaim of EPC pages will be split into smaller batches _across_ multiple LRUs with each being smaller than SGX_NR_TO_SCAN pages. A more precise way to mimic the current global EPC reclaim would be to have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages _across_ the given EPC cgroup _AND_ its descendants, and then do the actual reclaim in one batch. But this is unnecessarily complicated at this stage. Alternatively, the current sgx_reclaim_pages() could be changed to return the actual "reclaimed" pages, but not "scanned" pages. However, the reclamation is a lengthy process, forcing a successful reclamation of predetermined number of pages may block the caller for too long. And that may not be acceptable in some synchronous contexts, e.g., in serving an ioctl(). With this building block in place, add synchronous reclamation support in sgx_cgroup_try_charge(): trigger a call to sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the caller allows synchronous reclaim as indicated by s newly added parameter. A later patch will add support for asynchronous reclamation reusing sgx_cgroup_reclaim_pages(). Note all reclaimable EPC pages are still tracked in the global LRU thus no per-cgroup reclamation is actually active at the moment. Per-cgroup tracking and reclamation will be turned on in the end after all necessary infrastructure is in place. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V11: - Use commit message suggested by Kai - Remove "usage" comments for functions. (Kai) V10: - Simplify the signature by removing a pointer to nr_to_scan (Kai) - Return pages attempted instead of reclaimed as it is really what the cgroup caller needs to track progress. This further simplifies the design. - Merge patch for exposing sgx_reclaim_pages() with basic synchronous reclamation. (Kai) - Shorten names for EPC cgroup functions. (Jarkko) - Fix/add comments to justify the design (Kai) - Separate out a helper for for addressing single iteration of the loop in sgx_cgroup_try_charge(). (Jarkko) V9: - Add comments for static variables. (Jarkko) V8: - Use width of 80 characters in text paragraphs. (Jarkko) - Remove alignment for substructure variables. (Jarkko) V7: - Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim function". Do not split the top level function (Kai) - Dropped patches 7 and 8 of V6. - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 119 ++- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 5 +- arch/x86/kernel/cpu/sgx/main.c | 45 ++ arch/x86/kernel/cpu/sgx/sgx.h| 1 + 4 files changed, 148 insertions(+), 22 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index 5c484fd10160..3602616726ff 100644 ---
[PATCH v13 07/14] x86/sgx: Abstract tracking reclaimable pages in LRU
From: Kristen Carlson Accardi The SGX driver tracks reclaimable EPC pages by adding a newly allocated page into the global LRU list in sgx_mark_page_reclaimable(), and doing the opposite in sgx_unmark_page_reclaimable(). To support SGX EPC cgroup, the SGX driver will need to maintain an LRU list for each cgroup, and each newly allocated EPC page will need to be added to the LRU of associated cgroup, not always the global LRU list. When sgx_mark_page_reclaimable() is called, the cgroup that the newly allocated EPC page belongs to is already known, i.e., it has been set to the 'struct sgx_epc_page'. Add a helper, sgx_epc_page_lru(), to return the LRU that the EPC page should be added to for the given EPC page. Currently it just returns the global LRU. Change sgx_{mark|unmark}_page_reclaimable() to use the helper function to get the LRU from the EPC page instead of referring to the global LRU directly. This allows EPC page being able to be tracked in "per-cgroup" LRU when that becomes ready. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V13: - Revise commit log (Kai) - Rename sgx_lru_list() to sgx_epc_page_lru() V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 30 ++ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 59736dd02ca7..2bf9cca3be21 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -31,6 +31,11 @@ static DEFINE_XARRAY(sgx_epc_address_space); */ static struct sgx_epc_lru_list sgx_global_lru; +static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc_page) +{ + return _global_lru; +} + static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); /* Nodes with one or more EPC sections. */ @@ -499,25 +504,24 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void) } /** - * sgx_mark_page_reclaimable() - Mark a page as reclaimable + * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a LRU. * @page: EPC page - * - * Mark a page as reclaimable and add it to the active page list. Pages - * are automatically removed from the active list when freed. */ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(_global_lru.lock); + struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page); + + spin_lock(>lock); page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED; - list_add_tail(>list, _global_lru.reclaimable); - spin_unlock(_global_lru.lock); + list_add_tail(>list, >reclaimable); + spin_unlock(>lock); } /** - * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list + * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU * @page: EPC page * - * Clear the reclaimable flag and remove the page from the active page list. + * Clear the reclaimable flag if set and remove the page from its LRU. * * Return: * 0 on success, @@ -525,18 +529,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) */ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(_global_lru.lock); + struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page); + + spin_lock(>lock); if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) { /* The page is being reclaimed. */ if (list_empty(>list)) { - spin_unlock(_global_lru.lock); + spin_unlock(>lock); return -EBUSY; } list_del(>list); page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(_global_lru.lock); + spin_unlock(>lock); return 0; } -- 2.25.1
[PATCH v13 06/14] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
From: Sean Christopherson Introduce a data structure to wrap the existing reclaimable list and its spinlock. Each cgroup later will have one instance of this structure to track EPC pages allocated for processes associated with the same cgroup. Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages from the reclaimable list in this structure when its usage reaches near its limit. Use this structure to encapsulate the LRU list and its lock used by the global reclaimer. Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Cc: Sean Christopherson Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V6: - removed introduction to unreclaimables in commit message. V4: - Removed unneeded comments for the spinlock and the non-reclaimables. (Kai, Jarkko) - Revised the commit to add introduction comments for unreclaimables and multiple LRU lists.(Kai) - Reordered the patches: delay all changes for unreclaimables to later, and this one becomes the first change in the SGX subsystem. V3: - Removed the helper functions and revised commit messages. --- arch/x86/kernel/cpu/sgx/main.c | 39 +- arch/x86/kernel/cpu/sgx/sgx.h | 15 + 2 files changed, 35 insertions(+), 19 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 1226ea0d5b3c..59736dd02ca7 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -27,10 +27,9 @@ static DEFINE_XARRAY(sgx_epc_address_space); /* * These variables are part of the state of the reclaimer, and must be accessed - * with sgx_reclaimer_lock acquired. + * with sgx_global_lru.lock acquired. */ -static LIST_HEAD(sgx_active_page_list); -static DEFINE_SPINLOCK(sgx_reclaimer_lock); +static struct sgx_epc_lru_list sgx_global_lru; static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); @@ -305,13 +304,13 @@ static void sgx_reclaim_pages(void) int ret; int i; - spin_lock(_reclaimer_lock); + spin_lock(_global_lru.lock); for (i = 0; i < SGX_NR_TO_SCAN; i++) { - if (list_empty(_active_page_list)) + epc_page = list_first_entry_or_null(_global_lru.reclaimable, + struct sgx_epc_page, list); + if (!epc_page) break; - epc_page = list_first_entry(_active_page_list, - struct sgx_epc_page, list); list_del_init(_page->list); encl_page = epc_page->owner; @@ -323,7 +322,7 @@ static void sgx_reclaim_pages(void) */ epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(_reclaimer_lock); + spin_unlock(_global_lru.lock); for (i = 0; i < cnt; i++) { epc_page = chunk[i]; @@ -346,9 +345,9 @@ static void sgx_reclaim_pages(void) continue; skip: - spin_lock(_reclaimer_lock); - list_add_tail(_page->list, _active_page_list); - spin_unlock(_reclaimer_lock); + spin_lock(_global_lru.lock); + list_add_tail(_page->list, _global_lru.reclaimable); + spin_unlock(_global_lru.lock); kref_put(_page->encl->refcount, sgx_encl_release); @@ -379,7 +378,7 @@ static void sgx_reclaim_pages(void) static bool sgx_should_reclaim(unsigned long watermark) { return atomic_long_read(_nr_free_pages) < watermark && - !list_empty(_active_page_list); + !list_empty(_global_lru.reclaimable); } /* @@ -431,6 +430,8 @@ static bool __init sgx_page_reclaimer_init(void) ksgxd_tsk = tsk; + sgx_lru_init(_global_lru); + return true; } @@ -506,10 +507,10 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void) */ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(_reclaimer_lock); + spin_lock(_global_lru.lock); page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED; - list_add_tail(>list, _active_page_list); - spin_unlock(_reclaimer_lock); + list_add_tail(>list, _global_lru.reclaimable); + spin_unlock(_global_lru.lock); } /** @@ -524,18 +525,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) */ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(_reclaimer_lock); + spin_lock(_global_lru.lock); if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) { /* The page is being reclaimed. */ if (list_empty(>list)) { - spin_unlock(_reclaimer_lock); + spin_unlock(_global_lru.lock); return -EBUSY; }
[PATCH v13 10/14] x86/sgx: Charge mem_cgroup for per-cgroup reclamation
Enclave Page Cache(EPC) memory can be swapped out to regular system memory, and the consumed memory should be charged to a proper mem_cgroup. Currently the selection of mem_cgroup to charge is done in sgx_encl_get_mem_cgroup(). But it considers all contexts other than the ksgxd thread are user processes. With the new EPC cgroup implementation, the swapping can also happen in EPC cgroup work-queue threads. In those cases, it improperly selects the root mem_cgroup to charge for the RAM usage. Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take an additional argument to explicitly specify the mm struct to charge for allocations. Callers from background kthreads not associated with a charging mm struct would set it to NULL, while callers in user process contexts set it to current->mm. Internally, it handles the case when the charging mm given is NULL, by searching for an mm struct from enclave's mm_list. Signed-off-by: Haitao Huang Reported-by: Mikko Ylinen Reviewed-by: Kai Huang Tested-by: Mikko Ylinen Tested-by: Jarkko Sakkinen --- V10: - Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko) V9: - Reduce number of if statements. (Tim) V8: - Limit text paragraphs to 80 characters wide. (Jarkko) --- arch/x86/kernel/cpu/sgx/encl.c | 29 ++-- arch/x86/kernel/cpu/sgx/encl.h | 3 +-- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 10 ++ arch/x86/kernel/cpu/sgx/main.c | 29 +--- arch/x86/kernel/cpu/sgx/sgx.h| 2 +- 5 files changed, 36 insertions(+), 37 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index f474179b6f77..7b77dad41daf 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -993,23 +993,23 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_inde } /* - * When called from ksgxd, returns the mem_cgroup of a struct mm stored - * in the enclave's mm_list. When not called from ksgxd, just returns - * the mem_cgroup of the current task. + * Find the mem_cgroup to charge for memory allocated on behalf of an enclave. + * + * Used in sgx_encl_alloc_backing() for backing store allocation. + * + * Return the mem_cgroup of the given charge_mm. Otherwise return the mem_cgroup + * of a struct mm stored in the enclave's mm_list. */ -static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) +static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl, + struct mm_struct *charge_mm) { struct mem_cgroup *memcg = NULL; struct sgx_encl_mm *encl_mm; int idx; - /* -* If called from normal task context, return the mem_cgroup -* of the current task's mm. The remainder of the handling is for -* ksgxd. -*/ - if (!current_is_ksgxd()) - return get_mem_cgroup_from_mm(current->mm); +/* Use the charge_mm if given. */ + if (charge_mm) + return get_mem_cgroup_from_mm(charge_mm); /* * Search the enclave's mm_list to find an mm associated with @@ -1047,8 +1047,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) * @encl: an enclave pointer * @page_index:enclave page index * @backing: data for accessing backing storage for the page + * @charge_mm: the mm to charge for the allocation * - * When called from ksgxd, sets the active memcg from one of the + * When charge_mm is NULL, sets the active memcg from one of the * mms in the enclave's mm_list prior to any backing page allocation, * in order to ensure that shmem page allocations are charged to the * enclave. Create a backing page for loading data back into an EPC page with @@ -1060,9 +1061,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) * -errno otherwise. */ int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index, - struct sgx_backing *backing) + struct sgx_backing *backing, struct mm_struct *charge_mm) { - struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl); + struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl, charge_mm); struct mem_cgroup *memcg = set_active_memcg(encl_memcg); int ret; diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index fe15ade02ca1..5ce9d108290f 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr, int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start, unsigned long end, unsigned long vm_flags); -bool current_is_ksgxd(void); void sgx_encl_release(struct kref *ref); int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm); const
[PATCH v13 05/14] x86/sgx: Implement basic EPC misc cgroup functionality
From: Kristen Carlson Accardi SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used to limit or account for SGX EPC memory, which is a desirable feature in some environments. For instance, within a Kubernetes environment, while a user may specify a particular EPC quota for a pod, the orchestrator requires a mechanism to enforce that the pod's actual runtime EPC usage does not exceed the allocated quota. Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to limit and track EPC allocations per cgroup. Earlier patches have added the "sgx_epc" resource type in the misc cgroup subsystem. Add basic support in SGX driver as the "sgx_epc" resource provider: - Set "capacity" of EPC by calling misc_cg_set_capacity() - Update EPC usage counter, "current", by calling charge and uncharge APIs for EPC allocation and deallocation, respectively. - Setup sgx_epc resource type specific callbacks, which perform initialization and cleanup during cgroup allocation and deallocation, respectively. With these changes, the misc cgroup controller enables users to set a hard limit for EPC usage in the "misc.max" interface file. It reports current usage in "misc.current", the total EPC memory available in "misc.capacity", and the number of times EPC usage reached the max limit in "misc.events". For now, the EPC cgroup simply blocks additional EPC allocation in sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are still tracked in the global active list, only reclaimed by the global reclaimer when the total free page count is lower than a threshold. Later patches will reorganize the tracking and reclamation code in the global reclaimer and implement per-cgroup tracking and reclaiming. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V13: - Remove unneeded includes. (Kai) V12: - Remove CONFIG_CGROUP_SGX_EPC and make sgx cgroup implementation conditionally compiled with CONFIG_CGROUP_MISC. (Jarkko) V11: - Update copyright and format better (Kai) - Create wrappers to remove #ifdefs in c file. (Kai) - Remove unneeded comments (Kai) V10: - Shorten function, variable, struct names, s/sgx_epc_cgroup/sgx_cgroup. (Jarkko) - Use enums instead of booleans for the parameters. (Dave, Jarkko) V8: - Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko) - Remove extra space, '_INTEL'. (Jarkko) V7: - Use a static for root cgroup (Kai) - Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai) - Correct check for charge API return (Kai) - Start initialization in SGX device driver init (Kai) - Remove unneeded BUG_ON (Kai) - Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai) V6: - Split the original large patch"Limit process EPC usage with misc cgroup controller" and restructure it (Kai) --- arch/x86/kernel/cpu/sgx/Makefile | 1 + arch/x86/kernel/cpu/sgx/epc_cgroup.c | 71 +++ arch/x86/kernel/cpu/sgx/epc_cgroup.h | 72 arch/x86/kernel/cpu/sgx/main.c | 42 +++- arch/x86/kernel/cpu/sgx/sgx.h| 21 include/linux/misc_cgroup.h | 2 + 6 files changed, 207 insertions(+), 2 deletions(-) create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile index 9c1656779b2a..081cb424575e 100644 --- a/arch/x86/kernel/cpu/sgx/Makefile +++ b/arch/x86/kernel/cpu/sgx/Makefile @@ -4,3 +4,4 @@ obj-y += \ ioctl.o \ main.o obj-$(CONFIG_X86_SGX_KVM) += virt.o +obj-$(CONFIG_CGROUP_MISC) += epc_cgroup.o diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c new file mode 100644 index ..5c484fd10160 --- /dev/null +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -0,0 +1,71 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2022-2024 Intel Corporation. */ + +#include +#include "epc_cgroup.h" + +/* The root SGX EPC cgroup */ +static struct sgx_cgroup sgx_cg_root; + +/** + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page + * + * @sgx_cg:The EPC cgroup to be charged for the page. + * Return: + * * %0 - If successfully charged. + * * -errno - for failures. + */ +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) +{ + return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE); +} + +/** + * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page + * @sgx_cg:The charged sgx cgroup. + */ +void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) +{ + misc_cg_uncharge(MISC_CG_RES_SGX_EPC,
[PATCH v13 04/14] cgroup/misc: Add SGX EPC resource type
From: Kristen Carlson Accardi Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type for the misc controller. Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V12: - Remove CONFIG_CGROUP_SGX_EPC (Jarkko) V6: - Split the original patch into this and the preceding one (Kai) --- include/linux/misc_cgroup.h | 4 kernel/cgroup/misc.c| 4 2 files changed, 8 insertions(+) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index 541a5611c597..440ed2bb8053 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -17,6 +17,10 @@ enum misc_res_type { MISC_CG_RES_SEV, /* AMD SEV-ES ASIDs resource */ MISC_CG_RES_SEV_ES, +#endif +#ifdef CONFIG_X86_SGX + /* SGX EPC memory resource */ + MISC_CG_RES_SGX_EPC, #endif MISC_CG_RES_TYPES }; diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 7d852139121a..863f9147602b 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -24,6 +24,10 @@ static const char *const misc_res_name[] = { /* AMD SEV-ES ASIDs resource */ "sev_es", #endif +#ifdef CONFIG_X86_SGX + /* Intel SGX EPC memory bytes */ + "sgx_epc", +#endif }; /* Root misc cgroup */ -- 2.25.1
[PATCH v13 03/14] cgroup/misc: Export APIs for SGX driver
From: Kristen Carlson Accardi The SGX EPC cgroup will reclaim EPC pages when usage in a cgroup reaches its or ancestor's limit. This requires a walk from the current cgroup up to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to enable this walk. The SGX driver also needs start a global level reclamation from the root. Export misc_cg_root() for the SGX driver to access. Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V6: - Make commit messages more concise and split the original patch into two(Kai) --- include/linux/misc_cgroup.h | 24 kernel/cgroup/misc.c| 21 - 2 files changed, 32 insertions(+), 13 deletions(-) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index 0806d4436208..541a5611c597 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -64,6 +64,7 @@ struct misc_cg { struct misc_res res[MISC_CG_RES_TYPES]; }; +struct misc_cg *misc_cg_root(void); u64 misc_cg_res_total_usage(enum misc_res_type type); int misc_cg_set_capacity(enum misc_res_type type, u64 capacity); int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops); @@ -84,6 +85,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css) return css ? container_of(css, struct misc_cg, css) : NULL; } +/** + * misc_cg_parent() - Get the parent of the passed misc cgroup. + * @cgroup: cgroup whose parent needs to be fetched. + * + * Context: Any context. + * Return: + * * struct misc_cg* - Parent of the @cgroup. + * * %NULL - If @cgroup is null or the passed cgroup does not have a parent. + */ +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup) +{ + return cgroup ? css_misc(cgroup->css.parent) : NULL; +} + /* * get_current_misc_cg() - Find and get the misc cgroup of the current task. * @@ -108,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg) } #else /* !CONFIG_CGROUP_MISC */ +static inline struct misc_cg *misc_cg_root(void) +{ + return NULL; +} + +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg) +{ + return NULL; +} static inline u64 misc_cg_res_total_usage(enum misc_res_type type) { diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 4a602d68cf7d..7d852139121a 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES]; static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES]; /** - * parent_misc() - Get the parent of the passed misc cgroup. - * @cgroup: cgroup whose parent needs to be fetched. - * - * Context: Any context. - * Return: - * * struct misc_cg* - Parent of the @cgroup. - * * %NULL - If @cgroup is null or the passed cgroup does not have a parent. + * misc_cg_root() - Return the root misc cgroup. */ -static struct misc_cg *parent_misc(struct misc_cg *cgroup) +struct misc_cg *misc_cg_root(void) { - return cgroup ? css_misc(cgroup->css.parent) : NULL; + return _cg; } +EXPORT_SYMBOL_GPL(misc_cg_root); /** * valid_type() - Check if @type is valid or not. @@ -183,7 +178,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) if (!amount) return 0; - for (i = cg; i; i = parent_misc(i)) { + for (i = cg; i; i = misc_cg_parent(i)) { res = >res[type]; new_usage = atomic64_add_return(amount, >usage); @@ -196,12 +191,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) return 0; err_charge: - for (j = i; j; j = parent_misc(j)) { + for (j = i; j; j = misc_cg_parent(j)) { atomic64_inc(>res[type].events); cgroup_file_notify(>events_file); } - for (j = cg; j != i; j = parent_misc(j)) + for (j = cg; j != i; j = misc_cg_parent(j)) misc_cg_cancel_charge(type, j, amount); misc_cg_cancel_charge(type, i, amount); return ret; @@ -223,7 +218,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount) if (!(amount && valid_type(type) && cg)) return; - for (i = cg; i; i = parent_misc(i)) + for (i = cg; i; i = misc_cg_parent(i)) misc_cg_cancel_charge(type, i, amount); } EXPORT_SYMBOL_GPL(misc_cg_uncharge); -- 2.25.1
[PATCH v13 02/14] cgroup/misc: Add per resource callbacks for CSS events
From: Kristen Carlson Accardi The misc cgroup controller (subsystem) currently does not perform resource type specific action for Cgroups Subsystem State (CSS) events: the 'css_alloc' event when a cgroup is created and the 'css_free' event when a cgroup is destroyed. Define callbacks for those events and allow resource providers to register the callbacks per resource type as needed. This will be utilized later by the EPC misc cgroup support implemented in the SGX driver. This design passes a struct misc_cg into the callbacks. An alternative to pass only a struct misc_res was considered for encapsulating how misc_cg stores its misc_res array. However, the SGX driver would still need to access the misc_res array in the statically defined misc root_cg during initialization to set resource specific fields. Therefore, some extra getter/setter APIs to abstract such access to the misc_res array within the misc_cg struct would be needed. That seems an overkill at the moment and is deferred to later improvement when there are more users of these callbacks. Link: https://lore.kernel.org/lkml/op.2kdw36otwjv...@hhuan26-mobl.amr.corp.intel.com/ Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V12: - Add comments in commit to clarify reason to pass in misc_cg, not misc_res. (Kai) - Remove unlikely (Kai) V8: - Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko) V7: - Make ops one per resource type and store them in array (Michal) - Rename the ops struct to misc_res_ops, and enforce the constraints of required callback functions (Jarkko) - Moved addition of priv field to patch 4 where it was used first. (Jarkko) V6: - Create ops struct for per resource callbacks (Jarkko) - Drop max_write callback (Dave, Michal) - Style fixes (Kai) --- include/linux/misc_cgroup.h | 11 + kernel/cgroup/misc.c| 82 + 2 files changed, 86 insertions(+), 7 deletions(-) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index e799b1f8d05b..0806d4436208 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -27,6 +27,16 @@ struct misc_cg; #include +/** + * struct misc_res_ops: per resource type callback ops. + * @alloc: invoked for resource specific initialization when cgroup is allocated. + * @free: invoked for resource specific cleanup when cgroup is deallocated. + */ +struct misc_res_ops { + int (*alloc)(struct misc_cg *cg); + void (*free)(struct misc_cg *cg); +}; + /** * struct misc_res: Per cgroup per misc type resource * @max: Maximum limit on the resource. @@ -56,6 +66,7 @@ struct misc_cg { u64 misc_cg_res_total_usage(enum misc_res_type type); int misc_cg_set_capacity(enum misc_res_type type, u64 capacity); +int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops); int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount); void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount); diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 79a3717a5803..4a602d68cf7d 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -39,6 +39,9 @@ static struct misc_cg root_cg; */ static u64 misc_res_capacity[MISC_CG_RES_TYPES]; +/* Resource type specific operations */ +static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES]; + /** * parent_misc() - Get the parent of the passed misc cgroup. * @cgroup: cgroup whose parent needs to be fetched. @@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 capacity) } EXPORT_SYMBOL_GPL(misc_cg_set_capacity); +/** + * misc_cg_set_ops() - set resource specific operations. + * @type: Type of the misc res. + * @ops: Operations for the given type. + * + * Context: Any context. + * Return: + * * %0 - Successfully registered the operations. + * * %-EINVAL - If @type is invalid, or the operations missing any required callbacks. + */ +int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops) +{ + if (!valid_type(type)) + return -EINVAL; + + if (!ops->alloc) { + pr_err("%s: alloc missing\n", __func__); + return -EINVAL; + } + + if (!ops->free) { + pr_err("%s: free missing\n", __func__); + return -EINVAL; + } + + misc_res_ops[type] = ops; + return 0; +} +EXPORT_SYMBOL_GPL(misc_cg_set_ops); + /** * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup. * @type: Misc res type in misc cg to cancel the charge from. @@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = { {} }; +static inline int _misc_cg_res_alloc(struct misc_cg *cg) +{ + enum misc_res_type i; + int ret; + + for (i = 0; i < MISC_CG_RES_TYPES; i++) { +
[PATCH v13 00/14] Add Cgroup support for SGX EPC memory
SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used to limit or account for SGX EPC memory, which is a desirable feature in some environments, e.g., support for pod level control in a Kubernates cluster on a VM or bare-metal host [1,2]. This patchset implements the support for sgx_epc memory within the misc cgroup controller. A user can use the misc cgroup controller to set and enforce a max limit on total EPC usage per cgroup. The implementation reports current usage and events of reaching the limit per cgroup as well as the total system capacity. Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques and pages can be swapped out of the EPC to their backing store, which are normal system memory allocated via shmem and accounted by the memory controller. Similar to per-cgroup reclamation done by the memory controller, the EPC misc controller needs to implement a per-cgroup EPC reclaiming process: when the EPC usage of a cgroup reaches its hard limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out some EPC pages within the same cgroup to make room for new allocations. For that, this implementation tracks reclaimable EPC pages in a separate LRU list in each cgroup, and below are more details and justification of this design. Track EPC pages in per-cgroup LRUs (from Dave) -- tl;dr: A cgroup hitting its limit should be as similar as possible to the system running out of EPC memory. The only two choices to implement that are nasty changes the existing LRU scanning algorithm, or to add new LRUs. The result: Add a new LRU for each cgroup and scans those instead. Replace the existing global cgroup with the root cgroup's LRU (only when this new support is compiled in, obviously). The existing EPC memory management aims to be a miniature version of the core VM where EPC memory can be overcommitted and reclaimed. EPC allocations can wait for reclaim. The alternative to waiting would have been to send a signal and let the enclave die. This series attempts to implement that same logic for cgroups, for the same reasons: it's preferable to wait for memory to become available and let reclaim happen than to do things that are fatal to enclaves. There is currently a global reclaimable page SGX LRU list. That list (and the existing scanning algorithm) is essentially useless for doing reclaim when a cgroup hits its limit because the cgroup's pages are scattered around that LRU. It is unspeakably inefficient to scan a linked list with millions of entries for what could be dozens of pages from a cgroup that needs reclaim. Even if unspeakably slow reclaim was accepted, the existing scanning algorithm only picks a few pages off the head of the global LRU. It would either need to hold the list locks for unreasonable amounts of time, or be taught to scan the list in pieces, which has its own challenges. Unreclaimable Enclave Pages --- There are a variety of page types for enclaves, each serving different purposes [5]. Although the SGX architecture supports swapping for all types, some special pages, e.g., Version Array(VA) and Secure Enclave Control Structure (SECS)[5], holds meta data of reclaimed pages and enclaves. That makes reclamation of such pages more intricate to manage. The SGX driver global reclaimer currently does not swap out VA pages. It only swaps the SECS page of an enclave when all other associated pages have been swapped out. The cgroup reclaimer follows the same approach and does not track those in per-cgroup LRUs and considers them as unreclaimable pages. The allocation of these pages is counted towards the usage of a specific cgroup and is subject to the cgroup's set EPC limits. Earlier versions of this series implemented forced enclave-killing to reclaim VA and SECS pages. That was designed to enforce the 'max' limit, particularly in scenarios where a user or administrator reduces this limit post-launch of enclaves. However, subsequent discussions [3, 4] indicated that such preemptive enforcement is not necessary for the misc-controllers. Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed, and the limit is only enforced at the time of new EPC allocation request. When a cgroup hits its limit but nothing left in the LRUs of the subtree, i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC within that cgroup will result in an 'ENOMEM'. Unreclaimable Guest VM EPC Pages The EPC pages allocated for guest VMs by the virtual EPC driver are not reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats those as unreclaimable and returns ENOMEM when its limit is hit and nothing reclaimable left within the cgroup. The virtual EPC driver
[PATCH v13 01/14] x86/sgx: Replace boolean parameters with enums
Replace boolean parameters for 'reclaim' in the function sgx_alloc_epc_page() and its callers with an enum. Also opportunistically remove non-static declaration of __sgx_alloc_epc_page() and a typo Signed-off-by: Haitao Huang Suggested-by: Jarkko Sakkinen Suggested-by: Dave Hansen Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- arch/x86/kernel/cpu/sgx/encl.c | 12 ++-- arch/x86/kernel/cpu/sgx/encl.h | 4 ++-- arch/x86/kernel/cpu/sgx/ioctl.c | 10 +- arch/x86/kernel/cpu/sgx/main.c | 14 +++--- arch/x86/kernel/cpu/sgx/sgx.h | 13 +++-- arch/x86/kernel/cpu/sgx/virt.c | 2 +- 6 files changed, 32 insertions(+), 23 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 279148e72459..f474179b6f77 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -217,7 +217,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page, struct sgx_epc_page *epc_page; int ret; - epc_page = sgx_alloc_epc_page(encl_page, false); + epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM); if (IS_ERR(epc_page)) return epc_page; @@ -359,14 +359,14 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma, goto err_out_unlock; } - epc_page = sgx_alloc_epc_page(encl_page, false); + epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM); if (IS_ERR(epc_page)) { if (PTR_ERR(epc_page) == -EBUSY) vmret = VM_FAULT_NOPAGE; goto err_out_unlock; } - va_page = sgx_encl_grow(encl, false); + va_page = sgx_encl_grow(encl, SGX_NO_RECLAIM); if (IS_ERR(va_page)) { if (PTR_ERR(va_page) == -EBUSY) vmret = VM_FAULT_NOPAGE; @@ -1232,8 +1232,8 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr) /** * sgx_alloc_va_page() - Allocate a Version Array (VA) page - * @reclaim: Reclaim EPC pages directly if none available. Enclave - * mutex should not be held if this is set. + * @reclaim: Whether reclaim EPC pages directly if none available. Enclave + * mutex should not be held for SGX_DO_RECLAIM. * * Allocate a free EPC page and convert it to a Version Array (VA) page. * @@ -1241,7 +1241,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr) * a VA page, * -errno otherwise */ -struct sgx_epc_page *sgx_alloc_va_page(bool reclaim) +struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim) { struct sgx_epc_page *epc_page; int ret; diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index f94ff14c9486..fe15ade02ca1 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -116,14 +116,14 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl, unsigned long offset, u64 secinfo_flags); void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr); -struct sgx_epc_page *sgx_alloc_va_page(bool reclaim); +struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim); unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page); void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset); bool sgx_va_page_full(struct sgx_va_page *va_page); void sgx_encl_free_epc_page(struct sgx_epc_page *page); struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl, unsigned long addr); -struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim); +struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim reclaim); void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page); #endif /* _X86_ENCL_H */ diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c index b65ab214bdf5..793a0ba2cb16 100644 --- a/arch/x86/kernel/cpu/sgx/ioctl.c +++ b/arch/x86/kernel/cpu/sgx/ioctl.c @@ -17,7 +17,7 @@ #include "encl.h" #include "encls.h" -struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim) +struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim reclaim) { struct sgx_va_page *va_page = NULL; void *err; @@ -64,7 +64,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs) struct file *backing; long ret; - va_page = sgx_encl_grow(encl, true); + va_page = sgx_encl_grow(encl, SGX_DO_RECLAIM); if (IS_ERR(va_page)) return PTR_ERR(va_page); else if (va_page) @@ -83,7 +83,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs) encl->backing = backing; - secs_epc = sgx_alloc_epc_page(>secs, true); + secs_epc =
[PATCH] virtio_net: Warn if insufficient queue length for transmitting
The transmit queue is stopped when the number of free queue entries is less than 2+MAX_SKB_FRAGS, in start_xmit(). If the queue length (QUEUE_NUM_MAX) is less than then this, transmission will immediately trigger a netdev watchdog timeout. Report this condition earlier and more directly. Signed-off-by: Darius Rad --- drivers/net/virtio_net.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 115c3c5414f2..72ee8473b61c 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -4917,6 +4917,9 @@ static int virtnet_probe(struct virtio_device *vdev) set_bit(guest_offloads[i], >guest_offloads); vi->guest_offloads_capable = vi->guest_offloads; + if (virtqueue_get_vring_size(vi->sq->vq) < 2 + MAX_SKB_FRAGS) + netdev_warn_once(dev, "not enough queue entries, expect xmit timeout\n"); + pr_debug("virtnet: registered device %s with %d RX and TX vq's\n", dev->name, max_queue_pairs); -- 2.39.2
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 26.04.24 21:55, Guillaume Morin wrote: On 26 Apr 9:19, David Hildenbrand wrote: A couple of points: a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want to check PageAnonExclusive. b) If you're not following the can_follow_write_pte/_pmd model, you are doing something wrong :) c) The code was heavily changed in mm/mm-unstable. It was merged with t the common code. Likely, in mm/mm-unstable, the existing can_follow_write_pte and can_follow_write_pmd checks will already cover what you want in most cases. We'd need a can_follow_write_pud() to cover follow_huge_pud() and (unfortunately) something to handle follow_hugepd() as well similarly. Copy-pasting what we do in can_follow_write_pte() and adjusting for different PTE types is the right thing to do. Maybe now it's time to factor out the common checks into a separate helper. I tried to get the hugepd stuff right but this was the first I heard about it :-) Afaict follow_huge_pmd and friends were already DTRT I'll have to have a closer look at some details (the hugepd writability check looks a bit odd), but it's mostly what I would have expected! -- Cheers, David / dhildenb
Re: [PATCH v7 6/6] remoteproc: qcom: enable in-kernel PD mapper
On 4/26/2024 6:36 PM, Dmitry Baryshkov wrote: On Sat, 27 Apr 2024 at 04:03, Chris Lew wrote: On 4/24/2024 2:28 AM, Dmitry Baryshkov wrote: diff --git a/drivers/remoteproc/qcom_q6v5_adsp.c b/drivers/remoteproc/qcom_q6v5_adsp.c index 1d24c9b656a8..02d0c626b03b 100644 --- a/drivers/remoteproc/qcom_q6v5_adsp.c +++ b/drivers/remoteproc/qcom_q6v5_adsp.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -375,10 +376,14 @@ static int adsp_start(struct rproc *rproc) int ret; unsigned int val; - ret = qcom_q6v5_prepare(>q6v5); + ret = qcom_pdm_get(); if (ret) return ret; Would it make sense to try and model this as a rproc subdev? This section of the remoteproc code seems to be focused on making specific calls to setup and enable hardware resources, where as pd mapper is software. sysmon and ssr are also purely software and they are modeled as subdevs in qcom_common. I'm not an expert on remoteproc organization but this was just a thought. Well, the issue is that the pd-mapper is a global, not a per-remoteproc instance Both sysmon and ssr have some kind of global states that they manage too. Each subdev functionality tends to be a mix of per-remoteproc instance management and global state management. If pd-mapper was completely global, pd-mapper would be able to instantiate by itself. Instead, instantiation is dependent on each remoteproc instance properly getting and putting references. The pdm subdev could manage the references to pd-mapper for that remoteproc instance. On the other hand, I think Bjorn recommended this could be moved to probe time in v4. The v4 version was doing the reinitialization-dance, but I think the recommendation could still apply to this version. Thanks! Chris + ret = qcom_q6v5_prepare(>q6v5); + if (ret) + goto put_pdm; + ret = adsp_map_carveout(rproc); if (ret) { dev_err(adsp->dev, "ADSP smmu mapping failed\n"); @@ -446,6 +451,8 @@ static int adsp_start(struct rproc *rproc) adsp_unmap_carveout(rproc); disable_irqs: qcom_q6v5_unprepare(>q6v5); +put_pdm: + qcom_pdm_release(); return ret; }
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 30 Apr 20:21, David Hildenbrand wrote: > Sorry for not replying earlier, was busy with other stuff. I'll try getiing > that stuff into shape and send it out soonish. No worries. Let me know what you think of the FOLL_FORCE patch when you have a sec. > > I went with using one write uprobe function with some additional > > branches. I went back and forth between that and making them 2 different > > functions. > > All the folio_test_hugetlb() special casing is a bit suboptimal. Likely we > want a separate variant, because we should be sing hugetlb PTE functions > consistently (e.g., huge_pte_uffd_wp() vs pte_uffd_wp(), softdirty does not > exist etc.) Ok, I'll give this a whirl and send something probably tomorrow. > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > > index 2f4e88552d3f..8a33e380f7ea 100644 > > --- a/fs/hugetlbfs/inode.c > > +++ b/fs/hugetlbfs/inode.c > > @@ -83,6 +83,10 @@ static const struct fs_parameter_spec > > hugetlb_fs_parameters[] = { > > {} > > }; > > +bool hugetlbfs_mapping(struct address_space *mapping) { > > + return mapping->a_ops == _aops; > > is_vm_hugetlb_page() might be what you are looking for. I use hugetlbfs_mapping() in __uprobe_register() which does an early return using the mapping only if it's not supported. There is no vma that I can get at this point (afaict). I could refactor so we check this when we have a vma but it looked cleaner to introduce it since there is already shmem_mapping() > > } > > -static void copy_from_page(struct page *page, unsigned long vaddr, void > > *dst, int len) > > +static void copy_from_page(struct page *page, unsigned long vaddr, void > > *dst, int len, unsigned long page_mask) > > { > > void *kaddr = kmap_atomic(page); > > - memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len); > > + memcpy(dst, kaddr + (vaddr & ~page_mask), len); > > kunmap_atomic(kaddr); > > } > > > -static void copy_to_page(struct page *page, unsigned long vaddr, const > > void *src, int len) > > +static void copy_to_page(struct page *page, unsigned long vaddr, const > > void *src, int len, unsigned long page_mask) > > { > > void *kaddr = kmap_atomic(page); > > - memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len); > > + memcpy(kaddr + (vaddr & ~page_mask), src, len); > > kunmap_atomic(kaddr); > > } > > These two changes really are rather ugly ... > > An why are they even required? We get a PAGE_SIZED-based subpage of a > hugetlb page. We only kmap that one and copy within that one. > > In other words, I don't think the copy_from_page() and copy_to_page() > changes are even required when we consistently work on subpages and not > suddenly on head pages. The main reason is that the previous __replace_page worked directly on the full HP page so adjusting after gup seemed to make more sense to me. But now I guess it's not that useful (esp we're going with a different version of write_uprobe). I'll fix (...) > > { > > struct uwo_data *data = walk->private;; > > const bool is_register = !!is_swbp_insn(>opcode); > > @@ -415,9 +417,12 @@ static int __write_opcode_pte(pte_t *ptep, unsigned > > long vaddr, > > /* Unmap + flush the TLB, such that we can write atomically .*/ > > flush_cache_page(vma, vaddr, pte_pfn(pte)); > > - pte = ptep_clear_flush(vma, vaddr, ptep); > > + if (folio_test_hugetlb(folio)) > > + pte = huge_ptep_clear_flush(vma, vaddr, ptep); > > + else > > + pte = ptep_clear_flush(vma, vaddr, ptep); > > copy_to_page(page, data->opcode_vaddr, >opcode, > > -UPROBE_SWBP_INSN_SIZE); > > +UPROBE_SWBP_INSN_SIZE, page_mask); > > /* When unregistering, we may only zap a PTE if uffd is disabled ... */ > > if (is_register || userfaultfd_missing(vma)) > > @@ -443,13 +448,18 @@ static int __write_opcode_pte(pte_t *ptep, unsigned > > long vaddr, > > if (!identical || folio_maybe_dma_pinned(folio)) > > goto remap; > > - /* Zap it and try to reclaim swap space. */ > > - dec_mm_counter(mm, MM_ANONPAGES); > > - folio_remove_rmap_pte(folio, page, vma); > > - if (!folio_mapped(folio) && folio_test_swapcache(folio) && > > -folio_trylock(folio)) { > > - folio_free_swap(folio); > > - folio_unlock(folio); > > + if (folio_test_hugetlb(folio)) { > > + hugetlb_remove_rmap(folio); > > + large = false; > > + } else { > > + /* Zap it and try to reclaim swap space. */ > > + dec_mm_counter(mm, MM_ANONPAGES); > > + folio_remove_rmap_pte(folio, page, vma); > > + if (!folio_mapped(folio) && folio_test_swapcache(folio) && > > + folio_trylock(folio)) { > > + folio_free_swap(folio); > > + folio_unlock(folio); > > + } > > } > > folio_put(folio); > > @@ -461,11 +471,29 @@ static int __write_opcode_pte(pte_t *ptep, unsigned > > long vaddr, > > */ > > smp_wmb(); >
[PATCH] eventfs/tracing: Add callback for release of an eventfs_inode
From: "Steven Rostedt (Google)" Synthetic events create and destroy tracefs files when they are created and removed. The tracing subsystem has its own file descriptor representing the state of the events attached to the tracefs files. There's a race between the eventfs files and this file descriptor of the tracing system where the following can cause an issue: With two scripts 'A' and 'B' doing: Script 'A': echo "hello int aaa" > /sys/kernel/tracing/synthetic_events while : do echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable done Script 'B': echo > /sys/kernel/tracing/synthetic_events Script 'A' creates a synthetic event "hello" and then just writes zero into its enable file. Script 'B' removes all synthetic events (including the newly created "hello" event). What happens is that the opening of the "enable" file has: { struct trace_event_file *file = inode->i_private; int ret; ret = tracing_check_open_get_tr(file->tr); [..] But deleting the events frees the "file" descriptor, and a "use after free" happens with the dereference at "file->tr". The file descriptor does have a reference counter, but there needs to be a way to decrement it from the eventfs when the eventfs_inode is removed that represents this file descriptor. Add an optional "release" callback to the eventfs_entry array structure, that gets called when the eventfs file is about to be removed. This allows for the creating on the eventfs file to increment the tracing file descriptor ref counter. When the eventfs file is deleted, it can call the release function that will call the put function for the tracing file descriptor. This will protect the tracing file from being freed while a eventfs file that references it is being opened. Link: https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-tze-nan...@mediatek.com/ Cc: sta...@vger.kernel.org Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode") Reported-by: Tze-nan wu Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c| 7 +++ include/linux/tracefs.h | 3 +++ kernel/trace/trace_events.c | 12 3 files changed, 22 insertions(+) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 894c6ca1e500..dc97c19f9e0a 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -84,10 +84,17 @@ enum { static void release_ei(struct kref *ref) { struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref); + const struct eventfs_entry *entry; struct eventfs_root_inode *rei; WARN_ON_ONCE(!ei->is_freed); + for (int i = 0; i < ei->nr_entries; i++) { + entry = >entries[i]; + if (entry->release) + entry->release(entry->name, ei->data); + } + kfree(ei->entry_attrs); kfree_const(ei->name); if (ei->is_events) { diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h index 7a5fe17b6bf9..d03f74658716 100644 --- a/include/linux/tracefs.h +++ b/include/linux/tracefs.h @@ -62,6 +62,8 @@ struct eventfs_file; typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, const struct file_operations **fops); +typedef void (*eventfs_release)(const char *name, void *data); + /** * struct eventfs_entry - dynamically created eventfs file call back handler * @name: Then name of the dynamic file in an eventfs directory @@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, struct eventfs_entry { const char *name; eventfs_callbackcallback; + eventfs_release release; }; struct eventfs_inode; diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 52f75c36bbca..6ef29eba90ce 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2552,6 +2552,14 @@ static int event_callback(const char *name, umode_t *mode, void **data, return 0; } +/* The file is incremented on creation and freeing the enable file decrements it */ +static void event_release(const char *name, void *data) +{ + struct trace_event_file *file = data; + + event_file_put(file); +} + static int event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file) { @@ -2566,6 +2574,7 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file) { .name = "enable", .callback = event_callback, + .release= event_release, }, { .name = "filter", @@ -2634,6 +2643,9 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file) return ret; } + /*
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 30.04.24 17:22, Guillaume Morin wrote: On 26 Apr 21:55, Guillaume Morin wrote: On 26 Apr 9:19, David Hildenbrand wrote: A couple of points: a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want to check PageAnonExclusive. b) If you're not following the can_follow_write_pte/_pmd model, you are doing something wrong :) c) The code was heavily changed in mm/mm-unstable. It was merged with t the common code. Likely, in mm/mm-unstable, the existing can_follow_write_pte and can_follow_write_pmd checks will already cover what you want in most cases. We'd need a can_follow_write_pud() to cover follow_huge_pud() and (unfortunately) something to handle follow_hugepd() as well similarly. Copy-pasting what we do in can_follow_write_pte() and adjusting for different PTE types is the right thing to do. Maybe now it's time to factor out the common checks into a separate helper. I tried to get the hugepd stuff right but this was the first I heard about it :-) Afaict follow_huge_pmd and friends were already DTRT I got it working on top of your uprobes-cow branch with the foll force patch sent friday. Still pretty lightly tested Sorry for not replying earlier, was busy with other stuff. I'll try getiing that stuff into shape and send it out soonish. I went with using one write uprobe function with some additional branches. I went back and forth between that and making them 2 different functions. All the folio_test_hugetlb() special casing is a bit suboptimal. Likely we want a separate variant, because we should be sing hugetlb PTE functions consistently (e.g., huge_pte_uffd_wp() vs pte_uffd_wp(), softdirty does not exist etc.) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 2f4e88552d3f..8a33e380f7ea 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -83,6 +83,10 @@ static const struct fs_parameter_spec hugetlb_fs_parameters[] = { {} }; +bool hugetlbfs_mapping(struct address_space *mapping) { + return mapping->a_ops == _aops; is_vm_hugetlb_page() might be what you are looking for. [...] } -static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len) +static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len, unsigned long page_mask) { void *kaddr = kmap_atomic(page); - memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len); + memcpy(dst, kaddr + (vaddr & ~page_mask), len); kunmap_atomic(kaddr); } -static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len) +static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len, unsigned long page_mask) { void *kaddr = kmap_atomic(page); - memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len); + memcpy(kaddr + (vaddr & ~page_mask), src, len); kunmap_atomic(kaddr); } These two changes really are rather ugly ... An why are they even required? We get a PAGE_SIZED-based subpage of a hugetlb page. We only kmap that one and copy within that one. In other words, I don't think the copy_from_page() and copy_to_page() changes are even required when we consistently work on subpages and not suddenly on head pages. -static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode) +static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode, unsigned long page_mask) { uprobe_opcode_t old_opcode; bool is_swbp; @@ -191,7 +192,8 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t * is a trap variant; uprobes always wins over any other (gdb) * breakpoint. */ - copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE); + copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE, + page_mask); is_swbp = is_swbp_insn(_opcode); if (is_swbp_insn(new_opcode)) { @@ -376,8 +378,8 @@ struct uwo_data { uprobe_opcode_t opcode; }; -static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr, - unsigned long next, struct mm_walk *walk) +static int __write_opcode(pte_t *ptep, unsigned long vaddr, + unsigned long page_mask, struct mm_walk *walk) Unrelated alignment change. { struct uwo_data *data = walk->private;; const bool is_register = !!is_swbp_insn(>opcode); @@ -415,9 +417,12 @@ static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr, /* Unmap + flush the TLB, such that we can write atomically .*/ flush_cache_page(vma, vaddr, pte_pfn(pte)); - pte = ptep_clear_flush(vma, vaddr, ptep); + if (folio_test_hugetlb(folio)) + pte = huge_ptep_clear_flush(vma, vaddr, ptep); + else + pte = ptep_clear_flush(vma, vaddr, ptep); copy_to_page(page, data->opcode_vaddr, >opcode, -
Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn
On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote: > On 4/30/24 8:05 AM, Edward Adam Davis wrote: > > static int vhost_task_fn(void *data) > > { > > struct vhost_task *vtsk = data; > > @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data) > > schedule(); > > } > > > > - mutex_lock(>exit_mutex); > > + mutex_lock(_mutex); > > /* > > * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL. > > * When the vhost layer has called vhost_task_stop it's already stopped > > @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data) > > vtsk->handle_sigkill(vtsk->data); > > } > > complete(>exited); > > - mutex_unlock(>exit_mutex); > > + mutex_unlock(_mutex); > > > > Edward, thanks for the patch. I think though I just needed to swap the > order of the calls above. > > Instead of: > > complete(>exited); > mutex_unlock(>exit_mutex); > > it should have been: > > mutex_unlock(>exit_mutex); > complete(>exited); > > If my analysis is correct, then Michael do you want me to resubmit a > patch on top of your vhost branch or resubmit the entire patchset? Resubmit all please. -- MST
[PATCH] scripts/spdxcheck: Add count of missing files to stats output
Add a count of files missing an SPDX header to the stats output. This is useful detailed information for working on SPDX header additions. Signed-off-by: Tim Bird --- scripts/spdxcheck.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/scripts/spdxcheck.py b/scripts/spdxcheck.py index 18cb9f5b3d3d..8b8fb115fc81 100755 --- a/scripts/spdxcheck.py +++ b/scripts/spdxcheck.py @@ -412,6 +412,9 @@ if __name__ == '__main__': if parser.checked: pc = int(100 * parser.spdx_valid / parser.checked) sys.stderr.write('Files with SPDX: %12d %3d%%\n' %(parser.spdx_valid, pc)) +missing = parser.checked - parser.spdx_valid +mpc = int(100 * missing / parser.checked) +sys.stderr.write('Files without SPDX:%12d %3d%%\n' %(missing, mpc)) sys.stderr.write('Files with errors: %12d\n' %parser.spdx_errors) if ndirs: sys.stderr.write('\n') -- 2.25.1
[PATCH v3 4/4] dax/bus.c: Use the right locking mode (read vs write) in size_show
In size_show(), the dax_dev_rwsem only needs a read lock, but was acquiring a write lock. Change it to down_read_interruptible() so it doesn't unnecessarily hold a write lock. Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") Cc: Dan Williams Reviewed-by: Dan Williams Signed-off-by: Vishal Verma --- drivers/dax/bus.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 0011a6e6a8f2..f24b67c64d5e 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -937,11 +937,11 @@ static ssize_t size_show(struct device *dev, unsigned long long size; int rc; - rc = down_write_killable(_dev_rwsem); + rc = down_read_interruptible(_dev_rwsem); if (rc) return rc; size = dev_dax_size(dev_dax); - up_write(_dev_rwsem); + up_read(_dev_rwsem); return sysfs_emit(buf, "%llu\n", size); } -- 2.44.0
[PATCH v3 3/4] dax/bus.c: Don't use down_write_killable for non-user processes
Change an instance of down_write_killable() to a simple down_write() where there is no user process that might want to interrupt the operation. Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") Reported-by: Dan Williams Reviewed-by: Dan Williams Signed-off-by: Vishal Verma --- drivers/dax/bus.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index e2c7354ce328..0011a6e6a8f2 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -1540,12 +1540,8 @@ static struct dev_dax *__devm_create_dev_dax(struct dev_dax_data *data) struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) { struct dev_dax *dev_dax; - int rc; - - rc = down_write_killable(_region_rwsem); - if (rc) - return ERR_PTR(rc); + down_write(_region_rwsem); dev_dax = __devm_create_dev_dax(data); up_write(_region_rwsem); -- 2.44.0
[PATCH v3 1/4] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts
In [1], Dan points out that all of the WARN_ON_ONCE() usage in the referenced patch should be replaced with lockdep_assert_held, or lockdep_held_assert_write(). Replace these as appropriate. Link: https://lore.kernel.org/r/65f0b5ef41817_aa2229...@dwillia2-mobl3.amr.corp.intel.com.notmuch [1] Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") Cc: Andrew Morton Reported-by: Dan Williams Reviewed-by: Dan Williams Signed-off-by: Vishal Verma --- drivers/dax/bus.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 797e1ebff299..7924dd542a13 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -192,7 +192,7 @@ static u64 dev_dax_size(struct dev_dax *dev_dax) u64 size = 0; int i; - WARN_ON_ONCE(!rwsem_is_locked(_dev_rwsem)); + lockdep_assert_held(_dev_rwsem); for (i = 0; i < dev_dax->nr_range; i++) size += range_len(_dax->ranges[i].range); @@ -302,7 +302,7 @@ static unsigned long long dax_region_avail_size(struct dax_region *dax_region) resource_size_t size = resource_size(_region->res); struct resource *res; - WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem)); + lockdep_assert_held(_region_rwsem); for_each_dax_region_resource(dax_region, res) size -= resource_size(res); @@ -447,7 +447,7 @@ static void trim_dev_dax_range(struct dev_dax *dev_dax) struct range *range = _dax->ranges[i].range; struct dax_region *dax_region = dev_dax->region; - WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem)); + lockdep_assert_held_write(_region_rwsem); dev_dbg(_dax->dev, "delete range[%d]: %#llx:%#llx\n", i, (unsigned long long)range->start, (unsigned long long)range->end); @@ -507,7 +507,7 @@ static int __free_dev_dax_id(struct dev_dax *dev_dax) struct dax_region *dax_region; int rc = dev_dax->id; - WARN_ON_ONCE(!rwsem_is_locked(_dev_rwsem)); + lockdep_assert_held_write(_dev_rwsem); if (!dev_dax->dyn_id || dev_dax->id < 0) return -1; @@ -713,7 +713,7 @@ static void __unregister_dax_mapping(void *data) dev_dbg(dev, "%s\n", __func__); - WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem)); + lockdep_assert_held_write(_region_rwsem); dev_dax->ranges[mapping->range_id].mapping = NULL; mapping->range_id = -1; @@ -830,7 +830,7 @@ static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id) struct device *dev; int rc; - WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem)); + lockdep_assert_held_write(_region_rwsem); if (dev_WARN_ONCE(_dax->dev, !dax_region->dev->driver, "region disabled\n")) @@ -876,7 +876,7 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start, struct resource *alloc; int i, rc; - WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem)); + lockdep_assert_held_write(_region_rwsem); /* handle the seed alloc special case */ if (!size) { @@ -935,7 +935,7 @@ static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, r struct device *dev = _dax->dev; int rc; - WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem)); + lockdep_assert_held_write(_region_rwsem); if (dev_WARN_ONCE(dev, !size, "deletion is handled by dev_dax_shrink\n")) return -EINVAL; -- 2.44.0
[PATCH v3 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths
Commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") aimed to undo device_lock() abuses for protecting changes to dax-driver internal data-structures like the dax_region resource tree to device-dax-instance range structures. However, the device_lock() was legitamately enforcing that devices to be deleted were not current actively attached to any driver nor assigned any capacity from the region. As a result of the device_lock restoration in delete_store(), the conditional locking in unregister_dev_dax() and unregister_dax_mapping() can be removed. Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") Reported-by: Dan Williams Reviewed-by: Dan Williams Signed-off-by: Vishal Verma --- drivers/dax/bus.c | 42 -- 1 file changed, 8 insertions(+), 34 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 7924dd542a13..e2c7354ce328 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -465,26 +465,17 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax) trim_dev_dax_range(dev_dax); } -static void __unregister_dev_dax(void *dev) +static void unregister_dev_dax(void *dev) { struct dev_dax *dev_dax = to_dev_dax(dev); dev_dbg(dev, "%s\n", __func__); + down_write(_region_rwsem); kill_dev_dax(dev_dax); device_del(dev); free_dev_dax_ranges(dev_dax); put_device(dev); -} - -static void unregister_dev_dax(void *dev) -{ - if (rwsem_is_locked(_region_rwsem)) - return __unregister_dev_dax(dev); - - if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0)) - return; - __unregister_dev_dax(dev); up_write(_region_rwsem); } @@ -560,15 +551,10 @@ static ssize_t delete_store(struct device *dev, struct device_attribute *attr, if (!victim) return -ENXIO; - rc = down_write_killable(_region_rwsem); - if (rc) - return rc; - rc = down_write_killable(_dev_rwsem); - if (rc) { - up_write(_region_rwsem); - return rc; - } + device_lock(dev); + device_lock(victim); dev_dax = to_dev_dax(victim); + down_write(_dev_rwsem); if (victim->driver || dev_dax_size(dev_dax)) rc = -EBUSY; else { @@ -589,11 +575,12 @@ static ssize_t delete_store(struct device *dev, struct device_attribute *attr, rc = -EBUSY; } up_write(_dev_rwsem); + device_unlock(victim); /* won the race to invalidate the device, clean it up */ if (do_del) devm_release_action(dev, unregister_dev_dax, victim); - up_write(_region_rwsem); + device_unlock(dev); put_device(victim); return rc; @@ -705,7 +692,7 @@ static void dax_mapping_release(struct device *dev) put_device(parent); } -static void __unregister_dax_mapping(void *data) +static void unregister_dax_mapping(void *data) { struct device *dev = data; struct dax_mapping *mapping = to_dax_mapping(dev); @@ -713,25 +700,12 @@ static void __unregister_dax_mapping(void *data) dev_dbg(dev, "%s\n", __func__); - lockdep_assert_held_write(_region_rwsem); - dev_dax->ranges[mapping->range_id].mapping = NULL; mapping->range_id = -1; device_unregister(dev); } -static void unregister_dax_mapping(void *data) -{ - if (rwsem_is_locked(_region_rwsem)) - return __unregister_dax_mapping(data); - - if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0)) - return; - __unregister_dax_mapping(data); - up_write(_region_rwsem); -} - static struct dev_dax_range *get_dax_range(struct device *dev) { struct dax_mapping *mapping = to_dax_mapping(dev); -- 2.44.0
Re: [PATCH v3 0/2] remoteproc: k3-r5: Wait for core0 power-up before powering up core1
On Tue, Apr 30, 2024 at 04:23:05PM +0530, Beleswar Padhi wrote: > PSC controller has a limitation that it can only power-up the second core > when the first core is in ON state. Power-state for core0 should be equal > to or higher than core1, else the kernel is seen hanging during rproc > loading. > > Make the powering up of cores sequential, by waiting for the current core > to power-up before proceeding to the next core, with a timeout of 2sec. > Add a wait queue event in k3_r5_cluster_rproc_init call, that will wait > for the current core to be released from reset before proceeding with the > next core. > > Also, ensure that core1 can not be powered on before core0 when starting > cores from sysfs. Similarly, ensure that core0 can not be shutdown > before core1 from sysfs. > > v3: Changelog: > 1) Added my own Signed-off-by in PATCH 1, specifying the changes done > 2) Addressed changes requested by adding comments in code in PATCH 1 > > Link to v2: > https://lore.kernel.org/all/20240424130504.494916-1-b-pa...@ti.com/ > > v2: Changelog: > 1) Fixed multi-line comment format > 2) Included root cause of bug in comments > 3) Added a patch to ensure power-up/shutdown is sequential via sysfs > > Link to v1: > https://lore.kernel.org/all/20230906124756.3480579-1-a-nan...@ti.com/ > > Apurva Nandan (1): > remoteproc: k3-r5: Wait for core0 power-up before powering up core1 > > Beleswar Padhi (1): > remoteproc: k3-r5: Do not allow core1 to power up before core0 via > sysfs > > drivers/remoteproc/ti_k3_r5_remoteproc.c | 56 +++- > 1 file changed, 54 insertions(+), 2 deletions(-) Applied - thanks, Mathieu > > -- > 2.34.1 >
Re: [PATCH v4 0/4] Support MT8188 SCP core 1
On Tue, Apr 30, 2024 at 09:15:30AM +0800, Olivia Wen wrote: > Change in v4: > Updating the description of PATCH v4 4/4. > > Olivia Wen (4): > dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP > remoteproc: mediatek: Support MT8188 SCP core 1 > remoteproc: mediatek: Support setting DRAM and IPI shared buffer sizes > remoteproc: mediatek: Add IMGSYS IPI command > > .../devicetree/bindings/remoteproc/mtk,scp.yaml| 2 + > drivers/remoteproc/mtk_common.h| 11 +- > drivers/remoteproc/mtk_scp.c | 230 > +++-- > drivers/remoteproc/mtk_scp_ipi.c | 7 +- > include/linux/remoteproc/mtk_scp.h | 1 + > 5 files changed, 225 insertions(+), 26 deletions(-) I have applied this set. Thanks, Mathieu > > -- > 2.6.4 >
Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph
On Tue, Apr 30, 2024 at 6:32 AM Masami Hiramatsu wrote: > > On Mon, 29 Apr 2024 13:25:04 -0700 > Andrii Nakryiko wrote: > > > On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu > > wrote: > > > > > > Hi Andrii, > > > > > > On Thu, 25 Apr 2024 13:31:53 -0700 > > > Andrii Nakryiko wrote: > > > > > > > Hey Masami, > > > > > > > > I can't really review most of that code as I'm completely unfamiliar > > > > with all those inner workings of fprobe/ftrace/function_graph. I left > > > > a few comments where there were somewhat more obvious BPF-related > > > > pieces. > > > > > > > > But I also did run our BPF benchmarks on probes/for-next as a baseline > > > > and then with your series applied on top. Just to see if there are any > > > > regressions. I think it will be a useful data point for you. > > > > > > Thanks for testing! > > > > > > > > > > > You should be already familiar with the bench tool we have in BPF > > > > selftests (I used it on some other patches for your tree). > > > > > > What patches we need? > > > > > > > You mean for this `bench` tool? They are part of BPF selftests (under > > tools/testing/selftests/bpf), you can build them by running: > > > > $ make RELEASE=1 -j$(nproc) bench > > > > After that you'll get a self-container `bench` binary, which has all > > the self-contained benchmarks. > > > > You might also find a small script (benchs/run_bench_trigger.sh inside > > BPF selftests directory) helpful, it collects final summary of the > > benchmark run and optionally accepts a specific set of benchmarks. So > > you can use it like this: > > > > $ benchs/run_bench_trigger.sh kprobe kprobe-multi > > kprobe : 18.731 ± 0.639M/s > > kprobe-multi : 23.938 ± 0.612M/s > > > > By default it will run a wider set of benchmarks (no uprobes, but a > > bunch of extra fentry/fexit tests and stuff like this). > > origin: > # benchs/run_bench_trigger.sh > kretprobe :1.329 ± 0.007M/s > kretprobe-multi:1.341 ± 0.004M/s > # benchs/run_bench_trigger.sh > kretprobe :1.288 ± 0.014M/s > kretprobe-multi:1.365 ± 0.002M/s > # benchs/run_bench_trigger.sh > kretprobe :1.329 ± 0.002M/s > kretprobe-multi:1.331 ± 0.011M/s > # benchs/run_bench_trigger.sh > kretprobe :1.311 ± 0.003M/s > kretprobe-multi:1.318 ± 0.002M/s s > > patched: > > # benchs/run_bench_trigger.sh > kretprobe :1.274 ± 0.003M/s > kretprobe-multi:1.397 ± 0.002M/s > # benchs/run_bench_trigger.sh > kretprobe :1.307 ± 0.002M/s > kretprobe-multi:1.406 ± 0.004M/s > # benchs/run_bench_trigger.sh > kretprobe :1.279 ± 0.004M/s > kretprobe-multi:1.330 ± 0.014M/s > # benchs/run_bench_trigger.sh > kretprobe :1.256 ± 0.010M/s > kretprobe-multi:1.412 ± 0.003M/s > > Hmm, in my case, it seems smaller differences (~3%?). > I attached perf report results for those, but I don't see large difference. I ran my benchmarks on bare metal machine (and quite powerful at that, you can see my numbers are almost 10x of yours), with mitigations disabled, no retpolines, etc. If you have any of those mitigations it might result in smaller differences, probably. If you are running inside QEMU/VM, the results might differ significantly as well. > > > > > > > > > BASELINE > > > > > > > > kprobe : 24.634 ± 0.205M/s > > > > kprobe-multi : 28.898 ± 0.531M/s > > > > kretprobe : 10.478 ± 0.015M/s > > > > kretprobe-multi: 11.012 ± 0.063M/s > > > > > > > > THIS PATCH SET ON TOP > > > > = > > > > kprobe : 25.144 ± 0.027M/s (+2%) > > > > kprobe-multi : 28.909 ± 0.074M/s > > > > kretprobe :9.482 ± 0.008M/s (-9.5%) > > > > kretprobe-multi: 13.688 ± 0.027M/s (+24%) > > > > > > This looks good. Kretprobe should also use kretprobe-multi (fprobe) > > > eventually because it should be a single callback version of > > > kretprobe-multi. > > I ran another benchmark (prctl loop, attached), the origin kernel result is > here; > > # sh ./benchmark.sh > count = 1000, took 6.748133 sec > > And the patched kernel result; > > # sh ./benchmark.sh > count = 1000, took 6.644095 sec > > I confirmed that the parf result has no big difference. > > Thank you, > > > > > > > > > > > > > These numbers are pretty stable and look to be more or less > > > > representative. > > > > > > > > As you can see, kprobes got a bit faster, kprobe-multi seems to be > > > > about the same, though. > > > > > > > > Then (I suppose they are "legacy") kretprobes got quite noticeably > > > > slower, almost by 10%. Not sure why, but looks real after re-running > > > > benchmarks a bunch of times and getting stable results. > > > > > > Hmm, kretprobe on x86 should use ftrace + rethook even with my series. > > > So nothing should be changed. Maybe cache access pattern has been > > > changed? > > > I'll check it with tracefs (to remove the effect from bpf related changes) > > > > > > > > > > > On the other hand, multi-kretprobes got significantly faster (+24%!). > > > > Again, I
Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn
On 4/30/24 8:05 AM, Edward Adam Davis wrote: > static int vhost_task_fn(void *data) > { > struct vhost_task *vtsk = data; > @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data) > schedule(); > } > > - mutex_lock(>exit_mutex); > + mutex_lock(_mutex); > /* >* If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL. >* When the vhost layer has called vhost_task_stop it's already stopped > @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data) > vtsk->handle_sigkill(vtsk->data); > } > complete(>exited); > - mutex_unlock(>exit_mutex); > + mutex_unlock(_mutex); > Edward, thanks for the patch. I think though I just needed to swap the order of the calls above. Instead of: complete(>exited); mutex_unlock(>exit_mutex); it should have been: mutex_unlock(>exit_mutex); complete(>exited); If my analysis is correct, then Michael do you want me to resubmit a patch on top of your vhost branch or resubmit the entire patchset?
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 26 Apr 21:55, Guillaume Morin wrote: > > On 26 Apr 9:19, David Hildenbrand wrote: > > A couple of points: > > > > a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want > > to check PageAnonExclusive. > > > > b) If you're not following the can_follow_write_pte/_pmd model, you are > > doing something wrong :) > > > > c) The code was heavily changed in mm/mm-unstable. It was merged with t > > the common code. > > > > Likely, in mm/mm-unstable, the existing can_follow_write_pte and > > can_follow_write_pmd checks will already cover what you want in most cases. > > > > We'd need a can_follow_write_pud() to cover follow_huge_pud() and > > (unfortunately) something to handle follow_hugepd() as well similarly. > > > > Copy-pasting what we do in can_follow_write_pte() and adjusting for > > different PTE types is the right thing to do. Maybe now it's time to factor > > out the common checks into a separate helper. > > I tried to get the hugepd stuff right but this was the first I heard > about it :-) Afaict follow_huge_pmd and friends were already DTRT I got it working on top of your uprobes-cow branch with the foll force patch sent friday. Still pretty lightly tested I went with using one write uprobe function with some additional branches. I went back and forth between that and making them 2 different functions. diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 2f4e88552d3f..8a33e380f7ea 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -83,6 +83,10 @@ static const struct fs_parameter_spec hugetlb_fs_parameters[] = { {} }; +bool hugetlbfs_mapping(struct address_space *mapping) { + return mapping->a_ops == _aops; +} + /* * Mask used when checking the page offset value passed in via system * calls. This value will be converted to a loff_t which is signed. diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 68244bb3637a..66fdcc0b5db2 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -511,6 +511,8 @@ struct hugetlbfs_sb_info { umode_t mode; }; +bool hugetlbfs_mapping(struct address_space *mapping); + static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb) { return sb->s_fs_info; @@ -557,6 +559,8 @@ static inline struct hstate *hstate_inode(struct inode *i) { return NULL; } + +static inline bool hugetlbfs_mapping(struct address_space *mapping) { return false; } #endif /* !CONFIG_HUGETLBFS */ #ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 4237a7477cca..e6e93a574c39 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -11,6 +11,7 @@ #include #include +#include #include /* read_mapping_page */ #include #include @@ -120,7 +121,7 @@ struct xol_area { */ static bool valid_vma(struct vm_area_struct *vma, bool is_register) { - vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE; + vm_flags_t flags = VM_MAYEXEC | VM_MAYSHARE; if (is_register) flags |= VM_WRITE; @@ -163,21 +164,21 @@ bool __weak is_trap_insn(uprobe_opcode_t *insn) return is_swbp_insn(insn); } -static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len) +static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len, unsigned long page_mask) { void *kaddr = kmap_atomic(page); - memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len); + memcpy(dst, kaddr + (vaddr & ~page_mask), len); kunmap_atomic(kaddr); } -static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len) +static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len, unsigned long page_mask) { void *kaddr = kmap_atomic(page); - memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len); + memcpy(kaddr + (vaddr & ~page_mask), src, len); kunmap_atomic(kaddr); } -static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode) +static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode, unsigned long page_mask) { uprobe_opcode_t old_opcode; bool is_swbp; @@ -191,7 +192,8 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t * is a trap variant; uprobes always wins over any other (gdb) * breakpoint. */ - copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE); + copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE, + page_mask); is_swbp = is_swbp_insn(_opcode); if (is_swbp_insn(new_opcode)) { @@ -376,8 +378,8 @@ struct uwo_data { uprobe_opcode_t opcode; }; -static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr, - unsigned long next, struct mm_walk *walk) +static int __write_opcode(pte_t *ptep,
Re: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race
On Mon, Apr 29, 2024 at 04:11:03PM +0300, Jarkko Sakkinen wrote: > On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote: > > Two enclave threads may try to add and remove the same enclave page > > simultaneously (e.g., if the SGX runtime supports both lazy allocation > > and `MADV_DONTNEED` semantics). Consider this race: > > > > 1. T1 performs page removal in sgx_encl_remove_pages() and stops right > >after removing the page table entry and right before re-acquiring the > >enclave lock to EREMOVE and xa_erase(>page_array) the page. > > 2. T2 tries to access the page, and #PF[not_present] is raised. The > >condition to EAUG in sgx_vma_fault() is not satisfied because the > >page is still present in encl->page_array, thus the SGX driver > >assumes that the fault happened because the page was swapped out. The > >driver continues on a code path that installs a page table entry > >*without* performing EAUG. > > 3. The enclave page metadata is in inconsistent state: the PTE is > >installed but there was no EAUG. Thus, T2 in userspace infinitely > >receives SIGSEGV on this page (and EACCEPT always fails). > > > > Fix this by making sure that T1 (the page-removing thread) always wins > > this data race. In particular, the page-being-removed is marked as such, > > and T2 retries until the page is fully removed. > > > > Fixes: 9849bb27152c ("x86/sgx: Support complete page removal") > > Cc: sta...@vger.kernel.org > > Signed-off-by: Dmitrii Kuvaiskii > > --- > > arch/x86/kernel/cpu/sgx/encl.c | 3 ++- > > arch/x86/kernel/cpu/sgx/encl.h | 3 +++ > > arch/x86/kernel/cpu/sgx/ioctl.c | 1 + > > 3 files changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > > index 41f14b1a3025..7ccd8b2fce5f 100644 > > --- a/arch/x86/kernel/cpu/sgx/encl.c > > +++ b/arch/x86/kernel/cpu/sgx/encl.c > > @@ -257,7 +257,8 @@ static struct sgx_encl_page > > *__sgx_encl_load_page(struct sgx_encl *encl, > > > > /* Entry successfully located. */ > > if (entry->epc_page) { > > - if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED) > > + if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED | > > + SGX_ENCL_PAGE_BEING_REMOVED)) > > return ERR_PTR(-EBUSY); > > > > return entry; > > diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h > > index f94ff14c9486..fff5f2293ae7 100644 > > --- a/arch/x86/kernel/cpu/sgx/encl.h > > +++ b/arch/x86/kernel/cpu/sgx/encl.h > > @@ -25,6 +25,9 @@ > > /* 'desc' bit marking that the page is being reclaimed. */ > > #define SGX_ENCL_PAGE_BEING_RECLAIMED BIT(3) > > > > +/* 'desc' bit marking that the page is being removed. */ > > +#define SGX_ENCL_PAGE_BEING_REMOVEDBIT(2) > > + > > struct sgx_encl_page { > > unsigned long desc; > > unsigned long vm_max_prot_bits:8; > > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c > > b/arch/x86/kernel/cpu/sgx/ioctl.c > > index b65ab214bdf5..c542d4dd3e64 100644 > > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > > @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl > > *encl, > > * Do not keep encl->lock because of dependency on > > * mmap_lock acquired in sgx_zap_enclave_ptes(). > > */ > > + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED; > > mutex_unlock(>lock); > > > > sgx_zap_enclave_ptes(encl, addr); > > It is somewhat trivial to NAK this as the commit message does > not do any effort describing the new flag. By default at least > I have strong opposition against any new flags related to > reclaiming even if it needs a bit of extra synchronization > work in the user space. > > One way to describe concurrency scenarios would be to take > example from https://www.kernel.org/doc/Documentation/memory-barriers.txt > > I.e. see the examples with CPU 1 and CPU 2. Thank you for the suggestion. Here is my new attempt at describing the racy scenario: Consider some enclave page added to the enclave. User space decides to temporarily remove this page (e.g., emulating the MADV_DONTNEED semantics) on CPU1. At the same time, user space performs a memory access on the same page on CPU2, which results in a #PF and ultimately in sgx_vma_fault(). Scenario proceeds as follows: /* * CPU1: User space performs * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES) * on a single enclave page */ sgx_encl_remove_pages() { mutex_lock(>lock); entry = sgx_encl_load_page(encl); /* * verify that page is * trimmed and accepted */ mutex_unlock(>lock); /* * remove PTE entry; cannot * be performed under lock */ sgx_zap_enclave_ptes(encl); /* * Fault on CPU2 */ sgx_vma_fault() {
Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
On Mon, Apr 29, 2024 at 04:04:24PM +0300, Jarkko Sakkinen wrote: > On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote: > > Two enclave threads may try to access the same non-present enclave page > > simultaneously (e.g., if the SGX runtime supports lazy allocation). The > > threads will end up in sgx_encl_eaug_page(), racing to acquire the > > enclave lock. The winning thread will perform EAUG, set up the page > > table entry, and insert the page into encl->page_array. The losing > > thread will then get -EBUSY on xa_insert(>page_array) and proceed > > to error handling path. > > And that path removes page. Not sure I got gist of this tbh. Well, this is not about a redundant EREMOVE performed. This is about the enclave page becoming inaccessible due to a bug triggered with a data race. Consider some enclave page not yet added to the enclave. The enclave performs a memory access to it at the same time on CPU1 and CPU2. Since the page does not yet have a corresponding PTE, the #PF handler on both CPUs calls sgx_vma_fault(). Scenario proceeds as follows: /* * Fault on CPU1 */ sgx_vma_fault() { xa_load(>page_array) == NULL -> sgx_encl_eaug_page() { .../* * Fault on CPU2 */ sgx_vma_fault() { xa_load(>page_array) == NULL -> sgx_encl_eaug_page() { ... mutex_lock(>lock); /* * alloc encl_page */ /* * alloc EPC page */ epc_page = sgx_alloc_epc_page(...); /* * add page_to enclave's xarray */ xa_insert(>page_array, ...); /* * add page to enclave via EAUG * (page is in pending state) */ /* * add PTE entry */ vmf_insert_pfn(...); mutex_unlock(>lock); return VM_FAULT_NOPAGE; } } mutex_lock(>lock); /* * alloc encl_page */ /* * alloc EPC page */ epc_page = sgx_alloc_epc_page(...); /* * add page_to enclave's xarray, * this fails with -EBUSY */ xa_insert(>page_array, ...); err_out_shrink: sgx_encl_free_epc_page(epc_page) { /* * remove page via EREMOVE */ /* * free EPC page */ sgx_free_epc_page(epc_page); } mutex_unlock(>lock); return VM_FAULT_SIGBUS; } } CPU2 added the enclave page (in pending state) to the enclave and installed the PTE. The kernel gives control back to the user space, without raising a signal. The user space on CPU2 retries the memory access and induces a page fault, but now with the SGX bit set in the #PF error code. The #PF handler calls do_user_addr_fault(), which calls access_error() and ultimately raises a SIGSEGV. The userspace SIGSEGV handler is supposed to perform EACCEPT, after which point the enclave page becomes accessible. CPU1 however jumps to the error handling path because the page was already inserted into the enclave's xarray. This error handling path EREMOVEs the page and also raises a SIGBUS signal to user space. The PTE entry is not removed. After CPU1 performs EREMOVE, this enclave page becomes perpetually inaccessible (until an SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is marked accessible in the PTE entry but is not EAUGed. Because of this combination, the #PF handler sees the SGX bit set in the #PF error code and does not call sgx_vma_fault() but instead raises a SIGSEGV. The userspace SIGSEGV handler cannot perform EACCEPT because the page was not EAUGed. Thus, the user space is stuck with the inaccessible page. Also note that in the scenario, CPU1 raises a SIGBUS signal to user space unnecessarily. This signal is spurious because a page-access retry on CPU2 will also raise the SIGBUS signal. That said, this side effect is less severe because it affects only user space. Therefore, it could be circumvented in user space alone, but it seems reasonable to fix it in this patch. > > This error handling path
Re: [PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows
On Mon, Apr 29, 2024 at 04:06:39PM +0300, Jarkko Sakkinen wrote: > On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote: > > SGX runtimes such as Gramine may implement EDMM-based lazy allocation of > > enclave pages and may support MADV_DONTNEED semantics [1]. The former > > implies #PF-based page allocation, and the latter implies the usage of > > SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl. > > > > A trivial program like below (run under Gramine and with EDMM enabled) > > stresses these two flows in the SGX driver and hangs: > > > > /* repeatedly touch different enclave pages at random and mix with > > * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */ > > static void* thread_func(void* arg) { > > size_t num_pages = 0xA000 / page_size; > > for (int i = 0; i < 5000; i++) { > > size_t page = get_random_ulong() % num_pages; > > char data = READ_ONCE(((char*)arg)[page * page_size]); > > > > page = get_random_ulong() % num_pages; > > madvise(arg + page * page_size, page_size, MADV_DONTNEED); > > } > > } > > > > addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0); > > pthread_t threads[16]; > > for (int i = 0; i < 16; i++) > > pthread_create([i], NULL, thread_func, addr); > > I'm not convinced that kernel is the problem here but it could be also > how Gramine is implemented. > > So maybe you could make a better case of that. The example looks a bit > artificial to me. I believe that these are the bugs in the kernel (in the SGX driver). I provided more detailed descriptions of the races and ensuing bugs in the other two replies, please check them. The example is a stress test written to debug very infrequent hangs of real-world applications that are run with Gramine, EDMM, and two optimizations (lazy allocation and MADV_DONTNEED semantics). We observed hangs of Node.js, PyTorch, R, iperf, Blender, Nginx. To root cause these hangs, we wrote this artificial stress test. This test succeeds on vanilla Linux, so ideally it should also pass on Gramine. Please also note that the optimizations of lazy allocation and MADV_DONTNEED provide significant performance improvement for some workloads that run on Gramine. For example, a Java workload with a 16GB enclave size has approx. 57x improvement in total runtime. Thus, we consider it important to permit these optimizations in Gramine, which IIUC requires bug fixes in the SGX driver. You can find more info at https://github.com/gramineproject/gramine/pull/1513. Which parts do you consider artificial, and how could I modify the stress test? -- Dmitrii Kuvaiskii
Re: [PATCH v9 1/3] soc: qcom: Add qcom_rproc_minidump module
On Tue, Mar 26, 2024 at 07:43:12PM +0530, Mukesh Ojha wrote: > Add qcom_rproc_minidump module in a preparation to remove > minidump specific code from driver/remoteproc/qcom_common.c > and provide needed exported API, this as well helps to > abstract minidump specific data layout from qualcomm's > remoteproc driver. > > It is just a copying of qcom_minidump() functionality from > driver/remoteproc/qcom_common.c into a separate file under > qcom_rproc_minidump(). > I'd prefer to see this enter the git history as one patch, extracting this logic from the remoteproc into its own driver - rather than as presented here give a sense that it's a new thing added. (I'll take care of the maintainer sync...) I also would prefer for this to include a problem description, documenting why this is done. I've not compared patch 1 and 3, but I'd also like a statement in the commit message telling if there are any changes, or if the functions are cleanly moved from one place to another. Regards, Bjorn > Signed-off-by: Mukesh Ojha > --- > Changes in v9: > - Added source file driver/remoteproc/qcom_common.c copyright >to qcom_rproc_minidump.c > - Dissociated it from minidump series as this can go separately >and minidump can put it dependency for the data structure files. > > Nothing much changed in these three patches from previous version, > However, giving the link of their older versions. > > v8: > https://lore.kernel.org/lkml/20240131105734.13090-1-quic_mo...@quicinc.com/ > v7: > https://lore.kernel.org/lkml/20240109153200.12848-1-quic_mo...@quicinc.com/ > v6: > https://lore.kernel.org/lkml/1700864395-1479-1-git-send-email-quic_mo...@quicinc.com/ > v5: > https://lore.kernel.org/lkml/1694429639-21484-1-git-send-email-quic_mo...@quicinc.com/ > v4: > https://lore.kernel.org/lkml/1687955688-20809-1-git-send-email-quic_mo...@quicinc.com/ > > drivers/soc/qcom/Kconfig | 10 +++ > drivers/soc/qcom/Makefile | 1 + > drivers/soc/qcom/qcom_minidump_internal.h | 64 + > drivers/soc/qcom/qcom_rproc_minidump.c| 115 > ++ > include/soc/qcom/qcom_minidump.h | 23 ++ > 5 files changed, 213 insertions(+) > create mode 100644 drivers/soc/qcom/qcom_minidump_internal.h > create mode 100644 drivers/soc/qcom/qcom_rproc_minidump.c > create mode 100644 include/soc/qcom/qcom_minidump.h > > diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig > index 5af33b0e3470..ed23e0275c22 100644 > --- a/drivers/soc/qcom/Kconfig > +++ b/drivers/soc/qcom/Kconfig > @@ -277,4 +277,14 @@ config QCOM_PBS > This module provides the APIs to the client drivers that wants to > send the > PBS trigger event to the PBS RAM. > > +config QCOM_RPROC_MINIDUMP > + tristate "QCOM Remoteproc Minidump Support" > + depends on ARCH_QCOM || COMPILE_TEST > + depends on QCOM_SMEM > + help > + Enablement of core Minidump feature is controlled from boot firmware > + side, so if it is enabled from firmware, this config allow Linux to > + query predefined Minidump segments associated with the remote > processor > + and check its validity and end up collecting the dump on remote > processor > + crash during its recovery. > endmenu > diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile > index ca0bece0dfff..44664589263d 100644 > --- a/drivers/soc/qcom/Makefile > +++ b/drivers/soc/qcom/Makefile > @@ -36,3 +36,4 @@ obj-$(CONFIG_QCOM_ICC_BWMON)+= icc-bwmon.o > qcom_ice-objs+= ice.o > obj-$(CONFIG_QCOM_INLINE_CRYPTO_ENGINE) += qcom_ice.o > obj-$(CONFIG_QCOM_PBS) +=qcom-pbs.o > +obj-$(CONFIG_QCOM_RPROC_MINIDUMP)+= qcom_rproc_minidump.o > diff --git a/drivers/soc/qcom/qcom_minidump_internal.h > b/drivers/soc/qcom/qcom_minidump_internal.h > new file mode 100644 > index ..71709235b196 > --- /dev/null > +++ b/drivers/soc/qcom/qcom_minidump_internal.h > @@ -0,0 +1,64 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved. > + */ > + > +#ifndef _QCOM_MINIDUMP_INTERNAL_H_ > +#define _QCOM_MINIDUMP_INTERNAL_H_ > + > +#define MAX_NUM_OF_SS 10 > +#define MAX_REGION_NAME_LENGTH 16 > +#define SBL_MINIDUMP_SMEM_ID 602 > +#define MINIDUMP_REGION_VALID ('V' << 24 | 'A' << 16 | 'L' << 8 | > 'I' << 0) > +#define MINIDUMP_SS_ENCR_DONE ('D' << 24 | 'O' << 16 | 'N' << 8 | > 'E' << 0) > +#define MINIDUMP_SS_ENABLED ('E' << 24 | 'N' << 16 | 'B' << 8 | 'L' << 0) > + > +/** > + * struct minidump_region - Minidump region > + * @name : Name of the region to be dumped > + * @seq_num: : Use to differentiate regions with same name. > + * @valid: This entry to be dumped (if set to 1) > + * @address : Physical address of region to be dumped > + * @size : Size of the region > +
Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph
On Mon, 29 Apr 2024 13:25:04 -0700 Andrii Nakryiko wrote: > On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu wrote: > > > > Hi Andrii, > > > > On Thu, 25 Apr 2024 13:31:53 -0700 > > Andrii Nakryiko wrote: > > > > > Hey Masami, > > > > > > I can't really review most of that code as I'm completely unfamiliar > > > with all those inner workings of fprobe/ftrace/function_graph. I left > > > a few comments where there were somewhat more obvious BPF-related > > > pieces. > > > > > > But I also did run our BPF benchmarks on probes/for-next as a baseline > > > and then with your series applied on top. Just to see if there are any > > > regressions. I think it will be a useful data point for you. > > > > Thanks for testing! > > > > > > > > You should be already familiar with the bench tool we have in BPF > > > selftests (I used it on some other patches for your tree). > > > > What patches we need? > > > > You mean for this `bench` tool? They are part of BPF selftests (under > tools/testing/selftests/bpf), you can build them by running: > > $ make RELEASE=1 -j$(nproc) bench > > After that you'll get a self-container `bench` binary, which has all > the self-contained benchmarks. > > You might also find a small script (benchs/run_bench_trigger.sh inside > BPF selftests directory) helpful, it collects final summary of the > benchmark run and optionally accepts a specific set of benchmarks. So > you can use it like this: > > $ benchs/run_bench_trigger.sh kprobe kprobe-multi > kprobe : 18.731 ± 0.639M/s > kprobe-multi : 23.938 ± 0.612M/s > > By default it will run a wider set of benchmarks (no uprobes, but a > bunch of extra fentry/fexit tests and stuff like this). origin: # benchs/run_bench_trigger.sh kretprobe :1.329 ± 0.007M/s kretprobe-multi:1.341 ± 0.004M/s # benchs/run_bench_trigger.sh kretprobe :1.288 ± 0.014M/s kretprobe-multi:1.365 ± 0.002M/s # benchs/run_bench_trigger.sh kretprobe :1.329 ± 0.002M/s kretprobe-multi:1.331 ± 0.011M/s # benchs/run_bench_trigger.sh kretprobe :1.311 ± 0.003M/s kretprobe-multi:1.318 ± 0.002M/s s patched: # benchs/run_bench_trigger.sh kretprobe :1.274 ± 0.003M/s kretprobe-multi:1.397 ± 0.002M/s # benchs/run_bench_trigger.sh kretprobe :1.307 ± 0.002M/s kretprobe-multi:1.406 ± 0.004M/s # benchs/run_bench_trigger.sh kretprobe :1.279 ± 0.004M/s kretprobe-multi:1.330 ± 0.014M/s # benchs/run_bench_trigger.sh kretprobe :1.256 ± 0.010M/s kretprobe-multi:1.412 ± 0.003M/s Hmm, in my case, it seems smaller differences (~3%?). I attached perf report results for those, but I don't see large difference. > > > > > > BASELINE > > > > > > kprobe : 24.634 ± 0.205M/s > > > kprobe-multi : 28.898 ± 0.531M/s > > > kretprobe : 10.478 ± 0.015M/s > > > kretprobe-multi: 11.012 ± 0.063M/s > > > > > > THIS PATCH SET ON TOP > > > = > > > kprobe : 25.144 ± 0.027M/s (+2%) > > > kprobe-multi : 28.909 ± 0.074M/s > > > kretprobe :9.482 ± 0.008M/s (-9.5%) > > > kretprobe-multi: 13.688 ± 0.027M/s (+24%) > > > > This looks good. Kretprobe should also use kretprobe-multi (fprobe) > > eventually because it should be a single callback version of > > kretprobe-multi. I ran another benchmark (prctl loop, attached), the origin kernel result is here; # sh ./benchmark.sh count = 1000, took 6.748133 sec And the patched kernel result; # sh ./benchmark.sh count = 1000, took 6.644095 sec I confirmed that the parf result has no big difference. Thank you, > > > > > > > > These numbers are pretty stable and look to be more or less > > > representative. > > > > > > As you can see, kprobes got a bit faster, kprobe-multi seems to be > > > about the same, though. > > > > > > Then (I suppose they are "legacy") kretprobes got quite noticeably > > > slower, almost by 10%. Not sure why, but looks real after re-running > > > benchmarks a bunch of times and getting stable results. > > > > Hmm, kretprobe on x86 should use ftrace + rethook even with my series. > > So nothing should be changed. Maybe cache access pattern has been > > changed? > > I'll check it with tracefs (to remove the effect from bpf related changes) > > > > > > > > On the other hand, multi-kretprobes got significantly faster (+24%!). > > > Again, I don't know if it is expected or not, but it's a nice > > > improvement. > > > > Thanks! > > > > > > > > If you have any idea why kretprobes would get so much slower, it would > > > be nice to look into that and see if you can mitigate the regression > > > somehow. Thanks! > > > > OK, let me check it. > > > > Thank you! > > > > > > > > > > > > 51 files changed, 2325 insertions(+), 882 deletions(-) > > > > create mode 100644 > > > > tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc > > > > > > > > -- > > > > Masami Hiramatsu (Google) > > > > > > > > > > -- > > Masami Hiramatsu (Google) -- Masami
[PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn
[syzbot reported] BUG: KASAN: slab-use-after-free in instrument_atomic_read include/linux/instrumented.h:68 [inline] BUG: KASAN: slab-use-after-free in atomic_long_read include/linux/atomic/atomic-instrumented.h:3188 [inline] BUG: KASAN: slab-use-after-free in __mutex_unlock_slowpath+0xef/0x750 kernel/locking/mutex.c:921 Read of size 8 at addr 888023632880 by task vhost-5104/5105 CPU: 0 PID: 5105 Comm: vhost-5104 Not tainted 6.9.0-rc5-next-20240426-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 kasan_check_range+0x282/0x290 mm/kasan/generic.c:189 instrument_atomic_read include/linux/instrumented.h:68 [inline] atomic_long_read include/linux/atomic/atomic-instrumented.h:3188 [inline] __mutex_unlock_slowpath+0xef/0x750 kernel/locking/mutex.c:921 vhost_task_fn+0x3bc/0x3f0 kernel/vhost_task.c:65 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Allocated by task 5104: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:370 [inline] __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387 kasan_kmalloc include/linux/kasan.h:211 [inline] kmalloc_trace_noprof+0x19c/0x2b0 mm/slub.c:4146 kmalloc_noprof include/linux/slab.h:660 [inline] kzalloc_noprof include/linux/slab.h:778 [inline] vhost_task_create+0x149/0x300 kernel/vhost_task.c:134 vhost_worker_create+0x17b/0x3f0 drivers/vhost/vhost.c:667 vhost_dev_set_owner+0x563/0x940 drivers/vhost/vhost.c:945 vhost_dev_ioctl+0xda/0xda0 drivers/vhost/vhost.c:2108 vhost_vsock_dev_ioctl+0x2bb/0xfa0 drivers/vhost/vsock.c:875 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 5104: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579 poison_slab_object+0xe0/0x150 mm/kasan/common.c:240 __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256 kasan_slab_free include/linux/kasan.h:184 [inline] slab_free_hook mm/slub.c:2190 [inline] slab_free mm/slub.c:4430 [inline] kfree+0x149/0x350 mm/slub.c:4551 vhost_worker_destroy drivers/vhost/vhost.c:629 [inline] vhost_workers_free drivers/vhost/vhost.c:648 [inline] vhost_dev_cleanup+0x9b0/0xba0 drivers/vhost/vhost.c:1051 vhost_vsock_dev_release+0x3aa/0x410 drivers/vhost/vsock.c:751 __fput+0x406/0x8b0 fs/file_table.c:422 __do_sys_close fs/open.c:1555 [inline] __se_sys_close fs/open.c:1540 [inline] __x64_sys_close+0x7f/0x110 fs/open.c:1540 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [Fix] Delete the member exit_mutex from the struct vhost_task and replace it with a declared global static mutex. Fixes: a3df30984f4f ("vhost_task: Handle SIGKILL by flushing work and exiting") Reported--by: syzbot+98edc2df894917b34...@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis --- kernel/vhost_task.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index 48c289947b99..375356499867 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -20,10 +20,10 @@ struct vhost_task { struct completion exited; unsigned long flags; struct task_struct *task; - /* serialize SIGKILL and vhost_task_stop calls */ - struct mutex exit_mutex; }; +static DEFINE_MUTEX(exit_mutex); //serialize SIGKILL and vhost_task_stop calls + static int vhost_task_fn(void *data) { struct vhost_task *vtsk = data; @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data) schedule(); } - mutex_lock(>exit_mutex); + mutex_lock(_mutex); /* * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL. * When the vhost layer has called vhost_task_stop it's already stopped @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data) vtsk->handle_sigkill(vtsk->data); } complete(>exited); - mutex_unlock(>exit_mutex); + mutex_unlock(_mutex); do_exit(0); } @@ -88,12 +88,12 @@ EXPORT_SYMBOL_GPL(vhost_task_wake); */ void vhost_task_stop(struct vhost_task *vtsk) { - mutex_lock(>exit_mutex); + mutex_lock(_mutex); if (!test_bit(VHOST_TASK_FLAGS_KILLED, >flags)) { set_bit(VHOST_TASK_FLAGS_STOP, >flags);
Re: [PATCH v7 05/16] module: make module_memory_{alloc,free} more self-contained
On 29/4/24 14:16, Mike Rapoport wrote: From: "Mike Rapoport (IBM)" Move the logic related to the memory allocation and freeing into module_memory_alloc() and module_memory_free(). Signed-off-by: Mike Rapoport (IBM) --- kernel/module/main.c | 64 +++- 1 file changed, 39 insertions(+), 25 deletions(-) Nice code simplification. Reviewed-by: Philippe Mathieu-Daudé
[PATCH v22 4/5] Documentation: tracing: Add ring-buffer mapping
It is now possible to mmap() a ring-buffer to stream its content. Add some documentation and a code example. Signed-off-by: Vincent Donnefort diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index 5092d6c13af5..0b300901fd75 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -29,6 +29,7 @@ Linux Tracing Technologies timerlat-tracer intel_th ring-buffer-design + ring-buffer-map stm sys-t coresight/index diff --git a/Documentation/trace/ring-buffer-map.rst b/Documentation/trace/ring-buffer-map.rst new file mode 100644 index ..8e296bcc0d7f --- /dev/null +++ b/Documentation/trace/ring-buffer-map.rst @@ -0,0 +1,106 @@ +.. SPDX-License-Identifier: GPL-2.0 + +== +Tracefs ring-buffer memory mapping +== + +:Author: Vincent Donnefort + +Overview + +Tracefs ring-buffer memory map provides an efficient method to stream data +as no memory copy is necessary. The application mapping the ring-buffer becomes +then a consumer for that ring-buffer, in a similar fashion to trace_pipe. + +Memory mapping setup + +The mapping works with a mmap() of the trace_pipe_raw interface. + +The first system page of the mapping contains ring-buffer statistics and +description. It is referred to as the meta-page. One of the most important +fields of the meta-page is the reader. It contains the sub-buffer ID which can +be safely read by the mapper (see ring-buffer-design.rst). + +The meta-page is followed by all the sub-buffers, ordered by ascending ID. It is +therefore effortless to know where the reader starts in the mapping: + +.. code-block:: c + +reader_id = meta->reader->id; +reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; + +When the application is done with the current reader, it can get a new one using +the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates +the meta-page fields. + +Limitations +=== +When a mapping is in place on a Tracefs ring-buffer, it is not possible to +either resize it (either by increasing the entire size of the ring-buffer or +each subbuf). It is also not possible to use snapshot and causes splice to copy +the ring buffer data instead of using the copyless swap from the ring buffer. + +Concurrent readers (either another application mapping that ring-buffer or the +kernel with trace_pipe) are allowed but not recommended. They will compete for +the ring-buffer and the output is unpredictable, just like concurrent readers on +trace_pipe would be. + +Example +=== + +.. code-block:: c + +#include +#include +#include +#include + +#include + +#include +#include + +#define TRACE_PIPE_RAW "/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw" + +int main(void) +{ +int page_size = getpagesize(), fd, reader_id; +unsigned long meta_len, data_len; +struct trace_buffer_meta *meta; +void *map, *reader, *data; + +fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK); +if (fd < 0) +exit(EXIT_FAILURE); + +map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0); +if (map == MAP_FAILED) +exit(EXIT_FAILURE); + +meta = (struct trace_buffer_meta *)map; +meta_len = meta->meta_page_size; + +printf("entries:%llu\n", meta->entries); +printf("overrun:%llu\n", meta->overrun); +printf("read: %llu\n", meta->read); +printf("nr_subbufs: %u\n", meta->nr_subbufs); + +data_len = meta->subbuf_size * meta->nr_subbufs; +data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, meta_len); +if (data == MAP_FAILED) +exit(EXIT_FAILURE); + +if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0) +exit(EXIT_FAILURE); + +reader_id = meta->reader.id; +reader = data + meta->subbuf_size * reader_id; + +printf("Current reader address: %p\n", reader); + +munmap(data, data_len); +munmap(meta, meta_len); +close (fd); + +return 0; +} -- 2.44.0.769.g3c40516874-goog
[PATCH v22 3/5] tracing: Allow user-space mapping of the ring-buffer
Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each subbuffer constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * Subbuf ID 0 * Subbuf ID 1 ... It is therefore easy to translate a subbuf ID into an offset in the mapping: reader_id = meta->reader->id; reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Mapping will prevent snapshot and buffer size modifications. CC: Signed-off-by: Vincent Donnefort diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h index b682e9925539..bd1066754220 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -43,4 +43,6 @@ struct trace_buffer_meta { __u64 Reserved2; }; +#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1) + #endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 233d1af39fff..a35e7f598233 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1191,6 +1191,12 @@ static void tracing_snapshot_instance_cond(struct trace_array *tr, return; } + if (tr->mapped) { + trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n"); + trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n"); + return; + } + local_irq_save(flags); update_max_tr(tr, current, smp_processor_id(), cond_data); local_irq_restore(flags); @@ -1323,7 +1329,7 @@ static int tracing_arm_snapshot_locked(struct trace_array *tr) lockdep_assert_held(_types_lock); spin_lock(>snapshot_trigger_lock); - if (tr->snapshot == UINT_MAX) { + if (tr->snapshot == UINT_MAX || tr->mapped) { spin_unlock(>snapshot_trigger_lock); return -EBUSY; } @@ -6068,7 +6074,7 @@ static void tracing_set_nop(struct trace_array *tr) { if (tr->current_trace == _trace) return; - + tr->current_trace->enabled--; if (tr->current_trace->reset) @@ -8194,15 +8200,32 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, return ret; } -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct ftrace_buffer_info *info = file->private_data; struct trace_iterator *iter = >iter; + int err; + + if (cmd == TRACE_MMAP_IOCTL_GET_READER) { + if (!(file->f_flags & O_NONBLOCK)) { + err = ring_buffer_wait(iter->array_buffer->buffer, + iter->cpu_file, + iter->tr->buffer_percent, + NULL, NULL); + if (err) + return err; + } - if (cmd) - return -ENOIOCTLCMD; + return ring_buffer_map_get_reader(iter->array_buffer->buffer, + iter->cpu_file); + } else if (cmd) { + return -ENOTTY; + } + /* +* An ioctl call with cmd 0 to the ring buffer file will wake up all +* waiters +*/ mutex_lock(_types_lock); /* Make sure the waiters see the new wait_index */ @@ -8214,6 +8237,76 @@ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned return 0; } +#ifdef CONFIG_TRACER_MAX_TRACE +static int get_snapshot_map(struct trace_array *tr) +{ + int err = 0; + + /* +* Called with mmap_lock held. lockdep would be unhappy if we would now +* take trace_types_lock. Instead use the specific +* snapshot_trigger_lock. +*/ + spin_lock(>snapshot_trigger_lock); + + if (tr->snapshot || tr->mapped == UINT_MAX) + err = -EBUSY; + else + tr->mapped++; + + spin_unlock(>snapshot_trigger_lock); + + /* Wait for update_max_tr() to observe iter->tr->mapped */ + if (tr->mapped == 1) + synchronize_rcu(); + + return err; + +} +static void put_snapshot_map(struct trace_array *tr) +{ + spin_lock(>snapshot_trigger_lock); + if (!WARN_ON(!tr->mapped)) +
[PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions
In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. CC: Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index dc5ae4e96aee..96d2140b471e 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu, + struct vm_area_struct *vma); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..b682e9925539 --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _TRACE_MMAP_H_ +#define _TRACE_MMAP_H_ + +#include + +/** + * struct trace_buffer_meta - Ring-buffer Meta-page description + * @meta_page_size:Size of this meta-page. + * @meta_struct_len: Size of this structure. + * @subbuf_size: Size of each sub-buffer. + * @nr_subbufs:Number of subbfs in the ring-buffer, including the reader. + * @reader.lost_events:Number of events lost at the time of the reader swap. + * @reader.id: subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1] + * @reader.read: Number of bytes read on the reader subbuf. + * @flags: Placeholder for now, 0 until new features are supported. + * @entries: Number of entries in the ring-buffer. + * @overrun: Number of entries lost in the ring-buffer. + * @read: Number of entries that have been read. + * @Reserved1: Internal use only. + * @Reserved2: Internal use only. + */ +struct trace_buffer_meta { + __u32 meta_page_size; + __u32 meta_struct_len; + + __u32 subbuf_size; + __u32 nr_subbufs; + + struct { + __u64 lost_events; + __u32 id; + __u32 read; + } reader; + + __u64 flags; + + __u64 entries; + __u64 overrun; + __u64 read; + + __u64 Reserved1; + __u64 Reserved2; +}; + +#endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index cc9ebe593571..fc66d01ff472 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,7 @@ #include #include #include +#include #include #include @@ -338,6 +340,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + unsigned intmapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to subbuf VA */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; struct list_headnew_pages; /* new pages to add */ @@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters); init_waitqueue_head(_buffer->irq_work.waiters); init_waitqueue_head(_buffer->irq_work.full_waiters); +
[PATCH v22 1/5] ring-buffer: Allocate sub-buffers with __GFP_COMP
In preparation for the ring-buffer memory mapping, allocate compound pages for the ring-buffer sub-buffers to enable us to map them to user-space with vm_insert_pages(). Signed-off-by: Vincent Donnefort diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 25476ead681b..cc9ebe593571 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, list_add(>list, pages); page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), - mflags | __GFP_ZERO, + mflags | __GFP_COMP | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) goto free_pages; @@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) cpu_buffer->reader_page = bpage; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO, + page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) goto fail_free_reader; @@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu) goto out; page = alloc_pages_node(cpu_to_node(cpu), - GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO, + GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) { kfree(bpage); -- 2.44.0.769.g3c40516874-goog
[PATCH v22 0/5] Introducing trace buffer mapping by user-space
The tracing ring-buffers can be stored on disk or sent to network without any copy via splice. However the later doesn't allow real time processing of the traces. A solution is to give userspace direct access to the ring-buffer pages via a mapping. An application can now become a consumer of the ring-buffer, in a similar fashion to what trace_pipe offers. Support for this new feature can already be found in libtracefs from version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE. Vincent v21 -> v22: * Remove DONTDUMP (VM_IO implies DONTDUMP already) * Remove MIXEDMAP (implicit when using vm_insert_page) * Remove PFNMAP (We do not perform raw PFN mappings and MIXEDMAP is already implicitely set) * Add comments to justify the VM_* flags v20 -> v21: * Collect Ack * Add .gitignore * Few nits * Remove meta-page padding (zero-page not supported by vm_insert_pages) * Remove single-usage macros * Move vma flags handling into ring-buffer.c v19 -> v20: * Fix typos in documentation. * Remove useless mmap open and fault callbacks. * add mm.h include for vm_insert_pages v18 -> v19: * Use VM_PFNMAP and vm_insert_pages * Allocate ring-buffer subbufs with __GFP_COMP * Pad the meta-page with the zero-page to align on the subbuf_order * Extend the ring-buffer test with mmap() dedicated suite v17 -> v18: * Fix lockdep_assert_held * Fix spin_lock_init typo * Fix CONFIG_TRACER_MAX_TRACE typo v16 -> v17: * Documentation and comments improvements. * Create get/put_snapshot_map() for clearer code. * Replace kzalloc with kcalloc. * Fix -ENOMEM handling in rb_alloc_meta_page(). * Move flush(cpu_buffer->reader_page) behind the reader lock. * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer mutex. (removes READ_ONCE/WRITE_ONCE accesses). v15 -> v16: * Add comment for the dcache flush. * Remove now unnecessary WRITE_ONCE for the meta-page. v14 -> v15: * Add meta-page and reader-page flush. Intends to fix the mapping for VIVT and aliasing-VIPT data caches. * -EPERM on VM_EXEC. * Fix build warning !CONFIG_TRACER_MAX_TRACE. v13 -> v14: * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu) * on unmap, sync meta-page teardown with the reader_lock instead of the synchronize_rcu. * Add a dedicated spinlock for trace_array ->snapshot and ->mapped. (intends to fix a lockdep issue) * Add kerneldoc for flags and Reserved fields. * Add kselftest for snapshot/map mutual exclusion. v12 -> v13: * Swap subbufs_{touched,lost} for Reserved fields. * Add a flag field in the meta-page. * Fix CONFIG_TRACER_MAX_TRACE. * Rebase on top of trace/urgent. * Add a comment for try_unregister_trigger() v11 -> v12: * Fix code sample mmap bug. * Add logging in sample code. * Reset tracer in selftest. * Add a refcount for the snapshot users. * Prevent mapping when there are snapshot users and vice versa. * Refine the meta-page. * Fix types in the meta-page. * Collect Reviewed-by. v10 -> v11: * Add Documentation and code sample. * Add a selftest. * Move all the update to the meta-page into a single rb_update_meta_page(). * rb_update_meta_page() is now called from ring_buffer_map_get_reader() to fix NOBLOCK callers. * kerneldoc for struct trace_meta_page. * Add a patch to zero all the ring-buffer allocations. v9 -> v10: * Refactor rb_update_meta_page() * In-loop declaration for foreach_subbuf_page() * Check for cpu_buffer->mapped overflow v8 -> v9: * Fix the unlock path in ring_buffer_map() * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a) v7 -> v8: * Drop the subbufs renaming into bpages * Use subbuf as a name when relevant v6 -> v7: * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/ * Support for subbufs * Rename subbufs into bpages v5 -> v6: * Rebase on next-20230802. * (unsigned long) -> (void *) cast for virt_to_page(). * Add a wait for the GET_READER_PAGE ioctl. * Move writer fields update (overrun/pages_lost/entries/pages_touched) in the irq_work. * Rearrange id in struct buffer_page. * Rearrange the meta-page. * ring_buffer_meta_page -> trace_buffer_meta_page. * Add meta_struct_len into the meta-page. v4 -> v5: * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3) v3 -> v4: * Add to the meta-page: - pages_lost / pages_read (allow to compute how full is the ring-buffer) - read (allow to compute how many entries can be read) - A reader_page struct. * Rename ring_buffer_meta_header -> ring_buffer_meta * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page * Properly consume events on ring_buffer_map_get_reader_page() with rb_advance_reader(). v2 -> v3: * Remove data page list (for non-consuming read) ** Implies removing order > 0 meta-page * Add a new meta page
[PATCH v3 2/2] remoteproc: k3-r5: Do not allow core1 to power up before core0 via sysfs
PSC controller has a limitation that it can only power-up the second core when the first core is in ON state. Power-state for core0 should be equal to or higher than core1. Therefore, prevent core1 from powering up before core0 during the start process from sysfs. Similarly, prevent core0 from shutting down before core1 has been shut down from sysfs. Fixes: 6dedbd1d5443 ("remoteproc: k3-r5: Add a remoteproc driver for R5F subsystem") Signed-off-by: Beleswar Padhi --- drivers/remoteproc/ti_k3_r5_remoteproc.c | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c b/drivers/remoteproc/ti_k3_r5_remoteproc.c index 6d6afd6beb3a..1799b4f6d11e 100644 --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c @@ -548,7 +548,7 @@ static int k3_r5_rproc_start(struct rproc *rproc) struct k3_r5_rproc *kproc = rproc->priv; struct k3_r5_cluster *cluster = kproc->cluster; struct device *dev = kproc->dev; - struct k3_r5_core *core; + struct k3_r5_core *core0, *core; u32 boot_addr; int ret; @@ -574,6 +574,15 @@ static int k3_r5_rproc_start(struct rproc *rproc) goto unroll_core_run; } } else { + /* do not allow core 1 to start before core 0 */ + core0 = list_first_entry(>cores, struct k3_r5_core, +elem); + if (core != core0 && core0->rproc->state == RPROC_OFFLINE) { + dev_err(dev, "%s: can not start core 1 before core 0\n", + __func__); + return -EPERM; + } + ret = k3_r5_core_run(core); if (ret) goto put_mbox; @@ -619,7 +628,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc) { struct k3_r5_rproc *kproc = rproc->priv; struct k3_r5_cluster *cluster = kproc->cluster; - struct k3_r5_core *core = kproc->core; + struct device *dev = kproc->dev; + struct k3_r5_core *core1, *core = kproc->core; int ret; /* halt all applicable cores */ @@ -632,6 +642,15 @@ static int k3_r5_rproc_stop(struct rproc *rproc) } } } else { + /* do not allow core 0 to stop before core 1 */ + core1 = list_last_entry(>cores, struct k3_r5_core, + elem); + if (core != core1 && core1->rproc->state != RPROC_OFFLINE) { + dev_err(dev, "%s: can not stop core 0 before core 1\n", + __func__); + return -EPERM; + } + ret = k3_r5_core_halt(core); if (ret) goto out; -- 2.34.1
[PATCH v3 1/2] remoteproc: k3-r5: Wait for core0 power-up before powering up core1
From: Apurva Nandan PSC controller has a limitation that it can only power-up the second core when the first core is in ON state. Power-state for core0 should be equal to or higher than core1, else the kernel is seen hanging during rproc loading. Make the powering up of cores sequential, by waiting for the current core to power-up before proceeding to the next core, with a timeout of 2sec. Add a wait queue event in k3_r5_cluster_rproc_init call, that will wait for the current core to be released from reset before proceeding with the next core. Fixes: 6dedbd1d5443 ("remoteproc: k3-r5: Add a remoteproc driver for R5F subsystem") Signed-off-by: Apurva Nandan [added comments and fixed code style] Signed-off-by: Beleswar Padhi --- drivers/remoteproc/ti_k3_r5_remoteproc.c | 33 1 file changed, 33 insertions(+) diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c b/drivers/remoteproc/ti_k3_r5_remoteproc.c index ad3415a3851b..6d6afd6beb3a 100644 --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c @@ -103,12 +103,14 @@ struct k3_r5_soc_data { * @dev: cached device pointer * @mode: Mode to configure the Cluster - Split or LockStep * @cores: list of R5 cores within the cluster + * @core_transition: wait queue to sync core state changes * @soc_data: SoC-specific feature data for a R5FSS */ struct k3_r5_cluster { struct device *dev; enum cluster_mode mode; struct list_head cores; + wait_queue_head_t core_transition; const struct k3_r5_soc_data *soc_data; }; @@ -128,6 +130,7 @@ struct k3_r5_cluster { * @atcm_enable: flag to control ATCM enablement * @btcm_enable: flag to control BTCM enablement * @loczrama: flag to dictate which TCM is at device address 0x0 + * @released_from_reset: flag to signal when core is out of reset */ struct k3_r5_core { struct list_head elem; @@ -144,6 +147,7 @@ struct k3_r5_core { u32 atcm_enable; u32 btcm_enable; u32 loczrama; + bool released_from_reset; }; /** @@ -460,6 +464,8 @@ static int k3_r5_rproc_prepare(struct rproc *rproc) ret); return ret; } + core->released_from_reset = true; + wake_up_interruptible(>core_transition); /* * Newer IP revisions like on J7200 SoCs support h/w auto-initialization @@ -1140,6 +1146,12 @@ static int k3_r5_rproc_configure_mode(struct k3_r5_rproc *kproc) return ret; } + /* +* Skip the waiting mechanism for sequential power-on of cores if the +* core has already been booted by another entity. +*/ + core->released_from_reset = c_state; + ret = ti_sci_proc_get_status(core->tsp, _vec, , , ); if (ret < 0) { @@ -1280,6 +1292,26 @@ static int k3_r5_cluster_rproc_init(struct platform_device *pdev) cluster->mode == CLUSTER_MODE_SINGLECPU || cluster->mode == CLUSTER_MODE_SINGLECORE) break; + + /* +* R5 cores require to be powered on sequentially, core0 +* should be in higher power state than core1 in a cluster +* So, wait for current core to power up before proceeding +* to next core and put timeout of 2sec for each core. +* +* This waiting mechanism is necessary because +* rproc_auto_boot_callback() for core1 can be called before +* core0 due to thread execution order. +*/ + ret = wait_event_interruptible_timeout(cluster->core_transition, + core->released_from_reset, + msecs_to_jiffies(2000)); + if (ret <= 0) { + dev_err(dev, + "Timed out waiting for %s core to power up!\n", + rproc->name); + return ret; + } } return 0; @@ -1709,6 +1741,7 @@ static int k3_r5_probe(struct platform_device *pdev) cluster->dev = dev; cluster->soc_data = data; INIT_LIST_HEAD(>cores); + init_waitqueue_head(>core_transition); ret = of_property_read_u32(np, "ti,cluster-mode", >mode); if (ret < 0 && ret != -EINVAL) { -- 2.34.1
[PATCH v3 0/2] remoteproc: k3-r5: Wait for core0 power-up before powering up core1
PSC controller has a limitation that it can only power-up the second core when the first core is in ON state. Power-state for core0 should be equal to or higher than core1, else the kernel is seen hanging during rproc loading. Make the powering up of cores sequential, by waiting for the current core to power-up before proceeding to the next core, with a timeout of 2sec. Add a wait queue event in k3_r5_cluster_rproc_init call, that will wait for the current core to be released from reset before proceeding with the next core. Also, ensure that core1 can not be powered on before core0 when starting cores from sysfs. Similarly, ensure that core0 can not be shutdown before core1 from sysfs. v3: Changelog: 1) Added my own Signed-off-by in PATCH 1, specifying the changes done 2) Addressed changes requested by adding comments in code in PATCH 1 Link to v2: https://lore.kernel.org/all/20240424130504.494916-1-b-pa...@ti.com/ v2: Changelog: 1) Fixed multi-line comment format 2) Included root cause of bug in comments 3) Added a patch to ensure power-up/shutdown is sequential via sysfs Link to v1: https://lore.kernel.org/all/20230906124756.3480579-1-a-nan...@ti.com/ Apurva Nandan (1): remoteproc: k3-r5: Wait for core0 power-up before powering up core1 Beleswar Padhi (1): remoteproc: k3-r5: Do not allow core1 to power up before core0 via sysfs drivers/remoteproc/ti_k3_r5_remoteproc.c | 56 +++- 1 file changed, 54 insertions(+), 2 deletions(-) -- 2.34.1
Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
It took me a while, but I was able to figure out why EEVDF behaves different then CFS does. I'm still waiting for some official confirmation of my assumptions but it all seems very plausible to me. Leaving aside all the specifics of vhost and kworkers, a more general description of the scenario would be as follows: Assume that we have two tasks taking turns on a single CPU. Task 1 does something and wakes up Task 2. Task 2 does something and goes to sleep. And we're just repeating that. Task 1 and task 2 only run for very short amounts of time, i.e. much shorter than a regular time slice (vhost = task1, kworker = task2). Let's further assume, that task 1 runs longer than task 2. In CFS, this means, that vruntime of task 1 starts to outrun the vruntime of task 2. This means that vruntime(task2) < vruntime(task1). Hence, task 2 always gets picked on wake up because it has the smaller vruntime. In EEVDF, this would translate to a permanent positive lag, which also causes task 2 to get consistently scheduled on wake up. Let's now assume, that ocassionally, task 2 runs a little bit longer than task 1. In CFS, this means, that task 2 can close the vruntime gap by a bit, but, it can easily remain below the value of task 1. Task 2 would still get picked on wake up. With EEVDF, in its current form, task 2 will now get a negative lag, which in turn, will cause it not being picked on the next wake up. So, it seems we have a change in the level of how far the both variants look into the past. CFS being willing to take more history into account, whereas EEVDF does not (with update_entity_lag setting the lag value from scratch, and place_entity not taking the original vruntime into account). All of this can be seen as correct by design, a task consumes more time than the others, so it has to give way to others. The big difference is now, that CFS allowed a task to collect some bonus by constantly using less CPU time than others and trading that time against ocassionally taking more CPU time. EEVDF could do the same thing, by allowing the accumulation of positive lag, which can then be traded against the one time the task would get negative lag. This might clash with other EEVDF assumptions though. The patch below fixes the degredation, but is not at all aligned with what EEVDF wants to achieve, but it helps as an indicator that my hypothesis is correct. So, what does this now mean for the vhost regression we were discussing? 1. The behavior of the scheduler changed with regard to wake-up scenarios. 2. vhost in its current form relies on the way how CFS works by assuming that the kworker always gets scheduled. I would like to argue that it therefore makes sense to reconsider the vhost implementation to make it less dependent on the internals of the scheduler. As proposed earlier in this thread, I see two options: 1. Do an explicit schedule() after every iteration across the vhost queues 2. Set the need_resched flag after writing to the socket that would trigger eventfd and the underlying kworker Both options would make sure that the vhost gives up the CPU as it cannot continue anyway without the kworker handling the event. Option 1 will give up the CPU regardless of whether something was found in the queues, whereas option 2 would only give up the CPU if there is. It shall be noted, that we encountered similar behavior when running some fio benchmarks. From a brief glance at the code, I was seeing similar intentions: Loop over queues, then trigger an action through some event mechanism. Applying the same patch as mentioned above also fixes this issue. It could be argued, that this is still something that needs to be somehow addressed by the scheduler since it might affect others as well and there are in fact patches coming in. Will they address our issue here? Not sure yet. On the other hand, it might just be beneficial to make vhost more resilient towards the scheduler's algorithm by not relying on a certain behavior in the wakeup path. Further discussion on additional commits to make EEVDF work correctly can be found here: https://lore.kernel.org/lkml/20240408090639.gd21...@noisy.programming.kicks-ass.net/T/ So far these patches do not fix the degredation. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 03be0d1330a6..b83a72311d2a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -701,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) s64 lag, limit; SCHED_WARN_ON(!se->on_rq); - lag = avg_vruntime(cfs_rq) - se->vruntime; + lag = se->vlag + avg_vruntime(cfs_rq) - se->vruntime; limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se); se->vlag = clamp(lag, -limit, limit);
Re: [PATCH v2 2/2] LoongArch: Add steal time support in guest side
On 2024/4/30 下午4:23, Huacai Chen wrote: Hi, Bibo, On Tue, Apr 30, 2024 at 9:45 AM Bibo Mao wrote: Percpu struct kvm_steal_time is added here, its size is 64 bytes and also defined as 64 bytes, so that the whole structure is in one physical page. When vcpu is onlined, function pv_enable_steal_time() is called. This function will pass guest physical address of struct kvm_steal_time and tells hypervisor to enable steal time. When vcpu is offline, physical address is set as 0 and tells hypervisor to disable steal time. Here is output of vmstat on guest when there is workload on both host and guest. It includes steal time stat information. procs ---memory-- -io -system-- --cpu- r b swpd free inact active bibo in cs us sy id wa st 15 1 0 7583616 184112 72208200 162 52 31 6 43 0 20 17 0 0 7583616 184704 721920 0 6318 6885 5 60 8 5 22 16 0 0 7583616 185392 721440 0 1766 1081 0 49 0 1 50 16 0 0 7583616 184816 723040 0 6300 6166 4 62 12 2 20 18 0 0 7583632 184480 722400 0 2814 1754 2 58 4 1 35 Signed-off-by: Bibo Mao --- arch/loongarch/Kconfig| 11 +++ arch/loongarch/include/asm/paravirt.h | 5 + arch/loongarch/kernel/paravirt.c | 131 ++ arch/loongarch/kernel/time.c | 2 + 4 files changed, 149 insertions(+) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 0a1540a8853e..f3a03c33a052 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -592,6 +592,17 @@ config PARAVIRT over full virtualization. However, when run without a hypervisor the kernel is theoretically slower and slightly larger. +config PARAVIRT_TIME_ACCOUNTING + bool "Paravirtual steal time accounting" + select PARAVIRT + help + Select this option to enable fine granularity task steal time + accounting. Time spent executing other tasks in parallel with + the current vCPU is discounted from the vCPU power. To account for + that, there can be a small performance impact. + + If in doubt, say N here. + Can we use a hidden selection manner, which means: config PARAVIRT_TIME_ACCOUNTING def_bool PARAVIRT Because I think we needn't give too many choices to users (and bring much more effort to test). PowerPC even hide all the PARAVIRT config options... see arch/powerpc/platforms/pseries/Kconfig I do not know neither :( It is just used at beginning, maybe there is negative effect or potential issue, I just think that it can be selected by user for easily debugging. After it is matured, hidden selection manner can be used. Regards Bibo Mao Huacai config ARCH_SUPPORTS_KEXEC def_bool y diff --git a/arch/loongarch/include/asm/paravirt.h b/arch/loongarch/include/asm/paravirt.h index 58f7b7b89f2c..fe27fb5e82b8 100644 --- a/arch/loongarch/include/asm/paravirt.h +++ b/arch/loongarch/include/asm/paravirt.h @@ -17,11 +17,16 @@ static inline u64 paravirt_steal_clock(int cpu) } int pv_ipi_init(void); +int __init pv_time_init(void); #else static inline int pv_ipi_init(void) { return 0; } +static inline int pv_time_init(void) +{ + return 0; +} #endif // CONFIG_PARAVIRT #endif diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c index 9044ed62045c..3f83afc7e2d2 100644 --- a/arch/loongarch/kernel/paravirt.c +++ b/arch/loongarch/kernel/paravirt.c @@ -5,10 +5,13 @@ #include #include #include +#include #include struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; +static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); +static int has_steal_clock; static u64 native_steal_clock(int cpu) { @@ -17,6 +20,57 @@ static u64 native_steal_clock(int cpu) DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +static bool steal_acc = true; +static int __init parse_no_stealacc(char *arg) +{ + steal_acc = false; + return 0; +} +early_param("no-steal-acc", parse_no_stealacc); + +static u64 para_steal_clock(int cpu) +{ + u64 steal; + struct kvm_steal_time *src; + int version; + + src = _cpu(steal_time, cpu); + do { + + version = src->version; + /* Make sure that the version is read before the steal */ + virt_rmb(); + steal = src->steal; + /* Make sure that the steal is read before the next version */ + virt_rmb(); + + } while ((version & 1) || (version != src->version)); + return steal; +} + +static int pv_enable_steal_time(void) +{ + int cpu = smp_processor_id(); + struct kvm_steal_time *st; + unsigned long addr; + + if (!has_steal_clock) + return -EPERM; + + st = _cpu(steal_time, cpu); + addr =
[syzbot] [net?] [virt?] [kvm?] KASAN: slab-use-after-free Read in vhost_task_fn
Hello, syzbot found the following issue on: HEAD commit:bb7a2467e6be Add linux-next specific files for 20240426 git tree: linux-next console+strace: https://syzkaller.appspot.com/x/log.txt?x=123bf96b18 kernel config: https://syzkaller.appspot.com/x/.config?x=5c6a0288262dd108 dashboard link: https://syzkaller.appspot.com/bug?extid=98edc2df894917b3431f compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11c8a4ef18 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=16c3002898 Downloadable assets: disk image: https://storage.googleapis.com/syzbot-assets/5175af7dda64/disk-bb7a2467.raw.xz vmlinux: https://storage.googleapis.com/syzbot-assets/70db0462e868/vmlinux-bb7a2467.xz kernel image: https://storage.googleapis.com/syzbot-assets/3217fb825698/bzImage-bb7a2467.xz The issue was bisected to: commit a3df30984f4faf82d63d2a96f8ac773403ce935d Author: Mike Christie Date: Sat Mar 16 00:47:06 2024 + vhost_task: Handle SIGKILL by flushing work and exiting bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=1442391718 final oops: https://syzkaller.appspot.com/x/report.txt?x=1642391718 console output: https://syzkaller.appspot.com/x/log.txt?x=1242391718 IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+98edc2df894917b34...@syzkaller.appspotmail.com Fixes: a3df30984f4f ("vhost_task: Handle SIGKILL by flushing work and exiting") == BUG: KASAN: slab-use-after-free in instrument_atomic_read include/linux/instrumented.h:68 [inline] BUG: KASAN: slab-use-after-free in atomic_long_read include/linux/atomic/atomic-instrumented.h:3188 [inline] BUG: KASAN: slab-use-after-free in __mutex_unlock_slowpath+0xef/0x750 kernel/locking/mutex.c:921 Read of size 8 at addr 88802a9d9080 by task vhost-5103/5104 CPU: 1 PID: 5104 Comm: vhost-5103 Not tainted 6.9.0-rc5-next-20240426-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 kasan_check_range+0x282/0x290 mm/kasan/generic.c:189 instrument_atomic_read include/linux/instrumented.h:68 [inline] atomic_long_read include/linux/atomic/atomic-instrumented.h:3188 [inline] __mutex_unlock_slowpath+0xef/0x750 kernel/locking/mutex.c:921 vhost_task_fn+0x3bc/0x3f0 kernel/vhost_task.c:65 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Allocated by task 5103: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:370 [inline] __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387 kasan_kmalloc include/linux/kasan.h:211 [inline] kmalloc_trace_noprof+0x19c/0x2b0 mm/slub.c:4146 kmalloc_noprof include/linux/slab.h:660 [inline] kzalloc_noprof include/linux/slab.h:778 [inline] vhost_task_create+0x149/0x300 kernel/vhost_task.c:134 vhost_worker_create+0x17b/0x3f0 drivers/vhost/vhost.c:667 vhost_dev_set_owner+0x563/0x940 drivers/vhost/vhost.c:945 vhost_dev_ioctl+0xda/0xda0 drivers/vhost/vhost.c:2108 vhost_vsock_dev_ioctl+0x2bb/0xfa0 drivers/vhost/vsock.c:875 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 5103: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579 poison_slab_object+0xe0/0x150 mm/kasan/common.c:240 __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256 kasan_slab_free include/linux/kasan.h:184 [inline] slab_free_hook mm/slub.c:2190 [inline] slab_free mm/slub.c:4430 [inline] kfree+0x149/0x350 mm/slub.c:4551 vhost_worker_destroy drivers/vhost/vhost.c:629 [inline] vhost_workers_free drivers/vhost/vhost.c:648 [inline] vhost_dev_cleanup+0x9b0/0xba0 drivers/vhost/vhost.c:1051 vhost_vsock_dev_release+0x3aa/0x410 drivers/vhost/vsock.c:751 __fput+0x406/0x8b0 fs/file_table.c:422 __do_sys_close fs/open.c:1555 [inline] __se_sys_close fs/open.c:1540 [inline] __x64_sys_close+0x7f/0x110 fs/open.c:1540 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f The buggy address belongs to the object at 88802a9d9000 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 128 bytes inside of freed 512-byte region
Re: [PATCH v2 2/2] LoongArch: Add steal time support in guest side
Hi, Bibo, On Tue, Apr 30, 2024 at 9:45 AM Bibo Mao wrote: > > Percpu struct kvm_steal_time is added here, its size is 64 bytes and > also defined as 64 bytes, so that the whole structure is in one physical > page. > > When vcpu is onlined, function pv_enable_steal_time() is called. This > function will pass guest physical address of struct kvm_steal_time and > tells hypervisor to enable steal time. When vcpu is offline, physical > address is set as 0 and tells hypervisor to disable steal time. > > Here is output of vmstat on guest when there is workload on both host > and guest. It includes steal time stat information. > > procs ---memory-- -io -system-- --cpu- > r b swpd free inact active bibo in cs us sy id wa st > 15 1 0 7583616 184112 72208200 162 52 31 6 43 0 20 > 17 0 0 7583616 184704 721920 0 6318 6885 5 60 8 5 22 > 16 0 0 7583616 185392 721440 0 1766 1081 0 49 0 1 50 > 16 0 0 7583616 184816 723040 0 6300 6166 4 62 12 2 20 > 18 0 0 7583632 184480 722400 0 2814 1754 2 58 4 1 35 > > Signed-off-by: Bibo Mao > --- > arch/loongarch/Kconfig| 11 +++ > arch/loongarch/include/asm/paravirt.h | 5 + > arch/loongarch/kernel/paravirt.c | 131 ++ > arch/loongarch/kernel/time.c | 2 + > 4 files changed, 149 insertions(+) > > diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig > index 0a1540a8853e..f3a03c33a052 100644 > --- a/arch/loongarch/Kconfig > +++ b/arch/loongarch/Kconfig > @@ -592,6 +592,17 @@ config PARAVIRT > over full virtualization. However, when run without a hypervisor > the kernel is theoretically slower and slightly larger. > > +config PARAVIRT_TIME_ACCOUNTING > + bool "Paravirtual steal time accounting" > + select PARAVIRT > + help > + Select this option to enable fine granularity task steal time > + accounting. Time spent executing other tasks in parallel with > + the current vCPU is discounted from the vCPU power. To account for > + that, there can be a small performance impact. > + > + If in doubt, say N here. > + Can we use a hidden selection manner, which means: config PARAVIRT_TIME_ACCOUNTING def_bool PARAVIRT Because I think we needn't give too many choices to users (and bring much more effort to test). PowerPC even hide all the PARAVIRT config options... see arch/powerpc/platforms/pseries/Kconfig Huacai > config ARCH_SUPPORTS_KEXEC > def_bool y > > diff --git a/arch/loongarch/include/asm/paravirt.h > b/arch/loongarch/include/asm/paravirt.h > index 58f7b7b89f2c..fe27fb5e82b8 100644 > --- a/arch/loongarch/include/asm/paravirt.h > +++ b/arch/loongarch/include/asm/paravirt.h > @@ -17,11 +17,16 @@ static inline u64 paravirt_steal_clock(int cpu) > } > > int pv_ipi_init(void); > +int __init pv_time_init(void); > #else > static inline int pv_ipi_init(void) > { > return 0; > } > > +static inline int pv_time_init(void) > +{ > + return 0; > +} > #endif // CONFIG_PARAVIRT > #endif > diff --git a/arch/loongarch/kernel/paravirt.c > b/arch/loongarch/kernel/paravirt.c > index 9044ed62045c..3f83afc7e2d2 100644 > --- a/arch/loongarch/kernel/paravirt.c > +++ b/arch/loongarch/kernel/paravirt.c > @@ -5,10 +5,13 @@ > #include > #include > #include > +#include > #include > > struct static_key paravirt_steal_enabled; > struct static_key paravirt_steal_rq_enabled; > +static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); > +static int has_steal_clock; > > static u64 native_steal_clock(int cpu) > { > @@ -17,6 +20,57 @@ static u64 native_steal_clock(int cpu) > > DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); > > +static bool steal_acc = true; > +static int __init parse_no_stealacc(char *arg) > +{ > + steal_acc = false; > + return 0; > +} > +early_param("no-steal-acc", parse_no_stealacc); > + > +static u64 para_steal_clock(int cpu) > +{ > + u64 steal; > + struct kvm_steal_time *src; > + int version; > + > + src = _cpu(steal_time, cpu); > + do { > + > + version = src->version; > + /* Make sure that the version is read before the steal */ > + virt_rmb(); > + steal = src->steal; > + /* Make sure that the steal is read before the next version */ > + virt_rmb(); > + > + } while ((version & 1) || (version != src->version)); > + return steal; > +} > + > +static int pv_enable_steal_time(void) > +{ > + int cpu = smp_processor_id(); > + struct kvm_steal_time *st; > + unsigned long addr; > + > + if (!has_steal_clock) > + return -EPERM; > + > + st = _cpu(steal_time, cpu); > + addr = per_cpu_ptr_to_phys(st); > + > + /* The whole structure kvm_steal_time should be one
Re: [PATCH v4 4/4] remoteproc: mediatek: Add IMGSYS IPI command
Il 30/04/24 03:15, Olivia Wen ha scritto: Add an IPI command definition for communication with IMGSYS through SCP mailbox. Signed-off-by: Olivia Wen Reviewed-by: AngeloGioacchino Del Regno
[syzbot] Monthly trace report (Apr 2024)
Hello trace maintainers/developers, This is a 31-day syzbot report for the trace subsystem. All related reports/information can be found at: https://syzkaller.appspot.com/upstream/s/trace During the period, 3 new issues were detected and 2 were fixed. In total, 9 issues are still open and 34 have been fixed so far. Some of the still happening issues: Ref Crashes Repro Title <1> 471 Yes WARNING in format_decode (3) https://syzkaller.appspot.com/bug?extid=e2c932aec5c8a6e1d31c <2> 26 Yes INFO: task hung in blk_trace_ioctl (4) https://syzkaller.appspot.com/bug?extid=ed812ed461471ab17a0c <3> 26 Yes WARNING in blk_register_tracepoints https://syzkaller.appspot.com/bug?extid=c54ded83396afee31eb1 <4> 13 Nopossible deadlock in __send_signal_locked https://syzkaller.appspot.com/bug?extid=6e3b6eab5bd4ed584a38 <5> 7 Yes WARNING in get_probe_ref https://syzkaller.appspot.com/bug?extid=8672dcb9d10011c0a160 --- This report is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkal...@googlegroups.com. To disable reminders for individual bugs, reply with the following command: #syz set no-reminders To change bug's subsystems, reply with: #syz set subsystems: new-subsystem You may send multiple commands in a single email message.
Re: [PATCH v9 1/3] soc: qcom: Add qcom_rproc_minidump module
Gentle ping.. -Mukesh On 3/26/2024 7:43 PM, Mukesh Ojha wrote: Add qcom_rproc_minidump module in a preparation to remove minidump specific code from driver/remoteproc/qcom_common.c and provide needed exported API, this as well helps to abstract minidump specific data layout from qualcomm's remoteproc driver. It is just a copying of qcom_minidump() functionality from driver/remoteproc/qcom_common.c into a separate file under qcom_rproc_minidump(). Signed-off-by: Mukesh Ojha --- Changes in v9: - Added source file driver/remoteproc/qcom_common.c copyright to qcom_rproc_minidump.c - Dissociated it from minidump series as this can go separately and minidump can put it dependency for the data structure files. Nothing much changed in these three patches from previous version, However, giving the link of their older versions. v8: https://lore.kernel.org/lkml/20240131105734.13090-1-quic_mo...@quicinc.com/ v7: https://lore.kernel.org/lkml/20240109153200.12848-1-quic_mo...@quicinc.com/ v6: https://lore.kernel.org/lkml/1700864395-1479-1-git-send-email-quic_mo...@quicinc.com/ v5: https://lore.kernel.org/lkml/1694429639-21484-1-git-send-email-quic_mo...@quicinc.com/ v4: https://lore.kernel.org/lkml/1687955688-20809-1-git-send-email-quic_mo...@quicinc.com/ drivers/soc/qcom/Kconfig | 10 +++ drivers/soc/qcom/Makefile | 1 + drivers/soc/qcom/qcom_minidump_internal.h | 64 + drivers/soc/qcom/qcom_rproc_minidump.c| 115 ++ include/soc/qcom/qcom_minidump.h | 23 ++ 5 files changed, 213 insertions(+) create mode 100644 drivers/soc/qcom/qcom_minidump_internal.h create mode 100644 drivers/soc/qcom/qcom_rproc_minidump.c create mode 100644 include/soc/qcom/qcom_minidump.h diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig index 5af33b0e3470..ed23e0275c22 100644 --- a/drivers/soc/qcom/Kconfig +++ b/drivers/soc/qcom/Kconfig @@ -277,4 +277,14 @@ config QCOM_PBS This module provides the APIs to the client drivers that wants to send the PBS trigger event to the PBS RAM. +config QCOM_RPROC_MINIDUMP + tristate "QCOM Remoteproc Minidump Support" + depends on ARCH_QCOM || COMPILE_TEST + depends on QCOM_SMEM + help + Enablement of core Minidump feature is controlled from boot firmware + side, so if it is enabled from firmware, this config allow Linux to + query predefined Minidump segments associated with the remote processor + and check its validity and end up collecting the dump on remote processor + crash during its recovery. endmenu diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile index ca0bece0dfff..44664589263d 100644 --- a/drivers/soc/qcom/Makefile +++ b/drivers/soc/qcom/Makefile @@ -36,3 +36,4 @@ obj-$(CONFIG_QCOM_ICC_BWMON) += icc-bwmon.o qcom_ice-objs += ice.o obj-$(CONFIG_QCOM_INLINE_CRYPTO_ENGINE) += qcom_ice.o obj-$(CONFIG_QCOM_PBS) += qcom-pbs.o +obj-$(CONFIG_QCOM_RPROC_MINIDUMP) += qcom_rproc_minidump.o diff --git a/drivers/soc/qcom/qcom_minidump_internal.h b/drivers/soc/qcom/qcom_minidump_internal.h new file mode 100644 index ..71709235b196 --- /dev/null +++ b/drivers/soc/qcom/qcom_minidump_internal.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved. + */ + +#ifndef _QCOM_MINIDUMP_INTERNAL_H_ +#define _QCOM_MINIDUMP_INTERNAL_H_ + +#define MAX_NUM_OF_SS 10 +#define MAX_REGION_NAME_LENGTH 16 +#define SBL_MINIDUMP_SMEM_ID 602 +#define MINIDUMP_REGION_VALID ('V' << 24 | 'A' << 16 | 'L' << 8 | 'I' << 0) +#define MINIDUMP_SS_ENCR_DONE ('D' << 24 | 'O' << 16 | 'N' << 8 | 'E' << 0) +#define MINIDUMP_SS_ENABLED ('E' << 24 | 'N' << 16 | 'B' << 8 | 'L' << 0) + +/** + * struct minidump_region - Minidump region + * @name : Name of the region to be dumped + * @seq_num: : Use to differentiate regions with same name. + * @valid : This entry to be dumped (if set to 1) + * @address: Physical address of region to be dumped + * @size : Size of the region + */ +struct minidump_region { + charname[MAX_REGION_NAME_LENGTH]; + __le32 seq_num; + __le32 valid; + __le64 address; + __le64 size; +}; + +/** + * struct minidump_subsystem - Subsystem's SMEM Table of content + * @status : Subsystem toc init status + * @enabled : if set to 1, this region would be copied during coredump + * @encryption_status: Encryption status for this subsystem + * @encryption_required : Decides to encrypt the subsystem regions or not + * @region_count : Number of regions added in this subsystem toc + * @regions_baseptr : regions base pointer of the subsystem + */ +struct minidump_subsystem { + __le32 status; + __le32 enabled; +