Re: [PATCH] tracing/fprobe: Support raw tracepoint events on modules
On Sat, 1 Jun 2024 01:01:26 +0800 kernel test robot wrote: > Hi Masami, > > kernel test robot noticed the following build errors: > > [auto build test ERROR on linus/master] > [also build test ERROR on v6.10-rc1 next-20240531] > [cannot apply to rostedt-trace/for-next rostedt-trace/for-next-urgent] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use '--base' as documented in > https://git-scm.com/docs/git-format-patch#_base_tree_information] > > url: > https://github.com/intel-lab-lkp/linux/commits/Masami-Hiramatsu-Google/tracing-fprobe-Support-raw-tracepoint-events-on-modules/20240531-175013 > base: linus/master > patch link: > https://lore.kernel.org/r/171714888633.198965.13093663631481169611.stgit%40devnote2 > patch subject: [PATCH] tracing/fprobe: Support raw tracepoint events on > modules > config: s390-defconfig > (https://download.01.org/0day-ci/archive/20240601/202406010034.fsnp9rsq-...@intel.com/config) > compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project > bafda89a0944d947fc4b3b5663185e07a397ac30) > reproduce (this is a W=1 build): > (https://download.01.org/0day-ci/archive/20240601/202406010034.fsnp9rsq-...@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version > of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot > | Closes: > https://lore.kernel.org/oe-kbuild-all/202406010034.fsnp9rsq-...@intel.com/ OK, I must be tired... let me fix that. And I'll add a selftest for this case. Thanks! > > All errors (new ones prefixed by >>): > >In file included from kernel/tracepoint.c:5: >In file included from include/linux/module.h:19: >In file included from include/linux/elf.h:6: >In file included from arch/s390/include/asm/elf.h:173: >In file included from arch/s390/include/asm/mmu_context.h:11: >In file included from arch/s390/include/asm/pgalloc.h:18: >In file included from include/linux/mm.h:2253: >include/linux/vmstat.h:500:43: warning: arithmetic between different > enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') > [-Wenum-enum-conversion] > 500 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + > |~ ^ > 501 |item]; > | >include/linux/vmstat.h:507:43: warning: arithmetic between different > enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') > [-Wenum-enum-conversion] > 507 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + > |~ ^ > 508 |NR_VM_NUMA_EVENT_ITEMS + > |~~ >include/linux/vmstat.h:514:36: warning: arithmetic between different > enumeration types ('enum node_stat_item' and 'enum lru_list') > [-Wenum-enum-conversion] > 514 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_" > | ~~~ ^ ~~~ >include/linux/vmstat.h:519:43: warning: arithmetic between different > enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') > [-Wenum-enum-conversion] > 519 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + > |~ ^ > 520 |NR_VM_NUMA_EVENT_ITEMS + > |~~ >include/linux/vmstat.h:528:43: warning: arithmetic between different > enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') > [-Wenum-enum-conversion] > 528 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + > |~ ^ > 529 |NR_VM_NUMA_EVENT_ITEMS + > |~~ > >> kernel/tracepoint.c:751:34: error: no member named > >> '__start___tracepoints_ptrs' in 'struct module' > 751 | > for_each_tracepoint_range(mod->__start___tracepoints_ptrs, > | ~~~ ^ >5 warnings and 1 error generated. > > > vim +751 kernel/tracepoint.c > >738 >739void for_each_module_tracepoint(void (*fct)(struct tracepoint > *tp, void *priv), >740void *priv) >741{ >742struct tp_module *tp_mod; >743struct module *mod; >
[PATCH] lib/test_kmod: add missing MODULE_DESCRIPTION() macro
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_kmod.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson --- lib/test_kmod.c | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/test_kmod.c b/lib/test_kmod.c index 1eec3b7ac67c..064ed0fce75a 100644 --- a/lib/test_kmod.c +++ b/lib/test_kmod.c @@ -1223,4 +1223,5 @@ static void __exit test_kmod_exit(void) module_exit(test_kmod_exit); MODULE_AUTHOR("Luis R. Rodriguez "); +MODULE_DESCRIPTION("kmod stress test driver"); MODULE_LICENSE("GPL"); --- base-commit: b050496579632f86ee1ef7e7501906db579f3457 change-id: 20240531-md-lib-test_kmod-83bf2ee7e725
Re: [PATCH RFC 4/4] mm, page_alloc: add static key for should_fail_alloc_page()
On Fri, May 31, 2024 at 11:33:35AM +0200, Vlastimil Babka wrote: > Similarly to should_failslab(), remove the overhead of calling the > noinline function should_fail_alloc_page() with a static key that guards > the allocation hotpath callsite and is controlled by the fault and error > injection frameworks. > > Signed-off-by: Vlastimil Babka Reviewed-by: Roman Gushchin Thanks!
Re: [PATCH RFC 3/4] mm, slab: add static key for should_failslab()
On Fri, May 31, 2024 at 11:33:34AM +0200, Vlastimil Babka wrote: > Since commit 4f6923fbb352 ("mm: make should_failslab always available for > fault injection") should_failslab() is unconditionally a noinline > function. This adds visible overhead to the slab allocation hotpath, > even if the function is empty. With CONFIG_FAILSLAB=y there's additional > overhead when the functionality is not enabled by a boot parameter or > debugfs. > > The overhead can be eliminated with a static key around the callsite. > Fault injection and error injection frameworks can now be told that the > this function has a static key associated, and are able to enable and > disable it accordingly. > > Signed-off-by: Vlastimil Babka Reviewed-by: Roman Gushchin
Re: [PATCH RFC 0/4] static key support for error injection functions
On Fri, May 31, 2024 at 11:33:31AM +0200, Vlastimil Babka wrote: > Incomplete, help needed from ftrace/kprobe and bpf folks. > > As previously mentioned by myself [1] and others [2] the functions > designed for error injection can bring visible overhead in fastpaths > such as slab or page allocation, because even if nothing hooks into them > at a given moment, they are noninline function calls regardless of > CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab > always available for fault injection") and af3b854492f3 > ("mm/page_alloc.c: allow error injection"). > > Live patching their callsites has been also suggested in both [1] and > [2] threads, and this is an attempt to do that with static keys that > guard the call sites. When disabled, the error injection functions still > exist and are noinline, but are not being called. Any of the existing > mechanisms that can inject errors should make sure to enable the > respective static key. I have added that support to some of them but > need help with the others. I think it's a clever idea and makes total sense! > > - the legacy fault injection, i.e. CONFIG_FAILSLAB and > CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the > address of the static key if it exists. The key will be activated if the > fault injection probability becomes non-zero, and deactivated in the > opposite transition. This also removes the overhead of the evaluation > (on top of the noninline function call) when these mechanisms are > configured in the kernel but unused at the moment. > > - the generic error injection using kretprobes with > override_function_with_return is handled in patch 2. The > ALLOW_ERROR_INJECTION() annotation is extended so that static key > address can be passed, and the framework controls it when error > injection is enabled or disabled in debugfs for the function. > > There are two more users I know of but am not familiar enough to fix up > myself. I hope people that are more familiar can help me here. > > - ftrace seems to be using override_function_with_return from > #define ftrace_override_function_with_return but I found no place > where the latter is used. I assume it might be hidden behind more > macro magic? But the point is if ftrace can be instructed to act like > an error injection, it would also have to use some form of metadata > (from patch 2 presumably?) to get to the static key and control it. > > If ftrace can only observe the function being called, maybe it > wouldn't be wrong to just observe nothing if the static key isn't > enabled because nobody is doing the fault injection? > > - bpftrace, as can be seen from the example in commit 4f6923fbb352 > description. I suppose bpf is already aware what functions the > currently loaded bpf programs hook into, so that it could look up the > static key and control it. Maybe using again the metadata from patch 2, > or extending its own, as I've noticed there's e.g. BTF_ID(func, > should_failslab) > > Now I realize maybe handling this at the k(ret)probe level would be > sufficient for all cases except the legacy fault injection from Patch 1? > Also wanted to note that by AFAIU by using the static_key_slow_dec/inc > API (as done in patches 1/2) should allow all mechanisms to coexist > naturally without fighting each other on the static key state, and also > handle the reference count for e.g. active probes or bpf programs if > there's no similar internal mechanism. > > Patches 3 and 4 implement the static keys for the two mm fault injection > sites in slab and page allocators. For a quick demonstration I've run a > VM and the simple test from [1] that stresses the slab allocator and got > this time before the series: > > real0m8.349s > user0m0.694s > sys 0m7.648s > > with perf showing > >0.61% nonexistent [kernel.kallsyms] [k] should_failslab.constprop.0 >0.00% nonexistent [kernel.kallsyms] [k] should_fail_alloc_page > > > ▒ > > And after the series > > real0m7.924s > user0m0.727s > sys 0m7.191s Is "user" increase a measurement error or it's real? Otherwise, nice savings!
Re: [PATCH 10/20] function_graph: Have the instances use their own ftrace_ops for filtering
On Fri, 31 May 2024 23:50:23 +0900 Masami Hiramatsu (Google) wrote: > So is it similar to the fprobe/kprobe, use shared signle ftrace_ops, > but keep each fgraph has own hash table? Sort of. I created helper functions in ftrace that lets you have a "manager ftrace_ops" that will be used to assign to ftrace (with the function that will demultiplex), and then you can have "subops" that can be assigned that is per user. Here's a glimpse of the code: /** * ftrace_startup_subops - enable tracing for subops of an ops * @ops: Manager ops (used to pick all the functions of its subops) * @subops: A new ops to add to @ops * @command: Extra commands to use to enable tracing * * The @ops is a manager @ops that has the filter that includes all the functions * that its list of subops are tracing. Adding a new @subops will add the * functions of @subops to @ops. */ int ftrace_startup_subops(struct ftrace_ops *ops, struct ftrace_ops *subops, int command) { struct ftrace_hash *filter_hash; struct ftrace_hash *notrace_hash; struct ftrace_hash *save_filter_hash; struct ftrace_hash *save_notrace_hash; int size_bits; int ret; if (unlikely(ftrace_disabled)) return -ENODEV; ftrace_ops_init(ops); ftrace_ops_init(subops); /* Make everything canonical (Just in case!) */ if (!ops->func_hash->filter_hash) ops->func_hash->filter_hash = EMPTY_HASH; if (!ops->func_hash->notrace_hash) ops->func_hash->notrace_hash = EMPTY_HASH; if (!subops->func_hash->filter_hash) subops->func_hash->filter_hash = EMPTY_HASH; if (!subops->func_hash->notrace_hash) subops->func_hash->notrace_hash = EMPTY_HASH; /* For the first subops to ops just enable it normally */ if (list_empty(>subop_list)) { /* Just use the subops hashes */ filter_hash = copy_hash(subops->func_hash->filter_hash); notrace_hash = copy_hash(subops->func_hash->notrace_hash); if (!filter_hash || !notrace_hash) { free_ftrace_hash(filter_hash); free_ftrace_hash(notrace_hash); return -ENOMEM; } save_filter_hash = ops->func_hash->filter_hash; save_notrace_hash = ops->func_hash->notrace_hash; ops->func_hash->filter_hash = filter_hash; ops->func_hash->notrace_hash = notrace_hash; list_add(>list, >subop_list); ret = ftrace_startup(ops, command); if (ret < 0) { list_del(>list); ops->func_hash->filter_hash = save_filter_hash; ops->func_hash->notrace_hash = save_notrace_hash; free_ftrace_hash(filter_hash); free_ftrace_hash(notrace_hash); } else { free_ftrace_hash(save_filter_hash); free_ftrace_hash(save_notrace_hash); subops->flags |= FTRACE_OPS_FL_ENABLED; } return ret; } /* * Here there's already something attached. Here are the rules: * o If either filter_hash is empty then the final stays empty * o Otherwise, the final is a superset of both hashes * o If either notrace_hash is empty then the final stays empty * o Otherwise, the final is an intersection between the hashes */ if (ops->func_hash->filter_hash == EMPTY_HASH || subops->func_hash->filter_hash == EMPTY_HASH) { filter_hash = EMPTY_HASH; } else { size_bits = max(ops->func_hash->filter_hash->size_bits, subops->func_hash->filter_hash->size_bits); filter_hash = alloc_and_copy_ftrace_hash(size_bits, ops->func_hash->filter_hash); if (!filter_hash) return -ENOMEM; ret = append_hash(_hash, subops->func_hash->filter_hash); if (ret < 0) { free_ftrace_hash(filter_hash); return ret; } } if (ops->func_hash->notrace_hash == EMPTY_HASH || subops->func_hash->notrace_hash == EMPTY_HASH) { notrace_hash = EMPTY_HASH; } else { size_bits = max(ops->func_hash->filter_hash->size_bits, subops->func_hash->filter_hash->size_bits); notrace_hash = alloc_ftrace_hash(size_bits); if (!notrace_hash) { free_ftrace_hash(filter_hash); return -ENOMEM; } ret = intersect_hash(_hash, ops->func_hash->filter_hash,
[PATCH v14 14/14] selftests/sgx: Add scripts for EPC cgroup testing
With different cgroups, the script starts one or multiple concurrent SGX selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed test case, which loads an enclave of EPC size equal to the EPC capacity available on the platform. The script checks results against the expectation set for each cgroup and reports success or failure. The script creates 3 different cgroups at the beginning with following expectations: 1) small - intentionally small enough to fail the test loading an enclave of size equal to the capacity. 2) large - large enough to run up to 4 concurrent tests but fail some if more than 4 concurrent tests are run. The script starts 4 expecting at least one test to pass, and then starts 5 expecting at least one test to fail. 3) larger - limit is the same as the capacity, large enough to run lots of concurrent tests. The script starts 8 of them and expects all pass. Then it reruns the same test with one process randomly killed and usage checked to be zero after all processes exit. The script also includes a test with low mem_cg limit and large sgx_epc limit to verify that the RAM used for per-cgroup reclamation is charged to a proper mem_cg. For this test, it turns off swapping before start, and turns swapping back on afterwards. Add README to document how to run the tests. Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Tested-by: Jarkko Sakkinen --- V13: - More improvement on handling error cases and style fixes. - Add settings file for custom timeout V12: - Integrate the scripts to the "run_tests" target. (Jarkko) V11: - Remove cgroups-tools dependency and make scripts ash compatible. (Jarkko) - Drop support for cgroup v1 and simplify. (Michal, Jarkko) - Add documentation for functions. (Jarkko) - Turn off swapping before memcontrol tests and back on after - Format and style fixes, name for hard coded values V7: - Added memcontrol test. V5: - Added script with automatic results checking, remove the interactive script. - The script can run independent from the series below. --- tools/testing/selftests/sgx/Makefile | 3 +- tools/testing/selftests/sgx/README| 109 +++ tools/testing/selftests/sgx/ash_cgexec.sh | 16 + tools/testing/selftests/sgx/config| 4 + .../selftests/sgx/run_epc_cg_selftests.sh | 295 ++ tools/testing/selftests/sgx/settings | 2 + .../selftests/sgx/watch_misc_for_tests.sh | 11 + 7 files changed, 439 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/sgx/README create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh create mode 100644 tools/testing/selftests/sgx/config create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh create mode 100644 tools/testing/selftests/sgx/settings create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh diff --git a/tools/testing/selftests/sgx/Makefile b/tools/testing/selftests/sgx/Makefile index 867f88ce2570..739376af9e33 100644 --- a/tools/testing/selftests/sgx/Makefile +++ b/tools/testing/selftests/sgx/Makefile @@ -20,7 +20,8 @@ ENCL_LDFLAGS := -Wl,-T,test_encl.lds,--build-id=none ifeq ($(CAN_BUILD_X86_64), 1) TEST_CUSTOM_PROGS := $(OUTPUT)/test_sgx -TEST_FILES := $(OUTPUT)/test_encl.elf +TEST_FILES := $(OUTPUT)/test_encl.elf ash_cgexec.sh +TEST_PROGS := run_epc_cg_selftests.sh all: $(TEST_CUSTOM_PROGS) $(OUTPUT)/test_encl.elf endif diff --git a/tools/testing/selftests/sgx/README b/tools/testing/selftests/sgx/README new file mode 100644 index ..f84406bf29a4 --- /dev/null +++ b/tools/testing/selftests/sgx/README @@ -0,0 +1,109 @@ +SGX selftests + +The SGX selftests includes a c program (test_sgx) that covers basic user space +facing APIs and a shell scripts (run_sgx_cg_selftests.sh) testing SGX misc +cgroup. The SGX cgroup test script requires root privileges and runs a +specific test case of the test_sgx in different cgroups configured by the +script. More details about the cgroup test can be found below. + +All SGX selftests can run with or without kselftest framework. + +WITH KSELFTEST FRAMEWORK +=== + +BUILD +- + +Build executable file "test_sgx" from top level directory of the kernel source: + $ make -C tools/testing/selftests TARGETS=sgx + +RUN +--- + +Run all sgx tests as sudo or root since the cgroup tests need to configure cgroup +limits in files under /sys/fs/cgroup. + + $ sudo make -C tools/testing/selftests/sgx run_tests + +Without sudo, SGX cgroup tests will be skipped. + +On platforms with large Enclave Page Cache (EPC) and/or less cpu cores, you may +need adjust the timeout in 'settings' to avoid timeouts. + +More details about kselftest framework can be found in +Documentation/dev-tools/kselftest.rst. + +WITHOUT KSELFTEST FRAMEWORK +=== + +BUILD +- + +Build executable file "test_sgx" from this +directory(tools/testing/selftests/sgx/): + + $ make + +RUN +--- + +Run all
[PATCH v14 13/14] Docs/x86/sgx: Add description for cgroup support
From: Sean Christopherson Add initial documentation of how to regulate the distribution of SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup controller. Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Cc: Sean Christopherson Reviewed-by: Bagas Sanjaya Reviewed-by: Jarkko Sakkinen Acked-by: Kai Huang Tested-by: Mikko Ylinen Tested-by: Jarkko Sakkinen --- V8: - Limit text width to 80 characters to be consistent. V6: - Remove mentioning of VMM specific behavior on handling SIGBUS - Remove statement of forced reclamation, add statement to specify ENOMEM returned when no reclamation possible. - Added statements on the non-preemptive nature for the max limit - Dropped Reviewed-by tag because of changes V4: - Fix indentation (Randy) - Change misc.events file to be read-only - Fix a typo for 'subsystem' - Add behavior when VMM overcommit EPC with a cgroup (Mikko) --- Documentation/arch/x86/sgx.rst | 83 ++ 1 file changed, 83 insertions(+) diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst index d90796adc2ec..c537e6a9aa65 100644 --- a/Documentation/arch/x86/sgx.rst +++ b/Documentation/arch/x86/sgx.rst @@ -300,3 +300,86 @@ to expected failures and handle them as follows: first call. It indicates a bug in the kernel or the userspace client if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has a return code other than 0. + + +Cgroup Support +== + +The "sgx_epc" resource within the Miscellaneous cgroup controller regulates +distribution of SGX EPC memory, which is a subset of system RAM that is used to +provide SGX-enabled applications with protected memory, and is otherwise +inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be +read/written outside of an SGX enclave. + +Although current systems implement EPC by stealing memory from RAM, for all +intents and purposes the EPC is independent from normal system memory, e.g. must +be reserved at boot from RAM and cannot be converted between EPC and normal +memory while the system is running. The EPC is managed by the SGX subsystem and +is not accounted by the memory controller. Note that this is true only for EPC +memory itself, i.e. normal memory allocations related to SGX and EPC memory, +e.g. the backing memory for evicted EPC pages, are accounted, limited and +protected by the memory controller. + +Much like normal system memory, EPC memory can be overcommitted via virtual +memory techniques and pages can be swapped out of the EPC to their backing store +(normal system memory allocated via shmem). The SGX EPC subsystem is analogous +to the memory subsystem, and it implements limit and protection models for EPC +memory. + +SGX EPC Interface Files +--- + +For a generic description of the Miscellaneous controller interface files, +please see Documentation/admin-guide/cgroup-v2.rst + +All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If +a value which is not PAGE_SIZE aligned is written, the actual value used by the +controller will be rounded down to the closest PAGE_SIZE multiple. + + misc.capacity +A read-only flat-keyed file shown only in the root cgroup. The sgx_epc +resource will show the total amount of EPC memory available on the +platform. + + misc.current +A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc +resource will show the current active EPC memory usage of the cgroup and +its descendants. EPC pages that are swapped out to backing RAM are not +included in the current count. + + misc.max +A read-write single value file which exists on non-root cgroups. The +sgx_epc resource will show the EPC usage hard limit. The default is +"max". + +If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for +page fault handling, will be blocked until EPC can be reclaimed from the +cgroup. If there are no pages left that are reclaimable within the same +group, the kernel returns ENOMEM. + +The EPC pages allocated for a guest VM by the virtual EPC driver are not +reclaimable by the host kernel. In case the guest cgroup's limit is +reached and no reclaimable pages left in the same cgroup, the virtual +EPC driver returns SIGBUS to the user space process to indicate failure +on new EPC allocation requests. + +The misc.max limit is non-preemptive. If a user writes a limit lower +than the current usage to this file, the cgroup will not preemptively +deallocate pages currently in use, and will only start blocking the next +allocation and reclaiming EPC at that time. + + misc.events +A read-only flat-keyed file which exists on non-root
[PATCH v14 12/14] x86/sgx: Turn on per-cgroup EPC reclamation
From: Kristen Carlson Accardi Previous patches have implemented all infrastructure needed for per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC pages are still tracked in the global LRU as sgx_epc_page_lru() always returns reference to the global LRU. Change sgx_epc_page_lru() to return the LRU of the cgroup in which the given EPC page is allocated. This makes all EPC pages tracked in per-cgroup LRUs and the global reclaimer (ksgxd) will not be able to reclaim any pages from the global LRU. However, in cases of over-committing, i.e., the sum of cgroup limits greater than the total capacity, cgroups may never reclaim but the total usage can still be near the capacity. Therefore a global reclamation is still needed in those cases and it should be performed from the root cgroup. Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup when cgroup is enabled. Similar to sgx_cgroup_reclaim_pages(), return the next cgroup so callers can use it as the new starting node for next round of reclamation if needed. Also update sgx_can_reclaim_global(), to check emptiness of LRUs of all cgroups when EPC cgroup is enabled, otherwise only check the global LRU. Finally, change sgx_reclaim_direct(), to check and ensure there are free pages at cgroup level so forward progress can be made by the caller. With these changes, the global reclamation and per-cgroup reclamation both work properly with all pages tracked in per-cgroup LRUs. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Tested-by: Mikko Ylinen Tested-by: Jarkko Sakkinen --- V14: - Update global reclamation to use the new sgx_cgroup_reclaim_pages() to iterate cgroups at lower level if the top cgroups are too busy. V13: - Use IS_ENABLED(CONFIG_CGROUP_MISC) in sgx_can_reclaim_global(). (Kai) V12: - Remove CONFIG_CGROUP_SGX_EPC, conditional compile SGX Cgroup for CONFIGCONFIG_CGROUPMISC. (Jarkko) V11: - Reword the comments for global reclamation for allocation failure after passing cgroup charging. (Kai) - Add stub functions to remove ifdefs in c file (Kai) - Add more detailed comments to clarify each page belongs to one cgroup, or the root. (Kai) V10: - Add comment to clarify each page belongs to one cgroup, or the root by default. (Kai) - Merge the changes that expose sgx_cgroup_* functions to this patch. - Add changes for sgx_reclaim_direct() that was missed previously. V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 8 +-- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 32 +++ arch/x86/kernel/cpu/sgx/main.c | 80 +--- 3 files changed, 109 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index 7e0e9a4fedf0..2702bbd49c60 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -67,7 +67,7 @@ static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg) * * Return: %true if all cgroups under the specified root have empty LRU lists. */ -static bool sgx_cgroup_lru_empty(struct misc_cg *root) +bool sgx_cgroup_lru_empty(struct misc_cg *root) { struct cgroup_subsys_state *css_root; struct cgroup_subsys_state *pos; @@ -127,8 +127,8 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root) * release the reference if the returned is not used as %start for a subsequent * call. */ -static struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct misc_cg *start, - struct mm_struct *charge_mm) +struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct misc_cg *start, +struct mm_struct *charge_mm) { struct cgroup_subsys_state *css_root, *pos; struct cgroup_subsys_state *next = NULL; @@ -181,7 +181,7 @@ static struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mis * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the * cgroup. */ -static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg) +bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg) { u64 cur, max; diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h index 2044e0d64076..c6940bb8ec3b 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -13,6 +13,11 @@ #define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES struct sgx_cgroup; +static inline struct misc_cg *misc_from_sgx(struct sgx_cgroup *sgx_cg) +{ + return NULL; +} + static inline struct sgx_cgroup *sgx_get_current_cg(void) { return NULL; @@ -27,8 +32,25 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_recl
[PATCH v14 11/14] x86/sgx: Charge mem_cgroup for per-cgroup reclamation
Enclave Page Cache(EPC) memory can be swapped out to regular system memory, and the consumed memory should be charged to a proper mem_cgroup. Currently the selection of mem_cgroup to charge is done in sgx_encl_get_mem_cgroup(). But it considers all contexts other than the ksgxd thread are user processes. With the new EPC cgroup implementation, the swapping can also happen in EPC cgroup work-queue threads. In those cases, it improperly selects the root mem_cgroup to charge for the RAM usage. Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take an additional argument to explicitly specify the mm struct to charge for allocations. Callers from background kthreads not associated with a charging mm struct would set it to NULL, while callers in user process contexts set it to current->mm. Internally, it handles the case when the charging mm given is NULL, by searching for an mm struct from enclave's mm_list. Signed-off-by: Haitao Huang Reported-by: Mikko Ylinen Reviewed-by: Kai Huang Reviewed-by: Jarkko Sakkinen Tested-by: Mikko Ylinen Tested-by: Jarkko Sakkinen --- V10: - Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko) V9: - Reduce number of if statements. (Tim) V8: - Limit text paragraphs to 80 characters wide. (Jarkko) --- arch/x86/kernel/cpu/sgx/encl.c | 29 ++-- arch/x86/kernel/cpu/sgx/encl.h | 3 +-- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 11 +++ arch/x86/kernel/cpu/sgx/main.c | 29 +--- arch/x86/kernel/cpu/sgx/sgx.h| 2 +- 5 files changed, 37 insertions(+), 37 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index f474179b6f77..7b77dad41daf 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -993,23 +993,23 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_inde } /* - * When called from ksgxd, returns the mem_cgroup of a struct mm stored - * in the enclave's mm_list. When not called from ksgxd, just returns - * the mem_cgroup of the current task. + * Find the mem_cgroup to charge for memory allocated on behalf of an enclave. + * + * Used in sgx_encl_alloc_backing() for backing store allocation. + * + * Return the mem_cgroup of the given charge_mm. Otherwise return the mem_cgroup + * of a struct mm stored in the enclave's mm_list. */ -static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) +static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl, + struct mm_struct *charge_mm) { struct mem_cgroup *memcg = NULL; struct sgx_encl_mm *encl_mm; int idx; - /* -* If called from normal task context, return the mem_cgroup -* of the current task's mm. The remainder of the handling is for -* ksgxd. -*/ - if (!current_is_ksgxd()) - return get_mem_cgroup_from_mm(current->mm); +/* Use the charge_mm if given. */ + if (charge_mm) + return get_mem_cgroup_from_mm(charge_mm); /* * Search the enclave's mm_list to find an mm associated with @@ -1047,8 +1047,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) * @encl: an enclave pointer * @page_index:enclave page index * @backing: data for accessing backing storage for the page + * @charge_mm: the mm to charge for the allocation * - * When called from ksgxd, sets the active memcg from one of the + * When charge_mm is NULL, sets the active memcg from one of the * mms in the enclave's mm_list prior to any backing page allocation, * in order to ensure that shmem page allocations are charged to the * enclave. Create a backing page for loading data back into an EPC page with @@ -1060,9 +1061,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) * -errno otherwise. */ int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index, - struct sgx_backing *backing) + struct sgx_backing *backing, struct mm_struct *charge_mm) { - struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl); + struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl, charge_mm); struct mem_cgroup *memcg = set_active_memcg(encl_memcg); int ret; diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index fe15ade02ca1..5ce9d108290f 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr, int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start, unsigned long end, unsigned long vm_flags); -bool current_is_ksgxd(void); void sgx_encl_release(struct kref *ref); int sgx_encl_mm_add(struct sgx_encl *encl, struct
[PATCH v14 10/14] x86/sgx: Implement async reclamation for cgroup
From: Kristen Carlson Accardi In cases EPC pages need be allocated during a page fault and the cgroup usage is near its limit, an asynchronous reclamation needs be triggered to avoid blocking the page fault handling. Create a workqueue, corresponding work item and function definitions for EPC cgroup to support the asynchronous reclamation. In sgx_cgroup_try_charge(), if caller does not allow synchronous reclamation, queue an asynchronous work into the workqueue. Reclaiming only when the usage is at or very close to the limit would cause thrashing. To avoid that, before returning from sgx_cgroup_try_charge(), check the need for reclamation (usage too close to the limit), queue an async work if needed, similar to how the global reclaimer wakes up its reclaiming thread after each allocation in sgx_alloc_epc_pages(). Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Tested-by: Jarkko Sakkinen --- V13: - Revert to BUG_ON() in case of workq allocation failure in init and only alloc if misc is enabled. V11: - Print error instead of WARN (Kai) - Add check for need to queue an async reclamation before returning from try_charge(), do so if needed. This is to be consistent with global reclaimer to minimize thrashing during allocation time. V10: - Split asynchronous flow in separate patch. (Kai) - Consider cgroup disabled when the workqueue allocation fail during init. (Kai) - Abstract out sgx_cgroup_should_reclaim(). V9: - Add comments for static variables. (Jarkko) V8: - Remove alignment for substructure variables. (Jarkko) V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 136 ++- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 1 + 2 files changed, 136 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index 4406077acd1c..9e366068e0cd 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -4,9 +4,63 @@ #include #include "epc_cgroup.h" +/* + * The minimal free pages maintained by per-cgroup reclaimer + * Set this to the low threshold used by the global reclaimer, ksgxd. + */ +#define SGX_CG_MIN_FREE_PAGE (SGX_NR_LOW_PAGES) + +/* + * If the cgroup limit is close to SGX_CG_MIN_FREE_PAGE, maintaining the minimal + * free pages would barely leave any page for use, causing excessive reclamation + * and thrashing. + * + * Define the following limit, below which cgroup does not maintain the minimal + * free page threshold. Set this to quadruple of the minimal so at least 75% + * pages used without being reclaimed. + */ +#define SGX_CG_LOW_LIMIT (SGX_CG_MIN_FREE_PAGE * 4) + /* The root SGX EPC cgroup */ static struct sgx_cgroup sgx_cg_root; +/* + * The work queue that reclaims EPC pages in the background for cgroups. + * + * A cgroup schedules a work item into this queue to reclaim pages within the + * same cgroup when its usage limit is reached and synchronous reclamation is not + * an option, i.e., in a page fault handler. + */ +static struct workqueue_struct *sgx_cg_wq; + +static inline u64 sgx_cgroup_page_counter_read(struct sgx_cgroup *sgx_cg) +{ + return atomic64_read(_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE; +} + +static inline u64 sgx_cgroup_max_pages(struct sgx_cgroup *sgx_cg) +{ + return READ_ONCE(sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE; +} + +/* + * Get the lower bound of limits of a cgroup and its ancestors. Used in + * sgx_cgroup_should_reclaim() to determine if EPC usage of a cgroup is + * close to its limit or its ancestors' hence reclamation is needed. + */ +static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg) +{ + struct misc_cg *i = sgx_cg->cg; + u64 m = U64_MAX; + + while (i) { + m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max)); + i = misc_cg_parent(i); + } + + return m / PAGE_SIZE; +} + /** * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs * @root: Root of the tree to check @@ -113,6 +167,65 @@ static struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mis return css_misc(next); } +/** + * sgx_cgroup_should_reclaim() - check if EPC reclamation is needed for a cgroup + * @sgx_cg: The cgroup to be checked. + * + * This function can be used to guard a call to sgx_cgroup_reclaim_pages() where + * the minimal number of free page needs be maintained for the cgroup to make + * good forward progress. + * + * Return: %true if number of free pages available for the cgroup below a + * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the + * cgroup. + */ +static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg) +{ + u64 cur,
[PATCH v14 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup
From: Kristen Carlson Accardi Currently in the EPC page allocation, the kernel simply fails the allocation when the current EPC cgroup fails to charge due to its usage reaching limit. This is not ideal. When that happens, a better way is to reclaim EPC page(s) from the current EPC cgroup (and/or its descendants) to reduce its usage so the new allocation can succeed. Add the basic building blocks to support per-cgroup reclamation. Currently the kernel only has one place to reclaim EPC pages: the global EPC LRU list. To support the "per-cgroup" EPC reclaim, maintain an LRU list for each EPC cgroup, and introduce a "cgroup" variant function to reclaim EPC pages from a given EPC cgroup and its descendants. Currently the kernel does the global EPC reclaim in sgx_reclaim_page(). It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) pages. Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN pages from the global LRU, and then tries to reclaim these pages at once for better performance. Implement the "cgroup" variant EPC reclaim in a similar way, but keep the implementation simple: 1) change sgx_reclaim_pages() to take an LRU as input, and return the pages that are "scanned" and attempted for reclamation (but not necessarily reclaimed successfully); 2) loop the given EPC cgroup and its descendants and do the new sgx_reclaim_pages() until SGX_NR_TO_SCAN pages are "scanned". This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC cgroup, and only moves to its descendants when there's no enough reclaimable EPC pages to "scan" in its LRU. It should be enough for most cases. In other cases, the caller may invoke this function in a loop to ensure enough pages reclaimed for its usage. To ensure all descendant groups scanned in a round-robin fashion in those cases, sgx_cgroup_reclaim_pages() takes in a starting cgroup and returns the next cgroup that the caller can pass in as the new starting cgroup for a subsequent call. Note, this simple implementation doesn't _exactly_ mimic the current global EPC reclaim (which always tries to do the actual reclaim in batch of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN reclaimable pages, the actual reclaim of EPC pages will be split into smaller batches _across_ multiple LRUs with each being smaller than SGX_NR_TO_SCAN pages. A more precise way to mimic the current global EPC reclaim would be to have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages _across_ the given EPC cgroup _AND_ its descendants, and then do the actual reclaim in one batch. But this is unnecessarily complicated at this stage. Alternatively, the current sgx_reclaim_pages() could be changed to return the actual "reclaimed" pages, but not "scanned" pages. However, the reclamation is a lengthy process, forcing a successful reclamation of predetermined number of pages may block the caller for too long. And that may not be acceptable in some synchronous contexts, e.g., in serving an ioctl(). With this building block in place, add synchronous reclamation support in sgx_cgroup_try_charge(): trigger a call to sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the caller allows synchronous reclaim as indicated by s newly added parameter. A later patch will add support for asynchronous reclamation reusing sgx_cgroup_reclaim_pages(). Note all reclaimable EPC pages are still tracked in the global LRU thus no per-cgroup reclamation is actually active at the moment. Per-cgroup tracking and reclamation will be turned on in the end after all necessary infrastructure is in place. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V14: - Allow sgx_cgroup_reclaim_pages() to continue from previous tree-walk. It takes in a 'start' node and returns the 'next' node for the caller to use as the new 'start'. This is to ensure pages in lower level cgroups can be reclaimed if all pages in upper level nodes are "too young". (Kai) - Move renaming sgx_should_reclaim() to sgx_should_reclaim_global() from a later patch to this one. (Kai) V11: - Use commit message suggested by Kai - Remove "usage" comments for functions. (Kai) V10: - Simplify the signature by removing a pointer to nr_to_scan (Kai) - Return pages attempted instead of reclaimed as it is really what the cgroup caller needs to track progress. This further simplifies the design. - Merge patch for exposing sgx_reclaim_pages() with basic synchronous reclamation. (Kai) - Shorten names for EPC cgroup functions. (Jarkko) - Fix/add comments to justify the design (Kai) - Separate out a helper for for addressing single iteration of the loop in sgx_cgroup_try_charge(). (Jarkko) V9: - Add comments for static
[PATCH v14 09/14] x86/sgx: Abstract check for global reclaimable pages
From: Kristen Carlson Accardi For the global reclaimer to determine if any page available for reclamation at the global level, it currently only checks for emptiness of the global LRU. That will be inadequate when pages are tracked in multiple LRUs, one per cgroup. For this purpose, create a new helper, sgx_can_reclaim_global(), to abstract this check. Currently it only checks the global LRU, later will check emptiness of LRUs of all cgroups when per-cgroup tracking is turned on. Replace all the checks for emptiness of the global LRU, list_empty(_global_lru.reclaimable), with calls to sgx_can_reclaim_global(). Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V13: - Rename sgx_can_reclaim() to sgx_can_reclaim_global() and sgx_should_reclaim() to sgx_should_reclaim_global(). (Kai) V10: - Add comments for the new function. (Jarkko) V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 71e84937bc17..0a06b80e26b8 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -37,6 +37,14 @@ static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc return _global_lru; } +/* + * Check if there is any reclaimable page at global level. + */ +static inline bool sgx_can_reclaim_global(void) +{ + return !list_empty(_global_lru.reclaimable); +} + static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); /* Nodes with one or more EPC sections. */ @@ -390,7 +398,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru) static bool sgx_should_reclaim_global(unsigned long watermark) { return atomic_long_read(_nr_free_pages) < watermark && - !list_empty(_global_lru.reclaimable); + sgx_can_reclaim_global(); } static void sgx_reclaim_pages_global(void) @@ -596,7 +604,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim) break; } - if (list_empty(_global_lru.reclaimable)) { + if (!sgx_can_reclaim_global()) { page = ERR_PTR(-ENOMEM); break; } -- 2.25.1
[PATCH v14 07/14] x86/sgx: Abstract tracking reclaimable pages in LRU
From: Kristen Carlson Accardi The SGX driver tracks reclaimable EPC pages by adding a newly allocated page into the global LRU list in sgx_mark_page_reclaimable(), and doing the opposite in sgx_unmark_page_reclaimable(). To support SGX EPC cgroup, the SGX driver will need to maintain an LRU list for each cgroup, and each newly allocated EPC page will need to be added to the LRU of associated cgroup, not always the global LRU list. When sgx_mark_page_reclaimable() is called, the cgroup that the newly allocated EPC page belongs to is already known, i.e., it has been set to the 'struct sgx_epc_page'. Add a helper, sgx_epc_page_lru(), to return the LRU that the EPC page should be added to for the given EPC page. Currently it just returns the global LRU. Change sgx_{mark|unmark}_page_reclaimable() to use the helper function to get the LRU from the EPC page instead of referring to the global LRU directly. This allows EPC page being able to be tracked in "per-cgroup" LRU when that becomes ready. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V13: - Revise commit log (Kai) - Rename sgx_lru_list() to sgx_epc_page_lru() V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 30 ++ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 75d6ce72b9b8..1fa73250e2b9 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space); */ static struct sgx_epc_lru_list sgx_global_lru; +static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc_page) +{ + return _global_lru; +} + static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); /* Nodes with one or more EPC sections. */ @@ -500,25 +505,24 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void) } /** - * sgx_mark_page_reclaimable() - Mark a page as reclaimable + * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a LRU. * @page: EPC page - * - * Mark a page as reclaimable and add it to the active page list. Pages - * are automatically removed from the active list when freed. */ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(_global_lru.lock); + struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page); + + spin_lock(>lock); page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED; - list_add_tail(>list, _global_lru.reclaimable); - spin_unlock(_global_lru.lock); + list_add_tail(>list, >reclaimable); + spin_unlock(>lock); } /** - * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list + * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU * @page: EPC page * - * Clear the reclaimable flag and remove the page from the active page list. + * Clear the reclaimable flag if set and remove the page from its LRU. * * Return: * 0 on success, @@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) */ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(_global_lru.lock); + struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page); + + spin_lock(>lock); if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) { /* The page is being reclaimed. */ if (list_empty(>list)) { - spin_unlock(_global_lru.lock); + spin_unlock(>lock); return -EBUSY; } list_del(>list); page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(_global_lru.lock); + spin_unlock(>lock); return 0; } -- 2.25.1
[PATCH v14 06/14] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
From: Sean Christopherson Introduce a data structure to wrap the existing reclaimable list and its spinlock. Each cgroup later will have one instance of this structure to track EPC pages allocated for processes associated with the same cgroup. Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages from the reclaimable list in this structure when its usage reaches near its limit. Use this structure to encapsulate the LRU list and its lock used by the global reclaimer. Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Cc: Sean Christopherson Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V6: - removed introduction to unreclaimables in commit message. V4: - Removed unneeded comments for the spinlock and the non-reclaimables. (Kai, Jarkko) - Revised the commit to add introduction comments for unreclaimables and multiple LRU lists.(Kai) - Reordered the patches: delay all changes for unreclaimables to later, and this one becomes the first change in the SGX subsystem. V3: - Removed the helper functions and revised commit messages. --- arch/x86/kernel/cpu/sgx/main.c | 39 +- arch/x86/kernel/cpu/sgx/sgx.h | 15 + 2 files changed, 35 insertions(+), 19 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index d7a945d44268..75d6ce72b9b8 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space); /* * These variables are part of the state of the reclaimer, and must be accessed - * with sgx_reclaimer_lock acquired. + * with sgx_global_lru.lock acquired. */ -static LIST_HEAD(sgx_active_page_list); -static DEFINE_SPINLOCK(sgx_reclaimer_lock); +static struct sgx_epc_lru_list sgx_global_lru; static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); @@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void) int ret; int i; - spin_lock(_reclaimer_lock); + spin_lock(_global_lru.lock); for (i = 0; i < SGX_NR_TO_SCAN; i++) { - if (list_empty(_active_page_list)) + epc_page = list_first_entry_or_null(_global_lru.reclaimable, + struct sgx_epc_page, list); + if (!epc_page) break; - epc_page = list_first_entry(_active_page_list, - struct sgx_epc_page, list); list_del_init(_page->list); encl_page = epc_page->owner; @@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void) */ epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(_reclaimer_lock); + spin_unlock(_global_lru.lock); for (i = 0; i < cnt; i++) { epc_page = chunk[i]; @@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void) continue; skip: - spin_lock(_reclaimer_lock); - list_add_tail(_page->list, _active_page_list); - spin_unlock(_reclaimer_lock); + spin_lock(_global_lru.lock); + list_add_tail(_page->list, _global_lru.reclaimable); + spin_unlock(_global_lru.lock); kref_put(_page->encl->refcount, sgx_encl_release); @@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void) static bool sgx_should_reclaim(unsigned long watermark) { return atomic_long_read(_nr_free_pages) < watermark && - !list_empty(_active_page_list); + !list_empty(_global_lru.reclaimable); } /* @@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void) ksgxd_tsk = tsk; + sgx_lru_init(_global_lru); + return true; } @@ -507,10 +508,10 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void) */ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(_reclaimer_lock); + spin_lock(_global_lru.lock); page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED; - list_add_tail(>list, _active_page_list); - spin_unlock(_reclaimer_lock); + list_add_tail(>list, _global_lru.reclaimable); + spin_unlock(_global_lru.lock); } /** @@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) */ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(_reclaimer_lock); + spin_lock(_global_lru.lock); if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) { /* The page is being reclaimed. */ if (list_empty(>list)) { - spin_unlock(_reclaimer_lock); + spin_unlock(_global_lru.lock); return -EBUSY; }
[PATCH v14 05/14] x86/sgx: Implement basic EPC misc cgroup functionality
From: Kristen Carlson Accardi SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used to limit or account for SGX EPC memory, which is a desirable feature in some environments. For instance, within a Kubernetes environment, while a user may specify a particular EPC quota for a pod, the orchestrator requires a mechanism to enforce that the pod's actual runtime EPC usage does not exceed the allocated quota. Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to limit and track EPC allocations per cgroup. Earlier patches have added the "sgx_epc" resource type in the misc cgroup subsystem. Add basic support in SGX driver as the "sgx_epc" resource provider: - Set "capacity" of EPC by calling misc_cg_set_capacity() - Update EPC usage counter, "current", by calling charge and uncharge APIs for EPC allocation and deallocation, respectively. - Setup sgx_epc resource type specific callbacks, which perform initialization and cleanup during cgroup allocation and deallocation, respectively. With these changes, the misc cgroup controller enables users to set a hard limit for EPC usage in the "misc.max" interface file. It reports current usage in "misc.current", the total EPC memory available in "misc.capacity", and the number of times EPC usage reached the max limit in "misc.events". For now, the EPC cgroup simply blocks additional EPC allocation in sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are still tracked in the global active list, only reclaimed by the global reclaimer when the total free page count is lower than a threshold. Later patches will reorganize the tracking and reclamation code in the global reclaimer and implement per-cgroup tracking and reclaiming. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V13: - Remove unneeded includes. (Kai) V12: - Remove CONFIG_CGROUP_SGX_EPC and make sgx cgroup implementation conditionally compiled with CONFIG_CGROUP_MISC. (Jarkko) V11: - Update copyright and format better (Kai) - Create wrappers to remove #ifdefs in c file. (Kai) - Remove unneeded comments (Kai) V10: - Shorten function, variable, struct names, s/sgx_epc_cgroup/sgx_cgroup. (Jarkko) - Use enums instead of booleans for the parameters. (Dave, Jarkko) V8: - Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko) - Remove extra space, '_INTEL'. (Jarkko) V7: - Use a static for root cgroup (Kai) - Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai) - Correct check for charge API return (Kai) - Start initialization in SGX device driver init (Kai) - Remove unneeded BUG_ON (Kai) - Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai) V6: - Split the original large patch"Limit process EPC usage with misc cgroup controller" and restructure it (Kai) --- arch/x86/kernel/cpu/sgx/Makefile | 1 + arch/x86/kernel/cpu/sgx/epc_cgroup.c | 71 +++ arch/x86/kernel/cpu/sgx/epc_cgroup.h | 72 arch/x86/kernel/cpu/sgx/main.c | 42 +++- arch/x86/kernel/cpu/sgx/sgx.h| 21 include/linux/misc_cgroup.h | 2 + 6 files changed, 207 insertions(+), 2 deletions(-) create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile index 9c1656779b2a..081cb424575e 100644 --- a/arch/x86/kernel/cpu/sgx/Makefile +++ b/arch/x86/kernel/cpu/sgx/Makefile @@ -4,3 +4,4 @@ obj-y += \ ioctl.o \ main.o obj-$(CONFIG_X86_SGX_KVM) += virt.o +obj-$(CONFIG_CGROUP_MISC) += epc_cgroup.o diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c new file mode 100644 index ..5c484fd10160 --- /dev/null +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -0,0 +1,71 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2022-2024 Intel Corporation. */ + +#include +#include "epc_cgroup.h" + +/* The root SGX EPC cgroup */ +static struct sgx_cgroup sgx_cg_root; + +/** + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page + * + * @sgx_cg:The EPC cgroup to be charged for the page. + * Return: + * * %0 - If successfully charged. + * * -errno - for failures. + */ +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) +{ + return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE); +} + +/** + * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page + * @sgx_cg:The charged sgx cgroup. + */ +void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) +{ + misc_cg_uncharge(MISC_CG_RES_SGX_EPC,
[PATCH v14 04/14] cgroup/misc: Add SGX EPC resource type
From: Kristen Carlson Accardi Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type for the misc controller. Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V12: - Remove CONFIG_CGROUP_SGX_EPC (Jarkko) V6: - Split the original patch into this and the preceding one (Kai) --- include/linux/misc_cgroup.h | 4 kernel/cgroup/misc.c| 4 2 files changed, 8 insertions(+) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index 541a5611c597..440ed2bb8053 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -17,6 +17,10 @@ enum misc_res_type { MISC_CG_RES_SEV, /* AMD SEV-ES ASIDs resource */ MISC_CG_RES_SEV_ES, +#endif +#ifdef CONFIG_X86_SGX + /* SGX EPC memory resource */ + MISC_CG_RES_SGX_EPC, #endif MISC_CG_RES_TYPES }; diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 7d852139121a..863f9147602b 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -24,6 +24,10 @@ static const char *const misc_res_name[] = { /* AMD SEV-ES ASIDs resource */ "sev_es", #endif +#ifdef CONFIG_X86_SGX + /* Intel SGX EPC memory bytes */ + "sgx_epc", +#endif }; /* Root misc cgroup */ -- 2.25.1
[PATCH v14 02/14] cgroup/misc: Add per resource callbacks for CSS events
From: Kristen Carlson Accardi The misc cgroup controller (subsystem) currently does not perform resource type specific action for Cgroups Subsystem State (CSS) events: the 'css_alloc' event when a cgroup is created and the 'css_free' event when a cgroup is destroyed. Define callbacks for those events and allow resource providers to register the callbacks per resource type as needed. This will be utilized later by the EPC misc cgroup support implemented in the SGX driver. This design passes a struct misc_cg into the callbacks. An alternative to pass only a struct misc_res was considered for encapsulating how misc_cg stores its misc_res array. However, the SGX driver would still need to access the misc_res array in the statically defined misc root_cg during initialization to set resource specific fields. Therefore, some extra getter/setter APIs to abstract such access to the misc_res array within the misc_cg struct would be needed. That seems an overkill at the moment and is deferred to later improvement when there are more users of these callbacks. Link: https://lore.kernel.org/lkml/op.2kdw36otwjv...@hhuan26-mobl.amr.corp.intel.com/ Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V12: - Add comments in commit to clarify reason to pass in misc_cg, not misc_res. (Kai) - Remove unlikely (Kai) V8: - Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko) V7: - Make ops one per resource type and store them in array (Michal) - Rename the ops struct to misc_res_ops, and enforce the constraints of required callback functions (Jarkko) - Moved addition of priv field to patch 4 where it was used first. (Jarkko) V6: - Create ops struct for per resource callbacks (Jarkko) - Drop max_write callback (Dave, Michal) - Style fixes (Kai) --- include/linux/misc_cgroup.h | 11 + kernel/cgroup/misc.c| 82 + 2 files changed, 86 insertions(+), 7 deletions(-) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index e799b1f8d05b..0806d4436208 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -27,6 +27,16 @@ struct misc_cg; #include +/** + * struct misc_res_ops: per resource type callback ops. + * @alloc: invoked for resource specific initialization when cgroup is allocated. + * @free: invoked for resource specific cleanup when cgroup is deallocated. + */ +struct misc_res_ops { + int (*alloc)(struct misc_cg *cg); + void (*free)(struct misc_cg *cg); +}; + /** * struct misc_res: Per cgroup per misc type resource * @max: Maximum limit on the resource. @@ -56,6 +66,7 @@ struct misc_cg { u64 misc_cg_res_total_usage(enum misc_res_type type); int misc_cg_set_capacity(enum misc_res_type type, u64 capacity); +int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops); int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount); void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount); diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 79a3717a5803..4a602d68cf7d 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -39,6 +39,9 @@ static struct misc_cg root_cg; */ static u64 misc_res_capacity[MISC_CG_RES_TYPES]; +/* Resource type specific operations */ +static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES]; + /** * parent_misc() - Get the parent of the passed misc cgroup. * @cgroup: cgroup whose parent needs to be fetched. @@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 capacity) } EXPORT_SYMBOL_GPL(misc_cg_set_capacity); +/** + * misc_cg_set_ops() - set resource specific operations. + * @type: Type of the misc res. + * @ops: Operations for the given type. + * + * Context: Any context. + * Return: + * * %0 - Successfully registered the operations. + * * %-EINVAL - If @type is invalid, or the operations missing any required callbacks. + */ +int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops) +{ + if (!valid_type(type)) + return -EINVAL; + + if (!ops->alloc) { + pr_err("%s: alloc missing\n", __func__); + return -EINVAL; + } + + if (!ops->free) { + pr_err("%s: free missing\n", __func__); + return -EINVAL; + } + + misc_res_ops[type] = ops; + return 0; +} +EXPORT_SYMBOL_GPL(misc_cg_set_ops); + /** * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup. * @type: Misc res type in misc cg to cancel the charge from. @@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = { {} }; +static inline int _misc_cg_res_alloc(struct misc_cg *cg) +{ + enum misc_res_type i; + int ret; + + for (i = 0; i < MISC_CG_RES_TYPES; i++) { +
[PATCH v14 03/14] cgroup/misc: Export APIs for SGX driver
From: Kristen Carlson Accardi The SGX EPC cgroup will reclaim EPC pages when usage in a cgroup reaches its or ancestor's limit. This requires a walk from the current cgroup up to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to enable this walk. The SGX driver also needs start a global level reclamation from the root. Export misc_cg_root() for the SGX driver to access. Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V6: - Make commit messages more concise and split the original patch into two(Kai) --- include/linux/misc_cgroup.h | 24 kernel/cgroup/misc.c| 21 - 2 files changed, 32 insertions(+), 13 deletions(-) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index 0806d4436208..541a5611c597 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -64,6 +64,7 @@ struct misc_cg { struct misc_res res[MISC_CG_RES_TYPES]; }; +struct misc_cg *misc_cg_root(void); u64 misc_cg_res_total_usage(enum misc_res_type type); int misc_cg_set_capacity(enum misc_res_type type, u64 capacity); int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops); @@ -84,6 +85,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css) return css ? container_of(css, struct misc_cg, css) : NULL; } +/** + * misc_cg_parent() - Get the parent of the passed misc cgroup. + * @cgroup: cgroup whose parent needs to be fetched. + * + * Context: Any context. + * Return: + * * struct misc_cg* - Parent of the @cgroup. + * * %NULL - If @cgroup is null or the passed cgroup does not have a parent. + */ +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup) +{ + return cgroup ? css_misc(cgroup->css.parent) : NULL; +} + /* * get_current_misc_cg() - Find and get the misc cgroup of the current task. * @@ -108,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg) } #else /* !CONFIG_CGROUP_MISC */ +static inline struct misc_cg *misc_cg_root(void) +{ + return NULL; +} + +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg) +{ + return NULL; +} static inline u64 misc_cg_res_total_usage(enum misc_res_type type) { diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 4a602d68cf7d..7d852139121a 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES]; static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES]; /** - * parent_misc() - Get the parent of the passed misc cgroup. - * @cgroup: cgroup whose parent needs to be fetched. - * - * Context: Any context. - * Return: - * * struct misc_cg* - Parent of the @cgroup. - * * %NULL - If @cgroup is null or the passed cgroup does not have a parent. + * misc_cg_root() - Return the root misc cgroup. */ -static struct misc_cg *parent_misc(struct misc_cg *cgroup) +struct misc_cg *misc_cg_root(void) { - return cgroup ? css_misc(cgroup->css.parent) : NULL; + return _cg; } +EXPORT_SYMBOL_GPL(misc_cg_root); /** * valid_type() - Check if @type is valid or not. @@ -183,7 +178,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) if (!amount) return 0; - for (i = cg; i; i = parent_misc(i)) { + for (i = cg; i; i = misc_cg_parent(i)) { res = >res[type]; new_usage = atomic64_add_return(amount, >usage); @@ -196,12 +191,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) return 0; err_charge: - for (j = i; j; j = parent_misc(j)) { + for (j = i; j; j = misc_cg_parent(j)) { atomic64_inc(>res[type].events); cgroup_file_notify(>events_file); } - for (j = cg; j != i; j = parent_misc(j)) + for (j = cg; j != i; j = misc_cg_parent(j)) misc_cg_cancel_charge(type, j, amount); misc_cg_cancel_charge(type, i, amount); return ret; @@ -223,7 +218,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount) if (!(amount && valid_type(type) && cg)) return; - for (i = cg; i; i = parent_misc(i)) + for (i = cg; i; i = misc_cg_parent(i)) misc_cg_cancel_charge(type, i, amount); } EXPORT_SYMBOL_GPL(misc_cg_uncharge); -- 2.25.1
[PATCH v14 01/14] x86/sgx: Replace boolean parameters with enums
Replace boolean parameters for 'reclaim' in the function sgx_alloc_epc_page() and its callers with an enum. Also opportunistically remove non-static declaration of __sgx_alloc_epc_page() and a typo Signed-off-by: Haitao Huang Suggested-by: Jarkko Sakkinen Suggested-by: Dave Hansen Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- arch/x86/kernel/cpu/sgx/encl.c | 12 ++-- arch/x86/kernel/cpu/sgx/encl.h | 4 ++-- arch/x86/kernel/cpu/sgx/ioctl.c | 10 +- arch/x86/kernel/cpu/sgx/main.c | 14 +++--- arch/x86/kernel/cpu/sgx/sgx.h | 13 +++-- arch/x86/kernel/cpu/sgx/virt.c | 2 +- 6 files changed, 32 insertions(+), 23 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 279148e72459..f474179b6f77 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -217,7 +217,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page, struct sgx_epc_page *epc_page; int ret; - epc_page = sgx_alloc_epc_page(encl_page, false); + epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM); if (IS_ERR(epc_page)) return epc_page; @@ -359,14 +359,14 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma, goto err_out_unlock; } - epc_page = sgx_alloc_epc_page(encl_page, false); + epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM); if (IS_ERR(epc_page)) { if (PTR_ERR(epc_page) == -EBUSY) vmret = VM_FAULT_NOPAGE; goto err_out_unlock; } - va_page = sgx_encl_grow(encl, false); + va_page = sgx_encl_grow(encl, SGX_NO_RECLAIM); if (IS_ERR(va_page)) { if (PTR_ERR(va_page) == -EBUSY) vmret = VM_FAULT_NOPAGE; @@ -1232,8 +1232,8 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr) /** * sgx_alloc_va_page() - Allocate a Version Array (VA) page - * @reclaim: Reclaim EPC pages directly if none available. Enclave - * mutex should not be held if this is set. + * @reclaim: Whether reclaim EPC pages directly if none available. Enclave + * mutex should not be held for SGX_DO_RECLAIM. * * Allocate a free EPC page and convert it to a Version Array (VA) page. * @@ -1241,7 +1241,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr) * a VA page, * -errno otherwise */ -struct sgx_epc_page *sgx_alloc_va_page(bool reclaim) +struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim) { struct sgx_epc_page *epc_page; int ret; diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index f94ff14c9486..fe15ade02ca1 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -116,14 +116,14 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl, unsigned long offset, u64 secinfo_flags); void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr); -struct sgx_epc_page *sgx_alloc_va_page(bool reclaim); +struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim); unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page); void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset); bool sgx_va_page_full(struct sgx_va_page *va_page); void sgx_encl_free_epc_page(struct sgx_epc_page *page); struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl, unsigned long addr); -struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim); +struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim reclaim); void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page); #endif /* _X86_ENCL_H */ diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c index b65ab214bdf5..793a0ba2cb16 100644 --- a/arch/x86/kernel/cpu/sgx/ioctl.c +++ b/arch/x86/kernel/cpu/sgx/ioctl.c @@ -17,7 +17,7 @@ #include "encl.h" #include "encls.h" -struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim) +struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim reclaim) { struct sgx_va_page *va_page = NULL; void *err; @@ -64,7 +64,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs) struct file *backing; long ret; - va_page = sgx_encl_grow(encl, true); + va_page = sgx_encl_grow(encl, SGX_DO_RECLAIM); if (IS_ERR(va_page)) return PTR_ERR(va_page); else if (va_page) @@ -83,7 +83,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs) encl->backing = backing; - secs_epc = sgx_alloc_epc_page(>secs, true); + secs_epc =
[PATCH v14 00/14] Add Cgroup support for SGX EPC memory
SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used to limit or account for SGX EPC memory, which is a desirable feature in some environments, e.g., support for pod level control in a Kubernates cluster on a VM or bare-metal host [1,2]. This patchset implements the support for sgx_epc memory within the misc cgroup controller. A user can use the misc cgroup controller to set and enforce a max limit on total EPC usage per cgroup. The implementation reports current usage and events of reaching the limit per cgroup as well as the total system capacity. Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques and pages can be swapped out of the EPC to their backing store, which are normal system memory allocated via shmem and accounted by the memory controller. Similar to per-cgroup reclamation done by the memory controller, the EPC misc controller needs to implement a per-cgroup EPC reclaiming process: when the EPC usage of a cgroup reaches its hard limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out some EPC pages within the same cgroup to make room for new allocations. For that, this implementation tracks reclaimable EPC pages in a separate LRU list in each cgroup, and below are more details and justification of this design. Track EPC pages in per-cgroup LRUs (from Dave) -- tl;dr: A cgroup hitting its limit should be as similar as possible to the system running out of EPC memory. The only two choices to implement that are nasty changes the existing LRU scanning algorithm, or to add new LRUs. The result: Add a new LRU for each cgroup and scans those instead. Replace the existing global cgroup with the root cgroup's LRU (only when this new support is compiled in, obviously). The existing EPC memory management aims to be a miniature version of the core VM where EPC memory can be overcommitted and reclaimed. EPC allocations can wait for reclaim. The alternative to waiting would have been to send a signal and let the enclave die. This series attempts to implement that same logic for cgroups, for the same reasons: it's preferable to wait for memory to become available and let reclaim happen than to do things that are fatal to enclaves. There is currently a global reclaimable page SGX LRU list. That list (and the existing scanning algorithm) is essentially useless for doing reclaim when a cgroup hits its limit because the cgroup's pages are scattered around that LRU. It is unspeakably inefficient to scan a linked list with millions of entries for what could be dozens of pages from a cgroup that needs reclaim. Even if unspeakably slow reclaim was accepted, the existing scanning algorithm only picks a few pages off the head of the global LRU. It would either need to hold the list locks for unreasonable amounts of time, or be taught to scan the list in pieces, which has its own challenges. Unreclaimable Enclave Pages --- There are a variety of page types for enclaves, each serving different purposes [5]. Although the SGX architecture supports swapping for all types, some special pages, e.g., Version Array(VA) and Secure Enclave Control Structure (SECS)[5], holds meta data of reclaimed pages and enclaves. That makes reclamation of such pages more intricate to manage. The SGX driver global reclaimer currently does not swap out VA pages. It only swaps the SECS page of an enclave when all other associated pages have been swapped out. The cgroup reclaimer follows the same approach and does not track those in per-cgroup LRUs and considers them as unreclaimable pages. The allocation of these pages is counted towards the usage of a specific cgroup and is subject to the cgroup's set EPC limits. Earlier versions of this series implemented forced enclave-killing to reclaim VA and SECS pages. That was designed to enforce the 'max' limit, particularly in scenarios where a user or administrator reduces this limit post-launch of enclaves. However, subsequent discussions [3, 4] indicated that such preemptive enforcement is not necessary for the misc-controllers. Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed, and the limit is only enforced at the time of new EPC allocation request. When a cgroup hits its limit but nothing left in the LRUs of the subtree, i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC within that cgroup will result in an 'ENOMEM'. Unreclaimable Guest VM EPC Pages The EPC pages allocated for guest VMs by the virtual EPC driver are not reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats those as unreclaimable and returns ENOMEM when its limit is hit and nothing reclaimable left within the cgroup. The virtual EPC driver
Re: [PATCH] mctp i2c: Add rx trace
Hi Tal, kernel test robot noticed the following build errors: [auto build test ERROR on linus/master] [also build test ERROR on horms-ipvs/master v6.10-rc1 next-20240531] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Tal-Yacobi/mctp-i2c-Add-rx-trace/20240528-223555 base: linus/master patch link: https://lore.kernel.org/r/20240528143420.742611-1-talycb8%40gmail.com patch subject: [PATCH] mctp i2c: Add rx trace config: hexagon-randconfig-r052-20240531 (https://download.01.org/0day-ci/archive/20240601/202406010530.skassbs4-...@intel.com/config) compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project bafda89a0944d947fc4b3b5663185e07a397ac30) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240601/202406010530.skassbs4-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202406010530.skassbs4-...@intel.com/ All errors (new ones prefixed by >>, old ones prefixed by <<): WARNING: modpost: missing MODULE_DESCRIPTION() in vmlinux.o WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/locking/locktorture.o WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/rcutorture.o WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/rcuscale.o WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/refscale.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp855.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp857.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp863.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp864.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp865.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp869.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp936.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_ascii.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-5.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-7.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-9.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-15.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-celtic.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-inuit.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-romanian.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/jbd2/jbd2.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/fat/fat.o WARNING: modpost: missing MODULE_DESCRIPTION() in crypto/xor.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/math/rational.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/crypto/libarc4.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/zlib_inflate/zlib_inflate.o WARNING: modpost: missing MODULE_DESCRIPTION() in lib/zlib_deflate/zlib_deflate.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/video/backlight/rt4831-backlight.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/video/fbdev/vfb.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/regulator/rt4831-regulator.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/reset/hisilicon/hi6220_reset.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/tty/goldfish.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/char/lp.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/char/ppdev.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/gpu/drm/gud/gud.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/gpu/drm/drm_panel_orientation_quirks.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/gpu/drm/udl/udl.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/base/regmap/regmap-spmi.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/base/regmap/regmap-w1.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/mfd/pcf50633-gpio.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/mfd/rt4831.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/dax/dax.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/host/ohci-exynos.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/host/xhci-pci-renesas.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/serial/usb_debug.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/serial/navman.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/input/matrix-keymap.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/i2c/busses/i2c-qup.o WARNING: modpost: missing MODULE_DESCRIPTION() in driv
Re: [PATCH 0/2] livepatch: Add compiler optimization disclaimer/docs
On 5/31/24 07:23, Miroslav Benes wrote: > Hi, > > On Tue, 21 Jul 2020, Joe Lawrence wrote: > >> In light of [PATCH] Revert "kbuild: use -flive-patching when >> CONFIG_LIVEPATCH is enabled" [1], we should add some loud disclaimers >> and explanation of the impact compiler optimizations have on >> livepatching. >> >> The first commit provides detailed explanations and examples. The list >> was taken mostly from Miroslav's LPC talk a few years back. This is a >> bit rough, so corrections and additional suggestions welcome. Expanding >> upon the source-based patching approach would be helpful, too. >> >> The second commit adds a small README.rst file in the livepatch samples >> directory pointing the reader to the doc introduced in the first commit. >> >> I didn't touch the livepatch kselftests yet as I'm still unsure about >> how to best account for IPA here. We could add the same README.rst >> disclaimer here, too, but perhaps we have a chance to do something more. >> Possibilities range from checking for renamed functions as part of their >> build, or the selftest scripts, or even adding something to the kernel >> API. I think we'll have a better idea after reviewing the compiler >> considerations doc. > > thanks to Marcos for resurrecting this. > > Joe, do you have an updated version by any chance? Some things have > changed since July 2020 so it calls for a new review. If there was an > improved version, it would be easier. If not, no problem at all. > Yea, it's been a little while :) I don't have any newer version than this one. I can rebase, apply all of the v1 suggestions, and see where it stands. LMK if you can think of any specifics that could be added. For example, CONFIG_KERNEL_IBT will be driving some changes soon, whether it be klp-convert for source-based patches or vmlinux.o binary comparison for kpatch-build. I can push a v2 with a few changes, but IIRC, last time we reviewed this, it kinda begged the question of how someone is creating the livepatch in the first place. As long as we're fine holding that thought for a while longer, this doc may still be useful by itself. -- Joe
Re: [PATCH] livepatch: introduce klp_func called interface
On Tue, May 21, 2024 at 08:34:46AM +0200, Miroslav Benes wrote: > Hello, > > On Mon, 20 May 2024, zhang warden wrote: > > > > > > > > On May 20, 2024, at 14:46, Miroslav Benes wrote: > > > > > > Hi, > > > > > > On Mon, 20 May 2024, Wardenjohn wrote: > > > > > >> Livepatch module usually used to modify kernel functions. > > >> If the patched function have bug, it may cause serious result > > >> such as kernel crash. > > >> > > >> This is a kobject attribute of klp_func. Sysfs interface named > > >> "called" is introduced to livepatch which will be set as true > > >> if the patched function is called. > > >> > > >> /sys/kernel/livepatchcalled > > >> > > >> This value "called" is quite necessary for kernel stability > > >> assurance for livepatching module of a running system. > > >> Testing process is important before a livepatch module apply to > > >> a production system. With this interface, testing process can > > >> easily find out which function is successfully called. > > >> Any testing process can make sure they have successfully cover > > >> all the patched function that changed with the help of this interface. > > > > > > Even easier is to use the existing tracing infrastructure in the kernel > > > (ftrace for example) to track the new function. You can obtain much more > > > information with that than the new attribute provides. > > > > > > Regards, > > > Miroslav > > Hi Miroslav > > > > First, in most cases, testing process is should be automated, which make > > using existing tracing infrastructure inconvenient. > > could you elaborate, please? We use ftrace exactly for this purpose and > our testing process is also more or less automated. > > > Second, livepatch is > > already use ftrace for functional replacement, I don’t think it is a > > good choice of using kernel tracing tool to trace a patched function. > > Why? > > > At last, this attribute can be thought of as a state of a livepatch > > function. It is a state, like the "patched" "transition" state of a > > klp_patch. Adding this state will not break the state consistency of > > livepatch. > > Yes, but the information you get is limited compared to what is available > now. You would obtain the information that a patched function was called > but ftrace could also give you the context and more. > > It would not hurt to add a new attribute per se but I am wondering about > the reason and its usage. Once we have it, the commit message should be > improved based on that. > I agree with Miroslav about using ftrace, or Dan in the other thread about using gcov if even more granular coverage is needed. Admittedly, I would have to go find documentation to remember how to do either, but at least I'd be using well-tested facilities designed for this exact purpose. Adding these attributes to livepatch sysfs would be expedient and probably easier for us to use, but imposes a recurring burden on us to maintain and test (where is the documentation and kselftest for this new interface?). Or, we could let the other tools handle all of that for us. Perhaps if someone already has an off-the-shelf script that is using ftrace to monitor livepatched code, it could be donated to Documentation/livepatch/? I can ask our QE folks if they have something like this. Regards, -- Joe
Re: [PATCH 02/12] wifi: wcn36xx: make use of QCOM_FW_HELPER
Dmitry Baryshkov writes: > Make the driver use qcom_fw_helper to autodetect the path to the > calibration data file. > > Signed-off-by: Dmitry Baryshkov Not a fan of one sentence commit messages. It would be nice to explain a bit more in the commit message, for instance answering to the question 'why?' and maybe provide a short example how this is supposed to work? -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCH 01/12] soc: qcom: add firmware name helper
Bjorn Andersson writes: > On Mon, May 27, 2024 at 02:42:44PM GMT, Dmitry Baryshkov wrote: > >> On Thu, 23 May 2024 at 01:48, Bjorn Andersson >> wrote: >> > >> > On Tue, May 21, 2024 at 03:08:31PM +0200, Dmitry Baryshkov wrote: >> > > On Tue, 21 May 2024 at 13:20, Kalle Valo wrote: >> > > > >> > > > Dmitry Baryshkov writes: >> > > > >> > > > > On Tue, 21 May 2024 at 12:52, wrote: >> > > > >> >> > > > >> On 21/05/2024 11:45, Dmitry Baryshkov wrote: >> > > > >> > Qualcomm platforms have different sets of the firmware files, >> > > > >> > which >> > > > >> > differ from platform to platform (and from board to board, due to >> > > > >> > the >> > > > >> > embedded signatures). Rather than listing all the firmware files, >> > > > >> > including full paths, in the DT, provide a way to determine >> > > > >> > firmware >> > > > >> > path based on the root DT node compatible. >> > > > >> >> > > > >> Ok this looks quite over-engineered but necessary to handle the >> > > > >> legacy, >> > > > >> but I really think we should add a way to look for a board-specific >> > > > >> path >> > > > >> first and fallback to those SoC specific paths. >> > > > > >> > > > > Again, CONFIG_FW_LOADER_USER_HELPER => delays. >> > > > >> > > > To me this also looks like very over-engineered, can you elaborate more >> > > > why this is needed? Concrete examples would help to understand better. >> > > >> > > Sure. During the meeting last week Arnd suggested evaluating if we can >> > > drop firmware-name from the board DT files. Several reasons for that: >> > > - DT should describe the hardware, not the Linux-firmware locations >> > > - having firmware name in DT complicates updating the tree to use >> > > different firmware API (think of mbn vs mdt vs any other format) >> > > - If the DT gets supplied by the vendor (e.g. for >> > > SystemReady-certified devices), there should be a sync between the >> > > vendor's DT, linux kernel and the rootfs. Dropping firmware names from >> > > DT solves that by removing one piece of the equation >> > > >> > > Now for the complexity of the solution. Each SoC family has their own >> > > firmware set. This includes firmware for the DSPs, for modem, WiFi >> > > bits, GPU shader, etc. >> > > For the development boards these devices are signed by the testing key >> > > and the actual signature is not validated against the root of trust >> > > certificate. >> > > For the end-user devices the signature is actually validated against >> > > the bits fused to the SoC during manufacturing process. CA certificate >> > > (and thus the fuses) differ from vendor to vendor (and from the device >> > > to device) >> > > >> > > Not all of the firmware files are a part of the public linux-firmware >> > > tree. However we need to support the rootfs bundled with the firmware >> > > for different platforms (both public and vendor). The non-signed files >> > > come from the Adreno GPU and can be shared between platforms. All >> > > other files are SoC-specific and in some cases device-specific. >> > > >> > > So for example the SDM845 db845c (open device) loads following firmware >> > > files: >> > > Not signed: >> > > - qcom/a630_sqe.fw >> > > - qcom/a630_gmu.bin >> > > >> > > Signed, will work for any non-secured sdm845 device: >> > > - qcom/sdm845/a630_zap.mbn >> > > - qcom/sdm845/adsp.mbn >> > > - qcom/sdm845/cdsp.mbn >> > > - qcom/sdm485/mba.mbn >> > > - qcom/sdm845/modem.mbn >> > > - qcom/sdm845/wlanmdsp.mbn (loaded via TQFTP) >> > > - qcom/venus-5.2/venus.mbn >> > > >> > > Signed, works only for DB845c. >> > > - qcom/sdm845/Thundercomm/db845c/slpi.mbn >> > > >> > > In comparison, the SDM845 Pixel-3 phone (aka blueline) should load the >> > > following firmware files: >> > > - qcom/a630_sqe.fw (the same, non-signed file) >> > > - qcom/a630_gmu.bin (the same, non-signed file) >> > > - qcom/sdm845/Google/blueline/a630_zap.mbn >> > >> > How do you get from "a630_zap.mbn" to this? By extending the lookup >> > table for every target, or what am I missing? >> >> More or less so. Matching the root OF node gives us the firmware >> location, then it gets prepended to all firmware targets. Not an ideal >> solution, as there is no fallback support, but at least it gives us >> some points to discuss (and to decide whether to move to some >> particular direction or to abandon the idea completely, making Arnd >> unhappy again). >> > > I understand the desire to not put linux-firmware-specific paths in the > DeviceTree Me too. > but I think I'm less keen on having a big lookup table in the > kernel... Yeah, also for me this feels wrong. But on the other hand I don't have anything better to suggest either... -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCHv7 bpf-next 0/9] uprobe: uretprobe speed up
On Thu, May 23, 2024 at 5:11 AM Jiri Olsa wrote: > > hi, > as part of the effort on speeding up the uprobes [0] coming with > return uprobe optimization by using syscall instead of the trap > on the uretprobe trampoline. > > The speed up depends on instruction type that uprobe is installed > and depends on specific HW type, please check patch 1 for details. > > Patches 1-8 are based on bpf-next/master, but patch 2 and 3 are > apply-able on linux-trace.git tree probes/for-next branch. > Patch 9 is based on man-pages master. > > v7 changes: > - fixes in man page [Alejandro Colomar] > - fixed patch #1 fixes tag [Oleg] > > Also available at: > https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git > uretprobe_syscall > > thanks, > jirka > > > Notes to check list items in Documentation/process/adding-syscalls.rst: > > - System Call Alternatives > New syscall seems like the best way in here, because we need > just to quickly enter kernel with no extra arguments processing, > which we'd need to do if we decided to use another syscall. > > - Designing the API: Planning for Extension > The uretprobe syscall is very specific and most likely won't be > extended in the future. > > At the moment it does not take any arguments and even if it does > in future, it's allowed to be called only from trampoline prepared > by kernel, so there'll be no broken user. > > - Designing the API: Other Considerations > N/A because uretprobe syscall does not return reference to kernel > object. > > - Proposing the API > Wiring up of the uretprobe system call is in separate change, > selftests and man page changes are part of the patchset. > > - Generic System Call Implementation > There's no CONFIG option for the new functionality because it > keeps the same behaviour from the user POV. > > - x86 System Call Implementation > It's 64-bit syscall only. > > - Compatibility System Calls (Generic) > N/A uretprobe syscall has no arguments and is not supported > for compat processes. > > - Compatibility System Calls (x86) > N/A uretprobe syscall is not supported for compat processes. > > - System Calls Returning Elsewhere > N/A. > > - Other Details > N/A. > > - Testing > Adding new bpf selftests and ran ltp on top of this change. > > - Man Page > Attached. > > - Do not call System Calls in the Kernel > N/A. > > > [0] https://lore.kernel.org/bpf/ZeCXHKJ--iYYbmLj@krava/ > --- > Jiri Olsa (8): > x86/shstk: Make return uprobe work with shadow stack > uprobe: Wire up uretprobe system call > uprobe: Add uretprobe syscall to speed up return probe > selftests/x86: Add return uprobe shadow stack test > selftests/bpf: Add uretprobe syscall test for regs integrity > selftests/bpf: Add uretprobe syscall test for regs changes > selftests/bpf: Add uretprobe syscall call from user space test > selftests/bpf: Add uretprobe shadow stack test > Masami, Steven, It seems like the series is ready to go in. Are you planning to take the first 4 patches through your linux-trace tree? > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/x86/include/asm/shstk.h| 4 + > arch/x86/kernel/shstk.c | 16 > arch/x86/kernel/uprobes.c | 124 > - > include/linux/syscalls.h| 2 + > include/linux/uprobes.h | 3 + > include/uapi/asm-generic/unistd.h | 5 +- > kernel/events/uprobes.c | 24 -- > kernel/sys_ni.c | 2 + > tools/include/linux/compiler.h | 4 + > tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c | 123 > - > tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 385 > +++ > tools/testing/selftests/bpf/progs/uprobe_syscall.c | 15 > tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c | 17 > tools/testing/selftests/x86/test_shadow_stack.c | 145 > ++ > 15 files changed, 860 insertions(+), 10 deletions(-) > create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall.c > create mode 100644 > tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c > > Jiri Olsa (1): > man2: Add uretprobe syscall page > > man/man2/uretprobe.2 | 56 > > 1 file changed, 56 insertions(+) > create mode 100644 man/man2/uretprobe.2
[PATCH v7 5/5] MAINTAINERS: add myself for Marvell 88PM886 PMIC
Add an entry to MAINTAINERS for the Marvell 88PM886 PMIC MFD, onkey and regulator drivers. Signed-off-by: Karel Balej --- Notes: RFC v3: - Remove onkey bindings file. RFC v2: - Only mention 88PM886 in the commit message. - Add regulator driver. - Rename the entry. MAINTAINERS | 9 + 1 file changed, 9 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index d6c90161c7bf..9d6c940029b8 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13295,6 +13295,15 @@ F: drivers/net/dsa/mv88e6xxx/ F: include/linux/dsa/mv88e6xxx.h F: include/linux/platform_data/mv88e6xxx.h +MARVELL 88PM886 PMIC DRIVER +M: Karel Balej +S: Maintained +F: Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml +F: drivers/input/misc/88pm886-onkey.c +F: drivers/mfd/88pm886.c +F: drivers/regulators/88pm886-regulator.c +F: include/linux/mfd/88pm886.h + MARVELL ARMADA 3700 PHY DRIVERS M: Miquel Raynal S: Maintained -- 2.45.1
[PATCH v7 3/5] regulator: add regulators driver for Marvell 88PM886 PMIC
Support the LDO and buck regulators of the Marvell 88PM886 PMIC. Signed-off-by: Karel Balej --- Notes: v7: - Address Mark's feedback: - Drop get_current_limit op, max_uA values and thus unneeded struct pm886_regulator and adapt the code accordingly. v6: - Remove all definitions (now present in the header). v5: - Add remaining regulators. - Clean up includes. - Address Mark's feedback: - Use dedicated regmap config. RFC v4: - Initialize regulators regmap in the regulators driver. - Register all regulators at once. - Drop regulator IDs. - Add missing '\n' to dev_err_probe message. - Fix includes. - Add ID table. RFC v3: - Do not have a variable for each regulator -- define them all in the pm886_regulators array. - Use new regulators regmap index name. - Use dev_err_probe. RFC v2: - Drop of_compatible and related code. - Drop unused include. - Remove some abstraction: use only one regmap for all regulators and only mention 88PM886 in Kconfig description. - Reword commit message. drivers/regulator/88pm886-regulator.c | 392 ++ drivers/regulator/Kconfig | 6 + drivers/regulator/Makefile| 1 + 3 files changed, 399 insertions(+) create mode 100644 drivers/regulator/88pm886-regulator.c diff --git a/drivers/regulator/88pm886-regulator.c b/drivers/regulator/88pm886-regulator.c new file mode 100644 index ..a38bd4f312b7 --- /dev/null +++ b/drivers/regulator/88pm886-regulator.c @@ -0,0 +1,392 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include + +#include + +static const struct regmap_config pm886_regulator_regmap_config = { + .reg_bits = 8, + .val_bits = 8, + .max_register = PM886_REG_BUCK5_VOUT, +}; + +static const struct regulator_ops pm886_ldo_ops = { + .list_voltage = regulator_list_voltage_table, + .map_voltage = regulator_map_voltage_iterate, + .set_voltage_sel = regulator_set_voltage_sel_regmap, + .get_voltage_sel = regulator_get_voltage_sel_regmap, + .enable = regulator_enable_regmap, + .disable = regulator_disable_regmap, + .is_enabled = regulator_is_enabled_regmap, +}; + +static const struct regulator_ops pm886_buck_ops = { + .list_voltage = regulator_list_voltage_linear_range, + .map_voltage = regulator_map_voltage_linear_range, + .set_voltage_sel = regulator_set_voltage_sel_regmap, + .get_voltage_sel = regulator_get_voltage_sel_regmap, + .enable = regulator_enable_regmap, + .disable = regulator_disable_regmap, + .is_enabled = regulator_is_enabled_regmap, +}; + +static const unsigned int pm886_ldo_volt_table1[] = { + 170, 180, 190, 250, 280, 290, 310, 330, +}; + +static const unsigned int pm886_ldo_volt_table2[] = { + 120, 125, 170, 180, 185, 190, 250, 260, + 270, 275, 280, 285, 290, 300, 310, 330, +}; + +static const unsigned int pm886_ldo_volt_table3[] = { + 170, 180, 190, 200, 210, 250, 270, 280, +}; + +static const struct linear_range pm886_buck_volt_ranges1[] = { + REGULATOR_LINEAR_RANGE(60, 0, 79, 12500), + REGULATOR_LINEAR_RANGE(160, 80, 84, 5), +}; + +static const struct linear_range pm886_buck_volt_ranges2[] = { + REGULATOR_LINEAR_RANGE(60, 0, 79, 12500), + REGULATOR_LINEAR_RANGE(160, 80, 114, 5), +}; + +static struct regulator_desc pm886_regulators[] = { + { + .name = "LDO1", + .regulators_node = "regulators", + .of_match = "ldo1", + .ops = _ldo_ops, + .type = REGULATOR_VOLTAGE, + .enable_reg = PM886_REG_LDO_EN1, + .enable_mask = BIT(0), + .volt_table = pm886_ldo_volt_table1, + .n_voltages = ARRAY_SIZE(pm886_ldo_volt_table1), + .vsel_reg = PM886_REG_LDO1_VOUT, + .vsel_mask = PM886_LDO_VSEL_MASK, + }, + { + .name = "LDO2", + .regulators_node = "regulators", + .of_match = "ldo2", + .ops = _ldo_ops, + .type = REGULATOR_VOLTAGE, + .enable_reg = PM886_REG_LDO_EN1, + .enable_mask = BIT(1), + .volt_table = pm886_ldo_volt_table1, + .n_voltages = ARRAY_SIZE(pm886_ldo_volt_table1), + .vsel_reg = PM886_REG_LDO2_VOUT, + .vsel_mask = PM886_LDO_VSEL_MASK, + }, + { + .name = "LDO3", + .regulators_node = "regulators", + .of_match = "ldo3", + .ops = _ldo_ops, + .type = REGULATOR_VOLTAGE, + .enable_reg = PM886_REG_LDO_EN1, + .enable_mask =
[PATCH v7 4/5] input: add onkey driver for Marvell 88PM886 PMIC
Marvell 88PM886 PMIC provides onkey among other things. Add client driver to handle it. The driver currently only provides a basic support omitting additional functions found in the vendor version, such as long onkey and GPIO integration. Acked-by: Dmitry Torokhov Signed-off-by: Karel Balej --- Notes: v5: - Remove kernel.h include. RFC v4: - Reflect MFD driver changes: - chip->regmaps[...] -> chip->regmap - Address Dmitry's feedback: - Add ID table. - Add Ack. RFC v3: - Drop wakeup-source. RFC v2: - Address Dmitry's feedback: - Sort includes alphabetically. - Drop onkey->irq. - ret -> err in irq_handler and no initialization. - Break long lines and other formatting. - Do not clobber platform_get_irq error. - Do not set device parent manually. - Use input_set_capability. - Use the wakeup-source DT property. - Drop of_match_table. - Use more temporaries. - Use dev_err_probe. - Modify Kconfig description. drivers/input/misc/88pm886-onkey.c | 98 ++ drivers/input/misc/Kconfig | 7 +++ drivers/input/misc/Makefile| 1 + 3 files changed, 106 insertions(+) create mode 100644 drivers/input/misc/88pm886-onkey.c diff --git a/drivers/input/misc/88pm886-onkey.c b/drivers/input/misc/88pm886-onkey.c new file mode 100644 index ..284ff5190b6e --- /dev/null +++ b/drivers/input/misc/88pm886-onkey.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include + +#include + +struct pm886_onkey { + struct input_dev *idev; + struct pm886_chip *chip; +}; + +static irqreturn_t pm886_onkey_irq_handler(int irq, void *data) +{ + struct pm886_onkey *onkey = data; + struct regmap *regmap = onkey->chip->regmap; + struct input_dev *idev = onkey->idev; + struct device *parent = idev->dev.parent; + unsigned int val; + int err; + + err = regmap_read(regmap, PM886_REG_STATUS1, ); + if (err) { + dev_err(parent, "Failed to read status: %d\n", err); + return IRQ_NONE; + } + val &= PM886_ONKEY_STS1; + + input_report_key(idev, KEY_POWER, val); + input_sync(idev); + + return IRQ_HANDLED; +} + +static int pm886_onkey_probe(struct platform_device *pdev) +{ + struct pm886_chip *chip = dev_get_drvdata(pdev->dev.parent); + struct device *dev = >dev; + struct pm886_onkey *onkey; + struct input_dev *idev; + int irq, err; + + onkey = devm_kzalloc(dev, sizeof(*onkey), GFP_KERNEL); + if (!onkey) + return -ENOMEM; + + onkey->chip = chip; + + irq = platform_get_irq(pdev, 0); + if (irq < 0) + return dev_err_probe(dev, irq, "Failed to get IRQ\n"); + + idev = devm_input_allocate_device(dev); + if (!idev) { + dev_err(dev, "Failed to allocate input device\n"); + return -ENOMEM; + } + onkey->idev = idev; + + idev->name = "88pm886-onkey"; + idev->phys = "88pm886-onkey/input0"; + idev->id.bustype = BUS_I2C; + + input_set_capability(idev, EV_KEY, KEY_POWER); + + err = devm_request_threaded_irq(dev, irq, NULL, pm886_onkey_irq_handler, + IRQF_ONESHOT | IRQF_NO_SUSPEND, "onkey", + onkey); + if (err) + return dev_err_probe(dev, err, "Failed to request IRQ\n"); + + err = input_register_device(idev); + if (err) + return dev_err_probe(dev, err, "Failed to register input device\n"); + + return 0; +} + +static const struct platform_device_id pm886_onkey_id_table[] = { + { "88pm886-onkey", }, + { } +}; +MODULE_DEVICE_TABLE(platform, pm886_onkey_id_table); + +static struct platform_driver pm886_onkey_driver = { + .driver = { + .name = "88pm886-onkey", + }, + .probe = pm886_onkey_probe, + .id_table = pm886_onkey_id_table, +}; +module_platform_driver(pm886_onkey_driver); + +MODULE_DESCRIPTION("Marvell 88PM886 onkey driver"); +MODULE_AUTHOR("Karel Balej "); +MODULE_LICENSE("GPL"); diff --git a/drivers/input/misc/Kconfig b/drivers/input/misc/Kconfig index 6ba984d7f0b1..16a079d9f0f2 100644 --- a/drivers/input/misc/Kconfig +++ b/drivers/input/misc/Kconfig @@ -33,6 +33,13 @@ config INPUT_88PM80X_ONKEY To compile this driver as a module, choose M here: the module will be called 88pm80x_onkey. +config INPUT_88PM886_ONKEY + tristate "Marvell 88PM886 onkey support" + depends on MFD_88PM886_PMIC + help + Support the onkey of Marvell 88PM886 PMIC as an input device + reporting power button status. + config INPUT_AB8500_PONKEY tristate "AB8500 Pon (PowerOn) Key" depends on AB8500_CORE diff --git
[PATCH v7 2/5] mfd: add driver for Marvell 88PM886 PMIC
Marvell 88PM886 is a PMIC which provides various functions such as onkey, battery, charger and regulators. It is found for instance in the samsung,coreprimevelte smartphone with which this was tested. Implement basic support to allow for the use of regulators and onkey. Signed-off-by: Karel Balej --- Notes: v6: - Address Lee's comments: - Don't break long line in the power off handler. - Set PLATFORM_DEVID_NONE. This should be safe with respect to collisions since there are no known devices with more than one of these PMICs, plus given their general purpose nature it is unlikely that there would ever be. Also include the corresponding header. - Move all defines to the header. - Give the base page's maximum register its real name. - Set irq_base to 0. v5: - Address Mark's feedback: - Move regmap config back out of the header and rename it. Also lower its maximum register based on what's actually used in the downstream code. RFC v4: - Use MFD_CELL_* macros. - Address Lee's feedback: - Do not define regmap_config.val_bits and .reg_bits. - Drop everything regulator related except mfd_cell (regmap initialization, IDs enum etc.). Drop pm886_initialize_subregmaps. - Do not store regmap pointers as an array as there is now only one regmap. Also drop the corresponding enum. - Move regmap_config to the header as it is needed in the regulators driver. - pm886_chip.whoami -> chip_id - Reword chip ID mismatch error message and print the ID as hexadecimal. - Fix includes in include/linux/88pm886.h. - Drop the pm886_irq_number enum and define the (for the moment) only IRQ explicitly. - Have only one MFD cell for all regulators as they are now registered all at once in the regulators driver. - Reword commit message. - Make device table static and remove comma after the sentinel to signal that nothing should come after it. RFC v3: - Drop onkey cell .of_compatible. - Rename LDO page offset and regmap to REGULATORS. RFC v2: - Remove some abstraction. - Sort includes alphabetically and add linux/of.h. - Depend on OF, remove of_match_ptr and add MODULE_DEVICE_TABLE. - Use more temporaries and break long lines. - Do not initialize ret in probe. - Use the wakeup-source DT property. - Rename ret to err. - Address Lee's comments: - Drop patched in presets for base regmap and related defines. - Use full sentences in comments. - Remove IRQ comment. - Define regmap_config member values. - Rename data to sys_off_data. - Add _PMIC suffix to Kconfig. - Use dev_err_probe. - Do not store irq_data. - s/WHOAMI/CHIP_ID - Drop LINUX part of include guard name. - Merge in the regulator series modifications in order to have more devices and modify the commit message accordingly. Changes with respect to the original regulator series patches: - ret -> err - Add temporary for dev in pm88x_initialize_subregmaps. - Drop of_compatible for the regulators. - Do not duplicate LDO regmap for bucks. - Rewrite commit message. Friday late afternoon counts as a weekend too... drivers/mfd/88pm886.c | 148 drivers/mfd/Kconfig | 12 +++ drivers/mfd/Makefile| 1 + include/linux/mfd/88pm886.h | 69 + 4 files changed, 230 insertions(+) create mode 100644 drivers/mfd/88pm886.c create mode 100644 include/linux/mfd/88pm886.h diff --git a/drivers/mfd/88pm886.c b/drivers/mfd/88pm886.c new file mode 100644 index ..dbe9efc027d2 --- /dev/null +++ b/drivers/mfd/88pm886.c @@ -0,0 +1,148 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +static const struct regmap_config pm886_regmap_config = { + .reg_bits = 8, + .val_bits = 8, + .max_register = PM886_REG_RTC_SPARE6, +}; + +static struct regmap_irq pm886_regmap_irqs[] = { + REGMAP_IRQ_REG(PM886_IRQ_ONKEY, 0, PM886_INT_ENA1_ONKEY), +}; + +static struct regmap_irq_chip pm886_regmap_irq_chip = { + .name = "88pm886", + .irqs = pm886_regmap_irqs, + .num_irqs = ARRAY_SIZE(pm886_regmap_irqs), + .num_regs = 4, + .status_base = PM886_REG_INT_STATUS1, + .ack_base = PM886_REG_INT_STATUS1, + .unmask_base = PM886_REG_INT_ENA_1, +}; + +static struct resource pm886_onkey_resources[] = { + DEFINE_RES_IRQ_NAMED(PM886_IRQ_ONKEY, "88pm886-onkey"), +}; + +static struct mfd_cell pm886_devs[] = { + MFD_CELL_RES("88pm886-onkey", pm886_onkey_resources), + MFD_CELL_NAME("88pm886-regulator"), +}; + +static int pm886_power_off_handler(struct sys_off_data *sys_off_data) +{ + struct pm886_chip *chip
[PATCH v7 1/5] dt-bindings: mfd: add entry for Marvell 88PM886 PMIC
Marvell 88PM886 is a PMIC with several subdevices such as onkey, regulators or battery and charger. It comes in at least two revisions, A0 and A1 -- only A1 is described here at the moment. Reviewed-by: Krzysztof Kozlowski Signed-off-by: Karel Balej --- Notes: RFC v4: - Address Krzysztof's comments: - Fix regulators indentation. - Add Krzysztof's trailer. RFC v3: - Add wakeup-source property. - Address Rob's feedback: - Move regulators into the MFD file. - Remove interrupt-controller and #interrupt-cells properties. RFC v2: - Address Rob's feedback: - Drop mention of 88PM880. - Make sure the file passes bindings check (add the necessary header and fix `interrupt-cells`). - Other small changes. - Add regulators. Changes with respect to the regulator RFC series: - Address Krzysztof's comments: - Drop unused compatible. - Fix sub-node pattern. .../bindings/mfd/marvell,88pm886-a1.yaml | 76 +++ 1 file changed, 76 insertions(+) create mode 100644 Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml diff --git a/Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml b/Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml new file mode 100644 index ..d6a71c912b76 --- /dev/null +++ b/Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml @@ -0,0 +1,76 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/mfd/marvell,88pm886-a1.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Marvell 88PM886 PMIC core + +maintainers: + - Karel Balej + +description: + Marvell 88PM886 is a PMIC providing several functions such as onkey, + regulators or battery and charger. + +properties: + compatible: +const: marvell,88pm886-a1 + + reg: +maxItems: 1 + + interrupts: +maxItems: 1 + + wakeup-source: true + + regulators: +type: object +additionalProperties: false +patternProperties: + "^(ldo(1[0-6]|[1-9])|buck[1-5])$": +type: object +$ref: /schemas/regulator/regulator.yaml# +description: LDO or buck regulator. +unevaluatedProperties: false + +required: + - compatible + - reg + - interrupts + +additionalProperties: false + +examples: + - | +#include +i2c { + #address-cells = <1>; + #size-cells = <0>; + pmic@30 { +compatible = "marvell,88pm886-a1"; +reg = <0x30>; +interrupts = <0 4 IRQ_TYPE_LEVEL_HIGH>; +interrupt-parent = <>; +wakeup-source; + +regulators { + ldo2: ldo2 { +regulator-min-microvolt = <310>; +regulator-max-microvolt = <330>; + }; + + ldo15: ldo15 { +regulator-min-microvolt = <330>; +regulator-max-microvolt = <330>; + }; + + buck2: buck2 { +regulator-min-microvolt = <180>; +regulator-max-microvolt = <180>; + }; +}; + }; +}; +... -- 2.45.1
[PATCH v7 0/5] initial support for Marvell 88PM886 PMIC
Hello, the following implements basic support for Marvell's 88PM886 PMIC which is found for instance as a component of the samsung,coreprimevelte smartphone which inspired this and also serves as a testing platform. The code for the MFD is based primarily on this old series [1] with the addition of poweroff based on the smartphone's downstream kernel tree [2]. The onkey and regulators drivers are based on the latter. I am not in possesion of the datasheet. [1] https://lore.kernel.org/all/1434098601-3498-1-git-send-email-yizh...@marvell.com/ [2] https://github.com/CoderCharmander/g361f-kernel Thank you, kind regards, K. B. --- v7: - Rebase to v6.10-rc1. - v6: https://lore.kernel.org/r/20240504194632.2456-1-bal...@matfyz.cz/ v6: - Rebase to v6.9-rc6. - Fix patchset versioning: the previous version was marked as v1 because I considered RFC to be its own thing. Thank you to Krzysztof for explaining that that is not the case. The previous version is thus now marked as v5 and this is v6, sorry for any confusion. - v5: https://lore.kernel.org/r/20240331105608.7338-2-bal...@matfyz.cz/ v5: - RFC v4: https://lore.kernel.org/r/20240311160110.32185-1-kar...@gimli.ms.mff.cuni.cz/ - Rebase to v6.9-rc1. - Thank you to everybody for their feedback on the RFC! RFC v4: - RFC v3: https://lore.kernel.org/all/20240303101506.4187-1-kar...@gimli.ms.mff.cuni.cz/ RFC v3: - Address Rob's feedback: - Drop onkey bindings patch. - Rename PM88X -> PM886 everywhere. - RFC v2: https://lore.kernel.org/all/20240211094609.2223-1-kar...@gimli.ms.mff.cuni.cz/ RFC v2: - Merge with the regulators series to have multiple devices and thus justify the use of the MFD framework. - Rebase on v6.8-rc3. - Reorder patches. - MFD RFC v1: https://lore.kernel.org/all/20231217131838.7569-1-kar...@gimli.ms.mff.cuni.cz/ - regulators RFC v1: https://lore.kernel.org/all/20231228100208.2932-1-kar...@gimli.ms.mff.cuni.cz/ Karel Balej (5): dt-bindings: mfd: add entry for Marvell 88PM886 PMIC mfd: add driver for Marvell 88PM886 PMIC regulator: add regulators driver for Marvell 88PM886 PMIC input: add onkey driver for Marvell 88PM886 PMIC MAINTAINERS: add myself for Marvell 88PM886 PMIC .../bindings/mfd/marvell,88pm886-a1.yaml | 76 MAINTAINERS | 9 + drivers/input/misc/88pm886-onkey.c| 98 + drivers/input/misc/Kconfig| 7 + drivers/input/misc/Makefile | 1 + drivers/mfd/88pm886.c | 148 +++ drivers/mfd/Kconfig | 12 + drivers/mfd/Makefile | 1 + drivers/regulator/88pm886-regulator.c | 392 ++ drivers/regulator/Kconfig | 6 + drivers/regulator/Makefile| 1 + include/linux/mfd/88pm886.h | 69 +++ 12 files changed, 820 insertions(+) create mode 100644 Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml create mode 100644 drivers/input/misc/88pm886-onkey.c create mode 100644 drivers/mfd/88pm886.c create mode 100644 drivers/regulator/88pm886-regulator.c create mode 100644 include/linux/mfd/88pm886.h -- 2.45.1
Re: [PATCH v5 5/7] remoteproc: core: support of the tee interface
On Thu, May 30, 2024 at 09:42:26AM +0200, Arnaud POULIQUEN wrote: > Hello Mathieu, > > On 5/29/24 22:35, Mathieu Poirier wrote: > > On Wed, May 29, 2024 at 09:13:26AM +0200, Arnaud POULIQUEN wrote: > >> Hello Mathieu, > >> > >> On 5/28/24 23:30, Mathieu Poirier wrote: > >>> On Tue, May 21, 2024 at 10:09:59AM +0200, Arnaud Pouliquen wrote: > 1) on start: > - Using the TEE loader, the resource table is loaded by an external > entity. > In such case the resource table address is not find from the firmware but > provided by the TEE remoteproc framework. > Use the rproc_get_loaded_rsc_table instead of rproc_find_loaded_rsc_table > - test that rproc->cached_table is not null before performing the memcpy > > 2)on stop > The use of the cached_table seems mandatory: > - during recovery sequence to have a snapshot of the resource table > resources used, > - on stop to allow for the deinitialization of resources after the > the remote processor has been shutdown. > However if the TEE interface is being used, we first need to unmap the > table_ptr before setting it to rproc->cached_table. > The update of rproc->table_ptr to rproc->cached_table is performed in > tee_remoteproc. > > Signed-off-by: Arnaud Pouliquen > --- > drivers/remoteproc/remoteproc_core.c | 31 +--- > 1 file changed, 23 insertions(+), 8 deletions(-) > > diff --git a/drivers/remoteproc/remoteproc_core.c > b/drivers/remoteproc/remoteproc_core.c > index 42bca01f3bde..3a642151c983 100644 > --- a/drivers/remoteproc/remoteproc_core.c > +++ b/drivers/remoteproc/remoteproc_core.c > @@ -1267,6 +1267,7 @@ EXPORT_SYMBOL(rproc_resource_cleanup); > static int rproc_set_rsc_table_on_start(struct rproc *rproc, const > struct firmware *fw) > { > struct resource_table *loaded_table; > +struct device *dev = >dev; > > /* > * The starting device has been given the rproc->cached_table > as the > @@ -1276,12 +1277,21 @@ static int rproc_set_rsc_table_on_start(struct > rproc *rproc, const struct firmwa > * this information to device memory. We also update the > table_ptr so > * that any subsequent changes will be applied to the loaded > version. > */ > -loaded_table = rproc_find_loaded_rsc_table(rproc, fw); > -if (loaded_table) { > -memcpy(loaded_table, rproc->cached_table, > rproc->table_sz); > -rproc->table_ptr = loaded_table; > +if (rproc->tee_interface) { > +loaded_table = rproc_get_loaded_rsc_table(rproc, > >table_sz); > +if (IS_ERR(loaded_table)) { > +dev_err(dev, "can't get resource table\n"); > +return PTR_ERR(loaded_table); > +} > +} else { > +loaded_table = rproc_find_loaded_rsc_table(rproc, fw); > } > > +if (loaded_table && rproc->cached_table) > +memcpy(loaded_table, rproc->cached_table, > rproc->table_sz); > + > >>> > >>> Why is this not part of the else {} above as it was the case before this > >>> patch? > >>> And why was an extra check for ->cached_table added? > >> > >> Here we have to cover 2 use cases if rproc->tee_interface is set. > >> 1) The remote processor is in stop state > >> - loaded_table points to the resource table in the remote memory and > >> - rproc->cached_table is null > >> => no memcopy > >> 2) crash recovery > >> - loaded_table points to the resource table in the remote memory > >> - rproc-cached_table point to a copy of the resource table > > > > A cached_table exists because it was created in > > rproc_reset_rsc_table_on_stop(). > > But as the comment says [1], that part of the code was meant to be used for > > the > > attach()/detach() use case. Mixing both will become extremely confusing and > > impossible to maintain. > > i am not sure to understand your point here... the cached_table table was > already existing for the "normal" case[2]. Seems to me that the cache table is > needed on stop in all scenarios. > > [2] > https://elixir.bootlin.com/linux/v4.20.17/source/drivers/remoteproc/remoteproc_core.c#L1402 > > > > > I think the TEE scenario should be as similar as the "normal" one where TEE > > is > > not involved. To that end, I suggest to create a cached_table in > > tee_rproc_parse_fw(), exactly the same way it is done in > > rproc_elf_load_rsc_table(). That way the code path in > > rproc_set_rsc_table_on_start() become very similar and we have a > > cached_table to > > work with when the remote processor is recovered. In fact we may not need > >
Re: [PATCH RFC 3/4] mm, slab: add static key for should_failslab()
On Fri, May 31, 2024 at 9:44 AM Alexei Starovoitov wrote: > > On Fri, May 31, 2024 at 2:33 AM Vlastimil Babka wrote: > > > > Since commit 4f6923fbb352 ("mm: make should_failslab always available for > > fault injection") should_failslab() is unconditionally a noinline > > function. This adds visible overhead to the slab allocation hotpath, > > even if the function is empty. With CONFIG_FAILSLAB=y there's additional > > overhead when the functionality is not enabled by a boot parameter or > > debugfs. > > > > The overhead can be eliminated with a static key around the callsite. > > Fault injection and error injection frameworks can now be told that the > > this function has a static key associated, and are able to enable and > > disable it accordingly. > > > > Signed-off-by: Vlastimil Babka > > --- > > mm/failslab.c | 2 +- > > mm/slab.h | 3 +++ > > mm/slub.c | 10 +++--- > > 3 files changed, 11 insertions(+), 4 deletions(-) > > > > diff --git a/mm/failslab.c b/mm/failslab.c > > index ffc420c0e767..878fd08e5dac 100644 > > --- a/mm/failslab.c > > +++ b/mm/failslab.c > > @@ -9,7 +9,7 @@ static struct { > > bool ignore_gfp_reclaim; > > bool cache_filter; > > } failslab = { > > - .attr = FAULT_ATTR_INITIALIZER, > > + .attr = FAULT_ATTR_INITIALIZER_KEY(_failslab_active.key), > > .ignore_gfp_reclaim = true, > > .cache_filter = false, > > }; > > diff --git a/mm/slab.h b/mm/slab.h > > index 5f8f47c5bee0..792e19cb37b8 100644 > > --- a/mm/slab.h > > +++ b/mm/slab.h > > @@ -11,6 +11,7 @@ > > #include > > #include > > #include > > +#include > > > > /* > > * Internal slab definitions > > @@ -160,6 +161,8 @@ static_assert(IS_ALIGNED(offsetof(struct slab, > > freelist), sizeof(freelist_aba_t) > > */ > > #define slab_page(s) folio_page(slab_folio(s), 0) > > > > +DECLARE_STATIC_KEY_FALSE(should_failslab_active); > > + > > /* > > * If network-based swap is enabled, sl*b must keep track of whether pages > > * were allocated from pfmemalloc reserves. > > diff --git a/mm/slub.c b/mm/slub.c > > index 0809760cf789..3bb579760a37 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -3874,13 +3874,15 @@ static __always_inline void > > maybe_wipe_obj_freeptr(struct kmem_cache *s, > > 0, sizeof(void *)); > > } > > > > +DEFINE_STATIC_KEY_FALSE(should_failslab_active); > > + > > noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags) > > { > > if (__should_failslab(s, gfpflags)) > > return -ENOMEM; > > return 0; > > } > > -ALLOW_ERROR_INJECTION(should_failslab, ERRNO); > > +ALLOW_ERROR_INJECTION_KEY(should_failslab, ERRNO, _failslab_active); > > > > static __fastpath_inline > > struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) > > @@ -3889,8 +3891,10 @@ struct kmem_cache *slab_pre_alloc_hook(struct > > kmem_cache *s, gfp_t flags) > > > > might_alloc(flags); > > > > - if (unlikely(should_failslab(s, flags))) > > - return NULL; > > + if (static_branch_unlikely(_failslab_active)) { > > + if (should_failslab(s, flags)) > > + return NULL; > > + } > > makes sense. > Acked-by: Alexei Starovoitov > > Do you have any microbenchmark numbers before/after this optimization? There are numbers in the cover letter for the entire series: https://lore.kernel.org/lkml/20240531-fault-injection-statickeys-v1-0-a513fd0a9...@suse.cz/
Re: [PATCH] tracing/fprobe: Support raw tracepoint events on modules
Hi Masami, kernel test robot noticed the following build errors: [auto build test ERROR on linus/master] [also build test ERROR on v6.10-rc1 next-20240531] [cannot apply to rostedt-trace/for-next rostedt-trace/for-next-urgent] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Masami-Hiramatsu-Google/tracing-fprobe-Support-raw-tracepoint-events-on-modules/20240531-175013 base: linus/master patch link: https://lore.kernel.org/r/171714888633.198965.13093663631481169611.stgit%40devnote2 patch subject: [PATCH] tracing/fprobe: Support raw tracepoint events on modules config: s390-defconfig (https://download.01.org/0day-ci/archive/20240601/202406010034.fsnp9rsq-...@intel.com/config) compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project bafda89a0944d947fc4b3b5663185e07a397ac30) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240601/202406010034.fsnp9rsq-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202406010034.fsnp9rsq-...@intel.com/ All errors (new ones prefixed by >>): In file included from kernel/tracepoint.c:5: In file included from include/linux/module.h:19: In file included from include/linux/elf.h:6: In file included from arch/s390/include/asm/elf.h:173: In file included from arch/s390/include/asm/mmu_context.h:11: In file included from arch/s390/include/asm/pgalloc.h:18: In file included from include/linux/mm.h:2253: include/linux/vmstat.h:500:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion] 500 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + |~ ^ 501 |item]; | include/linux/vmstat.h:507:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion] 507 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + |~ ^ 508 |NR_VM_NUMA_EVENT_ITEMS + |~~ include/linux/vmstat.h:514:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion] 514 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_" | ~~~ ^ ~~~ include/linux/vmstat.h:519:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion] 519 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + |~ ^ 520 |NR_VM_NUMA_EVENT_ITEMS + |~~ include/linux/vmstat.h:528:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion] 528 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + |~ ^ 529 |NR_VM_NUMA_EVENT_ITEMS + |~~ >> kernel/tracepoint.c:751:34: error: no member named >> '__start___tracepoints_ptrs' in 'struct module' 751 | for_each_tracepoint_range(mod->__start___tracepoints_ptrs, | ~~~ ^ 5 warnings and 1 error generated. vim +751 kernel/tracepoint.c 738 739 void for_each_module_tracepoint(void (*fct)(struct tracepoint *tp, void *priv), 740 void *priv) 741 { 742 struct tp_module *tp_mod; 743 struct module *mod; 744 745 if (!mod->num_tracepoints) 746 return; 747 748 mutex_lock(_module_list_mutex); 749 list_for_each_entry(tp_mod, _module_list, list) { 750 mod = tp_mod->mod; > 751 > for_each_tracepoint_range(mod->__start___tracepoints_ptrs, 752 mod->tracepoints_ptrs + mod->num_tracepoints, 753 fct, priv); 754 } 755 mutex_unlock(_module_list_mutex); 756 } 757 #endif /* CONFIG_MODULES */ 758 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Re: [PATCH RFC 3/4] mm, slab: add static key for should_failslab()
On Fri, May 31, 2024 at 2:33 AM Vlastimil Babka wrote: > > Since commit 4f6923fbb352 ("mm: make should_failslab always available for > fault injection") should_failslab() is unconditionally a noinline > function. This adds visible overhead to the slab allocation hotpath, > even if the function is empty. With CONFIG_FAILSLAB=y there's additional > overhead when the functionality is not enabled by a boot parameter or > debugfs. > > The overhead can be eliminated with a static key around the callsite. > Fault injection and error injection frameworks can now be told that the > this function has a static key associated, and are able to enable and > disable it accordingly. > > Signed-off-by: Vlastimil Babka > --- > mm/failslab.c | 2 +- > mm/slab.h | 3 +++ > mm/slub.c | 10 +++--- > 3 files changed, 11 insertions(+), 4 deletions(-) > > diff --git a/mm/failslab.c b/mm/failslab.c > index ffc420c0e767..878fd08e5dac 100644 > --- a/mm/failslab.c > +++ b/mm/failslab.c > @@ -9,7 +9,7 @@ static struct { > bool ignore_gfp_reclaim; > bool cache_filter; > } failslab = { > - .attr = FAULT_ATTR_INITIALIZER, > + .attr = FAULT_ATTR_INITIALIZER_KEY(_failslab_active.key), > .ignore_gfp_reclaim = true, > .cache_filter = false, > }; > diff --git a/mm/slab.h b/mm/slab.h > index 5f8f47c5bee0..792e19cb37b8 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > > /* > * Internal slab definitions > @@ -160,6 +161,8 @@ static_assert(IS_ALIGNED(offsetof(struct slab, freelist), > sizeof(freelist_aba_t) > */ > #define slab_page(s) folio_page(slab_folio(s), 0) > > +DECLARE_STATIC_KEY_FALSE(should_failslab_active); > + > /* > * If network-based swap is enabled, sl*b must keep track of whether pages > * were allocated from pfmemalloc reserves. > diff --git a/mm/slub.c b/mm/slub.c > index 0809760cf789..3bb579760a37 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -3874,13 +3874,15 @@ static __always_inline void > maybe_wipe_obj_freeptr(struct kmem_cache *s, > 0, sizeof(void *)); > } > > +DEFINE_STATIC_KEY_FALSE(should_failslab_active); > + > noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags) > { > if (__should_failslab(s, gfpflags)) > return -ENOMEM; > return 0; > } > -ALLOW_ERROR_INJECTION(should_failslab, ERRNO); > +ALLOW_ERROR_INJECTION_KEY(should_failslab, ERRNO, _failslab_active); > > static __fastpath_inline > struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) > @@ -3889,8 +3891,10 @@ struct kmem_cache *slab_pre_alloc_hook(struct > kmem_cache *s, gfp_t flags) > > might_alloc(flags); > > - if (unlikely(should_failslab(s, flags))) > - return NULL; > + if (static_branch_unlikely(_failslab_active)) { > + if (should_failslab(s, flags)) > + return NULL; > + } makes sense. Acked-by: Alexei Starovoitov Do you have any microbenchmark numbers before/after this optimization?
Re: [PATCH RFC 0/4] static key support for error injection functions
Hi, On Fri, May 31, 2024 at 11:33:31AM +0200, Vlastimil Babka wrote: > Incomplete, help needed from ftrace/kprobe and bpf folks. > - the generic error injection using kretprobes with > override_function_with_return is handled in patch 2. The > ALLOW_ERROR_INJECTION() annotation is extended so that static key > address can be passed, and the framework controls it when error > injection is enabled or disabled in debugfs for the function. > > There are two more users I know of but am not familiar enough to fix up > myself. I hope people that are more familiar can help me here. > > - ftrace seems to be using override_function_with_return from > #define ftrace_override_function_with_return but I found no place > where the latter is used. I assume it might be hidden behind more > macro magic? But the point is if ftrace can be instructed to act like > an error injection, it would also have to use some form of metadata > (from patch 2 presumably?) to get to the static key and control it. I don't think you've missed anything; nothing currently uses ftrace_override_function_with_return(). I added that in commit: 94d095ffa0e16bb7 ("ftrace: abstract DYNAMIC_FTRACE_WITH_ARGS accesses") ... so that it was possible to do anything that was possible with FTRACE_WITH_REGS and/or kprobes, under the expectation that we might want to move fault injection and BPF probes over to fprobes in future, as ftrace/fprobes is generally faster than kprobes (e.g. for architectures that can't do KPROBES_ON_FTRACE or OPTPROBES). That's just the mechanism for the handler to use; I'd expect whatever registered the handler to be responsible for flipping the static key, and I don't think anything needs to change within ftrace itself. > If ftrace can only observe the function being called, maybe it > wouldn't be wrong to just observe nothing if the static key isn't > enabled because nobody is doing the fault injection? Yep, that sounds right to me. Mark.
Re: [PATCH 10/20] function_graph: Have the instances use their own ftrace_ops for filtering
On Fri, 31 May 2024 02:03:46 -0400 Steven Rostedt wrote: > On Fri, 31 May 2024 12:12:41 +0900 > Masami Hiramatsu (Google) wrote: > > > On Thu, 30 May 2024 22:30:57 -0400 > > Steven Rostedt wrote: > > > > > On Fri, 24 May 2024 22:37:02 -0400 > > > Steven Rostedt wrote: > > > > > > > From: "Steven Rostedt (VMware)" > > > > > > > > Allow for instances to have their own ftrace_ops part of the fgraph_ops > > > > that makes the funtion_graph tracer filter on the set_ftrace_filter file > > > > of the instance and not the top instance. > > > > > > > > Note that this also requires to update ftrace_graph_func() to call new > > > > function_graph_enter_ops() instead of function_graph_enter() so that > > > > it avoid pushing on shadow stack multiple times on the same function. > > > > > > So I found a major design flaw in this patch. > > > > > > > > > > > Co-developed with Masami Hiramatsu: > > > > Link: > > > > https://lore.kernel.org/linux-trace-kernel/171509102088.162236.15758883237657317789.stgit@devnote2 > > > > > > > > Signed-off-by: Steven Rostedt (VMware) > > > > Signed-off-by: Masami Hiramatsu (Google) > > > > Signed-off-by: Steven Rostedt (Google) > > > > --- > > > > > > > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c > > > > index 8da0e66ca22d..998558cb8f15 100644 > > > > --- a/arch/x86/kernel/ftrace.c > > > > +++ b/arch/x86/kernel/ftrace.c > > > > @@ -648,9 +648,24 @@ void ftrace_graph_func(unsigned long ip, unsigned > > > > long parent_ip, > > > >struct ftrace_ops *op, struct ftrace_regs *fregs) > > > > { > > > > struct pt_regs *regs = >regs; > > > > - unsigned long *stack = (unsigned long > > > > *)kernel_stack_pointer(regs); > > > > + unsigned long *parent = (unsigned long > > > > *)kernel_stack_pointer(regs); > > > > + struct fgraph_ops *gops = container_of(op, struct fgraph_ops, > > > > ops); > > > > + int bit; > > > > + > > > > + if (unlikely(ftrace_graph_is_dead())) > > > > + return; > > > > + > > > > + if (unlikely(atomic_read(>tracing_graph_pause))) > > > > + return; > > > > > > > > - prepare_ftrace_return(ip, (unsigned long *)stack, 0); > > > > + bit = ftrace_test_recursion_trylock(ip, *parent); > > > > + if (bit < 0) > > > > + return; > > > > + > > > > + if (!function_graph_enter_ops(*parent, ip, 0, parent, gops)) > > > > > > So each registered graph ops has its own ftrace_ops which gets > > > registered with ftrace, so this function does get called in a loop (by > > > the ftrace iterator function). This means that we would need that code > > > to detect the function_graph_enter_ops() getting called multiple times > > > for the same function. This means each fgraph_ops gits its own retstack > > > on the shadow stack. > > > > Ah, that is my concern and the reason why I added bitmap and stack reuse > > code in the ftrace_push_return_trace(). > > > > > > > > I find this a waste of shadow stack resources, and also complicates the > > > code with having to deal with tail calls and all that. > > > > > > BUT! There's good news! I also thought about another way of handling > > > this. I have something working, but requires a bit of rewriting the > > > code. I should have something out in a day or two. > > > > Hmm, I just wonder why you don't reocver my bitmap check and stack > > reusing code. Are there any problem on it? (Too complicated?) > > > > I actually dislike the use of ftrace itself to do the loop. I rather > have fgraph be in control of it. (actually, I agreed with you, looping in ftrace may cause trouble) > > I've come up with a new "subops" assignment, where you can have one > ftrace_ops represent multiple sub ftrace_ops. Basically, each fgraph > ops can register its own ftrace_ops under a single graph_ops > ftrace_ops. The graph_ops will be used to decide what functions call > the callback, and then the callback does the multiplexing. So is it similar to the fprobe/kprobe, use shared signle ftrace_ops, but keep each fgraph has own hash table? > This removes the need to touch the architecture code. It can also be > used by fprobes to handle the attachments to functions for several > different sets of callbacks. > > I'll send out patches soon. OK, I'll wait for that. Thank you! > > -- Steve > -- Masami Hiramatsu (Google)
Re: [PATCH 0/3] tracing: Fix some selftest issues
On Fri, 31 May 2024 03:24:25 -0400 Steven Rostedt wrote: > On Fri, 31 May 2024 11:37:21 +0900 > Masami Hiramatsu (Google) wrote: > > > So, in summary, it is designed to be a module. Steve, I think these tests > > should be kept as modules. There are many reason to do so. > > > > - This test is designed to be used as module. > > - This can conflict with other boot time selftest if it is embedded. > > - We can make these tests and boot time selftest mutable exclusive but > >if we make these tests as modules, we can build and run both tests > >safely. > > - Embedding these tests leave new events when the kernel boot, which > >user must be cleaned up by manual. > > > > What would you think? > > I was mostly following what Ingo told me long ago, where having it > built in is just one more way to test it ;-) Ah, would you mean embedding the module is also a part of tests? > > But that said, from your first patch, you show the stack dump and > mention: > > > Since the kprobes and synth event generation tests adds and enable > > generated events in init_module() and delete it in exit_module(), > > if we make it as built-in, those events are left in kernel and cause > > kprobe event self-test failure. > > But you don't explain what exactly the conflict is. What about those > events causes kprobe selftests to fail? The major conflict happens when the boot-time test cleans up the kprobe events by dyn_events_release_all(_kprobe_ops); And I removed it by [3/3] patch in this series :) because it does not needed and not confirmed there is no other kprobe events when the test starts. Also the warning message are redundant so I removed it by [2/3]. So without this [1/3], if we apply [2/3] and [3/3], the problem will be mitigated, but I think the root cause is that these modules are built-in. Thank you, > > > -- Steve -- Masami Hiramatsu (Google)
Re: [PATCH] livepatch: introduce klp_func called interface
> And for the unlikely branch, isn’t the complier will compile this branch > into a cold branch that will do no harm to the function performance? The test (cmp insn or something like that) still needs to be there. Since there is only a simple assignment in the branch, the compiler may just choose not to have a cold branch in this case. The only difference is in which case you would jump here. You can see for yourself (and prove me wrong if it comes to it). Miroslav
Re: [PATCH] livepatch: introduce klp_func called interface
> On May 31, 2024, at 15:21, Miroslav Benes wrote: > > Hi, > > On Fri, 31 May 2024, zhang warden wrote: > > you have not replied to my questions/feedback yet. > > Also, I do not think that unlikely changes anything here. It is a simple > branch after all. > > Miroslav Hi Miroslav, Sorry for my carelessness. I apologise for my ignorance. > Second, livepatch is > already use ftrace for functional replacement, I don’t think it is a > good choice of using kernel tracing tool to trace a patched function. I admit that ftrace can use for tracing the new patched function. But for some cases, user who what to know the state of this function can easily cat the 'called' interface. It is more convenient than using ftrace to trace the state. And for the unlikely branch, isn’t the complier will compile this branch into a cold branch that will do no harm to the function performance? Regards, Wardenjohn
Re: [PATCH v6 2/5] mfd: add driver for Marvell 88PM886 PMIC
Lee Jones, 2024-05-31T11:24:52+01:00: > Are you planning on seeing to Mark's review comments? Indeed, I'm hoping that I will be able to send it over the weekend. K. B.
Re: [PATCH v3 1/3] arm64: dts: qcom: pm7250b: Add node for PMIC VBUS booster
On 30.05.2024 5:05 PM, Luca Weiss wrote: > Add the required DTS node for the USB VBUS output regulator, which is > available on PM7250B. This will provide the VBUS source to connected > peripherals. > > Reviewed-by: Bryan O'Donoghue > Signed-off-by: Luca Weiss > --- Reviewed-by: Konrad Dybcio Konrad
Re: [PATCH] arm64: dts: qcom: sm8550-samsung-q5q: fix typo
On 31/05/2024 14:05, David Wronek wrote: > It looks like "cdsp_mem" was pasted in the license header by accident. > Fix the typo by removing it. > > Signed-off-by: David Wronek > --- > arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts | 2 +- Fixes: ba2c082a401f ("arm64: dts: qcom: sm8550: Add support for Samsung Galaxy Z Fold5") Reviewed-by: Krzysztof Kozlowski Best regards, Krzysztof
[PATCH] arm64: dts: qcom: sm8550-samsung-q5q: fix typo
It looks like "cdsp_mem" was pasted in the license header by accident. Fix the typo by removing it. Signed-off-by: David Wronek --- arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts b/arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts index 4654ae1364ba..3d351e90bb39 100644 --- a/arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts +++ b/arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts @@ -1,4 +1,4 @@ -// SPDX-License-cdsp_memIdentifier: BSD-3-Clause +// SPDX-License-Identifier: BSD-3-Clause /* * Copyright (c) 2024, Alexandru Marc Serdeliuc * Copyright (c) 2024, David Wronek --- base-commit: 0e1980c40b6edfa68b6acf926bab22448a6e40c9 change-id: 20240531-fix-typo-q5q-5d34423b7bb4 Best regards, -- David Wronek
Re: [PATCH 0/2] livepatch: Add compiler optimization disclaimer/docs
Hi, On Tue, 21 Jul 2020, Joe Lawrence wrote: > In light of [PATCH] Revert "kbuild: use -flive-patching when > CONFIG_LIVEPATCH is enabled" [1], we should add some loud disclaimers > and explanation of the impact compiler optimizations have on > livepatching. > > The first commit provides detailed explanations and examples. The list > was taken mostly from Miroslav's LPC talk a few years back. This is a > bit rough, so corrections and additional suggestions welcome. Expanding > upon the source-based patching approach would be helpful, too. > > The second commit adds a small README.rst file in the livepatch samples > directory pointing the reader to the doc introduced in the first commit. > > I didn't touch the livepatch kselftests yet as I'm still unsure about > how to best account for IPA here. We could add the same README.rst > disclaimer here, too, but perhaps we have a chance to do something more. > Possibilities range from checking for renamed functions as part of their > build, or the selftest scripts, or even adding something to the kernel > API. I think we'll have a better idea after reviewing the compiler > considerations doc. thanks to Marcos for resurrecting this. Joe, do you have an updated version by any chance? Some things have changed since July 2020 so it calls for a new review. If there was an improved version, it would be easier. If not, no problem at all. Miroslav
Re: [PATCH v6 2/5] mfd: add driver for Marvell 88PM886 PMIC
On Sat, 04 May 2024, Karel Balej wrote: > Marvell 88PM886 is a PMIC which provides various functions such as > onkey, battery, charger and regulators. It is found for instance in the > samsung,coreprimevelte smartphone with which this was tested. Implement > basic support to allow for the use of regulators and onkey. > > Signed-off-by: Karel Balej > --- > > Notes: > v6: > - Address Lee's comments: > - Don't break long line in the power off handler. > - Set PLATFORM_DEVID_NONE. This should be safe with respect to > collisions since there are no known devices with more than one of > these PMICs, plus given their general purpose nature it is unlikely > that there would ever be. Also include the corresponding header. > - Move all defines to the header. > - Give the base page's maximum register its real name. > - Set irq_base to 0. > v5: > - Address Mark's feedback: > - Move regmap config back out of the header and rename it. Also lower > its maximum register based on what's actually used in the downstream > code. > RFC v4: > - Use MFD_CELL_* macros. > - Address Lee's feedback: > - Do not define regmap_config.val_bits and .reg_bits. > - Drop everything regulator related except mfd_cell (regmap > initialization, IDs enum etc.). Drop pm886_initialize_subregmaps. > - Do not store regmap pointers as an array as there is now only one > regmap. Also drop the corresponding enum. > - Move regmap_config to the header as it is needed in the regulators > driver. > - pm886_chip.whoami -> chip_id > - Reword chip ID mismatch error message and print the ID as > hexadecimal. > - Fix includes in include/linux/88pm886.h. > - Drop the pm886_irq_number enum and define the (for the moment) only > IRQ explicitly. > - Have only one MFD cell for all regulators as they are now registered > all at once in the regulators driver. > - Reword commit message. > - Make device table static and remove comma after the sentinel to signal > that nothing should come after it. > RFC v3: > - Drop onkey cell .of_compatible. > - Rename LDO page offset and regmap to REGULATORS. > RFC v2: > - Remove some abstraction. > - Sort includes alphabetically and add linux/of.h. > - Depend on OF, remove of_match_ptr and add MODULE_DEVICE_TABLE. > - Use more temporaries and break long lines. > - Do not initialize ret in probe. > - Use the wakeup-source DT property. > - Rename ret to err. > - Address Lee's comments: > - Drop patched in presets for base regmap and related defines. > - Use full sentences in comments. > - Remove IRQ comment. > - Define regmap_config member values. > - Rename data to sys_off_data. > - Add _PMIC suffix to Kconfig. > - Use dev_err_probe. > - Do not store irq_data. > - s/WHOAMI/CHIP_ID > - Drop LINUX part of include guard name. > - Merge in the regulator series modifications in order to have more > devices and modify the commit message accordingly. Changes with > respect to the original regulator series patches: > - ret -> err > - Add temporary for dev in pm88x_initialize_subregmaps. > - Drop of_compatible for the regulators. > - Do not duplicate LDO regmap for bucks. > - Rewrite commit message. > > drivers/mfd/88pm886.c | 148 > drivers/mfd/Kconfig | 12 +++ > drivers/mfd/Makefile| 1 + > include/linux/mfd/88pm886.h | 69 + > 4 files changed, 230 insertions(+) > create mode 100644 drivers/mfd/88pm886.c > create mode 100644 include/linux/mfd/88pm886.h I don't see any more issues. Are you planning on seeing to Mark's review comments? -- Lee Jones [李琼斯]
[PATCH] tracing/fprobe: Support raw tracepoint events on modules
From: Masami Hiramatsu (Google) Support raw tracepoint event on module by fprobe events. Since it only uses for_each_kernel_tracepoint() to find a tracepoint, the tracepoints on modules are not handled. Thus if user specified a tracepoint on a module, it shows an error. This adds new for_each_module_tracepoint() API to tracepoint subsystem, and uses it to find tracepoints on modules. Reported-by: don Closes: https://lore.kernel.org/all/20240530215718.aeec973a1d0bf058d39cb...@kernel.org/ Signed-off-by: Masami Hiramatsu (Google) --- include/linux/tracepoint.h |7 +++ kernel/trace/trace_fprobe.c | 46 --- kernel/tracepoint.c | 19 ++ 3 files changed, 64 insertions(+), 8 deletions(-) diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 689b6d71590e..46e6a5e759fd 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -65,6 +65,8 @@ struct tp_module { bool trace_module_has_bad_taint(struct module *mod); extern int register_tracepoint_module_notifier(struct notifier_block *nb); extern int unregister_tracepoint_module_notifier(struct notifier_block *nb); +void for_each_module_tracepoint(void (*fct)(struct tracepoint *, void *), + void *priv); #else static inline bool trace_module_has_bad_taint(struct module *mod) { @@ -80,6 +82,11 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb) { return 0; } +static inline +void for_each_module_tracepoint(void (*fct)(struct tracepoint *, void *), + void *priv) +{ +} #endif /* CONFIG_MODULES */ /* diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c index 62e6a8f4aae9..1d8a983e1edc 100644 --- a/kernel/trace/trace_fprobe.c +++ b/kernel/trace/trace_fprobe.c @@ -385,6 +385,7 @@ static struct trace_fprobe *alloc_trace_fprobe(const char *group, const char *event, const char *symbol, struct tracepoint *tpoint, + struct module *mod, int maxactive, int nargs, bool is_return) { @@ -405,6 +406,7 @@ static struct trace_fprobe *alloc_trace_fprobe(const char *group, tf->fp.entry_handler = fentry_dispatcher; tf->tpoint = tpoint; + tf->mod = mod; tf->fp.nr_maxactive = maxactive; ret = trace_probe_init(>tp, event, group, false, nargs); @@ -895,8 +897,23 @@ static struct notifier_block tracepoint_module_nb = { struct __find_tracepoint_cb_data { const char *tp_name; struct tracepoint *tpoint; + struct module *mod; }; +static void __find_tracepoint_module_cb(struct tracepoint *tp, void *priv) +{ + struct __find_tracepoint_cb_data *data = priv; + + if (!data->tpoint && !strcmp(data->tp_name, tp->name)) { + data->tpoint = tp; + data->mod = __module_text_address((unsigned long)tp->probestub); + if (!try_module_get(data->mod)) { + data->tpoint = NULL; + data->mod = NULL; + } + } +} + static void __find_tracepoint_cb(struct tracepoint *tp, void *priv) { struct __find_tracepoint_cb_data *data = priv; @@ -905,14 +922,28 @@ static void __find_tracepoint_cb(struct tracepoint *tp, void *priv) data->tpoint = tp; } -static struct tracepoint *find_tracepoint(const char *tp_name) +/* + * Find a tracepoint from kernel and module. If the tracepoint is in a module, + * this increments the module refcount to prevent unloading until the + * trace_fprobe is registered to the list. After registering the trace_fprobe + * on the trace_fprobe list, the module refcount is decremented because + * tracepoint_probe_module_cb will handle it. + */ +static struct tracepoint *find_tracepoint(const char *tp_name, + struct module **tp_mod) { struct __find_tracepoint_cb_data data = { .tp_name = tp_name, + .mod = NULL, }; for_each_kernel_tracepoint(__find_tracepoint_cb, ); + if (!data.tpoint && IS_ENABLED(CONFIG_MODULES)) { + for_each_module_tracepoint(__find_tracepoint_module_cb, ); + *tp_mod = data.mod; + } + return data.tpoint; } @@ -996,6 +1027,7 @@ static int __trace_fprobe_create(int argc, const char *argv[]) char abuf[MAX_BTF_ARGS_LEN]; char *dbuf = NULL; bool is_tracepoint = false; + struct module *tp_mod = NULL; struct tracepoint *tpoint = NULL; struct traceprobe_parse_context ctx = { .flags = TPARG_FL_KERNEL | TPARG_FL_FPROBE, @@ -1080,7 +1112,7 @@ static int
[PATCH RFC 4/4] mm, page_alloc: add static key for should_fail_alloc_page()
Similarly to should_failslab(), remove the overhead of calling the noinline function should_fail_alloc_page() with a static key that guards the allocation hotpath callsite and is controlled by the fault and error injection frameworks. Signed-off-by: Vlastimil Babka --- mm/fail_page_alloc.c | 3 ++- mm/internal.h| 2 ++ mm/page_alloc.c | 11 --- 3 files changed, 12 insertions(+), 4 deletions(-) diff --git a/mm/fail_page_alloc.c b/mm/fail_page_alloc.c index b1b09cce9394..0906b76d78e8 100644 --- a/mm/fail_page_alloc.c +++ b/mm/fail_page_alloc.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include #include +#include "internal.h" static struct { struct fault_attr attr; @@ -9,7 +10,7 @@ static struct { bool ignore_gfp_reclaim; u32 min_order; } fail_page_alloc = { - .attr = FAULT_ATTR_INITIALIZER, + .attr = FAULT_ATTR_INITIALIZER_KEY(_fail_alloc_page_active.key), .ignore_gfp_reclaim = true, .ignore_gfp_highmem = true, .min_order = 1, diff --git a/mm/internal.h b/mm/internal.h index b2c75b12014e..8539e39b02e6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -410,6 +410,8 @@ extern char * const zone_names[MAX_NR_ZONES]; /* perform sanity checks on struct pages being allocated or freed */ DECLARE_STATIC_KEY_MAYBE(CONFIG_DEBUG_VM, check_pages_enabled); +DECLARE_STATIC_KEY_FALSE(should_fail_alloc_page_active); + extern int min_free_kbytes; void setup_per_zone_wmarks(void); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2e22ce5675ca..e5dc3bafa549 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -274,6 +274,8 @@ int user_min_free_kbytes = -1; static int watermark_boost_factor __read_mostly = 15000; static int watermark_scale_factor = 10; +DEFINE_STATIC_KEY_FALSE(should_fail_alloc_page_active); + /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; EXPORT_SYMBOL(movable_zone); @@ -3012,7 +3014,7 @@ noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) { return __should_fail_alloc_page(gfp_mask, order); } -ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE); +ALLOW_ERROR_INJECTION_KEY(should_fail_alloc_page, TRUE, _fail_alloc_page_active); static inline long __zone_watermark_unusable_free(struct zone *z, unsigned int order, unsigned int alloc_flags) @@ -4430,8 +4432,11 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, might_alloc(gfp_mask); - if (should_fail_alloc_page(gfp_mask, order)) - return false; + if (static_branch_unlikely(_fail_alloc_page_active)) { + if (should_fail_alloc_page(gfp_mask, order)) { + return false; + } + } *alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags); -- 2.45.1
[PATCH RFC 0/4] static key support for error injection functions
Incomplete, help needed from ftrace/kprobe and bpf folks. As previously mentioned by myself [1] and others [2] the functions designed for error injection can bring visible overhead in fastpaths such as slab or page allocation, because even if nothing hooks into them at a given moment, they are noninline function calls regardless of CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab always available for fault injection") and af3b854492f3 ("mm/page_alloc.c: allow error injection"). Live patching their callsites has been also suggested in both [1] and [2] threads, and this is an attempt to do that with static keys that guard the call sites. When disabled, the error injection functions still exist and are noinline, but are not being called. Any of the existing mechanisms that can inject errors should make sure to enable the respective static key. I have added that support to some of them but need help with the others. - the legacy fault injection, i.e. CONFIG_FAILSLAB and CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the address of the static key if it exists. The key will be activated if the fault injection probability becomes non-zero, and deactivated in the opposite transition. This also removes the overhead of the evaluation (on top of the noninline function call) when these mechanisms are configured in the kernel but unused at the moment. - the generic error injection using kretprobes with override_function_with_return is handled in patch 2. The ALLOW_ERROR_INJECTION() annotation is extended so that static key address can be passed, and the framework controls it when error injection is enabled or disabled in debugfs for the function. There are two more users I know of but am not familiar enough to fix up myself. I hope people that are more familiar can help me here. - ftrace seems to be using override_function_with_return from #define ftrace_override_function_with_return but I found no place where the latter is used. I assume it might be hidden behind more macro magic? But the point is if ftrace can be instructed to act like an error injection, it would also have to use some form of metadata (from patch 2 presumably?) to get to the static key and control it. If ftrace can only observe the function being called, maybe it wouldn't be wrong to just observe nothing if the static key isn't enabled because nobody is doing the fault injection? - bpftrace, as can be seen from the example in commit 4f6923fbb352 description. I suppose bpf is already aware what functions the currently loaded bpf programs hook into, so that it could look up the static key and control it. Maybe using again the metadata from patch 2, or extending its own, as I've noticed there's e.g. BTF_ID(func, should_failslab) Now I realize maybe handling this at the k(ret)probe level would be sufficient for all cases except the legacy fault injection from Patch 1? Also wanted to note that by AFAIU by using the static_key_slow_dec/inc API (as done in patches 1/2) should allow all mechanisms to coexist naturally without fighting each other on the static key state, and also handle the reference count for e.g. active probes or bpf programs if there's no similar internal mechanism. Patches 3 and 4 implement the static keys for the two mm fault injection sites in slab and page allocators. For a quick demonstration I've run a VM and the simple test from [1] that stresses the slab allocator and got this time before the series: real0m8.349s user0m0.694s sys 0m7.648s with perf showing 0.61% nonexistent [kernel.kallsyms] [k] should_failslab.constprop.0 0.00% nonexistent [kernel.kallsyms] [k] should_fail_alloc_page ▒ And after the series real0m7.924s user0m0.727s sys 0m7.191s and the functions gone from perf report. There might be other such fault injection callsites in hotpaths of other subsystems but I didn't search for them at this point. [1] https://lore.kernel.org/all/6d5bb852-8703-4abf-a52b-90816bccb...@suse.cz/ [2] https://lore.kernel.org/all/3j5d3p22ssv7xoaghzraa7crcfih3h2qqjlhmjppbp6f42pg2t@kg7qoicog5ye/ Signed-off-by: Vlastimil Babka --- Vlastimil Babka (4): fault-inject: add support for static keys around fault injection sites error-injection: support static keys around injectable functions mm, slab: add static key for should_failslab() mm, page_alloc: add static key for should_fail_alloc_page() include/asm-generic/error-injection.h | 13 ++- include/asm-generic/vmlinux.lds.h | 2 +- include/linux/error-injection.h | 9 +--- include/linux/fault-inject.h | 7 +- kernel/fail_function.c| 22 +++--- lib/error-inject.c| 6
[PATCH RFC 2/4] error-injection: support static keys around injectable functions
Error injectable functions cannot be inlined and since some are called from hot paths, this incurrs overhead even if no error injection is enabled for them. To remove this overhead when disabled, allow the callsites of error injectable functions to put the calls behind a static key, which the framework can control when error injection is enabled or disabled for the function. Introduce a new ALLOW_ERROR_INJECTION_KEY() macro that adds a parameter with the static key's address, and store it in struct error_injection_entry. This new field has caused a mismatch when populating the injection list from the _error_injection_whitelist section with the current STRUCT_ALIGN(), so change the alignment to 8. During the population, copy the key's address also to struct ei_entry, and make it possible to retrieve it along with the error type by get_injectable_error_type(). Finally, make the processing of writes to the debugfs inject file enable the static key when the function is added to the injection list, and disable when removed. Signed-off-by: Vlastimil Babka --- include/asm-generic/error-injection.h | 13 - include/asm-generic/vmlinux.lds.h | 2 +- include/linux/error-injection.h | 9 ++--- kernel/fail_function.c| 22 +++--- lib/error-inject.c| 6 +- 5 files changed, 43 insertions(+), 9 deletions(-) diff --git a/include/asm-generic/error-injection.h b/include/asm-generic/error-injection.h index b05253f68eaa..eed2731f3820 100644 --- a/include/asm-generic/error-injection.h +++ b/include/asm-generic/error-injection.h @@ -12,6 +12,7 @@ enum { struct error_injection_entry { unsigned long addr; + unsigned long static_key_addr; int etype; }; @@ -25,16 +26,26 @@ struct pt_regs; * 'Error Injectable Functions' section. */ #define ALLOW_ERROR_INJECTION(fname, _etype) \ -static struct error_injection_entry __used \ +static struct error_injection_entry __used __aligned(8) \ __section("_error_injection_whitelist") \ _eil_addr_##fname = { \ .addr = (unsigned long)fname, \ .etype = EI_ETYPE_##_etype, \ } +#define ALLOW_ERROR_INJECTION_KEY(fname, _etype, key) \ +static struct error_injection_entry __used __aligned(8) \ + __section("_error_injection_whitelist") \ + _eil_addr_##fname = { \ + .addr = (unsigned long)fname, \ + .static_key_addr = (unsigned long)key, \ + .etype = EI_ETYPE_##_etype, \ + } + void override_function_with_return(struct pt_regs *regs); #else #define ALLOW_ERROR_INJECTION(fname, _etype) +#define ALLOW_ERROR_INJECTION_KEY(fname, _etype, key) static inline void override_function_with_return(struct pt_regs *regs) { } #endif diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 5703526d6ebf..1b15a0af2a00 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -248,7 +248,7 @@ #ifdef CONFIG_FUNCTION_ERROR_INJECTION #define ERROR_INJECT_WHITELIST() \ - STRUCT_ALIGN(); \ + . = ALIGN(8); \ BOUNDED_SECTION(_error_injection_whitelist) #else #define ERROR_INJECT_WHITELIST() diff --git a/include/linux/error-injection.h b/include/linux/error-injection.h index 20e738f4eae8..bec81b57a9d5 100644 --- a/include/linux/error-injection.h +++ b/include/linux/error-injection.h @@ -6,10 +6,12 @@ #include #include +struct static_key; + #ifdef CONFIG_FUNCTION_ERROR_INJECTION -extern bool within_error_injection_list(unsigned long addr); -extern int get_injectable_error_type(unsigned long addr); +bool within_error_injection_list(unsigned long addr); +int get_injectable_error_type(unsigned long addr, struct static_key **key_addr); #else /* !CONFIG_FUNCTION_ERROR_INJECTION */ @@ -18,7 +20,8 @@ static inline bool within_error_injection_list(unsigned long addr) return false; } -static inline int get_injectable_error_type(unsigned long addr) +static inline int get_injectable_error_type(unsigned long addr, + struct static_key **key_addr) { return -EOPNOTSUPP; } diff --git a/kernel/fail_function.c b/kernel/fail_function.c index d971a0189319..9240eb137e00 100644 --- a/kernel/fail_function.c +++ b/kernel/fail_function.c @@ -27,15 +27,16 @@ struct fei_attr { struct list_head list; struct kprobe kp; unsigned long retval; + struct
[PATCH RFC 3/4] mm, slab: add static key for should_failslab()
Since commit 4f6923fbb352 ("mm: make should_failslab always available for fault injection") should_failslab() is unconditionally a noinline function. This adds visible overhead to the slab allocation hotpath, even if the function is empty. With CONFIG_FAILSLAB=y there's additional overhead when the functionality is not enabled by a boot parameter or debugfs. The overhead can be eliminated with a static key around the callsite. Fault injection and error injection frameworks can now be told that the this function has a static key associated, and are able to enable and disable it accordingly. Signed-off-by: Vlastimil Babka --- mm/failslab.c | 2 +- mm/slab.h | 3 +++ mm/slub.c | 10 +++--- 3 files changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/failslab.c b/mm/failslab.c index ffc420c0e767..878fd08e5dac 100644 --- a/mm/failslab.c +++ b/mm/failslab.c @@ -9,7 +9,7 @@ static struct { bool ignore_gfp_reclaim; bool cache_filter; } failslab = { - .attr = FAULT_ATTR_INITIALIZER, + .attr = FAULT_ATTR_INITIALIZER_KEY(_failslab_active.key), .ignore_gfp_reclaim = true, .cache_filter = false, }; diff --git a/mm/slab.h b/mm/slab.h index 5f8f47c5bee0..792e19cb37b8 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -11,6 +11,7 @@ #include #include #include +#include /* * Internal slab definitions @@ -160,6 +161,8 @@ static_assert(IS_ALIGNED(offsetof(struct slab, freelist), sizeof(freelist_aba_t) */ #define slab_page(s) folio_page(slab_folio(s), 0) +DECLARE_STATIC_KEY_FALSE(should_failslab_active); + /* * If network-based swap is enabled, sl*b must keep track of whether pages * were allocated from pfmemalloc reserves. diff --git a/mm/slub.c b/mm/slub.c index 0809760cf789..3bb579760a37 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3874,13 +3874,15 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s, 0, sizeof(void *)); } +DEFINE_STATIC_KEY_FALSE(should_failslab_active); + noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags) { if (__should_failslab(s, gfpflags)) return -ENOMEM; return 0; } -ALLOW_ERROR_INJECTION(should_failslab, ERRNO); +ALLOW_ERROR_INJECTION_KEY(should_failslab, ERRNO, _failslab_active); static __fastpath_inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) @@ -3889,8 +3891,10 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) might_alloc(flags); - if (unlikely(should_failslab(s, flags))) - return NULL; + if (static_branch_unlikely(_failslab_active)) { + if (should_failslab(s, flags)) + return NULL; + } return s; } -- 2.45.1
[PATCH RFC 1/4] fault-inject: add support for static keys around fault injection sites
Some fault injection sites are placed in hotpaths and incur overhead even if not enabled, due to one or more function calls leading up to should_fail_ex() that returns false due to attr->probability == 0. This overhead can be eliminated if the outermost call into the checks is guarded with a static key, so add support for that. The framework should be told that such static key exist for a fault_attr, by initializing fault_attr->active with the static key address. When it's not NULL, enable the static key from setup_fault_attr() when the fault probability is non-zero. Also wire up writing into debugfs "probability" file to enable or disable the static key when transitioning between zero and non-zero probability. For now, do not add configfs interface support as the immediate plan is to leverage this for should_failslab() and should_fail_alloc_page() after other necessary preparatory changes, and none of the configfs based fault injection users. Signed-off-by: Vlastimil Babka --- include/linux/fault-inject.h | 7 ++- lib/fault-inject.c | 43 ++- 2 files changed, 48 insertions(+), 2 deletions(-) diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h index 6d5edef09d45..cfe75cc1bac4 100644 --- a/include/linux/fault-inject.h +++ b/include/linux/fault-inject.h @@ -9,6 +9,7 @@ #include #include #include +#include /* * For explanation of the elements of this struct, see @@ -30,13 +31,14 @@ struct fault_attr { unsigned long count; struct ratelimit_state ratelimit_state; struct dentry *dname; + struct static_key *active; }; enum fault_flags { FAULT_NOWARN = 1 << 0, }; -#define FAULT_ATTR_INITIALIZER { \ +#define FAULT_ATTR_INITIALIZER_KEY(_key) { \ .interval = 1, \ .times = ATOMIC_INIT(1),\ .require_end = ULONG_MAX, \ @@ -44,8 +46,11 @@ enum fault_flags { .ratelimit_state = RATELIMIT_STATE_INIT_DISABLED, \ .verbose = 2, \ .dname = NULL, \ + .active = (_key), \ } +#define FAULT_ATTR_INITIALIZER FAULT_ATTR_INITIALIZER_KEY(NULL) + #define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER int setup_fault_attr(struct fault_attr *attr, char *str); bool should_fail_ex(struct fault_attr *attr, ssize_t size, int flags); diff --git a/lib/fault-inject.c b/lib/fault-inject.c index d608f9b48c10..93c46d2ec106 100644 --- a/lib/fault-inject.c +++ b/lib/fault-inject.c @@ -35,6 +35,9 @@ int setup_fault_attr(struct fault_attr *attr, char *str) atomic_set(>times, times); atomic_set(>space, space); + if (probability != 0 && attr->active) + static_key_slow_inc(attr->active); + return 1; } EXPORT_SYMBOL_GPL(setup_fault_attr); @@ -166,6 +169,12 @@ EXPORT_SYMBOL_GPL(should_fail); #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS +/* + * Protect updating probability from debugfs as that may trigger static key + * changes when changing between zero and non-zero. + */ +static DEFINE_MUTEX(probability_mutex); + static int debugfs_ul_set(void *data, u64 val) { *(unsigned long *)data = val; @@ -186,6 +195,38 @@ static void debugfs_create_ul(const char *name, umode_t mode, debugfs_create_file(name, mode, parent, value, _ul); } +static int debugfs_prob_set(void *data, u64 val) +{ + struct fault_attr *attr = data; + + mutex_lock(_mutex); + + if (attr->active) { + if (attr->probability != 0 && val == 0) { + static_key_slow_dec(attr->active); + } else if (attr->probability == 0 && val != 0) { + static_key_slow_inc(attr->active); + } + } + + attr->probability = val; + + mutex_unlock(_mutex); + + return 0; +} + +static int debugfs_prob_get(void *data, u64 *val) +{ + struct fault_attr *attr = data; + + *val = attr->probability; + + return 0; +} + +DEFINE_SIMPLE_ATTRIBUTE(fops_prob, debugfs_prob_get, debugfs_prob_set, "%llu\n"); + #ifdef CONFIG_FAULT_INJECTION_STACKTRACE_FILTER static int debugfs_stacktrace_depth_set(void *data, u64 val) @@ -218,7 +259,7 @@ struct dentry *fault_create_debugfs_attr(const char *name, if (IS_ERR(dir)) return dir; - debugfs_create_ul("probability", mode, dir, >probability); + debugfs_create_file("probability", mode, dir, , _prob); debugfs_create_ul("interval", mode, dir, >interval); debugfs_create_atomic_t("times", mode, dir, >times); debugfs_create_atomic_t("space", mode,
Re: [PATCH v2] dt-bindings: remoteproc: k3-dsp: correct optional sram properties for AM62A SoCs
On 30/05/2024 18:48, Hari Nagalla wrote: > The C7xv-dsp on AM62A have 32KB L1 I-cache and a 64KB L1 D-cache. It > does not have an addressable l1dram . So, remove this optional sram > property from the bindings to fix device tree build warnings. > > Signed-off-by: Hari Nagalla > --- > Changes from v1 to v2: > *) Kept back memory-regions property, as it is unrelated to this patch >correcting the sram property for AM62A C7xv-dsp nodes. > > DT binding check log: > https://paste.sr.ht/~hnagalla/cb26237560a572a17c0874b687353e00b400285a > > v1: https://lore.kernel.org/all/20230810110545.11644-1-hnaga...@ti.com/ > > .../bindings/remoteproc/ti,k3-dsp-rproc.yaml | 15 ++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git > a/Documentation/devicetree/bindings/remoteproc/ti,k3-dsp-rproc.yaml > b/Documentation/devicetree/bindings/remoteproc/ti,k3-dsp-rproc.yaml > index 9768db8663eb..771cfceb5458 100644 > --- a/Documentation/devicetree/bindings/remoteproc/ti,k3-dsp-rproc.yaml > +++ b/Documentation/devicetree/bindings/remoteproc/ti,k3-dsp-rproc.yaml > @@ -111,7 +111,6 @@ else: > properties: >compatible: > enum: > - - ti,am62a-c7xv-dsp >- ti,j721e-c71-dsp >- ti,j721s2-c71-dsp >then: > @@ -124,6 +123,20 @@ else: > items: >- const: l2sram >- const: l1dram > + else: > +if: Please use allOf (move top one to bottom) and define separate ifs for each variant instead of nesting them. Best regards, Krzysztof
Re: [PATCH 0/3] tracing: Fix some selftest issues
On Fri, 31 May 2024 11:37:21 +0900 Masami Hiramatsu (Google) wrote: > So, in summary, it is designed to be a module. Steve, I think these tests > should be kept as modules. There are many reason to do so. > > - This test is designed to be used as module. > - This can conflict with other boot time selftest if it is embedded. > - We can make these tests and boot time selftest mutable exclusive but >if we make these tests as modules, we can build and run both tests >safely. > - Embedding these tests leave new events when the kernel boot, which >user must be cleaned up by manual. > > What would you think? I was mostly following what Ingo told me long ago, where having it built in is just one more way to test it ;-) But that said, from your first patch, you show the stack dump and mention: > Since the kprobes and synth event generation tests adds and enable > generated events in init_module() and delete it in exit_module(), > if we make it as built-in, those events are left in kernel and cause > kprobe event self-test failure. But you don't explain what exactly the conflict is. What about those events causes kprobe selftests to fail? -- Steve
Re: [PATCH] livepatch: introduce klp_func called interface
Hi, On Fri, 31 May 2024, zhang warden wrote: > > Hi Bros, > > How about my patch? I do think it is a viable feature to show the state > of the patched function. If we add an unlikely branch test before we set > the 'called' state, once this function is called, there maybe no > negative effect to the performance. you have not replied to my questions/feedback yet. Also, I do not think that unlikely changes anything here. It is a simple branch after all. Miroslav
[PATCH RT 0/1] Linux v4.19.315-rt135-rc1
Dear RT Folks, This is the RT stable review cycle of patch 4.19.315-rt135-rc1. Please scream at me if I messed something up. Please test the patches too. The -rc release is also available on kernel.org https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git on the v4.19-rt-next branch. If all goes well, this patch will be converted to the next main release on 2024-06-07. Signing key fingerprint: 5BF6 7BC5 0826 72CA BB45 ACAE 587C 5ECA 5D0A 306C All keys used for the above files and repositories can be found on the following git repository: git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git Enjoy! Daniel Changes from v4.19.312-rt134: Daniel Wagner (1): Linux 4.19.315-rt135 localversion-rt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.45.1
[PATCH RT 1/1] Linux 4.19.315-rt135
v4.19.315-rt135-rc1 stable review patch. If anyone has any objections, please let me know. --- Signed-off-by: Daniel Wagner --- localversion-rt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/localversion-rt b/localversion-rt index 6067da4c8c99..e3026053f01e 100644 --- a/localversion-rt +++ b/localversion-rt @@ -1 +1 @@ --rt134 +-rt135 -- 2.45.1
[syzbot] Monthly trace report (May 2024)
Hello trace maintainers/developers, This is a 31-day syzbot report for the trace subsystem. All related reports/information can be found at: https://syzkaller.appspot.com/upstream/s/trace During the period, 1 new issues were detected and 0 were fixed. In total, 10 issues are still open and 35 have been fixed so far. Some of the still happening issues: Ref Crashes Repro Title <1> 705 Yes WARNING in format_decode (3) https://syzkaller.appspot.com/bug?extid=e2c932aec5c8a6e1d31c <2> 26 Yes INFO: task hung in blk_trace_ioctl (4) https://syzkaller.appspot.com/bug?extid=ed812ed461471ab17a0c <3> 7 Yes WARNING in get_probe_ref https://syzkaller.appspot.com/bug?extid=8672dcb9d10011c0a160 <4> 6 Yes INFO: task hung in blk_trace_remove (2) https://syzkaller.appspot.com/bug?extid=2373f6be3e6de4f92562 <5> 5 Yes general protection fault in bpf_get_attach_cookie_tracing https://syzkaller.appspot.com/bug?extid=3ab78ff125b7979e45f9 --- This report is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkal...@googlegroups.com. To disable reminders for individual bugs, reply with the following command: #syz set no-reminders To change bug's subsystems, reply with: #syz set subsystems: new-subsystem You may send multiple commands in a single email message.
Re: [PATCH v2] sched/rt: Clean up usage of rt_task()
On 2024-05-30 12:10:44 [+0100], Qais Yousef wrote: > > This is not consistent because IMHO the clock setup & slack should be > > handled equally. So I am asking the sched folks for a policy and I am > > leaning towards looking at task-policy in this case instead of prio > > because you shouldn't do anything that can delay. > > Can't we do that based on is_soft/is_hard flag in hrtimer struct when we apply > the slack in hrtimer_set_expires_range_ns() instead? We need to decide on a policy first. You don't want to add overhead on each invocation plus some in-kernel ask for delta. ->is_soft is not a good criteria. Sebastian
Re: [PATCH 10/20] function_graph: Have the instances use their own ftrace_ops for filtering
On Fri, 31 May 2024 12:12:41 +0900 Masami Hiramatsu (Google) wrote: > On Thu, 30 May 2024 22:30:57 -0400 > Steven Rostedt wrote: > > > On Fri, 24 May 2024 22:37:02 -0400 > > Steven Rostedt wrote: > > > > > From: "Steven Rostedt (VMware)" > > > > > > Allow for instances to have their own ftrace_ops part of the fgraph_ops > > > that makes the funtion_graph tracer filter on the set_ftrace_filter file > > > of the instance and not the top instance. > > > > > > Note that this also requires to update ftrace_graph_func() to call new > > > function_graph_enter_ops() instead of function_graph_enter() so that > > > it avoid pushing on shadow stack multiple times on the same function. > > > > So I found a major design flaw in this patch. > > > > > > > > Co-developed with Masami Hiramatsu: > > > Link: > > > https://lore.kernel.org/linux-trace-kernel/171509102088.162236.15758883237657317789.stgit@devnote2 > > > > > > Signed-off-by: Steven Rostedt (VMware) > > > Signed-off-by: Masami Hiramatsu (Google) > > > Signed-off-by: Steven Rostedt (Google) > > > --- > > > > > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c > > > index 8da0e66ca22d..998558cb8f15 100644 > > > --- a/arch/x86/kernel/ftrace.c > > > +++ b/arch/x86/kernel/ftrace.c > > > @@ -648,9 +648,24 @@ void ftrace_graph_func(unsigned long ip, unsigned > > > long parent_ip, > > > struct ftrace_ops *op, struct ftrace_regs *fregs) > > > { > > > struct pt_regs *regs = >regs; > > > - unsigned long *stack = (unsigned long *)kernel_stack_pointer(regs); > > > + unsigned long *parent = (unsigned long *)kernel_stack_pointer(regs); > > > + struct fgraph_ops *gops = container_of(op, struct fgraph_ops, ops); > > > + int bit; > > > + > > > + if (unlikely(ftrace_graph_is_dead())) > > > + return; > > > + > > > + if (unlikely(atomic_read(>tracing_graph_pause))) > > > + return; > > > > > > - prepare_ftrace_return(ip, (unsigned long *)stack, 0); > > > + bit = ftrace_test_recursion_trylock(ip, *parent); > > > + if (bit < 0) > > > + return; > > > + > > > + if (!function_graph_enter_ops(*parent, ip, 0, parent, gops)) > > > > So each registered graph ops has its own ftrace_ops which gets > > registered with ftrace, so this function does get called in a loop (by > > the ftrace iterator function). This means that we would need that code > > to detect the function_graph_enter_ops() getting called multiple times > > for the same function. This means each fgraph_ops gits its own retstack > > on the shadow stack. > > Ah, that is my concern and the reason why I added bitmap and stack reuse > code in the ftrace_push_return_trace(). > > > > > I find this a waste of shadow stack resources, and also complicates the > > code with having to deal with tail calls and all that. > > > > BUT! There's good news! I also thought about another way of handling > > this. I have something working, but requires a bit of rewriting the > > code. I should have something out in a day or two. > > Hmm, I just wonder why you don't reocver my bitmap check and stack > reusing code. Are there any problem on it? (Too complicated?) > I actually dislike the use of ftrace itself to do the loop. I rather have fgraph be in control of it. I've come up with a new "subops" assignment, where you can have one ftrace_ops represent multiple sub ftrace_ops. Basically, each fgraph ops can register its own ftrace_ops under a single graph_ops ftrace_ops. The graph_ops will be used to decide what functions call the callback, and then the callback does the multiplexing. This removes the need to touch the architecture code. It can also be used by fprobes to handle the attachments to functions for several different sets of callbacks. I'll send out patches soon. -- Steve