Re: [PATCH] tracing/fprobe: Support raw tracepoint events on modules

2024-05-31 Thread Google
On Sat, 1 Jun 2024 01:01:26 +0800
kernel test robot  wrote:

> Hi Masami,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v6.10-rc1 next-20240531]
> [cannot apply to rostedt-trace/for-next rostedt-trace/for-next-urgent]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:
> https://github.com/intel-lab-lkp/linux/commits/Masami-Hiramatsu-Google/tracing-fprobe-Support-raw-tracepoint-events-on-modules/20240531-175013
> base:   linus/master
> patch link:
> https://lore.kernel.org/r/171714888633.198965.13093663631481169611.stgit%40devnote2
> patch subject: [PATCH] tracing/fprobe: Support raw tracepoint events on 
> modules
> config: s390-defconfig 
> (https://download.01.org/0day-ci/archive/20240601/202406010034.fsnp9rsq-...@intel.com/config)
> compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project 
> bafda89a0944d947fc4b3b5663185e07a397ac30)
> reproduce (this is a W=1 build): 
> (https://download.01.org/0day-ci/archive/20240601/202406010034.fsnp9rsq-...@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version 
> of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot 
> | Closes: 
> https://lore.kernel.org/oe-kbuild-all/202406010034.fsnp9rsq-...@intel.com/

OK, I must be tired... let me fix that. And I'll add a selftest for this case.

Thanks!

> 
> All errors (new ones prefixed by >>):
> 
>In file included from kernel/tracepoint.c:5:
>In file included from include/linux/module.h:19:
>In file included from include/linux/elf.h:6:
>In file included from arch/s390/include/asm/elf.h:173:
>In file included from arch/s390/include/asm/mmu_context.h:11:
>In file included from arch/s390/include/asm/pgalloc.h:18:
>In file included from include/linux/mm.h:2253:
>include/linux/vmstat.h:500:43: warning: arithmetic between different 
> enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
> [-Wenum-enum-conversion]
>  500 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
>  |~ ^
>  501 |item];
>  |
>include/linux/vmstat.h:507:43: warning: arithmetic between different 
> enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
> [-Wenum-enum-conversion]
>  507 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
>  |~ ^
>  508 |NR_VM_NUMA_EVENT_ITEMS +
>  |~~
>include/linux/vmstat.h:514:36: warning: arithmetic between different 
> enumeration types ('enum node_stat_item' and 'enum lru_list') 
> [-Wenum-enum-conversion]
>  514 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
>  |   ~~~ ^ ~~~
>include/linux/vmstat.h:519:43: warning: arithmetic between different 
> enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
> [-Wenum-enum-conversion]
>  519 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
>  |~ ^
>  520 |NR_VM_NUMA_EVENT_ITEMS +
>  |~~
>include/linux/vmstat.h:528:43: warning: arithmetic between different 
> enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
> [-Wenum-enum-conversion]
>  528 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
>  |~ ^
>  529 |NR_VM_NUMA_EVENT_ITEMS +
>  |~~
> >> kernel/tracepoint.c:751:34: error: no member named 
> >> '__start___tracepoints_ptrs' in 'struct module'
>  751 | 
> for_each_tracepoint_range(mod->__start___tracepoints_ptrs,
>  |   ~~~  ^
>5 warnings and 1 error generated.
> 
> 
> vim +751 kernel/tracepoint.c
> 
>738
>739void for_each_module_tracepoint(void (*fct)(struct tracepoint 
> *tp, void *priv),
>740void *priv)
>741{
>742struct tp_module *tp_mod;
>743struct module *mod;
>

[PATCH] lib/test_kmod: add missing MODULE_DESCRIPTION() macro

2024-05-31 Thread Jeff Johnson
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/test_kmod.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson 
---
 lib/test_kmod.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/test_kmod.c b/lib/test_kmod.c
index 1eec3b7ac67c..064ed0fce75a 100644
--- a/lib/test_kmod.c
+++ b/lib/test_kmod.c
@@ -1223,4 +1223,5 @@ static void __exit test_kmod_exit(void)
 module_exit(test_kmod_exit);
 
 MODULE_AUTHOR("Luis R. Rodriguez ");
+MODULE_DESCRIPTION("kmod stress test driver");
 MODULE_LICENSE("GPL");

---
base-commit: b050496579632f86ee1ef7e7501906db579f3457
change-id: 20240531-md-lib-test_kmod-83bf2ee7e725




Re: [PATCH RFC 4/4] mm, page_alloc: add static key for should_fail_alloc_page()

2024-05-31 Thread Roman Gushchin
On Fri, May 31, 2024 at 11:33:35AM +0200, Vlastimil Babka wrote:
> Similarly to should_failslab(), remove the overhead of calling the
> noinline function should_fail_alloc_page() with a static key that guards
> the allocation hotpath callsite and is controlled by the fault and error
> injection frameworks.
> 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Roman Gushchin 

Thanks!



Re: [PATCH RFC 3/4] mm, slab: add static key for should_failslab()

2024-05-31 Thread Roman Gushchin
On Fri, May 31, 2024 at 11:33:34AM +0200, Vlastimil Babka wrote:
> Since commit 4f6923fbb352 ("mm: make should_failslab always available for
> fault injection") should_failslab() is unconditionally a noinline
> function. This adds visible overhead to the slab allocation hotpath,
> even if the function is empty. With CONFIG_FAILSLAB=y there's additional
> overhead when the functionality is not enabled by a boot parameter or
> debugfs.
> 
> The overhead can be eliminated with a static key around the callsite.
> Fault injection and error injection frameworks can now be told that the
> this function has a static key associated, and are able to enable and
> disable it accordingly.
> 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Roman Gushchin 



Re: [PATCH RFC 0/4] static key support for error injection functions

2024-05-31 Thread Roman Gushchin
On Fri, May 31, 2024 at 11:33:31AM +0200, Vlastimil Babka wrote:
> Incomplete, help needed from ftrace/kprobe and bpf folks.
> 
> As previously mentioned by myself [1] and others [2] the functions
> designed for error injection can bring visible overhead in fastpaths
> such as slab or page allocation, because even if nothing hooks into them
> at a given moment, they are noninline function calls regardless of
> CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab
> always available for fault injection") and af3b854492f3
> ("mm/page_alloc.c: allow error injection").
> 
> Live patching their callsites has been also suggested in both [1] and
> [2] threads, and this is an attempt to do that with static keys that
> guard the call sites. When disabled, the error injection functions still
> exist and are noinline, but are not being called. Any of the existing
> mechanisms that can inject errors should make sure to enable the
> respective static key. I have added that support to some of them but
> need help with the others.

I think it's a clever idea and makes total sense!

> 
> - the legacy fault injection, i.e. CONFIG_FAILSLAB and
>   CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the
>   address of the static key if it exists. The key will be activated if the
>   fault injection probability becomes non-zero, and deactivated in the
>   opposite transition. This also removes the overhead of the evaluation
>   (on top of the noninline function call) when these mechanisms are
>   configured in the kernel but unused at the moment.
> 
> - the generic error injection using kretprobes with
>   override_function_with_return is handled in patch 2. The
>   ALLOW_ERROR_INJECTION() annotation is extended so that static key
>   address can be passed, and the framework controls it when error
>   injection is enabled or disabled in debugfs for the function.
> 
> There are two more users I know of but am not familiar enough to fix up
> myself. I hope people that are more familiar can help me here.
> 
> - ftrace seems to be using override_function_with_return from
>   #define ftrace_override_function_with_return but I found no place
>   where the latter is used. I assume it might be hidden behind more
>   macro magic? But the point is if ftrace can be instructed to act like
>   an error injection, it would also have to use some form of metadata
>   (from patch 2 presumably?) to get to the static key and control it.
> 
>   If ftrace can only observe the function being called, maybe it
>   wouldn't be wrong to just observe nothing if the static key isn't
>   enabled because nobody is doing the fault injection?
> 
> - bpftrace, as can be seen from the example in commit 4f6923fbb352
>   description. I suppose bpf is already aware what functions the
>   currently loaded bpf programs hook into, so that it could look up the
>   static key and control it. Maybe using again the metadata from patch 2,
>   or extending its own, as I've noticed there's e.g. BTF_ID(func,
>   should_failslab)
> 
> Now I realize maybe handling this at the k(ret)probe level would be
> sufficient for all cases except the legacy fault injection from Patch 1?
> Also wanted to note that by AFAIU by using the static_key_slow_dec/inc
> API (as done in patches 1/2) should allow all mechanisms to coexist
> naturally without fighting each other on the static key state, and also
> handle the reference count for e.g. active probes or bpf programs if
> there's no similar internal mechanism.
> 
> Patches 3 and 4 implement the static keys for the two mm fault injection
> sites in slab and page allocators. For a quick demonstration I've run a
> VM and the simple test from [1] that stresses the slab allocator and got
> this time before the series:
> 
> real0m8.349s
> user0m0.694s
> sys 0m7.648s
> 
> with perf showing
> 
>0.61%  nonexistent  [kernel.kallsyms]  [k] should_failslab.constprop.0
>0.00%  nonexistent  [kernel.kallsyms]  [k] should_fail_alloc_page  
>   
>   
>   ▒
> 
> And after the series
> 
> real0m7.924s
> user0m0.727s
> sys 0m7.191s

Is "user" increase a measurement error or it's real?

Otherwise, nice savings!



Re: [PATCH 10/20] function_graph: Have the instances use their own ftrace_ops for filtering

2024-05-31 Thread Steven Rostedt
On Fri, 31 May 2024 23:50:23 +0900
Masami Hiramatsu (Google)  wrote:

> So is it similar to the fprobe/kprobe, use shared signle ftrace_ops,
> but keep each fgraph has own hash table?

Sort of.

I created helper functions in ftrace that lets you have a "manager
ftrace_ops" that will be used to assign to ftrace (with the function
that will demultiplex), and then you can have "subops" that can be
assigned that is per user. Here's a glimpse of the code:

/**
 * ftrace_startup_subops - enable tracing for subops of an ops
 * @ops: Manager ops (used to pick all the functions of its subops)
 * @subops: A new ops to add to @ops
 * @command: Extra commands to use to enable tracing
 *
 * The @ops is a manager @ops that has the filter that includes all the 
functions
 * that its list of subops are tracing. Adding a new @subops will add the
 * functions of @subops to @ops.
 */
int ftrace_startup_subops(struct ftrace_ops *ops, struct ftrace_ops *subops, 
int command)
{
struct ftrace_hash *filter_hash;
struct ftrace_hash *notrace_hash;
struct ftrace_hash *save_filter_hash;
struct ftrace_hash *save_notrace_hash;
int size_bits;
int ret;

if (unlikely(ftrace_disabled))
return -ENODEV;

ftrace_ops_init(ops);
ftrace_ops_init(subops);

/* Make everything canonical (Just in case!) */
if (!ops->func_hash->filter_hash)
ops->func_hash->filter_hash = EMPTY_HASH;
if (!ops->func_hash->notrace_hash)
ops->func_hash->notrace_hash = EMPTY_HASH;
if (!subops->func_hash->filter_hash)
subops->func_hash->filter_hash = EMPTY_HASH;
if (!subops->func_hash->notrace_hash)
subops->func_hash->notrace_hash = EMPTY_HASH;

/* For the first subops to ops just enable it normally */
if (list_empty(>subop_list)) {
/* Just use the subops hashes */
filter_hash = copy_hash(subops->func_hash->filter_hash);
notrace_hash = copy_hash(subops->func_hash->notrace_hash);
if (!filter_hash || !notrace_hash) {
free_ftrace_hash(filter_hash);
free_ftrace_hash(notrace_hash);
return -ENOMEM;
}

save_filter_hash = ops->func_hash->filter_hash;
save_notrace_hash = ops->func_hash->notrace_hash;

ops->func_hash->filter_hash = filter_hash;
ops->func_hash->notrace_hash = notrace_hash;
list_add(>list, >subop_list);
ret = ftrace_startup(ops, command);
if (ret < 0) {
list_del(>list);
ops->func_hash->filter_hash = save_filter_hash;
ops->func_hash->notrace_hash = save_notrace_hash;
free_ftrace_hash(filter_hash);
free_ftrace_hash(notrace_hash);
} else {
free_ftrace_hash(save_filter_hash);
free_ftrace_hash(save_notrace_hash);
subops->flags |= FTRACE_OPS_FL_ENABLED;
}
return ret;
}

/*
 * Here there's already something attached. Here are the rules:
 *   o If either filter_hash is empty then the final stays empty
 *  o Otherwise, the final is a superset of both hashes
 *   o If either notrace_hash is empty then the final stays empty
 *  o Otherwise, the final is an intersection between the hashes
 */
if (ops->func_hash->filter_hash == EMPTY_HASH ||
subops->func_hash->filter_hash == EMPTY_HASH) {
filter_hash = EMPTY_HASH;
} else {
size_bits = max(ops->func_hash->filter_hash->size_bits,
subops->func_hash->filter_hash->size_bits);
filter_hash = alloc_and_copy_ftrace_hash(size_bits, 
ops->func_hash->filter_hash);
if (!filter_hash)
return -ENOMEM;
ret = append_hash(_hash, subops->func_hash->filter_hash);
if (ret < 0) {
free_ftrace_hash(filter_hash);
return ret;
}
}

if (ops->func_hash->notrace_hash == EMPTY_HASH ||
subops->func_hash->notrace_hash == EMPTY_HASH) {
notrace_hash = EMPTY_HASH;
} else {
size_bits = max(ops->func_hash->filter_hash->size_bits,
subops->func_hash->filter_hash->size_bits);
notrace_hash = alloc_ftrace_hash(size_bits);
if (!notrace_hash) {
free_ftrace_hash(filter_hash);
return -ENOMEM;
}

ret = intersect_hash(_hash, ops->func_hash->filter_hash,
   

[PATCH v14 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-05-31 Thread Haitao Huang
With different cgroups, the script starts one or multiple concurrent SGX
selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
test case, which loads an enclave of EPC size equal to the EPC capacity
available on the platform. The script checks results against the
expectation set for each cgroup and reports success or failure.

The script creates 3 different cgroups at the beginning with following
expectations:

1) small - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) large - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script starts 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) larger - limit is the same as the capacity, large enough to run lots of
concurrent tests. The script starts 8 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all processes exit.

The script also includes a test with low mem_cg limit and large sgx_epc
limit to verify that the RAM used for per-cgroup reclamation is charged
to a proper mem_cg. For this test, it turns off swapping before start,
and turns swapping back on afterwards.

Add README to document how to run the tests.

Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Tested-by: Jarkko Sakkinen 
---
V13:
- More improvement on handling error cases and style fixes.
- Add settings file for custom timeout

V12:
- Integrate the scripts to the "run_tests" target. (Jarkko)

V11:
- Remove cgroups-tools dependency and make scripts ash compatible. (Jarkko)
- Drop support for cgroup v1 and simplify. (Michal, Jarkko)
- Add documentation for functions. (Jarkko)
- Turn off swapping before memcontrol tests and back on after
- Format and style fixes, name for hard coded values

V7:
- Added memcontrol test.

V5:
- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.
---
 tools/testing/selftests/sgx/Makefile  |   3 +-
 tools/testing/selftests/sgx/README| 109 +++
 tools/testing/selftests/sgx/ash_cgexec.sh |  16 +
 tools/testing/selftests/sgx/config|   4 +
 .../selftests/sgx/run_epc_cg_selftests.sh | 295 ++
 tools/testing/selftests/sgx/settings  |   2 +
 .../selftests/sgx/watch_misc_for_tests.sh |  11 +
 7 files changed, 439 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/sgx/README
 create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
 create mode 100644 tools/testing/selftests/sgx/config
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100644 tools/testing/selftests/sgx/settings
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/Makefile 
b/tools/testing/selftests/sgx/Makefile
index 867f88ce2570..739376af9e33 100644
--- a/tools/testing/selftests/sgx/Makefile
+++ b/tools/testing/selftests/sgx/Makefile
@@ -20,7 +20,8 @@ ENCL_LDFLAGS := -Wl,-T,test_encl.lds,--build-id=none
 
 ifeq ($(CAN_BUILD_X86_64), 1)
 TEST_CUSTOM_PROGS := $(OUTPUT)/test_sgx
-TEST_FILES := $(OUTPUT)/test_encl.elf
+TEST_FILES := $(OUTPUT)/test_encl.elf ash_cgexec.sh
+TEST_PROGS := run_epc_cg_selftests.sh
 
 all: $(TEST_CUSTOM_PROGS) $(OUTPUT)/test_encl.elf
 endif
diff --git a/tools/testing/selftests/sgx/README 
b/tools/testing/selftests/sgx/README
new file mode 100644
index ..f84406bf29a4
--- /dev/null
+++ b/tools/testing/selftests/sgx/README
@@ -0,0 +1,109 @@
+SGX selftests
+
+The SGX selftests includes a c program (test_sgx) that covers basic user space
+facing APIs and a shell scripts (run_sgx_cg_selftests.sh) testing SGX misc
+cgroup. The SGX cgroup test script requires root privileges and runs a
+specific test case of  the test_sgx in different cgroups configured by the
+script. More details about the cgroup test can be found below.
+
+All SGX selftests can run with or without kselftest framework.
+
+WITH KSELFTEST FRAMEWORK
+===
+
+BUILD
+-
+
+Build executable file "test_sgx" from top level directory of the kernel source:
+ $ make -C tools/testing/selftests TARGETS=sgx
+
+RUN
+---
+
+Run all sgx tests as sudo or root since the cgroup tests need to configure 
cgroup
+limits in files under /sys/fs/cgroup.
+
+ $ sudo make -C tools/testing/selftests/sgx run_tests
+
+Without sudo, SGX cgroup tests will be skipped.
+
+On platforms with large Enclave Page Cache (EPC) and/or less cpu cores, you may
+need adjust the timeout in 'settings' to avoid timeouts.
+
+More details about kselftest framework can be found in
+Documentation/dev-tools/kselftest.rst.
+
+WITHOUT KSELFTEST FRAMEWORK
+===
+
+BUILD
+-
+
+Build executable file "test_sgx" from this
+directory(tools/testing/selftests/sgx/):
+
+  $ make
+
+RUN
+---
+
+Run all 

[PATCH v14 13/14] Docs/x86/sgx: Add description for cgroup support

2024-05-31 Thread Haitao Huang
From: Sean Christopherson 

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang
Signed-off-by: Haitao Huang
Cc: Sean Christopherson 
Reviewed-by: Bagas Sanjaya 
Reviewed-by: Jarkko Sakkinen 
Acked-by: Kai Huang 
Tested-by: Mikko Ylinen 
Tested-by: Jarkko Sakkinen 
---
V8:
- Limit text width to 80 characters to be consistent.

V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes

V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
 Documentation/arch/x86/sgx.rst | 83 ++
 1 file changed, 83 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..c537e6a9aa65 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,86 @@ to expected failures and handle them as follows:
first call.  It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that is used to
+provide SGX-enabled applications with protected memory, and is otherwise
+inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be
+read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all
+intents and purposes the EPC is independent from normal system memory, e.g. 
must
+be reserved at boot from RAM and cannot be converted between EPC and normal
+memory while the system is running.  The EPC is managed by the SGX subsystem 
and
+is not accounted by the memory controller.  Note that this is true only for EPC
+memory itself, i.e.  normal memory allocations related to SGX and EPC memory,
+e.g. the backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual
+memory techniques and pages can be swapped out of the EPC to their backing 
store
+(normal system memory allocated via shmem).  The SGX EPC subsystem is analogous
+to the memory subsystem, and it implements limit and protection models for EPC
+memory.
+
+SGX EPC Interface Files
+---
+
+For a generic description of the Miscellaneous controller interface files,
+please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If
+a value which is not PAGE_SIZE aligned is written, the actual value used by the
+controller will be rounded down to the closest PAGE_SIZE multiple.
+
+  misc.capacity
+A read-only flat-keyed file shown only in the root cgroup. The sgx_epc
+resource will show the total amount of EPC memory available on the
+platform.
+
+  misc.current
+A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc
+resource will show the current active EPC memory usage of the cgroup 
and
+its descendants. EPC pages that are swapped out to backing RAM are not
+included in the current count.
+
+  misc.max
+A read-write single value file which exists on non-root cgroups. The
+sgx_epc resource will show the EPC usage hard limit. The default is
+"max".
+
+If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for
+page fault handling, will be blocked until EPC can be reclaimed from 
the
+cgroup. If there are no pages left that are reclaimable within the same
+group, the kernel returns ENOMEM.
+
+The EPC pages allocated for a guest VM by the virtual EPC driver are 
not
+reclaimable by the host kernel. In case the guest cgroup's limit is
+reached and no reclaimable pages left in the same cgroup, the virtual
+EPC driver returns SIGBUS to the user space process to indicate failure
+on new EPC allocation requests.
+
+The misc.max limit is non-preemptive. If a user writes a limit lower
+than the current usage to this file, the cgroup will not preemptively
+deallocate pages currently in use, and will only start blocking the 
next
+allocation and reclaiming EPC at that time.
+
+  misc.events
+A read-only flat-keyed file which exists on non-root 

[PATCH v14 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-05-31 Thread Haitao Huang
From: Kristen Carlson Accardi 

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_epc_page_lru() always
returns reference to the global LRU.

Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
given EPC page is allocated.

This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., the sum of cgroup
limits greater than the total capacity, cgroups may never reclaim but
the total usage can still be near the capacity. Therefore a global
reclamation is still needed in those cases and it should be performed
from the root cgroup.

Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled. Similar to sgx_cgroup_reclaim_pages(), return
the next cgroup so callers can use it as the new starting node for next
round of reclamation if needed.

Also update sgx_can_reclaim_global(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.

Finally, change sgx_reclaim_direct(), to check and ensure there are free
pages at cgroup level so forward progress can be made by the caller.

With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Tested-by: Mikko Ylinen 
Tested-by: Jarkko Sakkinen 
---
V14:
- Update global reclamation to use the new sgx_cgroup_reclaim_pages() to
iterate cgroups at lower level if the top cgroups are too busy.

V13:
- Use IS_ENABLED(CONFIG_CGROUP_MISC) in sgx_can_reclaim_global(). (Kai)

V12:
- Remove CONFIG_CGROUP_SGX_EPC, conditional compile SGX Cgroup for
CONFIGCONFIG_CGROUPMISC. (Jarkko)

V11:
- Reword the comments for global reclamation for allocation failure
after passing cgroup charging. (Kai)
- Add stub functions to remove ifdefs in c file (Kai)
- Add more detailed comments to clarify each page belongs to one cgroup, or the
root. (Kai)

V10:
- Add comment to clarify each page belongs to one cgroup, or the root by
default. (Kai)
- Merge the changes that expose sgx_cgroup_* functions to this patch.
- Add changes for sgx_reclaim_direct() that was missed previously.

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c |  8 +--
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 32 +++
 arch/x86/kernel/cpu/sgx/main.c   | 80 +---
 3 files changed, 109 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 7e0e9a4fedf0..2702bbd49c60 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -67,7 +67,7 @@ static inline u64 sgx_cgroup_max_pages_to_root(struct 
sgx_cgroup *sgx_cg)
  *
  * Return: %true if all cgroups under the specified root have empty LRU lists.
  */
-static bool sgx_cgroup_lru_empty(struct misc_cg *root)
+bool sgx_cgroup_lru_empty(struct misc_cg *root)
 {
struct cgroup_subsys_state *css_root;
struct cgroup_subsys_state *pos;
@@ -127,8 +127,8 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root)
  * release the reference if the returned is not used as %start for a subsequent
  * call.
  */
-static struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct 
misc_cg *start,
-   struct mm_struct *charge_mm)
+struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct misc_cg 
*start,
+struct mm_struct *charge_mm)
 {
struct cgroup_subsys_state *css_root, *pos;
struct cgroup_subsys_state *next = NULL;
@@ -181,7 +181,7 @@ static struct misc_cg *sgx_cgroup_reclaim_pages(struct 
misc_cg *root, struct mis
  * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the
  * cgroup.
  */
-static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
 {
u64 cur, max;
 
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 2044e0d64076..c6940bb8ec3b 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -13,6 +13,11 @@
 #define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
 struct sgx_cgroup;
 
+static inline struct misc_cg *misc_from_sgx(struct sgx_cgroup *sgx_cg)
+{
+   return NULL;
+}
+
 static inline struct sgx_cgroup *sgx_get_current_cg(void)
 {
return NULL;
@@ -27,8 +32,25 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup 
*sgx_cg, enum sgx_recl
 
 

[PATCH v14 11/14] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

2024-05-31 Thread Haitao Huang
Enclave Page Cache(EPC) memory can be swapped out to regular system
memory, and the consumed memory should be charged to a proper
mem_cgroup. Currently the selection of mem_cgroup to charge is done in
sgx_encl_get_mem_cgroup(). But it considers all contexts other than the
ksgxd thread are user processes. With the new EPC cgroup implementation,
the swapping can also happen in EPC cgroup work-queue threads. In those
cases, it improperly selects the root mem_cgroup to charge for the RAM
usage.

Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take
an additional argument to explicitly specify the mm struct to charge for
allocations. Callers from background kthreads not associated with a
charging mm struct would set it to NULL, while callers in user process
contexts set it to current->mm.

Internally, it handles the case when the charging mm given is NULL, by
searching for an mm struct from enclave's mm_list.

Signed-off-by: Haitao Huang 
Reported-by: Mikko Ylinen 
Reviewed-by: Kai Huang 
Reviewed-by: Jarkko Sakkinen 
Tested-by: Mikko Ylinen 
Tested-by: Jarkko Sakkinen 
---
V10:
- Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)

V9:
- Reduce number of if statements. (Tim)

V8:
- Limit text paragraphs to 80 characters wide. (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c   | 29 ++--
 arch/x86/kernel/cpu/sgx/encl.h   |  3 +--
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 11 +++
 arch/x86/kernel/cpu/sgx/main.c   | 29 +---
 arch/x86/kernel/cpu/sgx/sgx.h|  2 +-
 5 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f474179b6f77..7b77dad41daf 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -993,23 +993,23 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, 
unsigned long page_inde
 }
 
 /*
- * When called from ksgxd, returns the mem_cgroup of a struct mm stored
- * in the enclave's mm_list. When not called from ksgxd, just returns
- * the mem_cgroup of the current task.
+ * Find the mem_cgroup to charge for memory allocated on behalf of an enclave.
+ *
+ * Used in sgx_encl_alloc_backing() for backing store allocation.
+ *
+ * Return the mem_cgroup of the given charge_mm. Otherwise return the 
mem_cgroup
+ * of a struct mm stored in the enclave's mm_list.
  */
-static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
+static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl,
+ struct mm_struct *charge_mm)
 {
struct mem_cgroup *memcg = NULL;
struct sgx_encl_mm *encl_mm;
int idx;
 
-   /*
-* If called from normal task context, return the mem_cgroup
-* of the current task's mm. The remainder of the handling is for
-* ksgxd.
-*/
-   if (!current_is_ksgxd())
-   return get_mem_cgroup_from_mm(current->mm);
+/* Use the charge_mm if given. */
+   if (charge_mm)
+   return get_mem_cgroup_from_mm(charge_mm);
 
/*
 * Search the enclave's mm_list to find an mm associated with
@@ -1047,8 +1047,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct 
sgx_encl *encl)
  * @encl:  an enclave pointer
  * @page_index:enclave page index
  * @backing:   data for accessing backing storage for the page
+ * @charge_mm: the mm to charge for the allocation
  *
- * When called from ksgxd, sets the active memcg from one of the
+ * When charge_mm is NULL, sets the active memcg from one of the
  * mms in the enclave's mm_list prior to any backing page allocation,
  * in order to ensure that shmem page allocations are charged to the
  * enclave.  Create a backing page for loading data back into an EPC page with
@@ -1060,9 +1061,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct 
sgx_encl *encl)
  *   -errno otherwise.
  */
 int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
-  struct sgx_backing *backing)
+  struct sgx_backing *backing, struct mm_struct 
*charge_mm)
 {
-   struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
+   struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl, 
charge_mm);
struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
int ret;
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index fe15ade02ca1..5ce9d108290f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, 
unsigned long addr,
 int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 unsigned long end, unsigned long vm_flags);
 
-bool current_is_ksgxd(void);
 void sgx_encl_release(struct kref *ref);
 int sgx_encl_mm_add(struct sgx_encl *encl, struct 

[PATCH v14 10/14] x86/sgx: Implement async reclamation for cgroup

2024-05-31 Thread Haitao Huang
From: Kristen Carlson Accardi 

In cases EPC pages need be allocated during a page fault and the cgroup
usage is near its limit, an asynchronous reclamation needs be triggered
to avoid blocking the page fault handling.

Create a workqueue, corresponding work item and function definitions
for EPC cgroup to support the asynchronous reclamation.

In sgx_cgroup_try_charge(), if caller does not allow synchronous
reclamation, queue an asynchronous work into the workqueue.

Reclaiming only when the usage is at or very close to the limit would
cause thrashing. To avoid that, before returning from
sgx_cgroup_try_charge(), check the need for reclamation (usage too close
to the limit), queue an async work if needed, similar to how the global
reclaimer wakes up its reclaiming thread after each allocation in
sgx_alloc_epc_pages().

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Tested-by: Jarkko Sakkinen 
---
V13:
- Revert to BUG_ON() in case of workq allocation failure in init and
only alloc if misc is enabled.

V11:
- Print error instead of WARN (Kai)
- Add check for need to queue an async reclamation before returning from
try_charge(), do so if needed. This is to be consistent with global
reclaimer to minimize thrashing during allocation time.

V10:
- Split asynchronous flow in separate patch. (Kai)
- Consider cgroup disabled when the workqueue allocation fail during
init. (Kai)
- Abstract out sgx_cgroup_should_reclaim().

V9:
- Add comments for static variables. (Jarkko)

V8:
- Remove alignment for substructure variables. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 136 ++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |   1 +
 2 files changed, 136 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 4406077acd1c..9e366068e0cd 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -4,9 +4,63 @@
 #include
 #include "epc_cgroup.h"
 
+/*
+ * The minimal free pages maintained by per-cgroup reclaimer
+ * Set this to the low threshold used by the global reclaimer, ksgxd.
+ */
+#define SGX_CG_MIN_FREE_PAGE   (SGX_NR_LOW_PAGES)
+
+/*
+ * If the cgroup limit is close to SGX_CG_MIN_FREE_PAGE, maintaining the 
minimal
+ * free pages would barely leave any page for use, causing excessive 
reclamation
+ * and thrashing.
+ *
+ * Define the following limit, below which cgroup does not maintain the minimal
+ * free page threshold. Set this to quadruple of the minimal so at least 75%
+ * pages used without being reclaimed.
+ */
+#define SGX_CG_LOW_LIMIT   (SGX_CG_MIN_FREE_PAGE * 4)
+
 /* The root SGX EPC cgroup */
 static struct sgx_cgroup sgx_cg_root;
 
+/*
+ * The work queue that reclaims EPC pages in the background for cgroups.
+ *
+ * A cgroup schedules a work item into this queue to reclaim pages within the
+ * same cgroup when its usage limit is reached and synchronous reclamation is 
not
+ * an option, i.e., in a page fault handler.
+ */
+static struct workqueue_struct *sgx_cg_wq;
+
+static inline u64 sgx_cgroup_page_counter_read(struct sgx_cgroup *sgx_cg)
+{
+   return atomic64_read(_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / 
PAGE_SIZE;
+}
+
+static inline u64 sgx_cgroup_max_pages(struct sgx_cgroup *sgx_cg)
+{
+   return READ_ONCE(sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors. Used in
+ * sgx_cgroup_should_reclaim() to determine if EPC usage of a cgroup is
+ * close to its limit or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg)
+{
+   struct misc_cg *i = sgx_cg->cg;
+   u64 m = U64_MAX;
+
+   while (i) {
+   m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+   i = misc_cg_parent(i);
+   }
+
+   return m / PAGE_SIZE;
+}
+
 /**
  * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
  * @root:  Root of the tree to check
@@ -113,6 +167,65 @@ static struct misc_cg *sgx_cgroup_reclaim_pages(struct 
misc_cg *root, struct mis
return css_misc(next);
 }
 
+/**
+ * sgx_cgroup_should_reclaim() - check if EPC reclamation is needed for a 
cgroup
+ * @sgx_cg: The cgroup to be checked.
+ *
+ * This function can be used to guard a call to sgx_cgroup_reclaim_pages() 
where
+ * the minimal number of free page needs be maintained for the cgroup to make
+ * good forward progress.
+ *
+ * Return: %true if number of free pages available for the cgroup below a
+ * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the
+ * cgroup.
+ */
+static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+{
+   u64 cur, 

[PATCH v14 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-05-31 Thread Haitao Huang
From: Kristen Carlson Accardi 

Currently in the EPC page allocation, the kernel simply fails the
allocation when the current EPC cgroup fails to charge due to its usage
reaching limit.  This is not ideal. When that happens, a better way is
to reclaim EPC page(s) from the current EPC cgroup (and/or its
descendants) to reduce its usage so the new allocation can succeed.

Add the basic building blocks to support per-cgroup reclamation.

Currently the kernel only has one place to reclaim EPC pages: the global
EPC LRU list.  To support the "per-cgroup" EPC reclaim, maintain an LRU
list for each EPC cgroup, and introduce a "cgroup" variant function to
reclaim EPC pages from a given EPC cgroup and its descendants.

Currently the kernel does the global EPC reclaim in sgx_reclaim_page().
It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16)
pages.  Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN
pages from the global LRU, and then tries to reclaim these pages at once
for better performance.

Implement the "cgroup" variant EPC reclaim in a similar way, but keep
the implementation simple: 1) change sgx_reclaim_pages() to take an LRU
as input, and return the pages that are "scanned" and attempted for
reclamation (but not necessarily reclaimed successfully); 2) loop the
given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
until SGX_NR_TO_SCAN pages are "scanned".

This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always
tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC
cgroup, and only moves to its descendants when there's no enough
reclaimable EPC pages to "scan" in its LRU.  It should be enough for
most cases. In other cases, the caller may invoke this function in a
loop to ensure enough pages reclaimed for its usage. To ensure all
descendant groups scanned in a round-robin fashion in those cases,
sgx_cgroup_reclaim_pages() takes in a starting cgroup and returns the
next cgroup that the caller can pass in as the new starting cgroup for a
subsequent call.

Note, this simple implementation doesn't _exactly_ mimic the current
global EPC reclaim (which always tries to do the actual reclaim in batch
of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN
reclaimable pages, the actual reclaim of EPC pages will be split into
smaller batches _across_ multiple LRUs with each being smaller than
SGX_NR_TO_SCAN pages.

A more precise way to mimic the current global EPC reclaim would be to
have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
_across_ the given EPC cgroup _AND_ its descendants, and then do the
actual reclaim in one batch.  But this is unnecessarily complicated at
this stage.

Alternatively, the current sgx_reclaim_pages() could be changed to
return the actual "reclaimed" pages, but not "scanned" pages. However,
the reclamation is a lengthy process, forcing a successful reclamation
of predetermined number of pages may block the caller for too long. And
that may not be acceptable in some synchronous contexts, e.g., in
serving an ioctl().

With this building block in place, add synchronous reclamation support
in sgx_cgroup_try_charge(): trigger a call to
sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the
caller allows synchronous reclaim as indicated by s newly added
parameter.

A later patch will add support for asynchronous reclamation reusing
sgx_cgroup_reclaim_pages().

Note all reclaimable EPC pages are still tracked in the global LRU thus
no per-cgroup reclamation is actually active at the moment. Per-cgroup
tracking and reclamation will be turned on in the end after all
necessary infrastructure is in place.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V14:
- Allow sgx_cgroup_reclaim_pages() to continue from previous tree-walk.
It takes in a 'start' node and returns the 'next' node for the caller to
use as the new 'start'. This is to ensure pages in lower level cgroups
can be reclaimed if all pages in upper level nodes are "too young".
(Kai)
- Move renaming sgx_should_reclaim() to sgx_should_reclaim_global() from
a later patch to this one. (Kai)

V11:
- Use commit message suggested by Kai
- Remove "usage" comments for functions. (Kai)

V10:
- Simplify the signature by removing a pointer to nr_to_scan (Kai)
- Return pages attempted instead of reclaimed as it is really what the
cgroup caller needs to track progress. This further simplifies the design.
- Merge patch for exposing sgx_reclaim_pages() with basic synchronous
reclamation. (Kai)
- Shorten names for EPC cgroup functions. (Jarkko)
- Fix/add comments to justify the design (Kai)
- Separate out a helper for for addressing single iteration of the loop
in sgx_cgroup_try_charge(). (Jarkko)

V9:
- Add comments for static 

[PATCH v14 09/14] x86/sgx: Abstract check for global reclaimable pages

2024-05-31 Thread Haitao Huang
From: Kristen Carlson Accardi 

For the global reclaimer to determine if any page available for
reclamation at the global level, it currently only checks for emptiness
of the global LRU. That will be inadequate when pages are tracked in
multiple LRUs, one per cgroup. For this purpose, create a new helper,
sgx_can_reclaim_global(), to abstract this check. Currently it only
checks the global LRU, later will check emptiness of LRUs of all cgroups
when per-cgroup tracking is turned on.

Replace all the checks for emptiness of the global LRU,
list_empty(_global_lru.reclaimable), with calls to
sgx_can_reclaim_global().

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V13:
- Rename sgx_can_reclaim() to sgx_can_reclaim_global() and
sgx_should_reclaim() to sgx_should_reclaim_global(). (Kai)

V10:
- Add comments for the new function. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 71e84937bc17..0a06b80e26b8 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -37,6 +37,14 @@ static inline struct sgx_epc_lru_list 
*sgx_epc_page_lru(struct sgx_epc_page *epc
return _global_lru;
 }
 
+/*
+ * Check if there is any reclaimable page at global level.
+ */
+static inline bool sgx_can_reclaim_global(void)
+{
+   return !list_empty(_global_lru.reclaimable);
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -390,7 +398,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
 static bool sgx_should_reclaim_global(unsigned long watermark)
 {
return atomic_long_read(_nr_free_pages) < watermark &&
-  !list_empty(_global_lru.reclaimable);
+   sgx_can_reclaim_global();
 }
 
 static void sgx_reclaim_pages_global(void)
@@ -596,7 +604,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum 
sgx_reclaim reclaim)
break;
}
 
-   if (list_empty(_global_lru.reclaimable)) {
+   if (!sgx_can_reclaim_global()) {
page = ERR_PTR(-ENOMEM);
break;
}
-- 
2.25.1




[PATCH v14 07/14] x86/sgx: Abstract tracking reclaimable pages in LRU

2024-05-31 Thread Haitao Huang
From: Kristen Carlson Accardi 

The SGX driver tracks reclaimable EPC pages by adding a newly allocated
page into the global LRU list in sgx_mark_page_reclaimable(), and doing
the opposite in sgx_unmark_page_reclaimable().

To support SGX EPC cgroup, the SGX driver will need to maintain an LRU
list for each cgroup, and each newly allocated EPC page will need to be
added to the LRU of associated cgroup, not always the global LRU list.

When sgx_mark_page_reclaimable() is called, the cgroup that the newly
allocated EPC page belongs to is already known, i.e., it has been set to
the 'struct sgx_epc_page'.

Add a helper, sgx_epc_page_lru(), to return the LRU that the EPC page
should be added to for the given EPC page.  Currently it just returns
the global LRU. Change sgx_{mark|unmark}_page_reclaimable() to use the
helper function to get the LRU from the EPC page instead of referring to
the global LRU directly.

This allows EPC page being able to be tracked in "per-cgroup" LRU when
that becomes ready.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V13:
- Revise commit log (Kai)
- Rename sgx_lru_list() to sgx_epc_page_lru()

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 75d6ce72b9b8..1fa73250e2b9 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_list sgx_global_lru;
 
+static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page 
*epc_page)
+{
+   return _global_lru;
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -500,25 +505,24 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
 }
 
 /**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a 
LRU.
  * @page:  EPC page
- *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_global_lru.lock);
+   struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page);
+
+   spin_lock(>lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(>list, _global_lru.reclaimable);
-   spin_unlock(_global_lru.lock);
+   list_add_tail(>list, >reclaimable);
+   spin_unlock(>lock);
 }
 
 /**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU
  * @page:  EPC page
  *
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
  *
  * Return:
  *   0 on success,
@@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_global_lru.lock);
+   struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page);
+
+   spin_lock(>lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(>list)) {
-   spin_unlock(_global_lru.lock);
+   spin_unlock(>lock);
return -EBUSY;
}
 
list_del(>list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(_global_lru.lock);
+   spin_unlock(>lock);
 
return 0;
 }
-- 
2.25.1




[PATCH v14 06/14] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list

2024-05-31 Thread Haitao Huang
From: Sean Christopherson 

Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.

Use this structure to encapsulate the LRU list and its lock used by the
global reclaimer.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Cc: Sean Christopherson 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V6:
- removed introduction to unreclaimables in commit message.

V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.

V3:
- Removed the helper functions and revised commit messages.
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +-
 arch/x86/kernel/cpu/sgx/sgx.h  | 15 +
 2 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index d7a945d44268..75d6ce72b9b8 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
 
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-   if (list_empty(_active_page_list))
+   epc_page = list_first_entry_or_null(_global_lru.reclaimable,
+   struct sgx_epc_page, list);
+   if (!epc_page)
break;
 
-   epc_page = list_first_entry(_active_page_list,
-   struct sgx_epc_page, list);
list_del_init(_page->list);
encl_page = epc_page->owner;
 
@@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
 */
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(_reclaimer_lock);
+   spin_unlock(_global_lru.lock);
 
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
continue;
 
 skip:
-   spin_lock(_reclaimer_lock);
-   list_add_tail(_page->list, _active_page_list);
-   spin_unlock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
+   list_add_tail(_page->list, _global_lru.reclaimable);
+   spin_unlock(_global_lru.lock);
 
kref_put(_page->encl->refcount, sgx_encl_release);
 
@@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
return atomic_long_read(_nr_free_pages) < watermark &&
-  !list_empty(_active_page_list);
+  !list_empty(_global_lru.reclaimable);
 }
 
 /*
@@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
ksgxd_tsk = tsk;
 
+   sgx_lru_init(_global_lru);
+
return true;
 }
 
@@ -507,10 +508,10 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(>list, _active_page_list);
-   spin_unlock(_reclaimer_lock);
+   list_add_tail(>list, _global_lru.reclaimable);
+   spin_unlock(_global_lru.lock);
 }
 
 /**
@@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(>list)) {
-   spin_unlock(_reclaimer_lock);
+   spin_unlock(_global_lru.lock);
return -EBUSY;
}
 
 

[PATCH v14 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-05-31 Thread Haitao Huang
From: Kristen Carlson Accardi 

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments. For
instance, within a Kubernetes environment, while a user may specify a
particular EPC quota for a pod, the orchestrator requires a mechanism to
enforce that the pod's actual runtime EPC usage does not exceed the
allocated quota.

Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:

- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.

With these changes, the misc cgroup controller enables users to set a hard
limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V13:
- Remove unneeded includes. (Kai)

V12:
- Remove CONFIG_CGROUP_SGX_EPC and make sgx cgroup implementation
conditionally compiled with CONFIG_CGROUP_MISC. (Jarkko)

V11:
- Update copyright and format better (Kai)
- Create wrappers to remove #ifdefs in c file. (Kai)
- Remove unneeded comments (Kai)

V10:
- Shorten function, variable, struct names, s/sgx_epc_cgroup/sgx_cgroup. 
(Jarkko)
- Use enums instead of booleans for the parameters. (Dave, Jarkko)

V8:
- Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko)
- Remove extra space, '_INTEL'. (Jarkko)

V7:
- Use a static for root cgroup (Kai)
- Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
- Correct check for charge API return (Kai)
- Start initialization in SGX device driver init (Kai)
- Remove unneeded BUG_ON (Kai)
- Split  sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)

V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller"  and restructure it (Kai)
---
 arch/x86/kernel/cpu/sgx/Makefile |  1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 71 +++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 72 
 arch/x86/kernel/cpu/sgx/main.c   | 42 +++-
 arch/x86/kernel/cpu/sgx/sgx.h| 21 
 include/linux/misc_cgroup.h  |  2 +
 6 files changed, 207 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..081cb424575e 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
 obj-$(CONFIG_X86_SGX_KVM)  += virt.o
+obj-$(CONFIG_CGROUP_MISC)  += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index ..5c484fd10160
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2022-2024 Intel Corporation. */
+
+#include
+#include "epc_cgroup.h"
+
+/* The root SGX EPC cgroup */
+static struct sgx_cgroup sgx_cg_root;
+
+/**
+ * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * @sgx_cg:The EPC cgroup to be charged for the page.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -errno - for failures.
+ */
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+   return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+}
+
+/**
+ * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page
+ * @sgx_cg:The charged sgx cgroup.
+ */
+void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg)
+{
+   misc_cg_uncharge(MISC_CG_RES_SGX_EPC, 

[PATCH v14 04/14] cgroup/misc: Add SGX EPC resource type

2024-05-31 Thread Haitao Huang
From: Kristen Carlson Accardi 

Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V12:
- Remove CONFIG_CGROUP_SGX_EPC (Jarkko)

V6:
- Split the original patch into this and the preceding one (Kai)
---
 include/linux/misc_cgroup.h | 4 
 kernel/cgroup/misc.c| 4 
 2 files changed, 8 insertions(+)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 541a5611c597..440ed2bb8053 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
MISC_CG_RES_SEV,
/* AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_X86_SGX
+   /* SGX EPC memory resource */
+   MISC_CG_RES_SGX_EPC,
 #endif
MISC_CG_RES_TYPES
 };
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 7d852139121a..863f9147602b 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
/* AMD SEV-ES ASIDs resource */
"sev_es",
 #endif
+#ifdef CONFIG_X86_SGX
+   /* Intel SGX EPC memory bytes */
+   "sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
-- 
2.25.1




[PATCH v14 02/14] cgroup/misc: Add per resource callbacks for CSS events

2024-05-31 Thread Haitao Huang
From: Kristen Carlson Accardi 

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.

This design passes a struct misc_cg into the callbacks. An alternative
to pass only a struct misc_res was considered for encapsulating how
misc_cg stores its misc_res array. However, the SGX driver would still
need to access the misc_res array in the statically defined misc root_cg
during initialization to set resource specific fields. Therefore, some
extra getter/setter APIs to abstract such access to the misc_res array
within the misc_cg struct would be needed. That seems an overkill at the
moment and is deferred to later improvement when there are more users of
these callbacks.

Link: 
https://lore.kernel.org/lkml/op.2kdw36otwjv...@hhuan26-mobl.amr.corp.intel.com/

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V12:
- Add comments in commit to clarify reason to pass in misc_cg, not
misc_res. (Kai)
- Remove unlikely (Kai)

V8:
- Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko)

V7:
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of 
required callback
functions (Jarkko)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)

V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
 include/linux/misc_cgroup.h | 11 +
 kernel/cgroup/misc.c| 82 +
 2 files changed, 86 insertions(+), 7 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..0806d4436208 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -27,6 +27,16 @@ struct misc_cg;
 
 #include 
 
+/**
+ * struct misc_res_ops: per resource type callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is 
allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_res_ops {
+   int (*alloc)(struct misc_cg *cg);
+   void (*free)(struct misc_cg *cg);
+};
+
 /**
  * struct misc_res: Per cgroup per misc type resource
  * @max: Maximum limit on the resource.
@@ -56,6 +66,7 @@ struct misc_cg {
 
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 
amount);
 void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
 
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..4a602d68cf7d 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -39,6 +39,9 @@ static struct misc_cg root_cg;
  */
 static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 
+/* Resource type specific operations */
+static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
+
 /**
  * parent_misc() - Get the parent of the passed misc cgroup.
  * @cgroup: cgroup whose parent needs to be fetched.
@@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 
capacity)
 }
 EXPORT_SYMBOL_GPL(misc_cg_set_capacity);
 
+/**
+ * misc_cg_set_ops() - set resource specific operations.
+ * @type: Type of the misc res.
+ * @ops: Operations for the given type.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - Successfully registered the operations.
+ * * %-EINVAL - If @type is invalid, or the operations missing any required 
callbacks.
+ */
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
+{
+   if (!valid_type(type))
+   return -EINVAL;
+
+   if (!ops->alloc) {
+   pr_err("%s: alloc missing\n", __func__);
+   return -EINVAL;
+   }
+
+   if (!ops->free) {
+   pr_err("%s: free missing\n", __func__);
+   return -EINVAL;
+   }
+
+   misc_res_ops[type] = ops;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_ops);
+
 /**
  * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
  * @type: Misc res type in misc cg to cancel the charge from.
@@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = {
{}
 };
 
+static inline int _misc_cg_res_alloc(struct misc_cg *cg)
+{
+   enum misc_res_type i;
+   int ret;
+
+   for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+   

[PATCH v14 03/14] cgroup/misc: Export APIs for SGX driver

2024-05-31 Thread Haitao Huang
From: Kristen Carlson Accardi 

The SGX EPC cgroup will reclaim EPC pages when usage in a cgroup reaches
its or ancestor's limit. This requires a walk from the current cgroup up
to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to
enable this walk.

The SGX driver also needs start a global level reclamation from the
root. Export misc_cg_root() for the SGX driver to access.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V6:
- Make commit messages more concise and split the original patch into two(Kai)
---
 include/linux/misc_cgroup.h | 24 
 kernel/cgroup/misc.c| 21 -
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 0806d4436208..541a5611c597 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -64,6 +64,7 @@ struct misc_cg {
struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
 int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
@@ -84,6 +85,20 @@ static inline struct misc_cg *css_misc(struct 
cgroup_subsys_state *css)
return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -108,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+   return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+   return NULL;
+}
 
 static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 4a602d68cf7d..7d852139121a 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+   return _cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -183,7 +178,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!amount)
return 0;
 
-   for (i = cg; i; i = parent_misc(i)) {
+   for (i = cg; i; i = misc_cg_parent(i)) {
res = >res[type];
 
new_usage = atomic64_add_return(amount, >usage);
@@ -196,12 +191,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
return 0;
 
 err_charge:
-   for (j = i; j; j = parent_misc(j)) {
+   for (j = i; j; j = misc_cg_parent(j)) {
atomic64_inc(>res[type].events);
cgroup_file_notify(>events_file);
}
 
-   for (j = cg; j != i; j = parent_misc(j))
+   for (j = cg; j != i; j = misc_cg_parent(j))
misc_cg_cancel_charge(type, j, amount);
misc_cg_cancel_charge(type, i, amount);
return ret;
@@ -223,7 +218,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!(amount && valid_type(type) && cg))
return;
 
-   for (i = cg; i; i = parent_misc(i))
+   for (i = cg; i; i = misc_cg_parent(i))
misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.25.1




[PATCH v14 01/14] x86/sgx: Replace boolean parameters with enums

2024-05-31 Thread Haitao Huang
Replace boolean parameters for 'reclaim' in the function
sgx_alloc_epc_page() and its callers with an enum.

Also opportunistically remove non-static declaration of
__sgx_alloc_epc_page() and a typo

Signed-off-by: Haitao Huang 
Suggested-by: Jarkko Sakkinen 
Suggested-by: Dave Hansen 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
 arch/x86/kernel/cpu/sgx/encl.c  | 12 ++--
 arch/x86/kernel/cpu/sgx/encl.h  |  4 ++--
 arch/x86/kernel/cpu/sgx/ioctl.c | 10 +-
 arch/x86/kernel/cpu/sgx/main.c  | 14 +++---
 arch/x86/kernel/cpu/sgx/sgx.h   | 13 +++--
 arch/x86/kernel/cpu/sgx/virt.c  |  2 +-
 6 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..f474179b6f77 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -217,7 +217,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct 
sgx_encl_page *encl_page,
struct sgx_epc_page *epc_page;
int ret;
 
-   epc_page = sgx_alloc_epc_page(encl_page, false);
+   epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM);
if (IS_ERR(epc_page))
return epc_page;
 
@@ -359,14 +359,14 @@ static vm_fault_t sgx_encl_eaug_page(struct 
vm_area_struct *vma,
goto err_out_unlock;
}
 
-   epc_page = sgx_alloc_epc_page(encl_page, false);
+   epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM);
if (IS_ERR(epc_page)) {
if (PTR_ERR(epc_page) == -EBUSY)
vmret =  VM_FAULT_NOPAGE;
goto err_out_unlock;
}
 
-   va_page = sgx_encl_grow(encl, false);
+   va_page = sgx_encl_grow(encl, SGX_NO_RECLAIM);
if (IS_ERR(va_page)) {
if (PTR_ERR(va_page) == -EBUSY)
vmret = VM_FAULT_NOPAGE;
@@ -1232,8 +1232,8 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned 
long addr)
 
 /**
  * sgx_alloc_va_page() - Allocate a Version Array (VA) page
- * @reclaim: Reclaim EPC pages directly if none available. Enclave
- *   mutex should not be held if this is set.
+ * @reclaim: Whether reclaim EPC pages directly if none available. Enclave
+ *   mutex should not be held for SGX_DO_RECLAIM.
  *
  * Allocate a free EPC page and convert it to a Version Array (VA) page.
  *
@@ -1241,7 +1241,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned 
long addr)
  *   a VA page,
  *   -errno otherwise
  */
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim)
 {
struct sgx_epc_page *epc_page;
int ret;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..fe15ade02ca1 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,14 +116,14 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl 
*encl,
  unsigned long offset,
  u64 secinfo_flags);
 void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
 void sgx_encl_free_epc_page(struct sgx_epc_page *page);
 struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 unsigned long addr);
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim 
reclaim);
 void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index b65ab214bdf5..793a0ba2cb16 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -17,7 +17,7 @@
 #include "encl.h"
 #include "encls.h"
 
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim 
reclaim)
 {
struct sgx_va_page *va_page = NULL;
void *err;
@@ -64,7 +64,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct 
sgx_secs *secs)
struct file *backing;
long ret;
 
-   va_page = sgx_encl_grow(encl, true);
+   va_page = sgx_encl_grow(encl, SGX_DO_RECLAIM);
if (IS_ERR(va_page))
return PTR_ERR(va_page);
else if (va_page)
@@ -83,7 +83,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct 
sgx_secs *secs)
 
encl->backing = backing;
 
-   secs_epc = sgx_alloc_epc_page(>secs, true);
+   secs_epc = 

[PATCH v14 00/14] Add Cgroup support for SGX EPC memory

2024-05-31 Thread Haitao Huang
SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The existing
cgroup memory controller cannot be used to limit or account for SGX EPC
memory, which is a desirable feature in some environments, e.g., support
for pod level control in a Kubernates cluster on a VM or bare-metal host
[1,2].
 
This patchset implements the support for sgx_epc memory within the misc
cgroup controller. A user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.
 
Much like normal system memory, EPC memory can be overcommitted via virtual
memory techniques and pages can be swapped out of the EPC to their backing
store, which are normal system memory allocated via shmem and accounted by
the memory controller. Similar to per-cgroup reclamation done by the memory
controller, the EPC misc controller needs to implement a per-cgroup EPC
reclaiming process: when the EPC usage of a cgroup reaches its hard limit
('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
some EPC pages within the same cgroup to make room for new allocations.
 
For that, this implementation tracks reclaimable EPC pages in a separate
LRU list in each cgroup, and below are more details and justification of
this design. 
 
Track EPC pages in per-cgroup LRUs (from Dave)
--
 
tl;dr: A cgroup hitting its limit should be as similar as possible to the
system running out of EPC memory. The only two choices to implement that
are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
The result: Add a new LRU for each cgroup and scans those instead. Replace
the existing global cgroup with the root cgroup's LRU (only when this new
support is compiled in, obviously).
 
The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.
 
This series attempts to implement that same logic for cgroups, for the same
reasons: it's preferable to wait for memory to become available and let
reclaim happen than to do things that are fatal to enclaves.
 
There is currently a global reclaimable page SGX LRU list. That list (and
the existing scanning algorithm) is essentially useless for doing reclaim
when a cgroup hits its limit because the cgroup's pages are scattered
around that LRU. It is unspeakably inefficient to scan a linked list with
millions of entries for what could be dozens of pages from a cgroup that
needs reclaim.
 
Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It would
either need to hold the list locks for unreasonable amounts of time, or be
taught to scan the list in pieces, which has its own challenges.
 
Unreclaimable Enclave Pages
---
 
There are a variety of page types for enclaves, each serving different
purposes [5]. Although the SGX architecture supports swapping for all
types, some special pages, e.g., Version Array(VA) and Secure Enclave
Control Structure (SECS)[5], holds meta data of reclaimed pages and
enclaves. That makes reclamation of such pages more intricate to manage.
The SGX driver global reclaimer currently does not swap out VA pages. It
only swaps the SECS page of an enclave when all other associated pages have
been swapped out. The cgroup reclaimer follows the same approach and does
not track those in per-cgroup LRUs and considers them as unreclaimable
pages. The allocation of these pages is counted towards the usage of a
specific cgroup and is subject to the cgroup's set EPC limits.
 
Earlier versions of this series implemented forced enclave-killing to
reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
particularly in scenarios where a user or administrator reduces this limit
post-launch of enclaves. However, subsequent discussions [3, 4] indicated
that such preemptive enforcement is not necessary for the misc-controllers.
Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
and the limit is only enforced at the time of new EPC allocation request.
When a cgroup hits its limit but nothing left in the LRUs of the subtree,
i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
within that cgroup will result in an 'ENOMEM'.
 
Unreclaimable Guest VM EPC Pages

 
The EPC pages allocated for guest VMs by the virtual EPC driver are not
reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
those as unreclaimable and returns ENOMEM when its limit is hit and nothing
reclaimable left within the cgroup. The virtual EPC driver 

Re: [PATCH] mctp i2c: Add rx trace

2024-05-31 Thread kernel test robot
Hi Tal,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on horms-ipvs/master v6.10-rc1 next-20240531]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Tal-Yacobi/mctp-i2c-Add-rx-trace/20240528-223555
base:   linus/master
patch link:
https://lore.kernel.org/r/20240528143420.742611-1-talycb8%40gmail.com
patch subject: [PATCH] mctp i2c: Add rx trace
config: hexagon-randconfig-r052-20240531 
(https://download.01.org/0day-ci/archive/20240601/202406010530.skassbs4-...@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project 
bafda89a0944d947fc4b3b5663185e07a397ac30)
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20240601/202406010530.skassbs4-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202406010530.skassbs4-...@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

WARNING: modpost: missing MODULE_DESCRIPTION() in vmlinux.o
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/locking/locktorture.o
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/rcutorture.o
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/rcuscale.o
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/refscale.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp855.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp857.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp863.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp864.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp865.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp869.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_cp936.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_ascii.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-5.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-7.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-9.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/nls_iso8859-15.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-celtic.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-inuit.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nls/mac-romanian.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/jbd2/jbd2.o
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/fat/fat.o
WARNING: modpost: missing MODULE_DESCRIPTION() in crypto/xor.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/math/rational.o
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/crypto/libarc4.o
WARNING: modpost: missing MODULE_DESCRIPTION() in 
lib/zlib_inflate/zlib_inflate.o
WARNING: modpost: missing MODULE_DESCRIPTION() in 
lib/zlib_deflate/zlib_deflate.o
WARNING: modpost: missing MODULE_DESCRIPTION() in 
drivers/video/backlight/rt4831-backlight.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/video/fbdev/vfb.o
WARNING: modpost: missing MODULE_DESCRIPTION() in 
drivers/regulator/rt4831-regulator.o
WARNING: modpost: missing MODULE_DESCRIPTION() in 
drivers/reset/hisilicon/hi6220_reset.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/tty/goldfish.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/char/lp.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/char/ppdev.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/gpu/drm/gud/gud.o
WARNING: modpost: missing MODULE_DESCRIPTION() in 
drivers/gpu/drm/drm_panel_orientation_quirks.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/gpu/drm/udl/udl.o
WARNING: modpost: missing MODULE_DESCRIPTION() in 
drivers/base/regmap/regmap-spmi.o
WARNING: modpost: missing MODULE_DESCRIPTION() in 
drivers/base/regmap/regmap-w1.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/mfd/pcf50633-gpio.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/mfd/rt4831.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/dax/dax.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/host/ohci-exynos.o
WARNING: modpost: missing MODULE_DESCRIPTION() in 
drivers/usb/host/xhci-pci-renesas.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/serial/usb_debug.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/usb/serial/navman.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/input/matrix-keymap.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/i2c/busses/i2c-qup.o
WARNING: modpost: missing MODULE_DESCRIPTION() in driv

Re: [PATCH 0/2] livepatch: Add compiler optimization disclaimer/docs

2024-05-31 Thread Joe Lawrence
On 5/31/24 07:23, Miroslav Benes wrote:
> Hi,
> 
> On Tue, 21 Jul 2020, Joe Lawrence wrote:
> 
>> In light of [PATCH] Revert "kbuild: use -flive-patching when
>> CONFIG_LIVEPATCH is enabled" [1], we should add some loud disclaimers
>> and explanation of the impact compiler optimizations have on
>> livepatching.
>>
>> The first commit provides detailed explanations and examples.  The list
>> was taken mostly from Miroslav's LPC talk a few years back.  This is a
>> bit rough, so corrections and additional suggestions welcome.  Expanding
>> upon the source-based patching approach would be helpful, too.
>>
>> The second commit adds a small README.rst file in the livepatch samples
>> directory pointing the reader to the doc introduced in the first commit.
>>
>> I didn't touch the livepatch kselftests yet as I'm still unsure about
>> how to best account for IPA here.  We could add the same README.rst
>> disclaimer here, too, but perhaps we have a chance to do something more.
>> Possibilities range from checking for renamed functions as part of their
>> build, or the selftest scripts, or even adding something to the kernel
>> API.  I think we'll have a better idea after reviewing the compiler
>> considerations doc.
> 
> thanks to Marcos for resurrecting this.
> 
> Joe, do you have an updated version by any chance? Some things have 
> changed since July 2020 so it calls for a new review. If there was an 
> improved version, it would be easier. If not, no problem at all.
> 

Yea, it's been a little while :) I don't have any newer version than
this one.  I can rebase,  apply all of the v1 suggestions, and see where
it stands.  LMK if you can think of any specifics that could be added.

For example, CONFIG_KERNEL_IBT will be driving some changes soon,
whether it be klp-convert for source-based patches or vmlinux.o binary
comparison for kpatch-build.

I can push a v2 with a few changes, but IIRC, last time we reviewed
this, it kinda begged the question of how someone is creating the
livepatch in the first place.  As long as we're fine holding that
thought for a while longer, this doc may still be useful by itself.

-- 
Joe




Re: [PATCH] livepatch: introduce klp_func called interface

2024-05-31 Thread Joe Lawrence
On Tue, May 21, 2024 at 08:34:46AM +0200, Miroslav Benes wrote:
> Hello,
> 
> On Mon, 20 May 2024, zhang warden wrote:
> 
> > 
> > 
> > > On May 20, 2024, at 14:46, Miroslav Benes  wrote:
> > > 
> > > Hi,
> > > 
> > > On Mon, 20 May 2024, Wardenjohn wrote:
> > > 
> > >> Livepatch module usually used to modify kernel functions.
> > >> If the patched function have bug, it may cause serious result
> > >> such as kernel crash.
> > >> 
> > >> This is a kobject attribute of klp_func. Sysfs interface named
> > >> "called" is introduced to livepatch which will be set as true
> > >> if the patched function is called.
> > >> 
> > >> /sys/kernel/livepatchcalled
> > >> 
> > >> This value "called" is quite necessary for kernel stability
> > >> assurance for livepatching module of a running system.
> > >> Testing process is important before a livepatch module apply to
> > >> a production system. With this interface, testing process can
> > >> easily find out which function is successfully called.
> > >> Any testing process can make sure they have successfully cover
> > >> all the patched function that changed with the help of this interface.
> > > 
> > > Even easier is to use the existing tracing infrastructure in the kernel 
> > > (ftrace for example) to track the new function. You can obtain much more 
> > > information with that than the new attribute provides.
> > > 
> > > Regards,
> > > Miroslav
> > Hi Miroslav
> > 
> > First, in most cases, testing process is should be automated, which make 
> > using existing tracing infrastructure inconvenient.
> 
> could you elaborate, please? We use ftrace exactly for this purpose and 
> our testing process is also more or less automated.
> 
> > Second, livepatch is 
> > already use ftrace for functional replacement, I don’t think it is a 
> > good choice of using kernel tracing tool to trace a patched function.
> 
> Why?
> 
> > At last, this attribute can be thought of as a state of a livepatch 
> > function. It is a state, like the "patched" "transition" state of a 
> > klp_patch.  Adding this state will not break the state consistency of 
> > livepatch.
> 
> Yes, but the information you get is limited compared to what is available 
> now. You would obtain the information that a patched function was called 
> but ftrace could also give you the context and more.
> 
> It would not hurt to add a new attribute per se but I am wondering about 
> the reason and its usage. Once we have it, the commit message should be 
> improved based on that.
> 

I agree with Miroslav about using ftrace, or Dan in the other thread
about using gcov if even more granular coverage is needed.

Admittedly, I would have to go find documentation to remember how to do
either, but at least I'd be using well-tested facilities designed for
this exact purpose.

Adding these attributes to livepatch sysfs would be expedient and
probably easier for us to use, but imposes a recurring burden on us to
maintain and test (where is the documentation and kselftest for this new
interface?).  Or, we could let the other tools handle all of that for
us.

Perhaps if someone already has an off-the-shelf script that is using
ftrace to monitor livepatched code, it could be donated to
Documentation/livepatch/?  I can ask our QE folks if they have something
like this.

Regards,

--
Joe




Re: [PATCH 02/12] wifi: wcn36xx: make use of QCOM_FW_HELPER

2024-05-31 Thread Kalle Valo
Dmitry Baryshkov  writes:

> Make the driver use qcom_fw_helper to autodetect the path to the
> calibration data file.
>
> Signed-off-by: Dmitry Baryshkov 

Not a fan of one sentence commit messages. It would be nice to explain a
bit more in the commit message, for instance answering to the question
'why?' and maybe provide a short example how this is supposed to work?

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches



Re: [PATCH 01/12] soc: qcom: add firmware name helper

2024-05-31 Thread Kalle Valo
Bjorn Andersson  writes:

> On Mon, May 27, 2024 at 02:42:44PM GMT, Dmitry Baryshkov wrote:
>
>> On Thu, 23 May 2024 at 01:48, Bjorn Andersson  
>> wrote:
>> >
>> > On Tue, May 21, 2024 at 03:08:31PM +0200, Dmitry Baryshkov wrote:
>> > > On Tue, 21 May 2024 at 13:20, Kalle Valo  wrote:
>> > > >
>> > > > Dmitry Baryshkov  writes:
>> > > >
>> > > > > On Tue, 21 May 2024 at 12:52,  wrote:
>> > > > >>
>> > > > >> On 21/05/2024 11:45, Dmitry Baryshkov wrote:
>> > > > >> > Qualcomm platforms have different sets of the firmware files, 
>> > > > >> > which
>> > > > >> > differ from platform to platform (and from board to board, due to 
>> > > > >> > the
>> > > > >> > embedded signatures). Rather than listing all the firmware files,
>> > > > >> > including full paths, in the DT, provide a way to determine 
>> > > > >> > firmware
>> > > > >> > path based on the root DT node compatible.
>> > > > >>
>> > > > >> Ok this looks quite over-engineered but necessary to handle the 
>> > > > >> legacy,
>> > > > >> but I really think we should add a way to look for a board-specific 
>> > > > >> path
>> > > > >> first and fallback to those SoC specific paths.
>> > > > >
>> > > > > Again, CONFIG_FW_LOADER_USER_HELPER => delays.
>> > > >
>> > > > To me this also looks like very over-engineered, can you elaborate more
>> > > > why this is needed? Concrete examples would help to understand better.
>> > >
>> > > Sure. During the meeting last week Arnd suggested evaluating if we can
>> > > drop firmware-name from the board DT files. Several reasons for that:
>> > > - DT should describe the hardware, not the Linux-firmware locations
>> > > - having firmware name in DT complicates updating the tree to use
>> > > different firmware API (think of mbn vs mdt vs any other format)
>> > > - If the DT gets supplied by the vendor (e.g. for
>> > > SystemReady-certified devices), there should be a sync between the
>> > > vendor's DT, linux kernel and the rootfs. Dropping firmware names from
>> > > DT solves that by removing one piece of the equation
>> > >
>> > > Now for the complexity of the solution. Each SoC family has their own
>> > > firmware set. This includes firmware for the DSPs, for modem, WiFi
>> > > bits, GPU shader, etc.
>> > > For the development boards these devices are signed by the testing key
>> > > and the actual signature is not validated against the root of trust
>> > > certificate.
>> > > For the end-user devices the signature is actually validated against
>> > > the bits fused to the SoC during manufacturing process. CA certificate
>> > > (and thus the fuses) differ from vendor to vendor (and from the device
>> > > to device)
>> > >
>> > > Not all of the firmware files are a part of the public linux-firmware
>> > > tree. However we need to support the rootfs bundled with the firmware
>> > > for different platforms (both public and vendor). The non-signed files
>> > > come from the Adreno GPU and can be shared between platforms. All
>> > > other files are SoC-specific and in some cases device-specific.
>> > >
>> > > So for example the SDM845 db845c (open device) loads following firmware 
>> > > files:
>> > > Not signed:
>> > > - qcom/a630_sqe.fw
>> > > - qcom/a630_gmu.bin
>> > >
>> > > Signed, will work for any non-secured sdm845 device:
>> > > - qcom/sdm845/a630_zap.mbn
>> > > - qcom/sdm845/adsp.mbn
>> > > - qcom/sdm845/cdsp.mbn
>> > > - qcom/sdm485/mba.mbn
>> > > - qcom/sdm845/modem.mbn
>> > > - qcom/sdm845/wlanmdsp.mbn (loaded via TQFTP)
>> > > - qcom/venus-5.2/venus.mbn
>> > >
>> > > Signed, works only for DB845c.
>> > > - qcom/sdm845/Thundercomm/db845c/slpi.mbn
>> > >
>> > > In comparison, the SDM845 Pixel-3 phone (aka blueline) should load the
>> > > following firmware files:
>> > > - qcom/a630_sqe.fw (the same, non-signed file)
>> > > - qcom/a630_gmu.bin (the same, non-signed file)
>> > > - qcom/sdm845/Google/blueline/a630_zap.mbn
>> >
>> > How do you get from "a630_zap.mbn" to this? By extending the lookup
>> > table for every target, or what am I missing?
>> 
>> More or less so. Matching the root OF node gives us the firmware
>> location, then it gets prepended to all firmware targets. Not an ideal
>> solution, as there is no fallback support, but at least it gives us
>> some points to discuss (and to decide whether to move to some
>> particular direction or to abandon the idea completely, making Arnd
>> unhappy again).
>> 
>
> I understand the desire to not put linux-firmware-specific paths in the
> DeviceTree

Me too.

> but I think I'm less keen on having a big lookup table in the
> kernel...

Yeah, also for me this feels wrong. But on the other hand I don't have
anything better to suggest either...

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches



Re: [PATCHv7 bpf-next 0/9] uprobe: uretprobe speed up

2024-05-31 Thread Andrii Nakryiko
On Thu, May 23, 2024 at 5:11 AM Jiri Olsa  wrote:
>
> hi,
> as part of the effort on speeding up the uprobes [0] coming with
> return uprobe optimization by using syscall instead of the trap
> on the uretprobe trampoline.
>
> The speed up depends on instruction type that uprobe is installed
> and depends on specific HW type, please check patch 1 for details.
>
> Patches 1-8 are based on bpf-next/master, but patch 2 and 3 are
> apply-able on linux-trace.git tree probes/for-next branch.
> Patch 9 is based on man-pages master.
>
> v7 changes:
> - fixes in man page [Alejandro Colomar]
> - fixed patch #1 fixes tag [Oleg]
>
> Also available at:
>   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
>   uretprobe_syscall
>
> thanks,
> jirka
>
>
> Notes to check list items in Documentation/process/adding-syscalls.rst:
>
> - System Call Alternatives
>   New syscall seems like the best way in here, because we need
>   just to quickly enter kernel with no extra arguments processing,
>   which we'd need to do if we decided to use another syscall.
>
> - Designing the API: Planning for Extension
>   The uretprobe syscall is very specific and most likely won't be
>   extended in the future.
>
>   At the moment it does not take any arguments and even if it does
>   in future, it's allowed to be called only from trampoline prepared
>   by kernel, so there'll be no broken user.
>
> - Designing the API: Other Considerations
>   N/A because uretprobe syscall does not return reference to kernel
>   object.
>
> - Proposing the API
>   Wiring up of the uretprobe system call is in separate change,
>   selftests and man page changes are part of the patchset.
>
> - Generic System Call Implementation
>   There's no CONFIG option for the new functionality because it
>   keeps the same behaviour from the user POV.
>
> - x86 System Call Implementation
>   It's 64-bit syscall only.
>
> - Compatibility System Calls (Generic)
>   N/A uretprobe syscall has no arguments and is not supported
>   for compat processes.
>
> - Compatibility System Calls (x86)
>   N/A uretprobe syscall is not supported for compat processes.
>
> - System Calls Returning Elsewhere
>   N/A.
>
> - Other Details
>   N/A.
>
> - Testing
>   Adding new bpf selftests and ran ltp on top of this change.
>
> - Man Page
>   Attached.
>
> - Do not call System Calls in the Kernel
>   N/A.
>
>
> [0] https://lore.kernel.org/bpf/ZeCXHKJ--iYYbmLj@krava/
> ---
> Jiri Olsa (8):
>   x86/shstk: Make return uprobe work with shadow stack
>   uprobe: Wire up uretprobe system call
>   uprobe: Add uretprobe syscall to speed up return probe
>   selftests/x86: Add return uprobe shadow stack test
>   selftests/bpf: Add uretprobe syscall test for regs integrity
>   selftests/bpf: Add uretprobe syscall test for regs changes
>   selftests/bpf: Add uretprobe syscall call from user space test
>   selftests/bpf: Add uretprobe shadow stack test
>

Masami, Steven,

It seems like the series is ready to go in. Are you planning to take
the first 4 patches through your linux-trace tree?

>  arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
>  arch/x86/include/asm/shstk.h|   4 +
>  arch/x86/kernel/shstk.c |  16 
>  arch/x86/kernel/uprobes.c   | 124 
> -
>  include/linux/syscalls.h|   2 +
>  include/linux/uprobes.h |   3 +
>  include/uapi/asm-generic/unistd.h   |   5 +-
>  kernel/events/uprobes.c |  24 --
>  kernel/sys_ni.c |   2 +
>  tools/include/linux/compiler.h  |   4 +
>  tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c   | 123 
> -
>  tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 385 
> +++
>  tools/testing/selftests/bpf/progs/uprobe_syscall.c  |  15 
>  tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c |  17 
>  tools/testing/selftests/x86/test_shadow_stack.c | 145 
> ++
>  15 files changed, 860 insertions(+), 10 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
>  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall.c
>  create mode 100644 
> tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
>
> Jiri Olsa (1):
>   man2: Add uretprobe syscall page
>
>  man/man2/uretprobe.2 | 56 
> 
>  1 file changed, 56 insertions(+)
>  create mode 100644 man/man2/uretprobe.2



[PATCH v7 5/5] MAINTAINERS: add myself for Marvell 88PM886 PMIC

2024-05-31 Thread Karel Balej
Add an entry to MAINTAINERS for the Marvell 88PM886 PMIC MFD, onkey and
regulator drivers.

Signed-off-by: Karel Balej 
---

Notes:
RFC v3:
- Remove onkey bindings file.
RFC v2:
- Only mention 88PM886 in the commit message.
- Add regulator driver.
- Rename the entry.

 MAINTAINERS | 9 +
 1 file changed, 9 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index d6c90161c7bf..9d6c940029b8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13295,6 +13295,15 @@ F: drivers/net/dsa/mv88e6xxx/
 F: include/linux/dsa/mv88e6xxx.h
 F: include/linux/platform_data/mv88e6xxx.h
 
+MARVELL 88PM886 PMIC DRIVER
+M: Karel Balej 
+S: Maintained
+F: Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml
+F: drivers/input/misc/88pm886-onkey.c
+F: drivers/mfd/88pm886.c
+F: drivers/regulators/88pm886-regulator.c
+F: include/linux/mfd/88pm886.h
+
 MARVELL ARMADA 3700 PHY DRIVERS
 M: Miquel Raynal 
 S: Maintained
-- 
2.45.1




[PATCH v7 3/5] regulator: add regulators driver for Marvell 88PM886 PMIC

2024-05-31 Thread Karel Balej
Support the LDO and buck regulators of the Marvell 88PM886 PMIC.

Signed-off-by: Karel Balej 
---

Notes:
v7:
- Address Mark's feedback:
  - Drop get_current_limit op, max_uA values and thus unneeded struct
pm886_regulator and adapt the code accordingly.
v6:
- Remove all definitions (now present in the header).
v5:
- Add remaining regulators.
- Clean up includes.
- Address Mark's feedback:
  - Use dedicated regmap config.
RFC v4:
- Initialize regulators regmap in the regulators driver.
- Register all regulators at once.
- Drop regulator IDs.
- Add missing '\n' to dev_err_probe message.
- Fix includes.
- Add ID table.
RFC v3:
- Do not have a variable for each regulator -- define them all in the
  pm886_regulators array.
- Use new regulators regmap index name.
- Use dev_err_probe.
RFC v2:
- Drop of_compatible and related code.
- Drop unused include.
- Remove some abstraction: use only one regmap for all regulators and
  only mention 88PM886 in Kconfig description.
- Reword commit message.

 drivers/regulator/88pm886-regulator.c | 392 ++
 drivers/regulator/Kconfig |   6 +
 drivers/regulator/Makefile|   1 +
 3 files changed, 399 insertions(+)
 create mode 100644 drivers/regulator/88pm886-regulator.c

diff --git a/drivers/regulator/88pm886-regulator.c 
b/drivers/regulator/88pm886-regulator.c
new file mode 100644
index ..a38bd4f312b7
--- /dev/null
+++ b/drivers/regulator/88pm886-regulator.c
@@ -0,0 +1,392 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+static const struct regmap_config pm886_regulator_regmap_config = {
+   .reg_bits = 8,
+   .val_bits = 8,
+   .max_register = PM886_REG_BUCK5_VOUT,
+};
+
+static const struct regulator_ops pm886_ldo_ops = {
+   .list_voltage = regulator_list_voltage_table,
+   .map_voltage = regulator_map_voltage_iterate,
+   .set_voltage_sel = regulator_set_voltage_sel_regmap,
+   .get_voltage_sel = regulator_get_voltage_sel_regmap,
+   .enable = regulator_enable_regmap,
+   .disable = regulator_disable_regmap,
+   .is_enabled = regulator_is_enabled_regmap,
+};
+
+static const struct regulator_ops pm886_buck_ops = {
+   .list_voltage = regulator_list_voltage_linear_range,
+   .map_voltage = regulator_map_voltage_linear_range,
+   .set_voltage_sel = regulator_set_voltage_sel_regmap,
+   .get_voltage_sel = regulator_get_voltage_sel_regmap,
+   .enable = regulator_enable_regmap,
+   .disable = regulator_disable_regmap,
+   .is_enabled = regulator_is_enabled_regmap,
+};
+
+static const unsigned int pm886_ldo_volt_table1[] = {
+   170, 180, 190, 250, 280, 290, 310, 330,
+};
+
+static const unsigned int pm886_ldo_volt_table2[] = {
+   120, 125, 170, 180, 185, 190, 250, 260,
+   270, 275, 280, 285, 290, 300, 310, 330,
+};
+
+static const unsigned int pm886_ldo_volt_table3[] = {
+   170, 180, 190, 200, 210, 250, 270, 280,
+};
+
+static const struct linear_range pm886_buck_volt_ranges1[] = {
+   REGULATOR_LINEAR_RANGE(60, 0, 79, 12500),
+   REGULATOR_LINEAR_RANGE(160, 80, 84, 5),
+};
+
+static const struct linear_range pm886_buck_volt_ranges2[] = {
+   REGULATOR_LINEAR_RANGE(60, 0, 79, 12500),
+   REGULATOR_LINEAR_RANGE(160, 80, 114, 5),
+};
+
+static struct regulator_desc pm886_regulators[] = {
+   {
+   .name = "LDO1",
+   .regulators_node = "regulators",
+   .of_match = "ldo1",
+   .ops = _ldo_ops,
+   .type = REGULATOR_VOLTAGE,
+   .enable_reg = PM886_REG_LDO_EN1,
+   .enable_mask = BIT(0),
+   .volt_table = pm886_ldo_volt_table1,
+   .n_voltages = ARRAY_SIZE(pm886_ldo_volt_table1),
+   .vsel_reg = PM886_REG_LDO1_VOUT,
+   .vsel_mask = PM886_LDO_VSEL_MASK,
+   },
+   {
+   .name = "LDO2",
+   .regulators_node = "regulators",
+   .of_match = "ldo2",
+   .ops = _ldo_ops,
+   .type = REGULATOR_VOLTAGE,
+   .enable_reg = PM886_REG_LDO_EN1,
+   .enable_mask = BIT(1),
+   .volt_table = pm886_ldo_volt_table1,
+   .n_voltages = ARRAY_SIZE(pm886_ldo_volt_table1),
+   .vsel_reg = PM886_REG_LDO2_VOUT,
+   .vsel_mask = PM886_LDO_VSEL_MASK,
+   },
+   {
+   .name = "LDO3",
+   .regulators_node = "regulators",
+   .of_match = "ldo3",
+   .ops = _ldo_ops,
+   .type = REGULATOR_VOLTAGE,
+   .enable_reg = PM886_REG_LDO_EN1,
+   .enable_mask = 

[PATCH v7 4/5] input: add onkey driver for Marvell 88PM886 PMIC

2024-05-31 Thread Karel Balej
Marvell 88PM886 PMIC provides onkey among other things. Add client
driver to handle it. The driver currently only provides a basic support
omitting additional functions found in the vendor version, such as long
onkey and GPIO integration.

Acked-by: Dmitry Torokhov 
Signed-off-by: Karel Balej 
---

Notes:
v5:
- Remove kernel.h include.
RFC v4:
- Reflect MFD driver changes:
  - chip->regmaps[...] -> chip->regmap
- Address Dmitry's feedback:
  - Add ID table.
  - Add Ack.
RFC v3:
- Drop wakeup-source.
RFC v2:
- Address Dmitry's feedback:
  - Sort includes alphabetically.
  - Drop onkey->irq.
  - ret -> err in irq_handler and no initialization.
  - Break long lines and other formatting.
  - Do not clobber platform_get_irq error.
  - Do not set device parent manually.
  - Use input_set_capability.
  - Use the wakeup-source DT property.
  - Drop of_match_table.
  - Use more temporaries.
  - Use dev_err_probe.
- Modify Kconfig description.

 drivers/input/misc/88pm886-onkey.c | 98 ++
 drivers/input/misc/Kconfig |  7 +++
 drivers/input/misc/Makefile|  1 +
 3 files changed, 106 insertions(+)
 create mode 100644 drivers/input/misc/88pm886-onkey.c

diff --git a/drivers/input/misc/88pm886-onkey.c 
b/drivers/input/misc/88pm886-onkey.c
new file mode 100644
index ..284ff5190b6e
--- /dev/null
+++ b/drivers/input/misc/88pm886-onkey.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+struct pm886_onkey {
+   struct input_dev *idev;
+   struct pm886_chip *chip;
+};
+
+static irqreturn_t pm886_onkey_irq_handler(int irq, void *data)
+{
+   struct pm886_onkey *onkey = data;
+   struct regmap *regmap = onkey->chip->regmap;
+   struct input_dev *idev = onkey->idev;
+   struct device *parent = idev->dev.parent;
+   unsigned int val;
+   int err;
+
+   err = regmap_read(regmap, PM886_REG_STATUS1, );
+   if (err) {
+   dev_err(parent, "Failed to read status: %d\n", err);
+   return IRQ_NONE;
+   }
+   val &= PM886_ONKEY_STS1;
+
+   input_report_key(idev, KEY_POWER, val);
+   input_sync(idev);
+
+   return IRQ_HANDLED;
+}
+
+static int pm886_onkey_probe(struct platform_device *pdev)
+{
+   struct pm886_chip *chip = dev_get_drvdata(pdev->dev.parent);
+   struct device *dev = >dev;
+   struct pm886_onkey *onkey;
+   struct input_dev *idev;
+   int irq, err;
+
+   onkey = devm_kzalloc(dev, sizeof(*onkey), GFP_KERNEL);
+   if (!onkey)
+   return -ENOMEM;
+
+   onkey->chip = chip;
+
+   irq = platform_get_irq(pdev, 0);
+   if (irq < 0)
+   return dev_err_probe(dev, irq, "Failed to get IRQ\n");
+
+   idev = devm_input_allocate_device(dev);
+   if (!idev) {
+   dev_err(dev, "Failed to allocate input device\n");
+   return -ENOMEM;
+   }
+   onkey->idev = idev;
+
+   idev->name = "88pm886-onkey";
+   idev->phys = "88pm886-onkey/input0";
+   idev->id.bustype = BUS_I2C;
+
+   input_set_capability(idev, EV_KEY, KEY_POWER);
+
+   err = devm_request_threaded_irq(dev, irq, NULL, pm886_onkey_irq_handler,
+   IRQF_ONESHOT | IRQF_NO_SUSPEND, "onkey",
+   onkey);
+   if (err)
+   return dev_err_probe(dev, err, "Failed to request IRQ\n");
+
+   err = input_register_device(idev);
+   if (err)
+   return dev_err_probe(dev, err, "Failed to register input 
device\n");
+
+   return 0;
+}
+
+static const struct platform_device_id pm886_onkey_id_table[] = {
+   { "88pm886-onkey", },
+   { }
+};
+MODULE_DEVICE_TABLE(platform, pm886_onkey_id_table);
+
+static struct platform_driver pm886_onkey_driver = {
+   .driver = {
+   .name = "88pm886-onkey",
+   },
+   .probe = pm886_onkey_probe,
+   .id_table = pm886_onkey_id_table,
+};
+module_platform_driver(pm886_onkey_driver);
+
+MODULE_DESCRIPTION("Marvell 88PM886 onkey driver");
+MODULE_AUTHOR("Karel Balej ");
+MODULE_LICENSE("GPL");
diff --git a/drivers/input/misc/Kconfig b/drivers/input/misc/Kconfig
index 6ba984d7f0b1..16a079d9f0f2 100644
--- a/drivers/input/misc/Kconfig
+++ b/drivers/input/misc/Kconfig
@@ -33,6 +33,13 @@ config INPUT_88PM80X_ONKEY
  To compile this driver as a module, choose M here: the module
  will be called 88pm80x_onkey.
 
+config INPUT_88PM886_ONKEY
+   tristate "Marvell 88PM886 onkey support"
+   depends on MFD_88PM886_PMIC
+   help
+ Support the onkey of Marvell 88PM886 PMIC as an input device
+ reporting power button status.
+
 config INPUT_AB8500_PONKEY
tristate "AB8500 Pon (PowerOn) Key"
depends on AB8500_CORE
diff --git 

[PATCH v7 2/5] mfd: add driver for Marvell 88PM886 PMIC

2024-05-31 Thread Karel Balej
Marvell 88PM886 is a PMIC which provides various functions such as
onkey, battery, charger and regulators. It is found for instance in the
samsung,coreprimevelte smartphone with which this was tested. Implement
basic support to allow for the use of regulators and onkey.

Signed-off-by: Karel Balej 
---

Notes:
v6:
- Address Lee's comments:
  - Don't break long line in the power off handler.
  - Set PLATFORM_DEVID_NONE. This should be safe with respect to
collisions since there are no known devices with more than one of
these PMICs, plus given their general purpose nature it is unlikely
that there would ever be. Also include the corresponding header.
  - Move all defines to the header.
- Give the base page's maximum register its real name.
- Set irq_base to 0.
v5:
- Address Mark's feedback:
  - Move regmap config back out of the header and rename it. Also lower
its maximum register based on what's actually used in the downstream
code.
RFC v4:
- Use MFD_CELL_* macros.
- Address Lee's feedback:
  - Do not define regmap_config.val_bits and .reg_bits.
  - Drop everything regulator related except mfd_cell (regmap
initialization, IDs enum etc.). Drop pm886_initialize_subregmaps.
  - Do not store regmap pointers as an array as there is now only one
regmap. Also drop the corresponding enum.
  - Move regmap_config to the header as it is needed in the regulators
driver.
  - pm886_chip.whoami -> chip_id
  - Reword chip ID mismatch error message and print the ID as
hexadecimal.
  - Fix includes in include/linux/88pm886.h.
  - Drop the pm886_irq_number enum and define the (for the moment) only
IRQ explicitly.
- Have only one MFD cell for all regulators as they are now registered
  all at once in the regulators driver.
- Reword commit message.
- Make device table static and remove comma after the sentinel to signal
  that nothing should come after it.
RFC v3:
- Drop onkey cell .of_compatible.
- Rename LDO page offset and regmap to REGULATORS.
RFC v2:
- Remove some abstraction.
- Sort includes alphabetically and add linux/of.h.
- Depend on OF, remove of_match_ptr and add MODULE_DEVICE_TABLE.
- Use more temporaries and break long lines.
- Do not initialize ret in probe.
- Use the wakeup-source DT property.
- Rename ret to err.
- Address Lee's comments:
  - Drop patched in presets for base regmap and related defines.
  - Use full sentences in comments.
  - Remove IRQ comment.
  - Define regmap_config member values.
  - Rename data to sys_off_data.
  - Add _PMIC suffix to Kconfig.
  - Use dev_err_probe.
  - Do not store irq_data.
  - s/WHOAMI/CHIP_ID
  - Drop LINUX part of include guard name.
  - Merge in the regulator series modifications in order to have more
devices and modify the commit message accordingly. Changes with
respect to the original regulator series patches:
- ret -> err
- Add temporary for dev in pm88x_initialize_subregmaps.
- Drop of_compatible for the regulators.
- Do not duplicate LDO regmap for bucks.
- Rewrite commit message.

Friday late afternoon counts as a weekend too...

 drivers/mfd/88pm886.c   | 148 
 drivers/mfd/Kconfig |  12 +++
 drivers/mfd/Makefile|   1 +
 include/linux/mfd/88pm886.h |  69 +
 4 files changed, 230 insertions(+)
 create mode 100644 drivers/mfd/88pm886.c
 create mode 100644 include/linux/mfd/88pm886.h

diff --git a/drivers/mfd/88pm886.c b/drivers/mfd/88pm886.c
new file mode 100644
index ..dbe9efc027d2
--- /dev/null
+++ b/drivers/mfd/88pm886.c
@@ -0,0 +1,148 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+static const struct regmap_config pm886_regmap_config = {
+   .reg_bits = 8,
+   .val_bits = 8,
+   .max_register = PM886_REG_RTC_SPARE6,
+};
+
+static struct regmap_irq pm886_regmap_irqs[] = {
+   REGMAP_IRQ_REG(PM886_IRQ_ONKEY, 0, PM886_INT_ENA1_ONKEY),
+};
+
+static struct regmap_irq_chip pm886_regmap_irq_chip = {
+   .name = "88pm886",
+   .irqs = pm886_regmap_irqs,
+   .num_irqs = ARRAY_SIZE(pm886_regmap_irqs),
+   .num_regs = 4,
+   .status_base = PM886_REG_INT_STATUS1,
+   .ack_base = PM886_REG_INT_STATUS1,
+   .unmask_base = PM886_REG_INT_ENA_1,
+};
+
+static struct resource pm886_onkey_resources[] = {
+   DEFINE_RES_IRQ_NAMED(PM886_IRQ_ONKEY, "88pm886-onkey"),
+};
+
+static struct mfd_cell pm886_devs[] = {
+   MFD_CELL_RES("88pm886-onkey", pm886_onkey_resources),
+   MFD_CELL_NAME("88pm886-regulator"),
+};
+
+static int pm886_power_off_handler(struct sys_off_data *sys_off_data)
+{
+   struct pm886_chip *chip 

[PATCH v7 1/5] dt-bindings: mfd: add entry for Marvell 88PM886 PMIC

2024-05-31 Thread Karel Balej
Marvell 88PM886 is a PMIC with several subdevices such as onkey,
regulators or battery and charger. It comes in at least two revisions,
A0 and A1 -- only A1 is described here at the moment.

Reviewed-by: Krzysztof Kozlowski 
Signed-off-by: Karel Balej 
---

Notes:
RFC v4:
- Address Krzysztof's comments:
  - Fix regulators indentation.
  - Add Krzysztof's trailer.
RFC v3:
- Add wakeup-source property.
- Address Rob's feedback:
  - Move regulators into the MFD file.
  - Remove interrupt-controller and #interrupt-cells properties.
RFC v2:
- Address Rob's feedback:
  - Drop mention of 88PM880.
  - Make sure the file passes bindings check (add the necessary header
and fix `interrupt-cells`).
  - Other small changes.
- Add regulators. Changes with respect to the regulator RFC series:
  - Address Krzysztof's comments:
- Drop unused compatible.
- Fix sub-node pattern.

 .../bindings/mfd/marvell,88pm886-a1.yaml  | 76 +++
 1 file changed, 76 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml

diff --git a/Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml 
b/Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml
new file mode 100644
index ..d6a71c912b76
--- /dev/null
+++ b/Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml
@@ -0,0 +1,76 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/mfd/marvell,88pm886-a1.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Marvell 88PM886 PMIC core
+
+maintainers:
+  - Karel Balej 
+
+description:
+  Marvell 88PM886 is a PMIC providing several functions such as onkey,
+  regulators or battery and charger.
+
+properties:
+  compatible:
+const: marvell,88pm886-a1
+
+  reg:
+maxItems: 1
+
+  interrupts:
+maxItems: 1
+
+  wakeup-source: true
+
+  regulators:
+type: object
+additionalProperties: false
+patternProperties:
+  "^(ldo(1[0-6]|[1-9])|buck[1-5])$":
+type: object
+$ref: /schemas/regulator/regulator.yaml#
+description: LDO or buck regulator.
+unevaluatedProperties: false
+
+required:
+  - compatible
+  - reg
+  - interrupts
+
+additionalProperties: false
+
+examples:
+  - |
+#include 
+i2c {
+  #address-cells = <1>;
+  #size-cells = <0>;
+  pmic@30 {
+compatible = "marvell,88pm886-a1";
+reg = <0x30>;
+interrupts = <0 4 IRQ_TYPE_LEVEL_HIGH>;
+interrupt-parent = <>;
+wakeup-source;
+
+regulators {
+  ldo2: ldo2 {
+regulator-min-microvolt = <310>;
+regulator-max-microvolt = <330>;
+  };
+
+  ldo15: ldo15 {
+regulator-min-microvolt = <330>;
+regulator-max-microvolt = <330>;
+  };
+
+  buck2: buck2 {
+regulator-min-microvolt = <180>;
+regulator-max-microvolt = <180>;
+  };
+};
+  };
+};
+...
-- 
2.45.1




[PATCH v7 0/5] initial support for Marvell 88PM886 PMIC

2024-05-31 Thread Karel Balej
Hello,

the following implements basic support for Marvell's 88PM886 PMIC which
is found for instance as a component of the samsung,coreprimevelte
smartphone which inspired this and also serves as a testing platform.

The code for the MFD is based primarily on this old series [1] with the
addition of poweroff based on the smartphone's downstream kernel tree
[2]. The onkey and regulators drivers are based on the latter. I am not
in possesion of the datasheet.

[1] 
https://lore.kernel.org/all/1434098601-3498-1-git-send-email-yizh...@marvell.com/
[2] https://github.com/CoderCharmander/g361f-kernel

Thank you, kind regards,
K. B.
---
v7:
- Rebase to v6.10-rc1.
- v6: https://lore.kernel.org/r/20240504194632.2456-1-bal...@matfyz.cz/
v6:
- Rebase to v6.9-rc6.
- Fix patchset versioning: the previous version was marked as v1 because I
  considered RFC to be its own thing. Thank you to Krzysztof for
  explaining that that is not the case. The previous version is thus now
  marked as v5 and this is v6, sorry for any confusion.
- v5: https://lore.kernel.org/r/20240331105608.7338-2-bal...@matfyz.cz/
v5:
- RFC v4: 
https://lore.kernel.org/r/20240311160110.32185-1-kar...@gimli.ms.mff.cuni.cz/
- Rebase to v6.9-rc1.
- Thank you to everybody for their feedback on the RFC!
RFC v4:
- RFC v3: 
https://lore.kernel.org/all/20240303101506.4187-1-kar...@gimli.ms.mff.cuni.cz/
RFC v3:
- Address Rob's feedback:
  - Drop onkey bindings patch.
- Rename PM88X -> PM886 everywhere.
- RFC v2: 
https://lore.kernel.org/all/20240211094609.2223-1-kar...@gimli.ms.mff.cuni.cz/
RFC v2:
- Merge with the regulators series to have multiple devices and thus
  justify the use of the MFD framework.
- Rebase on v6.8-rc3.
- Reorder patches.
- MFD RFC v1: 
https://lore.kernel.org/all/20231217131838.7569-1-kar...@gimli.ms.mff.cuni.cz/
- regulators RFC v1: 
https://lore.kernel.org/all/20231228100208.2932-1-kar...@gimli.ms.mff.cuni.cz/

Karel Balej (5):
  dt-bindings: mfd: add entry for Marvell 88PM886 PMIC
  mfd: add driver for Marvell 88PM886 PMIC
  regulator: add regulators driver for Marvell 88PM886 PMIC
  input: add onkey driver for Marvell 88PM886 PMIC
  MAINTAINERS: add myself for Marvell 88PM886 PMIC

 .../bindings/mfd/marvell,88pm886-a1.yaml  |  76 
 MAINTAINERS   |   9 +
 drivers/input/misc/88pm886-onkey.c|  98 +
 drivers/input/misc/Kconfig|   7 +
 drivers/input/misc/Makefile   |   1 +
 drivers/mfd/88pm886.c | 148 +++
 drivers/mfd/Kconfig   |  12 +
 drivers/mfd/Makefile  |   1 +
 drivers/regulator/88pm886-regulator.c | 392 ++
 drivers/regulator/Kconfig |   6 +
 drivers/regulator/Makefile|   1 +
 include/linux/mfd/88pm886.h   |  69 +++
 12 files changed, 820 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml
 create mode 100644 drivers/input/misc/88pm886-onkey.c
 create mode 100644 drivers/mfd/88pm886.c
 create mode 100644 drivers/regulator/88pm886-regulator.c
 create mode 100644 include/linux/mfd/88pm886.h

-- 
2.45.1




Re: [PATCH v5 5/7] remoteproc: core: support of the tee interface

2024-05-31 Thread Mathieu Poirier
On Thu, May 30, 2024 at 09:42:26AM +0200, Arnaud POULIQUEN wrote:
> Hello Mathieu,
> 
> On 5/29/24 22:35, Mathieu Poirier wrote:
> > On Wed, May 29, 2024 at 09:13:26AM +0200, Arnaud POULIQUEN wrote:
> >> Hello Mathieu,
> >>
> >> On 5/28/24 23:30, Mathieu Poirier wrote:
> >>> On Tue, May 21, 2024 at 10:09:59AM +0200, Arnaud Pouliquen wrote:
>  1) on start:
>  - Using the TEE loader, the resource table is loaded by an external 
>  entity.
>  In such case the resource table address is not find from the firmware but
>  provided by the TEE remoteproc framework.
>  Use the rproc_get_loaded_rsc_table instead of rproc_find_loaded_rsc_table
>  - test that rproc->cached_table is not null before performing the memcpy
> 
>  2)on stop
>  The use of the cached_table seems mandatory:
>  - during recovery sequence to have a snapshot of the resource table
>    resources used,
>  - on stop to allow  for the deinitialization of resources after the
>    the remote processor has been shutdown.
>  However if the TEE interface is being used, we first need to unmap the
>  table_ptr before setting it to rproc->cached_table.
>  The update of rproc->table_ptr to rproc->cached_table is performed in
>  tee_remoteproc.
> 
>  Signed-off-by: Arnaud Pouliquen 
>  ---
>   drivers/remoteproc/remoteproc_core.c | 31 +---
>   1 file changed, 23 insertions(+), 8 deletions(-)
> 
>  diff --git a/drivers/remoteproc/remoteproc_core.c 
>  b/drivers/remoteproc/remoteproc_core.c
>  index 42bca01f3bde..3a642151c983 100644
>  --- a/drivers/remoteproc/remoteproc_core.c
>  +++ b/drivers/remoteproc/remoteproc_core.c
>  @@ -1267,6 +1267,7 @@ EXPORT_SYMBOL(rproc_resource_cleanup);
>   static int rproc_set_rsc_table_on_start(struct rproc *rproc, const 
>  struct firmware *fw)
>   {
>   struct resource_table *loaded_table;
>  +struct device *dev = >dev;
>   
>   /*
>    * The starting device has been given the rproc->cached_table 
>  as the
>  @@ -1276,12 +1277,21 @@ static int rproc_set_rsc_table_on_start(struct 
>  rproc *rproc, const struct firmwa
>    * this information to device memory. We also update the 
>  table_ptr so
>    * that any subsequent changes will be applied to the loaded 
>  version.
>    */
>  -loaded_table = rproc_find_loaded_rsc_table(rproc, fw);
>  -if (loaded_table) {
>  -memcpy(loaded_table, rproc->cached_table, 
>  rproc->table_sz);
>  -rproc->table_ptr = loaded_table;
>  +if (rproc->tee_interface) {
>  +loaded_table = rproc_get_loaded_rsc_table(rproc, 
>  >table_sz);
>  +if (IS_ERR(loaded_table)) {
>  +dev_err(dev, "can't get resource table\n");
>  +return PTR_ERR(loaded_table);
>  +}
>  +} else {
>  +loaded_table = rproc_find_loaded_rsc_table(rproc, fw);
>   }
>   
>  +if (loaded_table && rproc->cached_table)
>  +memcpy(loaded_table, rproc->cached_table, 
>  rproc->table_sz);
>  +
> >>>
> >>> Why is this not part of the else {} above as it was the case before this 
> >>> patch?
> >>> And why was an extra check for ->cached_table added?
> >>
> >> Here we have to cover 2 use cases if rproc->tee_interface is set.
> >> 1) The remote processor is in stop state
> >>  - loaded_table points to the resource table in the remote memory and
> >>  -  rproc->cached_table is null
> >>  => no memcopy
> >> 2) crash recovery
> >>  - loaded_table points to the resource table in the remote memory
> >>  - rproc-cached_table point to a copy of the resource table
> > 
> > A cached_table exists because it was created in 
> > rproc_reset_rsc_table_on_stop().
> > But as the comment says [1], that part of the code was meant to be used for 
> > the
> > attach()/detach() use case.  Mixing both will become extremely confusing and
> > impossible to maintain.
> 
> i am not sure to understand your point here... the cached_table table was
> already existing for the "normal" case[2]. Seems to me that the cache table is
> needed on stop in all scenarios.
> 
> [2]
> https://elixir.bootlin.com/linux/v4.20.17/source/drivers/remoteproc/remoteproc_core.c#L1402
> 
> > 
> > I think the TEE scenario should be as similar as the "normal" one where TEE 
> > is
> > not involved.  To that end, I suggest to create a cached_table in
> > tee_rproc_parse_fw(), exactly the same way it is done in
> > rproc_elf_load_rsc_table().  That way the code path in
> > rproc_set_rsc_table_on_start() become very similar and we have a 
> > cached_table to
> > work with when the remote processor is recovered.  In fact we may not need
> > 

Re: [PATCH RFC 3/4] mm, slab: add static key for should_failslab()

2024-05-31 Thread Yosry Ahmed
On Fri, May 31, 2024 at 9:44 AM Alexei Starovoitov
 wrote:
>
> On Fri, May 31, 2024 at 2:33 AM Vlastimil Babka  wrote:
> >
> > Since commit 4f6923fbb352 ("mm: make should_failslab always available for
> > fault injection") should_failslab() is unconditionally a noinline
> > function. This adds visible overhead to the slab allocation hotpath,
> > even if the function is empty. With CONFIG_FAILSLAB=y there's additional
> > overhead when the functionality is not enabled by a boot parameter or
> > debugfs.
> >
> > The overhead can be eliminated with a static key around the callsite.
> > Fault injection and error injection frameworks can now be told that the
> > this function has a static key associated, and are able to enable and
> > disable it accordingly.
> >
> > Signed-off-by: Vlastimil Babka 
> > ---
> >  mm/failslab.c |  2 +-
> >  mm/slab.h |  3 +++
> >  mm/slub.c | 10 +++---
> >  3 files changed, 11 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/failslab.c b/mm/failslab.c
> > index ffc420c0e767..878fd08e5dac 100644
> > --- a/mm/failslab.c
> > +++ b/mm/failslab.c
> > @@ -9,7 +9,7 @@ static struct {
> > bool ignore_gfp_reclaim;
> > bool cache_filter;
> >  } failslab = {
> > -   .attr = FAULT_ATTR_INITIALIZER,
> > +   .attr = FAULT_ATTR_INITIALIZER_KEY(_failslab_active.key),
> > .ignore_gfp_reclaim = true,
> > .cache_filter = false,
> >  };
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 5f8f47c5bee0..792e19cb37b8 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -11,6 +11,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  /*
> >   * Internal slab definitions
> > @@ -160,6 +161,8 @@ static_assert(IS_ALIGNED(offsetof(struct slab, 
> > freelist), sizeof(freelist_aba_t)
> >   */
> >  #define slab_page(s) folio_page(slab_folio(s), 0)
> >
> > +DECLARE_STATIC_KEY_FALSE(should_failslab_active);
> > +
> >  /*
> >   * If network-based swap is enabled, sl*b must keep track of whether pages
> >   * were allocated from pfmemalloc reserves.
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 0809760cf789..3bb579760a37 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -3874,13 +3874,15 @@ static __always_inline void 
> > maybe_wipe_obj_freeptr(struct kmem_cache *s,
> > 0, sizeof(void *));
> >  }
> >
> > +DEFINE_STATIC_KEY_FALSE(should_failslab_active);
> > +
> >  noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
> >  {
> > if (__should_failslab(s, gfpflags))
> > return -ENOMEM;
> > return 0;
> >  }
> > -ALLOW_ERROR_INJECTION(should_failslab, ERRNO);
> > +ALLOW_ERROR_INJECTION_KEY(should_failslab, ERRNO, _failslab_active);
> >
> >  static __fastpath_inline
> >  struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
> > @@ -3889,8 +3891,10 @@ struct kmem_cache *slab_pre_alloc_hook(struct 
> > kmem_cache *s, gfp_t flags)
> >
> >     might_alloc(flags);
> >
> > -   if (unlikely(should_failslab(s, flags)))
> > -   return NULL;
> > +   if (static_branch_unlikely(_failslab_active)) {
> > +   if (should_failslab(s, flags))
> > +   return NULL;
> > +   }
>
> makes sense.
> Acked-by: Alexei Starovoitov 
>
> Do you have any microbenchmark numbers before/after this optimization?

There are numbers in the cover letter for the entire series:
https://lore.kernel.org/lkml/20240531-fault-injection-statickeys-v1-0-a513fd0a9...@suse.cz/



Re: [PATCH] tracing/fprobe: Support raw tracepoint events on modules

2024-05-31 Thread kernel test robot
Hi Masami,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.10-rc1 next-20240531]
[cannot apply to rostedt-trace/for-next rostedt-trace/for-next-urgent]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Masami-Hiramatsu-Google/tracing-fprobe-Support-raw-tracepoint-events-on-modules/20240531-175013
base:   linus/master
patch link:
https://lore.kernel.org/r/171714888633.198965.13093663631481169611.stgit%40devnote2
patch subject: [PATCH] tracing/fprobe: Support raw tracepoint events on modules
config: s390-defconfig 
(https://download.01.org/0day-ci/archive/20240601/202406010034.fsnp9rsq-...@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project 
bafda89a0944d947fc4b3b5663185e07a397ac30)
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20240601/202406010034.fsnp9rsq-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202406010034.fsnp9rsq-...@intel.com/

All errors (new ones prefixed by >>):

   In file included from kernel/tracepoint.c:5:
   In file included from include/linux/module.h:19:
   In file included from include/linux/elf.h:6:
   In file included from arch/s390/include/asm/elf.h:173:
   In file included from arch/s390/include/asm/mmu_context.h:11:
   In file included from arch/s390/include/asm/pgalloc.h:18:
   In file included from include/linux/mm.h:2253:
   include/linux/vmstat.h:500:43: warning: arithmetic between different 
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
[-Wenum-enum-conversion]
 500 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
 |~ ^
 501 |item];
 |
   include/linux/vmstat.h:507:43: warning: arithmetic between different 
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
[-Wenum-enum-conversion]
 507 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
 |~ ^
 508 |NR_VM_NUMA_EVENT_ITEMS +
 |~~
   include/linux/vmstat.h:514:36: warning: arithmetic between different 
enumeration types ('enum node_stat_item' and 'enum lru_list') 
[-Wenum-enum-conversion]
 514 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
 |   ~~~ ^ ~~~
   include/linux/vmstat.h:519:43: warning: arithmetic between different 
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
[-Wenum-enum-conversion]
 519 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
 |~ ^
 520 |NR_VM_NUMA_EVENT_ITEMS +
 |~~
   include/linux/vmstat.h:528:43: warning: arithmetic between different 
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
[-Wenum-enum-conversion]
 528 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
 |~ ^
 529 |NR_VM_NUMA_EVENT_ITEMS +
 |~~
>> kernel/tracepoint.c:751:34: error: no member named 
>> '__start___tracepoints_ptrs' in 'struct module'
 751 | 
for_each_tracepoint_range(mod->__start___tracepoints_ptrs,
 |   ~~~  ^
   5 warnings and 1 error generated.


vim +751 kernel/tracepoint.c

   738  
   739  void for_each_module_tracepoint(void (*fct)(struct tracepoint *tp, void 
*priv),
   740  void *priv)
   741  {
   742  struct tp_module *tp_mod;
   743  struct module *mod;
   744  
   745  if (!mod->num_tracepoints)
   746  return;
   747  
   748  mutex_lock(_module_list_mutex);
   749  list_for_each_entry(tp_mod, _module_list, list) {
   750  mod = tp_mod->mod;
 > 751  
 > for_each_tracepoint_range(mod->__start___tracepoints_ptrs,
   752  mod->tracepoints_ptrs + mod->num_tracepoints,
   753  fct, priv);
   754  }
   755  mutex_unlock(_module_list_mutex);
   756  }
   757  #endif /* CONFIG_MODULES */
   758  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



Re: [PATCH RFC 3/4] mm, slab: add static key for should_failslab()

2024-05-31 Thread Alexei Starovoitov
On Fri, May 31, 2024 at 2:33 AM Vlastimil Babka  wrote:
>
> Since commit 4f6923fbb352 ("mm: make should_failslab always available for
> fault injection") should_failslab() is unconditionally a noinline
> function. This adds visible overhead to the slab allocation hotpath,
> even if the function is empty. With CONFIG_FAILSLAB=y there's additional
> overhead when the functionality is not enabled by a boot parameter or
> debugfs.
>
> The overhead can be eliminated with a static key around the callsite.
> Fault injection and error injection frameworks can now be told that the
> this function has a static key associated, and are able to enable and
> disable it accordingly.
>
> Signed-off-by: Vlastimil Babka 
> ---
>  mm/failslab.c |  2 +-
>  mm/slab.h |  3 +++
>  mm/slub.c | 10 +++---
>  3 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/mm/failslab.c b/mm/failslab.c
> index ffc420c0e767..878fd08e5dac 100644
> --- a/mm/failslab.c
> +++ b/mm/failslab.c
> @@ -9,7 +9,7 @@ static struct {
> bool ignore_gfp_reclaim;
> bool cache_filter;
>  } failslab = {
> -   .attr = FAULT_ATTR_INITIALIZER,
> +   .attr = FAULT_ATTR_INITIALIZER_KEY(_failslab_active.key),
> .ignore_gfp_reclaim = true,
> .cache_filter = false,
>  };
> diff --git a/mm/slab.h b/mm/slab.h
> index 5f8f47c5bee0..792e19cb37b8 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  /*
>   * Internal slab definitions
> @@ -160,6 +161,8 @@ static_assert(IS_ALIGNED(offsetof(struct slab, freelist), 
> sizeof(freelist_aba_t)
>   */
>  #define slab_page(s) folio_page(slab_folio(s), 0)
>
> +DECLARE_STATIC_KEY_FALSE(should_failslab_active);
> +
>  /*
>   * If network-based swap is enabled, sl*b must keep track of whether pages
>   * were allocated from pfmemalloc reserves.
> diff --git a/mm/slub.c b/mm/slub.c
> index 0809760cf789..3bb579760a37 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3874,13 +3874,15 @@ static __always_inline void 
> maybe_wipe_obj_freeptr(struct kmem_cache *s,
> 0, sizeof(void *));
>  }
>
> +DEFINE_STATIC_KEY_FALSE(should_failslab_active);
> +
>  noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
>  {
> if (__should_failslab(s, gfpflags))
> return -ENOMEM;
> return 0;
>  }
> -ALLOW_ERROR_INJECTION(should_failslab, ERRNO);
> +ALLOW_ERROR_INJECTION_KEY(should_failslab, ERRNO, _failslab_active);
>
>  static __fastpath_inline
>  struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
> @@ -3889,8 +3891,10 @@ struct kmem_cache *slab_pre_alloc_hook(struct 
> kmem_cache *s, gfp_t flags)
>
> might_alloc(flags);
>
> -   if (unlikely(should_failslab(s, flags)))
> -   return NULL;
> +   if (static_branch_unlikely(_failslab_active)) {
> +   if (should_failslab(s, flags))
> +   return NULL;
> +   }

makes sense.
Acked-by: Alexei Starovoitov 

Do you have any microbenchmark numbers before/after this optimization?



Re: [PATCH RFC 0/4] static key support for error injection functions

2024-05-31 Thread Mark Rutland
Hi,

On Fri, May 31, 2024 at 11:33:31AM +0200, Vlastimil Babka wrote:
> Incomplete, help needed from ftrace/kprobe and bpf folks.

> - the generic error injection using kretprobes with
>   override_function_with_return is handled in patch 2. The
>   ALLOW_ERROR_INJECTION() annotation is extended so that static key
>   address can be passed, and the framework controls it when error
>   injection is enabled or disabled in debugfs for the function.
> 
> There are two more users I know of but am not familiar enough to fix up
> myself. I hope people that are more familiar can help me here.
> 
> - ftrace seems to be using override_function_with_return from
>   #define ftrace_override_function_with_return but I found no place
>   where the latter is used. I assume it might be hidden behind more
>   macro magic? But the point is if ftrace can be instructed to act like
>   an error injection, it would also have to use some form of metadata
>   (from patch 2 presumably?) to get to the static key and control it.

I don't think you've missed anything; nothing currently uses
ftrace_override_function_with_return(). I added that in commit:

  94d095ffa0e16bb7 ("ftrace: abstract DYNAMIC_FTRACE_WITH_ARGS accesses")

... so that it was possible to do anything that was possible with
FTRACE_WITH_REGS and/or kprobes, under the expectation that we might
want to move fault injection and BPF probes over to fprobes in future,
as ftrace/fprobes is generally faster than kprobes (e.g. for
architectures that can't do KPROBES_ON_FTRACE or OPTPROBES).

That's just the mechanism for the handler to use; I'd expect whatever
registered the handler to be responsible for flipping the static key,
and I don't think anything needs to change within ftrace itself.

>   If ftrace can only observe the function being called, maybe it
>   wouldn't be wrong to just observe nothing if the static key isn't
>   enabled because nobody is doing the fault injection?

Yep, that sounds right to me.

Mark.



Re: [PATCH 10/20] function_graph: Have the instances use their own ftrace_ops for filtering

2024-05-31 Thread Google
On Fri, 31 May 2024 02:03:46 -0400
Steven Rostedt  wrote:

> On Fri, 31 May 2024 12:12:41 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > On Thu, 30 May 2024 22:30:57 -0400
> > Steven Rostedt  wrote:
> > 
> > > On Fri, 24 May 2024 22:37:02 -0400
> > > Steven Rostedt  wrote:
> > >   
> > > > From: "Steven Rostedt (VMware)" 
> > > > 
> > > > Allow for instances to have their own ftrace_ops part of the fgraph_ops
> > > > that makes the funtion_graph tracer filter on the set_ftrace_filter file
> > > > of the instance and not the top instance.
> > > > 
> > > > Note that this also requires to update ftrace_graph_func() to call new
> > > > function_graph_enter_ops() instead of function_graph_enter() so that
> > > > it avoid pushing on shadow stack multiple times on the same function.  
> > > 
> > > So I found a major design flaw in this patch.
> > >   
> > > > 
> > > > Co-developed with Masami Hiramatsu:
> > > > Link: 
> > > > https://lore.kernel.org/linux-trace-kernel/171509102088.162236.15758883237657317789.stgit@devnote2
> > > > 
> > > > Signed-off-by: Steven Rostedt (VMware) 
> > > > Signed-off-by: Masami Hiramatsu (Google) 
> > > > Signed-off-by: Steven Rostedt (Google) 
> > > > ---  
> > >   
> > > > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> > > > index 8da0e66ca22d..998558cb8f15 100644
> > > > --- a/arch/x86/kernel/ftrace.c
> > > > +++ b/arch/x86/kernel/ftrace.c
> > > > @@ -648,9 +648,24 @@ void ftrace_graph_func(unsigned long ip, unsigned 
> > > > long parent_ip,
> > > >struct ftrace_ops *op, struct ftrace_regs *fregs)
> > > >  {
> > > > struct pt_regs *regs = >regs;
> > > > -   unsigned long *stack = (unsigned long 
> > > > *)kernel_stack_pointer(regs);
> > > > +   unsigned long *parent = (unsigned long 
> > > > *)kernel_stack_pointer(regs);
> > > > +   struct fgraph_ops *gops = container_of(op, struct fgraph_ops, 
> > > > ops);
> > > > +   int bit;
> > > > +
> > > > +   if (unlikely(ftrace_graph_is_dead()))
> > > > +   return;
> > > > +
> > > > +   if (unlikely(atomic_read(>tracing_graph_pause)))
> > > > +   return;
> > > >  
> > > > -   prepare_ftrace_return(ip, (unsigned long *)stack, 0);
> > > > +   bit = ftrace_test_recursion_trylock(ip, *parent);
> > > > +   if (bit < 0)
> > > > +   return;
> > > > +
> > > > +   if (!function_graph_enter_ops(*parent, ip, 0, parent, gops))  
> > > 
> > > So each registered graph ops has its own ftrace_ops which gets
> > > registered with ftrace, so this function does get called in a loop (by
> > > the ftrace iterator function). This means that we would need that code
> > > to detect the function_graph_enter_ops() getting called multiple times
> > > for the same function. This means each fgraph_ops gits its own retstack
> > > on the shadow stack.  
> > 
> > Ah, that is my concern and the reason why I added bitmap and stack reuse
> > code in the ftrace_push_return_trace().
> > 
> > > 
> > > I find this a waste of shadow stack resources, and also complicates the
> > > code with having to deal with tail calls and all that.
> > > 
> > > BUT! There's good news! I also thought about another way of handling
> > > this. I have something working, but requires a bit of rewriting the
> > > code. I should have something out in a day or two.  
> > 
> > Hmm, I just wonder why you don't reocver my bitmap check and stack
> > reusing code. Are there any problem on it? (Too complicated?)
> > 
> 
> I actually dislike the use of ftrace itself to do the loop. I rather
> have fgraph be in control of it.

(actually, I agreed with you, looping in ftrace may cause trouble)

> 
> I've come up with a new "subops" assignment, where you can have one
> ftrace_ops represent multiple sub ftrace_ops. Basically, each fgraph
> ops can register its own ftrace_ops under a single graph_ops
> ftrace_ops. The graph_ops will be used to decide what functions call
> the callback, and then the callback does the multiplexing.

So is it similar to the fprobe/kprobe, use shared signle ftrace_ops,
but keep each fgraph has own hash table?

> This removes the need to touch the architecture code. It can also be
> used by fprobes to handle the attachments to functions for several
> different sets of callbacks.
> 
> I'll send out patches soon.

OK, I'll wait for that.

Thank you!

> 
> -- Steve
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 0/3] tracing: Fix some selftest issues

2024-05-31 Thread Google
On Fri, 31 May 2024 03:24:25 -0400
Steven Rostedt  wrote:

> On Fri, 31 May 2024 11:37:21 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > So, in summary, it is designed to be a module. Steve, I think these tests
> > should be kept as modules. There are many reason to do so.
> > 
> >  - This test is designed to be used as module.
> >  - This can conflict with other boot time selftest if it is embedded.
> >  - We can make these tests and boot time selftest mutable exclusive but
> >if we make these tests as modules, we can build and run both tests
> >safely.
> >  - Embedding these tests leave new events when the kernel boot, which
> >user must be cleaned up by manual.
> > 
> > What would you think?
> 
> I was mostly following what Ingo told me long ago, where having it
> built in is just one more way to test it ;-)

Ah, would you mean embedding the module is also a part of tests?

> 
> But that said, from your first patch, you show the stack dump and
> mention:
> 
> > Since the kprobes and synth event generation tests adds and enable
> > generated events in init_module() and delete it in exit_module(),
> > if we make it as built-in, those events are left in kernel and cause
> > kprobe event self-test failure.
> 
> But you don't explain what exactly the conflict is. What about those
> events causes kprobe selftests to fail?

The major conflict happens when the boot-time test cleans up the kprobe
events by

  dyn_events_release_all(_kprobe_ops);

And I removed it by [3/3] patch in this series :) because it does not
needed and not confirmed there is no other kprobe events when the test
starts. Also the warning message are redundant so I removed it by [2/3].

So without this [1/3], if we apply [2/3] and [3/3], the problem will be
mitigated, but I think the root cause is that these modules are built-in.

Thank you,

> 
> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] livepatch: introduce klp_func called interface

2024-05-31 Thread Miroslav Benes
> And for the unlikely branch, isn’t the complier will compile this branch 
> into a cold branch that will do no harm to the function performance?

The test (cmp insn or something like that) still needs to be there. Since 
there is only a simple assignment in the branch, the compiler may just 
choose not to have a cold branch in this case. The only difference is in 
which case you would jump here. You can see for yourself (and prove me 
wrong if it comes to it).

Miroslav

Re: [PATCH] livepatch: introduce klp_func called interface

2024-05-31 Thread zhang warden



> On May 31, 2024, at 15:21, Miroslav Benes  wrote:
> 
> Hi,
> 
> On Fri, 31 May 2024, zhang warden wrote:
> 
> you have not replied to my questions/feedback yet.
> 
> Also, I do not think that unlikely changes anything here. It is a simple 
> branch after all.
> 
> Miroslav
Hi Miroslav,

Sorry for my carelessness. I apologise for my ignorance.

> Second, livepatch is 
> already use ftrace for functional replacement, I don’t think it is a 
> good choice of using kernel tracing tool to trace a patched function.

I admit that ftrace can use for tracing the new patched function. But for some 
cases, user who what to know the state of this function can easily cat the 
'called' interface.

It is more convenient than using ftrace to trace the state.

And for the unlikely branch, isn’t the complier will compile this branch into a 
cold branch that will do no harm to the function performance?

Regards,
Wardenjohn





Re: [PATCH v6 2/5] mfd: add driver for Marvell 88PM886 PMIC

2024-05-31 Thread Karel Balej
Lee Jones, 2024-05-31T11:24:52+01:00:
> Are you planning on seeing to Mark's review comments?

Indeed, I'm hoping that I will be able to send it over the weekend.

K. B.



Re: [PATCH v3 1/3] arm64: dts: qcom: pm7250b: Add node for PMIC VBUS booster

2024-05-31 Thread Konrad Dybcio
On 30.05.2024 5:05 PM, Luca Weiss wrote:
> Add the required DTS node for the USB VBUS output regulator, which is
> available on PM7250B. This will provide the VBUS source to connected
> peripherals.
> 
> Reviewed-by: Bryan O'Donoghue 
> Signed-off-by: Luca Weiss 
> ---

Reviewed-by: Konrad Dybcio 

Konrad



Re: [PATCH] arm64: dts: qcom: sm8550-samsung-q5q: fix typo

2024-05-31 Thread Krzysztof Kozlowski
On 31/05/2024 14:05, David Wronek wrote:
> It looks like "cdsp_mem" was pasted in the license header by accident.
> Fix the typo by removing it.
> 
> Signed-off-by: David Wronek 
> ---
>  arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts | 2 +-

Fixes: ba2c082a401f ("arm64: dts: qcom: sm8550: Add support for Samsung Galaxy 
Z Fold5")
Reviewed-by: Krzysztof Kozlowski 


Best regards,
Krzysztof




[PATCH] arm64: dts: qcom: sm8550-samsung-q5q: fix typo

2024-05-31 Thread David Wronek
It looks like "cdsp_mem" was pasted in the license header by accident.
Fix the typo by removing it.

Signed-off-by: David Wronek 
---
 arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts 
b/arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts
index 4654ae1364ba..3d351e90bb39 100644
--- a/arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts
+++ b/arch/arm64/boot/dts/qcom/sm8550-samsung-q5q.dts
@@ -1,4 +1,4 @@
-// SPDX-License-cdsp_memIdentifier: BSD-3-Clause
+// SPDX-License-Identifier: BSD-3-Clause
 /*
  * Copyright (c) 2024, Alexandru Marc Serdeliuc 
  * Copyright (c) 2024, David Wronek 

---
base-commit: 0e1980c40b6edfa68b6acf926bab22448a6e40c9
change-id: 20240531-fix-typo-q5q-5d34423b7bb4

Best regards,
-- 
David Wronek 




Re: [PATCH 0/2] livepatch: Add compiler optimization disclaimer/docs

2024-05-31 Thread Miroslav Benes
Hi,

On Tue, 21 Jul 2020, Joe Lawrence wrote:

> In light of [PATCH] Revert "kbuild: use -flive-patching when
> CONFIG_LIVEPATCH is enabled" [1], we should add some loud disclaimers
> and explanation of the impact compiler optimizations have on
> livepatching.
> 
> The first commit provides detailed explanations and examples.  The list
> was taken mostly from Miroslav's LPC talk a few years back.  This is a
> bit rough, so corrections and additional suggestions welcome.  Expanding
> upon the source-based patching approach would be helpful, too.
> 
> The second commit adds a small README.rst file in the livepatch samples
> directory pointing the reader to the doc introduced in the first commit.
> 
> I didn't touch the livepatch kselftests yet as I'm still unsure about
> how to best account for IPA here.  We could add the same README.rst
> disclaimer here, too, but perhaps we have a chance to do something more.
> Possibilities range from checking for renamed functions as part of their
> build, or the selftest scripts, or even adding something to the kernel
> API.  I think we'll have a better idea after reviewing the compiler
> considerations doc.

thanks to Marcos for resurrecting this.

Joe, do you have an updated version by any chance? Some things have 
changed since July 2020 so it calls for a new review. If there was an 
improved version, it would be easier. If not, no problem at all.

Miroslav



Re: [PATCH v6 2/5] mfd: add driver for Marvell 88PM886 PMIC

2024-05-31 Thread Lee Jones
On Sat, 04 May 2024, Karel Balej wrote:

> Marvell 88PM886 is a PMIC which provides various functions such as
> onkey, battery, charger and regulators. It is found for instance in the
> samsung,coreprimevelte smartphone with which this was tested. Implement
> basic support to allow for the use of regulators and onkey.
> 
> Signed-off-by: Karel Balej 
> ---
> 
> Notes:
> v6:
> - Address Lee's comments:
>   - Don't break long line in the power off handler.
>   - Set PLATFORM_DEVID_NONE. This should be safe with respect to
> collisions since there are no known devices with more than one of
> these PMICs, plus given their general purpose nature it is unlikely
> that there would ever be. Also include the corresponding header.
>   - Move all defines to the header.
> - Give the base page's maximum register its real name.
> - Set irq_base to 0.
> v5:
> - Address Mark's feedback:
>   - Move regmap config back out of the header and rename it. Also lower
> its maximum register based on what's actually used in the downstream
> code.
> RFC v4:
> - Use MFD_CELL_* macros.
> - Address Lee's feedback:
>   - Do not define regmap_config.val_bits and .reg_bits.
>   - Drop everything regulator related except mfd_cell (regmap
> initialization, IDs enum etc.). Drop pm886_initialize_subregmaps.
>   - Do not store regmap pointers as an array as there is now only one
> regmap. Also drop the corresponding enum.
>   - Move regmap_config to the header as it is needed in the regulators
> driver.
>   - pm886_chip.whoami -> chip_id
>   - Reword chip ID mismatch error message and print the ID as
> hexadecimal.
>   - Fix includes in include/linux/88pm886.h.
>   - Drop the pm886_irq_number enum and define the (for the moment) only
> IRQ explicitly.
> - Have only one MFD cell for all regulators as they are now registered
>   all at once in the regulators driver.
> - Reword commit message.
> - Make device table static and remove comma after the sentinel to signal
>   that nothing should come after it.
> RFC v3:
> - Drop onkey cell .of_compatible.
> - Rename LDO page offset and regmap to REGULATORS.
> RFC v2:
> - Remove some abstraction.
> - Sort includes alphabetically and add linux/of.h.
> - Depend on OF, remove of_match_ptr and add MODULE_DEVICE_TABLE.
> - Use more temporaries and break long lines.
> - Do not initialize ret in probe.
> - Use the wakeup-source DT property.
> - Rename ret to err.
> - Address Lee's comments:
>   - Drop patched in presets for base regmap and related defines.
>   - Use full sentences in comments.
>   - Remove IRQ comment.
>   - Define regmap_config member values.
>   - Rename data to sys_off_data.
>   - Add _PMIC suffix to Kconfig.
>   - Use dev_err_probe.
>   - Do not store irq_data.
>   - s/WHOAMI/CHIP_ID
>   - Drop LINUX part of include guard name.
>   - Merge in the regulator series modifications in order to have more
> devices and modify the commit message accordingly. Changes with
> respect to the original regulator series patches:
> - ret -> err
> - Add temporary for dev in pm88x_initialize_subregmaps.
> - Drop of_compatible for the regulators.
> - Do not duplicate LDO regmap for bucks.
> - Rewrite commit message.
> 
>  drivers/mfd/88pm886.c   | 148 
>  drivers/mfd/Kconfig |  12 +++
>  drivers/mfd/Makefile|   1 +
>  include/linux/mfd/88pm886.h |  69 +
>  4 files changed, 230 insertions(+)
>  create mode 100644 drivers/mfd/88pm886.c
>  create mode 100644 include/linux/mfd/88pm886.h

I don't see any more issues.

Are you planning on seeing to Mark's review comments?

-- 
Lee Jones [李琼斯]



[PATCH] tracing/fprobe: Support raw tracepoint events on modules

2024-05-31 Thread Masami Hiramatsu (Google)
From: Masami Hiramatsu (Google) 

Support raw tracepoint event on module by fprobe events.
Since it only uses for_each_kernel_tracepoint() to find a tracepoint,
the tracepoints on modules are not handled. Thus if user specified a
tracepoint on a module, it shows an error.
This adds new for_each_module_tracepoint() API to tracepoint subsystem,
and uses it to find tracepoints on modules.

Reported-by: don 
Closes: 
https://lore.kernel.org/all/20240530215718.aeec973a1d0bf058d39cb...@kernel.org/
Signed-off-by: Masami Hiramatsu (Google) 
---
 include/linux/tracepoint.h  |7 +++
 kernel/trace/trace_fprobe.c |   46 ---
 kernel/tracepoint.c |   19 ++
 3 files changed, 64 insertions(+), 8 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 689b6d71590e..46e6a5e759fd 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -65,6 +65,8 @@ struct tp_module {
 bool trace_module_has_bad_taint(struct module *mod);
 extern int register_tracepoint_module_notifier(struct notifier_block *nb);
 extern int unregister_tracepoint_module_notifier(struct notifier_block *nb);
+void for_each_module_tracepoint(void (*fct)(struct tracepoint *, void *),
+   void *priv);
 #else
 static inline bool trace_module_has_bad_taint(struct module *mod)
 {
@@ -80,6 +82,11 @@ int unregister_tracepoint_module_notifier(struct 
notifier_block *nb)
 {
return 0;
 }
+static inline
+void for_each_module_tracepoint(void (*fct)(struct tracepoint *, void *),
+   void *priv)
+{
+}
 #endif /* CONFIG_MODULES */
 
 /*
diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
index 62e6a8f4aae9..1d8a983e1edc 100644
--- a/kernel/trace/trace_fprobe.c
+++ b/kernel/trace/trace_fprobe.c
@@ -385,6 +385,7 @@ static struct trace_fprobe *alloc_trace_fprobe(const char 
*group,
   const char *event,
   const char *symbol,
   struct tracepoint *tpoint,
+  struct module *mod,
   int maxactive,
   int nargs, bool is_return)
 {
@@ -405,6 +406,7 @@ static struct trace_fprobe *alloc_trace_fprobe(const char 
*group,
tf->fp.entry_handler = fentry_dispatcher;
 
tf->tpoint = tpoint;
+   tf->mod = mod;
tf->fp.nr_maxactive = maxactive;
 
ret = trace_probe_init(>tp, event, group, false, nargs);
@@ -895,8 +897,23 @@ static struct notifier_block tracepoint_module_nb = {
 struct __find_tracepoint_cb_data {
const char *tp_name;
struct tracepoint *tpoint;
+   struct module *mod;
 };
 
+static void __find_tracepoint_module_cb(struct tracepoint *tp, void *priv)
+{
+   struct __find_tracepoint_cb_data *data = priv;
+
+   if (!data->tpoint && !strcmp(data->tp_name, tp->name)) {
+   data->tpoint = tp;
+   data->mod = __module_text_address((unsigned long)tp->probestub);
+   if (!try_module_get(data->mod)) {
+   data->tpoint = NULL;
+   data->mod = NULL;
+   }
+   }
+}
+
 static void __find_tracepoint_cb(struct tracepoint *tp, void *priv)
 {
struct __find_tracepoint_cb_data *data = priv;
@@ -905,14 +922,28 @@ static void __find_tracepoint_cb(struct tracepoint *tp, 
void *priv)
data->tpoint = tp;
 }
 
-static struct tracepoint *find_tracepoint(const char *tp_name)
+/*
+ * Find a tracepoint from kernel and module. If the tracepoint is in a module,
+ * this increments the module refcount to prevent unloading until the
+ * trace_fprobe is registered to the list. After registering the trace_fprobe
+ * on the trace_fprobe list, the module refcount is decremented because
+ * tracepoint_probe_module_cb will handle it.
+ */
+static struct tracepoint *find_tracepoint(const char *tp_name,
+ struct module **tp_mod)
 {
struct __find_tracepoint_cb_data data = {
.tp_name = tp_name,
+   .mod = NULL,
};
 
for_each_kernel_tracepoint(__find_tracepoint_cb, );
 
+   if (!data.tpoint && IS_ENABLED(CONFIG_MODULES)) {
+   for_each_module_tracepoint(__find_tracepoint_module_cb, );
+   *tp_mod = data.mod;
+   }
+
return data.tpoint;
 }
 
@@ -996,6 +1027,7 @@ static int __trace_fprobe_create(int argc, const char 
*argv[])
char abuf[MAX_BTF_ARGS_LEN];
char *dbuf = NULL;
bool is_tracepoint = false;
+   struct module *tp_mod = NULL;
struct tracepoint *tpoint = NULL;
struct traceprobe_parse_context ctx = {
.flags = TPARG_FL_KERNEL | TPARG_FL_FPROBE,
@@ -1080,7 +1112,7 @@ static int 

[PATCH RFC 4/4] mm, page_alloc: add static key for should_fail_alloc_page()

2024-05-31 Thread Vlastimil Babka
Similarly to should_failslab(), remove the overhead of calling the
noinline function should_fail_alloc_page() with a static key that guards
the allocation hotpath callsite and is controlled by the fault and error
injection frameworks.

Signed-off-by: Vlastimil Babka 
---
 mm/fail_page_alloc.c |  3 ++-
 mm/internal.h|  2 ++
 mm/page_alloc.c  | 11 ---
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/fail_page_alloc.c b/mm/fail_page_alloc.c
index b1b09cce9394..0906b76d78e8 100644
--- a/mm/fail_page_alloc.c
+++ b/mm/fail_page_alloc.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include 
 #include 
+#include "internal.h"
 
 static struct {
struct fault_attr attr;
@@ -9,7 +10,7 @@ static struct {
bool ignore_gfp_reclaim;
u32 min_order;
 } fail_page_alloc = {
-   .attr = FAULT_ATTR_INITIALIZER,
+   .attr = FAULT_ATTR_INITIALIZER_KEY(_fail_alloc_page_active.key),
.ignore_gfp_reclaim = true,
.ignore_gfp_highmem = true,
.min_order = 1,
diff --git a/mm/internal.h b/mm/internal.h
index b2c75b12014e..8539e39b02e6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -410,6 +410,8 @@ extern char * const zone_names[MAX_NR_ZONES];
 /* perform sanity checks on struct pages being allocated or freed */
 DECLARE_STATIC_KEY_MAYBE(CONFIG_DEBUG_VM, check_pages_enabled);
 
+DECLARE_STATIC_KEY_FALSE(should_fail_alloc_page_active);
+
 extern int min_free_kbytes;
 
 void setup_per_zone_wmarks(void);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5675ca..e5dc3bafa549 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -274,6 +274,8 @@ int user_min_free_kbytes = -1;
 static int watermark_boost_factor __read_mostly = 15000;
 static int watermark_scale_factor = 10;
 
+DEFINE_STATIC_KEY_FALSE(should_fail_alloc_page_active);
+
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
 EXPORT_SYMBOL(movable_zone);
@@ -3012,7 +3014,7 @@ noinline bool should_fail_alloc_page(gfp_t gfp_mask, 
unsigned int order)
 {
return __should_fail_alloc_page(gfp_mask, order);
 }
-ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
+ALLOW_ERROR_INJECTION_KEY(should_fail_alloc_page, TRUE, 
_fail_alloc_page_active);
 
 static inline long __zone_watermark_unusable_free(struct zone *z,
unsigned int order, unsigned int alloc_flags)
@@ -4430,8 +4432,11 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
 
might_alloc(gfp_mask);
 
-   if (should_fail_alloc_page(gfp_mask, order))
-   return false;
+   if (static_branch_unlikely(_fail_alloc_page_active)) {
+   if (should_fail_alloc_page(gfp_mask, order)) {
+   return false;
+   }
+   }
 
*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
 

-- 
2.45.1




[PATCH RFC 0/4] static key support for error injection functions

2024-05-31 Thread Vlastimil Babka
Incomplete, help needed from ftrace/kprobe and bpf folks.

As previously mentioned by myself [1] and others [2] the functions
designed for error injection can bring visible overhead in fastpaths
such as slab or page allocation, because even if nothing hooks into them
at a given moment, they are noninline function calls regardless of
CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab
always available for fault injection") and af3b854492f3
("mm/page_alloc.c: allow error injection").

Live patching their callsites has been also suggested in both [1] and
[2] threads, and this is an attempt to do that with static keys that
guard the call sites. When disabled, the error injection functions still
exist and are noinline, but are not being called. Any of the existing
mechanisms that can inject errors should make sure to enable the
respective static key. I have added that support to some of them but
need help with the others.

- the legacy fault injection, i.e. CONFIG_FAILSLAB and
  CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the
  address of the static key if it exists. The key will be activated if the
  fault injection probability becomes non-zero, and deactivated in the
  opposite transition. This also removes the overhead of the evaluation
  (on top of the noninline function call) when these mechanisms are
  configured in the kernel but unused at the moment.

- the generic error injection using kretprobes with
  override_function_with_return is handled in patch 2. The
  ALLOW_ERROR_INJECTION() annotation is extended so that static key
  address can be passed, and the framework controls it when error
  injection is enabled or disabled in debugfs for the function.

There are two more users I know of but am not familiar enough to fix up
myself. I hope people that are more familiar can help me here.

- ftrace seems to be using override_function_with_return from
  #define ftrace_override_function_with_return but I found no place
  where the latter is used. I assume it might be hidden behind more
  macro magic? But the point is if ftrace can be instructed to act like
  an error injection, it would also have to use some form of metadata
  (from patch 2 presumably?) to get to the static key and control it.

  If ftrace can only observe the function being called, maybe it
  wouldn't be wrong to just observe nothing if the static key isn't
  enabled because nobody is doing the fault injection?

- bpftrace, as can be seen from the example in commit 4f6923fbb352
  description. I suppose bpf is already aware what functions the
  currently loaded bpf programs hook into, so that it could look up the
  static key and control it. Maybe using again the metadata from patch 2,
  or extending its own, as I've noticed there's e.g. BTF_ID(func,
  should_failslab)

Now I realize maybe handling this at the k(ret)probe level would be
sufficient for all cases except the legacy fault injection from Patch 1?
Also wanted to note that by AFAIU by using the static_key_slow_dec/inc
API (as done in patches 1/2) should allow all mechanisms to coexist
naturally without fighting each other on the static key state, and also
handle the reference count for e.g. active probes or bpf programs if
there's no similar internal mechanism.

Patches 3 and 4 implement the static keys for the two mm fault injection
sites in slab and page allocators. For a quick demonstration I've run a
VM and the simple test from [1] that stresses the slab allocator and got
this time before the series:

real0m8.349s
user0m0.694s
sys 0m7.648s

with perf showing

   0.61%  nonexistent  [kernel.kallsyms]  [k] should_failslab.constprop.0
   0.00%  nonexistent  [kernel.kallsyms]  [k] should_fail_alloc_page


▒

And after the series

real0m7.924s
user0m0.727s
sys 0m7.191s

and the functions gone from perf report.

There might be other such fault injection callsites in hotpaths of other
subsystems but I didn't search for them at this point.

[1] https://lore.kernel.org/all/6d5bb852-8703-4abf-a52b-90816bccb...@suse.cz/
[2] 
https://lore.kernel.org/all/3j5d3p22ssv7xoaghzraa7crcfih3h2qqjlhmjppbp6f42pg2t@kg7qoicog5ye/

Signed-off-by: Vlastimil Babka 
---
Vlastimil Babka (4):
  fault-inject: add support for static keys around fault injection sites
  error-injection: support static keys around injectable functions
  mm, slab: add static key for should_failslab()
  mm, page_alloc: add static key for should_fail_alloc_page()

 include/asm-generic/error-injection.h | 13 ++-
 include/asm-generic/vmlinux.lds.h |  2 +-
 include/linux/error-injection.h   |  9 +---
 include/linux/fault-inject.h  |  7 +-
 kernel/fail_function.c| 22 +++---
 lib/error-inject.c|  6 

[PATCH RFC 2/4] error-injection: support static keys around injectable functions

2024-05-31 Thread Vlastimil Babka
Error injectable functions cannot be inlined and since some are called
from hot paths, this incurrs overhead even if no error injection is
enabled for them.

To remove this overhead when disabled, allow the callsites of error
injectable functions to put the calls behind a static key, which the
framework can control when error injection is enabled or disabled for
the function.

Introduce a new ALLOW_ERROR_INJECTION_KEY() macro that adds a parameter
with the static key's address, and store it in struct
error_injection_entry. This new field has caused a mismatch when
populating the injection list from the _error_injection_whitelist
section with the current STRUCT_ALIGN(), so change the alignment to 8.

During the population, copy the key's address also to struct ei_entry,
and make it possible to retrieve it along with the error type by
get_injectable_error_type().

Finally, make the processing of writes to the debugfs inject file enable
the static key when the function is added to the injection list, and
disable when removed.

Signed-off-by: Vlastimil Babka 
---
 include/asm-generic/error-injection.h | 13 -
 include/asm-generic/vmlinux.lds.h |  2 +-
 include/linux/error-injection.h   |  9 ++---
 kernel/fail_function.c| 22 +++---
 lib/error-inject.c|  6 +-
 5 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/include/asm-generic/error-injection.h 
b/include/asm-generic/error-injection.h
index b05253f68eaa..eed2731f3820 100644
--- a/include/asm-generic/error-injection.h
+++ b/include/asm-generic/error-injection.h
@@ -12,6 +12,7 @@ enum {
 
 struct error_injection_entry {
unsigned long   addr;
+   unsigned long   static_key_addr;
int etype;
 };
 
@@ -25,16 +26,26 @@ struct pt_regs;
  * 'Error Injectable Functions' section.
  */
 #define ALLOW_ERROR_INJECTION(fname, _etype)   \
-static struct error_injection_entry __used \
+static struct error_injection_entry __used __aligned(8)
\
__section("_error_injection_whitelist") \
_eil_addr_##fname = {   \
.addr = (unsigned long)fname,   \
.etype = EI_ETYPE_##_etype, \
}
 
+#define ALLOW_ERROR_INJECTION_KEY(fname, _etype, key)  \
+static struct error_injection_entry __used __aligned(8)
\
+   __section("_error_injection_whitelist") \
+   _eil_addr_##fname = {   \
+   .addr = (unsigned long)fname,   \
+   .static_key_addr = (unsigned long)key,  \
+   .etype = EI_ETYPE_##_etype, \
+   }
+
 void override_function_with_return(struct pt_regs *regs);
 #else
 #define ALLOW_ERROR_INJECTION(fname, _etype)
+#define ALLOW_ERROR_INJECTION_KEY(fname, _etype, key)
 
 static inline void override_function_with_return(struct pt_regs *regs) { }
 #endif
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 5703526d6ebf..1b15a0af2a00 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -248,7 +248,7 @@
 
 #ifdef CONFIG_FUNCTION_ERROR_INJECTION
 #define ERROR_INJECT_WHITELIST()   \
-   STRUCT_ALIGN(); \
+   . = ALIGN(8);   \
BOUNDED_SECTION(_error_injection_whitelist)
 #else
 #define ERROR_INJECT_WHITELIST()
diff --git a/include/linux/error-injection.h b/include/linux/error-injection.h
index 20e738f4eae8..bec81b57a9d5 100644
--- a/include/linux/error-injection.h
+++ b/include/linux/error-injection.h
@@ -6,10 +6,12 @@
 #include 
 #include 
 
+struct static_key;
+
 #ifdef CONFIG_FUNCTION_ERROR_INJECTION
 
-extern bool within_error_injection_list(unsigned long addr);
-extern int get_injectable_error_type(unsigned long addr);
+bool within_error_injection_list(unsigned long addr);
+int get_injectable_error_type(unsigned long addr, struct static_key 
**key_addr);
 
 #else /* !CONFIG_FUNCTION_ERROR_INJECTION */
 
@@ -18,7 +20,8 @@ static inline bool within_error_injection_list(unsigned long 
addr)
return false;
 }
 
-static inline int get_injectable_error_type(unsigned long addr)
+static inline int get_injectable_error_type(unsigned long addr,
+   struct static_key **key_addr)
 {
return -EOPNOTSUPP;
 }
diff --git a/kernel/fail_function.c b/kernel/fail_function.c
index d971a0189319..9240eb137e00 100644
--- a/kernel/fail_function.c
+++ b/kernel/fail_function.c
@@ -27,15 +27,16 @@ struct fei_attr {
struct list_head list;
struct kprobe kp;
unsigned long retval;
+   struct 

[PATCH RFC 3/4] mm, slab: add static key for should_failslab()

2024-05-31 Thread Vlastimil Babka
Since commit 4f6923fbb352 ("mm: make should_failslab always available for
fault injection") should_failslab() is unconditionally a noinline
function. This adds visible overhead to the slab allocation hotpath,
even if the function is empty. With CONFIG_FAILSLAB=y there's additional
overhead when the functionality is not enabled by a boot parameter or
debugfs.

The overhead can be eliminated with a static key around the callsite.
Fault injection and error injection frameworks can now be told that the
this function has a static key associated, and are able to enable and
disable it accordingly.

Signed-off-by: Vlastimil Babka 
---
 mm/failslab.c |  2 +-
 mm/slab.h |  3 +++
 mm/slub.c | 10 +++---
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/failslab.c b/mm/failslab.c
index ffc420c0e767..878fd08e5dac 100644
--- a/mm/failslab.c
+++ b/mm/failslab.c
@@ -9,7 +9,7 @@ static struct {
bool ignore_gfp_reclaim;
bool cache_filter;
 } failslab = {
-   .attr = FAULT_ATTR_INITIALIZER,
+   .attr = FAULT_ATTR_INITIALIZER_KEY(_failslab_active.key),
.ignore_gfp_reclaim = true,
.cache_filter = false,
 };
diff --git a/mm/slab.h b/mm/slab.h
index 5f8f47c5bee0..792e19cb37b8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Internal slab definitions
@@ -160,6 +161,8 @@ static_assert(IS_ALIGNED(offsetof(struct slab, freelist), 
sizeof(freelist_aba_t)
  */
 #define slab_page(s) folio_page(slab_folio(s), 0)
 
+DECLARE_STATIC_KEY_FALSE(should_failslab_active);
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/mm/slub.c b/mm/slub.c
index 0809760cf789..3bb579760a37 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3874,13 +3874,15 @@ static __always_inline void 
maybe_wipe_obj_freeptr(struct kmem_cache *s,
0, sizeof(void *));
 }
 
+DEFINE_STATIC_KEY_FALSE(should_failslab_active);
+
 noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
 {
if (__should_failslab(s, gfpflags))
return -ENOMEM;
return 0;
 }
-ALLOW_ERROR_INJECTION(should_failslab, ERRNO);
+ALLOW_ERROR_INJECTION_KEY(should_failslab, ERRNO, _failslab_active);
 
 static __fastpath_inline
 struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
@@ -3889,8 +3891,10 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache 
*s, gfp_t flags)
 
might_alloc(flags);
 
-   if (unlikely(should_failslab(s, flags)))
-   return NULL;
+   if (static_branch_unlikely(_failslab_active)) {
+   if (should_failslab(s, flags))
+   return NULL;
+   }
 
return s;
 }

-- 
2.45.1




[PATCH RFC 1/4] fault-inject: add support for static keys around fault injection sites

2024-05-31 Thread Vlastimil Babka
Some fault injection sites are placed in hotpaths and incur overhead
even if not enabled, due to one or more function calls leading up to
should_fail_ex() that returns false due to attr->probability == 0.

This overhead can be eliminated if the outermost call into the checks is
guarded with a static key, so add support for that. The framework should
be told that such static key exist for a fault_attr, by initializing
fault_attr->active with the static key address. When it's not NULL,
enable the static key from setup_fault_attr() when the fault probability
is non-zero.

Also wire up writing into debugfs "probability" file to enable or
disable the static key when transitioning between zero and non-zero
probability.

For now, do not add configfs interface support as the immediate plan is
to leverage this for should_failslab() and should_fail_alloc_page()
after other necessary preparatory changes, and none of the configfs
based fault injection users.

Signed-off-by: Vlastimil Babka 
---
 include/linux/fault-inject.h |  7 ++-
 lib/fault-inject.c   | 43 ++-
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h
index 6d5edef09d45..cfe75cc1bac4 100644
--- a/include/linux/fault-inject.h
+++ b/include/linux/fault-inject.h
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * For explanation of the elements of this struct, see
@@ -30,13 +31,14 @@ struct fault_attr {
unsigned long count;
struct ratelimit_state ratelimit_state;
struct dentry *dname;
+   struct static_key *active;
 };
 
 enum fault_flags {
FAULT_NOWARN =  1 << 0,
 };
 
-#define FAULT_ATTR_INITIALIZER {   \
+#define FAULT_ATTR_INITIALIZER_KEY(_key) { \
.interval = 1,  \
.times = ATOMIC_INIT(1),\
.require_end = ULONG_MAX,   \
@@ -44,8 +46,11 @@ enum fault_flags {
.ratelimit_state = RATELIMIT_STATE_INIT_DISABLED,   \
.verbose = 2,   \
.dname = NULL,  \
+   .active = (_key),   \
}
 
+#define FAULT_ATTR_INITIALIZER FAULT_ATTR_INITIALIZER_KEY(NULL)
+
 #define DECLARE_FAULT_ATTR(name) struct fault_attr name = 
FAULT_ATTR_INITIALIZER
 int setup_fault_attr(struct fault_attr *attr, char *str);
 bool should_fail_ex(struct fault_attr *attr, ssize_t size, int flags);
diff --git a/lib/fault-inject.c b/lib/fault-inject.c
index d608f9b48c10..93c46d2ec106 100644
--- a/lib/fault-inject.c
+++ b/lib/fault-inject.c
@@ -35,6 +35,9 @@ int setup_fault_attr(struct fault_attr *attr, char *str)
atomic_set(>times, times);
atomic_set(>space, space);
 
+   if (probability != 0 && attr->active)
+   static_key_slow_inc(attr->active);
+
return 1;
 }
 EXPORT_SYMBOL_GPL(setup_fault_attr);
@@ -166,6 +169,12 @@ EXPORT_SYMBOL_GPL(should_fail);
 
 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
 
+/*
+ * Protect updating probability from debugfs as that may trigger static key
+ * changes when changing between zero and non-zero.
+ */
+static DEFINE_MUTEX(probability_mutex);
+
 static int debugfs_ul_set(void *data, u64 val)
 {
*(unsigned long *)data = val;
@@ -186,6 +195,38 @@ static void debugfs_create_ul(const char *name, umode_t 
mode,
debugfs_create_file(name, mode, parent, value, _ul);
 }
 
+static int debugfs_prob_set(void *data, u64 val)
+{
+   struct fault_attr *attr = data;
+
+   mutex_lock(_mutex);
+
+   if (attr->active) {
+   if (attr->probability != 0 && val == 0) {
+   static_key_slow_dec(attr->active);
+   } else if (attr->probability == 0 && val != 0) {
+   static_key_slow_inc(attr->active);
+   }
+   }
+
+   attr->probability = val;
+
+   mutex_unlock(_mutex);
+
+   return 0;
+}
+
+static int debugfs_prob_get(void *data, u64 *val)
+{
+   struct fault_attr *attr = data;
+
+   *val = attr->probability;
+
+   return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_prob, debugfs_prob_get, debugfs_prob_set, 
"%llu\n");
+
 #ifdef CONFIG_FAULT_INJECTION_STACKTRACE_FILTER
 
 static int debugfs_stacktrace_depth_set(void *data, u64 val)
@@ -218,7 +259,7 @@ struct dentry *fault_create_debugfs_attr(const char *name,
if (IS_ERR(dir))
return dir;
 
-   debugfs_create_ul("probability", mode, dir, >probability);
+   debugfs_create_file("probability", mode, dir, , _prob);
debugfs_create_ul("interval", mode, dir, >interval);
debugfs_create_atomic_t("times", mode, dir, >times);
debugfs_create_atomic_t("space", mode, 

Re: [PATCH v2] dt-bindings: remoteproc: k3-dsp: correct optional sram properties for AM62A SoCs

2024-05-31 Thread Krzysztof Kozlowski
On 30/05/2024 18:48, Hari Nagalla wrote:
> The C7xv-dsp on AM62A have 32KB L1 I-cache and a 64KB L1 D-cache. It
> does not have an addressable l1dram . So, remove this optional sram
> property from the bindings to fix device tree build warnings.
> 
> Signed-off-by: Hari Nagalla 
> ---
> Changes from v1 to v2:
> *) Kept back memory-regions property, as it is unrelated to this patch
>correcting the sram property for AM62A C7xv-dsp nodes.
> 
> DT binding check log:
> https://paste.sr.ht/~hnagalla/cb26237560a572a17c0874b687353e00b400285a
> 
> v1: https://lore.kernel.org/all/20230810110545.11644-1-hnaga...@ti.com/
> 
>  .../bindings/remoteproc/ti,k3-dsp-rproc.yaml  | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git 
> a/Documentation/devicetree/bindings/remoteproc/ti,k3-dsp-rproc.yaml 
> b/Documentation/devicetree/bindings/remoteproc/ti,k3-dsp-rproc.yaml
> index 9768db8663eb..771cfceb5458 100644
> --- a/Documentation/devicetree/bindings/remoteproc/ti,k3-dsp-rproc.yaml
> +++ b/Documentation/devicetree/bindings/remoteproc/ti,k3-dsp-rproc.yaml
> @@ -111,7 +111,6 @@ else:
>  properties:
>compatible:
>  enum:
> -  - ti,am62a-c7xv-dsp
>- ti,j721e-c71-dsp
>- ti,j721s2-c71-dsp
>then:
> @@ -124,6 +123,20 @@ else:
>  items:
>- const: l2sram
>- const: l1dram
> +  else:
> +if:

Please use allOf (move top one to bottom) and define separate ifs for
each variant instead of nesting them.

Best regards,
Krzysztof




Re: [PATCH 0/3] tracing: Fix some selftest issues

2024-05-31 Thread Steven Rostedt
On Fri, 31 May 2024 11:37:21 +0900
Masami Hiramatsu (Google)  wrote:

> So, in summary, it is designed to be a module. Steve, I think these tests
> should be kept as modules. There are many reason to do so.
> 
>  - This test is designed to be used as module.
>  - This can conflict with other boot time selftest if it is embedded.
>  - We can make these tests and boot time selftest mutable exclusive but
>if we make these tests as modules, we can build and run both tests
>safely.
>  - Embedding these tests leave new events when the kernel boot, which
>user must be cleaned up by manual.
> 
> What would you think?

I was mostly following what Ingo told me long ago, where having it
built in is just one more way to test it ;-)

But that said, from your first patch, you show the stack dump and
mention:

> Since the kprobes and synth event generation tests adds and enable
> generated events in init_module() and delete it in exit_module(),
> if we make it as built-in, those events are left in kernel and cause
> kprobe event self-test failure.

But you don't explain what exactly the conflict is. What about those
events causes kprobe selftests to fail?


-- Steve



Re: [PATCH] livepatch: introduce klp_func called interface

2024-05-31 Thread Miroslav Benes
Hi,

On Fri, 31 May 2024, zhang warden wrote:

> 
> Hi Bros,
> 
> How about my patch? I do think it is a viable feature to show the state 
> of the patched function. If we add an unlikely branch test before we set 
> the 'called' state, once this function is called, there maybe no 
> negative effect to the performance.

you have not replied to my questions/feedback yet.

Also, I do not think that unlikely changes anything here. It is a simple 
branch after all.

Miroslav



[PATCH RT 0/1] Linux v4.19.315-rt135-rc1

2024-05-31 Thread Daniel Wagner
Dear RT Folks,

This is the RT stable review cycle of patch 4.19.315-rt135-rc1.

Please scream at me if I messed something up. Please test the patches
too.

The -rc release is also available on kernel.org

  https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

on the v4.19-rt-next branch.

If all goes well, this patch will be converted to the next main
release on 2024-06-07.

Signing key fingerprint:

  5BF6 7BC5 0826 72CA BB45  ACAE 587C 5ECA 5D0A 306C

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Daniel

Changes from v4.19.312-rt134:


Daniel Wagner (1):
  Linux 4.19.315-rt135

 localversion-rt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.45.1




[PATCH RT 1/1] Linux 4.19.315-rt135

2024-05-31 Thread Daniel Wagner
v4.19.315-rt135-rc1 stable review patch.
If anyone has any objections, please let me know.

---


Signed-off-by: Daniel Wagner 
---
 localversion-rt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/localversion-rt b/localversion-rt
index 6067da4c8c99..e3026053f01e 100644
--- a/localversion-rt
+++ b/localversion-rt
@@ -1 +1 @@
--rt134
+-rt135
-- 
2.45.1




[syzbot] Monthly trace report (May 2024)

2024-05-31 Thread syzbot
Hello trace maintainers/developers,

This is a 31-day syzbot report for the trace subsystem.
All related reports/information can be found at:
https://syzkaller.appspot.com/upstream/s/trace

During the period, 1 new issues were detected and 0 were fixed.
In total, 10 issues are still open and 35 have been fixed so far.

Some of the still happening issues:

Ref Crashes Repro Title
<1> 705 Yes   WARNING in format_decode (3)
  https://syzkaller.appspot.com/bug?extid=e2c932aec5c8a6e1d31c
<2> 26  Yes   INFO: task hung in blk_trace_ioctl (4)
  https://syzkaller.appspot.com/bug?extid=ed812ed461471ab17a0c
<3> 7   Yes   WARNING in get_probe_ref
  https://syzkaller.appspot.com/bug?extid=8672dcb9d10011c0a160
<4> 6   Yes   INFO: task hung in blk_trace_remove (2)
  https://syzkaller.appspot.com/bug?extid=2373f6be3e6de4f92562
<5> 5   Yes   general protection fault in bpf_get_attach_cookie_tracing
  https://syzkaller.appspot.com/bug?extid=3ab78ff125b7979e45f9

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

To disable reminders for individual bugs, reply with the following command:
#syz set  no-reminders

To change bug's subsystems, reply with:
#syz set  subsystems: new-subsystem

You may send multiple commands in a single email message.



Re: [PATCH v2] sched/rt: Clean up usage of rt_task()

2024-05-31 Thread Sebastian Andrzej Siewior
On 2024-05-30 12:10:44 [+0100], Qais Yousef wrote:
> > This is not consistent because IMHO the clock setup & slack should be
> > handled equally. So I am asking the sched folks for a policy and I am
> > leaning towards looking at task-policy in this case instead of prio
> > because you shouldn't do anything that can delay.
> 
> Can't we do that based on is_soft/is_hard flag in hrtimer struct when we apply
> the slack in hrtimer_set_expires_range_ns() instead?

We need to decide on a policy first.
You don't want to add overhead on each invocation plus some in-kernel
ask for delta. ->is_soft is not a good criteria.

Sebastian



Re: [PATCH 10/20] function_graph: Have the instances use their own ftrace_ops for filtering

2024-05-31 Thread Steven Rostedt
On Fri, 31 May 2024 12:12:41 +0900
Masami Hiramatsu (Google)  wrote:

> On Thu, 30 May 2024 22:30:57 -0400
> Steven Rostedt  wrote:
> 
> > On Fri, 24 May 2024 22:37:02 -0400
> > Steven Rostedt  wrote:
> >   
> > > From: "Steven Rostedt (VMware)" 
> > > 
> > > Allow for instances to have their own ftrace_ops part of the fgraph_ops
> > > that makes the funtion_graph tracer filter on the set_ftrace_filter file
> > > of the instance and not the top instance.
> > > 
> > > Note that this also requires to update ftrace_graph_func() to call new
> > > function_graph_enter_ops() instead of function_graph_enter() so that
> > > it avoid pushing on shadow stack multiple times on the same function.  
> > 
> > So I found a major design flaw in this patch.
> >   
> > > 
> > > Co-developed with Masami Hiramatsu:
> > > Link: 
> > > https://lore.kernel.org/linux-trace-kernel/171509102088.162236.15758883237657317789.stgit@devnote2
> > > 
> > > Signed-off-by: Steven Rostedt (VMware) 
> > > Signed-off-by: Masami Hiramatsu (Google) 
> > > Signed-off-by: Steven Rostedt (Google) 
> > > ---  
> >   
> > > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> > > index 8da0e66ca22d..998558cb8f15 100644
> > > --- a/arch/x86/kernel/ftrace.c
> > > +++ b/arch/x86/kernel/ftrace.c
> > > @@ -648,9 +648,24 @@ void ftrace_graph_func(unsigned long ip, unsigned 
> > > long parent_ip,
> > >  struct ftrace_ops *op, struct ftrace_regs *fregs)
> > >  {
> > >   struct pt_regs *regs = >regs;
> > > - unsigned long *stack = (unsigned long *)kernel_stack_pointer(regs);
> > > + unsigned long *parent = (unsigned long *)kernel_stack_pointer(regs);
> > > + struct fgraph_ops *gops = container_of(op, struct fgraph_ops, ops);
> > > + int bit;
> > > +
> > > + if (unlikely(ftrace_graph_is_dead()))
> > > + return;
> > > +
> > > + if (unlikely(atomic_read(>tracing_graph_pause)))
> > > + return;
> > >  
> > > - prepare_ftrace_return(ip, (unsigned long *)stack, 0);
> > > + bit = ftrace_test_recursion_trylock(ip, *parent);
> > > + if (bit < 0)
> > > + return;
> > > +
> > > + if (!function_graph_enter_ops(*parent, ip, 0, parent, gops))  
> > 
> > So each registered graph ops has its own ftrace_ops which gets
> > registered with ftrace, so this function does get called in a loop (by
> > the ftrace iterator function). This means that we would need that code
> > to detect the function_graph_enter_ops() getting called multiple times
> > for the same function. This means each fgraph_ops gits its own retstack
> > on the shadow stack.  
> 
> Ah, that is my concern and the reason why I added bitmap and stack reuse
> code in the ftrace_push_return_trace().
> 
> > 
> > I find this a waste of shadow stack resources, and also complicates the
> > code with having to deal with tail calls and all that.
> > 
> > BUT! There's good news! I also thought about another way of handling
> > this. I have something working, but requires a bit of rewriting the
> > code. I should have something out in a day or two.  
> 
> Hmm, I just wonder why you don't reocver my bitmap check and stack
> reusing code. Are there any problem on it? (Too complicated?)
> 

I actually dislike the use of ftrace itself to do the loop. I rather
have fgraph be in control of it.

I've come up with a new "subops" assignment, where you can have one
ftrace_ops represent multiple sub ftrace_ops. Basically, each fgraph
ops can register its own ftrace_ops under a single graph_ops
ftrace_ops. The graph_ops will be used to decide what functions call
the callback, and then the callback does the multiplexing.

This removes the need to touch the architecture code. It can also be
used by fprobes to handle the attachments to functions for several
different sets of callbacks.

I'll send out patches soon.

-- Steve