Re: [PATCH v12 3/6] tracing: Add snapshot refcount
Hi Vincent, kernel test robot noticed the following build errors: [auto build test ERROR on 4f1991a92cfe89096b2d1f5583a2e093bdd55c37] url: https://github.com/intel-lab-lkp/linux/commits/Vincent-Donnefort/ring-buffer-Zero-ring-buffer-sub-buffers/20240123-191131 base: 4f1991a92cfe89096b2d1f5583a2e093bdd55c37 patch link: https://lore.kernel.org/r/20240123110757.3657908-4-vdonnefort%40google.com patch subject: [PATCH v12 3/6] tracing: Add snapshot refcount config: i386-randconfig-012-20240126 (https://download.01.org/0day-ci/archive/20240126/202401261026.cc8ubwxf-...@intel.com/config) compiler: gcc-9 (Debian 9.3.0-22) 9.3.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240126/202401261026.cc8ubwxf-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202401261026.cc8ubwxf-...@intel.com/ All errors (new ones prefixed by >>): kernel/trace/trace.c: In function 'tracing_set_tracer': >> kernel/trace/trace.c:6645:9: error: 'struct tracer' has no member named >> 'use_max_tr' 6645 |if (t->use_max_tr) | ^~ >> kernel/trace/trace.c:6646:5: error: implicit declaration of function >> 'tracing_disarm_snapshot_locked'; did you mean 'tracing_disarm_snapshot'? >> [-Werror=implicit-function-declaration] 6646 | tracing_disarm_snapshot_locked(tr); | ^~ | tracing_disarm_snapshot cc1: some warnings being treated as errors vim +6645 kernel/trace/trace.c 6641 6642 if (t->init) { 6643 ret = tracer_init(t, tr); 6644 if (ret) { > 6645 if (t->use_max_tr) > 6646 tracing_disarm_snapshot_locked(tr); 6647 goto out; 6648 } 6649 } 6650 6651 tr->current_trace = t; 6652 tr->current_trace->enabled++; 6653 trace_branch_enable(tr); 6654 out: 6655 mutex_unlock(&trace_types_lock); 6656 6657 return ret; 6658 } 6659 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Re: [PATCH v12 3/6] tracing: Add snapshot refcount
Hi Vincent, kernel test robot noticed the following build errors: [auto build test ERROR on 4f1991a92cfe89096b2d1f5583a2e093bdd55c37] url: https://github.com/intel-lab-lkp/linux/commits/Vincent-Donnefort/ring-buffer-Zero-ring-buffer-sub-buffers/20240123-191131 base: 4f1991a92cfe89096b2d1f5583a2e093bdd55c37 patch link: https://lore.kernel.org/r/20240123110757.3657908-4-vdonnefort%40google.com patch subject: [PATCH v12 3/6] tracing: Add snapshot refcount config: i386-buildonly-randconfig-005-20240126 (https://download.01.org/0day-ci/archive/20240126/202401260918.sn23mdek-...@intel.com/config) compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240126/202401260918.sn23mdek-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202401260918.sn23mdek-...@intel.com/ All errors (new ones prefixed by >>): >> kernel/trace/trace.c:6645:11: error: no member named 'use_max_tr' in 'struct >> tracer' 6645 | if (t->use_max_tr) | ~ ^ >> kernel/trace/trace.c:6646:5: error: call to undeclared function >> 'tracing_disarm_snapshot_locked'; ISO C99 and later do not support implicit >> function declarations [-Wimplicit-function-declaration] 6646 | tracing_disarm_snapshot_locked(tr); | ^ kernel/trace/trace.c:6646:5: note: did you mean 'tracing_disarm_snapshot'? kernel/trace/trace.h:1986:20: note: 'tracing_disarm_snapshot' declared here 1986 | static inline void tracing_disarm_snapshot(struct trace_array *tr) { } |^ 2 errors generated. vim +6645 kernel/trace/trace.c 6641 6642 if (t->init) { 6643 ret = tracer_init(t, tr); 6644 if (ret) { > 6645 if (t->use_max_tr) > 6646 tracing_disarm_snapshot_locked(tr); 6647 goto out; 6648 } 6649 } 6650 6651 tr->current_trace = t; 6652 tr->current_trace->enabled++; 6653 trace_branch_enable(tr); 6654 out: 6655 mutex_unlock(&trace_types_lock); 6656 6657 return ret; 6658 } 6659 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Re: [PATCH v2 2/2] tracing: Include Microcode Revision in mce_record tracepoint
On 1/25/2024 3:03 PM, Sohil Mehta wrote: > On 1/25/2024 10:48 AM, Avadhut Naik wrote: >> Currently, the microcode field (Microcode Revision) of struct mce is not >> exported to userspace through the mce_record tracepoint. >> >> Export it through the tracepoint as it may provide useful information for >> debug and analysis. >> >> Signed-off-by: Avadhut Naik >> --- > > A couple of nits below. > > Apart from that the patch looks fine to me. > > Reviewed-by: Sohil Mehta > >> include/trace/events/mce.h | 7 +-- >> 1 file changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h >> index 657b93ec8176..203baccd3c5c 100644 >> --- a/include/trace/events/mce.h >> +++ b/include/trace/events/mce.h >> @@ -34,6 +34,7 @@ TRACE_EVENT(mce_record, >> __field(u8, cs ) >> __field(u8, bank) >> __field(u8, cpuvendor ) >> +__field(u32,microcode ) > > Tab alignment is inconsistent. > >> ), >> >> TP_fast_assign( >> @@ -55,9 +56,10 @@ TRACE_EVENT(mce_record, >> __entry->cs = m->cs; >> __entry->bank = m->bank; >> __entry->cpuvendor = m->cpuvendor; >> +__entry->microcode = m->microcode; >> ), >> >> -TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, >> ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: >> %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", >> +TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, >> ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: >> %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x, MICROCODE >> REVISION: %u", > > Should microcode by printed as a decimal or an hexadecimal? Elsewhere > such as __print_mce(), it is printed as an hexadecimal: > > /* > * Note this output is parsed by external tools and old fields > * should not be changed. > */ > pr_emerg(HW_ERR "PROCESSOR %u:%x TIME %llu SOCKET %u APIC %x > microcode %x\n", > m->cpuvendor, m->cpuid, m->time, m->socketid, m->apicid, > m->microcode); > > Had kept the field as decimal since I considered that version should be a decimal number instead of a hex value. Hadn't noticed the above log in core.c file. Thanks for pointing it out. Since we now have precedent that microcode version values are being reported in hex, will change the above format specifier to %x. Will just submit a new version, addressing the tab alignments too. > > >> __entry->cpu, >> __entry->mcgcap, __entry->mcgstatus, >> __entry->bank, __entry->status, >> @@ -69,7 +71,8 @@ TRACE_EVENT(mce_record, >> __entry->cpuvendor, __entry->cpuid, >> __entry->walltime, >> __entry->socketid, >> -__entry->apicid) >> +__entry->apicid, >> +__entry->microcode) >> ); >> >> #endif /* _TRACE_MCE_H */ > -- Thanks, Avadhut Naik
Re: [PATCH v7 0/5] Add DAX ABI for memmap_on_memory
On Wed, 24 Jan 2024 12:03:45 -0800 Vishal Verma wrote: > This series adds sysfs ABI to control memmap_on_memory behavior for DAX > devices. Thanks. I'll add this to mm-unstable for some additional testing, but I do think we should have the evidence of additional review on this series's four preparatory patches before proceeding further.
回复:回复:general protection fault in ath9k_wmi_event_tasklet
>Great, thank you for testing! I'll send a proper patch. How would you >like to be credited with reporting? Just as 'Ubisectech Sirius >' ? Hello. Please use 'Ubisectech Sirius' to credit the report. Thanks.
Re: [PATCH net-next v1] vsock/test: print type for SOCK_SEQPACKET
Hello: This patch was applied to netdev/net-next.git (main) by Jakub Kicinski : On Wed, 24 Jan 2024 22:32:55 +0300 you wrote: > SOCK_SEQPACKET is supported for virtio transport, so do not interpret > such type of socket as unknown. > > Signed-off-by: Arseniy Krasnov > --- > tools/testing/vsock/vsock_diag_test.c | 2 ++ > 1 file changed, 2 insertions(+) Here is the summary with links: - [net-next,v1] vsock/test: print type for SOCK_SEQPACKET https://git.kernel.org/netdev/net-next/c/767ec326f985 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
[PATCH] tracing/trigger: Fix to return error if failed to alloc snapshot
From: Masami Hiramatsu (Google) Fix register_snapshot_trigger() to return error code if it failed to allocate a snapshot instead of 0 (success). Unless that, it will register snapshot trigger without an error. Fixes: 0bbe7f719985 ("tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation") Cc: sta...@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) --- kernel/trace/trace_events_trigger.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c index 46439e3bcec4..b33c3861fbbb 100644 --- a/kernel/trace/trace_events_trigger.c +++ b/kernel/trace/trace_events_trigger.c @@ -1470,8 +1470,10 @@ register_snapshot_trigger(char *glob, struct event_trigger_data *data, struct trace_event_file *file) { - if (tracing_alloc_snapshot_instance(file->tr) != 0) - return 0; + int ret = tracing_alloc_snapshot_instance(file->tr); + + if (ret < 0) + return ret; return register_trigger(glob, data, file); }
Re: [PATCH v12 3/6] tracing: Add snapshot refcount
Hi Vincent, On Thu, 25 Jan 2024 14:53:40 + Vincent Donnefort wrote: > > > @@ -1470,12 +1483,20 @@ register_snapshot_trigger(char *glob, > > > struct event_trigger_data *data, > > > struct trace_event_file *file) > > > { > > > - if (tracing_alloc_snapshot_instance(file->tr) != 0) > > > + if (tracing_arm_snapshot(file->tr)) > > > return 0; > > > > BTW, is this return value correct? It seems that the register_*_trigger() > > will return error code when it fails. > > It should indeed be > > ret = tracing_arm_snapshot() > if (ret) > return ret; OK, then there is a bug. We need to fix it before changing this for backporting. Thank you, -- Masami Hiramatsu (Google)
Re: [PATCH v2 00/41] filelock: split struct file_lock into file_lock and file_lease structs
On Fri, 2024-01-26 at 09:34 +1100, NeilBrown wrote: > On Fri, 26 Jan 2024, Chuck Lever wrote: > > On Thu, Jan 25, 2024 at 05:42:41AM -0500, Jeff Layton wrote: > > > Long ago, file locks used to hang off of a singly-linked list in struct > > > inode. Because of this, when leases were added, they were added to the > > > same list and so they had to be tracked using the same sort of > > > structure. > > > > > > Several years ago, we added struct file_lock_context, which allowed us > > > to use separate lists to track different types of file locks. Given > > > that, leases no longer need to be tracked using struct file_lock. > > > > > > That said, a lot of the underlying infrastructure _is_ the same between > > > file leases and locks, so we can't completely separate everything. > > > > > > This patchset first splits a group of fields used by both file locks and > > > leases into a new struct file_lock_core, that is then embedded in struct > > > file_lock. Coccinelle was then used to convert a lot of the callers to > > > deal with the move, with the remaining 25% or so converted by hand. > > > > > > It then converts several internal functions in fs/locks.c to work > > > with struct file_lock_core. Lastly, struct file_lock is split into > > > struct file_lock and file_lease, and the lease-related APIs converted to > > > take struct file_lease. > > > > > > After the first few patches (which I left split up for easier review), > > > the set should be bisectable. I'll plan to squash the first few > > > together to make sure the resulting set is bisectable before merge. > > > > > > Finally, I left the coccinelle scripts I used in tree. I had heard it > > > was preferable to merge those along with the patches that they > > > generate, but I wasn't sure where they go. I can either move those to a > > > more appropriate location or we can just drop that commit if it's not > > > needed. > > > > > > Signed-off-by: Jeff Layton > > > > v2 looks nicer. > > > > I would add a few list handling primitives, as I see enough > > instances of list_for_each_entry, list_for_each_entry_safe, > > list_first_entry, and list_first_entry_or_null on fl_core.flc_list > > to make it worth having those. > > > > Also, there doesn't seem to be benefit for API consumers to have to > > understand the internal structure of struct file_lock/lease to reach > > into fl_core. Having accessor functions for common fields like > > fl_type and fl_flags could be cleaner. > > I'm not a big fan of accessor functions. They don't *look* like normal > field access, so a casual reader has to go find out what the function > does, just to find the it doesn't really do anything. I might have been a bit too hasty with the idea. I took a look earlier today and it gets pretty ugly trying to handle these fields with accessors. flc_flags, for instance will need both a get and a set method, which gets wordy after a while. Some of the flc_list accesses don't involve list walks either so I don't think we'll ever be able to make this "neat" without a ton of one-off accessors. > But neither am I a fan have requiring filesystems to use > "fl_core.flc_foo". As you say, reaching into fl_core isn't ideal. > I too think it's ugly. > It would be nice if we could make fl_core and anonymous structure, but > that really requires -fplan9-extensions which Linus is on-record as not > liking. > Unless... > > How horrible would it be to use > >union { >struct file_lock_core flc_core; >struct file_lock_core; >}; > > I think that only requires -fms-extensions, which Linus was less > negative towards. That would allow access to the members of > file_lock_core without the "flc_core." prefix, but would still allow > getting the address of 'flc_core'. > Maybe it's too ugly. > I'd rather not rely on special compiler flags. > While fl_type and fl_flags are most common, fl_pid, fl_owner, fl_file > and even fl_wait are also used. Having accessor functions for all of those > would be too much I think. > Some of them need setters too, and some like fl_flags like to be able to do this: fl->fl_flags |= FL_SLEEP; That's hard to deal with in an accessor unless you want to do it with macros or something. > Maybe higher-level functions which meet the real need of the filesystem > might be a useful approach: > > locks_wakeup(lock) > locks_wait_interruptible(lock, condition) > locks_posix_init(lock, type, pid, ...) ?? > locks_is_unlock() - fl_type is compared with F_UNLCK 22 times. > > While those are probably a good idea, through don't really help much > with reducing the need for accessor functions. > I can take a look at some of those. Reducing the number of instances can only help. > I don't suppose we could just leave the #defines in place? Probably not > a good idea. > > Maybe spell "fl_core" as "c"? lk->c.flc_flags ??? > It's at least a little shorter. I can make that change if it's preferred. > > And I wonder if we could h
Re: [PATCH RFC 2/2] arm64: dts: qcom: msm8953: Add GPU
On 1/25/24 22:56, Luca Weiss wrote: From: Vladimir Lypak Add the GPU node for the Adreno 506 found on this family of SoCs. The clock speeds are a bit different per SoC variant, SDM450 maxes out at 600MHz while MSM8953 (= SDM625) goes up to 650MHz and SDM632 goes up to 725MHz. To achieve this, create a new sdm450.dtsi to hold the 600MHz OPP and use the new dtsi for sdm450-motorola-ali. Signed-off-by: Vladimir Lypak Co-developed-by: Luca Weiss Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/msm8953.dtsi| 115 +++ arch/arm64/boot/dts/qcom/sdm450-motorola-ali.dts | 2 +- arch/arm64/boot/dts/qcom/sdm450.dtsi | 14 +++ arch/arm64/boot/dts/qcom/sdm632.dtsi | 8 ++ 4 files changed, 138 insertions(+), 1 deletion(-) diff --git a/arch/arm64/boot/dts/qcom/msm8953.dtsi b/arch/arm64/boot/dts/qcom/msm8953.dtsi index 91d083871ab0..1fe0c0c4fd15 100644 --- a/arch/arm64/boot/dts/qcom/msm8953.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8953.dtsi @@ -1046,6 +1046,94 @@ mdss_dsi1_phy: phy@1a96400 { }; }; + gpu: gpu@1c0 { + compatible = "qcom,adreno-506.0", "qcom,adreno"; + reg = <0x01c0 0x4>; + reg-names = "kgsl_3d0_reg_memory"; + interrupts = ; + + clocks = <&gcc GCC_OXILI_GFX3D_CLK>, +<&gcc GCC_OXILI_AHB_CLK>, +<&gcc GCC_BIMC_GFX_CLK>, +<&gcc GCC_BIMC_GPU_CLK>, +<&gcc GCC_OXILI_TIMER_CLK>, +<&gcc GCC_OXILI_AON_CLK>; + clock-names = "core", + "iface", + "mem_iface", + "alt_mem_iface", + "rbbmtimer", + "alwayson"; + power-domains = <&gcc OXILI_GX_GDSC>; + + iommus = <&gpu_iommu 0>; + operating-points-v2 = <&gpu_opp_table>; + + #cooling-cells = <2>; + + status = "disabled"; + + zap-shader { + memory-region = <&zap_shader_region>; + }; + + gpu_opp_table: opp-table { + compatible = "operating-points-v2"; + + opp-1920 { + opp-hz = /bits/ 64 <1920>; + opp-supported-hw = <0xff>; + required-opps = <&rpmpd_opp_min_svs>; + }; If you remove all OPPs but this one, can the GPU still spit out pixels? Konrad
Re: [PATCH RFC 1/2] arm64: dts: qcom: msm8953: Add GPU IOMMU
On 1/25/24 23:24, Dmitry Baryshkov wrote: On 25/01/2024 23:56, Luca Weiss wrote: From: Vladimir Lypak Add the IOMMU used for the GPU on MSM8953. Signed-off-by: Vladimir Lypak --- arch/arm64/boot/dts/qcom/msm8953.dtsi | 31 +++ 1 file changed, 31 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/msm8953.dtsi b/arch/arm64/boot/dts/qcom/msm8953.dtsi index dcb5c98b793c..91d083871ab0 100644 --- a/arch/arm64/boot/dts/qcom/msm8953.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8953.dtsi @@ -1046,6 +1046,37 @@ mdss_dsi1_phy: phy@1a96400 { }; }; + gpu_iommu: iommu@1c48000 { Nit: most of the platforms use the adreno_smmu label. But maybe the msm-iommu vs arm-smmu makes difference here. Not really :) Please keep the labels unified Nevertheless: Reviewed-by: Dmitry Baryshkov + compatible = "qcom,msm8953-iommu", "qcom,msm-iommu-v2"; + ranges = <0 0x01c48000 0x8000>; + + clocks = <&gcc GCC_OXILI_AHB_CLK>, + <&gcc GCC_BIMC_GFX_CLK>; And align these Konrad
Re: [PATCH net-next v1] vsock/test: print type for SOCK_SEQPACKET
On Wed, Jan 24, 2024 at 10:32:55PM +0300, Arseniy Krasnov wrote: > SOCK_SEQPACKET is supported for virtio transport, so do not interpret > such type of socket as unknown. > > Signed-off-by: Arseniy Krasnov Acked-by: Michael S. Tsirkin > --- > tools/testing/vsock/vsock_diag_test.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/tools/testing/vsock/vsock_diag_test.c > b/tools/testing/vsock/vsock_diag_test.c > index 5e6049226b77..17aeba7cbd14 100644 > --- a/tools/testing/vsock/vsock_diag_test.c > +++ b/tools/testing/vsock/vsock_diag_test.c > @@ -39,6 +39,8 @@ static const char *sock_type_str(int type) > return "DGRAM"; > case SOCK_STREAM: > return "STREAM"; > + case SOCK_SEQPACKET: > + return "SEQPACKET"; > default: > return "INVALID TYPE"; > } > -- > 2.25.1
Re: [PATCH v2 00/41] filelock: split struct file_lock into file_lock and file_lease structs
On Fri, 26 Jan 2024, Chuck Lever wrote: > On Thu, Jan 25, 2024 at 05:42:41AM -0500, Jeff Layton wrote: > > Long ago, file locks used to hang off of a singly-linked list in struct > > inode. Because of this, when leases were added, they were added to the > > same list and so they had to be tracked using the same sort of > > structure. > > > > Several years ago, we added struct file_lock_context, which allowed us > > to use separate lists to track different types of file locks. Given > > that, leases no longer need to be tracked using struct file_lock. > > > > That said, a lot of the underlying infrastructure _is_ the same between > > file leases and locks, so we can't completely separate everything. > > > > This patchset first splits a group of fields used by both file locks and > > leases into a new struct file_lock_core, that is then embedded in struct > > file_lock. Coccinelle was then used to convert a lot of the callers to > > deal with the move, with the remaining 25% or so converted by hand. > > > > It then converts several internal functions in fs/locks.c to work > > with struct file_lock_core. Lastly, struct file_lock is split into > > struct file_lock and file_lease, and the lease-related APIs converted to > > take struct file_lease. > > > > After the first few patches (which I left split up for easier review), > > the set should be bisectable. I'll plan to squash the first few > > together to make sure the resulting set is bisectable before merge. > > > > Finally, I left the coccinelle scripts I used in tree. I had heard it > > was preferable to merge those along with the patches that they > > generate, but I wasn't sure where they go. I can either move those to a > > more appropriate location or we can just drop that commit if it's not > > needed. > > > > Signed-off-by: Jeff Layton > > v2 looks nicer. > > I would add a few list handling primitives, as I see enough > instances of list_for_each_entry, list_for_each_entry_safe, > list_first_entry, and list_first_entry_or_null on fl_core.flc_list > to make it worth having those. > > Also, there doesn't seem to be benefit for API consumers to have to > understand the internal structure of struct file_lock/lease to reach > into fl_core. Having accessor functions for common fields like > fl_type and fl_flags could be cleaner. I'm not a big fan of accessor functions. They don't *look* like normal field access, so a casual reader has to go find out what the function does, just to find the it doesn't really do anything. But neither am I a fan have requiring filesystems to use "fl_core.flc_foo". As you say, reaching into fl_core isn't ideal. It would be nice if we could make fl_core and anonymous structure, but that really requires -fplan9-extensions which Linus is on-record as not liking. Unless... How horrible would it be to use union { struct file_lock_core flc_core; struct file_lock_core; }; I think that only requires -fms-extensions, which Linus was less negative towards. That would allow access to the members of file_lock_core without the "flc_core." prefix, but would still allow getting the address of 'flc_core'. Maybe it's too ugly. While fl_type and fl_flags are most common, fl_pid, fl_owner, fl_file and even fl_wait are also used. Having accessor functions for all of those would be too much I think. Maybe higher-level functions which meet the real need of the filesystem might be a useful approach: locks_wakeup(lock) locks_wait_interruptible(lock, condition) locks_posix_init(lock, type, pid, ...) ?? locks_is_unlock() - fl_type is compared with F_UNLCK 22 times. While those are probably a good idea, through don't really help much with reducing the need for accessor functions. I don't suppose we could just leave the #defines in place? Probably not a good idea. Maybe spell "fl_core" as "c"? lk->c.flc_flags ??? And I wonder if we could have a new fl_flag for 'FOREIGN' locks rather than encoding that flag in the sign of the pid. That seems a bit ... clunky? NeilBrown
[PATCH v3 3/3] tracing: convert __trace_seq_init to use WARN_ON_ONCE
The initialization of trace_seq buffers is done elsewhere and therefore __trace_seq_init should yield a warning if it has to actually initialize the buffer. Suggested-by: Steven Rostedt Signed-off-by: Ricardo B. Marliere --- kernel/trace/trace_seq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/trace_seq.c b/kernel/trace/trace_seq.c index 741b2f3d76c0..3c7a7d903b54 100644 --- a/kernel/trace/trace_seq.c +++ b/kernel/trace/trace_seq.c @@ -32,7 +32,7 @@ */ static inline void __trace_seq_init(struct trace_seq *s) { - if (unlikely(!s->seq.size)) + if (WARN_ON_ONCE(!s->seq.size)) trace_seq_init(s); } -- 2.43.0
[PATCH v3 2/3] tracing: add trace_seq_reset function
Currently, trace_seq_init may be called many times with the intent of resetting the buffer. Add a function trace_seq_reset that does that and replace the relevant occurrences to use it instead. Suggested-by: Steven Rostedt Signed-off-by: Ricardo B. Marliere --- include/linux/trace_seq.h| 11 +++ include/trace/trace_events.h | 2 +- kernel/trace/trace.c | 10 +- kernel/trace/trace_output.c | 2 +- kernel/trace/trace_seq.c | 2 +- 5 files changed, 19 insertions(+), 8 deletions(-) diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h index 9ec229dfddaa..d3fa41001813 100644 --- a/include/linux/trace_seq.h +++ b/include/linux/trace_seq.h @@ -29,6 +29,17 @@ trace_seq_init(struct trace_seq *s) s->readpos = 0; } +static inline void +trace_seq_reset(struct trace_seq *s) +{ + if (WARN_ON_ONCE(!s->seq.size)) + seq_buf_init(&s->seq, s->buffer, TRACE_SEQ_BUFFER_SIZE); + else + seq_buf_clear(&s->seq); + s->full = 0; + s->readpos = 0; +} + /** * trace_seq_used - amount of actual data written to buffer * @s: trace sequence descriptor diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h index c2f9cabf154d..2bc79998e5ab 100644 --- a/include/trace/trace_events.h +++ b/include/trace/trace_events.h @@ -227,7 +227,7 @@ trace_raw_output_##call(struct trace_iterator *iter, int flags, \ \ field = (typeof(field))entry; \ \ - trace_seq_init(p); \ + trace_seq_reset(p); \ return trace_output_call(iter, #call, print); \ } \ static struct trace_event_functions trace_event_type_funcs_##call = { \ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 9eddf8168df2..9827700d0164 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2928,7 +2928,7 @@ static void output_printk(struct trace_event_buffer *fbuffer) event = &fbuffer->trace_file->event_call->event; raw_spin_lock_irqsave(&tracepoint_iter_lock, flags); - trace_seq_init(&iter->seq); + trace_seq_reset(&iter->seq); iter->ent = fbuffer->entry; event_call->event.funcs->trace(iter, 0, event); trace_seq_putc(&iter->seq, 0); @@ -6925,7 +6925,7 @@ tracing_read_pipe(struct file *filp, char __user *ubuf, if (sret != -EBUSY) goto out; - trace_seq_init(&iter->seq); + trace_seq_reset(&iter->seq); if (iter->trace->read) { sret = iter->trace->read(iter, filp, ubuf, cnt, ppos); @@ -6998,7 +6998,7 @@ tracing_read_pipe(struct file *filp, char __user *ubuf, /* Now copy what we have to the user */ sret = trace_seq_to_user(&iter->seq, ubuf, cnt); if (iter->seq.readpos >= trace_seq_used(&iter->seq)) - trace_seq_init(&iter->seq); + trace_seq_reset(&iter->seq); /* * If there was nothing to send to user, in spite of consuming trace @@ -7130,7 +7130,7 @@ static ssize_t tracing_splice_read_pipe(struct file *filp, spd.partial[i].offset = 0; spd.partial[i].len = trace_seq_used(&iter->seq); - trace_seq_init(&iter->seq); + trace_seq_reset(&iter->seq); } trace_access_unlock(iter->cpu_file); @@ -10282,7 +10282,7 @@ trace_printk_seq(struct trace_seq *s) printk(KERN_TRACE "%s", s->buffer); - trace_seq_init(s); + trace_seq_reset(s); } void trace_init_global_iter(struct trace_iterator *iter) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index 3e7fa44dc2b2..c949e7736618 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -308,7 +308,7 @@ int trace_raw_output_prep(struct trace_iterator *iter, return TRACE_TYPE_UNHANDLED; } - trace_seq_init(p); + trace_seq_reset(p); trace_seq_printf(s, "%s: ", trace_event_name(event)); return trace_handle_return(s); diff --git a/kernel/trace/trace_seq.c b/kernel/trace/trace_seq.c index c158d65a8a88..741b2f3d76c0 100644 --- a/kernel/trace/trace_seq.c +++ b/kernel/trace/trace_seq.c @@ -59,7 +59,7 @@ int trace_print_seq(struct seq_file *m, struct trace_seq *s) * do something else with the contents. */ if (!ret) - trace_seq_init(s); + trace_seq_reset(s); return ret; } -- 2.43.0
[PATCH v3 1/3] tracing: initialize trace_seq buffers
In order to extend trace_seq into being dynamic, the struct trace_seq will no longer be valid if simply set to zero. Call trace_seq_init() for all trace_seq when they are first created. Suggested-by: Steven Rostedt Signed-off-by: Ricardo B. Marliere --- kernel/trace/trace.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 46dbe22121e9..9eddf8168df2 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4889,6 +4889,9 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot) mutex_unlock(&trace_types_lock); + trace_seq_init(&iter->seq); + trace_seq_init(&iter->tmp_seq); + return iter; fail: @@ -6770,6 +6773,7 @@ static int tracing_open_pipe(struct inode *inode, struct file *filp) } trace_seq_init(&iter->seq); + trace_seq_init(&iter->tmp_seq); iter->trace = tr->current_trace; if (!alloc_cpumask_var(&iter->started, GFP_KERNEL)) { @@ -6947,6 +6951,7 @@ tracing_read_pipe(struct file *filp, char __user *ubuf, trace_iterator_reset(iter); cpumask_clear(iter->started); trace_seq_init(&iter->seq); + trace_seq_init(&iter->tmp_seq); trace_event_read_lock(); trace_access_lock(iter->cpu_file); @@ -8277,6 +8282,9 @@ static int tracing_buffers_open(struct inode *inode, struct file *filp) if (ret < 0) trace_array_put(tr); + trace_seq_init(&info->iter.seq); + trace_seq_init(&info->iter.tmp_seq); + return ret; } @@ -10300,6 +10308,9 @@ void trace_init_global_iter(struct trace_iterator *iter) iter->temp_size = STATIC_TEMP_BUF_SIZE; iter->fmt = static_fmt_buf; iter->fmt_size = STATIC_FMT_BUF_SIZE; + + trace_seq_init(&iter->seq); + trace_seq_init(&iter->tmp_seq); } void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) @@ -10712,6 +10723,9 @@ void __init early_trace_init(void) tracepoint_printk = 0; else static_key_enable(&tracepoint_printk_key.key); + + trace_seq_init(&tracepoint_print_iter->seq); + trace_seq_init(&tracepoint_print_iter->tmp_seq); } tracer_alloc_buffers(); -- 2.43.0
[PATCH v3 0/3] tracing: add trace_seq_reset function
This series is a prerequisite for a later effort of making trace_seq more flexible about its buffer size. To achieve that, initializing and resetting the buffers need to be differentiated. Changes in v3: - Reordered commits so it doesn't produce a failing build in-between. - Improved changelogs. Changes in v2: - Added a WARN_ON_ONCE to __trace_seq_init to catch possible misuses. - Properly initialized trace_seq buffers. Ricardo B. Marliere (3): tracing: initialize trace_seq buffers tracing: add trace_seq_reset function tracing: convert __trace_seq_init to use WARN_ON_ONCE include/linux/trace_seq.h| 11 +++ include/trace/trace_events.h | 2 +- kernel/trace/trace.c | 24 +++- kernel/trace/trace_output.c | 2 +- kernel/trace/trace_seq.c | 4 ++-- 5 files changed, 34 insertions(+), 9 deletions(-) -- 2.43.0
Re: [PATCH RFC 2/2] arm64: dts: qcom: msm8953: Add GPU
On 25/01/2024 23:56, Luca Weiss wrote: From: Vladimir Lypak Add the GPU node for the Adreno 506 found on this family of SoCs. The clock speeds are a bit different per SoC variant, SDM450 maxes out at 600MHz while MSM8953 (= SDM625) goes up to 650MHz and SDM632 goes up to 725MHz. To achieve this, create a new sdm450.dtsi to hold the 600MHz OPP and use the new dtsi for sdm450-motorola-ali. Signed-off-by: Vladimir Lypak Co-developed-by: Luca Weiss Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/msm8953.dtsi| 115 +++ arch/arm64/boot/dts/qcom/sdm450-motorola-ali.dts | 2 +- arch/arm64/boot/dts/qcom/sdm450.dtsi | 14 +++ arch/arm64/boot/dts/qcom/sdm632.dtsi | 8 ++ 4 files changed, 138 insertions(+), 1 deletion(-) Reviewed-by: Dmitry Baryshkov -- With best wishes Dmitry
Re: [PATCH RFC 1/2] arm64: dts: qcom: msm8953: Add GPU IOMMU
On 25/01/2024 23:56, Luca Weiss wrote: From: Vladimir Lypak Add the IOMMU used for the GPU on MSM8953. Signed-off-by: Vladimir Lypak --- arch/arm64/boot/dts/qcom/msm8953.dtsi | 31 +++ 1 file changed, 31 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/msm8953.dtsi b/arch/arm64/boot/dts/qcom/msm8953.dtsi index dcb5c98b793c..91d083871ab0 100644 --- a/arch/arm64/boot/dts/qcom/msm8953.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8953.dtsi @@ -1046,6 +1046,37 @@ mdss_dsi1_phy: phy@1a96400 { }; }; + gpu_iommu: iommu@1c48000 { Nit: most of the platforms use the adreno_smmu label. But maybe the msm-iommu vs arm-smmu makes difference here. Nevertheless: Reviewed-by: Dmitry Baryshkov + compatible = "qcom,msm8953-iommu", "qcom,msm-iommu-v2"; + ranges = <0 0x01c48000 0x8000>; + + clocks = <&gcc GCC_OXILI_AHB_CLK>, +<&gcc GCC_BIMC_GFX_CLK>; + clock-names = "iface", "bus"; + + power-domains = <&gcc OXILI_CX_GDSC>; + + qcom,iommu-secure-id = <18>; + + #address-cells = <1>; + #iommu-cells = <1>; + #size-cells = <1>; + + /* gfx3d_user */ + iommu-ctx@0 { + compatible = "qcom,msm-iommu-v2-ns"; + reg = <0x 0x1000>; + interrupts = ; + }; + + /* gfx3d_secure */ + iommu-ctx@2000 { + compatible = "qcom,msm-iommu-v2-sec"; + reg = <0x2000 0x1000>; + interrupts = ; + }; + }; + apps_iommu: iommu@1e2 { compatible = "qcom,msm8953-iommu", "qcom,msm-iommu-v1"; ranges = <0 0x01e2 0x2>; -- With best wishes Dmitry
Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop
On Thu, 25 Jan 2024 16:18:37 -0500 Mathieu Desnoyers wrote: > > > > This is how you are able to avoid the "before/after" logic I have, as > > the race is automatically detected. The least significant bits of the > > timestamp is ignored for the event delta calculation. > > Not quite, as I explained at the beginning of this email. All bits from the > previous timestamp, including its low bits, are useful to know how many > overflows happened since the last tsc. Yes, but it still means updating that timestamp you will compare to doesn't have the race I have. If the timestamp's upper bits are the same, or are off by one and the lower bits are higher than the current timestamp, you don't need to inject. But if the lower bits are higher than the timestamp or the higehr bits are off by more than one then you do. The lower bits are a delta against "0" of the current timestamp lower bits. There's no race in updating those bits as long as the upper bits remain the same or are off by one and the current timestamp lower bits are lower than the saved time stamp. In my case, because the delta is off of the entire timestamp, what I write into the saved timestamp, all bits matter. And to handle that I need the before/after timestamps to know if the currently saved timestamp didn't have a race. > > > And if a race > > happens where the interrupting event saves a later timestamp and comes > > back here, if the interrupted event writes the older timestamp, it just > > causes that delta calculation to overflow again and you inject another > > 64bit timestamp into the buffer. > > This part is correct: in the race you describe, we end up with the > possibility of bringing the last_tsc backwards, which can only cause > the tracer to use the full 64-bit timestamp when in fact it could use > the compact representation. But it's rare and should not matter in > practice. > > And by the way this algorithm is designed to work with preemption/migration > enabled as well, not just interrupts. So the race can come from a thread > running concurrently on another CPU and it should work as well. > > [...] > > > > > Going through a transition of changing it could end up being just as > > complex. I'm not sure the complexity in that transition is better than > > the complexity of the current code, as this code has been there for 15 > > years, and I know of at least 2 other projects that depend on this > > format as is. > > I agree with you that it's not clear-cut whether introducing this change > would be a benefit at this stage considering the extra complexity of > extending the ABI while keeping backward compatibility. > > But it's something we can keep in mind if we ever have to do major ABI > extensions for other reasons. Yeah, it's something to think about if we want to use a different format for something else. Thanks for the review. -- Steve
Re: [PATCH v2 2/3] clk: qcom: gcc-msm8953: add more resets
On Thu, 25 Jan 2024 at 23:36, Luca Weiss wrote: > > From: Vladimir Lypak > > Add new entries in the gcc driver for some more resets found on MSM8953. > > Signed-off-by: Vladimir Lypak > [luca: expand commit message, move entry, add more entries] > Signed-off-by: Luca Weiss > --- > drivers/clk/qcom/gcc-msm8953.c | 4 > 1 file changed, 4 insertions(+) Reviewed-by: Dmitry Baryshkov -- With best wishes Dmitry
Re: [PATCH v2 3/3] arm64: dts: qcom: msm8953: add reset for display subsystem
On Thu, 25 Jan 2024 at 23:36, Luca Weiss wrote: > > From: Vladimir Lypak > > With this reset we can avoid situations like IRQ storms from DSI host > before it even started probing (because boot-loader left DSI IRQs on). > > Signed-off-by: Vladimir Lypak > Reviewed-by: Konrad Dybcio > Signed-off-by: Luca Weiss > --- > arch/arm64/boot/dts/qcom/msm8953.dtsi | 2 ++ > 1 file changed, 2 insertions(+) Reviewed-by: Dmitry Baryshkov -- With best wishes Dmitry
[PATCH RFC 0/2] Add GPU support to MSM8953 SoC
Add the GPU IOMMU and GPU nodes to the msm8953 dtsi so GPU can work. First of all, functionally this series looks fine, tested on sdm632-fairphone-fp3. Secondly and the reason this is marked RFC for now is basically just dt bindings check fail, and some questions regarding IOMMU compatible. Basically I'm unsure what compatible (or even driver) IOMMU should use. qcom,msm-iommu-v2 is now in the patchset which seems to be okay, and also should handle the gfx3d_secure secure context correctly. Apart from some special handling there qcom,msm-iommu-v1 compatible is equivalent on the driver side. Currently the dt bindings say qcom,msm8953-iommu should be followed by qcom,msm-iommu-v1 which is the case for apps_iommu. But if we use qcom,msm-iommu-v2 for gpu_iommu then we can't re-use the same qcom,msm8953-iommu I think. Possible solutions: 1. Switch apps_iommu to use qcom,msm-iommu-v2 in dts & bindings? Since there's basically no special handling for either in the driver I don't forsee any problems. Then we can also use -v2 for gpu_iommu no problem. 2. Use qcom,msm-iommu-v1 for gpu_iommu? From some testing it also seems to work, I guess because the secure context is never used? 3. Use arm_smmu driver for gpu_iommu? Vladimir Lypak has suggested that, but that would at least need some more patching to work on msm8953. I probably don't have the motivation to take this on myself. Also what benefit would it bring? Hope I haven't rambled too long here and it's somewhat understandable. Please let me know what you think, which direction we can take to resolve this. (Also GPU clock-names dt-schema check fails but that seems to be the case on more a5xx GPUs also) Signed-off-by: Luca Weiss --- Vladimir Lypak (2): arm64: dts: qcom: msm8953: Add GPU IOMMU arm64: dts: qcom: msm8953: Add GPU arch/arm64/boot/dts/qcom/msm8953.dtsi| 146 +++ arch/arm64/boot/dts/qcom/sdm450-motorola-ali.dts | 2 +- arch/arm64/boot/dts/qcom/sdm450.dtsi | 14 +++ arch/arm64/boot/dts/qcom/sdm632.dtsi | 8 ++ 4 files changed, 169 insertions(+), 1 deletion(-) --- base-commit: 0e21aa976976d5fba8cd1f8f64bcce49beb5f895 change-id: 20231212-msm8953-gpu-4c085365f594 Best regards, -- Luca Weiss
[PATCH RFC 1/2] arm64: dts: qcom: msm8953: Add GPU IOMMU
From: Vladimir Lypak Add the IOMMU used for the GPU on MSM8953. Signed-off-by: Vladimir Lypak --- arch/arm64/boot/dts/qcom/msm8953.dtsi | 31 +++ 1 file changed, 31 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/msm8953.dtsi b/arch/arm64/boot/dts/qcom/msm8953.dtsi index dcb5c98b793c..91d083871ab0 100644 --- a/arch/arm64/boot/dts/qcom/msm8953.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8953.dtsi @@ -1046,6 +1046,37 @@ mdss_dsi1_phy: phy@1a96400 { }; }; + gpu_iommu: iommu@1c48000 { + compatible = "qcom,msm8953-iommu", "qcom,msm-iommu-v2"; + ranges = <0 0x01c48000 0x8000>; + + clocks = <&gcc GCC_OXILI_AHB_CLK>, +<&gcc GCC_BIMC_GFX_CLK>; + clock-names = "iface", "bus"; + + power-domains = <&gcc OXILI_CX_GDSC>; + + qcom,iommu-secure-id = <18>; + + #address-cells = <1>; + #iommu-cells = <1>; + #size-cells = <1>; + + /* gfx3d_user */ + iommu-ctx@0 { + compatible = "qcom,msm-iommu-v2-ns"; + reg = <0x 0x1000>; + interrupts = ; + }; + + /* gfx3d_secure */ + iommu-ctx@2000 { + compatible = "qcom,msm-iommu-v2-sec"; + reg = <0x2000 0x1000>; + interrupts = ; + }; + }; + apps_iommu: iommu@1e2 { compatible = "qcom,msm8953-iommu", "qcom,msm-iommu-v1"; ranges = <0 0x01e2 0x2>; -- 2.43.0
[PATCH RFC 2/2] arm64: dts: qcom: msm8953: Add GPU
From: Vladimir Lypak Add the GPU node for the Adreno 506 found on this family of SoCs. The clock speeds are a bit different per SoC variant, SDM450 maxes out at 600MHz while MSM8953 (= SDM625) goes up to 650MHz and SDM632 goes up to 725MHz. To achieve this, create a new sdm450.dtsi to hold the 600MHz OPP and use the new dtsi for sdm450-motorola-ali. Signed-off-by: Vladimir Lypak Co-developed-by: Luca Weiss Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/msm8953.dtsi| 115 +++ arch/arm64/boot/dts/qcom/sdm450-motorola-ali.dts | 2 +- arch/arm64/boot/dts/qcom/sdm450.dtsi | 14 +++ arch/arm64/boot/dts/qcom/sdm632.dtsi | 8 ++ 4 files changed, 138 insertions(+), 1 deletion(-) diff --git a/arch/arm64/boot/dts/qcom/msm8953.dtsi b/arch/arm64/boot/dts/qcom/msm8953.dtsi index 91d083871ab0..1fe0c0c4fd15 100644 --- a/arch/arm64/boot/dts/qcom/msm8953.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8953.dtsi @@ -1046,6 +1046,94 @@ mdss_dsi1_phy: phy@1a96400 { }; }; + gpu: gpu@1c0 { + compatible = "qcom,adreno-506.0", "qcom,adreno"; + reg = <0x01c0 0x4>; + reg-names = "kgsl_3d0_reg_memory"; + interrupts = ; + + clocks = <&gcc GCC_OXILI_GFX3D_CLK>, +<&gcc GCC_OXILI_AHB_CLK>, +<&gcc GCC_BIMC_GFX_CLK>, +<&gcc GCC_BIMC_GPU_CLK>, +<&gcc GCC_OXILI_TIMER_CLK>, +<&gcc GCC_OXILI_AON_CLK>; + clock-names = "core", + "iface", + "mem_iface", + "alt_mem_iface", + "rbbmtimer", + "alwayson"; + power-domains = <&gcc OXILI_GX_GDSC>; + + iommus = <&gpu_iommu 0>; + operating-points-v2 = <&gpu_opp_table>; + + #cooling-cells = <2>; + + status = "disabled"; + + zap-shader { + memory-region = <&zap_shader_region>; + }; + + gpu_opp_table: opp-table { + compatible = "operating-points-v2"; + + opp-1920 { + opp-hz = /bits/ 64 <1920>; + opp-supported-hw = <0xff>; + required-opps = <&rpmpd_opp_min_svs>; + }; + + opp-13330 { + opp-hz = /bits/ 64 <13330>; + opp-supported-hw = <0xff>; + required-opps = <&rpmpd_opp_min_svs>; + }; + + opp-21600 { + opp-hz = /bits/ 64 <21600>; + opp-supported-hw = <0xff>; + required-opps = <&rpmpd_opp_low_svs>; + }; + + opp-32000 { + opp-hz = /bits/ 64 <32000>; + opp-supported-hw = <0xff>; + required-opps = <&rpmpd_opp_svs>; + }; + + opp-4 { + opp-hz = /bits/ 64 <4>; + opp-supported-hw = <0xff>; + required-opps = <&rpmpd_opp_svs_plus>; + }; + + opp-51000 { + opp-hz = /bits/ 64 <51000>; + opp-supported-hw = <0xff>; + required-opps = <&rpmpd_opp_nom>; + }; + + opp-56000 { + opp-hz = /bits/ 64 <56000>; + opp-supported-hw = <0xff>; + required-opps = <&rpmpd_opp_nom_plus>; + }; + + /* +* This opp is only available on msm8953 and +* sdm632, the max for sdm450 is 600MHz. +*/ + opp-65000 { + opp-hz = /bits/ 64 <65000>; +
Re: Re: [PATCH v2 2/3] tracing: initialize trace_seq buffers
Hi Steve, On 25 Jan 15:44, Steven Rostedt wrote: > On Thu, 25 Jan 2024 17:16:21 -0300 > "Ricardo B. Marliere" wrote: > > > Now that trace_seq_reset have been created, correct the places where the > > buffers need to be initialized. > > This patch would need to come first. You don't ever want to intentionally > create a broken kernel. Indeed, sorry for the lack of attention. > > Also, the change log should be: > > In order to extend trace_seq into being dynamic, the struct trace_seq > will no longer be valid if simply set to zero. Call trace_seq_init() for > all trace_seq when they are first created. Ack. > > > > > Suggested-by: Steven Rostedt > > Signed-off-by: Ricardo B. Marliere > > --- > > kernel/trace/trace.c | 14 ++ > > 1 file changed, 14 insertions(+) > > > > You also need to initialize iter.seq in ftrace_dump() Thanks a lot for reviewing, I will send a v3. - Ricardo > > -- Steve > > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > > index d4c55d3e21c2..9827700d0164 100644 > > --- a/kernel/trace/trace.c > > +++ b/kernel/trace/trace.c > > @@ -4889,6 +4889,9 @@ __tracing_open(struct inode *inode, struct file > > *file, bool snapshot) > > > > mutex_unlock(&trace_types_lock); > > > > + trace_seq_init(&iter->seq); > > + trace_seq_init(&iter->tmp_seq); > > + > > return iter; > > > > fail: > > @@ -6770,6 +6773,7 @@ static int tracing_open_pipe(struct inode *inode, > > struct file *filp) > > } > > > > trace_seq_init(&iter->seq); > > + trace_seq_init(&iter->tmp_seq); > > iter->trace = tr->current_trace; > > > > if (!alloc_cpumask_var(&iter->started, GFP_KERNEL)) { > > @@ -6947,6 +6951,7 @@ tracing_read_pipe(struct file *filp, char __user > > *ubuf, > > trace_iterator_reset(iter); > > cpumask_clear(iter->started); > > trace_seq_init(&iter->seq); > > + trace_seq_init(&iter->tmp_seq); > > > > trace_event_read_lock(); > > trace_access_lock(iter->cpu_file); > > @@ -8277,6 +8282,9 @@ static int tracing_buffers_open(struct inode *inode, > > struct file *filp) > > if (ret < 0) > > trace_array_put(tr); > > > > + trace_seq_init(&info->iter.seq); > > + trace_seq_init(&info->iter.tmp_seq); > > + > > return ret; > > } > > > > @@ -10300,6 +10308,9 @@ void trace_init_global_iter(struct trace_iterator > > *iter) > > iter->temp_size = STATIC_TEMP_BUF_SIZE; > > iter->fmt = static_fmt_buf; > > iter->fmt_size = STATIC_FMT_BUF_SIZE; > > + > > + trace_seq_init(&iter->seq); > > + trace_seq_init(&iter->tmp_seq); > > } > > > > void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) > > @@ -10712,6 +10723,9 @@ void __init early_trace_init(void) > > tracepoint_printk = 0; > > else > > static_key_enable(&tracepoint_printk_key.key); > > + > > + trace_seq_init(&tracepoint_print_iter->seq); > > + trace_seq_init(&tracepoint_print_iter->tmp_seq); > > } > > tracer_alloc_buffers(); > > >
Re: [PATCH 0/4] tracing/user_events: Introduce multi-format events
It appears to put an outdated coversheet onto this series. Below is the updated coversheet that reflects changes made: Currently user_events supports 1 event with the same name and must have the exact same format when referenced by multiple programs. This opens an opportunity for malicous or poorly thought through programs to create events that others use with different formats. Another scenario is user programs wishing to use the same event name but add more fields later when the software updates. Various versions of a program may be running side-by-side, which is prevented by the current single format requirement. Add a new register flag (USER_EVENT_REG_MULTI_FORMAT) which indicates the user program wishes to use the same user_event name, but may have several different formats of the event in the future. When this flag is used, create the underlying tracepoint backing the user_event with a unique name per-version of the format. It's important that existing ABI users do not get this logic automatically, even if one of the multi format events matches the format. This ensures existing programs that create events and assume the tracepoint name will match exactly continue to work as expected. Add logic to only check multi-format events with other multi-format events and single-format events to only check single-format events during find. Change system name of the multi-format event tracepoint to ensure that multi-format events are isolated completely from single-format events. Add a register_name (reg_name) to the user_event struct which allows for split naming of events. We now have the name that was used to register within user_events as well as the unique name for the tracepoint. Upon registering events ensure matches based on first the reg_name, followed by the fields and format of the event. This allows for multiple events with the same registered name to have different formats. The underlying tracepoint will have a unique name in the format of {reg_name}:[unique_id]. For example, if both "test u32 value" and "test u64 value" are used with the USER_EVENT_REG_MULTI_FORMAT the system would have 2 unique tracepoints. The dynamic_events file would then show the following: u:test u64 count u:test u32 count The actual tracepoint names look like this: test:[d5874fdac44] test:[d5914662cd4] Both would be under the new user_events_multi system name to prevent the older ABI from being used to squat on multi-formatted events and block their use. Deleting events via "!u:test u64 count" would only delete the first tracepoint that matched that format. When the delete ABI is used all events with the same name will be attempted to be deleted. If per-version deletion is required, user programs should either not use persistent events or delete them via dynamic_events. Thanks, -Beau
[PATCH v2 3/3] arm64: dts: qcom: msm8953: add reset for display subsystem
From: Vladimir Lypak With this reset we can avoid situations like IRQ storms from DSI host before it even started probing (because boot-loader left DSI IRQs on). Signed-off-by: Vladimir Lypak Reviewed-by: Konrad Dybcio Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/msm8953.dtsi | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/msm8953.dtsi b/arch/arm64/boot/dts/qcom/msm8953.dtsi index ad2f8cf9c966..dcb5c98b793c 100644 --- a/arch/arm64/boot/dts/qcom/msm8953.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8953.dtsi @@ -859,6 +859,8 @@ mdss: display-subsystem@1a0 { "vsync", "core"; + resets = <&gcc GCC_MDSS_BCR>; + #address-cells = <1>; #size-cells = <1>; ranges; -- 2.43.0
[PATCH v2 2/3] clk: qcom: gcc-msm8953: add more resets
From: Vladimir Lypak Add new entries in the gcc driver for some more resets found on MSM8953. Signed-off-by: Vladimir Lypak [luca: expand commit message, move entry, add more entries] Signed-off-by: Luca Weiss --- drivers/clk/qcom/gcc-msm8953.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/clk/qcom/gcc-msm8953.c b/drivers/clk/qcom/gcc-msm8953.c index 3e5a8cb14d4d..68359534ff25 100644 --- a/drivers/clk/qcom/gcc-msm8953.c +++ b/drivers/clk/qcom/gcc-msm8953.c @@ -4171,6 +4171,10 @@ static const struct qcom_reset_map gcc_msm8953_resets[] = { [GCC_USB3PHY_PHY_BCR] = { 0x3f03c }, [GCC_USB3_PHY_BCR] = { 0x3f034 }, [GCC_USB_30_BCR]= { 0x3f070 }, + [GCC_MDSS_BCR] = { 0x4d074 }, + [GCC_CRYPTO_BCR]= { 0x16000 }, + [GCC_SDCC1_BCR] = { 0x42000 }, + [GCC_SDCC2_BCR] = { 0x43000 }, }; static const struct regmap_config gcc_msm8953_regmap_config = { -- 2.43.0
[PATCH v2 1/3] dt-bindings: clock: gcc-msm8953: add more resets
From: Vladimir Lypak Add new defines for some more BCRs found on MSM8953. Signed-off-by: Vladimir Lypak [luca: expand commit message, add more resets] Acked-by: Krzysztof Kozlowski Signed-off-by: Luca Weiss --- include/dt-bindings/clock/qcom,gcc-msm8953.h | 4 1 file changed, 4 insertions(+) diff --git a/include/dt-bindings/clock/qcom,gcc-msm8953.h b/include/dt-bindings/clock/qcom,gcc-msm8953.h index 783162da6148..13b4a62877e5 100644 --- a/include/dt-bindings/clock/qcom,gcc-msm8953.h +++ b/include/dt-bindings/clock/qcom,gcc-msm8953.h @@ -218,6 +218,10 @@ #define GCC_USB3PHY_PHY_BCR3 #define GCC_USB3_PHY_BCR 4 #define GCC_USB_30_BCR 5 +#define GCC_MDSS_BCR 6 +#define GCC_CRYPTO_BCR 7 +#define GCC_SDCC1_BCR 8 +#define GCC_SDCC2_BCR 9 /* GDSCs */ #define CPP_GDSC 0 -- 2.43.0
[PATCH v2 0/3] Add MDSS_BCR reset (+some more) for MSM8953
Add the MDSS_BCR reset that is found in the GCC of MSM8953 so we can make sure the MDSS gets properly reset before Linux starts using it. Also add some others that have been found in the LK sources. Signed-off-by: Luca Weiss --- Changes in v2: - Add more resets from LK sources - Pick up tags - Link to v1: https://lore.kernel.org/r/20240123-msm8953-mdss-reset-v1-0-bb8c6d3ce...@z3ntu.xyz --- Vladimir Lypak (3): dt-bindings: clock: gcc-msm8953: add more resets clk: qcom: gcc-msm8953: add more resets arm64: dts: qcom: msm8953: add reset for display subsystem arch/arm64/boot/dts/qcom/msm8953.dtsi| 2 ++ drivers/clk/qcom/gcc-msm8953.c | 4 include/dt-bindings/clock/qcom,gcc-msm8953.h | 4 3 files changed, 10 insertions(+) --- base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d change-id: 20240123-msm8953-mdss-reset-68308a03fff5 Best regards, -- Luca Weiss
Re: [PATCH 2/3] clk: qcom: gcc-msm8953: add MDSS_BCR reset
On Mittwoch, 24. Jänner 2024 13:10:53 CET Konrad Dybcio wrote: > On 1/23/24 22:03, Luca Weiss wrote: > > From: Vladimir Lypak > > > > Add an entry in the gcc driver for the MDSS_BCR reset found on MSM8953. > > > > Signed-off-by: Vladimir Lypak > > [luca: expand commit message, move entry] > > Signed-off-by: Luca Weiss > > --- > > I found some more definitions in lk2nd > > 88:#define GCC_CRYPTO_BCR(CLK_CTL_BASE + 0x16000) > 106:#define SDCC1_BCR (CLK_CTL_BASE + 0x42000) /* > block reset*/ 125:#define SDCC2_BCR (CLK_CTL_BASE > + 0x43000) /* block reset */ 150:#define USB_HS_BCR > (CLK_CTL_BASE + 0x41000) 155:#define GCC_QUSB2_PHY_BCR > (CLK_CTL_BASE + 0x4103C) 168:#define USB_30_BCR > (CLK_CTL_BASE + 0x3F070) > 189:#define USB3_PHY_BCR(CLK_CTL_BASE + 0x3F034) > 190:#define USB3PHY_PHY_BCR (CLK_CTL_BASE + 0x3F03C) > > Couldn't find this one though, did you confirm that MDSS goes off > when you assert it? That one's defined here: https://gerrit-public.fairphone.software/plugins/gitiles/kernel/msm-4.9/+/refs/heads/int/13/fp3/arch/arm64/boot/dts/qcom/msm8953-mdss-pll.dtsi#21 I'll add some of the others in v2. > > Konrad
Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-01-20 08:47, Steven Rostedt wrote: On Fri, 19 Jan 2024 20:49:36 -0500 Mathieu Desnoyers wrote: Let's say we have the following ktime_get() values (monotonic timestamp value) for a sequence of events: Timestamp (Hex)Encoding in the trace Packet header timestamp begin 0x12345678 64-bit: 0x12345678 Event 1 0x12345678 16-bit: 0x5678 (When decoded, same value as previous timestamp, no overflow) Event 2 0x1234 16-bit: 0x (When decoded, going from "0x5678" to "0x" does not overflow 16-bit) Event 3 0x1235 16-bit: 0x (When decoded, going from "0x" to "0x" overflow 16-bit exactly once which allows the trace reader to reconstruct timestamp 0x1235 from the previous timestamp and the 16-bit timestamp encoding.) Event 4 0x1237 64-bit: 0x1237 (Encoding over 16-bit not possible because going from 0x1235 to 0x1237 would overflow 16-bit twice, which cannot be detected by a trace reader. Therefore use the full 64-bit timestamp in the "large" event header representation.) I think that's basically what I said, but you are just looking at it differently ;-) Or should I say, you are using bits for optimization. Based on your explanation below, we are really talking about different things here. Let me try to reply to your explanation to try to show where what I am doing completely differs from what you have in mind. This will help explain how I handle 16-bit overflow as well. The events are based off of the last injected timestamp. Incorrect. There is no "injected timestamp". There is only a concept of the "current timestamp" as we either write to or read from the event stream. I will take the point of view of the trace reader for the rest of the discussion. The above example, starts with an timestamp injection into the packet header: 0x12345678, with the lsb 16bits ignore. Wrong again. The 16 least significant bits are not ignored. The "current timestamp" is really 0x12345678 when the packet header is read. In the packet header you have 0x12345678 in the first event you have 0x5678 how does that get you the timestamp? If that event had 0x, when the reader reads this packet, it would take the header 0x12345678 chop off (ignore) the 5678, and add the , right? We need to consider not only what happens when the 16 low bits increase, but also what happens when they end up with a value smaller than the previous 16 low bits. As a summary from our video meeting discussion: There are 3 cases we care about here: packet header timestamp: 0x12345678 followed by either: A) first event delta from packet header timestamp is 0: 16-bit value 0x5678 B) first event delta from packet header timestamp is <= 0x: B.1) 16-bit value example 0x5699 (no 16-bit overflow from previous value) B.2) 16-bit value example 0x (exactly one 16-bit overflow from previous value) C) first event delta from packet header timestamp is larger than 0x, which would cause the low-order 16 bits to have more than one 16-bit overflow from the previous value. The tracer detects this and uses a full 64-bit timestamp instead of the compact 16 bits. [...] But how do you detect the overflow? That last timestamps to know if the tsc overflowed or not needs to be saved and compared. I would assume you have a similar race that I have. Yes, I save a "last timestamp" per buffer, but the race does not matter because of the order in which it is saved wrt local cmpxchg updating the reserved position. The algorithm looks like: do { - read current reserved position (old pos) - read time - compute new reserved position (new pos) } while (cmpxchg(reserved pos, old pos, new pos) != old pos); [A] save_last_tsc() So the last_tsc that is saved is from the timestamp read before the cmpxchg. Yes. If interrupted at [A] by another trace producer, it will compare with an older "last tsc" than the tsc of the event physically located just before the nested event. This stale "last tsc" has a value which is necessarily lower than the one we would be saving with the save_last_tsc immediately afterwards, which means in the worse case we end up using a full 64-bit timestamp when in fact we could use a more compact representation. But this race is rare and therefore it does not matter for size. That's equivalent to me "injecting" an absolute value for the same race. Yes. The fact that I only need this last_tsc value for the sake of optimization, and not for computation of a time delta from a previous injected timestamp, makes it possible to handle the race gracefully without requiring anything more than a single last_tsc value per buffer and a single comparison for 16-bit overflow. If you have:
Re: [PATCH] powerpc/papr_scm: Move duplicate definitions to common header files
Le 18/04/2022 à 06:38, Shivaprasad G Bhat a écrit : > papr_scm and ndtest share common PDSM payload structs like > nd_papr_pdsm_health. Presently these structs are duplicated across > papr_pdsm.h and ndtest.h header files. Since 'ndtest' is essentially > arch independent and can run on platforms other than PPC64, a way > needs to be deviced to avoid redundancy and duplication of PDSM > structs in future. > > So the patch proposes moving the PDSM header from arch/powerpc/include- > -/uapi/ to the generic include/uapi/linux directory. Also, there are > some #defines common between papr_scm and ndtest which are not exported > to the user space. So, move them to a header file which can be shared > across ndtest and papr_scm via newly introduced include/linux/papr_scm.h. > > Signed-off-by: Shivaprasad G Bhat > Signed-off-by: Vaibhav Jain > Suggested-by: "Aneesh Kumar K.V" This patch doesn't apply, if still relevant can you please rebase and re-submit ? Thanks Christophe > --- > Changelog: > Since v2: > Link: > https://patchwork.kernel.org/project/linux-nvdimm/patch/163454440296.431294.2368481747380790011.st...@lep8c.aus.stglabs.ibm.com/ > * Made it like v1, and rebased. > * Fixed repeating words in comments of the header file papr_scm.h > > Since v1: > Link: > https://patchwork.kernel.org/project/linux-nvdimm/patch/162505488483.72147.12741153746322191381.stgit@56e104a48989/ > * Removed dependency on this patch for the other patches > > MAINTAINERS |2 > arch/powerpc/include/uapi/asm/papr_pdsm.h | 165 > - > arch/powerpc/platforms/pseries/papr_scm.c | 43 > include/linux/papr_scm.h | 49 + > include/uapi/linux/papr_pdsm.h| 165 > + > tools/testing/nvdimm/test/ndtest.c|2 > tools/testing/nvdimm/test/ndtest.h| 31 - > 7 files changed, 220 insertions(+), 237 deletions(-) > delete mode 100644 arch/powerpc/include/uapi/asm/papr_pdsm.h > create mode 100644 include/linux/papr_scm.h > create mode 100644 include/uapi/linux/papr_pdsm.h > > diff --git a/MAINTAINERS b/MAINTAINERS > index 1699bb7cc867..03685b074dda 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -11254,6 +11254,8 @@ F:drivers/rtc/rtc-opal.c > F: drivers/scsi/ibmvscsi/ > F: drivers/tty/hvc/hvc_opal.c > F: drivers/watchdog/wdrtas.c > +F: include/linux/papr_scm.h > +F: include/uapi/linux/papr_pdsm.h > F: tools/testing/selftests/powerpc > N: /pmac > N: powermac > diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h > b/arch/powerpc/include/uapi/asm/papr_pdsm.h > deleted file mode 100644 > index 17439925045c.. > --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h > +++ /dev/null > @@ -1,165 +0,0 @@ > -/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > -/* > - * PAPR nvDimm Specific Methods (PDSM) and structs for libndctl > - * > - * (C) Copyright IBM 2020 > - * > - * Author: Vaibhav Jain > - */ > - > -#ifndef _UAPI_ASM_POWERPC_PAPR_PDSM_H_ > -#define _UAPI_ASM_POWERPC_PAPR_PDSM_H_ > - > -#include > -#include > - > -/* > - * PDSM Envelope: > - * > - * The ioctl ND_CMD_CALL exchange data between user-space and kernel via > - * envelope which consists of 2 headers sections and payload sections as > - * illustrated below: > - * +-+---+---+ > - * | 64-Bytes | 8-Bytes | Max 184-Bytes | > - * +-+---+---+ > - * | ND-HEADER | PDSM-HEADER | PDSM-PAYLOAD | > - * +-+---+---+ > - * | nd_family | | | > - * | nd_size_out | cmd_status| | > - * | nd_size_in | reserved | nd_pdsm_payload | > - * | nd_command | payload --> | | > - * | nd_fw_size | | | > - * | nd_payload ---> | | | > - * +---+-+---+ > - * > - * ND Header: > - * This is the generic libnvdimm header described as 'struct nd_cmd_pkg' > - * which is interpreted by libnvdimm before passed on to papr_scm. Important > - * member fields used are: > - * 'nd_family' : (In) NVDIMM_FAMILY_PAPR_SCM > - * 'nd_size_in' : (In) PDSM-HEADER + PDSM-IN-PAYLOAD (usually 0) > - * 'nd_size_out': (In) PDSM-HEADER + PDSM-RETURN-PAYLOAD > - * 'nd_command' : (In) One of PAPR_PDSM_XXX > - * 'nd_fw_size' : (Out) PDSM-HEADER + size of actual payload returned > - * > - * PDSM Header: > - * This is papr-scm specific header that precedes the payload. This is > defined > - * as nd_cmd_pdsm_pkg. Following fields aare available in this header: > - * > - * 'cmd_status'
Re: [PATCH v2 2/2] tracing: Include Microcode Revision in mce_record tracepoint
On 1/25/2024 10:48 AM, Avadhut Naik wrote: > Currently, the microcode field (Microcode Revision) of struct mce is not > exported to userspace through the mce_record tracepoint. > > Export it through the tracepoint as it may provide useful information for > debug and analysis. > > Signed-off-by: Avadhut Naik > --- A couple of nits below. Apart from that the patch looks fine to me. Reviewed-by: Sohil Mehta > include/trace/events/mce.h | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h > index 657b93ec8176..203baccd3c5c 100644 > --- a/include/trace/events/mce.h > +++ b/include/trace/events/mce.h > @@ -34,6 +34,7 @@ TRACE_EVENT(mce_record, > __field(u8, cs ) > __field(u8, bank) > __field(u8, cpuvendor ) > + __field(u32,microcode ) Tab alignment is inconsistent. > ), > > TP_fast_assign( > @@ -55,9 +56,10 @@ TRACE_EVENT(mce_record, > __entry->cs = m->cs; > __entry->bank = m->bank; > __entry->cpuvendor = m->cpuvendor; > + __entry->microcode = m->microcode; > ), > > - TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, > ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: > %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", > + TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, > ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: > %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x, MICROCODE REVISION: > %u", Should microcode by printed as a decimal or an hexadecimal? Elsewhere such as __print_mce(), it is printed as an hexadecimal: /* * Note this output is parsed by external tools and old fields * should not be changed. */ pr_emerg(HW_ERR "PROCESSOR %u:%x TIME %llu SOCKET %u APIC %x microcode %x\n", m->cpuvendor, m->cpuid, m->time, m->socketid, m->apicid, m->microcode); > __entry->cpu, > __entry->mcgcap, __entry->mcgstatus, > __entry->bank, __entry->status, > @@ -69,7 +71,8 @@ TRACE_EVENT(mce_record, > __entry->cpuvendor, __entry->cpuid, > __entry->walltime, > __entry->socketid, > - __entry->apicid) > + __entry->apicid, > + __entry->microcode) > ); > > #endif /* _TRACE_MCE_H */
Re: [PATCH v2 1/2] tracing: Include PPIN in mce_record tracepoint
On 1/25/2024 10:48 AM, Avadhut Naik wrote: > Machine Check Error information from struct mce is exported to userspace > through the mce_record tracepoint. > > Currently, however, the PPIN (Protected Processor Inventory Number) field > of struct mce is not exported through the tracepoint. > > Export PPIN through the tracepoint as it may provide useful information > for debug and analysis. > > Signed-off-by: Avadhut Naik > --- The patch looks fine to me expect for a nit below. With that fixed, please feel free to add: Reviewed-by: Sohil Mehta > include/trace/events/mce.h | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h > index 1391ada0da3b..657b93ec8176 100644 > --- a/include/trace/events/mce.h > +++ b/include/trace/events/mce.h > @@ -25,6 +25,7 @@ TRACE_EVENT(mce_record, > __field(u64,ipid) > __field(u64,ip ) > __field(u64,tsc ) > + __field(u64,ppin) The tabs are not aligned with the rest of the structure. > __field(u64,walltime) > __field(u32,cpu ) > __field(u32,cpuid ) > @@ -45,6 +46,7 @@ TRACE_EVENT(mce_record, > __entry->ipid = m->ipid; > __entry->ip = m->ip; > __entry->tsc= m->tsc; > + __entry->ppin = m->ppin; > __entry->walltime = m->time; > __entry->cpu= m->extcpu; > __entry->cpuid = m->cpuid; > @@ -55,7 +57,7 @@ TRACE_EVENT(mce_record, > __entry->cpuvendor = m->cpuvendor; > ), > > - TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, > ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, > PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", > + TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, > ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: > %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", > __entry->cpu, > __entry->mcgcap, __entry->mcgstatus, > __entry->bank, __entry->status, > @@ -63,6 +65,7 @@ TRACE_EVENT(mce_record, > __entry->addr, __entry->misc, __entry->synd, > __entry->cs, __entry->ip, > __entry->tsc, > + __entry->ppin, > __entry->cpuvendor, __entry->cpuid, > __entry->walltime, > __entry->socketid,
Re: [PATCH v2 2/3] tracing: initialize trace_seq buffers
On Thu, 25 Jan 2024 17:16:21 -0300 "Ricardo B. Marliere" wrote: > Now that trace_seq_reset have been created, correct the places where the > buffers need to be initialized. This patch would need to come first. You don't ever want to intentionally create a broken kernel. Also, the change log should be: In order to extend trace_seq into being dynamic, the struct trace_seq will no longer be valid if simply set to zero. Call trace_seq_init() for all trace_seq when they are first created. > > Suggested-by: Steven Rostedt > Signed-off-by: Ricardo B. Marliere > --- > kernel/trace/trace.c | 14 ++ > 1 file changed, 14 insertions(+) > You also need to initialize iter.seq in ftrace_dump() -- Steve > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index d4c55d3e21c2..9827700d0164 100644 > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -4889,6 +4889,9 @@ __tracing_open(struct inode *inode, struct file *file, > bool snapshot) > > mutex_unlock(&trace_types_lock); > > + trace_seq_init(&iter->seq); > + trace_seq_init(&iter->tmp_seq); > + > return iter; > > fail: > @@ -6770,6 +6773,7 @@ static int tracing_open_pipe(struct inode *inode, > struct file *filp) > } > > trace_seq_init(&iter->seq); > + trace_seq_init(&iter->tmp_seq); > iter->trace = tr->current_trace; > > if (!alloc_cpumask_var(&iter->started, GFP_KERNEL)) { > @@ -6947,6 +6951,7 @@ tracing_read_pipe(struct file *filp, char __user *ubuf, > trace_iterator_reset(iter); > cpumask_clear(iter->started); > trace_seq_init(&iter->seq); > + trace_seq_init(&iter->tmp_seq); > > trace_event_read_lock(); > trace_access_lock(iter->cpu_file); > @@ -8277,6 +8282,9 @@ static int tracing_buffers_open(struct inode *inode, > struct file *filp) > if (ret < 0) > trace_array_put(tr); > > + trace_seq_init(&info->iter.seq); > + trace_seq_init(&info->iter.tmp_seq); > + > return ret; > } > > @@ -10300,6 +10308,9 @@ void trace_init_global_iter(struct trace_iterator > *iter) > iter->temp_size = STATIC_TEMP_BUF_SIZE; > iter->fmt = static_fmt_buf; > iter->fmt_size = STATIC_FMT_BUF_SIZE; > + > + trace_seq_init(&iter->seq); > + trace_seq_init(&iter->tmp_seq); > } > > void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) > @@ -10712,6 +10723,9 @@ void __init early_trace_init(void) > tracepoint_printk = 0; > else > static_key_enable(&tracepoint_printk_key.key); > + > + trace_seq_init(&tracepoint_print_iter->seq); > + trace_seq_init(&tracepoint_print_iter->tmp_seq); > } > tracer_alloc_buffers(); >
Re: [PATCH v2 0/2] Update mce_record tracepoint
Hi, On 1/25/2024 1:19 PM, Luck, Tony wrote: >>> The first patch adds PPIN (Protected Processor Inventory Number) field to >>> the tracepoint. >>> >>> The second patch adds the microcode field (Microcode Revision) to the >>> tracepoint. >> >> This is a lot of static information to add to *every* MCE. > > 8 bytes for PPIN, 4 more for microcode. > > Number of recoverable machine checks per system I hope the monthly rate > should > be countable on my fingers. If a system is getting more than that, then > people should > be looking at fixing the underlying problem. > > Corrected errors are much more common. Though Linux takes action to limit the > rate when storms occur. So maybe hundreds or small numbers of thousands of > error trace records? Increase in trace buffer consumption still measured in > Kbytes > not Mbytes. Server systems that do machine check reporting now start at tens > of > GBytes memory. > >> And where does it end? Stick full dmesg in the tracepoint too? > > Seems like overkill. > >> What is the real-life use case here? > > Systems using rasdaemon to track errors will be able to track both of these > (I assume that Naik has plans to update rasdaemon to capture and save these > new fields). > Yes, I do intend to submit a pull request to the rasdaemon to parse and log these new fields. > PPIN is useful when talking to the CPU vendor about patterns of similar errors > seen across a cluster. > > MICROCODE - gives a fast path to root cause problems that have already > been fixed in a microcode update. > > -Tony -- Thanks, Avadhut Naik
[PATCH v2 3/3] tracing: convert __trace_seq_init to use WARN_ON_ONCE
The initialization of trace_seq buffers are done elsewhere and therefore __trace_seq_init should yield a warning if it has to actually initialize the buffer. Suggested-by: Steven Rostedt Signed-off-by: Ricardo B. Marliere --- kernel/trace/trace_seq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/trace_seq.c b/kernel/trace/trace_seq.c index 741b2f3d76c0..3c7a7d903b54 100644 --- a/kernel/trace/trace_seq.c +++ b/kernel/trace/trace_seq.c @@ -32,7 +32,7 @@ */ static inline void __trace_seq_init(struct trace_seq *s) { - if (unlikely(!s->seq.size)) + if (WARN_ON_ONCE(!s->seq.size)) trace_seq_init(s); } -- 2.43.0
[PATCH v2 2/3] tracing: initialize trace_seq buffers
Now that trace_seq_reset have been created, correct the places where the buffers need to be initialized. Suggested-by: Steven Rostedt Signed-off-by: Ricardo B. Marliere --- kernel/trace/trace.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index d4c55d3e21c2..9827700d0164 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4889,6 +4889,9 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot) mutex_unlock(&trace_types_lock); + trace_seq_init(&iter->seq); + trace_seq_init(&iter->tmp_seq); + return iter; fail: @@ -6770,6 +6773,7 @@ static int tracing_open_pipe(struct inode *inode, struct file *filp) } trace_seq_init(&iter->seq); + trace_seq_init(&iter->tmp_seq); iter->trace = tr->current_trace; if (!alloc_cpumask_var(&iter->started, GFP_KERNEL)) { @@ -6947,6 +6951,7 @@ tracing_read_pipe(struct file *filp, char __user *ubuf, trace_iterator_reset(iter); cpumask_clear(iter->started); trace_seq_init(&iter->seq); + trace_seq_init(&iter->tmp_seq); trace_event_read_lock(); trace_access_lock(iter->cpu_file); @@ -8277,6 +8282,9 @@ static int tracing_buffers_open(struct inode *inode, struct file *filp) if (ret < 0) trace_array_put(tr); + trace_seq_init(&info->iter.seq); + trace_seq_init(&info->iter.tmp_seq); + return ret; } @@ -10300,6 +10308,9 @@ void trace_init_global_iter(struct trace_iterator *iter) iter->temp_size = STATIC_TEMP_BUF_SIZE; iter->fmt = static_fmt_buf; iter->fmt_size = STATIC_FMT_BUF_SIZE; + + trace_seq_init(&iter->seq); + trace_seq_init(&iter->tmp_seq); } void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) @@ -10712,6 +10723,9 @@ void __init early_trace_init(void) tracepoint_printk = 0; else static_key_enable(&tracepoint_printk_key.key); + + trace_seq_init(&tracepoint_print_iter->seq); + trace_seq_init(&tracepoint_print_iter->tmp_seq); } tracer_alloc_buffers(); -- 2.43.0
[PATCH v2 1/3] tracing: add trace_seq_reset function
Currently, trace_seq_init may be called many times with the intent of resetting the buffer. Add a function trace_seq_reset that does that and replace the relevant occurrences to use it instead. Suggested-by: Steven Rostedt Signed-off-by: Ricardo B. Marliere --- include/linux/trace_seq.h| 11 +++ include/trace/trace_events.h | 2 +- kernel/trace/trace.c | 10 +- kernel/trace/trace_output.c | 2 +- kernel/trace/trace_seq.c | 2 +- 5 files changed, 19 insertions(+), 8 deletions(-) diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h index 9ec229dfddaa..d3fa41001813 100644 --- a/include/linux/trace_seq.h +++ b/include/linux/trace_seq.h @@ -29,6 +29,17 @@ trace_seq_init(struct trace_seq *s) s->readpos = 0; } +static inline void +trace_seq_reset(struct trace_seq *s) +{ + if (WARN_ON_ONCE(!s->seq.size)) + seq_buf_init(&s->seq, s->buffer, TRACE_SEQ_BUFFER_SIZE); + else + seq_buf_clear(&s->seq); + s->full = 0; + s->readpos = 0; +} + /** * trace_seq_used - amount of actual data written to buffer * @s: trace sequence descriptor diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h index c2f9cabf154d..2bc79998e5ab 100644 --- a/include/trace/trace_events.h +++ b/include/trace/trace_events.h @@ -227,7 +227,7 @@ trace_raw_output_##call(struct trace_iterator *iter, int flags, \ \ field = (typeof(field))entry; \ \ - trace_seq_init(p); \ + trace_seq_reset(p); \ return trace_output_call(iter, #call, print); \ } \ static struct trace_event_functions trace_event_type_funcs_##call = { \ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 46dbe22121e9..d4c55d3e21c2 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2928,7 +2928,7 @@ static void output_printk(struct trace_event_buffer *fbuffer) event = &fbuffer->trace_file->event_call->event; raw_spin_lock_irqsave(&tracepoint_iter_lock, flags); - trace_seq_init(&iter->seq); + trace_seq_reset(&iter->seq); iter->ent = fbuffer->entry; event_call->event.funcs->trace(iter, 0, event); trace_seq_putc(&iter->seq, 0); @@ -6921,7 +6921,7 @@ tracing_read_pipe(struct file *filp, char __user *ubuf, if (sret != -EBUSY) goto out; - trace_seq_init(&iter->seq); + trace_seq_reset(&iter->seq); if (iter->trace->read) { sret = iter->trace->read(iter, filp, ubuf, cnt, ppos); @@ -6993,7 +6993,7 @@ tracing_read_pipe(struct file *filp, char __user *ubuf, /* Now copy what we have to the user */ sret = trace_seq_to_user(&iter->seq, ubuf, cnt); if (iter->seq.readpos >= trace_seq_used(&iter->seq)) - trace_seq_init(&iter->seq); + trace_seq_reset(&iter->seq); /* * If there was nothing to send to user, in spite of consuming trace @@ -7125,7 +7125,7 @@ static ssize_t tracing_splice_read_pipe(struct file *filp, spd.partial[i].offset = 0; spd.partial[i].len = trace_seq_used(&iter->seq); - trace_seq_init(&iter->seq); + trace_seq_reset(&iter->seq); } trace_access_unlock(iter->cpu_file); @@ -10274,7 +10274,7 @@ trace_printk_seq(struct trace_seq *s) printk(KERN_TRACE "%s", s->buffer); - trace_seq_init(s); + trace_seq_reset(s); } void trace_init_global_iter(struct trace_iterator *iter) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index 3e7fa44dc2b2..c949e7736618 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -308,7 +308,7 @@ int trace_raw_output_prep(struct trace_iterator *iter, return TRACE_TYPE_UNHANDLED; } - trace_seq_init(p); + trace_seq_reset(p); trace_seq_printf(s, "%s: ", trace_event_name(event)); return trace_handle_return(s); diff --git a/kernel/trace/trace_seq.c b/kernel/trace/trace_seq.c index c158d65a8a88..741b2f3d76c0 100644 --- a/kernel/trace/trace_seq.c +++ b/kernel/trace/trace_seq.c @@ -59,7 +59,7 @@ int trace_print_seq(struct seq_file *m, struct trace_seq *s) * do something else with the contents. */ if (!ret) - trace_seq_init(s); + trace_seq_reset(s); return ret; } -- 2.43.0
[PATCH v2 0/3] add trace_seq_reset function
This series is a prerequisite for a later effort of making trace_seq more flexible about its buffer size. To achieve that, initializing and resetting the buffers need to be differentiated. Ricardo B. Marliere (3): tracing: add trace_seq_reset function tracing: initialize trace_seq buffers tracing: convert __trace_seq_init to use WARN_ON_ONCE include/linux/trace_seq.h| 11 +++ include/trace/trace_events.h | 2 +- kernel/trace/trace.c | 24 +++- kernel/trace/trace_output.c | 2 +- kernel/trace/trace_seq.c | 4 ++-- 5 files changed, 34 insertions(+), 9 deletions(-) -- 2.43.0
RE: [PATCH v2 0/2] Update mce_record tracepoint
> > The first patch adds PPIN (Protected Processor Inventory Number) field to > > the tracepoint. > > > > The second patch adds the microcode field (Microcode Revision) to the > > tracepoint. > > This is a lot of static information to add to *every* MCE. 8 bytes for PPIN, 4 more for microcode. Number of recoverable machine checks per system I hope the monthly rate should be countable on my fingers. If a system is getting more than that, then people should be looking at fixing the underlying problem. Corrected errors are much more common. Though Linux takes action to limit the rate when storms occur. So maybe hundreds or small numbers of thousands of error trace records? Increase in trace buffer consumption still measured in Kbytes not Mbytes. Server systems that do machine check reporting now start at tens of GBytes memory. > And where does it end? Stick full dmesg in the tracepoint too? Seems like overkill. > What is the real-life use case here? Systems using rasdaemon to track errors will be able to track both of these (I assume that Naik has plans to update rasdaemon to capture and save these new fields). PPIN is useful when talking to the CPU vendor about patterns of similar errors seen across a cluster. MICROCODE - gives a fast path to root cause problems that have already been fixed in a microcode update. -Tony
Re: [PATCH v2 0/2] Update mce_record tracepoint
On Thu, Jan 25, 2024 at 12:48:55PM -0600, Avadhut Naik wrote: > This patchset updates the mce_record tracepoint so that the recently added > fields of struct mce are exported through it to userspace. > > The first patch adds PPIN (Protected Processor Inventory Number) field to > the tracepoint. > > The second patch adds the microcode field (Microcode Revision) to the > tracepoint. This is a lot of static information to add to *every* MCE. And where does it end? Stick full dmesg in the tracepoint too? What is the real-life use case here? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH v2 1/4] remoteproc: Add TEE support
Hi Arnaud, On Thu, Jan 18, 2024 at 11:04:30AM +0100, Arnaud Pouliquen wrote: > From: Arnaud Pouliquen > > Add a remoteproc TEE (Trusted Execution Environment) device Device or driver? Seems to be the latter... > that will be probed by the TEE bus. If the associated Trusted > application is supported on secure part this device offers a client > interface to load a firmware in the secure part. > This firmware could be authenticated and decrypted by the secure > trusted application. > > Signed-off-by: Arnaud Pouliquen > --- > drivers/remoteproc/Kconfig | 9 + > drivers/remoteproc/Makefile | 1 + > drivers/remoteproc/tee_remoteproc.c | 393 > include/linux/tee_remoteproc.h | 99 +++ > 4 files changed, 502 insertions(+) > create mode 100644 drivers/remoteproc/tee_remoteproc.c > create mode 100644 include/linux/tee_remoteproc.h > > diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig > index 48845dc8fa85..85299606806c 100644 > --- a/drivers/remoteproc/Kconfig > +++ b/drivers/remoteproc/Kconfig > @@ -365,6 +365,15 @@ config XLNX_R5_REMOTEPROC > > It's safe to say N if not interested in using RPU r5f cores. > > + > +config TEE_REMOTEPROC > + tristate "trusted firmware support by a TEE application" > + depends on OPTEE > + help > + Support for trusted remote processors firmware. The firmware > + authentication and/or decryption are managed by a trusted application. > + This can be either built-in or a loadable module. > + > endif # REMOTEPROC > > endmenu > diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile > index 91314a9b43ce..fa8daebce277 100644 > --- a/drivers/remoteproc/Makefile > +++ b/drivers/remoteproc/Makefile > @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC) += rcar_rproc.o > obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o > obj-$(CONFIG_ST_SLIM_REMOTEPROC) += st_slim_rproc.o > obj-$(CONFIG_STM32_RPROC)+= stm32_rproc.o > +obj-$(CONFIG_TEE_REMOTEPROC) += tee_remoteproc.o > obj-$(CONFIG_TI_K3_DSP_REMOTEPROC) += ti_k3_dsp_remoteproc.o > obj-$(CONFIG_TI_K3_R5_REMOTEPROC)+= ti_k3_r5_remoteproc.o > obj-$(CONFIG_XLNX_R5_REMOTEPROC) += xlnx_r5_remoteproc.o > diff --git a/drivers/remoteproc/tee_remoteproc.c > b/drivers/remoteproc/tee_remoteproc.c > new file mode 100644 > index ..49e1e0caf889 > --- /dev/null > +++ b/drivers/remoteproc/tee_remoteproc.c > @@ -0,0 +1,393 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * Copyright (C) STMicroelectronics 2023 - All Rights Reserved > + * Author: Arnaud Pouliquen > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "remoteproc_internal.h" > + > +#define MAX_TEE_PARAM_ARRY_MEMBER4 > + > +/* > + * Authentication of the firmware and load in the remote processor memory > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + * [in] params[1].memref: buffer containing the image of the > buffer > + */ > +#define TA_RPROC_FW_CMD_LOAD_FW 1 > + > +/* > + * Start the remote processor > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + */ > +#define TA_RPROC_FW_CMD_START_FW 2 > + > +/* > + * Stop the remote processor > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + */ > +#define TA_RPROC_FW_CMD_STOP_FW 3 > + > +/* > + * Return the address of the resource table, or 0 if not found > + * No check is done to verify that the address returned is accessible by > + * the non secure context. If the resource table is loaded in a protected > + * memory the access by the non secure context will lead to a data abort. > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + * [out] params[1].value.a: 32bit LSB resource table memory address > + * [out] params[1].value.b: 32bit MSB resource table memory address > + * [out] params[2].value.a: 32bit LSB resource table memory size > + * [out] params[2].value.b: 32bit MSB resource table memory size > + */ > +#define TA_RPROC_FW_CMD_GET_RSC_TABLE4 > + > +/* > + * Return the address of the core dump > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + * [out] params[1].memref: address of the core dump image if exist, > + * else return Null > + */ > +#define TA_RPROC_FW_CMD_GET_COREDUMP 5 > + > +struct tee_rproc_mem { > + char name[20]; > + void __iomem *cpu_addr; > + phys_addr_t bus_addr; > + u32 dev_addr; > + size_t size; > +}; > + > +struct tee_rproc_context { > + struct list_head sessions; > + struct tee_context *tee_ctx; > + struct device *dev; > +}; > + > +static struct tee_rproc_context *tee_rproc_ctx;
Re: [PATCH] tracing: Include PPIN in mce_record tracepoint
On Wed, Jan 24, 2024 at 09:09:08AM -0500, Steven Rostedt wrote: > I don't think that's a worry anymore. The offsets can change based on > kernel config. PowerTop needed to have the library ported to it because > it use to hardcode the offsets but then it broke when running the 32bit > version on a 64bit kernel. > > > > > I guess no until we break some use case and then we will have to revert. > > At least this is what we've done in the past... > > > > But that revert was reverted when we converted PowerTop to use libtraceevent. Ok, sounds like a good plan. /me makes a mental note for the future. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
[PATCH v2 2/2] tracing: Include Microcode Revision in mce_record tracepoint
Currently, the microcode field (Microcode Revision) of struct mce is not exported to userspace through the mce_record tracepoint. Export it through the tracepoint as it may provide useful information for debug and analysis. Signed-off-by: Avadhut Naik --- include/trace/events/mce.h | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h index 657b93ec8176..203baccd3c5c 100644 --- a/include/trace/events/mce.h +++ b/include/trace/events/mce.h @@ -34,6 +34,7 @@ TRACE_EVENT(mce_record, __field(u8, cs ) __field(u8, bank) __field(u8, cpuvendor ) + __field(u32,microcode ) ), TP_fast_assign( @@ -55,9 +56,10 @@ TRACE_EVENT(mce_record, __entry->cs = m->cs; __entry->bank = m->bank; __entry->cpuvendor = m->cpuvendor; + __entry->microcode = m->microcode; ), - TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", + TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x, MICROCODE REVISION: %u", __entry->cpu, __entry->mcgcap, __entry->mcgstatus, __entry->bank, __entry->status, @@ -69,7 +71,8 @@ TRACE_EVENT(mce_record, __entry->cpuvendor, __entry->cpuid, __entry->walltime, __entry->socketid, - __entry->apicid) + __entry->apicid, + __entry->microcode) ); #endif /* _TRACE_MCE_H */ -- 2.34.1
[PATCH v2 1/2] tracing: Include PPIN in mce_record tracepoint
Machine Check Error information from struct mce is exported to userspace through the mce_record tracepoint. Currently, however, the PPIN (Protected Processor Inventory Number) field of struct mce is not exported through the tracepoint. Export PPIN through the tracepoint as it may provide useful information for debug and analysis. Signed-off-by: Avadhut Naik --- include/trace/events/mce.h | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h index 1391ada0da3b..657b93ec8176 100644 --- a/include/trace/events/mce.h +++ b/include/trace/events/mce.h @@ -25,6 +25,7 @@ TRACE_EVENT(mce_record, __field(u64,ipid) __field(u64,ip ) __field(u64,tsc ) + __field(u64,ppin) __field(u64,walltime) __field(u32,cpu ) __field(u32,cpuid ) @@ -45,6 +46,7 @@ TRACE_EVENT(mce_record, __entry->ipid = m->ipid; __entry->ip = m->ip; __entry->tsc= m->tsc; + __entry->ppin = m->ppin; __entry->walltime = m->time; __entry->cpu= m->extcpu; __entry->cpuid = m->cpuid; @@ -55,7 +57,7 @@ TRACE_EVENT(mce_record, __entry->cpuvendor = m->cpuvendor; ), - TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", + TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR/MISC/SYND: %016Lx/%016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", __entry->cpu, __entry->mcgcap, __entry->mcgstatus, __entry->bank, __entry->status, @@ -63,6 +65,7 @@ TRACE_EVENT(mce_record, __entry->addr, __entry->misc, __entry->synd, __entry->cs, __entry->ip, __entry->tsc, + __entry->ppin, __entry->cpuvendor, __entry->cpuid, __entry->walltime, __entry->socketid, -- 2.34.1
[PATCH v2 0/2] Update mce_record tracepoint
This patchset updates the mce_record tracepoint so that the recently added fields of struct mce are exported through it to userspace. The first patch adds PPIN (Protected Processor Inventory Number) field to the tracepoint. The second patch adds the microcode field (Microcode Revision) to the tracepoint. Changes in v2: - Export microcode field (Microcode Revision) through the tracepoiont in addition to PPIN. Avadhut Naik (2): tracing: Include PPIN in mce_record tracepoint tracing: Include Microcode Revision in mce_record tracepoint include/trace/events/mce.h | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) base-commit: 0d4f19418f067465b0a84a287d9a51e443a0bc3a -- 2.34.1
Re: [PATCH 1/4] tracing/user_events: Prepare find/delete for same name events
On Thu, Jan 25, 2024 at 09:59:03AM +0900, Masami Hiramatsu wrote: > On Tue, 23 Jan 2024 22:08:41 + > Beau Belgrave wrote: > > > The current code for finding and deleting events assumes that there will > > never be cases when user_events are registered with the same name, but > > different formats. In the future this scenario will exist to ensure > > user programs can be updated or modify their events and run different > > versions of their programs side-by-side without being blocked. > > Ah, this is a very important point. Kernel always has only one instance > but user program doesn't. Thus it can define the same event name. > For the similar problem, uprobe event assumes that the user (here > admin) will define different group name to avoid it. But for the user > event, it is embedded, hmm. > Yes, the series will handle if multi-processes use the same name, we will find a matching version of that name within the user_event group. If there isn't one, a new one is created. Each is backed by an independent tracepoint which does match up with how uprobe does it. This actually got brought up in the tracefs meetings we've had and it seemed to get wide agreement on how to best handle this. > > > > This change does not yet allow for multi-format events. If user_events > > are registered with the same name but different arguments the programs > > see the same return values as before. This change simply makes it > > possible to easily accomodate for this in future changes. > > > > Update find_user_event() to take in argument parameters and register > > flags to accomodate future multi-format event scenarios. Have find > > validate argument matching and return error pointers to cover address > > in use cases, or allocation errors. Update callers to handle error > > pointer logic. > > Understand, that is similar to what probe events do. > > > > > Move delete_user_event() to use hash walking directly now that find has > > changed. Delete all events found that match the register name, stop > > if an error occurs and report back to the user. > > What happen if we run 2 different version of the applications and terminate > one of them? The event which is used by others will be kept? > Each unique version of a user_event has it's own ref-count. If one version is not-used, but another version is, only the not-used version will get deleted. The other version that is in use will return a -EBUSY when it gets to that version via enumeration. While we only have a single tracepoint per-version, we have several user_event structures in memory that have the same name, yet different formats. Each of which have their own lifetime, enablers and ref-counts to keep them isolated from each other. Thanks, -Beau > Thank you, > > > > > Update user_fields_match() to cover list_empty() scenarios instead of > > each callsite doing it now that find_user_event() uses it directly. > > > > Signed-off-by: Beau Belgrave > > --- > > kernel/trace/trace_events_user.c | 106 +-- > > 1 file changed, 58 insertions(+), 48 deletions(-) > > > > diff --git a/kernel/trace/trace_events_user.c > > b/kernel/trace/trace_events_user.c > > index 9365ce407426..0480579ba563 100644 > > --- a/kernel/trace/trace_events_user.c > > +++ b/kernel/trace/trace_events_user.c > > @@ -202,6 +202,8 @@ static struct user_event_mm *user_event_mm_get(struct > > user_event_mm *mm); > > static struct user_event_mm *user_event_mm_get_all(struct user_event > > *user); > > static void user_event_mm_put(struct user_event_mm *mm); > > static int destroy_user_event(struct user_event *user); > > +static bool user_fields_match(struct user_event *user, int argc, > > + const char **argv); > > > > static u32 user_event_key(char *name) > > { > > @@ -1493,17 +1495,24 @@ static int destroy_user_event(struct user_event > > *user) > > } > > > > static struct user_event *find_user_event(struct user_event_group *group, > > - char *name, u32 *outkey) > > + char *name, int argc, const char > > **argv, > > + u32 flags, u32 *outkey) > > { > > struct user_event *user; > > u32 key = user_event_key(name); > > > > *outkey = key; > > > > - hash_for_each_possible(group->register_table, user, node, key) > > - if (!strcmp(EVENT_NAME(user), name)) > > + hash_for_each_possible(group->register_table, user, node, key) { > > + if (strcmp(EVENT_NAME(user), name)) > > + continue; > > + > > + if (user_fields_match(user, argc, argv)) > > return user_event_get(user); > > > > + return ERR_PTR(-EADDRINUSE); > > + } > > + > > return NULL; > > } > > > > @@ -1860,6 +1869,9 @@ static bool user_fields_match(struct user_event > > *user, int argc, > > struct list_head *head = &user->fields; > > int i = 0; > > > > +
Re: [PATCH RFC v3 00/35] Add support for arm64 MTE dynamic tag storage reuse
On Thu, 25 Jan 2024 16:42:21 + Alexandru Elisei wrote: > include/trace/events/cma.h| 59 ++ > include/trace/events/mmflags.h| 5 +- I know others like being Cc'd on every patch in a series, but I'm not about to trudge through 35 patches to review trace events, having no idea which patch they are in. -- Steve
Re: [PATCH v2 00/41] filelock: split struct file_lock into file_lock and file_lease structs
On Thu, 2024-01-25 at 09:57 -0500, Chuck Lever wrote: > On Thu, Jan 25, 2024 at 05:42:41AM -0500, Jeff Layton wrote: > > Long ago, file locks used to hang off of a singly-linked list in struct > > inode. Because of this, when leases were added, they were added to the > > same list and so they had to be tracked using the same sort of > > structure. > > > > Several years ago, we added struct file_lock_context, which allowed us > > to use separate lists to track different types of file locks. Given > > that, leases no longer need to be tracked using struct file_lock. > > > > That said, a lot of the underlying infrastructure _is_ the same between > > file leases and locks, so we can't completely separate everything. > > > > This patchset first splits a group of fields used by both file locks and > > leases into a new struct file_lock_core, that is then embedded in struct > > file_lock. Coccinelle was then used to convert a lot of the callers to > > deal with the move, with the remaining 25% or so converted by hand. > > > > It then converts several internal functions in fs/locks.c to work > > with struct file_lock_core. Lastly, struct file_lock is split into > > struct file_lock and file_lease, and the lease-related APIs converted to > > take struct file_lease. > > > > After the first few patches (which I left split up for easier review), > > the set should be bisectable. I'll plan to squash the first few > > together to make sure the resulting set is bisectable before merge. > > > > Finally, I left the coccinelle scripts I used in tree. I had heard it > > was preferable to merge those along with the patches that they > > generate, but I wasn't sure where they go. I can either move those to a > > more appropriate location or we can just drop that commit if it's not > > needed. > > > > Signed-off-by: Jeff Layton > > v2 looks nicer. > > I would add a few list handling primitives, as I see enough > instances of list_for_each_entry, list_for_each_entry_safe, > list_first_entry, and list_first_entry_or_null on fl_core.flc_list > to make it worth having those. > > Also, there doesn't seem to be benefit for API consumers to have to > understand the internal structure of struct file_lock/lease to reach > into fl_core. Having accessor functions for common fields like > fl_type and fl_flags could be cleaner. > That is a good suggestion. I had considered it before and figured "why bother", but I think that would make things simpler. I'll plan to do a v3 that has more helpers. Possibly we can just convert some of the subsystems ahead of time and avoid some churn. Stay tuned... > For the series: > > Reviewed-by: Chuck Lever > > For the nfsd and lockd parts: > > Acked-by: Chuck Lever > > > > --- > > Changes in v2: > > - renamed file_lock_core fields to have "flc_" prefix > > - used macros to more easily do the change piecemeal > > - broke up patches into per-subsystem ones > > - Link to v1: > > https://lore.kernel.org/r/20240116-flsplit-v1-0-c9d0f4370...@kernel.org > > > > --- > > Jeff Layton (41): > > filelock: rename some fields in tracepoints > > filelock: rename fl_pid variable in lock_get_status > > dlm: rename fl_flags variable in dlm_posix_unlock > > nfs: rename fl_flags variable in nfs4_proc_unlck > > nfsd: rename fl_type and fl_flags variables in nfsd4_lock > > lockd: rename fl_flags and fl_type variables in nlmclnt_lock > > 9p: rename fl_type variable in v9fs_file_do_lock > > afs: rename fl_type variable in afs_next_locker > > filelock: drop the IS_* macros > > filelock: split common fields into struct file_lock_core > > filelock: add coccinelle scripts to move fields to struct > > file_lock_core > > filelock: have fs/locks.c deal with file_lock_core directly > > filelock: convert some internal functions to use file_lock_core > > instead > > filelock: convert more internal functions to use file_lock_core > > filelock: make posix_same_owner take file_lock_core pointers > > filelock: convert posix_owner_key to take file_lock_core arg > > filelock: make locks_{insert,delete}_global_locks take file_lock_core > > arg > > filelock: convert locks_{insert,delete}_global_blocked > > filelock: make __locks_delete_block and __locks_wake_up_blocks take > > file_lock_core > > filelock: convert __locks_insert_block, conflict and deadlock checks > > to use file_lock_core > > filelock: convert fl_blocker to file_lock_core > > filelock: clean up locks_delete_block internals > > filelock: reorganize locks_delete_block and __locks_insert_block > > filelock: make assign_type helper take a file_lock_core pointer > > filelock: convert locks_wake_up_blocks to take a file_lock_core > > pointer > > filelock: convert locks_insert_lock_ctx and locks_delete_lock_ctx > > filelock: convert locks_translate_pid to take file_lock_core > > filelock: conv
[PATCH RFC v3 35/35] HACK! arm64: dts: Add fake tag storage to fvp-base-revc.dts
Faking a tag storage region for FVP is useful for testing. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch, not intended to be merged. arch/arm64/boot/dts/arm/fvp-base-revc.dts | 42 +-- 1 file changed, 39 insertions(+), 3 deletions(-) diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts index 60472d65a355..e9f44420cb62 100644 --- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts +++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts @@ -165,10 +165,30 @@ C1_L2: l2-cache1 { }; }; - memory@8000 { + memory0: memory@8000 { device_type = "memory"; - reg = <0x 0x8000 0 0x8000>, - <0x0008 0x8000 0 0x8000>; + reg = <0x00 0x8000 0x00 0x8000>; + numa-node-id = <0x00>; + }; + + /* tags0 */ + tags_memory0: memory@8f800 { + device_type = "memory"; + reg = <0x08 0xf800 0x00 0x400>; + numa-node-id = <0x00>; + }; + + memory1: memory@88000 { + device_type = "memory"; + reg = <0x08 0x8000 0x00 0x7800>; + numa-node-id = <0x01>; + }; + + /* tags1 */ + tags_memory1: memory@8fc0 { + device_type = "memory"; + reg = <0x08 0xfc00 0x00 0x3c0>; + numa-node-id = <0x01>; }; reserved-memory { @@ -183,6 +203,22 @@ vram: vram@1800 { reg = <0x 0x1800 0 0x0080>; no-map; }; + + tags0: tag-storage@8f800 { + compatible = "arm,mte-tag-storage"; + reg = <0x08 0xf800 0x00 0x400>; + block-size = <0x1000>; + tagged-memory = <&memory0>; + reusable; + }; + + tags1: tag-storage@8fc0 { + compatible = "arm,mte-tag-storage"; + reg = <0x08 0xfc00 0x00 0x3c0>; + block-size = <0x1000>; + tagged-memory = <&memory1>; + reusable; + }; }; gic: interrupt-controller@2f00 { -- 2.43.0
[PATCH RFC v3 34/35] arm64: mte: Enable dynamic tag storage management
Everything is in place, enable tag storage management. Signed-off-by: Alexandru Elisei --- arch/arm64/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 088e30fc6d12..95c153705a2c 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2084,7 +2084,7 @@ config ARM64_MTE if ARM64_MTE config ARM64_MTE_TAG_STORAGE - bool + bool "MTE tag storage management" select ARCH_HAS_FAULT_ON_ACCESS select CONFIG_CMA help -- 2.43.0
[PATCH RFC v3 33/35] KVM: arm64: mte: Introduce VM_MTE_KVM VMA flag
Tag storage pages mapped by the host in a VM with MTE enabled are migrated when they are first accessed by the guest. This introduces latency spikes for memory accesses made by the guest. Tag storage pages can be mapped in the guest memory when the VM_MTE VMA flag is not set. Introduce a new VMA flag, VM_MTE_KVM, to stop tag storage pages from being mapped in a VM with MTE enabled. The flag is different from VM_MTE, because the pages from the VMA won't be mapped as tagged in the host, and host's userspace can continue to access the guest memory as Untagged. The flag's only function is to instruct the page allocator to treat the allocation as tagged, so tag storage pages aren't used. The page allocator will also try to reserve tag storage for the new page, which can speed up stage 2 aborts further if the VMM has accessed the memory before the guest. For example, qemu and kvmtool will benefit from this change because the guest image is copied after the memslot is created. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch. arch/arm64/kvm/mmu.c | 77 ++- arch/arm64/mm/fault.c | 2 +- include/linux/mm.h| 2 ++ 3 files changed, 79 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 986a9544228d..45c57c4b9fe2 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1420,7 +1420,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, unsigned long mmu_seq; struct kvm *kvm = vcpu->kvm; struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache; - struct vm_area_struct *vma; + struct vm_area_struct *vma, *old_vma; short vma_shift; gfn_t gfn; kvm_pfn_t pfn; @@ -1428,6 +1428,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, long vma_pagesize, fault_granule; enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; struct kvm_pgtable *pgt; + bool vma_has_kvm_mte = false; if (fault_is_perm) fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu); @@ -1506,6 +1507,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, gfn = fault_ipa >> PAGE_SHIFT; mte_allowed = kvm_vma_mte_allowed(vma); + vma_has_kvm_mte = !!(vma->vm_flags & VM_MTE_KVM); + old_vma = vma; /* Don't use the VMA after the unlock -- it may have vanished */ vma = NULL; @@ -1521,6 +1524,27 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, mmu_seq = vcpu->kvm->mmu_invalidate_seq; mmap_read_unlock(current->mm); + /* +* If the VMA was created after the memslot, it doesn't have the +* VM_MTE_KVM flag set. +*/ + if (unlikely(tag_storage_enabled() && !fault_is_perm && + kvm_has_mte(kvm) && mte_allowed && !vma_has_kvm_mte)) { + mmap_write_lock(current->mm); + vma = vma_lookup(current->mm, hva); + /* The VMA was changed, replay the fault. */ + if (vma != old_vma) { + mmap_write_unlock(current->mm); + return 0; + } + if (!(vma->vm_flags & VM_MTE_KVM)) { + vma_start_write(vma); + vm_flags_reset(vma, vma->vm_flags | VM_MTE_KVM); + } + vma = NULL; + mmap_write_unlock(current->mm); + } + pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL, write_fault, &writable, NULL); @@ -1986,6 +2010,40 @@ int __init kvm_mmu_init(u32 *hyp_va_bits) return err; } +static int kvm_set_clear_kvm_mte_vma(const struct kvm_memory_slot *memslot, bool set) +{ + struct vm_area_struct *vma; + hva_t hva, memslot_end; + int ret = 0; + + hva = memslot->userspace_addr; + memslot_end = hva + (memslot->npages << PAGE_SHIFT); + + mmap_write_lock(current->mm); + + do { + vma = find_vma_intersection(current->mm, hva, memslot_end); + if (!vma) + break; + if (!kvm_vma_mte_allowed(vma)) + continue; + if (set) { + if (!(vma->vm_flags & VM_MTE_KVM)) { + vma_start_write(vma); + vm_flags_reset(vma, vma->vm_flags | VM_MTE_KVM); + } + } else if (vma->vm_flags & VM_MTE_KVM) { + vma_start_write(vma); + vm_flags_reset(vma, vma->vm_flags & ~VM_MTE_KVM); + } + hva = min(memslot_end, vma->vm_end); + } while (hva < memslot_end); + + mmap_write_unlock(current->mm); + + return ret; +} + void kvm_arch_commit_memory_region(struct kvm *kvm,
[PATCH RFC v3 32/35] KVM: arm64: mte: Reserve tag storage for virtual machines with MTE
KVM allows MTE enabled VMs to be created when the backing VMA does not have MTE enabled. As a result, pages allocated for the virtual machine's memory won't have tag storage reserved. Try to reserve tag storage the first time the page is accessed by the guest. This is similar to how pages mapped without tag storage in an MTE VMA are handled. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch. arch/arm64/include/asm/mte_tag_storage.h | 10 ++ arch/arm64/include/asm/pgtable.h | 7 +++- arch/arm64/kvm/mmu.c | 43 arch/arm64/mm/fault.c| 2 +- 4 files changed, 60 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h index 40590a8c3748..32940ef7bcdf 100644 --- a/arch/arm64/include/asm/mte_tag_storage.h +++ b/arch/arm64/include/asm/mte_tag_storage.h @@ -34,6 +34,8 @@ void free_tag_storage(struct page *page, int order); bool page_tag_storage_reserved(struct page *page); bool page_is_tag_storage(struct page *page); +int replace_folio_with_tagged(struct folio *folio); + vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf, bool *map_pte); vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page); @@ -67,6 +69,14 @@ static inline bool page_tag_storage_reserved(struct page *page) { return true; } +static inline bool page_is_tag_storage(struct page *page) +{ + return false; +} +static inline int replace_folio_with_tagged(struct folio *folio) +{ + return -EINVAL; +} #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */ #endif /* !__ASSEMBLY__ */ diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index d0473538c926..7f89606ad617 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1108,7 +1108,12 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) #define __HAVE_ARCH_FREE_PAGES_PREPARE static inline void arch_free_pages_prepare(struct page *page, int order) { - if (tag_storage_enabled() && page_mte_tagged(page)) + /* +* KVM can free a page after tag storage has been reserved and before is +* marked as tagged, hence use page_tag_storage_reserved() instead of +* page_mte_tagged() to check for tag storage. +*/ + if (tag_storage_enabled() && page_tag_storage_reserved(page)) free_tag_storage(page, order); } diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index b7517c4a19c4..986a9544228d 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1361,6 +1361,8 @@ static void sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, if (!kvm_has_mte(kvm)) return; + WARN_ON_ONCE(tag_storage_enabled() && !page_tag_storage_reserved(pfn_to_page(pfn))); + for (i = 0; i < nr_pages; i++, page++) { if (try_page_mte_tagging(page)) { mte_clear_page_tags(page_address(page)); @@ -1374,6 +1376,39 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma) return vma->vm_flags & VM_MTE_ALLOWED; } +/* + * Called with an elevated reference on the pfn. If successful, the reference + * count is not changed. If it returns an error, the elevated reference is + * dropped. + */ +static int kvm_mte_reserve_tag_storage(kvm_pfn_t pfn) +{ + struct folio *folio; + int ret; + + folio = page_folio(pfn_to_page(pfn)); + + if (page_tag_storage_reserved(folio_page(folio, 0))) +return 0; + + if (page_is_tag_storage(folio_page(folio, 0))) + goto migrate; + + ret = reserve_tag_storage(folio_page(folio, 0), folio_order(folio), + GFP_HIGHUSER_MOVABLE); + if (!ret) + return 0; + +migrate: + replace_folio_with_tagged(folio); + /* +* If migration succeeds, the fault needs to be replayed because 'pfn' +* has been unmapped. If migration fails, KVM will try to reserve tag +* storage again by replaying the fault. +*/ + return -EAGAIN; +} + static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_memory_slot *memslot, unsigned long hva, bool fault_is_perm) @@ -1488,6 +1523,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL, write_fault, &writable, NULL); + if (pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(hva, vma_shift); return 0; @@ -1518,6 +1554,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (exec_fault && device) return -ENOEXEC; +
[PATCH RFC v3 31/35] khugepaged: arm64: Don't collapse MTE enabled VMAs
copy_user_highpage() will do memory allocation if there are saved tags for the destination page, and the page is missing tag storage. After commit a349d72fd9ef ("mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s"), collapse_huge_page() calls __collapse_huge_page_copy() -> .. -> copy_user_highpage() with the RCU lock held, which means that copy_user_highpage() can only allocate memory using GFP_ATOMIC or equivalent. Get around this by refusing to collapse pages into a transparent huge page if the VMA is MTE-enabled. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch. I think an agreement on whether copy*_user_highpage() should be always allowed to sleep, or should not be allowed, would be useful. arch/arm64/include/asm/pgtable.h| 3 +++ arch/arm64/kernel/mte_tag_storage.c | 5 + include/linux/khugepaged.h | 5 + mm/khugepaged.c | 4 4 files changed, 17 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 87ae59436162..d0473538c926 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1120,6 +1120,9 @@ static inline bool arch_alloc_cma(gfp_t gfp_mask) return true; } +bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned long address); +#define arch_hugepage_vma_revalidate arch_hugepage_vma_revalidate + #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */ #endif /* CONFIG_ARM64_MTE */ diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c index ac7b9c9c585c..a99959b70573 100644 --- a/arch/arm64/kernel/mte_tag_storage.c +++ b/arch/arm64/kernel/mte_tag_storage.c @@ -636,3 +636,8 @@ void arch_alloc_page(struct page *page, int order, gfp_t gfp) if (tag_storage_enabled() && alloc_requires_tag_storage(gfp)) reserve_tag_storage(page, order, gfp); } + +bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned long address) +{ + return !(vma->vm_flags & VM_MTE); +} diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index f68865e19b0b..461e4322dff2 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -38,6 +38,11 @@ static inline void khugepaged_exit(struct mm_struct *mm) if (test_bit(MMF_VM_HUGEPAGE, &mm->flags)) __khugepaged_exit(mm); } + +#ifndef arch_hugepage_vma_revalidate +#define arch_hugepage_vma_revalidate(vma, address) 1 +#endif + #else /* CONFIG_TRANSPARENT_HUGEPAGE */ static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm) { diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2b219acb528e..cb9a9ddb4d86 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -935,6 +935,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, */ if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap))) return SCAN_PAGE_ANON; + + if (!arch_hugepage_vma_revalidate(vma, address)) + return SCAN_VMA_CHECK; + return SCAN_SUCCEED; } -- 2.43.0
[PATCH RFC v3 30/35] arm64: mte: ptrace: Handle pages with missing tag storage
A page can end up mapped in a MTE enabled VMA without the corresponding tag storage block reserved. Tag accesses made by ptrace in this case can lead to the wrong tags being read or memory corruption for the process that is using the tag storage memory as data. Reserve tag storage by treating ptrace accesses like a fault. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch, issue reported by Peter Collingbourne. arch/arm64/kernel/mte.c | 26 -- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index faf09da3400a..b1fa02dad4fd 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -412,10 +412,13 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr, while (len) { struct vm_area_struct *vma; unsigned long tags, offset; + unsigned int fault_flags; + struct page *page; + vm_fault_t ret; void *maddr; - struct page *page = get_user_page_vma_remote(mm, addr, -gup_flags, &vma); +get_page: + page = get_user_page_vma_remote(mm, addr, gup_flags, &vma); if (IS_ERR(page)) { err = PTR_ERR(page); break; @@ -433,6 +436,25 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr, put_page(page); break; } + + if (tag_storage_enabled() && !page_tag_storage_reserved(page)) { + fault_flags = FAULT_FLAG_DEFAULT | \ + FAULT_FLAG_USER | \ + FAULT_FLAG_REMOTE | \ + FAULT_FLAG_ALLOW_RETRY | \ + FAULT_FLAG_RETRY_NOWAIT; + if (write) + fault_flags |= FAULT_FLAG_WRITE; + + put_page(page); + ret = handle_mm_fault(vma, addr, fault_flags, NULL); + if (ret & VM_FAULT_ERROR) { + err = -EFAULT; + break; + } + goto get_page; + } + WARN_ON_ONCE(!page_mte_tagged(page)); /* limit access to the end of the page */ -- 2.43.0
[PATCH RFC v3 29/35] arm64: mte: copypage: Handle tag restoring when missing tag storage
There are several situations where copy_highpage() can end up copying tags to a page which doesn't have its tag storage reserved. One situation involves migration racing with mprotect(PROT_MTE): VMA is initially untagged, migration starts and destination page is allocated as untagged, mprotect(PROT_MTE) changes the VMA to tagged and userspace accesses the source page, thus making it tagged. The migration code then calls copy_highpage(), which will copy the tags from the source page (now tagged) to the destination page (allocated as untagged). Yes another situation can happen during THP collapse. The huge page that will replace the HPAGE_PMD_NR contiguous mapped pages is allocated with __GFP_TAGGED not set. copy_highpage() will copy the tags from the pages being replaced to the huge page which doesn't have tag storage reserved. The situation gets even more complicated when the replacement huge page is a tag storage page. The tag storage huge page will be migrated after a fault on access, but the tags from the original pages must be copied over to the huge page that will be replacing the tag storage huge page. Signed-off-by: Alexandru Elisei --- arch/arm64/mm/copypage.c | 56 1 file changed, 56 insertions(+) diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c index a7bb20055ce0..e991ccb43fb7 100644 --- a/arch/arm64/mm/copypage.c +++ b/arch/arm64/mm/copypage.c @@ -13,6 +13,59 @@ #include #include #include +#include + +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE +static inline bool try_transfer_saved_tags(struct page *from, struct page *to) +{ + void *tags; + bool saved; + + VM_WARN_ON_ONCE(!preemptible()); + + if (page_mte_tagged(from)) { + if (page_tag_storage_reserved(to)) + return false; + + tags = mte_allocate_tag_buf(); + if (WARN_ON(!tags)) + return true; + + mte_copy_page_tags_to_buf(page_address(from), tags); + saved = mte_save_tags_for_pfn(tags, page_to_pfn(to)); + if (!saved) + mte_free_tag_buf(tags); + + return saved; + } + + tags_by_pfn_lock(); + tags = mte_erase_tags_for_pfn(page_to_pfn(from)); + tags_by_pfn_unlock(); + + if (likely(!tags)) + return false; + + if (page_tag_storage_reserved(to)) { + WARN_ON_ONCE(!try_page_mte_tagging(to)); + mte_copy_page_tags_from_buf(page_address(to), tags); + set_page_mte_tagged(to); + mte_free_tag_buf(tags); + return true; + } + + saved = mte_save_tags_for_pfn(tags, page_to_pfn(to)); + if (!saved) + mte_free_tag_buf(tags); + + return saved; +} +#else +static inline bool try_transfer_saved_tags(struct page *from, struct page *to) +{ + return false; +} +#endif void copy_highpage(struct page *to, struct page *from) { @@ -24,6 +77,9 @@ void copy_highpage(struct page *to, struct page *from) if (kasan_hw_tags_enabled()) page_kasan_tag_reset(to); + if (tag_storage_enabled() && try_transfer_saved_tags(from, to)) + return; + if (system_supports_mte() && page_mte_tagged(from)) { /* It's a new page, shouldn't have been tagged yet */ WARN_ON_ONCE(!try_page_mte_tagging(to)); -- 2.43.0
[PATCH RFC v3 28/35] arm64: mte: swap: Handle tag restoring when missing tag storage
Linux restores tags when a page is swapped in and there are tags associated with the swap entry which the new page will replace. The saved tags are restored even if the page will not be mapped as tagged, to protect against cases where the page is shared between different VMAs, and is tagged in some, but untagged in others. By using this approach, the process can still access the correct tags following an mprotect(PROT_MTE) on the non-MTE enabled VMA. But this poses a challenge for managing tag storage: in the scenario above, when a new page is allocated to be swapped in for the process where it will be mapped as untagged, the corresponding tag storage block is not reserved. mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will overwrite data in the tag storage block associated with the new page, leading to data corruption if the block is in use by a process. Get around this issue by saving the tags in a new xarray, this time indexed by the page pfn, and then restoring them when tag storage is reserved for the page. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * Restore saved tags **before** setting the PG_tag_storage_reserved bit to eliminate a brief window of opportunity where userspace can access uninitialized tags (Peter Collingbourne). arch/arm64/include/asm/mte_tag_storage.h | 8 ++ arch/arm64/include/asm/pgtable.h | 11 +++ arch/arm64/kernel/mte_tag_storage.c | 12 ++- arch/arm64/mm/mteswap.c | 110 +++ 4 files changed, 140 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h index 50bdae94cf71..40590a8c3748 100644 --- a/arch/arm64/include/asm/mte_tag_storage.h +++ b/arch/arm64/include/asm/mte_tag_storage.h @@ -36,6 +36,14 @@ bool page_is_tag_storage(struct page *page); vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf, bool *map_pte); +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page); + +void tags_by_pfn_lock(void); +void tags_by_pfn_unlock(void); + +void *mte_erase_tags_for_pfn(unsigned long pfn); +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn); +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order); #else static inline bool tag_storage_enabled(void) { diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 0174e292f890..87ae59436162 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1085,6 +1085,17 @@ static inline void arch_swap_invalidate_area(int type) mte_invalidate_tags_area_by_swp_entry(type); } +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE +#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE +static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry, + struct folio *folio) +{ + if (tag_storage_enabled()) + return mte_try_transfer_swap_tags(entry, &folio->page); + return 0; +} +#endif + #define __HAVE_ARCH_SWAP_RESTORE static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) { diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c index afe2bb754879..ac7b9c9c585c 100644 --- a/arch/arm64/kernel/mte_tag_storage.c +++ b/arch/arm64/kernel/mte_tag_storage.c @@ -567,6 +567,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp) } } + mte_restore_tags_for_pfn(page_to_pfn(page), order); page_set_tag_storage_reserved(page, order); out_unlock: mutex_unlock(&tag_blocks_lock); @@ -595,7 +596,8 @@ void free_tag_storage(struct page *page, int order) struct tag_region *region; unsigned long page_va; unsigned long flags; - int ret; + void *tags; + int i, ret; ret = tag_storage_find_block(page, &start_block, ®ion); if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page))) @@ -605,6 +607,14 @@ void free_tag_storage(struct page *page, int order) /* Avoid writeback of dirty tag cache lines corrupting data. */ dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order)); + tags_by_pfn_lock(); + for (i = 0; i < (1 << order); i++) { + tags = mte_erase_tags_for_pfn(page_to_pfn(page + i)); + if (unlikely(tags)) + mte_free_tag_buf(tags); + } + tags_by_pfn_unlock(); + end_block = start_block + order_to_num_blocks(order, region->block_size_pages); xa_lock_irqsave(&tag_blocks_reserved, flags); diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c index 2a43746b803f..e11495fa3c18 100644 --- a/arch/arm64/mm/mteswap.c +++ b/arch/arm64/mm/mteswap.c @@ -20,6 +20,112 @@ void mte_free_tag_buf(void *buf) kfree(buf); } +#ifdef CONFIG_
[PATCH RFC v3 27/35] arm64: mte: Handle tag storage pages mapped in an MTE VMA
Tag stoarge pages cannot be tagged. When such a page is mapped in a MTE-enabled VMA, migrate it out directly and don't try to reserve tag storage for it. Signed-off-by: Alexandru Elisei --- arch/arm64/include/asm/mte_tag_storage.h | 1 + arch/arm64/kernel/mte_tag_storage.c | 15 +++ arch/arm64/mm/fault.c| 11 +-- 3 files changed, 25 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h index 6d0f6ffcfdd6..50bdae94cf71 100644 --- a/arch/arm64/include/asm/mte_tag_storage.h +++ b/arch/arm64/include/asm/mte_tag_storage.h @@ -32,6 +32,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp); void free_tag_storage(struct page *page, int order); bool page_tag_storage_reserved(struct page *page); +bool page_is_tag_storage(struct page *page); vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf, bool *map_pte); diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c index 1c8469781870..afe2bb754879 100644 --- a/arch/arm64/kernel/mte_tag_storage.c +++ b/arch/arm64/kernel/mte_tag_storage.c @@ -492,6 +492,21 @@ bool page_tag_storage_reserved(struct page *page) return test_bit(PG_tag_storage_reserved, &page->flags); } +bool page_is_tag_storage(struct page *page) +{ + unsigned long pfn = page_to_pfn(page); + struct range *tag_range; + int i; + + for (i = 0; i < num_tag_regions; i++) { + tag_range = &tag_regions[i].tag_range; + if (tag_range->start <= pfn && pfn <= tag_range->end) + return true; + } + + return false; +} + int reserve_tag_storage(struct page *page, int order, gfp_t gfp) { unsigned long start_block, end_block; diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 1db3adb6499f..01450ab91a87 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -1014,6 +1014,7 @@ static int replace_folio_with_tagged(struct folio *folio) vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf, bool *map_pte) { + bool is_tag_storage = page_is_tag_storage(folio_page(folio, 0)); struct vm_area_struct *vma = vmf->vma; int ret = 0; @@ -1033,12 +1034,18 @@ vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault if (unlikely(is_migrate_isolate_page(folio_page(folio, 0 goto out_retry; - ret = reserve_tag_storage(folio_page(folio, 0), folio_order(folio), GFP_HIGHUSER_MOVABLE); - if (ret) { + if (!is_tag_storage) { + ret = reserve_tag_storage(folio_page(folio, 0), folio_order(folio), + GFP_HIGHUSER_MOVABLE); + if (!ret) + goto out_map; + /* replace_folio_with_tagged() is expensive, try to avoid it. */ if (fault_flag_allow_retry_first(vmf->flags)) goto out_retry; + } + if (ret || is_tag_storage) { replace_folio_with_tagged(folio); return 0; } -- 2.43.0
[PATCH RFC v3 26/35] arm64: mte: Use fault-on-access to reserve missing tag storage
There are three situations in which a page that is to be mapped as tagged doesn't have the corresponding tag storage reserved: * reserve_tag_storage() failed. * The allocation didn't specifiy __GFP_TAGGED (this can happen during migration, for example). * The page was mapped in a non-MTE enabled VMA, then an mprotect(PROT_MTE) enabled MTE. If a page that is about to be mapped as tagged doesn't have tag storage reserved, map it with the PAGE_FAULT_ON_ACCESS protection to trigger a fault next time they are accessed, and then reserve tag storage when the fault is handled. If tag storage cannot be reserved, then the page is migrated out of the VMA. Tag storage pages (which cannot be tagged) mapped in an MTE enabled MTE will be handled in a subsequent patch. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch, loosely based on the arm64 code from the rfc v2 patch #19 ("mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)") * All the common code has been moved back to the arch independent function handle_{huge_pmd,pte}_protnone() (David Hildenbrand). * Page is migrated if tag storage cannot be reserved after exhausting all attempts (Hyesoo Yu). * Moved folio_isolate_lru() declaration and struct migration_target_control to headers in include/linux (Peter Collingbourne). arch/arm64/Kconfig | 1 + arch/arm64/include/asm/mte.h | 4 +- arch/arm64/include/asm/mte_tag_storage.h | 3 + arch/arm64/include/asm/pgtable-prot.h| 2 + arch/arm64/include/asm/pgtable.h | 44 --- arch/arm64/kernel/mte.c | 11 ++- arch/arm64/mm/fault.c| 98 include/linux/memcontrol.h | 2 + include/linux/migrate.h | 8 +- include/linux/migrate_mode.h | 1 + mm/internal.h| 6 -- 11 files changed, 156 insertions(+), 24 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 6f65e9005dc9..088e30fc6d12 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2085,6 +2085,7 @@ config ARM64_MTE if ARM64_MTE config ARM64_MTE_TAG_STORAGE bool + select ARCH_HAS_FAULT_ON_ACCESS select CONFIG_CMA help Adds support for dynamic management of the memory used by the hardware diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h index 6457b7899207..70dc2e409070 100644 --- a/arch/arm64/include/asm/mte.h +++ b/arch/arm64/include/asm/mte.h @@ -107,7 +107,7 @@ static inline bool try_page_mte_tagging(struct page *page) } void mte_zero_clear_page_tags(void *addr); -void mte_sync_tags(pte_t pte, unsigned int nr_pages); +void mte_sync_tags(pte_t *pteval, unsigned int nr_pages); void mte_copy_page_tags(void *kto, const void *kfrom); void mte_thread_init_user(void); void mte_thread_switch(struct task_struct *next); @@ -139,7 +139,7 @@ static inline bool try_page_mte_tagging(struct page *page) static inline void mte_zero_clear_page_tags(void *addr) { } -static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages) +static inline void mte_sync_tags(pte_t *pteval, unsigned int nr_pages) { } static inline void mte_copy_page_tags(void *kto, const void *kfrom) diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h index 423b19e0cc46..6d0f6ffcfdd6 100644 --- a/arch/arm64/include/asm/mte_tag_storage.h +++ b/arch/arm64/include/asm/mte_tag_storage.h @@ -32,6 +32,9 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp); void free_tag_storage(struct page *page, int order); bool page_tag_storage_reserved(struct page *page); + +vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf, + bool *map_pte); #else static inline bool tag_storage_enabled(void) { diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h index 483dbfa39c4c..1820e29244f8 100644 --- a/arch/arm64/include/asm/pgtable-prot.h +++ b/arch/arm64/include/asm/pgtable-prot.h @@ -19,6 +19,7 @@ #define PTE_SPECIAL(_AT(pteval_t, 1) << 56) #define PTE_DEVMAP (_AT(pteval_t, 1) << 57) #define PTE_PROT_NONE (_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */ +#define PTE_TAG_STORAGE_NONE (_AT(pteval_t, 1) << 60) /* only when PTE_PROT_NONE */ /* * This bit indicates that the entry is present i.e. pmd_page() @@ -96,6 +97,7 @@ extern bool arm64_use_ng_mappings; }) #define PAGE_NONE __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN) +#define PAGE_FAULT_ON_ACCESS __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN) /* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */ #define PAGE_SHARED__pgprot(_PAGE_SHARED
[PATCH RFC v3 25/35] arm64: mte: Reserve tag block for the zero page
On arm64, when a page is mapped as tagged, its tags are zeroed for two reasons: * To prevent leakage of tags to userspace. * To allow userspace to access the contents of the page with having to set the tags explicitely (bits 59:56 of an userspace pointer are zero, which correspond to tag 0b). The zero page receives special treatment, as the tags for the zero page are zeroed when the MTE feature is being enabled. This is done for performance reasons - the tags are zeroed once, instead of every time the page is mapped. When the tags for the zero page are zeroed, tag storage is not yet enabled. Reserve tag storage for the page immediately after tag storage management becomes enabled. Note that zeroing tags before tag storage management is enabled is safe to do because the tag storage pages are reserved at that point. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * Expanded commit message (David Hildenbrand) arch/arm64/kernel/mte_tag_storage.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c index 8c347f4855e4..1c8469781870 100644 --- a/arch/arm64/kernel/mte_tag_storage.c +++ b/arch/arm64/kernel/mte_tag_storage.c @@ -363,6 +363,8 @@ static int __init mte_enable_tag_storage(void) goto out_disabled; } + reserve_tag_storage(ZERO_PAGE(0), 0, GFP_HIGHUSER); + static_branch_enable(&tag_storage_enabled_key); pr_info("MTE tag storage region management enabled"); -- 2.43.0
[PATCH RFC v3 24/35] arm64: mte: Perform CMOs for tag blocks
Make sure the contents of the tag storage block is not corrupted by performing: 1. A tag dcache inval when the associated tagged pages are freed, to avoid dirty tag cache lines being evicted and corrupting the tag storage block when it's being used to store data. 2. A data cache inval when the tag storage block is being reserved, to ensure that no dirty data cache lines are present, which would trigger a writeback that could corrupt the tags stored in the block. Signed-off-by: Alexandru Elisei --- arch/arm64/include/asm/assembler.h | 10 ++ arch/arm64/include/asm/mte_tag_storage.h | 2 ++ arch/arm64/kernel/mte_tag_storage.c | 11 +++ arch/arm64/lib/mte.S | 16 4 files changed, 39 insertions(+) diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h index 513787e43329..65fe88cce72b 100644 --- a/arch/arm64/include/asm/assembler.h +++ b/arch/arm64/include/asm/assembler.h @@ -310,6 +310,16 @@ alternative_cb_end lsl \reg, \reg, \tmp// actual cache line size .endm +/* + * tcache_line_size - get the safe tag cache line size across all CPUs + */ + .macro tcache_line_size, reg, tmp + read_ctr\tmp + ubfm\tmp, \tmp, #32, #37// tag cache line size encoding + mov \reg, #4// bytes per word + lsl \reg, \reg, \tmp// actual tag cache line size + .endm + /* * raw_icache_line_size - get the minimum I-cache line size on this CPU * from the CTR register. diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h index 09f1318d924e..423b19e0cc46 100644 --- a/arch/arm64/include/asm/mte_tag_storage.h +++ b/arch/arm64/include/asm/mte_tag_storage.h @@ -11,6 +11,8 @@ #include +extern void dcache_inval_tags_poc(unsigned long start, unsigned long end); + #ifdef CONFIG_ARM64_MTE_TAG_STORAGE DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key); diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c index 762c7c803a70..8c347f4855e4 100644 --- a/arch/arm64/kernel/mte_tag_storage.c +++ b/arch/arm64/kernel/mte_tag_storage.c @@ -17,6 +17,7 @@ #include #include +#include #include __ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key); @@ -421,8 +422,13 @@ static bool tag_storage_block_is_reserved(unsigned long block) static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order) { + unsigned long block_va; int ret; + block_va = (unsigned long)page_to_virt(pfn_to_page(block)); + /* Avoid writeback of dirty data cache lines corrupting tags. */ + dcache_inval_poc(block_va, block_va + region->block_size_pages * PAGE_SIZE); + ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL)); if (!ret) block_ref_add(block, region, order); @@ -570,6 +576,7 @@ void free_tag_storage(struct page *page, int order) { unsigned long block, start_block, end_block; struct tag_region *region; + unsigned long page_va; unsigned long flags; int ret; @@ -577,6 +584,10 @@ void free_tag_storage(struct page *page, int order) if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page))) return; + page_va = (unsigned long)page_to_virt(page); + /* Avoid writeback of dirty tag cache lines corrupting data. */ + dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order)); + end_block = start_block + order_to_num_blocks(order, region->block_size_pages); xa_lock_irqsave(&tag_blocks_reserved, flags); diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S index 9f623e9da09f..bc02b4e95062 100644 --- a/arch/arm64/lib/mte.S +++ b/arch/arm64/lib/mte.S @@ -175,3 +175,19 @@ SYM_FUNC_START(mte_copy_page_tags_from_buf) ret SYM_FUNC_END(mte_copy_page_tags_from_buf) + +/* + * dcache_inval_tags_poc(start, end) + * + * Ensure that any tags in the D-cache for the interval [start, end) + * are invalidated to PoC. + * + * - start - virtual start address of region + * - end - virtual end address of region + */ +SYM_FUNC_START(__pi_dcache_inval_tags_poc) + tcache_line_size x2, x3 + dcache_by_myline_op igvac, sy, x0, x1, x2, x3 + ret +SYM_FUNC_END(__pi_dcache_inval_tags_poc) +SYM_FUNC_ALIAS(dcache_inval_tags_poc, __pi_dcache_inval_tags_poc) -- 2.43.0
[PATCH RFC v3 23/35] arm64: mte: Try to reserve tag storage in arch_alloc_page()
Reserve tag storage for a page that is being allocated as tagged. This is a best effort approach, and failing to reserve tag storage is allowed. When all the associated tagged pages have been freed, return the tag storage pages back to the page allocator, where they can be used again for data allocations. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * Based on rfc v2 patch #16 ("arm64: mte: Manage tag storage on page allocation"). * Fixed calculation of the number of associated tag storage blocks (Hyesoo Yu). * Tag storage is reserved in arch_alloc_page() instead of arch_prep_new_page(). arch/arm64/include/asm/mte.h | 16 +- arch/arm64/include/asm/mte_tag_storage.h | 31 +++ arch/arm64/include/asm/page.h| 5 + arch/arm64/include/asm/pgtable.h | 19 ++ arch/arm64/kernel/mte_tag_storage.c | 234 +++ arch/arm64/mm/fault.c| 7 + fs/proc/page.c | 1 + include/linux/kernel-page-flags.h| 1 + include/linux/page-flags.h | 1 + include/trace/events/mmflags.h | 3 +- mm/huge_memory.c | 1 + 11 files changed, 316 insertions(+), 3 deletions(-) diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h index 8034695b3dd7..6457b7899207 100644 --- a/arch/arm64/include/asm/mte.h +++ b/arch/arm64/include/asm/mte.h @@ -40,12 +40,24 @@ void mte_free_tag_buf(void *buf); #ifdef CONFIG_ARM64_MTE /* track which pages have valid allocation tags */ -#define PG_mte_tagged PG_arch_2 +#define PG_mte_tagged PG_arch_2 /* simple lock to avoid multiple threads tagging the same page */ -#define PG_mte_lockPG_arch_3 +#define PG_mte_lockPG_arch_3 +/* Track if a tagged page has tag storage reserved */ +#define PG_tag_storage_reservedPG_arch_4 + +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key); +extern bool page_tag_storage_reserved(struct page *page); +#endif static inline void set_page_mte_tagged(struct page *page) { +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE + /* Open code mte_tag_storage_enabled() */ + WARN_ON_ONCE(static_branch_likely(&tag_storage_enabled_key) && +!page_tag_storage_reserved(page)); +#endif /* * Ensure that the tags written prior to this function are visible * before the page flags update. diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h index 7b3f6bff8e6f..09f1318d924e 100644 --- a/arch/arm64/include/asm/mte_tag_storage.h +++ b/arch/arm64/include/asm/mte_tag_storage.h @@ -5,6 +5,12 @@ #ifndef __ASM_MTE_TAG_STORAGE_H #define __ASM_MTE_TAG_STORAGE_H +#ifndef __ASSEMBLY__ + +#include + +#include + #ifdef CONFIG_ARM64_MTE_TAG_STORAGE DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key); @@ -15,6 +21,15 @@ static inline bool tag_storage_enabled(void) } void mte_init_tag_storage(void); + +static inline bool alloc_requires_tag_storage(gfp_t gfp) +{ + return gfp & __GFP_TAGGED; +} +int reserve_tag_storage(struct page *page, int order, gfp_t gfp); +void free_tag_storage(struct page *page, int order); + +bool page_tag_storage_reserved(struct page *page); #else static inline bool tag_storage_enabled(void) { @@ -23,6 +38,22 @@ static inline bool tag_storage_enabled(void) static inline void mte_init_tag_storage(void) { } +static inline bool alloc_requires_tag_storage(struct page *page) +{ + return false; +} +static inline int reserve_tag_storage(struct page *page, int order, gfp_t gfp) +{ + return 0; +} +static inline void free_tag_storage(struct page *page, int order) +{ +} +static inline bool page_tag_storage_reserved(struct page *page) +{ + return true; +} #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */ +#endif /* !__ASSEMBLY__ */ #endif /* __ASM_MTE_TAG_STORAGE_H */ diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h index 88bab032a493..3a656492f34a 100644 --- a/arch/arm64/include/asm/page.h +++ b/arch/arm64/include/asm/page.h @@ -35,6 +35,11 @@ void copy_highpage(struct page *to, struct page *from); void tag_clear_highpage(struct page *to); #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE +void arch_alloc_page(struct page *, int order, gfp_t gfp); +#define HAVE_ARCH_ALLOC_PAGE +#endif + #define clear_user_page(page, vaddr, pg) clear_page(page) #define copy_user_page(to, from, vaddr, pg)copy_page(to, from) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 2499cc4fa4f2..f30466199a9b 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -10,6 +10,7 @@ #include #include +#include #include #include #include @@ -1069,6 +1070,24 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) mte_restore_page_tags_by_swp_entry(en
[PATCH RFC v3 22/35] arm64: mte: Enable tag storage if CMA areas have been activated
Before enabling MTE tag storage management, make sure that the CMA areas have been successfully activated. If a CMA area fails activation, the pages are kept as reserved. Reserved pages are never used by the page allocator. If this happens, the kernel would have to manage tag storage only for some of the memory, but not for all memory, and that would make the code unreasonably complicated. Choose to disable tag storage management altogether if a CMA area fails to be activated. Signed-off-by: Alexandru Elisei --- Changes since v2: * New patch. arch/arm64/include/asm/mte_tag_storage.h | 12 ++ arch/arm64/kernel/mte_tag_storage.c | 50 2 files changed, 62 insertions(+) diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h index 3c2cd29e053e..7b3f6bff8e6f 100644 --- a/arch/arm64/include/asm/mte_tag_storage.h +++ b/arch/arm64/include/asm/mte_tag_storage.h @@ -6,8 +6,20 @@ #define __ASM_MTE_TAG_STORAGE_H #ifdef CONFIG_ARM64_MTE_TAG_STORAGE + +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key); + +static inline bool tag_storage_enabled(void) +{ + return static_branch_likely(&tag_storage_enabled_key); +} + void mte_init_tag_storage(void); #else +static inline bool tag_storage_enabled(void) +{ + return false; +} static inline void mte_init_tag_storage(void) { } diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c index 9a1a8a45171e..d58c68b4a849 100644 --- a/arch/arm64/kernel/mte_tag_storage.c +++ b/arch/arm64/kernel/mte_tag_storage.c @@ -19,6 +19,8 @@ #include +__ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key); + struct tag_region { struct range mem_range; /* Memory associated with the tag storage, in PFNs. */ struct range tag_range; /* Tag storage memory, in PFNs. */ @@ -314,3 +316,51 @@ void __init mte_init_tag_storage(void) num_tag_regions = 0; pr_info("MTE tag storage region management disabled"); } + +static int __init mte_enable_tag_storage(void) +{ + struct range *tag_range; + struct cma *cma; + int i, ret; + + if (num_tag_regions == 0) + return 0; + + for (i = 0; i < num_tag_regions; i++) { + tag_range = &tag_regions[i].tag_range; + cma = tag_regions[i].cma; + /* +* CMA will keep the pages as reserved when the region fails +* activation. +*/ + if (PageReserved(pfn_to_page(tag_range->start))) + goto out_disabled; + } + + static_branch_enable(&tag_storage_enabled_key); + pr_info("MTE tag storage region management enabled"); + + return 0; + +out_disabled: + for (i = 0; i < num_tag_regions; i++) { + tag_range = &tag_regions[i].tag_range; + cma = tag_regions[i].cma; + + if (PageReserved(pfn_to_page(tag_range->start))) + continue; + + /* Try really hard to reserve the tag storage. */ + ret = cma_alloc(cma, range_len(tag_range), 8, true); + /* +* Tag storage is still in use for data, memory and/or tag +* corruption will ensue. +*/ + WARN_ON_ONCE(ret); + } + num_tag_regions = 0; + pr_info("MTE tag storage region management disabled"); + + return -EINVAL; +} +arch_initcall(mte_enable_tag_storage); -- 2.43.0
[PATCH RFC v3 21/35] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
To be able to reserve the tag storage associated with a tagged page requires that the tag storage can be migrated, if it's in use for data. The kernel allocates pages in non-preemptible contexts, which makes migration impossible. The only user of tagged pages in the kernel is HW KASAN, so don't use tag storage pages if HW KASAN is enabled. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * Expanded commit message (David Hildenbrand) arch/arm64/kernel/mte_tag_storage.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c index 90b157132efa..9a1a8a45171e 100644 --- a/arch/arm64/kernel/mte_tag_storage.c +++ b/arch/arm64/kernel/mte_tag_storage.c @@ -256,6 +256,16 @@ void __init mte_init_tag_storage(void) goto out_disabled; } + /* +* The kernel allocates memory in non-preemptible contexts, which makes +* migration impossible when reserving the associated tag storage. The +* only in-kernel user of tagged pages is HW KASAN. +*/ + if (kasan_hw_tags_enabled()) { + pr_info("KASAN HW tags incompatible with MTE tag storage management"); + goto out_disabled; + } + /* * Check that tag storage is addressable by the kernel. * cma_init_reserved_mem(), unlike cma_declare_contiguous_nid(), doesn't -- 2.43.0
[PATCH RFC v3 20/35] arm64: mte: Add tag storage memory to CMA
Add the MTE tag storage pages to CMA, which allows the page allocator to manage them like regular pages. The CMA migratype lends the tag storage pages some very desirable properties: * They cannot be longterm pinned, meaning they should always be migratable. * The pages can be allocated explicitely by using their PFN (with alloc_cma_range()) when they are needed to store tags. Signed-off-by: Alexandru Elisei --- Changes since v2: * Reworked from rfc v2 patch #12 ("arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype"). * Tag storage memory is now added to the cma_areas array and will be managed like a regular CMA region (David Hildenbrand). * If a tag storage region spans multiple zones, CMA won't be able to activate the region. Split such regions into multiple tag storage regions (Hyesoo Yu). arch/arm64/Kconfig | 1 + arch/arm64/kernel/mte_tag_storage.c | 150 +++- 2 files changed, 150 insertions(+), 1 deletion(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 92d97930b56e..6f65e9005dc9 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2085,6 +2085,7 @@ config ARM64_MTE if ARM64_MTE config ARM64_MTE_TAG_STORAGE bool + select CONFIG_CMA help Adds support for dynamic management of the memory used by the hardware for storing MTE tags. This memory, unlike normal memory, cannot be diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c index 2f32265d8ad8..90b157132efa 100644 --- a/arch/arm64/kernel/mte_tag_storage.c +++ b/arch/arm64/kernel/mte_tag_storage.c @@ -5,6 +5,8 @@ * Copyright (C) 2023 ARM Ltd. */ +#include +#include #include #include #include @@ -22,6 +24,7 @@ struct tag_region { struct range tag_range; /* Tag storage memory, in PFNs. */ u32 block_size_pages; /* Tag block size, in pages. */ phandle mem_phandle;/* phandle for the associated memory node. */ + struct cma *cma;/* CMA cookie */ }; #define MAX_TAG_REGIONS32 @@ -139,9 +142,88 @@ static int __init mte_find_tagged_memory_regions(void) return -EINVAL; } +static void __init mte_split_tag_region(struct tag_region *region, unsigned long last_tag_pfn) +{ + struct tag_region *new_region; + unsigned long last_mem_pfn; + + new_region = &tag_regions[num_tag_regions]; + last_mem_pfn = region->mem_range.start + (last_tag_pfn - region->tag_range.start) * 32; + + new_region->mem_range.start = last_mem_pfn + 1; + new_region->mem_range.end = region->mem_range.end; + region->mem_range.end = last_mem_pfn; + + new_region->tag_range.start = last_tag_pfn + 1; + new_region->tag_range.end = region->tag_range.end; + region->tag_range.end = last_tag_pfn; + + new_region->block_size_pages = region->block_size_pages; + + num_tag_regions++; +} + +/* + * Split any tag region that spans multiple zones - CMA will fail if that + * happens. + */ +static int __init mte_split_tag_regions(void) +{ + struct tag_region *region; + struct range *tag_range; + struct zone *zone; + unsigned long pfn; + int i; + + for (i = 0; i < num_tag_regions; i++) { + region = &tag_regions[i]; + tag_range = ®ion->tag_range; + zone = page_zone(pfn_to_page(tag_range->start)); + + for (pfn = tag_range->start + 1; pfn <= tag_range->end; pfn++) { + if (page_zone(pfn_to_page(pfn)) == zone) + continue; + + if (WARN_ON_ONCE(pfn % region->block_size_pages)) + goto out_err; + + if (num_tag_regions == MAX_TAG_REGIONS) + goto out_err; + + mte_split_tag_region(&tag_regions[i], pfn - 1); + /* Move on to the next region. */ + break; + } + } + + return 0; + +out_err: + pr_err("Error splitting tag storage region 0x%llx-0x%llx spanning multiple zones", + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end + 1) - 1); + return -EINVAL; +} + void __init mte_init_tag_storage(void) { - int ret; + unsigned long long mem_end; + struct tag_region *region; + unsigned long pfn, order; + u64 start, end; + int i, j, ret; + + /* +* Tag storage memory requires that tag storage pages in use for data +* are always migratable when they need to be repurposed to store tags. +* If ARCH_KEEP_MEMBLOCK is enabled, kexec will not scan reserved +* memblocks when trying to find a suitable location for the kernel +* image. This means that kexec will not use tag storage pages for +* copying the kernel, and the pages will remain migratable. +* +* Add the check in c
[PATCH RFC v3 19/35] arm64: mte: Discover tag storage memory
Allow the kernel to get the base address, size, block size and associated memory node for tag storage from the device tree blob. A tag storage region represents the smallest contiguous memory region that holds all the tags for the associated contiguous memory region which can be tagged. For example, for a 32GB contiguous tagged memory the corresponding tag storage region is exactly 1GB of contiguous memory, not two adjacent 512M of tag storage memory, nor one 2GB tag storage region. Tag storage is described as reserved memory; future patches will teach the kernel how to make use of it for data (non-tagged) allocations. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * Reworked from rfc v2 patch #11 ("arm64: mte: Reserve tag storage memory"). * Added device tree schema (Rob Herring) * Tag storage memory is now described in the "reserved-memory" node (Rob Herring). .../reserved-memory/arm,mte-tag-storage.yaml | 78 + arch/arm64/Kconfig| 12 ++ arch/arm64/include/asm/mte_tag_storage.h | 16 ++ arch/arm64/kernel/Makefile| 1 + arch/arm64/kernel/mte_tag_storage.c | 158 ++ arch/arm64/mm/init.c | 3 + 6 files changed, 268 insertions(+) create mode 100644 Documentation/devicetree/bindings/reserved-memory/arm,mte-tag-storage.yaml create mode 100644 arch/arm64/include/asm/mte_tag_storage.h create mode 100644 arch/arm64/kernel/mte_tag_storage.c diff --git a/Documentation/devicetree/bindings/reserved-memory/arm,mte-tag-storage.yaml b/Documentation/devicetree/bindings/reserved-memory/arm,mte-tag-storage.yaml new file mode 100644 index ..a99aaa1e8b6e --- /dev/null +++ b/Documentation/devicetree/bindings/reserved-memory/arm,mte-tag-storage.yaml @@ -0,0 +1,78 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/reserved-memory/arm,mte-tag-storage.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Tag storage memory for Memory Tagging Extension + +description: | + Description of the tag storage memory region that Linux can use to store + data when the associated memory is not tagged. + + The reserved memory described by the node must also be described by a + standalone 'memory' node. + +maintainers: + - Alexandru Elisei + +allOf: + - $ref: reserved-memory.yaml + +properties: + compatible: +const: arm,mte-tag-storage + + reg: +description: | + Specifies the memory region that MTE uses for tag storage. The size of the + region must be equal to the size needed to store all the tags for the + associated tagged memory. + + block-size: +description: | + Specifies the minimum multiple of 4K bytes of tag storage where all the + tags stored in the block correspond to a contiguous memory region. This + is needed for platforms where the memory controller interleaves tag + writes to memory. + + For example, if the memory controller interleaves tag writes for 256KB + of contiguous memory across 8K of tag storage (2-way interleave), then + the correct value for 'block-size' is 0x2000. + + This value is a hardware property, independent of the selected kernel page + size. +$ref: /schemas/types.yaml#/definitions/uint32 + + tagged-memory: +description: | + Specifies the memory node, as a phandle, for which all the tags are + stored in the tag storage region. + + The memory node must describe one contiguous memory region (i.e, the + 'ranges' property of the memory node must have exactly one entry). +$ref: /schemas/types.yaml#/definitions/phandle + +unevaluatedProperties: false + +required: + - compatible + - reg + - block-size + - tagged-memory + - reusable + +examples: + - | +reserved-memory { + #address-cells = <2>; + #size-cells = <2>; + + tags0: tag-storage@8f800 { +compatible = "arm,mte-tag-storage"; +reg = <0x08 0xf800 0x00 0x400>; +block-size = <0x1000>; +tagged-memory = <&memory0>; +reusable; + }; +}; diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index aa7c1d435139..92d97930b56e 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2082,6 +2082,18 @@ config ARM64_MTE Documentation/arch/arm64/memory-tagging-extension.rst. +if ARM64_MTE +config ARM64_MTE_TAG_STORAGE + bool + help + Adds support for dynamic management of the memory used by the hardware + for storing MTE tags. This memory, unlike normal memory, cannot be + tagged. When it is used to store tags for another memory location it + cannot be used for any type of allocation. + + If unsure, say N +endif # ARM64_MTE + endmenu # "ARMv8.5 architectural features" menu "ARMv8.7 architectural features" diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/
[PATCH RFC v3 18/35] arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
__GFP_ZEROTAGS is used to instruct the page allocator to zero the tags at the same time as the physical frame is zeroed. The name can be slightly misleading, because it doesn't mean that the code will zero the tags unconditionally, but that the tags will be zeroed if and only if the physical frame is also zeroed (either __GFP_ZERO is set or init_on_alloc is 1). Rename it to __GFP_TAGGED, in preparation for it to be used by the page allocator to recognize when an allocation is tagged (has metadata). Signed-off-by: Alexandru Elisei --- arch/arm64/mm/fault.c | 2 +- include/linux/gfp_types.h | 6 +++--- include/trace/events/mmflags.h | 2 +- mm/page_alloc.c| 2 +- mm/shmem.c | 2 +- 5 files changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 4d3f0a870ad8..c022e473c17c 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -944,7 +944,7 @@ NOKPROBE_SYMBOL(do_debug_exception); gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp) { if (vma->vm_flags & VM_MTE) - return __GFP_ZEROTAGS; + return __GFP_TAGGED; return 0; } diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h index 1b6053da8754..f638353ebdc7 100644 --- a/include/linux/gfp_types.h +++ b/include/linux/gfp_types.h @@ -45,7 +45,7 @@ typedef unsigned int __bitwise gfp_t; #define ___GFP_HARDWALL0x10u #define ___GFP_THISNODE0x20u #define ___GFP_ACCOUNT 0x40u -#define ___GFP_ZEROTAGS0x80u +#define ___GFP_TAGGED 0x80u #ifdef CONFIG_KASAN_HW_TAGS #define ___GFP_SKIP_ZERO 0x100u #define ___GFP_SKIP_KASAN 0x200u @@ -226,7 +226,7 @@ typedef unsigned int __bitwise gfp_t; * * %__GFP_ZERO returns a zeroed page on success. * - * %__GFP_ZEROTAGS zeroes memory tags at allocation time if the memory itself + * %__GFP_TAGGED zeroes memory tags at allocation time if the memory itself * is being zeroed (either via __GFP_ZERO or via init_on_alloc, provided that * __GFP_SKIP_ZERO is not set). This flag is intended for optimization: setting * memory tags at the same time as zeroing memory has minimal additional @@ -241,7 +241,7 @@ typedef unsigned int __bitwise gfp_t; #define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN) #define __GFP_COMP ((__force gfp_t)___GFP_COMP) #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) -#define __GFP_ZEROTAGS ((__force gfp_t)___GFP_ZEROTAGS) +#define __GFP_TAGGED ((__force gfp_t)___GFP_TAGGED) #define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO) #define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN) diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index d801409b33cf..6ca0d5ed46c0 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -50,7 +50,7 @@ gfpflag_string(__GFP_RECLAIM), \ gfpflag_string(__GFP_DIRECT_RECLAIM), \ gfpflag_string(__GFP_KSWAPD_RECLAIM), \ - gfpflag_string(__GFP_ZEROTAGS) + gfpflag_string(__GFP_TAGGED) #ifdef CONFIG_KASAN_HW_TAGS #define __def_gfpflag_names_kasan ,\ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 502ee3eb8583..0a0118612a13 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1480,7 +1480,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order, { bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) && !should_skip_init(gfp_flags); - bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS); + bool zero_tags = init && (gfp_flags & __GFP_TAGGED); int i; set_page_private(page, 0); diff --git a/mm/shmem.c b/mm/shmem.c index 621fabc3b8c6..3e28357b0a40 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1585,7 +1585,7 @@ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp, */ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) { - gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM | __GFP_ZEROTAGS; + gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM | __GFP_TAGGED; gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY; gfp_t zoneflags = limit_gfp & GFP_ZONEMASK; gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK); -- 2.43.0
[PATCH RFC v3 17/35] arm64: mte: Rework naming for tag manipulation functions
The tag save/restore/copy functions could be more explicit about from where the tags are coming from and where they are being copied to. Renaming the functions to make it easier to understand what they are doing: - Rename the mte_clear_page_tags() 'addr' parameter to 'page_addr', to match the other functions that take a page address as parameter. - Rename mte_save/restore_tags() to mte_save/restore_page_tags_by_swp_entry() to make it clear that they are saved in a collection indexed by swp_entry (this will become important when they will be also saved in a collection indexed by page pfn). Same applies to mte_invalidate_tags{,_area}_by_swp_entry(). - Rename mte_save/restore_page_tags() to make it clear where the tags are going to be saved, respectively from where they are restored - in a previously allocated memory buffer, not in an xarray, like when the tags are saved when swapping. Rename the action to 'copy' instead of 'save'/'restore' to match the copy from user functions, which also copy tags to memory. - Rename mte_allocate/free_tag_storage() to mte_allocate/free_tag_buf() to make it clear the functions have nothing to do with the memory where the corresponding tags for a page live. Change the parameter type for mte_free_tag_buf()) to be void *, to match the return value of mte_allocate_tag_buf(). Also do that because that memory is opaque and it is not meant to be directly deferenced. In the name of consistency rename local variables from tag_storage to tags. Give a similar treatment to the hibernation code that saves and restores the tags for all tagged pages. In the same spirit, rename MTE_PAGE_TAG_STORAGE to MTE_PAGE_TAG_STORAGE_SIZE to make it clear that it relates to the size of the memory needed to save the tags for a page. Oportunistically rename MTE_TAG_SIZE to MTE_TAG_SIZE_BITS to make it clear it is measured in bits, not bytes, like the rest of the size variable from the same header file. Signed-off-by: Alexandru Elisei --- arch/arm64/include/asm/mte-def.h | 16 +- arch/arm64/include/asm/mte.h | 23 +-- arch/arm64/include/asm/pgtable.h | 8 ++--- arch/arm64/kernel/elfcore.c | 14 - arch/arm64/kernel/hibernate.c| 46 ++--- arch/arm64/lib/mte.S | 18 ++-- arch/arm64/mm/mteswap.c | 50 7 files changed, 90 insertions(+), 85 deletions(-) diff --git a/arch/arm64/include/asm/mte-def.h b/arch/arm64/include/asm/mte-def.h index 14ee86b019c2..eb0d76a6bdcf 100644 --- a/arch/arm64/include/asm/mte-def.h +++ b/arch/arm64/include/asm/mte-def.h @@ -5,14 +5,14 @@ #ifndef __ASM_MTE_DEF_H #define __ASM_MTE_DEF_H -#define MTE_GRANULE_SIZE UL(16) -#define MTE_GRANULE_MASK (~(MTE_GRANULE_SIZE - 1)) -#define MTE_GRANULES_PER_PAGE (PAGE_SIZE / MTE_GRANULE_SIZE) -#define MTE_TAG_SHIFT 56 -#define MTE_TAG_SIZE 4 -#define MTE_TAG_MASK GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), MTE_TAG_SHIFT) -#define MTE_PAGE_TAG_STORAGE (MTE_GRANULES_PER_PAGE * MTE_TAG_SIZE / 8) +#define MTE_GRANULE_SIZE UL(16) +#define MTE_GRANULE_MASK (~(MTE_GRANULE_SIZE - 1)) +#define MTE_GRANULES_PER_PAGE (PAGE_SIZE / MTE_GRANULE_SIZE) +#define MTE_TAG_SHIFT 56 +#define MTE_TAG_SIZE_BITS 4 +#define MTE_TAG_MASK GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE_BITS - 1)), MTE_TAG_SHIFT) +#define MTE_PAGE_TAG_STORAGE_SIZE (MTE_GRANULES_PER_PAGE * MTE_TAG_SIZE_BITS / 8) -#define __MTE_PREAMBLE ARM64_ASM_PREAMBLE ".arch_extension memtag\n" +#define __MTE_PREAMBLE ARM64_ASM_PREAMBLE ".arch_extension memtag\n" #endif /* __ASM_MTE_DEF_H */ diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h index 91fbd5c8a391..8034695b3dd7 100644 --- a/arch/arm64/include/asm/mte.h +++ b/arch/arm64/include/asm/mte.h @@ -18,19 +18,24 @@ #include -void mte_clear_page_tags(void *addr); +void mte_clear_page_tags(void *page_addr); + unsigned long mte_copy_tags_from_user(void *to, const void __user *from, unsigned long n); unsigned long mte_copy_tags_to_user(void __user *to, void *from, unsigned long n); -int mte_save_tags(struct page *page); -void mte_save_page_tags(const void *page_addr, void *tag_storage); -void mte_restore_tags(swp_entry_t entry, struct page *page); -void mte_restore_page_tags(void *page_addr, const void *tag_storage); -void mte_invalidate_tags(int type, pgoff_t offset); -void mte_invalidate_tags_area(int type); -void *mte_allocate_tag_storage(void); -void mte_free_tag_storage(char *storage); + +int mte_save_page_tags_by_swp_entry(struct page *page); +void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page); + +void mte_copy_page_tags_to_buf(const void *page_addr, void *to); +void mte_copy_page_tags_from_buf(void *p
[PATCH RFC v3 16/35] KVM: arm64: Don't deny VM_PFNMAP VMAs when kvm_has_mte()
According to ARM DDI 0487J.a, page D10-5976, a memory location which doesn't have the Normal memory attribute is considered Untagged, and accesses are Tag Unchecked. Tag reads from an Untagged address return 0b, and writes are ignored. Linux uses VM_PFNMAP VMAs represent device memory, and Linux doesn't set the VM_MTE_ALLOWED flag for these VMAs. In user_mem_abort(), KVM requires that all VMAs that back guest memory must allow tagging (VM_MTE_ALLOWED flag set), except for VMAs that represent device memory. When a memslot is created or changed, KVM enforces a different behaviour: **all** VMAs that intersect the memslot must allow tagging, even those that represent device memory. This is too restrictive, and can lead to inconsistent behaviour: a VM_PFNMAP VMA that is present when a memslot is created causes KVM_SET_USER_MEMORY_REGION to fail, but if such a VMA is created after the memslot has been created, the virtual machine will run without errors. Change kvm_arch_prepare_memory_region() to allow VM_PFNMAP VMAs when the VM has the MTE capability enabled. Signed-off-by: Alexandru Elisei --- Changes from rfc v2: * New patch. It's a fix, and can be taken independently of the series. arch/arm64/kvm/mmu.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index d14504821b79..b7517c4a19c4 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -2028,17 +2028,15 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, if (!vma) break; - if (kvm_has_mte(kvm) && !kvm_vma_mte_allowed(vma)) { - ret = -EINVAL; - break; - } - if (vma->vm_flags & VM_PFNMAP) { /* IO region dirty page logging not allowed */ if (new->flags & KVM_MEM_LOG_DIRTY_PAGES) { ret = -EINVAL; break; } + } else if (kvm_has_mte(kvm) && !kvm_vma_mte_allowed(vma)) { + ret = -EINVAL; + break; } hva = min(reg_end, vma->vm_end); } while (hva < reg_end); -- 2.43.0
[PATCH RFC v3 15/35] of: fdt: Add of_flat_read_u32()
Add the function of_flat_read_u32() to return the value of a property as an u32. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch, suggested by Rob Herring. drivers/of/fdt.c | 21 + include/linux/of_fdt.h | 2 ++ 2 files changed, 23 insertions(+) diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c index bf502ba8da95..dfcd79fd5fd9 100644 --- a/drivers/of/fdt.c +++ b/drivers/of/fdt.c @@ -755,6 +755,27 @@ const void *__init of_get_flat_dt_prop(unsigned long node, const char *name, return fdt_getprop(initial_boot_params, node, name, size); } +/* + * of_flat_read_u32 - Return the value of the given property as an u32. + * + * @node: device node from which the property value is to be read + * @propname: name of the property + * @out_value: the value of the property + * @return: 0 on success, -EINVAL if property does not exist + */ +int __init of_flat_read_u32(unsigned long node, const char *propname, + u32 *out_value) +{ + const __be32 *reg; + + reg = of_get_flat_dt_prop(node, propname, NULL); + if (!reg) + return -EINVAL; + + *out_value = be32_to_cpup(reg); + return 0; +} + /** * of_fdt_is_compatible - Return true if given node from the given blob has * compat in its compatible list diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h index 0e26f8c3b10e..d7901699061b 100644 --- a/include/linux/of_fdt.h +++ b/include/linux/of_fdt.h @@ -57,6 +57,8 @@ extern const void *of_get_flat_dt_prop(unsigned long node, const char *name, extern int of_flat_dt_is_compatible(unsigned long node, const char *name); extern unsigned long of_get_flat_dt_root(void); extern uint32_t of_get_flat_dt_phandle(unsigned long node); +extern int of_flat_read_u32(unsigned long node, const char *propname, + u32 *out_value); extern int early_init_dt_scan_chosen(char *cmdline); extern int early_init_dt_scan_memory(void); -- 2.43.0
[PATCH RFC v3 14/35] of: fdt: Return the region size in of_flat_dt_translate_address()
Alongside the base address, arm64 will also need to know the size of a tag storage region. Teach of_flat_dt_translate_address() to parse and return the size. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch, suggested by Rob Herring. arch/sh/kernel/cpu/sh2/probe.c | 2 +- drivers/of/fdt_address.c | 12 +--- drivers/tty/serial/earlycon.c | 2 +- include/linux/of_fdt.h | 2 +- 4 files changed, 12 insertions(+), 6 deletions(-) diff --git a/arch/sh/kernel/cpu/sh2/probe.c b/arch/sh/kernel/cpu/sh2/probe.c index 70a07f4f2142..fa8904e8f390 100644 --- a/arch/sh/kernel/cpu/sh2/probe.c +++ b/arch/sh/kernel/cpu/sh2/probe.c @@ -21,7 +21,7 @@ static int __init scan_cache(unsigned long node, const char *uname, if (!of_flat_dt_is_compatible(node, "jcore,cache")) return 0; - j2_ccr_base = ioremap(of_flat_dt_translate_address(node), 4); + j2_ccr_base = ioremap(of_flat_dt_translate_address(node, NULL), 4); return 1; } diff --git a/drivers/of/fdt_address.c b/drivers/of/fdt_address.c index 1dc15ab78b10..4c08d710 100644 --- a/drivers/of/fdt_address.c +++ b/drivers/of/fdt_address.c @@ -160,7 +160,8 @@ static int __init fdt_translate_one(const void *blob, int parent, * that can be mapped to a cpu physical address). This is not really specified * that way, but this is traditionally the way IBM at least do things */ -static u64 __init fdt_translate_address(const void *blob, int node_offset) +static u64 __init fdt_translate_address(const void *blob, int node_offset, + u64 *out_size) { int parent, len; const struct of_bus *bus, *pbus; @@ -193,6 +194,9 @@ static u64 __init fdt_translate_address(const void *blob, int node_offset) goto bail; } memcpy(addr, reg, na * 4); + /* The size of the region doesn't need translating. */ + if (out_size) + *out_size = of_read_number(reg + na, ns); pr_debug("bus (na=%d, ns=%d) on %s\n", na, ns, fdt_get_name(blob, parent, NULL)); @@ -242,8 +246,10 @@ static u64 __init fdt_translate_address(const void *blob, int node_offset) /** * of_flat_dt_translate_address - translate DT addr into CPU phys addr * @node: node in the flat blob + * @out_size: size of the region, can be NULL if not needed + * @return: the address, OF_BAD_ADDR in case of error */ -u64 __init of_flat_dt_translate_address(unsigned long node) +u64 __init of_flat_dt_translate_address(unsigned long node, u64 *out_size) { - return fdt_translate_address(initial_boot_params, node); + return fdt_translate_address(initial_boot_params, node, out_size); } diff --git a/drivers/tty/serial/earlycon.c b/drivers/tty/serial/earlycon.c index a5fbb6ed38ae..e941cf786232 100644 --- a/drivers/tty/serial/earlycon.c +++ b/drivers/tty/serial/earlycon.c @@ -265,7 +265,7 @@ int __init of_setup_earlycon(const struct earlycon_id *match, spin_lock_init(&port->lock); port->iotype = UPIO_MEM; - addr = of_flat_dt_translate_address(node); + addr = of_flat_dt_translate_address(node, NULL); if (addr == OF_BAD_ADDR) { pr_warn("[%s] bad address\n", match->name); return -ENXIO; diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h index d69ad5bb1eb1..0e26f8c3b10e 100644 --- a/include/linux/of_fdt.h +++ b/include/linux/of_fdt.h @@ -36,7 +36,7 @@ extern char __dtb_start[]; extern char __dtb_end[]; /* Other Prototypes */ -extern u64 of_flat_dt_translate_address(unsigned long node); +extern u64 of_flat_dt_translate_address(unsigned long node, u64 *out_size); extern void of_fdt_limit_memory(int limit); #endif /* CONFIG_OF_FLATTREE */ -- 2.43.0
[PATCH RFC v3 13/35] mm: memory: Introduce fault-on-access mechanism for pages
Introduce a mechanism that allows an architecture to trigger a page fault, and add the infrastructure to handle that fault accordingly. To use make use of this, an arch is expected to mark the table entry as PAGE_NONE (which will cause a fault next time it is accessed) and to implement an arch-specific method (like a software bit) for recognizing that the fault needs to be handled by the arch code. arm64 will use of this approach to reserve tag storage for pages which are mapped in an MTE enabled VMA, but the storage needed to store tags isn't reserved (for example, because of an mprotect(PROT_MTE) call on a VMA with existing pages). Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch. Split from patch #19 ("mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)") (David Hildenbrand). include/linux/huge_mm.h | 4 ++-- include/linux/pgtable.h | 47 +++-- mm/Kconfig | 3 +++ mm/huge_memory.c| 36 + mm/memory.c | 51 ++--- 5 files changed, 109 insertions(+), 32 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 5adb86af35fc..4678a0a5e6a8 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -346,7 +346,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap); -vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf); +vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf); extern struct page *huge_zero_page; extern unsigned long huge_zero_pfn; @@ -476,7 +476,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud, return NULL; } -static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) +static inline vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf) { return 0; } diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 2d0f04042f62..81a21be855a2 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1455,7 +1455,7 @@ static inline int pud_trans_unstable(pud_t *pud) return 0; } -#ifndef CONFIG_NUMA_BALANCING +#if !defined(CONFIG_NUMA_BALANCING) && !defined(CONFIG_ARCH_HAS_FAULT_ON_ACCESS) /* * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It is * perfectly valid to indicate "no" in that case, which is why our default @@ -1477,7 +1477,50 @@ static inline int pmd_protnone(pmd_t pmd) { return 0; } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* !CONFIG_NUMA_BALANCING && !CONFIG_ARCH_HAS_FAULT_ON_ACCESS */ + +#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS +static inline bool arch_fault_on_access_pte(pte_t pte) +{ + return false; +} + +static inline bool arch_fault_on_access_pmd(pmd_t pmd) +{ + return false; +} + +/* + * The function is called with the fault lock held and an elevated reference on + * the folio. + * + * Rules that an arch implementation of the function must follow: + * + * 1. The function must return with the elevated reference dropped. + * + * 2. If the return value contains VM_FAULT_RETRY or VM_FAULT_COMPLETED then: + * + * - if FAULT_FLAG_RETRY_NOWAIT is not set, the function must return with the + * correct fault lock released, which can be accomplished with + * release_fault_lock(vmf). Note that release_fault_lock() doesn't check if + * FAULT_FLAG_RETRY_NOWAIT is set before releasing the mmap_lock. + * + * - if FAULT_FLAG_RETRY_NOWAIT is set, then the function must not release the + * mmap_lock. The flag should be set only if the mmap_lock is held. + * + * 3. If the return value contains neither of the above, the function must not + * release the fault lock; the generic fault handler will take care of releasing + * the correct lock. + */ +static inline vm_fault_t arch_handle_folio_fault_on_access(struct folio *folio, + struct vm_fault *vmf, + bool *map_pte) +{ + *map_pte = false; + + return VM_FAULT_SIGBUS; +} +#endif #endif /* CONFIG_MMU */ diff --git a/mm/Kconfig b/mm/Kconfig index 341cf53898db..153df67221f1 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1006,6 +1006,9 @@ config IDLE_PAGE_TRACKING config ARCH_HAS_CACHE_LINE_SIZE bool +config ARCH_HAS_FAULT_ON_ACCESS + bool + config ARCH_HAS_CURRENT_STACK_POINTER bool help diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 94ef5c02b459..2bad63a7ec16 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1698,7 +1698,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, } /* NUMA hinting page fault entry point for trans huge pmds */ -vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) +vm_fault_t handle_huge_pmd_protnone(struct
[PATCH RFC v3 12/35] mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
arm64 uses arch_swap_restore() to restore saved tags before the page is swapped in and it's called in atomic context (with the ptl lock held). Introduce arch_swap_prepare_to_restore() that will allow an architecture to perform extra work during swap in and outside of a critical section. This will be used by arm64 to allocate a buffer in memory where to temporarily save tags if tag storage is not available for the page being swapped in. Signed-off-by: Alexandru Elisei --- include/linux/pgtable.h | 7 +++ mm/memory.c | 4 mm/shmem.c | 9 + mm/swapfile.c | 5 + 4 files changed, 25 insertions(+) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 98f81ca08cbe..2d0f04042f62 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -959,6 +959,13 @@ static inline void arch_swap_invalidate_area(int type) } #endif +#ifndef __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE +static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry, struct folio *folio) +{ + return 0; +} +#endif + #ifndef __HAVE_ARCH_SWAP_RESTORE static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) { diff --git a/mm/memory.c b/mm/memory.c index 7e1f4849463a..8a421e168b57 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3975,6 +3975,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_throttle_swaprate(folio, GFP_KERNEL); + ret = arch_swap_prepare_to_restore(entry, folio); + if (ret) + goto out_page; + /* * Back out if somebody else already faulted in this pte. */ diff --git a/mm/shmem.c b/mm/shmem.c index 14427e9982f9..621fabc3b8c6 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1855,6 +1855,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct swap_info_struct *si; struct folio *folio = NULL; swp_entry_t swap; + vm_fault_t ret; int error; VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); @@ -1903,6 +1904,14 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, } folio_wait_writeback(folio); + ret = arch_swap_prepare_to_restore(swap, folio); + if (ret) { + if (fault_type) + *fault_type = ret; + error = -EINVAL; + goto unlock; + } + /* * Some architectures may have to restore extra metadata to the * folio after reading from swap. diff --git a/mm/swapfile.c b/mm/swapfile.c index 556ff7347d5f..49425598f778 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1785,6 +1785,11 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, goto setpte; } + if (arch_swap_prepare_to_restore(entry, folio)) { + ret = -EINVAL; + goto out; + } + /* * Some architectures may have to restore extra metadata to the page * when reading from swap. This metadata may be indexed by swap entry -- 2.43.0
[PATCH RFC v3 11/35] mm: Allow an arch to hook into folio allocation when VMA is known
arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA. When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets set in vma_alloc_zeroed_movable_folio(). Expand this to be more generic by adding an arch hook that modifes the gfp flags for an allocation when the VMA is known. Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO is also set; from that point of view, the current behaviour is unchanged, even though the arm64 flag is set in more places. When arm64 will have support to reuse the tag storage for data allocation, the uses of the __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try to reserve the corresponding tag storage for the pages being allocated. The flags returned by arch_calc_vma_gfp() are or'ed with the flags set by the caller; this has been done to keep an architecture from modifying the flags already set by the core memory management code; this is similar to how do_mmap() -> calc_vm_flag_bits() -> arch_calc_vm_flag_bits() has been implemented. This can be revisited in the future if there's a need to do so. Signed-off-by: Alexandru Elisei --- arch/arm64/include/asm/page.h| 5 ++--- arch/arm64/include/asm/pgtable.h | 3 +++ arch/arm64/mm/fault.c| 19 ++- include/linux/pgtable.h | 7 +++ mm/mempolicy.c | 1 + mm/shmem.c | 5 - 6 files changed, 23 insertions(+), 17 deletions(-) diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h index 2312e6ee595f..88bab032a493 100644 --- a/arch/arm64/include/asm/page.h +++ b/arch/arm64/include/asm/page.h @@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page *from, void copy_highpage(struct page *to, struct page *from); #define __HAVE_ARCH_COPY_HIGHPAGE -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma, - unsigned long vaddr); -#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio +#define vma_alloc_zeroed_movable_folio(vma, vaddr) \ + vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false) void tag_clear_highpage(struct page *to); #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 79ce70fbb751..08f0904dbfc2 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1071,6 +1071,9 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) #endif /* CONFIG_ARM64_MTE */ +#define __HAVE_ARCH_CALC_VMA_GFP +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp); + /* * On AArch64, the cache coherency is handled via the set_pte_at() function. */ diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 55f6455a8284..4d3f0a870ad8 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -937,22 +937,15 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned long esr, NOKPROBE_SYMBOL(do_debug_exception); /* - * Used during anonymous page fault handling. + * If this is called during anonymous page fault handling, and the page is + * mapped with PROT_MTE, initialise the tags at the point of tag zeroing as this + * is usually faster than separate DC ZVA and STGM. */ -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma, - unsigned long vaddr) +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp) { - gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO; - - /* -* If the page is mapped with PROT_MTE, initialise the tags at the -* point of allocation and page zeroing as this is usually faster than -* separate DC ZVA and STGM. -*/ if (vma->vm_flags & VM_MTE) - flags |= __GFP_ZEROTAGS; - - return vma_alloc_folio(flags, 0, vma, vaddr, false); + return __GFP_ZEROTAGS; + return 0; } void tag_clear_highpage(struct page *page) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index c5ddec6b5305..98f81ca08cbe 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -901,6 +901,13 @@ static inline void arch_do_swap_page(struct mm_struct *mm, } #endif +#ifndef __HAVE_ARCH_CALC_VMA_GFP +static inline gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp) +{ + return 0; +} +#endif + #ifndef __HAVE_ARCH_FREE_PAGES_PREPARE static inline void arch_free_pages_prepare(struct page *page, int order) { } #endif diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 10a590ee1c89..f7ef52760b32 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2168,6 +2168,7 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, pgoff_t ilx; struct page *page; + gfp |=
[PATCH RFC v3 10/35] mm: cma: Fast track allocating memory when the pages are free
If the pages to be allocated are free, take them directly off the buddy allocator, instead of going through alloc_contig_range() and avoiding costly calls to lru_cache_disable(). Only allocations of the same size as the CMA region order are considered, to avoid taking the zone spinlock for too long. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch. Reworked from the rfc v2 patch #26 ("arm64: mte: Fast track reserving tag storage when the block is free") (David Hildenbrand). include/linux/page-flags.h | 15 -- mm/Kconfig | 5 + mm/cma.c | 42 ++ mm/memory-failure.c| 8 mm/page_alloc.c| 23 - 5 files changed, 73 insertions(+), 20 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 735cddc13d20..b7237bce7446 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -575,11 +575,22 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY) #define MAGIC_HWPOISON 0x48575053U /* HWPS */ extern void SetPageHWPoisonTakenOff(struct page *page); extern void ClearPageHWPoisonTakenOff(struct page *page); -extern bool take_page_off_buddy(struct page *page); -extern bool put_page_back_buddy(struct page *page); +extern bool PageHWPoisonTakenOff(struct page *page); #else PAGEFLAG_FALSE(HWPoison, hwpoison) +TESTSCFLAG_FALSE(HWPoison, hwpoison) #define __PG_HWPOISON 0 +static inline void SetPageHWPoisonTakenOff(struct page *page) { } +static inline void ClearPageHWPoisonTakenOff(struct page *page) { } +static inline bool PageHWPoisonTakenOff(struct page *page) +{ + return false; +} +#endif + +#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY +extern bool take_page_off_buddy(struct page *page, bool poison); +extern bool put_page_back_buddy(struct page *page, bool unpoison); #endif #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT) diff --git a/mm/Kconfig b/mm/Kconfig index ffc3a2ba3a8c..341cf53898db 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -745,12 +745,16 @@ config DEFAULT_MMAP_MIN_ADDR config ARCH_SUPPORTS_MEMORY_FAILURE bool +config WANTS_TAKE_PAGE_OFF_BUDDY + bool + config MEMORY_FAILURE depends on MMU depends on ARCH_SUPPORTS_MEMORY_FAILURE bool "Enable recovery from hardware memory errors" select MEMORY_ISOLATION select RAS + select WANTS_TAKE_PAGE_OFF_BUDDY help Enables code to recover from some memory failures on systems with MCA recovery. This allows a system to continue running @@ -891,6 +895,7 @@ config CMA depends on MMU select MIGRATION select MEMORY_ISOLATION + select WANTS_TAKE_PAGE_OFF_BUDDY help This enables the Contiguous Memory Allocator which allows other subsystems to allocate big physically-contiguous blocks of memory. diff --git a/mm/cma.c b/mm/cma.c index 2881bab12b01..15663f95d77b 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -444,6 +444,34 @@ static void cma_debug_show_areas(struct cma *cma) static inline void cma_debug_show_areas(struct cma *cma) { } #endif +/* Called with the cma mutex held. */ +static int cma_alloc_pages_fastpath(struct cma *cma, unsigned long start, + unsigned long end) +{ + bool success = false; + unsigned long i, j; + + /* Avoid contention on the zone lock. */ + if (start - end != 1 << cma->order_per_bit) + return -EINVAL; + + for (i = start; i < end; i++) { + if (!is_free_buddy_page(pfn_to_page(i))) + break; + success = take_page_off_buddy(pfn_to_page(i), false); + if (!success) + break; + } + + if (success) + return 0; + + for (j = start; j < i; j++) + put_page_back_buddy(pfn_to_page(j), false); + + return -EBUSY; +} + /** * cma_alloc_range() - allocate pages in a specific range * @cma: Contiguous memory region for which the allocation is performed. @@ -493,7 +521,11 @@ int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count, for (i = 0; i < tries; i++) { mutex_lock(&cma_mutex); - err = alloc_contig_range(start, start + count, MIGRATE_CMA, gfp); + err = cma_alloc_pages_fastpath(cma, start, start + count); + if (err) { + err = alloc_contig_range(start, start + count, +MIGRATE_CMA, gfp); + } mutex_unlock(&cma_mutex); if (err != -EBUSY) @@ -529,7 +561,6 @@ int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count, return err; } - /** * cma_alloc() - allocate pages from contiguous area * @cma: Contiguous memory region for which the allocation is perfor
[PATCH RFC v3 09/35] mm: cma: Introduce cma_remove_mem()
Memory is added to CMA with cma_declare_contiguous_nid() and cma_init_reserved_mem(). This memory is then put on the MIGRATE_CMA list in cma_init_reserved_areas(), where the page allocator can make use of it. If a device manages multiple CMA areas, and there's an error when one of the areas is added to CMA, there is no mechanism for the device to prevent the rest of the areas, which were added before the error occured, from being later added to the MIGRATE_CMA list. Add cma_remove_mem() which allows a previously reserved CMA area to be removed and thus it cannot be used by the page allocator. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch. include/linux/cma.h | 1 + mm/cma.c| 30 +- 2 files changed, 30 insertions(+), 1 deletion(-) diff --git a/include/linux/cma.h b/include/linux/cma.h index e32559da6942..787cbec1702e 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -48,6 +48,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, unsigned int order_per_bit, const char *name, struct cma **res_cma); +extern void cma_remove_mem(struct cma **res_cma); extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align, bool no_warn); extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count, diff --git a/mm/cma.c b/mm/cma.c index 4a0f68b9443b..2881bab12b01 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -147,8 +147,12 @@ static int __init cma_init_reserved_areas(void) { int i; - for (i = 0; i < cma_area_count; i++) + for (i = 0; i < cma_area_count; i++) { + /* Region was removed. */ + if (!cma_areas[i].count) + continue; cma_activate_area(&cma_areas[i]); + } return 0; } @@ -216,6 +220,30 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, return 0; } +/** + * cma_remove_mem() - remove cma area + * @res_cma: Pointer to the cma region. + * + * This function removes a cma region created with cma_init_reserved_mem(). The + * ->count is set to 0. + */ +void __init cma_remove_mem(struct cma **res_cma) +{ + struct cma *cma; + + if (WARN_ON_ONCE(!res_cma || !(*res_cma))) + return; + + cma = *res_cma; + if (WARN_ON_ONCE(!cma->count)) + return; + + totalcma_pages -= cma->count; + cma->count = 0; + + *res_cma = NULL; +} + /** * cma_declare_contiguous_nid() - reserve custom contiguous area * @base: Base address of the reserved area optional, use 0 for any -- 2.43.0
[PATCH RFC v3 08/35] mm: cma: Introduce cma_alloc_range()
Today, cma_alloc() is used to allocate a contiguous memory region. The function allows the caller to specify the number of pages to allocate, but not the starting address. cma_alloc() will walk over the entire CMA region trying to allocate the first available range of the specified size. Introduce cma_alloc_range(), which makes CMA more versatile by allowing the caller to specify a particular range in the CMA region, defined by the start pfn and the size. arm64 will make use of this function when tag storage management will be implemented: cma_alloc_range() will be used to reserve the tag storage associated with a tagged page. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch. include/linux/cma.h| 2 + include/trace/events/cma.h | 59 ++ mm/cma.c | 86 ++ 3 files changed, 147 insertions(+) diff --git a/include/linux/cma.h b/include/linux/cma.h index 63873b93deaa..e32559da6942 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -50,6 +50,8 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, struct cma **res_cma); extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align, bool no_warn); +extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count, + unsigned tries, gfp_t gfp); extern bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned long count); extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long count); diff --git a/include/trace/events/cma.h b/include/trace/events/cma.h index 25103e67737c..a89af313a572 100644 --- a/include/trace/events/cma.h +++ b/include/trace/events/cma.h @@ -36,6 +36,65 @@ TRACE_EVENT(cma_release, __entry->count) ); +TRACE_EVENT(cma_alloc_range_start, + + TP_PROTO(const char *name, unsigned long start, unsigned long count, +unsigned tries), + + TP_ARGS(name, start, count, tries), + + TP_STRUCT__entry( + __string(name, name) + __field(unsigned long, start) + __field(unsigned long, count) + __field(unsigned, tries) + ), + + TP_fast_assign( + __assign_str(name, name); + __entry->start = start; + __entry->count = count; + __entry->tries = tries; + ), + + TP_printk("name=%s start=%lx count=%lu tries=%u", + __get_str(name), + __entry->start, + __entry->count, + __entry->tries) +); + +TRACE_EVENT(cma_alloc_range_finish, + + TP_PROTO(const char *name, unsigned long start, unsigned long count, +unsigned attempts, int err), + + TP_ARGS(name, start, count, attempts, err), + + TP_STRUCT__entry( + __string(name, name) + __field(unsigned long, start) + __field(unsigned long, count) + __field(unsigned, attempts) + __field(int, err) + ), + + TP_fast_assign( + __assign_str(name, name); + __entry->start = start; + __entry->count = count; + __entry->attempts = attempts; + __entry->err = err; + ), + + TP_printk("name=%s start=%lx count=%lu attempts=%u err=%d", + __get_str(name), + __entry->start, + __entry->count, + __entry->attempts, + __entry->err) +); + TRACE_EVENT(cma_alloc_start, TP_PROTO(const char *name, unsigned long count, unsigned int align), diff --git a/mm/cma.c b/mm/cma.c index 543bb6b3be8e..4a0f68b9443b 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -416,6 +416,92 @@ static void cma_debug_show_areas(struct cma *cma) static inline void cma_debug_show_areas(struct cma *cma) { } #endif +/** + * cma_alloc_range() - allocate pages in a specific range + * @cma: Contiguous memory region for which the allocation is performed. + * @start: Starting pfn of the allocation. + * @count: Requested number of pages + * @tries: Number of tries if the range is busy + * @no_warn: Avoid printing message about failed allocation + * + * This function allocates part of contiguous memory from a specific contiguous + * memory area, from the specified starting address. The 'start' pfn and the the + * 'count' number of pages must be aligned to the CMA bitmap order per bit. + */ +int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count, + unsigned tries, gfp_t gfp) +{ + unsigned long bitmap_maxno, bitmap_no, bitmap_start, bitmap_count; + unsigned long i = 0; + struct page *page; + int err = -EINVAL; + + if (!cma || !cma->count || !cma->bitmap) + goto out_s
[PATCH RFC v3 07/35] mm: cma: Add CMA_RELEASE_{SUCCESS,FAIL} events
Similar to the two events that relate to CMA allocations, add the CMA_RELEASE_SUCCESS and CMA_RELEASE_FAIL events that count when CMA pages are freed. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch. include/linux/vm_event_item.h | 2 ++ mm/cma.c | 6 +- mm/vmstat.c | 2 ++ 3 files changed, 9 insertions(+), 1 deletion(-) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 747943bc8cc2..aba5c5bf8127 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -83,6 +83,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_CMA CMA_ALLOC_SUCCESS, CMA_ALLOC_FAIL, + CMA_RELEASE_SUCCESS, + CMA_RELEASE_FAIL, #endif UNEVICTABLE_PGCULLED, /* culled to noreclaim list */ UNEVICTABLE_PGSCANNED, /* scanned for reclaimability */ diff --git a/mm/cma.c b/mm/cma.c index dbf7fe8cb1bd..543bb6b3be8e 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -562,8 +562,10 @@ bool cma_release(struct cma *cma, const struct page *pages, { unsigned long pfn; - if (!cma_pages_valid(cma, pages, count)) + if (!cma_pages_valid(cma, pages, count)) { + count_vm_events(CMA_RELEASE_FAIL, count); return false; + } pr_debug("%s(page %p, count %lu)\n", __func__, (void *)pages, count); @@ -575,6 +577,8 @@ bool cma_release(struct cma *cma, const struct page *pages, cma_clear_bitmap(cma, pfn, count); trace_cma_release(cma->name, pfn, pages, count); + count_vm_events(CMA_RELEASE_SUCCESS, count); + return true; } diff --git a/mm/vmstat.c b/mm/vmstat.c index db79935e4a54..eebfd5c6c723 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1340,6 +1340,8 @@ const char * const vmstat_text[] = { #ifdef CONFIG_CMA "cma_alloc_success", "cma_alloc_fail", + "cma_release_success", + "cma_release_fail", #endif "unevictable_pgs_culled", "unevictable_pgs_scanned", -- 2.43.0
[PATCH RFC v3 06/35] mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages
The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one after each cma_alloc() function call. This is done even though cma_alloc() can allocate an arbitrary number of CMA pages. When looking at /proc/vmstat, the number of successful (or failed) cma_alloc() calls doesn't tell much with regards to how many CMA pages were allocated via cma_alloc() versus via the page allocator (regular allocation request or PCP lists refill). This can also be rather confusing to a user who isn't familiar with the code, since the unit of measurement for nr_free_cma is the number of pages, but cma_alloc_success and cma_alloc_fail count the number of cma_alloc() function calls. Let's make this consistent, and arguably more useful, by having CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to allocate. For users that wish to track the number of cma_alloc() calls, there are tracepoints for that already implemented. Signed-off-by: Alexandru Elisei --- mm/cma.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index f49c95f8ee37..dbf7fe8cb1bd 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long count, pr_debug("%s(): returned %p\n", __func__, page); out: if (page) { - count_vm_event(CMA_ALLOC_SUCCESS); + count_vm_events(CMA_ALLOC_SUCCESS, count); cma_sysfs_account_success_pages(cma, count); } else { - count_vm_event(CMA_ALLOC_FAIL); + count_vm_events(CMA_ALLOC_FAIL, count); if (cma) cma_sysfs_account_fail_pages(cma, count); } -- 2.43.0
[PATCH RFC v3 05/35] mm: cma: Don't append newline when generating CMA area name
cma->name is displayed in several CMA messages. When the name is generated by the CMA code, don't append a newline to avoid breaking the text across two lines. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch. This is a fix, and can be merged independently of the other patches. mm/cma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/cma.c b/mm/cma.c index 7c09c47e530b..f49c95f8ee37 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -204,7 +204,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, if (name) snprintf(cma->name, CMA_MAX_NAME, name); else - snprintf(cma->name, CMA_MAX_NAME, "cma%d\n", cma_area_count); + snprintf(cma->name, CMA_MAX_NAME, "cma%d", cma_area_count); cma->base_pfn = PFN_DOWN(base); cma->count = size >> PAGE_SHIFT; -- 2.43.0
[PATCH RFC v3 04/35] mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA guard code"
The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code") removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list because CMA is always allowed when __GFP_MOVABLE is set. With the introduction of the arch_alloc_cma() function, the above is not true anymore, so bring back the filter. This is a partially revert because the stale comment remains removed. Signed-off-by: Alexandru Elisei --- mm/page_alloc.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a96d47a6393e..0fa34bcfb1af 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2897,10 +2897,17 @@ struct page *rmqueue(struct zone *preferred_zone, WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); if (likely(pcp_allowed_order(order))) { - page = rmqueue_pcplist(preferred_zone, zone, order, - migratetype, alloc_flags); - if (likely(page)) - goto out; + /* +* MIGRATE_MOVABLE pcplist could have the pages on CMA area and +* we need to skip it when CMA area isn't allowed. +*/ + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || + migratetype != MIGRATE_MOVABLE) { + page = rmqueue_pcplist(preferred_zone, zone, order, + migratetype, alloc_flags); + if (likely(page)) + goto out; + } } page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, -- 2.43.0
[PATCH RFC v3 03/35] mm: page_alloc: Add an arch hook to filter MIGRATE_CMA allocations
As an architecture might have specific requirements around the allocation of CMA pages, add an arch hook that can disable allocations from MIGRATE_CMA, if the allocation was otherwise allowed. This will be used by arm64, which will put tag storage pages on the MIGRATE_CMA list, and tag storage pages cannot be tagged. The filter will be used to deny using MIGRATE_CMA for __GFP_TAGGED allocations. Signed-off-by: Alexandru Elisei --- include/linux/pgtable.h | 7 +++ mm/page_alloc.c | 3 ++- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 6d98d5fdd697..c5ddec6b5305 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -905,6 +905,13 @@ static inline void arch_do_swap_page(struct mm_struct *mm, static inline void arch_free_pages_prepare(struct page *page, int order) { } #endif +#ifndef __HAVE_ARCH_ALLOC_CMA +static inline bool arch_alloc_cma(gfp_t gfp) +{ + return true; +} +#endif + #ifndef __HAVE_ARCH_UNMAP_ONE /* * Some architectures support metadata associated with a page. When a diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 27282a1c82fe..a96d47a6393e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3157,7 +3157,8 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask, unsigned int alloc_flags) { #ifdef CONFIG_CMA - if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE) + if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE && + arch_alloc_cma(gfp_mask)) alloc_flags |= ALLOC_CMA; #endif return alloc_flags; -- 2.43.0
[PATCH RFC v3 02/35] mm: page_alloc: Add an arch hook early in free_pages_prepare()
The arm64 MTE code uses the PG_arch_2 page flag, which it renames to PG_mte_tagged, to track if a page has been mapped with tagging enabled. That flag is cleared by free_pages_prepare() by doing: page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; When tag storage management is added, tag storage will be reserved for a page if and only if the page is mapped as tagged (the page flag PG_mte_tagged is set). When a page is freed, likewise, the code will have to look at the the page flags to determine if the page has tag storage reserved, which should also be freed. For this purpose, add an arch_free_pages_prepare() hook that is called before that page flags are cleared. The function arch_free_page() has also been considered for this purpose, but it is called after the flags are cleared. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * Expanded commit message (David Hildenbrand). include/linux/pgtable.h | 4 mm/page_alloc.c | 1 + 2 files changed, 5 insertions(+) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index f6d0e3513948..6d98d5fdd697 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -901,6 +901,10 @@ static inline void arch_do_swap_page(struct mm_struct *mm, } #endif +#ifndef __HAVE_ARCH_FREE_PAGES_PREPARE +static inline void arch_free_pages_prepare(struct page *page, int order) { } +#endif + #ifndef __HAVE_ARCH_UNMAP_ONE /* * Some architectures support metadata associated with a page. When a diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2c140abe5ee6..27282a1c82fe 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1092,6 +1092,7 @@ static __always_inline bool free_pages_prepare(struct page *page, trace_mm_page_free(page, order); kmsan_free_page(page, order); + arch_free_pages_prepare(page, order); if (memcg_kmem_online() && PageMemcgKmem(page)) __memcg_kmem_uncharge_page(page, order); -- 2.43.0
[PATCH RFC v3 01/35] mm: page_alloc: Add gfp_flags parameter to arch_alloc_page()
Extend the usefulness of arch_alloc_page() by adding the gfp_flags parameter. Signed-off-by: Alexandru Elisei --- Changes since rfc v2: * New patch. arch/s390/include/asm/page.h | 2 +- arch/s390/mm/page-states.c | 2 +- include/linux/gfp.h | 2 +- mm/page_alloc.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h index 73b9c3bf377f..859f0958c574 100644 --- a/arch/s390/include/asm/page.h +++ b/arch/s390/include/asm/page.h @@ -163,7 +163,7 @@ static inline int page_reset_referenced(unsigned long addr) struct page; void arch_free_page(struct page *page, int order); -void arch_alloc_page(struct page *page, int order); +void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags); static inline int devmem_is_allowed(unsigned long pfn) { diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c index 01f9b39e65f5..b986c8b158e3 100644 --- a/arch/s390/mm/page-states.c +++ b/arch/s390/mm/page-states.c @@ -21,7 +21,7 @@ void arch_free_page(struct page *page, int order) __set_page_unused(page_to_virt(page), 1UL << order); } -void arch_alloc_page(struct page *page, int order) +void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags) { if (!cmma_flag) return; diff --git a/include/linux/gfp.h b/include/linux/gfp.h index de292a007138..9e8aa3d144db 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -172,7 +172,7 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags) static inline void arch_free_page(struct page *page, int order) { } #endif #ifndef HAVE_ARCH_ALLOC_PAGE -static inline void arch_alloc_page(struct page *page, int order) { } +static inline void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags) { } #endif struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 150d4f23b010..2c140abe5ee6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1485,7 +1485,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order, set_page_private(page, 0); set_page_refcounted(page); - arch_alloc_page(page, order); + arch_alloc_page(page, order, gfp_flags); debug_pagealloc_map_pages(page, 1 << order); /* -- 2.43.0
[PATCH RFC v3 00/35] Add support for arm64 MTE dynamic tag storage reuse
The series is based on v6.8-rc1 and can be cloned with: $ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \ -b arm-mte-dynamic-carveout-rfc-v3 Changelog = The changes from the previous version [1] are extensive, so I'll list them first. Only the major changes are below, individual patches will have their own changelog. I would like to point out that patch #31 ("khugepaged: arm64: Don't collapse MTE enabled VMAs") might be controversial. Please have a look. Changes since rfc v2 [1]: - Patches #5 ("mm: cma: Don't append newline when generating CMA area name") and #16 ("KVM: arm64: Don't deny VM_PFNMAP VMAs when kvm_has_mte()") are new and they are fixes. I think they can be merged independently of the rest of the series. - Tag storage now uses the CMA API to allocate and free tag storage pages (David Hildenbrand). - Tag storage is now described as subnode of 'reserved-memory' (Rob Herring). - KVM now has support for dynamic tag storage reuse, added in patches #32 ("KVM: arm64: mte: Reserve tag storage for VMs with MTE") and #33 ("KVM: arm64: mte: Introduce VM_MTE_KVM VMA flag"). - Reserving tag storage when a tagged page is allocated is now a best effort approach instead of being mandatory. If tag storage cannot be reserved, the page is marked as protnone and tag storage is reserved when the fault is taken on the next userspace access to the address. - ptrace support for pages without tag storage has been added, implemented in patch #30 ("arm64: mte: ptrace: Handle pages with missing tag storage"). - The following patches have been dropped: #4 (" mm: migrate/mempolicy: Add hook to modify migration target gfp"), #5 ("mm: page_alloc: Add an arch hook to allow prep_new_page() to fail") because reserving tag storage is now best effort, and to make the series shorter, in the case of patch #4. - Also dropped patch #13 ("arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK") and added a BUILD_BUG_ON() instead (David Hildenbrand). - Dropped patch #15 ("arm64: mte: Check that tag storage blocks are in the same zone") because it's not needed anymore, cma_init_reserved_areas->cma_activate_area() already does that (David Hildenbrand). - Moved patches #1 ("arm64: mte: Rework naming for tag manipulation functions") and #2 ("arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED") after the changes to the common code and before tag storage is discovered. - Patch #12 ("arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype") was replaced with patch #20 ("arm64: mte: Add tag storage memory to CMA") (David Hildenbrand). - Split patch #19 ("mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)") into an arch independent part (patch #13, "mm: memory: Introduce fault-on-access mechanism for pages") and into an arm64 patch (patch #26, "arm64: mte: Use fault-on-access to reserve missing tag storage"). The arm64 code is much smaller because of this (David Hildenbrand). [1] https://lore.kernel.org/linux-arm-kernel/20231119165721.9849-1-alexandru.eli...@arm.com/ Introduction Memory Tagging Extension (MTE) is implemented currently to have a static carve-out of the DRAM to store the allocation tags (a.k.a. memory colour). This is what we call the tag storage. Each 16 bytes have 4 bits of tags, so this means 1/32 of the DRAM, roughly 3% used for the tag storage. This is done transparently by the hardware/interconnect (with firmware setup) and normally hidden from the OS. So a checked memory access to location X generates a tag fetch from location Y in the carve-out and this tag is compared with the bits 59:56 in the pointer. The correspondence from X to Y is linear (subject to a minimum block size to deal with some address interleaving). The software doesn't need to know about this correspondence as we have specific instructions like STG/LDG to location X that lead to a tag store/load to Y. Not all memory used by applications is tagged (mmap(PROT_MTE)). For example, some large allocations may not use PROT_MTE at all or only for the first and last page since initialising the tags takes time. And executable memory is never tagged. The side-effect is that of thie 3% of DRAM, only part of it, say 1%, is effectively used. The series aims to take that unused tag storage and release it to the page allocator for normal data usage. The first complication is that a PROT_MTE page allocation at address X will need to reserve the tag storage page at location Y (and migrate any data in that page if it is in use). To make things more complicated, pages in the tag storage/carve-out range cannot use PROT_MTE themselves on current hardware, so this adds the second complication - a heterogeneous memory layout. The kernel needs to know where to allocate a PROT_MTE page from or migrate a current page if it becomes PROT_MTE (mprotect()) and the range it is in does not support tagging. Some other complications are arm64-specific like cache coherency between tags and dat
Re: [PATCH] ARM: dts: qcom: apq8026-lg-lenok: Add vibrator support
On 1/21/24 11:09, Luca Weiss wrote: This device has a vibrator attached to the CAMSS_GP0_CLK, use clk-pwm and pwm-vibrator to make the vibrator work. Signed-off-by: Luca Weiss --- now your mainlined smartwatch can wake you up! Reviewed-by: Konrad Dybcio Konrad
Re: [PATCH] arm64: dts: qcom: sm6350: Add tsens thermal zones
On 1/24/24 16:31, Luca Weiss wrote: Add the definitions for the various thermal zones found on the SM6350 SoC. Hooking up GPU and CPU cooling can limit the clock speeds there to reduce the temperature again to good levels. Most thermal zones only have one critical temperature configured at 125°C which can be mostly considered a placeholder until those zones can be hooked up to cooling. Signed-off-by: Luca Weiss --- [...] + cpuss0-thermal { + polling-delay-passive = <0>; + polling-delay = <0>; + + thermal-sensors = <&tsens0 7>; cpuss0-thermal and cpuss1-thermal are very likely the sensors for cluster0/1, can you test that out, perhaps with corepinning+stress? You can then assign multiple cpu cooling devices. LGTM otherwise! Konrad
Re: [PATCH v6 32/36] fprobe: Rewrite fprobe on function-graph tracer
On Fri, Jan 12, 2024 at 07:17:06PM +0900, Masami Hiramatsu (Google) wrote: SNIP > * Register @fp to ftrace for enabling the probe on the address given by > @addrs. > @@ -298,23 +547,27 @@ EXPORT_SYMBOL_GPL(register_fprobe); > */ > int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num) > { > - int ret; > - > - if (!fp || !addrs || num <= 0) > - return -EINVAL; > - > - fprobe_init(fp); > + struct fprobe_hlist *hlist_array; > + int ret, i; > > - ret = ftrace_set_filter_ips(&fp->ops, addrs, num, 0, 0); > + ret = fprobe_init(fp, addrs, num); > if (ret) > return ret; > > - ret = fprobe_init_rethook(fp, num); > - if (!ret) > - ret = register_ftrace_function(&fp->ops); > + mutex_lock(&fprobe_mutex); > + > + hlist_array = fp->hlist_array; > + ret = fprobe_graph_add_ips(addrs, num); so fprobe_graph_add_ips registers the ftrace_ops and actually starts the tracing.. and in the code below we prepare fprobe data that is checked in the ftrace_ops callback.. should we do this this earlier before calling fprobe_graph_add_ips/register_ftrace_graph? jirka > + if (!ret) { > + add_fprobe_hash(fp); > + for (i = 0; i < hlist_array->size; i++) > + insert_fprobe_node(&hlist_array->array[i]); > + } > + mutex_unlock(&fprobe_mutex); > > if (ret) > fprobe_fail_cleanup(fp); > + > return ret; > } > EXPORT_SYMBOL_GPL(register_fprobe_ips); > @@ -352,14 +605,13 @@ EXPORT_SYMBOL_GPL(register_fprobe_syms); SNIP
Re: [PATCH v2 00/41] filelock: split struct file_lock into file_lock and file_lease structs
On Thu, Jan 25, 2024 at 05:42:41AM -0500, Jeff Layton wrote: > Long ago, file locks used to hang off of a singly-linked list in struct > inode. Because of this, when leases were added, they were added to the > same list and so they had to be tracked using the same sort of > structure. > > Several years ago, we added struct file_lock_context, which allowed us > to use separate lists to track different types of file locks. Given > that, leases no longer need to be tracked using struct file_lock. > > That said, a lot of the underlying infrastructure _is_ the same between > file leases and locks, so we can't completely separate everything. > > This patchset first splits a group of fields used by both file locks and > leases into a new struct file_lock_core, that is then embedded in struct > file_lock. Coccinelle was then used to convert a lot of the callers to > deal with the move, with the remaining 25% or so converted by hand. > > It then converts several internal functions in fs/locks.c to work > with struct file_lock_core. Lastly, struct file_lock is split into > struct file_lock and file_lease, and the lease-related APIs converted to > take struct file_lease. > > After the first few patches (which I left split up for easier review), > the set should be bisectable. I'll plan to squash the first few > together to make sure the resulting set is bisectable before merge. > > Finally, I left the coccinelle scripts I used in tree. I had heard it > was preferable to merge those along with the patches that they > generate, but I wasn't sure where they go. I can either move those to a > more appropriate location or we can just drop that commit if it's not > needed. > > Signed-off-by: Jeff Layton v2 looks nicer. I would add a few list handling primitives, as I see enough instances of list_for_each_entry, list_for_each_entry_safe, list_first_entry, and list_first_entry_or_null on fl_core.flc_list to make it worth having those. Also, there doesn't seem to be benefit for API consumers to have to understand the internal structure of struct file_lock/lease to reach into fl_core. Having accessor functions for common fields like fl_type and fl_flags could be cleaner. For the series: Reviewed-by: Chuck Lever For the nfsd and lockd parts: Acked-by: Chuck Lever > --- > Changes in v2: > - renamed file_lock_core fields to have "flc_" prefix > - used macros to more easily do the change piecemeal > - broke up patches into per-subsystem ones > - Link to v1: > https://lore.kernel.org/r/20240116-flsplit-v1-0-c9d0f4370...@kernel.org > > --- > Jeff Layton (41): > filelock: rename some fields in tracepoints > filelock: rename fl_pid variable in lock_get_status > dlm: rename fl_flags variable in dlm_posix_unlock > nfs: rename fl_flags variable in nfs4_proc_unlck > nfsd: rename fl_type and fl_flags variables in nfsd4_lock > lockd: rename fl_flags and fl_type variables in nlmclnt_lock > 9p: rename fl_type variable in v9fs_file_do_lock > afs: rename fl_type variable in afs_next_locker > filelock: drop the IS_* macros > filelock: split common fields into struct file_lock_core > filelock: add coccinelle scripts to move fields to struct file_lock_core > filelock: have fs/locks.c deal with file_lock_core directly > filelock: convert some internal functions to use file_lock_core instead > filelock: convert more internal functions to use file_lock_core > filelock: make posix_same_owner take file_lock_core pointers > filelock: convert posix_owner_key to take file_lock_core arg > filelock: make locks_{insert,delete}_global_locks take file_lock_core > arg > filelock: convert locks_{insert,delete}_global_blocked > filelock: make __locks_delete_block and __locks_wake_up_blocks take > file_lock_core > filelock: convert __locks_insert_block, conflict and deadlock checks to > use file_lock_core > filelock: convert fl_blocker to file_lock_core > filelock: clean up locks_delete_block internals > filelock: reorganize locks_delete_block and __locks_insert_block > filelock: make assign_type helper take a file_lock_core pointer > filelock: convert locks_wake_up_blocks to take a file_lock_core pointer > filelock: convert locks_insert_lock_ctx and locks_delete_lock_ctx > filelock: convert locks_translate_pid to take file_lock_core > filelock: convert seqfile handling to use file_lock_core > 9p: adapt to breakup of struct file_lock > afs: adapt to breakup of struct file_lock > ceph: adapt to breakup of struct file_lock > dlm: adapt to breakup of struct file_lock > gfs2: adapt to breakup of struct file_lock > lockd: adapt to breakup of struct file_lock > nfs: adapt to breakup of struct file_lock > nfsd: adapt to breakup of struct file_lock > ocfs2: adapt to breakup of struct file_lock > smb/client: adapt to breakup
Re: [PATCH v6 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph
On Fri, Jan 12, 2024 at 07:10:50PM +0900, Masami Hiramatsu (Google) wrote: > Hi, > > Here is the 6th version of the series to re-implement the fprobe on > function-graph tracer. The previous version is; > > https://lore.kernel.org/all/170290509018.220107.1347127510564358608.stgit@devnote2/ > > This version fixes use-after-unregister bug and arm64 stack unwinding > bug [13/36], add an improvement for multiple interrupts during push > operation[20/36], keep SAVE_REGS until BPF and fprobe_event using > ftrace_regs[26/36], also reorder the patches[30/36][31/36] so that new > fprobe can switch to SAVE_ARGS[32/36] safely. > This series also temporarily adds a DIRECT_CALLS bugfix[1/36], which > should be pushed separatedly as a stable bugfix. > > There are some TODOs: > - Add s390x and loongarch support to fprobe (multiple fgraph). > - Fix to get the symbol address from ftrace entry address on arm64. >(This should be done in BPF trace event) > - Cleanup code, rename some terms(offset/index) and FGRAPH_TYPE_BITMAP >part should be merged to FGRAPH_TYPE_ARRAY patch. hi, I'm getting kasan bugs below when running bpf selftests on top of this patchset.. I think it's probably the reason I see failures in some bpf kprobe_multi/fprobe tests so far I couldn't find the reason.. still checking ;-) jirka --- [ 507.585913][ T697] BUG: KASAN: slab-out-of-bounds in ftrace_push_return_trace.isra.0+0x346/0x370 [ 507.586747][ T697] Write of size 8 at addr 888148193ff8 by task test_progs/697 [ 507.587460][ T697] [ 507.587713][ T697] CPU: 2 PID: 697 Comm: test_progs Tainted: G OE 6.7.0+ #309 d8e2cbcdc10865c6eb2d28ed0cbf958842aa75a8 [ 507.588821][ T697] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014 [ 507.589681][ T697] Call Trace: [ 507.590044][ T697] [ 507.590357][ T697] dump_stack_lvl+0xf6/0x180 [ 507.590807][ T697] print_report+0xc4/0x610 [ 507.591259][ T697] ? fixup_red_left+0x5/0x20 [ 507.591781][ T697] kasan_report+0xbe/0xf0 [ 507.592241][ T697] ? ftrace_push_return_trace.isra.0+0x346/0x370 [ 507.592928][ T697] ? ftrace_push_return_trace.isra.0+0x346/0x370 [ 507.593535][ T697] ? __pfx_text_poke_loc_init+0x10/0x10 [ 507.594076][ T697] ? ftrace_replace_code+0x17a/0x230 [ 507.594586][ T697] ftrace_push_return_trace.isra.0+0x346/0x370 [ 507.595192][ T697] ? __pfx_text_poke_loc_init+0x10/0x10 [ 507.595747][ T697] function_graph_enter_ops+0xbb/0x2d0 [ 507.596271][ T697] ? ftrace_replace_code+0x17a/0x230 [ 507.596784][ T697] ? __pfx_function_graph_enter_ops+0x10/0x10 [ 507.597353][ T697] ? preempt_count_sub+0x14/0xc0 [ 507.598576][ T697] ? __pfx_text_poke_loc_init+0x10/0x10 [ 507.599145][ T697] ? __pfx_fuse_sync_fs+0x10/0x10 [ 507.599718][ T697] ftrace_graph_func+0x142/0x270 [ 507.600293][ T697] ? __pfx_text_poke_loc_init+0x10/0x10 [ 507.600892][ T697] ? __pfx_fuse_conn_put.part.0+0x10/0x10 [ 507.601484][ T697] 0xa0560097 [ 507.602067][ T697] ? __pfx_fuse_conn_put.part.0+0x10/0x10 [ 507.602715][ T697] ? text_poke_loc_init+0x5/0x2e0 [ 507.603288][ T697] ? __pfx_fuse_conn_put.part.0+0x10/0x10 [ 507.603923][ T697] text_poke_loc_init+0x5/0x2e0 [ 507.604468][ T697] ftrace_replace_code+0x17a/0x230 [ 507.605071][ T697] ftrace_modify_all_code+0x131/0x1a0 [ 507.605663][ T697] ftrace_startup+0x10b/0x210 [ 507.606200][ T697] register_ftrace_graph+0x313/0x8a0 [ 507.606805][ T697] ? register_ftrace_graph+0x3fe/0x8a0 [ 507.607427][ T697] register_fprobe_ips.part.0+0x25a/0x3f0 [ 507.608090][ T697] bpf_kprobe_multi_link_attach+0x49e/0x850 [ 507.608781][ T697] ? __pfx_bpf_kprobe_multi_link_attach+0x10/0x10 [ 507.609500][ T697] ? __debug_check_no_obj_freed+0x1d8/0x3a0 [ 507.610194][ T697] ? __fget_light+0x96/0xe0 [ 507.610741][ T697] __sys_bpf+0x307a/0x3180 [ 507.611286][ T697] ? __pfx___sys_bpf+0x10/0x10 [ 507.611838][ T697] ? __kasan_slab_free+0x12d/0x1c0 [ 507.612434][ T697] ? audit_log_exit+0x8e0/0x1960 [ 507.613003][ T697] ? kmem_cache_free+0x19d/0x460 [ 507.613644][ T697] ? rcu_is_watching+0x34/0x60 [ 507.614202][ T697] ? lockdep_hardirqs_on_prepare+0xe/0x250 [ 507.614865][ T697] ? seqcount_lockdep_reader_access.constprop.0+0x105/0x120 [ 507.615662][ T697] ? seqcount_lockdep_reader_access.constprop.0+0xb2/0x120 [ 507.616431][ T697] __x64_sys_bpf+0x44/0x60 [ 507.616940][ T697] do_syscall_64+0x87/0x1b0 [ 507.617495][ T697] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 507.618179][ T697] RIP: 0033:0x7ff2edca6b4d [ 507.618745][ T697] Code: c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b 92 0c 00 f7 d8 64 89 01 48 [ 507.620863][ T697] RSP: 002b:7ffe2e58a8f8 EFLAGS: 0206 ORIG_RAX: 0141 [ 507.621749][ T697] RAX: ffda RBX: 7ffe2e58b018 RCX: 7ff2edca6b4d
Re: [PATCH v12 3/6] tracing: Add snapshot refcount
Hi Masami, Thanks for taking the time to look at those changes. On Thu, Jan 25, 2024 at 12:11:49AM +0900, Masami Hiramatsu wrote: > On Tue, 23 Jan 2024 11:07:54 + > Vincent Donnefort wrote: > > [...] > > @@ -6592,8 +6641,11 @@ int tracing_set_tracer(struct trace_array *tr, const > > char *buf) > > > > if (t->init) { > > ret = tracer_init(t, tr); > > - if (ret) > > + if (ret) { > > + if (t->use_max_tr) > > + tracing_disarm_snapshot_locked(tr); > > This part is out of CONFIG_TRACER_MAX_TRACE, so it may cause a compile error > if CONFIG_TRACER_MAX_TRACE is not set. Duh, yes it must depends on TRACER_MAX_TRACE :-\ > > > goto out; > > + } > > } > > > > tr->current_trace = t; > [...] > > diff --git a/kernel/trace/trace_events_trigger.c > > b/kernel/trace/trace_events_trigger.c > > index 46439e3bcec4..d41bf64741e2 100644 > > --- a/kernel/trace/trace_events_trigger.c > > +++ b/kernel/trace/trace_events_trigger.c > > @@ -597,20 +597,9 @@ static int register_trigger(char *glob, > > return ret; > > } > > > > -/** > > - * unregister_trigger - Generic event_command @unreg implementation > > - * @glob: The raw string used to register the trigger > > - * @test: Trigger-specific data used to find the trigger to remove > > - * @file: The trace_event_file associated with the event > > - * > > - * Common implementation for event trigger unregistration. > > - * > > - * Usually used directly as the @unreg method in event command > > - * implementations. > > - */ > > -static void unregister_trigger(char *glob, > > - struct event_trigger_data *test, > > - struct trace_event_file *file) > > OK, so __unregister_trigger returns true if data exists, but > unregister_trigger() ignores results. (I want some comment here) Will add something for the __unregister_trigger flavour. > > > +static bool __unregister_trigger(char *glob, > > +struct event_trigger_data *test, > > +struct trace_event_file *file) > > { > > struct event_trigger_data *data = NULL, *iter; > > > > @@ -626,8 +615,32 @@ static void unregister_trigger(char *glob, > > } > > } > > > > - if (data && data->ops->free) > > - data->ops->free(data); > > + if (data) { > > + if (data->ops->free) > > + data->ops->free(data); > > + > > + return true; > > + } > > + > > + return false; > > +} > > + > > +/** > > + * unregister_trigger - Generic event_command @unreg implementation > > + * @glob: The raw string used to register the trigger > > + * @test: Trigger-specific data used to find the trigger to remove > > + * @file: The trace_event_file associated with the event > > + * > > + * Common implementation for event trigger unregistration. > > + * > > + * Usually used directly as the @unreg method in event command > > + * implementations. > > + */ > > +static void unregister_trigger(char *glob, > > + struct event_trigger_data *test, > > + struct trace_event_file *file) > > +{ > > + __unregister_trigger(glob, test, file); > > } > > > > /* > > @@ -1470,12 +1483,20 @@ register_snapshot_trigger(char *glob, > > struct event_trigger_data *data, > > struct trace_event_file *file) > > { > > - if (tracing_alloc_snapshot_instance(file->tr) != 0) > > + if (tracing_arm_snapshot(file->tr)) > > return 0; > > BTW, is this return value correct? It seems that the register_*_trigger() > will return error code when it fails. It should indeed be ret = tracing_arm_snapshot() if (ret) return ret; > > Thanks, > > > > > return register_trigger(glob, data, file); > > } > > > > +static void unregister_snapshot_trigger(char *glob, > > + struct event_trigger_data *data, > > + struct trace_event_file *file) > > +{ > > + if (__unregister_trigger(glob, data, file)) > > + tracing_disarm_snapshot(file->tr); > > +} > > + > > static int > > snapshot_trigger_print(struct seq_file *m, struct event_trigger_data *data) > > { > > @@ -1508,7 +1529,7 @@ static struct event_command trigger_snapshot_cmd = { > > .trigger_type = ETT_SNAPSHOT, > > .parse = event_trigger_parse, > > .reg= register_snapshot_trigger, > > - .unreg = unregister_trigger, > > + .unreg = unregister_snapshot_trigger, > > .get_trigger_ops= snapshot_get_trigger_ops, > > .set_filter = set_trigger_filter, > > }; > > -- > > 2.43.0.429.g432eaa2c6b-goog > > > > > -- > Masami Hiramatsu (Google)
Re: 回复:general protection fault in ath9k_wmi_event_tasklet
Kalle Valo writes: > Toke Høiland-Jørgensen writes: > >> "Ubisectech Sirius" writes: >> Hmm, so from eyeballing the code in question, this looks like it is another initialisation race along the lines of the one fixed in commit: 8b3046abc99e ("ath9k_htc: fix NULL pointer dereference at ath9k_htc_tx_get_packet()") Could you please test the patch below and see if you can still reproduce this issue with that applied? >>> Hello. >>> I can not reproduce the issue on the Linux with the patch applied >>> Ubisectech Sirius Team >> >> Great, thank you for testing! I'll send a proper patch. How would you >> like to be credited with reporting? Just as 'Ubisectech Sirius >> ' ? > > Ubisectech, please CC linux-wireless on any wireless issues and don't > use HTML format in emails. More info in the wiki below. Doh, didn't even notice that linux-wireless was not in Cc. Sorry about that! :( -Toke
Re: 回复:general protection fault in ath9k_wmi_event_tasklet
Toke Høiland-Jørgensen writes: > "Ubisectech Sirius" writes: > >>>Hmm, so from eyeballing the code in question, this looks like it is >>>another initialisation race along the lines of the one fixed in commit: >>>8b3046abc99e ("ath9k_htc: fix NULL pointer dereference at >>> ath9k_htc_tx_get_packet()") >>>Could you please test the patch below and see if you can still reproduce >>>this issue with that applied? >> Hello. >> I can not reproduce the issue on the Linux with the patch applied >> Ubisectech Sirius Team > > Great, thank you for testing! I'll send a proper patch. How would you > like to be credited with reporting? Just as 'Ubisectech Sirius > ' ? Ubisectech, please CC linux-wireless on any wireless issues and don't use HTML format in emails. More info in the wiki below. -- https://patchwork.kernel.org/project/linux-wireless/list/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: 回复:general protection fault in ath9k_wmi_event_tasklet
"Ubisectech Sirius" writes: >>Hmm, so from eyeballing the code in question, this looks like it is >>another initialisation race along the lines of the one fixed in commit: >>8b3046abc99e ("ath9k_htc: fix NULL pointer dereference at >>ath9k_htc_tx_get_packet()") >>Could you please test the patch below and see if you can still reproduce >>this issue with that applied? > Hello. > I can not reproduce the issue on the Linux with the patch applied > Ubisectech Sirius Team Great, thank you for testing! I'll send a proper patch. How would you like to be credited with reporting? Just as 'Ubisectech Sirius ' ? -Toke
RE: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> -Original Message- > From: Jason Wang [mailto:jasow...@redhat.com] > Sent: Thursday, January 25, 2024 12:49 PM > To: wangyunjian > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org; > da...@davemloft.net; magnus.karls...@intel.com; net...@vger.kernel.org; > linux-kernel@vger.kernel.org; k...@vger.kernel.org; > virtualizat...@lists.linux.dev; xudingke > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support > > On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang > wrote: > > > > Now the zero-copy feature of AF_XDP socket is supported by some > > drivers, which can reduce CPU utilization on the xdp program. > > This patch set allows tun to support AF_XDP Rx zero-copy feature. > > > > This patch tries to address this by: > > - Use peek_len to consume a xsk->desc and get xsk->desc length. > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe empty. > > So add a check for empty vq's array in vhost_net_buf_produce(). > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support > > - add tun_put_user_desc function to copy the Rx data to VM > > Code explains themselves, let's explain why you need to do this. > > 1) why you want to use peek_len > 2) for "vq's array", what does it mean? > 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so I guess you > meant TX zerocopy instead of RX (as I don't see codes for > RX?) OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX zerocopy from the view of vhost-net. > > A big question is how could you handle GSO packets from userspace/guests? Now by disabling VM's TSO and csum feature. XDP does not support GSO packets. However, this feature can be added once XDP supports it in the future. > > > > > Signed-off-by: Yunjian Wang > > --- > > drivers/net/tun.c | 165 > +++- > > drivers/vhost/net.c | 18 +++-- > > 2 files changed, 176 insertions(+), 7 deletions(-) > > > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c index > > afa5497f7c35..248b0f8e07d1 100644 > > --- a/drivers/net/tun.c > > +++ b/drivers/net/tun.c > > @@ -77,6 +77,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -145,6 +146,10 @@ struct tun_file { > > struct tun_struct *detached; > > struct ptr_ring tx_ring; > > struct xdp_rxq_info xdp_rxq; > > + struct xdp_desc desc; > > + /* protects xsk pool */ > > + spinlock_t pool_lock; > > + struct xsk_buff_pool *pool; > > }; > > > > struct tun_page { > > @@ -208,6 +213,8 @@ struct tun_struct { > > struct bpf_prog __rcu *xdp_prog; > > struct tun_prog __rcu *steering_prog; > > struct tun_prog __rcu *filter_prog; > > + /* tracks AF_XDP ZC enabled queues */ > > + unsigned long *af_xdp_zc_qps; > > struct ethtool_link_ksettings link_ksettings; > > /* init args */ > > struct file *file; > > @@ -795,6 +802,8 @@ static int tun_attach(struct tun_struct *tun, > > struct file *file, > > > > tfile->queue_index = tun->numqueues; > > tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN; > > + tfile->desc.len = 0; > > + tfile->pool = NULL; > > > > if (tfile->detached) { > > /* Re-attach detached tfile, updating XDP queue_index > > */ @@ -989,6 +998,13 @@ static int tun_net_init(struct net_device *dev) > > return err; > > } > > > > + tun->af_xdp_zc_qps = bitmap_zalloc(MAX_TAP_QUEUES, > GFP_KERNEL); > > + if (!tun->af_xdp_zc_qps) { > > + security_tun_dev_free_security(tun->security); > > + free_percpu(dev->tstats); > > + return -ENOMEM; > > + } > > + > > tun_flow_init(tun); > > > > dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST | @@ > -1009,6 > > +1025,7 @@ static int tun_net_init(struct net_device *dev) > > tun_flow_uninit(tun); > > security_tun_dev_free_security(tun->security); > > free_percpu(dev->tstats); > > + bitmap_free(tun->af_xdp_zc_qps); > > return err; > > } > > return 0; > > @@ -1222,11 +1239,77 @@ static int tun_xdp_set(struct net_device *dev, > struct bpf_prog *prog, > > return 0; > > } > > > > +static int tun_xsk_pool_enable(struct net_device *netdev, > > + struct xsk_buff_pool *pool, > > + u16 qid) { > > + struct tun_struct *tun = netdev_priv(netdev); > > + struct tun_file *tfile; > > + unsigned long flags; > > + > > + rcu_read_lock(); > > + tfile = rtnl_dereference(tun->tfiles[qid]); > > + if (!tfile) { > > + rcu_read_unlock(); > > + return -ENODEV; > > + } > > + > > + spin_lock_irqsave(&tfile->pool_lock, flags); > > + xsk_pool_set_rxq_info(pool, &tfile->xdp_rxq); > > + tfile-
Re: [RFC PATCH 2/5] mfd: add 88pm88x driver
On Sun, 17 Dec 2023, Karel Balej wrote: > From: Karel Balej > > Marvell 88PM880 and 8PM886 are two similar PMICs with mostly matching > register mapping. They provide various functions such as onkey, battery, > charger and regulators. > > Add support for 88PM886 found for instance in the samsung,coreprimevelte > smartphone with which this was tested. Support for 88PM880 is not > implemented here but should be straightforward to add. > > Implement only the most basic support omitting the currently unused > registers and I2C subclients which should thus be added with the > respective subdevices. However, add support for the onkey already. > > Signed-off-by: Karel Balej > --- > drivers/mfd/88pm88x.c | 199 > drivers/mfd/Kconfig | 11 ++ > drivers/mfd/Makefile| 1 + > include/linux/mfd/88pm88x.h | 60 +++ > 4 files changed, 271 insertions(+) > create mode 100644 drivers/mfd/88pm88x.c > create mode 100644 include/linux/mfd/88pm88x.h > > diff --git a/drivers/mfd/88pm88x.c b/drivers/mfd/88pm88x.c > new file mode 100644 > index ..5db6c65b667d > --- /dev/null > +++ b/drivers/mfd/88pm88x.c > @@ -0,0 +1,199 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +#include > +#include > +#include > +#include > +#include > +#include Alphabetical > +#include > + > +/* interrupt status registers */ Use correct grammar in comments, including capital letters. - Applies throughout The comment is not required - we can see what they are from the nomenclature. > +#define PM88X_REG_INT_STATUS10x05 > + > +#define PM88X_REG_INT_ENA_1 0x0a > +#define PM88X_INT_ENA1_ONKEY BIT(0) > + > +enum pm88x_irq_number { > + PM88X_IRQ_ONKEY, > + > + PM88X_MAX_IRQ > +}; An enum for a single IRQ? > +static struct regmap_irq pm88x_regmap_irqs[] = { > + REGMAP_IRQ_REG(PM88X_IRQ_ONKEY, 0, PM88X_INT_ENA1_ONKEY), > +}; > + > +static struct regmap_irq_chip pm88x_regmap_irq_chip = { > + .name = "88pm88x", > + .irqs = pm88x_regmap_irqs, > + .num_irqs = ARRAY_SIZE(pm88x_regmap_irqs), > + .num_regs = 4, > + .status_base = PM88X_REG_INT_STATUS1, > + .ack_base = PM88X_REG_INT_STATUS1, > + .unmask_base = PM88X_REG_INT_ENA_1, > +}; > + > +static struct reg_sequence pm886_presets[] = { > + /* disable watchdog */ > + REG_SEQ0(PM88X_REG_WDOG, 0x01), Easier to read if you place spaces between them. > + /* GPIO1: DVC, GPIO0: input */ > + REG_SEQ0(PM88X_REG_GPIO_CTRL1, 0x40), Shouldn't you set these up using Pintrl? > + /* GPIO2: input */ > + REG_SEQ0(PM88X_REG_GPIO_CTRL2, 0x00), > + /* DVC2, DVC1 */ Please unify all of the comments. They all use a different structure. > + REG_SEQ0(PM88X_REG_GPIO_CTRL3, 0x44), > + /* GPIO5V_1:input, GPIO5V_2: input */ > + REG_SEQ0(PM88X_REG_GPIO_CTRL4, 0x00), > + /* output 32 kHz from XO */ > + REG_SEQ0(PM88X_REG_AON_CTRL2, 0x2a), > + /* OSC_FREERUN = 1, to lock FLL */ > + REG_SEQ0(PM88X_REG_BK_OSC_CTRL1, 0x0f), > + /* XO_LJ = 1, enable low jitter for 32 kHz */ > + REG_SEQ0(PM88X_REG_LOWPOWER2, 0x20), > + /* OV_VSYS and UV_VSYS1 comparators on VSYS disabled, VSYS_OVER_TH : > 5.6V */ > + REG_SEQ0(PM88X_REG_LOWPOWER4, 0xc8), > + /* set the duty cycle of charger DC/DC to max */ > + REG_SEQ0(PM88X_REG_BK_OSC_CTRL3, 0xc0), These all looks like they should be handled in their respective drivers? "patch"ing these in seems like a hack. > +}; Why this instead of > +static struct resource onkey_resources[] = { > + DEFINE_RES_IRQ_NAMED(PM88X_IRQ_ONKEY, "88pm88x-onkey"), > +}; > + > +static struct mfd_cell pm88x_devs[] = { > + { > + .name = "88pm88x-onkey", > + .num_resources = ARRAY_SIZE(onkey_resources), > + .resources = onkey_resources, > + .id = -1, > + }, > +}; It's not an MFD if it only supports a single device. > +static struct pm88x_data pm886_a1_data = { > + .whoami = PM886_A1_WHOAMI, > + .presets = pm886_presets, > + .num_presets = ARRAY_SIZE(pm886_presets), > +}; Just pass the device ID through DT's .data, then match on that instead of passing pointer to random data structures. > +static const struct regmap_config pm88x_i2c_regmap = { > + .reg_bits = 8, > + .val_bits = 8, > + .max_register = 0xfe, Define this please. > +}; > + > +static int pm88x_power_off_handler(struct sys_off_data *data) 'data' is a terrible variable name. Please change throughout. > +{ > + struct pm88x_chip *chip = data->cb_data; > + int ret; > + > + ret = regmap_update_bits(chip->regmaps[PM88X_REGMAP_BASE], > PM88X_REG_MISC_CONFIG1, > + PM88X_SW_PDOWN, PM88X_SW_PDOWN); > + if (ret) { > + dev_err(&chip->client->dev, "Failed to power off the device: > %d\n", ret); > + return NOTIFY_BAD; > + } > + return NOTIFY_DONE;
RE: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> -Original Message- > From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com] > Sent: Thursday, January 25, 2024 3:05 AM > To: wangyunjian ; m...@redhat.com; > willemdebruijn.ker...@gmail.com; jasow...@redhat.com; k...@kernel.org; > da...@davemloft.net; magnus.karls...@intel.com > Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org; > k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke > ; wangyunjian > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support > > Yunjian Wang wrote: > > Now the zero-copy feature of AF_XDP socket is supported by some > > drivers, which can reduce CPU utilization on the xdp program. > > This patch set allows tun to support AF_XDP Rx zero-copy feature. > > > > This patch tries to address this by: > > - Use peek_len to consume a xsk->desc and get xsk->desc length. > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe empty. > > So add a check for empty vq's array in vhost_net_buf_produce(). > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support > > - add tun_put_user_desc function to copy the Rx data to VM > > > > Signed-off-by: Yunjian Wang > > I don't fully understand the higher level design of this feature yet. This feature can reduce the memory copy for the xdp program. > > But some initial comments at the code level. > > > --- > > drivers/net/tun.c | 165 > +++- > > drivers/vhost/net.c | 18 +++-- > > 2 files changed, 176 insertions(+), 7 deletions(-) > > > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c index > > afa5497f7c35..248b0f8e07d1 100644 > > --- a/drivers/net/tun.c > > +++ b/drivers/net/tun.c > > @@ -77,6 +77,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -145,6 +146,10 @@ struct tun_file { > > struct tun_struct *detached; > > struct ptr_ring tx_ring; > > struct xdp_rxq_info xdp_rxq; > > + struct xdp_desc desc; > > + /* protects xsk pool */ > > + spinlock_t pool_lock; > > + struct xsk_buff_pool *pool; > > }; > > > > struct tun_page { > > @@ -208,6 +213,8 @@ struct tun_struct { > > struct bpf_prog __rcu *xdp_prog; > > struct tun_prog __rcu *steering_prog; > > struct tun_prog __rcu *filter_prog; > > + /* tracks AF_XDP ZC enabled queues */ > > + unsigned long *af_xdp_zc_qps; > > struct ethtool_link_ksettings link_ksettings; > > /* init args */ > > struct file *file; > > @@ -795,6 +802,8 @@ static int tun_attach(struct tun_struct *tun, > > struct file *file, > > > > tfile->queue_index = tun->numqueues; > > tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN; > > + tfile->desc.len = 0; > > + tfile->pool = NULL; > > > > if (tfile->detached) { > > /* Re-attach detached tfile, updating XDP queue_index */ @@ > > -989,6 > > +998,13 @@ static int tun_net_init(struct net_device *dev) > > return err; > > } > > > > + tun->af_xdp_zc_qps = bitmap_zalloc(MAX_TAP_QUEUES, GFP_KERNEL); > > + if (!tun->af_xdp_zc_qps) { > > + security_tun_dev_free_security(tun->security); > > + free_percpu(dev->tstats); > > + return -ENOMEM; > > + } > > + > > tun_flow_init(tun); > > > > dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST | @@ -1009,6 > > +1025,7 @@ static int tun_net_init(struct net_device *dev) > > tun_flow_uninit(tun); > > security_tun_dev_free_security(tun->security); > > free_percpu(dev->tstats); > > + bitmap_free(tun->af_xdp_zc_qps); > > Please release state in inverse order of acquire. Will done. > > > return err; > > } > > return 0; > > @@ -1222,11 +1239,77 @@ static int tun_xdp_set(struct net_device *dev, > struct bpf_prog *prog, > > return 0; > > } > > > > +static int tun_xsk_pool_enable(struct net_device *netdev, > > + struct xsk_buff_pool *pool, > > + u16 qid) > > +{ > > + struct tun_struct *tun = netdev_priv(netdev); > > + struct tun_file *tfile; > > + unsigned long flags; > > + > > + rcu_read_lock(); > > + tfile = rtnl_dereference(tun->tfiles[qid]); > > + if (!tfile) { > > + rcu_read_unlock(); > > + return -ENODEV; > > + } > > No need for rcu_read_lock with rtnl_dereference. > > Consider ASSERT_RTNL() if unsure whether this patch could be reached without > the rtnl held. Will done. > > > + > > + spin_lock_irqsave(&tfile->pool_lock, flags); > > + xsk_pool_set_rxq_info(pool, &tfile->xdp_rxq); > > + tfile->pool = pool; > > + spin_unlock_irqrestore(&tfile->pool_lock, flags); > > + > > + rcu_read_unlock(); > > + set_bit(qid, tun->af_xdp_zc_qps); > > What are the concurrency semantics: there's a spinlock to make the update to > xdp_rxq and pool a critical section, but the bitmap is not part of this? > Please > also then document why the irqsave. The bitmap indicates whether the XDP pool is e