[PATCHv2] kmemleak: Add option to print warnings to dmesg
Currently, kmemleak only prints the number of suspected leaks to dmesg but requires the user to read a debugfs file to get the actual stack traces of the objects' allocation points. Add an option to print the full object information to dmesg too. This allows easier integration of kmemleak into automated test systems since those kind of systems presumably already save kernel logs. Signed-off-by: Vincent Whitchurch --- v2: Print hex dump too. lib/Kconfig.debug | 9 + mm/kmemleak.c | 37 ++--- 2 files changed, 39 insertions(+), 7 deletions(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ab1b599202bc..9a3fc905b8bd 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -593,6 +593,15 @@ config DEBUG_KMEMLEAK_DEFAULT_OFF Say Y here to disable kmemleak by default. It can then be enabled on the command line via kmemleak=on. +config DEBUG_KMEMLEAK_WARN + bool "Print kmemleak object warnings to log buffer" + depends on DEBUG_KMEMLEAK + help + Say Y here to make kmemleak print information about unreferenced + objects (including stacktraces) as warnings to the kernel log buffer. + Otherwise this information is only available by reading the kmemleak + debugfs file. + config DEBUG_STACK_USAGE bool "Stack utilization instrumentation" depends on DEBUG_KERNEL && !IA64 diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 9a085d525bbc..22662715a3dc 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -181,6 +181,7 @@ struct kmemleak_object { /* flag set to not scan the object */ #define OBJECT_NO_SCAN (1 << 2) +#define HEX_PREFIX "" /* number of bytes to print per line; must be 16 or 32 */ #define HEX_ROW_SIZE 16 /* number of bytes to print at a time (1, 2, 4, 8) */ @@ -299,6 +300,25 @@ static void kmemleak_disable(void); kmemleak_disable(); \ } while (0) +#define warn_or_seq_printf(seq, fmt, ...) do {\ + if (seq)\ + seq_printf(seq, fmt, ##__VA_ARGS__);\ + else\ + pr_warn(fmt, ##__VA_ARGS__);\ +} while (0) + +static void warn_or_seq_hex_dump(struct seq_file *seq, int prefix_type, +int rowsize, int groupsize, const void *buf, +size_t len, bool ascii) +{ + if (seq) + seq_hex_dump(seq, HEX_PREFIX, prefix_type, rowsize, groupsize, +buf, len, ascii); + else + print_hex_dump(KERN_WARNING, pr_fmt(HEX_PREFIX), prefix_type, + rowsize, groupsize, buf, len, ascii); +} + /* * Printing of the objects hex dump to the seq file. The number of lines to be * printed is limited to HEX_MAX_LINES to prevent seq file spamming. The @@ -314,10 +334,10 @@ static void hex_dump_object(struct seq_file *seq, /* limit the number of lines to HEX_MAX_LINES */ len = min_t(size_t, object->size, HEX_MAX_LINES * HEX_ROW_SIZE); - seq_printf(seq, " hex dump (first %zu bytes):\n", len); + warn_or_seq_printf(seq, " hex dump (first %zu bytes):\n", len); kasan_disable_current(); - seq_hex_dump(seq, "", DUMP_PREFIX_NONE, HEX_ROW_SIZE, -HEX_GROUP_SIZE, ptr, len, HEX_ASCII); + warn_or_seq_hex_dump(seq, DUMP_PREFIX_NONE, HEX_ROW_SIZE, +HEX_GROUP_SIZE, ptr, len, HEX_ASCII); kasan_enable_current(); } @@ -365,17 +385,17 @@ static void print_unreferenced(struct seq_file *seq, int i; unsigned int msecs_age = jiffies_to_msecs(jiffies - object->jiffies); - seq_printf(seq, "unreferenced object 0x%08lx (size %zu):\n", + warn_or_seq_printf(seq, "unreferenced object 0x%08lx (size %zu):\n", object->pointer, object->size); - seq_printf(seq, " comm \"%s\", pid %d, jiffies %lu (age %d.%03ds)\n", + warn_or_seq_printf(seq, " comm \"%s\", pid %d, jiffies %lu (age %d.%03ds)\n", object->comm, object->pid, object->jiffies, msecs_age / 1000, msecs_age % 1000); hex_dump_object(seq, object); - seq_printf(seq, " backtrace:\n"); + warn_or_seq_printf(seq, " backtrace:\n"); for (i = 0; i < object->trace_len; i++) { void *ptr = (void *)object->trace[i]; - seq_printf(seq, "[<%p>] %pS\n", ptr, ptr); + warn_or_seq_printf(seq, "[<%p>] %pS\n", ptr, ptr); } } @@ -1598,6 +1618,9 @@ static void kmemleak_scan(void) if (unreferenced_object(object
Re: [PATCHv2] kmemleak: Add option to print warnings to dmesg
On Mon, Aug 27, 2018 at 03:16:41PM -0700, Andrew Morton wrote: > On Mon, 27 Aug 2018 10:38:21 +0200 Vincent Whitchurch > wrote: > > > Currently, kmemleak only prints the number of suspected leaks to dmesg > > but requires the user to read a debugfs file to get the actual stack > > traces of the objects' allocation points. Add an option to print the > > full object information to dmesg too. This allows easier integration of > > kmemleak into automated test systems since those kind of systems > > presumably already save kernel logs. > > "presumably" is a bit rubbery. Are you sure this change is sufficienty > useful to justify including it? Do you have use-cases for it? We (like every one else) have automated test infrastructure which test our Linux systems. With this option, running our tests with kmemleak is as simple as enabling kmemleak and this option; the test infrastructure knows how to save kernel logs, which will now include kmemleak reports. Without this option, the test infrastructure needs to be specifically taught to read out the kmemleak debugfs file. Removing this need for special handling makes kmemleak more similar to other kernel debug options (slab debugging, debug objects, etc). > > > --- a/lib/Kconfig.debug > > +++ b/lib/Kconfig.debug > > @@ -593,6 +593,15 @@ config DEBUG_KMEMLEAK_DEFAULT_OFF > > Say Y here to disable kmemleak by default. It can then be enabled > > on the command line via kmemleak=on. > > > > +config DEBUG_KMEMLEAK_WARN > > + bool "Print kmemleak object warnings to log buffer" > > + depends on DEBUG_KMEMLEAK > > + help > > + Say Y here to make kmemleak print information about unreferenced > > + objects (including stacktraces) as warnings to the kernel log buffer. > > + Otherwise this information is only available by reading the kmemleak > > + debugfs file. > > Why add the config option? Why not simply make the change for all > configs? No particular reason other than preserving the current behaviour for existing users. I can remove the config option if Catalin is fine with it. > > > config DEBUG_STACK_USAGE > > bool "Stack utilization instrumentation" > > depends on DEBUG_KERNEL && !IA64 > > diff --git a/mm/kmemleak.c b/mm/kmemleak.c > > index 9a085d525bbc..22662715a3dc 100644 > > --- a/mm/kmemleak.c > > +++ b/mm/kmemleak.c > > @@ -181,6 +181,7 @@ struct kmemleak_object { > > /* flag set to not scan the object */ > > #define OBJECT_NO_SCAN (1 << 2) > > > > +#define HEX_PREFIX "" > > /* number of bytes to print per line; must be 16 or 32 */ > > #define HEX_ROW_SIZE 16 > > /* number of bytes to print at a time (1, 2, 4, 8) */ > > @@ -299,6 +300,25 @@ static void kmemleak_disable(void); > > kmemleak_disable(); \ > > } while (0) > > > > +#define warn_or_seq_printf(seq, fmt, ...) do {\ > > + if (seq)\ > > + seq_printf(seq, fmt, ##__VA_ARGS__);\ > > + else\ > > + pr_warn(fmt, ##__VA_ARGS__);\ > > +} while (0) > > + > > +static void warn_or_seq_hex_dump(struct seq_file *seq, int prefix_type, > > +int rowsize, int groupsize, const void *buf, > > +size_t len, bool ascii) > > +{ > > + if (seq) > > + seq_hex_dump(seq, HEX_PREFIX, prefix_type, rowsize, groupsize, > > +buf, len, ascii); > > + else > > + print_hex_dump(KERN_WARNING, pr_fmt(HEX_PREFIX), prefix_type, > > + rowsize, groupsize, buf, len, ascii); > > +} > > This will print to the logs OR to the debugfs file, won't it? No, the information is always available in the debugfs file, even after this patch. The dmesg printing is in addition to that. The code is called with and without seq == NULL in different code paths.
[PATCH 1/2] kmemleak: dump all objects for slab usage analysis
In order to be able to analyse the kernel's slab usage, we'd need a list of allocated objects and their allocation stacks. Kmemleak already maintains such a list internally, so we expose it via debugfs file. This file can be post-processed in userspace and converted to a suitable format for slab usage analysis. Signed-off-by: Vincent Whitchurch --- mm/kmemleak.c | 53 + 1 file changed, 53 insertions(+) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 17dd883198ae..7bef05c690d6 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -1759,6 +1759,34 @@ static int kmemleak_seq_show(struct seq_file *seq, void *v) return 0; } +static void kmemleak_print_object(struct seq_file *seq, + struct kmemleak_object *object) +{ + int i; + + seq_printf(seq, "object 0x%08lx (size %zu):\n", + object->pointer, object->size); + seq_printf(seq, " comm \"%s\", pid %d, jiffies %lu\n", + object->comm, object->pid, object->jiffies); + + for (i = 0; i < object->trace_len; i++) { + void *ptr = (void *)object->trace[i]; + + seq_printf(seq, "[<%p>] %pS\n", ptr, ptr); + } +} + +static int kmemleak_all_seq_show(struct seq_file *seq, void *v) +{ + struct kmemleak_object *object = v; + unsigned long flags; + + spin_lock_irqsave(&object->lock, flags); + kmemleak_print_object(seq, object); + spin_unlock_irqrestore(&object->lock, flags); + return 0; +} + static const struct seq_operations kmemleak_seq_ops = { .start = kmemleak_seq_start, .next = kmemleak_seq_next, @@ -1766,11 +1794,23 @@ static const struct seq_operations kmemleak_seq_ops = { .show = kmemleak_seq_show, }; +static const struct seq_operations kmemleak_all_seq_ops = { + .start = kmemleak_seq_start, + .next = kmemleak_seq_next, + .stop = kmemleak_seq_stop, + .show = kmemleak_all_seq_show, +}; + static int kmemleak_open(struct inode *inode, struct file *file) { return seq_open(file, &kmemleak_seq_ops); } +static int kmemleak_all_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &kmemleak_all_seq_ops); +} + static int dump_str_object_info(const char *str) { unsigned long flags; @@ -1911,6 +1951,14 @@ static const struct file_operations kmemleak_fops = { .release= seq_release, }; +static const struct file_operations kmemleak_all_fops = { + .owner = THIS_MODULE, + .open = kmemleak_all_open, + .read = seq_read, + .llseek = seq_lseek, + .release= seq_release, +}; + static void __kmemleak_do_cleanup(void) { struct kmemleak_object *object; @@ -2102,6 +2150,11 @@ static int __init kmemleak_late_init(void) if (!dentry) pr_warn("Failed to create the debugfs kmemleak file\n"); + dentry = debugfs_create_file("kmemleak_all", 0400, NULL, NULL, +&kmemleak_all_fops); + if (!dentry) + pr_warn("Failed to create the debugfs kmemleak_all file\n"); + if (kmemleak_error) { /* * Some error occurred and kmemleak was disabled. There is a -- 2.11.0
[PATCH 2/2] scripts: add kmemleak2pprof.py for slab usage analysis
Add a script which converts /sys/kernel/debug/kmemleak_all to the pprof format, which can be used for analysing memory usage. See https://github.com/google/pprof. $ ./kmemleak2pprof.py kmemleak_all $ pprof -text -ignore free_area_init_node -compact_labels -nodecount 10 prof Showing nodes accounting for 4.85MB, 34.05% of 14.23MB total Dropped 3989 nodes (cum <= 0.07MB) Showing top 10 nodes out of 190 flat flat% sum%cum cum% 1.39MB 9.78% 9.78% 1.61MB 11.29% new_inode_pseudo+0x8/0x4c 0.75MB 5.27% 15.04% 0.75MB 5.27% alloc_large_system_hash+0x19c/0x2b8 0.73MB 5.12% 20.17% 0.86MB 6.07% kernfs_new_node+0x30/0x50 0.66MB 4.62% 24.79% 0.66MB 4.62% __vmalloc_node.constprop.9+0x48/0x50 0.61MB 4.28% 29.06% 0.61MB 4.28% d_alloc+0x10/0x78 0.22MB 1.52% 30.58% 0.22MB 1.52% alloc_inode+0x1c/0xa4 0.18MB 1.28% 31.86% 0.20MB 1.42% _do_fork+0xb0/0x41c 0.13MB 0.88% 32.74% 0.13MB 0.88% early_trace_init+0x16c/0x374 0.09MB 0.66% 33.40% 0.17MB 1.17% inet_init+0x128/0x24c 0.09MB 0.65% 34.05% 0.09MB 0.65% __kernfs_new_node+0x34/0x1a8 Signed-off-by: Vincent Whitchurch --- scripts/kmemleak2pprof.py | 164 ++ 1 file changed, 164 insertions(+) create mode 100755 scripts/kmemleak2pprof.py diff --git a/scripts/kmemleak2pprof.py b/scripts/kmemleak2pprof.py new file mode 100755 index ..1295d3ca9a9d --- /dev/null +++ b/scripts/kmemleak2pprof.py @@ -0,0 +1,164 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 +# +# Copyright (C) 2018 Axis Communications AB +# +# Converts /sys/kernel/debug/kmemleak_all to the pprof format, see +# https://github.com/google/pprof. +# +# profile_pb2.py can be generated with the following commands. protoc is +# packaged as protobuf-compiler in Debian: +# +# wget https://raw.githubusercontent.com/google/pprof/master/proto/profile.proto +# protoc -I. --python_out=. profile.proto + +import argparse + +from collections import defaultdict + +import profile_pb2 + + +# object 0xee0243b0 (size 464): +# comm "swapper/0", pid 0, jiffies 4294937296 +# [<80220673>] alloc_inode+0x13/0x60 +# [<80221cc5>] new_inode_pseudo+0xd/0x38 +# [<802568a3>] proc_setup_thread_self+0x37/0xc4 +# [<8020e8c1>] mount_ns+0x55/0x94 +# [<8024f2e1>] proc_mount+0x45/0x48 +# [<8020ee9b>] mount_fs+0x1f/0x104 +# [<80224785>] vfs_kern_mount.part.3+0x35/0xbc +# [<80224833>] kern_mount_data+0x17/0x2c +# [<8024f44b>] pid_ns_prepare_proc+0x13/0x24 +# [<8012ed0d>] alloc_pid+0x309/0x338 +# [<80118e2b>] copy_process.part.5+0xa2b/0x1308 +# [<80119807>] _do_fork+0x77/0x2f0 +# [<80119abf>] kernel_thread+0x23/0x28 +# [<8053517f>] rest_init+0x27/0xb4 +# [<80900afb>] start_kernel+0x369/0x372 +# [<807b>] 0x807b +class KmemleakAll(object): +def __init__(self): +pass + +def analyze(self, f): +allocs = defaultdict(int) +stack = [] +size = 0 + +while True: +line = f.readline() +if not line: +break + +line = line.strip() + +if line.startswith('['): +# (null) is in the address part so later parsing steps fail. +# Don't bother fixing it up since it's clearly bogus. +if '(null)' in line: +continue + +stack.append(line) +continue +elif line.startswith('comm'): +continue + +if size: +allocs[(tuple(stack), size)] += 1 +size = 0 + +stack = [] +size = int(line.split('(size ')[1].strip('):')) + +return sorted(allocs.items(), key=lambda x: x[0][1] * x[1], reverse=True) + + +class ProfileWriter(object): +def __init__(self, allocs): +self.profile = profile_pb2.Profile() +self.strings = [''] +self.allocs = allocs +self.locations = {} +self.functions = {} + +def stridx(self, s): +try: +idx = self.strings.index(s) +except ValueError: +idx = len(self.strings) +self.strings.append(s) + +return idx + +def get_function_id(self, funcname, filename): +try: +return self.functions[(funcname, filename)].id +except KeyError: +pass + +function = self.profile.function.add() +function.id = len(self.functions) + 1 +function.name = self.stridx(funcname) +function.filename = self.stridx(filename) + +self.functions[(funcname, filename)] = function + +return function.id + +def get_location_id(self, addr): +
Re: [PATCH 2/2] scripts: add kmemleak2pprof.py for slab usage analysis
On Tue, Aug 28, 2018 at 04:28:04PM -0700, Andrew Morton wrote: > On Tue, 28 Aug 2018 12:39:14 +0200 Vincent Whitchurch > wrote: > > > Add a script which converts /sys/kernel/debug/kmemleak_all to the pprof > > format, which can be used for analysing memory usage. See > > https://github.com/google/pprof. > > Why is this better than /proc/slabinfo? slabinfo just tells you how much memory is being used in a particular slab, it doesn't give you a breakdown of who allocated all that memory. slabinfo can't also tell you how much memory a particular subsystem is using. For example, here we can see that tracer_init_tracefs() and its callers are using ~12% of the total tracked memory: $ pprof -top -compact_labels -cum prof Showing nodes accounting for 13418.95kB, 92.07% of 14575.28kB total Dropped 4069 nodes (cum <= 72.88kB) flat flat% sum%cum cum% ... 0 0% 56.71% 1832.15kB 12.57% tracer_init_tracefs+0x74/0x1cc And that tracefs' dentrys use 500 KiB and its inodes use 1+ MiB: $ pprof -text -compact_labels -focus tracer_init_tracefs -nodecount 2 prof Main binary filename not available. Showing nodes accounting for 1794.85kB, 12.31% of 14575.28kB total Dropped 1912 nodes (cum <= 72.88kB) Showing top 2 nodes out of 32 flat flat% sum%cum cum% 1294.56kB 8.88% 8.88% 1294.56kB 8.88% new_inode_pseudo+0x8/0x4c 500.29kB 3.43% 12.31% 500.29kB 3.43% d_alloc+0x10/0x78 ... > > > $ ./kmemleak2pprof.py kmemleak_all > > $ pprof -text -ignore free_area_init_node -compact_labels -nodecount 10 > > prof > > Are we missing an argument here? s/prof/kmemleak_all/? No, the default output filename of this script is called "prof".
Re: [PATCHv2] kmemleak: Add option to print warnings to dmesg
On Tue, Aug 28, 2018 at 11:26:22AM +0100, Catalin Marinas wrote: > On Tue, Aug 28, 2018 at 12:14:12PM +0200, Vincent Whitchurch wrote: > > On Mon, Aug 27, 2018 at 03:16:41PM -0700, Andrew Morton wrote: > > > On Mon, 27 Aug 2018 10:38:21 +0200 Vincent Whitchurch > > > wrote: > > > > +config DEBUG_KMEMLEAK_WARN > > > > + bool "Print kmemleak object warnings to log buffer" > > > > + depends on DEBUG_KMEMLEAK > > > > + help > > > > + Say Y here to make kmemleak print information about > > > > unreferenced > > > > + objects (including stacktraces) as warnings to the kernel log > > > > buffer. > > > > + Otherwise this information is only available by reading the > > > > kmemleak > > > > + debugfs file. > > > > > > Why add the config option? Why not simply make the change for all > > > configs? > > > > No particular reason other than preserving the current behaviour for > > existing users. I can remove the config option if Catalin is fine with > > it. > > IIRC, in the early kmemleak days, people complained about it being to > noisy (the false positives rate was also much higher), so the default > behaviour was changed to monitor (almost) quietly with the details > available via debugfs. I'd like to keep this default behaviour but we > could have a "verbose" command via both debugfs and kernel parameter (as > we do with "off" and "on"). Would this work for you? Either a config option or a parameter are usable for me. How about something like this? It can be enabled with kmemleak.verbose=1 or "echo 1 > /sys/module/kmemleak/parameters/verbose": diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 9a3fc905b8bd..ab1b599202bc 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -593,15 +593,6 @@ config DEBUG_KMEMLEAK_DEFAULT_OFF Say Y here to disable kmemleak by default. It can then be enabled on the command line via kmemleak=on. -config DEBUG_KMEMLEAK_WARN - bool "Print kmemleak object warnings to log buffer" - depends on DEBUG_KMEMLEAK - help - Say Y here to make kmemleak print information about unreferenced - objects (including stacktraces) as warnings to the kernel log buffer. - Otherwise this information is only available by reading the kmemleak - debugfs file. - config DEBUG_STACK_USAGE bool "Stack utilization instrumentation" depends on DEBUG_KERNEL && !IA64 diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 22662715a3dc..c91d43738596 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -86,6 +86,7 @@ #include #include #include +#include #include #include #include @@ -236,6 +237,9 @@ static int kmemleak_skip_disable; /* If there are leaks that can be reported */ static bool kmemleak_found_leaks; +static bool kmemleak_verbose; +module_param_named(verbose, kmemleak_verbose, bool, 0600); + /* * Early object allocation/freeing logging. Kmemleak is initialized after the * kernel allocator. However, both the kernel allocator and kmemleak may @@ -1618,9 +1622,10 @@ static void kmemleak_scan(void) if (unreferenced_object(object) && !(object->flags & OBJECT_REPORTED)) { object->flags |= OBJECT_REPORTED; -#ifdef CONFIG_DEBUG_KMEMLEAK_WARN - print_unreferenced(NULL, object); -#endif + + if (kmemleak_verbose) + print_unreferenced(NULL, object); + new_leaks++; } spin_unlock_irqrestore(&object->lock, flags);
[PATCH] watchdog: Mark watchdog touch functions as notrace
Some architectures need to use stop_machine() to patch functions for ftrace, and the assumption is that the stopped CPUs do not make function calls to traceable functions when they are in the stopped state. Commit ce4f06dcbb5d ("stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE") added calls to the watchdog touch functions from the stopped CPUs and those functions lack notrace annotations. This leads to crashes when enabling/disabling ftrace on ARM kernels built with the Thumb-2 instruction set. Fix it by adding the necessary notrace annotations. Fixes: ce4f06dcbb5d ("stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE") Signed-off-by: Vincent Whitchurch --- kernel/watchdog.c | 4 ++-- kernel/watchdog_hld.c | 2 +- kernel/workqueue.c| 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 5470dce212c0..977918d5d350 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -261,7 +261,7 @@ static void __touch_watchdog(void) * entering idle state. This should only be used for scheduler events. * Use touch_softlockup_watchdog() for everything else. */ -void touch_softlockup_watchdog_sched(void) +notrace void touch_softlockup_watchdog_sched(void) { /* * Preemption can be enabled. It doesn't matter which CPU's timestamp @@ -270,7 +270,7 @@ void touch_softlockup_watchdog_sched(void) raw_cpu_write(watchdog_touch_ts, 0); } -void touch_softlockup_watchdog(void) +notrace void touch_softlockup_watchdog(void) { touch_softlockup_watchdog_sched(); wq_watchdog_touch(raw_smp_processor_id()); diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c index 1f7020d65d0a..71381168dede 100644 --- a/kernel/watchdog_hld.c +++ b/kernel/watchdog_hld.c @@ -29,7 +29,7 @@ static struct cpumask dead_events_mask; static unsigned long hardlockup_allcpu_dumped; static atomic_t watchdog_cpus = ATOMIC_INIT(0); -void arch_touch_nmi_watchdog(void) +notrace void arch_touch_nmi_watchdog(void) { /* * Using __raw here because some code paths have diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 78b192071ef7..5f78c6e41796 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -5559,7 +5559,7 @@ static void wq_watchdog_timer_fn(struct timer_list *unused) mod_timer(&wq_watchdog_timer, jiffies + thresh); } -void wq_watchdog_touch(int cpu) +notrace void wq_watchdog_touch(int cpu) { if (cpu >= 0) per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies; -- 2.11.0
[PATCH] kmemleak: Add option to print warnings to dmesg
Currently, kmemleak only prints the number of suspected leaks to dmesg but requires the user to read a debugfs file to get the actual stack traces of the objects' allocation points. Add an option to print the stack trace information (except the hex dumps) to dmesg too. This allows easier integration of kmemleak into automated test systems since those kind of systems presumably already save kernel logs. Signed-off-by: Vincent Whitchurch --- lib/Kconfig.debug | 9 + mm/kmemleak.c | 21 + 2 files changed, 26 insertions(+), 4 deletions(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ab1b599202bc..9a3fc905b8bd 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -593,6 +593,15 @@ config DEBUG_KMEMLEAK_DEFAULT_OFF Say Y here to disable kmemleak by default. It can then be enabled on the command line via kmemleak=on. +config DEBUG_KMEMLEAK_WARN + bool "Print kmemleak object warnings to log buffer" + depends on DEBUG_KMEMLEAK + help + Say Y here to make kmemleak print information about unreferenced + objects (including stacktraces) as warnings to the kernel log buffer. + Otherwise this information is only available by reading the kmemleak + debugfs file. + config DEBUG_STACK_USAGE bool "Stack utilization instrumentation" depends on DEBUG_KERNEL && !IA64 diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 9a085d525bbc..61ba47a357fc 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -311,6 +311,9 @@ static void hex_dump_object(struct seq_file *seq, const u8 *ptr = (const u8 *)object->pointer; size_t len; + if (!seq) + return; + /* limit the number of lines to HEX_MAX_LINES */ len = min_t(size_t, object->size, HEX_MAX_LINES * HEX_ROW_SIZE); @@ -355,6 +358,13 @@ static bool unreferenced_object(struct kmemleak_object *object) jiffies_last_scan); } +#define warn_or_seq_printf(seq, fmt, ...) do {\ + if (seq)\ + seq_printf(seq, fmt, ##__VA_ARGS__);\ + else\ + pr_warn(fmt, ##__VA_ARGS__);\ +} while (0) + /* * Printing of the unreferenced objects information to the seq file. The * print_unreferenced function must be called with the object->lock held. @@ -365,17 +375,17 @@ static void print_unreferenced(struct seq_file *seq, int i; unsigned int msecs_age = jiffies_to_msecs(jiffies - object->jiffies); - seq_printf(seq, "unreferenced object 0x%08lx (size %zu):\n", + warn_or_seq_printf(seq, "unreferenced object 0x%08lx (size %zu):\n", object->pointer, object->size); - seq_printf(seq, " comm \"%s\", pid %d, jiffies %lu (age %d.%03ds)\n", + warn_or_seq_printf(seq, " comm \"%s\", pid %d, jiffies %lu (age %d.%03ds)\n", object->comm, object->pid, object->jiffies, msecs_age / 1000, msecs_age % 1000); hex_dump_object(seq, object); - seq_printf(seq, " backtrace:\n"); + warn_or_seq_printf(seq, " backtrace:\n"); for (i = 0; i < object->trace_len; i++) { void *ptr = (void *)object->trace[i]; - seq_printf(seq, "[<%p>] %pS\n", ptr, ptr); + warn_or_seq_printf(seq, "[<%p>] %pS\n", ptr, ptr); } } @@ -1598,6 +1608,9 @@ static void kmemleak_scan(void) if (unreferenced_object(object) && !(object->flags & OBJECT_REPORTED)) { object->flags |= OBJECT_REPORTED; +#ifdef CONFIG_DEBUG_KMEMLEAK_WARN + print_unreferenced(NULL, object); +#endif new_leaks++; } spin_unlock_irqrestore(&object->lock, flags); -- 2.11.0
[PATCH] kmemleak: Always register debugfs file
If kmemleak built in to the kernel, but is disabled by default, the debugfs file is never registered. Because of this, it is not possible to find out if the kernel is built with kmemleak support by checking for the presence of this file. To allow this, always register the file. After this patch, if the file doesn't exist, kmemleak is not available in the kernel. If writing "scan" or any other value than "clear" to this file results in EBUSY, then kmemleak is available but is disabled by default and can be activated via the kernel command line. Signed-off-by: Vincent Whitchurch --- mm/kmemleak.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 9a085d525bbc..17dd883198ae 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -2097,6 +2097,11 @@ static int __init kmemleak_late_init(void) kmemleak_initialized = 1; + dentry = debugfs_create_file("kmemleak", 0644, NULL, NULL, +&kmemleak_fops); + if (!dentry) + pr_warn("Failed to create the debugfs kmemleak file\n"); + if (kmemleak_error) { /* * Some error occurred and kmemleak was disabled. There is a @@ -2108,10 +2113,6 @@ static int __init kmemleak_late_init(void) return -ENOMEM; } - dentry = debugfs_create_file("kmemleak", 0644, NULL, NULL, -&kmemleak_fops); - if (!dentry) - pr_warn("Failed to create the debugfs kmemleak file\n"); mutex_lock(&scan_mutex); start_scan_thread(); mutex_unlock(&scan_mutex); -- 2.11.0
Re: [PATCH 1/2] module: Overwrite st_size instead of st_info
On Thu, Nov 22, 2018 at 12:01:54PM +, Dave Martin wrote: > On Mon, Nov 19, 2018 at 05:25:12PM +0100, Vincent Whitchurch wrote: > > st_info is currently overwritten after relocation and used to store the > > elf_type(). However, we're going to need it fix kallsyms on ARM's > > Thumb-2 kernels, so preserve st_info and overwrite the st_size field > > instead. st_size is neither used by the module core nor by any > > architecture. > > > > Signed-off-by: Vincent Whitchurch > > --- > > v4: Split out to separate patch. Use st_size instead of st_other. > > v1-v3: See PATCH 2/2 > > > > kernel/module.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/module.c b/kernel/module.c > > index 49a405891587..3d86a38b580c 100644 > > --- a/kernel/module.c > > +++ b/kernel/module.c > > @@ -2682,7 +2682,7 @@ static void add_kallsyms(struct module *mod, const > > struct load_info *info) > > > > /* Set types up while we still have access to sections. */ > > for (i = 0; i < mod->kallsyms->num_symtab; i++) > > - mod->kallsyms->symtab[i].st_info > > + mod->kallsyms->symtab[i].st_size > > = elf_type(&mod->kallsyms->symtab[i], info); > > > > /* Now populate the cut down core kallsyms for after init. */ > > @@ -4061,7 +4061,7 @@ int module_get_kallsym(unsigned int symnum, unsigned > > long *value, char *type, > > kallsyms = rcu_dereference_sched(mod->kallsyms); > > if (symnum < kallsyms->num_symtab) { > > *value = kallsyms->symtab[symnum].st_value; > > - *type = kallsyms->symtab[symnum].st_info; > > + *type = kallsyms->symtab[symnum].st_size; > > strlcpy(name, symname(kallsyms, symnum), KSYM_NAME_LEN); > > strlcpy(module_name, mod->name, MODULE_NAME_LEN); > > *exported = is_exported(name, *value, mod); > > This is fine if st_size is really unused, but how sure are you of that? > > grepping for st_size throws up some hits that appear ELF-related, but > I've not investigated them in detail. > > (The fact that struct stat has an identically named field throws up > a load of false positives too.) $ git describe --tags v4.20-rc3-93-g92b419289cee $ rg -m1 '[\.>]st_size' --iglob '!**/tools/**' --iglob '!**/vdso*' --iglob '!**/scripts/**' --iglob '!**/usr/**' --iglob '!**/samples/**' | cat | kernel/kexec_file.c: if (sym->st_size != size) { Symbols in kexec kernel. | fs/stat.c:tmp.st_size = stat->size; | Documentation/networking/tls.txt: sendfile(sock, file, &offset, stat.st_size); | net/9p/client.c: ret->st_rdev, ret->st_size, ret->st_blksize, | net/9p/protocol.c:&stbuf->st_rdev, &stbuf->st_size, | fs/9p/vfs_inode_dotl.c: i_size_write(inode, stat->st_size); | fs/hostfs/hostfs_user.c: p->size = buf->st_size; | arch/powerpc/boot/mktree.c: nblks = (st.st_size + IMGBLK) / IMGBLK; | arch/alpha/kernel/osf_sys.c: tmp.st_size = lstat->size; | arch/x86/ia32/sys_ia32.c: __put_user(stat->size, &ubuf->st_size) || Not Elf_Sym. | arch/x86/kernel/machine_kexec_64.c:sym->st_size); Symbols in kexec kernel. | arch/sparc/boot/piggyback.c: st4(buffer + 12, s.st_size); | arch/sparc/kernel/sys_sparc32.c: err |= put_user(stat->size, &statbuf->st_size); | arch/um/os-Linux/file.c: .ust_size= src->st_size,/* total size, in bytes */ | arch/um/os-Linux/start_up.c: size = (buf.st_size + UM_KERN_PAGE_SIZE) & ~(UM_KERN_PAGE_SIZE - 1); | arch/s390/kernel/compat_linux.c: tmp.st_size = stat->size; | arch/arm/kernel/sys_oabi-compat.c:tmp.st_size = stat->size; | arch/mips/boot/compressed/calc_vmlinuz_load_addr.c: vmlinux_size = (uint64_t)sb.st_size; | drivers/net/ethernet/marvell/sky2.c: hw->st_idx = RING_NEXT(hw->st_idx, hw->st_size); Not Elf_Sym.
[PATCH 1/2] module: Overwrite st_size instead of st_info
st_info is currently overwritten after relocation and used to store the elf_type(). However, we're going to need it fix kallsyms on ARM's Thumb-2 kernels, so preserve st_info and overwrite the st_size field instead. st_size is neither used by the module core nor by any architecture. Signed-off-by: Vincent Whitchurch --- v4: Split out to separate patch. Use st_size instead of st_other. v1-v3: See PATCH 2/2 kernel/module.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/module.c b/kernel/module.c index 49a405891587..3d86a38b580c 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2682,7 +2682,7 @@ static void add_kallsyms(struct module *mod, const struct load_info *info) /* Set types up while we still have access to sections. */ for (i = 0; i < mod->kallsyms->num_symtab; i++) - mod->kallsyms->symtab[i].st_info + mod->kallsyms->symtab[i].st_size = elf_type(&mod->kallsyms->symtab[i], info); /* Now populate the cut down core kallsyms for after init. */ @@ -4061,7 +4061,7 @@ int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type, kallsyms = rcu_dereference_sched(mod->kallsyms); if (symnum < kallsyms->num_symtab) { *value = kallsyms->symtab[symnum].st_value; - *type = kallsyms->symtab[symnum].st_info; + *type = kallsyms->symtab[symnum].st_size; strlcpy(name, symname(kallsyms, symnum), KSYM_NAME_LEN); strlcpy(module_name, mod->name, MODULE_NAME_LEN); *exported = is_exported(name, *value, mod); -- 2.11.0
[PATCH 2/2] ARM: module: Fix function kallsyms on Thumb-2
Thumb-2 functions have the lowest bit set in the symbol value in the symtab. When kallsyms are generated for the vmlinux, the kallsyms are generated from the output of nm, and nm clears the lowest bit. $ arm-linux-gnueabihf-readelf -a vmlinux | grep show_interrupts 95947: 8015dc89 686 FUNCGLOBAL DEFAULT2 show_interrupts $ arm-linux-gnueabihf-nm vmlinux | grep show_interrupts 8015dc88 T show_interrupts $ cat /proc/kallsyms | grep show_interrupts 8015dc88 T show_interrupts However, for modules, the kallsyms uses the values in the symbol table without modification, so for functions in modules, the lowest bit is set in kallsyms. $ arm-linux-gnueabihf-readelf -a drivers/net/tun.ko | grep tun_get_socket 333: 2d4d36 FUNCGLOBAL DEFAULT1 tun_get_socket $ arm-linux-gnueabihf-nm drivers/net/tun.ko | grep tun_get_socket 2d4c T tun_get_socket $ cat /proc/kallsyms | grep tun_get_socket 7f802d4d t tun_get_socket [tun] Because of this, the symbol+offset of the crashing instruction shown in oopses is incorrect when the crash is in a module. For example, given a tun_get_socket which starts like this, 2d4c : 2d4c: 6943ldr r3, [r0, #20] 2d4e: 4a07ldr r2, [pc, #28] 2d50: 4293cmp r3, r2 a crash when tun_get_socket is called with NULL results in: PC is at tun_xdp+0xa3/0xa4 [tun] pc : [<7f802d4c>] As can be seen, the "PC is at" line reports the wrong symbol name, and the symbol+offset will point to the wrong source line if it is passed to gdb. To solve this, add a way for archs to fixup the reading of these module kallsyms values, and use that to clear the lowest bit for function symbols on Thumb-2. After the fix: # cat /proc/kallsyms | grep tun_get_socket 7f802d4c t tun_get_socket [tun] PC is at tun_get_socket+0x0/0x24 [tun] pc : [<7f802d4c>] Signed-off-by: Vincent Whitchurch --- v4: Split out st_value overwrite change. Add HAVE* macro to avoid function call. v3: Do not overwrite st_value v2: Fix build warning with !MODULES arch/arm/include/asm/module.h | 11 +++ include/linux/module.h| 7 +++ kernel/module.c | 25 ++--- 3 files changed, 32 insertions(+), 11 deletions(-) diff --git a/arch/arm/include/asm/module.h b/arch/arm/include/asm/module.h index 9e81b7c498d8..e2ccec651e71 100644 --- a/arch/arm/include/asm/module.h +++ b/arch/arm/include/asm/module.h @@ -61,4 +61,15 @@ u32 get_module_plt(struct module *mod, unsigned long loc, Elf32_Addr val); MODULE_ARCH_VERMAGIC_ARMTHUMB \ MODULE_ARCH_VERMAGIC_P2V +#ifdef CONFIG_THUMB2_KERNEL +#define HAVE_ARCH_MODULE_KALLSYMS_SYMBOL_VALUE +static inline unsigned long module_kallsyms_symbol_value(Elf_Sym *sym) +{ + if (ELF_ST_TYPE(sym->st_info) == STT_FUNC) + return sym->st_value & ~1; + + return sym->st_value; +} +#endif + #endif /* _ASM_ARM_MODULE_H */ diff --git a/include/linux/module.h b/include/linux/module.h index fce6b4335e36..cfc55f358800 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -486,6 +486,13 @@ struct module { #define MODULE_ARCH_INIT {} #endif +#ifndef HAVE_ARCH_MODULE_KALLSYMS_SYMBOL_VALUE +static inline unsigned long module_kallsyms_symbol_value(Elf_Sym *sym) +{ + return sym->st_value; +} +#endif + extern struct mutex module_mutex; /* FIXME: It'd be nice to isolate modules during init, too, so they diff --git a/kernel/module.c b/kernel/module.c index 3d86a38b580c..a36d7915ed2b 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3934,6 +3934,9 @@ static const char *get_ksymbol(struct module *mod, /* Scan for closest preceding symbol, and next symbol. (ELF starts real symbols at 1). */ for (i = 1; i < kallsyms->num_symtab; i++) { + unsigned long thisval = module_kallsyms_symbol_value(&kallsyms->symtab[i]); + unsigned long bestval = module_kallsyms_symbol_value(&kallsyms->symtab[best]); + if (kallsyms->symtab[i].st_shndx == SHN_UNDEF) continue; @@ -3943,21 +3946,21 @@ static const char *get_ksymbol(struct module *mod, || is_arm_mapping_symbol(symname(kallsyms, i))) continue; - if (kallsyms->symtab[i].st_value <= addr - && kallsyms->symtab[i].st_value > kallsyms->symtab[best].st_value) + if (thisval <= addr + && thisval > bestval) best = i; - if (kallsyms->symtab[i].st_value > addr - && kallsyms->symtab[i].st_value < nextval) - nextval = kallsyms->symtab[i].st_value; + if (thisval > addr + && thisval < nextval) +
[PATCH v3] ARM: module: Fix function kallsyms on Thumb-2
Thumb-2 functions have the lowest bit set in the symbol value in the symtab. When kallsyms are generated for the vmlinux, the kallsyms are generated from the output of nm, and nm clears the lowest bit. $ arm-linux-gnueabihf-readelf -a vmlinux | grep show_interrupts 95947: 8015dc89 686 FUNCGLOBAL DEFAULT2 show_interrupts $ arm-linux-gnueabihf-nm vmlinux | grep show_interrupts 8015dc88 T show_interrupts $ cat /proc/kallsyms | grep show_interrupts 8015dc88 T show_interrupts However, for modules, the kallsyms uses the values in the symbol table without modification, so for functions in modules, the lowest bit is set in kallsyms. $ arm-linux-gnueabihf-readelf -a drivers/net/tun.ko | grep tun_get_socket 268: 00e144 FUNCGLOBAL DEFAULT2 tun_get_socket $ arm-linux-gnueabihf-nm drivers/net/tun.ko | grep tun_get_socket 00e0 T tun_get_socket $ cat /proc/kallsyms | grep tun_get_socket 7fcd30e1 t tun_get_socket [tun] Because of this, the offset of the crashing instruction shown in oopses is incorrect when the crash is in a module. For example, given a tun_get_socket which starts like this, 00e0 : e0: b500push{lr} e2: f7ff fffe bl 0 <__gnu_mcount_nc> e6: 4b08ldr r3, [pc, #32] e8: 6942ldr r2, [r0, #20] ea: 429acmp r2, r3 ec: d002beq.n f4 a crash when tun_get_socket is called with NULL results in: PC is at tun_get_socket+0x7/0x2c [tun] pc : [<7fcdb0e8>] which can result in the incorrect line being reported by gdb if this symbol+offset is used there. If the crash is on the first instruction of a function, the "PC is at" line would also report the symbol name of the preceding function. To solve this, introduce a weak module_kallsyms_symbol_value() function which arches can override to fix up these symbol values, and implement this for Thumb-2. We need to move elf_type() to st_other so that the original value of st_info is preserved. After the fix: $ cat /proc/kallsyms | grep tun_get_socket 7fcd30e0 t tun_get_socket [tun] PC is at tun_get_socket+0x8/0x2c [tun] pc : [<7fcdb0e8>] Signed-off-by: Vincent Whitchurch --- v3: Do not overwrite st_value v2: Fix build warning with !MODULES arch/arm/kernel/module.c | 10 ++ include/linux/moduleloader.h | 2 ++ kernel/module.c | 34 +- 3 files changed, 33 insertions(+), 13 deletions(-) diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index 3ff571c2c71c..89ab84a83600 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -336,6 +336,16 @@ static const Elf_Shdr *find_mod_section(const Elf32_Ehdr *hdr, extern void fixup_pv_table(const void *, unsigned long); extern void fixup_smp(const void *, unsigned long); +#ifdef CONFIG_THUMB2_KERNEL +unsigned long module_kallsyms_symbol_value(Elf_Sym *sym) +{ + if (ELF_ST_TYPE(sym->st_info) == STT_FUNC) + return sym->st_value & ~1; + + return sym->st_value; +} +#endif + int module_finalize(const Elf32_Ehdr *hdr, const Elf_Shdr *sechdrs, struct module *mod) { diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h index 31013c2effd3..6395409b01a4 100644 --- a/include/linux/moduleloader.h +++ b/include/linux/moduleloader.h @@ -86,6 +86,8 @@ void module_arch_cleanup(struct module *mod); /* Any cleanup before freeing mod->module_init */ void module_arch_freeing_init(struct module *mod); +unsigned long module_kallsyms_symbol_value(Elf_Sym *sym); + #ifdef CONFIG_KASAN #include #define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT) diff --git a/kernel/module.c b/kernel/module.c index 49a405891587..5a588cfbb8f8 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2682,7 +2682,7 @@ static void add_kallsyms(struct module *mod, const struct load_info *info) /* Set types up while we still have access to sections. */ for (i = 0; i < mod->kallsyms->num_symtab; i++) - mod->kallsyms->symtab[i].st_info + mod->kallsyms->symtab[i].st_other = elf_type(&mod->kallsyms->symtab[i], info); /* Now populate the cut down core kallsyms for after init. */ @@ -3916,6 +3916,11 @@ static const char *symname(struct mod_kallsyms *kallsyms, unsigned int symnum) return kallsyms->strtab + kallsyms->symtab[symnum].st_name; } +unsigned long __weak module_kallsyms_symbol_value(Elf_Sym *sym) +{ + return sym->st_value; +} + static const char *get_ksymbol(struct module *mod, unsigned long addr, unsigned long *size, @@ -3934,6 +3939,9 @@ static const char *get_ksymbol(struct module *mod, /* Scan for closest preceding sy
[PATCH v5 2/2] ARM: module: Fix function kallsyms on Thumb-2
Thumb-2 functions have the lowest bit set in the symbol value in the symtab. When kallsyms are generated for the vmlinux, the kallsyms are generated from the output of nm, and nm clears the lowest bit. $ arm-linux-gnueabihf-readelf -a vmlinux | grep show_interrupts 95947: 8015dc89 686 FUNCGLOBAL DEFAULT2 show_interrupts $ arm-linux-gnueabihf-nm vmlinux | grep show_interrupts 8015dc88 T show_interrupts $ cat /proc/kallsyms | grep show_interrupts 8015dc88 T show_interrupts However, for modules, the kallsyms uses the values in the symbol table without modification, so for functions in modules, the lowest bit is set in kallsyms. $ arm-linux-gnueabihf-readelf -a drivers/net/tun.ko | grep tun_get_socket 333: 2d4d36 FUNCGLOBAL DEFAULT1 tun_get_socket $ arm-linux-gnueabihf-nm drivers/net/tun.ko | grep tun_get_socket 2d4c T tun_get_socket $ cat /proc/kallsyms | grep tun_get_socket 7f802d4d t tun_get_socket [tun] Because of this, the symbol+offset of the crashing instruction shown in oopses is incorrect when the crash is in a module. For example, given a tun_get_socket which starts like this, 2d4c : 2d4c: 6943ldr r3, [r0, #20] 2d4e: 4a07ldr r2, [pc, #28] 2d50: 4293cmp r3, r2 a crash when tun_get_socket is called with NULL results in: PC is at tun_xdp+0xa3/0xa4 [tun] pc : [<7f802d4c>] As can be seen, the "PC is at" line reports the wrong symbol name, and the symbol+offset will point to the wrong source line if it is passed to gdb. To solve this, add a way for archs to fixup the reading of these module kallsyms values, and use that to clear the lowest bit for function symbols on Thumb-2. After the fix: # cat /proc/kallsyms | grep tun_get_socket 7f802d4c t tun_get_socket [tun] PC is at tun_get_socket+0x0/0x24 [tun] pc : [<7f802d4c>] Signed-off-by: Vincent Whitchurch --- v5: Use/move local variables to reduce calls and keep lines short. Use const arg. v4: Split out st_value overwrite change. Add HAVE* macro to avoid function call. v3: Do not overwrite st_value v2: Fix build warning with !MODULES arch/arm/include/asm/module.h | 11 +++ include/linux/module.h| 7 +++ kernel/module.c | 45 +++ 3 files changed, 46 insertions(+), 17 deletions(-) diff --git a/arch/arm/include/asm/module.h b/arch/arm/include/asm/module.h index 9e81b7c498d8..c7bcf0347801 100644 --- a/arch/arm/include/asm/module.h +++ b/arch/arm/include/asm/module.h @@ -61,4 +61,15 @@ u32 get_module_plt(struct module *mod, unsigned long loc, Elf32_Addr val); MODULE_ARCH_VERMAGIC_ARMTHUMB \ MODULE_ARCH_VERMAGIC_P2V +#ifdef CONFIG_THUMB2_KERNEL +#define HAVE_ARCH_MODULE_KALLSYMS_SYMBOL_VALUE +static inline unsigned long module_kallsyms_symbol_value(const Elf_Sym *sym) +{ + if (ELF_ST_TYPE(sym->st_info) == STT_FUNC) + return sym->st_value & ~1; + + return sym->st_value; +} +#endif + #endif /* _ASM_ARM_MODULE_H */ diff --git a/include/linux/module.h b/include/linux/module.h index fce6b4335e36..12146257eb5d 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -486,6 +486,13 @@ struct module { #define MODULE_ARCH_INIT {} #endif +#ifndef HAVE_ARCH_MODULE_KALLSYMS_SYMBOL_VALUE +static inline unsigned long module_kallsyms_symbol_value(const Elf_Sym *sym) +{ + return sym->st_value; +} +#endif + extern struct mutex module_mutex; /* FIXME: It'd be nice to isolate modules during init, too, so they diff --git a/kernel/module.c b/kernel/module.c index 3d86a38b580c..9364017fdc21 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3922,7 +3922,7 @@ static const char *get_ksymbol(struct module *mod, unsigned long *offset) { unsigned int i, best = 0; - unsigned long nextval; + unsigned long nextval, bestval; struct mod_kallsyms *kallsyms = rcu_dereference_sched(mod->kallsyms); /* At worse, next value is at end of module */ @@ -3931,10 +3931,15 @@ static const char *get_ksymbol(struct module *mod, else nextval = (unsigned long)mod->core_layout.base+mod->core_layout.text_size; + bestval = module_kallsyms_symbol_value(&kallsyms->symtab[best]); + /* Scan for closest preceding symbol, and next symbol. (ELF starts real symbols at 1). */ for (i = 1; i < kallsyms->num_symtab; i++) { - if (kallsyms->symtab[i].st_shndx == SHN_UNDEF) + const Elf_Sym *sym = &kallsyms->symtab[i]; + unsigned long thisval = module_kallsyms_symbol_value(sym); + + if (sym->st_shndx == SHN_UNDEF) continue; /* We ignore unnamed symbols: they're unin
[PATCH v5 1/2] module: Overwrite st_size instead of st_info
st_info is currently overwritten after relocation and used to store the elf_type(). However, we're going to need it fix kallsyms on ARM's Thumb-2 kernels, so preserve st_info and overwrite the st_size field instead. st_size is neither used by the module core nor by any architecture. Reviewed-by: Dave Martin Signed-off-by: Vincent Whitchurch --- v5: Add Dave Martin's Reviewed-by v4: Split out to separate patch. Use st_size instead of st_other. v1-v3: See PATCH 2/2 kernel/module.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/module.c b/kernel/module.c index 49a405891587..3d86a38b580c 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2682,7 +2682,7 @@ static void add_kallsyms(struct module *mod, const struct load_info *info) /* Set types up while we still have access to sections. */ for (i = 0; i < mod->kallsyms->num_symtab; i++) - mod->kallsyms->symtab[i].st_info + mod->kallsyms->symtab[i].st_size = elf_type(&mod->kallsyms->symtab[i], info); /* Now populate the cut down core kallsyms for after init. */ @@ -4061,7 +4061,7 @@ int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type, kallsyms = rcu_dereference_sched(mod->kallsyms); if (symnum < kallsyms->num_symtab) { *value = kallsyms->symtab[symnum].st_value; - *type = kallsyms->symtab[symnum].st_info; + *type = kallsyms->symtab[symnum].st_size; strlcpy(name, symname(kallsyms, symnum), KSYM_NAME_LEN); strlcpy(module_name, mod->name, MODULE_NAME_LEN); *exported = is_exported(name, *value, mod); -- 2.11.0
Re: [PATCH v2] ARM: module: Fix function kallsyms on Thumb-2
On Fri, Nov 02, 2018 at 02:53:22PM +0100, Jessica Yu wrote: > +++ Vincent Whitchurch [01/11/18 16:29 +0100]: > > On Wed, Oct 31, 2018 at 04:53:41PM +0100, Jessica Yu wrote: > > > Could this be done in modpost? I'm guessing the answer is no as some > > > relocations may rely on that bit being set in st_value, right? > > > Therefore we can only clear the bit _after_ relocations to the module > > > are applied at runtime, correct? > > > > Yes, that's correct, it needs to be done after the relocations. > > > > > I'm not against having an arch-specific kallsyms fixup function, my > > > only concern would be if this would interfere with the delayed > > > relocations livepatching does, if there are plans in the future to > > > have livepatching on arm (IIRC there was an attempt at a port in > > > 2016). If there exists some Thumb-2 relocation types that rely on that > > > lowest bit in st_value being set in order to be applied, and we clear > > > it permanently from the symtab, then livepatching wouldn't be able to > > > apply those types of relocations anymore. If this is a legitimate > > > concern, then perhaps an alternative solution would be to have an > > > arch-specific kallsyms symbol-value-fixup function for accesses to > > > sym.st_value, without modifying the module symtab. > > > > I'm not familiar with livepatching, but yes, if it needs to do > > relocations later I guess we should preserve the original value. > > Yeah, I think the symtab needs to remain unmodified for livepatch to > be able to do delayed relocations after module load. > > > I gave the alternative solution a go and it seems to work. > > add_kallsyms() currently overwrites st_info so I had to move the > > elf_type to the unused st_other field instead to preserve st_info: > > I think that's fine, I think the only usages of st_other I've found > are during relocations, and the field is big enough for a char, so it > should be fine to reuse it afterwards. If livepatch can do relocations later, doesn't that mean that we _can't_ reuse st_other for storing the elf_type? OTOH, the current code overwrites st_info, and I see that used from various archs' relocation functions too, so perhaps I'm missing something.
[PATCH v6 modules-next 1/2] module: Overwrite st_size instead of st_info
st_info is currently overwritten after relocation and used to store the elf_type(). However, we're going to need it fix kallsyms on ARM's Thumb-2 kernels, so preserve st_info and overwrite the st_size field instead. st_size is neither used by the module core nor by any architecture. Reviewed-by: Miroslav Benes Reviewed-by: Dave Martin Signed-off-by: Vincent Whitchurch --- v6: Add Miroslav Benes' Reviewed-by v5: Add Dave Martin's Reviewed-by v4: Split out to separate patch. Use st_size instead of st_other. v1-v3: See PATCH 2/2 kernel/module.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/module.c b/kernel/module.c index 1b5edf78694c..b36ff8a3d562 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2684,7 +2684,7 @@ static void add_kallsyms(struct module *mod, const struct load_info *info) /* Set types up while we still have access to sections. */ for (i = 0; i < mod->kallsyms->num_symtab; i++) - mod->kallsyms->symtab[i].st_info + mod->kallsyms->symtab[i].st_size = elf_type(&mod->kallsyms->symtab[i], info); /* Now populate the cut down core kallsyms for after init. */ @@ -4070,7 +4070,7 @@ int module_get_kallsym(unsigned int symnum, unsigned long *value, char *type, kallsyms = rcu_dereference_sched(mod->kallsyms); if (symnum < kallsyms->num_symtab) { *value = kallsyms->symtab[symnum].st_value; - *type = kallsyms->symtab[symnum].st_info; + *type = kallsyms->symtab[symnum].st_size; strlcpy(name, kallsyms_symbol_name(kallsyms, symnum), KSYM_NAME_LEN); strlcpy(module_name, mod->name, MODULE_NAME_LEN); *exported = is_exported(name, *value, mod); -- 2.11.0
[PATCH v6 modules-next 2/2] ARM: module: Fix function kallsyms on Thumb-2
Thumb-2 functions have the lowest bit set in the symbol value in the symtab. When kallsyms are generated for the vmlinux, the kallsyms are generated from the output of nm, and nm clears the lowest bit. $ arm-linux-gnueabihf-readelf -a vmlinux | grep show_interrupts 95947: 8015dc89 686 FUNCGLOBAL DEFAULT2 show_interrupts $ arm-linux-gnueabihf-nm vmlinux | grep show_interrupts 8015dc88 T show_interrupts $ cat /proc/kallsyms | grep show_interrupts 8015dc88 T show_interrupts However, for modules, the kallsyms uses the values in the symbol table without modification, so for functions in modules, the lowest bit is set in kallsyms. $ arm-linux-gnueabihf-readelf -a drivers/net/tun.ko | grep tun_get_socket 333: 2d4d36 FUNCGLOBAL DEFAULT1 tun_get_socket $ arm-linux-gnueabihf-nm drivers/net/tun.ko | grep tun_get_socket 2d4c T tun_get_socket $ cat /proc/kallsyms | grep tun_get_socket 7f802d4d t tun_get_socket [tun] Because of this, the symbol+offset of the crashing instruction shown in oopses is incorrect when the crash is in a module. For example, given a tun_get_socket which starts like this, 2d4c : 2d4c: 6943ldr r3, [r0, #20] 2d4e: 4a07ldr r2, [pc, #28] 2d50: 4293cmp r3, r2 a crash when tun_get_socket is called with NULL results in: PC is at tun_xdp+0xa3/0xa4 [tun] pc : [<7f802d4c>] As can be seen, the "PC is at" line reports the wrong symbol name, and the symbol+offset will point to the wrong source line if it is passed to gdb. To solve this, add a way for archs to fixup the reading of these module kallsyms values, and use that to clear the lowest bit for function symbols on Thumb-2. After the fix: # cat /proc/kallsyms | grep tun_get_socket 7f802d4c t tun_get_socket [tun] PC is at tun_get_socket+0x0/0x24 [tun] pc : [<7f802d4c>] Signed-off-by: Vincent Whitchurch --- v6: Rename module_kallsyms_symbol_value() -> kallsyms_symbol_value() v5: Use/move local variables to reduce calls and keep lines short. Use const arg. v4: Split out st_value overwrite change. Add HAVE* macro to avoid function call. v3: Do not overwrite st_value v2: Fix build warning with !MODULES arch/arm/include/asm/module.h | 11 +++ include/linux/module.h| 7 +++ kernel/module.c | 43 +++ 3 files changed, 45 insertions(+), 16 deletions(-) diff --git a/arch/arm/include/asm/module.h b/arch/arm/include/asm/module.h index 9e81b7c498d8..182163b55546 100644 --- a/arch/arm/include/asm/module.h +++ b/arch/arm/include/asm/module.h @@ -61,4 +61,15 @@ u32 get_module_plt(struct module *mod, unsigned long loc, Elf32_Addr val); MODULE_ARCH_VERMAGIC_ARMTHUMB \ MODULE_ARCH_VERMAGIC_P2V +#ifdef CONFIG_THUMB2_KERNEL +#define HAVE_ARCH_KALLSYMS_SYMBOL_VALUE +static inline unsigned long kallsyms_symbol_value(const Elf_Sym *sym) +{ + if (ELF_ST_TYPE(sym->st_info) == STT_FUNC) + return sym->st_value & ~1; + + return sym->st_value; +} +#endif + #endif /* _ASM_ARM_MODULE_H */ diff --git a/include/linux/module.h b/include/linux/module.h index fce6b4335e36..c0b4b7840b57 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -486,6 +486,13 @@ struct module { #define MODULE_ARCH_INIT {} #endif +#ifndef HAVE_ARCH_KALLSYMS_SYMBOL_VALUE +static inline unsigned long kallsyms_symbol_value(const Elf_Sym *sym) +{ + return sym->st_value; +} +#endif + extern struct mutex module_mutex; /* FIXME: It'd be nice to isolate modules during init, too, so they diff --git a/kernel/module.c b/kernel/module.c index b36ff8a3d562..164bf201eae4 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3928,7 +3928,7 @@ static const char *find_kallsyms_symbol(struct module *mod, unsigned long *offset) { unsigned int i, best = 0; - unsigned long nextval; + unsigned long nextval, bestval; struct mod_kallsyms *kallsyms = rcu_dereference_sched(mod->kallsyms); /* At worse, next value is at end of module */ @@ -3937,10 +3937,15 @@ static const char *find_kallsyms_symbol(struct module *mod, else nextval = (unsigned long)mod->core_layout.base+mod->core_layout.text_size; + bestval = kallsyms_symbol_value(&kallsyms->symtab[best]); + /* Scan for closest preceding symbol, and next symbol. (ELF starts real symbols at 1). */ for (i = 1; i < kallsyms->num_symtab; i++) { - if (kallsyms->symtab[i].st_shndx == SHN_UNDEF) + const Elf_Sym *sym = &kallsyms->symtab[i]; + unsigned long thisval = kallsyms_symbol_value(sym); + + if (sym->st_shndx == SHN_UNDEF) continue; /* We ignore
Re: [PATCH v5 2/2] ARM: module: Fix function kallsyms on Thumb-2
On Thu, Dec 06, 2018 at 06:29:20PM +0100, Jessica Yu wrote: > Also, do you mind if I drop the module_ prefix from > module_kallsyms_symbol_value()? I recently submitted some internal > module kallsyms cleanups [1] and there we have the newly renamed > kallsyms_symbol_name(), so I think it'd be nice if we kept the naming > scheme consistent with just kallsyms_symbol_value(). I could just do > this myself if you don't want to respin a v6. I've sent out a v6 with this change. I also rebased the change on top of your cleanups in modules-next. Thanks.
[PATCH] drop_caches: Allow unmapping pages
drop_caches does not drop pages which are currently mapped. Add an option to try to unmap and drop even these pages. This provides a simple way to obtain a rough estimate of how many file pages are used in a particular use case: drop everything and check how much gets read back. # cat /proc/meminfo | grep file Active(file): 16608 kB Inactive(file):23424 kB # echo 3 > /proc/sys/vm/drop_caches && cat /proc/meminfo | grep file Active(file): 10624 kB Inactive(file):15060 kB # echo 11 > /proc/sys/vm/drop_caches && cat /proc/meminfo | grep file Active(file):240 kB Inactive(file): 2344 kB Signed-off-by: Vincent Whitchurch --- Documentation/sysctl/vm.txt | 4 fs/drop_caches.c| 3 ++- include/linux/fs.h | 10 -- kernel/sysctl.c | 4 ++-- mm/truncate.c | 39 - 5 files changed, 41 insertions(+), 19 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 187ce4f599a2..6ea06c2c973b 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -222,6 +222,10 @@ To increase the number of objects freed by this operation, the user may run number of dirty objects on the system and create more candidates to be dropped. +By default, pages which are currently mapped are not dropped from the +pagecache. If you want to unmap and drop these pages too, echo 9 or 11 instead +of 1 or 3 respectively (set bit 4). + This file is not a means to control the growth of the various kernel caches (inodes, dentries, pagecache, etc...) These objects are automatically reclaimed by the kernel when memory is needed elsewhere on the system. diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 82377017130f..9faaa1e3a672 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -17,6 +17,7 @@ int sysctl_drop_caches; static void drop_pagecache_sb(struct super_block *sb, void *unused) { struct inode *inode, *toput_inode = NULL; + bool unmap = sysctl_drop_caches & 8; spin_lock(&sb->s_inode_list_lock); list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { @@ -30,7 +31,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused) spin_unlock(&inode->i_lock); spin_unlock(&sb->s_inode_list_lock); - invalidate_mapping_pages(inode->i_mapping, 0, -1); + __invalidate_mapping_pages(inode->i_mapping, 0, -1, unmap); iput(toput_inode); toput_inode = inode; diff --git a/include/linux/fs.h b/include/linux/fs.h index 811c77743dad..503e176654ce 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2675,8 +2675,14 @@ extern int check_disk_change(struct block_device *); extern int __invalidate_device(struct block_device *, bool); extern int invalidate_partition(struct gendisk *, int); #endif -unsigned long invalidate_mapping_pages(struct address_space *mapping, - pgoff_t start, pgoff_t end); +unsigned long __invalidate_mapping_pages(struct address_space *mapping, + pgoff_t start, pgoff_t end, bool unmap); + +static inline unsigned long invalidate_mapping_pages(struct address_space *mapping, +pgoff_t start, pgoff_t end) +{ + return __invalidate_mapping_pages(mapping, start, end, false); +} static inline void invalidate_remote_inode(struct inode *inode) { diff --git a/kernel/sysctl.c b/kernel/sysctl.c index ba4d9e85feb8..f12c2a8d84fb 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -125,7 +125,7 @@ static int __maybe_unused neg_one = -1; static int zero; static int __maybe_unused one = 1; static int __maybe_unused two = 2; -static int __maybe_unused four = 4; +static int __maybe_unused fifteen = 15; static unsigned long one_ul = 1; static int one_hundred = 100; static int one_thousand = 1000; @@ -1431,7 +1431,7 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = drop_caches_sysctl_handler, .extra1 = &one, - .extra2 = &four, + .extra2 = &fifteen, }, #ifdef CONFIG_COMPACTION { diff --git a/mm/truncate.c b/mm/truncate.c index 798e7ccfb030..613b02e02146 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -245,6 +245,22 @@ int generic_error_remove_page(struct address_space *mapping, struct page *page) } EXPORT_SYMBOL(generic_error_remove_page); +static int __invalidate_inode_page(struct page *page, bool unmap) +{ + struct address_space *mapping = page_mapping(page); + if (!mapping) + return 0; + if (PageDirty(page) || PageWriteback(page)) + return 0; + if (page_mapped(page)) { + if (!unmap)
[PATCH] ARM: module: Fix function kallsyms on Thumb-2
Thumb-2 functions have the lowest bit set in the symbol value in the symtab. When kallsyms are generated for the vmlinux, the kallsyms are generated from the output of nm, and nm clears the lowest bit. $ arm-linux-gnueabihf-readelf -a vmlinux | grep show_interrupts 95947: 8015dc89 686 FUNCGLOBAL DEFAULT2 show_interrupts $ arm-linux-gnueabihf-nm vmlinux | grep show_interrupts 8015dc88 T show_interrupts $ cat /proc/kallsyms | grep show_interrupts 8015dc88 T show_interrupts However, for modules, the kallsyms uses the values in the symbol table without modification, so for functions in modules, the lowest bit is set in kallsyms. $ arm-linux-gnueabihf-readelf -a drivers/net/tun.ko | grep tun_get_socket 268: 00e144 FUNCGLOBAL DEFAULT2 tun_get_socket $ arm-linux-gnueabihf-nm drivers/net/tun.ko | grep tun_get_socket 00e0 T tun_get_socket $ cat /proc/kallsyms | grep tun_get_socket 7fcd30e1 t tun_get_socket [tun] Because of this, the offset of the crashing instruction shown in oopses is incorrect when the crash is in a module. For example, given a tun_get_socket which starts like this, 00e0 : e0: b500push{lr} e2: f7ff fffe bl 0 <__gnu_mcount_nc> e6: 4b08ldr r3, [pc, #32] e8: 6942ldr r2, [r0, #20] ea: 429acmp r2, r3 ec: d002beq.n f4 a crash when tun_get_socket is called with NULL results in: PC is at tun_get_socket+0x7/0x2c [tun] pc : [<7fcdb0e8>] which can result in the incorrect line being reported by gdb if this symbol+offset is used there. If the crash is on the first instruction of a function, the "PC is at" line would also report the symbol name of the preceding function. To solve this, fix up these symbols like nm does. For this, we need a new hook in the generic module loading code, before the symbols' st_info is overwritten by add_kallsyms(). After the fix: $ cat /proc/kallsyms | grep tun_get_socket 7fcd30e0 t tun_get_socket [tun] PC is at tun_get_socket+0x8/0x2c [tun] pc : [<7fcdb0e8>] Signed-off-by: Vincent Whitchurch --- arch/arm/kernel/module.c | 14 ++ include/linux/moduleloader.h | 2 ++ kernel/module.c | 6 ++ 3 files changed, 22 insertions(+) diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index 3ff571c2c71c..771f86318d84 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -399,6 +399,20 @@ int module_finalize(const Elf32_Ehdr *hdr, const Elf_Shdr *sechdrs, return 0; } +#if defined(CONFIG_THUMB2_KERNEL) && defined(CONFIG_KALLSYMS) +void module_fixup_kallsyms(struct mod_kallsyms *kallsyms) +{ + int i; + + for (i = 0; i < kallsyms->num_symtab; i++) { + Elf_Sym *sym = &kallsyms->symtab[i]; + + if (ELF_ST_TYPE(sym->st_info) == STT_FUNC) + sym->st_value &= ~1; + } +} +#endif + void module_arch_cleanup(struct module *mod) { diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h index 31013c2effd3..92387dd49b82 100644 --- a/include/linux/moduleloader.h +++ b/include/linux/moduleloader.h @@ -86,6 +86,8 @@ void module_arch_cleanup(struct module *mod); /* Any cleanup before freeing mod->module_init */ void module_arch_freeing_init(struct module *mod); +void module_fixup_kallsyms(struct mod_kallsyms *kallsyms); + #ifdef CONFIG_KASAN #include #define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT) diff --git a/kernel/module.c b/kernel/module.c index 49a405891587..ded4f4b49824 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2659,6 +2659,10 @@ static void layout_symtab(struct module *mod, struct load_info *info) mod->init_layout.size = debug_align(mod->init_layout.size); } +void __weak module_fixup_kallsyms(struct mod_kallsyms *kallsyms) +{ +} + /* * We use the full symtab and strtab which layout_symtab arranged to * be appended to the init section. Later we switch to the cut-down @@ -2680,6 +2684,8 @@ static void add_kallsyms(struct module *mod, const struct load_info *info) /* Make sure we get permanent strtab: don't use info->strtab. */ mod->kallsyms->strtab = (void *)info->sechdrs[info->index.str].sh_addr; + module_fixup_kallsyms(mod->kallsyms); + /* Set types up while we still have access to sections. */ for (i = 0; i < mod->kallsyms->num_symtab; i++) mod->kallsyms->symtab[i].st_info -- 2.11.0
[PATCH v2] ARM: module: Fix function kallsyms on Thumb-2
Thumb-2 functions have the lowest bit set in the symbol value in the symtab. When kallsyms are generated for the vmlinux, the kallsyms are generated from the output of nm, and nm clears the lowest bit. $ arm-linux-gnueabihf-readelf -a vmlinux | grep show_interrupts 95947: 8015dc89 686 FUNCGLOBAL DEFAULT2 show_interrupts $ arm-linux-gnueabihf-nm vmlinux | grep show_interrupts 8015dc88 T show_interrupts $ cat /proc/kallsyms | grep show_interrupts 8015dc88 T show_interrupts However, for modules, the kallsyms uses the values in the symbol table without modification, so for functions in modules, the lowest bit is set in kallsyms. $ arm-linux-gnueabihf-readelf -a drivers/net/tun.ko | grep tun_get_socket 268: 00e144 FUNCGLOBAL DEFAULT2 tun_get_socket $ arm-linux-gnueabihf-nm drivers/net/tun.ko | grep tun_get_socket 00e0 T tun_get_socket $ cat /proc/kallsyms | grep tun_get_socket 7fcd30e1 t tun_get_socket [tun] Because of this, the offset of the crashing instruction shown in oopses is incorrect when the crash is in a module. For example, given a tun_get_socket which starts like this, 00e0 : e0: b500push{lr} e2: f7ff fffe bl 0 <__gnu_mcount_nc> e6: 4b08ldr r3, [pc, #32] e8: 6942ldr r2, [r0, #20] ea: 429acmp r2, r3 ec: d002beq.n f4 a crash when tun_get_socket is called with NULL results in: PC is at tun_get_socket+0x7/0x2c [tun] pc : [<7fcdb0e8>] which can result in the incorrect line being reported by gdb if this symbol+offset is used there. If the crash is on the first instruction of a function, the "PC is at" line would also report the symbol name of the preceding function. To solve this, fix up these symbols like nm does. For this, we need a new hook in the generic module loading code, before the symbols' st_info is overwritten by add_kallsyms(). After the fix: $ cat /proc/kallsyms | grep tun_get_socket 7fcd30e0 t tun_get_socket [tun] PC is at tun_get_socket+0x8/0x2c [tun] pc : [<7fcdb0e8>] Signed-off-by: Vincent Whitchurch --- v2: Fix build warning with !MODULES arch/arm/kernel/module.c | 14 ++ include/linux/moduleloader.h | 3 +++ kernel/module.c | 6 ++ 3 files changed, 23 insertions(+) diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index 3ff571c2c71c..771f86318d84 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -399,6 +399,20 @@ int module_finalize(const Elf32_Ehdr *hdr, const Elf_Shdr *sechdrs, return 0; } +#if defined(CONFIG_THUMB2_KERNEL) && defined(CONFIG_KALLSYMS) +void module_fixup_kallsyms(struct mod_kallsyms *kallsyms) +{ + int i; + + for (i = 0; i < kallsyms->num_symtab; i++) { + Elf_Sym *sym = &kallsyms->symtab[i]; + + if (ELF_ST_TYPE(sym->st_info) == STT_FUNC) + sym->st_value &= ~1; + } +} +#endif + void module_arch_cleanup(struct module *mod) { diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h index 31013c2effd3..b1ac5584eaa5 100644 --- a/include/linux/moduleloader.h +++ b/include/linux/moduleloader.h @@ -86,6 +86,9 @@ void module_arch_cleanup(struct module *mod); /* Any cleanup before freeing mod->module_init */ void module_arch_freeing_init(struct module *mod); +struct mod_kallsyms; +void module_fixup_kallsyms(struct mod_kallsyms *kallsyms); + #ifdef CONFIG_KASAN #include #define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT) diff --git a/kernel/module.c b/kernel/module.c index 49a405891587..ded4f4b49824 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2659,6 +2659,10 @@ static void layout_symtab(struct module *mod, struct load_info *info) mod->init_layout.size = debug_align(mod->init_layout.size); } +void __weak module_fixup_kallsyms(struct mod_kallsyms *kallsyms) +{ +} + /* * We use the full symtab and strtab which layout_symtab arranged to * be appended to the init section. Later we switch to the cut-down @@ -2680,6 +2684,8 @@ static void add_kallsyms(struct module *mod, const struct load_info *info) /* Make sure we get permanent strtab: don't use info->strtab. */ mod->kallsyms->strtab = (void *)info->sechdrs[info->index.str].sh_addr; + module_fixup_kallsyms(mod->kallsyms); + /* Set types up while we still have access to sections. */ for (i = 0; i < mod->kallsyms->num_symtab; i++) mod->kallsyms->symtab[i].st_info -- 2.11.0
Re: [PATCH v2] ARM: module: Fix function kallsyms on Thumb-2
On Wed, Oct 31, 2018 at 04:53:41PM +0100, Jessica Yu wrote: > Could this be done in modpost? I'm guessing the answer is no as some > relocations may rely on that bit being set in st_value, right? > Therefore we can only clear the bit _after_ relocations to the module > are applied at runtime, correct? Yes, that's correct, it needs to be done after the relocations. > I'm not against having an arch-specific kallsyms fixup function, my > only concern would be if this would interfere with the delayed > relocations livepatching does, if there are plans in the future to > have livepatching on arm (IIRC there was an attempt at a port in > 2016). If there exists some Thumb-2 relocation types that rely on that > lowest bit in st_value being set in order to be applied, and we clear > it permanently from the symtab, then livepatching wouldn't be able to > apply those types of relocations anymore. If this is a legitimate > concern, then perhaps an alternative solution would be to have an > arch-specific kallsyms symbol-value-fixup function for accesses to > sym.st_value, without modifying the module symtab. I'm not familiar with livepatching, but yes, if it needs to do relocations later I guess we should preserve the original value. I gave the alternative solution a go and it seems to work. add_kallsyms() currently overwrites st_info so I had to move the elf_type to the unused st_other field instead to preserve st_info: diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index 3ff571c2c71c..f443d0ccd1a0 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -336,6 +336,16 @@ static const Elf_Shdr *find_mod_section(const Elf32_Ehdr *hdr, extern void fixup_pv_table(const void *, unsigned long); extern void fixup_smp(const void *, unsigned long); +#ifdef CONFIG_THUMB2_KERNEL +unsigned long symbol_value(Elf_Sym *sym) +{ + if (ELF_ST_TYPE(sym->st_info) == STT_FUNC) + return sym->st_value & ~1; + + return sym->st_value; +} +#endif + int module_finalize(const Elf32_Ehdr *hdr, const Elf_Shdr *sechdrs, struct module *mod) { diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h index 31013c2effd3..6bf6118db37f 100644 --- a/include/linux/moduleloader.h +++ b/include/linux/moduleloader.h @@ -86,6 +86,8 @@ void module_arch_cleanup(struct module *mod); /* Any cleanup before freeing mod->module_init */ void module_arch_freeing_init(struct module *mod); +unsigned long symbol_value(Elf_Sym *sym); + #ifdef CONFIG_KASAN #include #define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT) diff --git a/kernel/module.c b/kernel/module.c index 49a405891587..871bf4450e9d 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2682,7 +2682,7 @@ static void add_kallsyms(struct module *mod, const struct load_info *info) /* Set types up while we still have access to sections. */ for (i = 0; i < mod->kallsyms->num_symtab; i++) - mod->kallsyms->symtab[i].st_info + mod->kallsyms->symtab[i].st_other = elf_type(&mod->kallsyms->symtab[i], info); /* Now populate the cut down core kallsyms for after init. */ @@ -3916,6 +3916,11 @@ static const char *symname(struct mod_kallsyms *kallsyms, unsigned int symnum) return kallsyms->strtab + kallsyms->symtab[symnum].st_name; } +unsigned long __weak symbol_value(Elf_Sym *sym) +{ + return sym->st_value; +} + static const char *get_ksymbol(struct module *mod, unsigned long addr, unsigned long *size, @@ -3934,6 +3939,9 @@ static const char *get_ksymbol(struct module *mod, /* Scan for closest preceding symbol, and next symbol. (ELF starts real symbols at 1). */ for (i = 1; i < kallsyms->num_symtab; i++) { + unsigned long thisval = symbol_value(&kallsyms->symtab[i]); + unsigned long bestval = symbol_value(&kallsyms->symtab[best]); + if (kallsyms->symtab[i].st_shndx == SHN_UNDEF) continue; @@ -3943,21 +3951,21 @@ static const char *get_ksymbol(struct module *mod, || is_arm_mapping_symbol(symname(kallsyms, i))) continue; - if (kallsyms->symtab[i].st_value <= addr - && kallsyms->symtab[i].st_value > kallsyms->symtab[best].st_value) + if (thisval <= addr + && thisval > bestval) best = i; - if (kallsyms->symtab[i].st_value > addr - && kallsyms->symtab[i].st_value < nextval) - nextval = kallsyms->symtab[i].st_value; + if (thisval > addr + && thisval < nextval) + nextval = thisval; } if (!best) return NULL; if (size) - *size = nextval - kallsyms->symtab[best].st
[PATCH v2] sysctl: Add panic-fatal-signals
Add a sysctl which asks the kernel to panic when any userspace process receives a fatal signal which would trigger a core dump. This has proven to be quite useful when debugging problems seen during testing of embedded systems: When combined with kernel core dumps (saved due to the panic), it allows easier debugging of crashes which have their origin in system-wide problems such as buggy drivers or other kernel or hardware-related issues. The crashing process's core dump can be extracted from the kernel core dump using tools such as the crash utility's gcore extension. Signed-off-by: Vincent Whitchurch --- v2: Put the sysctl behind a config option include/linux/signal.h | 1 + init/Kconfig | 14 ++ kernel/signal.c| 5 - kernel/sysctl.c| 9 + 4 files changed, 28 insertions(+), 1 deletion(-) diff --git a/include/linux/signal.h b/include/linux/signal.h index cc7e2c1cd444..109efd1432e9 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -10,6 +10,7 @@ struct task_struct; /* for sysctl */ extern int print_fatal_signals; +extern int panic_fatal_signals; static inline void copy_siginfo(kernel_siginfo_t *to, const kernel_siginfo_t *from) diff --git a/init/Kconfig b/init/Kconfig index d47cb77a220e..875442f6ab53 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1242,6 +1242,20 @@ config SYSCTL_SYSCALL If unsure say N here. +config SYSCTL_PANIC_FATAL_SIGNALS + bool "panic-fatal-signals sysctl" if EXPERT + depends on PROC_SYSCTL + help + If you say Y here, a kernel.panic-fatal-signals sysctl will be + offered. If this sysctl is turned on, the kernel will panic if any + userspace process receives a fatal signal which would trigger a core + dump. + + When used together with kernel core dumps, this can be useful for + debugging some system-wide problems, primarily on embedded systems. + + If unsure, say N. + config FHANDLE bool "open by fhandle syscalls" if EXPERT select EXPORTFS diff --git a/kernel/signal.c b/kernel/signal.c index e1d7ad8e6ab1..83c6877b0182 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -59,6 +59,7 @@ static struct kmem_cache *sigqueue_cachep; int print_fatal_signals __read_mostly; +int panic_fatal_signals __read_mostly; static void __user *sig_handler(struct task_struct *t, int sig) { @@ -2497,8 +2498,10 @@ bool get_signal(struct ksignal *ksig) current->flags |= PF_SIGNALED; if (sig_kernel_coredump(signr)) { - if (print_fatal_signals) + if (print_fatal_signals || panic_fatal_signals) print_fatal_signal(ksig->info.si_signo); + if (panic_fatal_signals) + panic("Fatal signal and panic_fatal_signals=1"); proc_coredump_connector(current); /* * If it was able to dump core, this kills all diff --git a/kernel/sysctl.c b/kernel/sysctl.c index ba4d9e85feb8..48023bad5b08 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -557,6 +557,15 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, +#ifdef CONFIG_SYSCTL_PANIC_FATAL_SIGNALS + { + .procname = "panic-fatal-signals", + .data = &panic_fatal_signals, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif #ifdef CONFIG_SPARC { .procname = "reboot-cmd", -- 2.20.0
[PATCH] tty: Add NULL TTY driver
If no console driver is enabled (or if a non-present driver is selected with something like console=null in an attempt to disable the console), opening /dev/console errors out, and init scripts and other userspace code that relies on the existence of a console will fail. Symlinking /dev/null to /dev/console does not solve the problem since /dev/null does not behave like a real TTY. To just provide a dummy console to userspace when no console driver is available or desired, add a ttynull driver which simply discards all writes. It can be chosen on the command line in the standard way, i.e. with console=ttynull. Signed-off-by: Vincent Whitchurch --- drivers/tty/Kconfig | 14 ++ drivers/tty/Makefile | 1 + drivers/tty/ttynull.c | 109 ++ 3 files changed, 124 insertions(+) create mode 100644 drivers/tty/ttynull.c diff --git a/drivers/tty/Kconfig b/drivers/tty/Kconfig index e0a04bfc873e..d862f442f389 100644 --- a/drivers/tty/Kconfig +++ b/drivers/tty/Kconfig @@ -376,6 +376,20 @@ config PPC_EARLY_DEBUG_EHV_BC_HANDLE there simply will be no early console output. This is true also if you don't boot under a hypervisor at all. +config NULL_TTY + tristate "NULL TTY driver" + help + Say Y here if you want a NULL TTY which simply discards messages. + + This is useful to allow userspace applications which expect a console + device to work without modifications even when no console is + available or desired. + + In order to use this driver, you should redirect the console to this + TTY, or boot the kernel with console=ttynull. + + If unsure, say N. + config GOLDFISH_TTY tristate "Goldfish TTY Driver" depends on GOLDFISH diff --git a/drivers/tty/Makefile b/drivers/tty/Makefile index c72cafdf32b4..020b1cd9294f 100644 --- a/drivers/tty/Makefile +++ b/drivers/tty/Makefile @@ -25,6 +25,7 @@ obj-$(CONFIG_ISI) += isicom.o obj-$(CONFIG_MOXA_INTELLIO)+= moxa.o obj-$(CONFIG_MOXA_SMARTIO) += mxser.o obj-$(CONFIG_NOZOMI) += nozomi.o +obj-$(CONFIG_NULL_TTY) += ttynull.o obj-$(CONFIG_ROCKETPORT) += rocket.o obj-$(CONFIG_SYNCLINK_GT) += synclink_gt.o obj-$(CONFIG_SYNCLINKMP) += synclinkmp.o diff --git a/drivers/tty/ttynull.c b/drivers/tty/ttynull.c new file mode 100644 index ..17f05b7eb6d3 --- /dev/null +++ b/drivers/tty/ttynull.c @@ -0,0 +1,109 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 Axis Communications AB + * + * Based on ttyprintk.c: + * Copyright (C) 2010 Samo Pogacnik + */ + +#include +#include +#include + +static const struct tty_port_operations ttynull_port_ops; +static struct tty_driver *ttynull_driver; +static struct tty_port ttynull_port; + +static int ttynull_open(struct tty_struct *tty, struct file *filp) +{ + return tty_port_open(&ttynull_port, tty, filp); +} + +static void ttynull_close(struct tty_struct *tty, struct file *filp) +{ + tty_port_close(&ttynull_port, tty, filp); +} + +static void ttynull_hangup(struct tty_struct *tty) +{ + tty_port_hangup(&ttynull_port); +} + +static int ttynull_write(struct tty_struct *tty, const unsigned char *buf, +int count) +{ + return count; +} + +static int ttynull_write_room(struct tty_struct *tty) +{ + return 65536; +} + +static const struct tty_operations ttynull_ops = { + .open = ttynull_open, + .close = ttynull_close, + .hangup = ttynull_hangup, + .write = ttynull_write, + .write_room = ttynull_write_room, +}; + +static struct tty_driver *ttynull_device(struct console *c, int *index) +{ + *index = 0; + return ttynull_driver; +} + +static struct console ttynull_console = { + .name = "ttynull", + .device = ttynull_device, +}; + +static int __init ttynull_init(void) +{ + struct tty_driver *driver; + int ret; + + driver = tty_alloc_driver(1, + TTY_DRIVER_RESET_TERMIOS | + TTY_DRIVER_REAL_RAW | + TTY_DRIVER_UNNUMBERED_NODE); + if (IS_ERR(driver)) + return PTR_ERR(driver); + + tty_port_init(&ttynull_port); + ttynull_port.ops = &ttynull_port_ops; + + driver->driver_name = "ttynull"; + driver->name = "ttynull"; + driver->type = TTY_DRIVER_TYPE_CONSOLE; + driver->init_termios = tty_std_termios; + driver->init_termios.c_oflag = OPOST | OCRNL | ONOCR | ONLRET; + tty_set_operations(driver, &ttynull_ops); + tty_port_link_device(&ttynull_port, driver, 0); + + ret = tty_register_driver(driver); + if (ret < 0) { + put_tty_driver(driver); + tty_port_destroy(&ttynull_port); + return ret; + } + + ttynull_driver = driver; +
Re: [PATCH] tty: Add NULL TTY driver
On Wed, Apr 03, 2019 at 03:12:13PM +0200, Greg KH wrote: > On Wed, Apr 03, 2019 at 01:33:27PM +0200, Vincent Whitchurch wrote: > > If no console driver is enabled (or if a non-present driver is selected > > with something like console=null in an attempt to disable the console), > > opening /dev/console errors out, and init scripts and other userspace > > code that relies on the existence of a console will fail. Symlinking > > /dev/null to /dev/console does not solve the problem since /dev/null > > does not behave like a real TTY. > > > > To just provide a dummy console to userspace when no console driver is > > available or desired, add a ttynull driver which simply discards all > > writes. It can be chosen on the command line in the standard way, i.e. > > with console=ttynull. > > If they have a broken system that sets "console=null", why would they > know to fix it to be "console=ttynull"? > > I'm all for adding new functionality, but to provide kernel code because > userspace just isn't configured properly, that feels really wrong to me. Especially on embedded systems, it would be convenient to have a simple way to disable the console (both for kernel and userspace) on a system which normally uses it, to free up the UART for other things. In my case some early init script in userspace was actually tring to handle the lack of a console in this way: [ ! -w /dev/console ] || exec 2>/dev/console but the problem is that /dev/console always exists but fails on write, so the obvious way of handling missing devices doesn't work. AFAICS the only way to get rid of /dev/console would be to disable the TTY layer entirely in the kernel or have early userspace delete the file from devtmpfs. > Now if this were to be the "default" if nothing is set up at all, that > might make a bit more sense, but as-is, this doesn't seem very useful. Making it the default now would break users, wouldn't it? Since IIRC if no console is selected, the first registered console will automatically be used by default.
Re: [PATCH] tty: Add NULL TTY driver
On Fri, Apr 05, 2019 at 02:32:41PM +0200, Enrico Weigelt, metux IT consult wrote: > On 05.04.19 11:00, Vincent Whitchurch wrote: > > On Fri, Apr 05, 2019 at 10:39:43AM +0200, Enrico Weigelt, metux IT consult > > wrote: > >> On 03.04.19 16:11, Vincent Whitchurch wrote: > >> > >>> Especially on embedded systems, it would be convenient to have a simple > >>> way to disable the console (both for kernel and userspace) on a system > >>> which normally uses it, to free up the UART for other things. > >> > >> Just symlinking to /dev/null does not work ? > > > > No, /dev/null does not support the TTY ioctls. > > hmm, wo (which programs) do you need, that really need them ? I think it was systemd's debug-shell which complained about the ioctls. > >> OTOH, if you're introducing a dummy console, wouldn't a ringbuffer that, > >> can be read out later, a better option ? > > > > There is already a ttyprintk driver in mainline to send these messages > > to the printk ring buffer if one is actually intrested in what is > > written to the console. There's no option to enable it via console= in > > mainline but I have a patch for that too. > > Great. IMHO, that would be the better way. Like Krzysztof explained, sometimes you just want to discard these messages.
[PATCH] perf bench mem: Always memset source before memcpy
For memcpy, the source pages are memset to zero only when --cycles is used. This leads to wildly different results with or without --cycles, since all sources pages are likely to be mapped to the same zero page without explicit writes. Before this fix: $ export cmd="./perf stat -e LLC-loads -- ./perf bench \ mem memcpy -s 1024MB -l 100 -f default" $ $cmd 2,935,826 LLC-loads 3.821677452 seconds time elapsed $ $cmd --cycles 217,533,436 LLC-loads 8.616725985 seconds time elapsed After this fix: $ $cmd 214,459,686 LLC-loads 8.674301124 seconds time elapsed $ $cmd --cycles 214,758,651 LLC-loads 8.644480006 seconds time elapsed Fixes: 47b5757bac03c3387c ("perf bench mem: Move boilerplate memory allocation to the infrastructure") Signed-off-by: Vincent Whitchurch --- tools/perf/bench/mem-functions.c | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/tools/perf/bench/mem-functions.c b/tools/perf/bench/mem-functions.c index 9235b76501be..19d45c377ac1 100644 --- a/tools/perf/bench/mem-functions.c +++ b/tools/perf/bench/mem-functions.c @@ -223,12 +223,8 @@ static int bench_mem_common(int argc, const char **argv, struct bench_mem_info * return 0; } -static u64 do_memcpy_cycles(const struct function *r, size_t size, void *src, void *dst) +static void memcpy_prefault(memcpy_t fn, size_t size, void *src, void *dst) { - u64 cycle_start = 0ULL, cycle_end = 0ULL; - memcpy_t fn = r->fn.memcpy; - int i; - /* Make sure to always prefault zero pages even if MMAP_THRESH is crossed: */ memset(src, 0, size); @@ -237,6 +233,15 @@ static u64 do_memcpy_cycles(const struct function *r, size_t size, void *src, vo * to not measure page fault overhead: */ fn(dst, src, size); +} + +static u64 do_memcpy_cycles(const struct function *r, size_t size, void *src, void *dst) +{ + u64 cycle_start = 0ULL, cycle_end = 0ULL; + memcpy_t fn = r->fn.memcpy; + int i; + + memcpy_prefault(fn, size, src, dst); cycle_start = get_cycles(); for (i = 0; i < nr_loops; ++i) @@ -252,11 +257,7 @@ static double do_memcpy_gettimeofday(const struct function *r, size_t size, void memcpy_t fn = r->fn.memcpy; int i; - /* -* We prefault the freshly allocated memory range here, -* to not measure page fault overhead: -*/ - fn(dst, src, size); + memcpy_prefault(fn, size, src, dst); BUG_ON(gettimeofday(&tv_start, NULL)); for (i = 0; i < nr_loops; ++i) -- 2.25.1
[PATCH v3] cifs: Silently ignore unknown oplock break handle
Make SMB2 not print out an error when an oplock break is received for an unknown handle, similar to SMB1. The debug message which is printed for these unknown handles may also be misleading, so fix that too. The SMB2 lease break path is not affected by this patch. Without this, a program which writes to a file from one thread, and opens, reads, and writes the same file from another thread triggers the below errors several times a minute when run against a Samba server configured with "smb2 leases = no". CIFS: VFS: \\192.168.0.1 No task to wake, unknown frame received! NumMids 2 : 424d53fe 0040 0012 .SMB@... 0010: 0001 0020: 0030: Signed-off-by: Vincent Whitchurch --- Notes: v3: - Change debug print to Tom Talpey's suggestion v2: - Drop change to lease break - Rewrite commit message fs/cifs/smb2misc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/cifs/smb2misc.c b/fs/cifs/smb2misc.c index 60d4bd1eae2b..76cd05b8d53b 100644 --- a/fs/cifs/smb2misc.c +++ b/fs/cifs/smb2misc.c @@ -754,8 +754,8 @@ smb2_is_valid_oplock_break(char *buffer, struct TCP_Server_Info *server) } } spin_unlock(&cifs_tcp_ses_lock); - cifs_dbg(FYI, "Can not process oplock break for non-existent connection\n"); - return false; + cifs_dbg(FYI, "No file id matched, oplock break ignored\n"); + return true; } void -- 2.28.0
[PATCH] CIFS: Prevent error log on spurious oplock break
The SMB1 version of ->is_oplock_break() returns true even if the FileId is not found, as long as the oplock break notification message structure itself appears to be valid. A true return value makes cifs_demultiplex_thread() to not print an error message for such packets. However, the SMB2 version returns false in such cases, leading to an error "No task to wake, unknown frame received!" followed by a hexdump of the packet header being printed by cifs_demultiplex_thread(). Note that before commit fa9c2362497fbd64788063288d ("CIFS: Fix SMB2 oplock break processing"), SMB2 also returned true for the case where a connection was found but the FileId was not, but it's not clear to me if that commit really intended to change the behaviour of the error prints. Change the behaviour of SMB2 to be the same as SMB1 and avoid the error messages for these packets which we ignore as per the spec. Signed-off-by: Vincent Whitchurch --- fs/cifs/smb2misc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/cifs/smb2misc.c b/fs/cifs/smb2misc.c index 60d4bd1eae2b..3ea3bda64083 100644 --- a/fs/cifs/smb2misc.c +++ b/fs/cifs/smb2misc.c @@ -679,7 +679,7 @@ smb2_is_valid_lease_break(char *buffer) } spin_unlock(&cifs_tcp_ses_lock); cifs_dbg(FYI, "Can not process lease break - no lease matched\n"); - return false; + return true; } bool @@ -755,7 +755,7 @@ smb2_is_valid_oplock_break(char *buffer, struct TCP_Server_Info *server) } spin_unlock(&cifs_tcp_ses_lock); cifs_dbg(FYI, "Can not process oplock break for non-existent connection\n"); - return false; + return true; } void -- 2.28.0
[PATCH v2] cifs: Silently ignore unknown oplock break handle
Make SMB2 not print out an error when an oplock break is received for an unknown handle, similar to SMB1. The SMB2 lease break path is not affected by this patch. Without this, a program which writes to a file from one thread, and opens, reads, and writes the same file from another thread triggers the below errors several times a minute when run against a Samba server configured with "smb2 leases = no". CIFS: VFS: \\192.168.0.1 No task to wake, unknown frame received! NumMids 2 : 424d53fe 0040 0012 .SMB@... 0010: 0001 0020: 0030: Signed-off-by: Vincent Whitchurch --- Notes: v2: - Drop change to lease break - Rewrite commit message fs/cifs/smb2misc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/cifs/smb2misc.c b/fs/cifs/smb2misc.c index 60d4bd1eae2b..4d8576e202e3 100644 --- a/fs/cifs/smb2misc.c +++ b/fs/cifs/smb2misc.c @@ -755,7 +755,7 @@ smb2_is_valid_oplock_break(char *buffer, struct TCP_Server_Info *server) } spin_unlock(&cifs_tcp_ses_lock); cifs_dbg(FYI, "Can not process oplock break for non-existent connection\n"); - return false; + return true; } void -- 2.28.0
Re: [PATCH 1/9] regulator: Update DA9121 dt-bindings
On Fri, Nov 20, 2020 at 01:14:50PM +0100, Adam Ward wrote: > Update bindings for the Dialog Semiconductor DA9121 voltage regulator to add > device variants. > > Signed-off-by: Adam Ward > --- > .../devicetree/bindings/regulator/dlg,da9121.yaml | 177 > +++-- > MAINTAINERS| 2 + > .../dt-bindings/regulator/dlg,da9121-regulator.h | 22 +++ > 3 files changed, 185 insertions(+), 16 deletions(-) > create mode 100644 include/dt-bindings/regulator/dlg,da9121-regulator.h > > diff --git a/Documentation/devicetree/bindings/regulator/dlg,da9121.yaml > b/Documentation/devicetree/bindings/regulator/dlg,da9121.yaml > index cde0d82..1bd177d 100644 > --- a/Documentation/devicetree/bindings/regulator/dlg,da9121.yaml > +++ b/Documentation/devicetree/bindings/regulator/dlg,da9121.yaml > @@ -8,40 +8,185 @@ title: Dialog Semiconductor DA9121 voltage regulator > > maintainers: >- Vincent Whitchurch > + - Adam Ward I'm quite happy to have myself removed from the list instead. You are in a much better position to maintain the bindings for these chips. > - buck1: > -description: > - Initial data for the Buck1 regulator. > -$ref: "regulator.yaml#" > + interrupt-parent: > +maxItems: 1 > +description: Specifies the reference to the interrupt controller. > + > + interrupts: > +maxItems: 1 > +description: IRQ line information. > + > + dlg,irq-polling-delay-passive: > +maxItems: 1 > +description: | > + Specify the polling period, measured in milliseconds, between > interrupt status > + update checks. Range 1000-1 ms. > + > + regulators: > type: object > +$ref: regulator.yaml# > +description: | > + This node defines the settings for the BUCK. The content of the > + sub-node is defined by the standard binding for regulators; see > regulator.yaml. > + The DA9121 regulator is bound using their names listed below > + buck1 - BUCK1 > + buck2 - BUCK2 //DA9122, DA9220, DA9131, DA9132 only This move to a sub-node means that older devicetrees won't work. I assume that's fine since the driver is only in linux-next at the moment, but perhaps it's worth mentioning this in the commit message? > + > +patternProperties: > + "^buck([0-1])$": > +type: object > +$ref: regulator.yaml# > + > +properties: > + regulator-mode: > +maxItems: 1 > +description: Defined in > include/dt-bindings/regulator/dlg,da9121-regulator.h > + > + regulator-initial-mode: > +maxItems: 1 > +description: Defined in > include/dt-bindings/regulator/dlg,da9121-regulator.h > > -unevaluatedProperties: false I think my s/unevaluatedProperties/additionalProperties/ fix is already in linux-next, so this looks like it needs rebasing.
Re: [PATCH 1/9] regulator: Update DA9121 dt-bindings
On Fri, Nov 20, 2020 at 02:47:42PM +0100, Vincent Whitchurch wrote: > On Fri, Nov 20, 2020 at 01:14:50PM +0100, Adam Ward wrote: > > - buck1: > > -description: > > - Initial data for the Buck1 regulator. > > -$ref: "regulator.yaml#" > > + interrupt-parent: > > +maxItems: 1 > > +description: Specifies the reference to the interrupt controller. > > + > > + interrupts: > > +maxItems: 1 > > +description: IRQ line information. > > + > > + dlg,irq-polling-delay-passive: > > +maxItems: 1 > > +description: | > > + Specify the polling period, measured in milliseconds, between > > interrupt status > > + update checks. Range 1000-1 ms. > > + > > + regulators: > > type: object > > +$ref: regulator.yaml# > > +description: | > > + This node defines the settings for the BUCK. The content of the > > + sub-node is defined by the standard binding for regulators; see > > regulator.yaml. > > + The DA9121 regulator is bound using their names listed below > > + buck1 - BUCK1 > > + buck2 - BUCK2 //DA9122, DA9220, DA9131, DA9132 only > > This move to a sub-node means that older devicetrees won't work. I > assume that's fine since the driver is only in linux-next at the moment, > but perhaps it's worth mentioning this in the commit message? Actually, perhaps I'm missing something, but I don't quite see why this move to a sub-node is needed. There is some flexibility in the regulator framework for this as I noted earlier (https://lore.kernel.org/lkml/20201102154848.tm5nsydaukyd7...@axis.com/). For the case of an MFD it certainly makes sense to have a "regulators" sub-node but for these chips it seems rather redundant. Also, perhaps you could split this patch into logical pieces too as Mark has suggested for some of the other patches in this series?
Re: [PATCH] CIFS: Prevent error log on spurious oplock break
On Tue, Mar 09, 2021 at 04:29:14PM +0100, Steve French wrote: > On Tue, Mar 9, 2021, 07:42 Vincent Whitchurch via samba-technical > mailto:samba-techni...@lists.samba.org>> > wrote: >> Thank you for the suggestions. In my case, I've only received some >> reports of this error being emitted very rarely (couple of times a month >> in our stability tests). Right now it looks like the problem may only >> be with a particular NAS, and we're looking into triggering oplock >> breaks more often and catching the problem with some more logging. > > I lean toward reducing or skipping the logging of the 'normsl' (or at > least possible) race between close and oplock break. > > I see this eg spamming the log running xfstest 524 > > Can you repro it as well running that? I haven't run xfstests, but we figured out how to easily trigger the error in a normal use case in our application. I can now easily get the errors to spam the logs with a small program which writes to a file from one thread in a loop and opens, reads, and closes the same file in another thread in a loop. This is against a Samba server configured with "smb2 leases = no". Logs show that the oplock break FileId is not found because of the race between close and oplock break which you mentioned, and in some cases because of another race between open and oplock break (the open was not completed since it was waiting on the response to GetInfo). If this is unavoidable, I think it really would be nice to at least reduce the severity since it's scary-looking and so easy to trigger. How about something like the below? It prints an info message for the first unhandled oplock breaks once. (I'm not sure if the lease key path should be handled differently. If the concerns about removing the message were primarily for that path, perhaps my original patch but with the change to smb2_is_valid_lease_break() dropped could be acceptable?) diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h index 3de3c5908a72..849c3721f8a2 100644 --- a/fs/cifs/cifsglob.h +++ b/fs/cifs/cifsglob.h @@ -256,7 +256,7 @@ struct smb_version_operations { void (*dump_share_caps)(struct seq_file *, struct cifs_tcon *); /* verify the message */ int (*check_message)(char *, unsigned int, struct TCP_Server_Info *); - bool (*is_oplock_break)(char *, struct TCP_Server_Info *); + int (*is_oplock_break)(char *, struct TCP_Server_Info *); int (*handle_cancelled_mid)(char *, struct TCP_Server_Info *); void (*downgrade_oplock)(struct TCP_Server_Info *server, struct cifsInodeInfo *cinode, __u32 oplock, diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h index 75ce6f742b8d..2714b6cdf70a 100644 --- a/fs/cifs/cifsproto.h +++ b/fs/cifs/cifsproto.h @@ -135,7 +135,7 @@ extern int SendReceiveBlockingLock(const unsigned int xid, int *bytes_returned); extern int cifs_reconnect(struct TCP_Server_Info *server); extern int checkSMB(char *buf, unsigned int len, struct TCP_Server_Info *srvr); -extern bool is_valid_oplock_break(char *, struct TCP_Server_Info *); +extern int is_valid_oplock_break(char *, struct TCP_Server_Info *); extern bool backup_cred(struct cifs_sb_info *); extern bool is_size_safe_to_change(struct cifsInodeInfo *, __u64 eof); extern void cifs_update_eof(struct cifsInodeInfo *cifsi, loff_t offset, diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index 112692300fb6..5dc58f0c99b0 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -1009,6 +1009,8 @@ cifs_demultiplex_thread(void *p) server->lstrp = jiffies; for (i = 0; i < num_mids; i++) { + int oplockret = -EINVAL; + if (mids[i] != NULL) { mids[i]->resp_buf_size = server->pdu_size; @@ -1020,17 +1022,24 @@ cifs_demultiplex_thread(void *p) mids[i]->callback(mids[i]); cifs_mid_q_entry_release(mids[i]); - } else if (server->ops->is_oplock_break && - server->ops->is_oplock_break(bufs[i], - server)) { - smb2_add_credits_from_hdr(bufs[i], server); + continue; + } + + if (server->ops->is_oplock_break) + oplockret = server->ops->is_oplock_break(bufs[i], server); + + smb2_add_credits_from_hdr(bufs[i], server); + + if (oplockret == 0) { cifs_dbg(FYI, "Received oplock break\n"); + } else if (oplockret == -ENOENT) { +
Re: [PATCH] CIFS: Prevent error log on spurious oplock break
On Tue, Mar 09, 2021 at 01:05:11AM +0100, ronnie sahlberg wrote: > On Sun, Mar 7, 2021 at 8:52 PM Shyam Prasad N via samba-technical > wrote: > > The reason for rejecting the request maybe a number of things like: > > corrupted request, stale request (for some old session), or for a > > wrong handle. > > I don't think we should treat any of these cases as a success. > > I agree with Shyam here. > We shouldn't change the return value to pretend success just to > suppress a warning. Thank you all for your comments. I see that everyone agrees that the error print is useful for SMB2, so I will drop this patch. > However, if it is common to trigger with false positives we might want > to something to prevent it from > spamming the logs. > These messages could be useful if we encounter bugs in our leasing > code, or bugs in server > lease code, so we should't throw them away completely. But if false > positives are common ... > > Some thoughts I and Stever brainstormed about could be to change the code in > the > demiltiplex thread where we currently dump the packets that were "invalid" > to maybe: > * log once as VFS and then log any future ones as FYI > * log once as VFS and then only make the others available via dynamic > trace points > * rate limit it so we only log it once every n minutes? (this is overkill?) Thank you for the suggestions. In my case, I've only received some reports of this error being emitted very rarely (couple of times a month in our stability tests). Right now it looks like the problem may only be with a particular NAS, and we're looking into triggering oplock breaks more often and catching the problem with some more logging.
Re: [PATCH 0/8] Virtio-over-PCIe on non-MIC
On Fri, Jan 18, 2019 at 04:49:16PM -0700, Stephen Warren wrote: > On 1/16/19 9:32 AM, Vincent Whitchurch wrote: > > The Virtio-over-PCIe framework living under drivers/misc/mic/vop implements > > a > > generic framework to use virtio between two Linux systems, given shared > > memory > > and a couple of interrupts. It does not actually require the Intel MIC > > hardware, x86-64, or even PCIe for that matter. This patch series makes it > > buildable on more systems and adds a loopback driver to test it without > > special > > hardware. > > > > Note that I don't have access to Intel MIC hardware so some testing of the > > patchset (especially the patch "vop: Use consistent DMA") on that platform > > would be appreciated, to ensure that the series does not break anything > > there. > > So a while ago I took a look at running virtio over PCIe. I found virtio > basically had two parts: > > 1) The protocol used to enumerate which virtio devices exist, and perhaps > configure them. > > 2) The ring buffer protocol that actually transfers the data. > > I recall that data transfer was purely based on simple shared memory and > interrupts, and hence could run over PCIe (e.g. via the PCIe endpoint > subsystem in the kernel) without issue. > > However, the enumeration/configuration protocol requires the host to be able > to do all kinds of strange things that can't possibly be emulated over PCIe; > IIRC the configuration data contains "registers" that when written select > the data other "registers" access. When the virtio device is exposed by a > hypervisor, and all the accesses are emulated synchronously through a trap, > this is easy enough to implement. However, if the two ends of this > configuration parsing are on different ends of a PCIe bus, there's no way > this can work. Correct, and that's why the MIC "Virtio-over-PCIe framework" does not try to implement the standard "Virtio Over PCI Bus". (Yes, it's confusing.) > Are you thinking of doing something different for enumeration/configuration, > and just using the virtio ring buffer protocol over PCIe? The mic/vop code already does this. See Documentation/mic/mic_overview.txt for some information. > I did post asking about this quite a while back, but IIRC I didn't receive > much of a response. Yes, here it is: > > > https://lists.linuxfoundation.org/pipermail/virtualization/2018-March/037276.html > "virtio over SW-defined/CPU-driven PCIe endpoint" I came to essentialy the same conclusions before I found the MIC code. (Your "aside" in that email about virtio doing PCIe reads instead of writes is not solved by the MIC code, since that is how the standard virtio devices/drivers work.)
[PATCH] mic: vop: Fix broken virtqueues
VOP is broken in mainline since commit 1ce9e6055fa0a9043 ("virtio_ring: introduce packed ring support"); attempting to use the virtqueues leads to various kernel crashes. I'm testing it with my not-yet-merged loopback patches, but even the in-tree MIC hardware cannot work. The problem is not in the referenced commit per se, but is due to the following hack in vop_find_vq() which depends on the layout of private structures in other source files, which that commit happened to change: /* * To reassign the used ring here we are directly accessing * struct vring_virtqueue which is a private data structure * in virtio_ring.c. At the minimum, a BUILD_BUG_ON() in * vring_new_virtqueue() would ensure that * (&vq->vring == (struct vring *) (&vq->vq + 1)); */ vr = (struct vring *)(vq + 1); vr->used = used; Fix vop by using __vring_new_virtqueue() to create the needed vring layout from the start, instead of attempting to patch in the used ring later. __vring_new_virtqueue() was added way back in commit 2a2d1382fe9dcc ("virtio: Add improved queue allocation API") in order to address mic's usecase, according to the commit message. Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/vop/vop_main.c | 60 +++-- 1 file changed, 34 insertions(+), 26 deletions(-) diff --git a/drivers/misc/mic/vop/vop_main.c b/drivers/misc/mic/vop/vop_main.c index d2b9782eee87..fef45bf6d519 100644 --- a/drivers/misc/mic/vop/vop_main.c +++ b/drivers/misc/mic/vop/vop_main.c @@ -284,6 +284,26 @@ static void vop_del_vqs(struct virtio_device *dev) vop_del_vq(vq, idx++); } +static struct virtqueue *vop_new_virtqueue(unsigned int index, + unsigned int num, + struct virtio_device *vdev, + bool context, + void *pages, + bool (*notify)(struct virtqueue *vq), + void (*callback)(struct virtqueue *vq), + const char *name, + void *used) +{ + bool weak_barriers = false; + struct vring vring; + + vring_init(&vring, num, pages, MIC_VIRTIO_RING_ALIGN); + vring.used = used; + + return __vring_new_virtqueue(index, vring, vdev, weak_barriers, context, +notify, callback, name); +} + /* * This routine will assign vring's allocated in host/io memory. Code in * virtio_ring.c however continues to access this io memory as if it were local @@ -303,7 +323,6 @@ static struct virtqueue *vop_find_vq(struct virtio_device *dev, struct _mic_vring_info __iomem *info; void *used; int vr_size, _vr_size, err, magic; - struct vring *vr; u8 type = ioread8(&vdev->desc->type); if (index >= ioread8(&vdev->desc->num_vq)) @@ -322,17 +341,7 @@ static struct virtqueue *vop_find_vq(struct virtio_device *dev, return ERR_PTR(-ENOMEM); vdev->vr[index] = va; memset_io(va, 0x0, _vr_size); - vq = vring_new_virtqueue( - index, - le16_to_cpu(config.num), MIC_VIRTIO_RING_ALIGN, - dev, - false, - ctx, - (void __force *)va, vop_notify, callback, name); - if (!vq) { - err = -ENOMEM; - goto unmap; - } + info = va + _vr_size; magic = ioread32(&info->magic); @@ -341,7 +350,6 @@ static struct virtqueue *vop_find_vq(struct virtio_device *dev, goto unmap; } - /* Allocate and reassign used ring now */ vdev->used_size[index] = PAGE_ALIGN(sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * le16_to_cpu(config.num)); @@ -351,8 +359,17 @@ static struct virtqueue *vop_find_vq(struct virtio_device *dev, err = -ENOMEM; dev_err(_vop_dev(vdev), "%s %d err %d\n", __func__, __LINE__, err); - goto del_vq; + goto unmap; + } + + vq = vop_new_virtqueue(index, le16_to_cpu(config.num), dev, ctx, + (void __force *)va, vop_notify, callback, + name, used); + if (!vq) { + err = -ENOMEM; + goto free_used; } + vdev->used[index] = dma_map_single(&vpdev->dev, used, vdev->used_size[index], DMA_BIDIRECTIONAL); @@ -360,26 +377,17 @@ static struct virt
[PATCH] sysctl: Add panic-fatal-signals
Add a sysctl which asks the kernel to panic when any userspace process receives a fatal signal which would trigger a core dump. This has proven to be quite useful when debugging problems seen during testing of embedded systems: When combined with kernel core dumps (saved due to the panic), it allows easier debugging of crashes which have their origin in system-wide problems such as buggy drivers or other kernel or hardware-related issues. The crashing process's core dump can be extracted from the kernel core dump using tools such as the crash utility's gcore extension. Signed-off-by: Vincent Whitchurch --- include/linux/signal.h | 1 + kernel/signal.c| 5 - kernel/sysctl.c| 7 +++ 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/include/linux/signal.h b/include/linux/signal.h index cc7e2c1cd444..109efd1432e9 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -10,6 +10,7 @@ struct task_struct; /* for sysctl */ extern int print_fatal_signals; +extern int panic_fatal_signals; static inline void copy_siginfo(kernel_siginfo_t *to, const kernel_siginfo_t *from) diff --git a/kernel/signal.c b/kernel/signal.c index e1d7ad8e6ab1..83c6877b0182 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -59,6 +59,7 @@ static struct kmem_cache *sigqueue_cachep; int print_fatal_signals __read_mostly; +int panic_fatal_signals __read_mostly; static void __user *sig_handler(struct task_struct *t, int sig) { @@ -2497,8 +2498,10 @@ bool get_signal(struct ksignal *ksig) current->flags |= PF_SIGNALED; if (sig_kernel_coredump(signr)) { - if (print_fatal_signals) + if (print_fatal_signals || panic_fatal_signals) print_fatal_signal(ksig->info.si_signo); + if (panic_fatal_signals) + panic("Fatal signal and panic_fatal_signals=1"); proc_coredump_connector(current); /* * If it was able to dump core, this kills all diff --git a/kernel/sysctl.c b/kernel/sysctl.c index ba4d9e85feb8..ad1ac6ec2004 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -557,6 +557,13 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "panic-fatal-signals", + .data = &panic_fatal_signals, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, #ifdef CONFIG_SPARC { .procname = "reboot-cmd", -- 2.20.0
[PATCH 8/8] vop: Add loopback
Add a loopback driver to allow testing and evaluation of the VOP framework without special hardware. The host and the guest will run under the same kernel. Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/Kconfig| 10 + drivers/misc/mic/vop/Makefile | 2 + drivers/misc/mic/vop/vop_loopback.c | 390 3 files changed, 402 insertions(+) create mode 100644 drivers/misc/mic/vop/vop_loopback.c diff --git a/drivers/misc/mic/Kconfig b/drivers/misc/mic/Kconfig index 319ee3c90a26..b181cf20d40f 100644 --- a/drivers/misc/mic/Kconfig +++ b/drivers/misc/mic/Kconfig @@ -149,6 +149,16 @@ config VOP OS and tools for MIC to use with this driver are available from <http://software.intel.com/en-us/mic-developer>. +config VOP_LOOPBACK + tristate "VOP loopback driver" + depends on VOP + help + This enables a loopback driver to test and evaluate the VOP + infrastructure without actual PCIe hardware. The host and the guest + sides run under the same kernel. + + If unsure, say N. + if VOP source "drivers/vhost/Kconfig.vringh" endif diff --git a/drivers/misc/mic/vop/Makefile b/drivers/misc/mic/vop/Makefile index 78819c8999f1..a6ead25c4418 100644 --- a/drivers/misc/mic/vop/Makefile +++ b/drivers/misc/mic/vop/Makefile @@ -7,3 +7,5 @@ obj-m := vop.o vop-objs += vop_main.o vop-objs += vop_debugfs.o vop-objs += vop_vringh.o + +obj-$(CONFIG_VOP_LOOPBACK) += vop_loopback.o diff --git a/drivers/misc/mic/vop/vop_loopback.c b/drivers/misc/mic/vop/vop_loopback.c new file mode 100644 index ..6e3d24f166cf --- /dev/null +++ b/drivers/misc/mic/vop/vop_loopback.c @@ -0,0 +1,390 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 Axis Communications AB + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../bus/vop_bus.h" + +struct mic_irq { + struct list_head list; + int irq; + irqreturn_t (*func)(int irq, void *data); + void *data; +}; + +struct vop_loopback_end { + struct vop_loopback *loopback; + const char *name; + struct vop_device *vop; + struct list_head irqs; + struct mutex mutex; + struct work_struct work; +}; + +struct vop_loopback { + struct device *dev; + void *dp; + struct vop_loopback_end host; + struct vop_loopback_end guest; +}; + +static inline struct vop_loopback *vop_to_loopback(struct device *dev) +{ + return dev_get_drvdata(dev->parent); +} + +static dma_addr_t +vop_loopback_dma_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, unsigned long attrs) +{ + return page_to_phys(page) + offset; +} + +static void vop_loopback_dma_unmap_page(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + unsigned long attrs) +{ +} + +static int vop_loopback_dma_mmap(struct device *dev, struct vm_area_struct *vma, +void *cpu_addr, dma_addr_t dma_addr, size_t size, +unsigned long attrs) +{ + return remap_pfn_range(vma, vma->vm_start, + PHYS_PFN(dma_addr), size, + vma->vm_page_prot); +} + +static void *vop_loopback_dma_alloc(struct device *dev, size_t size, + dma_addr_t *handle, gfp_t gfp, + unsigned long attrs) +{ + void *p = (void *) __get_free_pages(gfp, get_order(size)); + + if (p) + *handle = virt_to_phys(p); + + return p; +} + +static void vop_loopback_dma_free(struct device *dev, size_t size, + void *cpu_addr, dma_addr_t handle, + unsigned long attrs) +{ + free_pages((unsigned long) (uintptr_t) cpu_addr, get_order(size)); +} + +static const struct dma_map_ops vop_loopback_dma_ops = { + .map_page = vop_loopback_dma_map_page, + .unmap_page = vop_loopback_dma_unmap_page, + .mmap = vop_loopback_dma_mmap, + .alloc = vop_loopback_dma_alloc, + .free = vop_loopback_dma_free, +}; + + +static void vop_loopback_ack_interrupt(struct vop_device *vop, int num) +{ +} + +static int vop_loopback_next_db(struct vop_device *vop) +{ + return 0; +} + +static void *vop_loopback_get_dp(struct vop_device *vop) +{ + struct vop_loopback *loopback = vop_to_loopback(&vop->dev); + + return loopback->dp; +} + +static dma_addr_t vop_loopback_get_dp_dma(struct vop_device *vop) +{ + struct vop_loopback *loopback = vop_to_loopback(&vop->dev); + + return virt_to_phys(loopback->dp); +} + +static void __iomem *vop_loopback_get_remote_dp
[PATCH 3/8] vop: Add definition of readq/writeq if missing
Include so that readq/writeq are replaced by two readl/writel on systems that do not support them. The values read/written are pointers which will be 32-bit on 32-bit systems so the non-atomicity should not matter. Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/vop/vop_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/misc/mic/vop/vop_main.c b/drivers/misc/mic/vop/vop_main.c index 5a366cf68d95..26b23f2cf94c 100644 --- a/drivers/misc/mic/vop/vop_main.c +++ b/drivers/misc/mic/vop/vop_main.c @@ -34,6 +34,7 @@ #include #include #include +#include #include "vop_main.h" -- 2.20.0
[PATCH 7/8] vop: Use consistent DMA
The vop code maps buffers using the streaming DMA API but never syncs them so it doesn't work on systems without cache coherence. The vrings want consistent mappings, and not streaming mappings so use that API to allocate and map buffers. Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/bus/vop_bus.h| 2 + drivers/misc/mic/host/mic_boot.c | 46 + drivers/misc/mic/vop/vop_main.c | 23 ++- drivers/misc/mic/vop/vop_vringh.c | 111 -- 4 files changed, 114 insertions(+), 68 deletions(-) diff --git a/drivers/misc/mic/bus/vop_bus.h b/drivers/misc/mic/bus/vop_bus.h index fff7a865d721..a2db11bea277 100644 --- a/drivers/misc/mic/bus/vop_bus.h +++ b/drivers/misc/mic/bus/vop_bus.h @@ -86,6 +86,7 @@ struct vop_driver { * node to add/remove/configure virtio devices. * @get_dp: Get access to the virtio device page used by the self * node to add/remove/configure virtio devices. + * @get_dp_dma: Get DMA address of the virtio device page. * @send_intr: Send an interrupt to the peer node on a specified doorbell. * @ioremap: Map a buffer with the specified DMA address and length. * @iounmap: Unmap a buffer previously mapped. @@ -103,6 +104,7 @@ struct vop_hw_ops { void (*ack_interrupt)(struct vop_device *vpdev, int num); void __iomem * (*get_remote_dp)(struct vop_device *vpdev); void * (*get_dp)(struct vop_device *vpdev); + dma_addr_t (*get_dp_dma)(struct vop_device *vpdev); void (*send_intr)(struct vop_device *vpdev, int db); void __iomem * (*ioremap)(struct vop_device *vpdev, dma_addr_t pa, size_t len); diff --git a/drivers/misc/mic/host/mic_boot.c b/drivers/misc/mic/host/mic_boot.c index 6479435ac96b..1dcd25917eca 100644 --- a/drivers/misc/mic/host/mic_boot.c +++ b/drivers/misc/mic/host/mic_boot.c @@ -55,9 +55,47 @@ static void _mic_dma_unmap_page(struct device *dev, dma_addr_t dma_addr, mic_unmap_single(mdev, dma_addr, size); } +static int _mic_dma_mmap(struct device *dev, struct vm_area_struct *vma, +void *cpu_addr, dma_addr_t dma_addr, size_t size, +unsigned long attrs) +{ + return remap_pfn_range(vma, vma->vm_start, + virt_to_pfn(cpu_addr), size, + vma->vm_page_prot); +} + +static void *_mic_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, + gfp_t gfp, unsigned long attrs) +{ + void *cpu_addr = (void *) __get_free_pages(gfp, get_order(size)); + dma_addr_t dma_addr; + + if (!cpu_addr) + return NULL; + + *handle = dma_map_single(dev, p, size, attrs); + if (dma_mapping_error(dev, *handle)) { + free_pages((unsigned long) (uintptr_t) cpu_addr, + get_order(size)); + return NULL; + } + + return cpu_addr; +} + +static void _mic_dma_free(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t handle, unsigned long attrs) +{ + dma_unmap_single(dev, cpu_addr, handle, attrs); + free_pages((unsigned long) (uintptr_t) cpu_addr, get_order(size)); +} + static const struct dma_map_ops _mic_dma_ops = { .map_page = _mic_dma_map_page, .unmap_page = _mic_dma_unmap_page, + .mmap = _mic_dma_mmap, + .alloc = _mic_dma_alloc, + .free = _mic_dma_free, }; static struct mic_irq * @@ -100,6 +138,13 @@ static void *__mic_get_dp(struct vop_device *vpdev) return mdev->dp; } +static void *__mic_get_dp_dma(struct vop_device *vpdev) +{ + struct mic_device *mdev = vpdev_to_mdev(&vpdev->dev); + + return mdev->dp_dma_addr; +} + static void __iomem *__mic_get_remote_dp(struct vop_device *vpdev) { return NULL; @@ -131,6 +176,7 @@ static struct vop_hw_ops vop_hw_ops = { .ack_interrupt = __mic_ack_interrupt, .next_db = __mic_next_db, .get_dp = __mic_get_dp, + .get_dp_dma = __mic_get_dp_dma, .get_remote_dp = __mic_get_remote_dp, .send_intr = __mic_send_intr, .ioremap = __mic_ioremap, diff --git a/drivers/misc/mic/vop/vop_main.c b/drivers/misc/mic/vop/vop_main.c index dd49169ef53d..6d4a4e8993bb 100644 --- a/drivers/misc/mic/vop/vop_main.c +++ b/drivers/misc/mic/vop/vop_main.c @@ -264,9 +264,8 @@ static void vop_del_vq(struct virtqueue *vq, int n) struct vring *vr = (struct vring *)(vq + 1); struct vop_device *vpdev = vdev->vpdev; - dma_unmap_single(&vpdev->dev, vdev->used[n], -vdev->used_size[n], DMA_BIDIRECTIONAL); - free_pages((unsigned long)vr->used, get_order(vdev->used_size[n])); + dma_free_wc(&vpdev->dev, vdev->used_size[n], vr->used, vdev->used[n]); + vring_del_virtqueue(vq); vpdev->hw_ops->iounmap(vpdev, vdev->vr[n]);
[PATCH 1/8] vop: Use %z for size_t
Fixes these kind of errors on 32-bit: drivers/misc/mic/vop/vop_vringh.c:590:3: error: format '%lx' expects argument of type 'long unsigned int', but argument 7 has type 'size_t {aka unsigned int}' [-Werror=format=] Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/vop/vop_vringh.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/misc/mic/vop/vop_vringh.c b/drivers/misc/mic/vop/vop_vringh.c index cbc8ebcff5cf..4267fa08b483 100644 --- a/drivers/misc/mic/vop/vop_vringh.c +++ b/drivers/misc/mic/vop/vop_vringh.c @@ -587,7 +587,7 @@ static int vop_virtio_copy_to_user(struct vop_vdev *vdev, void __user *ubuf, err: vpdev->hw_ops->iounmap(vpdev, dbuf); dev_dbg(vop_dev(vdev), - "%s: ubuf %p dbuf %p len 0x%lx vr_idx 0x%x\n", + "%s: ubuf %p dbuf %p len 0x%zx vr_idx 0x%x\n", __func__, ubuf, dbuf, len, vr_idx); return err; } @@ -670,7 +670,7 @@ static int vop_virtio_copy_from_user(struct vop_vdev *vdev, void __user *ubuf, err: vpdev->hw_ops->iounmap(vpdev, dbuf); dev_dbg(vop_dev(vdev), - "%s: ubuf %p dbuf %p len 0x%lx vr_idx 0x%x\n", + "%s: ubuf %p dbuf %p len 0x%zx vr_idx 0x%x\n", __func__, ubuf, dbuf, len, vr_idx); return err; } -- 2.20.0
[PATCH 5/8] vop: vringh: Do not crash if no DMA channel
Fallback gracefully if no DMA channel is provided instead of dereferencing NULL pointers. Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/vop/vop_vringh.c | 32 +++ 1 file changed, 20 insertions(+), 12 deletions(-) diff --git a/drivers/misc/mic/vop/vop_vringh.c b/drivers/misc/mic/vop/vop_vringh.c index ccdfbb652123..18d6ecff1834 100644 --- a/drivers/misc/mic/vop/vop_vringh.c +++ b/drivers/misc/mic/vop/vop_vringh.c @@ -531,12 +531,12 @@ static int vop_virtio_copy_to_user(struct vop_vdev *vdev, void __user *ubuf, void __iomem *dbuf = vpdev->hw_ops->ioremap(vpdev, daddr, len); struct vop_vringh *vvr = &vdev->vvr[vr_idx]; struct vop_info *vi = dev_get_drvdata(&vpdev->dev); - size_t dma_alignment = 1 << vi->dma_ch->device->copy_align; - bool x200 = is_dma_copy_aligned(vi->dma_ch->device, 1, 1, 1); + size_t dma_alignment; + bool x200; size_t dma_offset, partlen; int err; - if (!VOP_USE_DMA) { + if (!VOP_USE_DMA || !vi->dma_ch) { if (copy_to_user(ubuf, (void __force *)dbuf, len)) { err = -EFAULT; dev_err(vop_dev(vdev), "%s %d err %d\n", @@ -548,6 +548,9 @@ static int vop_virtio_copy_to_user(struct vop_vdev *vdev, void __user *ubuf, goto err; } + dma_alignment = 1 << vi->dma_ch->device->copy_align; + x200 = is_dma_copy_aligned(vi->dma_ch->device, 1, 1, 1); + dma_offset = daddr - round_down(daddr, dma_alignment); daddr -= dma_offset; len += dma_offset; @@ -606,18 +609,23 @@ static int vop_virtio_copy_from_user(struct vop_vdev *vdev, void __user *ubuf, void __iomem *dbuf = vpdev->hw_ops->ioremap(vpdev, daddr, len); struct vop_vringh *vvr = &vdev->vvr[vr_idx]; struct vop_info *vi = dev_get_drvdata(&vdev->vpdev->dev); - size_t dma_alignment = 1 << vi->dma_ch->device->copy_align; - bool x200 = is_dma_copy_aligned(vi->dma_ch->device, 1, 1, 1); + size_t dma_alignment; + bool x200; size_t partlen; - bool dma = VOP_USE_DMA; + bool dma = VOP_USE_DMA && vi->dma_ch; int err = 0; - if (daddr & (dma_alignment - 1)) { - vdev->tx_dst_unaligned += len; - dma = false; - } else if (ALIGN(len, dma_alignment) > dlen) { - vdev->tx_len_unaligned += len; - dma = false; + if (dma) { + dma_alignment = 1 << vi->dma_ch->device->copy_align; + x200 = is_dma_copy_aligned(vi->dma_ch->device, 1, 1, 1); + + if (daddr & (dma_alignment - 1)) { + vdev->tx_dst_unaligned += len; + dma = false; + } else if (ALIGN(len, dma_alignment) > dlen) { + vdev->tx_len_unaligned += len; + dma = false; + } } if (!dma) -- 2.20.0
[PATCH 6/8] vop: Fix handling of >32 feature bits
This is needed, for example, for VIRTIO_F_IOMMU_PLATFORM. Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/vop/vop_main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/misc/mic/vop/vop_main.c b/drivers/misc/mic/vop/vop_main.c index 26b23f2cf94c..dd49169ef53d 100644 --- a/drivers/misc/mic/vop/vop_main.c +++ b/drivers/misc/mic/vop/vop_main.c @@ -117,7 +117,7 @@ _vop_total_desc_size(struct mic_device_desc __iomem *desc) static u64 vop_get_features(struct virtio_device *vdev) { unsigned int i, bits; - u32 features = 0; + u64 features = 0; struct mic_device_desc __iomem *desc = to_vopvdev(vdev)->desc; u8 __iomem *in_features = _vop_vq_features(desc); int feature_len = ioread8(&desc->feature_len); @@ -125,7 +125,7 @@ static u64 vop_get_features(struct virtio_device *vdev) bits = min_t(unsigned, feature_len, sizeof(vdev->features)) * 8; for (i = 0; i < bits; i++) if (ioread8(&in_features[i / 8]) & (BIT(i % 8))) - features |= BIT(i); + features |= BIT_ULL(i); return features; } -- 2.20.0
[PATCH 4/8] vop: Allow building on more systems
VOP_BUS does not actually depend on 32-bit X86 or PCI. The code uses archdata.dma_ops so it can be built on ARM too. Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/misc/mic/Kconfig b/drivers/misc/mic/Kconfig index 227cc7443671..319ee3c90a26 100644 --- a/drivers/misc/mic/Kconfig +++ b/drivers/misc/mic/Kconfig @@ -38,7 +38,7 @@ comment "VOP Bus Driver" config VOP_BUS tristate "VOP Bus Driver" - depends on 64BIT && PCI && X86 && X86_DEV_DMA_OPS + depends on ARM || (X86 && X86_DEV_DMA_OPS) help This option is selected by any driver which registers a device or driver on the VOP Bus, such as CONFIG_INTEL_MIC_HOST @@ -132,7 +132,7 @@ comment "VOP Driver" config VOP tristate "VOP Driver" - depends on 64BIT && PCI && X86 && VOP_BUS + depends on VOP_BUS select VHOST_RING select VIRTIO help -- 2.20.0
[PATCH 2/8] vop: Cast pointers to uintptr_t
Fix these on 32-bit: vop_vringh.c:711:13: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast] Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/vop/vop_main.c | 8 drivers/misc/mic/vop/vop_vringh.c | 4 ++-- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/misc/mic/vop/vop_main.c b/drivers/misc/mic/vop/vop_main.c index 6b212c8b78e7..5a366cf68d95 100644 --- a/drivers/misc/mic/vop/vop_main.c +++ b/drivers/misc/mic/vop/vop_main.c @@ -497,7 +497,7 @@ static int _vop_add_device(struct mic_device_desc __iomem *d, vdev->desc = d; vdev->dc = (void __iomem *)d + _vop_aligned_desc_size(d); vdev->dnode = dnode; - vdev->vdev.priv = (void *)(u64)dnode; + vdev->vdev.priv = (void *)(uintptr_t)dnode; init_completion(&vdev->reset_done); vdev->h2c_vdev_db = vpdev->hw_ops->next_db(vpdev); @@ -519,7 +519,7 @@ static int _vop_add_device(struct mic_device_desc __iomem *d, offset, type); goto free_irq; } - writeq((u64)vdev, &vdev->dc->vdev); + writeq((uintptr_t)vdev, &vdev->dc->vdev); dev_dbg(_vop_dev(vdev), "%s: registered vop device %u type %u vdev %p\n", __func__, offset, type, vdev); @@ -552,7 +552,7 @@ static void _vop_handle_config_change(struct mic_device_desc __iomem *d, { struct mic_device_ctrl __iomem *dc = (void __iomem *)d + _vop_aligned_desc_size(d); - struct _vop_vdev *vdev = (struct _vop_vdev *)readq(&dc->vdev); + struct _vop_vdev *vdev = (struct _vop_vdev *)(uintptr_t)readq(&dc->vdev); if (ioread8(&dc->config_change) != MIC_VIRTIO_PARAM_CONFIG_CHANGED) return; @@ -571,7 +571,7 @@ static int _vop_remove_device(struct mic_device_desc __iomem *d, { struct mic_device_ctrl __iomem *dc = (void __iomem *)d + _vop_aligned_desc_size(d); - struct _vop_vdev *vdev = (struct _vop_vdev *)readq(&dc->vdev); + struct _vop_vdev *vdev = (struct _vop_vdev *)(uintptr_t)readq(&dc->vdev); u8 status; int ret = -1; diff --git a/drivers/misc/mic/vop/vop_vringh.c b/drivers/misc/mic/vop/vop_vringh.c index 4267fa08b483..ccdfbb652123 100644 --- a/drivers/misc/mic/vop/vop_vringh.c +++ b/drivers/misc/mic/vop/vop_vringh.c @@ -708,12 +708,12 @@ static int vop_vringh_copy(struct vop_vdev *vdev, struct vringh_kiov *iov, partlen = min(kiov->iov_len, len); if (read) ret = vop_virtio_copy_to_user(vdev, ubuf, partlen, - (u64)kiov->iov_base, + (uintptr_t)kiov->iov_base, kiov->iov_len, vr_idx); else ret = vop_virtio_copy_from_user(vdev, ubuf, partlen, - (u64)kiov->iov_base, + (uintptr_t)kiov->iov_base, kiov->iov_len, vr_idx); if (ret) { -- 2.20.0
[PATCH 0/8] Virtio-over-PCIe on non-MIC
The Virtio-over-PCIe framework living under drivers/misc/mic/vop implements a generic framework to use virtio between two Linux systems, given shared memory and a couple of interrupts. It does not actually require the Intel MIC hardware, x86-64, or even PCIe for that matter. This patch series makes it buildable on more systems and adds a loopback driver to test it without special hardware. Note that I don't have access to Intel MIC hardware so some testing of the patchset (especially the patch "vop: Use consistent DMA") on that platform would be appreciated, to ensure that the series does not break anything there. Vincent Whitchurch (8): vop: Use %z for size_t vop: Cast pointers to uintptr_t vop: Add definition of readq/writeq if missing vop: Allow building on more systems vop: vringh: Do not crash if no DMA channel vop: Fix handling of >32 feature bits vop: Use consistent DMA vop: Add loopback drivers/misc/mic/Kconfig| 14 +- drivers/misc/mic/bus/vop_bus.h | 2 + drivers/misc/mic/host/mic_boot.c| 46 drivers/misc/mic/vop/Makefile | 2 + drivers/misc/mic/vop/vop_loopback.c | 390 drivers/misc/mic/vop/vop_main.c | 36 +-- drivers/misc/mic/vop/vop_vringh.c | 151 ++- 7 files changed, 549 insertions(+), 92 deletions(-) create mode 100644 drivers/misc/mic/vop/vop_loopback.c -- 2.20.0
Re: [PATCH 0/8] Virtio-over-PCIe on non-MIC
On Wed, Jan 16, 2019 at 06:07:53PM +0100, Arnd Bergmann wrote: > On Wed, Jan 16, 2019 at 5:33 PM Vincent Whitchurch > wrote: > > The Virtio-over-PCIe framework living under drivers/misc/mic/vop implements > > a > > generic framework to use virtio between two Linux systems, given shared > > memory > > and a couple of interrupts. It does not actually require the Intel MIC > > hardware, x86-64, or even PCIe for that matter. This patch series makes it > > buildable on more systems and adds a loopback driver to test it without > > special > > hardware. > > > > Note that I don't have access to Intel MIC hardware so some testing of the > > patchset (especially the patch "vop: Use consistent DMA") on that platform > > would be appreciated, to ensure that the series does not break anything > > there. > > I think we need to take a step back though and discuss what combinations > we actually do want to support. I have not actually read the whole mic/vop > driver, so I don't know if this would be a good fit as a generic interface -- > it may or may not be, and any other input would be helpful. The MIC driver as a whole is uninteresting as a generic interface since it is quite tied to the Intel hardware. The VOP parts though are logically separated and have no relation to that hardware, even if the ioctls are called MIC_VIRTIO_*. The samples/mic/mpssd/mpssd.c code handles both the boot of the MIC (sysfs) and the VOP parts (ioctls). > Aside from that, I should note that we have two related subsystems > in the kernel: the PCIe endpoint subsystem maintained by Kishon and > Lorenzo, and the NTB subsystem maintained by Jon, Dave and Allen. > > In order to properly support virtio over PCIe, I would hope we can come > up with a user space interface that looks the same way for configuring > virtio drivers in mic, pcie-endpoint and ntb, if at all possible. Have > you looked at those two subsystems? pcie-endpoint is a generic framework that allows Linux to act as an endpoint and set up the BARs, etc. mic appears to have Intel MIC-specific code for this (pre-dating pcie-endpoint) but this is separate from the vop code. pcie-endpoint and vop do not have overlapping functionality and can be used together. I'm not familiar with NTB, but from a quick look it seems to be tied to special hardware, and I don't see any virtio-related code there. A vop backend for NTB-backend would presumably work to allow virtio functionality there.
Re: [PATCH] sysctl: Add panic-fatal-signals
On Wed, Jan 16, 2019 at 09:06:18AM -0800, Kees Cook wrote: > On Wed, Jan 16, 2019 at 4:54 AM Vincent Whitchurch > wrote: > > Add a sysctl which asks the kernel to panic when any userspace process > > receives a fatal signal which would trigger a core dump. This has > > proven to be quite useful when debugging problems seen during testing of > > embedded systems: When combined with kernel core dumps (saved due to > > the panic), it allows easier debugging of crashes which have their > > origin in system-wide problems such as buggy drivers or other kernel or > > hardware-related issues. > > > > The crashing process's core dump can be extracted from the kernel core > > dump using tools such as the crash utility's gcore extension. > > Whoa. That's intense. :) "Insane" is the word I used myself in an earlier draft of the commit message. :) > If you have a use-case for it, then I guess I > can't complain. Given that this is an instant DoS for a system that > isn't expecting it, I do wonder if it should live behind a CONFIG or > something to make it available only when really wanted? Sure, I can put it behind a "depends on EXPERT" config option. Distros and the like should have no reason to provide access to such a sysctl.
Re: [PATCH 0/8] Virtio-over-PCIe on non-MIC
On Thu, Jan 17, 2019 at 01:39:27PM +0100, Arnd Bergmann wrote: > Correct, and again we have to see if this is a good interface. The NTB > and PCIe-endpoint interfaces have a number of differences and a > number of similarities. In particular they should both be usable with > virtio-style drivers, but the underlying hardware differs mainly in how > it is probed by the system: an NTB is seen as a PCI device attached > to two host bridges, while and endpoint is typically a platform_device > on one side, but a pci_dev on the other side. > > Can you describe how you expect a VOP device over NTB or > PCIe-endpoint would get created, configured and used? Assuming PCIe-endpoint: On the RC, a vop-host-backend driver (PCI driver) sets up some shared memory area which the RC and the endpoint can use to communicate the location of the MIC device descriptors and other information such as the MSI address. It implements vop callbacks to allow the vop framework to obtain the address of the MIC descriptors and send/receive interrupts to/from the guest. On the endpoint, the PCIe endpoint driver sets up (hardcoded) BARs and memory regions as required to allow the endpoint and the root complex to access each other's memory. On the endpoint, the vop-guest-backend, via the shared memory set up by the vop-host-backend, obtains the address of the MIC device page and the MSI address, and a method to receive vop interrupts from the host. This information is used to implement the vop callbacks allowing the vop framework to access to the MIC device page and send/receive interrupts from/to the host. vop (despite its name) doesn't care about PCIe. The vop-guest-backend doesn't actually need to talk to the PCIe endpoint driver. The vop-guest-backend can be probed via any means, such as via a device tree on the endpoint. On the RC, userspace opens the vop device and adds the virtio devices, which end up in the MIC device page set up by the vop-host-backend. On the endpoint, when the vop framework (via the vop-guest-backend) sees these devices, it registers devices on the virtio bus and the virtio drivers are probed. On the RC, userspace implements the device end of the virtio communication in userspace, using the MIC_VIRTIO_COPY_DESC ioctl. I also have patches to support vhost. > Is there always one master side that is responsible for creating > virtio devices on it, with the slave side automatically attaching to > them, or can either side create virtio devices? Only the master can create virtio devices. The virtio drivers run on the slave. > Is there any limit on > the number of virtio devices or queues within a VOP device? The virtio device information (mic_device_desc) is put into the MIC device page whose size is limited by the ABI header in include/uapi/linux/mic_ioctl.h (MIC_DP_SIZE, 4096 bytes). So the number of devices is limited by the limit of the number of device descriptors that can fit in that size. There is also a per-device limit on the number of vrings (MIC_VRING_ENTRIES) and vring entries (MIC_VRING_ENTRIES) in the ABI header.
Re: [PATCH 0/8] Virtio-over-PCIe on non-MIC
On Thu, Jan 17, 2019 at 07:21:42AM -0800, Christoph Hellwig wrote: > On Thu, Jan 17, 2019 at 04:19:06PM +0100, Vincent Whitchurch wrote: > > On the RC, a vop-host-backend driver (PCI driver) sets up some shared > > memory area which the RC and the endpoint can use to communicate the > > location of the MIC device descriptors and other information such as the > > MSI address. It implements vop callbacks to allow the vop framework to > > obtain the address of the MIC descriptors and send/receive interrupts > > to/from the guest. > > Why would we require any work on the RC / host side? A properly > setup software controlled virtio device should just show up as a > normal PCIe device, and the virtio-pci device should bind to it. If I understand you correctly, I think you're talking about the RC running the virtio drivers and the endpoint implementing the virtio device? This vop stuff is used for the other way around: the virtio device is implement on the RC and the endpoint runs the virtio drivers.
Re: [PATCH 0/8] Virtio-over-PCIe on non-MIC
On Thu, Jan 17, 2019 at 04:53:25PM +0100, Arnd Bergmann wrote: > On Thu, Jan 17, 2019 at 4:19 PM Vincent Whitchurch > wrote: > > On Thu, Jan 17, 2019 at 01:39:27PM +0100, Arnd Bergmann wrote: > > > Can you describe how you expect a VOP device over NTB or > > > PCIe-endpoint would get created, configured and used? > > > > Assuming PCIe-endpoint: > > > > On the RC, a vop-host-backend driver (PCI driver) sets up some shared > > memory area which the RC and the endpoint can use to communicate the > > location of the MIC device descriptors and other information such as the > > MSI address. It implements vop callbacks to allow the vop framework to > > obtain the address of the MIC descriptors and send/receive interrupts > > to/from the guest. > > > > On the endpoint, the PCIe endpoint driver sets up (hardcoded) BARs and > > memory regions as required to allow the endpoint and the root complex to > > access each other's memory. > > > > On the endpoint, the vop-guest-backend, via the shared memory set up by > > the vop-host-backend, obtains the address of the MIC device page and the > > MSI address, and a method to receive vop interrupts from the host. This > > information is used to implement the vop callbacks allowing the vop > > framework to access to the MIC device page and send/receive interrupts > > from/to the host. > > Ok, this seems fine so far. So the vop-host-backend is a regular PCI > driver that implements the VOP protocol from the host side, and it > can talk to either a MIC, or another guest-backend written for the PCI-EP > framework to implement the same protocol, right? Yes, but just to clarify: the placement of the device page and the way to communicate the location of the device page address and any other information needed by the guest-backend are hardware-specific so there is no generic vop-host-backend implementation which can talk to both a MIC and to something else. > > vop (despite its name) doesn't care about PCIe. The vop-guest-backend > > doesn't actually need to talk to the PCIe endpoint driver. The > > vop-guest-backend can be probed via any means, such as via a device tree > > on the endpoint. > > > > On the RC, userspace opens the vop device and adds the virtio devices, > > which end up in the MIC device page set up by the vop-host-backend. > > > > On the endpoint, when the vop framework (via the vop-guest-backend) sees > > these devices, it registers devices on the virtio bus and the virtio > > drivers are probed. > > Ah, so the direction is fixed, and it's the opposite of what Christoph > and I were expecting. This is probably something we need to discuss > a bit. From what I understand, there is no technical requirement why > it has to be this direction, right? I don't think the vop framework itself has any such requirement. The MIC uses it in this way (see Documentation/mic/mic_overview.txt) and it also makes sense (to me, at least) if one wants to treat the endpoint like one would treat a virtualized guest. > What I mean is that the same vop framework could work with > a PCI-EP driver implementing the vop-host-backend and > a PCI driver implementing the vop-guest-backend? In order > to do this, the PCI-EP configuration would need to pick whether > it wants the EP to be the vop host or guest, but having more > flexibility in it (letting each side add virtio devices) would be > harder to do. Correct, this is my understanding also. > > On the RC, userspace implements the device end of the virtio > > communication in userspace, using the MIC_VIRTIO_COPY_DESC ioctl. I > > also have patches to support vhost. > > This is a part I don't understand yet. Does this mean that the > normal operation is between a user space process on the vop-host > talking to the kernel on the vop-guest? Yes. For example, the guest mounts a 9p filesystem with virtio-9p and the 9p server is implemented in a userspace process on the host. This is again similar to virtualization. > I'm a bit worried about the ioctl interface here, as this combines the > configuration side with the actual data transfer, and that seems > a bit inflexible. > > > > Is there always one master side that is responsible for creating > > > virtio devices on it, with the slave side automatically attaching to > > > them, or can either side create virtio devices? > > > > Only the master can create virtio devices. The virtio drivers run on > > the slave. > > Ok. > > > > Is there any limit on > > > the number of virtio devices or queues within a VOP device? > > > > The virtio device infor
[PATCH] mic: vop: Fix crash on remove
The remove path contains a hack which depends on internal structures in other source files, similar to the one which was recently removed from the registration path. Since commit 1ce9e6055fa0 ("virtio_ring: introduce packed ring support"), this leads to a crash when vop devices are removed. The structure in question is only examined to get the virtual address of the allocated used page. Store that pointer locally instead to fix the crash. Fixes: 1ce9e6055fa0 ("virtio_ring: introduce packed ring support") Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/vop/vop_main.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/misc/mic/vop/vop_main.c b/drivers/misc/mic/vop/vop_main.c index eb8d0937abda..c4517703dc1d 100644 --- a/drivers/misc/mic/vop/vop_main.c +++ b/drivers/misc/mic/vop/vop_main.c @@ -48,7 +48,8 @@ * @dc: Virtio device control * @vpdev: VOP device which is the parent for this virtio device * @vr: Buffer for accessing the VRING - * @used: Buffer for used + * @used_virt: Virtual address of used ring + * @used: DMA address of used ring * @used_size: Size of the used buffer * @reset_done: Track whether VOP reset is complete * @virtio_cookie: Cookie returned upon requesting a interrupt @@ -62,6 +63,7 @@ struct _vop_vdev { struct mic_device_ctrl __iomem *dc; struct vop_device *vpdev; void __iomem *vr[VOP_MAX_VRINGS]; + void *used_virt[VOP_MAX_VRINGS]; dma_addr_t used[VOP_MAX_VRINGS]; int used_size[VOP_MAX_VRINGS]; struct completion reset_done; @@ -261,12 +263,12 @@ static bool vop_notify(struct virtqueue *vq) static void vop_del_vq(struct virtqueue *vq, int n) { struct _vop_vdev *vdev = to_vopvdev(vq->vdev); - struct vring *vr = (struct vring *)(vq + 1); struct vop_device *vpdev = vdev->vpdev; dma_unmap_single(&vpdev->dev, vdev->used[n], vdev->used_size[n], DMA_BIDIRECTIONAL); - free_pages((unsigned long)vr->used, get_order(vdev->used_size[n])); + free_pages((unsigned long)vdev->used_virt[n], + get_order(vdev->used_size[n])); vring_del_virtqueue(vq); vpdev->hw_ops->unmap(vpdev, vdev->vr[n]); vdev->vr[n] = NULL; @@ -355,6 +357,7 @@ static struct virtqueue *vop_find_vq(struct virtio_device *dev, le16_to_cpu(config.num)); used = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, get_order(vdev->used_size[index])); + vdev->used_virt[index] = used; if (!used) { err = -ENOMEM; dev_err(_vop_dev(vdev), "%s %d err %d\n", -- 2.20.0
[PATCH] mic: vop: Fix use-after-free on remove
KASAN detects a use-after-free when vop devices are removed. This problem was introduced by commit 0063e8bbd2b62d136 ("virtio_vop: don't kfree device on register failure"). That patch moved the freeing of the struct _vop_vdev to the release function, but failed to ensure that vop holds a reference to the device when it doesn't want it to go away. A kfree() was replaced with a put_device() in the unregistration path, but the last reference to the device is already dropped in unregister_virtio_device() so the struct is freed before vop is done with it. Fix it by holding a reference until cleanup is done. This is similar to the fix in virtio_pci in commit 2989be09a8a9d6 ("virtio_pci: fix use after free on release"). == BUG: KASAN: use-after-free in vop_scan_devices+0xc6c/0xe50 [vop] Read of size 8 at addr 88800da18580 by task kworker/0:1/12 CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.0.0-rc4+ #53 Workqueue: events vop_hotplug_devices [vop] Call Trace: dump_stack+0x74/0xbb print_address_description+0x5d/0x2b0 ? vop_scan_devices+0xc6c/0xe50 [vop] kasan_report+0x152/0x1aa ? vop_scan_devices+0xc6c/0xe50 [vop] ? vop_scan_devices+0xc6c/0xe50 [vop] vop_scan_devices+0xc6c/0xe50 [vop] ? vop_loopback_free_irq+0x160/0x160 [vop_loopback] process_one_work+0x7c0/0x14b0 ? pwq_dec_nr_in_flight+0x2d0/0x2d0 ? do_raw_spin_lock+0x120/0x280 worker_thread+0x8f/0xbf0 ? __kthread_parkme+0x78/0xf0 ? process_one_work+0x14b0/0x14b0 kthread+0x2ae/0x3a0 ? kthread_park+0x120/0x120 ret_from_fork+0x3a/0x50 Allocated by task 12: kmem_cache_alloc_trace+0x13a/0x2a0 vop_scan_devices+0x473/0xe50 [vop] process_one_work+0x7c0/0x14b0 worker_thread+0x8f/0xbf0 kthread+0x2ae/0x3a0 ret_from_fork+0x3a/0x50 Freed by task 12: kfree+0x104/0x310 device_release+0x73/0x1d0 kobject_put+0x14f/0x420 unregister_virtio_device+0x32/0x50 vop_scan_devices+0x19d/0xe50 [vop] process_one_work+0x7c0/0x14b0 worker_thread+0x8f/0xbf0 kthread+0x2ae/0x3a0 ret_from_fork+0x3a/0x50 The buggy address belongs to the object at 88800da18008 which belongs to the cache kmalloc-2k of size 2048 The buggy address is located 1400 bytes inside of 2048-byte region [88800da18008, 88800da18808) The buggy address belongs to the page: page:ea368600 count:1 mapcount:0 mapping:88801440dbc0 index:0x0 compound_mapcount: 0 flags: 0x40010200(slab|head) raw: 40010200 ea378608 ea37a008 88801440dbc0 raw: 000d000d 0001 page dumped because: kasan: bad access detected Memory state around the buggy address: 88800da18480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 88800da18500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >88800da18580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ 88800da18600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 88800da18680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb == Fixes: 0063e8bbd2b62d136 ("virtio_vop: don't kfree device on register failure") Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/vop/vop_main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/misc/mic/vop/vop_main.c b/drivers/misc/mic/vop/vop_main.c index 6606e704b832..c4517703dc1d 100644 --- a/drivers/misc/mic/vop/vop_main.c +++ b/drivers/misc/mic/vop/vop_main.c @@ -598,6 +598,8 @@ static int _vop_remove_device(struct mic_device_desc __iomem *d, int ret = -1; if (ioread8(&dc->config_change) == MIC_VIRTIO_PARAM_DEV_REMOVE) { + struct device *dev = get_device(&vdev->vdev.dev); + dev_dbg(&vpdev->dev, "%s %d config_change %d type %d vdev %p\n", __func__, __LINE__, @@ -609,7 +611,7 @@ static int _vop_remove_device(struct mic_device_desc __iomem *d, iowrite8(-1, &dc->h2c_vdev_db); if (status & VIRTIO_CONFIG_S_DRIVER_OK) wait_for_completion(&vdev->reset_done); - put_device(&vdev->vdev.dev); + put_device(dev); iowrite8(1, &dc->guest_ack); dev_dbg(&vpdev->dev, "%s %d guest_ack %d\n", __func__, __LINE__, ioread8(&dc->guest_ack)); -- 2.20.0
Re: [PATCH v2] mic: vop: Fix broken virtqueues
On Wed, Jan 30, 2019 at 08:40:23PM +0100, Greg KH wrote: > This patch is already in my tree, can you just send a "fixup" patch > instead of me having to revert the old one and then adding this one > again? I've now posted the fixup patch separately as "mic: vop: Fix crash on remove". Thanks.
Re: [PATCH] tty: Add NULL TTY driver
On Fri, Apr 05, 2019 at 10:39:43AM +0200, Enrico Weigelt, metux IT consult wrote: > On 03.04.19 16:11, Vincent Whitchurch wrote: > > > Especially on embedded systems, it would be convenient to have a simple > > way to disable the console (both for kernel and userspace) on a system > > which normally uses it, to free up the UART for other things. > > Just symlinking to /dev/null does not work ? No, /dev/null does not support the TTY ioctls. > OTOH, if you're introducing a dummy console, wouldn't a ringbuffer that, > can be read out later, a better option ? There is already a ttyprintk driver in mainline to send these messages to the printk ring buffer if one is actually intrested in what is written to the console. There's no option to enable it via console= in mainline but I have a patch for that too.
[PATCH v2] printk: Do not lose last line in kmsg buffer dump
kmsg_dump_get_buffer() is supposed to select all the youngest log messages which fit into the provided buffer. It determines the correct start index by using msg_print_text() with a NULL buffer to calculate the size of each entry. However, when performing the actual writes, msg_print_text() only writes the entry to the buffer if the written len is lesser than the size of the buffer. So if the lengths of the selected youngest log messages happen to precisely fill up the provided buffer, the last log message is not included. We don't want to modify msg_print_text() to fill up the buffer and start returning a length which is equal to the size of the buffer, since callers of its other users, such as kmsg_dump_get_line(), depend upon the current behaviour. Instead, fix kmsg_dump_get_buffer() to compensate for this. For example, with the following two final prints: [6.427502] A [6.427769] 12345 A dump of a 64-byte buffer filled by kmsg_dump_get_buffer(), before this patch: : 3c 30 3e 5b 20 20 20 20 36 2e 35 32 32 31 39 37 <0>[6.522197 0010: 5d 20 41 41 41 41 41 41 41 41 41 41 41 41 41 0a ] A. 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 After this patch: : 3c 30 3e 5b 20 20 20 20 36 2e 34 35 36 36 37 38 <0>[6.456678 0010: 5d 20 42 42 42 42 42 42 42 42 31 32 33 34 35 0a ] 12345. 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Signed-off-by: Vincent Whitchurch --- v2: Move fix to kmsg_dump_get_buffer() kernel/printk/printk.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 1888f6a3b694..424abf802f02 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -3274,7 +3274,7 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog, /* move first record forward until length fits into the buffer */ seq = dumper->cur_seq; idx = dumper->cur_idx; - while (l > size && seq < dumper->next_seq) { + while (l >= size && seq < dumper->next_seq) { struct printk_log *msg = log_from_idx(idx); l -= msg_print_text(msg, true, time, NULL, 0); -- 2.20.0
Re: [PATCH v2] printk: Do not lose last line in kmsg buffer dump
On Fri, Jul 12, 2019 at 10:09:04AM +0200, Petr Mladek wrote: > The patch looks like a hack using a hole that the next cycle > does not longer check the number of really stored characters. > > What would happen when msg_print_text() starts adding > the trailing '\0' as suggested by > https://lkml.kernel.org/r/20190710121049.rwhk7fknfzn3c...@pathway.suse.cz I did have a look at that possibility, but I didn't see how that could work without potentially affecting userspace users of the syslog ABI. AFAICS the suggested change in msg_print_text() can be done in one of three ways: (1) msg_print_text() adds the '\0' and includes this length both when it estimates the size (NULL buffer) and when it actually prints: If we do this: - kmsg_dump_get_line_nolock() would have to subtract 1 from the len since its callers expected that len is always smaller than the size of the buffer. - The buffers given to use via the syslog interface will now include a '\0', potentially affecting userspace applications which use this ABI. (2) msg_print_text() adds the '\0', and includes this in the length only when estimating the size, and not when it actually prints. If we do this: - SYSLOG_ACTION_SIZE_UNREAD tries uses the size estimate to give userspace a count of how many characters are present in the buffer, and now this count will start differing from the actual count that can be read, potentially affecting userspace applications. (3) msg_print_text() adds the '\0', and does not include this length in the result at all. If we do this: - The original kmsg dump issue is not solved, since the last line is still lost. > BTW: What is the motivation for this fix? Is a bug report > or just some research of possible buffer overflows? The fix is not attempting to fix a buffer overflow, theoretical or otherwise. It's a fix for a bug in functionality which has been observed on our systems: We use pstore to save the kernel log when the kernel crashes, and sometimes the log in the pstore misses the last line, and since the last line usual says why we're panicing so it's rather important not to miss. > The commit message pretends that the problem is bigger than > it really is. It is about one byte and not one line. I'm not quite sure I follow. The current code does fail to include the *entire* last line. The memcpy on line #1294 is never executed for the last line because we stop the loop because of the check on line #1289: 1270 static size_t msg_print_text(const struct printk_log *msg, bool syslog, char *buf, size_t size) 1271 { 1272 const char *text = log_text(msg); 1273 size_t text_size = msg->text_len; 1274 size_t len = 0; 1275 1276 do { 1277 const char *next = memchr(text, '\n', text_size); 1278 size_t text_len; 1279 1280 if (next) { 1281 text_len = next - text; 1282 next++; 1283 text_size -= next - text; 1284 } else { 1285 text_len = text_size; 1286 } 1287 1288 if (buf) { 1289 if (print_prefix(msg, syslog, NULL) + 1290 text_len + 1 > size - len) 1291 break; 1292 1293 len += print_prefix(msg, syslog, buf + len); 1294 memcpy(buf + len, text, text_len); 1295 len += text_len; 1296 buf[len++] = '\n'; 1297 } else { 1298 /* SYSLOG_ACTION_* buffer size only calculation */ 1299 len += print_prefix(msg, syslog, NULL); 1300 len += text_len; 1301 len++; 1302 }
[PATCH] crypto: cryptd - Fix skcipher instance memory leak
cryptd_skcipher_free() fails to free the struct skcipher_instance allocated in cryptd_create_skcipher(), leading to a memory leak. This is detected by kmemleak on bootup on ARM64 platforms: unreferenced object 0x80003377b180 (size 1024): comm "cryptomgr_probe", pid 822, jiffies 4294894830 (age 52.760s) backtrace: kmem_cache_alloc_trace+0x270/0x2d0 cryptd_create+0x990/0x124c cryptomgr_probe+0x5c/0x1e8 kthread+0x258/0x318 ret_from_fork+0x10/0x1c Signed-off-by: Vincent Whitchurch --- crypto/cryptd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/crypto/cryptd.c b/crypto/cryptd.c index 1ce1bf6d3bab..5f76c6e222c6 100644 --- a/crypto/cryptd.c +++ b/crypto/cryptd.c @@ -388,6 +388,7 @@ static void cryptd_skcipher_free(struct skcipher_instance *inst) struct skcipherd_instance_ctx *ctx = skcipher_instance_ctx(inst); crypto_drop_skcipher(&ctx->spawn); + kfree(inst); } static int cryptd_create_skcipher(struct crypto_template *tmpl, -- 2.20.0
[PATCH] printk: Do not lose last line in kmsg dump
kmsg_dump_get_buffer() is supposed to select the youngest log messages which fit into the provided buffer. When that function determines the correct start index, by looping and calling msg_print_text() with a NULL buffer, msg_print_text() calculates a length which would allow the youngest log messages to completely fill the provided buffer. However, when doing the actual printing, an off-by-one error in msg_print_text() leads to that function allowing the provided buffer to only be filled to (size - 1). So if the lengths of the selected youngest log messages happen to precisely fill up the provided buffer, the last log message is not included. For example, with the following two final prints: [6.427502] A [6.427769] 12345 A dump of a 64-byte buffer filled by kmsg_dump_get_buffer(), before this patch: : 3c 30 3e 5b 20 20 20 20 36 2e 35 32 32 31 39 37 <0>[6.522197 0010: 5d 20 41 41 41 41 41 41 41 41 41 41 41 41 41 0a ] A. 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 After this patch: : 3c 30 3e 5b 20 20 20 20 36 2e 34 32 37 35 30 32 <0>[6.427502 0010: 5d 20 41 41 41 41 41 41 41 41 41 41 41 41 41 0a ] A. 0020: 3c 30 3e 5b 20 20 20 20 36 2e 34 32 37 37 36 39 <0>[6.427769 0030: 5d 20 42 42 42 42 42 42 42 42 31 32 33 34 35 0a ] 12345. Note that this bug only affects the kmsg dump code. msg_print_text() is also used from the syslog code and console_unlock() but this bug does trigger there since the buffers used there are never filled up completely (since they are only used to print individual lines, and their size is always LOG_LINE_MAX + PREFIX_MAX, and PREFIX_MAX has a value which is larger than the largest possible prefix). Signed-off-by: Vincent Whitchurch --- I posted this patch two years ago and received no replies. This problem is still present in mainline. https://lore.kernel.org/patchwork/patch/781106/ kernel/printk/printk.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 1888f6a3b694..7679d779d5cc 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1318,7 +1318,7 @@ static size_t msg_print_text(const struct printk_log *msg, bool syslog, } if (buf) { - if (prefix_len + text_len + 1 >= size - len) + if (prefix_len + text_len + 1 > size - len) break; memcpy(buf + len, prefix, prefix_len); -- 2.20.0
Re: [PATCH] printk: Do not lose last line in kmsg dump
On Tue, Jul 09, 2019 at 04:29:39PM +0200, Petr Mladek wrote: > On Tue 2019-07-09 19:12:30, Sergey Senozhatsky wrote: > > On (07/09/19 10:10), Vincent Whitchurch wrote: > > > A dump of a 64-byte buffer filled by kmsg_dump_get_buffer(), before this > > > patch: > > > > > > : 3c 30 3e 5b 20 20 20 20 36 2e 35 32 32 31 39 37 <0>[ > > > 6.522197 > > > 0010: 5d 20 41 41 41 41 41 41 41 41 41 41 41 41 41 0a ] > > > A. > > > 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > > > > 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > > > > > > > After this patch: > > > > > > : 3c 30 3e 5b 20 20 20 20 36 2e 34 32 37 35 30 32 <0>[ > > > 6.427502 > > > 0010: 5d 20 41 41 41 41 41 41 41 41 41 41 41 41 41 0a ] > > > A. > > > 0020: 3c 30 3e 5b 20 20 20 20 36 2e 34 32 37 37 36 39 <0>[ > > > 6.427769 > > > 0030: 5d 20 42 42 42 42 42 42 42 42 31 32 33 34 35 0a ] > > > 12345. > > > > [..] > > > > > @@ -1318,7 +1318,7 @@ static size_t msg_print_text(const struct > > > printk_log *msg, bool syslog, > > > } > > > > > > if (buf) { > > > - if (prefix_len + text_len + 1 >= size - len) > > > + if (prefix_len + text_len + 1 > size - len) > > > break; > > > > So with this patch the last byte of the buffer is 0xA. It's a bit > > uncomfortable that `len', which we return from msg_print_text(), > > now points one byte beyond the buffer: > > > > buf[len++] = '\n'; > > return len; > > > > This is not very common. Not sure what usually happens to kmsg_dump > > buffers, but anyone who'd do a rather innocent > > > > kmsg_dump(buf, &len); > > buf[len] = 0x00; > > > > will write to something which is not a kmsg buffer (in some cases). > > I have the same worries. > > On the other hand. The function does not store the trailing '\0' > into the buffer itself. The callers would need to add it themself. > It is their responsibility to avoid a buffer overflow. > > I have checked several users and it seems that nobody adds or > needs the trailing '\0'. > > It is ugly to do not use the entire buffer just because theoretical > buggy users. All the callers of kmsg_dump_get_buffer() seem to be fine but these two users of kmsg_dump_get_line(), which also calls msg_print_text(), have exactly the code which you and Sergey worry about: arch/powerpc/xmon/xmon.c 2836: while (kmsg_dump_get_line_nolock(&dumper, false, buf, sizeof(buf), &len)) { 2837- buf[len] = '\0'; arch/um/kernel/kmsg_dump.c 29: while (kmsg_dump_get_line(dumper, true, line, sizeof(line), &len)) { 30- line[len] = '\0'; I guess we should fix these first and leave this patch as is?
[PATCH v2 char-misc-next 5/7] mic: vop: Fix init race with shared interrupts
If a vop virtio device's interrupt is shared with others, there is a window where the interrupt can hit and the ->vqs list can be attempted to be traversed before register_virtio_device() has initialized it, leading to a NULL pointer dereference in vop_virtio_intr_handler(). Fix this by keeping a local list of virtqueues in this driver and using that instead of the list inside the struct virtio_device, similar to how virtio-pci handles this. Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/vop/vop_main.c | 42 ++--- 1 file changed, 39 insertions(+), 3 deletions(-) diff --git a/drivers/misc/mic/vop/vop_main.c b/drivers/misc/mic/vop/vop_main.c index e37b2c2152a2..6764feea6f55 100644 --- a/drivers/misc/mic/vop/vop_main.c +++ b/drivers/misc/mic/vop/vop_main.c @@ -40,6 +40,11 @@ #define VOP_MAX_VRINGS 4 +struct vop_vq { + struct virtqueue *vq; + struct list_head list; +}; + /* * _vop_vdev - Allocated per virtio device instance injected by the peer. * @@ -68,6 +73,9 @@ struct _vop_vdev { int used_size[VOP_MAX_VRINGS]; struct completion reset_done; struct mic_irq *virtio_cookie; + struct vop_vq *vqs; + struct list_head virtqueues; + spinlock_t virtqueues_lock; int c2h_vdev_db; int h2c_vdev_db; int dnode; @@ -264,6 +272,12 @@ static void vop_del_vq(struct virtqueue *vq, int n) { struct _vop_vdev *vdev = to_vopvdev(vq->vdev); struct vop_device *vpdev = vdev->vpdev; + struct vop_vq *vopvq = &vdev->vqs[n]; + unsigned long flags; + + spin_lock_irqsave(&vdev->virtqueues_lock, flags); + list_del(&vopvq->list); + spin_unlock_irqrestore(&vdev->virtqueues_lock, flags); dma_unmap_single(&vpdev->dev, vdev->used[n], vdev->used_size[n], DMA_BIDIRECTIONAL); @@ -284,6 +298,9 @@ static void vop_del_vqs(struct virtio_device *dev) list_for_each_entry_safe(vq, n, &dev->vqs, list) vop_del_vq(vq, idx++); + + kfree(vdev->vqs); + vdev->vqs = NULL; } static struct virtqueue *vop_new_virtqueue(unsigned int index, @@ -411,7 +428,13 @@ static int vop_find_vqs(struct virtio_device *dev, unsigned nvqs, if (nvqs > ioread8(&vdev->desc->num_vq)) return -ENOENT; + vdev->vqs = kcalloc(nvqs, sizeof(*vdev->vqs), GFP_KERNEL); + if (!vdev->vqs) + return -ENOMEM; + for (i = 0; i < nvqs; ++i) { + unsigned long flags; + if (!names[i]) { vqs[i] = NULL; continue; @@ -425,6 +448,12 @@ static int vop_find_vqs(struct virtio_device *dev, unsigned nvqs, err = PTR_ERR(vqs[i]); goto error; } + + vdev->vqs[i].vq = vqs[i]; + + spin_lock_irqsave(&vdev->virtqueues_lock, flags); + list_add(&vdev->vqs[i].list, &vdev->virtqueues); + spin_unlock_irqrestore(&vdev->virtqueues_lock, flags); } iowrite8(1, &dc->used_address_updated); @@ -468,13 +497,17 @@ static struct virtio_config_ops vop_vq_config_ops = { static irqreturn_t vop_virtio_intr_handler(int irq, void *data) { + unsigned long flags; struct _vop_vdev *vdev = data; struct vop_device *vpdev = vdev->vpdev; - struct virtqueue *vq; + struct vop_vq *vopvq; vpdev->hw_ops->ack_interrupt(vpdev, vdev->h2c_vdev_db); - list_for_each_entry(vq, &vdev->vdev.vqs, list) - vring_interrupt(0, vq); + + spin_lock_irqsave(&vdev->virtqueues_lock, flags); + list_for_each_entry(vopvq, &vdev->virtqueues, list) + vring_interrupt(0, vopvq->vq); + spin_unlock_irqrestore(&vdev->virtqueues_lock, flags); return IRQ_HANDLED; } @@ -516,6 +549,9 @@ static int _vop_add_device(struct mic_device_desc __iomem *d, vdev->vdev.priv = (void *)(unsigned long)dnode; init_completion(&vdev->reset_done); + INIT_LIST_HEAD(&vdev->virtqueues); + spin_lock_init(&vdev->virtqueues_lock); + vdev->h2c_vdev_db = vpdev->hw_ops->next_db(vpdev); vdev->virtio_cookie = vpdev->hw_ops->request_irq(vpdev, vop_virtio_intr_handler, "virtio intr", -- 2.20.0
[PATCH v2 char-misc-next 3/7] mic: vop: Allow building on more systems
VOP_BUS does not actually depend on x86-64 or PCI or X86_DEV_DMA_OPS. The dependency on X86_DEV_DMA_OPS has been unnecessary since commit 5657933dbb6e25fe ("treewide: Move dma_ops from struct dev_archdata into struct device"). Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/Kconfig | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/misc/mic/Kconfig b/drivers/misc/mic/Kconfig index 227cc7443671..242dcee14689 100644 --- a/drivers/misc/mic/Kconfig +++ b/drivers/misc/mic/Kconfig @@ -38,7 +38,6 @@ comment "VOP Bus Driver" config VOP_BUS tristate "VOP Bus Driver" - depends on 64BIT && PCI && X86 && X86_DEV_DMA_OPS help This option is selected by any driver which registers a device or driver on the VOP Bus, such as CONFIG_INTEL_MIC_HOST @@ -132,7 +131,7 @@ comment "VOP Driver" config VOP tristate "VOP Driver" - depends on 64BIT && PCI && X86 && VOP_BUS + depends on VOP_BUS select VHOST_RING select VIRTIO help -- 2.20.0
[PATCH v2 char-misc-next 4/7] vop: Add loopback driver
Add a loopback driver to allow testing and evaluation of the VOP framework without special hardware. The host and the guest will run under the same kernel. Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/Kconfig| 10 + drivers/misc/mic/vop/Makefile | 2 + drivers/misc/mic/vop/vop_loopback.c | 382 3 files changed, 394 insertions(+) create mode 100644 drivers/misc/mic/vop/vop_loopback.c diff --git a/drivers/misc/mic/Kconfig b/drivers/misc/mic/Kconfig index 242dcee14689..2e2f745afb3a 100644 --- a/drivers/misc/mic/Kconfig +++ b/drivers/misc/mic/Kconfig @@ -148,6 +148,16 @@ config VOP OS and tools for MIC to use with this driver are available from <http://software.intel.com/en-us/mic-developer>. +config VOP_LOOPBACK + tristate "VOP loopback driver" + depends on VOP + help + This enables a loopback driver to test and evaluate the VOP + infrastructure without actual PCIe hardware. The host and the guest + sides run under the same kernel. + + If unsure, say N. + if VOP source "drivers/vhost/Kconfig.vringh" endif diff --git a/drivers/misc/mic/vop/Makefile b/drivers/misc/mic/vop/Makefile index 78819c8999f1..a6ead25c4418 100644 --- a/drivers/misc/mic/vop/Makefile +++ b/drivers/misc/mic/vop/Makefile @@ -7,3 +7,5 @@ obj-m := vop.o vop-objs += vop_main.o vop-objs += vop_debugfs.o vop-objs += vop_vringh.o + +obj-$(CONFIG_VOP_LOOPBACK) += vop_loopback.o diff --git a/drivers/misc/mic/vop/vop_loopback.c b/drivers/misc/mic/vop/vop_loopback.c new file mode 100644 index ..76d787db3d3e --- /dev/null +++ b/drivers/misc/mic/vop/vop_loopback.c @@ -0,0 +1,382 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 Axis Communications AB + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../bus/vop_bus.h" + +struct mic_irq { + struct list_head list; + int irq; + irqreturn_t (*func)(int irq, void *data); + void *data; +}; + +struct vop_loopback_end { + struct vop_loopback *loopback; + const char *name; + struct vop_device *vop; + struct list_head irqs; + struct mutex mutex; + struct work_struct work; +}; + +struct vop_loopback { + struct device *dev; + void *dp; + struct vop_loopback_end host; + struct vop_loopback_end guest; +}; + +static inline struct vop_loopback *vop_to_loopback(struct device *dev) +{ + return dev_get_drvdata(dev->parent); +} + +static dma_addr_t +vop_loopback_dma_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, unsigned long attrs) +{ + return page_to_phys(page) + offset; +} + +static void vop_loopback_dma_unmap_page(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + unsigned long attrs) +{ +} + +static int vop_loopback_dma_mmap(struct device *dev, struct vm_area_struct *vma, +void *cpu_addr, dma_addr_t dma_addr, size_t size, +unsigned long attrs) +{ + return remap_pfn_range(vma, vma->vm_start, + PHYS_PFN(dma_addr), size, + vma->vm_page_prot); +} + +static void *vop_loopback_dma_alloc(struct device *dev, size_t size, + dma_addr_t *handle, gfp_t gfp, + unsigned long attrs) +{ + void *p = (void *) __get_free_pages(gfp, get_order(size)); + + if (p) + *handle = virt_to_phys(p); + + return p; +} + +static void vop_loopback_dma_free(struct device *dev, size_t size, + void *cpu_addr, dma_addr_t handle, + unsigned long attrs) +{ + free_pages((unsigned long) (uintptr_t) cpu_addr, get_order(size)); +} + +static const struct dma_map_ops vop_loopback_dma_ops = { + .map_page = vop_loopback_dma_map_page, + .unmap_page = vop_loopback_dma_unmap_page, + .mmap = vop_loopback_dma_mmap, + .alloc = vop_loopback_dma_alloc, + .free = vop_loopback_dma_free, +}; + + +static void vop_loopback_ack_interrupt(struct vop_device *vop, int num) +{ +} + +static int vop_loopback_next_db(struct vop_device *vop) +{ + return 0; +} + +static void *vop_loopback_get_dp(struct vop_device *vop) +{ + struct vop_loopback *loopback = vop_to_loopback(&vop->dev); + + return loopback->dp; +} + +static void __iomem *vop_loopback_get_remote_dp(struct vop_device *vop) +{ + struct vop_loopback *loopback = vop_to_loopback(&vop->dev); + + return (void __iomem *) loopback->dp; +} + +static struct mic_irq * +vop_loopback_reques
[PATCH v2 char-misc-next 4/7] mic: vop: Add loopback driver
Add a loopback driver to allow testing and evaluation of the VOP framework without special hardware. The host and the guest will run under the same kernel. Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/Kconfig| 10 + drivers/misc/mic/vop/Makefile | 2 + drivers/misc/mic/vop/vop_loopback.c | 382 3 files changed, 394 insertions(+) create mode 100644 drivers/misc/mic/vop/vop_loopback.c diff --git a/drivers/misc/mic/Kconfig b/drivers/misc/mic/Kconfig index 242dcee14689..2e2f745afb3a 100644 --- a/drivers/misc/mic/Kconfig +++ b/drivers/misc/mic/Kconfig @@ -148,6 +148,16 @@ config VOP OS and tools for MIC to use with this driver are available from <http://software.intel.com/en-us/mic-developer>. +config VOP_LOOPBACK + tristate "VOP loopback driver" + depends on VOP + help + This enables a loopback driver to test and evaluate the VOP + infrastructure without actual PCIe hardware. The host and the guest + sides run under the same kernel. + + If unsure, say N. + if VOP source "drivers/vhost/Kconfig.vringh" endif diff --git a/drivers/misc/mic/vop/Makefile b/drivers/misc/mic/vop/Makefile index 78819c8999f1..a6ead25c4418 100644 --- a/drivers/misc/mic/vop/Makefile +++ b/drivers/misc/mic/vop/Makefile @@ -7,3 +7,5 @@ obj-m := vop.o vop-objs += vop_main.o vop-objs += vop_debugfs.o vop-objs += vop_vringh.o + +obj-$(CONFIG_VOP_LOOPBACK) += vop_loopback.o diff --git a/drivers/misc/mic/vop/vop_loopback.c b/drivers/misc/mic/vop/vop_loopback.c new file mode 100644 index ..76d787db3d3e --- /dev/null +++ b/drivers/misc/mic/vop/vop_loopback.c @@ -0,0 +1,382 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 Axis Communications AB + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../bus/vop_bus.h" + +struct mic_irq { + struct list_head list; + int irq; + irqreturn_t (*func)(int irq, void *data); + void *data; +}; + +struct vop_loopback_end { + struct vop_loopback *loopback; + const char *name; + struct vop_device *vop; + struct list_head irqs; + struct mutex mutex; + struct work_struct work; +}; + +struct vop_loopback { + struct device *dev; + void *dp; + struct vop_loopback_end host; + struct vop_loopback_end guest; +}; + +static inline struct vop_loopback *vop_to_loopback(struct device *dev) +{ + return dev_get_drvdata(dev->parent); +} + +static dma_addr_t +vop_loopback_dma_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, unsigned long attrs) +{ + return page_to_phys(page) + offset; +} + +static void vop_loopback_dma_unmap_page(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + unsigned long attrs) +{ +} + +static int vop_loopback_dma_mmap(struct device *dev, struct vm_area_struct *vma, +void *cpu_addr, dma_addr_t dma_addr, size_t size, +unsigned long attrs) +{ + return remap_pfn_range(vma, vma->vm_start, + PHYS_PFN(dma_addr), size, + vma->vm_page_prot); +} + +static void *vop_loopback_dma_alloc(struct device *dev, size_t size, + dma_addr_t *handle, gfp_t gfp, + unsigned long attrs) +{ + void *p = (void *) __get_free_pages(gfp, get_order(size)); + + if (p) + *handle = virt_to_phys(p); + + return p; +} + +static void vop_loopback_dma_free(struct device *dev, size_t size, + void *cpu_addr, dma_addr_t handle, + unsigned long attrs) +{ + free_pages((unsigned long) (uintptr_t) cpu_addr, get_order(size)); +} + +static const struct dma_map_ops vop_loopback_dma_ops = { + .map_page = vop_loopback_dma_map_page, + .unmap_page = vop_loopback_dma_unmap_page, + .mmap = vop_loopback_dma_mmap, + .alloc = vop_loopback_dma_alloc, + .free = vop_loopback_dma_free, +}; + + +static void vop_loopback_ack_interrupt(struct vop_device *vop, int num) +{ +} + +static int vop_loopback_next_db(struct vop_device *vop) +{ + return 0; +} + +static void *vop_loopback_get_dp(struct vop_device *vop) +{ + struct vop_loopback *loopback = vop_to_loopback(&vop->dev); + + return loopback->dp; +} + +static void __iomem *vop_loopback_get_remote_dp(struct vop_device *vop) +{ + struct vop_loopback *loopback = vop_to_loopback(&vop->dev); + + return (void __iomem *) loopback->dp; +} + +static struct mic_irq * +vop_loopback_reques
[PATCH v2 char-misc-next 1/7] vop: Cast pointers to unsigned long
Fix these on 32-bit: vop_vringh.c:711:13: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast] Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/vop/vop_main.c | 13 + drivers/misc/mic/vop/vop_vringh.c | 5 +++-- 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/misc/mic/vop/vop_main.c b/drivers/misc/mic/vop/vop_main.c index 3f96e19e57eb..263e0c3fa7c6 100644 --- a/drivers/misc/mic/vop/vop_main.c +++ b/drivers/misc/mic/vop/vop_main.c @@ -514,7 +514,7 @@ static int _vop_add_device(struct mic_device_desc __iomem *d, vdev->desc = d; vdev->dc = (void __iomem *)d + _vop_aligned_desc_size(d); vdev->dnode = dnode; - vdev->vdev.priv = (void *)(u64)dnode; + vdev->vdev.priv = (void *)(unsigned long)dnode; init_completion(&vdev->reset_done); vdev->h2c_vdev_db = vpdev->hw_ops->next_db(vpdev); @@ -536,7 +536,7 @@ static int _vop_add_device(struct mic_device_desc __iomem *d, offset, type); goto free_irq; } - writeq((u64)vdev, &vdev->dc->vdev); + writeq((unsigned long)vdev, &vdev->dc->vdev); dev_dbg(_vop_dev(vdev), "%s: registered vop device %u type %u vdev %p\n", __func__, offset, type, vdev); @@ -563,13 +563,18 @@ static int vop_match_desc(struct device *dev, void *data) return vdev->desc == (void __iomem *)data; } +static struct _vop_vdev *vop_dc_to_vdev(struct mic_device_ctrl *dc) +{ + return (struct _vop_vdev *)(unsigned long)readq(&dc->vdev); +} + static void _vop_handle_config_change(struct mic_device_desc __iomem *d, unsigned int offset, struct vop_device *vpdev) { struct mic_device_ctrl __iomem *dc = (void __iomem *)d + _vop_aligned_desc_size(d); - struct _vop_vdev *vdev = (struct _vop_vdev *)readq(&dc->vdev); + struct _vop_vdev *vdev = vop_dc_to_vdev(dc); if (ioread8(&dc->config_change) != MIC_VIRTIO_PARAM_CONFIG_CHANGED) return; @@ -588,7 +593,7 @@ static int _vop_remove_device(struct mic_device_desc __iomem *d, { struct mic_device_ctrl __iomem *dc = (void __iomem *)d + _vop_aligned_desc_size(d); - struct _vop_vdev *vdev = (struct _vop_vdev *)readq(&dc->vdev); + struct _vop_vdev *vdev = vop_dc_to_vdev(dc); u8 status; int ret = -1; diff --git a/drivers/misc/mic/vop/vop_vringh.c b/drivers/misc/mic/vop/vop_vringh.c index 0bac8bce933c..73d9fe6bc18d 100644 --- a/drivers/misc/mic/vop/vop_vringh.c +++ b/drivers/misc/mic/vop/vop_vringh.c @@ -712,16 +712,17 @@ static int vop_vringh_copy(struct vop_vdev *vdev, struct vringh_kiov *iov, while (len && iov->i < iov->used) { struct kvec *kiov = &iov->iov[iov->i]; + unsigned long daddr = (unsigned long)kiov->iov_base; partlen = min(kiov->iov_len, len); if (read) ret = vop_virtio_copy_to_user(vdev, ubuf, partlen, - (u64)kiov->iov_base, + daddr, kiov->iov_len, vr_idx); else ret = vop_virtio_copy_from_user(vdev, ubuf, partlen, - (u64)kiov->iov_base, + daddr, kiov->iov_len, vr_idx); if (ret) { -- 2.20.0
[PATCH v2 char-misc-next 3/7] vop: Allow building on more systems
VOP_BUS does not actually depend on x86-64 or PCI or X86_DEV_DMA_OPS. The dependency on X86_DEV_DMA_OPS has been unnecessary since commit 5657933dbb6e25fe ("treewide: Move dma_ops from struct dev_archdata into struct device"). Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/Kconfig | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/misc/mic/Kconfig b/drivers/misc/mic/Kconfig index 227cc7443671..242dcee14689 100644 --- a/drivers/misc/mic/Kconfig +++ b/drivers/misc/mic/Kconfig @@ -38,7 +38,6 @@ comment "VOP Bus Driver" config VOP_BUS tristate "VOP Bus Driver" - depends on 64BIT && PCI && X86 && X86_DEV_DMA_OPS help This option is selected by any driver which registers a device or driver on the VOP Bus, such as CONFIG_INTEL_MIC_HOST @@ -132,7 +131,7 @@ comment "VOP Driver" config VOP tristate "VOP Driver" - depends on 64BIT && PCI && X86 && VOP_BUS + depends on VOP_BUS select VHOST_RING select VIRTIO help -- 2.20.0
[PATCH v2 char-misc-next 6/7] samples: mic: Split out vop code from mpssd
mpssd.c contains both code to boot the MIC hardware and code to serve virtio devices over vop. Split the vop-related code to a separate file so that it can be used by other hardware and vop-loopback. Except for changes to change_virtblk_backend() and the introduction of serve_virtio(), the rest is just movement of unmodified code between files. Signed-off-by: Vincent Whitchurch --- samples/mic/mpssd/Makefile |4 +- samples/mic/mpssd/mpssd.c | 1317 +- samples/mic/mpssd/mpssd.h |1 + samples/mic/mpssd/vop.c| 1357 4 files changed, 1364 insertions(+), 1315 deletions(-) create mode 100644 samples/mic/mpssd/vop.c diff --git a/samples/mic/mpssd/Makefile b/samples/mic/mpssd/Makefile index a7a6e0c70424..bc94054dbe2b 100644 --- a/samples/mic/mpssd/Makefile +++ b/samples/mic/mpssd/Makefile @@ -14,8 +14,8 @@ CFLAGS += -DDEBUG=$(DEBUG) endif all: $(PROGS) -mpssd: mpssd.c sysfs.c - $(CC) $(CFLAGS) mpssd.c sysfs.c -o mpssd -lpthread +mpssd: mpssd.c sysfs.c vop.o + $(CC) $(CFLAGS) $^ -o $@ -lpthread install: install mpssd /usr/sbin/mpssd diff --git a/samples/mic/mpssd/mpssd.c b/samples/mic/mpssd/mpssd.c index f42ce551bb48..5e4233c1ff93 100644 --- a/samples/mic/mpssd/mpssd.c +++ b/samples/mic/mpssd/mpssd.c @@ -31,1292 +31,18 @@ #include #include #include -#include -#include -#include -#include -#include -#include -#include #include "mpssd.h" #include #include #include -static void *init_mic(void *arg); - static FILE *logfp; -static struct mic_info mic_list; - -#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) - -#define min_t(type, x, y) ({ \ - type __min1 = (x); \ - type __min2 = (y); \ - __min1 < __min2 ? __min1 : __min2; }) - -/* align addr on a size boundary - adjust address up/down if needed */ -#define _ALIGN_DOWN(addr, size) ((addr)&(~((size)-1))) -#define _ALIGN_UP(addr, size)_ALIGN_DOWN(addr + size - 1, size) - -/* align addr on a size boundary - adjust address up if needed */ -#define _ALIGN(addr, size) _ALIGN_UP(addr, size) - -/* to align the pointer to the (next) page boundary */ -#define PAGE_ALIGN(addr)_ALIGN(addr, PAGE_SIZE) - -#define READ_ONCE(x) (*(volatile typeof(x) *)&(x)) - -#define GSO_ENABLED1 -#define MAX_GSO_SIZE (64 * 1024) -#define ETH_H_LEN 14 -#define MAX_NET_PKT_SIZE (_ALIGN_UP(MAX_GSO_SIZE + ETH_H_LEN, 64)) -#define MIC_DEVICE_PAGE_END0x1000 - -#ifndef VIRTIO_NET_HDR_F_DATA_VALID -#define VIRTIO_NET_HDR_F_DATA_VALID2 /* Csum is valid */ -#endif -static struct { - struct mic_device_desc dd; - struct mic_vqconfig vqconfig[2]; - __u32 host_features, guest_acknowledgements; - struct virtio_console_config cons_config; -} virtcons_dev_page = { - .dd = { - .type = VIRTIO_ID_CONSOLE, - .num_vq = ARRAY_SIZE(virtcons_dev_page.vqconfig), - .feature_len = sizeof(virtcons_dev_page.host_features), - .config_len = sizeof(virtcons_dev_page.cons_config), - }, - .vqconfig[0] = { - .num = htole16(MIC_VRING_ENTRIES), - }, - .vqconfig[1] = { - .num = htole16(MIC_VRING_ENTRIES), - }, -}; - -static struct { - struct mic_device_desc dd; - struct mic_vqconfig vqconfig[2]; - __u32 host_features, guest_acknowledgements; - struct virtio_net_config net_config; -} virtnet_dev_page = { - .dd = { - .type = VIRTIO_ID_NET, - .num_vq = ARRAY_SIZE(virtnet_dev_page.vqconfig), - .feature_len = sizeof(virtnet_dev_page.host_features), - .config_len = sizeof(virtnet_dev_page.net_config), - }, - .vqconfig[0] = { - .num = htole16(MIC_VRING_ENTRIES), - }, - .vqconfig[1] = { - .num = htole16(MIC_VRING_ENTRIES), - }, -#if GSO_ENABLED - .host_features = htole32( - 1 << VIRTIO_NET_F_CSUM | - 1 << VIRTIO_NET_F_GSO | - 1 << VIRTIO_NET_F_GUEST_TSO4 | - 1 << VIRTIO_NET_F_GUEST_TSO6 | - 1 << VIRTIO_NET_F_GUEST_ECN), -#else - .host_features = 0, -#endif -}; - -static const char *mic_config_dir = "/etc/mpss"; -static const char *virtblk_backend = "VIRTBLK_BACKEND"; -static struct { - struct mic_device_desc dd; - struct mic_vqconfig vqconfig[1]; - __u32 host_features, guest_acknowledgements; - struct virtio_blk_config blk_config; -} virtblk_dev_page = { - .dd = { - .type = VIRTIO_ID_BLOCK, - .num_vq = ARRAY_SIZE(virtblk_dev_page.vqconfig), - .feature_len = sizeof(virtblk_dev_page.host_feat
[PATCH v2 char-misc-next 2/7] mic: Rename ioremap pointer to remap
Some architectures (like MIPS) implement ioremap as a macro, and this leads to conflicts with the ioremap function pointer in various mic structures. drivers/misc/mic/vop/vop_vringh.c: In function 'vop_virtio_init_post': drivers/misc/mic/vop/vop_vringh.c:86:13: error: macro "ioremap" passed 3 arguments, but takes just 2 Rename ioremap to remap to fix this. Likewise for iounmap. Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/bus/scif_bus.h| 8 drivers/misc/mic/bus/vop_bus.h | 8 drivers/misc/mic/card/mic_device.c | 8 drivers/misc/mic/host/mic_boot.c | 8 drivers/misc/mic/scif/scif_map.h | 4 ++-- drivers/misc/mic/vop/vop_main.c| 7 +++ drivers/misc/mic/vop/vop_vringh.c | 10 +- 7 files changed, 26 insertions(+), 27 deletions(-) diff --git a/drivers/misc/mic/bus/scif_bus.h b/drivers/misc/mic/bus/scif_bus.h index ff59568219ad..377a4f38cd7e 100644 --- a/drivers/misc/mic/bus/scif_bus.h +++ b/drivers/misc/mic/bus/scif_bus.h @@ -88,8 +88,8 @@ struct scif_driver { * @send_intr: Send an interrupt to the remote node on a specified doorbell. * @send_p2p_intr: Send an interrupt to the peer node on a specified doorbell * which is specifically targeted for a peer to peer node. - * @ioremap: Map a buffer with the specified physical address and length. - * @iounmap: Unmap a buffer previously mapped. + * @remap: Map a buffer with the specified physical address and length. + * @unmap: Unmap a buffer previously mapped. */ struct scif_hw_ops { int (*next_db)(struct scif_hw_dev *sdev); @@ -104,9 +104,9 @@ struct scif_hw_ops { void (*send_intr)(struct scif_hw_dev *sdev, int db); void (*send_p2p_intr)(struct scif_hw_dev *sdev, int db, struct mic_mw *mw); - void __iomem * (*ioremap)(struct scif_hw_dev *sdev, + void __iomem * (*remap)(struct scif_hw_dev *sdev, phys_addr_t pa, size_t len); - void (*iounmap)(struct scif_hw_dev *sdev, void __iomem *va); + void (*unmap)(struct scif_hw_dev *sdev, void __iomem *va); }; int scif_register_driver(struct scif_driver *driver); diff --git a/drivers/misc/mic/bus/vop_bus.h b/drivers/misc/mic/bus/vop_bus.h index fff7a865d721..cf5f3fae573c 100644 --- a/drivers/misc/mic/bus/vop_bus.h +++ b/drivers/misc/mic/bus/vop_bus.h @@ -87,8 +87,8 @@ struct vop_driver { * @get_dp: Get access to the virtio device page used by the self * node to add/remove/configure virtio devices. * @send_intr: Send an interrupt to the peer node on a specified doorbell. - * @ioremap: Map a buffer with the specified DMA address and length. - * @iounmap: Unmap a buffer previously mapped. + * @remap: Map a buffer with the specified DMA address and length. + * @unmap: Unmap a buffer previously mapped. * @dma_filter: The DMA filter function to use for obtaining access to * a DMA channel on the peer node. */ @@ -104,9 +104,9 @@ struct vop_hw_ops { void __iomem * (*get_remote_dp)(struct vop_device *vpdev); void * (*get_dp)(struct vop_device *vpdev); void (*send_intr)(struct vop_device *vpdev, int db); - void __iomem * (*ioremap)(struct vop_device *vpdev, + void __iomem * (*remap)(struct vop_device *vpdev, dma_addr_t pa, size_t len); - void (*iounmap)(struct vop_device *vpdev, void __iomem *va); + void (*unmap)(struct vop_device *vpdev, void __iomem *va); }; struct vop_device * diff --git a/drivers/misc/mic/card/mic_device.c b/drivers/misc/mic/card/mic_device.c index e749af48f736..dcd07ef29801 100644 --- a/drivers/misc/mic/card/mic_device.c +++ b/drivers/misc/mic/card/mic_device.c @@ -245,8 +245,8 @@ static struct scif_hw_ops scif_hw_ops = { .next_db = ___mic_next_db, .send_intr = ___mic_send_intr, .send_p2p_intr = ___mic_send_p2p_intr, - .ioremap = ___mic_ioremap, - .iounmap = ___mic_iounmap, + .remap = ___mic_ioremap, + .unmap = ___mic_iounmap, }; static inline struct mic_driver *vpdev_to_mdrv(struct vop_device *vpdev) @@ -316,8 +316,8 @@ static struct vop_hw_ops vop_hw_ops = { .next_db = __mic_next_db, .get_remote_dp = __mic_get_remote_dp, .send_intr = __mic_send_intr, - .ioremap = __mic_ioremap, - .iounmap = __mic_iounmap, + .remap = __mic_ioremap, + .unmap = __mic_iounmap, }; static int mic_request_dma_chans(struct mic_driver *mdrv) diff --git a/drivers/misc/mic/host/mic_boot.c b/drivers/misc/mic/host/mic_boot.c index 6479435ac96b..079c36f0ce6e 100644 --- a/drivers/misc/mic/host/mic_boot.c +++ b/drivers/misc/mic/host/mic_boot.c @@ -133,8 +133,8 @@ static struct vop_hw_ops vop_hw_ops = { .get_dp = __mic_get_dp, .get_remote_dp = __mic_get_remote_dp, .send_intr = __mic_send_intr, - .ioremap = __mic_ioremap, - .iounmap = __mic_iou
[PATCH v2 char-misc-next 1/7] mic: vop: Cast pointers to unsigned long
Fix these on 32-bit: vop_vringh.c:711:13: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast] Signed-off-by: Vincent Whitchurch --- drivers/misc/mic/vop/vop_main.c | 13 + drivers/misc/mic/vop/vop_vringh.c | 5 +++-- 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/misc/mic/vop/vop_main.c b/drivers/misc/mic/vop/vop_main.c index 3f96e19e57eb..263e0c3fa7c6 100644 --- a/drivers/misc/mic/vop/vop_main.c +++ b/drivers/misc/mic/vop/vop_main.c @@ -514,7 +514,7 @@ static int _vop_add_device(struct mic_device_desc __iomem *d, vdev->desc = d; vdev->dc = (void __iomem *)d + _vop_aligned_desc_size(d); vdev->dnode = dnode; - vdev->vdev.priv = (void *)(u64)dnode; + vdev->vdev.priv = (void *)(unsigned long)dnode; init_completion(&vdev->reset_done); vdev->h2c_vdev_db = vpdev->hw_ops->next_db(vpdev); @@ -536,7 +536,7 @@ static int _vop_add_device(struct mic_device_desc __iomem *d, offset, type); goto free_irq; } - writeq((u64)vdev, &vdev->dc->vdev); + writeq((unsigned long)vdev, &vdev->dc->vdev); dev_dbg(_vop_dev(vdev), "%s: registered vop device %u type %u vdev %p\n", __func__, offset, type, vdev); @@ -563,13 +563,18 @@ static int vop_match_desc(struct device *dev, void *data) return vdev->desc == (void __iomem *)data; } +static struct _vop_vdev *vop_dc_to_vdev(struct mic_device_ctrl *dc) +{ + return (struct _vop_vdev *)(unsigned long)readq(&dc->vdev); +} + static void _vop_handle_config_change(struct mic_device_desc __iomem *d, unsigned int offset, struct vop_device *vpdev) { struct mic_device_ctrl __iomem *dc = (void __iomem *)d + _vop_aligned_desc_size(d); - struct _vop_vdev *vdev = (struct _vop_vdev *)readq(&dc->vdev); + struct _vop_vdev *vdev = vop_dc_to_vdev(dc); if (ioread8(&dc->config_change) != MIC_VIRTIO_PARAM_CONFIG_CHANGED) return; @@ -588,7 +593,7 @@ static int _vop_remove_device(struct mic_device_desc __iomem *d, { struct mic_device_ctrl __iomem *dc = (void __iomem *)d + _vop_aligned_desc_size(d); - struct _vop_vdev *vdev = (struct _vop_vdev *)readq(&dc->vdev); + struct _vop_vdev *vdev = vop_dc_to_vdev(dc); u8 status; int ret = -1; diff --git a/drivers/misc/mic/vop/vop_vringh.c b/drivers/misc/mic/vop/vop_vringh.c index 0bac8bce933c..73d9fe6bc18d 100644 --- a/drivers/misc/mic/vop/vop_vringh.c +++ b/drivers/misc/mic/vop/vop_vringh.c @@ -712,16 +712,17 @@ static int vop_vringh_copy(struct vop_vdev *vdev, struct vringh_kiov *iov, while (len && iov->i < iov->used) { struct kvec *kiov = &iov->iov[iov->i]; + unsigned long daddr = (unsigned long)kiov->iov_base; partlen = min(kiov->iov_len, len); if (read) ret = vop_virtio_copy_to_user(vdev, ubuf, partlen, - (u64)kiov->iov_base, + daddr, kiov->iov_len, vr_idx); else ret = vop_virtio_copy_from_user(vdev, ubuf, partlen, - (u64)kiov->iov_base, + daddr, kiov->iov_len, vr_idx); if (ret) { -- 2.20.0
[PATCH v2 char-misc-next 0/7] Virtio-over-PCIe on non-MIC
This allows the drivers/misc/mic/vop/ code to be used on other architectures and adds a loopback driver to be able to use/test/evaluate it without special hardware. See v1 for more info: https://lore.kernel.org/lkml/20190116163253.23780-1-vincent.whitchu...@axis.com/ v2: - Rebase on top of the patches which have merged - Use unsigned long instead of uintptr_t - Drop "Use consistent DMA" for now since it's not needed for loopback - Add race fix - Build on all architectures, not just ARM and x86 - Add sample userspace code Vincent Whitchurch (7): mic: vop: Cast pointers to unsigned long mic: Rename ioremap pointer to remap mic: vop: Allow building on more systems mic: vop: Add loopback driver mic: vop: Fix init race with shared interrupts samples: mic: Split out vop code from mpssd samples: mic: Add sample VOP userspace drivers/misc/mic/Kconfig| 13 +- drivers/misc/mic/bus/scif_bus.h |8 +- drivers/misc/mic/bus/vop_bus.h |8 +- drivers/misc/mic/card/mic_device.c |8 +- drivers/misc/mic/host/mic_boot.c|8 +- drivers/misc/mic/scif/scif_map.h|4 +- drivers/misc/mic/vop/Makefile |2 + drivers/misc/mic/vop/vop_loopback.c | 382 drivers/misc/mic/vop/vop_main.c | 62 +- drivers/misc/mic/vop/vop_vringh.c | 15 +- samples/mic/mpssd/.gitignore|1 + samples/mic/mpssd/Makefile |9 +- samples/mic/mpssd/mpssd.c | 1317 +- samples/mic/mpssd/mpssd.h |1 + samples/mic/mpssd/vop.c | 1357 +++ samples/mic/mpssd/vopd.c| 25 + 16 files changed, 1866 insertions(+), 1354 deletions(-) create mode 100644 drivers/misc/mic/vop/vop_loopback.c create mode 100644 samples/mic/mpssd/vop.c create mode 100644 samples/mic/mpssd/vopd.c -- 2.20.0
[PATCH v2 char-misc-next 7/7] samples: mic: Add sample VOP userspace
Add a sample userspace program which can be used with the VOP framework with vop-loopback and non-MIC hardware. For example, the following commands can be used to test networking over vop-loopback. (IPv6 is used with zone indices to force traffic through the interfaces instead of short-circuit delivery, since both interfaces are on the same system.) # modprobe vop_loopback # ./vopd & Setup host interface: # ip a a dev mic0 scope link fe80::0 # ip link set dev mic0 up Setup guest interface: # ip a a dev eth1 scope link fe80::1 # ip link set dev eth1 up Ping from host to guest: # ping6 fe80::1%mic0 Ping from guest to host: # ping6 fe80::0%eth1 Signed-off-by: Vincent Whitchurch --- samples/mic/mpssd/.gitignore | 1 + samples/mic/mpssd/Makefile | 5 - samples/mic/mpssd/vopd.c | 25 + 3 files changed, 30 insertions(+), 1 deletion(-) create mode 100644 samples/mic/mpssd/vopd.c diff --git a/samples/mic/mpssd/.gitignore b/samples/mic/mpssd/.gitignore index 8b7c72f07c92..34c001135abb 100644 --- a/samples/mic/mpssd/.gitignore +++ b/samples/mic/mpssd/.gitignore @@ -1 +1,2 @@ mpssd +vopd diff --git a/samples/mic/mpssd/Makefile b/samples/mic/mpssd/Makefile index bc94054dbe2b..b94b18de5a43 100644 --- a/samples/mic/mpssd/Makefile +++ b/samples/mic/mpssd/Makefile @@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/) ifeq ($(ARCH),x86) -PROGS := mpssd +PROGS := mpssd vopd CC = $(CROSS_COMPILE)gcc CFLAGS := -I../../../usr/include -I../../../tools/include @@ -17,6 +17,9 @@ all: $(PROGS) mpssd: mpssd.c sysfs.c vop.o $(CC) $(CFLAGS) $^ -o $@ -lpthread +vopd: vopd.c vop.o + $(CC) $(CFLAGS) $^ -o $@ -lpthread + install: install mpssd /usr/sbin/mpssd install micctrl /usr/sbin/micctrl diff --git a/samples/mic/mpssd/vopd.c b/samples/mic/mpssd/vopd.c new file mode 100644 index ..48c1877fed28 --- /dev/null +++ b/samples/mic/mpssd/vopd.c @@ -0,0 +1,25 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include + +#include "mpssd.h" + +void mpsslog(char *format, ...) +{ + va_list args; + + va_start(args, format); + vprintf(format, args); + va_end(args); +} + +int main(int argc, char *argv[]) +{ + struct mic_info themic = { + .name = "mic", + }; + + serve_virtio(&themic, &themic); + + return 0; +} -- 2.20.0
[PATCH] Documentation/sysctl/vm.txt: Fix drop_caches bit number
Bits are usually numbered starting from zero, so 4 should be bit 2, not bit 3. Suggested-by: Matthew Wilcox Signed-off-by: Vincent Whitchurch --- Documentation/sysctl/vm.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 187ce4f599a2..6af24cdb25cc 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -237,7 +237,7 @@ used: cat (1234): drop_caches: 3 These are informational only. They do not mean that anything is wrong -with your system. To disable them, echo 4 (bit 3) into drop_caches. +with your system. To disable them, echo 4 (bit 2) into drop_caches. == -- 2.20.0
[PATCH v3] kmemleak: add module param to print warnings to dmesg
Currently, kmemleak only prints the number of suspected leaks to dmesg but requires the user to read a debugfs file to get the actual stack traces of the objects' allocation points. Add a module option to print the full object information to dmesg too. It can be enabled with kmemleak.verbose=1 on the kernel command line, or "echo 1 > /sys/module/kmemleak/parameters/verbose": This allows easier integration of kmemleak into test systems: We have automated test infrastructure to test our Linux systems. With this option, running our tests with kmemleak is as simple as enabling kmemleak and passing this command line option; the test infrastructure knows how to save kernel logs, which will now include kmemleak reports. Without this option, the test infrastructure needs to be specifically taught to read out the kmemleak debugfs file. Removing this need for special handling makes kmemleak more similar to other kernel debug options (slab debugging, debug objects, etc). Signed-off-by: Vincent Whitchurch --- v3: Expand use case description. Replace config option with module parameter. mm/kmemleak.c | 42 +++--- 1 file changed, 35 insertions(+), 7 deletions(-) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 9a085d525bbc..c91d43738596 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -86,6 +86,7 @@ #include #include #include +#include #include #include #include @@ -181,6 +182,7 @@ struct kmemleak_object { /* flag set to not scan the object */ #define OBJECT_NO_SCAN (1 << 2) +#define HEX_PREFIX "" /* number of bytes to print per line; must be 16 or 32 */ #define HEX_ROW_SIZE 16 /* number of bytes to print at a time (1, 2, 4, 8) */ @@ -235,6 +237,9 @@ static int kmemleak_skip_disable; /* If there are leaks that can be reported */ static bool kmemleak_found_leaks; +static bool kmemleak_verbose; +module_param_named(verbose, kmemleak_verbose, bool, 0600); + /* * Early object allocation/freeing logging. Kmemleak is initialized after the * kernel allocator. However, both the kernel allocator and kmemleak may @@ -299,6 +304,25 @@ static void kmemleak_disable(void); kmemleak_disable(); \ } while (0) +#define warn_or_seq_printf(seq, fmt, ...) do {\ + if (seq)\ + seq_printf(seq, fmt, ##__VA_ARGS__);\ + else\ + pr_warn(fmt, ##__VA_ARGS__);\ +} while (0) + +static void warn_or_seq_hex_dump(struct seq_file *seq, int prefix_type, +int rowsize, int groupsize, const void *buf, +size_t len, bool ascii) +{ + if (seq) + seq_hex_dump(seq, HEX_PREFIX, prefix_type, rowsize, groupsize, +buf, len, ascii); + else + print_hex_dump(KERN_WARNING, pr_fmt(HEX_PREFIX), prefix_type, + rowsize, groupsize, buf, len, ascii); +} + /* * Printing of the objects hex dump to the seq file. The number of lines to be * printed is limited to HEX_MAX_LINES to prevent seq file spamming. The @@ -314,10 +338,10 @@ static void hex_dump_object(struct seq_file *seq, /* limit the number of lines to HEX_MAX_LINES */ len = min_t(size_t, object->size, HEX_MAX_LINES * HEX_ROW_SIZE); - seq_printf(seq, " hex dump (first %zu bytes):\n", len); + warn_or_seq_printf(seq, " hex dump (first %zu bytes):\n", len); kasan_disable_current(); - seq_hex_dump(seq, "", DUMP_PREFIX_NONE, HEX_ROW_SIZE, -HEX_GROUP_SIZE, ptr, len, HEX_ASCII); + warn_or_seq_hex_dump(seq, DUMP_PREFIX_NONE, HEX_ROW_SIZE, +HEX_GROUP_SIZE, ptr, len, HEX_ASCII); kasan_enable_current(); } @@ -365,17 +389,17 @@ static void print_unreferenced(struct seq_file *seq, int i; unsigned int msecs_age = jiffies_to_msecs(jiffies - object->jiffies); - seq_printf(seq, "unreferenced object 0x%08lx (size %zu):\n", + warn_or_seq_printf(seq, "unreferenced object 0x%08lx (size %zu):\n", object->pointer, object->size); - seq_printf(seq, " comm \"%s\", pid %d, jiffies %lu (age %d.%03ds)\n", + warn_or_seq_printf(seq, " comm \"%s\", pid %d, jiffies %lu (age %d.%03ds)\n", object->comm, object->pid, object->jiffies, msecs_age / 1000, msecs_age % 1000); hex_dump_object(seq, object); - seq_printf(seq, " backtrace:\n"); + warn_or_seq_printf(seq, " backtrace:\n"); for (i = 0; i < object->trace_len; i++) { void *ptr = (void *)object->trace[i]; - seq_printf(s
Re: [PATCH v2] dynamic debug: allow printing to trace event
On Fri, Aug 14, 2020 at 11:30:34PM +0200, Jason Baron wrote: > On 8/14/20 1:15 PM, Steven Rostedt wrote: > > On Fri, 14 Aug 2020 15:31:51 +0200 > > Vincent Whitchurch wrote: > >> index aa9ff9e1c0b3..f599ed21ecc5 100644 > >> --- a/include/linux/dynamic_debug.h > >> +++ b/include/linux/dynamic_debug.h > >> @@ -27,13 +27,16 @@ struct _ddebug { > >> * writes commands to /dynamic_debug/control > >> */ > >> #define _DPRINTK_FLAGS_NONE 0 > >> -#define _DPRINTK_FLAGS_PRINT (1<<0) /* printk() a message using the > >> format */ > >> +#define _DPRINTK_FLAGS_PRINTK (1<<0) /* printk() a message using the > >> format */ > > > > The above looks like a cleanup unrelated to this patch, and probably > > should be on its own. > > I read it as we used to have this one thing called 'print', which really meant > printk, but now that we also have the ability to output to the trace buffer, > what does 'print' mean now? So I read it as being part of this change. Yes, that's what was intended, but I think it makes sense to split it out as Steven suggested so I've done that now (and also renamed the combined flag to the less ambiguous _DPRINTK_FLAGS_ENABLE). > > > > >> #define _DPRINTK_FLAGS_INCL_MODNAME (1<<1) > >> #define _DPRINTK_FLAGS_INCL_FUNCNAME (1<<2) > >> #define _DPRINTK_FLAGS_INCL_LINENO(1<<3) > >> #define _DPRINTK_FLAGS_INCL_TID (1<<4) > >> +#define _DPRINTK_FLAGS_TRACE (1<<5) > >> +#define _DPRINTK_FLAGS_PRINT (_DPRINTK_FLAGS_PRINTK | \ > >> + _DPRINTK_FLAGS_TRACE) > > > Is _DPRINTK_FLAGS_PRINT actually used anywhere? Looks to me like > it can be removed. It's used from DYNAMIC_DEBUG_BRANCH() as well as from lib/dynamic_debug.c to check if the location is enabled. > This is a feature I've wanted for dynamic debug for a while. Thanks for > implementing it! > > Dynamic can be enabled on the command line in order to print things early > in boot (I think ftrace can as well), I want to make sure that there are > no ordering issues here? And things wouldn't blow up if we enable printing > to the ftrace buffer early on via dyanmic debug? I tried enabling all dynamic debug locations and tracing via the command line and that worked fine: dyndbg="file * +x" trace_event=printk:*
[PATCH v3 1/2] dynamic debug: split enable and printk flags
We're going to add support to allow dynamic debug locations to print to a trace event, and that will make _DPRINTK_FLAGS_PRINT a bit ambiguous. Will it mean printk(), or printing to the trace event, or any of the two? To make it clearer, split _DPRINTK_FLAGS_PRINT into two: (1) _DPRINTK_FLAGS_PRINTK for turning on the printk() and (2) _DPRINTK_FLAGS_ENABLE when checking if the dynamic debug location is enabled. _DPRINTK_FLAGS_ENABLE is currently just an alias of _DPRINTK_FLAGS_PRINTK but will later also include a new flag for enabling printing to the trace event. Signed-off-by: Vincent Whitchurch --- include/linux/dynamic_debug.h | 9 + lib/dynamic_debug.c | 8 2 files changed, 9 insertions(+), 8 deletions(-) diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h index aa9ff9e1c0b3..738421898aac 100644 --- a/include/linux/dynamic_debug.h +++ b/include/linux/dynamic_debug.h @@ -27,13 +27,14 @@ struct _ddebug { * writes commands to /dynamic_debug/control */ #define _DPRINTK_FLAGS_NONE0 -#define _DPRINTK_FLAGS_PRINT (1<<0) /* printk() a message using the format */ +#define _DPRINTK_FLAGS_PRINTK (1<<0) /* printk() a message using the format */ #define _DPRINTK_FLAGS_INCL_MODNAME(1<<1) #define _DPRINTK_FLAGS_INCL_FUNCNAME (1<<2) #define _DPRINTK_FLAGS_INCL_LINENO (1<<3) #define _DPRINTK_FLAGS_INCL_TID(1<<4) +#define _DPRINTK_FLAGS_ENABLE _DPRINTK_FLAGS_PRINTK #if defined DEBUG -#define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINT +#define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINTK #else #define _DPRINTK_FLAGS_DEFAULT 0 #endif @@ -111,10 +112,10 @@ void __dynamic_ibdev_dbg(struct _ddebug *descriptor, #ifdef DEBUG #define DYNAMIC_DEBUG_BRANCH(descriptor) \ - likely(descriptor.flags & _DPRINTK_FLAGS_PRINT) + likely(descriptor.flags & _DPRINTK_FLAGS_ENABLE) #else #define DYNAMIC_DEBUG_BRANCH(descriptor) \ - unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT) + unlikely(descriptor.flags & _DPRINTK_FLAGS_ENABLE) #endif #endif diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c index 1d012e597cc3..88af85cb5222 100644 --- a/lib/dynamic_debug.c +++ b/lib/dynamic_debug.c @@ -84,7 +84,7 @@ static inline const char *trim_prefix(const char *path) } static struct { unsigned flag:8; char opt_char; } opt_array[] = { - { _DPRINTK_FLAGS_PRINT, 'p' }, + { _DPRINTK_FLAGS_PRINTK, 'p' }, { _DPRINTK_FLAGS_INCL_MODNAME, 'm' }, { _DPRINTK_FLAGS_INCL_FUNCNAME, 'f' }, { _DPRINTK_FLAGS_INCL_LINENO, 'l' }, @@ -206,10 +206,10 @@ static int ddebug_change(const struct ddebug_query *query, if (newflags == dp->flags) continue; #ifdef CONFIG_JUMP_LABEL - if (dp->flags & _DPRINTK_FLAGS_PRINT) { - if (!(modifiers->flags & _DPRINTK_FLAGS_PRINT)) + if (dp->flags & _DPRINTK_FLAGS_ENABLE) { + if (!(modifiers->flags & _DPRINTK_FLAGS_ENABLE)) static_branch_disable(&dp->key.dd_key_true); - } else if (modifiers->flags & _DPRINTK_FLAGS_PRINT) + } else if (modifiers->flags & _DPRINTK_FLAGS_ENABLE) static_branch_enable(&dp->key.dd_key_true); #endif dp->flags = newflags; -- 2.28.0
[PATCH v3 0/2] Dynamic debug trace support
v3: - Split flag rename to a separate patch - Rename event to printk:dyndbg v2: - Remove stack buffer and use code similar to __ftrace_trace_stack() - Use an event with the same class as printk:console Vincent Whitchurch (2): dynamic debug: split enable and printk flags dynamic debug: allow printing to trace event .../admin-guide/dynamic-debug-howto.rst | 1 + include/linux/dynamic_debug.h | 11 +- include/trace/events/printk.h | 12 +- lib/dynamic_debug.c | 161 ++ 4 files changed, 151 insertions(+), 34 deletions(-) Range-diff: -: > 1: 2564b3dbbb04 dynamic debug: split enable and printk flags 1: 7bd3fb553503 ! 2: 90291c35d751 dynamic debug: allow printing to trace event @@ Commit message debug do it. Add an "x" flag to make the dynamic debug call site print to a new -printk:dynamic trace event. The trace event can be emitted instead of -or in addition to the printk(). +printk:dyndbg trace event. The trace event can be emitted instead of or +in addition to the printk(). The print buffer is statically allocated and managed using code borrowed from __ftrace_trace_stack() and is limited to 256 bytes (four of these @@ Documentation/admin-guide/dynamic-debug-howto.rst: of the characters:: The flags are:: penables the pr_debug() callsite. -+ xenables trace to the printk:dynamic event ++ xenables trace to the printk:dyndbg event fInclude the function name in the printed message lInclude line number in the printed message mInclude module name in the printed message ## include/linux/dynamic_debug.h ## @@ include/linux/dynamic_debug.h: struct _ddebug { -* writes commands to /dynamic_debug/control -*/ - #define _DPRINTK_FLAGS_NONE 0 --#define _DPRINTK_FLAGS_PRINT (1<<0) /* printk() a message using the format */ -+#define _DPRINTK_FLAGS_PRINTK (1<<0) /* printk() a message using the format */ - #define _DPRINTK_FLAGS_INCL_MODNAME (1<<1) #define _DPRINTK_FLAGS_INCL_FUNCNAME (1<<2) #define _DPRINTK_FLAGS_INCL_LINENO(1<<3) #define _DPRINTK_FLAGS_INCL_TID (1<<4) +-#define _DPRINTK_FLAGS_ENABLE _DPRINTK_FLAGS_PRINTK +#define _DPRINTK_FLAGS_TRACE (1<<5) -+#define _DPRINTK_FLAGS_PRINT (_DPRINTK_FLAGS_PRINTK | \ ++#define _DPRINTK_FLAGS_ENABLE (_DPRINTK_FLAGS_PRINTK | \ + _DPRINTK_FLAGS_TRACE) #if defined DEBUG --#define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINT -+#define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINTK + #define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINTK #else - #define _DPRINTK_FLAGS_DEFAULT 0 - #endif ## include/trace/events/printk.h ## @@ @@ include/trace/events/printk.h: TRACE_EVENT(console, + TP_ARGS(text, len) +); + -+DEFINE_EVENT(printk, dynamic, ++DEFINE_EVENT(printk, dyndbg, + TP_PROTO(const char *text, size_t len), + TP_ARGS(text, len) +); @@ lib/dynamic_debug.c #include -@@ lib/dynamic_debug.c: static inline const char *trim_prefix(const char *path) - } - - static struct { unsigned flag:8; char opt_char; } opt_array[] = { -- { _DPRINTK_FLAGS_PRINT, 'p' }, -+ { _DPRINTK_FLAGS_PRINTK, 'p' }, - { _DPRINTK_FLAGS_INCL_MODNAME, 'm' }, +@@ lib/dynamic_debug.c: static struct { unsigned flag:8; char opt_char; } opt_array[] = { { _DPRINTK_FLAGS_INCL_FUNCNAME, 'f' }, { _DPRINTK_FLAGS_INCL_LINENO, 'l' }, { _DPRINTK_FLAGS_INCL_TID, 't' }, @@ lib/dynamic_debug.c: static char *dynamic_emit_prefix(const struct _ddebug *desc + buf = this_cpu_ptr(dynamic_trace_bufs.bufs) + bufidx; + + len = vscnprintf(buf->buf, sizeof(buf->buf), fmt, args); -+ trace_dynamic(buf->buf, len); ++ trace_dyndbg(buf->buf, len); + +out: + /* As above. */ -- 2.28.0
[PATCH v3 2/2] dynamic debug: allow printing to trace event
When debugging device drivers, I've found it very useful to be able to redirect existing pr_debug()/dev_dbg() prints to the trace buffer instead of dmesg. Among the many advantages of the trace buffer is that it can be dynamically resized, allows these prints to combined with other trace events, and doesn't fill up system logs. Since dynamic debug already has hooks in these call sites, getting these prints into the ftrace buffer is straightforward if we have dynamic debug do it. Add an "x" flag to make the dynamic debug call site print to a new printk:dyndbg trace event. The trace event can be emitted instead of or in addition to the printk(). The print buffer is statically allocated and managed using code borrowed from __ftrace_trace_stack() and is limited to 256 bytes (four of these are allocated per CPU to handle the various contexts); anything larger will be truncated. Signed-off-by: Vincent Whitchurch --- .../admin-guide/dynamic-debug-howto.rst | 1 + include/linux/dynamic_debug.h | 4 +- include/trace/events/printk.h | 12 +- lib/dynamic_debug.c | 153 +++--- 4 files changed, 143 insertions(+), 27 deletions(-) diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst index e5a8def45f3f..e717054ae142 100644 --- a/Documentation/admin-guide/dynamic-debug-howto.rst +++ b/Documentation/admin-guide/dynamic-debug-howto.rst @@ -229,6 +229,7 @@ of the characters:: The flags are:: penables the pr_debug() callsite. + xenables trace to the printk:dyndbg event fInclude the function name in the printed message lInclude line number in the printed message mInclude module name in the printed message diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h index 738421898aac..74aee3f922d7 100644 --- a/include/linux/dynamic_debug.h +++ b/include/linux/dynamic_debug.h @@ -32,7 +32,9 @@ struct _ddebug { #define _DPRINTK_FLAGS_INCL_FUNCNAME (1<<2) #define _DPRINTK_FLAGS_INCL_LINENO (1<<3) #define _DPRINTK_FLAGS_INCL_TID(1<<4) -#define _DPRINTK_FLAGS_ENABLE _DPRINTK_FLAGS_PRINTK +#define _DPRINTK_FLAGS_TRACE (1<<5) +#define _DPRINTK_FLAGS_ENABLE (_DPRINTK_FLAGS_PRINTK | \ +_DPRINTK_FLAGS_TRACE) #if defined DEBUG #define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINTK #else diff --git a/include/trace/events/printk.h b/include/trace/events/printk.h index 13d405b2fd8b..1f78bd237a91 100644 --- a/include/trace/events/printk.h +++ b/include/trace/events/printk.h @@ -7,7 +7,7 @@ #include -TRACE_EVENT(console, +DECLARE_EVENT_CLASS(printk, TP_PROTO(const char *text, size_t len), TP_ARGS(text, len), @@ -31,6 +31,16 @@ TRACE_EVENT(console, TP_printk("%s", __get_str(msg)) ); + +DEFINE_EVENT(printk, console, + TP_PROTO(const char *text, size_t len), + TP_ARGS(text, len) +); + +DEFINE_EVENT(printk, dyndbg, + TP_PROTO(const char *text, size_t len), + TP_ARGS(text, len) +); #endif /* _TRACE_PRINTK_H */ /* This part must be outside protection */ diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c index 88af85cb5222..3319da2e235c 100644 --- a/lib/dynamic_debug.c +++ b/lib/dynamic_debug.c @@ -36,6 +36,7 @@ #include #include #include +#include #include @@ -89,6 +90,7 @@ static struct { unsigned flag:8; char opt_char; } opt_array[] = { { _DPRINTK_FLAGS_INCL_FUNCNAME, 'f' }, { _DPRINTK_FLAGS_INCL_LINENO, 'l' }, { _DPRINTK_FLAGS_INCL_TID, 't' }, + { _DPRINTK_FLAGS_TRACE, 'x' }, { _DPRINTK_FLAGS_NONE, '_' }, }; @@ -600,6 +602,96 @@ static char *dynamic_emit_prefix(const struct _ddebug *desc, char *buf) return buf; } +/* + * This code is heavily based on __ftrace_trace_stack(). + * + * Allow 4 levels of nesting: normal, softirq, irq, NMI. + */ +#define DYNAMIC_TRACE_NESTING 4 + +struct dynamic_trace_buf { + char buf[256]; +}; + +struct dynamic_trace_bufs { + struct dynamic_trace_buf bufs[DYNAMIC_TRACE_NESTING]; +}; + +static DEFINE_PER_CPU(struct dynamic_trace_bufs, dynamic_trace_bufs); +static DEFINE_PER_CPU(int, dynamic_trace_reserve); + +static void dynamic_trace(const char *fmt, va_list args) +{ + struct dynamic_trace_buf *buf; + int bufidx; + int len; + + preempt_disable_notrace(); + + bufidx = __this_cpu_inc_return(dynamic_trace_reserve) - 1; + + if (WARN_ON_ONCE(bufidx > DYNAMIC_TRACE_NESTING)) + goto out; + + /* For the same reasons as in __ftrace_trace_stack(). */ + barrier(); + + buf = this_cpu_ptr(dynamic_trace_bufs.bufs) + bufidx; + + len = vscnprintf(buf->buf, sizeof(buf->buf), fmt, args); + trace_dyndb
Re: [PATCH v2] dynamic debug: allow printing to trace event
On Fri, Aug 14, 2020 at 07:15:31PM +0200, Steven Rostedt wrote: > On Fri, 14 Aug 2020 15:31:51 +0200 > Vincent Whitchurch wrote: > > index aa9ff9e1c0b3..f599ed21ecc5 100644 > > --- a/include/linux/dynamic_debug.h > > +++ b/include/linux/dynamic_debug.h > > @@ -27,13 +27,16 @@ struct _ddebug { > > * writes commands to /dynamic_debug/control > > */ > > #define _DPRINTK_FLAGS_NONE0 > > -#define _DPRINTK_FLAGS_PRINT (1<<0) /* printk() a message using the > > format */ > > +#define _DPRINTK_FLAGS_PRINTK (1<<0) /* printk() a message using the > > format */ > > The above looks like a cleanup unrelated to this patch, and probably > should be on its own. I've moved this and the other renaming hunk out to a separate patch. > > #define _DPRINTK_FLAGS_INCL_MODNAME(1<<1) > > #define _DPRINTK_FLAGS_INCL_FUNCNAME (1<<2) > > #define _DPRINTK_FLAGS_INCL_LINENO (1<<3) > > #define _DPRINTK_FLAGS_INCL_TID(1<<4) > > +#define _DPRINTK_FLAGS_TRACE (1<<5) > > +#define _DPRINTK_FLAGS_PRINT (_DPRINTK_FLAGS_PRINTK | \ > > +_DPRINTK_FLAGS_TRACE) > > #if defined DEBUG > > -#define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINT > > +#define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINTK > > #else > > #define _DPRINTK_FLAGS_DEFAULT 0 > > #endif > > diff --git a/include/trace/events/printk.h b/include/trace/events/printk.h > > index 13d405b2fd8b..6c89121a1669 100644 > > --- a/include/trace/events/printk.h > > +++ b/include/trace/events/printk.h > > @@ -7,7 +7,7 @@ > > > > #include > > > > -TRACE_EVENT(console, > > +DECLARE_EVENT_CLASS(printk, > > TP_PROTO(const char *text, size_t len), > > > > TP_ARGS(text, len), > > @@ -31,6 +31,16 @@ TRACE_EVENT(console, > > > > TP_printk("%s", __get_str(msg)) > > ); > > + > > +DEFINE_EVENT(printk, console, > > + TP_PROTO(const char *text, size_t len), > > + TP_ARGS(text, len) > > +); > > + > > +DEFINE_EVENT(printk, dynamic, > > Can we call this "dynamic_printk" or "printk_dynamic", as > trace_dynamic() is too generic. I went for "dyndbg" (printk:dyndbg / trace_dyndbg()) which is the documented name of the infrastructure.
[PATCH v2 2/2] ARM: module: fix handling of unwind init sections
Unwind information for init sections is placed in .ARM.exidx.init.text and .ARM.extab.init.text. The module core doesn't know that these are init sections so they are allocated along with the core sections, and if the core and init sections get allocated in different memory regions (which is possible with CONFIG_ARM_MODULE_PLTS=y) and they can't reach each other, relocation fails: final section addresses: ... 0x7f80 .init.text .. 0xcbb54078 .ARM.exidx.init.text .. section 16 reloc 0 sym '': relocation 42 out of range (0xcbb54078 -> 0x7f80) Fix this by informing the module core that these sections are init sections, and by removing the init unwind tables before the module core frees the init sections. Signed-off-by: Vincent Whitchurch --- v2: No changes. arch/arm/kernel/module.c | 22 +++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index deef17f34bd2..af0a8500a24e 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -55,6 +55,13 @@ void *module_alloc(unsigned long size) } #endif +bool module_init_section(const char *name) +{ + return strstarts(name, ".init") || + strstarts(name, ".ARM.extab.init") || + strstarts(name, ".ARM.exidx.init"); +} + bool module_exit_section(const char *name) { return strstarts(name, ".exit") || @@ -409,8 +416,17 @@ module_arch_cleanup(struct module *mod) #ifdef CONFIG_ARM_UNWIND int i; - for (i = 0; i < ARM_SEC_MAX; i++) - if (mod->arch.unwind[i]) - unwind_table_del(mod->arch.unwind[i]); + for (i = 0; i < ARM_SEC_MAX; i++) { + unwind_table_del(mod->arch.unwind[i]); + mod->arch.unwind[i] = NULL; + } +#endif +} + +void __weak module_arch_freeing_init(struct module *mod) +{ +#ifdef CONFIG_ARM_UNWIND + unwind_table_del(mod->arch.unwind[ARM_SEC_INIT]); + mod->arch.unwind[ARM_SEC_INIT] = NULL; #endif } -- 2.25.1
[PATCH v2 1/2] module: allow arch overrides for .init section names
ARM stores unwind information for .init.text in sections named .ARM.extab.init.text and .ARM.exidx.init.text. Since those aren't currently recognized as init sections, they're allocated along with the core section, and relocation fails if the core and the init section are allocated from different regions and can't reach other. final section addresses: ... 0x7f80 .init.text .. 0xcbb54078 .ARM.exidx.init.text .. section 16 reloc 0 sym '': relocation 42 out of range (0xcbb54078 -> 0x7f80) Allow architectures to override the section name so that ARM can fix this. Signed-off-by: Vincent Whitchurch --- v2: Add comment and move module_init_section() next to module_exit_section(). include/linux/moduleloader.h | 5 + kernel/module.c | 9 +++-- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h index ca92aea8a6bd..4fa67a8b2265 100644 --- a/include/linux/moduleloader.h +++ b/include/linux/moduleloader.h @@ -29,6 +29,11 @@ void *module_alloc(unsigned long size); /* Free memory returned from module_alloc. */ void module_memfree(void *module_region); +/* Determines if the section name is an init section (that is only used during + * module loading). + */ +bool module_init_section(const char *name); + /* Determines if the section name is an exit section (that is only used during * module unloading) */ diff --git a/kernel/module.c b/kernel/module.c index 33569a01d6e1..84d0c455fb44 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -2400,7 +2400,7 @@ static void layout_sections(struct module *mod, struct load_info *info) if ((s->sh_flags & masks[m][0]) != masks[m][0] || (s->sh_flags & masks[m][1]) || s->sh_entsize != ~0UL - || strstarts(sname, ".init")) + || module_init_section(sname)) continue; s->sh_entsize = get_offset(mod, &mod->core_layout.size, s, i); pr_debug("\t%s\n", sname); @@ -2433,7 +2433,7 @@ static void layout_sections(struct module *mod, struct load_info *info) if ((s->sh_flags & masks[m][0]) != masks[m][0] || (s->sh_flags & masks[m][1]) || s->sh_entsize != ~0UL - || !strstarts(sname, ".init")) + || !module_init_section(sname)) continue; s->sh_entsize = (get_offset(mod, &mod->init_layout.size, s, i) | INIT_OFFSET_MASK); @@ -2768,6 +2768,11 @@ void * __weak module_alloc(unsigned long size) return vmalloc_exec(size); } +bool __weak module_init_section(const char *name) +{ + return strstarts(name, ".init"); +} + bool __weak module_exit_section(const char *name) { return strstarts(name, ".exit"); -- 2.25.1
Re: [PATCH v2 1/2] module: allow arch overrides for .init section names
On Mon, May 11, 2020 at 05:45:00PM +0200, Jessica Yu wrote: > +++ Vincent Whitchurch [11/05/20 13:48 +0200]: > >ARM stores unwind information for .init.text in sections named > >.ARM.extab.init.text and .ARM.exidx.init.text. Since those aren't > >currently recognized as init sections, they're allocated along with the > >core section, and relocation fails if the core and the init section are > >allocated from different regions and can't reach other. > > > > final section addresses: > >... > >0x7f80 .init.text > >.. > >0xcbb54078 .ARM.exidx.init.text > >.. > > > > section 16 reloc 0 sym '': relocation 42 out of range (0xcbb54078 -> > > 0x7f80) > > > >Allow architectures to override the section name so that ARM can fix > >this. > > > >Signed-off-by: Vincent Whitchurch > >--- > >v2: Add comment and move module_init_section() next to module_exit_section(). > > Thanks, this patch looks fine to me. You could add my: > >Acked-by: Jessica Yu > > Alternatively, I can take this through modules-next if the second > patch gets a review and ack from an ARM maintainer. Thanks! I've put the patches in Russell's patch system.
Re: [PATCH v3 2/2] dynamic debug: allow printing to trace event
On Wed, Aug 26, 2020 at 09:53:57PM +0200, Joe Perches wrote: > On Wed, 2020-08-26 at 15:32 -0400, Steven Rostedt wrote: > > On Tue, 25 Aug 2020 08:53:25 -0700 > > Joe Perches wrote: > > > > > > The print buffer is statically allocated and managed using code borrowed > > > > from __ftrace_trace_stack() and is limited to 256 bytes (four of these > > > > are allocated per CPU to handle the various contexts); anything larger > > > > will be truncated. > > > > > > There is an effort to avoid using trace_printk and the like > > > so perhaps this feature should have the same compile time > > > guard. > > > > No, this is fine for being in a production kernel. Basically, these are > > simply debug printk()s that can also be put into the trace buffer. The > > key difference to trace_printk() is that they are an event that needs > > to be enabled to write into the buffer. > > It just seems like a backdoor way to convert various pr_debug functions > (dev_dbg/netdev_dbg, et al) into tracing. > > What's the real value? Timing data? Something else? I mentioned my use case for this in the commit message and why it works much better than printk() for that, please let me know if it is unclear: When debugging device drivers, I've found it very useful to be able to redirect existing pr_debug()/dev_dbg() prints to the trace buffer instead of dmesg. Among the many advantages of the trace buffer is that it can be dynamically resized, allows these prints to combined with other trace events, and doesn't fill up system logs. This is my only use case for this, and I've used it very very often during the years I've been carrying this patch locally.
[PATCH] regulator: pwm: Fix machine constraints application
If the zero duty cycle doesn't correspond to any voltage in the voltage table, the PWM regulator returns an -EINVAL from get_voltage_sel() which results in the core erroring out with a "failed to get the current voltage" and ending up not applying the machine constraints. Instead, return -ENOTRECOVERABLE which makes the core set the voltage since it's at an unknown value. For example, with this device tree: fooregulator { compatible = "pwm-regulator"; pwms = <&foopwm 0 10>; regulator-min-microvolt = <225>; regulator-max-microvolt = <225>; regulator-name = "fooregulator"; regulator-always-on; regulator-boot-on; voltage-table = <225 30>; }; Before this patch: fooregulator: failed to get the current voltage(-22) After this patch: fooregulator: Setting 225-225uV fooregulator: 2250 mV Signed-off-by: Vincent Whitchurch --- drivers/regulator/pwm-regulator.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/regulator/pwm-regulator.c b/drivers/regulator/pwm-regulator.c index 3234b118b53e..990bd50771d8 100644 --- a/drivers/regulator/pwm-regulator.c +++ b/drivers/regulator/pwm-regulator.c @@ -279,7 +279,7 @@ static int pwm_regulator_init_table(struct platform_device *pdev, return ret; } - drvdata->state = -EINVAL; + drvdata->state = -ENOTRECOVERABLE; drvdata->duty_cycle_table = duty_cycle_table; drvdata->desc.ops = &pwm_regulator_voltage_table_ops; drvdata->desc.n_voltages= length / sizeof(*duty_cycle_table); -- 2.28.0
[PATCH v2] dynamic debug: allow printing to trace event
When debugging device drivers, I've found it very useful to be able to redirect existing pr_debug()/dev_dbg() prints to the trace buffer instead of dmesg. Among the many advantages of the trace buffer is that it can be dynamically resized, allows these prints to combined with other trace events, and doesn't fill up system logs. Since dynamic debug already has hooks in these call sites, getting these prints into the ftrace buffer is straightforward if we have dynamic debug do it. Add an "x" flag to make the dynamic debug call site print to a new printk:dynamic trace event. The trace event can be emitted instead of or in addition to the printk(). The print buffer is statically allocated and managed using code borrowed from __ftrace_trace_stack() and is limited to 256 bytes (four of these are allocated per CPU to handle the various contexts); anything larger will be truncated. Signed-off-by: Vincent Whitchurch --- v2: - Remove stack buffer and use code similar to __ftrace_trace_stack() - Use an event with the same class as printk:console .../admin-guide/dynamic-debug-howto.rst | 1 + include/linux/dynamic_debug.h | 7 +- include/trace/events/printk.h | 12 +- lib/dynamic_debug.c | 155 +++--- 4 files changed, 146 insertions(+), 29 deletions(-) diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst index e5a8def45f3f..d2ebee464db7 100644 --- a/Documentation/admin-guide/dynamic-debug-howto.rst +++ b/Documentation/admin-guide/dynamic-debug-howto.rst @@ -229,6 +229,7 @@ of the characters:: The flags are:: penables the pr_debug() callsite. + xenables trace to the printk:dynamic event fInclude the function name in the printed message lInclude line number in the printed message mInclude module name in the printed message diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h index aa9ff9e1c0b3..f599ed21ecc5 100644 --- a/include/linux/dynamic_debug.h +++ b/include/linux/dynamic_debug.h @@ -27,13 +27,16 @@ struct _ddebug { * writes commands to /dynamic_debug/control */ #define _DPRINTK_FLAGS_NONE0 -#define _DPRINTK_FLAGS_PRINT (1<<0) /* printk() a message using the format */ +#define _DPRINTK_FLAGS_PRINTK (1<<0) /* printk() a message using the format */ #define _DPRINTK_FLAGS_INCL_MODNAME(1<<1) #define _DPRINTK_FLAGS_INCL_FUNCNAME (1<<2) #define _DPRINTK_FLAGS_INCL_LINENO (1<<3) #define _DPRINTK_FLAGS_INCL_TID(1<<4) +#define _DPRINTK_FLAGS_TRACE (1<<5) +#define _DPRINTK_FLAGS_PRINT (_DPRINTK_FLAGS_PRINTK | \ +_DPRINTK_FLAGS_TRACE) #if defined DEBUG -#define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINT +#define _DPRINTK_FLAGS_DEFAULT _DPRINTK_FLAGS_PRINTK #else #define _DPRINTK_FLAGS_DEFAULT 0 #endif diff --git a/include/trace/events/printk.h b/include/trace/events/printk.h index 13d405b2fd8b..6c89121a1669 100644 --- a/include/trace/events/printk.h +++ b/include/trace/events/printk.h @@ -7,7 +7,7 @@ #include -TRACE_EVENT(console, +DECLARE_EVENT_CLASS(printk, TP_PROTO(const char *text, size_t len), TP_ARGS(text, len), @@ -31,6 +31,16 @@ TRACE_EVENT(console, TP_printk("%s", __get_str(msg)) ); + +DEFINE_EVENT(printk, console, + TP_PROTO(const char *text, size_t len), + TP_ARGS(text, len) +); + +DEFINE_EVENT(printk, dynamic, + TP_PROTO(const char *text, size_t len), + TP_ARGS(text, len) +); #endif /* _TRACE_PRINTK_H */ /* This part must be outside protection */ diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c index 1d012e597cc3..76fc3e33fe41 100644 --- a/lib/dynamic_debug.c +++ b/lib/dynamic_debug.c @@ -36,6 +36,7 @@ #include #include #include +#include #include @@ -84,11 +85,12 @@ static inline const char *trim_prefix(const char *path) } static struct { unsigned flag:8; char opt_char; } opt_array[] = { - { _DPRINTK_FLAGS_PRINT, 'p' }, + { _DPRINTK_FLAGS_PRINTK, 'p' }, { _DPRINTK_FLAGS_INCL_MODNAME, 'm' }, { _DPRINTK_FLAGS_INCL_FUNCNAME, 'f' }, { _DPRINTK_FLAGS_INCL_LINENO, 'l' }, { _DPRINTK_FLAGS_INCL_TID, 't' }, + { _DPRINTK_FLAGS_TRACE, 'x' }, { _DPRINTK_FLAGS_NONE, '_' }, }; @@ -600,6 +602,96 @@ static char *dynamic_emit_prefix(const struct _ddebug *desc, char *buf) return buf; } +/* + * This code is heavily based on __ftrace_trace_stack(). + * + * Allow 4 levels of nesting: normal, softirq, irq, NMI. + */ +#define DYNAMIC_TRACE_NESTING 4 + +struct dynamic_trace_buf { + char buf[256]; +}; + +struct dynamic_trace_bufs { + struct dynamic_trace_buf bufs[DYNAMIC_TRACE_N
Re: [PATCH] dynamic debug: allow printing to trace event
On Thu, Jul 23, 2020 at 05:26:44PM +0200, Steven Rostedt wrote: > On Thu, 23 Jul 2020 12:57:35 +0200 > Vincent Whitchurch wrote: > > > Would it be acceptable to just use a fixed size for the event? At least > > for my own debugging use cases it's preferable to just have to increase > > the trace buffer size in case it's insufficient, rather than to have to > > restort to one-off debugging code. > > There's two other options. > > Option 1, is to allocate 256 bytes times 4 (in case of interruption, > where you have a separate buffer for every context - normal, softirq, > irq, nmi), and use it like I do for stack traces in the latest kernel > (see __ftrace_stack_trace() in kernel/trace/trace.c) > > Option 2, would be to use trace_array_vprintk(), but you need to create > your own instance to do so to keep from messing with the top level buffer. Thanks for the suggestions, I've sent out a v2 implementing option 1: https://lore.kernel.org/lkml/20200814133151.7759-1-vincent.whitchu...@axis.com/
[PATCH v2] gpio: mockup: Allow probing from device tree
Allow the mockup driver to be probed via the device tree without any module parameters, allowing it to be used to configure and test higher level drivers like the leds-gpio driver and corresponding userspace before actual hardware is available. Signed-off-by: Vincent Whitchurch --- v2: Remove most of the added code, since the latest driver doesn't need it. Drop DT binding document, since Rob Herring was OK with not documenting this: https://lore.kernel.org/linux-devicetree/5baa1ae6.1c69fb81.847f2.3...@mx.google.com/ drivers/gpio/gpio-mockup.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/gpio/gpio-mockup.c b/drivers/gpio/gpio-mockup.c index 67ed4f238d43..c93892a6936a 100644 --- a/drivers/gpio/gpio-mockup.c +++ b/drivers/gpio/gpio-mockup.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -460,9 +461,18 @@ static int gpio_mockup_probe(struct platform_device *pdev) return 0; } +#ifdef CONFIG_OF +static const struct of_device_id gpio_mockup_of_match[] = { + { .compatible = "gpio-mockup", }, + {}, +}; +MODULE_DEVICE_TABLE(of, gpio_mockup_of_match); +#endif + static struct platform_driver gpio_mockup_driver = { .driver = { .name = "gpio-mockup", + .of_match_table = of_match_ptr(gpio_mockup_of_match), }, .probe = gpio_mockup_probe, }; @@ -556,8 +566,7 @@ static int __init gpio_mockup_init(void) { int i, num_chips, err; - if ((gpio_mockup_num_ranges < 2) || - (gpio_mockup_num_ranges % 2) || + if ((gpio_mockup_num_ranges % 2) || (gpio_mockup_num_ranges > GPIO_MOCKUP_MAX_RANGES)) return -EINVAL; -- 2.28.0
Re: [PATCH V3 2/4] misc: vop: do not allocate and reassign the used ring
On Tue, Oct 27, 2020 at 08:05:43AM +0100, Sherry Sun wrote: > Can you help test the patch about removing the codes of reassign used > ring, and comment on the impact for Intel MIC platform? Thanks for > any help. I don't have access to MIC hardware myself, either. But this patch is quite certainly going to break it since guests using a kernel without the patch will not work with hosts with a kernel with the patch.
[PATCH] of: Fix reserved-memory overlap detection
The reserved-memory overlap detection code fails to detect overlaps if either of the regions starts at address 0x0. For some reason the code explicitly checks for and ignores such regions, but this check looks invalid. Remove the check and fix this detection. For example, no overlap is currently reported for this case: foo@0 { reg = <0x 0x2000>; }; bar@1000 { reg = <0x1000 0x1000>; }; but it is after this patch: OF: reserved mem: OVERLAP DETECTED! foo@0 (0x--0x2000) overlaps with bar@1000 (0x1000--0x2000) Signed-off-by: Vincent Whitchurch --- drivers/of/of_reserved_mem.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c index 46b9371c8a33..1c5259e3e81f 100644 --- a/drivers/of/of_reserved_mem.c +++ b/drivers/of/of_reserved_mem.c @@ -217,8 +217,7 @@ static void __init __rmem_check_for_overlap(void) this = &reserved_mem[i]; next = &reserved_mem[i + 1]; - if (!(this->base && next->base)) - continue; + if (this->base + this->size > next->base) { phys_addr_t this_end, next_end; -- 2.28.0
Re: [PATCH] of: Fix reserved-memory overlap detection
On Tue, Oct 20, 2020 at 03:00:14PM +0200, Rob Herring wrote: > On Tue, Oct 20, 2020 at 2:36 AM Vincent Whitchurch > wrote: > > > > The reserved-memory overlap detection code fails to detect overlaps if > > either of the regions starts at address 0x0. For some reason the code > > explicitly checks for and ignores such regions, but this check looks > > invalid. Remove the check and fix this detection. > > Wouldn't 'base' be 0 for nodes that have a 'size' and no address? The > base in those cases isn't set until later when > __reserved_mem_alloc_size() is called. Ah, yes, I guess that's why the check was there. I see that those entries have both a zero address and a zero size, so this seems to work: diff --git a/arch/arm/boot/dts/vexpress-v2p-ca9.dts b/arch/arm/boot/dts/vexpress-v2p-ca9.dts index 623246f37448..6627e71c7283 100644 --- a/arch/arm/boot/dts/vexpress-v2p-ca9.dts +++ b/arch/arm/boot/dts/vexpress-v2p-ca9.dts @@ -81,6 +81,18 @@ vram: vram@4c00 { reg = <0x4c00 0x0080>; no-map; }; + + foo@0 { + reg = <0x0 0x2000>; + }; + + bar@1000 { + reg = <0x0 0x1000>; + }; + + baz { + size = <0x1000>; + }; }; clcd@1002 { diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c index 46b9371c8a33..fea9433d942a 100644 --- a/drivers/of/of_reserved_mem.c +++ b/drivers/of/of_reserved_mem.c @@ -200,6 +200,16 @@ static int __init __rmem_cmp(const void *a, const void *b) if (ra->base > rb->base) return 1; + /* +* Put the dynamic allocations (address == 0, size == 0) before static +* allocations at address 0x0 so that overlap detection works +* correctly. +*/ + if (ra->size < rb->size) + return -1; + if (ra->size > rb->size) + return 1; + return 0; } @@ -212,13 +222,19 @@ static void __init __rmem_check_for_overlap(void) sort(reserved_mem, reserved_mem_count, sizeof(reserved_mem[0]), __rmem_cmp, NULL); + + for (i = 0; i < reserved_mem_count - 1; i++) { + struct reserved_mem *this = &reserved_mem[i]; + + pr_info("i %d base %x size %x\n", i, this->base, this->size); + } + for (i = 0; i < reserved_mem_count - 1; i++) { struct reserved_mem *this, *next; this = &reserved_mem[i]; next = &reserved_mem[i + 1]; - if (!(this->base && next->base)) - continue; + if (this->base + this->size > next->base) { phys_addr_t this_end, next_end;
[PATCH v2] of: Fix reserved-memory overlap detection
The reserved-memory overlap detection code fails to detect overlaps if either of the regions starts at address 0x0. The code explicitly checks for and ignores such regions, apparently in order to ignore dynamically allocated regions which have an address of 0x0 at this point. These dynamically allocated regions also have a size of 0x0 at this point, so fix this by removing the check and sorting the dynamically allocated regions ahead of any static regions at address 0x0. For example, there are two overlaps in this case but they are not currently reported: foo@0 { reg = <0x0 0x2000>; }; bar@0 { reg = <0x0 0x1000>; }; baz@1000 { reg = <0x1000 0x1000>; }; quux { size = <0x1000>; }; but they are after this patch: OF: reserved mem: OVERLAP DETECTED! bar@0 (0x--0x1000) overlaps with foo@0 (0x--0x2000) OF: reserved mem: OVERLAP DETECTED! foo@0 (0x--0x2000) overlaps with baz@1000 (0x1000--0x00002000) Signed-off-by: Vincent Whitchurch --- v2: Fix handling of dynamically allocated regions. drivers/of/of_reserved_mem.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c index 46b9371..6530b8b 100644 --- a/drivers/of/of_reserved_mem.c +++ b/drivers/of/of_reserved_mem.c @@ -200,6 +200,16 @@ static int __init __rmem_cmp(const void *a, const void *b) if (ra->base > rb->base) return 1; + /* +* Put the dynamic allocations (address == 0, size == 0) before static +* allocations at address 0x0 so that overlap detection works +* correctly. +*/ + if (ra->size < rb->size) + return -1; + if (ra->size > rb->size) + return 1; + return 0; } @@ -217,8 +227,7 @@ static void __init __rmem_check_for_overlap(void) this = &reserved_mem[i]; next = &reserved_mem[i + 1]; - if (!(this->base && next->base)) - continue; + if (this->base + this->size > next->base) { phys_addr_t this_end, next_end; base-commit: 270315b8235e3d10c2e360cff56c2f9e0915a252 -- git-series 0.9.1
Re: [PATCH] of: Fix reserved-memory overlap detection
On Tue, Oct 20, 2020 at 04:17:27PM +0200, Rob Herring wrote: > On Tue, Oct 20, 2020 at 8:46 AM Vincent Whitchurch > wrote: > > On Tue, Oct 20, 2020 at 03:00:14PM +0200, Rob Herring wrote: > > > On Tue, Oct 20, 2020 at 2:36 AM Vincent Whitchurch > > > wrote: > > > > > > > > The reserved-memory overlap detection code fails to detect overlaps if > > > > either of the regions starts at address 0x0. For some reason the code > > > > explicitly checks for and ignores such regions, but this check looks > > > > invalid. Remove the check and fix this detection. > > > > > > Wouldn't 'base' be 0 for nodes that have a 'size' and no address? The > > > base in those cases isn't set until later when > > > __reserved_mem_alloc_size() is called. > > > > Ah, yes, I guess that's why the check was there. I see that those > > entries have both a zero address and a zero size, so this seems to work: > > Yes, I think it should work. Thanks, I've tested it a bit more and sent it out as a v2 now. > > diff --git a/arch/arm/boot/dts/vexpress-v2p-ca9.dts > > b/arch/arm/boot/dts/vexpress-v2p-ca9.dts > > index 623246f37448..6627e71c7283 100644 > > --- a/arch/arm/boot/dts/vexpress-v2p-ca9.dts > > +++ b/arch/arm/boot/dts/vexpress-v2p-ca9.dts > > @@ -81,6 +81,18 @@ vram: vram@4c00 { > > reg = <0x4c00 0x0080>; > > no-map; > > }; > > + > > + foo@0 { > > + reg = <0x0 0x2000>; > > + }; > > + > > + bar@1000 { > > + reg = <0x0 0x1000>; > > 0x1000 base? I've corrected this in the example in the commit message for v2.
Re: [PATCH V3 2/4] misc: vop: do not allocate and reassign the used ring
On Wed, Oct 28, 2020 at 02:47:49AM +0100, Sherry Sun wrote: > > Subject: Re: [PATCH V3 2/4] misc: vop: do not allocate and reassign the used > > ring > > > > On Tue, Oct 27, 2020 at 08:05:43AM +0100, Sherry Sun wrote: > > > Can you help test the patch about removing the codes of reassign used > > > ring, and comment on the impact for Intel MIC platform? Thanks for > > > any help. > > > > I don't have access to MIC hardware myself, either. > > > > But this patch is quite certainly going to break it since guests using a > > kernel > > without the patch will not work with hosts with a kernel with the patch. > > Thanks for your reply. > This patch can be used by both guests and hosts. > I have tested it on imx8qm platform(both guest and host are ARM64 > architecture), and it works well. > So I guess Intel MIC won't meet big problems when both guest and host > apply this patch. But it is best if it can be tested. The guest and host are different systems and it should be possible to upgrade the host kernel without being forced to upgrade the guest kernel, and vice-versa. This patch breaks that. If MIC hardware has no users and the drivers are being deleted then of course backward compatibility doesn't matter.
Re: [PATCH V5 0/2] Change vring space from nomal memory to dma coherent memory
On Wed, Oct 28, 2020 at 06:58:36AM +0100, Greg KH wrote: > Have you all seen: > > https://lore.kernel.org/r/8c1443136563de34699d2c084df478181c205db4.1603854416.git.sudeep.d...@intel.com No, that link doesn't work and I can't find that email from Sudeep in any of the archives: https://lore.kernel.org/lkml/?q=f%3Asudeep.dutt%40intel.com Sudeep, perhaps you could try to resend the patch? Thank you.
Re: [PATCH v2] gpio: mockup: Allow probing from device tree
On Wed, Oct 28, 2020 at 12:43:22PM +0100, Andy Shevchenko wrote: > On Wed, Oct 28, 2020 at 10:00 AM Vincent Whitchurch > wrote: > > Allow the mockup driver to be probed via the device tree without any > > module parameters, allowing it to be used to configure and test higher > > level drivers like the leds-gpio driver and corresponding userspace > > before actual hardware is available. > > You have to officially announce a DT binding for that. Rob Herring has earlier said that was not required for this driver. As I mentioned a little further down in the email you are replying to: | Drop DT binding document, since Rob Herring was OK with not documenting | this: | https://lore.kernel.org/linux-devicetree/5baa1ae6.1c69fb81.847f2.3...@mx.google.com/
Re: [PATCH v2] gpio: mockup: Allow probing from device tree
On Tue, Oct 27, 2020 at 07:12:13PM +0100, Bartosz Golaszewski wrote: > On Tue, Oct 27, 2020 at 2:54 PM Vincent Whitchurch > wrote: > > diff --git a/drivers/gpio/gpio-mockup.c b/drivers/gpio/gpio-mockup.c > > index 67ed4f238d43..c93892a6936a 100644 > > --- a/drivers/gpio/gpio-mockup.c > > +++ b/drivers/gpio/gpio-mockup.c > > @@ -13,6 +13,7 @@ > > #include > > #include > > #include > > +#include > > Please keep the includes ordered alphabetically. Thanks, fixed in v3. > > #include > > #include > > #include > > @@ -460,9 +461,18 @@ static int gpio_mockup_probe(struct platform_device > > *pdev) > > return 0; > > } > > > > +#ifdef CONFIG_OF > > +static const struct of_device_id gpio_mockup_of_match[] = { > > + { .compatible = "gpio-mockup", }, > > + {}, > > +}; > > +MODULE_DEVICE_TABLE(of, gpio_mockup_of_match); > > +#endif > > You don't need this ifdef - of_match_ptr() will evaluate to NULL if > CONFIG_OF is disabled and the compiler will optimize this struct out. The compiler can't optimise out the struct in the case of a module build since there is a reference from the MODULE_DEVICE_TABLE: $ grep CONFIG_OF .config # CONFIG_OF is not set $ nm drivers/gpio/gpio-mockup.ko | grep of_ r gpio_mockup_of_match R __mod_of__gpio_mockup_of_match_device_table But these few wasted bytes don't matter so I removed the CONFIG_OF anyway as you suggested.
[PATCH v3] gpio: mockup: Allow probing from device tree
Allow the mockup driver to be probed via the device tree without any module parameters, allowing it to be used to configure and test higher level drivers like the leds-gpio driver and corresponding userspace before actual hardware is available. Signed-off-by: Vincent Whitchurch --- Notes: v3: - Keep includes sorted alphabetically - Drop CONFIG_OF ifdefs v2: - Remove most of the added code, since the latest driver doesn't need it. - Drop DT binding document, since Rob Herring was OK with not documenting this: https://lore.kernel.org/linux-devicetree/5baa1ae6.1c69fb81.847f2.3...@mx.google.com/ drivers/gpio/gpio-mockup.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/gpio/gpio-mockup.c b/drivers/gpio/gpio-mockup.c index 67ed4f238d43..ca87c590ef3f 100644 --- a/drivers/gpio/gpio-mockup.c +++ b/drivers/gpio/gpio-mockup.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -460,9 +461,16 @@ static int gpio_mockup_probe(struct platform_device *pdev) return 0; } +static const struct of_device_id gpio_mockup_of_match[] = { + { .compatible = "gpio-mockup", }, + {}, +}; +MODULE_DEVICE_TABLE(of, gpio_mockup_of_match); + static struct platform_driver gpio_mockup_driver = { .driver = { .name = "gpio-mockup", + .of_match_table = of_match_ptr(gpio_mockup_of_match), }, .probe = gpio_mockup_probe, }; @@ -556,8 +564,7 @@ static int __init gpio_mockup_init(void) { int i, num_chips, err; - if ((gpio_mockup_num_ranges < 2) || - (gpio_mockup_num_ranges % 2) || + if ((gpio_mockup_num_ranges % 2) || (gpio_mockup_num_ranges > GPIO_MOCKUP_MAX_RANGES)) return -EINVAL; -- 2.28.0
Re: [PATCH v2] gpio: mockup: Allow probing from device tree
On Wed, Oct 28, 2020 at 09:25:32PM +0100, Andy Shevchenko wrote: > On Wed, Oct 28, 2020 at 8:41 PM Bartosz Golaszewski wrote: > > On Tue, Oct 27, 2020 at 2:54 PM Vincent Whitchurch > > > +#ifdef CONFIG_OF > > > +static const struct of_device_id gpio_mockup_of_match[] = { > > > + { .compatible = "gpio-mockup", }, > > > + {}, > > > +}; > > > +MODULE_DEVICE_TABLE(of, gpio_mockup_of_match); > > > +#endif > > > > You don't need this ifdef - of_match_ptr() will evaluate to NULL if > > CONFIG_OF is disabled and the compiler will optimize this struct out. > > It's not so. If you drop ugly ifdeffery (and I vote for that, see also > above) the of_match_ptr() must be dropped as well. Otherwise the > compiler will issue the warning. So it is either all or none. Yes, you're right. I actually tested a !OF build before but it turns out that the warning is disabled by default and only enabled with W=1. I'll fix this and change the header in the next version. Thank you.
[PATCH v4] gpio: mockup: Allow probing from device tree
Allow the mockup driver to be probed via the device tree without any module parameters, allowing it to be used to configure and test higher level drivers like the leds-gpio driver and corresponding userspace before actual hardware is available. Signed-off-by: Vincent Whitchurch --- Notes: v4: - Remove of_match_ptr() to fix unused variable warning with W=1 - Include linux/mod_devicetable.h instead of linux/of.h v3: - Keep includes sorted alphabetically - Drop CONFIG_OF ifdefs v2: - Remove most of the added code, since the latest driver doesn't need it. - Drop DT binding document, since Rob Herring was OK with not documenting this: https://lore.kernel.org/linux-devicetree/5baa1ae6.1c69fb81.847f2.3...@mx.google.com/ Range-diff against v3: 1: 1e9b8f36676d ! 1: 4e8fdcfe1a47 gpio: mockup: Allow probing from device tree @@ Commit message ## Notes ## +v4: +- Remove of_match_ptr() to fix unused variable warning with W=1 +- Include linux/mod_devicetable.h instead of linux/of.h + v3: - Keep includes sorted alphabetically - Drop CONFIG_OF ifdefs @@ Notes ## drivers/gpio/gpio-mockup.c ## @@ + #include #include #include ++#include #include -+#include #include #include - #include @@ drivers/gpio/gpio-mockup.c: static int gpio_mockup_probe(struct platform_device *pdev) return 0; } @@ drivers/gpio/gpio-mockup.c: static int gpio_mockup_probe(struct platform_device static struct platform_driver gpio_mockup_driver = { .driver = { .name = "gpio-mockup", -+ .of_match_table = of_match_ptr(gpio_mockup_of_match), ++ .of_match_table = gpio_mockup_of_match, }, .probe = gpio_mockup_probe, }; drivers/gpio/gpio-mockup.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/gpio/gpio-mockup.c b/drivers/gpio/gpio-mockup.c index 67ed4f238d43..28b757d34046 100644 --- a/drivers/gpio/gpio-mockup.c +++ b/drivers/gpio/gpio-mockup.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -460,9 +461,16 @@ static int gpio_mockup_probe(struct platform_device *pdev) return 0; } +static const struct of_device_id gpio_mockup_of_match[] = { + { .compatible = "gpio-mockup", }, + {}, +}; +MODULE_DEVICE_TABLE(of, gpio_mockup_of_match); + static struct platform_driver gpio_mockup_driver = { .driver = { .name = "gpio-mockup", + .of_match_table = gpio_mockup_of_match, }, .probe = gpio_mockup_probe, }; @@ -556,8 +564,7 @@ static int __init gpio_mockup_init(void) { int i, num_chips, err; - if ((gpio_mockup_num_ranges < 2) || - (gpio_mockup_num_ranges % 2) || + if ((gpio_mockup_num_ranges % 2) || (gpio_mockup_num_ranges > GPIO_MOCKUP_MAX_RANGES)) return -EINVAL; -- 2.28.0
Re: [PATCH V3 2/4] misc: vop: do not allocate and reassign the used ring
On Wed, Oct 28, 2020 at 04:50:36PM +0100, Arnd Bergmann wrote: > I think we should try to do something on top of the PCIe endpoint subsystem > to make it work across arbitrary combinations of host and device > implementations, > and provide a superset of what the MIC driver, (out-of-tree) Bluefield > endpoint > driver, and the NTB subsystem as well as a couple of others used to do, > each of them tunneling block/network/serial/... over a PCIe link of some > sort, usually with virtio. VOP is not PCIe-specific (as demonstrated by the vop-loopback patches I posted a while ago [1]), and it would be a shame for a replacement to be tied to the PCIe endpoint subsystem. There are many SOCs out there which have multiple Linux-capable processors without cache-coherency between them. VOP is (or should I say was since I guess it's being deleted) the closest we have in mainline to easily get generic virtio (and not just rpmsg) running between these kind of Linux instances. If a new replacement framework were to be PCIe-exclusive then we'd have to invent one more framework for non-PCIe links to do pretty much the same thing. [1] https://lore.kernel.org/lkml/20190403104746.16063-1-vincent.whitchu...@axis.com/