Re: Re: BUG: Bad rss-counter state (4)
(trimmed off the batman/bpf Ccs) On 2020-05-18 14:28, syzbot wrote: syzbot has bisected this bug to: commit 0d8dd67be013727ae57645ecd3ea2c36365d7da8 Author: Song Liu Date: Wed Dec 6 22:45:14 2017 + perf/headers: Sync new perf_event.h with the tools/include/uapi version bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13240a0210 start commit: ac935d22 Add linux-next specific files for 20200415 git tree: linux-next final crash:https://syzkaller.appspot.com/x/report.txt?x=10a40a0210 console output: https://syzkaller.appspot.com/x/log.txt?x=17240a0210 kernel config: https://syzkaller.appspot.com/x/.config?x=bc498783097e9019 dashboard link: https://syzkaller.appspot.com/bug?extid=347e2331d03d06ab0224 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12d18e6e10 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=104170d610 Reported-by: syzbot+347e2331d03d06ab0...@syzkaller.appspotmail.com Fixes: 0d8dd67be013 ("perf/headers: Sync new perf_event.h with the tools/include/uapi version") For information about bisection process see: https://goo.gl/tpsmEJ#bisection FWIW here's a nicer reproducer that more clearly shows what's really going on: #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include // for compat with older perf headers #define uprobe_path config1 int main(int argc, char *argv[]) { // Find out what type id we need for uprobes int perf_type_pmu_uprobe; { FILE *fp = fopen("/sys/bus/event_source/devices/uprobe/type", "r"); fscanf(fp, "%d", _type_pmu_uprobe); fclose(fp); } const char *filename = "./bus"; int fd = open(filename, O_RDWR|O_CREAT, 0600); write(fd, "x", 1); void *addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); // Register a perf uprobe on "./bus" struct perf_event_attr attr = {}; attr.type = perf_type_pmu_uprobe; attr.uprobe_path = (unsigned long) filename; syscall(__NR_perf_event_open, , 0, 0, -1, 0); void *addr2 = mmap(NULL, 2 * 4096, PROT_NONE, MAP_PRIVATE, fd, 0); void *addr3 = mremap((void *) addr2, 4096, 2 * 4096, MREMAP_MAYMOVE); mremap(addr3, 4096, 4096, MREMAP_MAYMOVE | MREMAP_FIXED, (void *) addr2); return 0; } this instantly reproduces this output on current mainline for me: BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1 AFAICT the worst thing about this bug is that it shows up on anything that parses logs for "BUG"; it doesn't seem to have any ill effects other than messing up the rss counters. Although maybe it points to some underlying problem in uprobes/mm interaction. If I enable the "rss_stat" tracepoint and set ftrace_dump_on_oops=1, I see a trace roughly like this: perf_event_open() mmap(2 * 4096): - uprobe_mmap() - install_breakpoint() - __replace_page() - rss_stat: mm_id=0 curr=1 member=1 size=53248B mremap(4096 => 2 * 4096): - install_breakpoint() - __replace_page() - rss_stat: mm_id=0 curr=1 member=1 size=57344B - unmap_page_range() - rss_stat: mm_id=0 curr=1 member=1 size=53248B mremap(4096 => 4096): - move_vma() - copy_vma() - vma_merge() - install_breakpoint() - __replace_page() - rss_stat: mm_id=0 curr=1 member=1 size=57344B - do_munmap() - install_breakpoint(): - __replace_page() - rss_stat: mm_id=0 curr=1 member=1 size=61440B - unmap_page_range(): - rss_stat: mm_id=0 curr=1 member=1 size=57344B exit() - exit_mmap() - unmap_page_range(): - rss_stat: mm_id=0 curr=0 member=1 size=45056B - unmap_page_range(): - rss_stat: mm_id=0 curr=0 member=1 size=32768B - unmap_page_range(): - rss_stat: mm_id=0 curr=0 member=1 size=20480B - unmap_page_range(): - rss_stat: mm_id=0 curr=0 member=1 size=16384B - unmap_page_range(): - rss_stat: mm_id=0 curr=0 member=1 size=4096B What strikes me here is that at the end of the first mremap(), we have size 53248B (13 pages), but at the end of the second mremap(), we have size 57344B (14 pages), even though the second mremap() is only moving 1 page. So the second mremap() is bumping it up twice, but then only bumping down once. Vegard
Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018
(trimmed CCs) On 2021-03-23 19:32, Kirill A. Shutemov wrote: On Fri, Jun 12, 2020 at 02:26:58PM +0200, Rafael J. Wysocki wrote: On 6/11/2020 3:40 AM, Kaneda, Erik wrote: -Original Message- From: Vegard Nossum Sent: Friday, June 5, 2020 7:45 AM To: Vlastimil Babka ; Rafael J. Wysocki ; Moore, Robert ; Kaneda, Erik Cc: Kees Cook ; Wysocki, Rafael J ; Christoph Lameter ; Andrew Morton ; Marco Elver ; Waiman Long ; LKML ; Linux MM ; ACPI Devel Maling List ; Len Brown ; Steven Rostedt Subject: Re: slub freelist issue / BUG: unable to handle page fault for address: 3ffe0018 On 2020-06-05 16:08, Vlastimil Babka wrote: On 6/5/20 3:12 PM, Rafael J. Wysocki wrote: On Fri, Jun 5, 2020 at 2:48 PM Vegard Nossum wrote: On 2020-06-05 11:36, Vegard Nossum wrote: On 2020-06-05 11:11, Vlastimil Babka wrote: On 6/4/20 8:46 PM, Vlastimil Babka wrote: On 6/4/20 7:57 PM, Kees Cook wrote: On Thu, Jun 04, 2020 at 07:20:18PM +0200, Vegard Nossum wrote: On 2020-06-04 19:18, Vlastimil Babka wrote: On 6/4/20 7:14 PM, Vegard Nossum wrote: Hi all, I ran into a boot problem with latest linus/master (6929f71e46bdddbf1c4d67c2728648176c67c555) that manifests like this: Hi, what's the .config you use? Pretty much x86_64 defconfig minus a few options (PCI, USB, ...) Oh yes indeed. I immediately crash in the same way with this config. I'll start digging... (defconfig finishes boot) This is funny, booting with slub_debug=F results in: I'm not sure if it's ACPI or ftrace wrong here, but looks like the changed free pointer offset merely exposes a bug in something else. So, with Kees' patch reverted, booting with slub_debug=F (or even more specific slub_debug=F,ftrace_event_field) also hits this bug below. I wanted to bisect it, but v5.7 was also bad, and also v5.6. Didn't try further in history. So it's not new at all, and likely very specific to your config+QEMU? (and related to the ACPI error messages that precede it?). I see it too, but not on v5.0. I can bisect it. commit 67a72420a326b45514deb3f212085fb2cd1595b5 Author: Bob Moore Date: Fri Aug 16 14:43:21 2019 -0700 ACPICA: Increase total number of possible Owner IDs ACPICA commit 1f1652dad88b9d767767bc1f7eb4f7d99e6b5324 From 255 to 4095 possible IDs. Link: https://github.com/acpica/acpica/commit/1f1652da Reported-by: Hedi Berriche Signed-off-by: Bob Moore Signed-off-by: Erik Schmauss Signed-off-by: Rafael J. Wysocki Bob, Erik, did we miss something in that patch? Maybe the patch just changes layout in a way that exposes the bug. Anyway the "ftrace_event_field" cache is not really involved, this is just because of slab merging. After adding "slub_nomerge" to "slub_debug=F", it starts making more sense, as the cache becomes Acpi-Namespace [0.140408] [ cut here ] [0.140837] cache_from_obj: Wrong slab cache. Acpi-Namespace but object is from kmalloc-64 [0.141406] WARNING: CPU: 0 PID: 1 at mm/slab.h:524 kmem_cache_free+0x1d3/0x250 [0.142105] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.7.0+ #45 [0.142393] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 [0.142393] RIP: 0010:kmem_cache_free+0x1d3/0x250 [0.142393] Code: 18 4d 85 ed 0f 84 10 ff ff ff 4c 39 ed 74 2f 49 8b 4d 58 48 8b 55 58 48 c7 c6 10 47 a1 ac 48 c7 c7 00 c2 b0 ac e8 b1 cc eb ff <0f> 0b 48 89 de 4c 89 ef e8 10 d7 ff ff 48 8b 15 59 36 9b 00 4c 89 [0.142393] RSP: 0018:b39cc0013dc0 EFLAGS: 00010282 [0.142393] RAX: RBX: 937287409e00 RCX: [0.142393] RDX: 0001 RSI: 0092 RDI: acfdd32c [0.142393] RBP: 93728742ef00 R08: b39cc0013c7d R09: 00fc [0.142393] R10: b39cc0013c78 R11: b39cc0013c7d R12: 937307409e00 [0.142393] R13: 937287401d00 R14: R15: [0.142393] FS: () GS:937287a0() knlGS: [0.142393] CS: 0010 DS: ES: CR0: 80050033 [0.142393] CR2: CR3: 03a0a000 CR4: 003406f0 [0.142393] Call Trace: [0.142393] acpi_os_release_object+0x5/0x10 [0.142393] acpi_ns_delete_children+0x46/0x59 [0.142393] acpi_ns_delete_namespace_subtree+0x5c/0x79 [0.142393] ? acpi_sleep_proc_init+0x1f/0x1f [0.142393] acpi_ns_terminate+0xc/0x31 [0.142393] acpi_ut_subsystem_shutdown+0x45/0xa3 [0.142393] ? acpi_sleep_proc_init+0x1f/0x1f [0.142393] acpi_terminate+0x5/0xf [0.142393] acpi_init+0x27b/0x308 [0.142393] ? video_setup+0x79/0x79 [0.142393] do_one_initcall+0x7b/0x160 [0.142393] kernel_init_freeable+0x190/0x1f2 [0.142393] ? rest_init+0x9a/0x9a [0.142393] kernel_init+0x5/0xf6 [0.142393] ret_from_fork+0x22/0x30 [0.142
Re: Minor RST rant
On 2020-08-06 08:48, Christoph Hellwig wrote: On Wed, Aug 05, 2020 at 05:12:30PM +0200, pet...@infradead.org wrote: On Wed, Aug 05, 2020 at 04:49:50PM +0200, Vegard Nossum wrote: FWIW, I *really* like how the extra markup renders in a browser, and I don't think I'm the only one. The thing is, I write code in a text editor, not a browser. When a header file says: read Documentation/foo I do 'gf' and that file gets opened in a buffer. Needing a browser is a fail. And that is my main problem with all the RST craze. It optmizes for shiny display in a browser, but copletely messed up the typical developer flow. If you are using vim, you can put this in ~/.vim/after/syntax/rst.vim: syn region rstInlineLiteral matchgroup=Special start="``" end="``" concealends syn region rstEmphasis matchgroup=Special start="\*\*" end="\*\*" concealends setlocal conceallevel=2 This will hide the ``foo`` and **bar** markup on lines that are not currently under the cursor. Vegard
Re: Re: Minor RST rant
On 2020-07-29 14:44, pet...@infradead.org wrote: On Sat, Jul 25, 2020 at 09:46:55AM +1000, NeilBrown wrote: Constant names stand out least effectively by themselves. In kernel-doc comments they are preceded by a '%'. Would that make the text more readable for you? Does our doc infrastructure honour that in .rst documents? It does not. It also still reads really weird. And for some reason firefox chokes on the HTML file I tried it with, and make htmldocs takes for bloody ever. Give me a plain text file, please. All this modern crap just doesn't work. FWIW, I *really* like how the extra markup renders in a browser, and I don't think I'm the only one. If you want to read .rst files in a terminal, I would suggest using something like this: $ pandoc -t plain Documentation/core-api/atomic_ops.rst | less It looks pretty readable to me, things like lists and code are properly indented, the only thing it's missing as far as I'm concerned is marking headings more prominently. The new online documentation is a great way to attract more people to kernel development (and just spread typical kernel knowledge to non-Linux/non-kernel programmers). The old Documentation/ was kind of hidden away and you only really came across it by accident if you did a treewide 'git grep'; the new online docs, on the other hand, are a pleasure to browse and explore and frequently show up in google searches for random kernel-related topics. Vegard
Re: [merged] exec-open-code-copy_string_kernel.patch removed from -mm tree
On 2020-06-05 22:19, a...@linux-foundation.org wrote: The patch titled Subject: exec: open code copy_string_kernel has been removed from the -mm tree. Its filename was exec-open-code-copy_string_kernel.patch This patch was dropped because it was merged into mainline or a subsystem tree -- From: Christoph Hellwig Subject: exec: open code copy_string_kernel Currently copy_string_kernel is just a wrapper around copy_strings that simplifies the calling conventions and uses set_fs to allow passing a kernel pointer. But due to the fact the we only need to handle a single kernel argument pointer, the logic can be sigificantly simplified while getting rid of the set_fs. Link: http://lkml.kernel.org/r/20200501104105.2621149-3-...@lst.de Signed-off-by: Christoph Hellwig Cc: Alexander Viro Signed-off-by: Andrew Morton --- fs/exec.c | 45 +++-- 1 file changed, 35 insertions(+), 10 deletions(-) --- a/fs/exec.c~exec-open-code-copy_string_kernel +++ a/fs/exec.c @@ -592,17 +592,42 @@ out: */ int copy_string_kernel(const char *arg, struct linux_binprm *bprm) { - int r; - mm_segment_t oldfs = get_fs(); - struct user_arg_ptr argv = { - .ptr.native = (const char __user *const __user *), - }; - - set_fs(KERNEL_DS); - r = copy_strings(1, argv, bprm); - set_fs(oldfs); + int len = strnlen(arg, MAX_ARG_STRLEN) + 1 /* terminating NUL */; + unsigned long pos = bprm->p; - return r; + if (len == 0) + return -EFAULT; Just a quick question, how can len ever be 0 here when len was set to strnlen() + 1? Should the test be different? The old version (i.e. copy_strings()) seems to return -EFAULT when strnlen() returns 0. Vegard + if (!valid_arg_len(bprm, len)) + return -E2BIG; + + /* We're going to work our way backwards. */ + arg += len; + bprm->p -= len; + if (IS_ENABLED(CONFIG_MMU) && bprm->p < bprm->argmin) + return -E2BIG; + + while (len > 0) { + unsigned int bytes_to_copy = min_t(unsigned int, len, + min_not_zero(offset_in_page(pos), PAGE_SIZE)); + struct page *page; + char *kaddr; + + pos -= bytes_to_copy; + arg -= bytes_to_copy; + len -= bytes_to_copy; + + page = get_arg_page(bprm, pos, 1); + if (!page) + return -E2BIG; + kaddr = kmap_atomic(page); + flush_arg_page(bprm, pos & PAGE_MASK, page); + memcpy(kaddr + offset_in_page(pos), arg, bytes_to_copy); + flush_kernel_dcache_page(page); + kunmap_atomic(kaddr); + put_arg_page(page); + } + + return 0; } EXPORT_SYMBOL(copy_string_kernel);
Re: WARNING: CPU: 1 PID: 52 at mm/page_alloc.c:4826 __alloc_pages_nodemask (Re: [PATCH 5/5] sysctl: pass kernel pointers to ->proc_handler)
On 2020-06-08 08:51, Christoph Hellwig wrote: On Thu, Jun 04, 2020 at 10:22:21PM +0200, Vegard Nossum wrote: It's easy to reproduce by just doing read(open("/proc/sys/vm/swappiness", O_RDONLY), 0, 512UL * 1024 * 1024 * 1024); or so. Reverting the commit fixes the issue for me. Yes, doing giant allocations will fail and trace. We have to options here that both seems sensible: - trunate sysctrl calls to some sensible length - (optionally) use vmalloc Is this a real application or just a test case trying to do the stupidmost possible thing? Just a test case. Allowing the kernel to allocate an unbounded amount of memory on behalf of userspace is an easy DOS. All the length checks were already in there, e.g. static int cmm_timeout_handler(struct ctl_table *ctl, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { char buf[64], *p; [...] len = min(*lenp, sizeof(buf)); if (copy_from_user(buf, buffer, len)) return -EFAULT; Vegard
Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018
On 2020-06-05 17:44, Kees Cook wrote: On Fri, Jun 05, 2020 at 04:44:51PM +0200, Vegard Nossum wrote: That's it :-) This fixes it for me: diff --git a/drivers/acpi/acpica/nsaccess.c b/drivers/acpi/acpica/nsaccess.c index 2566e2d4c7803..b76bbab917941 100644 --- a/drivers/acpi/acpica/nsaccess.c +++ b/drivers/acpi/acpica/nsaccess.c @@ -98,14 +98,12 @@ acpi_status acpi_ns_root_initialize(void) * predefined names are at the root level. It is much easier to * just create and link the new node(s) here. */ - new_node = - ACPI_ALLOCATE_ZEROED(sizeof(struct acpi_namespace_node)); + new_node = acpi_ns_create_node(*ACPI_CAST_PTR (u32, init_val->name)); if (!new_node) { status = AE_NO_MEMORY; goto unlock_and_exit; } - ACPI_COPY_NAMESEG(new_node->name.ascii, init_val->name); new_node->descriptor_type = ACPI_DESC_TYPE_NAMED; new_node->type = init_val->type; I'm a bit confused by the internals of acpi_ns_create_note(). It can still end up calling ACPI_ALLOCATE_ZEROED() via acpi_os_acquire_object(). Is this fix correct? include/acpi/platform/aclinuxex.h:static inline void *acpi_os_acquire_object(acpi_cache_t * cache) include/acpi/platform/aclinuxex.h-{ include/acpi/platform/aclinuxex.h- return kmem_cache_zalloc(cache, include/acpi/platform/aclinuxex.h- irqs_disabled()? GFP_ATOMIC : GFP_KERNEL); include/acpi/platform/aclinuxex.h-} No comment. Vegard
Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018
On 2020-06-05 16:08, Vlastimil Babka wrote: On 6/5/20 3:12 PM, Rafael J. Wysocki wrote: On Fri, Jun 5, 2020 at 2:48 PM Vegard Nossum wrote: On 2020-06-05 11:36, Vegard Nossum wrote: On 2020-06-05 11:11, Vlastimil Babka wrote: On 6/4/20 8:46 PM, Vlastimil Babka wrote: On 6/4/20 7:57 PM, Kees Cook wrote: On Thu, Jun 04, 2020 at 07:20:18PM +0200, Vegard Nossum wrote: On 2020-06-04 19:18, Vlastimil Babka wrote: On 6/4/20 7:14 PM, Vegard Nossum wrote: Hi all, I ran into a boot problem with latest linus/master (6929f71e46bdddbf1c4d67c2728648176c67c555) that manifests like this: Hi, what's the .config you use? Pretty much x86_64 defconfig minus a few options (PCI, USB, ...) Oh yes indeed. I immediately crash in the same way with this config. I'll start digging... (defconfig finishes boot) This is funny, booting with slub_debug=F results in: I'm not sure if it's ACPI or ftrace wrong here, but looks like the changed free pointer offset merely exposes a bug in something else. So, with Kees' patch reverted, booting with slub_debug=F (or even more specific slub_debug=F,ftrace_event_field) also hits this bug below. I wanted to bisect it, but v5.7 was also bad, and also v5.6. Didn't try further in history. So it's not new at all, and likely very specific to your config+QEMU? (and related to the ACPI error messages that precede it?). I see it too, but not on v5.0. I can bisect it. commit 67a72420a326b45514deb3f212085fb2cd1595b5 Author: Bob Moore Date: Fri Aug 16 14:43:21 2019 -0700 ACPICA: Increase total number of possible Owner IDs ACPICA commit 1f1652dad88b9d767767bc1f7eb4f7d99e6b5324 From 255 to 4095 possible IDs. Link: https://github.com/acpica/acpica/commit/1f1652da Reported-by: Hedi Berriche Signed-off-by: Bob Moore Signed-off-by: Erik Schmauss Signed-off-by: Rafael J. Wysocki Bob, Erik, did we miss something in that patch? Maybe the patch just changes layout in a way that exposes the bug. Anyway the "ftrace_event_field" cache is not really involved, this is just because of slab merging. After adding "slub_nomerge" to "slub_debug=F", it starts making more sense, as the cache becomes Acpi-Namespace [0.140408] [ cut here ] [0.140837] cache_from_obj: Wrong slab cache. Acpi-Namespace but object is from kmalloc-64 [0.141406] WARNING: CPU: 0 PID: 1 at mm/slab.h:524 kmem_cache_free+0x1d3/0x250 [0.142105] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.7.0+ #45 [0.142393] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 [0.142393] RIP: 0010:kmem_cache_free+0x1d3/0x250 [0.142393] Code: 18 4d 85 ed 0f 84 10 ff ff ff 4c 39 ed 74 2f 49 8b 4d 58 48 8b 55 58 48 c7 c6 10 47 a1 ac 48 c7 c7 00 c2 b0 ac e8 b1 cc eb ff <0f> 0b 48 89 de 4c 89 ef e8 10 d7 ff ff 48 8b 15 59 36 9b 00 4c 89 [0.142393] RSP: 0018:b39cc0013dc0 EFLAGS: 00010282 [0.142393] RAX: RBX: 937287409e00 RCX: [0.142393] RDX: 0001 RSI: 0092 RDI: acfdd32c [0.142393] RBP: 93728742ef00 R08: b39cc0013c7d R09: 00fc [0.142393] R10: b39cc0013c78 R11: b39cc0013c7d R12: 937307409e00 [0.142393] R13: 937287401d00 R14: R15: [0.142393] FS: () GS:937287a0() knlGS: [0.142393] CS: 0010 DS: ES: CR0: 80050033 [0.142393] CR2: CR3: 03a0a000 CR4: 003406f0 [0.142393] Call Trace: [0.142393] acpi_os_release_object+0x5/0x10 [0.142393] acpi_ns_delete_children+0x46/0x59 [0.142393] acpi_ns_delete_namespace_subtree+0x5c/0x79 [0.142393] ? acpi_sleep_proc_init+0x1f/0x1f [0.142393] acpi_ns_terminate+0xc/0x31 [0.142393] acpi_ut_subsystem_shutdown+0x45/0xa3 [0.142393] ? acpi_sleep_proc_init+0x1f/0x1f [0.142393] acpi_terminate+0x5/0xf [0.142393] acpi_init+0x27b/0x308 [0.142393] ? video_setup+0x79/0x79 [0.142393] do_one_initcall+0x7b/0x160 [0.142393] kernel_init_freeable+0x190/0x1f2 [0.142393] ? rest_init+0x9a/0x9a [0.142393] kernel_init+0x5/0xf6 [0.142393] ret_from_fork+0x22/0x30 [0.142393] ---[ end trace 3539f236ef812ba1 ]--- [0.142396] [ cut here ] I've also changed the warning so it's not printed just once, and also prints tracking info (see the hunk at the end of my mail, I'll turn this to a proper patch later). With "slub_debug=FU slub_nomerge" there are now multiple warnings, but they all look the same: [0.143815] [ cut here ] [0.144131] cache_from_obj: Wrong slab cache. Acpi-Namespace but object is from kmalloc-64 [0.144929] WARNING: CPU: 0 PID: 1 at mm/slab.h:524 kmem_cache_free+0x1d3/0x250 [0.145129] CPU: 0
Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018
On 2020-06-05 11:36, Vegard Nossum wrote: On 2020-06-05 11:11, Vlastimil Babka wrote: On 6/4/20 8:46 PM, Vlastimil Babka wrote: On 6/4/20 7:57 PM, Kees Cook wrote: On Thu, Jun 04, 2020 at 07:20:18PM +0200, Vegard Nossum wrote: On 2020-06-04 19:18, Vlastimil Babka wrote: On 6/4/20 7:14 PM, Vegard Nossum wrote: Hi all, I ran into a boot problem with latest linus/master (6929f71e46bdddbf1c4d67c2728648176c67c555) that manifests like this: Hi, what's the .config you use? Pretty much x86_64 defconfig minus a few options (PCI, USB, ...) Oh yes indeed. I immediately crash in the same way with this config. I'll start digging... (defconfig finishes boot) This is funny, booting with slub_debug=F results in: I'm not sure if it's ACPI or ftrace wrong here, but looks like the changed free pointer offset merely exposes a bug in something else. So, with Kees' patch reverted, booting with slub_debug=F (or even more specific slub_debug=F,ftrace_event_field) also hits this bug below. I wanted to bisect it, but v5.7 was also bad, and also v5.6. Didn't try further in history. So it's not new at all, and likely very specific to your config+QEMU? (and related to the ACPI error messages that precede it?). I see it too, but not on v5.0. I can bisect it. commit 67a72420a326b45514deb3f212085fb2cd1595b5 Author: Bob Moore Date: Fri Aug 16 14:43:21 2019 -0700 ACPICA: Increase total number of possible Owner IDs ACPICA commit 1f1652dad88b9d767767bc1f7eb4f7d99e6b5324 From 255 to 4095 possible IDs. Link: https://github.com/acpica/acpica/commit/1f1652da Reported-by: Hedi Berriche Signed-off-by: Bob Moore Signed-off-by: Erik Schmauss Signed-off-by: Rafael J. Wysocki Vegard This would mean acpi_os_release_object() calling kmem_cache_free(ftrace_event_field, x) where x is actually from kmalloc-64? Both parts of that sounds wrong. Thread starts here: https://lore.kernel.org/linux-mm/4dc93ff8-f86e-f4c9-ebeb-6d3153a78...@oracle.com/ [ 0.144386] ACPI: Added _OSI(Module Device) [ 0.144496] ACPI: Added _OSI(Processor Device) [ 0.144956] ACPI: Added _OSI(3.0 _SCP Extensions) [ 0.145432] ACPI: Added _OSI(Processor Aggregator Device) [ 0.145501] ACPI: Added _OSI(Linux-Dell-Video) [ 0.145951] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio) [ 0.146522] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics) [ 0.147070] ACPI Error: AE_BAD_PARAMETER, During Region initialization (20200430/tbxfload-52) [ 0.147494] ACPI: Unable to load the System Description Tables [ 0.148104] ACPI Error: Could not remove SCI handler (20200430/evmisc-251) [ 0.148507] [ cut here ] [ 0.148985] cache_from_obj: Wrong slab cache. ftrace_event_field but object is from kmalloc-64 [ 0.149502] WARNING: CPU: 0 PID: 1 at mm/slab.h:523 kmem_cache_free+0x248/0x260 [ 0.150254] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.7.0+ #43 [ 0.150490] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 [ 0.150490] RIP: 0010:kmem_cache_free+0x248/0x260 [ 0.150490] Code: ff 0f 0b e9 9d fe ff ff 49 8b 4d 58 48 8b 55 58 48 c7 c6 10 47 c1 a4 48 c7 c7 f0 c1 d0 a4 c6 05 9f 05 b1 00 01 e8 bc cc eb ff <0f> 0b 48 8b 15 5f 36 9b 00 4c 89 ed e9 d6 fd ff ff 0f 1f 80 00 00 [ 0.150490] RSP: 0018:b4dac0013dc0 EFLAGS: 00010282 [ 0.150490] RAX: RBX: a38a07409e00 RCX: [ 0.150490] RDX: 0001 RSI: 0092 RDI: a51dd32c [ 0.150490] RBP: a38a07403900 R08: b4dac0013c7d R09: 00eb [ 0.150490] R10: b4dac0013c78 R11: b4dac0013c7d R12: a38a87409e00 [ 0.150490] R13: a38a07401d00 R14: R15: [ 0.150490] FS: () GS:a38a07a0() knlGS: [ 0.150490] CS: 0010 DS: ES: CR0: 80050033 [ 0.150490] CR2: CR3: 0560a000 CR4: 003406f0 [ 0.150490] Call Trace: [ 0.150490] acpi_os_release_object+0x5/0x10 [ 0.150490] acpi_ns_delete_children+0x46/0x59 [ 0.150490] acpi_ns_delete_namespace_subtree+0x5c/0x79 [ 0.150490] ? acpi_sleep_proc_init+0x1f/0x1f [ 0.150490] acpi_ns_terminate+0xc/0x31 [ 0.150490] acpi_ut_subsystem_shutdown+0x45/0xa3 [ 0.150490] ? acpi_sleep_proc_init+0x1f/0x1f [ 0.150490] acpi_terminate+0x5/0xf [ 0.150490] acpi_init+0x27b/0x308 [ 0.150490] ? video_setup+0x79/0x79 [ 0.150490] do_one_initcall+0x7b/0x160 [ 0.150490] kernel_init_freeable+0x190/0x1f2 [ 0.150490] ? rest_init+0x9a/0x9a [ 0.150490] kernel_init+0x5/0xf6 [ 0.150490] ret_from_fork+0x22/0x30 [ 0.150490] ---[ end trace 967e9fbc065d7911 ]---
Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018
On 2020-06-05 11:11, Vlastimil Babka wrote: On 6/4/20 8:46 PM, Vlastimil Babka wrote: On 6/4/20 7:57 PM, Kees Cook wrote: On Thu, Jun 04, 2020 at 07:20:18PM +0200, Vegard Nossum wrote: On 2020-06-04 19:18, Vlastimil Babka wrote: On 6/4/20 7:14 PM, Vegard Nossum wrote: Hi all, I ran into a boot problem with latest linus/master (6929f71e46bdddbf1c4d67c2728648176c67c555) that manifests like this: Hi, what's the .config you use? Pretty much x86_64 defconfig minus a few options (PCI, USB, ...) Oh yes indeed. I immediately crash in the same way with this config. I'll start digging... (defconfig finishes boot) This is funny, booting with slub_debug=F results in: I'm not sure if it's ACPI or ftrace wrong here, but looks like the changed free pointer offset merely exposes a bug in something else. So, with Kees' patch reverted, booting with slub_debug=F (or even more specific slub_debug=F,ftrace_event_field) also hits this bug below. I wanted to bisect it, but v5.7 was also bad, and also v5.6. Didn't try further in history. So it's not new at all, and likely very specific to your config+QEMU? (and related to the ACPI error messages that precede it?). I see it too, but not on v5.0. I can bisect it. Also, panic_on_warn is apparently a core parameter, it should probably be __setup()... Vegard This would mean acpi_os_release_object() calling kmem_cache_free(ftrace_event_field, x) where x is actually from kmalloc-64? Both parts of that sounds wrong. Thread starts here: https://lore.kernel.org/linux-mm/4dc93ff8-f86e-f4c9-ebeb-6d3153a78...@oracle.com/ [0.144386] ACPI: Added _OSI(Module Device) [0.144496] ACPI: Added _OSI(Processor Device) [0.144956] ACPI: Added _OSI(3.0 _SCP Extensions) [0.145432] ACPI: Added _OSI(Processor Aggregator Device) [0.145501] ACPI: Added _OSI(Linux-Dell-Video) [0.145951] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio) [0.146522] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics) [0.147070] ACPI Error: AE_BAD_PARAMETER, During Region initialization (20200430/tbxfload-52) [0.147494] ACPI: Unable to load the System Description Tables [0.148104] ACPI Error: Could not remove SCI handler (20200430/evmisc-251) [0.148507] [ cut here ] [0.148985] cache_from_obj: Wrong slab cache. ftrace_event_field but object is from kmalloc-64 [0.149502] WARNING: CPU: 0 PID: 1 at mm/slab.h:523 kmem_cache_free+0x248/0x260 [0.150254] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.7.0+ #43 [0.150490] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 [0.150490] RIP: 0010:kmem_cache_free+0x248/0x260 [0.150490] Code: ff 0f 0b e9 9d fe ff ff 49 8b 4d 58 48 8b 55 58 48 c7 c6 10 47 c1 a4 48 c7 c7 f0 c1 d0 a4 c6 05 9f 05 b1 00 01 e8 bc cc eb ff <0f> 0b 48 8b 15 5f 36 9b 00 4c 89 ed e9 d6 fd ff ff 0f 1f 80 00 00 [0.150490] RSP: 0018:b4dac0013dc0 EFLAGS: 00010282 [0.150490] RAX: RBX: a38a07409e00 RCX: [0.150490] RDX: 0001 RSI: 0092 RDI: a51dd32c [0.150490] RBP: a38a07403900 R08: b4dac0013c7d R09: 00eb [0.150490] R10: b4dac0013c78 R11: b4dac0013c7d R12: a38a87409e00 [0.150490] R13: a38a07401d00 R14: R15: [0.150490] FS: () GS:a38a07a0() knlGS: [0.150490] CS: 0010 DS: ES: CR0: 80050033 [0.150490] CR2: CR3: 0560a000 CR4: 003406f0 [0.150490] Call Trace: [0.150490] acpi_os_release_object+0x5/0x10 [0.150490] acpi_ns_delete_children+0x46/0x59 [0.150490] acpi_ns_delete_namespace_subtree+0x5c/0x79 [0.150490] ? acpi_sleep_proc_init+0x1f/0x1f [0.150490] acpi_ns_terminate+0xc/0x31 [0.150490] acpi_ut_subsystem_shutdown+0x45/0xa3 [0.150490] ? acpi_sleep_proc_init+0x1f/0x1f [0.150490] acpi_terminate+0x5/0xf [0.150490] acpi_init+0x27b/0x308 [0.150490] ? video_setup+0x79/0x79 [0.150490] do_one_initcall+0x7b/0x160 [0.150490] kernel_init_freeable+0x190/0x1f2 [0.150490] ? rest_init+0x9a/0x9a [0.150490] kernel_init+0x5/0xf6 [0.150490] ret_from_fork+0x22/0x30 [0.150490] ---[ end trace 967e9fbc065d7911 ]---
WARNING: CPU: 1 PID: 52 at mm/page_alloc.c:4826 __alloc_pages_nodemask (Re: [PATCH 5/5] sysctl: pass kernel pointers to ->proc_handler)
(Trimmed original Ccs due to outgoing email policy.) Hi, On 2020-04-24 08:43, Christoph Hellwig wrote: Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig Acked-by: Andrey Ignatov [snip] diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index b6f5d459b087d..df2143e05c571 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -539,13 +539,13 @@ static struct dentry *proc_sys_lookup(struct inode *dir, struct dentry *dentry, return err; } -static ssize_t proc_sys_call_handler(struct file *filp, void __user *buf, +static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf, size_t count, loff_t *ppos, int write) { struct inode *inode = file_inode(filp); struct ctl_table_header *head = grab_header(inode); struct ctl_table *table = PROC_I(inode)->sysctl_entry; - void *new_buf = NULL; + void *kbuf; ssize_t error; if (IS_ERR(head)) @@ -564,27 +564,38 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *buf, if (!table->proc_handler) goto out; - error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, buf, , - ppos, _buf); + if (write) { + kbuf = memdup_user_nul(ubuf, count); + if (IS_ERR(kbuf)) { + error = PTR_ERR(kbuf); + goto out; + } + } else { + error = -ENOMEM; + kbuf = kzalloc(count, GFP_KERNEL); + if (!kbuf) + goto out; + } + + error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, , , + ppos); if (error) - goto out; + goto out_free_buf; /* careful: calling conventions are nasty here */ - if (new_buf) { - mm_segment_t old_fs; - - old_fs = get_fs(); - set_fs(KERNEL_DS); - error = table->proc_handler(table, write, (void __user *)new_buf, - , ppos); - set_fs(old_fs); - kfree(new_buf); - } else { - error = table->proc_handler(table, write, buf, , ppos); + error = table->proc_handler(table, write, kbuf, , ppos); + if (error) + goto out_free_buf; + + if (!write) { + error = -EFAULT; + if (copy_to_user(ubuf, kbuf, count)) + goto out_free_buf; } - if (!error) - error = count; + error = count; +out_free_buf: + kfree(kbuf); out: sysctl_head_finish(head); This commit in recent linus/master (32927393dc1ccd60fb2bdc05b9e8e88753761469) causes a regression for me: [ cut here ] WARNING: CPU: 1 PID: 52 at mm/page_alloc.c:4826 __alloc_pages_nodemask+0x1cd/0x2a0 CPU: 1 PID: 52 Comm: init Not tainted 5.7.0+ #218 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:__alloc_pages_nodemask+0x1cd/0x2a0 Code: 0f 85 26 ff ff ff 65 48 8b 04 25 00 7d 01 00 48 05 88 07 00 00 41 bd 01 00 00 00 48 89 44 24 08 e9 07 ff ff ff 80 e7 20 75 02 <0f> 0b 45 31 ed eb 98 44 8b 64 24 18 65 8b 05 d0 25 e9 7e 89 c0 48 RSP: 0018:c90e7de0 EFLAGS: 00010246 RAX: RBX: 000400c0 RCX: RDX: RSI: 0013 RDI: 00040dc0 RBP: 7000 R08: 820276c0 R09: R10: R11: R12: c90e7f08 R13: 0013 R14: 0013 R15: 81c34ce0 FS: 006cf880() GS:88803ed0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 004a1dab CR3: 3e012002 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: kmalloc_order+0x16/0x70 kmalloc_order_trace+0x18/0xa0 proc_sys_call_handler+0xf7/0x170 vfs_read+0x98/0x120 ksys_read+0x5a/0xd0 do_syscall_64+0x43/0x140 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x43f910 Code: 01 f0 ff ff 0f 83 e0 57 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 19 f2 28 00 00 75 14 b8 00 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 b4 57 00 00 c3 48 83 ec 08 e8 4a 39 00 00 RSP: 002b:7fffeaa8 EFLAGS: 0246 ORIG_RAX: RAX: ffda RBX: 004002c8 RCX:
Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018
On 2020-06-04 19:18, Vlastimil Babka wrote: On 6/4/20 7:14 PM, Vegard Nossum wrote: Hi all, I ran into a boot problem with latest linus/master (6929f71e46bdddbf1c4d67c2728648176c67c555) that manifests like this: Hi, what's the .config you use? Pretty much x86_64 defconfig minus a few options (PCI, USB, ...) Attached. Vegard # # Automatically generated file; DO NOT EDIT. # Linux/x86 5.7.0 Kernel Configuration # # # Compiler: gcc-9 (Ubuntu 9.2.1-17ubuntu1~16.04) 9.2.1 20191102 # CONFIG_CC_IS_GCC=y CONFIG_GCC_VERSION=90201 CONFIG_LD_VERSION=22601 CONFIG_CLANG_VERSION=0 CONFIG_CC_CAN_LINK=y CONFIG_CC_HAS_ASM_GOTO=y CONFIG_CC_HAS_ASM_INLINE=y CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_TABLE_SORT=y CONFIG_THREAD_INFO_IN_TASK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_BUILD_SALT="" CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_USELIB=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_GENERIC_IRQ_MIGRATION=y CONFIG_HARDIRQS_SW_RESEND=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y CONFIG_GENERIC_IRQ_RESERVATION_MODE=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y # CONFIG_GENERIC_IRQ_DEBUGFS is not set # end of IRQ subsystem CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_INIT=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # end of Timers subsystem # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set # CONFIG_IRQ_TIME_ACCOUNTING is not set # CONFIG_SCHED_THERMAL_PRESSURE is not set CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set # CONFIG_PSI is not set # end of CPU/Task time and stats accounting CONFIG_CPU_ISOLATION=y # # RCU Subsystem # CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y CONFIG_TREE_SRCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y # end of RCU Subsystem # CONFIG_IKCONFIG is not set # CONFIG_IKHEADERS is not set CONFIG_LOG_BUF_SHIFT=18 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y # # Scheduler features # # CONFIG_UCLAMP_TASK is not set # end of Scheduler features CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y CONFIG_CC_HAS_INT128=y CONFIG_ARCH_SUPPORTS_INT128=y # CONFIG_NUMA_BALANCING is not set CONFIG_CGROUPS=y # CONFIG_MEMCG is not set # CONFIG_BLK_CGROUP is not set CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y # CONFIG_CFS_BANDWIDTH is not set # CONFIG_RT_GROUP_SCHED is not set # CONFIG_CGROUP_PIDS is not set # CONFIG_CGROUP_RDMA is not set CONFIG_CGROUP_FREEZER=y # CONFIG_CGROUP_HUGETLB is not set CONFIG_CPUSETS=y CONFIG_PROC_PID_CPUSET=y # CONFIG_CGROUP_DEVICE is not set CONFIG_CGROUP_CPUACCT=y # CONFIG_CGROUP_PERF is not set # CONFIG_CGROUP_DEBUG is not set CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_TIME_NS=y CONFIG_IPC_NS=y # CONFIG_USER_NS is not set CONFIG_PID_NS=y # CONFIG_CHECKPOINT_RESTORE is not set # CONFIG_SCHED_AUTOGROUP is not set # CONFIG_SYSFS_DEPRECATED is not set CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_RD_GZIP=y CONFIG_RD_BZIP2=y CONFIG_RD_LZMA=y CONFIG_RD_XZ=y CONFIG_RD_LZO=y CONFIG_RD_LZ4=y # CONFIG_BOOT_CONFIG is not set CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_HAVE_UID16=y CONFIG_SYSCTL_EXCEPTION_TRACE=y CONFIG_HAVE_PCSPKR_PLATFORM=y # CONFIG_EXPERT is not set CONFIG_UID16=y CONFIG_MULTIUSER=y CONFIG_SGETMASK_SYSCALL=y CONFIG_SYSFS_SYSCALL=y CONFIG_FHANDLE=y CONFIG_POSIX_TIMERS=y CONFIG_PRINTK=y CONFIG_PRINTK_NMI=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_PCSPKR_PLATFORM=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_FUTEX_PI=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_AIO=y CONFIG_IO_URING=y CONFIG_ADVISE_SYSCALLS=y CONFIG_MEMBARRIER=y CONFIG_KAL
slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018
Hi all, I ran into a boot problem with latest linus/master (6929f71e46bdddbf1c4d67c2728648176c67c555) that manifests like this: hpet0: 3 comparators, 64-bit 100.00 MHz counter clocksource: Switched to clocksource tsc-early BUG: unable to handle page fault for address: 3ffe0018 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 P4D 0 Oops: [#1] SMP PTI CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0+ #211 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:kmem_cache_alloc+0x70/0x1d0 Code: 00 00 4c 8b 45 00 65 49 8b 50 08 65 4c 03 05 6f cc e7 7e 4d 8b 20 4d 85 e4 0f 84 3d 01 00 00 8b 45 20 48 8b 7d 00 48 8d 4a 01 <49> 8b 1c 04 4c 89 e0 65 48 0f c7 0f 0f 94 c0 84 c0 74 c5 8b 45 20 RSP: :c9013df8 EFLAGS: 00010206 RAX: 0018 RBX: 81c49200 RCX: 0002 RDX: 0001 RSI: 0dc0 RDI: 0002b300 RBP: 88803e403d00 R08: 88803ec2b300 R09: 0001 R10: 0dc0 R11: 0006 R12: 3ffe R13: 8110a583 R14: 0dc0 R15: 81c49a80 FS: () GS:88803ec0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 3ffe0018 CR3: 01c0a001 CR4: 003606f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: __trace_define_field+0x33/0xa0 event_trace_init+0xeb/0x2b4 tracer_init_tracefs+0x60/0x195 ? register_tracer+0x1e7/0x1e7 do_one_initcall+0x74/0x160 kernel_init_freeable+0x190/0x1f0 ? rest_init+0x9a/0x9a kernel_init+0x5/0xf6 ret_from_fork+0x35/0x40 CR2: 3ffe0018 ---[ end trace 707efa023f2ee960 ]--- RIP: 0010:kmem_cache_alloc+0x70/0x1d0 Bisection gives me: commit 3202fa62fb43087387c65bfa9c100feffac74aa6 Author: Kees Cook Date: Wed Apr 1 21:04:27 2020 -0700 slub: relocate freelist pointer to middle of object Reverting these three commits fixes it: 3202fa62fb43087387c65bfa9c100feffac74aa6 slub: relocate freelist pointer to middle of object 89b83f282d8ba380cf2124f88106c57df49c538c slub: avoid redzone when choosing freepointer location cbfc35a48609ceac978791e3ab9dde0c01f8cb20 mm/slub: fix incorrect interpretation of s->offset Vegard
Re: [PATCH v10 00/18] Enable FSGSBASE instructions
On 5/10/20 10:09 AM, Vegard Nossum wrote: On 4/24/20 1:21 AM, Sasha Levin wrote: Benefits: Currently a user process that wishes to read or write the FS/GS base must make a system call. But recent X86 processors have added new instructions for use in 64-bit mode that allow direct access to the FS and GS segment base addresses. The operating system controls whether applications can use these instructions with a %cr4 control bit. [...] So FWIW I've done some overnight fuzz testing of this patch set and haven't seen any problems. Will try a couple of other kernel configs too. I spoke a few minutes too soon. Just hit this, if anybody wants to have a look: [ 6402.786418] [ cut here ] [ 6402.787769] WARNING: CPU: 0 PID: 13802 at arch/x86/kernel/traps.c:811 do_debug+0x16c/0x210 [ 6402.790042] CPU: 0 PID: 13802 Comm: init Not tainted 5.7.0-rc4+ #194 [ 6402.791779] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 6402.793365] RIP: 0010:do_debug+0x16c/0x210 [ 6402.794496] Code: ef e8 f8 fb 00 00 f6 85 91 00 00 00 02 74 b9 fa 66 66 90 66 66 90 e8 c3 f5 11 00 eb ab f6 85 88 00 00 00 03 0f 85 6e ff ff ff <0f> 0b 80 e4 bf 49 89 84 24 58 0a 00 00 f0 41 80 0c 24 10 48 81 a5 [ 6402.799557] RSP: :fe011f20 EFLAGS: 00010046 [ 6402.800995] RAX: 4002 RBX: RCX: [ 6402.802959] RDX: RSI: 0003 RDI: 82471e60 [ 6402.804891] RBP: fe011f58 R08: R09: 0005 [ 6402.806836] R10: R11: R12: 88803e739a00 [ 6402.808775] R13: R14: 3ce24000 R15: [ 6402.810723] FS: 0097a8c0() GS:88803ec0() knlGS: [ 6402.812933] CS: 0010 DS: ES: CR0: 80050033 [ 6402.814509] CR2: 4010 CR3: 3ce24000 CR4: 06f0 [ 6402.816468] DR0: 0001 DR1: 40006070 DR2: 77ffd000 [ 6402.818406] DR3: DR6: 0ff0 DR7: 03b3062a [ 6402.820353] Call Trace: [ 6402.821043] <#DB> [ 6402.821622] debug+0x37/0x70 [ 6402.822449] RIP: 0010:arch_stack_walk_user+0x79/0x110 [ 6402.823851] Code: b8 f0 ff ff bf be f0 df ff ff 48 0f 44 c6 48 39 d0 0f 82 94 00 00 00 41 83 87 b8 09 00 00 01 66 66 90 0f ae e8 31 c0 48 8b 1a <66> 66 90 85 c0 75 72 66 66 90 0f ae e8 48 8b 72 08 66 66 90 85 c0 [ 6402.828923] RSP: :c90003807d80 EFLAGS: 0046 [ 6402.830346] RAX: RBX: 0040001000bf4800 RCX: 0001 [ 6402.832288] RDX: 40006073 RSI: 400060dd RDI: c90003807db8 [ 6402.834250] RBP: c90003807f58 R08: 0001 R09: 88803e00 [ 6402.836203] R10: 054c R11: 88803d2d955c R12: 88803e739a00 [ 6402.838139] R13: 810f16a0 R14: c90003807db8 R15: 88803e739a00 [ 6402.840083] ? profile_setup.cold+0xa1/0xa1 [ 6402.841235] [ 6402.841836] stack_trace_save_user+0x8c/0xd4 [ 6402.843045] trace_buffer_unlock_commit_regs+0x122/0x1a0 [ 6402.844501] trace_event_buffer_commit+0x6d/0x240 [ 6402.845799] trace_event_raw_event_preemptirq_template+0x75/0xc0 [ 6402.847441] ? debug+0x53/0x70 [ 6402.848299] ? trace_hardirqs_off_thunk+0x1a/0x33 [ 6402.849593] trace_hardirqs_off_caller+0xa6/0xd0 [ 6402.850862] ? debug+0x4e/0x70 [ 6402.851727] trace_hardirqs_off_thunk+0x1a/0x33 [ 6402.852983] debug+0x53/0x70 [ 6402.853785] RIP: 0033:0x400060dd [ 6402.854681] Code: 7a 1e 9e 91 de 4c 65 49 be 00 d0 ff f7 ff 7f 00 00 49 bf de a7 b3 e8 d7 21 3c 15 9c 48 81 0c 24 00 01 00 00 9d b8 62 00 00 00 <8e> c0 0f 05 66 8c c8 9c 48 81 24 24 ff fe ff ff 9d 48 89 04 25 40 [ 6402.859689] RSP: 002b:4000aea0 EFLAGS: 0317 [ 6402.861116] RAX: 0062 RBX: 40001000 RCX: [ 6402.863097] RDX: 40003000 RSI: 40004000 RDI: 40001000 [ 6402.866199] RBP: 40006073 R08: 0001 R09: 0001 [ 6402.868142] R10: ef080df2 R11: 1000 R12: fdff [ 6402.870083] R13: 654cde919e1e7ab5 R14: 77ffd000 R15: 153c21d7e8b3a7de [ 6402.872049] ---[ end trace 91a3039d0fd63799 ]--- It might not be related to the patch set, mind. Vegard
Re: [PATCH v10 00/18] Enable FSGSBASE instructions
On 4/24/20 1:21 AM, Sasha Levin wrote: Benefits: Currently a user process that wishes to read or write the FS/GS base must make a system call. But recent X86 processors have added new instructions for use in 64-bit mode that allow direct access to the FS and GS segment base addresses. The operating system controls whether applications can use these instructions with a %cr4 control bit. In addition to benefits to applications, performance improvements to the OS context switch code are possible by making use of these instructions. A third party reported out promising performance numbers out of their initial benchmarking of the previous version of this patch series [9]. Enablement check: The kernel provides information about the enabled state of FSGSBASE to applications using the ELF_AUX vector. If the HWCAP2_FSGSBASE bit is set in the AUX vector, the kernel has FSGSBASE instructions enabled and applications can use them. Kernel changes: Major changes made in the kernel are in context switch, paranoid path, and ptrace. In a context switch, a task's FS/GS base will be secured regardless of its selector. In the paranoid path, GS base is unconditionally overwritten to the kernel GS base on entry and the original GS base is restored on exit. Ptrace includes divergence of FS/GS index and base values. Security: For mitigating the Spectre v1 SWAPGS issue, LFENCE instructions were added on most kernel entries. Those patches are dependent on previous behaviors that users couldn't load a kernel address into the GS base. These patches change that assumption since the user can load any address into GS base. The changes to the kernel entry path in this patch series take account of the SWAPGS issue. Changes from v9: - Rebase on top of v5.7-rc1 and re-test. - Work around changes in 2fff071d28b5 ("x86/process: Unify copy_thread_tls()"). - Work around changes in c7ca0b614513 ("Revert "x86/ptrace: Prevent ptrace from clearing the FS/GS selector" and fix the test"). Andi Kleen (2): x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2 Andy Lutomirski (4): x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE x86/entry/64: Clean up paranoid exit x86/fsgsbase/64: Use FSGSBASE in switch_to() if available x86/fsgsbase/64: Enable FSGSBASE on 64bit by default and add a chicken bit Chang S. Bae (9): x86/ptrace: Prevent ptrace from clearing the FS/GS selector selftests/x86/fsgsbase: Test GS selector on ptracer-induced GS base write x86/entry/64: Switch CR3 before SWAPGS in paranoid entry x86/entry/64: Introduce the FIND_PERCPU_BASE macro x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit x86/entry/64: Document GSBASE handling in the paranoid path x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions x86/fsgsbase/64: Use FSGSBASE instructions on thread copy and ptrace selftests/x86/fsgsbase: Test ptracer-induced GS base write with FSGSBASE Sasha Levin (1): x86/fsgsbase/64: move save_fsgs to header file Thomas Gleixner (1): Documentation/x86/64: Add documentation for GS/FS addressing mode Tony Luck (1): x86/speculation/swapgs: Check FSGSBASE in enabling SWAPGS mitigation .../admin-guide/kernel-parameters.txt | 2 + Documentation/x86/entry_64.rst| 9 + Documentation/x86/x86_64/fsgs.rst | 199 ++ Documentation/x86/x86_64/index.rst| 1 + arch/x86/entry/calling.h | 40 arch/x86/entry/entry_64.S | 131 +--- arch/x86/include/asm/fsgsbase.h | 45 +++- arch/x86/include/asm/inst.h | 15 ++ arch/x86/include/uapi/asm/hwcap2.h| 3 + arch/x86/kernel/cpu/bugs.c| 6 +- arch/x86/kernel/cpu/common.c | 22 ++ arch/x86/kernel/process.c | 10 +- arch/x86/kernel/process.h | 69 ++ arch/x86/kernel/process_64.c | 142 +++-- arch/x86/kernel/ptrace.c | 17 +- tools/testing/selftests/x86/fsgsbase.c| 24 ++- 16 files changed, 606 insertions(+), 129 deletions(-) create mode 100644 Documentation/x86/x86_64/fsgs.rst So FWIW I've done some overnight fuzz testing of this patch set and haven't seen any problems. Will try a couple of other kernel configs too. Vegard
Re: email as a bona fide git transport
On 10/22/19 3:53 PM, Theodore Y. Ts'o wrote: On Tue, Oct 22, 2019 at 02:11:22PM +0200, Vegard Nossum wrote: As I wrote in there, we could already today start using git am --message-id when applying patches and this would provide something that a bot could annotate with git notes pointing to lore/LKML/LWN/whatever. I think that would already be a pretty nice improvement over today's situation. Sadly, since the beginning of 2018, this was only used for a measly ~0.14% of all non-merge commits in the kernel: $ git rev-list --count --no-merges --since='2018-01-01' --grep 'Message-Id: ' linus/master 178 You might also want to count commits which have a link tag with a Message-Id: Link: https://lore.kernel.org/r/c3438dad66a34a7d4e7509a5dd64c2326340a52a.1571647180.git.mbobrow...@mbobrowski.org That's because some kernel developers have been using a hook script like this: #!/bin/sh # For .git/hooks/applypatch-msg # # You must have the following in .git/config: # [am] # messageid = true . git-sh-setup perl -pi -e 's|^Message-Id:\s*]+)>?$|Link: https://lore.kernel.org/r/$1|g;' "$1" test -x "$GIT_DIR/hooks/commit-msg" && exec "$GIT_DIR/hooks/commit-msg" ${1+"$@"} : as we had reached rough consensus that this was the best way to incorprate the message id (since it could made to be a clickable link in tools like gitk, for example). This rough consensus has only been in place since around the time of the Maintainer's Summit in Lisbon, so uptake is still probably a bit slow. I'd expect to see a lot more of this in the next merge window, though. Thanks, I was not aware of this! Seems like something that should go in Documentation/maintainer/, right? The figure is much better, 16.7% on all non-merges since 2018-01-01. This should help and we can maybe already do some interesting things with git notes and lore/public-inbox. Vegard
email as a bona fide git transport
sage commit_editmsg="$(git rev-parse --git-dir)/COMMIT_EDITMSG" ( if [ -z "$prev" ] then echo 'Patchset title' echo echo Commits: echo git log --oneline $start..HEAD else git show --format=format:%B --no-patch $prev echo Previous-version: $(git rev-parse $prev) fi ) > "${commit_editmsg}" ${EDITOR} "${commit_editmsg}" merge=$(git commit-tree -p $start -p HEAD -F "${commit_editmsg}" $(git rev-parse HEAD^{tree})) echo $merge } This will open the editor to edit the patchset description and create a merge commit that encompasses the patches in the patchset (use sha1^- to view the patches in it). >From 622a0469a4970c5daac0c0323e2d6a77b3bebbdb Mon Sep 17 00:00:00 2001 From: Vegard Nossum Date: Sat, 5 Oct 2019 16:15:59 +0200 Subject: [PATCH 1/3] format-patch: add --complete Include the raw commit data between the changelog and the diffstat. This will allow 'git am' to reconstruct the commit exactly to the point where the sha1 will be the same. Signed-off-by: Vegard Nossum --- commit 622a0469a4970c5daac0c0323e2d6a77b3bebbdb tree 8f09d9d6ed78f8617b2fe54fe9712990ba808546 parent 108b97dc372828f0e72e56bbb40cae8e1e83ece6 author Vegard Nossum 1570284959 +0200 committer Vegard Nossum 1571219301 +0200 --- builtin/log.c | 12 log-tree.c| 17 + revision.h| 3 ++- 3 files changed, 31 insertions(+), 1 deletion(-) diff --git a/builtin/log.c b/builtin/log.c index c4b35fdaf9..81c1164ae5 100644 --- a/builtin/log.c +++ b/builtin/log.c @@ -1545,6 +1545,7 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix) char *branch_name = NULL; char *base_commit = NULL; struct base_tree_info bases; + int complete = 0; int show_progress = 0; struct progress *progress = NULL; struct oid_array idiff_prev = OID_ARRAY_INIT; @@ -1622,6 +1623,8 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix) N_("add a signature")), OPT_STRING(0, "base", _commit, N_("base-commit"), N_("add prerequisite tree info to the patch series")), + OPT_BOOL(0, "complete", , + N_("include all the information necessary to reconstruct commit exactly")), OPT_FILENAME(0, "signature-file", _file, N_("add a signature from a file")), OPT__QUIET(, N_("don't print the patch filenames")), @@ -1905,6 +1908,15 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix) prepare_bases(, base, list, nr); } + if (complete) { + /* + * We need the commit buffer so that we can output the exact + * sequence of bytes that gets hashed as part of a commit. + */ + save_commit_buffer = 1; + rev.show_raw_buffer = 1; + } + if (in_reply_to || thread || cover_letter) rev.ref_message_ids = xcalloc(1, sizeof(struct string_list)); if (in_reply_to) { diff --git a/log-tree.c b/log-tree.c index 923a299e70..2c9788b25a 100644 --- a/log-tree.c +++ b/log-tree.c @@ -774,6 +774,22 @@ void show_log(struct rev_info *opt) memcpy(_queued_diff, , sizeof(diff_queued_diff)); } + + if (opt->show_raw_buffer) { + const char *buffer = get_commit_buffer(commit, NULL); + const char *subject; + + fprintf(opt->diffopt.file, "---\n"); + fprintf(opt->diffopt.file, "commit %s\n", oid_to_hex(>object.oid)); + + /* + * TODO: hex-encode to avoid mailer mangling? + */ + if (find_commit_subject(buffer, )) + fprintf(opt->diffopt.file, "%.*s", (int) (subject - buffer), buffer); + else + fprintf(opt->diffopt.file, "%s", buffer); + } } int log_tree_diff_flush(struct rev_info *opt) @@ -791,6 +807,7 @@ int log_tree_diff_flush(struct rev_info *opt) if (opt->loginfo && !opt->no_commit_id) { show_log(opt); + if ((opt->diffopt.output_format & ~DIFF_FORMAT_NO_OUTPUT) && opt->verbose_header && opt->commit_format != CMIT_FMT_ONELINE && diff --git a/revision.h b/revision.h index 4134dc6029..5297dc9f3c 100644 --- a/revision.h +++ b/revision.h @@ -190,7 +190,8 @@ struct rev_info { use_terminator:1, missing_newline:1, date_mode_explicit:1, - preserve_subject:1; + preserve_subject:1, + show_raw_buffer:1; unsigned int disable_stdin:1; /* --show-linear-break */ unsigned int track_linear:1, -- 2.23.0.718.g3120370db8 >From 51bb531eb57320caf3761680ebf77c25b89b3719 Mon Sep 17 00:00:00 2001 From: Vegard Nossum Date: Wed, 16 Oct 2019 02:04:08 +0200 Subject: [PATCH 2/3] mailinfo: collect commit metadata from mail Signed-off-by: Vegard Nossum --- commit 51bb531eb57320caf3761680ebf77c25b89b3719 tree f3a3141f7d3f706d8ca60cdc1e1cde5aa2cc927a parent 622a0469a4970c5daac0c0323e2d6a77b3bebbdb author Vegard Nossum 1571184248 +0200 committer
Re: [PATCH v3 0/6] Tracing vs CR2
On 7/17/19 10:07 AM, Peter Zijlstra wrote: On Tue, Jul 16, 2019 at 09:33:50PM +0200, Vegard Nossum wrote: [ cut here ] General protection fault in user access. Non-canonical address? WARNING: CPU: 0 PID: 5039 at arch/x86/mm/extable.c:126 ex_handler_uaccess+0x5d/0x70 [...] https://lkml.kernel.org/r/57754f11-2c65-a2c8-2f6d-bfab0d2f8...@etsukata.com Does something like the below help? diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c index c8d0f05721a1..80ad4ccb7025 100644 --- a/kernel/stacktrace.c +++ b/kernel/stacktrace.c @@ -226,12 +226,16 @@ unsigned int stack_trace_save_user(unsigned long *store, unsigned int size) .store = store, .size = size, }; + mm_segment_t fs; /* Trace user stack if not a kernel thread */ if (current->flags & PF_KTHREAD) return 0; + fs = get_fs(); + set_fs(USER_DS); arch_stack_walk_user(consume_entry, , task_pt_regs(current)); + set_fs(fs); return c.len; } #endif Yes. Vegard
Re: [PATCH v3 0/6] Tracing vs CR2
On 7/17/19 3:02 AM, Andy Lutomirski wrote: On Tue, Jul 16, 2019 at 2:53 PM Vegard Nossum wrote: On 7/16/19 9:33 PM, Vegard Nossum wrote: On 7/11/19 1:40 PM, Peter Zijlstra wrote: Hi, Here's the latest (and hopefully final) set of tracing vs CR2 patches. They are basically the same as v2, with only minor edits and tags collected from the last review. Please consider. Hi, I ran my own battery of tests on your patch set on top of 5ad18b2e60b75c7297a998dea702451d33a052ed and ran into this: On a different thread, Peter and I decided that the last patch in this series (the one that removes the _DEBUG stuff) is wrong. Can you see if these are reproducible with that patch removed? Yes, without the last patch I still get this: Run /init as init process init[711]: segfault at 4000 ip 400a sp 4ff8 error 7 [ cut here ] General protection fault in user access. Non-canonical address? WARNING: CPU: 0 PID: 711 at arch/x86/mm/extable.c:126 ex_handler_uaccess+0x5d/0x70 CPU: 0 PID: 711 Comm: init Not tainted 5.2.0+ #125 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 init[716]: segfault at 4000 ip 400a sp 4ff8 error 7 RIP: 0010:ex_handler_uaccess+0x5d/0x70 Code: 5d 41 5c c3 e8 c4 8e 0e 00 80 3d e5 74 1e 01 00 75 d3 e8 b6 8e 0e 00 48 c7 c7 10 a7 fb 81 c6 05 d0 74 1e 01 01 e8 d1 43 01 00 <0f> 0b eb b7 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 RSP: :c965fa18 EFLAGS: 00010086 RAX: RBX: 81c07dac RCX: 811a887c init[714]: segfault at 4000 ip 400a sp 4ff8 error 7 RDX: RSI: 8289f05f RDI: 0093 RBP: c965fa88 R08: 2e80b265 R09: 003f init[718]: segfault at 4000 ip 400a sp 4ff8 error 7 R10: R11: R12: 000d R13: 000d R14: R15: FS: 006ce880() GS:88803ec0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 3fe0 CR3: 3d2f6004 CR4: 003606f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: Code: Bad RIP value. fixup_exception+0x50/0x6a do_general_protection+0x40/0x160 general_protection+0x2d/0x40 RIP: 0010:arch_stack_walk_user+0x71/0x100 Code: 00 48 83 e8 10 49 39 c4 77 45 4c 8b 04 24 4c 89 e3 4d 89 fd 4c 89 fd 41 83 87 98 0a 00 00 01 0f 01 cb 0f ae e8 31 c0 4c 89 e2 <4c> 8b 33 4d 89 f4 85 c0 75 7a 48 8b 73 08 0f 01 ca 85 c0 74 1f 65 [...] This is my reproducer (as init): #include #include #include #include #include #include #include #include struct child_data { (*code)(); }; child_fn(void *arg) { child_data *data = arg; mprotect(data->code, PAGE_SIZE, PROT_EXEC); data->code(); } int main() { mkdir("/sys", 7); mount("nodev", "/sys", "sysfs", 0, ""); mount("nodev", "/sys/kernel/tracing", "tracefs", 0, ""); int tracing_options_userstacktrace = open("/sys/kernel/tracing/options/userstacktrace", O_RDWR); write(tracing_options_userstacktrace, "1\n", 2); int tracing_events_preemptirq_irq_disable = open("/sys/kernel/tracing/events/preemptirq/irq_disable/enable", O_RDWR); write(tracing_events_preemptirq_irq_disable, "1\n", 2); void *code = mmap(0, PAGE_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, 1, 0); { unsigned char *output = code; *output++ = 72; *output++ = 189; for (int i = 0; i < 8; ++i) *output++ = i; } void *child_stack = mmap(0, PAGE_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, 1, 0); while (1) { child_data data = { code }; clone(child_fn, child_stack, SIGCHLD, ); } } Compiled with -static and booted with "norandmaps" (for some reason that makes a difference), this is 100% reproducible for me, although the reproducer is somewhat sensitive to small changes that I don't quite understand. Vegard
Re: [PATCH v3 0/6] Tracing vs CR2
On 7/16/19 9:33 PM, Vegard Nossum wrote: On 7/11/19 1:40 PM, Peter Zijlstra wrote: Hi, Here's the latest (and hopefully final) set of tracing vs CR2 patches. They are basically the same as v2, with only minor edits and tags collected from the last review. Please consider. Hi, I ran my own battery of tests on your patch set on top of 5ad18b2e60b75c7297a998dea702451d33a052ed and ran into this: [ cut here ] General protection fault in user access. Non-canonical address? WARNING: CPU: 0 PID: 5039 at arch/x86/mm/extable.c:126 ex_handler_uaccess+0x5d/0x70 Got a different one: WARNING: CPU: 0 PID: 2150 at arch/x86/kernel/traps.c:791 do_debug+0xfe/0x240 CPU: 0 PID: 2150 Comm: init Not tainted 5.2.0+ #124 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:do_debug+0xfe/0x240 Code: 05 07 3d f3 7e f6 85 91 00 00 00 02 0f 85 d8 00 00 00 49 8b 84 24 18 0b 00 00 f6 44 24 01 40 74 2f f6 85 88 00 00 00 03 75 26 <0f> 0b 80 e4 bf 49 89 84 24 18 0b 00 00 f0 41 80 0c 24 10 48 81 a5 RSP: :fe00ff20 EFLAGS: 00010046 RAX: 4002 RBX: RCX: 810e2f72 RDX: RSI: 0003 RDI: 8201f090 RBP: fe00ff58 R08: R09: 0005 R10: R11: R12: 88803e0df040 R13: R14: 3d376001 R15: FS: 56dbc8c0() GS:88803ec0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 41f38010 CR3: 3d376001 CR4: 003606f0 DR0: 0001 DR1: 41a4f070 DR2: 7fff959ff000 DR3: DR6: fffe0ff0 DR7: 03b3062a Call Trace: <#DB> debug+0x2d/0x70 RIP: 0010:arch_stack_walk_user+0x74/0x100 Code: e8 10 49 39 c4 77 45 4c 8b 04 24 4c 89 e3 4d 89 fd 4c 89 fd 41 83 87 98 0a 00 00 01 0f 01 cb 0f ae e8 31 c0 4c 89 e2 4c 8b 33 <4d> 89 f4 85 c0 75 7a 48 8b 73 08 0f 01 ca 85 c0 74 1f 65 48 8b 04 RSP: :c900030dbd68 EFLAGS: 00040046 RAX: RBX: 41a4f073 RCX: 811ca27b RDX: 41a4f073 RSI: 41a4f0dd RDI: c900030dbdb8 RBP: 88803e0df040 R08: c900030dbf58 R09: R10: R11: R12: 41a4f073 R13: 88803e0df040 R14: 0041281000bf4800 R15: 88803e0df040 ? stack_trace_consume_entry+0x4b/0x80 ? profile_setup.cold+0xc1/0xc1 stack_trace_save_user+0x71/0x9c trace_buffer_unlock_commit_regs+0x1ae/0x270 trace_event_buffer_commit+0x90/0x240 trace_event_raw_event_preemptirq_template+0x9a/0x100 ? debug+0x49/0x70 ? perf_trace_preemptirq_template+0x120/0x120 ? trace_hardirqs_off_thunk+0x1a/0x1c trace_hardirqs_off_caller+0xf4/0x150 ? debug+0x44/0x70 trace_hardirqs_off_thunk+0x1a/0x1c debug+0x49/0x70 RIP: 0033:0x41a4f0dd Code: 47 11 b7 d2 36 45 6c 49 be 00 f0 9f 95 ff 7f 00 00 49 bf de a7 b3 e8 d7 21 3c 15 9c 48 81 0c 24 00 01 00 00 9d b8 62 00 00 00 <8e> c0 0f 05 66 8c c8 9c 48 81 24 24 ff fe ff ff 9d 48 89 04 25 40 RSP: 002b:40901ea0 EFLAGS: 0317 RAX: 0062 RBX: 41281000 RCX: RDX: 401c RSI: 41892000 RDI: 41281000 RBP: 41a4f073 R08: 0001 R09: 0001 R10: 917d7748 R11: 1000 R12: fdff R13: 6c4536d2b71147a5 R14: 7fff959ff000 R15: 153c21d7e8b3a7de ---[ end trace 0cd51ba690f12b47 ]--- The warning is this: if (WARN_ON_ONCE((dr6 & DR_STEP) && !user_mode(regs))) { /* * Historical junk that used to handle SYSENTER single-stepping. * This should be unreachable now. If we survive for a while * without anyone hitting this warning, we'll turn this into * an oops. */ tsk->thread.debugreg6 &= ~DR_STEP; set_tsk_thread_flag(tsk, TIF_SINGLESTEP); regs->flags &= ~X86_EFLAGS_TF; } Unfortunately DR6 from the register dump has already been cleared at the top of do_debug() and the local variable dr6 is on the stack and not loaded into any of the registers AFAICT. From the userspace Code: line you can clearly see it setting EFLAGS_TF, then it seems to be trapping on the next instruction: 1b: 9c pushfq 1c: 48 81 0c 24 00 01 00orq$0x100,(%rsp) 23: 00 24: 9d popfq 25: b8 62 00 00 00 mov$0x62,%eax 2a:* 8e c0 mov%eax,%es <-- trapping instruction You can see that DR1 points to 41a4f070, which is close to userspace RBP (41a4f073), which is perhaps being accessed by stack_trace_save_user() and causing the debug exception on a data breakpoint? The Code: line from stack_trace_save_user() is: 27: 4c 8b
Re: [PATCH v3 0/6] Tracing vs CR2
On 7/11/19 1:40 PM, Peter Zijlstra wrote: Hi, Here's the latest (and hopefully final) set of tracing vs CR2 patches. They are basically the same as v2, with only minor edits and tags collected from the last review. Please consider. Hi, I ran my own battery of tests on your patch set on top of 5ad18b2e60b75c7297a998dea702451d33a052ed and ran into this: [ cut here ] General protection fault in user access. Non-canonical address? WARNING: CPU: 0 PID: 5039 at arch/x86/mm/extable.c:126 ex_handler_uaccess+0x5d/0x70 CPU: 0 PID: 5039 Comm: init Not tainted 5.2.0+ #124 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:ex_handler_uaccess+0x5d/0x70 Code: 5d 41 5c c3 e8 c4 8e 0e 00 80 3d e5 74 1e 01 00 75 d3 e8 b6 8e 0e 00 48 c7 c7 10 a7 fb 81 c6 05 d0 74 1e 01 01 e8 d1 43 01 00 <0f> 0b eb b7 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 RSP: :fe00fc48 EFLAGS: 00010086 RAX: RBX: 81c07dac RCX: 811a887c RDX: RSI: 8289f05f RDI: 0093 RBP: fe00fcb8 R08: 0036fe0f15d3 R09: 003f R10: R11: R12: 000d R13: 000d R14: R15: FS: 563ab8c0() GS:88803ec0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 1ff7 CR3: 3c804002 CR4: 003606f0 DR0: 40209100 DR1: 402091a1 DR2: DR3: DR6: 0ff1 DR7: 000b062a Call Trace: <#DB> fixup_exception+0x50/0x6a do_general_protection+0x40/0x160 general_protection+0x2d/0x40 RIP: 0010:arch_stack_walk_user+0x71/0x100 Code: 00 48 83 e8 10 49 39 c4 77 45 4c 8b 04 24 4c 89 e3 4d 89 fd 4c 89 fd 41 83 87 98 0a 00 00 01 0f 01 cb 0f ae e8 31 c0 4c 89 e2 <4c> 8b 33 4d 89 f4 85 c0 75 7a 48 8b 73 08 0f 01 ca 85 c0 74 1f 65 RSP: :fe00fd68 EFLAGS: 00050046 RAX: RBX: 854163717acc2789 RCX: 811ca27b RDX: 854163717acc2789 RSI: 40209102 RDI: fe00fdb8 RBP: 88803d55d040 R08: c9000520bf58 R09: R10: R11: R12: 854163717acc2789 R13: 88803d55d040 R14: 0093 R15: 88803d55d040 ? stack_trace_consume_entry+0x4b/0x80 ? arch_stack_walk_user+0x34/0x100 ? profile_setup.cold+0xc1/0xc1 stack_trace_save_user+0x71/0x9c trace_buffer_unlock_commit_regs+0x1ae/0x270 trace_event_buffer_commit+0x90/0x240 trace_event_raw_event_preemptirq_template+0x9a/0x100 ? debug+0x16/0x70 ? perf_trace_preemptirq_template+0x120/0x120 ? trace_hardirqs_off_thunk+0x1a/0x1c trace_hardirqs_off_caller+0xf4/0x150 trace_hardirqs_off_thunk+0x1a/0x1c ? debug+0x11/0x70 debug+0x16/0x70 RIP: 0010:copy_user_generic_unrolled+0xa0/0xc0 Code: 7f 40 ff c9 75 b6 89 d1 83 e2 07 c1 e9 03 74 12 4c 8b 06 4c 89 07 48 8d 76 08 48 8d 7f 08 ff c9 75 ee 21 d2 74 10 89 d1 8a 06 <88> 07 48 ff c6 48 ff c7 ff c9 75 f2 31 c0 0f 01 ca c3 0f 1f 40 00 RSP: :c9000520be38 EFLAGS: 00040202 RAX: 88803d55d09c RBX: 88803d55d040 RCX: 0001 RDX: 0001 RSI: 40209102 RDI: c9000520be76 RBP: 0001 R08: 0001 R09: R10: R11: R12: 7000 R13: 40209102 R14: c9000520be76 R15: __probe_kernel_read+0x57/0x90 is_prefetch.isra.0+0xb5/0x210 ? tracer_hardirqs_on+0x53/0x1a0 __bad_area_nosemaphore+0x9e/0x220 __do_page_fault+0x483/0x630 ? async_page_fault+0x8/0x40 async_page_fault+0x36/0x40 RIP: 0033:0x40209102 Code: 00 00 49 bc 00 20 23 40 00 00 00 00 49 bd 00 00 d0 40 00 00 00 00 49 be ff ff ff ff ff ff ff ff 49 bf 00 50 80 40 00 00 00 00 <9c> 48 81 0c 24 00 04 00 00 48 81 0c 24 00 00 04 00 9d ff 2c 25 00 RSP: 002b:1fff EFLAGS: 00010217 RAX: RBX: 402090b0 RCX: 0001 RDX: 0001 RSI: RDI: 41ebb000 RBP: 854163717acc2789 R08: 0001 R09: b1f39cc399a61ebb R10: 7ffeab175000 R11: 0360 R12: 40232000 R13: 40d0 R14: R15: 40805000 ---[ end trace e5e49800ff5aa5ed ]--- PANIC: double fault, error_code: 0x0 CPU: 0 PID: 5039 Comm: init Tainted: GW 5.2.0+ #124 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x50 Code: 82 e8 74 2d f8 ff 48 89 9d 10 01 00 00 48 89 ee 5b 4c 89 e7 5d 41 5c e9 8e 5d 12 00 5b b8 f4 ff ff ff 5d 41 5c c3 0f 1f 40 00 <65> 48 8b 04 25 c0 6c 01 00 65 8b 15 78 ba df 7e 81 e2 00 01 1f 00 RSP: :fe00f008 EFLAGS: 00010093 RAX: 00016cc0 RBX: 81a01436 RCX: 81a00b97 RDX: 00016cc0 RSI: 81a01428 RDI: 81a01436 RBP: fe00f088 R08:
Re: [PATCH v8 4/5] x86/xsave: Make XSAVE check the base CPUID features before enabling
On 10/5/17 11:52 PM, Andi Kleen wrote: From: Andi Kleen Before enabling XSAVE, not only check the XSAVE specific CPUID bits, but also the base CPUID features of the respective XSAVE feature. This allows to disable individual XSAVE states using the existing clearcpuid= option, which can be useful for performance testing and debugging, and also in general avoids inconsistencies. v2: Add curly brackets (Thomas Gleixner) Signed-off-by: Andi Kleen --- arch/x86/kernel/fpu/xstate.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index f1d5476c9022..924bd895b5ee 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -15,6 +15,7 @@ #include #include +#include /* * Although we spell it out in here, the Processor Trace @@ -36,6 +37,19 @@ static const char *xfeature_names[] = "unknown xstate feature" , }; +static short xsave_cpuid_features[] = { + X86_FEATURE_FPU, + X86_FEATURE_XMM, + X86_FEATURE_AVX, + X86_FEATURE_MPX, + X86_FEATURE_MPX, + X86_FEATURE_AVX512F, + X86_FEATURE_AVX512F, + X86_FEATURE_AVX512F, + X86_FEATURE_INTEL_PT, + X86_FEATURE_PKU, +}; + /* * Mask of xstate features supported by the CPU and the kernel: */ @@ -726,6 +740,7 @@ void __init fpu__init_system_xstate(void) unsigned int eax, ebx, ecx, edx; static int on_boot_cpu __initdata = 1; int err; + int i; WARN_ON_FPU(!on_boot_cpu); on_boot_cpu = 0; @@ -759,6 +774,14 @@ void __init fpu__init_system_xstate(void) goto out_disable; } + /* +* Clear XSAVE features that are disabled in the normal CPUID. +*/ + for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) { + if (!boot_cpu_has(xsave_cpuid_features[i])) + xfeatures_mask &= ~BIT(i); + } + xfeatures_mask &= fpu__get_supported_xfeatures_mask(); /* Enable xstate instructions to be able to continue with initialization: */ Hi, The commit for this patch in mainline (ccb18db2ab9d923df07e7495123fe5fb02329713) causes the kernel to hang on boot when passing the "nofxsr" option: $ kvm -cpu host -kernel arch/x86/boot/bzImage -append "console=ttyS0 nofxsr earlyprintk=ttyS0" -serial stdio -display none -smp 2 early console in extract_kernel input_data: 0x01dea276 input_len: 0x00500704 output: 0x0100 output_len: 0x012c79b4 kernel_total_size: 0x00f24000 booted via startup_32() Physical KASLR using RDRAND RDTSC... Virtual KASLR using RDRAND RDTSC... Decompressing Linux... Parsing ELF... Performing relocations... done. Booting the kernel. [..hang..] If I revert it from Linus's tree (~5.2-rc6) then it boots again: early console in extract_kernel input_data: 0x024192e9 input_len: 0x005d8ea1 output: 0x0100 output_len: 0x019c7fa4 kernel_total_size: 0x0162c000 trampoline_32bit: 0x0009d000 booted via startup_32() Physical KASLR using RDRAND RDTSC... Virtual KASLR using RDRAND RDTSC... Decompressing Linux... Parsing ELF... Performing relocations... done. Booting the kernel. Linux version 5.2.0-rc6+ (vegard@t460) (gcc version 5.5.0 20171010 (Ubuntu 5.5.0-12ubuntu1~16.04)) #98 SMP PREEMPT Sat Jun 29 17:13:31 CEST 2019 Command line: console=ttyS0 nofxsr earlyprintk=ttyS0 [..normal boot..] /proc/cpuinfo inside the VM is: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 78 model name : Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz stepping: 3 microcode : 0x1 cpu MHz : 2496.000 cache size : 4096 KB physical id : 0 siblings: 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat bugs: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds bogomips: 4992.00 clflush size: 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: Vegard
Re: [PATCH] mm, thp: Fix mlocking THP page with migration enabled
On Tue, 11 Sep 2018 at 12:34, Kirill A. Shutemov wrote: > > A transparent huge page is represented by a single entry on an LRU list. > Therefore, we can only make unevictable an entire compound page, not > individual subpages. > > If a user tries to mlock() part of a huge page, we want the rest of the > page to be reclaimable. > > We handle this by keeping PTE-mapped huge pages on normal LRU lists: the > PMD on border of VM_LOCKED VMA will be split into PTE table. > > Introduction of THP migration breaks the rules around mlocking THP > pages. If we had a single PMD mapping of the page in mlocked VMA, the > page will get mlocked, regardless of PTE mappings of the page. > > For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in > remove_migration_pmd(). > > Anon THP pages can only be shared between processes via fork(). Mlocked > page can only be shared if parent mlocked it before forking, otherwise > CoW will be triggered on mlock(). > > For Anon-THP, we can fix the issue by munlocking the page on removing PTE > migration entry for the page. PTEs for the page will always come after > mlocked PMD: rmap walks VMAs from oldest to newest. > > Test-case: > > #include > #include > #include > #include > #include > > int main(void) > { > unsigned long nodemask = 4; > void *addr; > > addr = mmap((void *)0x2000UL, 2UL << 20, PROT_READ | > PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0); > > if (fork()) { > wait(NULL); > return 0; > } > > mlock(addr, 4UL << 10); > mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES, > , 4, MPOL_MF_MOVE | MPOL_MF_MOVE_ALL); MPOL_MF_MOVE_ALL is actually not required to trigger the bug. > > return 0; > } > > Signed-off-by: Kirill A. Shutemov > Reported-by: Vegard Nossum Would you mind putting vegard.nos...@oracle.com instead? > Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path") The commit I bisected the problem to was actually a different one: commit c8633798497ce894c22ab083eb884c8294c537b2 Author: Naoya Horiguchi Date: Fri Sep 8 16:11:08 2017 -0700 mm: mempolicy: mbind and migrate_pages support thp migration But maybe you had a good reason to choose the other one instead. They are close together in any case, so I guess it would be hard to find a kernel with one commit and not the other. > Cc: [v4.14+] > Cc: Zi Yan > Cc: Naoya Horiguchi > Cc: Vlastimil Babka > Cc: Andrea Arcangeli You could also add: Link: https://lkml.org/lkml/2018/8/30/464 Thanks for debugging this. Vegard
Re: [PATCH] mm, thp: Fix mlocking THP page with migration enabled
On Tue, 11 Sep 2018 at 12:34, Kirill A. Shutemov wrote: > > A transparent huge page is represented by a single entry on an LRU list. > Therefore, we can only make unevictable an entire compound page, not > individual subpages. > > If a user tries to mlock() part of a huge page, we want the rest of the > page to be reclaimable. > > We handle this by keeping PTE-mapped huge pages on normal LRU lists: the > PMD on border of VM_LOCKED VMA will be split into PTE table. > > Introduction of THP migration breaks the rules around mlocking THP > pages. If we had a single PMD mapping of the page in mlocked VMA, the > page will get mlocked, regardless of PTE mappings of the page. > > For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in > remove_migration_pmd(). > > Anon THP pages can only be shared between processes via fork(). Mlocked > page can only be shared if parent mlocked it before forking, otherwise > CoW will be triggered on mlock(). > > For Anon-THP, we can fix the issue by munlocking the page on removing PTE > migration entry for the page. PTEs for the page will always come after > mlocked PMD: rmap walks VMAs from oldest to newest. > > Test-case: > > #include > #include > #include > #include > #include > > int main(void) > { > unsigned long nodemask = 4; > void *addr; > > addr = mmap((void *)0x2000UL, 2UL << 20, PROT_READ | > PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0); > > if (fork()) { > wait(NULL); > return 0; > } > > mlock(addr, 4UL << 10); > mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES, > , 4, MPOL_MF_MOVE | MPOL_MF_MOVE_ALL); MPOL_MF_MOVE_ALL is actually not required to trigger the bug. > > return 0; > } > > Signed-off-by: Kirill A. Shutemov > Reported-by: Vegard Nossum Would you mind putting vegard.nos...@oracle.com instead? > Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path") The commit I bisected the problem to was actually a different one: commit c8633798497ce894c22ab083eb884c8294c537b2 Author: Naoya Horiguchi Date: Fri Sep 8 16:11:08 2017 -0700 mm: mempolicy: mbind and migrate_pages support thp migration But maybe you had a good reason to choose the other one instead. They are close together in any case, so I guess it would be hard to find a kernel with one commit and not the other. > Cc: [v4.14+] > Cc: Zi Yan > Cc: Naoya Horiguchi > Cc: Vlastimil Babka > Cc: Andrea Arcangeli You could also add: Link: https://lkml.org/lkml/2018/8/30/464 Thanks for debugging this. Vegard
Re: v4.18.0+ WARNING: at mm/vmscan.c:1756 isolate_lru_page + bad page state
On Thu, 30 Aug 2018 at 15:31, Vegard Nossum wrote: > > Hi, > > Got this on a recent kernel (pretty sure it was > 2ad0d52699700a91660a406a4046017a2d7f246a but annoyingly the oops > itself doesn't tell me the exact version): > > [ cut here ] > trying to isolate tail page > WARNING: CPU: 2 PID: 19156 at mm/vmscan.c:1756 isolate_lru_page+0x235/0x250 [...] > I don't have the capacity to debug it atm and it may even have been > fixed in mainline (though searching didn't yield any other reports > AFAICT). > > I have .config and vmlinux (with DEBUG_INFO=y) if needed. > > It's not reproducible for the time being. Just a quick follow-up: I have a reproducer and Kirill Shutemov has identified the problem and provided a tentative patch. Vegard
Re: v4.18.0+ WARNING: at mm/vmscan.c:1756 isolate_lru_page + bad page state
On Thu, 30 Aug 2018 at 15:31, Vegard Nossum wrote: > > Hi, > > Got this on a recent kernel (pretty sure it was > 2ad0d52699700a91660a406a4046017a2d7f246a but annoyingly the oops > itself doesn't tell me the exact version): > > [ cut here ] > trying to isolate tail page > WARNING: CPU: 2 PID: 19156 at mm/vmscan.c:1756 isolate_lru_page+0x235/0x250 [...] > I don't have the capacity to debug it atm and it may even have been > fixed in mainline (though searching didn't yield any other reports > AFAICT). > > I have .config and vmlinux (with DEBUG_INFO=y) if needed. > > It's not reproducible for the time being. Just a quick follow-up: I have a reproducer and Kirill Shutemov has identified the problem and provided a tentative patch. Vegard
v4.18.0+ WARNING: at mm/vmscan.c:1756 isolate_lru_page + bad page state
Hi, Got this on a recent kernel (pretty sure it was 2ad0d52699700a91660a406a4046017a2d7f246a but annoyingly the oops itself doesn't tell me the exact version): [ cut here ] trying to isolate tail page WARNING: CPU: 2 PID: 19156 at mm/vmscan.c:1756 isolate_lru_page+0x235/0x250 CPU: 2 PID: 19156 Comm: mmap Not tainted 4.18.0+ #493 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:isolate_lru_page+0x235/0x250 Code: fe ff ff 48 c7 c6 80 73 43 82 48 c7 c7 60 27 a9 82 e8 3f 40 c9 00 85 c0 0f 84 f4 fd ff ff 48 c7 c7 a5 ba 75 82 e8 6b 59 ed ff <0f> 0b e9 e1 fd ff ff 49 c7 c7 00 fe ff ff 44 89 7c 24 04 e9 ed fe RSP: 0018:c90008edbc20 EFLAGS: 00010282 RAX: RBX: ea00082fd000 RCX: 0002 RDX: 8002 RSI: 0002 RDI: RBP: 8803a157ea00 R08: 0001 R09: R10: 82e456dc R11: 0001 R12: ea00082fd000 R13: 80020bf40805 R14: 7fe50f341000 R15: c90008edbdd8 FS: () GS:88042fb0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00580fb8 CR3: 02a1e004 CR4: 000606e0 Call Trace: clear_page_mlock+0x73/0xb0 page_remove_rmap+0x31e/0x370 unmap_page_range+0x70b/0xa40 unmap_vmas+0x47/0x90 exit_mmap+0xb0/0x1c0 mmput+0x5d/0x130 do_exit+0x2c2/0xc20 do_group_exit+0x42/0xb0 __x64_sys_exit_group+0xf/0x10 do_syscall_64+0x57/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x501ad8 Code: Bad RIP value. RSP: 002b:7fff9bb8dee8 EFLAGS: 0246 ORIG_RAX: 00e7 RAX: ffda RBX: RCX: 00501ad8 RDX: RSI: 003c RDI: RBP: 0059b4a0 R08: 00e7 R09: ffc8 R10: R11: 0246 R12: 0001 R13: 007d7860 R14: 00027150 R15: 7fff9bb8e0c0 ---[ end trace d3ada49968979043 ]--- [ cut here ] list_del corruption, ea00082fd008->prev is LIST_POISON2 (dead0200) WARNING: CPU: 2 PID: 19156 at lib/list_debug.c:50 __list_del_entry_valid+0x62/0x90 CPU: 2 PID: 19156 Comm: mmap Tainted: GW 4.18.0+ #493 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:__list_del_entry_valid+0x62/0x90 Code: 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 f0 b3 79 82 e8 d2 84 b1 ff 0f 0b 31 c0 c3 48 89 fe 48 c7 c7 28 b4 79 82 e8 be 84 b1 ff <0f> 0b 31 c0 c3 48 89 fe 48 c7 c7 60 b4 79 82 e8 aa 84 b1 ff 0f 0b RSP: 0018:c90008edbc18 EFLAGS: 00010086 RAX: RBX: ea00082fd000 RCX: 0003 RDX: 0003 RSI: 0003 RDI: RBP: 88043fff0d00 R08: 0001 R09: R10: 8802794a60c8 R11: 0001 R12: 0004 R13: 88042f4ae800 R14: 0005 R15: c90008edbdd8 FS: () GS:88042fb0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00501aae CR3: 02a1e004 CR4: 000606e0 Call Trace: isolate_lru_page+0xf3/0x250 clear_page_mlock+0x73/0xb0 page_remove_rmap+0x31e/0x370 unmap_page_range+0x70b/0xa40 unmap_vmas+0x47/0x90 exit_mmap+0xb0/0x1c0 mmput+0x5d/0x130 do_exit+0x2c2/0xc20 do_group_exit+0x42/0xb0 __x64_sys_exit_group+0xf/0x10 do_syscall_64+0x57/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x501ad8 Code: Bad RIP value. RSP: 002b:7fff9bb8dee8 EFLAGS: 0246 ORIG_RAX: 00e7 RAX: ffda RBX: RCX: 00501ad8 RDX: RSI: 003c RDI: RBP: 0059b4a0 R08: 00e7 R09: ffc8 R10: R11: 0246 R12: 0001 R13: 007d7860 R14: 00027150 R15: 7fff9bb8e0c0 ---[ end trace d3ada49968979044 ]--- BUG: Bad page state in process mmap pfn:20bf40 page:ea00082fd000 count:0 mapcount:0 mapping:dead0400 index:0x1 flags: 0x400() raw: 0400 dead0100 dead0200 dead0400 raw: 0001 page dumped because: non-NULL mapping CPU: 2 PID: 19156 Comm: mmap Tainted: GW 4.18.0+ #493 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x5c/0x7b bad_page+0xb3/0x110 free_pcppages_bulk+0x17b/0x7e0 free_unref_page+0x4a/0x60 zap_huge_pmd+0x204/0x360 unmap_page_range+0x970/0xa40 unmap_vmas+0x47/0x90 exit_mmap+0xb0/0x1c0 mmput+0x5d/0x130 do_exit+0x2c2/0xc20 do_group_exit+0x42/0xb0 __x64_sys_exit_group+0xf/0x10 do_syscall_64+0x57/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x501ad8 Code: Bad RIP value. RSP: 002b:7fff9bb8dee8 EFLAGS: 0246 ORIG_RAX:
v4.18.0+ WARNING: at mm/vmscan.c:1756 isolate_lru_page + bad page state
Hi, Got this on a recent kernel (pretty sure it was 2ad0d52699700a91660a406a4046017a2d7f246a but annoyingly the oops itself doesn't tell me the exact version): [ cut here ] trying to isolate tail page WARNING: CPU: 2 PID: 19156 at mm/vmscan.c:1756 isolate_lru_page+0x235/0x250 CPU: 2 PID: 19156 Comm: mmap Not tainted 4.18.0+ #493 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:isolate_lru_page+0x235/0x250 Code: fe ff ff 48 c7 c6 80 73 43 82 48 c7 c7 60 27 a9 82 e8 3f 40 c9 00 85 c0 0f 84 f4 fd ff ff 48 c7 c7 a5 ba 75 82 e8 6b 59 ed ff <0f> 0b e9 e1 fd ff ff 49 c7 c7 00 fe ff ff 44 89 7c 24 04 e9 ed fe RSP: 0018:c90008edbc20 EFLAGS: 00010282 RAX: RBX: ea00082fd000 RCX: 0002 RDX: 8002 RSI: 0002 RDI: RBP: 8803a157ea00 R08: 0001 R09: R10: 82e456dc R11: 0001 R12: ea00082fd000 R13: 80020bf40805 R14: 7fe50f341000 R15: c90008edbdd8 FS: () GS:88042fb0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00580fb8 CR3: 02a1e004 CR4: 000606e0 Call Trace: clear_page_mlock+0x73/0xb0 page_remove_rmap+0x31e/0x370 unmap_page_range+0x70b/0xa40 unmap_vmas+0x47/0x90 exit_mmap+0xb0/0x1c0 mmput+0x5d/0x130 do_exit+0x2c2/0xc20 do_group_exit+0x42/0xb0 __x64_sys_exit_group+0xf/0x10 do_syscall_64+0x57/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x501ad8 Code: Bad RIP value. RSP: 002b:7fff9bb8dee8 EFLAGS: 0246 ORIG_RAX: 00e7 RAX: ffda RBX: RCX: 00501ad8 RDX: RSI: 003c RDI: RBP: 0059b4a0 R08: 00e7 R09: ffc8 R10: R11: 0246 R12: 0001 R13: 007d7860 R14: 00027150 R15: 7fff9bb8e0c0 ---[ end trace d3ada49968979043 ]--- [ cut here ] list_del corruption, ea00082fd008->prev is LIST_POISON2 (dead0200) WARNING: CPU: 2 PID: 19156 at lib/list_debug.c:50 __list_del_entry_valid+0x62/0x90 CPU: 2 PID: 19156 Comm: mmap Tainted: GW 4.18.0+ #493 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:__list_del_entry_valid+0x62/0x90 Code: 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 f0 b3 79 82 e8 d2 84 b1 ff 0f 0b 31 c0 c3 48 89 fe 48 c7 c7 28 b4 79 82 e8 be 84 b1 ff <0f> 0b 31 c0 c3 48 89 fe 48 c7 c7 60 b4 79 82 e8 aa 84 b1 ff 0f 0b RSP: 0018:c90008edbc18 EFLAGS: 00010086 RAX: RBX: ea00082fd000 RCX: 0003 RDX: 0003 RSI: 0003 RDI: RBP: 88043fff0d00 R08: 0001 R09: R10: 8802794a60c8 R11: 0001 R12: 0004 R13: 88042f4ae800 R14: 0005 R15: c90008edbdd8 FS: () GS:88042fb0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00501aae CR3: 02a1e004 CR4: 000606e0 Call Trace: isolate_lru_page+0xf3/0x250 clear_page_mlock+0x73/0xb0 page_remove_rmap+0x31e/0x370 unmap_page_range+0x70b/0xa40 unmap_vmas+0x47/0x90 exit_mmap+0xb0/0x1c0 mmput+0x5d/0x130 do_exit+0x2c2/0xc20 do_group_exit+0x42/0xb0 __x64_sys_exit_group+0xf/0x10 do_syscall_64+0x57/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x501ad8 Code: Bad RIP value. RSP: 002b:7fff9bb8dee8 EFLAGS: 0246 ORIG_RAX: 00e7 RAX: ffda RBX: RCX: 00501ad8 RDX: RSI: 003c RDI: RBP: 0059b4a0 R08: 00e7 R09: ffc8 R10: R11: 0246 R12: 0001 R13: 007d7860 R14: 00027150 R15: 7fff9bb8e0c0 ---[ end trace d3ada49968979044 ]--- BUG: Bad page state in process mmap pfn:20bf40 page:ea00082fd000 count:0 mapcount:0 mapping:dead0400 index:0x1 flags: 0x400() raw: 0400 dead0100 dead0200 dead0400 raw: 0001 page dumped because: non-NULL mapping CPU: 2 PID: 19156 Comm: mmap Tainted: GW 4.18.0+ #493 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x5c/0x7b bad_page+0xb3/0x110 free_pcppages_bulk+0x17b/0x7e0 free_unref_page+0x4a/0x60 zap_huge_pmd+0x204/0x360 unmap_page_range+0x970/0xa40 unmap_vmas+0x47/0x90 exit_mmap+0xb0/0x1c0 mmput+0x5d/0x130 do_exit+0x2c2/0xc20 do_group_exit+0x42/0xb0 __x64_sys_exit_group+0xf/0x10 do_syscall_64+0x57/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x501ad8 Code: Bad RIP value. RSP: 002b:7fff9bb8dee8 EFLAGS: 0246 ORIG_RAX:
Re: Merge branch 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
On 16 August 2018 at 17:42, Richard Weinberger wrote: > On Thu, Aug 16, 2018 at 2:58 PM Sedat Dilek wrote: >> >> Hi Linus, >> >> I am here on Linux v4.18 and tried first to merge the l1tf-final Git-branch. >> Unfortunately, this is no more available in the tip Git-tree. >> >> Then I saw Linux v4.18.1 which includes all the above stuff. >> >> I tried to 'git cherry-pick -m 1 958f338e96f874a0d29442396d6adf9c1e17aa2d'. >> I know the commit-id is the hash of a merge. >> Luckily, I could get the "diff" and applied it. >> But the history misses. >> >> How can I get the history and subjects of all commits in your tree to >> cherry-pick the single commits? >> >> Do you happen to know another solution to get easily all L1TF commits >> with any other tricks? > > That should help: > git log --oneline > 958f338e96f874a0d29442396d6adf9c1e17aa2d^..958f338e96f874a0d29442396d6adf9c1e17aa2d Hey, As a shorthand for this, you can also use just: git log --oneline 958f338e96f87^- The syntax was made especially so that you can see all the commits that arrived via a merge commit without having to write the rev of the merge twice but is otherwise exactly equivalent to "rev^..rev". It should work from git v2.13. Just a tip :-) Vegard
Re: Merge branch 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
On 16 August 2018 at 17:42, Richard Weinberger wrote: > On Thu, Aug 16, 2018 at 2:58 PM Sedat Dilek wrote: >> >> Hi Linus, >> >> I am here on Linux v4.18 and tried first to merge the l1tf-final Git-branch. >> Unfortunately, this is no more available in the tip Git-tree. >> >> Then I saw Linux v4.18.1 which includes all the above stuff. >> >> I tried to 'git cherry-pick -m 1 958f338e96f874a0d29442396d6adf9c1e17aa2d'. >> I know the commit-id is the hash of a merge. >> Luckily, I could get the "diff" and applied it. >> But the history misses. >> >> How can I get the history and subjects of all commits in your tree to >> cherry-pick the single commits? >> >> Do you happen to know another solution to get easily all L1TF commits >> with any other tricks? > > That should help: > git log --oneline > 958f338e96f874a0d29442396d6adf9c1e17aa2d^..958f338e96f874a0d29442396d6adf9c1e17aa2d Hey, As a shorthand for this, you can also use just: git log --oneline 958f338e96f87^- The syntax was made especially so that you can see all the commits that arrived via a merge commit without having to write the rev of the merge twice but is otherwise exactly equivalent to "rev^..rev". It should work from git v2.13. Just a tip :-) Vegard
Re: [PATCH] fscache: fix a kernel BUG at fs/fscache/operation.c:69!
On 22 February 2018 at 08:33,wrote: > From: Lei Xue > > There is a potential race in fscache operation enqueuing for reading and > copying multiple pages from cachefiles to netfs. > Under some heavy load system, it will happen very often. > > If this race occurs, an oops similar to the following is seen: > > kernel BUG at fs/fscache/operation.c:69! > invalid opcode: [#1] SMP > … > #0 [883fff0838d8] machine_kexec at 81051beb > #1 [883fff083938] crash_kexec at 810f2542 > #2 [883fff083a08] oops_end at 8163e1a8 > #3 [883fff083a30] die at 8101859b > #4 [883fff083a60] do_trap at 8163d860 > #5 [883fff083ab0] do_invalid_op at 81015204 > #6 [883fff083b60] invalid_op at 8164701e > [exception RIP: fscache_enqueue_operation+246] > RIP: a0b793c6 RSP: 883fff083c18 RFLAGS: 00010046 > RAX: 0019 RBX: 8832ed1a9ec0 RCX: 0006 > RDX: RSI: 0046 RDI: 0046 > RBP: 883fff083c20 R8: 0086 R9: 178f > R10: 816aeb00 R11: 883fff08392e R12: 8802f0525620 > R13: 88407ffc01d8 R14: R15: 0003 > ORIG_RAX: CS: 0010 SS: > #7 [883fff083c10] fscache_enqueue_operation at a0b793c6 > #8 [883fff083c28] cachefiles_read_waiter at a0b15a48 > #9 [883fff083c48] __wake_up_common at 810af028 > > Signed-off-by: Lei Xue > --- > fs/cachefiles/rdwr.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c > index 883bc7bb12c5..9d5d13e150fb 100644 > --- a/fs/cachefiles/rdwr.c > +++ b/fs/cachefiles/rdwr.c > @@ -58,9 +58,9 @@ static int cachefiles_read_waiter(wait_queue_entry_t *wait, > unsigned mode, > > spin_lock(>work_lock); > list_add_tail(>op_link, >op->to_do); > + fscache_enqueue_retrieval(monitor->op); > spin_unlock(>work_lock); > > - fscache_enqueue_retrieval(monitor->op); > return 0; > } Hi, Just wondering what the status of this patch is? We've been hitting a similar problem and arrived at the same patch as a potential fix for it. Our crashes look like this: WARNING: CPU: 0 PID: 120693 at kernel/workqueue.c:618 insert_work+0x5f/0x70 Modules linked in: nbd CPU: 0 PID: 120693 Comm: sh Not tainted 4.16.2-0 #1 Hardware name: Oracle Corporation Sun Fire X4800/20434, BIOS 11080200 08/12/2016 RIP: 0010:insert_work+0x5f/0x70 RSP: 0018:88103fa039b8 EFLAGS: 00010046 RAX: 88103f443f00 RBX: 880187c37c00 RCX: 0005 RDX: 880187c37c20 RSI: 8807c04dec00 RDI: RBP: 88103fa039c8 R08: 0101 R09: 0001 R10: 887eee68fd40 R11: 0001 R12: 88503fafc600 R13: 0001cf60 R14: 880187c37c00 R15: 88103f443f00 FS: () GS:88103fa0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f394d2780a0 CR3: 000bcc936000 CR4: 06f0 Call Trace: __queue_work+0x11f/0x320 queue_work_on+0x19/0x30 fscache_enqueue_operation+0x83/0x160 cachefiles_read_waiter+0xd2/0x130 __wake_up_common+0x81/0x120 __wake_up_locked_key_bookmark+0x16/0x20 wake_up_page_bit+0x97/0xe0 unlock_page+0x20/0x30 page_endio+0x21/0xa0 mpage_end_io+0x41/0x60 bio_endio+0x78/0x90 dec_pending+0x140/0x250 ? linear_status+0x40/0x40 clone_endio+0x86/0x100 bio_endio+0x78/0x90 blk_update_request+0x8d/0x2b0 scsi_end_request+0x36/0x200 scsi_io_completion+0x12a/0x5e0 scsi_finish_command+0xf2/0x150 scsi_softirq_done+0x13e/0x160 __blk_mq_complete_request+0xb8/0x180 blk_mq_complete_request+0x57/0x70 scsi_mq_done+0x10/0x20 megasas_complete_cmd+0xdf/0x620 megasas_complete_cmd_dpc+0x8f/0x100 tasklet_action+0x9a/0xb0 __do_softirq+0xbf/0x1c8 irq_exit+0x9c/0xb0 do_IRQ+0x5b/0xe0 common_interrupt+0xf/0xf RIP: 0010:_raw_spin_unlock_irqrestore+0x9/0x10 RSP: 0018:c900309e3cf8 EFLAGS: 0296 ORIG_RAX: ffde RAX: 0002 RBX: 0002 RCX: 0001 RDX: ea0006793fe0 RSI: 0296 RDI: 88107800 RBP: c900309e3cf8 R08: 0002 R09: 0011b912 R10: 00e7 R11: R12: ea0014baa000 R13: 88103fa1d120 R14: 88107fff6000 R15: 88107fff6000 pagevec_lru_move_fn+0xb7/0xe0 ? pagevec_move_tail_fn+0x350/0x350 __pagevec_lru_add+0x12/0x20 lru_add_drain_cpu+0xc4/0xe0 lru_add_drain+0x10/0x20 exit_mmap+0x58/0x190 ? __handle_mm_fault+0x9a4/0x1540 ? hrtimer_try_to_cancel+0x1b/0xa0 mmput+0x4e/0x100 do_exit+0x22f/0xa10 do_group_exit+0x3a/0xa0 SyS_exit_group+0x12/0x20 do_syscall_64+0x61/0x110 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f394d325fa8 RSP: 002b:7ffda407e668
Re: [PATCH] fscache: fix a kernel BUG at fs/fscache/operation.c:69!
On 22 February 2018 at 08:33, wrote: > From: Lei Xue > > There is a potential race in fscache operation enqueuing for reading and > copying multiple pages from cachefiles to netfs. > Under some heavy load system, it will happen very often. > > If this race occurs, an oops similar to the following is seen: > > kernel BUG at fs/fscache/operation.c:69! > invalid opcode: [#1] SMP > … > #0 [883fff0838d8] machine_kexec at 81051beb > #1 [883fff083938] crash_kexec at 810f2542 > #2 [883fff083a08] oops_end at 8163e1a8 > #3 [883fff083a30] die at 8101859b > #4 [883fff083a60] do_trap at 8163d860 > #5 [883fff083ab0] do_invalid_op at 81015204 > #6 [883fff083b60] invalid_op at 8164701e > [exception RIP: fscache_enqueue_operation+246] > RIP: a0b793c6 RSP: 883fff083c18 RFLAGS: 00010046 > RAX: 0019 RBX: 8832ed1a9ec0 RCX: 0006 > RDX: RSI: 0046 RDI: 0046 > RBP: 883fff083c20 R8: 0086 R9: 178f > R10: 816aeb00 R11: 883fff08392e R12: 8802f0525620 > R13: 88407ffc01d8 R14: R15: 0003 > ORIG_RAX: CS: 0010 SS: > #7 [883fff083c10] fscache_enqueue_operation at a0b793c6 > #8 [883fff083c28] cachefiles_read_waiter at a0b15a48 > #9 [883fff083c48] __wake_up_common at 810af028 > > Signed-off-by: Lei Xue > --- > fs/cachefiles/rdwr.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c > index 883bc7bb12c5..9d5d13e150fb 100644 > --- a/fs/cachefiles/rdwr.c > +++ b/fs/cachefiles/rdwr.c > @@ -58,9 +58,9 @@ static int cachefiles_read_waiter(wait_queue_entry_t *wait, > unsigned mode, > > spin_lock(>work_lock); > list_add_tail(>op_link, >op->to_do); > + fscache_enqueue_retrieval(monitor->op); > spin_unlock(>work_lock); > > - fscache_enqueue_retrieval(monitor->op); > return 0; > } Hi, Just wondering what the status of this patch is? We've been hitting a similar problem and arrived at the same patch as a potential fix for it. Our crashes look like this: WARNING: CPU: 0 PID: 120693 at kernel/workqueue.c:618 insert_work+0x5f/0x70 Modules linked in: nbd CPU: 0 PID: 120693 Comm: sh Not tainted 4.16.2-0 #1 Hardware name: Oracle Corporation Sun Fire X4800/20434, BIOS 11080200 08/12/2016 RIP: 0010:insert_work+0x5f/0x70 RSP: 0018:88103fa039b8 EFLAGS: 00010046 RAX: 88103f443f00 RBX: 880187c37c00 RCX: 0005 RDX: 880187c37c20 RSI: 8807c04dec00 RDI: RBP: 88103fa039c8 R08: 0101 R09: 0001 R10: 887eee68fd40 R11: 0001 R12: 88503fafc600 R13: 0001cf60 R14: 880187c37c00 R15: 88103f443f00 FS: () GS:88103fa0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f394d2780a0 CR3: 000bcc936000 CR4: 06f0 Call Trace: __queue_work+0x11f/0x320 queue_work_on+0x19/0x30 fscache_enqueue_operation+0x83/0x160 cachefiles_read_waiter+0xd2/0x130 __wake_up_common+0x81/0x120 __wake_up_locked_key_bookmark+0x16/0x20 wake_up_page_bit+0x97/0xe0 unlock_page+0x20/0x30 page_endio+0x21/0xa0 mpage_end_io+0x41/0x60 bio_endio+0x78/0x90 dec_pending+0x140/0x250 ? linear_status+0x40/0x40 clone_endio+0x86/0x100 bio_endio+0x78/0x90 blk_update_request+0x8d/0x2b0 scsi_end_request+0x36/0x200 scsi_io_completion+0x12a/0x5e0 scsi_finish_command+0xf2/0x150 scsi_softirq_done+0x13e/0x160 __blk_mq_complete_request+0xb8/0x180 blk_mq_complete_request+0x57/0x70 scsi_mq_done+0x10/0x20 megasas_complete_cmd+0xdf/0x620 megasas_complete_cmd_dpc+0x8f/0x100 tasklet_action+0x9a/0xb0 __do_softirq+0xbf/0x1c8 irq_exit+0x9c/0xb0 do_IRQ+0x5b/0xe0 common_interrupt+0xf/0xf RIP: 0010:_raw_spin_unlock_irqrestore+0x9/0x10 RSP: 0018:c900309e3cf8 EFLAGS: 0296 ORIG_RAX: ffde RAX: 0002 RBX: 0002 RCX: 0001 RDX: ea0006793fe0 RSI: 0296 RDI: 88107800 RBP: c900309e3cf8 R08: 0002 R09: 0011b912 R10: 00e7 R11: R12: ea0014baa000 R13: 88103fa1d120 R14: 88107fff6000 R15: 88107fff6000 pagevec_lru_move_fn+0xb7/0xe0 ? pagevec_move_tail_fn+0x350/0x350 __pagevec_lru_add+0x12/0x20 lru_add_drain_cpu+0xc4/0xe0 lru_add_drain+0x10/0x20 exit_mmap+0x58/0x190 ? __handle_mm_fault+0x9a4/0x1540 ? hrtimer_try_to_cancel+0x1b/0xa0 mmput+0x4e/0x100 do_exit+0x22f/0xa10 do_group_exit+0x3a/0xa0 SyS_exit_group+0x12/0x20 do_syscall_64+0x61/0x110 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f394d325fa8 RSP: 002b:7ffda407e668 EFLAGS: 0246 ORIG_RAX: 00e7 RAX: ffda RBX:
Re: [PATCH 00/45] C++: Convert the kernel to C++
On 1 April 2018 at 22:40, David Howellswrote: > > Here are a series of patches to start converting the kernel to C++. It > requires g++ v8. Nice! I tried something similar a few years ago, but I don't think it was nearly as neat. I did get RTTI and exceptions to work (using libcxxrt + libunwind), though. Having noticed that a lot of really trivial kernel bugs are due to control flow issues (e.g. when somebody adds a possibly-failing step to a function but forget to add a new label to clean it up) I really wanted to see how/whether exceptions and RAII could help in that space. Just in case you want to compare notes, I've pushed my branch to: https://github.com/vegard/linux-2.6/tree/cxx I also started a little bit of work on converting a driver to use RAII, and quickly ran into a few problems: C++ destructors don't take arguments, which means that some objects would have to carry extra state around because some of the information needed to destroy an object resides with somebody else. This means that you would have to do more refactoring work to avoid needing this in the first place, i.e. mapping creation/destructing of various C-style structs to C++ is _not_ straightforward. Take dma_alloc_coherent() for example. It pairs up with dma_free_coherent() and that one needs to know the device and buffer size that you passed too: void *dma_alloc_coherent(struct device *, size_t, dma_addr_t *, gfp_t); void dma_free_coherent(struct device *, size_t, void *, dma_addr_t); This means that if you have 1 device using 2 buffers of the same size and the size is stored only by the device struct, then you must always do the destruction from the device struct, since the individual buffers don't know their size (unless you move the member there; but it feels like a waste of memory if you could do it just fine in C). Maybe there's a "proper" way to do it that I didn't see, but problems like this turned me off the whole approach a little. Another real bummer is the size and complexity of the RTTI and unwinding support code. First of all, unwinding requires parsing and executing DWARF code on the fly, and that just makes everything very slow. Not to mention that it needs to be threading-aware and does a lot of memory allocations. IIRC handling out-of-memory conditions was extremely ugly (not that the kernel is perfect in this respect to start with) and involved the use of "reserve buffers". I didn't like it at all. Also, for reference, I found a few other projects doing similar things in the past: https://github.com/veltzer/kcpp http://www.drdobbs.com/cpp/c-exceptions-the-linux-kernel/229100146 https://pograph.wordpress.com/2009/04/05/porting-cpp-code-to-linux-kernel/ https://www.threatstack.com/blog/c-in-the-linux-kernel/ There's probably more, I seem to remember at least 1 commercial product using C++ for their out-of-tree module (albeit without RTTI/exceptions), but I can't find it right now. Vegard
Re: [PATCH 00/45] C++: Convert the kernel to C++
On 1 April 2018 at 22:40, David Howells wrote: > > Here are a series of patches to start converting the kernel to C++. It > requires g++ v8. Nice! I tried something similar a few years ago, but I don't think it was nearly as neat. I did get RTTI and exceptions to work (using libcxxrt + libunwind), though. Having noticed that a lot of really trivial kernel bugs are due to control flow issues (e.g. when somebody adds a possibly-failing step to a function but forget to add a new label to clean it up) I really wanted to see how/whether exceptions and RAII could help in that space. Just in case you want to compare notes, I've pushed my branch to: https://github.com/vegard/linux-2.6/tree/cxx I also started a little bit of work on converting a driver to use RAII, and quickly ran into a few problems: C++ destructors don't take arguments, which means that some objects would have to carry extra state around because some of the information needed to destroy an object resides with somebody else. This means that you would have to do more refactoring work to avoid needing this in the first place, i.e. mapping creation/destructing of various C-style structs to C++ is _not_ straightforward. Take dma_alloc_coherent() for example. It pairs up with dma_free_coherent() and that one needs to know the device and buffer size that you passed too: void *dma_alloc_coherent(struct device *, size_t, dma_addr_t *, gfp_t); void dma_free_coherent(struct device *, size_t, void *, dma_addr_t); This means that if you have 1 device using 2 buffers of the same size and the size is stored only by the device struct, then you must always do the destruction from the device struct, since the individual buffers don't know their size (unless you move the member there; but it feels like a waste of memory if you could do it just fine in C). Maybe there's a "proper" way to do it that I didn't see, but problems like this turned me off the whole approach a little. Another real bummer is the size and complexity of the RTTI and unwinding support code. First of all, unwinding requires parsing and executing DWARF code on the fly, and that just makes everything very slow. Not to mention that it needs to be threading-aware and does a lot of memory allocations. IIRC handling out-of-memory conditions was extremely ugly (not that the kernel is perfect in this respect to start with) and involved the use of "reserve buffers". I didn't like it at all. Also, for reference, I found a few other projects doing similar things in the past: https://github.com/veltzer/kcpp http://www.drdobbs.com/cpp/c-exceptions-the-linux-kernel/229100146 https://pograph.wordpress.com/2009/04/05/porting-cpp-code-to-linux-kernel/ https://www.threatstack.com/blog/c-in-the-linux-kernel/ There's probably more, I seem to remember at least 1 commercial product using C++ for their out-of-tree module (albeit without RTTI/exceptions), but I can't find it right now. Vegard
parallel make broken with ORC unwinder
Hi, When I run make -j64 on a v4.14 kernel or newer with ORC_UNWINDER=y the kernel build breaks like this: $ make -j64 CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h DESCEND objtool CC scripts/mod/empty.o [...] security/smack/smack_lsm.o: warning: objtool: elf_update: cannot write data to file [...] drivers/atm/uPD98402.o: warning: objtool: elf_update: cannot write data to file AR arch/x86/entry/vdso/built-in.o CC security/keys/permission.o CC arch/x86/entry/vsyscall/vsyscall_gtod.o CC security/keys/process_keys.o CC [M] arch/x86/kvm/../../../virt/kvm/irqchip.o Segmentation fault make[2]: *** [drivers/atm/uPD98402.o] Error 139 make[2]: *** Waiting for unfinished jobs With FRAME_POINTER_UNWINDER=y everything seems to work fine. A bisect points to: ee9f8fce99640811b2b8e79d0d1dbe8bab69ba67 is the first bad commit commit ee9f8fce99640811b2b8e79d0d1dbe8bab69ba67 Author: Josh PoimboeufDate: Mon Jul 24 18:36:57 2017 -0500 x86/unwind: Add the ORC unwinder grepping for smack_lsm.o in the build log gives the following output: gcc -Wp,-MD,security/smack/.smack_lsm.o.d -nostdinc -isystem /usr/lib/gcc/x86_64-linux-gnu/4.7/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -Werror-implicit-function-declaration -Wno-format-security -std=gnu89 -fno-PIE -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -DCONFIG_X86_X32_ABI -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_SSSE3=1 -DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -fno-delete-null-pointer-checks -O2 -Wno-maybe-uninitialized --param=allow-store-data-races=0 -DCC_HAVE_ASM_GOTO -Wframe-larger-than=1024 -fno-stack-protector -Wno-unused-but-set-variable -fno-var-tracking-assignments -g -gdwarf-4 -pg -mfentry -DCC_USING_FENTRY -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fconserve-stack -Werror=implicit-int -Werror=strict-prototypes -DKBUILD_BASENAME='"smack_lsm"' -DKBUILD_MODNAME='"smack"' -c -o security/smack/.tmp_smack_lsm.o security/smack/smack_lsm.c ./tools/objtool/objtool orc generate --no-fp "security/smack/smack_lsm.o"; security/smack/smack_lsm.o: warning: objtool: elf_update: cannot write data to file if [ "-pg" = "-pg" ]; then if [ security/smack/smack_lsm.o != "scripts/mod/empty.o" ]; then ./scripts/recordmcount "security/smack/smack_lsm.o"; fi; fi; rm -f security/smack/smack.o; ar rcSTPD security/smack/smack.o security/smack/smack_lsm.o security/smack/smack_access.o security/smack/smackfs.o security/smack/smack_netfilter.o This line looks suspicious: ./tools/objtool/objtool orc generate --no-fp "security/smack/smack_lsm.o"; Is it really rewriting the file in place? That seems quite buggy to me. Vegard
parallel make broken with ORC unwinder
Hi, When I run make -j64 on a v4.14 kernel or newer with ORC_UNWINDER=y the kernel build breaks like this: $ make -j64 CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h DESCEND objtool CC scripts/mod/empty.o [...] security/smack/smack_lsm.o: warning: objtool: elf_update: cannot write data to file [...] drivers/atm/uPD98402.o: warning: objtool: elf_update: cannot write data to file AR arch/x86/entry/vdso/built-in.o CC security/keys/permission.o CC arch/x86/entry/vsyscall/vsyscall_gtod.o CC security/keys/process_keys.o CC [M] arch/x86/kvm/../../../virt/kvm/irqchip.o Segmentation fault make[2]: *** [drivers/atm/uPD98402.o] Error 139 make[2]: *** Waiting for unfinished jobs With FRAME_POINTER_UNWINDER=y everything seems to work fine. A bisect points to: ee9f8fce99640811b2b8e79d0d1dbe8bab69ba67 is the first bad commit commit ee9f8fce99640811b2b8e79d0d1dbe8bab69ba67 Author: Josh Poimboeuf Date: Mon Jul 24 18:36:57 2017 -0500 x86/unwind: Add the ORC unwinder grepping for smack_lsm.o in the build log gives the following output: gcc -Wp,-MD,security/smack/.smack_lsm.o.d -nostdinc -isystem /usr/lib/gcc/x86_64-linux-gnu/4.7/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -Werror-implicit-function-declaration -Wno-format-security -std=gnu89 -fno-PIE -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -DCONFIG_X86_X32_ABI -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_SSSE3=1 -DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -fno-delete-null-pointer-checks -O2 -Wno-maybe-uninitialized --param=allow-store-data-races=0 -DCC_HAVE_ASM_GOTO -Wframe-larger-than=1024 -fno-stack-protector -Wno-unused-but-set-variable -fno-var-tracking-assignments -g -gdwarf-4 -pg -mfentry -DCC_USING_FENTRY -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fconserve-stack -Werror=implicit-int -Werror=strict-prototypes -DKBUILD_BASENAME='"smack_lsm"' -DKBUILD_MODNAME='"smack"' -c -o security/smack/.tmp_smack_lsm.o security/smack/smack_lsm.c ./tools/objtool/objtool orc generate --no-fp "security/smack/smack_lsm.o"; security/smack/smack_lsm.o: warning: objtool: elf_update: cannot write data to file if [ "-pg" = "-pg" ]; then if [ security/smack/smack_lsm.o != "scripts/mod/empty.o" ]; then ./scripts/recordmcount "security/smack/smack_lsm.o"; fi; fi; rm -f security/smack/smack.o; ar rcSTPD security/smack/smack.o security/smack/smack_lsm.o security/smack/smack_access.o security/smack/smackfs.o security/smack/smack_netfilter.o This line looks suspicious: ./tools/objtool/objtool orc generate --no-fp "security/smack/smack_lsm.o"; Is it really rewriting the file in place? That seems quite buggy to me. Vegard
Re: [PATCH] mm: kill kmemcheck again
On 30 September 2017 at 11:48, Steven Rostedt <rost...@goodmis.org> wrote: > On Wed, 27 Sep 2017 17:02:07 +0200 > Michal Hocko <mho...@kernel.org> wrote: > >> > Now that 2 years have passed, and all distros provide gcc that supports >> > KASAN, kill kmemcheck again for the very same reasons. >> >> This is just too large to review manually. How have you generated the >> patch? > > I agree. This needs to be taken out piece by piece, not in one go, > where there could be unexpected fallout. I have a patch from earlier this year that starts by removing the core code and defining all the helpers/flags as no-ops so they can be removed bit by bit at a later time. See the attachment. Pekka signed off on it too. I never actually submitted this because I was waiting for MSAN to be merged in the kernel. It has been compile and run tested on x86_64. Vegard From b06e2b3b833b02ecb0afb9dd92422e89c7fbb6d9 Mon Sep 17 00:00:00 2001 From: Vegard Nossum <vegard.nos...@oracle.com> Date: Thu, 30 Mar 2017 13:26:15 +0200 Subject: [PATCH] kmemcheck: remove core (x86 + mm) code With KASAN/KMSAN and compiler-based instrumentation, this code is way past its expiry date. There is zero reason to be using kmemcheck at this point, as KASAN/KMSAN will be much faster, support SMP, and catch any bug that kmemcheck would have caught. See the additional rationale and past discussion at <https://lkml.org/lkml/2015/3/11/435>. I take the approach of first removing all the core x86 and mm code, leaving behind only include/linux/kmemcheck.h which provides some helpers (now only dummies as for the !KMEMCHECK case previously) used in e.g. networking code for special annotations. We can then send individual (smaller, more reviewable) patches for removing kmemcheck annotations in other subsystems. Once there are no users of the kmemcheck helpers, we can kill off the dummy helpers as well in a final patch. Cc: Ingo Molnar <mi...@kernel.org> Cc: Andrew Morton <a...@linux-foundation.org> Cc: Sasha Levin <alexander.le...@verizon.com> Cc: Steven Rostedt <rost...@goodmis.org> Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com> Signed-off-by: Pekka Enberg <penb...@kernel.org> --- Documentation/admin-guide/kernel-parameters.txt | 7 - Documentation/dev-tools/index.rst | 1 - Documentation/dev-tools/kmemcheck.rst | 733 MAINTAINERS | 10 - arch/arm/include/asm/dma-iommu.h| 1 - arch/openrisc/include/asm/dma-mapping.h | 1 - arch/x86/Kconfig| 3 +- arch/x86/Makefile | 5 - arch/x86/include/asm/dma-mapping.h | 1 - arch/x86/include/asm/kmemcheck.h| 42 -- arch/x86/include/asm/pgtable_types.h| 8 +- arch/x86/include/asm/string_32.h| 9 - arch/x86/include/asm/string_64.h| 8 - arch/x86/include/asm/xor.h | 5 +- arch/x86/kernel/cpu/intel.c | 15 - arch/x86/kernel/traps.c | 5 - arch/x86/mm/Makefile| 2 - arch/x86/mm/fault.c | 6 - arch/x86/mm/init.c | 5 +- arch/x86/mm/kmemcheck/Makefile | 1 - arch/x86/mm/kmemcheck/error.c | 227 arch/x86/mm/kmemcheck/error.h | 15 - arch/x86/mm/kmemcheck/kmemcheck.c | 658 - arch/x86/mm/kmemcheck/opcode.c | 106 arch/x86/mm/kmemcheck/opcode.h | 9 - arch/x86/mm/kmemcheck/pte.c | 22 - arch/x86/mm/kmemcheck/pte.h | 10 - arch/x86/mm/kmemcheck/selftest.c| 70 --- arch/x86/mm/kmemcheck/selftest.h| 6 - arch/x86/mm/kmemcheck/shadow.c | 173 -- arch/x86/mm/kmemcheck/shadow.h | 18 - include/linux/dma-mapping.h | 8 +- include/linux/gfp.h | 2 - include/linux/kmemcheck.h | 59 -- include/linux/mm_types.h| 8 - include/linux/slab.h| 12 +- init/main.c | 1 - kernel/sysctl.c | 10 - lib/Kconfig.debug | 6 +- lib/Kconfig.kmemcheck | 94 --- mm/Kconfig.debug| 1 - mm/Makefile | 2 - mm/kmemcheck.c | 125 mm/page_alloc.c | 14 - mm/slab.c | 14 - mm/slab.h | 2 - mm/sl
Re: [PATCH] mm: kill kmemcheck again
On 30 September 2017 at 11:48, Steven Rostedt wrote: > On Wed, 27 Sep 2017 17:02:07 +0200 > Michal Hocko wrote: > >> > Now that 2 years have passed, and all distros provide gcc that supports >> > KASAN, kill kmemcheck again for the very same reasons. >> >> This is just too large to review manually. How have you generated the >> patch? > > I agree. This needs to be taken out piece by piece, not in one go, > where there could be unexpected fallout. I have a patch from earlier this year that starts by removing the core code and defining all the helpers/flags as no-ops so they can be removed bit by bit at a later time. See the attachment. Pekka signed off on it too. I never actually submitted this because I was waiting for MSAN to be merged in the kernel. It has been compile and run tested on x86_64. Vegard From b06e2b3b833b02ecb0afb9dd92422e89c7fbb6d9 Mon Sep 17 00:00:00 2001 From: Vegard Nossum Date: Thu, 30 Mar 2017 13:26:15 +0200 Subject: [PATCH] kmemcheck: remove core (x86 + mm) code With KASAN/KMSAN and compiler-based instrumentation, this code is way past its expiry date. There is zero reason to be using kmemcheck at this point, as KASAN/KMSAN will be much faster, support SMP, and catch any bug that kmemcheck would have caught. See the additional rationale and past discussion at <https://lkml.org/lkml/2015/3/11/435>. I take the approach of first removing all the core x86 and mm code, leaving behind only include/linux/kmemcheck.h which provides some helpers (now only dummies as for the !KMEMCHECK case previously) used in e.g. networking code for special annotations. We can then send individual (smaller, more reviewable) patches for removing kmemcheck annotations in other subsystems. Once there are no users of the kmemcheck helpers, we can kill off the dummy helpers as well in a final patch. Cc: Ingo Molnar Cc: Andrew Morton Cc: Sasha Levin Cc: Steven Rostedt Signed-off-by: Vegard Nossum Signed-off-by: Pekka Enberg --- Documentation/admin-guide/kernel-parameters.txt | 7 - Documentation/dev-tools/index.rst | 1 - Documentation/dev-tools/kmemcheck.rst | 733 MAINTAINERS | 10 - arch/arm/include/asm/dma-iommu.h| 1 - arch/openrisc/include/asm/dma-mapping.h | 1 - arch/x86/Kconfig| 3 +- arch/x86/Makefile | 5 - arch/x86/include/asm/dma-mapping.h | 1 - arch/x86/include/asm/kmemcheck.h| 42 -- arch/x86/include/asm/pgtable_types.h| 8 +- arch/x86/include/asm/string_32.h| 9 - arch/x86/include/asm/string_64.h| 8 - arch/x86/include/asm/xor.h | 5 +- arch/x86/kernel/cpu/intel.c | 15 - arch/x86/kernel/traps.c | 5 - arch/x86/mm/Makefile| 2 - arch/x86/mm/fault.c | 6 - arch/x86/mm/init.c | 5 +- arch/x86/mm/kmemcheck/Makefile | 1 - arch/x86/mm/kmemcheck/error.c | 227 arch/x86/mm/kmemcheck/error.h | 15 - arch/x86/mm/kmemcheck/kmemcheck.c | 658 - arch/x86/mm/kmemcheck/opcode.c | 106 arch/x86/mm/kmemcheck/opcode.h | 9 - arch/x86/mm/kmemcheck/pte.c | 22 - arch/x86/mm/kmemcheck/pte.h | 10 - arch/x86/mm/kmemcheck/selftest.c| 70 --- arch/x86/mm/kmemcheck/selftest.h| 6 - arch/x86/mm/kmemcheck/shadow.c | 173 -- arch/x86/mm/kmemcheck/shadow.h | 18 - include/linux/dma-mapping.h | 8 +- include/linux/gfp.h | 2 - include/linux/kmemcheck.h | 59 -- include/linux/mm_types.h| 8 - include/linux/slab.h| 12 +- init/main.c | 1 - kernel/sysctl.c | 10 - lib/Kconfig.debug | 6 +- lib/Kconfig.kmemcheck | 94 --- mm/Kconfig.debug| 1 - mm/Makefile | 2 - mm/kmemcheck.c | 125 mm/page_alloc.c | 14 - mm/slab.c | 14 - mm/slab.h | 2 - mm/slub.c | 25 +- 47 files changed, 18 insertions(+), 2547 deletions(-) delete mode 100644 Documentation/dev-tools/kmemcheck.rst delete mode 100644 arch/x86/include/asm/kmemcheck.h delete mode 100644 arch/x86/mm/kmemchec
Re: [bisected] Re: tty lockdep trace
On 06/04/17 11:02, Mike Galbraith wrote: On Sun, 2017-06-04 at 10:32 +0200, Greg Kroah-Hartman wrote: On Sat, Jun 03, 2017 at 08:33:52AM +0200, Mike Galbraith wrote: On Wed, 2017-05-31 at 13:21 -0400, Dave Jones wrote: Just hit this during a trinity run. 925bb1ce47f429f69aad35876df7ecd8c53deb7e is the first bad commit commit 925bb1ce47f429f69aad35876df7ecd8c53deb7e Author: Vegard Nossum <vegard.nos...@oracle.com> Date: Thu May 11 12:18:52 2017 +0200 tty: fix port buffer locking Now reverting this. Oops, sorry, forgot to add Dave and your names to the patch revert. The list of people who reported this was really long, many thanks for this. If flush_to_ldisc() is the problem, and taking atomic_write_lock in that path an acceptable solution, how about do that a bit differently instead. Lockdep stopped grumbling, vbox seems happy. 925bb1ce47f4 (tty: fix port buffer locking) upset lockdep by holding buf->lock while acquiring tty->atomic_write_lock. Move acquisition to flush_to_ldisc(), taking it prior to taking buf->lock. Costs a reference, but appeases lockdep. Not-so-signed-off-by: /me --- drivers/tty/tty_buffer.c | 10 ++ drivers/tty/tty_port.c |2 -- 2 files changed, 10 insertions(+), 2 deletions(-) --- a/drivers/tty/tty_buffer.c +++ b/drivers/tty/tty_buffer.c @@ -465,7 +465,13 @@ static void flush_to_ldisc(struct work_s { struct tty_port *port = container_of(work, struct tty_port, buf.work); struct tty_bufhead *buf = >buf; + struct tty_struct *tty = READ_ONCE(port->itty); + struct tty_ldisc *disc = NULL; + if (tty) + disc = tty_ldisc_ref(tty); + if (disc) + mutex_lock(>atomic_write_lock); mutex_lock(>lock); while (1) { @@ -501,6 +507,10 @@ static void flush_to_ldisc(struct work_s } mutex_unlock(>lock); + if (disc) { + mutex_unlock(>atomic_write_lock); + tty_ldisc_deref(disc); + } } --- a/drivers/tty/tty_port.c +++ b/drivers/tty/tty_port.c @@ -34,9 +34,7 @@ static int tty_port_default_receive_buf( if (!disc) return 0; - mutex_lock(>atomic_write_lock); ret = tty_ldisc_receive_buf(disc, p, (char *)f, count); - mutex_unlock(>atomic_write_lock); tty_ldisc_deref(disc); I don't know how you did it, but this passes my testing (reproducers for both the original issue and the lockdep splat/hang). Although given the track record I'm not sure how much that's worth :-/ Vegard
Re: [bisected] Re: tty lockdep trace
On 06/04/17 11:02, Mike Galbraith wrote: On Sun, 2017-06-04 at 10:32 +0200, Greg Kroah-Hartman wrote: On Sat, Jun 03, 2017 at 08:33:52AM +0200, Mike Galbraith wrote: On Wed, 2017-05-31 at 13:21 -0400, Dave Jones wrote: Just hit this during a trinity run. 925bb1ce47f429f69aad35876df7ecd8c53deb7e is the first bad commit commit 925bb1ce47f429f69aad35876df7ecd8c53deb7e Author: Vegard Nossum Date: Thu May 11 12:18:52 2017 +0200 tty: fix port buffer locking Now reverting this. Oops, sorry, forgot to add Dave and your names to the patch revert. The list of people who reported this was really long, many thanks for this. If flush_to_ldisc() is the problem, and taking atomic_write_lock in that path an acceptable solution, how about do that a bit differently instead. Lockdep stopped grumbling, vbox seems happy. 925bb1ce47f4 (tty: fix port buffer locking) upset lockdep by holding buf->lock while acquiring tty->atomic_write_lock. Move acquisition to flush_to_ldisc(), taking it prior to taking buf->lock. Costs a reference, but appeases lockdep. Not-so-signed-off-by: /me --- drivers/tty/tty_buffer.c | 10 ++ drivers/tty/tty_port.c |2 -- 2 files changed, 10 insertions(+), 2 deletions(-) --- a/drivers/tty/tty_buffer.c +++ b/drivers/tty/tty_buffer.c @@ -465,7 +465,13 @@ static void flush_to_ldisc(struct work_s { struct tty_port *port = container_of(work, struct tty_port, buf.work); struct tty_bufhead *buf = >buf; + struct tty_struct *tty = READ_ONCE(port->itty); + struct tty_ldisc *disc = NULL; + if (tty) + disc = tty_ldisc_ref(tty); + if (disc) + mutex_lock(>atomic_write_lock); mutex_lock(>lock); while (1) { @@ -501,6 +507,10 @@ static void flush_to_ldisc(struct work_s } mutex_unlock(>lock); + if (disc) { + mutex_unlock(>atomic_write_lock); + tty_ldisc_deref(disc); + } } --- a/drivers/tty/tty_port.c +++ b/drivers/tty/tty_port.c @@ -34,9 +34,7 @@ static int tty_port_default_receive_buf( if (!disc) return 0; - mutex_lock(>atomic_write_lock); ret = tty_ldisc_receive_buf(disc, p, (char *)f, count); - mutex_unlock(>atomic_write_lock); tty_ldisc_deref(disc); I don't know how you did it, but this passes my testing (reproducers for both the original issue and the lockdep splat/hang). Although given the track record I'm not sure how much that's worth :-/ Vegard
Re: [linux-next / tty] possible circular locking dependency detected
On 06/03/17 11:34, Greg Kroah-Hartman wrote: On Mon, May 29, 2017 at 12:43:39PM +0200, Vegard Nossum wrote: On 05/22/17 12:27, Vegard Nossum wrote: On 05/22/17 12:24, Greg Kroah-Hartman wrote: On Mon, May 22, 2017 at 04:39:43PM +0900, Sergey Senozhatsky wrote: Hello, [ 1274.378287] == [ 1274.378289] WARNING: possible circular locking dependency detected [ 1274.378290] 4.12.0-rc1-next-20170522-dbg-7-gc09b2ab28b74-dirty #1317 Not tainted [ 1274.378291] -- [ 1274.378293] kworker/u8:5/111 is trying to acquire lock: [ 1274.378294] (>lock){+.+...}, at: [] tty_buffer_flush+0x34/0x88 [ 1274.378300] but task is already holding lock: [ 1274.378301] (_tty->termios_rwsem/1){..}, at: [] isig+0x47/0xd2 [ 1274.378307] which lock already depends on the new lock. Any hint as to what you were doing when this happened? Does this also show up in 4.11? It's my patch "tty: fix port buffer locking" :-/ At a glance, looks related to pty taking the lock on the other side in a different order. I'll have a closer look. I can reproduce the lockdep report locally on v4.12-rc3. Looking at it now. Any ideas? Or should I just revert the original patch? I think we must revert it for now, as I can easily reproduce not just the lockdep warning but actual hangs. It seems I missed some code paths when I worked the original patch. I'm working on a fix. Vegard
Re: [linux-next / tty] possible circular locking dependency detected
On 06/03/17 11:34, Greg Kroah-Hartman wrote: On Mon, May 29, 2017 at 12:43:39PM +0200, Vegard Nossum wrote: On 05/22/17 12:27, Vegard Nossum wrote: On 05/22/17 12:24, Greg Kroah-Hartman wrote: On Mon, May 22, 2017 at 04:39:43PM +0900, Sergey Senozhatsky wrote: Hello, [ 1274.378287] == [ 1274.378289] WARNING: possible circular locking dependency detected [ 1274.378290] 4.12.0-rc1-next-20170522-dbg-7-gc09b2ab28b74-dirty #1317 Not tainted [ 1274.378291] -- [ 1274.378293] kworker/u8:5/111 is trying to acquire lock: [ 1274.378294] (>lock){+.+...}, at: [] tty_buffer_flush+0x34/0x88 [ 1274.378300] but task is already holding lock: [ 1274.378301] (_tty->termios_rwsem/1){..}, at: [] isig+0x47/0xd2 [ 1274.378307] which lock already depends on the new lock. Any hint as to what you were doing when this happened? Does this also show up in 4.11? It's my patch "tty: fix port buffer locking" :-/ At a glance, looks related to pty taking the lock on the other side in a different order. I'll have a closer look. I can reproduce the lockdep report locally on v4.12-rc3. Looking at it now. Any ideas? Or should I just revert the original patch? I think we must revert it for now, as I can easily reproduce not just the lockdep warning but actual hangs. It seems I missed some code paths when I worked the original patch. I'm working on a fix. Vegard
Re: linux-next 20170519 and later - ^S/^Q borkage on ttys.
On 05/31/17 05:48, valdis.kletni...@vt.edu wrote: Pretty drastic. Hit ^S to pause scrolling, and instantly hung terminal. Seen on both urxvt and xterm under x11, and on virtual console screens. This appears in dmesg: [ 1844.182058] INFO: task kworker/u8:3:129 blocked for more than 120 seconds. [ 1844.182073] Tainted: G OE 4.12.0-rc3-next-20170530 #489 [ 1844.182078] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1844.182085] kworker/u8:3D11008 129 2 0x [ 1844.182109] Workqueue: events_unbound flush_to_ldisc [ 1844.182118] Call Trace: [ 1844.182136] __schedule+0x43e/0x1020 [ 1844.182147] ? schedule_preempt_disabled+0x27/0xd0 [ 1844.182156] schedule+0x5d/0x1d0 [ 1844.182164] ? __mutex_lock+0x4c9/0x11c0 [ 1844.182172] schedule_preempt_disabled+0x27/0xd0 [ 1844.182179] __mutex_lock+0x4c9/0x11c0 [ 1844.182191] ? tty_port_default_receive_buf+0x58/0xc0 [ 1844.182204] ? ldsem_down_read_trylock+0xc3/0x130 [ 1844.182215] mutex_lock_nested+0x1b/0x20 [ 1844.18] ? mutex_lock_nested+0x1b/0x20 [ 1844.182230] tty_port_default_receive_buf+0x58/0xc0 [ 1844.182240] flush_to_ldisc+0xea/0x220 [ 1844.182249] ? trace_hardirqs_on_caller+0x16/0x290 [ 1844.182262] process_one_work+0x3d6/0xd00 [ 1844.182269] ? lock_acquire+0xae/0x2f0 [ 1844.182284] worker_thread+0x71/0x830 [ 1844.182297] kthread+0x1a9/0x270 [ 1844.182304] ? process_one_work+0xd00/0xd00 [ 1844.182310] ? kthread_create_on_node+0x70/0x70 [ 1844.182321] ret_from_fork+0x27/0x40 [ 1844.182608] INFO: lockdep is turned off. Bisects down to this commit, and things work when it's reverted. Commit 925bb1ce47f4. Author: Vegard Nossum <vegard.nos...@oracle.com> Date: Thu May 11 12:18:52 2017 +0200 tty: fix port buffer locking tty_insert_flip_string_fixed_flag() is racy against itself when called from the ioctl(TCXONC, TCION/TCIOFF) path [1] and the flush_to_ldisc() workqueue path [2]. Gah, if it's that easy to trigger a deadlock (as opposed to just a lockdep warning), we should revert the patch until I have a better fix. ^S doesn't seem to reproduce it here, though. Too bad your stack trace doesn't show the process already holding the lock. Vegard
Re: linux-next 20170519 and later - ^S/^Q borkage on ttys.
On 05/31/17 05:48, valdis.kletni...@vt.edu wrote: Pretty drastic. Hit ^S to pause scrolling, and instantly hung terminal. Seen on both urxvt and xterm under x11, and on virtual console screens. This appears in dmesg: [ 1844.182058] INFO: task kworker/u8:3:129 blocked for more than 120 seconds. [ 1844.182073] Tainted: G OE 4.12.0-rc3-next-20170530 #489 [ 1844.182078] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1844.182085] kworker/u8:3D11008 129 2 0x [ 1844.182109] Workqueue: events_unbound flush_to_ldisc [ 1844.182118] Call Trace: [ 1844.182136] __schedule+0x43e/0x1020 [ 1844.182147] ? schedule_preempt_disabled+0x27/0xd0 [ 1844.182156] schedule+0x5d/0x1d0 [ 1844.182164] ? __mutex_lock+0x4c9/0x11c0 [ 1844.182172] schedule_preempt_disabled+0x27/0xd0 [ 1844.182179] __mutex_lock+0x4c9/0x11c0 [ 1844.182191] ? tty_port_default_receive_buf+0x58/0xc0 [ 1844.182204] ? ldsem_down_read_trylock+0xc3/0x130 [ 1844.182215] mutex_lock_nested+0x1b/0x20 [ 1844.18] ? mutex_lock_nested+0x1b/0x20 [ 1844.182230] tty_port_default_receive_buf+0x58/0xc0 [ 1844.182240] flush_to_ldisc+0xea/0x220 [ 1844.182249] ? trace_hardirqs_on_caller+0x16/0x290 [ 1844.182262] process_one_work+0x3d6/0xd00 [ 1844.182269] ? lock_acquire+0xae/0x2f0 [ 1844.182284] worker_thread+0x71/0x830 [ 1844.182297] kthread+0x1a9/0x270 [ 1844.182304] ? process_one_work+0xd00/0xd00 [ 1844.182310] ? kthread_create_on_node+0x70/0x70 [ 1844.182321] ret_from_fork+0x27/0x40 [ 1844.182608] INFO: lockdep is turned off. Bisects down to this commit, and things work when it's reverted. Commit 925bb1ce47f4. Author: Vegard Nossum Date: Thu May 11 12:18:52 2017 +0200 tty: fix port buffer locking tty_insert_flip_string_fixed_flag() is racy against itself when called from the ioctl(TCXONC, TCION/TCIOFF) path [1] and the flush_to_ldisc() workqueue path [2]. Gah, if it's that easy to trigger a deadlock (as opposed to just a lockdep warning), we should revert the patch until I have a better fix. ^S doesn't seem to reproduce it here, though. Too bad your stack trace doesn't show the process already holding the lock. Vegard
Re: [linux-next / tty] possible circular locking dependency detected
On 05/22/17 12:27, Vegard Nossum wrote: On 05/22/17 12:24, Greg Kroah-Hartman wrote: On Mon, May 22, 2017 at 04:39:43PM +0900, Sergey Senozhatsky wrote: Hello, [ 1274.378287] == [ 1274.378289] WARNING: possible circular locking dependency detected [ 1274.378290] 4.12.0-rc1-next-20170522-dbg-7-gc09b2ab28b74-dirty #1317 Not tainted [ 1274.378291] -- [ 1274.378293] kworker/u8:5/111 is trying to acquire lock: [ 1274.378294] (>lock){+.+...}, at: [] tty_buffer_flush+0x34/0x88 [ 1274.378300] but task is already holding lock: [ 1274.378301] (_tty->termios_rwsem/1){..}, at: [] isig+0x47/0xd2 [ 1274.378307] which lock already depends on the new lock. Any hint as to what you were doing when this happened? Does this also show up in 4.11? It's my patch "tty: fix port buffer locking" :-/ At a glance, looks related to pty taking the lock on the other side in a different order. I'll have a closer look. I can reproduce the lockdep report locally on v4.12-rc3. Looking at it now. Vegard
Re: [linux-next / tty] possible circular locking dependency detected
On 05/22/17 12:27, Vegard Nossum wrote: On 05/22/17 12:24, Greg Kroah-Hartman wrote: On Mon, May 22, 2017 at 04:39:43PM +0900, Sergey Senozhatsky wrote: Hello, [ 1274.378287] == [ 1274.378289] WARNING: possible circular locking dependency detected [ 1274.378290] 4.12.0-rc1-next-20170522-dbg-7-gc09b2ab28b74-dirty #1317 Not tainted [ 1274.378291] -- [ 1274.378293] kworker/u8:5/111 is trying to acquire lock: [ 1274.378294] (>lock){+.+...}, at: [] tty_buffer_flush+0x34/0x88 [ 1274.378300] but task is already holding lock: [ 1274.378301] (_tty->termios_rwsem/1){..}, at: [] isig+0x47/0xd2 [ 1274.378307] which lock already depends on the new lock. Any hint as to what you were doing when this happened? Does this also show up in 4.11? It's my patch "tty: fix port buffer locking" :-/ At a glance, looks related to pty taking the lock on the other side in a different order. I'll have a closer look. I can reproduce the lockdep report locally on v4.12-rc3. Looking at it now. Vegard
[PATCH] kthread: fix boot hang (regression) on MIPS/OpenRISC
This fixes a regression in commit 4d6501dce079 where I didn't notice that MIPS and OpenRISC were reinitialising p->{set,clear}_child_tid to NULL after our initialisation in copy_process(). We can simply get rid of the arch-specific initialisation here since it is now always done in copy_process() before hitting copy_thread{,_tls}(). Review notes: - As far as I can tell, copy_process() is the only user of copy_thread_tls(), which is the only caller of copy_thread() for architectures that don't implement copy_thread_tls(). - After this patch, there is no arch-specific code touching p->set_child_tid or p->clear_child_tid whatsoever. - It may look like MIPS/OpenRISC wanted to always have these fields be NULL, but that's not true, as copy_process() would unconditionally set them again _after_ calling copy_thread_tls() before commit 4d6501dce079. Fixes: 4d6501dce079c1eb6bf0b1d8f528a5e81770109e ("kthread: Fix use-after-free if kthread fork fails") Reported-by: Guenter Roeck <li...@roeck-us.net> Tested-by: Guenter Roeck <li...@roeck-us.net> # MIPS only Cc: Ralf Baechle <r...@linux-mips.org> Cc: linux-m...@linux-mips.org Cc: Jonas Bonn <jo...@southpole.se> Cc: Stefan Kristiansson <stefan.kristians...@saunalahti.fi> Cc: Stafford Horne <sho...@gmail.com> Cc: openr...@lists.librecores.org Cc: Oleg Nesterov <o...@redhat.com> Cc: Jamie Iles <jamie.i...@oracle.com> Cc: Thomas Gleixner <t...@linutronix.de> Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com> --- Not sure who this should go through, the last patch went through tglx/the core-urgent-for-linus tree, but it does touch arch code + fix a mainline boot hang regression on at least MIPS (Guenter said OpenRISC didn't seem affected in his boot tests, but the code looks wrong in any case). Maybe we could get acks/reviews by MIPS and OpenRISC maintainers? --- arch/mips/kernel/process.c | 1 - arch/openrisc/kernel/process.c | 2 -- 2 files changed, 3 deletions(-) diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c index 918d4c73e951..5351e1f3950d 100644 --- a/arch/mips/kernel/process.c +++ b/arch/mips/kernel/process.c @@ -120,7 +120,6 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long usp, struct thread_info *ti = task_thread_info(p); struct pt_regs *childregs, *regs = current_pt_regs(); unsigned long childksp; - p->set_child_tid = p->clear_child_tid = NULL; childksp = (unsigned long)task_stack_page(p) + THREAD_SIZE - 32; diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c index f8da545854f9..106859ae27ff 100644 --- a/arch/openrisc/kernel/process.c +++ b/arch/openrisc/kernel/process.c @@ -167,8 +167,6 @@ copy_thread(unsigned long clone_flags, unsigned long usp, top_of_kernel_stack = sp; - p->set_child_tid = p->clear_child_tid = NULL; - /* Locate userspace context on stack... */ sp -= STACK_FRAME_OVERHEAD; /* redzone */ sp -= sizeof(struct pt_regs); -- 2.12.0.rc0
[PATCH] kthread: fix boot hang (regression) on MIPS/OpenRISC
This fixes a regression in commit 4d6501dce079 where I didn't notice that MIPS and OpenRISC were reinitialising p->{set,clear}_child_tid to NULL after our initialisation in copy_process(). We can simply get rid of the arch-specific initialisation here since it is now always done in copy_process() before hitting copy_thread{,_tls}(). Review notes: - As far as I can tell, copy_process() is the only user of copy_thread_tls(), which is the only caller of copy_thread() for architectures that don't implement copy_thread_tls(). - After this patch, there is no arch-specific code touching p->set_child_tid or p->clear_child_tid whatsoever. - It may look like MIPS/OpenRISC wanted to always have these fields be NULL, but that's not true, as copy_process() would unconditionally set them again _after_ calling copy_thread_tls() before commit 4d6501dce079. Fixes: 4d6501dce079c1eb6bf0b1d8f528a5e81770109e ("kthread: Fix use-after-free if kthread fork fails") Reported-by: Guenter Roeck Tested-by: Guenter Roeck # MIPS only Cc: Ralf Baechle Cc: linux-m...@linux-mips.org Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: openr...@lists.librecores.org Cc: Oleg Nesterov Cc: Jamie Iles Cc: Thomas Gleixner Signed-off-by: Vegard Nossum --- Not sure who this should go through, the last patch went through tglx/the core-urgent-for-linus tree, but it does touch arch code + fix a mainline boot hang regression on at least MIPS (Guenter said OpenRISC didn't seem affected in his boot tests, but the code looks wrong in any case). Maybe we could get acks/reviews by MIPS and OpenRISC maintainers? --- arch/mips/kernel/process.c | 1 - arch/openrisc/kernel/process.c | 2 -- 2 files changed, 3 deletions(-) diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c index 918d4c73e951..5351e1f3950d 100644 --- a/arch/mips/kernel/process.c +++ b/arch/mips/kernel/process.c @@ -120,7 +120,6 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long usp, struct thread_info *ti = task_thread_info(p); struct pt_regs *childregs, *regs = current_pt_regs(); unsigned long childksp; - p->set_child_tid = p->clear_child_tid = NULL; childksp = (unsigned long)task_stack_page(p) + THREAD_SIZE - 32; diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c index f8da545854f9..106859ae27ff 100644 --- a/arch/openrisc/kernel/process.c +++ b/arch/openrisc/kernel/process.c @@ -167,8 +167,6 @@ copy_thread(unsigned long clone_flags, unsigned long usp, top_of_kernel_stack = sp; - p->set_child_tid = p->clear_child_tid = NULL; - /* Locate userspace context on stack... */ sp -= STACK_FRAME_OVERHEAD; /* redzone */ sp -= sizeof(struct pt_regs); -- 2.12.0.rc0
Re: mips qemu test failures in -next due to "kthread: Fix use-after-free if kthread fork fails"
On 05/28/17 13:45, Vegard Nossum wrote: On 05/27/17 19:56, Guenter Roeck wrote: Hi, my qemu testis of mips images are failing in -next. Symptom is a hang during boot; see http://kerneltests.org/builders/qemu-mips-next for some examples. I bisected the problem in next-20170526. It points to commit 4d6501dce079c ("kthread: Fix use-after-free if kthread fork fails"). Reverting that patch fixes the problem. Bisect log is attached. Hi, Thanks for the report and sorry for the breakage :-/ I can't immediately spot what's going wrong, but I am able to reproduce it on mips so I will try to debug. Are you sure it's this commit, though? I checked out linus/master and I get a boot hang even after reverting it. My mistake; I ran into a different bug which made me think it was hanging when it wasn't. However, I think I found the problem; does this patch fix it for you too? diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c index 918d4c73e951..5351e1f3950d 100644 --- a/arch/mips/kernel/process.c +++ b/arch/mips/kernel/process.c @@ -120,7 +120,6 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long usp, struct thread_info *ti = task_thread_info(p); struct pt_regs *childregs, *regs = current_pt_regs(); unsigned long childksp; - p->set_child_tid = p->clear_child_tid = NULL; childksp = (unsigned long)task_stack_page(p) + THREAD_SIZE - 32; The problem is that when we moved the p->{set,clear}_child_tid assignments inside copy_process(), the above assignments would clear them out. The assignments only exist on mips and openrisc (which would need the same patch), which explains why I didn't see it in my x86 testing. I think the patch above should be safe given that we're now always setting these fields in copy_process() at an appropriate moment. Looks like those assignments came from commit 3c37026d43c47 ("NPTL, round one."); Ralf? Oleg? Vegard
Re: mips qemu test failures in -next due to "kthread: Fix use-after-free if kthread fork fails"
On 05/28/17 13:45, Vegard Nossum wrote: On 05/27/17 19:56, Guenter Roeck wrote: Hi, my qemu testis of mips images are failing in -next. Symptom is a hang during boot; see http://kerneltests.org/builders/qemu-mips-next for some examples. I bisected the problem in next-20170526. It points to commit 4d6501dce079c ("kthread: Fix use-after-free if kthread fork fails"). Reverting that patch fixes the problem. Bisect log is attached. Hi, Thanks for the report and sorry for the breakage :-/ I can't immediately spot what's going wrong, but I am able to reproduce it on mips so I will try to debug. Are you sure it's this commit, though? I checked out linus/master and I get a boot hang even after reverting it. My mistake; I ran into a different bug which made me think it was hanging when it wasn't. However, I think I found the problem; does this patch fix it for you too? diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c index 918d4c73e951..5351e1f3950d 100644 --- a/arch/mips/kernel/process.c +++ b/arch/mips/kernel/process.c @@ -120,7 +120,6 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long usp, struct thread_info *ti = task_thread_info(p); struct pt_regs *childregs, *regs = current_pt_regs(); unsigned long childksp; - p->set_child_tid = p->clear_child_tid = NULL; childksp = (unsigned long)task_stack_page(p) + THREAD_SIZE - 32; The problem is that when we moved the p->{set,clear}_child_tid assignments inside copy_process(), the above assignments would clear them out. The assignments only exist on mips and openrisc (which would need the same patch), which explains why I didn't see it in my x86 testing. I think the patch above should be safe given that we're now always setting these fields in copy_process() at an appropriate moment. Looks like those assignments came from commit 3c37026d43c47 ("NPTL, round one."); Ralf? Oleg? Vegard
Re: mips qemu test failures in -next due to "kthread: Fix use-after-free if kthread fork fails"
On 05/27/17 19:56, Guenter Roeck wrote: Hi, my qemu testis of mips images are failing in -next. Symptom is a hang during boot; see http://kerneltests.org/builders/qemu-mips-next for some examples. I bisected the problem in next-20170526. It points to commit 4d6501dce079c ("kthread: Fix use-after-free if kthread fork fails"). Reverting that patch fixes the problem. Bisect log is attached. Hi, Thanks for the report and sorry for the breakage :-/ I can't immediately spot what's going wrong, but I am able to reproduce it on mips so I will try to debug. Are you sure it's this commit, though? I checked out linus/master and I get a boot hang even after reverting it. Vegard
Re: mips qemu test failures in -next due to "kthread: Fix use-after-free if kthread fork fails"
On 05/27/17 19:56, Guenter Roeck wrote: Hi, my qemu testis of mips images are failing in -next. Symptom is a hang during boot; see http://kerneltests.org/builders/qemu-mips-next for some examples. I bisected the problem in next-20170526. It points to commit 4d6501dce079c ("kthread: Fix use-after-free if kthread fork fails"). Reverting that patch fixes the problem. Bisect log is attached. Hi, Thanks for the report and sorry for the breakage :-/ I can't immediately spot what's going wrong, but I am able to reproduce it on mips so I will try to debug. Are you sure it's this commit, though? I checked out linus/master and I get a boot hang even after reverting it. Vegard
[tip:core/urgent] kthread: Fix use-after-free if kthread fork fails
Commit-ID: 4d6501dce079c1eb6bf0b1d8f528a5e81770109e Gitweb: http://git.kernel.org/tip/4d6501dce079c1eb6bf0b1d8f528a5e81770109e Author: Vegard Nossum <vegard.nos...@oracle.com> AuthorDate: Tue, 9 May 2017 09:39:59 +0200 Committer: Thomas Gleixner <t...@linutronix.de> CommitDate: Mon, 22 May 2017 22:21:16 +0200 kthread: Fix use-after-free if kthread fork fails If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but fails in copy_process() between calling dup_task_struct() and setting p->set_child_tid, then the value of p->set_child_tid will be inherited from the parent and get prematurely freed by free_kthread_struct(). kthread() - worker_thread() - process_one_work() | - call_usermodehelper_exec_work() | - kernel_thread() |- _do_fork() | - copy_process() | - dup_task_struct() | - arch_dup_task_struct() |- tsk->set_child_tid = current->set_child_tid // implied | - ... | - goto bad_fork_* | - ... | - free_task(tsk) | - free_kthread_struct(tsk) |- kfree(tsk->set_child_tid) - ... - schedule() - __schedule() - wq_worker_sleeping() - kthread_data(task)->flags // UAF The problem started showing up with commit 1da5c46fa965 since it reused ->set_child_tid for the kthread worker data. A better long-term solution might be to get rid of the ->set_child_tid abuse. The comment in set_kthread_struct() also looks slightly wrong. Debugged-by: Jamie Iles <jamie.i...@oracle.com> Fixes: 1da5c46fa965 ("kthread: Make struct kthread kmalloc'ed") Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com> Acked-by: Oleg Nesterov <o...@redhat.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org> Cc: Andy Lutomirski <l...@kernel.org> Cc: Frederic Weisbecker <fweis...@gmail.com> Cc: Jamie Iles <jamie.i...@oracle.com> Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nos...@oracle.com Signed-off-by: Thomas Gleixner <t...@linutronix.de> --- kernel/fork.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index d681f8f..b7cdea1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1553,6 +1553,18 @@ static __latent_entropy struct task_struct *copy_process( if (!p) goto fork_out; + /* +* This _must_ happen before we call free_task(), i.e. before we jump +* to any of the bad_fork_* labels. This is to avoid freeing +* p->set_child_tid which is (ab)used as a kthread's data pointer for +* kernel threads (PF_KTHREAD). +*/ + p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; + /* +* Clear TID on mm_release()? +*/ + p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; + ftrace_graph_init_task(p); rt_mutex_init_task(p); @@ -1716,11 +1728,6 @@ static __latent_entropy struct task_struct *copy_process( } } - p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; - /* -* Clear TID on mm_release()? -*/ - p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; #ifdef CONFIG_BLOCK p->plug = NULL; #endif
[tip:core/urgent] kthread: Fix use-after-free if kthread fork fails
Commit-ID: 4d6501dce079c1eb6bf0b1d8f528a5e81770109e Gitweb: http://git.kernel.org/tip/4d6501dce079c1eb6bf0b1d8f528a5e81770109e Author: Vegard Nossum AuthorDate: Tue, 9 May 2017 09:39:59 +0200 Committer: Thomas Gleixner CommitDate: Mon, 22 May 2017 22:21:16 +0200 kthread: Fix use-after-free if kthread fork fails If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but fails in copy_process() between calling dup_task_struct() and setting p->set_child_tid, then the value of p->set_child_tid will be inherited from the parent and get prematurely freed by free_kthread_struct(). kthread() - worker_thread() - process_one_work() | - call_usermodehelper_exec_work() | - kernel_thread() |- _do_fork() | - copy_process() | - dup_task_struct() | - arch_dup_task_struct() |- tsk->set_child_tid = current->set_child_tid // implied | - ... | - goto bad_fork_* | - ... | - free_task(tsk) | - free_kthread_struct(tsk) |- kfree(tsk->set_child_tid) - ... - schedule() - __schedule() - wq_worker_sleeping() - kthread_data(task)->flags // UAF The problem started showing up with commit 1da5c46fa965 since it reused ->set_child_tid for the kthread worker data. A better long-term solution might be to get rid of the ->set_child_tid abuse. The comment in set_kthread_struct() also looks slightly wrong. Debugged-by: Jamie Iles Fixes: 1da5c46fa965 ("kthread: Make struct kthread kmalloc'ed") Signed-off-by: Vegard Nossum Acked-by: Oleg Nesterov Cc: Peter Zijlstra Cc: Greg Kroah-Hartman Cc: Andy Lutomirski Cc: Frederic Weisbecker Cc: Jamie Iles Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nos...@oracle.com Signed-off-by: Thomas Gleixner --- kernel/fork.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index d681f8f..b7cdea1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1553,6 +1553,18 @@ static __latent_entropy struct task_struct *copy_process( if (!p) goto fork_out; + /* +* This _must_ happen before we call free_task(), i.e. before we jump +* to any of the bad_fork_* labels. This is to avoid freeing +* p->set_child_tid which is (ab)used as a kthread's data pointer for +* kernel threads (PF_KTHREAD). +*/ + p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; + /* +* Clear TID on mm_release()? +*/ + p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; + ftrace_graph_init_task(p); rt_mutex_init_task(p); @@ -1716,11 +1728,6 @@ static __latent_entropy struct task_struct *copy_process( } } - p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; - /* -* Clear TID on mm_release()? -*/ - p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; #ifdef CONFIG_BLOCK p->plug = NULL; #endif
Re: [linux-next / tty] possible circular locking dependency detected
On 05/22/17 12:24, Greg Kroah-Hartman wrote: On Mon, May 22, 2017 at 04:39:43PM +0900, Sergey Senozhatsky wrote: Hello, [ 1274.378287] == [ 1274.378289] WARNING: possible circular locking dependency detected [ 1274.378290] 4.12.0-rc1-next-20170522-dbg-7-gc09b2ab28b74-dirty #1317 Not tainted [ 1274.378291] -- [ 1274.378293] kworker/u8:5/111 is trying to acquire lock: [ 1274.378294] (>lock){+.+...}, at: [] tty_buffer_flush+0x34/0x88 [ 1274.378300] but task is already holding lock: [ 1274.378301] (_tty->termios_rwsem/1){..}, at: [] isig+0x47/0xd2 [ 1274.378307] which lock already depends on the new lock. Any hint as to what you were doing when this happened? Does this also show up in 4.11? It's my patch "tty: fix port buffer locking" :-/ At a glance, looks related to pty taking the lock on the other side in a different order. I'll have a closer look. Vegard
Re: [linux-next / tty] possible circular locking dependency detected
On 05/22/17 12:24, Greg Kroah-Hartman wrote: On Mon, May 22, 2017 at 04:39:43PM +0900, Sergey Senozhatsky wrote: Hello, [ 1274.378287] == [ 1274.378289] WARNING: possible circular locking dependency detected [ 1274.378290] 4.12.0-rc1-next-20170522-dbg-7-gc09b2ab28b74-dirty #1317 Not tainted [ 1274.378291] -- [ 1274.378293] kworker/u8:5/111 is trying to acquire lock: [ 1274.378294] (>lock){+.+...}, at: [] tty_buffer_flush+0x34/0x88 [ 1274.378300] but task is already holding lock: [ 1274.378301] (_tty->termios_rwsem/1){..}, at: [] isig+0x47/0xd2 [ 1274.378307] which lock already depends on the new lock. Any hint as to what you were doing when this happened? Does this also show up in 4.11? It's my patch "tty: fix port buffer locking" :-/ At a glance, looks related to pty taking the lock on the other side in a different order. I'll have a closer look. Vegard
[PATCH] tty: fix port buffer locking
tty_insert_flip_string_fixed_flag() is racy against itself when called from the ioctl(TCXONC, TCION/TCIOFF) path [1] and the flush_to_ldisc() workqueue path [2]. The problem is that port->buf.tail->used is modified without consistent locking; the ioctl path takes tty->atomic_write_lock, whereas the workqueue path takes ldata->output_lock. We cannot simply take ldata->output_lock, since that is specific to the N_TTY line discipline. It might seem natural to try to take port->buf.lock inside tty_insert_flip_string_fixed_flag() and friends (where port->buf is actually used/modified), but this creates problems for flush_to_ldisc() which takes it before grabbing tty->ldisc_sem, o_tty->termios_rwsem, and ldata->output_lock. Therefore, the simplest solution for now seems to be to take tty->atomic_write_lock inside tty_port_default_receive_buf(). This lock is also used in the write path [3] with a consistent ordering. [1]: Call Trace: tty_insert_flip_string_fixed_flag pty_write tty_send_xchar // down_read(_tty->termios_rwsem) // mutex_lock(>atomic_write_lock) n_tty_ioctl_helper n_tty_ioctl tty_ioctl // down_read(>ldisc_sem) do_vfs_ioctl SyS_ioctl [2]: Workqueue: events_unbound flush_to_ldisc Call Trace: tty_insert_flip_string_fixed_flag pty_write tty_put_char __process_echoes commit_echoes // mutex_lock(>output_lock) n_tty_receive_buf_common n_tty_receive_buf2 tty_ldisc_receive_buf // down_read(_tty->termios_rwsem) tty_port_default_receive_buf // down_read(>ldisc_sem) flush_to_ldisc // mutex_lock(>buf.lock) process_one_work [3]: Call Trace: tty_insert_flip_string_fixed_flag pty_write n_tty_write// mutex_lock(>output_lock) // down_read(>termios_rwsem) do_tty_write (inline) // mutex_lock(>atomic_write_lock) tty_write // down_read(>ldisc_sem) __vfs_write vfs_write SyS_write The bug can result in about a dozen different crashes depending on what exactly gets corrupted when port->buf.tail->used points outside the buffer. The patch passes my LOCKDEP/PROVE_LOCKING testing but more testing is always welcome. Found using syzkaller. Cc: <sta...@vger.kernel.org> Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com> --- drivers/tty/tty_port.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/tty/tty_port.c b/drivers/tty/tty_port.c index 1d21a9c1d33e..ef4dd596b864 100644 --- a/drivers/tty/tty_port.c +++ b/drivers/tty/tty_port.c @@ -34,7 +34,9 @@ static int tty_port_default_receive_buf(struct tty_port *port, if (!disc) return 0; + mutex_lock(>atomic_write_lock); ret = tty_ldisc_receive_buf(disc, p, (char *)f, count); + mutex_unlock(>atomic_write_lock); tty_ldisc_deref(disc); -- 2.12.0.rc0
[PATCH] tty: fix port buffer locking
tty_insert_flip_string_fixed_flag() is racy against itself when called from the ioctl(TCXONC, TCION/TCIOFF) path [1] and the flush_to_ldisc() workqueue path [2]. The problem is that port->buf.tail->used is modified without consistent locking; the ioctl path takes tty->atomic_write_lock, whereas the workqueue path takes ldata->output_lock. We cannot simply take ldata->output_lock, since that is specific to the N_TTY line discipline. It might seem natural to try to take port->buf.lock inside tty_insert_flip_string_fixed_flag() and friends (where port->buf is actually used/modified), but this creates problems for flush_to_ldisc() which takes it before grabbing tty->ldisc_sem, o_tty->termios_rwsem, and ldata->output_lock. Therefore, the simplest solution for now seems to be to take tty->atomic_write_lock inside tty_port_default_receive_buf(). This lock is also used in the write path [3] with a consistent ordering. [1]: Call Trace: tty_insert_flip_string_fixed_flag pty_write tty_send_xchar // down_read(_tty->termios_rwsem) // mutex_lock(>atomic_write_lock) n_tty_ioctl_helper n_tty_ioctl tty_ioctl // down_read(>ldisc_sem) do_vfs_ioctl SyS_ioctl [2]: Workqueue: events_unbound flush_to_ldisc Call Trace: tty_insert_flip_string_fixed_flag pty_write tty_put_char __process_echoes commit_echoes // mutex_lock(>output_lock) n_tty_receive_buf_common n_tty_receive_buf2 tty_ldisc_receive_buf // down_read(_tty->termios_rwsem) tty_port_default_receive_buf // down_read(>ldisc_sem) flush_to_ldisc // mutex_lock(>buf.lock) process_one_work [3]: Call Trace: tty_insert_flip_string_fixed_flag pty_write n_tty_write// mutex_lock(>output_lock) // down_read(>termios_rwsem) do_tty_write (inline) // mutex_lock(>atomic_write_lock) tty_write // down_read(>ldisc_sem) __vfs_write vfs_write SyS_write The bug can result in about a dozen different crashes depending on what exactly gets corrupted when port->buf.tail->used points outside the buffer. The patch passes my LOCKDEP/PROVE_LOCKING testing but more testing is always welcome. Found using syzkaller. Cc: Signed-off-by: Vegard Nossum --- drivers/tty/tty_port.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/tty/tty_port.c b/drivers/tty/tty_port.c index 1d21a9c1d33e..ef4dd596b864 100644 --- a/drivers/tty/tty_port.c +++ b/drivers/tty/tty_port.c @@ -34,7 +34,9 @@ static int tty_port_default_receive_buf(struct tty_port *port, if (!disc) return 0; + mutex_lock(>atomic_write_lock); ret = tty_ldisc_receive_buf(disc, p, (char *)f, count); + mutex_unlock(>atomic_write_lock); tty_ldisc_deref(disc); -- 2.12.0.rc0
[PATCH] tracing: use %pF in trace_dump_stack()
When using trace_dump_stack() you currently just get a list of function names. It can be very useful to know exactly where a call came from, especially if there are multiple calls from one function to another. By switching trace_dump_stack() to use %pF we get the function name and the offset, which can also be further processed to give exact line number information, like this: <...>-10873 3270529us : => pty_write+0x45/0x50 => n_tty_write+0x358/0x470 => tty_write+0x189/0x2f0 => __vfs_write+0x23/0x120 => vfs_write+0xb3/0x1b0 => SyS_write+0x44/0xa0 => entry_SYSCALL_64_fastpath+0x18/0xad $ scripts/faddr2line vmlinux tty_write+0x189/0x2f0 tty_write+0x189/0x2f0: do_tty_write at drivers/tty/tty_io.c:1174 (inlined by) tty_write at drivers/tty/tty_io.c:1257 Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com> --- kernel/trace/trace_output.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index 02a4aeb22c47..879909efed33 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -1073,9 +1073,7 @@ static enum print_line_t trace_stack_print(struct trace_iterator *iter, if (trace_seq_has_overflowed(s)) break; - trace_seq_puts(s, " => "); - seq_print_ip_sym(s, *p, flags); - trace_seq_putc(s, '\n'); + trace_seq_printf(s, " => %pF\n", (void *) *p); } return trace_handle_return(s); -- 2.12.0.rc0
[PATCH] tracing: use %pF in trace_dump_stack()
When using trace_dump_stack() you currently just get a list of function names. It can be very useful to know exactly where a call came from, especially if there are multiple calls from one function to another. By switching trace_dump_stack() to use %pF we get the function name and the offset, which can also be further processed to give exact line number information, like this: <...>-10873 3270529us : => pty_write+0x45/0x50 => n_tty_write+0x358/0x470 => tty_write+0x189/0x2f0 => __vfs_write+0x23/0x120 => vfs_write+0xb3/0x1b0 => SyS_write+0x44/0xa0 => entry_SYSCALL_64_fastpath+0x18/0xad $ scripts/faddr2line vmlinux tty_write+0x189/0x2f0 tty_write+0x189/0x2f0: do_tty_write at drivers/tty/tty_io.c:1174 (inlined by) tty_write at drivers/tty/tty_io.c:1257 Signed-off-by: Vegard Nossum --- kernel/trace/trace_output.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index 02a4aeb22c47..879909efed33 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -1073,9 +1073,7 @@ static enum print_line_t trace_stack_print(struct trace_iterator *iter, if (trace_seq_has_overflowed(s)) break; - trace_seq_puts(s, " => "); - seq_print_ip_sym(s, *p, flags); - trace_seq_putc(s, '\n'); + trace_seq_printf(s, " => %pF\n", (void *) *p); } return trace_handle_return(s); -- 2.12.0.rc0
Re: [PATCH 0/4] S390: Fine-tuning for six function implementations
On 05/07/17 19:12, SF Markus Elfring wrote: From: Markus ElfringDate: Sun, 7 May 2017 19:00:09 +0200 A few update suggestions were taken into account from static source code analysis. Markus Elfring (4): Combine two function calls into one in show_cacheinfo() Use seq_putc() in show_cpu_summary() Replace six seq_printf() calls by seq_puts() Combine two function calls into one at four places arch/s390/kernel/cache.c | 4 ++-- arch/s390/kernel/processor.c | 2 +- arch/s390/kernel/sysinfo.c | 25 +++-- 3 files changed, 14 insertions(+), 17 deletions(-) I'm sorry, I wouldn't normally respond to this, but I was put on the Cc after all so I'll give my feedback. I think these patches are a waste of time and a resources. It would be different if your patches fixed actual bugs. This is just mindless code transformations that MAY in the best case save a few bytes of code here and there (I don't know; you didn't say). But the potential gains from these incredibly numerous and tiny patches that don't fix anything are so small, it's a waste of time, bandwidth, and mental capacity for you and for everybody involved. I just searched my inbox for patches from you and you sent literally _hundreds_ over the past few days, all doing this crazy printf/puts/putc transformation. Another bit of searching and I see that I'm not the first one giving you this response: https://lkml.org/lkml/2017/1/23/383 - Jens Axboe https://lkml.org/lkml/2017/1/23/262 - Johannes Thumshirn https://lkml.org/lkml/2017/1/12/513 - Cyrille Pitchen https://lkml.org/lkml/2016/10/24/491 - Theodore Ts'o https://lkml.org/lkml/2016/10/7/148 - Dan Carpenter https://lkml.org/lkml/2016/9/14/58 - Christian Borntraeger ...and I'm sure there are many more. Vegard
Re: [PATCH 0/4] S390: Fine-tuning for six function implementations
On 05/07/17 19:12, SF Markus Elfring wrote: From: Markus Elfring Date: Sun, 7 May 2017 19:00:09 +0200 A few update suggestions were taken into account from static source code analysis. Markus Elfring (4): Combine two function calls into one in show_cacheinfo() Use seq_putc() in show_cpu_summary() Replace six seq_printf() calls by seq_puts() Combine two function calls into one at four places arch/s390/kernel/cache.c | 4 ++-- arch/s390/kernel/processor.c | 2 +- arch/s390/kernel/sysinfo.c | 25 +++-- 3 files changed, 14 insertions(+), 17 deletions(-) I'm sorry, I wouldn't normally respond to this, but I was put on the Cc after all so I'll give my feedback. I think these patches are a waste of time and a resources. It would be different if your patches fixed actual bugs. This is just mindless code transformations that MAY in the best case save a few bytes of code here and there (I don't know; you didn't say). But the potential gains from these incredibly numerous and tiny patches that don't fix anything are so small, it's a waste of time, bandwidth, and mental capacity for you and for everybody involved. I just searched my inbox for patches from you and you sent literally _hundreds_ over the past few days, all doing this crazy printf/puts/putc transformation. Another bit of searching and I see that I'm not the first one giving you this response: https://lkml.org/lkml/2017/1/23/383 - Jens Axboe https://lkml.org/lkml/2017/1/23/262 - Johannes Thumshirn https://lkml.org/lkml/2017/1/12/513 - Cyrille Pitchen https://lkml.org/lkml/2016/10/24/491 - Theodore Ts'o https://lkml.org/lkml/2016/10/7/148 - Dan Carpenter https://lkml.org/lkml/2016/9/14/58 - Christian Borntraeger ...and I'm sure there are many more. Vegard
[PATCH v2] kthread: fix use-after-free if kthread fork fails
If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but fails in copy_process() between calling dup_task_struct() and setting p->set_child_tid, then the value of p->set_child_tid will be inherited from the parent and get prematurely freed by free_kthread_struct(). kthread() - worker_thread() - process_one_work() | - call_usermodehelper_exec_work() | - kernel_thread() |- _do_fork() | - copy_process() | - dup_task_struct() | - arch_dup_task_struct() |- tsk->set_child_tid = current->set_child_tid // implied | - ... | - goto bad_fork_* | - ... | - free_task(tsk) | - free_kthread_struct(tsk) |- kfree(tsk->set_child_tid) - ... - schedule() - __schedule() - wq_worker_sleeping() - kthread_data(task)->flags // UAF The problem started showing up with commit 1da5c46fa965 since it reused ->set_child_tid for the kthread worker data. A better long-term solution might be to get rid of the ->set_child_tid abuse. The comment in set_kthread_struct() also looks slightly wrong. Fixes: 1da5c46fa965ff90f5ffc080b6ab3fae5e227bc3 ("kthread: Make struct kthread kmalloc'ed") Cc: Oleg Nesterov <o...@redhat.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Andy Lutomirski <l...@kernel.org> Debugged-by: Jamie Iles <jamie.i...@oracle.com> Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com> --- kernel/fork.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index dd5a371c392a..03b2f9606a54 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1554,6 +1554,18 @@ static __latent_entropy struct task_struct *copy_process( if (!p) goto fork_out; + /* +* This _must_ happen before we call free_task(), i.e. before we jump +* to any of the bad_fork_* labels. This is to avoid freeing +* p->set_child_tid which is (ab)used as a kthread's data pointer for +* kernel threads (PF_KTHREAD). +*/ + p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; + /* +* Clear TID on mm_release()? +*/ + p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; + ftrace_graph_init_task(p); rt_mutex_init_task(p); @@ -1720,11 +1732,6 @@ static __latent_entropy struct task_struct *copy_process( } } - p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; - /* -* Clear TID on mm_release()? -*/ - p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; #ifdef CONFIG_BLOCK p->plug = NULL; #endif -- 2.12.0.rc0
[PATCH v2] kthread: fix use-after-free if kthread fork fails
If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but fails in copy_process() between calling dup_task_struct() and setting p->set_child_tid, then the value of p->set_child_tid will be inherited from the parent and get prematurely freed by free_kthread_struct(). kthread() - worker_thread() - process_one_work() | - call_usermodehelper_exec_work() | - kernel_thread() |- _do_fork() | - copy_process() | - dup_task_struct() | - arch_dup_task_struct() |- tsk->set_child_tid = current->set_child_tid // implied | - ... | - goto bad_fork_* | - ... | - free_task(tsk) | - free_kthread_struct(tsk) |- kfree(tsk->set_child_tid) - ... - schedule() - __schedule() - wq_worker_sleeping() - kthread_data(task)->flags // UAF The problem started showing up with commit 1da5c46fa965 since it reused ->set_child_tid for the kthread worker data. A better long-term solution might be to get rid of the ->set_child_tid abuse. The comment in set_kthread_struct() also looks slightly wrong. Fixes: 1da5c46fa965ff90f5ffc080b6ab3fae5e227bc3 ("kthread: Make struct kthread kmalloc'ed") Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Andy Lutomirski Debugged-by: Jamie Iles Signed-off-by: Vegard Nossum --- kernel/fork.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index dd5a371c392a..03b2f9606a54 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1554,6 +1554,18 @@ static __latent_entropy struct task_struct *copy_process( if (!p) goto fork_out; + /* +* This _must_ happen before we call free_task(), i.e. before we jump +* to any of the bad_fork_* labels. This is to avoid freeing +* p->set_child_tid which is (ab)used as a kthread's data pointer for +* kernel threads (PF_KTHREAD). +*/ + p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; + /* +* Clear TID on mm_release()? +*/ + p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; + ftrace_graph_init_task(p); rt_mutex_init_task(p); @@ -1720,11 +1732,6 @@ static __latent_entropy struct task_struct *copy_process( } } - p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; - /* -* Clear TID on mm_release()? -*/ - p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; #ifdef CONFIG_BLOCK p->plug = NULL; #endif -- 2.12.0.rc0
Re: [PATCH] kthread: fix use-after-free if kthread fork fails
On 05/05/17 18:44, Oleg Nesterov wrote: On 05/05, Vegard Nossum wrote: If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but fails in copy_process() between calling dup_task_struct() and setting p->set_child_tid, then the value of p->set_child_tid will be inherited from the parent and get prematurely freed by free_kthread_struct(). Aaah... thanks! --- a/kernel/fork.c +++ b/kernel/fork.c @@ -518,6 +518,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) atomic_set(>stack_refcount, 1); #endif + /* +* Forking kthreads (e.g. usermodehelper) should not inherit this +* field since it's a pointer to a 'struct kthread' which is not +* reference counted. +*/ + tsk->set_child_tid = NULL; + Can't we just move both p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; /* * Clear TID on mm_release()? */ p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; lines here? clone_flags is not available in dup_task_struct(), but we could move those lines higher in copy_process(). The reason we didn't do it was that we thought it was a little fragile/unobvious that this has to happen before free_task() is called and that it was safer to clear it in dup_task_struct() (which also contains zeroing of other fields). The newly attached patch has been tested and seems to work, if you prefer it. Vegard diff --git a/kernel/fork.c b/kernel/fork.c index fbdc29365b83..c52e22fdf7ca 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1561,6 +1561,18 @@ static __latent_entropy struct task_struct *copy_process( if (!p) goto fork_out; + /* + * This _must_ happen before we call free_task(), i.e. before we jump + * to any of the bad_fork_* labels. This is to avoid freeing + * p->set_child_tid which is (ab)used as a kthread's data pointer for + * kernel threads (PF_KTHREAD). + */ + p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; + /* + * Clear TID on mm_release()? + */ + p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; + ftrace_graph_init_task(p); rt_mutex_init_task(p); @@ -1727,11 +1739,6 @@ static __latent_entropy struct task_struct *copy_process( } } - p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; - /* - * Clear TID on mm_release()? - */ - p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; #ifdef CONFIG_BLOCK p->plug = NULL; #endif
Re: [PATCH] kthread: fix use-after-free if kthread fork fails
On 05/05/17 18:44, Oleg Nesterov wrote: On 05/05, Vegard Nossum wrote: If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but fails in copy_process() between calling dup_task_struct() and setting p->set_child_tid, then the value of p->set_child_tid will be inherited from the parent and get prematurely freed by free_kthread_struct(). Aaah... thanks! --- a/kernel/fork.c +++ b/kernel/fork.c @@ -518,6 +518,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) atomic_set(>stack_refcount, 1); #endif + /* +* Forking kthreads (e.g. usermodehelper) should not inherit this +* field since it's a pointer to a 'struct kthread' which is not +* reference counted. +*/ + tsk->set_child_tid = NULL; + Can't we just move both p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; /* * Clear TID on mm_release()? */ p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; lines here? clone_flags is not available in dup_task_struct(), but we could move those lines higher in copy_process(). The reason we didn't do it was that we thought it was a little fragile/unobvious that this has to happen before free_task() is called and that it was safer to clear it in dup_task_struct() (which also contains zeroing of other fields). The newly attached patch has been tested and seems to work, if you prefer it. Vegard diff --git a/kernel/fork.c b/kernel/fork.c index fbdc29365b83..c52e22fdf7ca 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1561,6 +1561,18 @@ static __latent_entropy struct task_struct *copy_process( if (!p) goto fork_out; + /* + * This _must_ happen before we call free_task(), i.e. before we jump + * to any of the bad_fork_* labels. This is to avoid freeing + * p->set_child_tid which is (ab)used as a kthread's data pointer for + * kernel threads (PF_KTHREAD). + */ + p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; + /* + * Clear TID on mm_release()? + */ + p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; + ftrace_graph_init_task(p); rt_mutex_init_task(p); @@ -1727,11 +1739,6 @@ static __latent_entropy struct task_struct *copy_process( } } - p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; - /* - * Clear TID on mm_release()? - */ - p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; #ifdef CONFIG_BLOCK p->plug = NULL; #endif
[PATCH] kthread: fix use-after-free if kthread fork fails
If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but fails in copy_process() between calling dup_task_struct() and setting p->set_child_tid, then the value of p->set_child_tid will be inherited from the parent and get prematurely freed by free_kthread_struct(). kthread() - worker_thread() - process_one_work() | - call_usermodehelper_exec_work() | - kernel_thread() |- _do_fork() | - copy_process() | - dup_task_struct() | - arch_dup_task_struct() |- tsk->set_child_tid = current->set_child_tid // implied | - ... | - goto bad_fork_* | - ... | - free_task(tsk) | - free_kthread_struct(tsk) |- kfree(tsk->set_child_tid) - ... - schedule() - __schedule() - wq_worker_sleeping() - kthread_data(task)->flags // UAF The problem started showing up with commit 1da5c46fa965 since it reused ->set_child_tid for the kthread worker data. A better long-term solution might be to get rid of the ->set_child_tid abuse. The comment in set_kthread_struct() also looks slightly wrong. Fixes: 1da5c46fa965ff90f5ffc080b6ab3fae5e227bc3 ("kthread: Make struct kthread kmalloc'ed") Cc: Oleg Nesterov <o...@redhat.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Andy Lutomirski <l...@kernel.org> Debugged-by: Jamie Iles <jamie.i...@oracle.com> Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com> --- kernel/fork.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/kernel/fork.c b/kernel/fork.c index dd5a371c392a..fbdc29365b83 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -518,6 +518,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) atomic_set(>stack_refcount, 1); #endif + /* +* Forking kthreads (e.g. usermodehelper) should not inherit this +* field since it's a pointer to a 'struct kthread' which is not +* reference counted. +*/ + tsk->set_child_tid = NULL; + if (err) goto free_stack; -- 2.12.0.rc0
[PATCH] kthread: fix use-after-free if kthread fork fails
If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but fails in copy_process() between calling dup_task_struct() and setting p->set_child_tid, then the value of p->set_child_tid will be inherited from the parent and get prematurely freed by free_kthread_struct(). kthread() - worker_thread() - process_one_work() | - call_usermodehelper_exec_work() | - kernel_thread() |- _do_fork() | - copy_process() | - dup_task_struct() | - arch_dup_task_struct() |- tsk->set_child_tid = current->set_child_tid // implied | - ... | - goto bad_fork_* | - ... | - free_task(tsk) | - free_kthread_struct(tsk) |- kfree(tsk->set_child_tid) - ... - schedule() - __schedule() - wq_worker_sleeping() - kthread_data(task)->flags // UAF The problem started showing up with commit 1da5c46fa965 since it reused ->set_child_tid for the kthread worker data. A better long-term solution might be to get rid of the ->set_child_tid abuse. The comment in set_kthread_struct() also looks slightly wrong. Fixes: 1da5c46fa965ff90f5ffc080b6ab3fae5e227bc3 ("kthread: Make struct kthread kmalloc'ed") Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Andy Lutomirski Debugged-by: Jamie Iles Signed-off-by: Vegard Nossum --- kernel/fork.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/kernel/fork.c b/kernel/fork.c index dd5a371c392a..fbdc29365b83 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -518,6 +518,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) atomic_set(>stack_refcount, 1); #endif + /* +* Forking kthreads (e.g. usermodehelper) should not inherit this +* field since it's a pointer to a 'struct kthread' which is not +* reference counted. +*/ + tsk->set_child_tid = NULL; + if (err) goto free_stack; -- 2.12.0.rc0
Re: [GIT PULL] TTY/Serial driver fixes for 4.11-rc4
On 2 May 2017 at 18:35, Dmitry Vyukov <dvyu...@google.com> wrote: > On Fri, Apr 14, 2017 at 2:30 PM, Greg KH <gre...@linuxfoundation.org> wrote: >> On Fri, Apr 14, 2017 at 11:41:26AM +0200, Vegard Nossum wrote: >>> On 13 April 2017 at 20:34, Greg KH <gre...@linuxfoundation.org> wrote: >>> > On Thu, Apr 13, 2017 at 09:07:40AM -0700, Linus Torvalds wrote: >>> >> On Thu, Apr 13, 2017 at 3:50 AM, Vegard Nossum <vegard.nos...@gmail.com> >>> >> wrote: >>> So the original problem is that the vmalloc() in n_tty_open() can >>> fail, and that will panic in tty_set_ldisc()/tty_ldisc_restore() >>> because of its unwillingness to proceed if the tty doesn't have an >>> ldisc. >>> >>> Dmitry fixed this by allowing tty->ldisc == NULL in the case of memory >>> allocation failure as we can see from the comment in tty_set_ldisc(). >>> >>> Unfortunately, it would appear that some other bits of code do not >>> like tty->ldisc == NULL (other than the crash in this thread, I saw >>> 2-3 similar crashes in other functions, e.g. poll()). I see two >>> possibilities: >>> >>> 1) make other code handle tty->ldisc == NULL. >>> >>> 2) don't close/free the old ldisc until the new one has been >>> successfully created/initialised/opened/attached to the tty, and >>> return an error to userspace if changing it failed. >>> >>> I'm leaning towards #2 as the more obviously correct fix, it makes >>> tty_set_ldisc() transactional, the fix seems limited in scope to >>> tty_set_ldisc() itself, and we don't need to make every other bit of >>> code that uses tty->ldisc handle the NULL case. >> >> That sounds reasonable to me, care to work on a patch for this? > > Vegard, do you know how to do this? > That was first thing that I tried, but I did not manage to make it > work. disc is tied to tty, so it's not that one can create a fully > initialized disc on the side and then simply swap pointers. Looking at > the code now, there is at least TTY_LDISC_OPEN bit in tty. But as far > as I remember there were more fundamental problems. Or maybe I just > did not try too hard. I had a look at it but like you said, the tty/ldisc relationship is complicated :-/ Maybe we can split up ldisc initialisation into two methods so that the first one (e.g. ->alloc) does all the allocation and is allowed to fail and the second one (e.g. ->open) is not allowed to fail. Then you can allocate a new ldisc without freeing the old one and only swap them over if the allocation succeeded. That would require fixing up ->open for all the ldisc drivers though, I'm not sure how easy/feasible it is. I'll think about possible solutions, but I have no prior experience with the tty code. In the meantime syzkaller also hit a couple of other fun tty/pty bugs including a write/ioctl race that results in buffer overflow :-/ Vegard
Re: [GIT PULL] TTY/Serial driver fixes for 4.11-rc4
On 2 May 2017 at 18:35, Dmitry Vyukov wrote: > On Fri, Apr 14, 2017 at 2:30 PM, Greg KH wrote: >> On Fri, Apr 14, 2017 at 11:41:26AM +0200, Vegard Nossum wrote: >>> On 13 April 2017 at 20:34, Greg KH wrote: >>> > On Thu, Apr 13, 2017 at 09:07:40AM -0700, Linus Torvalds wrote: >>> >> On Thu, Apr 13, 2017 at 3:50 AM, Vegard Nossum >>> >> wrote: >>> So the original problem is that the vmalloc() in n_tty_open() can >>> fail, and that will panic in tty_set_ldisc()/tty_ldisc_restore() >>> because of its unwillingness to proceed if the tty doesn't have an >>> ldisc. >>> >>> Dmitry fixed this by allowing tty->ldisc == NULL in the case of memory >>> allocation failure as we can see from the comment in tty_set_ldisc(). >>> >>> Unfortunately, it would appear that some other bits of code do not >>> like tty->ldisc == NULL (other than the crash in this thread, I saw >>> 2-3 similar crashes in other functions, e.g. poll()). I see two >>> possibilities: >>> >>> 1) make other code handle tty->ldisc == NULL. >>> >>> 2) don't close/free the old ldisc until the new one has been >>> successfully created/initialised/opened/attached to the tty, and >>> return an error to userspace if changing it failed. >>> >>> I'm leaning towards #2 as the more obviously correct fix, it makes >>> tty_set_ldisc() transactional, the fix seems limited in scope to >>> tty_set_ldisc() itself, and we don't need to make every other bit of >>> code that uses tty->ldisc handle the NULL case. >> >> That sounds reasonable to me, care to work on a patch for this? > > Vegard, do you know how to do this? > That was first thing that I tried, but I did not manage to make it > work. disc is tied to tty, so it's not that one can create a fully > initialized disc on the side and then simply swap pointers. Looking at > the code now, there is at least TTY_LDISC_OPEN bit in tty. But as far > as I remember there were more fundamental problems. Or maybe I just > did not try too hard. I had a look at it but like you said, the tty/ldisc relationship is complicated :-/ Maybe we can split up ldisc initialisation into two methods so that the first one (e.g. ->alloc) does all the allocation and is allowed to fail and the second one (e.g. ->open) is not allowed to fail. Then you can allocate a new ldisc without freeing the old one and only swap them over if the allocation succeeded. That would require fixing up ->open for all the ldisc drivers though, I'm not sure how easy/feasible it is. I'll think about possible solutions, but I have no prior experience with the tty code. In the meantime syzkaller also hit a couple of other fun tty/pty bugs including a write/ioctl race that results in buffer overflow :-/ Vegard
Re: [git pull] vfs fixes
On 9 April 2017 at 07:40, Al Virowrote: > > The following changes since commit a71c9a1c779f2499fb2afc0553e543f18aff6edf: > > Linux 4.11-rc5 (2017-04-02 17:23:54 -0700) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus > > for you to fetch changes up to a8e28440016bfb23bec266c4c66eacca6ea2d48b: > > Merge branch 'work.statx' into for-next (2017-04-03 01:06:59 -0400) > > > Al Viro (2): > alpha: fix stack smashing in old_adjtimex(2) > Merge branch 'work.statx' into for-next I'm seeing the same memfd_create/name_to_handle_at/path_lookupat use-after-free that Dmitry was seeing here: https://lkml.org/lkml/2017/3/4/118 I haven't tried the patch from that thread yet, but was there any reason for it not to get merged so far? Vegard
Re: [git pull] vfs fixes
On 9 April 2017 at 07:40, Al Viro wrote: > > The following changes since commit a71c9a1c779f2499fb2afc0553e543f18aff6edf: > > Linux 4.11-rc5 (2017-04-02 17:23:54 -0700) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus > > for you to fetch changes up to a8e28440016bfb23bec266c4c66eacca6ea2d48b: > > Merge branch 'work.statx' into for-next (2017-04-03 01:06:59 -0400) > > > Al Viro (2): > alpha: fix stack smashing in old_adjtimex(2) > Merge branch 'work.statx' into for-next I'm seeing the same memfd_create/name_to_handle_at/path_lookupat use-after-free that Dmitry was seeing here: https://lkml.org/lkml/2017/3/4/118 I haven't tried the patch from that thread yet, but was there any reason for it not to get merged so far? Vegard
Re: [GIT PULL] TTY/Serial driver fixes for 4.11-rc4
On 13 April 2017 at 20:34, Greg KH <gre...@linuxfoundation.org> wrote: > On Thu, Apr 13, 2017 at 09:07:40AM -0700, Linus Torvalds wrote: >> On Thu, Apr 13, 2017 at 3:50 AM, Vegard Nossum <vegard.nos...@gmail.com> >> wrote: >> > >> > I've bisected a syzkaller crash down to this commit >> > (5362544bebe85071188dd9e479b5a5040841c895). The crash is: >> > >> > [ 25.137552] BUG: unable to handle kernel paging request at >> > 2280 >> > [ 25.137579] IP: mutex_lock_interruptible+0xb/0x30 >> >> It would seem to be the >> >> if (mutex_lock_interruptible(>atomic_read_lock)) >> >> call in n_tty_read(), the offset is about right for a NULL 'ldata' >> pointer (it's a big structure, it has a couple of character buffers of >> size N_TTY_BUF_SIZE). >> >> I don't see the obvious fix, so I suspect at this point we should just >> revert, as that commit seems to introduce worse problems that it is >> supposed to fix. Greg? > > Unless Dmitry has a better idea, I will just revert it and send you the > pull request in a day or so. I don't think we need to rush a revert, I'd hope there's a way to fix it properly. So the original problem is that the vmalloc() in n_tty_open() can fail, and that will panic in tty_set_ldisc()/tty_ldisc_restore() because of its unwillingness to proceed if the tty doesn't have an ldisc. Dmitry fixed this by allowing tty->ldisc == NULL in the case of memory allocation failure as we can see from the comment in tty_set_ldisc(). Unfortunately, it would appear that some other bits of code do not like tty->ldisc == NULL (other than the crash in this thread, I saw 2-3 similar crashes in other functions, e.g. poll()). I see two possibilities: 1) make other code handle tty->ldisc == NULL. 2) don't close/free the old ldisc until the new one has been successfully created/initialised/opened/attached to the tty, and return an error to userspace if changing it failed. I'm leaning towards #2 as the more obviously correct fix, it makes tty_set_ldisc() transactional, the fix seems limited in scope to tty_set_ldisc() itself, and we don't need to make every other bit of code that uses tty->ldisc handle the NULL case. Vegard
Re: [GIT PULL] TTY/Serial driver fixes for 4.11-rc4
On 13 April 2017 at 20:34, Greg KH wrote: > On Thu, Apr 13, 2017 at 09:07:40AM -0700, Linus Torvalds wrote: >> On Thu, Apr 13, 2017 at 3:50 AM, Vegard Nossum >> wrote: >> > >> > I've bisected a syzkaller crash down to this commit >> > (5362544bebe85071188dd9e479b5a5040841c895). The crash is: >> > >> > [ 25.137552] BUG: unable to handle kernel paging request at >> > 2280 >> > [ 25.137579] IP: mutex_lock_interruptible+0xb/0x30 >> >> It would seem to be the >> >> if (mutex_lock_interruptible(>atomic_read_lock)) >> >> call in n_tty_read(), the offset is about right for a NULL 'ldata' >> pointer (it's a big structure, it has a couple of character buffers of >> size N_TTY_BUF_SIZE). >> >> I don't see the obvious fix, so I suspect at this point we should just >> revert, as that commit seems to introduce worse problems that it is >> supposed to fix. Greg? > > Unless Dmitry has a better idea, I will just revert it and send you the > pull request in a day or so. I don't think we need to rush a revert, I'd hope there's a way to fix it properly. So the original problem is that the vmalloc() in n_tty_open() can fail, and that will panic in tty_set_ldisc()/tty_ldisc_restore() because of its unwillingness to proceed if the tty doesn't have an ldisc. Dmitry fixed this by allowing tty->ldisc == NULL in the case of memory allocation failure as we can see from the comment in tty_set_ldisc(). Unfortunately, it would appear that some other bits of code do not like tty->ldisc == NULL (other than the crash in this thread, I saw 2-3 similar crashes in other functions, e.g. poll()). I see two possibilities: 1) make other code handle tty->ldisc == NULL. 2) don't close/free the old ldisc until the new one has been successfully created/initialised/opened/attached to the tty, and return an error to userspace if changing it failed. I'm leaning towards #2 as the more obviously correct fix, it makes tty_set_ldisc() transactional, the fix seems limited in scope to tty_set_ldisc() itself, and we don't need to make every other bit of code that uses tty->ldisc handle the NULL case. Vegard
Re: [GIT PULL] TTY/Serial driver fixes for 4.11-rc4
On 26 March 2017 at 13:04, Greg KHwrote: > The following changes since commit 4495c08e84729385774601b5146d51d9e5849f81: > > Linux 4.11-rc2 (2017-03-12 14:47:08 -0700) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git/ > tags/tty-4.11-rc4 > > for you to fetch changes up to a4a3e061149f09c075f108b6f1cf04d9739a6bc2: > > tty: fix data race in tty_ldisc_ref_wait() (2017-03-17 14:07:10 +0900) > > > TTY/Serial driver fixes for 4.11-rc4 > > Here are some tty and serial driver fixes for 4.11-rc4. One of these > fix a long-standing issue in the ldisc code that was found by Dmitry > Vyukov with his great fuzzing work. The other fixes resolve other > reported issues, and there is one revert of a patch in 4.11-rc1 that > wasn't correct. > > All of these have been in linux-next for a while with no reported > issues. > > Signed-off-by: Greg Kroah-Hartman > > > Aleksey Makarov (1): > Revert "tty: serial: pl011: add ttyAMA for matching pl011 console" > > Dmitry Vyukov (2): > tty: don't panic on OOM in tty_set_ldisc() I've bisected a syzkaller crash down to this commit (5362544bebe85071188dd9e479b5a5040841c895). The crash is: [ 25.137552] BUG: unable to handle kernel paging request at 2280 [ 25.137579] IP: mutex_lock_interruptible+0xb/0x30 [ 25.137589] PGD 3b0c067 [ 25.137593] PUD 3911067 [ 25.137597] PMD 0 [ 25.137601] [ 25.137611] Oops: 0002 [#1] PREEMPT SMP KASAN [ 25.137624] CPU: 1 PID: 3690 Comm: a.out Not tainted 4.11.0-rc2+ #145 [ 25.137631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 25.137639] task: 880003b96400 task.stack: 880004e98000 [ 25.137651] RIP: 0010:mutex_lock_interruptible+0xb/0x30 [ 25.137657] RSP: 0018:880004e9fae0 EFLAGS: 00010246 [ 25.137668] RAX: RBX: 880004e6c000 RCX: 817bb2a9 [ 25.137675] RDX: 880003b96400 RSI: 0015 RDI: 2280 [ 25.137696] RBP: 880004e9fca0 R08: 0003 R09: 0002 [ 25.137703] R10: 0002 R11: edc23fe9 R12: 880004e6c000 [ 25.137710] R13: 80045430 R14: 880004bac900 R15: 880004bacb60 [ 25.137720] FS: 7f7cac233700() GS:88000610() knlGS: [ 25.137727] CS: 0010 DS: ES: CR0: 80050033 [ 25.137733] CR2: 2280 CR3: 03b67000 CR4: 06e0 [ 25.137746] DR0: DR1: DR2: [ 25.137752] DR3: DR6: fffe0ff0 DR7: 0400 [ 25.137755] Call Trace: [ 25.137769] ? n_tty_read+0x15f/0xc70 [ 25.137783] ? preempt_count_add+0xb2/0xe0 [ 25.137793] ? n_tty_flush_buffer+0x90/0x90 [ 25.137806] ? wait_woken+0x100/0x100 [ 25.137817] tty_read+0xd8/0x140 [ 25.137830] __vfs_read+0xd1/0x320 [ 25.137842] ? do_sendfile+0x6c0/0x6c0 [ 25.137853] ? __fsnotify_update_child_dentry_flags+0x30/0x30 [ 25.137864] ? selinux_file_permission+0x1c0/0x210 [ 25.137873] ? __fsnotify_parent+0x27/0x130 [ 25.137882] ? security_file_permission+0xce/0xf0 [ 25.137893] ? rw_verify_area+0x73/0x140 [ 25.137904] vfs_read+0xba/0x1b0 [ 25.137915] SyS_read+0xa0/0x120 [ 25.137926] ? vfs_write+0x260/0x260 [ 25.137938] ? preempt_count_sub+0x13/0xd0 [ 25.137949] entry_SYSCALL_64_fastpath+0x1a/0xa9 [ 25.137957] RIP: 0033:0x7f7caf61351d [ 25.137963] RSP: 002b:7f7cac232f20 EFLAGS: 0293 ORIG_RAX: [ 25.137974] RAX: ffda RBX: 7f7cac233700 RCX: 7f7caf61351d [ 25.137980] RDX: 003e RSI: 80045430 RDI: 0004 [ 25.137987] RBP: 7fffb4f21250 R08: 7f7cac233700 R09: 7f7cac233700 [ 25.137993] R10: 7f7cac2339d0 R11: 0293 R12: [ 25.137999] R13: 7fffb4f2124f R14: 7f7cac2339c0 R15: [ 25.138002] Code: c7 43 20 00 00 00 00 48 89 df e8 91 ff ff ff 5b 41 5c 5d c3 83 e8 01 41 89 44 24 10 eb e1 66 90 65 48 8b 14 25 40 54 01 00 31 c0 48 0f b1 17 48 85 c0 74 0a 55 48 89 e5 e8 e2 f4 ff ff 5d f3 [ 25.138218] RIP: mutex_lock_interruptible+0xb/0x30 RSP: 880004e9fae0 [ 25.138221] CR2: 2280 [ 25.138301] ---[ end trace 242fd54c56b177b4 ]--- The syzkaller reproducer is: # {Threaded:true Collide:true Repeat:true Procs:1 Sandbox:setuid Repro:false} mmap(&(0x7f00/0x9f000)=nil, (0x9f000), 0x3, 0x32, 0x, 0x0) r0 = openat$ptmx(0xff9c, &(0x7f001000-0xa)="2f6465762f70746d7800", 0x201, 0x0) ioctl$TIOCSPTLCK(r0, 0x40045431, &(0x7f09a000)=0x0) r1 = syz_open_pts(r0, 0x0) read(r1,
Re: [GIT PULL] TTY/Serial driver fixes for 4.11-rc4
On 26 March 2017 at 13:04, Greg KH wrote: > The following changes since commit 4495c08e84729385774601b5146d51d9e5849f81: > > Linux 4.11-rc2 (2017-03-12 14:47:08 -0700) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git/ > tags/tty-4.11-rc4 > > for you to fetch changes up to a4a3e061149f09c075f108b6f1cf04d9739a6bc2: > > tty: fix data race in tty_ldisc_ref_wait() (2017-03-17 14:07:10 +0900) > > > TTY/Serial driver fixes for 4.11-rc4 > > Here are some tty and serial driver fixes for 4.11-rc4. One of these > fix a long-standing issue in the ldisc code that was found by Dmitry > Vyukov with his great fuzzing work. The other fixes resolve other > reported issues, and there is one revert of a patch in 4.11-rc1 that > wasn't correct. > > All of these have been in linux-next for a while with no reported > issues. > > Signed-off-by: Greg Kroah-Hartman > > > Aleksey Makarov (1): > Revert "tty: serial: pl011: add ttyAMA for matching pl011 console" > > Dmitry Vyukov (2): > tty: don't panic on OOM in tty_set_ldisc() I've bisected a syzkaller crash down to this commit (5362544bebe85071188dd9e479b5a5040841c895). The crash is: [ 25.137552] BUG: unable to handle kernel paging request at 2280 [ 25.137579] IP: mutex_lock_interruptible+0xb/0x30 [ 25.137589] PGD 3b0c067 [ 25.137593] PUD 3911067 [ 25.137597] PMD 0 [ 25.137601] [ 25.137611] Oops: 0002 [#1] PREEMPT SMP KASAN [ 25.137624] CPU: 1 PID: 3690 Comm: a.out Not tainted 4.11.0-rc2+ #145 [ 25.137631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 25.137639] task: 880003b96400 task.stack: 880004e98000 [ 25.137651] RIP: 0010:mutex_lock_interruptible+0xb/0x30 [ 25.137657] RSP: 0018:880004e9fae0 EFLAGS: 00010246 [ 25.137668] RAX: RBX: 880004e6c000 RCX: 817bb2a9 [ 25.137675] RDX: 880003b96400 RSI: 0015 RDI: 2280 [ 25.137696] RBP: 880004e9fca0 R08: 0003 R09: 0002 [ 25.137703] R10: 0002 R11: edc23fe9 R12: 880004e6c000 [ 25.137710] R13: 80045430 R14: 880004bac900 R15: 880004bacb60 [ 25.137720] FS: 7f7cac233700() GS:88000610() knlGS: [ 25.137727] CS: 0010 DS: ES: CR0: 80050033 [ 25.137733] CR2: 2280 CR3: 03b67000 CR4: 06e0 [ 25.137746] DR0: DR1: DR2: [ 25.137752] DR3: DR6: fffe0ff0 DR7: 0400 [ 25.137755] Call Trace: [ 25.137769] ? n_tty_read+0x15f/0xc70 [ 25.137783] ? preempt_count_add+0xb2/0xe0 [ 25.137793] ? n_tty_flush_buffer+0x90/0x90 [ 25.137806] ? wait_woken+0x100/0x100 [ 25.137817] tty_read+0xd8/0x140 [ 25.137830] __vfs_read+0xd1/0x320 [ 25.137842] ? do_sendfile+0x6c0/0x6c0 [ 25.137853] ? __fsnotify_update_child_dentry_flags+0x30/0x30 [ 25.137864] ? selinux_file_permission+0x1c0/0x210 [ 25.137873] ? __fsnotify_parent+0x27/0x130 [ 25.137882] ? security_file_permission+0xce/0xf0 [ 25.137893] ? rw_verify_area+0x73/0x140 [ 25.137904] vfs_read+0xba/0x1b0 [ 25.137915] SyS_read+0xa0/0x120 [ 25.137926] ? vfs_write+0x260/0x260 [ 25.137938] ? preempt_count_sub+0x13/0xd0 [ 25.137949] entry_SYSCALL_64_fastpath+0x1a/0xa9 [ 25.137957] RIP: 0033:0x7f7caf61351d [ 25.137963] RSP: 002b:7f7cac232f20 EFLAGS: 0293 ORIG_RAX: [ 25.137974] RAX: ffda RBX: 7f7cac233700 RCX: 7f7caf61351d [ 25.137980] RDX: 003e RSI: 80045430 RDI: 0004 [ 25.137987] RBP: 7fffb4f21250 R08: 7f7cac233700 R09: 7f7cac233700 [ 25.137993] R10: 7f7cac2339d0 R11: 0293 R12: [ 25.137999] R13: 7fffb4f2124f R14: 7f7cac2339c0 R15: [ 25.138002] Code: c7 43 20 00 00 00 00 48 89 df e8 91 ff ff ff 5b 41 5c 5d c3 83 e8 01 41 89 44 24 10 eb e1 66 90 65 48 8b 14 25 40 54 01 00 31 c0 48 0f b1 17 48 85 c0 74 0a 55 48 89 e5 e8 e2 f4 ff ff 5d f3 [ 25.138218] RIP: mutex_lock_interruptible+0xb/0x30 RSP: 880004e9fae0 [ 25.138221] CR2: 2280 [ 25.138301] ---[ end trace 242fd54c56b177b4 ]--- The syzkaller reproducer is: # {Threaded:true Collide:true Repeat:true Procs:1 Sandbox:setuid Repro:false} mmap(&(0x7f00/0x9f000)=nil, (0x9f000), 0x3, 0x32, 0x, 0x0) r0 = openat$ptmx(0xff9c, &(0x7f001000-0xa)="2f6465762f70746d7800", 0x201, 0x0) ioctl$TIOCSPTLCK(r0, 0x40045431, &(0x7f09a000)=0x0) r1 = syz_open_pts(r0, 0x0) read(r1,
Re: [PATCH] hugetlbfs: fix offset overflow in huegtlbfs mmap
On 12 April 2017 at 00:51, Mike Kravetz <mike.krav...@oracle.com> wrote: > If mmap() maps a file, it can be passed an offset into the file at > which the mapping is to start. Offset could be a negative value when > represented as a loff_t. The offset plus length will be used to > update the file size (i_size) which is also a loff_t. Validate the > value of offset and offset + length to make sure they do not overflow > and appear as negative. > > Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call > region_abort if region_chg fails") applied. Prior to this commit, the > overflow would still occur but we would luckily return ENOMEM. > To reproduce: > mmap(0, 0x2000, 0, 0x40021, 0xULL, 0x8000ULL); > > Resulted in, > kernel BUG at mm/hugetlb.c:742! > Call Trace: > hugetlbfs_evict_inode+0x80/0xa0 > ? hugetlbfs_setattr+0x3c0/0x3c0 > evict+0x24a/0x620 > iput+0x48f/0x8c0 > dentry_unlink_inode+0x31f/0x4d0 > __dentry_kill+0x292/0x5e0 > dput+0x730/0x830 > __fput+0x438/0x720 > fput+0x1a/0x20 > task_work_run+0xfe/0x180 > exit_to_usermode_loop+0x133/0x150 > syscall_return_slowpath+0x184/0x1c0 > entry_SYSCALL_64_fastpath+0xab/0xad > > Reported-by: Vegard Nossum <vegard.nos...@gmail.com> Please use <vegard.nos...@oracle.com> if possible :-) > Signed-off-by: Mike Kravetz <mike.krav...@oracle.com> > --- > fs/hugetlbfs/inode.c | 15 --- > 1 file changed, 12 insertions(+), 3 deletions(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index 7163fe0..dde8613 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -136,17 +136,26 @@ static int hugetlbfs_file_mmap(struct file *file, > struct vm_area_struct *vma) > vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND; > vma->vm_ops = _vm_ops; > > + /* > +* Offset passed to mmap (before page shift) could have been > +* negative when represented as a (l)off_t. > +*/ > + if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0) > + return -EINVAL; > + This is strictly speaking undefined behaviour in C and would get flagged by e.g. UBSAN. The kernel does compile with -fno-strict-overflow when supported, though, so maybe it's more of a theoretical issue. Another thing: wouldn't we want to detect all truncations, not just the ones that happen to end up negative? For example (with -fno-strict-overflow), (0x12345678 << 12) == 0x45678000, which is still a positive integer, but obviously truncated. We can easily avoid the UB by moving the cast out (since ->vm_pgoff is unsigned and unsigned shifts are always defined IIRC), but that still doesn't reliably detect the positive-result truncation/overflow. > if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT)) > return -EINVAL; > > vma_len = (loff_t)(vma->vm_end - vma->vm_start); > + len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); > + /* check for overflow */ > + if (len < vma_len) > + return -EINVAL; Also strictly speaking UB. You can avoid it by casting vma_len to unsigned and dropping the loff_t cast, but it's admittedly somewhat verbose. There also isn't an "unsigned loff_t" AFAIK, but don't we have some helpers to safely check for overflows? Surely this isn't the only place that does loff_t arithmetic. > > inode_lock(inode); > file_accessed(file); > > ret = -ENOMEM; > - len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); > - > if (hugetlb_reserve_pages(inode, > vma->vm_pgoff >> huge_page_order(h), > len >> huge_page_shift(h), vma, > @@ -155,7 +164,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct > vm_area_struct *vma) > > ret = 0; > if (vma->vm_flags & VM_WRITE && inode->i_size < len) > - inode->i_size = len; > + i_size_write(inode, len); > out: > inode_unlock(inode); This hunk seems a bit out of place in the sense that I don't see how it relates to the overflow checking. I think this either belongs in a separate patch or it deserves a mention in the changelog. Vegard
Re: [PATCH] hugetlbfs: fix offset overflow in huegtlbfs mmap
On 12 April 2017 at 00:51, Mike Kravetz wrote: > If mmap() maps a file, it can be passed an offset into the file at > which the mapping is to start. Offset could be a negative value when > represented as a loff_t. The offset plus length will be used to > update the file size (i_size) which is also a loff_t. Validate the > value of offset and offset + length to make sure they do not overflow > and appear as negative. > > Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call > region_abort if region_chg fails") applied. Prior to this commit, the > overflow would still occur but we would luckily return ENOMEM. > To reproduce: > mmap(0, 0x2000, 0, 0x40021, 0xULL, 0x8000ULL); > > Resulted in, > kernel BUG at mm/hugetlb.c:742! > Call Trace: > hugetlbfs_evict_inode+0x80/0xa0 > ? hugetlbfs_setattr+0x3c0/0x3c0 > evict+0x24a/0x620 > iput+0x48f/0x8c0 > dentry_unlink_inode+0x31f/0x4d0 > __dentry_kill+0x292/0x5e0 > dput+0x730/0x830 > __fput+0x438/0x720 > fput+0x1a/0x20 > task_work_run+0xfe/0x180 > exit_to_usermode_loop+0x133/0x150 > syscall_return_slowpath+0x184/0x1c0 > entry_SYSCALL_64_fastpath+0xab/0xad > > Reported-by: Vegard Nossum Please use if possible :-) > Signed-off-by: Mike Kravetz > --- > fs/hugetlbfs/inode.c | 15 --- > 1 file changed, 12 insertions(+), 3 deletions(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index 7163fe0..dde8613 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -136,17 +136,26 @@ static int hugetlbfs_file_mmap(struct file *file, > struct vm_area_struct *vma) > vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND; > vma->vm_ops = _vm_ops; > > + /* > +* Offset passed to mmap (before page shift) could have been > +* negative when represented as a (l)off_t. > +*/ > + if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0) > + return -EINVAL; > + This is strictly speaking undefined behaviour in C and would get flagged by e.g. UBSAN. The kernel does compile with -fno-strict-overflow when supported, though, so maybe it's more of a theoretical issue. Another thing: wouldn't we want to detect all truncations, not just the ones that happen to end up negative? For example (with -fno-strict-overflow), (0x12345678 << 12) == 0x45678000, which is still a positive integer, but obviously truncated. We can easily avoid the UB by moving the cast out (since ->vm_pgoff is unsigned and unsigned shifts are always defined IIRC), but that still doesn't reliably detect the positive-result truncation/overflow. > if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT)) > return -EINVAL; > > vma_len = (loff_t)(vma->vm_end - vma->vm_start); > + len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); > + /* check for overflow */ > + if (len < vma_len) > + return -EINVAL; Also strictly speaking UB. You can avoid it by casting vma_len to unsigned and dropping the loff_t cast, but it's admittedly somewhat verbose. There also isn't an "unsigned loff_t" AFAIK, but don't we have some helpers to safely check for overflows? Surely this isn't the only place that does loff_t arithmetic. > > inode_lock(inode); > file_accessed(file); > > ret = -ENOMEM; > - len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); > - > if (hugetlb_reserve_pages(inode, > vma->vm_pgoff >> huge_page_order(h), > len >> huge_page_shift(h), vma, > @@ -155,7 +164,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct > vm_area_struct *vma) > > ret = 0; > if (vma->vm_flags & VM_WRITE && inode->i_size < len) > - inode->i_size = len; > + i_size_write(inode, len); > out: > inode_unlock(inode); This hunk seems a bit out of place in the sense that I don't see how it relates to the overflow checking. I think this either belongs in a separate patch or it deserves a mention in the changelog. Vegard
Re: [PATCH] um: use KERN_CONT in stack dump
On 12 March 2017 at 10:47, Vegard Nossum <vegard.nos...@oracle.com> wrote: > On 12/03/2017 10:45, Richard Weinberger wrote: >> diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c >> index aa1b56f5ac68..18eddf677ec6 100644 >> --- a/arch/um/kernel/sysrq.c >> +++ b/arch/um/kernel/sysrq.c >> @@ -17,10 +17,8 @@ >> >> static void _print_addr(void *data, unsigned long address, int reliable) >> { >> - pr_info(" [<%08lx>]", address); >> - pr_cont(" %s", reliable ? "" : "? "); >> - print_symbol("%s", address); >> - pr_cont("\n"); >> + pr_info(" [<%08lx>] %s%pB\n", address, reliable ? "" : "? ", >> + (void *)address); >> } > > Tested-by: Vegard Nossum <vegard.nos...@oracle.com> Just a heads up, this still appears unfixed in Linus's repo. Vegard
Re: [PATCH] um: use KERN_CONT in stack dump
On 12 March 2017 at 10:47, Vegard Nossum wrote: > On 12/03/2017 10:45, Richard Weinberger wrote: >> diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c >> index aa1b56f5ac68..18eddf677ec6 100644 >> --- a/arch/um/kernel/sysrq.c >> +++ b/arch/um/kernel/sysrq.c >> @@ -17,10 +17,8 @@ >> >> static void _print_addr(void *data, unsigned long address, int reliable) >> { >> - pr_info(" [<%08lx>]", address); >> - pr_cont(" %s", reliable ? "" : "? "); >> - print_symbol("%s", address); >> - pr_cont("\n"); >> + pr_info(" [<%08lx>] %s%pB\n", address, reliable ? "" : "? ", >> + (void *)address); >> } > > Tested-by: Vegard Nossum Just a heads up, this still appears unfixed in Linus's repo. Vegard
Re: [PATCH RESEND] mm/hugetlb: Don't call region_abort if region_chg fails
On 29 March 2017 at 23:08, Mike Kravetzwrote: > Changes to hugetlbfs reservation maps is a two step process. The first > step is a call to region_chg to determine what needs to be changed, and > prepare that change. This should be followed by a call to call to > region_add to commit the change, or region_abort to abort the change. > > The error path in hugetlb_reserve_pages called region_abort after a > failed call to region_chg. As a result, the adds_in_progress counter > in the reservation map is off by 1. This is caught by a VM_BUG_ON > in resv_map_release when the reservation map is freed. > > syzkaller fuzzer found this bug, that resulted in the following: > > kernel BUG at mm/hugetlb.c:742! > Call Trace: > hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493 > evict+0x481/0x920 fs/inode.c:553 > iput_final fs/inode.c:1515 [inline] > iput+0x62b/0xa20 fs/inode.c:1542 > hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306 > newseg+0x422/0xd30 ipc/shm.c:575 > ipcget_new ipc/util.c:285 [inline] > ipcget+0x21e/0x580 ipc/util.c:639 > SYSC_shmget ipc/shm.c:673 [inline] > SyS_shmget+0x158/0x230 ipc/shm.c:657 > entry_SYSCALL_64_fastpath+0x1f/0xc2 > RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742 > > Reported-by: Dmitry Vyukov > Signed-off-by: Mike Kravetz > Acked-by: Hillf Danton > --- > mm/hugetlb.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index c7025c1..c65d45c 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4233,7 +4233,9 @@ int hugetlb_reserve_pages(struct inode *inode, > return 0; > out_err: > if (!vma || vma->vm_flags & VM_MAYSHARE) > - region_abort(resv_map, from, to); > + /* Don't call region_abort if region_chg failed */ > + if (chg >= 0) > + region_abort(resv_map, from, to); > if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) > kref_put(_map->refs, resv_map_release); > return ret; Hi guys, I'm running into this on latest linus/master: kernel BUG at mm/hugetlb.c:742! invalid opcode: [#1] SMP KASAN CPU: 3 PID: 20281 Comm: syz-executor0 Not tainted 4.11.0-rc6 #335 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: 880064f30dc0 task.stack: 880065b38000 RIP: 0010:resv_map_release+0x1cb/0x270 RSP: 0018:880065b3fc38 EFLAGS: 00010287 RAX: 0001 RBX: 88006b5fe418 RCX: c90001b52000 RDX: 05de RSI: 8172026b RDI: 88006b5fe410 RBP: 880065b3fc78 R08: 880065b3f958 R09: R10: R11: R12: dc00 R13: 88006b5fe418 R14: 88006b5fe418 R15: 88006b5fe418 FS: 7f21647c5700() GS:88006d10() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00460750 CR3: 5d123000 CR4: 06e0 Call Trace: hugetlbfs_evict_inode+0x80/0xa0 ? hugetlbfs_setattr+0x3c0/0x3c0 evict+0x24a/0x620 iput+0x48f/0x8c0 dentry_unlink_inode+0x31f/0x4d0 __dentry_kill+0x292/0x5e0 dput+0x730/0x830 __fput+0x438/0x720 fput+0x1a/0x20 task_work_run+0xfe/0x180 exit_to_usermode_loop+0x133/0x150 syscall_return_slowpath+0x184/0x1c0 entry_SYSCALL_64_fastpath+0xab/0xad To reproduce: mmap(0, 0x2000, 0, 0x40031, 0xULL, 0x8000ULL); Curiously enough, it's the patch from this thread (i.e. commit ff8c0c53c47530ffea82c22a0a6df6332b56c957) that introduces it, according to git bisect. Reverting the commit from linus/master fixes the problem. Also found by syzcaller (no fault injections this time). Vegard
Re: [PATCH RESEND] mm/hugetlb: Don't call region_abort if region_chg fails
On 29 March 2017 at 23:08, Mike Kravetz wrote: > Changes to hugetlbfs reservation maps is a two step process. The first > step is a call to region_chg to determine what needs to be changed, and > prepare that change. This should be followed by a call to call to > region_add to commit the change, or region_abort to abort the change. > > The error path in hugetlb_reserve_pages called region_abort after a > failed call to region_chg. As a result, the adds_in_progress counter > in the reservation map is off by 1. This is caught by a VM_BUG_ON > in resv_map_release when the reservation map is freed. > > syzkaller fuzzer found this bug, that resulted in the following: > > kernel BUG at mm/hugetlb.c:742! > Call Trace: > hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493 > evict+0x481/0x920 fs/inode.c:553 > iput_final fs/inode.c:1515 [inline] > iput+0x62b/0xa20 fs/inode.c:1542 > hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306 > newseg+0x422/0xd30 ipc/shm.c:575 > ipcget_new ipc/util.c:285 [inline] > ipcget+0x21e/0x580 ipc/util.c:639 > SYSC_shmget ipc/shm.c:673 [inline] > SyS_shmget+0x158/0x230 ipc/shm.c:657 > entry_SYSCALL_64_fastpath+0x1f/0xc2 > RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742 > > Reported-by: Dmitry Vyukov > Signed-off-by: Mike Kravetz > Acked-by: Hillf Danton > --- > mm/hugetlb.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index c7025c1..c65d45c 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4233,7 +4233,9 @@ int hugetlb_reserve_pages(struct inode *inode, > return 0; > out_err: > if (!vma || vma->vm_flags & VM_MAYSHARE) > - region_abort(resv_map, from, to); > + /* Don't call region_abort if region_chg failed */ > + if (chg >= 0) > + region_abort(resv_map, from, to); > if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) > kref_put(_map->refs, resv_map_release); > return ret; Hi guys, I'm running into this on latest linus/master: kernel BUG at mm/hugetlb.c:742! invalid opcode: [#1] SMP KASAN CPU: 3 PID: 20281 Comm: syz-executor0 Not tainted 4.11.0-rc6 #335 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: 880064f30dc0 task.stack: 880065b38000 RIP: 0010:resv_map_release+0x1cb/0x270 RSP: 0018:880065b3fc38 EFLAGS: 00010287 RAX: 0001 RBX: 88006b5fe418 RCX: c90001b52000 RDX: 05de RSI: 8172026b RDI: 88006b5fe410 RBP: 880065b3fc78 R08: 880065b3f958 R09: R10: R11: R12: dc00 R13: 88006b5fe418 R14: 88006b5fe418 R15: 88006b5fe418 FS: 7f21647c5700() GS:88006d10() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 00460750 CR3: 5d123000 CR4: 06e0 Call Trace: hugetlbfs_evict_inode+0x80/0xa0 ? hugetlbfs_setattr+0x3c0/0x3c0 evict+0x24a/0x620 iput+0x48f/0x8c0 dentry_unlink_inode+0x31f/0x4d0 __dentry_kill+0x292/0x5e0 dput+0x730/0x830 __fput+0x438/0x720 fput+0x1a/0x20 task_work_run+0xfe/0x180 exit_to_usermode_loop+0x133/0x150 syscall_return_slowpath+0x184/0x1c0 entry_SYSCALL_64_fastpath+0xab/0xad To reproduce: mmap(0, 0x2000, 0, 0x40031, 0xULL, 0x8000ULL); Curiously enough, it's the patch from this thread (i.e. commit ff8c0c53c47530ffea82c22a0a6df6332b56c957) that introduces it, according to git bisect. Reverting the commit from linus/master fixes the problem. Also found by syzcaller (no fault injections this time). Vegard
Re: [PATCH] um: use KERN_CONT in stack dump
On 12/03/2017 10:45, Richard Weinberger wrote: Am 12.03.2017 um 10:38 schrieb Vegard Nossum: Without KERN_CONT, the symbol will appear on a new line, making stack traces completely unreadable: [snip] I think it is better to fix the root of the problem by using a single printk. i.e. diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c index aa1b56f5ac68..18eddf677ec6 100644 --- a/arch/um/kernel/sysrq.c +++ b/arch/um/kernel/sysrq.c @@ -17,10 +17,8 @@ static void _print_addr(void *data, unsigned long address, int reliable) { - pr_info(" [<%08lx>]", address); - pr_cont(" %s", reliable ? "" : "? "); - print_symbol("%s", address); - pr_cont("\n"); + pr_info(" [<%08lx>] %s%pB\n", address, reliable ? "" : "? ", + (void *)address); } Your patch is better. Tested-by: Vegard Nossum <vegard.nos...@oracle.com> Thanks, Vegard
Re: [PATCH] um: use KERN_CONT in stack dump
On 12/03/2017 10:45, Richard Weinberger wrote: Am 12.03.2017 um 10:38 schrieb Vegard Nossum: Without KERN_CONT, the symbol will appear on a new line, making stack traces completely unreadable: [snip] I think it is better to fix the root of the problem by using a single printk. i.e. diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c index aa1b56f5ac68..18eddf677ec6 100644 --- a/arch/um/kernel/sysrq.c +++ b/arch/um/kernel/sysrq.c @@ -17,10 +17,8 @@ static void _print_addr(void *data, unsigned long address, int reliable) { - pr_info(" [<%08lx>]", address); - pr_cont(" %s", reliable ? "" : "? "); - print_symbol("%s", address); - pr_cont("\n"); + pr_info(" [<%08lx>] %s%pB\n", address, reliable ? "" : "? ", + (void *)address); } Your patch is better. Tested-by: Vegard Nossum Thanks, Vegard
[PATCH] um: use KERN_CONT in stack dump
Without KERN_CONT, the symbol will appear on a new line, making stack traces completely unreadable: Call Trace: [<6008e891>] ? printk+0x0/0x94 [<6001cce6>] show_stack+0xfe/0x15b [<600666ec>] ? dump_stack_print_info+0xe1/0xea [<6008e891>] ? printk+0x0/0x94 [<6023e826>] ? bust_spinlocks+0x0/0x4f [<602343b8>] dump_stack+0x2a/0x2c [<6008e662>] panic+0x170/0x31e [<6008e4f2>] ? panic+0x0/0x31e This makes it readable again: Call Trace: [<6008e891>] ? printk+0x0/0x94 [<6001cce6>] show_stack+0xfe/0x15b [<600666ec>] ? dump_stack_print_info+0xe1/0xea [<6008e891>] ? printk+0x0/0x94 [<6023e826>] ? bust_spinlocks+0x0/0x4f [<602343b8>] dump_stack+0x2a/0x2c [<6008e662>] panic+0x170/0x31e Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com> --- arch/um/kernel/sysrq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c index a76295f7ede9..edf1f80123e7 100644 --- a/arch/um/kernel/sysrq.c +++ b/arch/um/kernel/sysrq.c @@ -22,7 +22,7 @@ static void _print_addr(void *data, unsigned long address, int reliable) { pr_info(" [<%08lx>]", address); pr_cont(" %s", reliable ? "" : "? "); - print_symbol("%s", address); + print_symbol(KERN_CONT "%s", address); pr_cont("\n"); } -- 2.12.0.rc0
[PATCH] um: use KERN_CONT in stack dump
Without KERN_CONT, the symbol will appear on a new line, making stack traces completely unreadable: Call Trace: [<6008e891>] ? printk+0x0/0x94 [<6001cce6>] show_stack+0xfe/0x15b [<600666ec>] ? dump_stack_print_info+0xe1/0xea [<6008e891>] ? printk+0x0/0x94 [<6023e826>] ? bust_spinlocks+0x0/0x4f [<602343b8>] dump_stack+0x2a/0x2c [<6008e662>] panic+0x170/0x31e [<6008e4f2>] ? panic+0x0/0x31e This makes it readable again: Call Trace: [<6008e891>] ? printk+0x0/0x94 [<6001cce6>] show_stack+0xfe/0x15b [<600666ec>] ? dump_stack_print_info+0xe1/0xea [<6008e891>] ? printk+0x0/0x94 [<6023e826>] ? bust_spinlocks+0x0/0x4f [<602343b8>] dump_stack+0x2a/0x2c [<6008e662>] panic+0x170/0x31e Signed-off-by: Vegard Nossum --- arch/um/kernel/sysrq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c index a76295f7ede9..edf1f80123e7 100644 --- a/arch/um/kernel/sysrq.c +++ b/arch/um/kernel/sysrq.c @@ -22,7 +22,7 @@ static void _print_addr(void *data, unsigned long address, int reliable) { pr_info(" [<%08lx>]", address); pr_cont(" %s", reliable ? "" : "? "); - print_symbol("%s", address); + print_symbol(KERN_CONT "%s", address); pr_cont("\n"); } -- 2.12.0.rc0
Re: [PATCH] locking/hung_task: Defer showing held locks
On 12/03/2017 06:33, Tetsuo Handa wrote: When I was running my testcase which may block hundreds of threads on fs locks, I got lockup due to output from debug_show_all_locks() added by commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks"). For example, if 1000 threads were blocked in TASK_UNINTERRUPTIBLE state and 500 out of 1000 threads hold some lock, debug_show_all_locks() from for_each_process_thread() loop will report locks held by 500 threads for 1000 times. This is a too much noise. In order to make sure rcu_lock_break() is called frequently, we should avoid calling debug_show_all_locks() from for_each_process_thread() loop because debug_show_all_locks() effectively calls for_each_process_thread() loop. Let's defer calling debug_show_all_locks() till before panic() or leaving for_each_process_thread() loop. Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> Cc: Vegard Nossum <vegard.nos...@oracle.com> --- kernel/hung_task.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/kernel/hung_task.c b/kernel/hung_task.c index f0f8e2a..751593e 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -43,6 +43,7 @@ int __read_mostly sysctl_hung_task_warnings = 10; static int __read_mostly did_panic; +static bool hung_task_show_lock; static struct task_struct *watchdog_task; @@ -120,12 +121,14 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\"" " disables this message.\n"); sched_show_task(t); - debug_show_all_locks(); + hung_task_show_lock = true; } touch_nmi_watchdog(); if (sysctl_hung_task_panic) { + if (hung_task_show_lock) + debug_show_all_locks(); trigger_all_cpu_backtrace(); panic("hung_task: blocked tasks"); } @@ -172,6 +175,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) if (test_taint(TAINT_DIE) || did_panic) return; + hung_task_show_lock = false; rcu_read_lock(); for_each_process_thread(g, t) { if (!max_count--) @@ -187,6 +191,8 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) } unlock: rcu_read_unlock(); + if (hung_task_show_lock) + debug_show_all_locks(); } static long hung_timeout_jiffies(unsigned long last_checked, Reviewed/Acked-by: Vegard Nossum <vegard.nos...@oracle.com> Thank you for fixing this. Vegard
Re: [PATCH] locking/hung_task: Defer showing held locks
On 12/03/2017 06:33, Tetsuo Handa wrote: When I was running my testcase which may block hundreds of threads on fs locks, I got lockup due to output from debug_show_all_locks() added by commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks"). For example, if 1000 threads were blocked in TASK_UNINTERRUPTIBLE state and 500 out of 1000 threads hold some lock, debug_show_all_locks() from for_each_process_thread() loop will report locks held by 500 threads for 1000 times. This is a too much noise. In order to make sure rcu_lock_break() is called frequently, we should avoid calling debug_show_all_locks() from for_each_process_thread() loop because debug_show_all_locks() effectively calls for_each_process_thread() loop. Let's defer calling debug_show_all_locks() till before panic() or leaving for_each_process_thread() loop. Signed-off-by: Tetsuo Handa Cc: Vegard Nossum --- kernel/hung_task.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/kernel/hung_task.c b/kernel/hung_task.c index f0f8e2a..751593e 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -43,6 +43,7 @@ int __read_mostly sysctl_hung_task_warnings = 10; static int __read_mostly did_panic; +static bool hung_task_show_lock; static struct task_struct *watchdog_task; @@ -120,12 +121,14 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\"" " disables this message.\n"); sched_show_task(t); - debug_show_all_locks(); + hung_task_show_lock = true; } touch_nmi_watchdog(); if (sysctl_hung_task_panic) { + if (hung_task_show_lock) + debug_show_all_locks(); trigger_all_cpu_backtrace(); panic("hung_task: blocked tasks"); } @@ -172,6 +175,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) if (test_taint(TAINT_DIE) || did_panic) return; + hung_task_show_lock = false; rcu_read_lock(); for_each_process_thread(g, t) { if (!max_count--) @@ -187,6 +191,8 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) } unlock: rcu_read_unlock(); + if (hung_task_show_lock) + debug_show_all_locks(); } static long hung_timeout_jiffies(unsigned long last_checked, Reviewed/Acked-by: Vegard Nossum Thank you for fixing this. Vegard
Re: [PATCH] locking/hung_task: Defer showing held locks
On 13 December 2016 at 15:45, Tetsuo Handawrote: > When I was running my testcase which may block hundreds of threads > on fs locks, I got lockup due to output from debug_show_all_locks() > added by commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks"). > > I think we don't need to call debug_show_all_locks() on each blocked > thread. Let's defer calling debug_show_all_locks() till before panic() > or leaving for_each_process_thread() loop. First of all, sorry for not answering earlier. I'm not sure I fully understand the problem, you say the "output from debug_show_all_locks()" caused a lockup, but was the problem simply that the amount of output caused it to stall for a long time? Could we instead 1) move the debug_show_all_locks() into the if (sysctl_hung_task_panic) bit unconditionally 2) call something (touch_nmi_watchdog()?) inside debug_show_all_locks() 3) in another way make debug_show_all_locks() more robust so it doesn't "lockup" ? Vegard
Re: [PATCH] locking/hung_task: Defer showing held locks
On 13 December 2016 at 15:45, Tetsuo Handa wrote: > When I was running my testcase which may block hundreds of threads > on fs locks, I got lockup due to output from debug_show_all_locks() > added by commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks"). > > I think we don't need to call debug_show_all_locks() on each blocked > thread. Let's defer calling debug_show_all_locks() till before panic() > or leaving for_each_process_thread() loop. First of all, sorry for not answering earlier. I'm not sure I fully understand the problem, you say the "output from debug_show_all_locks()" caused a lockup, but was the problem simply that the amount of output caused it to stall for a long time? Could we instead 1) move the debug_show_all_locks() into the if (sysctl_hung_task_panic) bit unconditionally 2) call something (touch_nmi_watchdog()?) inside debug_show_all_locks() 3) in another way make debug_show_all_locks() more robust so it doesn't "lockup" ? Vegard
[PATCH 1/4] mm: add new mmgrab() helper
Apart from adding the helper function itself, the rest of the kernel is converted mechanically using: git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/' git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/' This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own. (Michal Hocko provided most of the kerneldoc comment.) Cc: Andrew Morton <a...@linux-foundation.org> Acked-by: Michal Hocko <mho...@suse.com> Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com> --- arch/alpha/kernel/smp.c | 2 +- arch/arc/kernel/smp.c| 2 +- arch/arm/kernel/smp.c| 2 +- arch/arm64/kernel/smp.c | 2 +- arch/blackfin/mach-common/smp.c | 2 +- arch/hexagon/kernel/smp.c| 2 +- arch/ia64/kernel/setup.c | 2 +- arch/m32r/kernel/setup.c | 2 +- arch/metag/kernel/smp.c | 2 +- arch/mips/kernel/traps.c | 2 +- arch/mn10300/kernel/smp.c| 2 +- arch/parisc/kernel/smp.c | 2 +- arch/powerpc/kernel/smp.c| 2 +- arch/s390/kernel/processor.c | 2 +- arch/score/kernel/traps.c| 2 +- arch/sh/kernel/smp.c | 2 +- arch/sparc/kernel/leon_smp.c | 2 +- arch/sparc/kernel/smp_64.c | 2 +- arch/sparc/kernel/sun4d_smp.c| 2 +- arch/sparc/kernel/sun4m_smp.c| 2 +- arch/sparc/kernel/traps_32.c | 2 +- arch/sparc/kernel/traps_64.c | 2 +- arch/tile/kernel/smpboot.c | 2 +- arch/x86/kernel/cpu/common.c | 4 ++-- arch/xtensa/kernel/smp.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 2 +- drivers/gpu/drm/i915/i915_gem_userptr.c | 2 +- drivers/infiniband/hw/hfi1/file_ops.c| 2 +- fs/proc/base.c | 4 ++-- fs/userfaultfd.c | 2 +- include/linux/sched.h| 22 ++ kernel/exit.c| 2 +- kernel/futex.c | 2 +- kernel/sched/core.c | 4 ++-- mm/khugepaged.c | 2 +- mm/ksm.c | 2 +- mm/mmu_context.c | 2 +- mm/mmu_notifier.c| 2 +- mm/oom_kill.c| 4 ++-- virt/kvm/kvm_main.c | 2 +- 40 files changed, 65 insertions(+), 43 deletions(-) diff --git a/arch/alpha/kernel/smp.c b/arch/alpha/kernel/smp.c index 46bf263c3153..acb4b146a607 100644 --- a/arch/alpha/kernel/smp.c +++ b/arch/alpha/kernel/smp.c @@ -144,7 +144,7 @@ smp_callin(void) alpha_mv.smp_callin(); /* All kernel threads share the same mm context. */ - atomic_inc(_mm.mm_count); + mmgrab(_mm); current->active_mm = _mm; /* inform the notifiers about the new cpu */ diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c index 88674d972c9d..9cbc7aba3ede 100644 --- a/arch/arc/kernel/smp.c +++ b/arch/arc/kernel/smp.c @@ -125,7 +125,7 @@ void start_kernel_secondary(void) setup_processor(); atomic_inc(>mm_users); - atomic_inc(>mm_count); + mmgrab(mm); current->active_mm = mm; cpumask_set_cpu(cpu, mm_cpumask(mm)); diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index 7dd14e8395e6..c6514ce0fcbc 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -371,7 +371,7 @@ asmlinkage void secondary_start_kernel(void) * reference and switch to it. */ cpu = smp_processor_id(); - atomic_inc(>mm_count); + mmgrab(mm); current->active_mm = mm; cpumask_set_cpu(cpu, mm_cpumask(mm)); diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index cb87234cfcf2..959e41196cba 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -222,7 +222,7 @@ asmlinkage void secondary_start_kernel(void) * All kernel threads share the same mm context; grab a * reference and switch to it. */ - atomic_inc(>mm_count); + mmgrab(mm); current->active_mm = mm; /* diff --git a/arch/blackfin/mach-common/smp.c b/arch/blackfin/mach-common/smp.c index 23c4ef5f8bdc..bc5617ef7128 100644 --- a/arch/blackfin/mach-common/smp.c +++ b/arch/blackfin/mach-common/smp.c @@ -308,7 +308,7 @@ void secondary_start_kernel(void) /* Attach the new idle task to the global mm. */ atomic_inc(>mm_users); - atomic_inc(>mm_count); + mmgrab(mm); current->active_mm = mm; preempt_disable(); diff --git a/arch/hexagon/kernel/smp.c b/arch/hexag
[PATCH 1/4] mm: add new mmgrab() helper
Apart from adding the helper function itself, the rest of the kernel is converted mechanically using: git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/' git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/' This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own. (Michal Hocko provided most of the kerneldoc comment.) Cc: Andrew Morton Acked-by: Michal Hocko Signed-off-by: Vegard Nossum --- arch/alpha/kernel/smp.c | 2 +- arch/arc/kernel/smp.c| 2 +- arch/arm/kernel/smp.c| 2 +- arch/arm64/kernel/smp.c | 2 +- arch/blackfin/mach-common/smp.c | 2 +- arch/hexagon/kernel/smp.c| 2 +- arch/ia64/kernel/setup.c | 2 +- arch/m32r/kernel/setup.c | 2 +- arch/metag/kernel/smp.c | 2 +- arch/mips/kernel/traps.c | 2 +- arch/mn10300/kernel/smp.c| 2 +- arch/parisc/kernel/smp.c | 2 +- arch/powerpc/kernel/smp.c| 2 +- arch/s390/kernel/processor.c | 2 +- arch/score/kernel/traps.c| 2 +- arch/sh/kernel/smp.c | 2 +- arch/sparc/kernel/leon_smp.c | 2 +- arch/sparc/kernel/smp_64.c | 2 +- arch/sparc/kernel/sun4d_smp.c| 2 +- arch/sparc/kernel/sun4m_smp.c| 2 +- arch/sparc/kernel/traps_32.c | 2 +- arch/sparc/kernel/traps_64.c | 2 +- arch/tile/kernel/smpboot.c | 2 +- arch/x86/kernel/cpu/common.c | 4 ++-- arch/xtensa/kernel/smp.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 2 +- drivers/gpu/drm/i915/i915_gem_userptr.c | 2 +- drivers/infiniband/hw/hfi1/file_ops.c| 2 +- fs/proc/base.c | 4 ++-- fs/userfaultfd.c | 2 +- include/linux/sched.h| 22 ++ kernel/exit.c| 2 +- kernel/futex.c | 2 +- kernel/sched/core.c | 4 ++-- mm/khugepaged.c | 2 +- mm/ksm.c | 2 +- mm/mmu_context.c | 2 +- mm/mmu_notifier.c| 2 +- mm/oom_kill.c| 4 ++-- virt/kvm/kvm_main.c | 2 +- 40 files changed, 65 insertions(+), 43 deletions(-) diff --git a/arch/alpha/kernel/smp.c b/arch/alpha/kernel/smp.c index 46bf263c3153..acb4b146a607 100644 --- a/arch/alpha/kernel/smp.c +++ b/arch/alpha/kernel/smp.c @@ -144,7 +144,7 @@ smp_callin(void) alpha_mv.smp_callin(); /* All kernel threads share the same mm context. */ - atomic_inc(_mm.mm_count); + mmgrab(_mm); current->active_mm = _mm; /* inform the notifiers about the new cpu */ diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c index 88674d972c9d..9cbc7aba3ede 100644 --- a/arch/arc/kernel/smp.c +++ b/arch/arc/kernel/smp.c @@ -125,7 +125,7 @@ void start_kernel_secondary(void) setup_processor(); atomic_inc(>mm_users); - atomic_inc(>mm_count); + mmgrab(mm); current->active_mm = mm; cpumask_set_cpu(cpu, mm_cpumask(mm)); diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index 7dd14e8395e6..c6514ce0fcbc 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -371,7 +371,7 @@ asmlinkage void secondary_start_kernel(void) * reference and switch to it. */ cpu = smp_processor_id(); - atomic_inc(>mm_count); + mmgrab(mm); current->active_mm = mm; cpumask_set_cpu(cpu, mm_cpumask(mm)); diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index cb87234cfcf2..959e41196cba 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -222,7 +222,7 @@ asmlinkage void secondary_start_kernel(void) * All kernel threads share the same mm context; grab a * reference and switch to it. */ - atomic_inc(>mm_count); + mmgrab(mm); current->active_mm = mm; /* diff --git a/arch/blackfin/mach-common/smp.c b/arch/blackfin/mach-common/smp.c index 23c4ef5f8bdc..bc5617ef7128 100644 --- a/arch/blackfin/mach-common/smp.c +++ b/arch/blackfin/mach-common/smp.c @@ -308,7 +308,7 @@ void secondary_start_kernel(void) /* Attach the new idle task to the global mm. */ atomic_inc(>mm_users); - atomic_inc(>mm_count); + mmgrab(mm); current->active_mm = mm; preempt_disable(); diff --git a/arch/hexagon/kernel/smp.c b/arch/hexagon/kernel/smp.c index 983bae7d2665..c02a6455839e 100644 --- a/arch/hexagon/kernel/smp.c
[PATCH 4/4] mm: clarify mm_struct.mm_{users,count} documentation
Clarify documentation relating to mm_users and mm_count, and switch to kernel-doc syntax. Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com> --- include/linux/mm_types.h | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 08d947fc4c59..316c3e1fc226 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -407,8 +407,27 @@ struct mm_struct { unsigned long task_size;/* size of task vm space */ unsigned long highest_vm_end; /* highest vma end address */ pgd_t * pgd; - atomic_t mm_users; /* How many users with user space? */ - atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ + + /** +* @mm_users: The number of users including userspace. +* +* Use mmget()/mmget_not_zero()/mmput() to modify. When this drops +* to 0 (i.e. when the task exits and there are no other temporary +* reference holders), we also release a reference on @mm_count +* (which may then free the mm_struct if @mm_count also +* drops to 0). +*/ + atomic_t mm_users; + + /** +* @mm_count: The number of references to mm_struct +* (@mm_users count as 1). +* +* Use mmgrab()/mmdrop() to modify. When this drops to 0, the +* mm_struct is freed. +*/ + atomic_t mm_count; + atomic_long_t nr_ptes; /* PTE page table pages */ #if CONFIG_PGTABLE_LEVELS > 2 atomic_long_t nr_pmds; /* PMD page table pages */ -- 2.11.0.1.gaa10c3f
[PATCH 3/4] mm: use mmget_not_zero() helper
We already have the helper, we can convert the rest of the kernel mechanically using: git grep -l 'atomic_inc_not_zero.*mm_users' | xargs sed -i 's/atomic_inc_not_zero(&\(.*\)->mm_users)/mmget_not_zero\(\1\)/' This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own. Cc: Andrew Morton <a...@linux-foundation.org> Acked-by: Michal Hocko <mho...@suse.com> Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com> --- drivers/gpu/drm/i915/i915_gem_userptr.c | 2 +- drivers/iommu/intel-svm.c | 2 +- fs/proc/base.c | 4 ++-- fs/proc/task_mmu.c | 4 ++-- fs/proc/task_nommu.c| 2 +- kernel/events/uprobes.c | 2 +- mm/swapfile.c | 2 +- 7 files changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c index 1f27529cb48e..89be48ed7c77 100644 --- a/drivers/gpu/drm/i915/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c @@ -507,7 +507,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work) flags |= FOLL_WRITE; ret = -EFAULT; - if (atomic_inc_not_zero(>mm_users)) { + if (mmget_not_zero(mm)) { down_read(>mmap_sem); while (pinned < npages) { ret = get_user_pages_remote diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c index cb72e0011310..51f2b228723f 100644 --- a/drivers/iommu/intel-svm.c +++ b/drivers/iommu/intel-svm.c @@ -579,7 +579,7 @@ static irqreturn_t prq_event_thread(int irq, void *d) if (!svm->mm) goto bad_req; /* If the mm is already defunct, don't handle faults. */ - if (!atomic_inc_not_zero(>mm->mm_users)) + if (!mmget_not_zero(svm->mm)) goto bad_req; down_read(>mm->mmap_sem); vma = find_extend_vma(svm->mm, address); diff --git a/fs/proc/base.c b/fs/proc/base.c index 32f04999d930..ec7304f5117a 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -845,7 +845,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf, return -ENOMEM; copied = 0; - if (!atomic_inc_not_zero(>mm_users)) + if (!mmget_not_zero(mm)) goto free; /* Maybe we should limit FOLL_FORCE to actual ptrace users? */ @@ -953,7 +953,7 @@ static ssize_t environ_read(struct file *file, char __user *buf, return -ENOMEM; ret = 0; - if (!atomic_inc_not_zero(>mm_users)) + if (!mmget_not_zero(mm)) goto free; down_read(>mmap_sem); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 958f32545064..6c07c7813b26 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -167,7 +167,7 @@ static void *m_start(struct seq_file *m, loff_t *ppos) return ERR_PTR(-ESRCH); mm = priv->mm; - if (!mm || !atomic_inc_not_zero(>mm_users)) + if (!mm || !mmget_not_zero(mm)) return NULL; down_read(>mmap_sem); @@ -1352,7 +1352,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf, unsigned long end_vaddr; int ret = 0, copied = 0; - if (!mm || !atomic_inc_not_zero(>mm_users)) + if (!mm || !mmget_not_zero(mm)) goto out; ret = -EINVAL; diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c index 37175621e890..1ef97cfcf422 100644 --- a/fs/proc/task_nommu.c +++ b/fs/proc/task_nommu.c @@ -219,7 +219,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) return ERR_PTR(-ESRCH); mm = priv->mm; - if (!mm || !atomic_inc_not_zero(>mm_users)) + if (!mm || !mmget_not_zero(mm)) return NULL; down_read(>mmap_sem); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 215871bda3a2..f164fe8ca5ff 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -741,7 +741,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register) continue; } - if (!atomic_inc_not_zero(>vm_mm->mm_users)) + if (!mmget_not_zero(vma->vm_mm)) continue; info = prev; diff --git a/mm/swapfile.c b/mm/swapfile.c index 914c31cc143c..5502feef0a4a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1493,7 +1493,7 @@ int try_to_unuse(unsigned int type, bool frontswap, while (swap_count(*swap_map) && !retval && (p = p->next) != _mm->mmlist) {
[PATCH 4/4] mm: clarify mm_struct.mm_{users,count} documentation
Clarify documentation relating to mm_users and mm_count, and switch to kernel-doc syntax. Signed-off-by: Vegard Nossum --- include/linux/mm_types.h | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 08d947fc4c59..316c3e1fc226 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -407,8 +407,27 @@ struct mm_struct { unsigned long task_size;/* size of task vm space */ unsigned long highest_vm_end; /* highest vma end address */ pgd_t * pgd; - atomic_t mm_users; /* How many users with user space? */ - atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ + + /** +* @mm_users: The number of users including userspace. +* +* Use mmget()/mmget_not_zero()/mmput() to modify. When this drops +* to 0 (i.e. when the task exits and there are no other temporary +* reference holders), we also release a reference on @mm_count +* (which may then free the mm_struct if @mm_count also +* drops to 0). +*/ + atomic_t mm_users; + + /** +* @mm_count: The number of references to mm_struct +* (@mm_users count as 1). +* +* Use mmgrab()/mmdrop() to modify. When this drops to 0, the +* mm_struct is freed. +*/ + atomic_t mm_count; + atomic_long_t nr_ptes; /* PTE page table pages */ #if CONFIG_PGTABLE_LEVELS > 2 atomic_long_t nr_pmds; /* PMD page table pages */ -- 2.11.0.1.gaa10c3f
[PATCH 3/4] mm: use mmget_not_zero() helper
We already have the helper, we can convert the rest of the kernel mechanically using: git grep -l 'atomic_inc_not_zero.*mm_users' | xargs sed -i 's/atomic_inc_not_zero(&\(.*\)->mm_users)/mmget_not_zero\(\1\)/' This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own. Cc: Andrew Morton Acked-by: Michal Hocko Signed-off-by: Vegard Nossum --- drivers/gpu/drm/i915/i915_gem_userptr.c | 2 +- drivers/iommu/intel-svm.c | 2 +- fs/proc/base.c | 4 ++-- fs/proc/task_mmu.c | 4 ++-- fs/proc/task_nommu.c| 2 +- kernel/events/uprobes.c | 2 +- mm/swapfile.c | 2 +- 7 files changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c index 1f27529cb48e..89be48ed7c77 100644 --- a/drivers/gpu/drm/i915/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c @@ -507,7 +507,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work) flags |= FOLL_WRITE; ret = -EFAULT; - if (atomic_inc_not_zero(>mm_users)) { + if (mmget_not_zero(mm)) { down_read(>mmap_sem); while (pinned < npages) { ret = get_user_pages_remote diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c index cb72e0011310..51f2b228723f 100644 --- a/drivers/iommu/intel-svm.c +++ b/drivers/iommu/intel-svm.c @@ -579,7 +579,7 @@ static irqreturn_t prq_event_thread(int irq, void *d) if (!svm->mm) goto bad_req; /* If the mm is already defunct, don't handle faults. */ - if (!atomic_inc_not_zero(>mm->mm_users)) + if (!mmget_not_zero(svm->mm)) goto bad_req; down_read(>mm->mmap_sem); vma = find_extend_vma(svm->mm, address); diff --git a/fs/proc/base.c b/fs/proc/base.c index 32f04999d930..ec7304f5117a 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -845,7 +845,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf, return -ENOMEM; copied = 0; - if (!atomic_inc_not_zero(>mm_users)) + if (!mmget_not_zero(mm)) goto free; /* Maybe we should limit FOLL_FORCE to actual ptrace users? */ @@ -953,7 +953,7 @@ static ssize_t environ_read(struct file *file, char __user *buf, return -ENOMEM; ret = 0; - if (!atomic_inc_not_zero(>mm_users)) + if (!mmget_not_zero(mm)) goto free; down_read(>mmap_sem); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 958f32545064..6c07c7813b26 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -167,7 +167,7 @@ static void *m_start(struct seq_file *m, loff_t *ppos) return ERR_PTR(-ESRCH); mm = priv->mm; - if (!mm || !atomic_inc_not_zero(>mm_users)) + if (!mm || !mmget_not_zero(mm)) return NULL; down_read(>mmap_sem); @@ -1352,7 +1352,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf, unsigned long end_vaddr; int ret = 0, copied = 0; - if (!mm || !atomic_inc_not_zero(>mm_users)) + if (!mm || !mmget_not_zero(mm)) goto out; ret = -EINVAL; diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c index 37175621e890..1ef97cfcf422 100644 --- a/fs/proc/task_nommu.c +++ b/fs/proc/task_nommu.c @@ -219,7 +219,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) return ERR_PTR(-ESRCH); mm = priv->mm; - if (!mm || !atomic_inc_not_zero(>mm_users)) + if (!mm || !mmget_not_zero(mm)) return NULL; down_read(>mmap_sem); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 215871bda3a2..f164fe8ca5ff 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -741,7 +741,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register) continue; } - if (!atomic_inc_not_zero(>vm_mm->mm_users)) + if (!mmget_not_zero(vma->vm_mm)) continue; info = prev; diff --git a/mm/swapfile.c b/mm/swapfile.c index 914c31cc143c..5502feef0a4a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1493,7 +1493,7 @@ int try_to_unuse(unsigned int type, bool frontswap, while (swap_count(*swap_map) && !retval && (p = p->next) != _mm->mmlist) { mm = list_entry(p, struct mm_struct, mmlist); -
[PATCH 2/4] mm: add new mmget() helper
Apart from adding the helper function itself, the rest of the kernel is converted mechanically using: git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_users);/mmget\(\1\);/' git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_users);/mmget\(\&\1\);/' This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own. (Michal Hocko provided most of the kerneldoc comment.) Cc: Andrew Morton <a...@linux-foundation.org> Acked-by: Michal Hocko <mho...@suse.com> Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com> --- arch/arc/kernel/smp.c | 2 +- arch/blackfin/mach-common/smp.c | 2 +- arch/frv/mm/mmu-context.c | 2 +- arch/metag/kernel/smp.c | 2 +- arch/sh/kernel/smp.c| 2 +- arch/xtensa/kernel/smp.c| 2 +- include/linux/sched.h | 21 + kernel/fork.c | 4 ++-- mm/swapfile.c | 10 +- virt/kvm/async_pf.c | 2 +- 10 files changed, 35 insertions(+), 14 deletions(-) diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c index 9cbc7aba3ede..eec70cb71db1 100644 --- a/arch/arc/kernel/smp.c +++ b/arch/arc/kernel/smp.c @@ -124,7 +124,7 @@ void start_kernel_secondary(void) /* MMU, Caches, Vector Table, Interrupts etc */ setup_processor(); - atomic_inc(>mm_users); + mmget(mm); mmgrab(mm); current->active_mm = mm; cpumask_set_cpu(cpu, mm_cpumask(mm)); diff --git a/arch/blackfin/mach-common/smp.c b/arch/blackfin/mach-common/smp.c index bc5617ef7128..a2e6db2ce811 100644 --- a/arch/blackfin/mach-common/smp.c +++ b/arch/blackfin/mach-common/smp.c @@ -307,7 +307,7 @@ void secondary_start_kernel(void) local_irq_disable(); /* Attach the new idle task to the global mm. */ - atomic_inc(>mm_users); + mmget(mm); mmgrab(mm); current->active_mm = mm; diff --git a/arch/frv/mm/mmu-context.c b/arch/frv/mm/mmu-context.c index 81757d55a5b5..3473bde77f56 100644 --- a/arch/frv/mm/mmu-context.c +++ b/arch/frv/mm/mmu-context.c @@ -188,7 +188,7 @@ int cxn_pin_by_pid(pid_t pid) task_lock(tsk); if (tsk->mm) { mm = tsk->mm; - atomic_inc(>mm_users); + mmget(mm); ret = 0; } task_unlock(tsk); diff --git a/arch/metag/kernel/smp.c b/arch/metag/kernel/smp.c index af9cff547a19..c622293254e4 100644 --- a/arch/metag/kernel/smp.c +++ b/arch/metag/kernel/smp.c @@ -344,7 +344,7 @@ asmlinkage void secondary_start_kernel(void) * All kernel threads share the same mm context; grab a * reference and switch to it. */ - atomic_inc(>mm_users); + mmget(mm); mmgrab(mm); current->active_mm = mm; cpumask_set_cpu(cpu, mm_cpumask(mm)); diff --git a/arch/sh/kernel/smp.c b/arch/sh/kernel/smp.c index ee379c699c08..edc4769b047e 100644 --- a/arch/sh/kernel/smp.c +++ b/arch/sh/kernel/smp.c @@ -179,7 +179,7 @@ asmlinkage void start_secondary(void) enable_mmu(); mmgrab(mm); - atomic_inc(>mm_users); + mmget(mm); current->active_mm = mm; #ifdef CONFIG_MMU enter_lazy_tlb(mm, current); diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c index 9bf5cea3bae4..fcea72019df7 100644 --- a/arch/xtensa/kernel/smp.c +++ b/arch/xtensa/kernel/smp.c @@ -135,7 +135,7 @@ void secondary_start_kernel(void) /* All kernel threads share the same mm context. */ - atomic_inc(>mm_users); + mmget(mm); mmgrab(mm); current->active_mm = mm; cpumask_set_cpu(cpu, mm_cpumask(mm)); diff --git a/include/linux/sched.h b/include/linux/sched.h index 6ce46220bda2..9fc07aaf5c97 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2955,6 +2955,27 @@ static inline void mmdrop_async(struct mm_struct *mm) } } +/** + * mmget() - Pin the address space associated with a mm_struct. + * @mm: The address space to pin. + * + * Make sure that the address space of the given mm_struct doesn't + * go away. This does not protect against parts of the address space being + * modified or freed, however. + * + * Never use this function to pin this address space for an + * unbounded/indefinite amount of time. + * + * Use mmput() to release the reference acquired by mmget(). + * + * See also for an in-depth explanation + * of _struct.mm_count vs _struct.mm_users. + */ +static inline void mmget(struct mm_struct *mm) +{ + atomic_inc(>mm_users); +} + static inline bool mmget_not_zero(struct mm_struct *mm) { return atomic_inc_not_zero(>mm_users); diff --git a/kernel/fork.c b/kernel/fork.c index 869b8ccc00bf..0e2aaa1837b3 100644 --- a/kernel/fork.c +++ b/kern
[PATCH 2/4] mm: add new mmget() helper
Apart from adding the helper function itself, the rest of the kernel is converted mechanically using: git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_users);/mmget\(\1\);/' git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_users);/mmget\(\&\1\);/' This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own. (Michal Hocko provided most of the kerneldoc comment.) Cc: Andrew Morton Acked-by: Michal Hocko Signed-off-by: Vegard Nossum --- arch/arc/kernel/smp.c | 2 +- arch/blackfin/mach-common/smp.c | 2 +- arch/frv/mm/mmu-context.c | 2 +- arch/metag/kernel/smp.c | 2 +- arch/sh/kernel/smp.c| 2 +- arch/xtensa/kernel/smp.c| 2 +- include/linux/sched.h | 21 + kernel/fork.c | 4 ++-- mm/swapfile.c | 10 +- virt/kvm/async_pf.c | 2 +- 10 files changed, 35 insertions(+), 14 deletions(-) diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c index 9cbc7aba3ede..eec70cb71db1 100644 --- a/arch/arc/kernel/smp.c +++ b/arch/arc/kernel/smp.c @@ -124,7 +124,7 @@ void start_kernel_secondary(void) /* MMU, Caches, Vector Table, Interrupts etc */ setup_processor(); - atomic_inc(>mm_users); + mmget(mm); mmgrab(mm); current->active_mm = mm; cpumask_set_cpu(cpu, mm_cpumask(mm)); diff --git a/arch/blackfin/mach-common/smp.c b/arch/blackfin/mach-common/smp.c index bc5617ef7128..a2e6db2ce811 100644 --- a/arch/blackfin/mach-common/smp.c +++ b/arch/blackfin/mach-common/smp.c @@ -307,7 +307,7 @@ void secondary_start_kernel(void) local_irq_disable(); /* Attach the new idle task to the global mm. */ - atomic_inc(>mm_users); + mmget(mm); mmgrab(mm); current->active_mm = mm; diff --git a/arch/frv/mm/mmu-context.c b/arch/frv/mm/mmu-context.c index 81757d55a5b5..3473bde77f56 100644 --- a/arch/frv/mm/mmu-context.c +++ b/arch/frv/mm/mmu-context.c @@ -188,7 +188,7 @@ int cxn_pin_by_pid(pid_t pid) task_lock(tsk); if (tsk->mm) { mm = tsk->mm; - atomic_inc(>mm_users); + mmget(mm); ret = 0; } task_unlock(tsk); diff --git a/arch/metag/kernel/smp.c b/arch/metag/kernel/smp.c index af9cff547a19..c622293254e4 100644 --- a/arch/metag/kernel/smp.c +++ b/arch/metag/kernel/smp.c @@ -344,7 +344,7 @@ asmlinkage void secondary_start_kernel(void) * All kernel threads share the same mm context; grab a * reference and switch to it. */ - atomic_inc(>mm_users); + mmget(mm); mmgrab(mm); current->active_mm = mm; cpumask_set_cpu(cpu, mm_cpumask(mm)); diff --git a/arch/sh/kernel/smp.c b/arch/sh/kernel/smp.c index ee379c699c08..edc4769b047e 100644 --- a/arch/sh/kernel/smp.c +++ b/arch/sh/kernel/smp.c @@ -179,7 +179,7 @@ asmlinkage void start_secondary(void) enable_mmu(); mmgrab(mm); - atomic_inc(>mm_users); + mmget(mm); current->active_mm = mm; #ifdef CONFIG_MMU enter_lazy_tlb(mm, current); diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c index 9bf5cea3bae4..fcea72019df7 100644 --- a/arch/xtensa/kernel/smp.c +++ b/arch/xtensa/kernel/smp.c @@ -135,7 +135,7 @@ void secondary_start_kernel(void) /* All kernel threads share the same mm context. */ - atomic_inc(>mm_users); + mmget(mm); mmgrab(mm); current->active_mm = mm; cpumask_set_cpu(cpu, mm_cpumask(mm)); diff --git a/include/linux/sched.h b/include/linux/sched.h index 6ce46220bda2..9fc07aaf5c97 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2955,6 +2955,27 @@ static inline void mmdrop_async(struct mm_struct *mm) } } +/** + * mmget() - Pin the address space associated with a mm_struct. + * @mm: The address space to pin. + * + * Make sure that the address space of the given mm_struct doesn't + * go away. This does not protect against parts of the address space being + * modified or freed, however. + * + * Never use this function to pin this address space for an + * unbounded/indefinite amount of time. + * + * Use mmput() to release the reference acquired by mmget(). + * + * See also for an in-depth explanation + * of _struct.mm_count vs _struct.mm_users. + */ +static inline void mmget(struct mm_struct *mm) +{ + atomic_inc(>mm_users); +} + static inline bool mmget_not_zero(struct mm_struct *mm) { return atomic_inc_not_zero(>mm_users); diff --git a/kernel/fork.c b/kernel/fork.c index 869b8ccc00bf..0e2aaa1837b3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -994,7 +994,7 @@ struct mm_struct *get_task_mm(struct tas
Re: crash during oom reaper
On 12/16/2016 03:32 PM, Michal Hocko wrote: On Fri 16-12-16 15:25:27, Vegard Nossum wrote: On 12/16/2016 03:00 PM, Michal Hocko wrote: On Fri 16-12-16 14:14:17, Vegard Nossum wrote: [...] Out of memory: Kill process 1650 (trinity-main) score 90 or sacrifice child Killed process 1724 (trinity-c14) total-vm:37280kB, anon-rss:236kB, file-rss:112kB, shmem-rss:112kB BUG: unable to handle kernel NULL pointer dereference at 01e8 IP: [] copy_process.part.41+0x2150/0x5580 PGD c001067 PUD c67 PMD 0 Oops: 0002 [#1] PREEMPT SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) CPU: 28 PID: 1650 Comm: trinity-main Not tainted 4.9.0-rc6+ #317 Hmm, so this was the oom victim initially but we have decided to kill its child 1724 instead. Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: 88000f9bc440 task.stack: 88000c778000 RIP: 0010:[] [] copy_process.part.41+0x2150/0x5580 Could you match this to the kernel source please? kernel/fork.c:629 dup_mmap() Ok, so this is before the child is made visible so the oom reaper couldn't have seen it. it's atomic_dec(>i_writecount), it matches up with file_inode(file) == NULL: (gdb) p &((struct inode *)0)->i_writecount $1 = (atomic_t *) 0x1e8 <irq_stack_union+488> is this a p9 inode? When I looked at this before it always crashed in this spot for the very first VMA in the mm (which happens to be the exe, which is on a 9p root fs). I added a trace_printk() to dup_mmap() to print inode->i_sb->s_type and the last thing I see for a new crash in the same place is: trinity--9280 28 136345090us : copy_process.part.41: 8485ec40 - CPU: 0 PID: 9302 Comm: trinity-c0 Not tainted 4.9.0-rc8+ #332 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: 8807 task.stack: 8800099e RIP: 0010:[] [] copy_process.part.41+0x22c9/0x55b0 As you can see, the addresses match: (gdb) p _fs_type $1 = (struct file_system_type *) 0x8485ec40 So I think we can safely say that yes, it's a p9 inode. Vegard
Re: crash during oom reaper
On 12/16/2016 03:32 PM, Michal Hocko wrote: On Fri 16-12-16 15:25:27, Vegard Nossum wrote: On 12/16/2016 03:00 PM, Michal Hocko wrote: On Fri 16-12-16 14:14:17, Vegard Nossum wrote: [...] Out of memory: Kill process 1650 (trinity-main) score 90 or sacrifice child Killed process 1724 (trinity-c14) total-vm:37280kB, anon-rss:236kB, file-rss:112kB, shmem-rss:112kB BUG: unable to handle kernel NULL pointer dereference at 01e8 IP: [] copy_process.part.41+0x2150/0x5580 PGD c001067 PUD c67 PMD 0 Oops: 0002 [#1] PREEMPT SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) CPU: 28 PID: 1650 Comm: trinity-main Not tainted 4.9.0-rc6+ #317 Hmm, so this was the oom victim initially but we have decided to kill its child 1724 instead. Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: 88000f9bc440 task.stack: 88000c778000 RIP: 0010:[] [] copy_process.part.41+0x2150/0x5580 Could you match this to the kernel source please? kernel/fork.c:629 dup_mmap() Ok, so this is before the child is made visible so the oom reaper couldn't have seen it. it's atomic_dec(>i_writecount), it matches up with file_inode(file) == NULL: (gdb) p &((struct inode *)0)->i_writecount $1 = (atomic_t *) 0x1e8 is this a p9 inode? When I looked at this before it always crashed in this spot for the very first VMA in the mm (which happens to be the exe, which is on a 9p root fs). I added a trace_printk() to dup_mmap() to print inode->i_sb->s_type and the last thing I see for a new crash in the same place is: trinity--9280 28 136345090us : copy_process.part.41: 8485ec40 - CPU: 0 PID: 9302 Comm: trinity-c0 Not tainted 4.9.0-rc8+ #332 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: 8807 task.stack: 8800099e RIP: 0010:[] [] copy_process.part.41+0x22c9/0x55b0 As you can see, the addresses match: (gdb) p _fs_type $1 = (struct file_system_type *) 0x8485ec40 So I think we can safely say that yes, it's a p9 inode. Vegard