Re: Re: BUG: Bad rss-counter state (4)

2021-04-11 Thread Vegard Nossum

(trimmed off the batman/bpf Ccs)

On 2020-05-18 14:28, syzbot wrote:

syzbot has bisected this bug to:

commit 0d8dd67be013727ae57645ecd3ea2c36365d7da8
Author: Song Liu 
Date:   Wed Dec 6 22:45:14 2017 +

 perf/headers: Sync new perf_event.h with the tools/include/uapi version

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13240a0210
start commit:   ac935d22 Add linux-next specific files for 20200415
git tree:   linux-next
final crash:https://syzkaller.appspot.com/x/report.txt?x=10a40a0210
console output: https://syzkaller.appspot.com/x/log.txt?x=17240a0210
kernel config:  https://syzkaller.appspot.com/x/.config?x=bc498783097e9019
dashboard link: https://syzkaller.appspot.com/bug?extid=347e2331d03d06ab0224
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12d18e6e10
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=104170d610

Reported-by: syzbot+347e2331d03d06ab0...@syzkaller.appspotmail.com
Fixes: 0d8dd67be013 ("perf/headers: Sync new perf_event.h with the 
tools/include/uapi version")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection



FWIW here's a nicer reproducer that more clearly shows what's really
going on:

#define _GNU_SOURCE
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 

// for compat with older perf headers
#define uprobe_path config1

int main(int argc, char *argv[])
{
// Find out what type id we need for uprobes
int perf_type_pmu_uprobe;
{
FILE *fp = 
fopen("/sys/bus/event_source/devices/uprobe/type", "r");

fscanf(fp, "%d", _type_pmu_uprobe);
fclose(fp);
}

const char *filename = "./bus";

int fd = open(filename, O_RDWR|O_CREAT, 0600);
write(fd, "x", 1);

void *addr = mmap(NULL, 4096,
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

// Register a perf uprobe on "./bus"
struct perf_event_attr attr = {};
attr.type = perf_type_pmu_uprobe;
attr.uprobe_path = (unsigned long) filename;
syscall(__NR_perf_event_open, , 0, 0, -1, 0);

void *addr2 = mmap(NULL, 2 * 4096,
PROT_NONE,
MAP_PRIVATE, fd, 0);
void *addr3 = mremap((void *) addr2, 4096, 2 * 4096, 
MREMAP_MAYMOVE);
mremap(addr3, 4096, 4096, MREMAP_MAYMOVE | MREMAP_FIXED, (void 
*) addr2);


return 0;
}

this instantly reproduces this output on current mainline for me:

BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1

AFAICT the worst thing about this bug is that it shows up on anything
that parses logs for "BUG"; it doesn't seem to have any ill effects
other than messing up the rss counters. Although maybe it points to some
underlying problem in uprobes/mm interaction.

If I enable the "rss_stat" tracepoint and set ftrace_dump_on_oops=1, I
see a trace roughly like this:

perf_event_open()

mmap(2 * 4096):
 - uprobe_mmap()
- install_breakpoint()
   - __replace_page()
  - rss_stat: mm_id=0 curr=1 member=1 size=53248B

mremap(4096 => 2 * 4096):
 - install_breakpoint()
- __replace_page()
   - rss_stat: mm_id=0 curr=1 member=1 size=57344B
 - unmap_page_range()
- rss_stat: mm_id=0 curr=1 member=1 size=53248B

mremap(4096 => 4096):
 - move_vma()
- copy_vma()
   - vma_merge()
  - install_breakpoint()
 - __replace_page()
- rss_stat: mm_id=0 curr=1 member=1 size=57344B
 - do_munmap()
- install_breakpoint():
   - __replace_page()
  - rss_stat: mm_id=0 curr=1 member=1 size=61440B
- unmap_page_range():
   - rss_stat: mm_id=0 curr=1 member=1 size=57344B

exit()
 - exit_mmap()
- unmap_page_range():
   - rss_stat: mm_id=0 curr=0 member=1 size=45056B
- unmap_page_range():
   - rss_stat: mm_id=0 curr=0 member=1 size=32768B
- unmap_page_range():
   - rss_stat: mm_id=0 curr=0 member=1 size=20480B
- unmap_page_range():
   - rss_stat: mm_id=0 curr=0 member=1 size=16384B
- unmap_page_range():
   - rss_stat: mm_id=0 curr=0 member=1 size=4096B

What strikes me here is that at the end of the first mremap(), we have
size 53248B (13 pages), but at the end of the second mremap(), we have
size 57344B (14 pages), even though the second mremap() is only moving 1
page. So the second mremap() is bumping it up twice, but then only
bumping down once.


Vegard


Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018

2021-03-23 Thread Vegard Nossum

(trimmed CCs)

On 2021-03-23 19:32, Kirill A. Shutemov wrote:

On Fri, Jun 12, 2020 at 02:26:58PM +0200, Rafael J. Wysocki wrote:

On 6/11/2020 3:40 AM, Kaneda, Erik wrote:



-Original Message-
From: Vegard Nossum 
Sent: Friday, June 5, 2020 7:45 AM
To: Vlastimil Babka ; Rafael J. Wysocki
; Moore, Robert ; Kaneda,
Erik 
Cc: Kees Cook ; Wysocki, Rafael J
; Christoph Lameter ; Andrew
Morton ; Marco Elver ;
Waiman Long ; LKML ; Linux MM ; ACPI Devel
Maling List ; Len Brown ;
Steven Rostedt 
Subject: Re: slub freelist issue / BUG: unable to handle page fault for
address: 3ffe0018

On 2020-06-05 16:08, Vlastimil Babka wrote:

On 6/5/20 3:12 PM, Rafael J. Wysocki wrote:

On Fri, Jun 5, 2020 at 2:48 PM Vegard Nossum

 wrote:

On 2020-06-05 11:36, Vegard Nossum wrote:

On 2020-06-05 11:11, Vlastimil Babka wrote:

On 6/4/20 8:46 PM, Vlastimil Babka wrote:

On 6/4/20 7:57 PM, Kees Cook wrote:

On Thu, Jun 04, 2020 at 07:20:18PM +0200, Vegard Nossum wrote:

On 2020-06-04 19:18, Vlastimil Babka wrote:

On 6/4/20 7:14 PM, Vegard Nossum wrote:

Hi all,

I ran into a boot problem with latest linus/master
(6929f71e46bdddbf1c4d67c2728648176c67c555) that manifests

like this:

Hi, what's the .config you use?

Pretty much x86_64 defconfig minus a few options (PCI, USB,
...)

Oh yes indeed. I immediately crash in the same way with this config.
I'll
start digging...

(defconfig finishes boot)

This is funny, booting with slub_debug=F results in:
I'm not sure if it's ACPI or ftrace wrong here, but looks like
the changed free pointer offset merely exposes a bug in something
else.

So, with Kees' patch reverted, booting with slub_debug=F (or even
more specific slub_debug=F,ftrace_event_field) also hits this bug
below. I wanted to bisect it, but v5.7 was also bad, and also
v5.6. Didn't try further in history. So it's not new at all, and
likely very specific to your config+QEMU? (and related to the ACPI
error messages that precede it?).

I see it too, but not on v5.0. I can bisect it.

commit 67a72420a326b45514deb3f212085fb2cd1595b5
Author: Bob Moore 
Date:   Fri Aug 16 14:43:21 2019 -0700

ACPICA: Increase total number of possible Owner IDs

ACPICA commit 1f1652dad88b9d767767bc1f7eb4f7d99e6b5324

From 255 to 4095 possible IDs.

Link: https://github.com/acpica/acpica/commit/1f1652da
Reported-by: Hedi Berriche 
Signed-off-by: Bob Moore 
Signed-off-by: Erik Schmauss 
Signed-off-by: Rafael J. Wysocki 

Bob, Erik, did we miss something in that patch?

Maybe the patch just changes layout in a way that exposes the bug.

Anyway the "ftrace_event_field" cache is not really involved, this is
just because of slab merging. After adding "slub_nomerge" to
"slub_debug=F", it starts making more sense, as the cache becomes
Acpi-Namespace

[0.140408] [ cut here ]
[0.140837] cache_from_obj: Wrong slab cache. Acpi-Namespace but

object is from kmalloc-64

[0.141406] WARNING: CPU: 0 PID: 1 at mm/slab.h:524

kmem_cache_free+0x1d3/0x250

[0.142105] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.7.0+ #45
[0.142393] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),

BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014

[0.142393] RIP: 0010:kmem_cache_free+0x1d3/0x250
[0.142393] Code: 18 4d 85 ed 0f 84 10 ff ff ff 4c 39 ed 74 2f 49 8b 4d 58 48

8b 55 58 48 c7 c6 10 47 a1 ac 48 c7 c7 00 c2 b0 ac e8 b1 cc eb ff <0f> 0b 48 89 
de
4c 89 ef e8 10 d7 ff ff 48 8b 15 59 36 9b 00 4c 89

[0.142393] RSP: 0018:b39cc0013dc0 EFLAGS: 00010282
[0.142393] RAX:  RBX: 937287409e00 RCX:



[0.142393] RDX: 0001 RSI: 0092 RDI:

acfdd32c

[0.142393] RBP: 93728742ef00 R08: b39cc0013c7d R09:

00fc

[0.142393] R10: b39cc0013c78 R11: b39cc0013c7d R12:

937307409e00

[0.142393] R13: 937287401d00 R14:  R15:



[0.142393] FS:  () GS:937287a0()

knlGS:

[0.142393] CS:  0010 DS:  ES:  CR0: 80050033
[0.142393] CR2:  CR3: 03a0a000 CR4:

003406f0

[0.142393] Call Trace:
[0.142393]  acpi_os_release_object+0x5/0x10
[0.142393]  acpi_ns_delete_children+0x46/0x59
[0.142393]  acpi_ns_delete_namespace_subtree+0x5c/0x79
[0.142393]  ? acpi_sleep_proc_init+0x1f/0x1f
[0.142393]  acpi_ns_terminate+0xc/0x31
[0.142393]  acpi_ut_subsystem_shutdown+0x45/0xa3
[0.142393]  ? acpi_sleep_proc_init+0x1f/0x1f
[0.142393]  acpi_terminate+0x5/0xf
[0.142393]  acpi_init+0x27b/0x308
[0.142393]  ? video_setup+0x79/0x79
[0.142393]  do_one_initcall+0x7b/0x160
[0.142393]  kernel_init_freeable+0x190/0x1f2
[0.142393]  ? rest_init+0x9a/0x9a
[0.142393]  kernel_init+0x5/0xf6
[0.142393]  ret_from_fork+0x22/0x30
[0.142

Re: Minor RST rant

2020-08-06 Thread Vegard Nossum



On 2020-08-06 08:48, Christoph Hellwig wrote:

On Wed, Aug 05, 2020 at 05:12:30PM +0200, pet...@infradead.org wrote:

On Wed, Aug 05, 2020 at 04:49:50PM +0200, Vegard Nossum wrote:

FWIW, I *really* like how the extra markup renders in a browser, and I
don't think I'm the only one.


The thing is, I write code in a text editor, not a browser. When a
header file says: read Documentation/foo I do 'gf' and that file gets
opened in a buffer.

Needing a browser is a fail.


And that is my main problem with all the RST craze.  It optmizes for
shiny display in a browser, but copletely messed up the typical
developer flow.



If you are using vim, you can put this in ~/.vim/after/syntax/rst.vim:

  syn region rstInlineLiteral matchgroup=Special start="``" end="``" 
concealends
  syn region rstEmphasis matchgroup=Special start="\*\*" end="\*\*" 
concealends

  setlocal conceallevel=2

This will hide the ``foo`` and **bar** markup on lines that are not
currently under the cursor.


Vegard


Re: Re: Minor RST rant

2020-08-05 Thread Vegard Nossum

On 2020-07-29 14:44, pet...@infradead.org wrote:

On Sat, Jul 25, 2020 at 09:46:55AM +1000, NeilBrown wrote:


  Constant names stand out least effectively by themselves.  In
  kernel-doc comments they are preceded by a '%'.  Would that make the
  text more readable for you?  Does our doc infrastructure honour that in
  .rst documents?


It does not. It also still reads really weird.

And for some reason firefox chokes on the HTML file I tried it with, and
make htmldocs takes for bloody ever.

Give me a plain text file, please. All this modern crap just doesn't
work.



FWIW, I *really* like how the extra markup renders in a browser, and I
don't think I'm the only one.

If you want to read .rst files in a terminal, I would suggest using
something like this:

$ pandoc -t plain Documentation/core-api/atomic_ops.rst | less

It looks pretty readable to me, things like lists and code are properly
indented, the only thing it's missing as far as I'm concerned is marking
headings more prominently.

The new online documentation is a great way to attract more people to
kernel development (and just spread typical kernel knowledge to
non-Linux/non-kernel programmers). The old Documentation/ was kind of
hidden away and you only really came across it by accident if you did a
treewide 'git grep'; the new online docs, on the other hand, are a
pleasure to browse and explore and frequently show up in google searches
for random kernel-related topics.


Vegard


Re: [merged] exec-open-code-copy_string_kernel.patch removed from -mm tree

2020-06-16 Thread Vegard Nossum



On 2020-06-05 22:19, a...@linux-foundation.org wrote:

The patch titled
  Subject: exec: open code copy_string_kernel
has been removed from the -mm tree.  Its filename was
  exec-open-code-copy_string_kernel.patch

This patch was dropped because it was merged into mainline or a subsystem tree

--
From: Christoph Hellwig 
Subject: exec: open code copy_string_kernel

Currently copy_string_kernel is just a wrapper around copy_strings that
simplifies the calling conventions and uses set_fs to allow passing a
kernel pointer.  But due to the fact the we only need to handle a single
kernel argument pointer, the logic can be sigificantly simplified while
getting rid of the set_fs.

Link: http://lkml.kernel.org/r/20200501104105.2621149-3-...@lst.de
Signed-off-by: Christoph Hellwig 
Cc: Alexander Viro 
Signed-off-by: Andrew Morton 
---

  fs/exec.c |   45 +++--
  1 file changed, 35 insertions(+), 10 deletions(-)

--- a/fs/exec.c~exec-open-code-copy_string_kernel
+++ a/fs/exec.c
@@ -592,17 +592,42 @@ out:
   */
  int copy_string_kernel(const char *arg, struct linux_binprm *bprm)
  {
-   int r;
-   mm_segment_t oldfs = get_fs();
-   struct user_arg_ptr argv = {
-   .ptr.native = (const char __user *const  __user *),
-   };
-
-   set_fs(KERNEL_DS);
-   r = copy_strings(1, argv, bprm);
-   set_fs(oldfs);
+   int len = strnlen(arg, MAX_ARG_STRLEN) + 1 /* terminating NUL */;
+   unsigned long pos = bprm->p;
  
-	return r;

+   if (len == 0)
+   return -EFAULT;


Just a quick question, how can len ever be 0 here when len was set to
strnlen() + 1? Should the test be different?

The old version (i.e. copy_strings()) seems to return -EFAULT when
strnlen() returns 0.


Vegard


+   if (!valid_arg_len(bprm, len))
+   return -E2BIG;
+
+   /* We're going to work our way backwards. */
+   arg += len;
+   bprm->p -= len;
+   if (IS_ENABLED(CONFIG_MMU) && bprm->p < bprm->argmin)
+   return -E2BIG;
+
+   while (len > 0) {
+   unsigned int bytes_to_copy = min_t(unsigned int, len,
+   min_not_zero(offset_in_page(pos), PAGE_SIZE));
+   struct page *page;
+   char *kaddr;
+
+   pos -= bytes_to_copy;
+   arg -= bytes_to_copy;
+   len -= bytes_to_copy;
+
+   page = get_arg_page(bprm, pos, 1);
+   if (!page)
+   return -E2BIG;
+   kaddr = kmap_atomic(page);
+   flush_arg_page(bprm, pos & PAGE_MASK, page);
+   memcpy(kaddr + offset_in_page(pos), arg, bytes_to_copy);
+   flush_kernel_dcache_page(page);
+   kunmap_atomic(kaddr);
+   put_arg_page(page);
+   }
+
+   return 0;
  }
  EXPORT_SYMBOL(copy_string_kernel);
  


Re: WARNING: CPU: 1 PID: 52 at mm/page_alloc.c:4826 __alloc_pages_nodemask (Re: [PATCH 5/5] sysctl: pass kernel pointers to ->proc_handler)

2020-06-08 Thread Vegard Nossum



On 2020-06-08 08:51, Christoph Hellwig wrote:

On Thu, Jun 04, 2020 at 10:22:21PM +0200, Vegard Nossum wrote:

It's easy to reproduce by just doing

 read(open("/proc/sys/vm/swappiness", O_RDONLY), 0, 512UL * 1024 * 1024
* 1024);

or so. Reverting the commit fixes the issue for me.


Yes, doing giant allocations will fail and trace.  We have to options
here that both seems sensible:

  - trunate sysctrl calls to some sensible length
  - (optionally) use vmalloc

Is this a real application or just a test case trying to do the
stupidmost possible thing?



Just a test case.

Allowing the kernel to allocate an unbounded amount of memory on behalf
of userspace is an easy DOS.

All the length checks were already in there, e.g.

 static int cmm_timeout_handler(struct ctl_table *ctl, int write,
  void __user *buffer, size_t *lenp, loff_t 
*ppos)

 {
char buf[64], *p;
[...]
len = min(*lenp, sizeof(buf));
if (copy_from_user(buf, buffer, len))
return -EFAULT;


Vegard


Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018

2020-06-05 Thread Vegard Nossum

On 2020-06-05 17:44, Kees Cook wrote:

On Fri, Jun 05, 2020 at 04:44:51PM +0200, Vegard Nossum wrote:

That's it :-) This fixes it for me:

diff --git a/drivers/acpi/acpica/nsaccess.c b/drivers/acpi/acpica/nsaccess.c
index 2566e2d4c7803..b76bbab917941 100644
--- a/drivers/acpi/acpica/nsaccess.c
+++ b/drivers/acpi/acpica/nsaccess.c
@@ -98,14 +98,12 @@ acpi_status acpi_ns_root_initialize(void)
  * predefined names are at the root level. It is much easier
to
  * just create and link the new node(s) here.
  */
-   new_node =
-   ACPI_ALLOCATE_ZEROED(sizeof(struct
acpi_namespace_node));
+   new_node = acpi_ns_create_node(*ACPI_CAST_PTR (u32,
init_val->name));
 if (!new_node) {
 status = AE_NO_MEMORY;
 goto unlock_and_exit;
 }

-   ACPI_COPY_NAMESEG(new_node->name.ascii, init_val->name);
 new_node->descriptor_type = ACPI_DESC_TYPE_NAMED;
 new_node->type = init_val->type;


I'm a bit confused by the internals of acpi_ns_create_note(). It can still
end up calling ACPI_ALLOCATE_ZEROED() via acpi_os_acquire_object(). Is
this fix correct?



include/acpi/platform/aclinuxex.h:static inline void 
*acpi_os_acquire_object(acpi_cache_t * cache)

include/acpi/platform/aclinuxex.h-{
include/acpi/platform/aclinuxex.h-  return kmem_cache_zalloc(cache,
include/acpi/platform/aclinuxex.h- 
irqs_disabled()? GFP_ATOMIC : GFP_KERNEL);

include/acpi/platform/aclinuxex.h-}

No comment.


Vegard


Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018

2020-06-05 Thread Vegard Nossum

On 2020-06-05 16:08, Vlastimil Babka wrote:

On 6/5/20 3:12 PM, Rafael J. Wysocki wrote:

On Fri, Jun 5, 2020 at 2:48 PM Vegard Nossum  wrote:


On 2020-06-05 11:36, Vegard Nossum wrote:


On 2020-06-05 11:11, Vlastimil Babka wrote:

On 6/4/20 8:46 PM, Vlastimil Babka wrote:

On 6/4/20 7:57 PM, Kees Cook wrote:

On Thu, Jun 04, 2020 at 07:20:18PM +0200, Vegard Nossum wrote:

On 2020-06-04 19:18, Vlastimil Babka wrote:

On 6/4/20 7:14 PM, Vegard Nossum wrote:


Hi all,

I ran into a boot problem with latest linus/master
(6929f71e46bdddbf1c4d67c2728648176c67c555) that manifests like this:


Hi, what's the .config you use?


Pretty much x86_64 defconfig minus a few options (PCI, USB, ...)


Oh yes indeed. I immediately crash in the same way with this config.
I'll
start digging...

(defconfig finishes boot)


This is funny, booting with slub_debug=F results in:
I'm not sure if it's ACPI or ftrace wrong here, but looks like the
changed
free pointer offset merely exposes a bug in something else.


So, with Kees' patch reverted, booting with slub_debug=F (or even more
specific slub_debug=F,ftrace_event_field) also hits this bug below. I
wanted to bisect it, but v5.7 was also bad, and also v5.6. Didn't try
further in history. So it's not new at all, and likely very specific to
your config+QEMU? (and related to the ACPI error messages that precede
it?).


I see it too, but not on v5.0. I can bisect it.


commit 67a72420a326b45514deb3f212085fb2cd1595b5
Author: Bob Moore 
Date:   Fri Aug 16 14:43:21 2019 -0700

  ACPICA: Increase total number of possible Owner IDs

  ACPICA commit 1f1652dad88b9d767767bc1f7eb4f7d99e6b5324

  From 255 to 4095 possible IDs.

  Link: https://github.com/acpica/acpica/commit/1f1652da
  Reported-by: Hedi Berriche 
  Signed-off-by: Bob Moore 
  Signed-off-by: Erik Schmauss 
  Signed-off-by: Rafael J. Wysocki 


Bob, Erik, did we miss something in that patch?


Maybe the patch just changes layout in a way that exposes the bug.

Anyway the "ftrace_event_field" cache is not really involved, this is just
because of slab merging. After adding "slub_nomerge" to "slub_debug=F", it
starts making more sense, as the cache becomes Acpi-Namespace

[0.140408] [ cut here ]
[0.140837] cache_from_obj: Wrong slab cache. Acpi-Namespace but object is 
from kmalloc-64
[0.141406] WARNING: CPU: 0 PID: 1 at mm/slab.h:524 
kmem_cache_free+0x1d3/0x250
[0.142105] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.7.0+ #45
[0.142393] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
[0.142393] RIP: 0010:kmem_cache_free+0x1d3/0x250
[0.142393] Code: 18 4d 85 ed 0f 84 10 ff ff ff 4c 39 ed 74 2f 49 8b 4d 58 48 8b 
55 58 48 c7 c6 10 47 a1 ac 48 c7 c7 00 c2 b0 ac e8 b1 cc eb ff <0f> 0b 48 89 de 
4c 89 ef e8 10 d7 ff ff 48 8b 15 59 36 9b 00 4c 89
[0.142393] RSP: 0018:b39cc0013dc0 EFLAGS: 00010282
[0.142393] RAX:  RBX: 937287409e00 RCX: 
[0.142393] RDX: 0001 RSI: 0092 RDI: acfdd32c
[0.142393] RBP: 93728742ef00 R08: b39cc0013c7d R09: 00fc
[0.142393] R10: b39cc0013c78 R11: b39cc0013c7d R12: 937307409e00
[0.142393] R13: 937287401d00 R14:  R15: 
[0.142393] FS:  () GS:937287a0() 
knlGS:
[0.142393] CS:  0010 DS:  ES:  CR0: 80050033
[0.142393] CR2:  CR3: 03a0a000 CR4: 003406f0
[0.142393] Call Trace:
[0.142393]  acpi_os_release_object+0x5/0x10
[0.142393]  acpi_ns_delete_children+0x46/0x59
[0.142393]  acpi_ns_delete_namespace_subtree+0x5c/0x79
[0.142393]  ? acpi_sleep_proc_init+0x1f/0x1f
[0.142393]  acpi_ns_terminate+0xc/0x31
[0.142393]  acpi_ut_subsystem_shutdown+0x45/0xa3
[0.142393]  ? acpi_sleep_proc_init+0x1f/0x1f
[0.142393]  acpi_terminate+0x5/0xf
[0.142393]  acpi_init+0x27b/0x308
[0.142393]  ? video_setup+0x79/0x79
[0.142393]  do_one_initcall+0x7b/0x160
[0.142393]  kernel_init_freeable+0x190/0x1f2
[0.142393]  ? rest_init+0x9a/0x9a
[0.142393]  kernel_init+0x5/0xf6
[0.142393]  ret_from_fork+0x22/0x30
[0.142393] ---[ end trace 3539f236ef812ba1 ]---
[0.142396] [ cut here ]

I've also changed the warning so it's not printed just once, and also prints 
tracking info
(see the hunk at the end of my mail, I'll turn this to a proper patch later).

With "slub_debug=FU slub_nomerge" there are now multiple warnings, but they all 
look the same:

[0.143815] [ cut here ]
[0.144131] cache_from_obj: Wrong slab cache. Acpi-Namespace but object is 
from kmalloc-64
[0.144929] WARNING: CPU: 0 PID: 1 at mm/slab.h:524 
kmem_cache_free+0x1d3/0x250
[0.145129] CPU: 0

Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018

2020-06-05 Thread Vegard Nossum

On 2020-06-05 11:36, Vegard Nossum wrote:


On 2020-06-05 11:11, Vlastimil Babka wrote:

On 6/4/20 8:46 PM, Vlastimil Babka wrote:

On 6/4/20 7:57 PM, Kees Cook wrote:

On Thu, Jun 04, 2020 at 07:20:18PM +0200, Vegard Nossum wrote:

On 2020-06-04 19:18, Vlastimil Babka wrote:

On 6/4/20 7:14 PM, Vegard Nossum wrote:


Hi all,

I ran into a boot problem with latest linus/master
(6929f71e46bdddbf1c4d67c2728648176c67c555) that manifests like this:


Hi, what's the .config you use?


Pretty much x86_64 defconfig minus a few options (PCI, USB, ...)


Oh yes indeed. I immediately crash in the same way with this config. 
I'll

start digging...

(defconfig finishes boot)


This is funny, booting with slub_debug=F results in:
I'm not sure if it's ACPI or ftrace wrong here, but looks like the 
changed

free pointer offset merely exposes a bug in something else.


So, with Kees' patch reverted, booting with slub_debug=F (or even more
specific slub_debug=F,ftrace_event_field) also hits this bug below. I
wanted to bisect it, but v5.7 was also bad, and also v5.6. Didn't try
further in history. So it's not new at all, and likely very specific to
your config+QEMU? (and related to the ACPI error messages that precede 
it?).


I see it too, but not on v5.0. I can bisect it.


commit 67a72420a326b45514deb3f212085fb2cd1595b5
Author: Bob Moore 
Date:   Fri Aug 16 14:43:21 2019 -0700

ACPICA: Increase total number of possible Owner IDs

ACPICA commit 1f1652dad88b9d767767bc1f7eb4f7d99e6b5324

From 255 to 4095 possible IDs.

Link: https://github.com/acpica/acpica/commit/1f1652da
Reported-by: Hedi Berriche 
Signed-off-by: Bob Moore 
Signed-off-by: Erik Schmauss 
Signed-off-by: Rafael J. Wysocki 


Vegard

This would mean acpi_os_release_object() calling 
kmem_cache_free(ftrace_event_field, x)

where x is actually from kmalloc-64? Both parts of that sounds wrong.

Thread starts here: 
https://lore.kernel.org/linux-mm/4dc93ff8-f86e-f4c9-ebeb-6d3153a78...@oracle.com/ 



[    0.144386] ACPI: Added _OSI(Module Device)
[    0.144496] ACPI: Added _OSI(Processor Device)
[    0.144956] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.145432] ACPI: Added _OSI(Processor Aggregator Device)
[    0.145501] ACPI: Added _OSI(Linux-Dell-Video)
[    0.145951] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[    0.146522] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[    0.147070] ACPI Error: AE_BAD_PARAMETER, During Region 
initialization (20200430/tbxfload-52)

[    0.147494] ACPI: Unable to load the System Description Tables
[    0.148104] ACPI Error: Could not remove SCI handler 
(20200430/evmisc-251)

[    0.148507] [ cut here ]
[    0.148985] cache_from_obj: Wrong slab cache. ftrace_event_field 
but object is from kmalloc-64
[    0.149502] WARNING: CPU: 0 PID: 1 at mm/slab.h:523 
kmem_cache_free+0x248/0x260

[    0.150254] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.7.0+ #43
[    0.150490] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014

[    0.150490] RIP: 0010:kmem_cache_free+0x248/0x260
[    0.150490] Code: ff 0f 0b e9 9d fe ff ff 49 8b 4d 58 48 8b 55 58 
48 c7 c6 10 47 c1 a4 48 c7 c7 f0 c1 d0 a4 c6 05 9f 05 b1 00 01 e8 bc 
cc eb ff <0f> 0b 48 8b 15 5f 36 9b 00 4c 89 ed e9 d6 fd ff ff 0f 1f 
80 00 00

[    0.150490] RSP: 0018:b4dac0013dc0 EFLAGS: 00010282
[    0.150490] RAX:  RBX: a38a07409e00 RCX: 

[    0.150490] RDX: 0001 RSI: 0092 RDI: 
a51dd32c
[    0.150490] RBP: a38a07403900 R08: b4dac0013c7d R09: 
00eb
[    0.150490] R10: b4dac0013c78 R11: b4dac0013c7d R12: 
a38a87409e00
[    0.150490] R13: a38a07401d00 R14:  R15: 

[    0.150490] FS:  () GS:a38a07a0() 
knlGS:

[    0.150490] CS:  0010 DS:  ES:  CR0: 80050033
[    0.150490] CR2:  CR3: 0560a000 CR4: 
003406f0

[    0.150490] Call Trace:
[    0.150490]  acpi_os_release_object+0x5/0x10
[    0.150490]  acpi_ns_delete_children+0x46/0x59
[    0.150490]  acpi_ns_delete_namespace_subtree+0x5c/0x79
[    0.150490]  ? acpi_sleep_proc_init+0x1f/0x1f
[    0.150490]  acpi_ns_terminate+0xc/0x31
[    0.150490]  acpi_ut_subsystem_shutdown+0x45/0xa3
[    0.150490]  ? acpi_sleep_proc_init+0x1f/0x1f
[    0.150490]  acpi_terminate+0x5/0xf
[    0.150490]  acpi_init+0x27b/0x308
[    0.150490]  ? video_setup+0x79/0x79
[    0.150490]  do_one_initcall+0x7b/0x160
[    0.150490]  kernel_init_freeable+0x190/0x1f2
[    0.150490]  ? rest_init+0x9a/0x9a
[    0.150490]  kernel_init+0x5/0xf6
[    0.150490]  ret_from_fork+0x22/0x30
[    0.150490] ---[ end trace 967e9fbc065d7911 ]---











Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018

2020-06-05 Thread Vegard Nossum



On 2020-06-05 11:11, Vlastimil Babka wrote:

On 6/4/20 8:46 PM, Vlastimil Babka wrote:

On 6/4/20 7:57 PM, Kees Cook wrote:

On Thu, Jun 04, 2020 at 07:20:18PM +0200, Vegard Nossum wrote:

On 2020-06-04 19:18, Vlastimil Babka wrote:

On 6/4/20 7:14 PM, Vegard Nossum wrote:


Hi all,

I ran into a boot problem with latest linus/master
(6929f71e46bdddbf1c4d67c2728648176c67c555) that manifests like this:


Hi, what's the .config you use?


Pretty much x86_64 defconfig minus a few options (PCI, USB, ...)


Oh yes indeed. I immediately crash in the same way with this config. I'll
start digging...

(defconfig finishes boot)


This is funny, booting with slub_debug=F results in:
I'm not sure if it's ACPI or ftrace wrong here, but looks like the changed
free pointer offset merely exposes a bug in something else.


So, with Kees' patch reverted, booting with slub_debug=F (or even more
specific slub_debug=F,ftrace_event_field) also hits this bug below. I
wanted to bisect it, but v5.7 was also bad, and also v5.6. Didn't try
further in history. So it's not new at all, and likely very specific to
your config+QEMU? (and related to the ACPI error messages that precede it?).


I see it too, but not on v5.0. I can bisect it.

Also, panic_on_warn is apparently a core parameter, it should probably 
be __setup()...



Vegard




This would mean acpi_os_release_object() calling 
kmem_cache_free(ftrace_event_field, x)
where x is actually from kmalloc-64? Both parts of that sounds wrong.

Thread starts here: 
https://lore.kernel.org/linux-mm/4dc93ff8-f86e-f4c9-ebeb-6d3153a78...@oracle.com/

[0.144386] ACPI: Added _OSI(Module Device)
[0.144496] ACPI: Added _OSI(Processor Device)
[0.144956] ACPI: Added _OSI(3.0 _SCP Extensions)
[0.145432] ACPI: Added _OSI(Processor Aggregator Device)
[0.145501] ACPI: Added _OSI(Linux-Dell-Video)
[0.145951] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[0.146522] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[0.147070] ACPI Error: AE_BAD_PARAMETER, During Region initialization 
(20200430/tbxfload-52)
[0.147494] ACPI: Unable to load the System Description Tables
[0.148104] ACPI Error: Could not remove SCI handler (20200430/evmisc-251)
[0.148507] [ cut here ]
[0.148985] cache_from_obj: Wrong slab cache. ftrace_event_field but object 
is from kmalloc-64
[0.149502] WARNING: CPU: 0 PID: 1 at mm/slab.h:523 
kmem_cache_free+0x248/0x260
[0.150254] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.7.0+ #43
[0.150490] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
[0.150490] RIP: 0010:kmem_cache_free+0x248/0x260
[0.150490] Code: ff 0f 0b e9 9d fe ff ff 49 8b 4d 58 48 8b 55 58 48 c7 c6 10 47 
c1 a4 48 c7 c7 f0 c1 d0 a4 c6 05 9f 05 b1 00 01 e8 bc cc eb ff <0f> 0b 48 8b 15 
5f 36 9b 00 4c 89 ed e9 d6 fd ff ff 0f 1f 80 00 00
[0.150490] RSP: 0018:b4dac0013dc0 EFLAGS: 00010282
[0.150490] RAX:  RBX: a38a07409e00 RCX: 
[0.150490] RDX: 0001 RSI: 0092 RDI: a51dd32c
[0.150490] RBP: a38a07403900 R08: b4dac0013c7d R09: 00eb
[0.150490] R10: b4dac0013c78 R11: b4dac0013c7d R12: a38a87409e00
[0.150490] R13: a38a07401d00 R14:  R15: 
[0.150490] FS:  () GS:a38a07a0() 
knlGS:
[0.150490] CS:  0010 DS:  ES:  CR0: 80050033
[0.150490] CR2:  CR3: 0560a000 CR4: 003406f0
[0.150490] Call Trace:
[0.150490]  acpi_os_release_object+0x5/0x10
[0.150490]  acpi_ns_delete_children+0x46/0x59
[0.150490]  acpi_ns_delete_namespace_subtree+0x5c/0x79
[0.150490]  ? acpi_sleep_proc_init+0x1f/0x1f
[0.150490]  acpi_ns_terminate+0xc/0x31
[0.150490]  acpi_ut_subsystem_shutdown+0x45/0xa3
[0.150490]  ? acpi_sleep_proc_init+0x1f/0x1f
[0.150490]  acpi_terminate+0x5/0xf
[0.150490]  acpi_init+0x27b/0x308
[0.150490]  ? video_setup+0x79/0x79
[0.150490]  do_one_initcall+0x7b/0x160
[0.150490]  kernel_init_freeable+0x190/0x1f2
[0.150490]  ? rest_init+0x9a/0x9a
[0.150490]  kernel_init+0x5/0xf6
[0.150490]  ret_from_fork+0x22/0x30
[0.150490] ---[ end trace 967e9fbc065d7911 ]---









WARNING: CPU: 1 PID: 52 at mm/page_alloc.c:4826 __alloc_pages_nodemask (Re: [PATCH 5/5] sysctl: pass kernel pointers to ->proc_handler)

2020-06-04 Thread Vegard Nossum



(Trimmed original Ccs due to outgoing email policy.)

Hi,

On 2020-04-24 08:43, Christoph Hellwig wrote:

Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig 
Acked-by: Andrey Ignatov 


[snip]

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index b6f5d459b087d..df2143e05c571 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -539,13 +539,13 @@ static struct dentry *proc_sys_lookup(struct inode *dir, 
struct dentry *dentry,
return err;
  }
  
-static ssize_t proc_sys_call_handler(struct file *filp, void __user *buf,

+static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf,
size_t count, loff_t *ppos, int write)
  {
struct inode *inode = file_inode(filp);
struct ctl_table_header *head = grab_header(inode);
struct ctl_table *table = PROC_I(inode)->sysctl_entry;
-   void *new_buf = NULL;
+   void *kbuf;
ssize_t error;
  
  	if (IS_ERR(head))

@@ -564,27 +564,38 @@ static ssize_t proc_sys_call_handler(struct file *filp, 
void __user *buf,
if (!table->proc_handler)
goto out;
  
-	error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, buf, ,

-  ppos, _buf);
+   if (write) {
+   kbuf = memdup_user_nul(ubuf, count);
+   if (IS_ERR(kbuf)) {
+   error = PTR_ERR(kbuf);
+   goto out;
+   }
+   } else {
+   error = -ENOMEM;
+   kbuf = kzalloc(count, GFP_KERNEL);
+   if (!kbuf)
+   goto out;
+   }
+
+   error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, , ,
+  ppos);
if (error)
-   goto out;
+   goto out_free_buf;
  
  	/* careful: calling conventions are nasty here */

-   if (new_buf) {
-   mm_segment_t old_fs;
-
-   old_fs = get_fs();
-   set_fs(KERNEL_DS);
-   error = table->proc_handler(table, write, (void __user 
*)new_buf,
-   , ppos);
-   set_fs(old_fs);
-   kfree(new_buf);
-   } else {
-   error = table->proc_handler(table, write, buf, , ppos);
+   error = table->proc_handler(table, write, kbuf, , ppos);
+   if (error)
+   goto out_free_buf;
+
+   if (!write) {
+   error = -EFAULT;
+   if (copy_to_user(ubuf, kbuf, count))
+   goto out_free_buf;
}
  
-	if (!error)

-   error = count;
+   error = count;
+out_free_buf:
+   kfree(kbuf);
  out:
sysctl_head_finish(head);
  


This commit in recent linus/master
(32927393dc1ccd60fb2bdc05b9e8e88753761469) causes a regression for me:

[ cut here ]
WARNING: CPU: 1 PID: 52 at mm/page_alloc.c:4826 
__alloc_pages_nodemask+0x1cd/0x2a0

CPU: 1 PID: 52 Comm: init Not tainted 5.7.0+ #218
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014

RIP: 0010:__alloc_pages_nodemask+0x1cd/0x2a0
Code: 0f 85 26 ff ff ff 65 48 8b 04 25 00 7d 01 00 48 05 88 07 00 00 41 
bd 01 00 00 00 48 89 44 24 08 e9 07 ff ff ff 80 e7 20 75 02 <0f> 0b 45 
31 ed eb 98 44 8b 64 24 18 65 8b 05 d0 25 e9 7e 89 c0 48

RSP: 0018:c90e7de0 EFLAGS: 00010246
RAX:  RBX: 000400c0 RCX: 
RDX:  RSI: 0013 RDI: 00040dc0
RBP: 7000 R08: 820276c0 R09: 
R10:  R11:  R12: c90e7f08
R13: 0013 R14: 0013 R15: 81c34ce0
FS:  006cf880() GS:88803ed0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 004a1dab CR3: 3e012002 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 kmalloc_order+0x16/0x70
 kmalloc_order_trace+0x18/0xa0
 proc_sys_call_handler+0xf7/0x170
 vfs_read+0x98/0x120
 ksys_read+0x5a/0xd0
 do_syscall_64+0x43/0x140
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x43f910
Code: 01 f0 ff ff 0f 83 e0 57 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 
1f 44 00 00 83 3d 19 f2 28 00 00 75 14 b8 00 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 0f 83 b4 57 00 00 c3 48 83 ec 08 e8 4a 39 00 00

RSP: 002b:7fffeaa8 EFLAGS: 0246 ORIG_RAX: 
RAX: ffda RBX: 004002c8 RCX: 

Re: slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018

2020-06-04 Thread Vegard Nossum

On 2020-06-04 19:18, Vlastimil Babka wrote:

On 6/4/20 7:14 PM, Vegard Nossum wrote:


Hi all,

I ran into a boot problem with latest linus/master
(6929f71e46bdddbf1c4d67c2728648176c67c555) that manifests like this:


Hi, what's the .config you use?


Pretty much x86_64 defconfig minus a few options (PCI, USB, ...)

Attached.


Vegard
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 5.7.0 Kernel Configuration
#

#
# Compiler: gcc-9 (Ubuntu 9.2.1-17ubuntu1~16.04) 9.2.1 20191102
#
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=90201
CONFIG_LD_VERSION=22601
CONFIG_CLANG_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
# CONFIG_SCHED_THERMAL_PRESSURE is not set
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem

# CONFIG_IKCONFIG is not set
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=18
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# CONFIG_UCLAMP_TASK is not set
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_CGROUPS=y
# CONFIG_MEMCG is not set
# CONFIG_BLK_CGROUP is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
# CONFIG_RT_GROUP_SCHED is not set
# CONFIG_CGROUP_PIDS is not set
# CONFIG_CGROUP_RDMA is not set
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
# CONFIG_CGROUP_DEVICE is not set
CONFIG_CGROUP_CPUACCT=y
# CONFIG_CGROUP_PERF is not set
# CONFIG_CGROUP_DEBUG is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_TIME_NS=y
CONFIG_IPC_NS=y
# CONFIG_USER_NS is not set
CONFIG_PID_NS=y
# CONFIG_CHECKPOINT_RESTORE is not set
# CONFIG_SCHED_AUTOGROUP is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_RD_LZ4=y
# CONFIG_BOOT_CONFIG is not set
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
# CONFIG_EXPERT is not set
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_PRINTK_NMI=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_MEMBARRIER=y
CONFIG_KAL

slub freelist issue / BUG: unable to handle page fault for address: 000000003ffe0018

2020-06-04 Thread Vegard Nossum



Hi all,

I ran into a boot problem with latest linus/master
(6929f71e46bdddbf1c4d67c2728648176c67c555) that manifests like this:

hpet0: 3 comparators, 64-bit 100.00 MHz counter
clocksource: Switched to clocksource tsc-early
BUG: unable to handle page fault for address: 3ffe0018
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 0 P4D 0
Oops:  [#1] SMP PTI
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0+ #211
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014

RIP: 0010:kmem_cache_alloc+0x70/0x1d0
Code: 00 00 4c 8b 45 00 65 49 8b 50 08 65 4c 03 05 6f cc e7 7e 4d 8b 20 
4d 85 e4 0f 84 3d 01 00 00 8b 45 20 48 8b 7d 00 48 8d 4a 01 <49> 8b 1c 
04 4c 89 e0 65 48 0f c7 0f 0f 94 c0 84 c0 74 c5 8b 45 20

RSP: :c9013df8 EFLAGS: 00010206
RAX: 0018 RBX: 81c49200 RCX: 0002
RDX: 0001 RSI: 0dc0 RDI: 0002b300
RBP: 88803e403d00 R08: 88803ec2b300 R09: 0001
R10: 0dc0 R11: 0006 R12: 3ffe
R13: 8110a583 R14: 0dc0 R15: 81c49a80
FS:  () GS:88803ec0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 3ffe0018 CR3: 01c0a001 CR4: 003606f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 __trace_define_field+0x33/0xa0
 event_trace_init+0xeb/0x2b4
 tracer_init_tracefs+0x60/0x195
 ? register_tracer+0x1e7/0x1e7
 do_one_initcall+0x74/0x160
 kernel_init_freeable+0x190/0x1f0
 ? rest_init+0x9a/0x9a
 kernel_init+0x5/0xf6
 ret_from_fork+0x35/0x40
CR2: 3ffe0018
---[ end trace 707efa023f2ee960 ]---
RIP: 0010:kmem_cache_alloc+0x70/0x1d0

Bisection gives me:

commit 3202fa62fb43087387c65bfa9c100feffac74aa6
Author: Kees Cook 
Date:   Wed Apr 1 21:04:27 2020 -0700

slub: relocate freelist pointer to middle of object

Reverting these three commits fixes it:

3202fa62fb43087387c65bfa9c100feffac74aa6 slub: relocate freelist pointer 
to middle of object
89b83f282d8ba380cf2124f88106c57df49c538c slub: avoid redzone when 
choosing freepointer location
cbfc35a48609ceac978791e3ab9dde0c01f8cb20 mm/slub: fix incorrect 
interpretation of s->offset



Vegard


Re: [PATCH v10 00/18] Enable FSGSBASE instructions

2020-05-10 Thread Vegard Nossum



On 5/10/20 10:09 AM, Vegard Nossum wrote:


On 4/24/20 1:21 AM, Sasha Levin wrote:

Benefits:
Currently a user process that wishes to read or write the FS/GS base must
make a system call. But recent X86 processors have added new instructions
for use in 64-bit mode that allow direct access to the FS and GS segment
base addresses.  The operating system controls whether applications can
use these instructions with a %cr4 control bit.

[...]

So FWIW I've done some overnight fuzz testing of this patch set and
haven't seen any problems. Will try a couple of other kernel configs too.


I spoke a few minutes too soon. Just hit this, if anybody wants to have
a look:

[ 6402.786418] [ cut here ]
[ 6402.787769] WARNING: CPU: 0 PID: 13802 at arch/x86/kernel/traps.c:811 
do_debug+0x16c/0x210

[ 6402.790042] CPU: 0 PID: 13802 Comm: init Not tainted 5.7.0-rc4+ #194
[ 6402.791779] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 6402.793365] RIP: 0010:do_debug+0x16c/0x210
[ 6402.794496] Code: ef e8 f8 fb 00 00 f6 85 91 00 00 00 02 74 b9 fa 66 
66 90 66 66 90 e8 c3 f5 11 00 eb ab f6 85 88 00 00 00 03 0f 85 6e ff ff 
ff <0f> 0b 80 e4 bf 49 89 84 24 58 0a 00 00 f0 41 80 0c 24 10 48 81 a5

[ 6402.799557] RSP: :fe011f20 EFLAGS: 00010046
[ 6402.800995] RAX: 4002 RBX:  RCX: 

[ 6402.802959] RDX:  RSI: 0003 RDI: 
82471e60
[ 6402.804891] RBP: fe011f58 R08:  R09: 
0005
[ 6402.806836] R10:  R11:  R12: 
88803e739a00
[ 6402.808775] R13:  R14: 3ce24000 R15: 

[ 6402.810723] FS:  0097a8c0() GS:88803ec0() 
knlGS:

[ 6402.812933] CS:  0010 DS:  ES:  CR0: 80050033
[ 6402.814509] CR2: 4010 CR3: 3ce24000 CR4: 
06f0
[ 6402.816468] DR0: 0001 DR1: 40006070 DR2: 
77ffd000
[ 6402.818406] DR3:  DR6: 0ff0 DR7: 
03b3062a

[ 6402.820353] Call Trace:
[ 6402.821043]  <#DB>
[ 6402.821622]  debug+0x37/0x70
[ 6402.822449] RIP: 0010:arch_stack_walk_user+0x79/0x110
[ 6402.823851] Code: b8 f0 ff ff bf be f0 df ff ff 48 0f 44 c6 48 39 d0 
0f 82 94 00 00 00 41 83 87 b8 09 00 00 01 66 66 90 0f ae e8 31 c0 48 8b 
1a <66> 66 90 85 c0 75 72 66 66 90 0f ae e8 48 8b 72 08 66 66 90 85 c0

[ 6402.828923] RSP: :c90003807d80 EFLAGS: 0046
[ 6402.830346] RAX:  RBX: 0040001000bf4800 RCX: 
0001
[ 6402.832288] RDX: 40006073 RSI: 400060dd RDI: 
c90003807db8
[ 6402.834250] RBP: c90003807f58 R08: 0001 R09: 
88803e00
[ 6402.836203] R10: 054c R11: 88803d2d955c R12: 
88803e739a00
[ 6402.838139] R13: 810f16a0 R14: c90003807db8 R15: 
88803e739a00

[ 6402.840083]  ? profile_setup.cold+0xa1/0xa1
[ 6402.841235]  
[ 6402.841836]  stack_trace_save_user+0x8c/0xd4
[ 6402.843045]  trace_buffer_unlock_commit_regs+0x122/0x1a0
[ 6402.844501]  trace_event_buffer_commit+0x6d/0x240
[ 6402.845799]  trace_event_raw_event_preemptirq_template+0x75/0xc0
[ 6402.847441]  ? debug+0x53/0x70
[ 6402.848299]  ? trace_hardirqs_off_thunk+0x1a/0x33
[ 6402.849593]  trace_hardirqs_off_caller+0xa6/0xd0
[ 6402.850862]  ? debug+0x4e/0x70
[ 6402.851727]  trace_hardirqs_off_thunk+0x1a/0x33
[ 6402.852983]  debug+0x53/0x70
[ 6402.853785] RIP: 0033:0x400060dd
[ 6402.854681] Code: 7a 1e 9e 91 de 4c 65 49 be 00 d0 ff f7 ff 7f 00 00 
49 bf de a7 b3 e8 d7 21 3c 15 9c 48 81 0c 24 00 01 00 00 9d b8 62 00 00 
00 <8e> c0 0f 05 66 8c c8 9c 48 81 24 24 ff fe ff ff 9d 48 89 04 25 40

[ 6402.859689] RSP: 002b:4000aea0 EFLAGS: 0317
[ 6402.861116] RAX: 0062 RBX: 40001000 RCX: 

[ 6402.863097] RDX: 40003000 RSI: 40004000 RDI: 
40001000
[ 6402.866199] RBP: 40006073 R08: 0001 R09: 
0001
[ 6402.868142] R10: ef080df2 R11: 1000 R12: 
fdff
[ 6402.870083] R13: 654cde919e1e7ab5 R14: 77ffd000 R15: 
153c21d7e8b3a7de

[ 6402.872049] ---[ end trace 91a3039d0fd63799 ]---

It might not be related to the patch set, mind.


Vegard


Re: [PATCH v10 00/18] Enable FSGSBASE instructions

2020-05-10 Thread Vegard Nossum



On 4/24/20 1:21 AM, Sasha Levin wrote:

Benefits:
Currently a user process that wishes to read or write the FS/GS base must
make a system call. But recent X86 processors have added new instructions
for use in 64-bit mode that allow direct access to the FS and GS segment
base addresses.  The operating system controls whether applications can
use these instructions with a %cr4 control bit.

In addition to benefits to applications, performance improvements to the
OS context switch code are possible by making use of these instructions. A
third party reported out promising performance numbers out of their
initial benchmarking of the previous version of this patch series [9].

Enablement check:
The kernel provides information about the enabled state of FSGSBASE to
applications using the ELF_AUX vector. If the HWCAP2_FSGSBASE bit is set in
the AUX vector, the kernel has FSGSBASE instructions enabled and
applications can use them.

Kernel changes:
Major changes made in the kernel are in context switch, paranoid path, and
ptrace. In a context switch, a task's FS/GS base will be secured regardless
of its selector. In the paranoid path, GS base is unconditionally
overwritten to the kernel GS base on entry and the original GS base is
restored on exit. Ptrace includes divergence of FS/GS index and base
values.

Security:
For mitigating the Spectre v1 SWAPGS issue, LFENCE instructions were added
on most kernel entries. Those patches are dependent on previous behaviors
that users couldn't load a kernel address into the GS base. These patches
change that assumption since the user can load any address into GS base.
The changes to the kernel entry path in this patch series take account of
the SWAPGS issue.

Changes from v9:

  - Rebase on top of v5.7-rc1 and re-test.
  - Work around changes in 2fff071d28b5 ("x86/process: Unify
copy_thread_tls()").
  - Work around changes in c7ca0b614513 ("Revert "x86/ptrace: Prevent
ptrace from clearing the FS/GS selector" and fix the test").

  


Andi Kleen (2):
   x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions
   x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2

Andy Lutomirski (4):
   x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE
   x86/entry/64: Clean up paranoid exit
   x86/fsgsbase/64: Use FSGSBASE in switch_to() if available
   x86/fsgsbase/64: Enable FSGSBASE on 64bit by default and add a chicken
 bit

Chang S. Bae (9):
   x86/ptrace: Prevent ptrace from clearing the FS/GS selector
   selftests/x86/fsgsbase: Test GS selector on ptracer-induced GS base
 write
   x86/entry/64: Switch CR3 before SWAPGS in paranoid entry
   x86/entry/64: Introduce the FIND_PERCPU_BASE macro
   x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit
   x86/entry/64: Document GSBASE handling in the paranoid path
   x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions
   x86/fsgsbase/64: Use FSGSBASE instructions on thread copy and ptrace
   selftests/x86/fsgsbase: Test ptracer-induced GS base write with
 FSGSBASE

Sasha Levin (1):
   x86/fsgsbase/64: move save_fsgs to header file

Thomas Gleixner (1):
   Documentation/x86/64: Add documentation for GS/FS addressing mode

Tony Luck (1):
   x86/speculation/swapgs: Check FSGSBASE in enabling SWAPGS mitigation

  .../admin-guide/kernel-parameters.txt |   2 +
  Documentation/x86/entry_64.rst|   9 +
  Documentation/x86/x86_64/fsgs.rst | 199 ++
  Documentation/x86/x86_64/index.rst|   1 +
  arch/x86/entry/calling.h  |  40 
  arch/x86/entry/entry_64.S | 131 +---
  arch/x86/include/asm/fsgsbase.h   |  45 +++-
  arch/x86/include/asm/inst.h   |  15 ++
  arch/x86/include/uapi/asm/hwcap2.h|   3 +
  arch/x86/kernel/cpu/bugs.c|   6 +-
  arch/x86/kernel/cpu/common.c  |  22 ++
  arch/x86/kernel/process.c |  10 +-
  arch/x86/kernel/process.h |  69 ++
  arch/x86/kernel/process_64.c  | 142 +++--
  arch/x86/kernel/ptrace.c  |  17 +-
  tools/testing/selftests/x86/fsgsbase.c|  24 ++-
  16 files changed, 606 insertions(+), 129 deletions(-)
  create mode 100644 Documentation/x86/x86_64/fsgs.rst


So FWIW I've done some overnight fuzz testing of this patch set and
haven't seen any problems. Will try a couple of other kernel configs too.


Vegard


Re: email as a bona fide git transport

2019-10-22 Thread Vegard Nossum



On 10/22/19 3:53 PM, Theodore Y. Ts'o wrote:

On Tue, Oct 22, 2019 at 02:11:22PM +0200, Vegard Nossum wrote:


As I wrote in there, we could already today start using

   git am --message-id

when applying patches and this would provide something that a bot could
annotate with git notes pointing to lore/LKML/LWN/whatever. I think that
would already be a pretty nice improvement over today's situation.

Sadly, since the beginning of 2018, this was only used for a measly
~0.14% of all non-merge commits in the kernel:

$ git rev-list --count --no-merges --since='2018-01-01' --grep 'Message-Id:
' linus/master
178


You might also want to count commits which have a link tag with a
Message-Id:

Link: 
https://lore.kernel.org/r/c3438dad66a34a7d4e7509a5dd64c2326340a52a.1571647180.git.mbobrow...@mbobrowski.org

That's because some kernel developers have been using a hook script like this:

#!/bin/sh
# For .git/hooks/applypatch-msg
#
# You must have the following in .git/config:
# [am]
#   messageid = true

. git-sh-setup
perl -pi -e 's|^Message-Id:\s*]+)>?$|Link: https://lore.kernel.org/r/$1|g;' 
"$1"
test -x "$GIT_DIR/hooks/commit-msg" &&
exec "$GIT_DIR/hooks/commit-msg" ${1+"$@"}
:

 as we had reached rough consensus that this was the best way to
incorprate the message id (since it could made to be a clickable link
in tools like gitk, for example).  This rough consensus has only been
in place since around the time of the Maintainer's Summit in Lisbon,
so uptake is still probably a bit slow.  I'd expect to see a lot more
of this in the next merge window, though.


Thanks, I was not aware of this!

Seems like something that should go in Documentation/maintainer/,
right?

The figure is much better, 16.7% on all non-merges since 2018-01-01.
This should help and we can maybe already do some interesting things
with git notes and lore/public-inbox.


Vegard


email as a bona fide git transport

2019-10-16 Thread Vegard Nossum
sage
commit_editmsg="$(git rev-parse --git-dir)/COMMIT_EDITMSG"
(
if [ -z "$prev" ]
then
echo 'Patchset title'
echo
echo Commits:
echo
git log --oneline $start..HEAD
else
git show --format=format:%B --no-patch $prev
echo Previous-version: $(git rev-parse $prev)
fi
) > "${commit_editmsg}"

${EDITOR} "${commit_editmsg}"

merge=$(git commit-tree -p $start -p HEAD -F "${commit_editmsg}" 
$(git rev-parse HEAD^{tree}))

echo $merge
}

This will open the editor to edit the patchset description and create a
merge commit that encompasses the patches in the patchset (use sha1^- to
view the patches in it).
>From 622a0469a4970c5daac0c0323e2d6a77b3bebbdb Mon Sep 17 00:00:00 2001
From: Vegard Nossum 
Date: Sat, 5 Oct 2019 16:15:59 +0200
Subject: [PATCH 1/3] format-patch: add --complete

Include the raw commit data between the changelog and the diffstat.
This will allow 'git am' to reconstruct the commit exactly to the point
where the sha1 will be the same.

Signed-off-by: Vegard Nossum 
---
commit 622a0469a4970c5daac0c0323e2d6a77b3bebbdb
tree 8f09d9d6ed78f8617b2fe54fe9712990ba808546
parent 108b97dc372828f0e72e56bbb40cae8e1e83ece6
author Vegard Nossum  1570284959 +0200
committer Vegard Nossum  1571219301 +0200

---
 builtin/log.c | 12 
 log-tree.c| 17 +
 revision.h|  3 ++-
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/builtin/log.c b/builtin/log.c
index c4b35fdaf9..81c1164ae5 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -1545,6 +1545,7 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix)
 	char *branch_name = NULL;
 	char *base_commit = NULL;
 	struct base_tree_info bases;
+	int complete = 0;
 	int show_progress = 0;
 	struct progress *progress = NULL;
 	struct oid_array idiff_prev = OID_ARRAY_INIT;
@@ -1622,6 +1623,8 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix)
 			N_("add a signature")),
 		OPT_STRING(0, "base", _commit, N_("base-commit"),
 			   N_("add prerequisite tree info to the patch series")),
+		OPT_BOOL(0, "complete", ,
+			 N_("include all the information necessary to reconstruct commit exactly")),
 		OPT_FILENAME(0, "signature-file", _file,
 N_("add a signature from a file")),
 		OPT__QUIET(, N_("don't print the patch filenames")),
@@ -1905,6 +1908,15 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix)
 		prepare_bases(, base, list, nr);
 	}
 
+	if (complete) {
+		/*
+		 * We need the commit buffer so that we can output the exact
+		 * sequence of bytes that gets hashed as part of a commit.
+		 */
+		save_commit_buffer = 1;
+		rev.show_raw_buffer = 1;
+	}
+
 	if (in_reply_to || thread || cover_letter)
 		rev.ref_message_ids = xcalloc(1, sizeof(struct string_list));
 	if (in_reply_to) {
diff --git a/log-tree.c b/log-tree.c
index 923a299e70..2c9788b25a 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -774,6 +774,22 @@ void show_log(struct rev_info *opt)
 
 		memcpy(_queued_diff, , sizeof(diff_queued_diff));
 	}
+
+	if (opt->show_raw_buffer) {
+		const char *buffer = get_commit_buffer(commit, NULL);
+		const char *subject;
+
+		fprintf(opt->diffopt.file, "---\n");
+		fprintf(opt->diffopt.file, "commit %s\n", oid_to_hex(>object.oid));
+
+		/*
+		 * TODO: hex-encode to avoid mailer mangling?
+		 */
+		if (find_commit_subject(buffer, ))
+			fprintf(opt->diffopt.file, "%.*s", (int) (subject - buffer), buffer);
+		else
+			fprintf(opt->diffopt.file, "%s", buffer);
+	}
 }
 
 int log_tree_diff_flush(struct rev_info *opt)
@@ -791,6 +807,7 @@ int log_tree_diff_flush(struct rev_info *opt)
 
 	if (opt->loginfo && !opt->no_commit_id) {
 		show_log(opt);
+
 		if ((opt->diffopt.output_format & ~DIFF_FORMAT_NO_OUTPUT) &&
 		opt->verbose_header &&
 		opt->commit_format != CMIT_FMT_ONELINE &&
diff --git a/revision.h b/revision.h
index 4134dc6029..5297dc9f3c 100644
--- a/revision.h
+++ b/revision.h
@@ -190,7 +190,8 @@ struct rev_info {
 			use_terminator:1,
 			missing_newline:1,
 			date_mode_explicit:1,
-			preserve_subject:1;
+			preserve_subject:1,
+			show_raw_buffer:1;
 	unsigned int	disable_stdin:1;
 	/* --show-linear-break */
 	unsigned int	track_linear:1,
-- 
2.23.0.718.g3120370db8

>From 51bb531eb57320caf3761680ebf77c25b89b3719 Mon Sep 17 00:00:00 2001
From: Vegard Nossum 
Date: Wed, 16 Oct 2019 02:04:08 +0200
Subject: [PATCH 2/3] mailinfo: collect commit metadata from mail

Signed-off-by: Vegard Nossum 
---
commit 51bb531eb57320caf3761680ebf77c25b89b3719
tree f3a3141f7d3f706d8ca60cdc1e1cde5aa2cc927a
parent 622a0469a4970c5daac0c0323e2d6a77b3bebbdb
author Vegard Nossum  1571184248 +0200
committer 

Re: [PATCH v3 0/6] Tracing vs CR2

2019-07-17 Thread Vegard Nossum

On 7/17/19 10:07 AM, Peter Zijlstra wrote:

On Tue, Jul 16, 2019 at 09:33:50PM +0200, Vegard Nossum wrote:

[ cut here ]
General protection fault in user access. Non-canonical address?
WARNING: CPU: 0 PID: 5039 at arch/x86/mm/extable.c:126
ex_handler_uaccess+0x5d/0x70

[...]



   https://lkml.kernel.org/r/57754f11-2c65-a2c8-2f6d-bfab0d2f8...@etsukata.com

Does something like the below help?

diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index c8d0f05721a1..80ad4ccb7025 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -226,12 +226,16 @@ unsigned int stack_trace_save_user(unsigned long *store, 
unsigned int size)
.store  = store,
.size   = size,
};
+   mm_segment_t fs;
  
  	/* Trace user stack if not a kernel thread */

if (current->flags & PF_KTHREAD)
return 0;
  
+	fs = get_fs();

+   set_fs(USER_DS);
arch_stack_walk_user(consume_entry, , task_pt_regs(current));
+   set_fs(fs);
return c.len;
  }
  #endif



Yes.


Vegard


Re: [PATCH v3 0/6] Tracing vs CR2

2019-07-17 Thread Vegard Nossum



On 7/17/19 3:02 AM, Andy Lutomirski wrote:

On Tue, Jul 16, 2019 at 2:53 PM Vegard Nossum  wrote:



On 7/16/19 9:33 PM, Vegard Nossum wrote:


On 7/11/19 1:40 PM, Peter Zijlstra wrote:

Hi,

Here's the latest (and hopefully final) set of tracing vs CR2 patches.

They are basically the same as v2, with only minor edits and tags
collected
from the last review.

Please consider.



Hi,

I ran my own battery of tests on your patch set on top of
5ad18b2e60b75c7297a998dea702451d33a052ed and ran into this:



On a different thread, Peter and I decided that the last patch in this
series (the one that removes the _DEBUG stuff) is wrong.  Can you see
if these are reproducible with that patch removed?


Yes, without the last patch I still get this:

Run /init as init process
init[711]: segfault at 4000 ip 400a sp 4ff8 
error 7

[ cut here ]
General protection fault in user access. Non-canonical address?
WARNING: CPU: 0 PID: 711 at arch/x86/mm/extable.c:126 
ex_handler_uaccess+0x5d/0x70

CPU: 0 PID: 711 Comm: init Not tainted 5.2.0+ #125
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
init[716]: segfault at 4000 ip 400a sp 4ff8 
error 7

RIP: 0010:ex_handler_uaccess+0x5d/0x70
Code: 5d 41 5c c3 e8 c4 8e 0e 00 80 3d e5 74 1e 01 00 75 d3 e8 b6 8e 0e 
00 48 c7 c7 10 a7 fb 81 c6 05 d0 74 1e 01 01 e8 d1 43 01 00 <0f> 0b eb 
b7 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89

RSP: :c965fa18 EFLAGS: 00010086
RAX:  RBX: 81c07dac RCX: 811a887c
init[714]: segfault at 4000 ip 400a sp 4ff8 
error 7

RDX:  RSI: 8289f05f RDI: 0093
RBP: c965fa88 R08: 2e80b265 R09: 003f
init[718]: segfault at 4000 ip 400a sp 4ff8 
error 7

R10:  R11:  R12: 000d
R13: 000d R14:  R15: 
FS:  006ce880() GS:88803ec0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 3fe0 CR3: 3d2f6004 CR4: 003606f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
Code: Bad RIP value.
 fixup_exception+0x50/0x6a
 do_general_protection+0x40/0x160
 general_protection+0x2d/0x40
RIP: 0010:arch_stack_walk_user+0x71/0x100
Code: 00 48 83 e8 10 49 39 c4 77 45 4c 8b 04 24 4c 89 e3 4d 89 fd 4c 89 
fd 41 83 87 98 0a 00 00 01 0f 01 cb 0f ae e8 31 c0 4c 89 e2 <4c> 8b 33 
4d 89 f4 85 c0 75 7a 48 8b 73 08 0f 01 ca 85 c0 74 1f 65

[...]

This is my reproducer (as init):

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

struct child_data {
  (*code)();
};

child_fn(void *arg)
{
  child_data *data = arg;
  mprotect(data->code, PAGE_SIZE, PROT_EXEC);
  data->code();
}

int main()
{
  mkdir("/sys", 7);
  mount("nodev", "/sys", "sysfs", 0, "");
  mount("nodev", "/sys/kernel/tracing", "tracefs", 0, "");

  int tracing_options_userstacktrace = 
open("/sys/kernel/tracing/options/userstacktrace", O_RDWR);

  write(tracing_options_userstacktrace, "1\n", 2);

  int tracing_events_preemptirq_irq_disable = 
open("/sys/kernel/tracing/events/preemptirq/irq_disable/enable", O_RDWR);

  write(tracing_events_preemptirq_irq_disable, "1\n", 2);

  void *code = mmap(0, PAGE_SIZE, PROT_WRITE, MAP_PRIVATE | 
MAP_ANONYMOUS | MAP_32BIT, 1, 0);

  {
unsigned char *output = code;

*output++ = 72;
*output++ = 189;
for (int i = 0; i < 8; ++i)
  *output++ = i;
  }

  void *child_stack = mmap(0, PAGE_SIZE, PROT_WRITE, MAP_PRIVATE | 
MAP_ANONYMOUS | MAP_32BIT, 1, 0);


  while (1) {
child_data data = { code };
clone(child_fn, child_stack, SIGCHLD, );
  }
}

Compiled with -static and booted with "norandmaps" (for some reason that
makes a difference), this is 100% reproducible for me, although the
reproducer is somewhat sensitive to small changes that I don't quite
understand.


Vegard


Re: [PATCH v3 0/6] Tracing vs CR2

2019-07-16 Thread Vegard Nossum



On 7/16/19 9:33 PM, Vegard Nossum wrote:


On 7/11/19 1:40 PM, Peter Zijlstra wrote:

Hi,

Here's the latest (and hopefully final) set of tracing vs CR2 patches.

They are basically the same as v2, with only minor edits and tags 
collected

from the last review.

Please consider.



Hi,

I ran my own battery of tests on your patch set on top of 
5ad18b2e60b75c7297a998dea702451d33a052ed and ran into this:


[ cut here ]
General protection fault in user access. Non-canonical address?
WARNING: CPU: 0 PID: 5039 at arch/x86/mm/extable.c:126 
ex_handler_uaccess+0x5d/0x70


Got a different one:

WARNING: CPU: 0 PID: 2150 at arch/x86/kernel/traps.c:791 do_debug+0xfe/0x240
CPU: 0 PID: 2150 Comm: init Not tainted 5.2.0+ #124
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014

RIP: 0010:do_debug+0xfe/0x240
Code: 05 07 3d f3 7e f6 85 91 00 00 00 02 0f 85 d8 00 00 00 49 8b 84 24 
18 0b 00 00 f6 44 24 01 40 74 2f f6 85 88 00 00 00 03 75 26 <0f> 0b 80 
e4 bf 49 89 84 24 18 0b 00 00 f0 41 80 0c 24 10 48 81 a5

RSP: :fe00ff20 EFLAGS: 00010046
RAX: 4002 RBX:  RCX: 810e2f72
RDX:  RSI: 0003 RDI: 8201f090
RBP: fe00ff58 R08:  R09: 0005
R10:  R11:  R12: 88803e0df040
R13:  R14: 3d376001 R15: 
FS:  56dbc8c0() GS:88803ec0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 41f38010 CR3: 3d376001 CR4: 003606f0
DR0: 0001 DR1: 41a4f070 DR2: 7fff959ff000
DR3:  DR6: fffe0ff0 DR7: 03b3062a
Call Trace:
 <#DB>
 debug+0x2d/0x70
RIP: 0010:arch_stack_walk_user+0x74/0x100
Code: e8 10 49 39 c4 77 45 4c 8b 04 24 4c 89 e3 4d 89 fd 4c 89 fd 41 83 
87 98 0a 00 00 01 0f 01 cb 0f ae e8 31 c0 4c 89 e2 4c 8b 33 <4d> 89 f4 
85 c0 75 7a 48 8b 73 08 0f 01 ca 85 c0 74 1f 65 48 8b 04

RSP: :c900030dbd68 EFLAGS: 00040046
RAX:  RBX: 41a4f073 RCX: 811ca27b
RDX: 41a4f073 RSI: 41a4f0dd RDI: c900030dbdb8
RBP: 88803e0df040 R08: c900030dbf58 R09: 
R10:  R11:  R12: 41a4f073
R13: 88803e0df040 R14: 0041281000bf4800 R15: 88803e0df040
 ? stack_trace_consume_entry+0x4b/0x80
 
 ? profile_setup.cold+0xc1/0xc1
 stack_trace_save_user+0x71/0x9c
 trace_buffer_unlock_commit_regs+0x1ae/0x270
 trace_event_buffer_commit+0x90/0x240
 trace_event_raw_event_preemptirq_template+0x9a/0x100
 ? debug+0x49/0x70
 ? perf_trace_preemptirq_template+0x120/0x120
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 trace_hardirqs_off_caller+0xf4/0x150
 ? debug+0x44/0x70
 trace_hardirqs_off_thunk+0x1a/0x1c
 debug+0x49/0x70
RIP: 0033:0x41a4f0dd
Code: 47 11 b7 d2 36 45 6c 49 be 00 f0 9f 95 ff 7f 00 00 49 bf de a7 b3 
e8 d7 21 3c 15 9c 48 81 0c 24 00 01 00 00 9d b8 62 00 00 00 <8e> c0 0f 
05 66 8c c8 9c 48 81 24 24 ff fe ff ff 9d 48 89 04 25 40

RSP: 002b:40901ea0 EFLAGS: 0317
RAX: 0062 RBX: 41281000 RCX: 
RDX: 401c RSI: 41892000 RDI: 41281000
RBP: 41a4f073 R08: 0001 R09: 0001
R10: 917d7748 R11: 1000 R12: fdff
R13: 6c4536d2b71147a5 R14: 7fff959ff000 R15: 153c21d7e8b3a7de
---[ end trace 0cd51ba690f12b47 ]---

The warning is this:

if (WARN_ON_ONCE((dr6 & DR_STEP) && !user_mode(regs))) {
/*
 * Historical junk that used to handle SYSENTER 
single-stepping.
 * This should be unreachable now.  If we survive for a 
while
 * without anyone hitting this warning, we'll turn this 
into

 * an oops.
 */
tsk->thread.debugreg6 &= ~DR_STEP;
set_tsk_thread_flag(tsk, TIF_SINGLESTEP);
regs->flags &= ~X86_EFLAGS_TF;
}

Unfortunately DR6 from the register dump has already been cleared at the
top of do_debug() and the local variable dr6 is on the stack and not
loaded into any of the registers AFAICT.

From the userspace Code: line you can clearly see it setting EFLAGS_TF,
then it seems to be trapping on the next instruction:

  1b:   9c  pushfq
  1c:   48 81 0c 24 00 01 00orq$0x100,(%rsp)
  23:   00
  24:   9d  popfq
  25:   b8 62 00 00 00  mov$0x62,%eax
  2a:*  8e c0   mov%eax,%es <-- trapping 
instruction


You can see that DR1 points to 41a4f070, which is close to userspace RBP
(41a4f073), which is perhaps being accessed by stack_trace_save_user()
and causing the debug exception on a data breakpoint?

The Code: line from stack_trace_save_user() is:

  27:   4c 8b 

Re: [PATCH v3 0/6] Tracing vs CR2

2019-07-16 Thread Vegard Nossum



On 7/11/19 1:40 PM, Peter Zijlstra wrote:

Hi,

Here's the latest (and hopefully final) set of tracing vs CR2 patches.

They are basically the same as v2, with only minor edits and tags collected
from the last review.

Please consider.



Hi,

I ran my own battery of tests on your patch set on top of 
5ad18b2e60b75c7297a998dea702451d33a052ed and ran into this:


[ cut here ]
General protection fault in user access. Non-canonical address?
WARNING: CPU: 0 PID: 5039 at arch/x86/mm/extable.c:126 
ex_handler_uaccess+0x5d/0x70

CPU: 0 PID: 5039 Comm: init Not tainted 5.2.0+ #124
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014

RIP: 0010:ex_handler_uaccess+0x5d/0x70
Code: 5d 41 5c c3 e8 c4 8e 0e 00 80 3d e5 74 1e 01 00 75 d3 e8 b6 8e 0e 
00 48 c7 c7 10 a7 fb 81 c6 05 d0 74 1e 01 01 e8 d1 43 01 00 <0f> 0b eb 
b7 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89

RSP: :fe00fc48 EFLAGS: 00010086
RAX:  RBX: 81c07dac RCX: 811a887c
RDX:  RSI: 8289f05f RDI: 0093
RBP: fe00fcb8 R08: 0036fe0f15d3 R09: 003f
R10:  R11:  R12: 000d
R13: 000d R14:  R15: 
FS:  563ab8c0() GS:88803ec0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 1ff7 CR3: 3c804002 CR4: 003606f0
DR0: 40209100 DR1: 402091a1 DR2: 
DR3:  DR6: 0ff1 DR7: 000b062a
Call Trace:
 <#DB>
 fixup_exception+0x50/0x6a
 do_general_protection+0x40/0x160
 general_protection+0x2d/0x40
RIP: 0010:arch_stack_walk_user+0x71/0x100
Code: 00 48 83 e8 10 49 39 c4 77 45 4c 8b 04 24 4c 89 e3 4d 89 fd 4c 89 
fd 41 83 87 98 0a 00 00 01 0f 01 cb 0f ae e8 31 c0 4c 89 e2 <4c> 8b 33 
4d 89 f4 85 c0 75 7a 48 8b 73 08 0f 01 ca 85 c0 74 1f 65

RSP: :fe00fd68 EFLAGS: 00050046
RAX:  RBX: 854163717acc2789 RCX: 811ca27b
RDX: 854163717acc2789 RSI: 40209102 RDI: fe00fdb8
RBP: 88803d55d040 R08: c9000520bf58 R09: 
R10:  R11:  R12: 854163717acc2789
R13: 88803d55d040 R14: 0093 R15: 88803d55d040
 ? stack_trace_consume_entry+0x4b/0x80
 ? arch_stack_walk_user+0x34/0x100
 ? profile_setup.cold+0xc1/0xc1
 stack_trace_save_user+0x71/0x9c
 trace_buffer_unlock_commit_regs+0x1ae/0x270
 trace_event_buffer_commit+0x90/0x240
 trace_event_raw_event_preemptirq_template+0x9a/0x100
 ? debug+0x16/0x70
 ? perf_trace_preemptirq_template+0x120/0x120
 ? trace_hardirqs_off_thunk+0x1a/0x1c
 trace_hardirqs_off_caller+0xf4/0x150
 trace_hardirqs_off_thunk+0x1a/0x1c
 ? debug+0x11/0x70
 debug+0x16/0x70
RIP: 0010:copy_user_generic_unrolled+0xa0/0xc0
Code: 7f 40 ff c9 75 b6 89 d1 83 e2 07 c1 e9 03 74 12 4c 8b 06 4c 89 07 
48 8d 76 08 48 8d 7f 08 ff c9 75 ee 21 d2 74 10 89 d1 8a 06 <88> 07 48 
ff c6 48 ff c7 ff c9 75 f2 31 c0 0f 01 ca c3 0f 1f 40 00

RSP: :c9000520be38 EFLAGS: 00040202
RAX: 88803d55d09c RBX: 88803d55d040 RCX: 0001
RDX: 0001 RSI: 40209102 RDI: c9000520be76
RBP: 0001 R08: 0001 R09: 
R10:  R11:  R12: 7000
R13: 40209102 R14: c9000520be76 R15: 
 
 __probe_kernel_read+0x57/0x90
 is_prefetch.isra.0+0xb5/0x210
 ? tracer_hardirqs_on+0x53/0x1a0
 __bad_area_nosemaphore+0x9e/0x220
 __do_page_fault+0x483/0x630
 ? async_page_fault+0x8/0x40
 async_page_fault+0x36/0x40
RIP: 0033:0x40209102
Code: 00 00 49 bc 00 20 23 40 00 00 00 00 49 bd 00 00 d0 40 00 00 00 00 
49 be ff ff ff ff ff ff ff ff 49 bf 00 50 80 40 00 00 00 00 <9c> 48 81 
0c 24 00 04 00 00 48 81 0c 24 00 00 04 00 9d ff 2c 25 00

RSP: 002b:1fff EFLAGS: 00010217
RAX:  RBX: 402090b0 RCX: 0001
RDX: 0001 RSI:  RDI: 41ebb000
RBP: 854163717acc2789 R08: 0001 R09: b1f39cc399a61ebb
R10: 7ffeab175000 R11: 0360 R12: 40232000
R13: 40d0 R14:  R15: 40805000
---[ end trace e5e49800ff5aa5ed ]---
PANIC: double fault, error_code: 0x0
CPU: 0 PID: 5039 Comm: init Tainted: GW 5.2.0+ #124
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014

RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x50
Code: 82 e8 74 2d f8 ff 48 89 9d 10 01 00 00 48 89 ee 5b 4c 89 e7 5d 41 
5c e9 8e 5d 12 00 5b b8 f4 ff ff ff 5d 41 5c c3 0f 1f 40 00 <65> 48 8b 
04 25 c0 6c 01 00 65 8b 15 78 ba df 7e 81 e2 00 01 1f 00

RSP: :fe00f008 EFLAGS: 00010093
RAX: 00016cc0 RBX: 81a01436 RCX: 81a00b97
RDX: 00016cc0 RSI: 81a01428 RDI: 81a01436
RBP: fe00f088 R08: 

Re: [PATCH v8 4/5] x86/xsave: Make XSAVE check the base CPUID features before enabling

2019-06-29 Thread Vegard Nossum



On 10/5/17 11:52 PM, Andi Kleen wrote:

From: Andi Kleen 

Before enabling XSAVE, not only check the XSAVE specific CPUID bits,
but also the base CPUID features of the respective XSAVE feature.
This allows to disable individual XSAVE states using the existing
clearcpuid= option, which can be useful for performance testing
and debugging, and also in general avoids inconsistencies.

v2:
Add curly brackets (Thomas Gleixner)
Signed-off-by: Andi Kleen 
---
  arch/x86/kernel/fpu/xstate.c | 23 +++
  1 file changed, 23 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index f1d5476c9022..924bd895b5ee 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -15,6 +15,7 @@
  #include 
  
  #include 

+#include 
  
  /*

   * Although we spell it out in here, the Processor Trace
@@ -36,6 +37,19 @@ static const char *xfeature_names[] =
"unknown xstate feature"  ,
  };
  
+static short xsave_cpuid_features[] = {

+   X86_FEATURE_FPU,
+   X86_FEATURE_XMM,
+   X86_FEATURE_AVX,
+   X86_FEATURE_MPX,
+   X86_FEATURE_MPX,
+   X86_FEATURE_AVX512F,
+   X86_FEATURE_AVX512F,
+   X86_FEATURE_AVX512F,
+   X86_FEATURE_INTEL_PT,
+   X86_FEATURE_PKU,
+};
+
  /*
   * Mask of xstate features supported by the CPU and the kernel:
   */
@@ -726,6 +740,7 @@ void __init fpu__init_system_xstate(void)
unsigned int eax, ebx, ecx, edx;
static int on_boot_cpu __initdata = 1;
int err;
+   int i;
  
  	WARN_ON_FPU(!on_boot_cpu);

on_boot_cpu = 0;
@@ -759,6 +774,14 @@ void __init fpu__init_system_xstate(void)
goto out_disable;
}
  
+	/*

+* Clear XSAVE features that are disabled in the normal CPUID.
+*/
+   for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
+   if (!boot_cpu_has(xsave_cpuid_features[i]))
+   xfeatures_mask &= ~BIT(i);
+   }
+
xfeatures_mask &= fpu__get_supported_xfeatures_mask();
  
  	/* Enable xstate instructions to be able to continue with initialization: */




Hi,

The commit for this patch in mainline
(ccb18db2ab9d923df07e7495123fe5fb02329713) causes the kernel to hang on
boot when passing the "nofxsr" option:

$ kvm -cpu host -kernel arch/x86/boot/bzImage -append "console=ttyS0 
nofxsr earlyprintk=ttyS0" -serial stdio -display none -smp 2

early console in extract_kernel
input_data: 0x01dea276
input_len: 0x00500704
output: 0x0100
output_len: 0x012c79b4
kernel_total_size: 0x00f24000
booted via startup_32()
Physical KASLR using RDRAND RDTSC...
Virtual KASLR using RDRAND RDTSC...

Decompressing Linux... Parsing ELF... Performing relocations... done.
Booting the kernel.
[..hang..]

If I revert it from Linus's tree (~5.2-rc6) then it boots again:

early console in extract_kernel
input_data: 0x024192e9
input_len: 0x005d8ea1
output: 0x0100
output_len: 0x019c7fa4
kernel_total_size: 0x0162c000
trampoline_32bit: 0x0009d000
booted via startup_32()
Physical KASLR using RDRAND RDTSC...
Virtual KASLR using RDRAND RDTSC...

Decompressing Linux... Parsing ELF... Performing relocations... done.
Booting the kernel.
Linux version 5.2.0-rc6+ (vegard@t460) (gcc version 5.5.0 20171010 
(Ubuntu 5.5.0-12ubuntu1~16.04)) #98 SMP PREEMPT Sat Jun 29 17:13:31 CEST 
2019

Command line: console=ttyS0 nofxsr earlyprintk=ttyS0
[..normal boot..]

/proc/cpuinfo inside the VM is:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 78
model name  : Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
stepping: 3
microcode   : 0x1
cpu MHz : 2496.000
cache size  : 4096 KB
physical id : 0
siblings: 1
core id : 0
cpu cores   : 1
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb 
rdtscp lm constant_tsc arch_perfmon rep_good nopl cpuid tsc_known_freq 
pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb tpr_shadow 
vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep 
bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec 
xgetbv1 xsaves arat
bugs: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass 
l1tf mds

bogomips: 4992.00
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:


Vegard


Re: [PATCH] mm, thp: Fix mlocking THP page with migration enabled

2018-09-12 Thread Vegard Nossum
On Tue, 11 Sep 2018 at 12:34, Kirill A. Shutemov
 wrote:
>
> A transparent huge page is represented by a single entry on an LRU list.
> Therefore, we can only make unevictable an entire compound page, not
> individual subpages.
>
> If a user tries to mlock() part of a huge page, we want the rest of the
> page to be reclaimable.
>
> We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
> PMD on border of VM_LOCKED VMA will be split into PTE table.
>
> Introduction of THP migration breaks the rules around mlocking THP
> pages. If we had a single PMD mapping of the page in mlocked VMA, the
> page will get mlocked, regardless of PTE mappings of the page.
>
> For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
> remove_migration_pmd().
>
> Anon THP pages can only be shared between processes via fork(). Mlocked
> page can only be shared if parent mlocked it before forking, otherwise
> CoW will be triggered on mlock().
>
> For Anon-THP, we can fix the issue by munlocking the page on removing PTE
> migration entry for the page. PTEs for the page will always come after
> mlocked PMD: rmap walks VMAs from oldest to newest.
>
> Test-case:
>
> #include 
> #include 
> #include 
> #include 
> #include 
>
> int main(void)
> {
> unsigned long nodemask = 4;
> void *addr;
>
> addr = mmap((void *)0x2000UL, 2UL << 20, PROT_READ | 
> PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);
>
> if (fork()) {
> wait(NULL);
> return 0;
> }
>
> mlock(addr, 4UL << 10);
> mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
> , 4, MPOL_MF_MOVE | MPOL_MF_MOVE_ALL);

MPOL_MF_MOVE_ALL is actually not required to trigger the bug.

>
> return 0;
> }
>
> Signed-off-by: Kirill A. Shutemov 
> Reported-by: Vegard Nossum 

Would you mind putting vegard.nos...@oracle.com instead?

> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")

The commit I bisected the problem to was actually a different one:

commit c8633798497ce894c22ab083eb884c8294c537b2
Author: Naoya Horiguchi 
Date:   Fri Sep 8 16:11:08 2017 -0700

mm: mempolicy: mbind and migrate_pages support thp migration

But maybe you had a good reason to choose the other one instead. They
are close together in any case, so I guess it would be hard to find a
kernel with one commit and not the other.

> Cc:  [v4.14+]
> Cc: Zi Yan 
> Cc: Naoya Horiguchi 
> Cc: Vlastimil Babka 
> Cc: Andrea Arcangeli 

You could also add:

Link: https://lkml.org/lkml/2018/8/30/464

Thanks for debugging this.


Vegard


Re: [PATCH] mm, thp: Fix mlocking THP page with migration enabled

2018-09-12 Thread Vegard Nossum
On Tue, 11 Sep 2018 at 12:34, Kirill A. Shutemov
 wrote:
>
> A transparent huge page is represented by a single entry on an LRU list.
> Therefore, we can only make unevictable an entire compound page, not
> individual subpages.
>
> If a user tries to mlock() part of a huge page, we want the rest of the
> page to be reclaimable.
>
> We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
> PMD on border of VM_LOCKED VMA will be split into PTE table.
>
> Introduction of THP migration breaks the rules around mlocking THP
> pages. If we had a single PMD mapping of the page in mlocked VMA, the
> page will get mlocked, regardless of PTE mappings of the page.
>
> For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
> remove_migration_pmd().
>
> Anon THP pages can only be shared between processes via fork(). Mlocked
> page can only be shared if parent mlocked it before forking, otherwise
> CoW will be triggered on mlock().
>
> For Anon-THP, we can fix the issue by munlocking the page on removing PTE
> migration entry for the page. PTEs for the page will always come after
> mlocked PMD: rmap walks VMAs from oldest to newest.
>
> Test-case:
>
> #include 
> #include 
> #include 
> #include 
> #include 
>
> int main(void)
> {
> unsigned long nodemask = 4;
> void *addr;
>
> addr = mmap((void *)0x2000UL, 2UL << 20, PROT_READ | 
> PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);
>
> if (fork()) {
> wait(NULL);
> return 0;
> }
>
> mlock(addr, 4UL << 10);
> mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
> , 4, MPOL_MF_MOVE | MPOL_MF_MOVE_ALL);

MPOL_MF_MOVE_ALL is actually not required to trigger the bug.

>
> return 0;
> }
>
> Signed-off-by: Kirill A. Shutemov 
> Reported-by: Vegard Nossum 

Would you mind putting vegard.nos...@oracle.com instead?

> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")

The commit I bisected the problem to was actually a different one:

commit c8633798497ce894c22ab083eb884c8294c537b2
Author: Naoya Horiguchi 
Date:   Fri Sep 8 16:11:08 2017 -0700

mm: mempolicy: mbind and migrate_pages support thp migration

But maybe you had a good reason to choose the other one instead. They
are close together in any case, so I guess it would be hard to find a
kernel with one commit and not the other.

> Cc:  [v4.14+]
> Cc: Zi Yan 
> Cc: Naoya Horiguchi 
> Cc: Vlastimil Babka 
> Cc: Andrea Arcangeli 

You could also add:

Link: https://lkml.org/lkml/2018/8/30/464

Thanks for debugging this.


Vegard


Re: v4.18.0+ WARNING: at mm/vmscan.c:1756 isolate_lru_page + bad page state

2018-09-10 Thread Vegard Nossum
On Thu, 30 Aug 2018 at 15:31, Vegard Nossum  wrote:
>
> Hi,
>
> Got this on a recent kernel (pretty sure it was
> 2ad0d52699700a91660a406a4046017a2d7f246a but annoyingly the oops
> itself doesn't tell me the exact version):
>
> [ cut here ]
> trying to isolate tail page
> WARNING: CPU: 2 PID: 19156 at mm/vmscan.c:1756 isolate_lru_page+0x235/0x250

[...]

> I don't have the capacity to debug it atm and it may even have been
> fixed in mainline (though searching didn't yield any other reports
> AFAICT).
>
> I have .config and vmlinux (with DEBUG_INFO=y) if needed.
>
> It's not reproducible for the time being.

Just a quick follow-up: I have a reproducer and Kirill Shutemov has
identified the problem and provided a tentative patch.


Vegard


Re: v4.18.0+ WARNING: at mm/vmscan.c:1756 isolate_lru_page + bad page state

2018-09-10 Thread Vegard Nossum
On Thu, 30 Aug 2018 at 15:31, Vegard Nossum  wrote:
>
> Hi,
>
> Got this on a recent kernel (pretty sure it was
> 2ad0d52699700a91660a406a4046017a2d7f246a but annoyingly the oops
> itself doesn't tell me the exact version):
>
> [ cut here ]
> trying to isolate tail page
> WARNING: CPU: 2 PID: 19156 at mm/vmscan.c:1756 isolate_lru_page+0x235/0x250

[...]

> I don't have the capacity to debug it atm and it may even have been
> fixed in mainline (though searching didn't yield any other reports
> AFAICT).
>
> I have .config and vmlinux (with DEBUG_INFO=y) if needed.
>
> It's not reproducible for the time being.

Just a quick follow-up: I have a reproducer and Kirill Shutemov has
identified the problem and provided a tentative patch.


Vegard


v4.18.0+ WARNING: at mm/vmscan.c:1756 isolate_lru_page + bad page state

2018-08-30 Thread Vegard Nossum
Hi,

Got this on a recent kernel (pretty sure it was
2ad0d52699700a91660a406a4046017a2d7f246a but annoyingly the oops
itself doesn't tell me the exact version):

[ cut here ]
trying to isolate tail page
WARNING: CPU: 2 PID: 19156 at mm/vmscan.c:1756 isolate_lru_page+0x235/0x250
CPU: 2 PID: 19156 Comm: mmap Not tainted 4.18.0+ #493
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
RIP: 0010:isolate_lru_page+0x235/0x250
Code: fe ff ff 48 c7 c6 80 73 43 82 48 c7 c7 60 27 a9 82 e8 3f 40 c9
00 85 c0 0f 84 f4 fd ff ff 48 c7 c7 a5 ba 75 82 e8 6b 59 ed ff <0f> 0b
e9 e1 fd ff ff 49 c7 c7 00 fe ff ff 44 89 7c 24 04 e9 ed fe
RSP: 0018:c90008edbc20 EFLAGS: 00010282
RAX:  RBX: ea00082fd000 RCX: 0002
RDX: 8002 RSI: 0002 RDI: 
RBP: 8803a157ea00 R08: 0001 R09: 
R10: 82e456dc R11: 0001 R12: ea00082fd000
R13: 80020bf40805 R14: 7fe50f341000 R15: c90008edbdd8
FS:  () GS:88042fb0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 00580fb8 CR3: 02a1e004 CR4: 000606e0
Call Trace:
 clear_page_mlock+0x73/0xb0
 page_remove_rmap+0x31e/0x370
 unmap_page_range+0x70b/0xa40
 unmap_vmas+0x47/0x90
 exit_mmap+0xb0/0x1c0
 mmput+0x5d/0x130
 do_exit+0x2c2/0xc20
 do_group_exit+0x42/0xb0
 __x64_sys_exit_group+0xf/0x10
 do_syscall_64+0x57/0x170
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x501ad8
Code: Bad RIP value.
RSP: 002b:7fff9bb8dee8 EFLAGS: 0246 ORIG_RAX: 00e7
RAX: ffda RBX:  RCX: 00501ad8
RDX:  RSI: 003c RDI: 
RBP: 0059b4a0 R08: 00e7 R09: ffc8
R10:  R11: 0246 R12: 0001
R13: 007d7860 R14: 00027150 R15: 7fff9bb8e0c0
---[ end trace d3ada49968979043 ]---
[ cut here ]
list_del corruption, ea00082fd008->prev is LIST_POISON2 (dead0200)
WARNING: CPU: 2 PID: 19156 at lib/list_debug.c:50
__list_del_entry_valid+0x62/0x90
CPU: 2 PID: 19156 Comm: mmap Tainted: GW 4.18.0+ #493
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
RIP: 0010:__list_del_entry_valid+0x62/0x90
Code: 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 f0 b3 79 82 e8 d2 84 b1
ff 0f 0b 31 c0 c3 48 89 fe 48 c7 c7 28 b4 79 82 e8 be 84 b1 ff <0f> 0b
31 c0 c3 48 89 fe 48 c7 c7 60 b4 79 82 e8 aa 84 b1 ff 0f 0b
RSP: 0018:c90008edbc18 EFLAGS: 00010086
RAX:  RBX: ea00082fd000 RCX: 0003
RDX: 0003 RSI: 0003 RDI: 
RBP: 88043fff0d00 R08: 0001 R09: 
R10: 8802794a60c8 R11: 0001 R12: 0004
R13: 88042f4ae800 R14: 0005 R15: c90008edbdd8
FS:  () GS:88042fb0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 00501aae CR3: 02a1e004 CR4: 000606e0
Call Trace:
 isolate_lru_page+0xf3/0x250
 clear_page_mlock+0x73/0xb0
 page_remove_rmap+0x31e/0x370
 unmap_page_range+0x70b/0xa40
 unmap_vmas+0x47/0x90
 exit_mmap+0xb0/0x1c0
 mmput+0x5d/0x130
 do_exit+0x2c2/0xc20
 do_group_exit+0x42/0xb0
 __x64_sys_exit_group+0xf/0x10
 do_syscall_64+0x57/0x170
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x501ad8
Code: Bad RIP value.
RSP: 002b:7fff9bb8dee8 EFLAGS: 0246 ORIG_RAX: 00e7
RAX: ffda RBX:  RCX: 00501ad8
RDX:  RSI: 003c RDI: 
RBP: 0059b4a0 R08: 00e7 R09: ffc8
R10:  R11: 0246 R12: 0001
R13: 007d7860 R14: 00027150 R15: 7fff9bb8e0c0
---[ end trace d3ada49968979044 ]---
BUG: Bad page state in process mmap  pfn:20bf40
page:ea00082fd000 count:0 mapcount:0 mapping:dead0400 index:0x1
flags: 0x400()
raw: 0400 dead0100 dead0200 dead0400
raw: 0001   
page dumped because: non-NULL mapping
CPU: 2 PID: 19156 Comm: mmap Tainted: GW 4.18.0+ #493
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x5c/0x7b
 bad_page+0xb3/0x110
 free_pcppages_bulk+0x17b/0x7e0
 free_unref_page+0x4a/0x60
 zap_huge_pmd+0x204/0x360
 unmap_page_range+0x970/0xa40
 unmap_vmas+0x47/0x90
 exit_mmap+0xb0/0x1c0
 mmput+0x5d/0x130
 do_exit+0x2c2/0xc20
 do_group_exit+0x42/0xb0
 __x64_sys_exit_group+0xf/0x10
 do_syscall_64+0x57/0x170
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x501ad8
Code: Bad RIP value.
RSP: 002b:7fff9bb8dee8 EFLAGS: 0246 ORIG_RAX: 

v4.18.0+ WARNING: at mm/vmscan.c:1756 isolate_lru_page + bad page state

2018-08-30 Thread Vegard Nossum
Hi,

Got this on a recent kernel (pretty sure it was
2ad0d52699700a91660a406a4046017a2d7f246a but annoyingly the oops
itself doesn't tell me the exact version):

[ cut here ]
trying to isolate tail page
WARNING: CPU: 2 PID: 19156 at mm/vmscan.c:1756 isolate_lru_page+0x235/0x250
CPU: 2 PID: 19156 Comm: mmap Not tainted 4.18.0+ #493
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
RIP: 0010:isolate_lru_page+0x235/0x250
Code: fe ff ff 48 c7 c6 80 73 43 82 48 c7 c7 60 27 a9 82 e8 3f 40 c9
00 85 c0 0f 84 f4 fd ff ff 48 c7 c7 a5 ba 75 82 e8 6b 59 ed ff <0f> 0b
e9 e1 fd ff ff 49 c7 c7 00 fe ff ff 44 89 7c 24 04 e9 ed fe
RSP: 0018:c90008edbc20 EFLAGS: 00010282
RAX:  RBX: ea00082fd000 RCX: 0002
RDX: 8002 RSI: 0002 RDI: 
RBP: 8803a157ea00 R08: 0001 R09: 
R10: 82e456dc R11: 0001 R12: ea00082fd000
R13: 80020bf40805 R14: 7fe50f341000 R15: c90008edbdd8
FS:  () GS:88042fb0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 00580fb8 CR3: 02a1e004 CR4: 000606e0
Call Trace:
 clear_page_mlock+0x73/0xb0
 page_remove_rmap+0x31e/0x370
 unmap_page_range+0x70b/0xa40
 unmap_vmas+0x47/0x90
 exit_mmap+0xb0/0x1c0
 mmput+0x5d/0x130
 do_exit+0x2c2/0xc20
 do_group_exit+0x42/0xb0
 __x64_sys_exit_group+0xf/0x10
 do_syscall_64+0x57/0x170
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x501ad8
Code: Bad RIP value.
RSP: 002b:7fff9bb8dee8 EFLAGS: 0246 ORIG_RAX: 00e7
RAX: ffda RBX:  RCX: 00501ad8
RDX:  RSI: 003c RDI: 
RBP: 0059b4a0 R08: 00e7 R09: ffc8
R10:  R11: 0246 R12: 0001
R13: 007d7860 R14: 00027150 R15: 7fff9bb8e0c0
---[ end trace d3ada49968979043 ]---
[ cut here ]
list_del corruption, ea00082fd008->prev is LIST_POISON2 (dead0200)
WARNING: CPU: 2 PID: 19156 at lib/list_debug.c:50
__list_del_entry_valid+0x62/0x90
CPU: 2 PID: 19156 Comm: mmap Tainted: GW 4.18.0+ #493
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
RIP: 0010:__list_del_entry_valid+0x62/0x90
Code: 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 f0 b3 79 82 e8 d2 84 b1
ff 0f 0b 31 c0 c3 48 89 fe 48 c7 c7 28 b4 79 82 e8 be 84 b1 ff <0f> 0b
31 c0 c3 48 89 fe 48 c7 c7 60 b4 79 82 e8 aa 84 b1 ff 0f 0b
RSP: 0018:c90008edbc18 EFLAGS: 00010086
RAX:  RBX: ea00082fd000 RCX: 0003
RDX: 0003 RSI: 0003 RDI: 
RBP: 88043fff0d00 R08: 0001 R09: 
R10: 8802794a60c8 R11: 0001 R12: 0004
R13: 88042f4ae800 R14: 0005 R15: c90008edbdd8
FS:  () GS:88042fb0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 00501aae CR3: 02a1e004 CR4: 000606e0
Call Trace:
 isolate_lru_page+0xf3/0x250
 clear_page_mlock+0x73/0xb0
 page_remove_rmap+0x31e/0x370
 unmap_page_range+0x70b/0xa40
 unmap_vmas+0x47/0x90
 exit_mmap+0xb0/0x1c0
 mmput+0x5d/0x130
 do_exit+0x2c2/0xc20
 do_group_exit+0x42/0xb0
 __x64_sys_exit_group+0xf/0x10
 do_syscall_64+0x57/0x170
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x501ad8
Code: Bad RIP value.
RSP: 002b:7fff9bb8dee8 EFLAGS: 0246 ORIG_RAX: 00e7
RAX: ffda RBX:  RCX: 00501ad8
RDX:  RSI: 003c RDI: 
RBP: 0059b4a0 R08: 00e7 R09: ffc8
R10:  R11: 0246 R12: 0001
R13: 007d7860 R14: 00027150 R15: 7fff9bb8e0c0
---[ end trace d3ada49968979044 ]---
BUG: Bad page state in process mmap  pfn:20bf40
page:ea00082fd000 count:0 mapcount:0 mapping:dead0400 index:0x1
flags: 0x400()
raw: 0400 dead0100 dead0200 dead0400
raw: 0001   
page dumped because: non-NULL mapping
CPU: 2 PID: 19156 Comm: mmap Tainted: GW 4.18.0+ #493
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x5c/0x7b
 bad_page+0xb3/0x110
 free_pcppages_bulk+0x17b/0x7e0
 free_unref_page+0x4a/0x60
 zap_huge_pmd+0x204/0x360
 unmap_page_range+0x970/0xa40
 unmap_vmas+0x47/0x90
 exit_mmap+0xb0/0x1c0
 mmput+0x5d/0x130
 do_exit+0x2c2/0xc20
 do_group_exit+0x42/0xb0
 __x64_sys_exit_group+0xf/0x10
 do_syscall_64+0x57/0x170
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x501ad8
Code: Bad RIP value.
RSP: 002b:7fff9bb8dee8 EFLAGS: 0246 ORIG_RAX: 

Re: Merge branch 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2018-08-17 Thread Vegard Nossum
On 16 August 2018 at 17:42, Richard Weinberger
 wrote:
> On Thu, Aug 16, 2018 at 2:58 PM Sedat Dilek  wrote:
>>
>> Hi Linus,
>>
>> I am here on Linux v4.18 and tried first to merge the l1tf-final Git-branch.
>> Unfortunately, this is no more available in the tip Git-tree.
>>
>> Then I saw Linux v4.18.1 which includes all the above stuff.
>>
>> I tried to 'git cherry-pick -m 1 958f338e96f874a0d29442396d6adf9c1e17aa2d'.
>> I know the commit-id is the hash of a merge.
>> Luckily, I could get the "diff" and applied it.
>> But the history misses.
>>
>> How can I get the history and subjects of all commits in your tree to
>> cherry-pick the single commits?
>>
>> Do you happen to know another solution to get easily all L1TF commits
>> with any other tricks?
>
> That should help:
> git log --oneline
> 958f338e96f874a0d29442396d6adf9c1e17aa2d^..958f338e96f874a0d29442396d6adf9c1e17aa2d

Hey,

As a shorthand for this, you can also use just:

git log --oneline 958f338e96f87^-

The syntax was made especially so that you can see all the commits
that arrived via a merge commit without having to write the rev of the
merge twice but is otherwise exactly equivalent to "rev^..rev".

It should work from git v2.13. Just a tip :-)


Vegard


Re: Merge branch 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2018-08-17 Thread Vegard Nossum
On 16 August 2018 at 17:42, Richard Weinberger
 wrote:
> On Thu, Aug 16, 2018 at 2:58 PM Sedat Dilek  wrote:
>>
>> Hi Linus,
>>
>> I am here on Linux v4.18 and tried first to merge the l1tf-final Git-branch.
>> Unfortunately, this is no more available in the tip Git-tree.
>>
>> Then I saw Linux v4.18.1 which includes all the above stuff.
>>
>> I tried to 'git cherry-pick -m 1 958f338e96f874a0d29442396d6adf9c1e17aa2d'.
>> I know the commit-id is the hash of a merge.
>> Luckily, I could get the "diff" and applied it.
>> But the history misses.
>>
>> How can I get the history and subjects of all commits in your tree to
>> cherry-pick the single commits?
>>
>> Do you happen to know another solution to get easily all L1TF commits
>> with any other tricks?
>
> That should help:
> git log --oneline
> 958f338e96f874a0d29442396d6adf9c1e17aa2d^..958f338e96f874a0d29442396d6adf9c1e17aa2d

Hey,

As a shorthand for this, you can also use just:

git log --oneline 958f338e96f87^-

The syntax was made especially so that you can see all the commits
that arrived via a merge commit without having to write the rev of the
merge twice but is otherwise exactly equivalent to "rev^..rev".

It should work from git v2.13. Just a tip :-)


Vegard


Re: [PATCH] fscache: fix a kernel BUG at fs/fscache/operation.c:69!

2018-05-08 Thread Vegard Nossum
On 22 February 2018 at 08:33,   wrote:
> From: Lei Xue 
>
> There is a potential race in fscache operation enqueuing for reading and
> copying multiple pages from cachefiles to netfs.
> Under some heavy load system, it will happen very often.
>
> If this race occurs, an oops similar to the following is seen:
>
>  kernel BUG at fs/fscache/operation.c:69!
>  invalid opcode:  [#1] SMP
>  …
>  #0 [883fff0838d8] machine_kexec at 81051beb
>  #1 [883fff083938] crash_kexec at 810f2542
>  #2 [883fff083a08] oops_end at 8163e1a8
>  #3 [883fff083a30] die at 8101859b
>  #4 [883fff083a60] do_trap at 8163d860
>  #5 [883fff083ab0] do_invalid_op at 81015204
>  #6 [883fff083b60] invalid_op at 8164701e
> [exception RIP: fscache_enqueue_operation+246]
> RIP: a0b793c6  RSP: 883fff083c18  RFLAGS: 00010046
> RAX: 0019  RBX: 8832ed1a9ec0  RCX: 0006
> RDX:   RSI: 0046  RDI: 0046
> RBP: 883fff083c20   R8: 0086   R9: 178f
> R10: 816aeb00  R11: 883fff08392e  R12: 8802f0525620
> R13: 88407ffc01d8  R14:   R15: 0003
> ORIG_RAX:   CS: 0010  SS: 
>  #7 [883fff083c10] fscache_enqueue_operation at a0b793c6
>  #8 [883fff083c28] cachefiles_read_waiter at a0b15a48
>  #9 [883fff083c48] __wake_up_common at 810af028
>
> Signed-off-by: Lei Xue 
> ---
>  fs/cachefiles/rdwr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
> index 883bc7bb12c5..9d5d13e150fb 100644
> --- a/fs/cachefiles/rdwr.c
> +++ b/fs/cachefiles/rdwr.c
> @@ -58,9 +58,9 @@ static int cachefiles_read_waiter(wait_queue_entry_t *wait, 
> unsigned mode,
>
> spin_lock(>work_lock);
> list_add_tail(>op_link, >op->to_do);
> +   fscache_enqueue_retrieval(monitor->op);
> spin_unlock(>work_lock);
>
> -   fscache_enqueue_retrieval(monitor->op);
> return 0;
>  }

Hi,

Just wondering what the status of this patch is?

We've been hitting a similar problem and arrived at the same patch as
a potential fix for it.

Our crashes look like this:

WARNING: CPU: 0 PID: 120693 at kernel/workqueue.c:618 insert_work+0x5f/0x70
Modules linked in: nbd
CPU: 0 PID: 120693 Comm: sh Not tainted 4.16.2-0 #1
Hardware name: Oracle Corporation  Sun Fire X4800/20434, BIOS 11080200
   08/12/2016
RIP: 0010:insert_work+0x5f/0x70
RSP: 0018:88103fa039b8 EFLAGS: 00010046
RAX: 88103f443f00 RBX: 880187c37c00 RCX: 0005
RDX: 880187c37c20 RSI: 8807c04dec00 RDI: 
RBP: 88103fa039c8 R08: 0101 R09: 0001
R10: 887eee68fd40 R11: 0001 R12: 88503fafc600
R13: 0001cf60 R14: 880187c37c00 R15: 88103f443f00
FS:  () GS:88103fa0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f394d2780a0 CR3: 000bcc936000 CR4: 06f0
Call Trace:
 
 __queue_work+0x11f/0x320
 queue_work_on+0x19/0x30
 fscache_enqueue_operation+0x83/0x160
 cachefiles_read_waiter+0xd2/0x130
 __wake_up_common+0x81/0x120
 __wake_up_locked_key_bookmark+0x16/0x20
 wake_up_page_bit+0x97/0xe0
 unlock_page+0x20/0x30
 page_endio+0x21/0xa0
 mpage_end_io+0x41/0x60
 bio_endio+0x78/0x90
 dec_pending+0x140/0x250
 ? linear_status+0x40/0x40
 clone_endio+0x86/0x100
 bio_endio+0x78/0x90
 blk_update_request+0x8d/0x2b0
 scsi_end_request+0x36/0x200
 scsi_io_completion+0x12a/0x5e0
 scsi_finish_command+0xf2/0x150
 scsi_softirq_done+0x13e/0x160
 __blk_mq_complete_request+0xb8/0x180
 blk_mq_complete_request+0x57/0x70
 scsi_mq_done+0x10/0x20
 megasas_complete_cmd+0xdf/0x620
 megasas_complete_cmd_dpc+0x8f/0x100
 tasklet_action+0x9a/0xb0
 __do_softirq+0xbf/0x1c8
 irq_exit+0x9c/0xb0
 do_IRQ+0x5b/0xe0
 common_interrupt+0xf/0xf
 
RIP: 0010:_raw_spin_unlock_irqrestore+0x9/0x10
RSP: 0018:c900309e3cf8 EFLAGS: 0296 ORIG_RAX: ffde
RAX: 0002 RBX: 0002 RCX: 0001
RDX: ea0006793fe0 RSI: 0296 RDI: 88107800
RBP: c900309e3cf8 R08: 0002 R09: 0011b912
R10: 00e7 R11:  R12: ea0014baa000
R13: 88103fa1d120 R14: 88107fff6000 R15: 88107fff6000
 pagevec_lru_move_fn+0xb7/0xe0
 ? pagevec_move_tail_fn+0x350/0x350
 __pagevec_lru_add+0x12/0x20
 lru_add_drain_cpu+0xc4/0xe0
 lru_add_drain+0x10/0x20
 exit_mmap+0x58/0x190
 ? __handle_mm_fault+0x9a4/0x1540
 ? hrtimer_try_to_cancel+0x1b/0xa0
 mmput+0x4e/0x100
 do_exit+0x22f/0xa10
 do_group_exit+0x3a/0xa0
 SyS_exit_group+0x12/0x20
 do_syscall_64+0x61/0x110
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f394d325fa8
RSP: 002b:7ffda407e668 

Re: [PATCH] fscache: fix a kernel BUG at fs/fscache/operation.c:69!

2018-05-08 Thread Vegard Nossum
On 22 February 2018 at 08:33,   wrote:
> From: Lei Xue 
>
> There is a potential race in fscache operation enqueuing for reading and
> copying multiple pages from cachefiles to netfs.
> Under some heavy load system, it will happen very often.
>
> If this race occurs, an oops similar to the following is seen:
>
>  kernel BUG at fs/fscache/operation.c:69!
>  invalid opcode:  [#1] SMP
>  …
>  #0 [883fff0838d8] machine_kexec at 81051beb
>  #1 [883fff083938] crash_kexec at 810f2542
>  #2 [883fff083a08] oops_end at 8163e1a8
>  #3 [883fff083a30] die at 8101859b
>  #4 [883fff083a60] do_trap at 8163d860
>  #5 [883fff083ab0] do_invalid_op at 81015204
>  #6 [883fff083b60] invalid_op at 8164701e
> [exception RIP: fscache_enqueue_operation+246]
> RIP: a0b793c6  RSP: 883fff083c18  RFLAGS: 00010046
> RAX: 0019  RBX: 8832ed1a9ec0  RCX: 0006
> RDX:   RSI: 0046  RDI: 0046
> RBP: 883fff083c20   R8: 0086   R9: 178f
> R10: 816aeb00  R11: 883fff08392e  R12: 8802f0525620
> R13: 88407ffc01d8  R14:   R15: 0003
> ORIG_RAX:   CS: 0010  SS: 
>  #7 [883fff083c10] fscache_enqueue_operation at a0b793c6
>  #8 [883fff083c28] cachefiles_read_waiter at a0b15a48
>  #9 [883fff083c48] __wake_up_common at 810af028
>
> Signed-off-by: Lei Xue 
> ---
>  fs/cachefiles/rdwr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
> index 883bc7bb12c5..9d5d13e150fb 100644
> --- a/fs/cachefiles/rdwr.c
> +++ b/fs/cachefiles/rdwr.c
> @@ -58,9 +58,9 @@ static int cachefiles_read_waiter(wait_queue_entry_t *wait, 
> unsigned mode,
>
> spin_lock(>work_lock);
> list_add_tail(>op_link, >op->to_do);
> +   fscache_enqueue_retrieval(monitor->op);
> spin_unlock(>work_lock);
>
> -   fscache_enqueue_retrieval(monitor->op);
> return 0;
>  }

Hi,

Just wondering what the status of this patch is?

We've been hitting a similar problem and arrived at the same patch as
a potential fix for it.

Our crashes look like this:

WARNING: CPU: 0 PID: 120693 at kernel/workqueue.c:618 insert_work+0x5f/0x70
Modules linked in: nbd
CPU: 0 PID: 120693 Comm: sh Not tainted 4.16.2-0 #1
Hardware name: Oracle Corporation  Sun Fire X4800/20434, BIOS 11080200
   08/12/2016
RIP: 0010:insert_work+0x5f/0x70
RSP: 0018:88103fa039b8 EFLAGS: 00010046
RAX: 88103f443f00 RBX: 880187c37c00 RCX: 0005
RDX: 880187c37c20 RSI: 8807c04dec00 RDI: 
RBP: 88103fa039c8 R08: 0101 R09: 0001
R10: 887eee68fd40 R11: 0001 R12: 88503fafc600
R13: 0001cf60 R14: 880187c37c00 R15: 88103f443f00
FS:  () GS:88103fa0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f394d2780a0 CR3: 000bcc936000 CR4: 06f0
Call Trace:
 
 __queue_work+0x11f/0x320
 queue_work_on+0x19/0x30
 fscache_enqueue_operation+0x83/0x160
 cachefiles_read_waiter+0xd2/0x130
 __wake_up_common+0x81/0x120
 __wake_up_locked_key_bookmark+0x16/0x20
 wake_up_page_bit+0x97/0xe0
 unlock_page+0x20/0x30
 page_endio+0x21/0xa0
 mpage_end_io+0x41/0x60
 bio_endio+0x78/0x90
 dec_pending+0x140/0x250
 ? linear_status+0x40/0x40
 clone_endio+0x86/0x100
 bio_endio+0x78/0x90
 blk_update_request+0x8d/0x2b0
 scsi_end_request+0x36/0x200
 scsi_io_completion+0x12a/0x5e0
 scsi_finish_command+0xf2/0x150
 scsi_softirq_done+0x13e/0x160
 __blk_mq_complete_request+0xb8/0x180
 blk_mq_complete_request+0x57/0x70
 scsi_mq_done+0x10/0x20
 megasas_complete_cmd+0xdf/0x620
 megasas_complete_cmd_dpc+0x8f/0x100
 tasklet_action+0x9a/0xb0
 __do_softirq+0xbf/0x1c8
 irq_exit+0x9c/0xb0
 do_IRQ+0x5b/0xe0
 common_interrupt+0xf/0xf
 
RIP: 0010:_raw_spin_unlock_irqrestore+0x9/0x10
RSP: 0018:c900309e3cf8 EFLAGS: 0296 ORIG_RAX: ffde
RAX: 0002 RBX: 0002 RCX: 0001
RDX: ea0006793fe0 RSI: 0296 RDI: 88107800
RBP: c900309e3cf8 R08: 0002 R09: 0011b912
R10: 00e7 R11:  R12: ea0014baa000
R13: 88103fa1d120 R14: 88107fff6000 R15: 88107fff6000
 pagevec_lru_move_fn+0xb7/0xe0
 ? pagevec_move_tail_fn+0x350/0x350
 __pagevec_lru_add+0x12/0x20
 lru_add_drain_cpu+0xc4/0xe0
 lru_add_drain+0x10/0x20
 exit_mmap+0x58/0x190
 ? __handle_mm_fault+0x9a4/0x1540
 ? hrtimer_try_to_cancel+0x1b/0xa0
 mmput+0x4e/0x100
 do_exit+0x22f/0xa10
 do_group_exit+0x3a/0xa0
 SyS_exit_group+0x12/0x20
 do_syscall_64+0x61/0x110
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f394d325fa8
RSP: 002b:7ffda407e668 EFLAGS: 0246 ORIG_RAX: 00e7
RAX: ffda RBX: 

Re: [PATCH 00/45] C++: Convert the kernel to C++

2018-04-02 Thread Vegard Nossum
On 1 April 2018 at 22:40, David Howells  wrote:
>
> Here are a series of patches to start converting the kernel to C++.  It
> requires g++ v8.

Nice!

I tried something similar a few years ago, but I don't think it was
nearly as neat. I did get RTTI and exceptions to work (using libcxxrt
+ libunwind), though. Having noticed that a lot of really trivial
kernel bugs are due to control flow issues (e.g. when somebody adds a
possibly-failing step to a function but forget to add a new label to
clean it up) I really wanted to see how/whether exceptions and RAII
could help in that space.

Just in case you want to compare notes, I've pushed my branch to:

https://github.com/vegard/linux-2.6/tree/cxx

I also started a little bit of work on converting a driver to use
RAII, and quickly ran into a few problems: C++ destructors don't take
arguments, which means that some objects would have to carry extra
state around because some of the information needed to destroy an
object resides with somebody else. This means that you would have to
do more refactoring work to avoid needing this in the first place,
i.e. mapping creation/destructing of various C-style structs to C++ is
_not_ straightforward.

Take dma_alloc_coherent() for example. It pairs up with
dma_free_coherent() and that one needs to know the device and buffer
size that you passed too:

void *dma_alloc_coherent(struct device *, size_t, dma_addr_t *, gfp_t);
void dma_free_coherent(struct device *, size_t, void *, dma_addr_t);

This means that if you have 1 device using 2 buffers of the same size
and the size is stored only by the device struct, then you must always
do the destruction from the device struct, since the individual
buffers don't know their size (unless you move the member there; but
it feels like a waste of memory if you could do it just fine in C).
Maybe there's a "proper" way to do it that I didn't see, but problems
like this turned me off the whole approach a little.

Another real bummer is the size and complexity of the RTTI and
unwinding support code. First of all, unwinding requires parsing and
executing DWARF code on the fly, and that just makes everything very
slow. Not to mention that it needs to be threading-aware and does a
lot of memory allocations. IIRC handling out-of-memory conditions was
extremely ugly (not that the kernel is perfect in this respect to
start with) and involved the use of "reserve buffers". I didn't like
it at all.

Also, for reference, I found a few other projects doing similar things
in the past:

https://github.com/veltzer/kcpp
http://www.drdobbs.com/cpp/c-exceptions-the-linux-kernel/229100146
https://pograph.wordpress.com/2009/04/05/porting-cpp-code-to-linux-kernel/
https://www.threatstack.com/blog/c-in-the-linux-kernel/

There's probably more, I seem to remember at least 1 commercial
product using C++ for their out-of-tree module (albeit without
RTTI/exceptions), but I can't find it right now.


Vegard


Re: [PATCH 00/45] C++: Convert the kernel to C++

2018-04-02 Thread Vegard Nossum
On 1 April 2018 at 22:40, David Howells  wrote:
>
> Here are a series of patches to start converting the kernel to C++.  It
> requires g++ v8.

Nice!

I tried something similar a few years ago, but I don't think it was
nearly as neat. I did get RTTI and exceptions to work (using libcxxrt
+ libunwind), though. Having noticed that a lot of really trivial
kernel bugs are due to control flow issues (e.g. when somebody adds a
possibly-failing step to a function but forget to add a new label to
clean it up) I really wanted to see how/whether exceptions and RAII
could help in that space.

Just in case you want to compare notes, I've pushed my branch to:

https://github.com/vegard/linux-2.6/tree/cxx

I also started a little bit of work on converting a driver to use
RAII, and quickly ran into a few problems: C++ destructors don't take
arguments, which means that some objects would have to carry extra
state around because some of the information needed to destroy an
object resides with somebody else. This means that you would have to
do more refactoring work to avoid needing this in the first place,
i.e. mapping creation/destructing of various C-style structs to C++ is
_not_ straightforward.

Take dma_alloc_coherent() for example. It pairs up with
dma_free_coherent() and that one needs to know the device and buffer
size that you passed too:

void *dma_alloc_coherent(struct device *, size_t, dma_addr_t *, gfp_t);
void dma_free_coherent(struct device *, size_t, void *, dma_addr_t);

This means that if you have 1 device using 2 buffers of the same size
and the size is stored only by the device struct, then you must always
do the destruction from the device struct, since the individual
buffers don't know their size (unless you move the member there; but
it feels like a waste of memory if you could do it just fine in C).
Maybe there's a "proper" way to do it that I didn't see, but problems
like this turned me off the whole approach a little.

Another real bummer is the size and complexity of the RTTI and
unwinding support code. First of all, unwinding requires parsing and
executing DWARF code on the fly, and that just makes everything very
slow. Not to mention that it needs to be threading-aware and does a
lot of memory allocations. IIRC handling out-of-memory conditions was
extremely ugly (not that the kernel is perfect in this respect to
start with) and involved the use of "reserve buffers". I didn't like
it at all.

Also, for reference, I found a few other projects doing similar things
in the past:

https://github.com/veltzer/kcpp
http://www.drdobbs.com/cpp/c-exceptions-the-linux-kernel/229100146
https://pograph.wordpress.com/2009/04/05/porting-cpp-code-to-linux-kernel/
https://www.threatstack.com/blog/c-in-the-linux-kernel/

There's probably more, I seem to remember at least 1 commercial
product using C++ for their out-of-tree module (albeit without
RTTI/exceptions), but I can't find it right now.


Vegard


parallel make broken with ORC unwinder

2017-12-19 Thread Vegard Nossum
Hi,

When I run make -j64 on a v4.14 kernel or newer with ORC_UNWINDER=y
the kernel build breaks like this:

$ make -j64
  CHK include/config/kernel.release
  CHK include/generated/uapi/linux/version.h
  DESCEND  objtool
  CC  scripts/mod/empty.o
[...]
security/smack/smack_lsm.o: warning: objtool: elf_update: cannot write
data to file
[...]
drivers/atm/uPD98402.o: warning: objtool: elf_update: cannot write data to file
  AR  arch/x86/entry/vdso/built-in.o
  CC  security/keys/permission.o
  CC  arch/x86/entry/vsyscall/vsyscall_gtod.o
  CC  security/keys/process_keys.o
  CC [M]  arch/x86/kvm/../../../virt/kvm/irqchip.o
Segmentation fault
make[2]: *** [drivers/atm/uPD98402.o] Error 139
make[2]: *** Waiting for unfinished jobs

With FRAME_POINTER_UNWINDER=y everything seems to work fine.

A bisect points to:

ee9f8fce99640811b2b8e79d0d1dbe8bab69ba67 is the first bad commit
commit ee9f8fce99640811b2b8e79d0d1dbe8bab69ba67
Author: Josh Poimboeuf 
Date:   Mon Jul 24 18:36:57 2017 -0500

x86/unwind: Add the ORC unwinder

grepping for smack_lsm.o in the build log gives the following output:

  gcc -Wp,-MD,security/smack/.smack_lsm.o.d  -nostdinc -isystem
/usr/lib/gcc/x86_64-linux-gnu/4.7/include -I./arch/x86/include
-I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi
-I./arch/x86/include/generated/uapi -I./include/uapi
-I./include/generated/uapi -include ./include/linux/kconfig.h
-D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs
-fno-strict-aliasing -fno-common -fshort-wchar
-Werror-implicit-function-declaration -Wno-format-security -std=gnu89
-fno-PIE -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64
-falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387
-mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time
-DCONFIG_X86_X32_ABI -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1
-DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_SSSE3=1
-DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1
-DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -pipe -Wno-sign-compare
-fno-asynchronous-unwind-tables -fno-delete-null-pointer-checks -O2
-Wno-maybe-uninitialized --param=allow-store-data-races=0
-DCC_HAVE_ASM_GOTO -Wframe-larger-than=1024 -fno-stack-protector
-Wno-unused-but-set-variable -fno-var-tracking-assignments -g
-gdwarf-4 -pg -mfentry -DCC_USING_FENTRY -Wdeclaration-after-statement
-Wno-pointer-sign -fno-strict-overflow -fconserve-stack
-Werror=implicit-int -Werror=strict-prototypes
-DKBUILD_BASENAME='"smack_lsm"'  -DKBUILD_MODNAME='"smack"' -c -o
security/smack/.tmp_smack_lsm.o security/smack/smack_lsm.c
   ./tools/objtool/objtool orc generate --no-fp  "security/smack/smack_lsm.o";
security/smack/smack_lsm.o: warning: objtool: elf_update: cannot write
data to file
  if [ "-pg" = "-pg" ]; then if [ security/smack/smack_lsm.o !=
"scripts/mod/empty.o" ]; then ./scripts/recordmcount
"security/smack/smack_lsm.o"; fi; fi;
  rm -f security/smack/smack.o; ar rcSTPD security/smack/smack.o
security/smack/smack_lsm.o security/smack/smack_access.o
security/smack/smackfs.o security/smack/smack_netfilter.o

This line looks suspicious:

   ./tools/objtool/objtool orc generate --no-fp  "security/smack/smack_lsm.o";

Is it really rewriting the file in place? That seems quite buggy to me.


Vegard


parallel make broken with ORC unwinder

2017-12-19 Thread Vegard Nossum
Hi,

When I run make -j64 on a v4.14 kernel or newer with ORC_UNWINDER=y
the kernel build breaks like this:

$ make -j64
  CHK include/config/kernel.release
  CHK include/generated/uapi/linux/version.h
  DESCEND  objtool
  CC  scripts/mod/empty.o
[...]
security/smack/smack_lsm.o: warning: objtool: elf_update: cannot write
data to file
[...]
drivers/atm/uPD98402.o: warning: objtool: elf_update: cannot write data to file
  AR  arch/x86/entry/vdso/built-in.o
  CC  security/keys/permission.o
  CC  arch/x86/entry/vsyscall/vsyscall_gtod.o
  CC  security/keys/process_keys.o
  CC [M]  arch/x86/kvm/../../../virt/kvm/irqchip.o
Segmentation fault
make[2]: *** [drivers/atm/uPD98402.o] Error 139
make[2]: *** Waiting for unfinished jobs

With FRAME_POINTER_UNWINDER=y everything seems to work fine.

A bisect points to:

ee9f8fce99640811b2b8e79d0d1dbe8bab69ba67 is the first bad commit
commit ee9f8fce99640811b2b8e79d0d1dbe8bab69ba67
Author: Josh Poimboeuf 
Date:   Mon Jul 24 18:36:57 2017 -0500

x86/unwind: Add the ORC unwinder

grepping for smack_lsm.o in the build log gives the following output:

  gcc -Wp,-MD,security/smack/.smack_lsm.o.d  -nostdinc -isystem
/usr/lib/gcc/x86_64-linux-gnu/4.7/include -I./arch/x86/include
-I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi
-I./arch/x86/include/generated/uapi -I./include/uapi
-I./include/generated/uapi -include ./include/linux/kconfig.h
-D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs
-fno-strict-aliasing -fno-common -fshort-wchar
-Werror-implicit-function-declaration -Wno-format-security -std=gnu89
-fno-PIE -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64
-falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387
-mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time
-DCONFIG_X86_X32_ABI -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1
-DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_SSSE3=1
-DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1
-DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -pipe -Wno-sign-compare
-fno-asynchronous-unwind-tables -fno-delete-null-pointer-checks -O2
-Wno-maybe-uninitialized --param=allow-store-data-races=0
-DCC_HAVE_ASM_GOTO -Wframe-larger-than=1024 -fno-stack-protector
-Wno-unused-but-set-variable -fno-var-tracking-assignments -g
-gdwarf-4 -pg -mfentry -DCC_USING_FENTRY -Wdeclaration-after-statement
-Wno-pointer-sign -fno-strict-overflow -fconserve-stack
-Werror=implicit-int -Werror=strict-prototypes
-DKBUILD_BASENAME='"smack_lsm"'  -DKBUILD_MODNAME='"smack"' -c -o
security/smack/.tmp_smack_lsm.o security/smack/smack_lsm.c
   ./tools/objtool/objtool orc generate --no-fp  "security/smack/smack_lsm.o";
security/smack/smack_lsm.o: warning: objtool: elf_update: cannot write
data to file
  if [ "-pg" = "-pg" ]; then if [ security/smack/smack_lsm.o !=
"scripts/mod/empty.o" ]; then ./scripts/recordmcount
"security/smack/smack_lsm.o"; fi; fi;
  rm -f security/smack/smack.o; ar rcSTPD security/smack/smack.o
security/smack/smack_lsm.o security/smack/smack_access.o
security/smack/smackfs.o security/smack/smack_netfilter.o

This line looks suspicious:

   ./tools/objtool/objtool orc generate --no-fp  "security/smack/smack_lsm.o";

Is it really rewriting the file in place? That seems quite buggy to me.


Vegard


Re: [PATCH] mm: kill kmemcheck again

2017-09-30 Thread Vegard Nossum
On 30 September 2017 at 11:48, Steven Rostedt <rost...@goodmis.org> wrote:
> On Wed, 27 Sep 2017 17:02:07 +0200
> Michal Hocko <mho...@kernel.org> wrote:
>
>> > Now that 2 years have passed, and all distros provide gcc that supports
>> > KASAN, kill kmemcheck again for the very same reasons.
>>
>> This is just too large to review manually. How have you generated the
>> patch?
>
> I agree. This needs to be taken out piece by piece, not in one go,
> where there could be unexpected fallout.

I have a patch from earlier this year that starts by removing the core
code and defining all the helpers/flags as no-ops so they can be
removed bit by bit at a later time. See the attachment. Pekka signed
off on it too.

I never actually submitted this because I was waiting for MSAN to be
merged in the kernel. It has been compile and run tested on x86_64.


Vegard
From b06e2b3b833b02ecb0afb9dd92422e89c7fbb6d9 Mon Sep 17 00:00:00 2001
From: Vegard Nossum <vegard.nos...@oracle.com>
Date: Thu, 30 Mar 2017 13:26:15 +0200
Subject: [PATCH] kmemcheck: remove core (x86 + mm) code

With KASAN/KMSAN and compiler-based instrumentation, this code is way past
its expiry date. There is zero reason to be using kmemcheck at this point,
as KASAN/KMSAN will be much faster, support SMP, and catch any bug that
kmemcheck would have caught. See the additional rationale and past
discussion at <https://lkml.org/lkml/2015/3/11/435>.

I take the approach of first removing all the core x86 and mm code, leaving
behind only include/linux/kmemcheck.h which provides some helpers (now only
dummies as for the !KMEMCHECK case previously) used in e.g. networking code
for special annotations.

We can then send individual (smaller, more reviewable) patches for removing
kmemcheck annotations in other subsystems.

Once there are no users of the kmemcheck helpers, we can kill off the dummy
helpers as well in a final patch.

Cc: Ingo Molnar <mi...@kernel.org>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Sasha Levin <alexander.le...@verizon.com>
Cc: Steven Rostedt <rost...@goodmis.org>
Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com>
Signed-off-by: Pekka Enberg <penb...@kernel.org>
---
 Documentation/admin-guide/kernel-parameters.txt |   7 -
 Documentation/dev-tools/index.rst   |   1 -
 Documentation/dev-tools/kmemcheck.rst   | 733 
 MAINTAINERS |  10 -
 arch/arm/include/asm/dma-iommu.h|   1 -
 arch/openrisc/include/asm/dma-mapping.h |   1 -
 arch/x86/Kconfig|   3 +-
 arch/x86/Makefile   |   5 -
 arch/x86/include/asm/dma-mapping.h  |   1 -
 arch/x86/include/asm/kmemcheck.h|  42 --
 arch/x86/include/asm/pgtable_types.h|   8 +-
 arch/x86/include/asm/string_32.h|   9 -
 arch/x86/include/asm/string_64.h|   8 -
 arch/x86/include/asm/xor.h  |   5 +-
 arch/x86/kernel/cpu/intel.c |  15 -
 arch/x86/kernel/traps.c |   5 -
 arch/x86/mm/Makefile|   2 -
 arch/x86/mm/fault.c |   6 -
 arch/x86/mm/init.c  |   5 +-
 arch/x86/mm/kmemcheck/Makefile  |   1 -
 arch/x86/mm/kmemcheck/error.c   | 227 
 arch/x86/mm/kmemcheck/error.h   |  15 -
 arch/x86/mm/kmemcheck/kmemcheck.c   | 658 -
 arch/x86/mm/kmemcheck/opcode.c  | 106 
 arch/x86/mm/kmemcheck/opcode.h  |   9 -
 arch/x86/mm/kmemcheck/pte.c |  22 -
 arch/x86/mm/kmemcheck/pte.h |  10 -
 arch/x86/mm/kmemcheck/selftest.c|  70 ---
 arch/x86/mm/kmemcheck/selftest.h|   6 -
 arch/x86/mm/kmemcheck/shadow.c  | 173 --
 arch/x86/mm/kmemcheck/shadow.h  |  18 -
 include/linux/dma-mapping.h |   8 +-
 include/linux/gfp.h |   2 -
 include/linux/kmemcheck.h   |  59 --
 include/linux/mm_types.h|   8 -
 include/linux/slab.h|  12 +-
 init/main.c |   1 -
 kernel/sysctl.c |  10 -
 lib/Kconfig.debug   |   6 +-
 lib/Kconfig.kmemcheck   |  94 ---
 mm/Kconfig.debug|   1 -
 mm/Makefile |   2 -
 mm/kmemcheck.c  | 125 
 mm/page_alloc.c |  14 -
 mm/slab.c   |  14 -
 mm/slab.h   |   2 -
 mm/sl

Re: [PATCH] mm: kill kmemcheck again

2017-09-30 Thread Vegard Nossum
On 30 September 2017 at 11:48, Steven Rostedt  wrote:
> On Wed, 27 Sep 2017 17:02:07 +0200
> Michal Hocko  wrote:
>
>> > Now that 2 years have passed, and all distros provide gcc that supports
>> > KASAN, kill kmemcheck again for the very same reasons.
>>
>> This is just too large to review manually. How have you generated the
>> patch?
>
> I agree. This needs to be taken out piece by piece, not in one go,
> where there could be unexpected fallout.

I have a patch from earlier this year that starts by removing the core
code and defining all the helpers/flags as no-ops so they can be
removed bit by bit at a later time. See the attachment. Pekka signed
off on it too.

I never actually submitted this because I was waiting for MSAN to be
merged in the kernel. It has been compile and run tested on x86_64.


Vegard
From b06e2b3b833b02ecb0afb9dd92422e89c7fbb6d9 Mon Sep 17 00:00:00 2001
From: Vegard Nossum 
Date: Thu, 30 Mar 2017 13:26:15 +0200
Subject: [PATCH] kmemcheck: remove core (x86 + mm) code

With KASAN/KMSAN and compiler-based instrumentation, this code is way past
its expiry date. There is zero reason to be using kmemcheck at this point,
as KASAN/KMSAN will be much faster, support SMP, and catch any bug that
kmemcheck would have caught. See the additional rationale and past
discussion at <https://lkml.org/lkml/2015/3/11/435>.

I take the approach of first removing all the core x86 and mm code, leaving
behind only include/linux/kmemcheck.h which provides some helpers (now only
dummies as for the !KMEMCHECK case previously) used in e.g. networking code
for special annotations.

We can then send individual (smaller, more reviewable) patches for removing
kmemcheck annotations in other subsystems.

Once there are no users of the kmemcheck helpers, we can kill off the dummy
helpers as well in a final patch.

Cc: Ingo Molnar 
Cc: Andrew Morton 
Cc: Sasha Levin 
Cc: Steven Rostedt 
Signed-off-by: Vegard Nossum 
Signed-off-by: Pekka Enberg 
---
 Documentation/admin-guide/kernel-parameters.txt |   7 -
 Documentation/dev-tools/index.rst   |   1 -
 Documentation/dev-tools/kmemcheck.rst   | 733 
 MAINTAINERS |  10 -
 arch/arm/include/asm/dma-iommu.h|   1 -
 arch/openrisc/include/asm/dma-mapping.h |   1 -
 arch/x86/Kconfig|   3 +-
 arch/x86/Makefile   |   5 -
 arch/x86/include/asm/dma-mapping.h  |   1 -
 arch/x86/include/asm/kmemcheck.h|  42 --
 arch/x86/include/asm/pgtable_types.h|   8 +-
 arch/x86/include/asm/string_32.h|   9 -
 arch/x86/include/asm/string_64.h|   8 -
 arch/x86/include/asm/xor.h  |   5 +-
 arch/x86/kernel/cpu/intel.c |  15 -
 arch/x86/kernel/traps.c |   5 -
 arch/x86/mm/Makefile|   2 -
 arch/x86/mm/fault.c |   6 -
 arch/x86/mm/init.c  |   5 +-
 arch/x86/mm/kmemcheck/Makefile  |   1 -
 arch/x86/mm/kmemcheck/error.c   | 227 
 arch/x86/mm/kmemcheck/error.h   |  15 -
 arch/x86/mm/kmemcheck/kmemcheck.c   | 658 -
 arch/x86/mm/kmemcheck/opcode.c  | 106 
 arch/x86/mm/kmemcheck/opcode.h  |   9 -
 arch/x86/mm/kmemcheck/pte.c |  22 -
 arch/x86/mm/kmemcheck/pte.h |  10 -
 arch/x86/mm/kmemcheck/selftest.c|  70 ---
 arch/x86/mm/kmemcheck/selftest.h|   6 -
 arch/x86/mm/kmemcheck/shadow.c  | 173 --
 arch/x86/mm/kmemcheck/shadow.h  |  18 -
 include/linux/dma-mapping.h |   8 +-
 include/linux/gfp.h |   2 -
 include/linux/kmemcheck.h   |  59 --
 include/linux/mm_types.h|   8 -
 include/linux/slab.h|  12 +-
 init/main.c |   1 -
 kernel/sysctl.c |  10 -
 lib/Kconfig.debug   |   6 +-
 lib/Kconfig.kmemcheck   |  94 ---
 mm/Kconfig.debug|   1 -
 mm/Makefile |   2 -
 mm/kmemcheck.c  | 125 
 mm/page_alloc.c |  14 -
 mm/slab.c   |  14 -
 mm/slab.h   |   2 -
 mm/slub.c   |  25 +-
 47 files changed, 18 insertions(+), 2547 deletions(-)
 delete mode 100644 Documentation/dev-tools/kmemcheck.rst
 delete mode 100644 arch/x86/include/asm/kmemcheck.h
 delete mode 100644 arch/x86/mm/kmemchec

Re: [bisected] Re: tty lockdep trace

2017-06-04 Thread Vegard Nossum

On 06/04/17 11:02, Mike Galbraith wrote:

On Sun, 2017-06-04 at 10:32 +0200, Greg Kroah-Hartman wrote:

On Sat, Jun 03, 2017 at 08:33:52AM +0200, Mike Galbraith wrote:

On Wed, 2017-05-31 at 13:21 -0400, Dave Jones wrote:

Just hit this during a trinity run.


925bb1ce47f429f69aad35876df7ecd8c53deb7e is the first bad commit
commit 925bb1ce47f429f69aad35876df7ecd8c53deb7e
Author: Vegard Nossum <vegard.nos...@oracle.com>
Date:   Thu May 11 12:18:52 2017 +0200

 tty: fix port buffer locking


Now reverting this.  Oops, sorry, forgot to add Dave and your names to
the patch revert.  The list of people who reported this was really long,
many thanks for this.


If flush_to_ldisc() is the problem, and taking atomic_write_lock in
that path an acceptable solution, how about do that a bit differently
instead.  Lockdep stopped grumbling, vbox seems happy.

925bb1ce47f4 (tty: fix port buffer locking) upset lockdep by holding buf->lock
while acquiring tty->atomic_write_lock.  Move acquisition to flush_to_ldisc(),
taking it prior to taking buf->lock.  Costs a reference, but appeases lockdep.

Not-so-signed-off-by: /me
---
  drivers/tty/tty_buffer.c |   10 ++
  drivers/tty/tty_port.c   |2 --
  2 files changed, 10 insertions(+), 2 deletions(-)

--- a/drivers/tty/tty_buffer.c
+++ b/drivers/tty/tty_buffer.c
@@ -465,7 +465,13 @@ static void flush_to_ldisc(struct work_s
  {
struct tty_port *port = container_of(work, struct tty_port, buf.work);
struct tty_bufhead *buf = >buf;
+   struct tty_struct *tty = READ_ONCE(port->itty);
+   struct tty_ldisc *disc = NULL;
  
+	if (tty)

+   disc = tty_ldisc_ref(tty);
+   if (disc)
+   mutex_lock(>atomic_write_lock);
mutex_lock(>lock);
  
  	while (1) {

@@ -501,6 +507,10 @@ static void flush_to_ldisc(struct work_s
}
  
  	mutex_unlock(>lock);

+   if (disc) {
+   mutex_unlock(>atomic_write_lock);
+   tty_ldisc_deref(disc);
+   }
  
  }
  
--- a/drivers/tty/tty_port.c

+++ b/drivers/tty/tty_port.c
@@ -34,9 +34,7 @@ static int tty_port_default_receive_buf(
if (!disc)
return 0;
  
-	mutex_lock(>atomic_write_lock);

ret = tty_ldisc_receive_buf(disc, p, (char *)f, count);
-   mutex_unlock(>atomic_write_lock);
  
  	tty_ldisc_deref(disc);
  


I don't know how you did it, but this passes my testing (reproducers for
both the original issue and the lockdep splat/hang). Although given the
track record I'm not sure how much that's worth :-/


Vegard


Re: [bisected] Re: tty lockdep trace

2017-06-04 Thread Vegard Nossum

On 06/04/17 11:02, Mike Galbraith wrote:

On Sun, 2017-06-04 at 10:32 +0200, Greg Kroah-Hartman wrote:

On Sat, Jun 03, 2017 at 08:33:52AM +0200, Mike Galbraith wrote:

On Wed, 2017-05-31 at 13:21 -0400, Dave Jones wrote:

Just hit this during a trinity run.


925bb1ce47f429f69aad35876df7ecd8c53deb7e is the first bad commit
commit 925bb1ce47f429f69aad35876df7ecd8c53deb7e
Author: Vegard Nossum 
Date:   Thu May 11 12:18:52 2017 +0200

 tty: fix port buffer locking


Now reverting this.  Oops, sorry, forgot to add Dave and your names to
the patch revert.  The list of people who reported this was really long,
many thanks for this.


If flush_to_ldisc() is the problem, and taking atomic_write_lock in
that path an acceptable solution, how about do that a bit differently
instead.  Lockdep stopped grumbling, vbox seems happy.

925bb1ce47f4 (tty: fix port buffer locking) upset lockdep by holding buf->lock
while acquiring tty->atomic_write_lock.  Move acquisition to flush_to_ldisc(),
taking it prior to taking buf->lock.  Costs a reference, but appeases lockdep.

Not-so-signed-off-by: /me
---
  drivers/tty/tty_buffer.c |   10 ++
  drivers/tty/tty_port.c   |2 --
  2 files changed, 10 insertions(+), 2 deletions(-)

--- a/drivers/tty/tty_buffer.c
+++ b/drivers/tty/tty_buffer.c
@@ -465,7 +465,13 @@ static void flush_to_ldisc(struct work_s
  {
struct tty_port *port = container_of(work, struct tty_port, buf.work);
struct tty_bufhead *buf = >buf;
+   struct tty_struct *tty = READ_ONCE(port->itty);
+   struct tty_ldisc *disc = NULL;
  
+	if (tty)

+   disc = tty_ldisc_ref(tty);
+   if (disc)
+   mutex_lock(>atomic_write_lock);
mutex_lock(>lock);
  
  	while (1) {

@@ -501,6 +507,10 @@ static void flush_to_ldisc(struct work_s
}
  
  	mutex_unlock(>lock);

+   if (disc) {
+   mutex_unlock(>atomic_write_lock);
+   tty_ldisc_deref(disc);
+   }
  
  }
  
--- a/drivers/tty/tty_port.c

+++ b/drivers/tty/tty_port.c
@@ -34,9 +34,7 @@ static int tty_port_default_receive_buf(
if (!disc)
return 0;
  
-	mutex_lock(>atomic_write_lock);

ret = tty_ldisc_receive_buf(disc, p, (char *)f, count);
-   mutex_unlock(>atomic_write_lock);
  
  	tty_ldisc_deref(disc);
  


I don't know how you did it, but this passes my testing (reproducers for
both the original issue and the lockdep splat/hang). Although given the
track record I'm not sure how much that's worth :-/


Vegard


Re: [linux-next / tty] possible circular locking dependency detected

2017-06-03 Thread Vegard Nossum

On 06/03/17 11:34, Greg Kroah-Hartman wrote:

On Mon, May 29, 2017 at 12:43:39PM +0200, Vegard Nossum wrote:

On 05/22/17 12:27, Vegard Nossum wrote:

On 05/22/17 12:24, Greg Kroah-Hartman wrote:

On Mon, May 22, 2017 at 04:39:43PM +0900, Sergey Senozhatsky wrote:

Hello,

[ 1274.378287] ==
[ 1274.378289] WARNING: possible circular locking dependency detected
[ 1274.378290]
4.12.0-rc1-next-20170522-dbg-7-gc09b2ab28b74-dirty #1317 Not
tainted
[ 1274.378291] --
[ 1274.378293] kworker/u8:5/111 is trying to acquire lock:
[ 1274.378294]  (>lock){+.+...}, at: []
tty_buffer_flush+0x34/0x88
[ 1274.378300]
 but task is already holding lock:
[ 1274.378301]  (_tty->termios_rwsem/1){..}, at:
[] isig+0x47/0xd2
[ 1274.378307]
 which lock already depends on the new lock.




Any hint as to what you were doing when this happened?

Does this also show up in 4.11?


It's my patch "tty: fix port buffer locking" :-/

At a glance, looks related to pty taking the lock on the other side in a
different order. I'll have a closer look.


I can reproduce the lockdep report locally on v4.12-rc3. Looking at it now.


Any ideas?  Or should I just revert the original patch?


I think we must revert it for now, as I can easily reproduce not just
the lockdep warning but actual hangs. It seems I missed some code paths
when I worked the original patch.

I'm working on a fix.


Vegard


Re: [linux-next / tty] possible circular locking dependency detected

2017-06-03 Thread Vegard Nossum

On 06/03/17 11:34, Greg Kroah-Hartman wrote:

On Mon, May 29, 2017 at 12:43:39PM +0200, Vegard Nossum wrote:

On 05/22/17 12:27, Vegard Nossum wrote:

On 05/22/17 12:24, Greg Kroah-Hartman wrote:

On Mon, May 22, 2017 at 04:39:43PM +0900, Sergey Senozhatsky wrote:

Hello,

[ 1274.378287] ==
[ 1274.378289] WARNING: possible circular locking dependency detected
[ 1274.378290]
4.12.0-rc1-next-20170522-dbg-7-gc09b2ab28b74-dirty #1317 Not
tainted
[ 1274.378291] --
[ 1274.378293] kworker/u8:5/111 is trying to acquire lock:
[ 1274.378294]  (>lock){+.+...}, at: []
tty_buffer_flush+0x34/0x88
[ 1274.378300]
 but task is already holding lock:
[ 1274.378301]  (_tty->termios_rwsem/1){..}, at:
[] isig+0x47/0xd2
[ 1274.378307]
 which lock already depends on the new lock.




Any hint as to what you were doing when this happened?

Does this also show up in 4.11?


It's my patch "tty: fix port buffer locking" :-/

At a glance, looks related to pty taking the lock on the other side in a
different order. I'll have a closer look.


I can reproduce the lockdep report locally on v4.12-rc3. Looking at it now.


Any ideas?  Or should I just revert the original patch?


I think we must revert it for now, as I can easily reproduce not just
the lockdep warning but actual hangs. It seems I missed some code paths
when I worked the original patch.

I'm working on a fix.


Vegard


Re: linux-next 20170519 and later - ^S/^Q borkage on ttys.

2017-05-31 Thread Vegard Nossum

On 05/31/17 05:48, valdis.kletni...@vt.edu wrote:

Pretty drastic.  Hit ^S to pause scrolling, and instantly hung terminal.
Seen on both urxvt and xterm under x11, and on virtual console screens.

This appears in dmesg:

  [ 1844.182058] INFO: task kworker/u8:3:129 blocked for more than 120 seconds.
  [ 1844.182073]   Tainted: G   OE   4.12.0-rc3-next-20170530 #489
  [ 1844.182078] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1844.182085] kworker/u8:3D11008   129  2 0x
  [ 1844.182109] Workqueue: events_unbound flush_to_ldisc
  [ 1844.182118] Call Trace:
  [ 1844.182136]  __schedule+0x43e/0x1020
  [ 1844.182147]  ? schedule_preempt_disabled+0x27/0xd0
  [ 1844.182156]  schedule+0x5d/0x1d0
  [ 1844.182164]  ? __mutex_lock+0x4c9/0x11c0
  [ 1844.182172]  schedule_preempt_disabled+0x27/0xd0
  [ 1844.182179]  __mutex_lock+0x4c9/0x11c0
  [ 1844.182191]  ? tty_port_default_receive_buf+0x58/0xc0
  [ 1844.182204]  ? ldsem_down_read_trylock+0xc3/0x130
  [ 1844.182215]  mutex_lock_nested+0x1b/0x20
  [ 1844.18]  ? mutex_lock_nested+0x1b/0x20
  [ 1844.182230]  tty_port_default_receive_buf+0x58/0xc0
  [ 1844.182240]  flush_to_ldisc+0xea/0x220
  [ 1844.182249]  ? trace_hardirqs_on_caller+0x16/0x290
  [ 1844.182262]  process_one_work+0x3d6/0xd00
  [ 1844.182269]  ? lock_acquire+0xae/0x2f0
  [ 1844.182284]  worker_thread+0x71/0x830
  [ 1844.182297]  kthread+0x1a9/0x270
  [ 1844.182304]  ? process_one_work+0xd00/0xd00
  [ 1844.182310]  ? kthread_create_on_node+0x70/0x70
  [ 1844.182321]  ret_from_fork+0x27/0x40
  [ 1844.182608] INFO: lockdep is turned off.

Bisects down to this commit, and things work when it's reverted.

Commit 925bb1ce47f4.
Author: Vegard Nossum <vegard.nos...@oracle.com>
Date:   Thu May 11 12:18:52 2017 +0200

 tty: fix port buffer locking

 tty_insert_flip_string_fixed_flag() is racy against itself when called
 from the ioctl(TCXONC, TCION/TCIOFF) path [1] and the flush_to_ldisc()
 workqueue path [2].


Gah, if it's that easy to trigger a deadlock (as opposed to just a
lockdep warning), we should revert the patch until I have a better fix.

^S doesn't seem to reproduce it here, though. Too bad your stack trace
doesn't show the process already holding the lock.


Vegard


Re: linux-next 20170519 and later - ^S/^Q borkage on ttys.

2017-05-31 Thread Vegard Nossum

On 05/31/17 05:48, valdis.kletni...@vt.edu wrote:

Pretty drastic.  Hit ^S to pause scrolling, and instantly hung terminal.
Seen on both urxvt and xterm under x11, and on virtual console screens.

This appears in dmesg:

  [ 1844.182058] INFO: task kworker/u8:3:129 blocked for more than 120 seconds.
  [ 1844.182073]   Tainted: G   OE   4.12.0-rc3-next-20170530 #489
  [ 1844.182078] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
  [ 1844.182085] kworker/u8:3D11008   129  2 0x
  [ 1844.182109] Workqueue: events_unbound flush_to_ldisc
  [ 1844.182118] Call Trace:
  [ 1844.182136]  __schedule+0x43e/0x1020
  [ 1844.182147]  ? schedule_preempt_disabled+0x27/0xd0
  [ 1844.182156]  schedule+0x5d/0x1d0
  [ 1844.182164]  ? __mutex_lock+0x4c9/0x11c0
  [ 1844.182172]  schedule_preempt_disabled+0x27/0xd0
  [ 1844.182179]  __mutex_lock+0x4c9/0x11c0
  [ 1844.182191]  ? tty_port_default_receive_buf+0x58/0xc0
  [ 1844.182204]  ? ldsem_down_read_trylock+0xc3/0x130
  [ 1844.182215]  mutex_lock_nested+0x1b/0x20
  [ 1844.18]  ? mutex_lock_nested+0x1b/0x20
  [ 1844.182230]  tty_port_default_receive_buf+0x58/0xc0
  [ 1844.182240]  flush_to_ldisc+0xea/0x220
  [ 1844.182249]  ? trace_hardirqs_on_caller+0x16/0x290
  [ 1844.182262]  process_one_work+0x3d6/0xd00
  [ 1844.182269]  ? lock_acquire+0xae/0x2f0
  [ 1844.182284]  worker_thread+0x71/0x830
  [ 1844.182297]  kthread+0x1a9/0x270
  [ 1844.182304]  ? process_one_work+0xd00/0xd00
  [ 1844.182310]  ? kthread_create_on_node+0x70/0x70
  [ 1844.182321]  ret_from_fork+0x27/0x40
  [ 1844.182608] INFO: lockdep is turned off.

Bisects down to this commit, and things work when it's reverted.

Commit 925bb1ce47f4.
Author: Vegard Nossum 
Date:   Thu May 11 12:18:52 2017 +0200

 tty: fix port buffer locking

 tty_insert_flip_string_fixed_flag() is racy against itself when called
 from the ioctl(TCXONC, TCION/TCIOFF) path [1] and the flush_to_ldisc()
 workqueue path [2].


Gah, if it's that easy to trigger a deadlock (as opposed to just a
lockdep warning), we should revert the patch until I have a better fix.

^S doesn't seem to reproduce it here, though. Too bad your stack trace
doesn't show the process already holding the lock.


Vegard


Re: [linux-next / tty] possible circular locking dependency detected

2017-05-29 Thread Vegard Nossum

On 05/22/17 12:27, Vegard Nossum wrote:

On 05/22/17 12:24, Greg Kroah-Hartman wrote:

On Mon, May 22, 2017 at 04:39:43PM +0900, Sergey Senozhatsky wrote:

Hello,

[ 1274.378287] ==
[ 1274.378289] WARNING: possible circular locking dependency detected
[ 1274.378290] 4.12.0-rc1-next-20170522-dbg-7-gc09b2ab28b74-dirty 
#1317 Not tainted

[ 1274.378291] --
[ 1274.378293] kworker/u8:5/111 is trying to acquire lock:
[ 1274.378294]  (>lock){+.+...}, at: [] 
tty_buffer_flush+0x34/0x88

[ 1274.378300]
but task is already holding lock:
[ 1274.378301]  (_tty->termios_rwsem/1){..}, at: 
[] isig+0x47/0xd2

[ 1274.378307]
which lock already depends on the new lock.




Any hint as to what you were doing when this happened?

Does this also show up in 4.11?


It's my patch "tty: fix port buffer locking" :-/

At a glance, looks related to pty taking the lock on the other side in a
different order. I'll have a closer look.


I can reproduce the lockdep report locally on v4.12-rc3. Looking at it now.


Vegard


Re: [linux-next / tty] possible circular locking dependency detected

2017-05-29 Thread Vegard Nossum

On 05/22/17 12:27, Vegard Nossum wrote:

On 05/22/17 12:24, Greg Kroah-Hartman wrote:

On Mon, May 22, 2017 at 04:39:43PM +0900, Sergey Senozhatsky wrote:

Hello,

[ 1274.378287] ==
[ 1274.378289] WARNING: possible circular locking dependency detected
[ 1274.378290] 4.12.0-rc1-next-20170522-dbg-7-gc09b2ab28b74-dirty 
#1317 Not tainted

[ 1274.378291] --
[ 1274.378293] kworker/u8:5/111 is trying to acquire lock:
[ 1274.378294]  (>lock){+.+...}, at: [] 
tty_buffer_flush+0x34/0x88

[ 1274.378300]
but task is already holding lock:
[ 1274.378301]  (_tty->termios_rwsem/1){..}, at: 
[] isig+0x47/0xd2

[ 1274.378307]
which lock already depends on the new lock.




Any hint as to what you were doing when this happened?

Does this also show up in 4.11?


It's my patch "tty: fix port buffer locking" :-/

At a glance, looks related to pty taking the lock on the other side in a
different order. I'll have a closer look.


I can reproduce the lockdep report locally on v4.12-rc3. Looking at it now.


Vegard


[PATCH] kthread: fix boot hang (regression) on MIPS/OpenRISC

2017-05-29 Thread Vegard Nossum
This fixes a regression in commit 4d6501dce079 where I didn't notice
that MIPS and OpenRISC were reinitialising p->{set,clear}_child_tid to
NULL after our initialisation in copy_process().

We can simply get rid of the arch-specific initialisation here since it
is now always done in copy_process() before hitting copy_thread{,_tls}().

Review notes:

 - As far as I can tell, copy_process() is the only user of
   copy_thread_tls(), which is the only caller of copy_thread() for
   architectures that don't implement copy_thread_tls().

 - After this patch, there is no arch-specific code touching
   p->set_child_tid or p->clear_child_tid whatsoever.

 - It may look like MIPS/OpenRISC wanted to always have these fields be
   NULL, but that's not true, as copy_process() would unconditionally
   set them again _after_ calling copy_thread_tls() before commit
   4d6501dce079.

Fixes: 4d6501dce079c1eb6bf0b1d8f528a5e81770109e ("kthread: Fix use-after-free 
if kthread fork fails")
Reported-by: Guenter Roeck <li...@roeck-us.net>
Tested-by: Guenter Roeck <li...@roeck-us.net> # MIPS only
Cc: Ralf Baechle <r...@linux-mips.org>
Cc: linux-m...@linux-mips.org
Cc: Jonas Bonn <jo...@southpole.se>
Cc: Stefan Kristiansson <stefan.kristians...@saunalahti.fi>
Cc: Stafford Horne <sho...@gmail.com>
Cc: openr...@lists.librecores.org
Cc: Oleg Nesterov <o...@redhat.com>
Cc: Jamie Iles <jamie.i...@oracle.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com>
---
Not sure who this should go through, the last patch went through tglx/the
core-urgent-for-linus tree, but it does touch arch code + fix a mainline
boot hang regression on at least MIPS (Guenter said OpenRISC didn't seem
affected in his boot tests, but the code looks wrong in any case). Maybe
we could get acks/reviews by MIPS and OpenRISC maintainers?
---
 arch/mips/kernel/process.c | 1 -
 arch/openrisc/kernel/process.c | 2 --
 2 files changed, 3 deletions(-)

diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 918d4c73e951..5351e1f3950d 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -120,7 +120,6 @@ int copy_thread_tls(unsigned long clone_flags, unsigned 
long usp,
struct thread_info *ti = task_thread_info(p);
struct pt_regs *childregs, *regs = current_pt_regs();
unsigned long childksp;
-   p->set_child_tid = p->clear_child_tid = NULL;
 
childksp = (unsigned long)task_stack_page(p) + THREAD_SIZE - 32;
 
diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
index f8da545854f9..106859ae27ff 100644
--- a/arch/openrisc/kernel/process.c
+++ b/arch/openrisc/kernel/process.c
@@ -167,8 +167,6 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 
top_of_kernel_stack = sp;
 
-   p->set_child_tid = p->clear_child_tid = NULL;
-
/* Locate userspace context on stack... */
sp -= STACK_FRAME_OVERHEAD; /* redzone */
sp -= sizeof(struct pt_regs);
-- 
2.12.0.rc0



[PATCH] kthread: fix boot hang (regression) on MIPS/OpenRISC

2017-05-29 Thread Vegard Nossum
This fixes a regression in commit 4d6501dce079 where I didn't notice
that MIPS and OpenRISC were reinitialising p->{set,clear}_child_tid to
NULL after our initialisation in copy_process().

We can simply get rid of the arch-specific initialisation here since it
is now always done in copy_process() before hitting copy_thread{,_tls}().

Review notes:

 - As far as I can tell, copy_process() is the only user of
   copy_thread_tls(), which is the only caller of copy_thread() for
   architectures that don't implement copy_thread_tls().

 - After this patch, there is no arch-specific code touching
   p->set_child_tid or p->clear_child_tid whatsoever.

 - It may look like MIPS/OpenRISC wanted to always have these fields be
   NULL, but that's not true, as copy_process() would unconditionally
   set them again _after_ calling copy_thread_tls() before commit
   4d6501dce079.

Fixes: 4d6501dce079c1eb6bf0b1d8f528a5e81770109e ("kthread: Fix use-after-free 
if kthread fork fails")
Reported-by: Guenter Roeck 
Tested-by: Guenter Roeck  # MIPS only
Cc: Ralf Baechle 
Cc: linux-m...@linux-mips.org
Cc: Jonas Bonn 
Cc: Stefan Kristiansson 
Cc: Stafford Horne 
Cc: openr...@lists.librecores.org
Cc: Oleg Nesterov 
Cc: Jamie Iles 
Cc: Thomas Gleixner 
Signed-off-by: Vegard Nossum 
---
Not sure who this should go through, the last patch went through tglx/the
core-urgent-for-linus tree, but it does touch arch code + fix a mainline
boot hang regression on at least MIPS (Guenter said OpenRISC didn't seem
affected in his boot tests, but the code looks wrong in any case). Maybe
we could get acks/reviews by MIPS and OpenRISC maintainers?
---
 arch/mips/kernel/process.c | 1 -
 arch/openrisc/kernel/process.c | 2 --
 2 files changed, 3 deletions(-)

diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 918d4c73e951..5351e1f3950d 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -120,7 +120,6 @@ int copy_thread_tls(unsigned long clone_flags, unsigned 
long usp,
struct thread_info *ti = task_thread_info(p);
struct pt_regs *childregs, *regs = current_pt_regs();
unsigned long childksp;
-   p->set_child_tid = p->clear_child_tid = NULL;
 
childksp = (unsigned long)task_stack_page(p) + THREAD_SIZE - 32;
 
diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
index f8da545854f9..106859ae27ff 100644
--- a/arch/openrisc/kernel/process.c
+++ b/arch/openrisc/kernel/process.c
@@ -167,8 +167,6 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
 
top_of_kernel_stack = sp;
 
-   p->set_child_tid = p->clear_child_tid = NULL;
-
/* Locate userspace context on stack... */
sp -= STACK_FRAME_OVERHEAD; /* redzone */
sp -= sizeof(struct pt_regs);
-- 
2.12.0.rc0



Re: mips qemu test failures in -next due to "kthread: Fix use-after-free if kthread fork fails"

2017-05-28 Thread Vegard Nossum

On 05/28/17 13:45, Vegard Nossum wrote:

On 05/27/17 19:56, Guenter Roeck wrote:

Hi,

my qemu testis of mips images are failing in -next. Symptom is a hang 
during
boot; see http://kerneltests.org/builders/qemu-mips-next for some 
examples.


I bisected the problem in next-20170526. It points to commit 
4d6501dce079c
("kthread: Fix use-after-free if kthread fork fails"). Reverting that 
patch

fixes the problem.

Bisect log is attached.


Hi,

Thanks for the report and sorry for the breakage :-/

I can't immediately spot what's going wrong, but I am able to reproduce
it on mips so I will try to debug.

Are you sure it's this commit, though? I checked out linus/master and
I get a boot hang even after reverting it.


My mistake; I ran into a different bug which made me think it was
hanging when it wasn't.

However, I think I found the problem; does this patch fix it for you too?

diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 918d4c73e951..5351e1f3950d 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -120,7 +120,6 @@ int copy_thread_tls(unsigned long clone_flags, 
unsigned long usp,

struct thread_info *ti = task_thread_info(p);
struct pt_regs *childregs, *regs = current_pt_regs();
unsigned long childksp;
-   p->set_child_tid = p->clear_child_tid = NULL;

childksp = (unsigned long)task_stack_page(p) + THREAD_SIZE - 32;

The problem is that when we moved the p->{set,clear}_child_tid
assignments inside copy_process(), the above assignments would clear
them out. The assignments only exist on mips and openrisc (which would
need the same patch), which explains why I didn't see it in my x86
testing. I think the patch above should be safe given that we're now
always setting these fields in copy_process() at an appropriate moment.

Looks like those assignments came from commit 3c37026d43c47 ("NPTL,
round one."); Ralf?

Oleg?


Vegard


Re: mips qemu test failures in -next due to "kthread: Fix use-after-free if kthread fork fails"

2017-05-28 Thread Vegard Nossum

On 05/28/17 13:45, Vegard Nossum wrote:

On 05/27/17 19:56, Guenter Roeck wrote:

Hi,

my qemu testis of mips images are failing in -next. Symptom is a hang 
during
boot; see http://kerneltests.org/builders/qemu-mips-next for some 
examples.


I bisected the problem in next-20170526. It points to commit 
4d6501dce079c
("kthread: Fix use-after-free if kthread fork fails"). Reverting that 
patch

fixes the problem.

Bisect log is attached.


Hi,

Thanks for the report and sorry for the breakage :-/

I can't immediately spot what's going wrong, but I am able to reproduce
it on mips so I will try to debug.

Are you sure it's this commit, though? I checked out linus/master and
I get a boot hang even after reverting it.


My mistake; I ran into a different bug which made me think it was
hanging when it wasn't.

However, I think I found the problem; does this patch fix it for you too?

diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 918d4c73e951..5351e1f3950d 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -120,7 +120,6 @@ int copy_thread_tls(unsigned long clone_flags, 
unsigned long usp,

struct thread_info *ti = task_thread_info(p);
struct pt_regs *childregs, *regs = current_pt_regs();
unsigned long childksp;
-   p->set_child_tid = p->clear_child_tid = NULL;

childksp = (unsigned long)task_stack_page(p) + THREAD_SIZE - 32;

The problem is that when we moved the p->{set,clear}_child_tid
assignments inside copy_process(), the above assignments would clear
them out. The assignments only exist on mips and openrisc (which would
need the same patch), which explains why I didn't see it in my x86
testing. I think the patch above should be safe given that we're now
always setting these fields in copy_process() at an appropriate moment.

Looks like those assignments came from commit 3c37026d43c47 ("NPTL,
round one."); Ralf?

Oleg?


Vegard


Re: mips qemu test failures in -next due to "kthread: Fix use-after-free if kthread fork fails"

2017-05-28 Thread Vegard Nossum

On 05/27/17 19:56, Guenter Roeck wrote:

Hi,

my qemu testis of mips images are failing in -next. Symptom is a hang during
boot; see http://kerneltests.org/builders/qemu-mips-next for some examples.

I bisected the problem in next-20170526. It points to commit 4d6501dce079c
("kthread: Fix use-after-free if kthread fork fails"). Reverting that patch
fixes the problem.

Bisect log is attached.


Hi,

Thanks for the report and sorry for the breakage :-/

I can't immediately spot what's going wrong, but I am able to reproduce
it on mips so I will try to debug.

Are you sure it's this commit, though? I checked out linus/master and
I get a boot hang even after reverting it.


Vegard


Re: mips qemu test failures in -next due to "kthread: Fix use-after-free if kthread fork fails"

2017-05-28 Thread Vegard Nossum

On 05/27/17 19:56, Guenter Roeck wrote:

Hi,

my qemu testis of mips images are failing in -next. Symptom is a hang during
boot; see http://kerneltests.org/builders/qemu-mips-next for some examples.

I bisected the problem in next-20170526. It points to commit 4d6501dce079c
("kthread: Fix use-after-free if kthread fork fails"). Reverting that patch
fixes the problem.

Bisect log is attached.


Hi,

Thanks for the report and sorry for the breakage :-/

I can't immediately spot what's going wrong, but I am able to reproduce
it on mips so I will try to debug.

Are you sure it's this commit, though? I checked out linus/master and
I get a boot hang even after reverting it.


Vegard


[tip:core/urgent] kthread: Fix use-after-free if kthread fork fails

2017-05-22 Thread tip-bot for Vegard Nossum
Commit-ID:  4d6501dce079c1eb6bf0b1d8f528a5e81770109e
Gitweb: http://git.kernel.org/tip/4d6501dce079c1eb6bf0b1d8f528a5e81770109e
Author: Vegard Nossum <vegard.nos...@oracle.com>
AuthorDate: Tue, 9 May 2017 09:39:59 +0200
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Mon, 22 May 2017 22:21:16 +0200

kthread: Fix use-after-free if kthread fork fails

If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but
fails in copy_process() between calling dup_task_struct() and setting
p->set_child_tid, then the value of p->set_child_tid will be inherited
from the parent and get prematurely freed by free_kthread_struct().

kthread()
 - worker_thread()
- process_one_work()
|  - call_usermodehelper_exec_work()
| - kernel_thread()
|- _do_fork()
|   - copy_process()
|  - dup_task_struct()
| - arch_dup_task_struct()
|- tsk->set_child_tid = current->set_child_tid // 
implied
|  - ...
|  - goto bad_fork_*
|  - ...
|  - free_task(tsk)
| - free_kthread_struct(tsk)
|- kfree(tsk->set_child_tid)
- ...
- schedule()
   - __schedule()
  - wq_worker_sleeping()
 - kthread_data(task)->flags // UAF

The problem started showing up with commit 1da5c46fa965 since it reused
->set_child_tid for the kthread worker data.

A better long-term solution might be to get rid of the ->set_child_tid
abuse. The comment in set_kthread_struct() also looks slightly wrong.

Debugged-by: Jamie Iles <jamie.i...@oracle.com>
Fixes: 1da5c46fa965 ("kthread: Make struct kthread kmalloc'ed")
Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com>
Acked-by: Oleg Nesterov <o...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: Frederic Weisbecker <fweis...@gmail.com>
Cc: Jamie Iles <jamie.i...@oracle.com>
Cc: sta...@vger.kernel.org
Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nos...@oracle.com
Signed-off-by: Thomas Gleixner <t...@linutronix.de>

---
 kernel/fork.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d681f8f..b7cdea1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1553,6 +1553,18 @@ static __latent_entropy struct task_struct *copy_process(
if (!p)
goto fork_out;
 
+   /*
+* This _must_ happen before we call free_task(), i.e. before we jump
+* to any of the bad_fork_* labels. This is to avoid freeing
+* p->set_child_tid which is (ab)used as a kthread's data pointer for
+* kernel threads (PF_KTHREAD).
+*/
+   p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
+   /*
+* Clear TID on mm_release()?
+*/
+   p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? 
child_tidptr : NULL;
+
ftrace_graph_init_task(p);
 
rt_mutex_init_task(p);
@@ -1716,11 +1728,6 @@ static __latent_entropy struct task_struct *copy_process(
}
}
 
-   p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
-   /*
-* Clear TID on mm_release()?
-*/
-   p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? 
child_tidptr : NULL;
 #ifdef CONFIG_BLOCK
p->plug = NULL;
 #endif


[tip:core/urgent] kthread: Fix use-after-free if kthread fork fails

2017-05-22 Thread tip-bot for Vegard Nossum
Commit-ID:  4d6501dce079c1eb6bf0b1d8f528a5e81770109e
Gitweb: http://git.kernel.org/tip/4d6501dce079c1eb6bf0b1d8f528a5e81770109e
Author: Vegard Nossum 
AuthorDate: Tue, 9 May 2017 09:39:59 +0200
Committer:  Thomas Gleixner 
CommitDate: Mon, 22 May 2017 22:21:16 +0200

kthread: Fix use-after-free if kthread fork fails

If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but
fails in copy_process() between calling dup_task_struct() and setting
p->set_child_tid, then the value of p->set_child_tid will be inherited
from the parent and get prematurely freed by free_kthread_struct().

kthread()
 - worker_thread()
- process_one_work()
|  - call_usermodehelper_exec_work()
| - kernel_thread()
|- _do_fork()
|   - copy_process()
|  - dup_task_struct()
| - arch_dup_task_struct()
|- tsk->set_child_tid = current->set_child_tid // 
implied
|  - ...
|  - goto bad_fork_*
|  - ...
|  - free_task(tsk)
| - free_kthread_struct(tsk)
|- kfree(tsk->set_child_tid)
- ...
- schedule()
   - __schedule()
  - wq_worker_sleeping()
 - kthread_data(task)->flags // UAF

The problem started showing up with commit 1da5c46fa965 since it reused
->set_child_tid for the kthread worker data.

A better long-term solution might be to get rid of the ->set_child_tid
abuse. The comment in set_kthread_struct() also looks slightly wrong.

Debugged-by: Jamie Iles 
Fixes: 1da5c46fa965 ("kthread: Make struct kthread kmalloc'ed")
Signed-off-by: Vegard Nossum 
Acked-by: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Greg Kroah-Hartman 
Cc: Andy Lutomirski 
Cc: Frederic Weisbecker 
Cc: Jamie Iles 
Cc: sta...@vger.kernel.org
Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nos...@oracle.com
Signed-off-by: Thomas Gleixner 

---
 kernel/fork.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d681f8f..b7cdea1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1553,6 +1553,18 @@ static __latent_entropy struct task_struct *copy_process(
if (!p)
goto fork_out;
 
+   /*
+* This _must_ happen before we call free_task(), i.e. before we jump
+* to any of the bad_fork_* labels. This is to avoid freeing
+* p->set_child_tid which is (ab)used as a kthread's data pointer for
+* kernel threads (PF_KTHREAD).
+*/
+   p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
+   /*
+* Clear TID on mm_release()?
+*/
+   p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? 
child_tidptr : NULL;
+
ftrace_graph_init_task(p);
 
rt_mutex_init_task(p);
@@ -1716,11 +1728,6 @@ static __latent_entropy struct task_struct *copy_process(
}
}
 
-   p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
-   /*
-* Clear TID on mm_release()?
-*/
-   p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? 
child_tidptr : NULL;
 #ifdef CONFIG_BLOCK
p->plug = NULL;
 #endif


Re: [linux-next / tty] possible circular locking dependency detected

2017-05-22 Thread Vegard Nossum

On 05/22/17 12:24, Greg Kroah-Hartman wrote:

On Mon, May 22, 2017 at 04:39:43PM +0900, Sergey Senozhatsky wrote:

Hello,

[ 1274.378287] ==
[ 1274.378289] WARNING: possible circular locking dependency detected
[ 1274.378290] 4.12.0-rc1-next-20170522-dbg-7-gc09b2ab28b74-dirty #1317 Not 
tainted
[ 1274.378291] --
[ 1274.378293] kworker/u8:5/111 is trying to acquire lock:
[ 1274.378294]  (>lock){+.+...}, at: [] 
tty_buffer_flush+0x34/0x88
[ 1274.378300]
but task is already holding lock:
[ 1274.378301]  (_tty->termios_rwsem/1){..}, at: [] 
isig+0x47/0xd2
[ 1274.378307]
which lock already depends on the new lock.




Any hint as to what you were doing when this happened?

Does this also show up in 4.11?


It's my patch "tty: fix port buffer locking" :-/

At a glance, looks related to pty taking the lock on the other side in a
different order. I'll have a closer look.


Vegard


Re: [linux-next / tty] possible circular locking dependency detected

2017-05-22 Thread Vegard Nossum

On 05/22/17 12:24, Greg Kroah-Hartman wrote:

On Mon, May 22, 2017 at 04:39:43PM +0900, Sergey Senozhatsky wrote:

Hello,

[ 1274.378287] ==
[ 1274.378289] WARNING: possible circular locking dependency detected
[ 1274.378290] 4.12.0-rc1-next-20170522-dbg-7-gc09b2ab28b74-dirty #1317 Not 
tainted
[ 1274.378291] --
[ 1274.378293] kworker/u8:5/111 is trying to acquire lock:
[ 1274.378294]  (>lock){+.+...}, at: [] 
tty_buffer_flush+0x34/0x88
[ 1274.378300]
but task is already holding lock:
[ 1274.378301]  (_tty->termios_rwsem/1){..}, at: [] 
isig+0x47/0xd2
[ 1274.378307]
which lock already depends on the new lock.




Any hint as to what you were doing when this happened?

Does this also show up in 4.11?


It's my patch "tty: fix port buffer locking" :-/

At a glance, looks related to pty taking the lock on the other side in a
different order. I'll have a closer look.


Vegard


[PATCH] tty: fix port buffer locking

2017-05-11 Thread Vegard Nossum
tty_insert_flip_string_fixed_flag() is racy against itself when called
from the ioctl(TCXONC, TCION/TCIOFF) path [1] and the flush_to_ldisc()
workqueue path [2].

The problem is that port->buf.tail->used is modified without consistent
locking; the ioctl path takes tty->atomic_write_lock, whereas the workqueue
path takes ldata->output_lock.

We cannot simply take ldata->output_lock, since that is specific to the
N_TTY line discipline.

It might seem natural to try to take port->buf.lock inside
tty_insert_flip_string_fixed_flag() and friends (where port->buf is
actually used/modified), but this creates problems for flush_to_ldisc()
which takes it before grabbing tty->ldisc_sem, o_tty->termios_rwsem,
and ldata->output_lock.

Therefore, the simplest solution for now seems to be to take
tty->atomic_write_lock inside tty_port_default_receive_buf(). This lock
is also used in the write path [3] with a consistent ordering.

[1]: Call Trace:
 tty_insert_flip_string_fixed_flag
 pty_write
 tty_send_xchar // down_read(_tty->termios_rwsem)
// mutex_lock(>atomic_write_lock)
 n_tty_ioctl_helper
 n_tty_ioctl
 tty_ioctl  // down_read(>ldisc_sem)
 do_vfs_ioctl
 SyS_ioctl

[2]: Workqueue: events_unbound flush_to_ldisc
Call Trace:
 tty_insert_flip_string_fixed_flag
 pty_write
 tty_put_char
 __process_echoes
 commit_echoes  // mutex_lock(>output_lock)
 n_tty_receive_buf_common
 n_tty_receive_buf2
 tty_ldisc_receive_buf  // down_read(_tty->termios_rwsem)
 tty_port_default_receive_buf   // down_read(>ldisc_sem)
 flush_to_ldisc // mutex_lock(>buf.lock)
 process_one_work

[3]: Call Trace:
 tty_insert_flip_string_fixed_flag
 pty_write
 n_tty_write// mutex_lock(>output_lock)
// down_read(>termios_rwsem)
 do_tty_write (inline)  // mutex_lock(>atomic_write_lock)
 tty_write  // down_read(>ldisc_sem)
 __vfs_write
 vfs_write
 SyS_write

The bug can result in about a dozen different crashes depending on what
exactly gets corrupted when port->buf.tail->used points outside the
buffer.

The patch passes my LOCKDEP/PROVE_LOCKING testing but more testing is
always welcome.

Found using syzkaller.

Cc: <sta...@vger.kernel.org>
Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com>
---
 drivers/tty/tty_port.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/tty/tty_port.c b/drivers/tty/tty_port.c
index 1d21a9c1d33e..ef4dd596b864 100644
--- a/drivers/tty/tty_port.c
+++ b/drivers/tty/tty_port.c
@@ -34,7 +34,9 @@ static int tty_port_default_receive_buf(struct tty_port *port,
if (!disc)
return 0;
 
+   mutex_lock(>atomic_write_lock);
ret = tty_ldisc_receive_buf(disc, p, (char *)f, count);
+   mutex_unlock(>atomic_write_lock);
 
tty_ldisc_deref(disc);
 
-- 
2.12.0.rc0



[PATCH] tty: fix port buffer locking

2017-05-11 Thread Vegard Nossum
tty_insert_flip_string_fixed_flag() is racy against itself when called
from the ioctl(TCXONC, TCION/TCIOFF) path [1] and the flush_to_ldisc()
workqueue path [2].

The problem is that port->buf.tail->used is modified without consistent
locking; the ioctl path takes tty->atomic_write_lock, whereas the workqueue
path takes ldata->output_lock.

We cannot simply take ldata->output_lock, since that is specific to the
N_TTY line discipline.

It might seem natural to try to take port->buf.lock inside
tty_insert_flip_string_fixed_flag() and friends (where port->buf is
actually used/modified), but this creates problems for flush_to_ldisc()
which takes it before grabbing tty->ldisc_sem, o_tty->termios_rwsem,
and ldata->output_lock.

Therefore, the simplest solution for now seems to be to take
tty->atomic_write_lock inside tty_port_default_receive_buf(). This lock
is also used in the write path [3] with a consistent ordering.

[1]: Call Trace:
 tty_insert_flip_string_fixed_flag
 pty_write
 tty_send_xchar // down_read(_tty->termios_rwsem)
// mutex_lock(>atomic_write_lock)
 n_tty_ioctl_helper
 n_tty_ioctl
 tty_ioctl  // down_read(>ldisc_sem)
 do_vfs_ioctl
 SyS_ioctl

[2]: Workqueue: events_unbound flush_to_ldisc
Call Trace:
 tty_insert_flip_string_fixed_flag
 pty_write
 tty_put_char
 __process_echoes
 commit_echoes  // mutex_lock(>output_lock)
 n_tty_receive_buf_common
 n_tty_receive_buf2
 tty_ldisc_receive_buf  // down_read(_tty->termios_rwsem)
 tty_port_default_receive_buf   // down_read(>ldisc_sem)
 flush_to_ldisc // mutex_lock(>buf.lock)
 process_one_work

[3]: Call Trace:
 tty_insert_flip_string_fixed_flag
 pty_write
 n_tty_write// mutex_lock(>output_lock)
// down_read(>termios_rwsem)
 do_tty_write (inline)  // mutex_lock(>atomic_write_lock)
 tty_write  // down_read(>ldisc_sem)
 __vfs_write
 vfs_write
 SyS_write

The bug can result in about a dozen different crashes depending on what
exactly gets corrupted when port->buf.tail->used points outside the
buffer.

The patch passes my LOCKDEP/PROVE_LOCKING testing but more testing is
always welcome.

Found using syzkaller.

Cc: 
Signed-off-by: Vegard Nossum 
---
 drivers/tty/tty_port.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/tty/tty_port.c b/drivers/tty/tty_port.c
index 1d21a9c1d33e..ef4dd596b864 100644
--- a/drivers/tty/tty_port.c
+++ b/drivers/tty/tty_port.c
@@ -34,7 +34,9 @@ static int tty_port_default_receive_buf(struct tty_port *port,
if (!disc)
return 0;
 
+   mutex_lock(>atomic_write_lock);
ret = tty_ldisc_receive_buf(disc, p, (char *)f, count);
+   mutex_unlock(>atomic_write_lock);
 
tty_ldisc_deref(disc);
 
-- 
2.12.0.rc0



[PATCH] tracing: use %pF in trace_dump_stack()

2017-05-09 Thread Vegard Nossum
When using trace_dump_stack() you currently just get a list of function
names.

It can be very useful to know exactly where a call came from, especially
if there are multiple calls from one function to another.

By switching trace_dump_stack() to use %pF we get the function name and
the offset, which can also be further processed to give exact line number
information, like this:

<...>-10873 3270529us : 
 => pty_write+0x45/0x50
 => n_tty_write+0x358/0x470
 => tty_write+0x189/0x2f0
 => __vfs_write+0x23/0x120
 => vfs_write+0xb3/0x1b0
 => SyS_write+0x44/0xa0
 => entry_SYSCALL_64_fastpath+0x18/0xad

$ scripts/faddr2line vmlinux tty_write+0x189/0x2f0
tty_write+0x189/0x2f0:
do_tty_write at drivers/tty/tty_io.c:1174
 (inlined by) tty_write at drivers/tty/tty_io.c:1257

Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com>
---
 kernel/trace/trace_output.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 02a4aeb22c47..879909efed33 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1073,9 +1073,7 @@ static enum print_line_t trace_stack_print(struct 
trace_iterator *iter,
if (trace_seq_has_overflowed(s))
break;
 
-   trace_seq_puts(s, " => ");
-   seq_print_ip_sym(s, *p, flags);
-   trace_seq_putc(s, '\n');
+   trace_seq_printf(s, " => %pF\n", (void *) *p);
}
 
return trace_handle_return(s);
-- 
2.12.0.rc0



[PATCH] tracing: use %pF in trace_dump_stack()

2017-05-09 Thread Vegard Nossum
When using trace_dump_stack() you currently just get a list of function
names.

It can be very useful to know exactly where a call came from, especially
if there are multiple calls from one function to another.

By switching trace_dump_stack() to use %pF we get the function name and
the offset, which can also be further processed to give exact line number
information, like this:

<...>-10873 3270529us : 
 => pty_write+0x45/0x50
 => n_tty_write+0x358/0x470
 => tty_write+0x189/0x2f0
 => __vfs_write+0x23/0x120
 => vfs_write+0xb3/0x1b0
 => SyS_write+0x44/0xa0
 => entry_SYSCALL_64_fastpath+0x18/0xad

$ scripts/faddr2line vmlinux tty_write+0x189/0x2f0
tty_write+0x189/0x2f0:
do_tty_write at drivers/tty/tty_io.c:1174
 (inlined by) tty_write at drivers/tty/tty_io.c:1257

Signed-off-by: Vegard Nossum 
---
 kernel/trace/trace_output.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 02a4aeb22c47..879909efed33 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1073,9 +1073,7 @@ static enum print_line_t trace_stack_print(struct 
trace_iterator *iter,
if (trace_seq_has_overflowed(s))
break;
 
-   trace_seq_puts(s, " => ");
-   seq_print_ip_sym(s, *p, flags);
-   trace_seq_putc(s, '\n');
+   trace_seq_printf(s, " => %pF\n", (void *) *p);
}
 
return trace_handle_return(s);
-- 
2.12.0.rc0



Re: [PATCH 0/4] S390: Fine-tuning for six function implementations

2017-05-09 Thread Vegard Nossum

On 05/07/17 19:12, SF Markus Elfring wrote:

From: Markus Elfring 
Date: Sun, 7 May 2017 19:00:09 +0200

A few update suggestions were taken into account
from static source code analysis.

Markus Elfring (4):
  Combine two function calls into one in show_cacheinfo()
  Use seq_putc() in show_cpu_summary()
  Replace six seq_printf() calls by seq_puts()
  Combine two function calls into one at four places

 arch/s390/kernel/cache.c |  4 ++--
 arch/s390/kernel/processor.c |  2 +-
 arch/s390/kernel/sysinfo.c   | 25 +++--
 3 files changed, 14 insertions(+), 17 deletions(-)



I'm sorry, I wouldn't normally respond to this, but I was put on the Cc
after all so I'll give my feedback.

I think these patches are a waste of time and a resources.

It would be different if your patches fixed actual bugs. This is just
mindless code transformations that MAY in the best case save a few bytes
of code here and there (I don't know; you didn't say).

But the potential gains from these incredibly numerous and tiny patches
that don't fix anything are so small, it's a waste of time, bandwidth,
and mental capacity for you and for everybody involved.

I just searched my inbox for patches from you and you sent literally
_hundreds_ over the past few days, all doing this crazy printf/puts/putc
transformation.

Another bit of searching and I see that I'm not the first one giving you
this response:

https://lkml.org/lkml/2017/1/23/383 - Jens Axboe
https://lkml.org/lkml/2017/1/23/262 - Johannes Thumshirn
https://lkml.org/lkml/2017/1/12/513 - Cyrille Pitchen
https://lkml.org/lkml/2016/10/24/491 - Theodore Ts'o
https://lkml.org/lkml/2016/10/7/148 - Dan Carpenter
https://lkml.org/lkml/2016/9/14/58 - Christian Borntraeger

...and I'm sure there are many more.


Vegard


Re: [PATCH 0/4] S390: Fine-tuning for six function implementations

2017-05-09 Thread Vegard Nossum

On 05/07/17 19:12, SF Markus Elfring wrote:

From: Markus Elfring 
Date: Sun, 7 May 2017 19:00:09 +0200

A few update suggestions were taken into account
from static source code analysis.

Markus Elfring (4):
  Combine two function calls into one in show_cacheinfo()
  Use seq_putc() in show_cpu_summary()
  Replace six seq_printf() calls by seq_puts()
  Combine two function calls into one at four places

 arch/s390/kernel/cache.c |  4 ++--
 arch/s390/kernel/processor.c |  2 +-
 arch/s390/kernel/sysinfo.c   | 25 +++--
 3 files changed, 14 insertions(+), 17 deletions(-)



I'm sorry, I wouldn't normally respond to this, but I was put on the Cc
after all so I'll give my feedback.

I think these patches are a waste of time and a resources.

It would be different if your patches fixed actual bugs. This is just
mindless code transformations that MAY in the best case save a few bytes
of code here and there (I don't know; you didn't say).

But the potential gains from these incredibly numerous and tiny patches
that don't fix anything are so small, it's a waste of time, bandwidth,
and mental capacity for you and for everybody involved.

I just searched my inbox for patches from you and you sent literally
_hundreds_ over the past few days, all doing this crazy printf/puts/putc
transformation.

Another bit of searching and I see that I'm not the first one giving you
this response:

https://lkml.org/lkml/2017/1/23/383 - Jens Axboe
https://lkml.org/lkml/2017/1/23/262 - Johannes Thumshirn
https://lkml.org/lkml/2017/1/12/513 - Cyrille Pitchen
https://lkml.org/lkml/2016/10/24/491 - Theodore Ts'o
https://lkml.org/lkml/2016/10/7/148 - Dan Carpenter
https://lkml.org/lkml/2016/9/14/58 - Christian Borntraeger

...and I'm sure there are many more.


Vegard


[PATCH v2] kthread: fix use-after-free if kthread fork fails

2017-05-09 Thread Vegard Nossum
If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but
fails in copy_process() between calling dup_task_struct() and setting
p->set_child_tid, then the value of p->set_child_tid will be inherited
from the parent and get prematurely freed by free_kthread_struct().

kthread()
 - worker_thread()
- process_one_work()
|  - call_usermodehelper_exec_work()
| - kernel_thread()
|- _do_fork()
|   - copy_process()
|  - dup_task_struct()
| - arch_dup_task_struct()
|- tsk->set_child_tid = current->set_child_tid // 
implied
|  - ...
|  - goto bad_fork_*
|  - ...
|  - free_task(tsk)
| - free_kthread_struct(tsk)
|- kfree(tsk->set_child_tid)
- ...
- schedule()
   - __schedule()
  - wq_worker_sleeping()
 - kthread_data(task)->flags // UAF

The problem started showing up with commit 1da5c46fa965 since it reused
->set_child_tid for the kthread worker data.

A better long-term solution might be to get rid of the ->set_child_tid
abuse. The comment in set_kthread_struct() also looks slightly wrong.

Fixes: 1da5c46fa965ff90f5ffc080b6ab3fae5e227bc3 ("kthread: Make struct kthread 
kmalloc'ed")
Cc: Oleg Nesterov <o...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Andy Lutomirski <l...@kernel.org>
Debugged-by: Jamie Iles <jamie.i...@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com>
---
 kernel/fork.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index dd5a371c392a..03b2f9606a54 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1554,6 +1554,18 @@ static __latent_entropy struct task_struct *copy_process(
if (!p)
goto fork_out;
 
+   /*
+* This _must_ happen before we call free_task(), i.e. before we jump
+* to any of the bad_fork_* labels. This is to avoid freeing
+* p->set_child_tid which is (ab)used as a kthread's data pointer for
+* kernel threads (PF_KTHREAD).
+*/
+   p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
+   /*
+* Clear TID on mm_release()?
+*/
+   p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? 
child_tidptr : NULL;
+
ftrace_graph_init_task(p);
 
rt_mutex_init_task(p);
@@ -1720,11 +1732,6 @@ static __latent_entropy struct task_struct *copy_process(
}
}
 
-   p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
-   /*
-* Clear TID on mm_release()?
-*/
-   p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? 
child_tidptr : NULL;
 #ifdef CONFIG_BLOCK
p->plug = NULL;
 #endif
-- 
2.12.0.rc0



[PATCH v2] kthread: fix use-after-free if kthread fork fails

2017-05-09 Thread Vegard Nossum
If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but
fails in copy_process() between calling dup_task_struct() and setting
p->set_child_tid, then the value of p->set_child_tid will be inherited
from the parent and get prematurely freed by free_kthread_struct().

kthread()
 - worker_thread()
- process_one_work()
|  - call_usermodehelper_exec_work()
| - kernel_thread()
|- _do_fork()
|   - copy_process()
|  - dup_task_struct()
| - arch_dup_task_struct()
|- tsk->set_child_tid = current->set_child_tid // 
implied
|  - ...
|  - goto bad_fork_*
|  - ...
|  - free_task(tsk)
| - free_kthread_struct(tsk)
|- kfree(tsk->set_child_tid)
- ...
- schedule()
   - __schedule()
  - wq_worker_sleeping()
 - kthread_data(task)->flags // UAF

The problem started showing up with commit 1da5c46fa965 since it reused
->set_child_tid for the kthread worker data.

A better long-term solution might be to get rid of the ->set_child_tid
abuse. The comment in set_kthread_struct() also looks slightly wrong.

Fixes: 1da5c46fa965ff90f5ffc080b6ab3fae5e227bc3 ("kthread: Make struct kthread 
kmalloc'ed")
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Debugged-by: Jamie Iles 
Signed-off-by: Vegard Nossum 
---
 kernel/fork.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index dd5a371c392a..03b2f9606a54 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1554,6 +1554,18 @@ static __latent_entropy struct task_struct *copy_process(
if (!p)
goto fork_out;
 
+   /*
+* This _must_ happen before we call free_task(), i.e. before we jump
+* to any of the bad_fork_* labels. This is to avoid freeing
+* p->set_child_tid which is (ab)used as a kthread's data pointer for
+* kernel threads (PF_KTHREAD).
+*/
+   p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
+   /*
+* Clear TID on mm_release()?
+*/
+   p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? 
child_tidptr : NULL;
+
ftrace_graph_init_task(p);
 
rt_mutex_init_task(p);
@@ -1720,11 +1732,6 @@ static __latent_entropy struct task_struct *copy_process(
}
}
 
-   p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
-   /*
-* Clear TID on mm_release()?
-*/
-   p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? 
child_tidptr : NULL;
 #ifdef CONFIG_BLOCK
p->plug = NULL;
 #endif
-- 
2.12.0.rc0



Re: [PATCH] kthread: fix use-after-free if kthread fork fails

2017-05-05 Thread Vegard Nossum

On 05/05/17 18:44, Oleg Nesterov wrote:

On 05/05, Vegard Nossum wrote:


If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but
fails in copy_process() between calling dup_task_struct() and setting
p->set_child_tid, then the value of p->set_child_tid will be inherited
from the parent and get prematurely freed by free_kthread_struct().


Aaah... thanks!


--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -518,6 +518,13 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig, int node)
atomic_set(>stack_refcount, 1);
 #endif

+   /*
+* Forking kthreads (e.g. usermodehelper) should not inherit this
+* field since it's a pointer to a 'struct kthread' which is not
+* reference counted.
+*/
+   tsk->set_child_tid = NULL;
+


Can't we just move both

p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
/*
 * Clear TID on mm_release()?
 */
p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? 
child_tidptr : NULL;

lines here?


clone_flags is not available in dup_task_struct(), but we could move
those lines higher in copy_process(). The reason we didn't do it was
that we thought it was a little fragile/unobvious that this has to
happen before free_task() is called and that it was safer to clear it in
dup_task_struct() (which also contains zeroing of other fields).

The newly attached patch has been tested and seems to work, if you
prefer it.


Vegard
diff --git a/kernel/fork.c b/kernel/fork.c
index fbdc29365b83..c52e22fdf7ca 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1561,6 +1561,18 @@ static __latent_entropy struct task_struct *copy_process(
 	if (!p)
 		goto fork_out;
 
+	/*
+	 * This _must_ happen before we call free_task(), i.e. before we jump
+	 * to any of the bad_fork_* labels. This is to avoid freeing
+	 * p->set_child_tid which is (ab)used as a kthread's data pointer for
+	 * kernel threads (PF_KTHREAD).
+	 */
+	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
+	/*
+	 * Clear TID on mm_release()?
+	 */
+	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL;
+
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
@@ -1727,11 +1739,6 @@ static __latent_entropy struct task_struct *copy_process(
 		}
 	}
 
-	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
-	/*
-	 * Clear TID on mm_release()?
-	 */
-	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL;
 #ifdef CONFIG_BLOCK
 	p->plug = NULL;
 #endif


Re: [PATCH] kthread: fix use-after-free if kthread fork fails

2017-05-05 Thread Vegard Nossum

On 05/05/17 18:44, Oleg Nesterov wrote:

On 05/05, Vegard Nossum wrote:


If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but
fails in copy_process() between calling dup_task_struct() and setting
p->set_child_tid, then the value of p->set_child_tid will be inherited
from the parent and get prematurely freed by free_kthread_struct().


Aaah... thanks!


--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -518,6 +518,13 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig, int node)
atomic_set(>stack_refcount, 1);
 #endif

+   /*
+* Forking kthreads (e.g. usermodehelper) should not inherit this
+* field since it's a pointer to a 'struct kthread' which is not
+* reference counted.
+*/
+   tsk->set_child_tid = NULL;
+


Can't we just move both

p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
/*
 * Clear TID on mm_release()?
 */
p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? 
child_tidptr : NULL;

lines here?


clone_flags is not available in dup_task_struct(), but we could move
those lines higher in copy_process(). The reason we didn't do it was
that we thought it was a little fragile/unobvious that this has to
happen before free_task() is called and that it was safer to clear it in
dup_task_struct() (which also contains zeroing of other fields).

The newly attached patch has been tested and seems to work, if you
prefer it.


Vegard
diff --git a/kernel/fork.c b/kernel/fork.c
index fbdc29365b83..c52e22fdf7ca 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1561,6 +1561,18 @@ static __latent_entropy struct task_struct *copy_process(
 	if (!p)
 		goto fork_out;
 
+	/*
+	 * This _must_ happen before we call free_task(), i.e. before we jump
+	 * to any of the bad_fork_* labels. This is to avoid freeing
+	 * p->set_child_tid which is (ab)used as a kthread's data pointer for
+	 * kernel threads (PF_KTHREAD).
+	 */
+	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
+	/*
+	 * Clear TID on mm_release()?
+	 */
+	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL;
+
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
@@ -1727,11 +1739,6 @@ static __latent_entropy struct task_struct *copy_process(
 		}
 	}
 
-	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
-	/*
-	 * Clear TID on mm_release()?
-	 */
-	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL;
 #ifdef CONFIG_BLOCK
 	p->plug = NULL;
 #endif


[PATCH] kthread: fix use-after-free if kthread fork fails

2017-05-05 Thread Vegard Nossum
If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but
fails in copy_process() between calling dup_task_struct() and setting
p->set_child_tid, then the value of p->set_child_tid will be inherited
from the parent and get prematurely freed by free_kthread_struct().

kthread()
 - worker_thread()
- process_one_work()
|  - call_usermodehelper_exec_work()
| - kernel_thread()
|- _do_fork()
|   - copy_process()
|  - dup_task_struct()
| - arch_dup_task_struct()
|- tsk->set_child_tid = current->set_child_tid // 
implied
|  - ...
|  - goto bad_fork_*
|  - ...
|  - free_task(tsk)
| - free_kthread_struct(tsk)
|- kfree(tsk->set_child_tid)
- ...
- schedule()
   - __schedule()
  - wq_worker_sleeping()
 - kthread_data(task)->flags // UAF

The problem started showing up with commit 1da5c46fa965 since it reused
->set_child_tid for the kthread worker data.

A better long-term solution might be to get rid of the ->set_child_tid
abuse. The comment in set_kthread_struct() also looks slightly wrong.

Fixes: 1da5c46fa965ff90f5ffc080b6ab3fae5e227bc3 ("kthread: Make struct kthread 
kmalloc'ed")
Cc: Oleg Nesterov <o...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Andy Lutomirski <l...@kernel.org>
Debugged-by: Jamie Iles <jamie.i...@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com>
---
 kernel/fork.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index dd5a371c392a..fbdc29365b83 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -518,6 +518,13 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig, int node)
atomic_set(>stack_refcount, 1);
 #endif
 
+   /*
+* Forking kthreads (e.g. usermodehelper) should not inherit this
+* field since it's a pointer to a 'struct kthread' which is not
+* reference counted.
+*/
+   tsk->set_child_tid = NULL;
+
if (err)
goto free_stack;
 
-- 
2.12.0.rc0



[PATCH] kthread: fix use-after-free if kthread fork fails

2017-05-05 Thread Vegard Nossum
If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but
fails in copy_process() between calling dup_task_struct() and setting
p->set_child_tid, then the value of p->set_child_tid will be inherited
from the parent and get prematurely freed by free_kthread_struct().

kthread()
 - worker_thread()
- process_one_work()
|  - call_usermodehelper_exec_work()
| - kernel_thread()
|- _do_fork()
|   - copy_process()
|  - dup_task_struct()
| - arch_dup_task_struct()
|- tsk->set_child_tid = current->set_child_tid // 
implied
|  - ...
|  - goto bad_fork_*
|  - ...
|  - free_task(tsk)
| - free_kthread_struct(tsk)
|- kfree(tsk->set_child_tid)
- ...
- schedule()
   - __schedule()
  - wq_worker_sleeping()
 - kthread_data(task)->flags // UAF

The problem started showing up with commit 1da5c46fa965 since it reused
->set_child_tid for the kthread worker data.

A better long-term solution might be to get rid of the ->set_child_tid
abuse. The comment in set_kthread_struct() also looks slightly wrong.

Fixes: 1da5c46fa965ff90f5ffc080b6ab3fae5e227bc3 ("kthread: Make struct kthread 
kmalloc'ed")
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Debugged-by: Jamie Iles 
Signed-off-by: Vegard Nossum 
---
 kernel/fork.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index dd5a371c392a..fbdc29365b83 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -518,6 +518,13 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig, int node)
atomic_set(>stack_refcount, 1);
 #endif
 
+   /*
+* Forking kthreads (e.g. usermodehelper) should not inherit this
+* field since it's a pointer to a 'struct kthread' which is not
+* reference counted.
+*/
+   tsk->set_child_tid = NULL;
+
if (err)
goto free_stack;
 
-- 
2.12.0.rc0



Re: [GIT PULL] TTY/Serial driver fixes for 4.11-rc4

2017-05-02 Thread Vegard Nossum
On 2 May 2017 at 18:35, Dmitry Vyukov <dvyu...@google.com> wrote:
> On Fri, Apr 14, 2017 at 2:30 PM, Greg KH <gre...@linuxfoundation.org> wrote:
>> On Fri, Apr 14, 2017 at 11:41:26AM +0200, Vegard Nossum wrote:
>>> On 13 April 2017 at 20:34, Greg KH <gre...@linuxfoundation.org> wrote:
>>> > On Thu, Apr 13, 2017 at 09:07:40AM -0700, Linus Torvalds wrote:
>>> >> On Thu, Apr 13, 2017 at 3:50 AM, Vegard Nossum <vegard.nos...@gmail.com> 
>>> >> wrote:
>>> So the original problem is that the vmalloc() in n_tty_open() can
>>> fail, and that will panic in tty_set_ldisc()/tty_ldisc_restore()
>>> because of its unwillingness to proceed if the tty doesn't have an
>>> ldisc.
>>>
>>> Dmitry fixed this by allowing tty->ldisc == NULL in the case of memory
>>> allocation failure as we can see from the comment in tty_set_ldisc().
>>>
>>> Unfortunately, it would appear that some other bits of code do not
>>> like tty->ldisc == NULL (other than the crash in this thread, I saw
>>> 2-3 similar crashes in other functions, e.g. poll()). I see two
>>> possibilities:
>>>
>>> 1) make other code handle tty->ldisc == NULL.
>>>
>>> 2) don't close/free the old ldisc until the new one has been
>>> successfully created/initialised/opened/attached to the tty, and
>>> return an error to userspace if changing it failed.
>>>
>>> I'm leaning towards #2 as the more obviously correct fix, it makes
>>> tty_set_ldisc() transactional, the fix seems limited in scope to
>>> tty_set_ldisc() itself, and we don't need to make every other bit of
>>> code that uses tty->ldisc handle the NULL case.
>>
>> That sounds reasonable to me, care to work on a patch for this?
>
> Vegard, do you know how to do this?
> That was first thing that I tried, but I did not manage to make it
> work. disc is tied to tty, so it's not that one can create a fully
> initialized disc on the side and then simply swap pointers. Looking at
> the code now, there is at least TTY_LDISC_OPEN bit in tty. But as far
> as I remember there were more fundamental problems. Or maybe I just
> did not try too hard.

I had a look at it but like you said, the tty/ldisc relationship is
complicated :-/

Maybe we can split up ldisc initialisation into two methods so that
the first one (e.g. ->alloc) does all the allocation and is allowed to
fail and the second one (e.g. ->open) is not allowed to fail. Then you
can allocate a new ldisc without freeing the old one and only swap
them over if the allocation succeeded.

That would require fixing up ->open for all the ldisc drivers though,
I'm not sure how easy/feasible it is.

I'll think about possible solutions, but I have no prior experience
with the tty code. In the meantime syzkaller also hit a couple of
other fun tty/pty bugs including a write/ioctl race that results in
buffer overflow :-/


Vegard


Re: [GIT PULL] TTY/Serial driver fixes for 4.11-rc4

2017-05-02 Thread Vegard Nossum
On 2 May 2017 at 18:35, Dmitry Vyukov  wrote:
> On Fri, Apr 14, 2017 at 2:30 PM, Greg KH  wrote:
>> On Fri, Apr 14, 2017 at 11:41:26AM +0200, Vegard Nossum wrote:
>>> On 13 April 2017 at 20:34, Greg KH  wrote:
>>> > On Thu, Apr 13, 2017 at 09:07:40AM -0700, Linus Torvalds wrote:
>>> >> On Thu, Apr 13, 2017 at 3:50 AM, Vegard Nossum  
>>> >> wrote:
>>> So the original problem is that the vmalloc() in n_tty_open() can
>>> fail, and that will panic in tty_set_ldisc()/tty_ldisc_restore()
>>> because of its unwillingness to proceed if the tty doesn't have an
>>> ldisc.
>>>
>>> Dmitry fixed this by allowing tty->ldisc == NULL in the case of memory
>>> allocation failure as we can see from the comment in tty_set_ldisc().
>>>
>>> Unfortunately, it would appear that some other bits of code do not
>>> like tty->ldisc == NULL (other than the crash in this thread, I saw
>>> 2-3 similar crashes in other functions, e.g. poll()). I see two
>>> possibilities:
>>>
>>> 1) make other code handle tty->ldisc == NULL.
>>>
>>> 2) don't close/free the old ldisc until the new one has been
>>> successfully created/initialised/opened/attached to the tty, and
>>> return an error to userspace if changing it failed.
>>>
>>> I'm leaning towards #2 as the more obviously correct fix, it makes
>>> tty_set_ldisc() transactional, the fix seems limited in scope to
>>> tty_set_ldisc() itself, and we don't need to make every other bit of
>>> code that uses tty->ldisc handle the NULL case.
>>
>> That sounds reasonable to me, care to work on a patch for this?
>
> Vegard, do you know how to do this?
> That was first thing that I tried, but I did not manage to make it
> work. disc is tied to tty, so it's not that one can create a fully
> initialized disc on the side and then simply swap pointers. Looking at
> the code now, there is at least TTY_LDISC_OPEN bit in tty. But as far
> as I remember there were more fundamental problems. Or maybe I just
> did not try too hard.

I had a look at it but like you said, the tty/ldisc relationship is
complicated :-/

Maybe we can split up ldisc initialisation into two methods so that
the first one (e.g. ->alloc) does all the allocation and is allowed to
fail and the second one (e.g. ->open) is not allowed to fail. Then you
can allocate a new ldisc without freeing the old one and only swap
them over if the allocation succeeded.

That would require fixing up ->open for all the ldisc drivers though,
I'm not sure how easy/feasible it is.

I'll think about possible solutions, but I have no prior experience
with the tty code. In the meantime syzkaller also hit a couple of
other fun tty/pty bugs including a write/ioctl race that results in
buffer overflow :-/


Vegard


Re: [git pull] vfs fixes

2017-04-15 Thread Vegard Nossum
On 9 April 2017 at 07:40, Al Viro  wrote:
>
> The following changes since commit a71c9a1c779f2499fb2afc0553e543f18aff6edf:
>
>   Linux 4.11-rc5 (2017-04-02 17:23:54 -0700)
>
> are available in the git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus
>
> for you to fetch changes up to a8e28440016bfb23bec266c4c66eacca6ea2d48b:
>
>   Merge branch 'work.statx' into for-next (2017-04-03 01:06:59 -0400)
>
> 
> Al Viro (2):
>   alpha: fix stack smashing in old_adjtimex(2)
>   Merge branch 'work.statx' into for-next

I'm seeing the same memfd_create/name_to_handle_at/path_lookupat
use-after-free that Dmitry was seeing here:

https://lkml.org/lkml/2017/3/4/118

I haven't tried the patch from that thread yet, but was there any
reason for it not to get merged so far?


Vegard


Re: [git pull] vfs fixes

2017-04-15 Thread Vegard Nossum
On 9 April 2017 at 07:40, Al Viro  wrote:
>
> The following changes since commit a71c9a1c779f2499fb2afc0553e543f18aff6edf:
>
>   Linux 4.11-rc5 (2017-04-02 17:23:54 -0700)
>
> are available in the git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus
>
> for you to fetch changes up to a8e28440016bfb23bec266c4c66eacca6ea2d48b:
>
>   Merge branch 'work.statx' into for-next (2017-04-03 01:06:59 -0400)
>
> 
> Al Viro (2):
>   alpha: fix stack smashing in old_adjtimex(2)
>   Merge branch 'work.statx' into for-next

I'm seeing the same memfd_create/name_to_handle_at/path_lookupat
use-after-free that Dmitry was seeing here:

https://lkml.org/lkml/2017/3/4/118

I haven't tried the patch from that thread yet, but was there any
reason for it not to get merged so far?


Vegard


Re: [GIT PULL] TTY/Serial driver fixes for 4.11-rc4

2017-04-14 Thread Vegard Nossum
On 13 April 2017 at 20:34, Greg KH <gre...@linuxfoundation.org> wrote:
> On Thu, Apr 13, 2017 at 09:07:40AM -0700, Linus Torvalds wrote:
>> On Thu, Apr 13, 2017 at 3:50 AM, Vegard Nossum <vegard.nos...@gmail.com> 
>> wrote:
>> >
>> > I've bisected a syzkaller crash down to this commit
>> > (5362544bebe85071188dd9e479b5a5040841c895). The crash is:
>> >
>> > [   25.137552] BUG: unable to handle kernel paging request at 
>> > 2280
>> > [   25.137579] IP: mutex_lock_interruptible+0xb/0x30
>>
>> It would seem to be the
>>
>> if (mutex_lock_interruptible(>atomic_read_lock))
>>
>> call in n_tty_read(), the offset is about right for a NULL 'ldata'
>> pointer (it's a big structure, it has a couple of character buffers of
>> size N_TTY_BUF_SIZE).
>>
>> I don't see the obvious fix, so I suspect at this point we should just
>> revert, as that commit seems to introduce worse problems that it is
>> supposed to fix. Greg?
>
> Unless Dmitry has a better idea, I will just revert it and send you the
> pull request in a day or so.

I don't think we need to rush a revert, I'd hope there's a way to fix
it properly.

So the original problem is that the vmalloc() in n_tty_open() can
fail, and that will panic in tty_set_ldisc()/tty_ldisc_restore()
because of its unwillingness to proceed if the tty doesn't have an
ldisc.

Dmitry fixed this by allowing tty->ldisc == NULL in the case of memory
allocation failure as we can see from the comment in tty_set_ldisc().

Unfortunately, it would appear that some other bits of code do not
like tty->ldisc == NULL (other than the crash in this thread, I saw
2-3 similar crashes in other functions, e.g. poll()). I see two
possibilities:

1) make other code handle tty->ldisc == NULL.

2) don't close/free the old ldisc until the new one has been
successfully created/initialised/opened/attached to the tty, and
return an error to userspace if changing it failed.

I'm leaning towards #2 as the more obviously correct fix, it makes
tty_set_ldisc() transactional, the fix seems limited in scope to
tty_set_ldisc() itself, and we don't need to make every other bit of
code that uses tty->ldisc handle the NULL case.


Vegard


Re: [GIT PULL] TTY/Serial driver fixes for 4.11-rc4

2017-04-14 Thread Vegard Nossum
On 13 April 2017 at 20:34, Greg KH  wrote:
> On Thu, Apr 13, 2017 at 09:07:40AM -0700, Linus Torvalds wrote:
>> On Thu, Apr 13, 2017 at 3:50 AM, Vegard Nossum  
>> wrote:
>> >
>> > I've bisected a syzkaller crash down to this commit
>> > (5362544bebe85071188dd9e479b5a5040841c895). The crash is:
>> >
>> > [   25.137552] BUG: unable to handle kernel paging request at 
>> > 2280
>> > [   25.137579] IP: mutex_lock_interruptible+0xb/0x30
>>
>> It would seem to be the
>>
>> if (mutex_lock_interruptible(>atomic_read_lock))
>>
>> call in n_tty_read(), the offset is about right for a NULL 'ldata'
>> pointer (it's a big structure, it has a couple of character buffers of
>> size N_TTY_BUF_SIZE).
>>
>> I don't see the obvious fix, so I suspect at this point we should just
>> revert, as that commit seems to introduce worse problems that it is
>> supposed to fix. Greg?
>
> Unless Dmitry has a better idea, I will just revert it and send you the
> pull request in a day or so.

I don't think we need to rush a revert, I'd hope there's a way to fix
it properly.

So the original problem is that the vmalloc() in n_tty_open() can
fail, and that will panic in tty_set_ldisc()/tty_ldisc_restore()
because of its unwillingness to proceed if the tty doesn't have an
ldisc.

Dmitry fixed this by allowing tty->ldisc == NULL in the case of memory
allocation failure as we can see from the comment in tty_set_ldisc().

Unfortunately, it would appear that some other bits of code do not
like tty->ldisc == NULL (other than the crash in this thread, I saw
2-3 similar crashes in other functions, e.g. poll()). I see two
possibilities:

1) make other code handle tty->ldisc == NULL.

2) don't close/free the old ldisc until the new one has been
successfully created/initialised/opened/attached to the tty, and
return an error to userspace if changing it failed.

I'm leaning towards #2 as the more obviously correct fix, it makes
tty_set_ldisc() transactional, the fix seems limited in scope to
tty_set_ldisc() itself, and we don't need to make every other bit of
code that uses tty->ldisc handle the NULL case.


Vegard


Re: [GIT PULL] TTY/Serial driver fixes for 4.11-rc4

2017-04-13 Thread Vegard Nossum
On 26 March 2017 at 13:04, Greg KH  wrote:
> The following changes since commit 4495c08e84729385774601b5146d51d9e5849f81:
>
>   Linux 4.11-rc2 (2017-03-12 14:47:08 -0700)
>
> are available in the git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git/ 
> tags/tty-4.11-rc4
>
> for you to fetch changes up to a4a3e061149f09c075f108b6f1cf04d9739a6bc2:
>
>   tty: fix data race in tty_ldisc_ref_wait() (2017-03-17 14:07:10 +0900)
>
> 
> TTY/Serial driver fixes for 4.11-rc4
>
> Here are some tty and serial driver fixes for 4.11-rc4.  One of these
> fix a long-standing issue in the ldisc code that was found by Dmitry
> Vyukov with his great fuzzing work.  The other fixes resolve other
> reported issues, and there is one revert of a patch in 4.11-rc1 that
> wasn't correct.
>
> All of these have been in linux-next for a while with no reported
> issues.
>
> Signed-off-by: Greg Kroah-Hartman 
>
> 
> Aleksey Makarov (1):
>   Revert "tty: serial: pl011: add ttyAMA for matching pl011 console"
>
> Dmitry Vyukov (2):
>   tty: don't panic on OOM in tty_set_ldisc()

I've bisected a syzkaller crash down to this commit
(5362544bebe85071188dd9e479b5a5040841c895). The crash is:

[   25.137552] BUG: unable to handle kernel paging request at 2280
[   25.137579] IP: mutex_lock_interruptible+0xb/0x30
[   25.137589] PGD 3b0c067
[   25.137593] PUD 3911067
[   25.137597] PMD 0
[   25.137601]
[   25.137611] Oops: 0002 [#1] PREEMPT SMP KASAN
[   25.137624] CPU: 1 PID: 3690 Comm: a.out Not tainted 4.11.0-rc2+ #145
[   25.137631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.137639] task: 880003b96400 task.stack: 880004e98000
[   25.137651] RIP: 0010:mutex_lock_interruptible+0xb/0x30
[   25.137657] RSP: 0018:880004e9fae0 EFLAGS: 00010246
[   25.137668] RAX:  RBX: 880004e6c000 RCX: 817bb2a9
[   25.137675] RDX: 880003b96400 RSI: 0015 RDI: 2280
[   25.137696] RBP: 880004e9fca0 R08: 0003 R09: 0002
[   25.137703] R10: 0002 R11: edc23fe9 R12: 880004e6c000
[   25.137710] R13: 80045430 R14: 880004bac900 R15: 880004bacb60
[   25.137720] FS:  7f7cac233700() GS:88000610()
knlGS:
[   25.137727] CS:  0010 DS:  ES:  CR0: 80050033
[   25.137733] CR2: 2280 CR3: 03b67000 CR4: 06e0
[   25.137746] DR0:  DR1:  DR2: 
[   25.137752] DR3:  DR6: fffe0ff0 DR7: 0400
[   25.137755] Call Trace:
[   25.137769]  ? n_tty_read+0x15f/0xc70
[   25.137783]  ? preempt_count_add+0xb2/0xe0
[   25.137793]  ? n_tty_flush_buffer+0x90/0x90
[   25.137806]  ? wait_woken+0x100/0x100
[   25.137817]  tty_read+0xd8/0x140
[   25.137830]  __vfs_read+0xd1/0x320
[   25.137842]  ? do_sendfile+0x6c0/0x6c0
[   25.137853]  ? __fsnotify_update_child_dentry_flags+0x30/0x30
[   25.137864]  ? selinux_file_permission+0x1c0/0x210
[   25.137873]  ? __fsnotify_parent+0x27/0x130
[   25.137882]  ? security_file_permission+0xce/0xf0
[   25.137893]  ? rw_verify_area+0x73/0x140
[   25.137904]  vfs_read+0xba/0x1b0
[   25.137915]  SyS_read+0xa0/0x120
[   25.137926]  ? vfs_write+0x260/0x260
[   25.137938]  ? preempt_count_sub+0x13/0xd0
[   25.137949]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[   25.137957] RIP: 0033:0x7f7caf61351d
[   25.137963] RSP: 002b:7f7cac232f20 EFLAGS: 0293 ORIG_RAX:

[   25.137974] RAX: ffda RBX: 7f7cac233700 RCX: 7f7caf61351d
[   25.137980] RDX: 003e RSI: 80045430 RDI: 0004
[   25.137987] RBP: 7fffb4f21250 R08: 7f7cac233700 R09: 7f7cac233700
[   25.137993] R10: 7f7cac2339d0 R11: 0293 R12: 
[   25.137999] R13: 7fffb4f2124f R14: 7f7cac2339c0 R15: 
[   25.138002] Code: c7 43 20 00 00 00 00 48 89 df e8 91 ff ff ff 5b
41 5c 5d c3 83 e8 01 41 89 44 24 10 eb e1 66 90 65 48 8b 14 25 40 54
01 00 31 c0  48 0f b1 17 48 85 c0 74 0a 55 48 89 e5 e8 e2 f4 ff ff
5d f3
[   25.138218] RIP: mutex_lock_interruptible+0xb/0x30 RSP: 880004e9fae0
[   25.138221] CR2: 2280
[   25.138301] ---[ end trace 242fd54c56b177b4 ]---

The syzkaller reproducer is:

# {Threaded:true Collide:true Repeat:true Procs:1 Sandbox:setuid Repro:false}
mmap(&(0x7f00/0x9f000)=nil, (0x9f000), 0x3, 0x32,
0x, 0x0)
r0 = openat$ptmx(0xff9c,
&(0x7f001000-0xa)="2f6465762f70746d7800", 0x201, 0x0)
ioctl$TIOCSPTLCK(r0, 0x40045431, &(0x7f09a000)=0x0)
r1 = syz_open_pts(r0, 0x0)
read(r1, 

Re: [GIT PULL] TTY/Serial driver fixes for 4.11-rc4

2017-04-13 Thread Vegard Nossum
On 26 March 2017 at 13:04, Greg KH  wrote:
> The following changes since commit 4495c08e84729385774601b5146d51d9e5849f81:
>
>   Linux 4.11-rc2 (2017-03-12 14:47:08 -0700)
>
> are available in the git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git/ 
> tags/tty-4.11-rc4
>
> for you to fetch changes up to a4a3e061149f09c075f108b6f1cf04d9739a6bc2:
>
>   tty: fix data race in tty_ldisc_ref_wait() (2017-03-17 14:07:10 +0900)
>
> 
> TTY/Serial driver fixes for 4.11-rc4
>
> Here are some tty and serial driver fixes for 4.11-rc4.  One of these
> fix a long-standing issue in the ldisc code that was found by Dmitry
> Vyukov with his great fuzzing work.  The other fixes resolve other
> reported issues, and there is one revert of a patch in 4.11-rc1 that
> wasn't correct.
>
> All of these have been in linux-next for a while with no reported
> issues.
>
> Signed-off-by: Greg Kroah-Hartman 
>
> 
> Aleksey Makarov (1):
>   Revert "tty: serial: pl011: add ttyAMA for matching pl011 console"
>
> Dmitry Vyukov (2):
>   tty: don't panic on OOM in tty_set_ldisc()

I've bisected a syzkaller crash down to this commit
(5362544bebe85071188dd9e479b5a5040841c895). The crash is:

[   25.137552] BUG: unable to handle kernel paging request at 2280
[   25.137579] IP: mutex_lock_interruptible+0xb/0x30
[   25.137589] PGD 3b0c067
[   25.137593] PUD 3911067
[   25.137597] PMD 0
[   25.137601]
[   25.137611] Oops: 0002 [#1] PREEMPT SMP KASAN
[   25.137624] CPU: 1 PID: 3690 Comm: a.out Not tainted 4.11.0-rc2+ #145
[   25.137631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   25.137639] task: 880003b96400 task.stack: 880004e98000
[   25.137651] RIP: 0010:mutex_lock_interruptible+0xb/0x30
[   25.137657] RSP: 0018:880004e9fae0 EFLAGS: 00010246
[   25.137668] RAX:  RBX: 880004e6c000 RCX: 817bb2a9
[   25.137675] RDX: 880003b96400 RSI: 0015 RDI: 2280
[   25.137696] RBP: 880004e9fca0 R08: 0003 R09: 0002
[   25.137703] R10: 0002 R11: edc23fe9 R12: 880004e6c000
[   25.137710] R13: 80045430 R14: 880004bac900 R15: 880004bacb60
[   25.137720] FS:  7f7cac233700() GS:88000610()
knlGS:
[   25.137727] CS:  0010 DS:  ES:  CR0: 80050033
[   25.137733] CR2: 2280 CR3: 03b67000 CR4: 06e0
[   25.137746] DR0:  DR1:  DR2: 
[   25.137752] DR3:  DR6: fffe0ff0 DR7: 0400
[   25.137755] Call Trace:
[   25.137769]  ? n_tty_read+0x15f/0xc70
[   25.137783]  ? preempt_count_add+0xb2/0xe0
[   25.137793]  ? n_tty_flush_buffer+0x90/0x90
[   25.137806]  ? wait_woken+0x100/0x100
[   25.137817]  tty_read+0xd8/0x140
[   25.137830]  __vfs_read+0xd1/0x320
[   25.137842]  ? do_sendfile+0x6c0/0x6c0
[   25.137853]  ? __fsnotify_update_child_dentry_flags+0x30/0x30
[   25.137864]  ? selinux_file_permission+0x1c0/0x210
[   25.137873]  ? __fsnotify_parent+0x27/0x130
[   25.137882]  ? security_file_permission+0xce/0xf0
[   25.137893]  ? rw_verify_area+0x73/0x140
[   25.137904]  vfs_read+0xba/0x1b0
[   25.137915]  SyS_read+0xa0/0x120
[   25.137926]  ? vfs_write+0x260/0x260
[   25.137938]  ? preempt_count_sub+0x13/0xd0
[   25.137949]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[   25.137957] RIP: 0033:0x7f7caf61351d
[   25.137963] RSP: 002b:7f7cac232f20 EFLAGS: 0293 ORIG_RAX:

[   25.137974] RAX: ffda RBX: 7f7cac233700 RCX: 7f7caf61351d
[   25.137980] RDX: 003e RSI: 80045430 RDI: 0004
[   25.137987] RBP: 7fffb4f21250 R08: 7f7cac233700 R09: 7f7cac233700
[   25.137993] R10: 7f7cac2339d0 R11: 0293 R12: 
[   25.137999] R13: 7fffb4f2124f R14: 7f7cac2339c0 R15: 
[   25.138002] Code: c7 43 20 00 00 00 00 48 89 df e8 91 ff ff ff 5b
41 5c 5d c3 83 e8 01 41 89 44 24 10 eb e1 66 90 65 48 8b 14 25 40 54
01 00 31 c0  48 0f b1 17 48 85 c0 74 0a 55 48 89 e5 e8 e2 f4 ff ff
5d f3
[   25.138218] RIP: mutex_lock_interruptible+0xb/0x30 RSP: 880004e9fae0
[   25.138221] CR2: 2280
[   25.138301] ---[ end trace 242fd54c56b177b4 ]---

The syzkaller reproducer is:

# {Threaded:true Collide:true Repeat:true Procs:1 Sandbox:setuid Repro:false}
mmap(&(0x7f00/0x9f000)=nil, (0x9f000), 0x3, 0x32,
0x, 0x0)
r0 = openat$ptmx(0xff9c,
&(0x7f001000-0xa)="2f6465762f70746d7800", 0x201, 0x0)
ioctl$TIOCSPTLCK(r0, 0x40045431, &(0x7f09a000)=0x0)
r1 = syz_open_pts(r0, 0x0)
read(r1, 

Re: [PATCH] hugetlbfs: fix offset overflow in huegtlbfs mmap

2017-04-12 Thread Vegard Nossum
On 12 April 2017 at 00:51, Mike Kravetz <mike.krav...@oracle.com> wrote:
> If mmap() maps a file, it can be passed an offset into the file at
> which the mapping is to start.  Offset could be a negative value when
> represented as a loff_t.  The offset plus length will be used to
> update the file size (i_size) which is also a loff_t.  Validate the
> value of offset and offset + length to make sure they do not overflow
> and appear as negative.
>
> Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call
> region_abort if region_chg fails") applied.  Prior to this commit, the
> overflow would still occur but we would luckily return ENOMEM.
> To reproduce:
> mmap(0, 0x2000, 0, 0x40021, 0xULL, 0x8000ULL);
>
> Resulted in,
> kernel BUG at mm/hugetlb.c:742!
> Call Trace:
>  hugetlbfs_evict_inode+0x80/0xa0
>  ? hugetlbfs_setattr+0x3c0/0x3c0
>  evict+0x24a/0x620
>  iput+0x48f/0x8c0
>  dentry_unlink_inode+0x31f/0x4d0
>  __dentry_kill+0x292/0x5e0
>  dput+0x730/0x830
>  __fput+0x438/0x720
>  fput+0x1a/0x20
>  task_work_run+0xfe/0x180
>  exit_to_usermode_loop+0x133/0x150
>  syscall_return_slowpath+0x184/0x1c0
>  entry_SYSCALL_64_fastpath+0xab/0xad
>
> Reported-by: Vegard Nossum <vegard.nos...@gmail.com>

Please use <vegard.nos...@oracle.com> if possible :-)

> Signed-off-by: Mike Kravetz <mike.krav...@oracle.com>
> ---
>  fs/hugetlbfs/inode.c | 15 ---
>  1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 7163fe0..dde8613 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -136,17 +136,26 @@ static int hugetlbfs_file_mmap(struct file *file, 
> struct vm_area_struct *vma)
> vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
> vma->vm_ops = _vm_ops;
>
> +   /*
> +* Offset passed to mmap (before page shift) could have been
> +* negative when represented as a (l)off_t.
> +*/
> +   if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
> +   return -EINVAL;
> +

This is strictly speaking undefined behaviour in C and would get
flagged by e.g. UBSAN. The kernel does compile with
-fno-strict-overflow when supported, though, so maybe it's more of a
theoretical issue.

Another thing: wouldn't we want to detect all truncations, not just
the ones that happen to end up negative?

For example (with -fno-strict-overflow), (0x12345678 << 12) ==
0x45678000, which is still a positive integer, but obviously
truncated.

We can easily avoid the UB by moving the cast out (since ->vm_pgoff is
unsigned and unsigned shifts are always defined IIRC), but that still
doesn't reliably detect the positive-result truncation/overflow.

> if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
> return -EINVAL;
>
> vma_len = (loff_t)(vma->vm_end - vma->vm_start);
> +   len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> +   /* check for overflow */
> +   if (len < vma_len)
> +   return -EINVAL;

Also strictly speaking UB. You can avoid it by casting vma_len to
unsigned and dropping the loff_t cast, but it's admittedly somewhat
verbose. There also isn't an "unsigned loff_t" AFAIK, but don't we
have some helpers to safely check for overflows? Surely this isn't the
only place that does loff_t arithmetic.

>
> inode_lock(inode);
> file_accessed(file);
>
> ret = -ENOMEM;
> -   len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> -
> if (hugetlb_reserve_pages(inode,
> vma->vm_pgoff >> huge_page_order(h),
> len >> huge_page_shift(h), vma,
> @@ -155,7 +164,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct 
> vm_area_struct *vma)
>
> ret = 0;
> if (vma->vm_flags & VM_WRITE && inode->i_size < len)
> -   inode->i_size = len;
> +   i_size_write(inode, len);
>  out:
> inode_unlock(inode);

This hunk seems a bit out of place in the sense that I don't see how
it relates to the overflow checking. I think this either belongs in a
separate patch or it deserves a mention in the changelog.


Vegard


Re: [PATCH] hugetlbfs: fix offset overflow in huegtlbfs mmap

2017-04-12 Thread Vegard Nossum
On 12 April 2017 at 00:51, Mike Kravetz  wrote:
> If mmap() maps a file, it can be passed an offset into the file at
> which the mapping is to start.  Offset could be a negative value when
> represented as a loff_t.  The offset plus length will be used to
> update the file size (i_size) which is also a loff_t.  Validate the
> value of offset and offset + length to make sure they do not overflow
> and appear as negative.
>
> Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call
> region_abort if region_chg fails") applied.  Prior to this commit, the
> overflow would still occur but we would luckily return ENOMEM.
> To reproduce:
> mmap(0, 0x2000, 0, 0x40021, 0xULL, 0x8000ULL);
>
> Resulted in,
> kernel BUG at mm/hugetlb.c:742!
> Call Trace:
>  hugetlbfs_evict_inode+0x80/0xa0
>  ? hugetlbfs_setattr+0x3c0/0x3c0
>  evict+0x24a/0x620
>  iput+0x48f/0x8c0
>  dentry_unlink_inode+0x31f/0x4d0
>  __dentry_kill+0x292/0x5e0
>  dput+0x730/0x830
>  __fput+0x438/0x720
>  fput+0x1a/0x20
>  task_work_run+0xfe/0x180
>  exit_to_usermode_loop+0x133/0x150
>  syscall_return_slowpath+0x184/0x1c0
>  entry_SYSCALL_64_fastpath+0xab/0xad
>
> Reported-by: Vegard Nossum 

Please use  if possible :-)

> Signed-off-by: Mike Kravetz 
> ---
>  fs/hugetlbfs/inode.c | 15 ---
>  1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 7163fe0..dde8613 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -136,17 +136,26 @@ static int hugetlbfs_file_mmap(struct file *file, 
> struct vm_area_struct *vma)
> vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
> vma->vm_ops = _vm_ops;
>
> +   /*
> +* Offset passed to mmap (before page shift) could have been
> +* negative when represented as a (l)off_t.
> +*/
> +   if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
> +   return -EINVAL;
> +

This is strictly speaking undefined behaviour in C and would get
flagged by e.g. UBSAN. The kernel does compile with
-fno-strict-overflow when supported, though, so maybe it's more of a
theoretical issue.

Another thing: wouldn't we want to detect all truncations, not just
the ones that happen to end up negative?

For example (with -fno-strict-overflow), (0x12345678 << 12) ==
0x45678000, which is still a positive integer, but obviously
truncated.

We can easily avoid the UB by moving the cast out (since ->vm_pgoff is
unsigned and unsigned shifts are always defined IIRC), but that still
doesn't reliably detect the positive-result truncation/overflow.

> if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
> return -EINVAL;
>
> vma_len = (loff_t)(vma->vm_end - vma->vm_start);
> +   len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> +   /* check for overflow */
> +   if (len < vma_len)
> +   return -EINVAL;

Also strictly speaking UB. You can avoid it by casting vma_len to
unsigned and dropping the loff_t cast, but it's admittedly somewhat
verbose. There also isn't an "unsigned loff_t" AFAIK, but don't we
have some helpers to safely check for overflows? Surely this isn't the
only place that does loff_t arithmetic.

>
> inode_lock(inode);
> file_accessed(file);
>
> ret = -ENOMEM;
> -   len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> -
> if (hugetlb_reserve_pages(inode,
> vma->vm_pgoff >> huge_page_order(h),
> len >> huge_page_shift(h), vma,
> @@ -155,7 +164,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct 
> vm_area_struct *vma)
>
> ret = 0;
> if (vma->vm_flags & VM_WRITE && inode->i_size < len)
> -   inode->i_size = len;
> +   i_size_write(inode, len);
>  out:
> inode_unlock(inode);

This hunk seems a bit out of place in the sense that I don't see how
it relates to the overflow checking. I think this either belongs in a
separate patch or it deserves a mention in the changelog.


Vegard


Re: [PATCH] um: use KERN_CONT in stack dump

2017-04-11 Thread Vegard Nossum
On 12 March 2017 at 10:47, Vegard Nossum <vegard.nos...@oracle.com> wrote:
> On 12/03/2017 10:45, Richard Weinberger wrote:
>> diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c
>> index aa1b56f5ac68..18eddf677ec6 100644
>> --- a/arch/um/kernel/sysrq.c
>> +++ b/arch/um/kernel/sysrq.c
>> @@ -17,10 +17,8 @@
>>
>>  static void _print_addr(void *data, unsigned long address, int reliable)
>>  {
>> -   pr_info(" [<%08lx>]", address);
>> -   pr_cont(" %s", reliable ? "" : "? ");
>> -   print_symbol("%s", address);
>> -   pr_cont("\n");
>> +   pr_info(" [<%08lx>] %s%pB\n", address, reliable ? "" : "? ",
>> +   (void *)address);
>>  }
>
> Tested-by: Vegard Nossum <vegard.nos...@oracle.com>

Just a heads up, this still appears unfixed in Linus's repo.


Vegard


Re: [PATCH] um: use KERN_CONT in stack dump

2017-04-11 Thread Vegard Nossum
On 12 March 2017 at 10:47, Vegard Nossum  wrote:
> On 12/03/2017 10:45, Richard Weinberger wrote:
>> diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c
>> index aa1b56f5ac68..18eddf677ec6 100644
>> --- a/arch/um/kernel/sysrq.c
>> +++ b/arch/um/kernel/sysrq.c
>> @@ -17,10 +17,8 @@
>>
>>  static void _print_addr(void *data, unsigned long address, int reliable)
>>  {
>> -   pr_info(" [<%08lx>]", address);
>> -   pr_cont(" %s", reliable ? "" : "? ");
>> -   print_symbol("%s", address);
>> -   pr_cont("\n");
>> +   pr_info(" [<%08lx>] %s%pB\n", address, reliable ? "" : "? ",
>> +   (void *)address);
>>  }
>
> Tested-by: Vegard Nossum 

Just a heads up, this still appears unfixed in Linus's repo.


Vegard


Re: [PATCH RESEND] mm/hugetlb: Don't call region_abort if region_chg fails

2017-04-10 Thread Vegard Nossum
On 29 March 2017 at 23:08, Mike Kravetz  wrote:
> Changes to hugetlbfs reservation maps is a two step process.  The first
> step is a call to region_chg to determine what needs to be changed, and
> prepare that change.  This should be followed by a call to call to
> region_add to commit the change, or region_abort to abort the change.
>
> The error path in hugetlb_reserve_pages called region_abort after a
> failed call to region_chg.  As a result, the adds_in_progress counter
> in the reservation map is off by 1.  This is caught by a VM_BUG_ON
> in resv_map_release when the reservation map is freed.
>
> syzkaller fuzzer found this bug, that resulted in the following:
>
>  kernel BUG at mm/hugetlb.c:742!
>  Call Trace:
>   hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
>   evict+0x481/0x920 fs/inode.c:553
>   iput_final fs/inode.c:1515 [inline]
>   iput+0x62b/0xa20 fs/inode.c:1542
>   hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
>   newseg+0x422/0xd30 ipc/shm.c:575
>   ipcget_new ipc/util.c:285 [inline]
>   ipcget+0x21e/0x580 ipc/util.c:639
>   SYSC_shmget ipc/shm.c:673 [inline]
>   SyS_shmget+0x158/0x230 ipc/shm.c:657
>   entry_SYSCALL_64_fastpath+0x1f/0xc2
>  RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742
>
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Mike Kravetz 
> Acked-by: Hillf Danton 
> ---
>  mm/hugetlb.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c7025c1..c65d45c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4233,7 +4233,9 @@ int hugetlb_reserve_pages(struct inode *inode,
> return 0;
>  out_err:
> if (!vma || vma->vm_flags & VM_MAYSHARE)
> -   region_abort(resv_map, from, to);
> +   /* Don't call region_abort if region_chg failed */
> +   if (chg >= 0)
> +   region_abort(resv_map, from, to);
> if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> kref_put(_map->refs, resv_map_release);
> return ret;

Hi guys,

I'm running into this on latest linus/master:

kernel BUG at mm/hugetlb.c:742!
invalid opcode:  [#1] SMP KASAN
CPU: 3 PID: 20281 Comm: syz-executor0 Not tainted 4.11.0-rc6 #335
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: 880064f30dc0 task.stack: 880065b38000
RIP: 0010:resv_map_release+0x1cb/0x270
RSP: 0018:880065b3fc38 EFLAGS: 00010287
RAX: 0001 RBX: 88006b5fe418 RCX: c90001b52000
RDX: 05de RSI: 8172026b RDI: 88006b5fe410
RBP: 880065b3fc78 R08: 880065b3f958 R09: 
R10:  R11:  R12: dc00
R13: 88006b5fe418 R14: 88006b5fe418 R15: 88006b5fe418
FS:  7f21647c5700() GS:88006d10() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 00460750 CR3: 5d123000 CR4: 06e0
Call Trace:
 hugetlbfs_evict_inode+0x80/0xa0
 ? hugetlbfs_setattr+0x3c0/0x3c0
 evict+0x24a/0x620
 iput+0x48f/0x8c0
 dentry_unlink_inode+0x31f/0x4d0
 __dentry_kill+0x292/0x5e0
 dput+0x730/0x830
 __fput+0x438/0x720
 fput+0x1a/0x20
 task_work_run+0xfe/0x180
 exit_to_usermode_loop+0x133/0x150
 syscall_return_slowpath+0x184/0x1c0
 entry_SYSCALL_64_fastpath+0xab/0xad

To reproduce:

mmap(0, 0x2000, 0, 0x40031, 0xULL, 0x8000ULL);

Curiously enough, it's the patch from this thread (i.e. commit
ff8c0c53c47530ffea82c22a0a6df6332b56c957) that introduces it,
according to git bisect. Reverting the commit from linus/master fixes
the problem.

Also found by syzcaller (no fault injections this time).


Vegard


Re: [PATCH RESEND] mm/hugetlb: Don't call region_abort if region_chg fails

2017-04-10 Thread Vegard Nossum
On 29 March 2017 at 23:08, Mike Kravetz  wrote:
> Changes to hugetlbfs reservation maps is a two step process.  The first
> step is a call to region_chg to determine what needs to be changed, and
> prepare that change.  This should be followed by a call to call to
> region_add to commit the change, or region_abort to abort the change.
>
> The error path in hugetlb_reserve_pages called region_abort after a
> failed call to region_chg.  As a result, the adds_in_progress counter
> in the reservation map is off by 1.  This is caught by a VM_BUG_ON
> in resv_map_release when the reservation map is freed.
>
> syzkaller fuzzer found this bug, that resulted in the following:
>
>  kernel BUG at mm/hugetlb.c:742!
>  Call Trace:
>   hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
>   evict+0x481/0x920 fs/inode.c:553
>   iput_final fs/inode.c:1515 [inline]
>   iput+0x62b/0xa20 fs/inode.c:1542
>   hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
>   newseg+0x422/0xd30 ipc/shm.c:575
>   ipcget_new ipc/util.c:285 [inline]
>   ipcget+0x21e/0x580 ipc/util.c:639
>   SYSC_shmget ipc/shm.c:673 [inline]
>   SyS_shmget+0x158/0x230 ipc/shm.c:657
>   entry_SYSCALL_64_fastpath+0x1f/0xc2
>  RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742
>
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Mike Kravetz 
> Acked-by: Hillf Danton 
> ---
>  mm/hugetlb.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c7025c1..c65d45c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4233,7 +4233,9 @@ int hugetlb_reserve_pages(struct inode *inode,
> return 0;
>  out_err:
> if (!vma || vma->vm_flags & VM_MAYSHARE)
> -   region_abort(resv_map, from, to);
> +   /* Don't call region_abort if region_chg failed */
> +   if (chg >= 0)
> +   region_abort(resv_map, from, to);
> if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> kref_put(_map->refs, resv_map_release);
> return ret;

Hi guys,

I'm running into this on latest linus/master:

kernel BUG at mm/hugetlb.c:742!
invalid opcode:  [#1] SMP KASAN
CPU: 3 PID: 20281 Comm: syz-executor0 Not tainted 4.11.0-rc6 #335
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: 880064f30dc0 task.stack: 880065b38000
RIP: 0010:resv_map_release+0x1cb/0x270
RSP: 0018:880065b3fc38 EFLAGS: 00010287
RAX: 0001 RBX: 88006b5fe418 RCX: c90001b52000
RDX: 05de RSI: 8172026b RDI: 88006b5fe410
RBP: 880065b3fc78 R08: 880065b3f958 R09: 
R10:  R11:  R12: dc00
R13: 88006b5fe418 R14: 88006b5fe418 R15: 88006b5fe418
FS:  7f21647c5700() GS:88006d10() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 00460750 CR3: 5d123000 CR4: 06e0
Call Trace:
 hugetlbfs_evict_inode+0x80/0xa0
 ? hugetlbfs_setattr+0x3c0/0x3c0
 evict+0x24a/0x620
 iput+0x48f/0x8c0
 dentry_unlink_inode+0x31f/0x4d0
 __dentry_kill+0x292/0x5e0
 dput+0x730/0x830
 __fput+0x438/0x720
 fput+0x1a/0x20
 task_work_run+0xfe/0x180
 exit_to_usermode_loop+0x133/0x150
 syscall_return_slowpath+0x184/0x1c0
 entry_SYSCALL_64_fastpath+0xab/0xad

To reproduce:

mmap(0, 0x2000, 0, 0x40031, 0xULL, 0x8000ULL);

Curiously enough, it's the patch from this thread (i.e. commit
ff8c0c53c47530ffea82c22a0a6df6332b56c957) that introduces it,
according to git bisect. Reverting the commit from linus/master fixes
the problem.

Also found by syzcaller (no fault injections this time).


Vegard


Re: [PATCH] um: use KERN_CONT in stack dump

2017-03-12 Thread Vegard Nossum

On 12/03/2017 10:45, Richard Weinberger wrote:

Am 12.03.2017 um 10:38 schrieb Vegard Nossum:

Without KERN_CONT, the symbol will appear on a new line, making stack
traces completely unreadable:

[snip]

I think it is better to fix the root of the problem by using a single printk.
i.e.

diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c
index aa1b56f5ac68..18eddf677ec6 100644
--- a/arch/um/kernel/sysrq.c
+++ b/arch/um/kernel/sysrq.c
@@ -17,10 +17,8 @@

 static void _print_addr(void *data, unsigned long address, int reliable)
 {
-   pr_info(" [<%08lx>]", address);
-   pr_cont(" %s", reliable ? "" : "? ");
-   print_symbol("%s", address);
-   pr_cont("\n");
+   pr_info(" [<%08lx>] %s%pB\n", address, reliable ? "" : "? ",
+   (void *)address);
 }


Your patch is better.

Tested-by: Vegard Nossum <vegard.nos...@oracle.com>

Thanks,


Vegard


Re: [PATCH] um: use KERN_CONT in stack dump

2017-03-12 Thread Vegard Nossum

On 12/03/2017 10:45, Richard Weinberger wrote:

Am 12.03.2017 um 10:38 schrieb Vegard Nossum:

Without KERN_CONT, the symbol will appear on a new line, making stack
traces completely unreadable:

[snip]

I think it is better to fix the root of the problem by using a single printk.
i.e.

diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c
index aa1b56f5ac68..18eddf677ec6 100644
--- a/arch/um/kernel/sysrq.c
+++ b/arch/um/kernel/sysrq.c
@@ -17,10 +17,8 @@

 static void _print_addr(void *data, unsigned long address, int reliable)
 {
-   pr_info(" [<%08lx>]", address);
-   pr_cont(" %s", reliable ? "" : "? ");
-   print_symbol("%s", address);
-   pr_cont("\n");
+   pr_info(" [<%08lx>] %s%pB\n", address, reliable ? "" : "? ",
+   (void *)address);
 }


Your patch is better.

Tested-by: Vegard Nossum 

Thanks,


Vegard


[PATCH] um: use KERN_CONT in stack dump

2017-03-12 Thread Vegard Nossum
Without KERN_CONT, the symbol will appear on a new line, making stack
traces completely unreadable:

Call Trace:
 [<6008e891>] ?
printk+0x0/0x94
 [<6001cce6>]
show_stack+0xfe/0x15b
 [<600666ec>] ?
dump_stack_print_info+0xe1/0xea
 [<6008e891>] ?
printk+0x0/0x94
 [<6023e826>] ?
bust_spinlocks+0x0/0x4f
 [<602343b8>]
dump_stack+0x2a/0x2c
 [<6008e662>]
panic+0x170/0x31e
 [<6008e4f2>] ?
panic+0x0/0x31e

This makes it readable again:

Call Trace:
 [<6008e891>] ? printk+0x0/0x94
 [<6001cce6>] show_stack+0xfe/0x15b
 [<600666ec>] ? dump_stack_print_info+0xe1/0xea
 [<6008e891>] ? printk+0x0/0x94
 [<6023e826>] ? bust_spinlocks+0x0/0x4f
 [<602343b8>] dump_stack+0x2a/0x2c
 [<6008e662>] panic+0x170/0x31e

Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com>
---
 arch/um/kernel/sysrq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c
index a76295f7ede9..edf1f80123e7 100644
--- a/arch/um/kernel/sysrq.c
+++ b/arch/um/kernel/sysrq.c
@@ -22,7 +22,7 @@ static void _print_addr(void *data, unsigned long address, 
int reliable)
 {
pr_info(" [<%08lx>]", address);
pr_cont(" %s", reliable ? "" : "? ");
-   print_symbol("%s", address);
+   print_symbol(KERN_CONT "%s", address);
pr_cont("\n");
 }
 
-- 
2.12.0.rc0



[PATCH] um: use KERN_CONT in stack dump

2017-03-12 Thread Vegard Nossum
Without KERN_CONT, the symbol will appear on a new line, making stack
traces completely unreadable:

Call Trace:
 [<6008e891>] ?
printk+0x0/0x94
 [<6001cce6>]
show_stack+0xfe/0x15b
 [<600666ec>] ?
dump_stack_print_info+0xe1/0xea
 [<6008e891>] ?
printk+0x0/0x94
 [<6023e826>] ?
bust_spinlocks+0x0/0x4f
 [<602343b8>]
dump_stack+0x2a/0x2c
 [<6008e662>]
panic+0x170/0x31e
 [<6008e4f2>] ?
panic+0x0/0x31e

This makes it readable again:

Call Trace:
 [<6008e891>] ? printk+0x0/0x94
 [<6001cce6>] show_stack+0xfe/0x15b
 [<600666ec>] ? dump_stack_print_info+0xe1/0xea
 [<6008e891>] ? printk+0x0/0x94
 [<6023e826>] ? bust_spinlocks+0x0/0x4f
 [<602343b8>] dump_stack+0x2a/0x2c
 [<6008e662>] panic+0x170/0x31e

Signed-off-by: Vegard Nossum 
---
 arch/um/kernel/sysrq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c
index a76295f7ede9..edf1f80123e7 100644
--- a/arch/um/kernel/sysrq.c
+++ b/arch/um/kernel/sysrq.c
@@ -22,7 +22,7 @@ static void _print_addr(void *data, unsigned long address, 
int reliable)
 {
pr_info(" [<%08lx>]", address);
pr_cont(" %s", reliable ? "" : "? ");
-   print_symbol("%s", address);
+   print_symbol(KERN_CONT "%s", address);
pr_cont("\n");
 }
 
-- 
2.12.0.rc0



Re: [PATCH] locking/hung_task: Defer showing held locks

2017-03-12 Thread Vegard Nossum

On 12/03/2017 06:33, Tetsuo Handa wrote:

When I was running my testcase which may block hundreds of threads
on fs locks, I got lockup due to output from debug_show_all_locks()
added by commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks").

For example, if 1000 threads were blocked in TASK_UNINTERRUPTIBLE state
and 500 out of 1000 threads hold some lock, debug_show_all_locks() from
for_each_process_thread() loop will report locks held by 500 threads for
1000 times. This is a too much noise.

In order to make sure rcu_lock_break() is called frequently, we should
avoid calling debug_show_all_locks() from for_each_process_thread() loop
because debug_show_all_locks() effectively calls for_each_process_thread()
loop. Let's defer calling debug_show_all_locks() till before panic() or
leaving for_each_process_thread() loop.

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Cc: Vegard Nossum <vegard.nos...@oracle.com>
---
 kernel/hung_task.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index f0f8e2a..751593e 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -43,6 +43,7 @@
 int __read_mostly sysctl_hung_task_warnings = 10;

 static int __read_mostly did_panic;
+static bool hung_task_show_lock;

 static struct task_struct *watchdog_task;

@@ -120,12 +121,14 @@ static void check_hung_task(struct task_struct *t, 
unsigned long timeout)
pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
" disables this message.\n");
sched_show_task(t);
-   debug_show_all_locks();
+   hung_task_show_lock = true;
}

touch_nmi_watchdog();

if (sysctl_hung_task_panic) {
+   if (hung_task_show_lock)
+   debug_show_all_locks();
trigger_all_cpu_backtrace();
panic("hung_task: blocked tasks");
}
@@ -172,6 +175,7 @@ static void check_hung_uninterruptible_tasks(unsigned long 
timeout)
if (test_taint(TAINT_DIE) || did_panic)
return;

+   hung_task_show_lock = false;
rcu_read_lock();
for_each_process_thread(g, t) {
if (!max_count--)
@@ -187,6 +191,8 @@ static void check_hung_uninterruptible_tasks(unsigned long 
timeout)
}
  unlock:
rcu_read_unlock();
+   if (hung_task_show_lock)
+   debug_show_all_locks();
 }

 static long hung_timeout_jiffies(unsigned long last_checked,



Reviewed/Acked-by: Vegard Nossum <vegard.nos...@oracle.com>

Thank you for fixing this.


Vegard


Re: [PATCH] locking/hung_task: Defer showing held locks

2017-03-12 Thread Vegard Nossum

On 12/03/2017 06:33, Tetsuo Handa wrote:

When I was running my testcase which may block hundreds of threads
on fs locks, I got lockup due to output from debug_show_all_locks()
added by commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks").

For example, if 1000 threads were blocked in TASK_UNINTERRUPTIBLE state
and 500 out of 1000 threads hold some lock, debug_show_all_locks() from
for_each_process_thread() loop will report locks held by 500 threads for
1000 times. This is a too much noise.

In order to make sure rcu_lock_break() is called frequently, we should
avoid calling debug_show_all_locks() from for_each_process_thread() loop
because debug_show_all_locks() effectively calls for_each_process_thread()
loop. Let's defer calling debug_show_all_locks() till before panic() or
leaving for_each_process_thread() loop.

Signed-off-by: Tetsuo Handa 
Cc: Vegard Nossum 
---
 kernel/hung_task.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index f0f8e2a..751593e 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -43,6 +43,7 @@
 int __read_mostly sysctl_hung_task_warnings = 10;

 static int __read_mostly did_panic;
+static bool hung_task_show_lock;

 static struct task_struct *watchdog_task;

@@ -120,12 +121,14 @@ static void check_hung_task(struct task_struct *t, 
unsigned long timeout)
pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
" disables this message.\n");
sched_show_task(t);
-   debug_show_all_locks();
+   hung_task_show_lock = true;
}

touch_nmi_watchdog();

if (sysctl_hung_task_panic) {
+   if (hung_task_show_lock)
+   debug_show_all_locks();
trigger_all_cpu_backtrace();
panic("hung_task: blocked tasks");
}
@@ -172,6 +175,7 @@ static void check_hung_uninterruptible_tasks(unsigned long 
timeout)
if (test_taint(TAINT_DIE) || did_panic)
return;

+   hung_task_show_lock = false;
rcu_read_lock();
for_each_process_thread(g, t) {
if (!max_count--)
@@ -187,6 +191,8 @@ static void check_hung_uninterruptible_tasks(unsigned long 
timeout)
}
  unlock:
rcu_read_unlock();
+   if (hung_task_show_lock)
+   debug_show_all_locks();
 }

 static long hung_timeout_jiffies(unsigned long last_checked,



Reviewed/Acked-by: Vegard Nossum 

Thank you for fixing this.


Vegard


Re: [PATCH] locking/hung_task: Defer showing held locks

2016-12-20 Thread Vegard Nossum
On 13 December 2016 at 15:45, Tetsuo Handa
 wrote:
> When I was running my testcase which may block hundreds of threads
> on fs locks, I got lockup due to output from debug_show_all_locks()
> added by commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks").
>
> I think we don't need to call debug_show_all_locks() on each blocked
> thread. Let's defer calling debug_show_all_locks() till before panic()
> or leaving for_each_process_thread() loop.

First of all, sorry for not answering earlier.

I'm not sure I fully understand the problem, you say the "output from
debug_show_all_locks()" caused a lockup, but was the problem simply
that the amount of output caused it to stall for a long time?

Could we instead

1) move the debug_show_all_locks() into the if
(sysctl_hung_task_panic) bit unconditionally

2) call something (touch_nmi_watchdog()?) inside debug_show_all_locks()

3) in another way make debug_show_all_locks() more robust so it doesn't "lockup"

?


Vegard


Re: [PATCH] locking/hung_task: Defer showing held locks

2016-12-20 Thread Vegard Nossum
On 13 December 2016 at 15:45, Tetsuo Handa
 wrote:
> When I was running my testcase which may block hundreds of threads
> on fs locks, I got lockup due to output from debug_show_all_locks()
> added by commit b2d4c2edb2e4f89a ("locking/hung_task: Show all locks").
>
> I think we don't need to call debug_show_all_locks() on each blocked
> thread. Let's defer calling debug_show_all_locks() till before panic()
> or leaving for_each_process_thread() loop.

First of all, sorry for not answering earlier.

I'm not sure I fully understand the problem, you say the "output from
debug_show_all_locks()" caused a lockup, but was the problem simply
that the amount of output caused it to stall for a long time?

Could we instead

1) move the debug_show_all_locks() into the if
(sysctl_hung_task_panic) bit unconditionally

2) call something (touch_nmi_watchdog()?) inside debug_show_all_locks()

3) in another way make debug_show_all_locks() more robust so it doesn't "lockup"

?


Vegard


[PATCH 1/4] mm: add new mmgrab() helper

2016-12-18 Thread Vegard Nossum
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:

  git grep -l 'atomic_inc.*mm_count' | xargs sed -i 
's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
  git grep -l 'atomic_inc.*mm_count' | xargs sed -i 
's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

This is needed for a later patch that hooks into the helper, but might be
a worthwhile cleanup on its own.

(Michal Hocko provided most of the kerneldoc comment.)

Cc: Andrew Morton <a...@linux-foundation.org>
Acked-by: Michal Hocko <mho...@suse.com>
Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com>
---
 arch/alpha/kernel/smp.c  |  2 +-
 arch/arc/kernel/smp.c|  2 +-
 arch/arm/kernel/smp.c|  2 +-
 arch/arm64/kernel/smp.c  |  2 +-
 arch/blackfin/mach-common/smp.c  |  2 +-
 arch/hexagon/kernel/smp.c|  2 +-
 arch/ia64/kernel/setup.c |  2 +-
 arch/m32r/kernel/setup.c |  2 +-
 arch/metag/kernel/smp.c  |  2 +-
 arch/mips/kernel/traps.c |  2 +-
 arch/mn10300/kernel/smp.c|  2 +-
 arch/parisc/kernel/smp.c |  2 +-
 arch/powerpc/kernel/smp.c|  2 +-
 arch/s390/kernel/processor.c |  2 +-
 arch/score/kernel/traps.c|  2 +-
 arch/sh/kernel/smp.c |  2 +-
 arch/sparc/kernel/leon_smp.c |  2 +-
 arch/sparc/kernel/smp_64.c   |  2 +-
 arch/sparc/kernel/sun4d_smp.c|  2 +-
 arch/sparc/kernel/sun4m_smp.c|  2 +-
 arch/sparc/kernel/traps_32.c |  2 +-
 arch/sparc/kernel/traps_64.c |  2 +-
 arch/tile/kernel/smpboot.c   |  2 +-
 arch/x86/kernel/cpu/common.c |  4 ++--
 arch/xtensa/kernel/smp.c |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c |  2 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c  |  2 +-
 drivers/infiniband/hw/hfi1/file_ops.c|  2 +-
 fs/proc/base.c   |  4 ++--
 fs/userfaultfd.c |  2 +-
 include/linux/sched.h| 22 ++
 kernel/exit.c|  2 +-
 kernel/futex.c   |  2 +-
 kernel/sched/core.c  |  4 ++--
 mm/khugepaged.c  |  2 +-
 mm/ksm.c |  2 +-
 mm/mmu_context.c |  2 +-
 mm/mmu_notifier.c|  2 +-
 mm/oom_kill.c|  4 ++--
 virt/kvm/kvm_main.c  |  2 +-
 40 files changed, 65 insertions(+), 43 deletions(-)

diff --git a/arch/alpha/kernel/smp.c b/arch/alpha/kernel/smp.c
index 46bf263c3153..acb4b146a607 100644
--- a/arch/alpha/kernel/smp.c
+++ b/arch/alpha/kernel/smp.c
@@ -144,7 +144,7 @@ smp_callin(void)
alpha_mv.smp_callin();
 
/* All kernel threads share the same mm context.  */
-   atomic_inc(_mm.mm_count);
+   mmgrab(_mm);
current->active_mm = _mm;
 
/* inform the notifiers about the new cpu */
diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c
index 88674d972c9d..9cbc7aba3ede 100644
--- a/arch/arc/kernel/smp.c
+++ b/arch/arc/kernel/smp.c
@@ -125,7 +125,7 @@ void start_kernel_secondary(void)
setup_processor();
 
atomic_inc(>mm_users);
-   atomic_inc(>mm_count);
+   mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
 
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 7dd14e8395e6..c6514ce0fcbc 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -371,7 +371,7 @@ asmlinkage void secondary_start_kernel(void)
 * reference and switch to it.
 */
cpu = smp_processor_id();
-   atomic_inc(>mm_count);
+   mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
 
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index cb87234cfcf2..959e41196cba 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -222,7 +222,7 @@ asmlinkage void secondary_start_kernel(void)
 * All kernel threads share the same mm context; grab a
 * reference and switch to it.
 */
-   atomic_inc(>mm_count);
+   mmgrab(mm);
current->active_mm = mm;
 
/*
diff --git a/arch/blackfin/mach-common/smp.c b/arch/blackfin/mach-common/smp.c
index 23c4ef5f8bdc..bc5617ef7128 100644
--- a/arch/blackfin/mach-common/smp.c
+++ b/arch/blackfin/mach-common/smp.c
@@ -308,7 +308,7 @@ void secondary_start_kernel(void)
 
/* Attach the new idle task to the global mm. */
atomic_inc(>mm_users);
-   atomic_inc(>mm_count);
+   mmgrab(mm);
current->active_mm = mm;
 
preempt_disable();
diff --git a/arch/hexagon/kernel/smp.c b/arch/hexag

[PATCH 1/4] mm: add new mmgrab() helper

2016-12-18 Thread Vegard Nossum
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:

  git grep -l 'atomic_inc.*mm_count' | xargs sed -i 
's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
  git grep -l 'atomic_inc.*mm_count' | xargs sed -i 
's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

This is needed for a later patch that hooks into the helper, but might be
a worthwhile cleanup on its own.

(Michal Hocko provided most of the kerneldoc comment.)

Cc: Andrew Morton 
Acked-by: Michal Hocko 
Signed-off-by: Vegard Nossum 
---
 arch/alpha/kernel/smp.c  |  2 +-
 arch/arc/kernel/smp.c|  2 +-
 arch/arm/kernel/smp.c|  2 +-
 arch/arm64/kernel/smp.c  |  2 +-
 arch/blackfin/mach-common/smp.c  |  2 +-
 arch/hexagon/kernel/smp.c|  2 +-
 arch/ia64/kernel/setup.c |  2 +-
 arch/m32r/kernel/setup.c |  2 +-
 arch/metag/kernel/smp.c  |  2 +-
 arch/mips/kernel/traps.c |  2 +-
 arch/mn10300/kernel/smp.c|  2 +-
 arch/parisc/kernel/smp.c |  2 +-
 arch/powerpc/kernel/smp.c|  2 +-
 arch/s390/kernel/processor.c |  2 +-
 arch/score/kernel/traps.c|  2 +-
 arch/sh/kernel/smp.c |  2 +-
 arch/sparc/kernel/leon_smp.c |  2 +-
 arch/sparc/kernel/smp_64.c   |  2 +-
 arch/sparc/kernel/sun4d_smp.c|  2 +-
 arch/sparc/kernel/sun4m_smp.c|  2 +-
 arch/sparc/kernel/traps_32.c |  2 +-
 arch/sparc/kernel/traps_64.c |  2 +-
 arch/tile/kernel/smpboot.c   |  2 +-
 arch/x86/kernel/cpu/common.c |  4 ++--
 arch/xtensa/kernel/smp.c |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c |  2 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c  |  2 +-
 drivers/infiniband/hw/hfi1/file_ops.c|  2 +-
 fs/proc/base.c   |  4 ++--
 fs/userfaultfd.c |  2 +-
 include/linux/sched.h| 22 ++
 kernel/exit.c|  2 +-
 kernel/futex.c   |  2 +-
 kernel/sched/core.c  |  4 ++--
 mm/khugepaged.c  |  2 +-
 mm/ksm.c |  2 +-
 mm/mmu_context.c |  2 +-
 mm/mmu_notifier.c|  2 +-
 mm/oom_kill.c|  4 ++--
 virt/kvm/kvm_main.c  |  2 +-
 40 files changed, 65 insertions(+), 43 deletions(-)

diff --git a/arch/alpha/kernel/smp.c b/arch/alpha/kernel/smp.c
index 46bf263c3153..acb4b146a607 100644
--- a/arch/alpha/kernel/smp.c
+++ b/arch/alpha/kernel/smp.c
@@ -144,7 +144,7 @@ smp_callin(void)
alpha_mv.smp_callin();
 
/* All kernel threads share the same mm context.  */
-   atomic_inc(_mm.mm_count);
+   mmgrab(_mm);
current->active_mm = _mm;
 
/* inform the notifiers about the new cpu */
diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c
index 88674d972c9d..9cbc7aba3ede 100644
--- a/arch/arc/kernel/smp.c
+++ b/arch/arc/kernel/smp.c
@@ -125,7 +125,7 @@ void start_kernel_secondary(void)
setup_processor();
 
atomic_inc(>mm_users);
-   atomic_inc(>mm_count);
+   mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
 
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 7dd14e8395e6..c6514ce0fcbc 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -371,7 +371,7 @@ asmlinkage void secondary_start_kernel(void)
 * reference and switch to it.
 */
cpu = smp_processor_id();
-   atomic_inc(>mm_count);
+   mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
 
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index cb87234cfcf2..959e41196cba 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -222,7 +222,7 @@ asmlinkage void secondary_start_kernel(void)
 * All kernel threads share the same mm context; grab a
 * reference and switch to it.
 */
-   atomic_inc(>mm_count);
+   mmgrab(mm);
current->active_mm = mm;
 
/*
diff --git a/arch/blackfin/mach-common/smp.c b/arch/blackfin/mach-common/smp.c
index 23c4ef5f8bdc..bc5617ef7128 100644
--- a/arch/blackfin/mach-common/smp.c
+++ b/arch/blackfin/mach-common/smp.c
@@ -308,7 +308,7 @@ void secondary_start_kernel(void)
 
/* Attach the new idle task to the global mm. */
atomic_inc(>mm_users);
-   atomic_inc(>mm_count);
+   mmgrab(mm);
current->active_mm = mm;
 
preempt_disable();
diff --git a/arch/hexagon/kernel/smp.c b/arch/hexagon/kernel/smp.c
index 983bae7d2665..c02a6455839e 100644
--- a/arch/hexagon/kernel/smp.c

[PATCH 4/4] mm: clarify mm_struct.mm_{users,count} documentation

2016-12-18 Thread Vegard Nossum
Clarify documentation relating to mm_users and mm_count, and switch to
kernel-doc syntax.

Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com>
---
 include/linux/mm_types.h | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 08d947fc4c59..316c3e1fc226 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -407,8 +407,27 @@ struct mm_struct {
unsigned long task_size;/* size of task vm space */
unsigned long highest_vm_end;   /* highest vma end address */
pgd_t * pgd;
-   atomic_t mm_users;  /* How many users with user 
space? */
-   atomic_t mm_count;  /* How many references to 
"struct mm_struct" (users count as 1) */
+
+   /**
+* @mm_users: The number of users including userspace.
+*
+* Use mmget()/mmget_not_zero()/mmput() to modify. When this drops
+* to 0 (i.e. when the task exits and there are no other temporary
+* reference holders), we also release a reference on @mm_count
+* (which may then free the  mm_struct if @mm_count also
+* drops to 0).
+*/
+   atomic_t mm_users;
+
+   /**
+* @mm_count: The number of references to  mm_struct
+* (@mm_users count as 1).
+*
+* Use mmgrab()/mmdrop() to modify. When this drops to 0, the
+*  mm_struct is freed.
+*/
+   atomic_t mm_count;
+
atomic_long_t nr_ptes;  /* PTE page table pages */
 #if CONFIG_PGTABLE_LEVELS > 2
atomic_long_t nr_pmds;  /* PMD page table pages */
-- 
2.11.0.1.gaa10c3f



[PATCH 3/4] mm: use mmget_not_zero() helper

2016-12-18 Thread Vegard Nossum
We already have the helper, we can convert the rest of the kernel
mechanically using:

  git grep -l 'atomic_inc_not_zero.*mm_users' | xargs sed -i 
's/atomic_inc_not_zero(&\(.*\)->mm_users)/mmget_not_zero\(\1\)/'

This is needed for a later patch that hooks into the helper, but might be
a worthwhile cleanup on its own.

Cc: Andrew Morton <a...@linux-foundation.org>
Acked-by: Michal Hocko <mho...@suse.com>
Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c | 2 +-
 drivers/iommu/intel-svm.c   | 2 +-
 fs/proc/base.c  | 4 ++--
 fs/proc/task_mmu.c  | 4 ++--
 fs/proc/task_nommu.c| 2 +-
 kernel/events/uprobes.c | 2 +-
 mm/swapfile.c   | 2 +-
 7 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 1f27529cb48e..89be48ed7c77 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -507,7 +507,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct 
*_work)
flags |= FOLL_WRITE;
 
ret = -EFAULT;
-   if (atomic_inc_not_zero(>mm_users)) {
+   if (mmget_not_zero(mm)) {
down_read(>mmap_sem);
while (pinned < npages) {
ret = get_user_pages_remote
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index cb72e0011310..51f2b228723f 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -579,7 +579,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
if (!svm->mm)
goto bad_req;
/* If the mm is already defunct, don't handle faults. */
-   if (!atomic_inc_not_zero(>mm->mm_users))
+   if (!mmget_not_zero(svm->mm))
goto bad_req;
down_read(>mm->mmap_sem);
vma = find_extend_vma(svm->mm, address);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 32f04999d930..ec7304f5117a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -845,7 +845,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
return -ENOMEM;
 
copied = 0;
-   if (!atomic_inc_not_zero(>mm_users))
+   if (!mmget_not_zero(mm))
goto free;
 
/* Maybe we should limit FOLL_FORCE to actual ptrace users? */
@@ -953,7 +953,7 @@ static ssize_t environ_read(struct file *file, char __user 
*buf,
return -ENOMEM;
 
ret = 0;
-   if (!atomic_inc_not_zero(>mm_users))
+   if (!mmget_not_zero(mm))
goto free;
 
down_read(>mmap_sem);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 958f32545064..6c07c7813b26 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -167,7 +167,7 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
return ERR_PTR(-ESRCH);
 
mm = priv->mm;
-   if (!mm || !atomic_inc_not_zero(>mm_users))
+   if (!mm || !mmget_not_zero(mm))
return NULL;
 
down_read(>mmap_sem);
@@ -1352,7 +1352,7 @@ static ssize_t pagemap_read(struct file *file, char 
__user *buf,
unsigned long end_vaddr;
int ret = 0, copied = 0;
 
-   if (!mm || !atomic_inc_not_zero(>mm_users))
+   if (!mm || !mmget_not_zero(mm))
goto out;
 
ret = -EINVAL;
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 37175621e890..1ef97cfcf422 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -219,7 +219,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return ERR_PTR(-ESRCH);
 
mm = priv->mm;
-   if (!mm || !atomic_inc_not_zero(>mm_users))
+   if (!mm || !mmget_not_zero(mm))
return NULL;
 
down_read(>mmap_sem);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 215871bda3a2..f164fe8ca5ff 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -741,7 +741,7 @@ build_map_info(struct address_space *mapping, loff_t 
offset, bool is_register)
continue;
}
 
-   if (!atomic_inc_not_zero(>vm_mm->mm_users))
+   if (!mmget_not_zero(vma->vm_mm))
continue;
 
info = prev;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 914c31cc143c..5502feef0a4a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1493,7 +1493,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
while (swap_count(*swap_map) && !retval &&
(p = p->next) != _mm->mmlist) {

[PATCH 4/4] mm: clarify mm_struct.mm_{users,count} documentation

2016-12-18 Thread Vegard Nossum
Clarify documentation relating to mm_users and mm_count, and switch to
kernel-doc syntax.

Signed-off-by: Vegard Nossum 
---
 include/linux/mm_types.h | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 08d947fc4c59..316c3e1fc226 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -407,8 +407,27 @@ struct mm_struct {
unsigned long task_size;/* size of task vm space */
unsigned long highest_vm_end;   /* highest vma end address */
pgd_t * pgd;
-   atomic_t mm_users;  /* How many users with user 
space? */
-   atomic_t mm_count;  /* How many references to 
"struct mm_struct" (users count as 1) */
+
+   /**
+* @mm_users: The number of users including userspace.
+*
+* Use mmget()/mmget_not_zero()/mmput() to modify. When this drops
+* to 0 (i.e. when the task exits and there are no other temporary
+* reference holders), we also release a reference on @mm_count
+* (which may then free the  mm_struct if @mm_count also
+* drops to 0).
+*/
+   atomic_t mm_users;
+
+   /**
+* @mm_count: The number of references to  mm_struct
+* (@mm_users count as 1).
+*
+* Use mmgrab()/mmdrop() to modify. When this drops to 0, the
+*  mm_struct is freed.
+*/
+   atomic_t mm_count;
+
atomic_long_t nr_ptes;  /* PTE page table pages */
 #if CONFIG_PGTABLE_LEVELS > 2
atomic_long_t nr_pmds;  /* PMD page table pages */
-- 
2.11.0.1.gaa10c3f



[PATCH 3/4] mm: use mmget_not_zero() helper

2016-12-18 Thread Vegard Nossum
We already have the helper, we can convert the rest of the kernel
mechanically using:

  git grep -l 'atomic_inc_not_zero.*mm_users' | xargs sed -i 
's/atomic_inc_not_zero(&\(.*\)->mm_users)/mmget_not_zero\(\1\)/'

This is needed for a later patch that hooks into the helper, but might be
a worthwhile cleanup on its own.

Cc: Andrew Morton 
Acked-by: Michal Hocko 
Signed-off-by: Vegard Nossum 
---
 drivers/gpu/drm/i915/i915_gem_userptr.c | 2 +-
 drivers/iommu/intel-svm.c   | 2 +-
 fs/proc/base.c  | 4 ++--
 fs/proc/task_mmu.c  | 4 ++--
 fs/proc/task_nommu.c| 2 +-
 kernel/events/uprobes.c | 2 +-
 mm/swapfile.c   | 2 +-
 7 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 1f27529cb48e..89be48ed7c77 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -507,7 +507,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct 
*_work)
flags |= FOLL_WRITE;
 
ret = -EFAULT;
-   if (atomic_inc_not_zero(>mm_users)) {
+   if (mmget_not_zero(mm)) {
down_read(>mmap_sem);
while (pinned < npages) {
ret = get_user_pages_remote
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index cb72e0011310..51f2b228723f 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -579,7 +579,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
if (!svm->mm)
goto bad_req;
/* If the mm is already defunct, don't handle faults. */
-   if (!atomic_inc_not_zero(>mm->mm_users))
+   if (!mmget_not_zero(svm->mm))
goto bad_req;
down_read(>mm->mmap_sem);
vma = find_extend_vma(svm->mm, address);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 32f04999d930..ec7304f5117a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -845,7 +845,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
return -ENOMEM;
 
copied = 0;
-   if (!atomic_inc_not_zero(>mm_users))
+   if (!mmget_not_zero(mm))
goto free;
 
/* Maybe we should limit FOLL_FORCE to actual ptrace users? */
@@ -953,7 +953,7 @@ static ssize_t environ_read(struct file *file, char __user 
*buf,
return -ENOMEM;
 
ret = 0;
-   if (!atomic_inc_not_zero(>mm_users))
+   if (!mmget_not_zero(mm))
goto free;
 
down_read(>mmap_sem);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 958f32545064..6c07c7813b26 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -167,7 +167,7 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
return ERR_PTR(-ESRCH);
 
mm = priv->mm;
-   if (!mm || !atomic_inc_not_zero(>mm_users))
+   if (!mm || !mmget_not_zero(mm))
return NULL;
 
down_read(>mmap_sem);
@@ -1352,7 +1352,7 @@ static ssize_t pagemap_read(struct file *file, char 
__user *buf,
unsigned long end_vaddr;
int ret = 0, copied = 0;
 
-   if (!mm || !atomic_inc_not_zero(>mm_users))
+   if (!mm || !mmget_not_zero(mm))
goto out;
 
ret = -EINVAL;
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 37175621e890..1ef97cfcf422 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -219,7 +219,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return ERR_PTR(-ESRCH);
 
mm = priv->mm;
-   if (!mm || !atomic_inc_not_zero(>mm_users))
+   if (!mm || !mmget_not_zero(mm))
return NULL;
 
down_read(>mmap_sem);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 215871bda3a2..f164fe8ca5ff 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -741,7 +741,7 @@ build_map_info(struct address_space *mapping, loff_t 
offset, bool is_register)
continue;
}
 
-   if (!atomic_inc_not_zero(>vm_mm->mm_users))
+   if (!mmget_not_zero(vma->vm_mm))
continue;
 
info = prev;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 914c31cc143c..5502feef0a4a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1493,7 +1493,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
while (swap_count(*swap_map) && !retval &&
(p = p->next) != _mm->mmlist) {
mm = list_entry(p, struct mm_struct, mmlist);
-

[PATCH 2/4] mm: add new mmget() helper

2016-12-18 Thread Vegard Nossum
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:

  git grep -l 'atomic_inc.*mm_users' | xargs sed -i 
's/atomic_inc(&\(.*\)->mm_users);/mmget\(\1\);/'
  git grep -l 'atomic_inc.*mm_users' | xargs sed -i 
's/atomic_inc(&\(.*\)\.mm_users);/mmget\(\&\1\);/'

This is needed for a later patch that hooks into the helper, but might be
a worthwhile cleanup on its own.

(Michal Hocko provided most of the kerneldoc comment.)

Cc: Andrew Morton <a...@linux-foundation.org>
Acked-by: Michal Hocko <mho...@suse.com>
Signed-off-by: Vegard Nossum <vegard.nos...@oracle.com>
---
 arch/arc/kernel/smp.c   |  2 +-
 arch/blackfin/mach-common/smp.c |  2 +-
 arch/frv/mm/mmu-context.c   |  2 +-
 arch/metag/kernel/smp.c |  2 +-
 arch/sh/kernel/smp.c|  2 +-
 arch/xtensa/kernel/smp.c|  2 +-
 include/linux/sched.h   | 21 +
 kernel/fork.c   |  4 ++--
 mm/swapfile.c   | 10 +-
 virt/kvm/async_pf.c |  2 +-
 10 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c
index 9cbc7aba3ede..eec70cb71db1 100644
--- a/arch/arc/kernel/smp.c
+++ b/arch/arc/kernel/smp.c
@@ -124,7 +124,7 @@ void start_kernel_secondary(void)
/* MMU, Caches, Vector Table, Interrupts etc */
setup_processor();
 
-   atomic_inc(>mm_users);
+   mmget(mm);
mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
diff --git a/arch/blackfin/mach-common/smp.c b/arch/blackfin/mach-common/smp.c
index bc5617ef7128..a2e6db2ce811 100644
--- a/arch/blackfin/mach-common/smp.c
+++ b/arch/blackfin/mach-common/smp.c
@@ -307,7 +307,7 @@ void secondary_start_kernel(void)
local_irq_disable();
 
/* Attach the new idle task to the global mm. */
-   atomic_inc(>mm_users);
+   mmget(mm);
mmgrab(mm);
current->active_mm = mm;
 
diff --git a/arch/frv/mm/mmu-context.c b/arch/frv/mm/mmu-context.c
index 81757d55a5b5..3473bde77f56 100644
--- a/arch/frv/mm/mmu-context.c
+++ b/arch/frv/mm/mmu-context.c
@@ -188,7 +188,7 @@ int cxn_pin_by_pid(pid_t pid)
task_lock(tsk);
if (tsk->mm) {
mm = tsk->mm;
-   atomic_inc(>mm_users);
+   mmget(mm);
ret = 0;
}
task_unlock(tsk);
diff --git a/arch/metag/kernel/smp.c b/arch/metag/kernel/smp.c
index af9cff547a19..c622293254e4 100644
--- a/arch/metag/kernel/smp.c
+++ b/arch/metag/kernel/smp.c
@@ -344,7 +344,7 @@ asmlinkage void secondary_start_kernel(void)
 * All kernel threads share the same mm context; grab a
 * reference and switch to it.
 */
-   atomic_inc(>mm_users);
+   mmget(mm);
mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
diff --git a/arch/sh/kernel/smp.c b/arch/sh/kernel/smp.c
index ee379c699c08..edc4769b047e 100644
--- a/arch/sh/kernel/smp.c
+++ b/arch/sh/kernel/smp.c
@@ -179,7 +179,7 @@ asmlinkage void start_secondary(void)
 
enable_mmu();
mmgrab(mm);
-   atomic_inc(>mm_users);
+   mmget(mm);
current->active_mm = mm;
 #ifdef CONFIG_MMU
enter_lazy_tlb(mm, current);
diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c
index 9bf5cea3bae4..fcea72019df7 100644
--- a/arch/xtensa/kernel/smp.c
+++ b/arch/xtensa/kernel/smp.c
@@ -135,7 +135,7 @@ void secondary_start_kernel(void)
 
/* All kernel threads share the same mm context. */
 
-   atomic_inc(>mm_users);
+   mmget(mm);
mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6ce46220bda2..9fc07aaf5c97 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2955,6 +2955,27 @@ static inline void mmdrop_async(struct mm_struct *mm)
}
 }
 
+/**
+ * mmget() - Pin the address space associated with a  mm_struct.
+ * @mm: The address space to pin.
+ *
+ * Make sure that the address space of the given  mm_struct doesn't
+ * go away. This does not protect against parts of the address space being
+ * modified or freed, however.
+ *
+ * Never use this function to pin this address space for an
+ * unbounded/indefinite amount of time.
+ *
+ * Use mmput() to release the reference acquired by mmget().
+ *
+ * See also  for an in-depth explanation
+ * of _struct.mm_count vs _struct.mm_users.
+ */
+static inline void mmget(struct mm_struct *mm)
+{
+   atomic_inc(>mm_users);
+}
+
 static inline bool mmget_not_zero(struct mm_struct *mm)
 {
return atomic_inc_not_zero(>mm_users);
diff --git a/kernel/fork.c b/kernel/fork.c
index 869b8ccc00bf..0e2aaa1837b3 100644
--- a/kernel/fork.c
+++ b/kern

[PATCH 2/4] mm: add new mmget() helper

2016-12-18 Thread Vegard Nossum
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:

  git grep -l 'atomic_inc.*mm_users' | xargs sed -i 
's/atomic_inc(&\(.*\)->mm_users);/mmget\(\1\);/'
  git grep -l 'atomic_inc.*mm_users' | xargs sed -i 
's/atomic_inc(&\(.*\)\.mm_users);/mmget\(\&\1\);/'

This is needed for a later patch that hooks into the helper, but might be
a worthwhile cleanup on its own.

(Michal Hocko provided most of the kerneldoc comment.)

Cc: Andrew Morton 
Acked-by: Michal Hocko 
Signed-off-by: Vegard Nossum 
---
 arch/arc/kernel/smp.c   |  2 +-
 arch/blackfin/mach-common/smp.c |  2 +-
 arch/frv/mm/mmu-context.c   |  2 +-
 arch/metag/kernel/smp.c |  2 +-
 arch/sh/kernel/smp.c|  2 +-
 arch/xtensa/kernel/smp.c|  2 +-
 include/linux/sched.h   | 21 +
 kernel/fork.c   |  4 ++--
 mm/swapfile.c   | 10 +-
 virt/kvm/async_pf.c |  2 +-
 10 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c
index 9cbc7aba3ede..eec70cb71db1 100644
--- a/arch/arc/kernel/smp.c
+++ b/arch/arc/kernel/smp.c
@@ -124,7 +124,7 @@ void start_kernel_secondary(void)
/* MMU, Caches, Vector Table, Interrupts etc */
setup_processor();
 
-   atomic_inc(>mm_users);
+   mmget(mm);
mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
diff --git a/arch/blackfin/mach-common/smp.c b/arch/blackfin/mach-common/smp.c
index bc5617ef7128..a2e6db2ce811 100644
--- a/arch/blackfin/mach-common/smp.c
+++ b/arch/blackfin/mach-common/smp.c
@@ -307,7 +307,7 @@ void secondary_start_kernel(void)
local_irq_disable();
 
/* Attach the new idle task to the global mm. */
-   atomic_inc(>mm_users);
+   mmget(mm);
mmgrab(mm);
current->active_mm = mm;
 
diff --git a/arch/frv/mm/mmu-context.c b/arch/frv/mm/mmu-context.c
index 81757d55a5b5..3473bde77f56 100644
--- a/arch/frv/mm/mmu-context.c
+++ b/arch/frv/mm/mmu-context.c
@@ -188,7 +188,7 @@ int cxn_pin_by_pid(pid_t pid)
task_lock(tsk);
if (tsk->mm) {
mm = tsk->mm;
-   atomic_inc(>mm_users);
+   mmget(mm);
ret = 0;
}
task_unlock(tsk);
diff --git a/arch/metag/kernel/smp.c b/arch/metag/kernel/smp.c
index af9cff547a19..c622293254e4 100644
--- a/arch/metag/kernel/smp.c
+++ b/arch/metag/kernel/smp.c
@@ -344,7 +344,7 @@ asmlinkage void secondary_start_kernel(void)
 * All kernel threads share the same mm context; grab a
 * reference and switch to it.
 */
-   atomic_inc(>mm_users);
+   mmget(mm);
mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
diff --git a/arch/sh/kernel/smp.c b/arch/sh/kernel/smp.c
index ee379c699c08..edc4769b047e 100644
--- a/arch/sh/kernel/smp.c
+++ b/arch/sh/kernel/smp.c
@@ -179,7 +179,7 @@ asmlinkage void start_secondary(void)
 
enable_mmu();
mmgrab(mm);
-   atomic_inc(>mm_users);
+   mmget(mm);
current->active_mm = mm;
 #ifdef CONFIG_MMU
enter_lazy_tlb(mm, current);
diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c
index 9bf5cea3bae4..fcea72019df7 100644
--- a/arch/xtensa/kernel/smp.c
+++ b/arch/xtensa/kernel/smp.c
@@ -135,7 +135,7 @@ void secondary_start_kernel(void)
 
/* All kernel threads share the same mm context. */
 
-   atomic_inc(>mm_users);
+   mmget(mm);
mmgrab(mm);
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6ce46220bda2..9fc07aaf5c97 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2955,6 +2955,27 @@ static inline void mmdrop_async(struct mm_struct *mm)
}
 }
 
+/**
+ * mmget() - Pin the address space associated with a  mm_struct.
+ * @mm: The address space to pin.
+ *
+ * Make sure that the address space of the given  mm_struct doesn't
+ * go away. This does not protect against parts of the address space being
+ * modified or freed, however.
+ *
+ * Never use this function to pin this address space for an
+ * unbounded/indefinite amount of time.
+ *
+ * Use mmput() to release the reference acquired by mmget().
+ *
+ * See also  for an in-depth explanation
+ * of _struct.mm_count vs _struct.mm_users.
+ */
+static inline void mmget(struct mm_struct *mm)
+{
+   atomic_inc(>mm_users);
+}
+
 static inline bool mmget_not_zero(struct mm_struct *mm)
 {
return atomic_inc_not_zero(>mm_users);
diff --git a/kernel/fork.c b/kernel/fork.c
index 869b8ccc00bf..0e2aaa1837b3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -994,7 +994,7 @@ struct mm_struct *get_task_mm(struct tas

Re: crash during oom reaper

2016-12-16 Thread Vegard Nossum

On 12/16/2016 03:32 PM, Michal Hocko wrote:

On Fri 16-12-16 15:25:27, Vegard Nossum wrote:

On 12/16/2016 03:00 PM, Michal Hocko wrote:

On Fri 16-12-16 14:14:17, Vegard Nossum wrote:
[...]

Out of memory: Kill process 1650 (trinity-main) score 90 or sacrifice child
Killed process 1724 (trinity-c14) total-vm:37280kB, anon-rss:236kB,
file-rss:112kB, shmem-rss:112kB
BUG: unable to handle kernel NULL pointer dereference at 01e8
IP: [] copy_process.part.41+0x2150/0x5580
PGD c001067 PUD c67
PMD 0
Oops: 0002 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
CPU: 28 PID: 1650 Comm: trinity-main Not tainted 4.9.0-rc6+ #317


Hmm, so this was the oom victim initially but we have decided to kill
its child 1724 instead.


Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: 88000f9bc440 task.stack: 88000c778000
RIP: 0010:[]  []
copy_process.part.41+0x2150/0x5580


Could you match this to the kernel source please?


kernel/fork.c:629 dup_mmap()


Ok, so this is before the child is made visible so the oom reaper
couldn't have seen it.


it's atomic_dec(>i_writecount), it matches up with
file_inode(file) == NULL:

(gdb) p &((struct inode *)0)->i_writecount
$1 = (atomic_t *) 0x1e8 <irq_stack_union+488>


is this a p9 inode?


When I looked at this before it always crashed in this spot for the very
first VMA in the mm (which happens to be the exe, which is on a 9p root fs).

I added a trace_printk() to dup_mmap() to print inode->i_sb->s_type and
the last thing I see for a new crash in the same place is:

trinity--9280   28 136345090us : copy_process.part.41: 8485ec40
-
CPU: 0 PID: 9302 Comm: trinity-c0 Not tainted 4.9.0-rc8+ #332
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014

task: 8807 task.stack: 8800099e
RIP: 0010:[]  [] 
copy_process.part.41+0x22c9/0x55b0


As you can see, the addresses match:

(gdb) p _fs_type
$1 = (struct file_system_type *) 0x8485ec40 

So I think we can safely say that yes, it's a p9 inode.


Vegard


Re: crash during oom reaper

2016-12-16 Thread Vegard Nossum

On 12/16/2016 03:32 PM, Michal Hocko wrote:

On Fri 16-12-16 15:25:27, Vegard Nossum wrote:

On 12/16/2016 03:00 PM, Michal Hocko wrote:

On Fri 16-12-16 14:14:17, Vegard Nossum wrote:
[...]

Out of memory: Kill process 1650 (trinity-main) score 90 or sacrifice child
Killed process 1724 (trinity-c14) total-vm:37280kB, anon-rss:236kB,
file-rss:112kB, shmem-rss:112kB
BUG: unable to handle kernel NULL pointer dereference at 01e8
IP: [] copy_process.part.41+0x2150/0x5580
PGD c001067 PUD c67
PMD 0
Oops: 0002 [#1] PREEMPT SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
CPU: 28 PID: 1650 Comm: trinity-main Not tainted 4.9.0-rc6+ #317


Hmm, so this was the oom victim initially but we have decided to kill
its child 1724 instead.


Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: 88000f9bc440 task.stack: 88000c778000
RIP: 0010:[]  []
copy_process.part.41+0x2150/0x5580


Could you match this to the kernel source please?


kernel/fork.c:629 dup_mmap()


Ok, so this is before the child is made visible so the oom reaper
couldn't have seen it.


it's atomic_dec(>i_writecount), it matches up with
file_inode(file) == NULL:

(gdb) p &((struct inode *)0)->i_writecount
$1 = (atomic_t *) 0x1e8 


is this a p9 inode?


When I looked at this before it always crashed in this spot for the very
first VMA in the mm (which happens to be the exe, which is on a 9p root fs).

I added a trace_printk() to dup_mmap() to print inode->i_sb->s_type and
the last thing I see for a new crash in the same place is:

trinity--9280   28 136345090us : copy_process.part.41: 8485ec40
-
CPU: 0 PID: 9302 Comm: trinity-c0 Not tainted 4.9.0-rc8+ #332
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014

task: 8807 task.stack: 8800099e
RIP: 0010:[]  [] 
copy_process.part.41+0x22c9/0x55b0


As you can see, the addresses match:

(gdb) p _fs_type
$1 = (struct file_system_type *) 0x8485ec40 

So I think we can safely say that yes, it's a p9 inode.


Vegard


  1   2   3   4   5   6   7   8   >