[PATCH] module: invoke kobject uevent before sending LIVE notification

2021-01-12 Thread Adam Zabrocki
The recent change "module: delay kobject uevent until after module init
call", while helping avoid a race between udev/systemd and the module
loader, made it unnecessarily more difficult to monitor kernel module
integrity by out-of-tree projects such as Linux Kernel Runtime Guard.

Specifically, that change delayed the kobject uevent unnecessarily too far,
to until after sending a MODULE_STATE_LIVE notification.  As the uevent
modifies internal state of the KOBJ itself, this violated the assumption
(non-guaranteed yet handy while we can maintain it) that the KOBJ remains
consistent and can be integrity-checked as soon as the module is LIVE.

To make all of these projects happy at once, move the kobject KOBJ_ADD
uevent to just before sending the MODULE_STATE_LIVE notification.

Fixes: 38dc717e9715 ("module: delay kobject uevent until after module init 
call")
Signed-off-by: Adam Zabrocki 
---
 kernel/module.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 4bf30e4b3eaa..7d56b1b07237 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3681,14 +3681,14 @@ static noinline int do_init_module(struct module *mod)
dump_stack();
}

+   /* Delay uevent until module has finished its init routine */
+   kobject_uevent(>mkobj.kobj, KOBJ_ADD);
+
/* Now it's a first class citizen! */
mod->state = MODULE_STATE_LIVE;
blocking_notifier_call_chain(_notify_list,
 MODULE_STATE_LIVE, mod);

-   /* Delay uevent until module has finished its init routine */
-   kobject_uevent(>mkobj.kobj, KOBJ_ADD);
-
/*
 * We need to finish all async code before the module init sequence
 * is done.  This has potential to deadlock.  For example, a newly


Re: Linux Kernel module notification bug

2021-01-11 Thread Adam Zabrocki
Hello,

On Mon, Jan 11, 2021 at 03:20:48PM +0100, Jessica Yu wrote:
> +++ Adam Zabrocki [10/01/21 18:54 +0100]:
> > Hello,
> > 
> > I believe that the following commit does introduce a gentle "functionality
> > bug":
> > 
> > "module: delay kobject uevent until after module init call":
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/kernel/module.c?id=38dc717e97153e46375ee21797aa54777e5498f3
> > 
> > The official Linux Kernel API for the kernel module activities notification 
> > has
> > been divided based on the readiness 'stage' for such module. We have the
> > following stages:
> > 
> >MODULE_STATE_LIVE,  /* Normal state. */
> >MODULE_STATE_COMING,/* Full formed, running module_init. */
> >MODULE_STATE_GOING, /* Going away. */
> >MODULE_STATE_UNFORMED,  /* Still setting it up. */
> > 
> > LIVE means that the kernel module is correctly running and all 
> > initialization
> > work has been already done. Otherwise, we have event 'COMING' or 'UNFORMED'.
> > 
> > In the described commit, creation of the KOBJECT has been moved after 
> > invoking
> > a notficiation of the newly formed kernel module state (LIVE). That's 
> > somehow
> > inconsistent from my understanding of the kernel modules notifiers logic.
> > Creation of the new objects (like KOBJ) should be done before notification 
> > of
> > the stage LIVE is invoked.
> 
> I'm confused. We're not creating any kobjects here. That is all done
> in mod_sysfs_setup(), which is called while the module is still
> COMING.  What that commit does is delay telling userspace about the
> module (specifically, systemd/udev) until the module is basically
> ready. Systemd was basically receiving the uevent too early, before
> the module has initialized, hence we decided to delay the uevent [1].
> 

Sorry for the confusion on my side. I was referring to the internal state of 
the KOBJ itself which is being actively modified when uevent is sent. During 
the module creation KOBJ_ADD modifies 'kobj->state_add_uevent_sent'. Until 
recent commit, kernel didn't modify KOBJ after sending LIVE notification.

> > This commit breaks some of the projects that rely on the LIVE notification 
> > to
> > monitor when the newly loaded module is ready.
> 
> Hm, could you please explain specifically what is the issue you're seeing?
> What projects is it breaking?
> 

I'm specifically referencing these projects which are tracking kernel modules
for integrity purpose (e.g. anti-rootkit tools) like Linux Kernel Runtime 
Guard.
It is possible to modify these tools to adopt and take into account 
modification of KOBJ after LIVE state notification. However, from my 
understanding of the kernel modules notifiers logic, KOBJ should be fully 
formed at this stage.

Thanks,
Adam

> Thanks,
> 
> Jessica
> 
> [1] https://github.com/systemd/systemd/issues/17586

-- 
pi3 (pi3ki31ny) - pi3 (at) itsec pl
http://pi3.com.pl



Linux Kernel module notification bug

2021-01-10 Thread Adam Zabrocki
Hello,

I believe that the following commit does introduce a gentle "functionality 
bug":

"module: delay kobject uevent until after module init call":
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/kernel/module.c?id=38dc717e97153e46375ee21797aa54777e5498f3

The official Linux Kernel API for the kernel module activities notification has
been divided based on the readiness 'stage' for such module. We have the
following stages:

MODULE_STATE_LIVE,  /* Normal state. */
MODULE_STATE_COMING,/* Full formed, running module_init. */
MODULE_STATE_GOING, /* Going away. */
MODULE_STATE_UNFORMED,  /* Still setting it up. */

LIVE means that the kernel module is correctly running and all initialization
work has been already done. Otherwise, we have event 'COMING' or 'UNFORMED'.

In the described commit, creation of the KOBJECT has been moved after invoking
a notficiation of the newly formed kernel module state (LIVE). That's somehow
inconsistent from my understanding of the kernel modules notifiers logic.
Creation of the new objects (like KOBJ) should be done before notification of
the stage LIVE is invoked.

This commit breaks some of the projects that rely on the LIVE notification to
monitor when the newly loaded module is ready.

I believe that the latest time when 'kobject_uevent' function can be called is
just before invoking 'blocking_notifier_call_chain', e.g:

diff --git a/kernel/module.c b/kernel/module.c
index 4bf30e4b3eaa..7d56b1b07237 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3681,14 +3681,14 @@ static noinline int do_init_module(struct module *mod)
dump_stack();
}

+   /* Delay uevent until module has finished its init routine */
+   kobject_uevent(>mkobj.kobj, KOBJ_ADD);
+
/* Now it's a first class citizen! */
mod->state = MODULE_STATE_LIVE;
blocking_notifier_call_chain(_notify_list,
 MODULE_STATE_LIVE, mod);

-   /* Delay uevent until module has finished its init routine */
-   kobject_uevent(>mkobj.kobj, KOBJ_ADD);
-
/*
 * We need to finish all async code before the module init sequence
 * is done.  This has potential to deadlock.  For example, a newly

Thanks,
Adam

--
pi3 (pi3ki31ny) - pi3 (at) itsec pl
http://pi3.com.pl



Re: [PATCH] x86/kprobes: Fix optprobe to detect padding int3 correctly

2020-12-11 Thread Adam Zabrocki
Hi,

I've applied this patch on the top of kernel 5.9.7 and verified it works 
fine:

813050e0 :
...
...
813052f1:   e9 fe fe ff ff  jmpq   813051f4 

813052f6:   cc  int3
813052f7:   cc  int3
813052f8:   cc  int3
813052f9:   cc  int3
813052fa:   cc  int3
813052fb:   cc  int3
813052fc:   cc  int3
813052fd:   cc  int3
813052fe:   cc  int3
813052ff:   cc  int3

When I install KRETPROBE on this function, it is correctly optimized:

root@pi3:~/off-debug/git/lkrg# cat /sys/kernel/debug/kprobes/list|grep 
ftrace_enable_sysctl
813050e0  r  ftrace_enable_sysctl+0x0[OPTIMIZED]
root@pi3:~/off-debug/git/lkrg# 

gef➤  x/2i ftrace_enable_sysctl
   0x813050e0 :   jmp0xc062807a
   0x813050e5 : push   r14

Thanks,
Adam

On Fri, Dec 11, 2020 at 11:39:15AM -0800, Kees Cook wrote:
> On Fri, Dec 11, 2020 at 04:04:17PM +0900, Masami Hiramatsu wrote:
> > Fix optprobe to detect padding int3 correctly.
> > 
> > Since commit 7705dc855797 ("x86/vmlinux: Use INT3 instead of NOP
> > for linker fill bytes") changed the padding bytes between functions
> > from nop to int3, when optprobe decodes a target function it finds
> > int3 and gives up the jump optimization.
> > 
> > Instead of giving up any int3 detection, this checks whether the
> > rest of bytes to the end of the function are int3 or not. If all
> > of those are int3, those come from the linker. In that case,
> > optprobe continues jump optimization.
> > 
> > Fixes: 7705dc855797 ("x86/vmlinux: Use INT3 instead of NOP for linker fill 
> > bytes")
> > Cc: sta...@vger.kernel.org
> > Reported-by: Adam Zabrocki 
> > Signed-off-by: Masami Hiramatsu 
> 
> Reviewed-by: Kees Cook 
> 
> -- 
> Kees Cook

-- 
pi3 (pi3ki31ny) - pi3 (at) itsec pl
http://pi3.com.pl



Re: KRETPROBES are broken since kernel 5.8

2020-12-10 Thread Adam Zabrocki
Hi,

> > However, there might be another issue which I wanted to brought / discuss - 
> > problem with optimizer. Until kernel 5.9 KRETPROBE on e.g. 
> > 'ftrace_enable_sysctl' function was correctly optimized without any 
> > problems. 
> 
> Did you check it on other functions? Did you see it only on the 
> "ftrace_enable_sysctl"?
> 

Yes, I see it in most of the functions with padding.

> > Since 5.9 it can't be optimized anynmore. I didn't see any changes in the 
> > sources regarding the optimizer, neither function itself.
> > When I looked at the generated vmlinux binary, I can see that GCC generated 
> > padding at the end of this function using INT3 opcode:
> > 
> > ...
> > 8130528b:   41 bd f0 ff ff ff   mov$0xfff0,%r13d
> > 81305291:   e9 fe fe ff ff  jmpq   81305194 
> > 
> > 81305296:   cc  int3
> > 81305297:   cc  int3
> > 81305298:   cc  int3
> > 81305299:   cc  int3
> > 8130529a:   cc  int3
> > 8130529b:   cc  int3
> > 8130529c:   cc  int3
> > 8130529d:   cc  int3
> > 8130529e:   cc  int3
> > 8130529f:   cc  int3
> 
> So these int3 is generated by GCC for padding, right?
> 

I've just browsed a few commits and I've found that one:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7705dc8557973d8ad8f10840f61d8ec805695e9e

It looks like INT3 is now a default padding used by linker.

> > However, that's not the case here. INT3_INSN_OPCODE is placed at the end of 
> > the function as padding (and protect from NOP-padding problems).
> > 
> > I wonder, if optimizer should take this special case into account? Is it 
> > worth 
> > to still optimize such functions when they are padded with INT3?
> 
> Indeed. I expected int3 is used from other subsystems (e.g. kgdb) and,
> in that case the optimization can confuse them.

Right. The same can happen when text section is being actively modified. 
However, this case could be covered by running the optimizer logic under 
text_mutex.

> But if the gcc uses int3 to pad the room between functions, it should be
> reconsidered. 
> 

Looks like it's a default behavior now.

> Thank you,
>
> > If it is OK, we should backport those to stable tree.
> 
> Agreed.

It is also important to make sure that distro kernels would pick-up such 
backported fix.

Thanks,
Adam

-- 
pi3 (pi3ki31ny) - pi3 (at) itsec pl
http://pi3.com.pl



Re: KRETPROBES are broken since kernel 5.8

2020-12-09 Thread Adam Zabrocki
Hi,

On Thu, Dec 10, 2020 at 10:25:07AM +0900, Masami Hiramatsu wrote:
> Hi Adam,
> 
> Thank you for reporting and debugging, actually we had investigated this
> issue in Aug. Please read this thread.
> 
> https://lkml.kernel.org/r/8816bdbbc55c4d2397e0b02aad282...@trendmicro.com
> 

Thanks for the link. I've read the entire history of proposed fix - very 
informative :)

> We finally fixed this issue by commit e03b4a084ea6 ("kprobes: Remove NMI
> context check") and commit 645f224e7ba2 ("kprobes: Tell lockdep about
> kprobe nesting"). Yeah, it will be in the v5.10.
> 
> Could you try to test whether these commits can solve the issue?

I've applied these commits on the top of kernel 5.9.7 and verified that it 
works well. Non-optimized KRETPROBES are back to life.

However, there might be another issue which I wanted to brought / discuss - 
problem with optimizer. Until kernel 5.9 KRETPROBE on e.g. 
'ftrace_enable_sysctl' function was correctly optimized without any problems. 
Since 5.9 it can't be optimized anynmore. I didn't see any changes in the 
sources regarding the optimizer, neither function itself.
When I looked at the generated vmlinux binary, I can see that GCC generated 
padding at the end of this function using INT3 opcode:

...
8130528b:   41 bd f0 ff ff ff   mov$0xfff0,%r13d
81305291:   e9 fe fe ff ff  jmpq   81305194 

81305296:   cc  int3
81305297:   cc  int3
81305298:   cc  int3
81305299:   cc  int3
8130529a:   cc  int3
8130529b:   cc  int3
8130529c:   cc  int3
8130529d:   cc  int3
8130529e:   cc  int3
8130529f:   cc  int3

Such padding didn't exist in this function in generated images for older 
kernels. Nevertheless, such padding is not unusual and it's pretty common.

Optimizer logic fails here:

try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> __prepare_optimized_kprobe()
-> arch_prepare_optimized_kprobe() -> can_optimize():

/* Decode instructions */
addr = paddr - offset;
while (addr < paddr - offset + size) { /* Decode until function end */
unsigned long recovered_insn;
if (search_exception_tables(addr))
/*
 * Since some fixup code will jumps into this function,
 * we can't optimize kprobe in this function.
 */
return 0;
recovered_insn = recover_probed_instruction(buf, addr);
if (!recovered_insn)
return 0;
kernel_insn_init(, (void *)recovered_insn, MAX_INSN_SIZE);
insn_get_length();
/* Another subsystem puts a breakpoint */
if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
return 0;
/* Recover address */
insn.kaddr = (void *)addr;
insn.next_byte = (void *)(addr + insn.length);
/* Check any instructions don't jump into target */
if (insn_is_indirect_jump() ||
insn_jump_into_range(, paddr + INT3_INSN_SIZE,
 DISP32_SIZE))
return 0;
addr += insn.length;
}

One of the check tries to protect from the situation when another subsystem 
puts a breakpoint there as well:

/* Another subsystem puts a breakpoint */
if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
return 0;

However, that's not the case here. INT3_INSN_OPCODE is placed at the end of 
the function as padding (and protect from NOP-padding problems).

I wonder, if optimizer should take this special case into account? Is it worth 
to still optimize such functions when they are padded with INT3?

> If it is OK, we should backport those to stable tree.

Agreed.

Thanks,
Adam

> Thank you,
> 
> On Wed, 9 Dec 2020 22:50:01 +0100
> Adam Zabrocki  wrote:
> 
> > Hello,
> > 
> > Starting from kernel 5.8 all non-optimized kretprobes don't work. Until 5.8,
> > when #DB exception was raised, entry to the NMI was not fully performed. 
> > Among
> > others, the following logic was executed:
> > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589
> > 
> > if (!user_mode(regs)) {
> > rcu_nmi_enter();
> > preempt_disable();
> > }
> > 
> > In some older kernels function ist_enter() was called instead. Inside this
> > function we can see the following logic:
> > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91
> > 
> > if (user_mode(regs)) {
> > RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");

KRETPROBES are broken since kernel 5.8

2020-12-09 Thread Adam Zabrocki
Hello,

Starting from kernel 5.8 all non-optimized kretprobes don't work. Until 5.8,
when #DB exception was raised, entry to the NMI was not fully performed. Among
others, the following logic was executed:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589

if (!user_mode(regs)) {
rcu_nmi_enter();
preempt_disable();
}

In some older kernels function ist_enter() was called instead. Inside this
function we can see the following logic:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91

if (user_mode(regs)) {
RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
} else {
/*
 * We might have interrupted pretty much anything.  In
 * fact, if we're a machine check, we can even interrupt
 * NMI processing.  We don't want in_nmi() to return true,
 * but we need to notify RCU.
 */
rcu_nmi_enter();
}

preempt_disable();

As the comment says "We don't want in_nmi() to return true, but we need to
notify RCU.". However, since kernel 5.8 the logic of how interrupts are handled
was modified and currently we have this (function "exc_int3"):
https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630

/*
 * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
 * can trigger INT3, hence poke_int3_handler() must be done
 * before. If the entry came from kernel mode, then use nmi_enter()
 * because the INT3 could have been hit in any context including
 * NMI.
 */
if (user_mode(regs)) {
idtentry_enter_user(regs);
instrumentation_begin();
do_int3_user(regs);
instrumentation_end();
idtentry_exit_user(regs);
} else {
nmi_enter();
instrumentation_begin();
trace_hardirqs_off_finish();
if (!do_int3(regs))
die("int3", regs, 0);
if (regs->flags & X86_EFLAGS_IF)
trace_hardirqs_on_prepare();
instrumentation_end();
nmi_exit();
}

The root of unlucky change comes from this commit:

https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5

which later was modified by this commit:

https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2

and this is what we currently have in all kernels since 5.8. Essentially,
KRETPROBES are not working since these commits. We have the following logic:

asm_exc_int3() -> exc_int3():
|
|
|
v
...
nmi_enter();
...
if (!do_int3(regs))
   |
  -|
  |
  v
do_int3() -> kprobe_int3_handler():
|
|
|
v
...
if (!p->pre_handler || !p->pre_handler(p, regs))
 |
-|
|
v
...
pre_handler_kretprobe():
...
if (unlikely(in_nmi())) {
rp->nmissed++;
return 0;
}

Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before
invoking any registered kprobe verifies if it is not in NMI via in_nmi() call.

This bug was masked by OPTIMIZER which was turning most of the KPROBES to be
FTRACE so essentially if someone registered KRETPROBE, buggy code was not
invoked (FTRACE code was executed instead). However, the optimizer was changed
and can't optimize as many functions anymore (probably another bug which might
be worth to investigate) and this bug is more visible.

Thanks,
Adam

-- 
pi3 (pi3ki31ny) - pi3 (at) itsec pl
http://pi3.com.pl