On Thu, Jan 23, 2025 at 11:55 PM Petr Mladek <[email protected]> wrote:
>
> On Wed 2025-01-22 16:56:31, Petr Mladek wrote:
> > Anyway, I am working on a POC which would allow to track
> > to-be-released processes. It would finish the transition only
> > when all the to-be-released processes already use the new code.
> > It won't allow to remove the disabled livepatch prematurely.
>
> Here is the POC. I am not sure if it is the right path, see
> the warning and open questions in the commit message.
>
> I am going to wait for some feedback before investing more time
> into this.
>
> The patch is against current Linus' master, aka, v6.13-rc7.
>
> From ac7287d99aaeca7a4536e8ade61b9bd0c8ec7fdc Mon Sep 17 00:00:00 2001
> From: Petr Mladek <[email protected]>
> Date: Thu, 23 Jan 2025 09:04:09 +0100
> Subject: [PATCH] livepatching: Block transition until
> delayed_put_task_struct()
>
> WARNING: This patch is just a POC. It will blow up the system because
> RCU callbacks are handled by softirq context which are handled
> by default on exit from IRQ handlers. And it is not allowed
> to take sleeping locks here, see the backtrace at the end
> of the commit message.
>
> We would need to synchronize the counting of the exiting
> processes with the livepatch transition another way.
>
> Hmm, I guess that spin_lock is legal in softirq context.
> It might be the easiest approach.
>
> In the worst case, we would need to use a lock less
> algorithm which might make things even more complicated.
>
> Here is the description of the problem and the solution:
>
> The livepatching core code uses for_each_process_thread() cycle for setting
> and checking the state of processes on the system. It works well as long
> as the livepatch touches only code which is used only by processes in
> the task list.
>
> The problem starts when the livepatch replaces code which might be
> used by a process which is not longer in the task list. It is
> mostly about the processes which are being removed. They
> disapper from the list here:
>
> + release_task()
> + __exit_signal()
> + __unhash_process()
>
> There are basically two groups of problems:
>
> 1. The livepatching core does not longer updates TIF_PATCH_PENDING
> and p->patch_state. In this case, the ftrace handler
> klp_check_stack_func() might do wrong decision and
> use an incompatible variant of the code.
I believe I was able to reproduce the issue while attempting to
trigger the panic. The warning message is as follows:
The warning occurred at the following location:
klp_ftrace_handler
if (unlikely(func->transition)) {
WARN_ON_ONCE(patch_state == KLP_UNDEFINED);
}
[58063.291589] livepatch: enabling patch 'livepatch_61_release12'
[58063.297580] livepatch: 'livepatch_61_release12': starting patching transition
[58063.323340] ------------[ cut here ]------------
[58063.323343] WARNING: CPU: 58 PID: 3851051 at
kernel/livepatch/patch.c:98 klp_ftrace_handler+0x136/0x150
[58063.323349] Modules linked in: livepatch_61_release12(OK)
livepatch_61_release6(OK) ebtable_filter ebtables af_packet_diag
netlink_diag xt_DSCP xt_owner iptable_mangle raw_diag unix_diag
udp_diag iptable_raw xt_CT tcp_diag inet_diag cls_bpf sch_ingress
bpf_preload binfmt_misc iptable_filter bpfilter xt_conntrack nf_nat
nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 overlay af_packet bonding tls intel_rapl_msr
intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common
isst_if_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel kvm irqbypass rapl vfat intel_cstate fat iTCO_wdt
xfs intel_uncore pcspkr ses enclosure input_leds i2c_i801 i2c_smbus
mei_me lpc_ich ioatdma mei mfd_core intel_pch_thermal dca acpi_ipmi
wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq acpi_pad
acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod sg mpt3sas
raid_class scsi_transport_sas megaraid_sas crct10dif_pclmul
crc32_pclmul crc32c_intel
[58063.323402] polyval_clmulni polyval_generic ghash_clmulni_intel
sha512_ssse3 aesni_intel nvme crypto_simd cryptd nvme_core t10_pi i40e
ptp pps_core ahci libahci libata deflate zlib_deflate
[58063.323413] Unloaded tainted modules:
livepatch_61_release6(OK):3369 livepatch_61_release12(OK):3370 [last
unloaded: livepatch_61_release12(OK)]
[58063.323418] CPU: 58 PID: 3851051 Comm: docker Kdump: loaded
Tainted: G S O K 6.1.52-3
[58063.323421] Hardware name: Inspur SA5212M5/SA5212M5, BIOS 4.1.20 05/05/2021
[58063.323423] RIP: 0010:klp_ftrace_handler+0x136/0x150
[58063.323425] Code: eb b3 0f 1f 44 00 00 eb b5 8b 89 f4 23 00 00 83
f9 ff 74 16 85 c9 75 89 48 8b 00 48 8d 50 90 48 39 c6 0f 85 79 ff ff
ff eb 8b <0f> 0b e9 70 ff ff ff e8 ae 24 9c 00 66 66 2e 0f 1f 84 00 00
00 00
[58063.323428] RSP: 0018:ffffa87b2367fbb8 EFLAGS: 00010046
[58063.323429] RAX: ffff8b0ee59229a0 RBX: ffff8b26fa47b000 RCX: 00000000ffffffff
[58063.323432] RDX: ffff8b0ee5922930 RSI: ffff8b2d41e53f10 RDI: ffffa87b2367fbd8
[58063.323433] RBP: ffffa87b2367fbc8 R08: 0000000000000000 R09: fffffffffffffff7
[58063.323434] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b2520eaf240
[58063.323435] R13: ffff8b26fa47b000 R14: 000000000002f240 R15: 0000000000000000
[58063.323436] FS: 0000000000000000(0000) GS:ffff8b2520e80000(0000)
knlGS:0000000000000000
[58063.323438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58063.323440] CR2: 00007f19d7eaf000 CR3: 000000187a676004 CR4: 00000000007706e0
[58063.323441] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58063.323442] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[58063.323444] PKRU: 55555554
[58063.323445] Call Trace:
[58063.323445] <TASK>
[58063.323449] ? show_regs.cold+0x1a/0x1f
[58063.323454] ? klp_ftrace_handler+0x136/0x150
[58063.323455] ? __warn+0x84/0xd0
[58063.323457] ? klp_ftrace_handler+0x136/0x150
[58063.323459] ? report_bug+0x105/0x180
[58063.323463] ? handle_bug+0x40/0x70
[58063.323467] ? exc_invalid_op+0x19/0x70
[58063.323469] ? asm_exc_invalid_op+0x1b/0x20
[58063.323474] ? klp_ftrace_handler+0x136/0x150
[58063.323476] ? kmem_cache_free+0x155/0x470
[58063.323479] 0xffffffffc0876099
[58063.323495] ? update_rq_clock+0x5/0x250
[58063.323498] update_rq_clock+0x5/0x250
[58063.323500] __schedule+0xed/0x8f0
[58063.323504] ? update_rq_clock+0x5/0x250
[58063.323506] ? __schedule+0xed/0x8f0
[58063.323508] ? trace_hardirqs_off+0x36/0xf0
[58063.323512] do_task_dead+0x44/0x50
[58063.323515] do_exit+0x7cd/0xb90 [livepatch_61_release6]
[58063.323525] ? xfs_inode_mark_reclaimable+0x320/0x320 [livepatch_61_release6]
[58063.323533] do_group_exit+0x35/0x90
[58063.323536] get_signal+0x909/0x950
[58063.323539] ? get_futex_key+0xa4/0x4f0
[58063.323543] arch_do_signal_or_restart+0x34/0x2a0
[58063.323547] exit_to_user_mode_prepare+0x149/0x1b0
[58063.323551] syscall_exit_to_user_mode+0x1e/0x50
[58063.323555] do_syscall_64+0x48/0x90
[58063.323557] entry_SYSCALL_64_after_hwframe+0x64/0xce
[58063.323560] RIP: 0033:0x5601df8122f3
[58063.323563] Code: Unable to access opcode bytes at 0x5601df8122c9.
[58063.323564] RSP: 002b:00007f9ce6ffccc0 EFLAGS: 00000286 ORIG_RAX:
00000000000000ca
[58063.323566] RAX: fffffffffffffe00 RBX: 000000c42053e000 RCX: 00005601df8122f3
[58063.323567] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000c42053e148
[58063.323568] RBP: 00007f9ce6ffcd08 R08: 0000000000000000 R09: 0000000000000000
[58063.323569] R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000000
[58063.323570] R13: 0000000000801000 R14: 0000000000000000 R15: 00007f9ce6ffd700
[58063.323573] </TASK>
[58063.323574] ---[ end trace 0000000000000000 ]---
>
> This might become a real problem only when the code modifies
> the semantic.
>
> 2. The livepatching core does not longer check the stack and
> could finish the livepatch transition even when these
> to-be-removed processes have not been transitioned yet.
>
> This might even cause Oops when the to-be-removed processes
> are running a code from a disabled livepatch which might
> be removed in the meantime.
>
> This patch tries to address the 2nd problem which most likely caused
> the following crash:
> [...]
>
> This patch tries to avoid the crash by tracking the number of
> to-be-released processes. They block the current transition
> until delayed_put_task_struct() is called.
>
> It is just a POC. There are many open questions:
>
> 1. Does it help at all?
>
> It looks to me that release_task() is always called from another
> task. For example, exit_notify() seems to call it for a dead
> childern. It is not clear to me whether the released task
> is still running do_exit() at this point.
If the task is a thread (but not the thread group leader), it should
call release_task(), correct? Below is the trace from our production
server:
$ bpftrace -e 'k:release_task {$tsk=(struct task_struct *)arg0; if
($tsk->pid == tid){@stack[kstack()]=count()}}'
@stack[
release_task+1
kthread_exit+41
kthread+200
ret_from_fork+31
]: 1
@stack[
release_task+1
do_group_exit+53
__x64_sys_exit_group+24
do_syscall_64+56
entry_SYSCALL_64_after_hwframe+100
]: 2
@stack[
release_task+1
do_group_exit+53
get_signal+2313
arch_do_signal_or_restart+52
exit_to_user_mode_prepare+329
syscall_exit_to_user_mode+30
do_syscall_64+72
entry_SYSCALL_64_after_hwframe+100
]: 20
@stack[
release_task+1
__x64_sys_exit+27
do_syscall_64+56
entry_SYSCALL_64_after_hwframe+100
]: 26
>
> Well, for example, wait_task_zombie() calls release_task()
> in some special cases.
>
> 2. Is it enough to block the transition until delayed_put_task_struct()?
>
> I do not fully understand the maze of code. It might still be too
> early.
>
> It seems that put_task_struct() can delay the release even more.
After delayed_put_task_struct(), there is still some code that needs
to be executed. It appears that this just reduces the likelihood of
the issue occurring, but does not completely prevent it.
> Mabye, we should drop the klp reference count when it is final
> put_task_struct().
>
> 3. Is this worth the effort?
I believe it’s worth the effort. If we can find a solution to mitigate
this limitation, we should definitely try to implement it.
>
> It is probably a bad idea to livepatch do_exit() in the first place.
>
> But it is not obvious why it is a problem. It would be nice to at
> least detect the problem and warn about it.
--
Regards
Yafang