On Wed 2025-01-22 16:56:31, Petr Mladek wrote:
> Anyway, I am working on a POC which would allow to track
> to-be-released processes. It would finish the transition only
> when all the to-be-released processes already use the new code.
> It won't allow to remove the disabled livepatch prematurely.

Here is the POC. I am not sure if it is the right path, see
the warning and open questions in the commit message.

I am going to wait for some feedback before investing more time
into this.

The patch is against current Linus' master, aka, v6.13-rc7.

>From ac7287d99aaeca7a4536e8ade61b9bd0c8ec7fdc Mon Sep 17 00:00:00 2001
From: Petr Mladek <[email protected]>
Date: Thu, 23 Jan 2025 09:04:09 +0100
Subject: [PATCH] livepatching: Block transition until
 delayed_put_task_struct()

WARNING: This patch is just a POC. It will blow up the system because
        RCU callbacks are handled by softirq context which are handled
        by default on exit from IRQ handlers. And it is not allowed
        to take sleeping locks here, see the backtrace at the end
        of the commit message.

        We would need to synchronize the counting of the exiting
        processes with the livepatch transition another way.

        Hmm, I guess that spin_lock is legal in softirq context.
        It might be the easiest approach.

        In the worst case, we would need to use a lock less
        algorithm which might make things even more complicated.

Here is the description of the problem and the solution:

The livepatching core code uses for_each_process_thread() cycle for setting
and checking the state of processes on the system. It works well as long
as the livepatch touches only code which is used only by processes in
the task list.

The problem starts when the livepatch replaces code which might be
used by a process which is not longer in the task list. It is
mostly about the processes which are being removed. They
disapper from the list here:

        + release_task()
          + __exit_signal()
            + __unhash_process()

There are basically two groups of problems:

1. The livepatching core does not longer updates TIF_PATCH_PENDING
   and p->patch_state. In this case, the ftrace handler
   klp_check_stack_func() might do wrong decision and
   use an incompatible variant of the code.

   This might become a real problem only when the code modifies
   the semantic.

2. The livepatching core does not longer check the stack and
   could finish the livepatch transition even when these
   to-be-removed processes have not been transitioned yet.

   This might even cause Oops when the to-be-removed processes
   are running a code from a disabled livepatch which might
   be removed in the meantime.

This patch tries to address the 2nd problem which most likely caused
the following crash:

[156821.048318] livepatch: enabling patch 'livepatch_61_release12'
[156821.061580] livepatch: 'livepatch_61_release12': starting patching
transition
[156821.122212] livepatch: 'livepatch_61_release12': patching complete
[156821.175871] kernel tried to execute NX-protected page - exploit
attempt? (uid: 10524)
[156821.176011] BUG: unable to handle page fault for address: ffffffffc0ded7fa
[156821.176121] #PF: supervisor instruction fetch in kernel mode
[156821.176211] #PF: error_code(0x0011) - permissions violation
[156821.176302] PGD 986c15067 P4D 986c15067 PUD 986c17067 PMD
184f53b067 PTE 800000194c08e163
[156821.176435] Oops: 0011 [#1] PREEMPT SMP NOPTI
[156821.176506] CPU: 70 PID: 783972 Comm: java Kdump: loaded Tainted:
G S      W  O  K    6.1.52-3 #3.pdd
[156821.176654] Hardware name: Inspur SA5212M5/SA5212M5, BIOS 4.1.20 05/05/2021
[156821.176766] RIP: 0010:0xffffffffc0ded7fa
[156821.176841] Code: 0a 00 00 48 89 42 08 48 89 10 4d 89 a6 08 0a 00
00 4c 89 f7 4d 89 a6 10 0a 00 00 4d 8d a7 08 0a 00 00 4d 89 fe e8 00
00 00 00 <49> 8b 87 08 0a 00 00 48 2d 08 0a 00 00 4d 39 ec 75 aa 48 89
df e8
[156821.177138] RSP: 0018:ffffba6f273dbd30 EFLAGS: 00010282
[156821.177222] RAX: 0000000000000000 RBX: ffff94cd316f0000 RCX:
000000008020000d
[156821.177338] RDX: 000000008020000e RSI: 000000008020000d RDI:
ffff94cd316f0000
[156821.177452] RBP: ffffba6f273dbd88 R08: ffff94cd316f13f8 R09:
0000000000000001
[156821.177567] R10: 0000000000000000 R11: 0000000000000000 R12:
ffffba6f273dbd48
[156821.177682] R13: ffffba6f273dbd48 R14: ffffba6f273db340 R15:
ffffba6f273db340
[156821.177797] FS:  0000000000000000(0000) GS:ffff94e321180000(0000)
knlGS:0000000000000000
[156821.177926] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[156821.178019] CR2: ffffffffc0ded7fa CR3: 000000015909c006 CR4:
00000000007706e0
[156821.178133] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[156821.178248] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[156821.178363] PKRU: 55555554
[156821.178407] Call Trace:
[156821.178449]  <TASK>
[156821.178492]  ? show_regs.cold+0x1a/0x1f
[156821.178559]  ? __die_body+0x20/0x70
[156821.178617]  ? __die+0x2b/0x37
[156821.178669]  ? page_fault_oops+0x136/0x2b0
[156821.178734]  ? search_bpf_extables+0x63/0x90
[156821.178805]  ? search_exception_tables+0x5f/0x70
[156821.178881]  ? kernelmode_fixup_or_oops+0xa2/0x120
[156821.178957]  ? __bad_area_nosemaphore+0x176/0x1b0
[156821.179034]  ? bad_area_nosemaphore+0x16/0x20
[156821.179105]  ? do_kern_addr_fault+0x77/0x90
[156821.179175]  ? exc_page_fault+0xc6/0x160
[156821.179239]  ? asm_exc_page_fault+0x27/0x30
[156821.179310]  do_group_exit+0x35/0x90
[156821.179371]  get_signal+0x909/0x950
[156821.179429]  ? wake_up_q+0x50/0x90
[156821.179486]  arch_do_signal_or_restart+0x34/0x2a0
[156821.183207]  exit_to_user_mode_prepare+0x149/0x1b0
[156821.186963]  syscall_exit_to_user_mode+0x1e/0x50
[156821.190723]  do_syscall_64+0x48/0x90
[156821.194500]  entry_SYSCALL_64_after_hwframe+0x64/0xce
[156821.198195] RIP: 0033:0x7f967feb5a35
[156821.201769] Code: Unable to access opcode bytes at 0x7f967feb5a0b.
[156821.205283] RSP: 002b:00007f96664ee670 EFLAGS: 00000246 ORIG_RAX:
00000000000000ca
[156821.208790] RAX: fffffffffffffe00 RBX: 00007f967808a650 RCX:
00007f967feb5a35
[156821.212305] RDX: 000000000000000f RSI: 0000000000000080 RDI:
00007f967808a654
[156821.215785] RBP: 00007f96664ee6c0 R08: 00007f967808a600 R09:
0000000000000007
[156821.219273] R10: 0000000000000000 R11: 0000000000000246 R12:
00007f967808a600
[156821.222727] R13: 00007f967808a628 R14: 00007f967f691220 R15:
00007f96664ee750
[156821.226155]  </TASK>
[156821.229470] Modules linked in: livepatch_61_release12(OK)
ebtable_filter ebtables af_packet_diag netlink_diag xt_DSCP xt_owner
iptable_mangle iptable_raw xt_CT cls_bpf sch_ingress bpf_preload
binfmt_misc raw_diag unix_diag tcp_diag udp_diag inet_diag
iptable_filter bpfilter xt_conntrack nf_nat nf_conntrack_netlink
nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay af_packet
bonding tls intel_rapl_msr intel_rapl_common intel_uncore_frequency
intel_uncore_frequency_common isst_if_common skx_edac nfit
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
rapl vfat fat intel_cstate iTCO_wdt xfs intel_uncore pcspkr ses
enclosure mei_me i2c_i801 input_leds lpc_ich acpi_ipmi ioatdma
i2c_smbus mei mfd_core dca wmi ipmi_si intel_pch_thermal ipmi_devintf
ipmi_msghandler acpi_cpufreq acpi_pad acpi_power_meter ip_tables ext4
mbcache jbd2 sd_mod sg mpt3sas raid_class scsi_transport_sas
megaraid_sas crct10dif_pclmul crc32_pclmul crc32c_intel
polyval_clmulni polyval_generic
[156821.229555]  ghash_clmulni_intel sha512_ssse3 aesni_intel
crypto_simd cryptd nvme nvme_core t10_pi i40e ptp pps_core ahci
libahci libata deflate zlib_deflate
[156821.259012] Unloaded tainted modules:
livepatch_61_release6(OK):14089 livepatch_61_release12(OK):14088 [last
unloaded: livepatch_61_release6(OK)]
[156821.275421] CR2: ffffffffc0ded7fa

This patch tries to avoid the crash by tracking the number of
to-be-released processes. They block the current transition
until delayed_put_task_struct() is called.

It is just a POC. There are many open questions:

1. Does it help at all?

   It looks to me that release_task() is always called from another
   task. For example, exit_notify() seems to call it for a dead
   childern. It is not clear to me whether the released task
   is still running do_exit() at this point.

   Well, for example, wait_task_zombie() calls release_task()
   in some special cases.

2. Is it enough to block the transition until delayed_put_task_struct()?

   I do not fully understand the maze of code. It might still be too
   early.

   It seems that put_task_struct() can delay the release even more.
   Mabye, we should drop the klp reference count when it is final
   put_task_struct().

3. Is this worth the effort?

   It is probably a bad idea to livepatch do_exit() in the first place.

   But it is not obvious why it is a problem. It would be nice to at
   least detect the problem and warn about it.

Finally, here is the backtrace showing the problem with taking klp_mutex in
the RCU handler.

[    0.986614][    C3] =============================
[    0.987234][    C3] [ BUG: Invalid wait context ]
[    0.987882][    C3] 6.13.0-rc7-default+ #234 Tainted: G        W
[    0.988733][    C3] -----------------------------
[    0.989219][    C3] swapper/3/0 is trying to lock:
[    0.989698][    C3] ffffffff86447030 (klp_mutex){+.+.}-{4:4}, at: 
klp_put_releasing_task+0x1b/0x90
[    0.990283][    C3] other info that might help us debug this:
[    0.990283][    C3] context-{3:3}
[    0.990283][    C3] 1 lock held by swapper/3/0:
[    0.990283][    C3]  #0: ffffffff86377fc0 (rcu_callback){....}-{0:0}, at: 
rcu_do_batch+0x1a8/0xa40
[    0.990283][    C3] stack backtrace:
[    0.990283][    C3] CPU: 3 UID: 0 PID: 0 Comm: swapper/3 Tainted: G        W 
         6.13.0-rc7-default+ #234 bfeca4b35f98fc672d5ef9f6b720fe908580ae2c
[    0.990283][    C3] Tainted: [W]=WARN
[    0.990283][    C3] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
[    0.990283][    C3] Call Trace:
[    0.990283][    C3]  <IRQ>
[    0.990283][    C3]  dump_stack_lvl+0x6c/0xa0
[    0.990283][    C3]  __lock_acquire+0x919/0xb70
[    0.990283][    C3]  lock_acquire.part.0+0xad/0x220
[    0.990283][    C3]  ? klp_put_releasing_task+0x1b/0x90
[    0.990283][    C3]  ? rcu_is_watching+0x11/0x50
[    0.990283][    C3]  ? lock_acquire+0x107/0x140
[    0.990283][    C3]  ? klp_put_releasing_task+0x1b/0x90
[    0.990283][    C3]  __mutex_lock+0xb5/0xe00
[    0.990283][    C3]  ? klp_put_releasing_task+0x1b/0x90
[    0.990283][    C3]  ? __lock_acquire+0x551/0xb70
[    0.990283][    C3]  ? klp_put_releasing_task+0x1b/0x90
[    0.990283][    C3]  ? lock_acquire.part.0+0xbd/0x220
[    0.990283][    C3]  ? klp_put_releasing_task+0x1b/0x90
[    0.990283][    C3]  klp_put_releasing_task+0x1b/0x90
[    0.990283][    C3]  delayed_put_task_struct+0x4b/0x150
[    0.990283][    C3]  ? rcu_do_batch+0x1d2/0xa40
[    0.990283][    C3]  rcu_do_batch+0x1d4/0xa40
[    0.990283][    C3]  ? rcu_do_batch+0x1a8/0xa40
[    0.990283][    C3]  ? lock_is_held_type+0xd9/0x130
[    0.990283][    C3]  rcu_core+0x3bb/0x4f0
[    0.990283][    C3]  handle_softirqs+0xe2/0x400
[    0.990283][    C3]  __irq_exit_rcu+0xd9/0x150
[    0.990283][    C3]  irq_exit_rcu+0xe/0x30
[    0.990283][    C3]  sysvec_apic_timer_interrupt+0x8d/0xb0
[    0.990283][    C3]  </IRQ>
[    0.990283][    C3]  <TASK>
[    0.990283][    C3]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[    0.990283][    C3] RIP: 0010:pv_native_safe_halt+0xf/0x20
[    0.990283][    C3] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d 03 92 2f 00 fb f4 <c3> 
cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
[    0.990283][    C3] RSP: 0000:ffffb95bc00c3eb8 EFLAGS: 00000246
[    0.990283][    C3] RAX: 0000000000000111 RBX: 00000000001fd2cc RCX: 
0000000000000000
[    0.990283][    C3] RDX: 0000000000000000 RSI: ffffffff85f06d6a RDI: 
ffffffff85ed7907
[    0.990283][    C3] RBP: ffff9cd980883740 R08: 0000000000000001 R09: 
0000000000000000
[    0.990283][    C3] R10: 0000000000000001 R11: 0000000000000001 R12: 
0000000000000000
[    0.990283][    C3] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000000
[    0.990283][    C3]  default_idle+0x9/0x20
[    0.990283][    C3]  default_idle_call+0x84/0x1e0
[    0.990283][    C3]  cpuidle_idle_call+0x134/0x170
[    0.990283][    C3]  ? tsc_verify_tsc_adjust+0x45/0xd0
[    0.990283][    C3]  do_idle+0x93/0xf0
[    0.990283][    C3]  cpu_startup_entry+0x29/0x30
[    0.990283][    C3]  start_secondary+0x121/0x140
[    0.990283][    C3]  common_startup_64+0x13e/0x141
[    0.990283][    C3]  </TASK>

Signed-off-by: Petr Mladek <[email protected]>
---
 include/linux/livepatch.h     |  3 ++
 include/linux/sched.h         |  1 +
 kernel/exit.c                 |  5 ++
 kernel/livepatch/transition.c | 89 +++++++++++++++++++++++++++++------
 4 files changed, 83 insertions(+), 15 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 51a258c24ff5..63e9e56ca6fe 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -201,6 +201,9 @@ void klp_module_going(struct module *mod);
 void klp_copy_process(struct task_struct *child);
 void klp_update_patch_state(struct task_struct *task);
 
+void klp_get_releasing_task(struct task_struct *task);
+void klp_put_releasing_task(struct task_struct *task);
+
 static inline bool klp_patch_pending(struct task_struct *task)
 {
        return test_tsk_thread_flag(task, TIF_PATCH_PENDING);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 64934e0830af..d8a587208212 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1542,6 +1542,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_LIVEPATCH
        int patch_state;
+       bool klp_exit_counted;
 #endif
 #ifdef CONFIG_SECURITY
        /* Used by LSM modules for access restriction: */
diff --git a/kernel/exit.c b/kernel/exit.c
index 1dcddfe537ee..a2a9672077d5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -224,6 +224,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
        rethook_flush_task(tsk);
        perf_event_delayed_put(tsk);
        trace_sched_process_free(tsk);
+       klp_put_releasing_task(tsk);
        put_task_struct(tsk);
 }
 
@@ -242,6 +243,10 @@ void release_task(struct task_struct *p)
        struct task_struct *leader;
        struct pid *thread_pid;
        int zap_leader;
+
+       /* Block the transition until the very end. */
+       klp_get_releasing_task(p);
+
 repeat:
        /* don't need to get the RCU readlock here - the process is dead and
         * can't be modifying its own credentials. But shut RCU-lockdep up */
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index ba069459c101..6403af34f231 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -25,6 +25,14 @@ struct klp_patch *klp_transition_patch;
 
 static int klp_target_state = KLP_TRANSITION_IDLE;
 
+/*
+ * Allow to livepatch do_exit() function by counting processes which
+ * are being removed from the task list. They will block the transition
+ * almost until the task struct is released.
+ */
+unsigned int klp_releasing_tasks_cnt;
+bool klp_track_releasing_tasks;
+
 static unsigned int klp_signals_cnt;
 
 /*
@@ -87,7 +95,7 @@ static void klp_synchronize_transition(void)
  * The transition to the target patch state is complete.  Clean up the data
  * structures.
  */
-static void klp_complete_transition(void)
+static void __klp_complete_transition(void)
 {
        struct klp_object *obj;
        struct klp_func *func;
@@ -156,6 +164,25 @@ static void klp_complete_transition(void)
        klp_transition_patch = NULL;
 }
 
+static void klp_complete_transition(void)
+{
+       struct klp_patch *patch;
+
+       klp_cond_resched_disable();
+       patch = klp_transition_patch;
+       __klp_complete_transition();
+
+       /*
+        * It would make more sense to free the unused patches in
+        * klp_complete_transition() but it is called also
+        * from klp_cancel_transition().
+        */
+       if (!patch->enabled)
+               klp_free_patch_async(patch);
+       else if (patch->replace)
+               klp_free_replaced_patches_async(patch);
+}
+
 /*
  * This is called in the error path, to cancel a transition before it has
  * started, i.e. klp_init_transition() has been called but
@@ -171,7 +198,7 @@ void klp_cancel_transition(void)
                 klp_transition_patch->mod->name);
 
        klp_target_state = KLP_TRANSITION_UNPATCHED;
-       klp_complete_transition();
+       __klp_complete_transition();
 }
 
 /*
@@ -452,7 +479,6 @@ void klp_try_complete_transition(void)
 {
        unsigned int cpu;
        struct task_struct *g, *task;
-       struct klp_patch *patch;
        bool complete = true;
 
        WARN_ON_ONCE(klp_target_state == KLP_TRANSITION_IDLE);
@@ -507,20 +533,14 @@ void klp_try_complete_transition(void)
                return;
        }
 
-       /* Done!  Now cleanup the data structures. */
-       klp_cond_resched_disable();
-       patch = klp_transition_patch;
-       klp_complete_transition();
-
        /*
-        * It would make more sense to free the unused patches in
-        * klp_complete_transition() but it is called also
-        * from klp_cancel_transition().
+        * All tasks in the task list are migrated. Stop counting releasing
+        * processes. The last one would finish the transition when any.
         */
-       if (!patch->enabled)
-               klp_free_patch_async(patch);
-       else if (patch->replace)
-               klp_free_replaced_patches_async(patch);
+       klp_track_releasing_tasks = false;
+
+       /* Done!  Now cleanup the data structures. */
+       klp_complete_transition();
 }
 
 /*
@@ -582,6 +602,8 @@ void klp_init_transition(struct klp_patch *patch, int state)
 
        klp_transition_patch = patch;
 
+       klp_track_releasing_tasks = true;
+
        /*
         * Set the global target patch state which tasks will switch to.  This
         * has no effect until the TIF_PATCH_PENDING flags get set later.
@@ -715,6 +737,43 @@ void klp_copy_process(struct task_struct *child)
        child->patch_state = current->patch_state;
 }
 
+void klp_get_releasing_task(struct task_struct* p)
+{
+       mutex_lock(&klp_mutex);
+
+       if (klp_track_releasing_tasks) {
+               klp_releasing_tasks_cnt++;
+               p->klp_exit_counted = true;
+       }
+
+       mutex_unlock(&klp_mutex);
+}
+
+void klp_put_releasing_task(struct task_struct *p)
+{
+       mutex_lock(&klp_mutex);
+
+       if (!p->klp_exit_counted)
+               goto out;
+
+       if (WARN_ON_ONCE(!klp_releasing_tasks_cnt))
+               goto out;
+
+       if (--klp_releasing_tasks_cnt)
+               goto out;
+
+       /*
+        * Do not finish the transition when there are still non-migrated
+        * processes in the task list.
+        */
+       if (klp_track_releasing_tasks)
+               goto out;
+
+       klp_complete_transition();
+out:
+       mutex_unlock(&klp_mutex);
+}
+
 /*
  * Drop TIF_PATCH_PENDING of all tasks on admin's request. This forces an
  * existing transition to finish.
-- 
2.48.1


Reply via email to