On Wed, Feb 12, 2025 at 04:42:39PM +0100, Petr Mladek wrote:
> CPU1                          CPU1
> 
>                               klp_try_complete_transition()
> 
> 
> taskA:        
>  + fork()
>    + klp_copy_process()
>       child->patch_state = KLP_PATCH_UNPATCHED
> 
>                                 klp_try_switch_task(taskA)
>                                   // safe
> 
>                               child->patch_state = KLP_PATCH_PATCHED
> 
>                               all processes patched
> 
>                               klp_finish_transition()
> 
> 
>       list_add_tail_rcu(&p->thread_node,
>                         &p->signal->thread_head);
> 
> 
> BANG: The forked task has KLP_PATCH_UNPATCHED so that
>       klp_ftrace_handler() will redirect it to the old code.
> 
>       But CPU1 thinks that all tasks are migrated and is going
>       to finish the transition


Maybe klp_try_complete_transition() could iterate the tasks in two
passes?  The first pass would use rcu_read_lock().  Then if all tasks
appear to be patched, try again with tasklist_lock.

Or, we could do something completely different.  There's no need for
klp_copy_process() to copy the parent's state: a newly forked task can
be patched immediately because it has no stack.

So instead, just initialize it to KLP_TRANSITION_IDLE with
TIF_PATCH_PENDING cleared.  Then when klp_ftrace_handler() encounters a
KLP_TRANSITION_IDLE task, it considers its state to be 'klp_target_state'.

// called from copy_process()
void klp_init_task(struct task_struct *child)
{
        /* klp_ftrace_handler() will transition the task immediately */
        child->patch_state = KLP_TRANSITION_IDLE;
        clear_tsk_thread_flag(child, TIF_PATCH_PENDING);
}


klp_ftrace_handler():

                patch_state = current->patch_state;

                if (patch_state == KLP_TRANSITION_IDLE)
                        patch_state = klp_target_state;
                ...

Hm?

> I would first like to understand how exactly the stall happens.
> It is possible that even rcu_read_lock() won't help here!
> 
> If the it takes too long time to check backtraces of all pending
> processes then even rcu_read_lock() might trigger the RCU stall
> warning as well.

Yeah, based on Yafang's reply it appears there are RCU stalls either
way.

-- 
Josh

Reply via email to