Hi Jon,

I believe the real deadlock scenario is as below:

CPU0                                    CPU1
====                                   =======
spin_lock_bh(&tn->node_list_lock);
                                     <SOFTIRQ>
                                     call_timer_fn()
                                     <IRQ IN>

                                       irq_fn()
                                     <IRQ_OUT>
del_timer_sync()
//wait until tipc_node_timeout() is completed
                                     <SOFTIRQ>

                                     tipc_node_timeout()
                                     tipc_node_cleanup()
                                     spin_lock_bh(&tn->node_list_lock);

tn->node_list_lock has been held, but del_timer_sync() must wait until
tipc_node_timeout() is completed on other CPUs, otherwise, it will never
return. Exactly speaking, if tipc_node_timeout() is not finished,
del_timer_sync() could not exit from the loop below:

del_timer_sync()
        for (;;) {
                int ret = try_to_del_timer_sync(timer);
                if (ret >= 0)
                        return ret;
                cpu_relax();
        }

However, tipc_node_timeout() will try to grab the tn->node_list_lock on
CPU1, but the lock has been taken before del_timer_sync() on CPU0. As a
result, system will be locked forever.

Regarding the root cause, as long as we don't synchronously wait the
event of tipc_node_timeout() completion on other CPUs, the deadlock will
never happen. It means if we replace del_timer_sync() with del_timer()
in tipc_node_delete(), the deadlock will be eliminated. But we need to
ensure node instance is valid especially for tipc_node_timeout() after
tipc_node_put() is called in tipc_node_delete(). If so, we need to
carefully increase and decrease node reference count.

In your solution, tn->node_list_lock would not be held in
tipc_node_timeout(), instead, grabbing the lock is moved to workqueue,
which means the deadlock is able to be killed too.

If we use the solution I suggested above, it will be a bit natural and
we don't need to additionally introduce workqueue. But as I said, we
have to carefully check whether we properly get/put node reference count.

Actually, we original used del_timer() to asynchronously delete node
timer, but as some issues were found, later we changed as its deletion
behavior with a synchronous manner.

Anyway, your solution is also fine to me. So you can decide which
approach you prefer to.

Thanks,
Ying

On 11/11/2018 03:05 AM, Jon Maloy wrote:
> We see the following lockdep warning:
> 
> [ 2284.078521] ======================================================
> [ 2284.078604] WARNING: possible circular locking dependency detected
> [ 2284.078604] 4.19.0+ #42 Tainted: G            E
> [ 2284.078604] ------------------------------------------------------
> [ 2284.078604] rmmod/254 is trying to acquire lock:
> [ 2284.078604] 00000000acd94e28 ((&n->timer)#2){+.-.}, at: 
> del_timer_sync+0x5/0xa0
> [ 2284.078604]
> [ 2284.078604] but task is already holding lock:
> [ 2284.078604] 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: 
> tipc_node_stop+0xac/0x190 [tipc]
> [ 2284.078604]
> [ 2284.078604] which lock already depends on the new lock.
> [ 2284.078604]
> [ 2284.078604]
> [ 2284.078604] the existing dependency chain (in reverse order) is:
> [ 2284.078604]
> [ 2284.078604] -> #1 (&(&tn->node_list_lock)->rlock){+.-.}:
> [ 2284.078604]        tipc_node_timeout+0x20a/0x330 [tipc]
> [ 2284.078604]        call_timer_fn+0xa1/0x280
> [ 2284.078604]        run_timer_softirq+0x1f2/0x4d0
> [ 2284.078604]        __do_softirq+0xfc/0x413
> [ 2284.078604]        irq_exit+0xb5/0xc0
> [ 2284.078604]        smp_apic_timer_interrupt+0xac/0x210
> [ 2284.078604]        apic_timer_interrupt+0xf/0x20
> [ 2284.078604]        default_idle+0x1c/0x140
> [ 2284.078604]        do_idle+0x1bc/0x280
> [ 2284.078604]        cpu_startup_entry+0x19/0x20
> [ 2284.078604]        start_secondary+0x187/0x1c0
> [ 2284.078604]        secondary_startup_64+0xa4/0xb0
> [ 2284.078604]
> [ 2284.078604] -> #0 ((&n->timer)#2){+.-.}:
> [ 2284.078604]        del_timer_sync+0x34/0xa0
> [ 2284.078604]        tipc_node_delete+0x1a/0x40 [tipc]
> [ 2284.078604]        tipc_node_stop+0xcb/0x190 [tipc]
> [ 2284.078604]        tipc_net_stop+0x154/0x170 [tipc]
> [ 2284.078604]        tipc_exit_net+0x16/0x30 [tipc]
> [ 2284.078604]        ops_exit_list.isra.8+0x36/0x70
> [ 2284.078604]        unregister_pernet_operations+0x87/0xd0
> [ 2284.078604]        unregister_pernet_subsys+0x1d/0x30
> [ 2284.078604]        tipc_exit+0x11/0x6f2 [tipc]
> [ 2284.078604]        __x64_sys_delete_module+0x1df/0x240
> [ 2284.078604]        do_syscall_64+0x66/0x460
> [ 2284.078604]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 2284.078604]
> [ 2284.078604] other info that might help us debug this:
> [ 2284.078604]
> [ 2284.078604]  Possible unsafe locking scenario:
> [ 2284.078604]
> [ 2284.078604]        CPU0                    CPU1
> [ 2284.078604]        ----                    ----
> [ 2284.078604]   lock(&(&tn->node_list_lock)->rlock);
> [ 2284.078604]                                lock((&n->timer)#2);
> [ 2284.078604]                                
> lock(&(&tn->node_list_lock)->rlock);
> [ 2284.078604]   lock((&n->timer)#2);
> [ 2284.078604]
> [ 2284.078604]  *** DEADLOCK ***
> [ 2284.078604]
> [ 2284.078604] 3 locks held by rmmod/254:
> [ 2284.078604]  #0: 000000003368be9b (pernet_ops_rwsem){+.+.}, at: 
> unregister_pernet_subsys+0x15/0x30
> [ 2284.078604]  #1: 0000000046ed9c86 (rtnl_mutex){+.+.}, at: 
> tipc_net_stop+0x144/0x170 [tipc]
> [ 2284.078604]  #2: 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, 
> at: tipc_node_stop+0xac/0x19
> [...}
> 
> The reason is that the node timer handler sometimes needs to delete a
> node which has been disconnected for too long. To do this, it grabs
> the lock  'node_list_lock', which may at the same time be held by the
> generic node cleanup function, tipc_node_stop(), during module removal.
> Since the latter is calling del_timer_sync() inside the same lock, we
> have a potential deadlock.
> 
> We fix this by moving the timer trigged delete function to work queue
> context.
> 
> Fixes: 6a939f365bdb ("tipc: Auto removal of peer down node instance")
> Signed-off-by: Jon Maloy <[email protected]>
> ---
>  net/tipc/node.c | 74 
> ++++++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 58 insertions(+), 16 deletions(-)
> 
> diff --git a/net/tipc/node.c b/net/tipc/node.c
> index 2afc4f8..40527e6 100644
> --- a/net/tipc/node.c
> +++ b/net/tipc/node.c
> @@ -75,6 +75,12 @@ struct tipc_bclink_entry {
>       struct sk_buff_head namedq;
>  };
>  
> +struct tipc_del_node_work {
> +     struct work_struct work;
> +     struct net *net;
> +     u32 addr;
> +};
> +
>  /**
>   * struct tipc_node - TIPC node structure
>   * @addr: network address of node
> @@ -581,25 +587,61 @@ static void  tipc_node_clear_links(struct tipc_node 
> *node)
>       }
>  }
>  
> -/* tipc_node_cleanup - delete nodes that does not
> - * have active links for NODE_CLEANUP_AFTER time
> +static void tipc_exec_node_delete(struct work_struct *work)
> +{
> +     struct tipc_del_node_work *dwork;
> +     struct tipc_net *tn;
> +     struct tipc_node *n;
> +     struct net *net;
> +     u32 addr;
> +
> +     dwork = container_of(work, struct tipc_del_node_work, work);
> +     addr = dwork->addr;
> +     net = dwork->net;
> +     kfree(dwork);
> +
> +     /* In case the module has been removed: */
> +     tn = tipc_net(net);
> +     if (!tn)
> +             return;
> +     n = tipc_node_find(net, addr);
> +     if (!n || node_is_up(n))
> +             return;
> +
> +     spin_lock_bh(&tn->node_list_lock);
> +     tipc_node_write_lock(n);
> +     tipc_node_clear_links(n);
> +     tipc_node_delete_from_list(n);
> +     tipc_node_write_unlock(n);
> +     spin_unlock_bh(&tn->node_list_lock);
> +
> +     /* For work queue item */
> +     tipc_node_put(n);
> +}
> +
> +/* tipc_node_cleanup - delete node if it didn't have active
> + * links for  NODE_CLEANUP_AFTER millisecods
>   */
>  static int tipc_node_cleanup(struct tipc_node *peer)
>  {
> -     struct tipc_net *tn = tipc_net(peer->net);
> -     bool deleted = false;
> +     struct tipc_del_node_work *dwork;
>  
> -     spin_lock_bh(&tn->node_list_lock);
> -     tipc_node_write_lock(peer);
> +     if (node_is_up(peer))
> +             return false;
> +     if (!time_after(jiffies, peer->delete_at))
> +             return false;
>  
> -     if (!node_is_up(peer) && time_after(jiffies, peer->delete_at)) {
> -             tipc_node_clear_links(peer);
> -             tipc_node_delete_from_list(peer);
> -             deleted = true;
> -     }
> -     tipc_node_write_unlock(peer);
> -     spin_unlock_bh(&tn->node_list_lock);
> -     return deleted;
> +     /* Delete in work queue context to avoid potential deadlock */
> +     dwork = kzalloc(sizeof(*dwork), GFP_ATOMIC);
> +     if (!dwork)
> +             return false;
> +
> +     INIT_WORK(&dwork->work, tipc_exec_node_delete);
> +     dwork->addr = peer->addr;
> +     dwork->net = peer->net;
> +     tipc_node_get(peer);
> +     schedule_work(&dwork->work);
> +     return true;
>  }
>  
>  /* tipc_node_timeout - handle expiration of node timer
> @@ -613,8 +655,8 @@ static void tipc_node_timeout(struct timer_list *t)
>       int bearer_id;
>       int rc = 0;
>  
> -     if (!node_is_up(n) && tipc_node_cleanup(n)) {
> -             /*Removing the reference of Timer*/
> +     if (tipc_node_cleanup(n)) {
> +             /* Remove reference to this timer */
>               tipc_node_put(n);
>               return;
>       }
> 


_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to