Re: [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock

Kuniyuki Iwashima Sat, 07 Feb 2026 14:01:00 -0800

On Sat, Feb 7, 2026 at 6:35 AM Michal Luczaj <[email protected]> wrote:
>
> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
> socket is properly set up, which would include having a defined peer. IOW,
> there's a window when unix_stream_bpf_update_proto() can be called on
> socket which still has unix_peer(sk) == NULL.
>
>           T0 bpf                            T1 connect
>           ------                            ----------
>
>                                 WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> sock_map_sk_state_allowed(sk)
> ...
> sk_pair = unix_peer(sk)
> sock_hold(sk_pair)
>                                 sock_hold(newsk)
>                                 smp_mb__after_atomic()
>                                 unix_peer(sk) = newsk
>
> BUG: kernel NULL pointer dereference, address: 0000000000000080
> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
> Call Trace:
>   sock_map_link+0x564/0x8b0
>   sock_map_update_common+0x6e/0x340
>   sock_map_update_elem_sys+0x17d/0x240
>   __sys_bpf+0x26db/0x3250
>   __x64_sys_bpf+0x21/0x30
>   do_syscall_64+0x6b/0x3a0
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Initial idea was to move peer assignment _before_ the sk_state update[1],
> but that involved an additional memory barrier, and changing the hot path
> was rejected. Then a check during proto update was considered[2], but a
> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
> lock.
>
> Thus, teach sockmap about the af_unix-specific locking: instead of the
> usual lock_sock() involving sock::sk_lock, af_unix protects critical
> sections under unix_state_lock() operating on unix_sock::lock.
>
> [1]: 
> https://lore.kernel.org/netdev/[email protected]/
> [2]: https://lore.kernel.org/netdev/[email protected]/
> [3]: 
> https://lore.kernel.org/netdev/[email protected]/
>
> This patch also happens to fix a deadlock that may occur when
> bpf_iter_unix_seq_show()'s lock_sock_fast() takes the fast path and the
> iter prog attempts to update a sockmap. Which ends up spinning at
> sock_map_update_elem()'s bh_lock_sock():


Hmm.. this seems to be a more general problem for
bpf iter vs sockmap.  bpf_iter_{tcp,udp}_seq_show() also
hold lock_sock(),  where this patch's solution does not help.
We need to resolve this regardless of socket family.

Also, I feel lock_sock() should be outside of unix_state_lock()
since the former is usually sleepable.  If we used such locking
order in the future, it would trigger ABBA deadlock and we would
have to revisit this problem.



>
> WARNING: possible recursive locking detected
> --------------------------------------------
> test_progs/1393 is trying to acquire lock:
> ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: 
> sock_map_update_elem+0xdb/0x1f0
>
> but task is already holding lock:
> ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
>
> other info that might help us debug this:
>  Possible unsafe locking scenario:
>
>        CPU0
>        ----
>   lock(slock-AF_UNIX);
>   lock(slock-AF_UNIX);
>
>  *** DEADLOCK ***
>
>  May be due to missing lock nesting notation
>
> 4 locks held by test_progs/1393:
>  #0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0
>  #1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: 
> bpf_seq_read+0x42c/0x10d0
>  #2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: 
> __lock_sock_fast+0x37/0xe0
>  #3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at: 
> bpf_iter_run_prog+0x51d/0xb00
>
> Call Trace:
>  dump_stack_lvl+0x5d/0x80
>  print_deadlock_bug.cold+0xc0/0xce
>  __lock_acquire+0x130f/0x2590
>  lock_acquire+0x14e/0x2b0
>  _raw_spin_lock+0x30/0x40
>  sock_map_update_elem+0xdb/0x1f0
>  bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4
>  bpf_iter_run_prog+0x5b9/0xb00
>  bpf_iter_unix_seq_show+0x1f7/0x2e0
>  bpf_seq_read+0x42c/0x10d0
>  vfs_read+0x171/0xb20
>  ksys_read+0xff/0x200
>  do_syscall_64+0x6b/0x3a0
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Suggested-by: Kuniyuki Iwashima <[email protected]>
> Suggested-by: Martin KaFai Lau <[email protected]>
> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
> Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain 
> socket.")
> Signed-off-by: Michal Luczaj <[email protected]>
> ---
> Keeping sparse annotations in sock_map_sk_{acquire,release}() required some
> hackery I'm not proud of. Is there a better way?
> ---
>  net/core/sock_map.c | 47 +++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 39 insertions(+), 8 deletions(-)
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index b6586d9590b7..0c638b1f363a 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -12,6 +12,7 @@
>  #include <linux/list.h>
>  #include <linux/jhash.h>
>  #include <linux/sock_diag.h>
> +#include <net/af_unix.h>
>  #include <net/udp.h>
>
>  struct bpf_stab {
> @@ -115,17 +116,49 @@ int sock_map_prog_detach(const union bpf_attr *attr, 
> enum bpf_prog_type ptype)
>  }
>
>  static void sock_map_sk_acquire(struct sock *sk)
> -       __acquires(&sk->sk_lock.slock)
> +       __acquires(sock_or_unix_lock)
>  {
> -       lock_sock(sk);
> +       if (sk_is_unix(sk)) {
> +               unix_state_lock(sk);
> +               __release(sk); /* Silence sparse. */
> +       } else {
> +               lock_sock(sk);
> +       }
> +
>         rcu_read_lock();
>  }
>
>  static void sock_map_sk_release(struct sock *sk)
> -       __releases(&sk->sk_lock.slock)
> +       __releases(sock_or_unix_lock)
>  {
>         rcu_read_unlock();
> -       release_sock(sk);
> +
> +       if (sk_is_unix(sk)) {
> +               unix_state_unlock(sk);
> +               __acquire(sk); /* Silence sparse. */
> +       } else {
> +               release_sock(sk);
> +       }
> +}
> +
> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
> +{
> +       local_bh_disable();
> +
> +       if (sk_is_unix(sk))
> +               unix_state_lock(sk);
> +       else
> +               bh_lock_sock(sk);
> +}
> +
> +static inline void sock_map_sk_release_fast(struct sock *sk)
> +{
> +       if (sk_is_unix(sk))
> +               unix_state_unlock(sk);
> +       else
> +               bh_unlock_sock(sk);
> +
> +       local_bh_enable();
>  }
>
>  static void sock_map_add_link(struct sk_psock *psock,
> @@ -604,16 +637,14 @@ static long sock_map_update_elem(struct bpf_map *map, 
> void *key,
>         if (!sock_map_sk_is_suitable(sk))
>                 return -EOPNOTSUPP;
>
> -       local_bh_disable();
> -       bh_lock_sock(sk);
> +       sock_map_sk_acquire_fast(sk);
>         if (!sock_map_sk_state_allowed(sk))
>                 ret = -EOPNOTSUPP;
>         else if (map->map_type == BPF_MAP_TYPE_SOCKMAP)
>                 ret = sock_map_update_common(map, *(u32 *)key, sk, flags);
>         else
>                 ret = sock_hash_update_common(map, key, sk, flags);
> -       bh_unlock_sock(sk);
> -       local_bh_enable();
> +       sock_map_sk_release_fast(sk);
>         return ret;
>  }
>
>
> --
> 2.52.0
>

Re: [PATCH bpf v2 3/4] bpf, sockmap: Adapt for the af_unix-specific lock

Reply via email to