On Mon, Jun 25, 2018 at 08:34:17AM -0700, John Fastabend wrote:
> First in tcp_close, reduce scope of sk_callback_lock() the lock is
> only needed for protecting maps list the ingress and cork
> lists are protected by sock lock. Having the lock in wider scope is
> harmless but may confuse the reader who may infer it is in fact
> needed.
>
> Next, in sock_hash_delete_elem() the pattern is as follows,
>
> sock_hash_delete_elem()
> [...]
> spin_lock(bucket_lock)
> l = lookup_elem_raw()
> if (l)
> hlist_del_rcu()
> write_lock(sk_callback_lock)
> .... destroy psock ...
> write_unlock(sk_callback_lock)
> spin_unlock(bucket_lock)
>
> The ordering is necessary because we only know the {p}sock after
> dereferencing the hash table which we can't do unless we have the
> bucket lock held. Once we have the bucket lock and the psock element
> it is deleted from the hashmap to ensure any other path doing a lookup
> will fail. Finally, the refcnt is decremented and if zero the psock
> is destroyed.
>
> In parallel with the above (or free'ing the map) a tcp close event
> may trigger tcp_close(). Which at the moment omits the bucket lock
> altogether (oops!) where the flow looks like this,
>
> bpf_tcp_close()
> [...]
> write_lock(sk_callback_lock)
> for each psock->maps // list of maps this sock is part of
> hlist_del_rcu(ref_hash_node);
> .... destroy psock ...
> write_unlock(sk_callback_lock)
>
> Obviously, and demonstrated by syzbot, this is broken because
> we can have multiple threads deleting entries via hlist_del_rcu().
>
> To fix this we might be tempted to wrap the hlist operation in a
> bucket lock but that would create a lock inversion problem. In
> summary to follow locking rules the psocks maps list needs the
> sk_callback_lock but we need the bucket lock to do the hlist_del_rcu.
> To resolve the lock inversion problem pop the head of the maps list
> repeatedly and remove the reference until no more are left. If a
> delete happens in parallel from the BPF API that is OK as well because
> it will do a similar action, lookup the lock in the map/hash, delete
> it from the map/hash, and dec the refcnt. We check for this case
> before doing a destroy on the psock to ensure we don't have two
> threads tearing down a psock. The new logic is as follows,
>
> bpf_tcp_close()
> e = psock_map_pop(psock->maps) // done with sk_callback_lock
> bucket_lock() // lock hash list bucket
> l = lookup_elem_raw(head, hash, key, key_size);
> if (l) {
> //only get here if elmnt was not already removed
> hlist_del_rcu()
> ... destroy psock...
> }
> bucket_unlock()
>
> And finally for all the above to work add missing sk_callback_lock
> around smap_list_remove in sock_hash_ctx_update_elem(). Otherwise
> delete and update may corrupt maps list. Then add RCU annotations and
> use rcu_dereference/rcu_assign_pointer to manage values relying on
> RCU so that the object is not free'd from sock_hash_free() while it
> is being referenced in bpf_tcp_close().
>
> (As an aside the sk_callback_lock serves two purposes. The
> first, is to update the sock callbacks sk_data_ready, sk_write_space,
> etc. The second is to protect the psock 'maps' list. The 'maps' list
> is used to (as shown above) to delete all map/hash references to a
> sock when the sock is closed)
>
> Reported-by: syzbot+0ce137753c78f7b6a...@syzkaller.appspotmail.com
> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> Signed-off-by: John Fastabend <john.fastab...@gmail.com>
Acked-by: Martin KaFai Lau <ka...@fb.com>