From: Jason Baron <jba...@akamai.com>

The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
queue associated with the socket s that we've called poll() on, but it also
calls sock_poll_wait() for a remote peer socket's wait queue, if it's connected.
Thus, if we call poll()/select()/epoll() for the socket s, there are then
a couple of code paths in which the remote peer socket s2 and its associated
peer_wait queue can be freed before poll()/select()/epoll() have a chance
to remove themselves from this remote peer socket s2's wait queue.

The remote peer's socket and associated wait queues can be freed via:

1. If s calls connect() to connect to a new socket other than s2, it will
drop its reference on s2, and thus a close() on s2 will free it.

2. If we call close() on s2, then a subsequent sendmsg() from s, will drop
the final reference to s2, allowing it to be freed.

Address this issue, by reverting unix_dgram_poll() to only register with
the wait queue associated with s and simply drop the second sock_poll_wait()
registration for the remote peer socket wait queue. This then presents the
expected semantics to poll()/select()/epoll().

This works because we will continue to get POLLOUT wakeups from
unix_write_space(), which is called via sock_wfree(). In fact, we avoid having
two wakeup calls here for every buffer we read, since unix_dgram_recvmsg()
unconditionally calls wake_up_interruptible_sync_poll() on its 'peer_wait' queue
and we will no longer be in poll against that queue. So I think this should be
more performant than the current code. And we avoid the second poll() call here
as well during registration.

unix_write_space() should probably be enhanced such that it checks for the
unix_recvq_full() condition as well. In fact, it should probably look for
some fraction of that buffer being free, as is done in unix_writable(). But I'm
considering that a separate enhancement from fixing this issue.

I've tested this by specifically reproducing cases #1 and #2 above as well as
by running the test code here: https://lkml.org/lkml/2015/9/13/195

Signed-off-by: Jason Baron <jba...@akamai.com>
---
 net/unix/af_unix.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 03ee4d3..c1ae595 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2441,7 +2441,6 @@ static unsigned int unix_dgram_poll(struct file *file, 
struct socket *sock,
        other = unix_peer_get(sk);
        if (other) {
                if (unix_peer(other) != sk) {
-                       sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
                        if (unix_recvq_full(other))
                                writable = 0;
                }
-- 
1.8.2.rc2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to