from:"Rainer Weikusat"

Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-14 Thread Rainer Weikusat

Jason Baron  writes:
> On 10/12/2015 04:41 PM, Rainer Weikusat wrote:
>> Jason Baron  writes:
>>> On 10/05/2015 12:31 PM, Rainer Weikusat wrote:

[...]

>> OTOH, something I seriously dislike about your relaying implementation
>> is the unconditional add_wait_queue upon connect as this builds up a
>> possibly large wait queue of entities entirely uninterested in the
>> event which will need to be traversed even if peer_wait wakeups will
>> only happen if at least someone actually wants to be notified. This
>> could be changed such that the struct unix_sock member is only put onto
>> the peer's wait queue in poll and only if it hasn't already been put
>> onto it. The connection could then be severed if some kind of
>> 'disconnect' occurs.

[...]

> What about the following race?
>
> 1) thread A does poll() on f, finds the wakeup condition low, and adds
> itself to the remote peer_wait queue.
>
> 2) thread B sets the wake up condition in dgram_recvmsg(), but does not
> execute the wakeup of threads yet.
>
> 3) thread C also does a poll() on f, finds now that the wakeup condition
> is set, and hence removes f from the remote peer_wait queue.
>
> Then, thread A misses the POLLOUT, and hangs.

Indeed. This starts to turn into an interesting, little problem. Based
on the code I posted, there are (AFAICT) four possible situations a
thread looking for 'peer will accept data' can find itself in:

A - supposed to mean 'it connected to the peer_wait queue'
B - the peer socket is ready to accept messages

A && B
--

Since A, no other thread can be waiting for this event and since B, A
was undone. Fine to return.

!A && B
---

!A means 'was enqueued to peer_wait during an earlier call'. This means
any number of threads may be waiting. Because of this, do an additional
wake up after disconnecting since the current thread may never have
slept, ie, possibly, no wake up occurred yet.

 A && !B
 ---

 Now connected to receive peer wake ups. Go to sleep.

 !A && !B

 Someone else connected. Go to sleep.

> I understand your concern about POLLIN only waiters-I think we
> could add the 'relay callback' in dgram_poll() only for those who are
> looking for POLLOUT, and simply avoid the de-registration,

Whether or not the currently running thread wanted to write or to read
can only be determined (AIUI) if a poll_table was passed. If this was
the case and the thread didn't look for 'write space', it will never
execute the code in question, cf

/* No write status requested, avoid expensive OUT tests. */
if (wait && !(wait->key & POLL_OUT_ALL))
return mask;

As to 'concerns', I'll try to word that somewhat differently: The usual
way to handle a 'may sleep' situation is

1. Enqueue on the associated wait_queue.

2. Test the condition

2a) No sleep, dequeue and return

2b) Sleep until woken up

This means the wait queue is used to 'register' threads which want to be
woken up if a certain event happens. It's not used such that a thread
first enqueues itself on all wait queues it could ever need to enqueued
on and that an associated flag is used to inform the code on the way to
doing a wake up whether or not any of the entries on the list actually
corresponds to someone waiting for one. Even if such a flag exists, the
time needed to do the wake up is still proportional to size of the list.

For the given case, take a usual socketpair: It's absolutely pointless
to put these on the peer_wait queues of each other because all datagrams
send by one go to the other, ie, a situation where a thread goes to
sleep because it has to wait until data written via unrelated sockets
has been consumed can never happen.

I'll include the patch I'm currently using below, including an
X-Signed-Off by line to signify that this is the result of kernel
development sponsored by my employer and made available under the usual
terms. Changes since the last version:

- fixed the inverted (rather accidentally uninverted) test in
  _connect

- simplified the peer-checking logic: Instead of getting the
  peer with the state locked and then doing all kinds of checks
  which will hopefully enable avoiding to lock the state again,
  then, possibly, lock it again, lock the socket, do the peer
  checks followed by whatever else is necessary, then unlock it

- addressed the race described above by means of doing a wake_up
  on sk_sleep(sk) if a thread breaks the connection which
  didn't also establish it

- a comment (yohoho) explaining all of this -- I don't usually
  use comments in my own code because I cons

Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-18 Thread Rainer Weikusat

Jason Baron  writes:
> 
>> 
>> X-Signed-Off-By: Rainer Weikusat 
>> 

Sorry for the delayed reply but I had to do some catching up on what the
people who pay me consider 'useful work' :-).

> So the patches I've posted and yours both use the idea of a relaying
> the remote peer wakeup via callbacks that are internal to the net/unix,
> such that we avoid exposing the remote peer wakeup to the external
> poll()/select()/epoll(). They differ in when and how those callbacks
> are registered/unregistered.

Yes. In my opinion, that's the next best idea considering that the need
to handle this particular situation is presumably specfcific to the
af_unix.c code and thus, doesn't warrant changes to all I/O multiplexing
implementations.

> So I think your approach here will generally keep the peer wait wakeup
> queue to its absolute minimum, by removing from that queue when
> we set POLLOUT, however it requires taking the peer waitqueue lock on
> every poll() call.
>
> So I think there are tradeoffs here vs. what I've
> posted. So for example, if there are a lot of writers against one 'server'
> socket, there is going to be a lot of lock contention with your approach
> here. So I think the performance is going to depend on the workload that
> is tested.

This approach is really completely run-of-the-mill "possibly sleep until
some event occurs" code, eg, you'll find a description of this exact
procedure on p. 157 of chapter 6 of

https://lwn.net/Kernel/LDD3/

The idea behind 'the wait queue' (insofar I'm aware of it) is that it
will be used as list of threads who need to be notified when the
associated event occurs. Since you seem to argue that the run-of-the-mill
algorithm is too slow for this particular case, is there anything to
back this up?

This is also sort-of a zero-sum game as the idea to enqueue on a wait
queue because the event could possibly become interesting in future
really just shifts (for n:1 connected sockets) work from the (possibly
many) clients to the single server. And considering that 'the wait'
became necessary (if it became necessary) because flow-control kicked in
to stop clients from sending more data to the server until it had time
to catch up with the already sent data, this doesn't seem like a good
idea to me.

Using a flag to signal that at least some of the members of this queue
actually want to be woken up will also only save work if it is assumed
that most threads won't ever really be waiting for this, ie, won't
execute the corresponding unix_dgram_poll code. But if that code isn't
going to be executed most of the time, anyway, why optimize it?

It's also not really necessary to take the waitqueue lock on every poll,
eg, the code in unix_dgram_poll could be changed like this:

need_wakeup = 0;
unix_state_lock(sk);

other = unix_peer(sk);
if (other && unix_peer(other) != sk) {
if (unix_recvq_full(other)) {
need_wakeup = !unix_peer_wake_connect(sk, 
other);

if (unix_recvq_full(other)) {
writable = 0;
need_wakeup = 0;
} else
unix_peer_wake_disconnect(sk, other);
} else
need_wakeup = unix_peer_wake_disconnect(sk, 
other);
}

unix_state_unlock(sk);
if (need_wakeup)
wake_up_interruptible_poll(sk_sleep(sk), POLL_OUT_ALL);

I'm no big fan of adding complications like this based on the conjecture
that they might be useful, but this one seems justified to me. IMHO,
there are two interesting cases here:

1. The current thread won't have to wait. It should detect this quickly
   so that it can get back to work, hence, check for this before
   'preparing to wait'.

2. Nothing to be done here right now, hence, performing an additional
   check for 'still not writeable' shouldn't matter much.

Yet another 3.2.54 patch with the change above

---
--- linux-2-6.b/net/unix/af_unix.c  2015-10-18 20:41:10.829404899 +0100
+++ linux-2-6/net/unix/af_unix.c2015-10-18 20:17:34.819134482 +0100
@@ -115,6 +115,8 @@
 #include 
 #include 

+#define POLL_OUT_ALL   (POLLOUT | POLLWRNORM | POLLWRBAND)
+
 static struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1];
 static DEFINE_SPINLOCK(unix_table_lock);
 static atomic_long_t unix_nr_socks;
@@ -303,6 +305,53 @@ found:
return s;
 }

+static int unix_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags,
+   void *key)
+{
+   struct unix_sock

Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-20 Thread Rainer Weikusat

Jason Baron  writes:
> On 10/18/2015 04:58 PM, Rainer Weikusat wrote:
>
> [...]
>
>> 
>> The idea behind 'the wait queue' (insofar I'm aware of it) is that it
>> will be used as list of threads who need to be notified when the
>> associated event occurs. Since you seem to argue that the run-of-the-mill
>> algorithm is too slow for this particular case, is there anything to
>> back this up?
>>
>
> Generally the poll() routines only add to a wait queue once at the
> beginning, and all subsequent calls to poll() simply check the wakeup
> conditions. So here you are proposing to add/remove to the wait queue on
> subsequent invocations of poll(). So the initial patch I did, continued
> in the usual pattern and only added once on registration or connect().

The code uses the private member of a wait_queue_t to record if it the
add_wait_queue was already executed so the add/remove will only happen
if the wakeup condition changed in the meantime (which usually ought to
be the case, though). As far as I understand this, this really only
makes a difference for epoll as only epoll will keep everything on the
wait queues managed by it between 'polling calls'. In order to support
epoll-style wait queue management outside of epoll, the poll management
code would need to execute a cleanup callback instead of just the setup
callback it already executes.

> 1)
>
> In unix_peer_wake_relay() function, 'sk_wq' is an __rcu pointer and thus
> it requires proper dereferencing. Something like:
>
> struct unix_sock *u;
> struct socket_wq *wq;
>
> u = container_of(wait, struct unix_sock, wait);
> rcu_read_lock();
> wq = rcu_dereference(u->sk.sk_wq);
> if (wq_has_sleeper(wq))
>   wake_up_interruptible_sync_poll(&wq->wait, key);
> rcu_read_unlock();

I think this may be unecessary but I need more time to check this than
the odd "half an hour after work after 11pm [UK time]" I could put into
this today.

> 2)
>
> For the case of epoll() in edge triggered mode we need to ensure that
> when we return -EAGAIN from unix_dgram_sendmsg() when unix_recvq_full()
> is true, we need to add a unix_peer_wake_connect() call to guarantee a
> wakeup. Otherwise, we are going to potentially hang there.

I consider this necessary.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-21 Thread Rainer Weikusat

Rainer Weikusat  writes:
> Jason Baron  writes:
>> On 10/18/2015 04:58 PM, Rainer Weikusat wrote:

[...]

>> 1)
>>
>> In unix_peer_wake_relay() function, 'sk_wq' is an __rcu pointer and thus
>> it requires proper dereferencing. Something like:
>>
>> struct unix_sock *u;
>> struct socket_wq *wq;
>>
>> u = container_of(wait, struct unix_sock, wait);
>> rcu_read_lock();
>> wq = rcu_dereference(u->sk.sk_wq);
>> if (wq_has_sleeper(wq))
>>  wake_up_interruptible_sync_poll(&wq->wait, key);
>> rcu_read_unlock();
>
> I think this may be unecessary

I consider this unnecessary now.

Rationale: The wait queue is allocated and freed in tandem with the
socket inode which means it will remain allocated until after the
protocol release function (unix_release with the bulk of the
implementation being in unix_release_sock) returned. As the connection
with the other socket is broken in unix_release_sock, any relayed wake
up must have completed before this time (since both operations take
place while holding the same wait queue lock).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] unix: fix use-after-free in unix_dgram_poll()

2015-10-28 Thread Rainer Weikusat

Rainer Weikusat  writes:
> Jason Baron  writes:

[...]

>> 2)
>>
>> For the case of epoll() in edge triggered mode we need to ensure that
>> when we return -EAGAIN from unix_dgram_sendmsg() when unix_recvq_full()
>> is true, we need to add a unix_peer_wake_connect() call to guarantee a
>> wakeup. Otherwise, we are going to potentially hang there.
>
> I consider this necessary.

(As already discussed privately) just doing this would open up another
way for sockets to be enqueued on the peer_wait queue of the peer
forever despite no one wants to be notified of write space
availability. Here's another RFC patch addressing the issues so far plus
this one by breaking the connection to the peer socket from the wake up
relaying function. This has the nice additional property that the
dgram_poll code becomes somewhat simpler as the "dequeued where we
didn't enqueue" situation can no longer occur and the not-so-nice
additional property that the connect and disconnect functions need to
take the peer_wait.lock spinlock explicitly so that this lock is used to
ensure that no two threads modifiy the private pointer of the client
wait_queue_t.

I've also moved the check, possibly enqueue then recheck and possibly
dequeue dance into a pair of functions as this code would be identical
for both unix_dgram_poll and unix_dgram_sendmsg (I'm not really happy
with the names, though).

---
--- linux-2-6.b/net/unix/af_unix.c  2015-10-28 16:06:29.581960497 +
+++ linux-2-6/net/unix/af_unix.c2015-10-28 16:14:55.326065483 +
@@ -115,6 +115,8 @@
 #include 
 #include 
 
+#define POLL_OUT_ALL   (POLLOUT | POLLWRNORM | POLLWRBAND)
+
 static struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1];
 static DEFINE_SPINLOCK(unix_table_lock);
 static atomic_long_t unix_nr_socks;
@@ -303,6 +305,117 @@ found:
return s;
 }
 
+/*
+ * Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writability condition poll
+ * and sendmsg need to test. The dgram recv code will do a wake up on
+ * the peer_wait wait queue of a socket upon reception of a datagram
+ * which needs to be propagated to sleeping writers since these might
+ * not yet have sent anything. This can't be accomplished via
+ * poll_wait because the lifetime of the server socket might be less
+ * than that of its clients if these break their association with it
+ * or if the server socket is closed while clients are still connected
+ * to it and there's no way to inform "a polling implementation" that
+ * it should let go of a certain wait queue
+ *
+ * In order to achieve wake up propagation, a wait_queue_t of the
+ * client socket is thus enqueued on the peer_wait queue of the server
+ * socket whose wake function does a wake_up on the ordinary client
+ * socket wait queue. This connection is established whenever a write
+ * (or poll for write) hit the flow control condition and broken when
+ * the connection to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int 
flags,
+   void *key)
+{
+   struct unix_sock *u;
+   wait_queue_head_t *u_sleep;
+
+   u = container_of(q, struct unix_sock, peer_wake);
+
+   __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+   &u->peer_wake);
+   u->peer_wake.private = NULL;
+
+   /* relaying can only happen while the wq still exists */
+   u_sleep = sk_sleep(&u->sk);
+   if (u_sleep)
+   wake_up_interruptible_poll(u_sleep, key);
+
+   return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+   struct unix_sock *u, *u_other;
+   int rc;
+
+   u = unix_sk(sk);
+   u_other = unix_sk(other);
+   rc = 0;
+
+   spin_lock(&u_other->peer_wait.lock);
+
+   if (!u->peer_wake.private) {
+   u->peer_wake.private = other;
+   __add_wait_queue(&u_other->peer_wait, &u->peer_wake);
+
+   rc = 1;
+   }
+
+   spin_unlock(&u_other->peer_wait.lock);
+   return rc;
+}
+
+static int unix_dgram_peer_wake_disconnect(struct sock *sk, struct sock *other)
+{
+   struct unix_sock *u, *u_other;
+   int rc;
+
+   u = unix_sk(sk);
+   u_other = unix_sk(other);
+   rc = 0;
+
+   spin_lock(&u_other->peer_wait.lock);
+
+   if (u->peer_wake.private == other) {
+   __remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+   u->

Re: [RFC] unix: fix use-after-free in unix_dgram_poll()

2015-10-29 Thread Rainer Weikusat

Jason Baron  writes:
> On 10/28/2015 12:46 PM, Rainer Weikusat wrote:
>> Rainer Weikusat  writes:
>>> Jason Baron  writes:

[...]

>> and the not-so-nice additional property that the connect and
>> disconnect functions need to take the peer_wait.lock spinlock
>> explicitly so that this lock is used to ensure that no two threads
>> modifiy the private pointer of the client wait_queue_t.
>
> Hmmm...I thought these were already all guarded by unix_state_lock(sk).
> In any case, rest of the patch overall looks good to me.

All but the one in unix_dgram_peer_wake_relay: That's
called with the wait queue lock of the peer_wait queue held and even if
it was desirable to acquire the state lock of the involved socket, it
wouldn't easily be possible because 'other code' (in poll and sendmsg)
acquires the wait queue lock while holding the state lock.

The idea behind this is that the state lock ensures that the 'connection
state' doesn't change while some code is examining it and that the wait
queue lock of the peer_wait queue is (slightly ab-)used to ensure
exclusive access to the private pointer which guards against concurrent
inserts or removes of the same wait_queue_t.

Poking around in the implementation of abstract interfaces like this
isn't something I'm overly fond of but 'other code' does this too, eg,
in eventfd.c

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] unix: fix use-after-free in unix_dgram_poll()/ 4.2.5

2015-10-30 Thread Rainer Weikusat

Same changes ported to 4.2.5 with some minor improvments (I hope),
namely,

- applied a round of DeMorgan to the 'quick' check function in
  order to simplify the condition

- fixed a (minor) error in the dgram_sendmsg change: In case the
  2nd check resulted in 'can send now', the code would continue
  with 'wait until timeout expired' (since timeo was 0 in the
  case, this shouldn't make much of a practical difference)

- (hopefully) more intelligible function names and better
  explanation

- dropped the POLL_OUT_ALL macro again as that's really
  unrelated

---
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_tpeer_wake;
 };
 
 static inline struct unix_sock *unix_sk(struct sock *sk)
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,122 @@ found:
return s;
 }
 
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int 
flags,
+ void *key)
+{
+   struct unix_sock *u;
+   wait_queue_head_t *u_sleep;
+
+   u = container_of(q, struct unix_sock, peer_wake);
+
+   __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+   &u->peer_wake);
+   u->peer_wake.private = NULL;
+
+   /* relaying can only happen while the wq still exists */
+   u_sleep = sk_sleep(&u->sk);
+   if (u_sleep)
+   wake_up_interruptible_poll(u_sleep, key);
+
+   return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+   struct unix_sock *u, *u_other;
+   int rc;
+
+   u = unix_sk(sk);
+   u_other = unix_sk(other);
+   rc = 0;
+
+   spin_lock(&u_other->peer_wait.lock);
+
+   if (!u->peer_wake.private) {
+   u->peer_wake.private = other;
+   __add_wait_queue(&u_other->peer_wait, &u->peer_wake);
+
+   rc = 1;
+   }
+
+   spin_unlock(&u_other->peer_wait.lock);
+   return rc;
+}
+
+static int unix_dgram_peer_wake_disconnect(struct sock *sk, struct sock *other)
+{
+   struct unix_sock *u, *u_other;
+   int rc;
+
+   u = unix_sk(sk);
+   u_other = unix_sk(other);
+   rc = 0;
+
+   spin_lock(&u_other->peer_wait.lock);
+
+   if (u->peer_wake.private == other) {
+   __remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+   u->peer_wake.private = NULL;
+
+   rc = 1;
+   }
+
+   spin_unlock(&u_other->peer_wait.lock);
+   return rc;
+}
+
+/* Needs 'this' unix state lock. Lockless check if data can (likely)
+ * be sent.
+ */
+static inline int unix_dgram_peer_recv_ready(struct sock *sk,
+struct sock *other)
+{
+   return unix_peer(other) == sk || !unix_recvq_full(other);
+}
+
+/* Needs 'this' unix state lock. After recv_ready indicated not ready,
+ * establish peer_wait connection if still needed.
+ */
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+   int conned;
+
+   conned = unix_dgram_peer_wake_connect(sk, other);
+
+   if (unix_recvq_full(other))
+   return 1;
+
+   if (conned)
+   unix_dgram_peer_wake_disconnect(sk, other);
+
+   return 0;
+}
+
 static inline int unix_writable(struct sock *sk)
 {
return (atomic_read(&sk->sk_wmem_alloc) << 2)

Re: [RFC] unix: fix use-after-free in unix_dgram_poll()/ 4.2.5

2015-11-02 Thread Rainer Weikusat

Olivier Mauras  writes:

[...]

> I've encountered  issues with Jason's patch ported to 3.14.x which would break
> openldap, rendering it unable to answer any query - Here's a strace of the
> slapd process in this state http://pastebin.ca/3226383
> Just ported Rainer's patch to 3.14 and so far I can't reproduce the issue -

I may be missing something here but the final state according to the
trace it that thread 775 of the process blocks in epoll_wait with a
descriptor set containing only a listening TCP socket (8) and waiting
for new connections. I don't think this can execute any code
changed by my patch and I'm fairly certain for this for Jason's, too:
Both are about AF_UNIX datagram sockets and the specific case where
either a write couldn't complete because the backlog of the receive
queue of the 1 side of a n:1 datagram socket arrangement was considered
too large or where a 'poll for write' check returned 'not writeable' for
the same reason.

Judging from the 2.4.42 sources, OpenLDAP doesn't use AF_UNIX datagram
sockets at all so it shouldn't ever be affected by any changes to the
code handling them.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Use-after-free in ep_remove_wait_queue

2015-11-06 Thread Rainer Weikusat

Jason Baron  writes:
> On 11/06/2015 08:06 AM, Dmitry Vyukov wrote:
>> On Mon, Oct 12, 2015 at 2:17 PM, Dmitry Vyukov  wrote:
>>> On Mon, Oct 12, 2015 at 2:14 PM, Eric Dumazet  
>>> wrote:
 On Mon, 2015-10-12 at 14:02 +0200, Michal Kubecek wrote:

> Probably the issue discussed in
>
>   http://thread.gmane.org/gmane.linux.kernel/2057497/
>
> and previous related threads.
>

 Same issue, but Dmitry apparently did not trust me.
>>>
>>> Just wanted to make sure. Let's consider this as fixed.
>> 
>> 
>> Is it meant to be fixed now?
>> 
>> I am still seeing use-after-free reports in ep_unregister_pollwait on
>> d1e41ff11941784f469f17795a4d9425c2eb4b7a (Nov 5).
>> 
>
> Indeed - No formal patch has been posted for this yet - we've agreed
> that Rainer is going to post the patch for this issue.

I should be able to put some more work into this on Sunday or possibly
(but unlikely) later today (supposed to mean that I'll also actually do
that).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [V4.4-rc6 Regression] af_unix: Revert 'lock_interruptible' in stream receive code

2016-02-05 Thread Rainer Weikusat

Joseph Salisbury  writes:
> Hi Rainer,
>
> A kernel bug report was opened against Ubuntu [0].  After a kernel
> bisect, it was found that reverting the following commit resolved this bug:
>
> commit 3822b5c2fc62e3de8a0f33806ff279fb7df92432
> Author: Rainer Weikusat 
> Date:   Wed Dec 16 20:09:25 2015 +
>
> af_unix: Revert 'lock_interruptible' in stream receive code
>
>   
> The regression was introduced as of v4.4-rc6.
>
> I was hoping to get your feedback, since you are the patch author.  Do
> you think gathering any additional data will help diagnose this issue,
> or would it be best to submit a revert request?

Funny little problem :-). The code using the interruptible lock cleared
err as side effect hence the

out:
return copied ? : err;

at the end of unix_stream_read_generic didn't return the -ENOTSUP put
into err at the start of the function if copied was zero after the loop
because the size of the passed data buffer was zero.

The following patch should fix this:

-
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 49d5093..c3e1a08 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2300,6 +2300,7 @@ static int unix_stream_read_generic(struct 
unix_stream_read_state *state)
else
skip = 0;
 
+   err = 0;
do {
int chunk;
bool drop_skb;
--

I was just about to go the the supermarket to buy an apple when I
received the mail. I didn't even compile the change above yet, however,
I'll do so once I'm back and then submit something formal.

Here's a test program which can be compiled with a C compiler:

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(void)
{
enum { server, client, size };
int socket_fd[size];
int const opt = 1;

assert(socketpair(AF_LOCAL, SOCK_STREAM, 0, socket_fd) == 0);

char const msg[] = "A random message";
send(socket_fd[client], msg, sizeof msg, MSG_DONTWAIT | MSG_NOSIGNAL);

assert(setsockopt(socket_fd[server], SOL_SOCKET, SO_PASSCRED, &opt, 
sizeof(opt)) != -1);

union {
struct cmsghdr cmh;
char control[CMSG_SPACE(sizeof(struct ucred))];
} control_un;

control_un.cmh.cmsg_len = CMSG_LEN(sizeof(struct ucred));
control_un.cmh.cmsg_level = SOL_SOCKET;
control_un.cmh.cmsg_type = SCM_CREDENTIALS;

struct msghdr msgh;
msgh.msg_name = NULL;
msgh.msg_namelen = 0;
msgh.msg_iov = NULL;
msgh.msg_iovlen = 0;
msgh.msg_control = control_un.control;
msgh.msg_controllen = sizeof(control_un.control);

errno = 0;

if (recvmsg(socket_fd[server], &msgh, MSG_PEEK) == -1)
{
printf("Error: %s\n", strerror(errno));
exit(EXIT_FAILURE);
}
else
{
printf("Success!\n");
exit(EXIT_SUCCESS);
}
}

Re: [V4.4-rc6 Regression] af_unix: Revert 'lock_interruptible' in stream receive code

2016-02-05 Thread Rainer Weikusat

Joseph Salisbury  writes:
> On 02/05/2016 02:59 PM, Rainer Weikusat wrote:

[recvmsg w/o iovecs returning ENOTSUP for CMSG requests]

>> Funny little problem :-). The code using the interruptible lock cleared
>> err as side effect hence the
>>
>> out:
>>  return copied ? : err;
>>
>> at the end of unix_stream_read_generic didn't return the -ENOTSUP put
>> into err at the start of the function if copied was zero after the loop
>> because the size of the passed data buffer was zero.

There are more problems wrt handling control-message only reads in this
code. In particular, changing the test program as follows:

if (fork() == 0) {
sleep(1);
send(socket_fd[client], msg, sizeof msg, MSG_DONTWAIT | MSG_NOSIGNAL);

_exit(0);
}

makes the recvmsg fail with EAGAIN and judging from the code (I didn't
test this yet), it will return without an error but also without
credentials if the

err = -EAGAIN
if (!timeo)
break;

is changed to

if (!timeo) {
err = -EAGAIN;
break
}

because the following

mutex_lock(&u->readlock);
continue;

will cause the

do {
} while (size)

loop condition to be evaluated and since size is 0 (AIUI), the loop will
terminate immediately.

Re: [V4.4-rc6 Regression] af_unix: Revert 'lock_interruptible' in stream receive code

2016-02-05 Thread Rainer Weikusat

The present unix_stream_read_generic contains various code sequences of
the form

err = -EDISASTER;
if ()
goto out;

This has the unfortunate side effect of possibly causing the error code
to bleed through to the final

out:
return copied ? : err;

and then to be wrongly returned if no data was copied because the caller
didn't supply a data buffer, as demonstrated by the program available at

http://pad.lv/1540731

Change it such that err is only set if an error condition was detected.

Signed-off-by: Rainer Weikusat 
---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 49d5093..138787d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2277,13 +2277,15 @@ static int unix_stream_read_generic(struct 
unix_stream_read_state *state)
size_t size = state->size;
unsigned int last_len;
 
-   err = -EINVAL;
-   if (sk->sk_state != TCP_ESTABLISHED)
+   if (sk->sk_state != TCP_ESTABLISHED) {
+   err = -EINVAL;
goto out;
+   }
 
-   err = -EOPNOTSUPP;
-   if (flags & MSG_OOB)
+   if (flags & MSG_OOB) {
+   err = -EOPNOTSUPP;
goto out;
+   }
 
target = sock_rcvlowat(sk, flags & MSG_WAITALL, size);
timeo = sock_rcvtimeo(sk, noblock);
@@ -2329,9 +2331,11 @@ again:
goto unlock;
 
unix_state_unlock(sk);
-   err = -EAGAIN;
-   if (!timeo)
+   if (!timeo) {
+   err = -EAGAIN;
break;
+   }
+
mutex_unlock(&u->readlock);
 
timeo = unix_stream_data_wait(sk, timeo, last,

Re: [V4.4-rc6 Regression] af_unix: Revert 'lock_interruptible' in stream receive code

2016-02-05 Thread Rainer Weikusat

Rainer Weikusat  writes:
> Joseph Salisbury  writes:
>> On 02/05/2016 02:59 PM, Rainer Weikusat wrote:
>
> [recvmsg w/o iovecs returning ENOTSUP for CMSG requests]

[...]

> There are more problems wrt handling control-message only reads in this
> code.

[...]

> it will return without an error but also without credentials if the

[...]

> because the following
>
> mutex_lock(&u->readlock);
> continue;
>
> will cause the
>
> do {
> } while (size)
>
> loop condition to be evaluated and since size is 0 (AIUI), the loop will
> terminate immediately.

As I suspected, the test program included below doesn't really receive
the credentials (tested with a 4.5.0-rc2-net w/ the previous patch
applied). As that's a minor, additional problem, I'll fix that, too.

---
#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(void)
{
enum { server, client, size };
int socket_fd[size];
int const opt = 1;

assert(socketpair(AF_LOCAL, SOCK_STREAM, 0, socket_fd) == 0);
assert(setsockopt(socket_fd[server], SOL_SOCKET, SO_PASSCRED, &opt, 
sizeof(opt)) != -1);

char const msg[] = "A random message";

if (fork() == 0) {
sleep(1);
send(socket_fd[client], msg, sizeof msg, MSG_DONTWAIT | MSG_NOSIGNAL);

_exit(0);
}

union {
struct cmsghdr cmh;
char control[CMSG_SPACE(sizeof(struct ucred))];
} control_un;

control_un.cmh.cmsg_len = CMSG_LEN(sizeof(struct ucred));
control_un.cmh.cmsg_level = SOL_SOCKET;
control_un.cmh.cmsg_type = SCM_CREDENTIALS;

struct msghdr msgh;
msgh.msg_name = NULL;
msgh.msg_namelen = 0;
msgh.msg_iov = NULL;
msgh.msg_iovlen = 0;
msgh.msg_control = control_un.control;
msgh.msg_controllen = sizeof(control_un.control);

if (recvmsg(socket_fd[server], &msgh, MSG_PEEK) == -1)
{
printf("Error: %s\n", strerror(errno));
exit(EXIT_FAILURE);
}
else
{
struct ucred *ucred;

printf("Success?\n");

ucred = (void *)CMSG_DATA(&control_un.cmh);
printf("...  pid %ld, uid %d, gid %d\n",
   (long)ucred->pid, ucred->uid, ucred->gid);
}

return 0;
}

[PATCH] af_unix: Don't set err in unix_stream_read_generic unless there was an error

2016-02-05 Thread Rainer Weikusat

The present unix_stream_read_generic contains various code sequences of
the form

err = -EDISASTER;
if ()
goto out;

This has the unfortunate side effect of possibly causing the error code
to bleed through to the final

out:
return copied ? : err;

and then to be wrongly returned if no data was copied because the caller
didn't supply a data buffer, as demonstrated by the program available at

http://pad.lv/1540731

Change it such that err is only set if an error condition was detected.

Signed-off-by: Rainer Weikusat 
---

With proper subject this time (at least I hope so).

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 49d5093..138787d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2277,13 +2277,15 @@ static int unix_stream_read_generic(struct 
unix_stream_read_state *state)
size_t size = state->size;
unsigned int last_len;
 
-   err = -EINVAL;
-   if (sk->sk_state != TCP_ESTABLISHED)
+   if (sk->sk_state != TCP_ESTABLISHED) {
+   err = -EINVAL;
goto out;
+   }
 
-   err = -EOPNOTSUPP;
-   if (flags & MSG_OOB)
+   if (flags & MSG_OOB) {
+   err = -EOPNOTSUPP;
goto out;
+   }
 
target = sock_rcvlowat(sk, flags & MSG_WAITALL, size);
timeo = sock_rcvtimeo(sk, noblock);
@@ -2329,9 +2331,11 @@ again:
goto unlock;
 
unix_state_unlock(sk);
-   err = -EAGAIN;
-   if (!timeo)
+   if (!timeo) {
+   err = -EAGAIN;
break;
+   }
+
mutex_unlock(&u->readlock);
 
timeo = unix_stream_data_wait(sk, timeo, last,

[PATCH] af_unix: Don't use continue to re-execute unix_stream_read_generic loop

2016-02-05 Thread Rainer Weikusat

The unix_stream_read_generic function tries to use a continue statement
to restart the receive loop after waiting for a message. This may not
work as intended as the caller might use a recvmsg call to peek at
control messages without specifying a message buffer. If this was the
case, the continue will cause the function to return without an error
and without the credential information if the function had to wait for a
message while it had returned with the credentials otherwise. Change to
using goto to restart the loop without checking the condition first in
this case so that credentials are returned either way.

Signed-off-by: Rainer Weikusat 
---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 49d5093..3b73bd7 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2305,6 +2305,7 @@ static int unix_stream_read_generic(struct 
unix_stream_read_state *state)
bool drop_skb;
struct sk_buff *skb, *last;
 
+redo:
unix_state_lock(sk);
if (sock_flag(sk, SOCK_DEAD)) {
err = -ECONNRESET;
@@ -2344,7 +2345,7 @@ again:
}
 
mutex_lock(&u->readlock);
-   continue;
+   goto redo;
 unlock:
unix_state_unlock(sk);
break;

Re: [V4.4-rc6 Regression] af_unix: Revert 'lock_interruptible' in stream receive code

2016-02-07 Thread Rainer Weikusat

Eric Dumazet  writes:
> On Fri, 2016-02-05 at 21:44 +0000, Rainer Weikusat wrote:
>> The present unix_stream_read_generic contains various code sequences of
>> the form
>> 
>> err = -EDISASTER;
>> if ()
>>  goto out;
>> 
>> This has the unfortunate side effect of possibly causing the error code
>> to bleed through to the final
>> 
>> out:
>>  return copied ? : err;
>> 
>> and then to be wrongly returned if no data was copied because the caller
>> didn't supply a data buffer, as demonstrated by the program available at
>> 
>> http://pad.lv/1540731
>> 
>> Change it such that err is only set if an error condition was detected.
>
>
> Well, if you replace the traditional flow
>
> err = -;
> if (test)
>  goto out;
>
> Then please add unlikely() to at least give a hint to the compiler.

There are really four of these, the two leading ones,

if (unlikely(sk->sk_state != TCP_ESTABLISHED)) {
err = -EINVAL;
goto out;
}

if (unlikely(flags & MSG_OOB)) {
err = -EOPNOTSUPP;
goto out;
}

one in between which was already in the function,

unix_state_lock(sk);
if (sock_flag(sk, SOCK_DEAD)) {
err = -ECONNRESET;
goto unlock;
}
[at the beginning of the loop]

and lastly,

unix_state_unlock(sk);
if (!timeo) {
err = -EAGAIN;
break;
}

mutex_unlock(&u->readlock);

timeo = unix_stream_data_wait(sk, timeo, last,
  last_len);

As can be seen here, I've added unlikely() to the first two, left the
middle-one alone and didn't change the last one: That's not really an
error but a non-blocking read. My gut feeling about that would be that's
it's rather likely than unlikely but since I have no real information on
this, I don't know how to annotate it correctly (I'll gladly add
whatever annotation somebody else considers sensible).

> And please add a 'Fixes:  ' tag for bug fixes.

Similarly, if you think this should be considered as 'fixing' anything
in particular, I'll also gladly add that. In my opinion, the real
problem is that the function disagrees with itself on how to use the err
variable: The start uses that to record an error which might need to be
reported, the return statement uses it to indicate that an error has
occurred. Hence, some kind of in-between translation must occur.
The mutex_lock_interruptible happened to do that but that was never it's
intended purpose.

NB: This is not an attempt to start an argument about that, it just
summarizes my understanding of the situation and I don't insist on this
viewpoint.

I'll send an updated patch with the changes so far, ie, the two
unlikelys.

Re: [PATCH] af_unix: Don't set err in unix_stream_read_generic unless there was an error

2016-02-07 Thread Rainer Weikusat

The present unix_stream_read_generic contains various code sequences of
the form

err = -EDISASTER;
if ()
goto out;

This has the unfortunate side effect of possibly causing the error code
to bleed through to the final

out:
return copied ? : err;

and then to be wrongly returned if no data was copied because the caller
didn't supply a data buffer, as demonstrated by the program available at

http://pad.lv/1540731

Change it such that err is only set if an error condition was detected.

Signed-off-by: Rainer Weikusat 
---

With unlikely() added to the two leading error checks. I've also checked
that this actually generates better code (for me at least). Without an
annotation, the non-blocking read case is treated like the others, ie, a
jump to some code at the end of the function loading the error code into
a register, followed by a 2nd jump to the termination sequence.

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 49d5093..c1e4dd7 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2277,13 +2277,15 @@ static int unix_stream_read_generic(struct 
unix_stream_read_state *state)
size_t size = state->size;
unsigned int last_len;
 
-   err = -EINVAL;
-   if (sk->sk_state != TCP_ESTABLISHED)
+   if (unlikely(sk->sk_state != TCP_ESTABLISHED)) {
+   err = -EINVAL;
goto out;
+   }
 
-   err = -EOPNOTSUPP;
-   if (flags & MSG_OOB)
+   if (unlikely(flags & MSG_OOB)) {
+   err = -EOPNOTSUPP;
goto out;
+   }
 
target = sock_rcvlowat(sk, flags & MSG_WAITALL, size);
timeo = sock_rcvtimeo(sk, noblock);
@@ -2329,9 +2331,11 @@ again:
goto unlock;
 
unix_state_unlock(sk);
-   err = -EAGAIN;
-   if (!timeo)
+   if (!timeo) {
+   err = -EAGAIN;
break;
+   }
+
mutex_unlock(&u->readlock);
 
timeo = unix_stream_data_wait(sk, timeo, last,

Re: [V4.4-rc6 Regression] af_unix: Revert 'lock_interruptible' in stream receive code

2016-02-07 Thread Rainer Weikusat

Rainer Weikusat  writes:

[...]

> the real problem is that the function disagrees with itself on how to
> use the err variable: The start uses that to record an error which
> might need to be reported, the return statement uses it to indicate
> that an error has occurred.

This should have been "as indication that an errors has occured" (if no
data was copied).

Re: [V4.4-rc6 Regression] af_unix: Revert 'lock_interruptible' in stream receive code

2016-02-07 Thread Rainer Weikusat

Rainer Weikusat  writes:

[...]

> The start uses that to record an error which might need to be
> reported, the return statement uses it to indicate that an error has
> occurred. Hence, some kind of in-between translation must occur.  The
> mutex_lock_interruptible happened to do that but that was never it's
> intended purpose.

Additional information: The 'trick' of using recvmsg w/o a receive
buffer in order to retrieve control messages in fact wouldn't have
worked with the unix_stream_recvmsg prior to introduction of the
interruptible lock as that (judging from the git source) would have
triggered all the same issues,

- -EOPNOTSUP if a msg was available

- -EAGAIN if the code had to wait

- not receiving the creds if the -EAGAIN hadn't happened because
  of the continue (that's the other patch)

IOW, that's a feature inadvertendly added by an otherwise useless code
change (mea culpa).

[PATCH] af_unix: Don't set err in unix_stream_read_generic unless there was an error

2016-02-08 Thread Rainer Weikusat

The present unix_stream_read_generic contains various code sequences of
the form

err = -EDISASTER;
if ()
goto out;

This has the unfortunate side effect of possibly causing the error code
to bleed through to the final

out:
return copied ? : err;

and then to be wrongly returned if no data was copied because the caller
didn't supply a data buffer, as demonstrated by the program available at

http://pad.lv/1540731

Change it such that err is only set if an error condition was detected.

Fixes: 3822b5c2fc62
Signed-off-by: Rainer Weikusat 
---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 49d5093..c1e4dd7 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2277,13 +2277,15 @@ static int unix_stream_read_generic(struct 
unix_stream_read_state *state)
size_t size = state->size;
unsigned int last_len;
 
-   err = -EINVAL;
-   if (sk->sk_state != TCP_ESTABLISHED)
+   if (unlikely(sk->sk_state != TCP_ESTABLISHED)) {
+   err = -EINVAL;
goto out;
+   }
 
-   err = -EOPNOTSUPP;
-   if (flags & MSG_OOB)
+   if (unlikely(flags & MSG_OOB)) {
+   err = -EOPNOTSUPP;
goto out;
+   }
 
target = sock_rcvlowat(sk, flags & MSG_WAITALL, size);
timeo = sock_rcvtimeo(sk, noblock);
@@ -2329,9 +2331,11 @@ again:
goto unlock;
 
unix_state_unlock(sk);
-   err = -EAGAIN;
-   if (!timeo)
+   if (!timeo) {
+   err = -EAGAIN;
break;
+   }
+
mutex_unlock(&u->readlock);
 
timeo = unix_stream_data_wait(sk, timeo, last,

Re: [PATCH] af_unix: Don't set err in unix_stream_read_generic unless there was an error

2016-02-08 Thread Rainer Weikusat

The present unix_stream_read_generic contains various code sequences of
the form

err = -EDISASTER;
if ()
goto out;

This has the unfortunate side effect of possibly causing the error code
to bleed through to the final

out:
return copied ? : err;

and then to be wrongly returned if no data was copied because the caller
didn't supply a data buffer, as demonstrated by the program available at

http://pad.lv/1540731

Change it such that err is only set if an error condition was detected.

Fixes: 3822b5c2fc62 ("af_unix: Revert 'lock_interruptible' in stream receive 
code")
Signed-off-by: Rainer Weikusat 
---

With Fixes: fixed.

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 49d5093..c1e4dd7 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2277,13 +2277,15 @@ static int unix_stream_read_generic(struct 
unix_stream_read_state *state)
size_t size = state->size;
unsigned int last_len;
 
-   err = -EINVAL;
-   if (sk->sk_state != TCP_ESTABLISHED)
+   if (unlikely(sk->sk_state != TCP_ESTABLISHED)) {
+   err = -EINVAL;
goto out;
+   }
 
-   err = -EOPNOTSUPP;
-   if (flags & MSG_OOB)
+   if (unlikely(flags & MSG_OOB)) {
+   err = -EOPNOTSUPP;
goto out;
+   }
 
target = sock_rcvlowat(sk, flags & MSG_WAITALL, size);
timeo = sock_rcvtimeo(sk, noblock);
@@ -2329,9 +2331,11 @@ again:
goto unlock;
 
unix_state_unlock(sk);
-   err = -EAGAIN;
-   if (!timeo)
+   if (!timeo) {
+   err = -EAGAIN;
break;
+   }
+
mutex_unlock(&u->readlock);
 
timeo = unix_stream_data_wait(sk, timeo, last,

[PATCH] af_unix: Don't set err in unix_stream_read_generic unless there was an error

2016-02-08 Thread Rainer Weikusat

The present unix_stream_read_generic contains various code sequences of
the form

err = -EDISASTER;
if ()
goto out;

This has the unfortunate side effect of possibly causing the error code
to bleed through to the final

out:
return copied ? : err;

and then to be wrongly returned if no data was copied because the caller
didn't supply a data buffer, as demonstrated by the program available at

http://pad.lv/1540731

Change it such that err is only set if an error condition was detected.

Fixes: 3822b5c2fc62 ("af_unix: Revert 'lock_interruptible' in stream receive 
code")
Reported-by: Joseph Salisbury 
Signed-off-by: Rainer Weikusat 
---

And the subject again fixed and, since another correction was necessary,
anyway, a Reported-by added. 

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 49d5093..c1e4dd7 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2277,13 +2277,15 @@ static int unix_stream_read_generic(struct 
unix_stream_read_state *state)
size_t size = state->size;
unsigned int last_len;
 
-   err = -EINVAL;
-   if (sk->sk_state != TCP_ESTABLISHED)
+   if (unlikely(sk->sk_state != TCP_ESTABLISHED)) {
+   err = -EINVAL;
goto out;
+   }
 
-   err = -EOPNOTSUPP;
-   if (flags & MSG_OOB)
+   if (unlikely(flags & MSG_OOB)) {
+   err = -EOPNOTSUPP;
goto out;
+   }
 
target = sock_rcvlowat(sk, flags & MSG_WAITALL, size);
timeo = sock_rcvtimeo(sk, noblock);
@@ -2329,9 +2331,11 @@ again:
goto unlock;
 
unix_state_unlock(sk);
-   err = -EAGAIN;
-   if (!timeo)
+   if (!timeo) {
+   err = -EAGAIN;
break;
+   }
+
mutex_unlock(&u->readlock);
 
timeo = unix_stream_data_wait(sk, timeo, last,

[PATCH net] af_unix: Guard against other == sk in unix_dgram_sendmsg

2016-02-11 Thread Rainer Weikusat

The unix_dgram_sendmsg routine use the following test

if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {

to determine if sk and other are in an n:1 association (either
established via connect or by using sendto to send messages to an
unrelated socket identified by address). This isn't correct as the
specified address could have been bound to the sending socket itself or
because this socket could have been connected to itself by the time of
the unix_peer_get but disconnected before the unix_state_lock(other). In
both cases, the if-block would be entered despite other == sk which
might either block the sender unintentionally or lead to trying to unlock
the same spin lock twice for a non-blocking send. Add a other != sk
check to guard against this.

Fixes: 7d267278a9ec ("unix: avoid use-after-free in ep_remove_wait_queue")
Reported-By: Philipp Hahn 
Signed-off-by: Rainer Weikusat 
---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 29be035..f1ca279 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1781,7 +1781,12 @@ restart_locked:
goto out_unlock;
}
 
-   if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
+   /* other == sk && unix_peer(other) != sk if
+* - unix_peer(sk) == NULL, destination address bound to sk
+* - unix_peer(sk) == sk by time of get but disconnected before lock
+*/
+   if (other != sk &&
+   unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
if (timeo) {
timeo = unix_wait_for_peer(other, timeo);

Re: [PATCH net] af_unix: Guard against other == sk in unix_dgram_sendmsg

2016-02-12 Thread Rainer Weikusat

Philipp Hahn  writes:

> Hello Rainer,
>
> Am 11.02.2016 um 20:37 schrieb Rainer Weikusat:
>> The unix_dgram_sendmsg routine use the following test
>> 
>> if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {

[...]

>> This isn't correct as the> specified address could have been bound to
>> the sending socket itself

[...]

> After applying that patch at least my machine running the samba test no
> longer crashes.

There's a possible gotcha in there: Send-to-self used to be limited by
the queue limit. But the rationale for that (IIRC) was that someone
could keep using newly created sockets to queue ever more data to a
single, unrelated receiver. I don't think this should apply when
receiving and sending sockets are identical. But that's just my
opinion. The other option would be to avoid the unix_state_double_lock
for sk == other. I'd be willing to change this accordingly if someone
thinks the queue limit should apply to send-to-self.

Re: [PATCH] af_unix: Don't set err in unix_stream_read_generic unless there was an error

2016-02-12 Thread Rainer Weikusat

Joseph Salisbury  writes:
> On 02/05/2016 05:30 PM, Rainer Weikusat wrote:
>> The present unix_stream_read_generic contains various code sequences of
>> the form
>>
>> err = -EDISASTER;
>> if ()
>>  goto out;

[...]

>> Change it such that err is only set if an error condition was detected.

[...]

> I tested your patch, Rainer.  I can confirm that it fixes the reported
> bug[0].

This is only a partial fix: The launchpad test case will no longer fail
with EOPNOTSUPP and it will actually receive the credentials because the
message they're attached to was available by the time of the
recvmsg. But if this isn't the case, ie, if the receiver has to wait for
a message, the continue in the do { } while (size) loop will cause the
loop to be terminated without copying the credential information as the
continue will cause the size (of the remaining 'receive area' which is
zero for this case) to be checked before any of the other loop code is
executed again.

I posted a test case for this and a patch addressing that elsewhere in
this thread.

Re: [PATCH net] af_unix: Guard against other == sk in unix_dgram_sendmsg

2016-02-12 Thread Rainer Weikusat

Ben Hutchings  writes:
> On Fri, 2016-02-12 at 13:25 +0000, Rainer Weikusat wrote:
>> Philipp Hahn  writes:
>> > Hello Rainer,
>> > 
>> > Am 11.02.2016 um 20:37 schrieb Rainer Weikusat:
>> > > The unix_dgram_sendmsg routine use the following test
>> > > 
>> > > if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
>> 
>> [...]
>> 
>> > > This isn't correct as the> specified address could have been bound to
>> > > the sending socket itself
>> 
>> [...]
>> 
>> > After applying that patch at least my machine running the samba test no
>> > longer crashes.
>> 
>> There's a possible gotcha in there: Send-to-self used to be limited by
>> the queue limit. But the rationale for that (IIRC) was that someone
>> could keep using newly created sockets to queue ever more data to a
>> single, unrelated receiver. I don't think this should apply when
>> receiving and sending sockets are identical. But that's just my
>> opinion. The other option would be to avoid the unix_state_double_lock
>> for sk == other.
>
> Given that unix_state_double_lock() already handles sk == other, I'm
> not sure why you think it needs to be avoided.

Because the whole complication of restarting the operation after locking
both sk and other because other had to be unlocked before calling
unix_state_double_lock is useless for this case: As other == sk, there's
no reason to drop the lock on it which guarantees that the result of all
the earlier checks is still valid: If the -EAGAIN condition is not true,
execution can just continue.

>> I'd be willing to change this accordingly if someone
>> thinks the queue limit should apply to send-to-self.
>
> If we don't check the queue limit here, does anything else prevent the
> queue growing to the point it's a DoS?

The max_dgram_qlen limit exists specifically to prevent someone sending
'a lot' of messages to a socket unrelated to it by repeatedly creating a
socket, sending as many messages as the send buffer size will allow,
closing the socket, creating a new socket, ..., cf

http://netdev.vger.kernel.narkive.com/tcZIFJeC/get-rid-of-proc-sys-net-unix-max-dgram-qlen#post4
(first copy I found)

This 'attack' will obviously not work very well when sending and
receiving socket are identical.

Re: [PATCH net] af_unix: Guard against other == sk in unix_dgram_sendmsg

2016-02-12 Thread Rainer Weikusat

Ben Hutchings  writes:
> On Fri, 2016-02-12 at 20:17 +0000, Rainer Weikusat wrote:

[...]

>>>> I don't think this should apply when
>>>> receiving and sending sockets are identical. But that's just my
>>>> opinion. The other option would be to avoid the unix_state_double_lock
>>>> for sk == other.
>>> 
>>> Given that unix_state_double_lock() already handles sk == other, I'm
>>> not sure why you think it needs to be avoided.
>> 
>> Because the whole complication of restarting the operation after locking
>> both sk and other because other had to be unlocked before calling
>> unix_state_double_lock is useless for this case:

[...]

> Well of course it's useless, but it's also harmless.  

As is adding a

for (i = 0; i < 100; ++i);

between any two statements. And this isn't even entirely true as the
pointless double-lock will then require "did we pointlessly
doube-lock" checks elsewhere. I think it should be possible to do this
in a simpler way by not pointlessly double-locking (this may be
wrong but it's worth a try).

> If we really wanted to optimise this we could also skip unlocking if
> other < sk.

I wouldn't want to hardcode assumptions about the unix_state_double_lock
algorithm in functions using it.

https://patchwork.ozlabs.org/patch/579654?

2016-02-16 Thread Rainer Weikusat

https://patchwork.ozlabs.org/patch/579654

lists this as 'superseded', among with the older versions of the patch
which changed the error handling. But at least, I couldn't find anything
superseding it. This was supposed to address the different-but-related
problem demonstrated by the following (slightly modified) test program:

-
#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(void)
{
enum { server, client, size };
int socket_fd[size];
int const opt = 1;

assert(socketpair(AF_LOCAL, SOCK_STREAM, 0, socket_fd) == 0);
assert(setsockopt(socket_fd[server], SOL_SOCKET, SO_PASSCRED, &opt, 
sizeof(opt)) != -1);

char const msg[] = "A random message";

if (fork() == 0) {
sleep(1);
send(socket_fd[client], msg, sizeof msg, 0);

_exit(0);
}

union {
struct cmsghdr cmh;
char control[CMSG_SPACE(sizeof(struct ucred))];
} control_un;

control_un.cmh.cmsg_len = CMSG_LEN(sizeof(struct ucred));
control_un.cmh.cmsg_level = SOL_SOCKET;
control_un.cmh.cmsg_type = SCM_CREDENTIALS;

struct msghdr msgh;
msgh.msg_name = NULL;
msgh.msg_namelen = 0;
msgh.msg_iov = NULL;
msgh.msg_iovlen = 0;
msgh.msg_control = control_un.control;
msgh.msg_controllen = sizeof(control_un.control);

recvmsg(socket_fd[server], &msgh, MSG_PEEK);

printf("Success?\n");

struct ucred *ucred;
ucred = (void *)CMSG_DATA(&control_un.cmh);
printf("...  pid %ld, uid %d, gid %d\n",
   (long)ucred->pid, ucred->uid, ucred->gid);

return 0;
}


Because the receiver has to wait for the message, it will hit the
continue in unix_stream_read_generic. This causes the size-check of the
do-while loop to be executed which terminates the loop as the size is
zero without copying the credential information.

Just wondering if this might have been lost in the noise ...

[PATCH net] af_unix: Don't use continue to re-execute unix_stream_read_generic loop

2016-02-18 Thread Rainer Weikusat

The unix_stream_read_generic function tries to use a continue statement
to restart the receive loop after waiting for a message. This may not
work as intended as the caller might use a recvmsg call to peek at
control messages without specifying a message buffer. If this was the
case, the continue will cause the function to return without an error
and without the credential information if the function had to wait for a
message while it had returned with the credentials otherwise. Change to
using goto to restart the loop without checking the condition first in
this case so that credentials are returned either way.

Signed-off-by: Rainer Weikusat 
Acked-by: Hannes Frederic Sowa 
---

I'm resending this as the original patch seems to have been classified
as superseded without anything actually superseding it. I hope the net
is appropriate. I consider this a bugfix.

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index c51e283..f75f847 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2312,6 +2312,7 @@ static int unix_stream_read_generic(struct 
unix_stream_read_state *state)
bool drop_skb;
struct sk_buff *skb, *last;
 
+redo:
unix_state_lock(sk);
if (sock_flag(sk, SOCK_DEAD)) {
err = -ECONNRESET;
@@ -2353,7 +2354,7 @@ again:
}
 
mutex_lock(&u->readlock);
-   continue;
+   goto redo;
 unlock:
unix_state_unlock(sk);
break;

[PATCH] af_unix: Fix splice-bind deadlock

2015-12-27 Thread Rainer Weikusat

On 2015/11/06, Dmitry Vyukov reported a deadlock involving the splice
system call and AF_UNIX sockets, 

http://lists.openwall.net/netdev/2015/11/06/24

The situation was analyzed as

(a while ago) A: socketpair()
B: splice() from a pipe to /mnt/regular_file
does sb_start_write() on /mnt
C: try to freeze /mnt
wait for B to finish with /mnt
A: bind() try to bind our socket to /mnt/new_socket_name
lock our socket, see it not bound yet
decide that it needs to create something in /mnt
try to do sb_start_write() on /mnt, block (it's
waiting for C).
D: splice() from the same pipe to our socket
lock the pipe, see that socket is connected
try to lock the socket, block waiting for A
B:  get around to actually feeding a chunk from
pipe to file, try to lock the pipe.  Deadlock.

on 2015/11/10 by Al Viro,

http://lists.openwall.net/netdev/2015/11/10/4

The patch fixes this by removing the kern_path_create related code from
unix_mknod and executing it as part of unix_bind prior acquiring the
readlock of the socket in question. This means that A (as used above)
will sb_start_write on /mnt before it acquires the readlock, hence, it
won't indirectly block B which first did a sb_start_write and then
waited for a thread trying to acquire the readlock. Consequently, A
being blocked by C waiting for B won't cause a deadlock anymore
(effectively, both A and B acquire two locks in opposite order in the
situation described above).

Signed-off-by: Rainer Weikusat 
Tested-by: Dmitry Vyukov 
---

I also think this is a better (or at least more correct) solution than
the pretty obvious idea to record that the socket is in the process of
being bound and performing the mknod without the lock. Assuming the
first bind fails with -EADDRINUSE, a concurrent bind which might have
succeeded had it waited for the ultimate outcome of the first will
meanwhile have failed with -EINVAL despite the socket will end up
unbound.

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index b1314c0..9b3d268 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -953,32 +953,20 @@ fail:
return NULL;
 }
 
-static int unix_mknod(const char *sun_path, umode_t mode, struct path *res)
+static int unix_mknod(struct dentry *dentry, struct path *path, umode_t mode,
+ struct path *res)
 {
-   struct dentry *dentry;
-   struct path path;
-   int err = 0;
-   /*
-* Get the parent directory, calculate the hash for last
-* component.
-*/
-   dentry = kern_path_create(AT_FDCWD, sun_path, &path, 0);
-   err = PTR_ERR(dentry);
-   if (IS_ERR(dentry))
-   return err;
+   int err;
 
-   /*
-* All right, let's create it.
-*/
-   err = security_path_mknod(&path, dentry, mode, 0);
+   err = security_path_mknod(path, dentry, mode, 0);
if (!err) {
-   err = vfs_mknod(d_inode(path.dentry), dentry, mode, 0);
+   err = vfs_mknod(d_inode(path->dentry), dentry, mode, 0);
if (!err) {
-   res->mnt = mntget(path.mnt);
+   res->mnt = mntget(path->mnt);
res->dentry = dget(dentry);
}
}
-   done_path_create(&path, dentry);
+
return err;
 }
 
@@ -993,6 +981,8 @@ static int unix_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len)
unsigned int hash;
struct unix_address *addr;
struct hlist_head *list;
+   struct path path;
+   struct dentry *dentry;
 
err = -EINVAL;
if (sunaddr->sun_family != AF_UNIX)
@@ -1008,9 +998,21 @@ static int unix_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len)
goto out;
addr_len = err;
 
+   dentry = NULL;
+   if (sun_path[0]) {
+   /* Get the parent directory, calculate the hash for last
+* component.
+*/
+   dentry = kern_path_create(AT_FDCWD, sun_path, &path, 0);
+
+   err = PTR_ERR(dentry);
+   if (IS_ERR(dentry))
+   goto out;
+   }
+
err = mutex_lock_interruptible(&u->readlock);
if (err)
-   goto out;
+   goto out_path;
 
err = -EINVAL;
if (u->addr)
@@ -1026,11 +1028,11 @@ static int unix_bind(struct socket *sock, struct 
sockaddr *uaddr, int addr_len)
addr->hash = hash ^ sk->sk_type;
atomic_set(&addr->refcnt, 1);
 
-   if (sun_path[0]) {
-   struct path path;
+   if (dentry) {
+   struct path u_path;
umode_t mode = S_IFSOCK |
   (SOCK_INODE(sock)->i_mode & ~current_umask());
-   err = unix_mknod(sun_path, mode, &path);
+   err = unix_mknod(dentry, &

Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram

2015-12-29 Thread Rainer Weikusat

Jacob Siverskog  writes:
> This should fix a NULL pointer dereference I encountered (dump
> below). Since __skb_unlink is called while walking,
> skb_queue_walk_safe should be used.

The code in question is:

skb_queue_walk(queue, skb) {
*last = skb;
*peeked = skb->peeked;
if (flags & MSG_PEEK) {
if (_off >= skb->len && (skb->len || _off ||
 skb->peeked)) {
_off -= skb->len;
continue;
}

skb = skb_set_peeked(skb);
error = PTR_ERR(skb);
if (IS_ERR(skb)) {
spin_unlock_irqrestore(&queue->lock,
   cpu_flags);
goto no_packet;
}

atomic_inc(&skb->users);
}  else
__skb_unlink(skb, queue);

spin_unlock_irqrestore(&queue->lock, cpu_flags);
*off = _off;
return skb;
}

__skb_unlink is only called prior to returning from the function.
Consequently, it won't affect the skb_queue_walk code.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram

2015-12-30 Thread Rainer Weikusat

Jacob Siverskog  writes:
> On Tue, Dec 29, 2015 at 9:08 PM, David Miller  wrote:
>> From: Rainer Weikusat 
>> Date: Tue, 29 Dec 2015 19:42:36 +
>>
>>> Jacob Siverskog  writes:
>>>> This should fix a NULL pointer dereference I encountered (dump
>>>> below). Since __skb_unlink is called while walking,
>>>> skb_queue_walk_safe should be used.
>>>
>>> The code in question is:
>>  ...
>>> __skb_unlink is only called prior to returning from the function.
>>> Consequently, it won't affect the skb_queue_walk code.
>>
>> Agreed, this patch doesn't fix anything.
>
> Ok. Thanks for your feedback. How do you believe the issue could be
> solved? Investigating it gives:
>
> static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head 
> *list)
> {

[...]

> next->prev = prev;
>  530: e5823004 str r3, [r2, #4]  <--
> trapping instruction (r2 NULL)
>
> Register contents:
> r7 : c58cfe1c  r6 : c06351d0  r5 : c77810ac  r4 : c583eac0
> r3 :   r2 :   r1 :   r0 : 2013
>
> If I understand this correctly, then r4 = skb, r2 = next, r3 = prev.

Some additional information which may be helpful: The next->prev = prev
was pretty obvious from the original error message alone: The invalid
access happened at 4 but no register contained 4. Considering that this
is for ARM, this must have been caused by an instruction using an
address of the form

[Rx, #4]

ie, value of register x + 4. And the next->prev = prev is the only
access to something located 4 bytes beyond something else.

> Should there be a check for this in __skb_try_recv_datagram?

These lists are supposed to be circular, ie, the next pointer of the
last element should point to the first and the prev pointer of the first
to the last. If there's an element with ->next == NULL on the list,
something either didn't do inserts correctly or corrupted an originally
intact list.

General advice: The original error occurred with 4.3.0. Had this
happened to me, I'd either tried to locate the error in the same kernel
version or to reproduce the bug with the one I was planning to
modify. Trying to fix a 'strange memory access' error which was observed
with version x.y by modifying version x.z is IMHO needlessly moving on
shaky ground.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] af_unix: Fix splice-bind deadlock

2015-12-31 Thread Rainer Weikusat

Hannes Frederic Sowa  writes:
> On 27.12.2015 21:13, Rainer Weikusat wrote:
>> -static int unix_mknod(const char *sun_path, umode_t mode, struct path *res)
>> +static int unix_mknod(struct dentry *dentry, struct path *path, umode_t 
>> mode,
>> +  struct path *res)
>>   {
>> -struct dentry *dentry;
>> -struct path path;
>> -int err = 0;
>> -/*
>> - * Get the parent directory, calculate the hash for last
>> - * component.
>> - */
>> -dentry = kern_path_create(AT_FDCWD, sun_path, &path, 0);
>> -err = PTR_ERR(dentry);
>> -if (IS_ERR(dentry))
>> -return err;
>> +int err;
>>
>> -/*
>> - * All right, let's create it.
>> - */
>> -err = security_path_mknod(&path, dentry, mode, 0);
>> +err = security_path_mknod(path, dentry, mode, 0);
>>  if (!err) {
>> -err = vfs_mknod(d_inode(path.dentry), dentry, mode, 0);
>> +err = vfs_mknod(d_inode(path->dentry), dentry, mode, 0);
>>  if (!err) {
>> -res->mnt = mntget(path.mnt);
>> +res->mnt = mntget(path->mnt);
>>  res->dentry = dget(dentry);
>>  }
>>  }
>> -done_path_create(&path, dentry);
>> +
>
> The reordered call to done_path_create will change the locking
> ordering between the i_mutexes and the unix readlock. Can you comment
> on this? On a first sight this looks like a much more dangerous change
> than the original deadlock report. Can't this also conflict with
> splice code deep down in vfs layer?

Practical consideration
---

kern_path_create acquires the i_mutex of the parent directory of the
to-be-created directory entry (via filename_create/ namei.c), as
required for reading a directory or creating a new entry in a directory
(as per Documentation/filesystems/directory-locking). A deadlock was
possible here if the thread doing the bind then blocked when trying to
acquire the readlock while the thread holding the readlock is blocked on
another lock held by a thread trying to perform an operation on the same
directory as the bind (possibly with some indirection). The only 'other
lock' which could come into play here is the pipe lock of a pipe
partaking in a splice_to_pipe from the same AF_UNIX socket. But the idea
that some thread would need to take a pipe lock prior to performing a
directory operation is quite odd (splice_from_pipe_to_directory?
openatparentoffifo?). I've also checked all existing users
of pipe_lock and at least, I didn't find one performing a directory
operation.

Theoretical consideration
-

NB: The text below represents my opinion on this after spending a few
days thinking about it (on and of, of course). Making an argument for
the opposite position is also possible.

The filesystem (namespace) is a shared namespace accessible to all
currently running threads/ processes. Whoever uses the filesystem may
have to wait for other filesystem users but threads not using it
shouldn't have to. Because of this and because the filesystem is a
pretty central facility, an operation needing 'some filesystem lock' and
also some other lock (or locks) should always acquire the filesystem
ones before any more specialized locks (as do_splice does when splicing
to a file). If 'filesystem locks' are always acquired first, there's
also no risk of a deadlock because code holding a filesystem lock is
blocked on a more specialized lock (eg, a pipe lock or the readlock
mutx) while some other thread holding the/ a more specialized lock wants
the already held filesystem lock.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] af_unix: Fix splice-bind deadlock

2016-01-03 Thread Rainer Weikusat

Rainer Weikusat  writes:

> Hannes Frederic Sowa  writes:
>> On 27.12.2015 21:13, Rainer Weikusat wrote:
>>> -static int unix_mknod(const char *sun_path, umode_t mode, struct path *res)
>>> +static int unix_mknod(struct dentry *dentry, struct path *path, umode_t 
>>> mode,
>>> + struct path *res)
>>>   {
>>> -   struct dentry *dentry;
>>> -   struct path path;
>>> -   int err = 0;
>>> -   /*
>>> -* Get the parent directory, calculate the hash for last
>>> -* component.
>>> -*/
>>> -   dentry = kern_path_create(AT_FDCWD, sun_path, &path, 0);
>>> -   err = PTR_ERR(dentry);
>>> -   if (IS_ERR(dentry))
>>> -   return err;
>>> +   int err;
>>>
>>> -   /*
>>> -* All right, let's create it.
>>> -*/
>>> -   err = security_path_mknod(&path, dentry, mode, 0);
>>> +   err = security_path_mknod(path, dentry, mode, 0);
>>> if (!err) {
>>> -   err = vfs_mknod(d_inode(path.dentry), dentry, mode, 0);
>>> +   err = vfs_mknod(d_inode(path->dentry), dentry, mode, 0);
>>> if (!err) {
>>> -   res->mnt = mntget(path.mnt);
>>> +   res->mnt = mntget(path->mnt);
>>> res->dentry = dget(dentry);
>>> }
>>> }
>>> -   done_path_create(&path, dentry);
>>> +
>>
>> The reordered call to done_path_create will change the locking
>> ordering between the i_mutexes and the unix readlock. Can you comment
>> on this? On a first sight this looks like a much more dangerous change
>> than the original deadlock report. Can't this also conflict with
>> splice code deep down in vfs layer?
>
> Practical consideration

[...]

> A deadlock was possible here if the thread doing the bind then blocked
> when trying to acquire the readlock while the thread holding the
> readlock is blocked on another lock held by a thread trying to perform
> an operation on the same directory as the bind (possibly with some
> indirection).

Since this was probably pretty much a "write only" sentence, I think I
should try this again (with apologies in case a now err on the other
side and rather explain to much --- my abilities to express myself such
that people understand what I mean to express instead of just getting
mad at me are not great).

For a deadlock to happen here, there needs to be a cycle (circle?) of
threads each holding one lock and blocking while trying to acquire
another lock which ultimatively ends with a thread trying to acquire the
i_mutex of the directory where the socket name is to be created. The
binding thread would need to block when trying to acquire the
readlock. But (contrary to what I originally wrote[*]) this cannot happen
because the af_unix code doesn't lock anything non-socket related while
holding the readlock. The only instance of that was in _bind and caused
the deadlock.

[*] I misread

static ssize_t skb_unix_socket_splice(struct sock *sk,
  struct pipe_inode_info *pipe,
  struct splice_pipe_desc *spd)
{
int ret;
struct unix_sock *u = unix_sk(sk);

mutex_unlock(&u->readlock);
ret = splice_to_pipe(pipe, spd);
mutex_lock(&u->readlock);

return ret;
}

as 'lock followed by unlock' instead of 'unlock followed by lock'.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] af_unix: Fix splice-bind deadlock

2016-01-03 Thread Rainer Weikusat

Rainer Weikusat  writes:

[...]

> + dentry = NULL;
> + if (sun_path[0]) {
> + /* Get the parent directory, calculate the hash for last
> +  * component.
> +  */
> + dentry = kern_path_create(AT_FDCWD, sun_path, &path, 0);
> +
> + err = PTR_ERR(dentry);
> + if (IS_ERR(dentry))
> + goto out;
> + }
> +

This is wrong because kern_path_create can return with -EEXIST which
needs to be translated to -EADDRINUSE for this case. I'll fix that.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] af_unix: Fix splice-bind deadlock

2016-01-03 Thread Rainer Weikusat

On 2015/11/06, Dmitry Vyukov reported a deadlock involving the splice
system call and AF_UNIX sockets, 

http://lists.openwall.net/netdev/2015/11/06/24

The situation was analyzed as

(a while ago) A: socketpair()
B: splice() from a pipe to /mnt/regular_file
does sb_start_write() on /mnt
C: try to freeze /mnt
wait for B to finish with /mnt
A: bind() try to bind our socket to /mnt/new_socket_name
lock our socket, see it not bound yet
decide that it needs to create something in /mnt
try to do sb_start_write() on /mnt, block (it's
waiting for C).
D: splice() from the same pipe to our socket
lock the pipe, see that socket is connected
try to lock the socket, block waiting for A
B:  get around to actually feeding a chunk from
pipe to file, try to lock the pipe.  Deadlock.

on 2015/11/10 by Al Viro,

http://lists.openwall.net/netdev/2015/11/10/4

The patch fixes this by removing the kern_path_create related code from
unix_mknod and executing it as part of unix_bind prior acquiring the
readlock of the socket in question. This means that A (as used above)
will sb_start_write on /mnt before it acquires the readlock, hence, it
won't indirectly block B which first did a sb_start_write and then
waited for a thread trying to acquire the readlock. Consequently, A
being blocked by C waiting for B won't cause a deadlock anymore
(effectively, both A and B acquire two locks in opposite order in the
situation described above).

Dmitry Vyukov() tested the original patch.

Signed-off-by: Rainer Weikusat 
---

This fixes two 'wrong' error returns, namely, return -EADDRINUSE if
kern_path_create returned -EEXIST but delay returning an error from
kern_path_create until after the u->addr check as the -EINVAL should IMO
take precedence here.

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index b1314c0..e6d3556 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -953,32 +953,20 @@ fail:
return NULL;
 }
 
-static int unix_mknod(const char *sun_path, umode_t mode, struct path *res)
+static int unix_mknod(struct dentry *dentry, struct path *path, umode_t mode,
+ struct path *res)
 {
-   struct dentry *dentry;
-   struct path path;
-   int err = 0;
-   /*
-* Get the parent directory, calculate the hash for last
-* component.
-*/
-   dentry = kern_path_create(AT_FDCWD, sun_path, &path, 0);
-   err = PTR_ERR(dentry);
-   if (IS_ERR(dentry))
-   return err;
+   int err;
 
-   /*
-* All right, let's create it.
-*/
-   err = security_path_mknod(&path, dentry, mode, 0);
+   err = security_path_mknod(path, dentry, mode, 0);
if (!err) {
-   err = vfs_mknod(d_inode(path.dentry), dentry, mode, 0);
+   err = vfs_mknod(d_inode(path->dentry), dentry, mode, 0);
if (!err) {
-   res->mnt = mntget(path.mnt);
+   res->mnt = mntget(path->mnt);
res->dentry = dget(dentry);
}
}
-   done_path_create(&path, dentry);
+
return err;
 }
 
@@ -989,10 +977,12 @@ static int unix_bind(struct socket *sock, struct sockaddr 
*uaddr, int addr_len)
struct unix_sock *u = unix_sk(sk);
struct sockaddr_un *sunaddr = (struct sockaddr_un *)uaddr;
char *sun_path = sunaddr->sun_path;
-   int err;
+   int err, name_err;
unsigned int hash;
struct unix_address *addr;
struct hlist_head *list;
+   struct path path;
+   struct dentry *dentry;
 
err = -EINVAL;
if (sunaddr->sun_family != AF_UNIX)
@@ -1008,14 +998,34 @@ static int unix_bind(struct socket *sock, struct 
sockaddr *uaddr, int addr_len)
goto out;
addr_len = err;
 
+   name_err = 0;
+   dentry = NULL;
+   if (sun_path[0]) {
+   /* Get the parent directory, calculate the hash for last
+* component.
+*/
+   dentry = kern_path_create(AT_FDCWD, sun_path, &path, 0);
+
+   if (IS_ERR(dentry)) {
+   /* delay report until after 'already bound' check */
+   name_err = PTR_ERR(dentry);
+   dentry = NULL;
+   }
+   }
+
err = mutex_lock_interruptible(&u->readlock);
if (err)
-   goto out;
+   goto out_path;
 
err = -EINVAL;
if (u->addr)
goto out_up;
 
+   if (name_err) {
+   err = name_err == -EEXIST ? -EADDRINUSE : name_err;
+   goto out_up;
+   }
+
err = -ENOMEM;
addr = kmalloc(sizeof(*addr)+addr_len, GFP_KERNEL);
if (!addr)
@@ -1026,11 +1036,11 @@ static int unix_bind(struct socket *sock, struct 
sockaddr *uadd

Re: [PATCH] net: Fix potential NULL pointer dereference in __skb_try_recv_datagram

2016-01-04 Thread Rainer Weikusat

Eric Dumazet  writes:
> On Mon, 2016-01-04 at 10:10 +0100, Jacob Siverskog wrote:

[...]

>> I believe the crash occurred between these two actions. I just saw
>> that there are some interesting events in the log prior to the crash:
>> kernel: Bluetooth: Unable to push skb to HCI core(-6)
>> kernel: (stc):  proto stack 4's ->recv failed
>> kernel: (stc): remove_channel_from_table: id 3
>> kernel: (stc): remove_channel_from_table: id 2
>> kernel: (stc): remove_channel_from_table: id 4
>> kernel: (stc):  all chnl_ids unregistered
>> kernel: (stk) :ldisc_install = 0(stc): st_tty_close
>> 
>> The first print is from btwilink.c. However, I can't see the
>> connection between Bluetooth (BLE) and UDP/IPv6 (we're not using
>> 6LoWPAN or anything similar).
>> 
>> Thanks, Jacob
>
> Definitely these details are useful ;)
>
> Could you try :
>
> diff --git a/drivers/misc/ti-st/st_core.c b/drivers/misc/ti-st/st_core.c
> index 6e3af8b42cdd..0c99a74fb895 100644
> --- a/drivers/misc/ti-st/st_core.c
> +++ b/drivers/misc/ti-st/st_core.c
> @@ -912,7 +912,9 @@ void st_core_exit(struct st_data_s *st_gdata)
>   skb_queue_purge(&st_gdata->txq);
>   skb_queue_purge(&st_gdata->tx_waitq);
>   kfree_skb(st_gdata->rx_skb);
> + st_gdata->rx_skb = NULL;
>   kfree_skb(st_gdata->tx_skb);
> + st_gdata->tx_skb = NULL;
>   /* TTY ldisc cleanup */
>   err = tty_unregister_ldisc(N_TI_WL);
>   if (err)

Hmm ... the code continues with

err = tty_unregister_ldisc(N_TI_WL);
if (err)
pr_err("unable to un-register ldisc %ld", err);
/* free the global data pointer */
kfree(st_gdata);

So who would ever see that the rx_skb and tx_skb pointers were cleared
prior to freeing the data structure containing them?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] af_unix: Fix splice-bind deadlock

2016-01-06 Thread Rainer Weikusat

Hannes Frederic Sowa  writes:
> On Sun, Jan 3, 2016, at 19:03, Rainer Weikusat wrote:

[reorder i_mutex and readlock locking]

> I was concerned because of the comment in skb_socket_splice:
>
> /* Drop the socket lock, otherwise we have reverse
>  * locking dependencies between sk_lock and i_mutex
>  * here as compared to sendfile(). We enter here
>  * with the socket lock held, and splice_to_pipe() will
>  * grab the pipe inode lock. For sendfile() emulation,
>  * we call into ->sendpage() with the i_mutex lock held
>  * and networking will grab the socket lock.
>  */

AFAICT, this comment is "a bit misleading": sendfile (from file to
socket) is internally implemented as 'splice from file to pipe' +
'splice from pipe to socket'. The later acquires the pipe lock of the
pipe and then invokes the sendpage method of the socket which acquires
the appropiate socket lock (for an AF_UNIX socket, it's
u->readlock). But 'pipe lock' and 'i_mutex' are two completely
different things: The former is the mutex in a struct pipe_inode_info
(pipe_fs_u.h), the latter is the i_mutex in a struct inode (fs.h).

"Code explains comment" :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-09 Thread Rainer Weikusat

An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusuat 

---
"Why do things always end up messy and complicated"? Patch is against
4.3.0.


diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_tpeer_wake;
 };
 
 static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..4f263e3 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,122 @@ found:
return s;
 }
 
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int 
flags,
+ void *key)
+{
+   struct unix_sock *u;
+   wait_queue_head_t *u_sleep;
+
+   u = container_of(q, struct unix_sock, peer_wake);
+
+   __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+   &u->peer_wake);
+   u->peer_wake.private = NULL;
+
+   /* relaying can only happen while the wq still exists */
+   u_sleep = sk_sleep(&u->sk);
+   if (u_sleep)
+   wake_up_interruptible_poll(u_sleep, key);
+
+   return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+   struct unix_sock *u, *u_other;
+   int rc;
+
+   u = unix_sk(sk);
+   u_other = unix_sk(other);
+   rc = 0;
+
+   spin_lock(&u_other->peer_wait.lock);
+
+   if (!u->peer_wake.private) {
+   u->peer_wake.private = other;
+   __add_wait_queue(&u_other

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-10 Thread Rainer Weikusat

David Miller  writes:
> From: Rainer Weikusat 
> Date: Mon, 09 Nov 2015 14:40:48 +
>
>> +__remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
>> +&u->peer_wake);
>
> This is more simply:
>
>   __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait, q);

Thank you for pointing this out.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-10 Thread Rainer Weikusat

Jason Baron  writes:
> On 11/09/2015 09:40 AM, Rainer Weikusat wrote:

[...]

>> -if (unix_peer(other) != sk && unix_recvq_full(other)) {
>> +if (!unix_dgram_peer_recv_ready(sk, other)) {
>>  if (!timeo) {
>> -err = -EAGAIN;
>> -goto out_unlock;
>> +if (unix_dgram_peer_wake_me(sk, other)) {
>> +err = -EAGAIN;
>> +goto out_unlock;
>> +}
>> +
>> +goto restart;
>>  }
>
>
> So this will cause 'unix_state_lock(other) to be called twice in a
> row if we 'goto restart' (and hence will softlock the box). It just
> needs a 'unix_state_unlock(other);' before the 'goto restart'.

The goto restart was nonsense to begin with in this code path:
Restarting something is necessary after sleeping for some time but for
the case above, execution just continues. I've changed that (updated
patch should follow 'soon') to

if (!unix_dgram_peer_recv_ready(sk, other)) {
if (timeo) {
timeo = unix_wait_for_peer(other, timeo);

err = sock_intr_errno(timeo);
if (signal_pending(current))
goto out_free;

goto restart;
}

if (unix_dgram_peer_wake_me(sk, other)) {
err = -EAGAIN;
goto out_unlock;
}
}

> I also tested this patch with a single unix server and 200 client
> threads doing roughly epoll() followed by write() until -EAGAIN in a
> loop. The throughput for the test was roughly the same as current
> upstream, but the cpu usage was a lot higher. I think its b/c this patch
> takes the server wait queue lock in the _poll() routine. This causes a
> lot of contention. The previous patch you posted for this where you did
> not clear the wait queue on every wakeup and thus didn't need the queue
> lock in poll() (unless we were adding to it), performed much better.

I'm somewhat unsure what to make of that: The previous patch would also
take the wait queue lock whenever poll was about to return 'not
writable' because of the length of the server receive queue unless
another thread using the same client socket also noticed this and
enqueued this same socket already. And "hundreds of clients using a
single client socket in order to send data to a single server socket"
doesn't seem very realistic to me.

Also, this code shouldn't usually be executed as the server should
usually be capable of keeping up with the data sent by clients. If it's
permanently incapable of that, you're effectively performing a
(successful) DDOS against it. Which should result in "high CPU
utilization" in either case. It may be possible to improve this by
tuning/ changing the flow control mechanism. Out of my head, I'd suggest
making the queue longer (the default value is 10) and delaying wake ups
until the server actually did catch up, IOW, the receive queue is empty
or almost empty. But this ought to be done with a different patch.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-10 Thread Rainer Weikusat

An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat 
---

- use wait_queue_t passed as argument to _relay

- fix possible deadlock and logic error in _dgram_sendmsg by
  straightening the control flow ("spaghetti code considered
  confusing")

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_tpeer_wake;
 };
 
 static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..4297d8e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,122 @@ found:
return s;
 }
 
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int 
flags,
+ void *key)
+{
+   struct unix_sock *u;
+   wait_queue_head_t *u_sleep;
+
+   u = container_of(q, struct unix_sock, peer_wake);
+
+   __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+   q);
+   u->peer_wake.private = NULL;
+
+   /* relaying can only happen while the wq still exists */
+   u_sleep = sk_sleep(&u->sk);
+   if (u_sleep)
+   wake_up_interruptible_poll(u_sleep, key);
+
+   return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+   struct unix_sock *u, *u_other;
+   int rc;
+
+   u = unix_sk(sk);
+   u_othe

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-11 Thread Rainer Weikusat

Hannes Frederic Sowa  writes:
> On Tue, Nov 10, 2015, at 22:55, Rainer Weikusat wrote:
>> An AF_UNIX datagram socket being the client in an n:1 association with
>> some server socket is only allowed to send messages to the server if the
>> receive queue of this socket contains at most sk_max_ack_backlog
>> datagrams.

[...]

> This whole patch seems pretty complicated to me.
>
> Can't we just remove the unix_recvq_full checks alltogether and unify
> unix_dgram_poll with unix_poll?
>
> If we want to be cautious we could simply make unix_max_dgram_qlen limit
> the number of skbs which are in flight from a sending socket. The skb
> destructor can then decrement this. This seems much simpler.
>
> Would this work?

In the way this is intended to work, cf

http://marc.info/?t=11562760602&r=1&w=2

only if the limit would also apply to sockets which didn't sent anything
so far. Which means it'll end up in the exact same situation as before:
Sending something using a certain socket may not be possible because of
data sent by other sockets, so either, code trying to send using this
sockets ends up busy-waiting for "space again available" despite it's
trying to use select/ poll/ epolll/ $whatnot to get notified of this
condition and sleep until then or this notification needs to be
propagated to sleeping threads which didn't get to send anything yet.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-12 Thread Rainer Weikusat

Jason Baron  writes:
>> +
>> +/* Needs sk unix state lock. After recv_ready indicated not ready,
>> + * establish peer_wait connection if still needed.
>> + */
>> +static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
>> +{
>> +int connected;
>> +
>> +connected = unix_dgram_peer_wake_connect(sk, other);
>> +
>> +if (unix_recvq_full(other))
>> +return 1;
>> +
>> +if (connected)
>> +unix_dgram_peer_wake_disconnect(sk, other);
>> +
>> +return 0;
>> +}
>> +
>
> So the comment above this function says 'needs unix state lock', however
> the usage in unix_dgram_sendmsg() has the 'other' lock, while the usage
> in unix_dgram_poll() has the 'sk' lock. So this looks racy.

That's one thing which is broken with this patch. Judging from a 'quick'
look at the _dgram_sendmsg code, the unix_state_lock(other) will need to
be turned into a unix_state_double_lock(sk, other) and the remaining
code changed accordingly (since all of the checks must be done without
unlocking other). 

There's also something else seriously wrong with the present patch: Some
code in unix_dgram_connect presently (with this change) looks like this:

/*
 * If it was connected, reconnect.
 */
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;

if (unix_dgram_peer_wake_disconnect(sk, other))
wake_up_interruptible_poll(sk_sleep(sk),
   POLLOUT |
   POLLWRNORM |
   POLLWRBAND);

unix_state_double_unlock(sk, other);

if (other != old_peer)
unix_dgram_disconnected(sk, old_peer);
sock_put(old_peer);

and trying to disconnect from a peer the socket is just being
connected to is - of course - "flowering tomfoolery" (literal
translation of the German "bluehender Bloedsinn") --- it needs to
disconnect from old_peer instead.

I'll address the suggestion and send an updated patch "later today" (may
become "early tomorrow"). I have some code addressing both issues but
that's part of a release of 'our' kernel fork, ie, 3.2.54-based I'll
need to do 'soon'.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-13 Thread Rainer Weikusat

An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat 
---
"Believed to be least buggy version"

- disconnect from former peer in _dgram_connect

- use unix_state_double_lock in _dgram_sendmsg to ensure
  recv_ready/ wake_me preconditions are met (noted by Jason
  Baron)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_tpeer_wake;
 };
 
 static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..30e7c56 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,122 @@ found:
return s;
 }
 
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int 
flags,
+ void *key)
+{
+   struct unix_sock *u;
+   wait_queue_head_t *u_sleep;
+
+   u = container_of(q, struct unix_sock, peer_wake);
+
+   __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+   q);
+   u->peer_wake.private = NULL;
+
+   /* relaying can only happen while the wq still exists */
+   u_sleep = sk_sleep(&u->sk);
+   if (u_sleep)
+   wake_up_interruptible_poll(u_sleep, key);
+
+   return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+   struct unix_sock *u, *u_other;
+   int rc;
+
+   u = un

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-13 Thread Rainer Weikusat

Hannes Frederic Sowa  writes:
> On Wed, Nov 11, 2015, at 17:12, Rainer Weikusat wrote:
>> Hannes Frederic Sowa  writes:
>> > On Tue, Nov 10, 2015, at 22:55, Rainer Weikusat wrote:
>> >> An AF_UNIX datagram socket being the client in an n:1 association with
>> >> some server socket is only allowed to send messages to the server if the
>> >> receive queue of this socket contains at most sk_max_ack_backlog
>> >> datagrams.
>> 
>> [...]
>> 
>> > This whole patch seems pretty complicated to me.
>> >
>> > Can't we just remove the unix_recvq_full checks alltogether and unify
>> > unix_dgram_poll with unix_poll?
>> >
>> > If we want to be cautious we could simply make unix_max_dgram_qlen limit
>> > the number of skbs which are in flight from a sending socket. The skb
>> > destructor can then decrement this. This seems much simpler.
>> >
>> > Would this work?
>> 
>> In the way this is intended to work, cf
>> 
>> http://marc.info/?t=11562760602&r=1&w=2
>
> Oh, I see, we don't limit closed but still referenced sockets. This
> actually makes sense on how fd handling is implemented, just as a range
> check.
>
> Have you checked if we can somehow deregister the socket in the poll
> event framework? You wrote that it does not provide such a function but
> maybe it would be easy to add?

I thought about this but this would amount to adding a general interface
for the sole purpose of enabling the af_unix code to talk to the
eventpoll code and I don't really like this idea: IMHO, there should be
at least two users (preferably three) before creating any kind of
'abstract interface'. An even more ideal "castle in the air"
(hypothetical) solution would be "change the eventpoll.c code such that
it won't be affected if a wait queue just goes away". That's at least
theoretically possible (although it might not be in practice).

I wouldn't mind doing that (assuming it was possible) if it was just
for the kernels my employer uses because I'm aware of the uses these
will be put to and in control of the corresponding userland code. But
for "general Linux code", changing epoll in order to help the af_unix
code is more potential trouble than it's worth: Exchanging a relatively
unimportant bug in some module for a much more visibly damaging bug in a
central facility would be a bad tradeoff.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-15 Thread Rainer Weikusat

Jason Baron  writes:
> On 11/13/2015 01:51 PM, Rainer Weikusat wrote:
>
> [...]
>
>>  
>> -if (unix_peer(other) != sk && unix_recvq_full(other)) {
>> -if (!timeo) {
>> -err = -EAGAIN;
>> -goto out_unlock;
>> -}
>> +if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) {
>
> Remind me why the 'unix_peer(sk) == other' is added here? If the remote
> is not connected we still want to make sure that we don't overflow the
> the remote rcv queue, right?

Good point. The check is actually wrong there as the original code would
also check the limit in case of an unconnected send to a socket found
via address lookup. It belongs into the 2nd if (were I originally put
it).

>
> In terms of this added 'double' lock for both sk and other, where
> previously we just held the 'other' lock. I think we could continue to
> just hold the 'other' lock unless the remote queue is full, so something
> like:
>
> if (unix_peer(other) != sk && unix_recvq_full(other)) {
> bool need_wakeup = false;
>
>   skipping the blocking case...
>
> err = -EAGAIN;
> if (!other_connected)
> goto out_unlock;
> unix_state_unlock(other);
> unix_state_lock(sk);

That was my original idea. The problem with this is that the code
starting after the _lock and running until the main code path unlock has
to be executed in one go with the other lock held as the results of the
tests above this one may become invalid as soon as the other lock is
released. This means instead of continuing execution with the send code
proper after the block in case other became receive-ready between the
first and the second test (possible because _dgram_recvmsg does not
take the unix state lock), the whole procedure starting with acquiring
the other lock would need to be restarted. Given sufficiently unfavorable
circumstances, this could even turn into an endless loop which couldn't
be interrupted. (unless code for this was added). 

[...]

> we currently wake the entire queue on every remote read even when we
> have room in the rcv buffer. So this patch will cut down on ctxt
> switching rate dramatically from what we currently have.

In my opinion, this could be improved by making the throttling mechanism
work like a flip flop: If the queue lenght hits the limit after a
_sendmsg, set a "no more applicants" flag blocking future sends until
cleared (checking the flag would replace the present
check). After the receive queue ran empty (or almost empty),
_dgram_sendmsg would clear the flag and do a wakeup. But this should be
an independent patch (if implemented).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-16 Thread Rainer Weikusat

An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat 
---
- fix logic in _dgram_sendmsg: queue limit also needs to be
  enforced for unconnected sends

- drop _recv_ready helper function: I'm usually a big fan of
  functional decomposition but in this case, the abstraction
  seemed to obscure things rather than making them easier to
  understand 

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_tpeer_wake;
 };
 
 static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..3f4974d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,112 @@ found:
return s;
 }
 
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int 
flags,
+ void *key)
+{
+   struct unix_sock *u;
+   wait_queue_head_t *u_sleep;
+
+   u = container_of(q, struct unix_sock, peer_wake);
+
+   __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+   q);
+   u->peer_wake.private = NULL;
+
+   /* relaying can only happen while the wq still exists */
+   u_sleep = sk_sleep(&u->sk);
+   if (u_sleep)
+   wake_up_interruptible_poll(u_sleep, key);
+
+   return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, s

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)

2015-11-16 Thread Rainer Weikusat

An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat 
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
---

Additional remark about "5456f09aaf88/ af_unix: fix unix_dgram_poll()
behavior for EPOLLOUT event": This shouldn't be an issue anymore with
this change despite it restores the "only when writable" behaviour" as
the wake up relay will also be set up once _dgram_sendmsg returned
EAGAIN for a send attempt on a n:1 connected socket.


diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_tpeer_wake;
 };
 
 static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..3f4974d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,112 @@ found:
return s;
 }
 
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int 
flags,
+ void *key)
+{
+   struct unix_sock *u;
+   wait_queue_head_t *u_sleep;
+
+   u = container_of(q, struct unix_sock, peer_wake);
+
+   __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+   q);
+   u->peer_wake.private = NULL;
+
+   /* relaying can only happen while the wq still exists */
+   u_sleep = sk_sleep(&u->sk);
+   if (u_sleep)
+   wake_up_int

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-17 Thread Rainer Weikusat

Jason Baron  writes:
> On 11/15/2015 01:32 PM, Rainer Weikusat wrote:
>
>> 
>> That was my original idea. The problem with this is that the code
>> starting after the _lock and running until the main code path unlock has
>> to be executed in one go with the other lock held as the results of the
>> tests above this one may become invalid as soon as the other lock is
>> released. This means instead of continuing execution with the send code
>> proper after the block in case other became receive-ready between the
>> first and the second test (possible because _dgram_recvmsg does not
>> take the unix state lock), the whole procedure starting with acquiring
>> the other lock would need to be restarted. Given sufficiently unfavorable
>> circumstances, this could even turn into an endless loop which couldn't
>> be interrupted. (unless code for this was added). 
>> 
>
> hmmm - I think we can avoid it by doing the wakeup from the write path
> in the rare case that the queue has emptied - and avoid the double lock. IE:
>
> unix_state_unlock(other);
> unix_state_lock(sk);
> err = -EAGAIN;
> if (unix_peer(sk) == other) {
>unix_dgram_peer_wake_connect(sk, other);
>if (skb_queue_len(&other->sk_receive_queue) == 0)
>need_wakeup = true;
> }
> unix_state_unlock(sk);
> if (need_wakeup)
> wake_up_interruptible_poll(sk_sleep(sk), POLLOUT
> | POLLWRNORM | POLLWRBAND);
> goto out_free;

This should probably rather be

if (unix_dgram_peer_wake_connect(sk, other) &&
skb_queue_len(&other->sk_receive_queue) == 0)
need_wakeup = 1;

as there's no need to do the wake up if someone else already connected
and then, the double lock could be avoided at the expense of returning a
gratuitous EAGAIN to the caller and throwing all of the work
_dgram_sendmsg did so far, eg, allocate a skb, copy the data into the
kernel, do all the other checks, away.

This would enable another thread to do one of the following things in
parallell with the 'locked' part of _dgram_sendmsg

1) connect sk to a socket != other
2) use sk to send to a socket != other
3) do a shutdown on sk
4) determine write-readyness of sk via poll callback

IMHO, the only thing which could possibly matter is 2) and my suggestion
for this would rather be "use a send socket per sending thread if this
matters to you" than "cause something to fail which could as well
have succeeded".
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)

2015-11-17 Thread Rainer Weikusat

David Miller  writes:
> From: Rainer Weikusat 
> Date: Mon, 16 Nov 2015 22:28:40 +
>
>> An AF_UNIX datagram socket being the client in an n:1 association with
>> some server socket is only allowed to send messages to the server if the
>> receive queue of this socket contains at most sk_max_ack_backlog
>> datagrams.

[...]

>> Signed-off-by: Rainer Weikusat 
>> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
>
> So because of a corner case of epoll handling and sender socket release,
> every single datagram sendmsg has to do a double lock now?
>
> I do not dispute the correctness of your fix at this point, but that
> added cost in the fast path is really too high.

This leaves only the option of a somewhat incorrect solution and what is
or isn't acceptable in this respect is somewhat difficult to decide. The
basic options would be

- return EAGAIN even if sending became possible (Jason's most
  recent suggestions)

- retry sending a limited number of times, eg, once, before
  returning EAGAIN, on the grounds that this is nicer to the
  application and that redoing all the stuff up to the _lock in
  dgram_sendmsg can possibly/ likely be avoided

Which one do you prefer?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)

2015-11-17 Thread Rainer Weikusat

Rainer Weikusat  writes:

[...]

> The basic options would be
>
>   - return EAGAIN even if sending became possible (Jason's most
>   recent suggestions)
>
>   - retry sending a limited number of times, eg, once, before
>   returning EAGAIN, on the grounds that this is nicer to the
>   application and that redoing all the stuff up to the _lock in
>   dgram_sendmsg can possibly/ likely be avoided

A third option: Use trylock to acquire the sk lock. If this succeeds,
there's no risk of deadlocking anyone even if acquiring the locks in the
wrong order. This could look as follows (NB: I didn't even compile this,
I just wrote the code to get an idea how complicated it would be):

int need_wakeup;

[...]

need_wakeup = 0;
err = 0;
if (spin_lock_trylock(unix_sk(sk)->lock)) {
if (unix_peer(sk) != other ||
unix_dgram_peer_wake_me(sk, other))
err = -EAGAIN;
} else {
err = -EAGAIN;

unix_state_unlock(other);
unix_state_lock(sk);

need_wakeup = unix_peer(sk) != other &&
  unix_dgram_peer_wake_connect(sk, other) &&
  sk_receive_queue_len(other) == 0;
}

unix_state_unlock(sk);

if (err) {
if (need_wakeup)
wake_up_interruptible_poll(sk_sleep(sk),
   POLLOUT |
   POLLWRNORM |
   POLLWRBAND);

goto out_free;
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)

2015-11-17 Thread Rainer Weikusat

Rainer Weikusat  writes:

[...]

> This leaves only the option of a somewhat incorrect solution and what is
> or isn't acceptable in this respect is somewhat difficult to decide. The
> basic options would be

[...]
>   - retry sending a limited number of times, eg, once, before
>   returning EAGAIN, on the grounds that this is nicer to the
>   application and that redoing all the stuff up to the _lock in
>   dgram_sendmsg can possibly/ likely be avoided

Since it's better to have a specific example of something: Here's
another 'code sketch' of this option (hopefully with less errors this
time, there's an int restart = 0 above):

if (unix_peer(other) != sk && unix_recvq_full(other)) {
int need_wakeup;


[...]

need_wakeup = 0;
err = 0;
unix_state_unlock(other);
unix_state_lock(sk);

if (unix_peer(sk) == other) {
if (++restart == 2) {
need_wakeup = unix_dgram_peer_wake_connect(sk, 
other) &&
  sk_receive_queue_len(other) == 0;
err = -EAGAIN;
} else if (unix_dgram_peer_wake_me(sk, other))
err = -EAGAIN;
} else
err = -EAGAIN;

unix_state_unlock(sk);

if (err || !restart) {
if (need_wakeup)
wake_up_interruptible_poll(sk_sleep(sk),
   POLLOUT |
   POLLWRNORM |
   POLLWRBAND);

goto out_free;
}

goto restart;
}

I don't particularly like that, either, and to me, the best option seems
to be to return the spurious EAGAIN if taking both locks unconditionally
is not an option as that's the simplest choice.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)

2015-11-18 Thread Rainer Weikusat

David Miller  writes:
> From: Rainer Weikusat 
> Date: Mon, 16 Nov 2015 22:28:40 +
>
>> An AF_UNIX datagram socket being the client in an n:1

[...]

> So because of a corner case of epoll handling and sender socket release,
> every single datagram sendmsg has to do a double lock now?
>
> I do not dispute the correctness of your fix at this point, but that
> added cost in the fast path is really too high.

Some more information on this: Running the test program included below
on my 'work' system (otherwise idle, after logging in via VT with no GUI
running)/ quadcore AMD A10-5700, 3393.984 for 20 times/ patched 4.3 resulted in 
the
following throughput statistics[*]:

avg 13.617  M/s
median  13.393  M/s
max 17.14   M/s
min 13.047  M/s
deviation   0.85

I'll try to post the results for 'unpatched' later as I'm also working
on a couple of other things.

[*] I do not use my fingers for counting, hence, these are binary and
not decimal units.


#include 
#include 
#include 
#include 
#include 
#include 
#include 

enum {
MSG_SZ =16,
MSGS =  100
};

static char msg[MSG_SZ];

static uint64_t tv2u(struct timeval *tv)
{
uint64_t u;

u = tv->tv_sec;
u *= 100;
return u + tv->tv_usec;
}

int main(void)
{
struct timeval start, stop;
uint64_t t_diff;
double rate;
int sks[2];
unsigned remain;
char buf[MSG_SZ];

socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sks);

if (fork() == 0) {
close(*sks);

gettimeofday(&start, 0);
while (read(sks[1], buf, sizeof(buf)) > 0);
gettimeofday(&stop, 0);

t_diff = tv2u(&stop);
t_diff -= tv2u(&start);
rate = MSG_SZ * MSGS;
rate /= t_diff;
rate *= 100;
printf("rate %fM/s\n", rate / (1 << 20));

fflush(stdout);
_exit(0);
}

close(sks[1]);

remain = MSGS;
do write(*sks, msg, sizeof(msg)); while (--remain);
close(*sks);

wait(NULL);
return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

more statistics (was: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:))

2015-11-18 Thread Rainer Weikusat

Rainer Weikusat  writes:

[...]

> Some more information on this: Running the test program included below
> on my 'work' system (otherwise idle, after logging in via VT with no GUI
> running)/ quadcore AMD A10-5700, 3393.984 for 20 times/ patched 4.3 resulted 
> in the
> following throughput statistics[*]:

Since the results were too variable with only 20 runs, I've also tested
this with 100 for three kernels, stock 4.3, 4.3 plus the published
patch, 4.3 plus the published patch plus the "just return EAGAIN"
modification". The 1st and the 3rd perform about identical for the
test program I used (slightly modified version included below), the 2nd
is markedly slower. This is most easily visible when grouping the
printed data rates (B/s) 'by millions':

stock 4.3
-
1300.000-1399.000   3   (3%)
1400.000-1499.000   82  (82%)
1500.000-1599.000   15  (15%)


4.3 + patch
---
1300.000-1399.000   54  (54%)
1400.000-1499.000   35  (35%)
1500.000-1599.000   7   (7%)
1600.000-1699.000   1   (1%)
1800.000-1899.000   1   (1%)
2200.000-2299.000   2   (2%)


4.3 + modified patch

1300.000-1399.000   3   (3%)
1400.000-1499.000   82  (82%)
1500.000-1599.000   14  (14%)
2400.000-2499.000   1   (1%)


IMHO, the 3rd option would be the way to go if this was considered an
acceptable option (ie, despite it returns spurious errors in 'rare
cases').


modified test program
=
#include 
#include 
#include 
#include 
#include 
#include 
#include 

enum {
MSG_SZ =16,
MSGS =  100
};

static char msg[MSG_SZ];

static uint64_t tv2u(struct timeval *tv)
{
uint64_t u;

u = tv->tv_sec;
u *= 100;
return u + tv->tv_usec;
}

int main(void)
{
struct timeval start, stop;
uint64_t t_diff;
double rate;
int sks[2];
unsigned remain;
char buf[MSG_SZ];

socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sks);

if (fork() == 0) {
close(*sks);

gettimeofday(&start, 0);
while (read(sks[1], buf, sizeof(buf)) > 0);
gettimeofday(&stop, 0);

t_diff = tv2u(&stop);
t_diff -= tv2u(&start);
rate = MSG_SZ * MSGS;
rate /= t_diff;
rate *= 100;
printf("%f\n", rate);

fflush(stdout);
_exit(0);
}

close(sks[1]);

remain = MSGS;
do write(*sks, msg, sizeof(msg)); while (--remain);
close(*sks);

wait(NULL);
return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)

2015-11-19 Thread Rainer Weikusat

Rainer Weikusat  writes:
> Rainer Weikusat  writes:
>
> [...]
>
>> The basic options would be
>>
>>  - return EAGAIN even if sending became possible (Jason's most
>>   recent suggestions)
>>
>>  - retry sending a limited number of times, eg, once, before
>>   returning EAGAIN, on the grounds that this is nicer to the
>>   application and that redoing all the stuff up to the _lock in
>>   dgram_sendmsg can possibly/ likely be avoided
>
> A third option:

A fourth and even one that's reasonably simple to implement: In case
other became ready during the checks, drop other lock, do a double-lock
sk, other, set a flag variable indicating this and restart the procedure
after the unix_state_lock_other[*], using the value of the flag to lock/
unlock sk as needed. Should other still be ready to receive data,
execution can then continue with the 'queue it' code as the other lock
was held all the time this time. Combined with a few unlikely
annotations in place where they're IMHO appropriate, this is speed-wise
comparable to the stock kernel.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)

2015-11-19 Thread Rainer Weikusat

An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat 
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
---

This has been created around midnight at the end of a work day. I've
read through it a couple of times and found no errors. But I'm not
"ideally awake and attentive" right now.


diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_tpeer_wake;
 };
 
 static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..6962ff1 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,112 @@ found:
return s;
 }
 
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int 
flags,
+ void *key)
+{
+   struct unix_sock *u;
+   wait_queue_head_t *u_sleep;
+
+   u = container_of(q, struct unix_sock, peer_wake);
+
+   __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+   q);
+   u->peer_wake.private = NULL;
+
+   /* relaying can only happen while the wq still exists */
+   u_sleep = sk_sleep(&u->sk);
+   if (u_sleep)
+   wake_up_interruptible_poll(u_sleep, key);
+
+   return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+   struct unix_

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue (w/ Fixes:)

2015-11-20 Thread Rainer Weikusat

Jason Baron  writes:
> On 11/19/2015 06:52 PM, Rainer Weikusat wrote:
>
> [...]
>
>> @@ -1590,21 +1718,35 @@ restart:
>>  goto out_unlock;
>>  }
>>  
>> -if (unix_peer(other) != sk && unix_recvq_full(other)) {
>> -if (!timeo) {
>> +if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
>> +if (timeo) {
>> +timeo = unix_wait_for_peer(other, timeo);
>> +
>> +err = sock_intr_errno(timeo);
>> +if (signal_pending(current))
>> +goto out_free;
>> +
>> +goto restart;
>> +}
>> +
>> +if (unix_peer(sk) != other ||
>> +unix_dgram_peer_wake_me(sk, other)) {
>>  err = -EAGAIN;
>>  goto out_unlock;
>>  }
>
> Hi,
>
> So here we are calling unix_dgram_peer_wake_me() without the sk lock the 
> first time
> through - right?

Yes. And this is obviously wrong. I spend most of the 'evening time'
(some people would call that 'night time') with testing this and didn't
get to read through it again yet. Thank you for pointing this out. I'll
send an updated patch shortly.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-20 Thread Rainer Weikusat

Rainer Weikusat  writes:
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat 
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
---

- uninvert the lock/ check code in _dgram_sendmsg

- introduce a unix_dgram_peer_wake_disconnect_wakuep helper
  function as there were two calls with a wakeup immediately
  following and two without

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE  0
 #define UNIX_GC_MAYBE_CYCLE1
struct socket_wqpeer_wq;
+   wait_queue_tpeer_wake;
 };
 
 static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..3d93b0d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,118 @@ found:
return s;
 }
 
+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int 
flags,
+ void *key)
+{
+   struct unix_sock *u;
+   wait_queue_head_t *u_sleep;
+
+   u = container_of(q, struct unix_sock, peer_wake);
+
+   __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+   q);
+   u->peer_wake.private = NULL;
+
+   /* relaying can only happen while the wq still exists */
+   u_sleep = sk_sleep(&u->sk);
+   if (u_sleep)
+   wake_up_interruptible_poll(u_sleep, key);
+
+   return 0;
+}
+
+static int unix_dgram_peer_wake_connect(str

Re: Use-after-free in ppoll

2015-11-22 Thread Rainer Weikusat

Dmitry Vyukov  writes:
> Hello,
>
> On commit f2d10565b9bdbb722bd43e6e1a759eeddb9645c8 (Nov 20).
>
> The following program triggers use-after-free:
>
> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> #include 
> #include 
> #include 
> #include 
>
> void *thread(void *p)
> {
> syscall(SYS_write, (long)p, 0x2000278ful, 0x1ul, 0, 0, 0);
> return 0;
> }

[...]


> long r1 = syscall(SYS_socketpair, 0x1ul, 0x3ul, 0x0ul,

[...]

> long r5 = syscall(SYS_close, r2, 0, 0, 0, 0, 0);
> pthread_t th;
> pthread_create(&th, 0, thread, (void*)(long)r3);

[...]

> long r21 = syscall(SYS_ppoll, 0x2ffful, 0x3ul, 0x2ffcul, 
> 0x2ffdul, 0x8ul, 0);
> return 0;
> }

That's one of the already known sequences for triggering this issue: The
close will clear the peer pointer of the closed socket, hence, the 2nd
sock_poll_wait will be called by unix_dgram_poll. The write will
execute unix_dgram_sendmsg which detects that the peer is dead and
disconnects from it, causing the corresponding structures to be freed
despite they're still used.

NB: I didn't execute this but I spend a fair amount of time with the
af_unix.c code during the last couple of weeks and consider myself
"reasonably familiar" with it and that's IMO what should happen here.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Use-after-free in ppoll

2015-11-22 Thread Rainer Weikusat

Dmitry Vyukov  writes:
> On Sun, Nov 22, 2015 at 3:32 PM, Rainer Weikusat
>  wrote:
>> Dmitry Vyukov  writes:
>>> Hello,
>>>
>>> On commit f2d10565b9bdbb722bd43e6e1a759eeddb9645c8 (Nov 20).
>>>
>>> The following program triggers use-after-free:
>>>
>>> // autogenerated by syzkaller (http://github.com/google/syzkaller)
>>> #include 
>>> #include 
>>> #include 
>>> #include 
>>>
>>> void *thread(void *p)
>>> {
>>> syscall(SYS_write, (long)p, 0x2000278ful, 0x1ul, 0, 0, 0);
>>> return 0;
>>> }
>>
>> [...]
>>
>>
>>> long r1 = syscall(SYS_socketpair, 0x1ul, 0x3ul, 0x0ul,
>>
>> [...]
>>
>>> long r5 = syscall(SYS_close, r2, 0, 0, 0, 0, 0);
>>> pthread_t th;
>>> pthread_create(&th, 0, thread, (void*)(long)r3);
>>
>> [...]
>>
>>> long r21 = syscall(SYS_ppoll, 0x2ffful, 0x3ul, 0x2ffcul, 
>>> 0x2ffdul, 0x8ul, 0);
>>> return 0;
>>> }
>>
>> That's one of the already known sequences for triggering this issue:

[...]

> I have not read the code. But I just want to point out that all 3
> reports are different. For example, in the first one, ppoll both frees
> the object and then accesses it. That is, it is not write that frees
> the object.

The call trace is always the same:

[ 2672.994366]  [] __asan_load4+0x6a/0x70
[ 2672.994366]  [] do_raw_spin_lock+0x22/0x220
[ 2672.994366]  [] _raw_spin_lock_irqsave+0x51/0x60
[ 2672.994366]  [] remove_wait_queue+0x18/0x80
[ 2672.994366]  [] poll_freewait+0x7b/0x130
[ 2672.994366]  [] do_sys_poll+0x4dc/0x860
[ 2672.994366]  [] SyS_ppoll+0x1a9/0x310

And if you look at the poll implementation, the important part is this
(fs/ select.c, do_sys_poll)

fdcount = do_poll(nfds, head, &table, end_time);
poll_freewait(&table);

do_poll calls the poll routine of the file descriptors which cause
"enqueuing of something" via poll wait callback. For poll, that's the
__pollwait routine in select.c:

static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
poll_table *p)
{
struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt);
struct poll_table_entry *entry = poll_get_entry(pwq);
if (!entry)
return;
entry->filp = get_file(filp);
entry->wait_address = wait_address;
entry->key = p->_key;
init_waitqueue_func_entry(&entry->wait, pollwake);
entry->wait.private = pwq;
add_wait_queue(wait_address, &entry->wait);
}

because of the close, this routine will be called with the peer_wait
wait_queue_head of the non-closed socket of the socket pair as
wait_address argument. And poll_freewait calls free_poll_entry for all
entries on the poll table which is

static void free_poll_entry(struct poll_table_entry *entry)
{
remove_wait_queue(entry->wait_address, &entry->wait);
fput(entry->filp);
}

but by this time, the wait_address points to freed memory because the
only thing which kept the socket it belonged to alive after the
corresponding file descriptor was closed was the reference the other
socket held. But that was dropped by unix_dgram_sendmsg upon detecting a
dead peer.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Use-after-free in ppoll

2015-11-22 Thread Rainer Weikusat

Rainer Weikusat  writes:

[...]


> because of the close, this routine will be called with the peer_wait
> wait_queue_head of the non-closed socket of the socket pair as
> wait_address argument.

This should have been "peer_wait wait_queue_head of the peer of the
non-closed socket, ie, that of the closed socket"...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

alternate queueing mechanism (was: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue)

2015-11-22 Thread Rainer Weikusat

Rainer Weikusat  writes:

[AF_UNIX SOCK_DGRAM throughput]

> It may be possible to improve this by tuning/ changing the flow
> control mechanism. Out of my head, I'd suggest making the queue longer
> (the default value is 10) and delaying wake ups until the server
> actually did catch up, IOW, the receive queue is empty or almost
> empty. But this ought to be done with a different patch.

Because I was curious about the effects, I implemented this using a
slightly modified design than the one I originally suggested to account
for the different uses of the 'is the receive queue full' check. The
code uses a datagram-specific checking function,

static int unix_dgram_recvq_full(struct sock const *sk)
{
struct unix_sock *u;

u = unix_sk(sk);
if (test_bit(UNIX_DG_FULL, &u->flags))
return 1;

if (!unix_recvq_full(sk))
return 0;

__set_bit(UNIX_DG_FULL, &u->flags);
return 1;
}

which gets called instead of the other for the n:1 datagram checks and a

if (test_bit(UNIX_DG_FULL, &u->flags) &&
!skb_queue_len(&sk->sk_receive_queue)) {
__clear_bit(UNIX_DG_FULL, &u->flags);
wake_up_interruptible_sync_poll(&u->peer_wait,
POLLOUT | POLLWRNORM |
POLLWRBAND);
}

in unix_dgram_recvmsg to delay wakeups until the queued datagrams have
been consumed if the queue overflowed before. This has the additional,
nice side effect that wakeups won't ever be done for 1:1 connected
datagram sockets (both SOCK_DGRAM and SOCK_SEQPACKET) where they're of
no use, anyway.

Compared to a 'stock' 4.3 running the test program I posted (supposed to
make the overhead noticable by sending lots of small messages), the
average number of bytes sent per second increased by about 782,961.79
(ca 764.61K), about 5.32% of the 4.3 number (14,714,579.91), with a
fairly simple code change.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-23 Thread Rainer Weikusat

David Miller  writes:
> From: Rainer Weikusat 
>> Rainer Weikusat  writes:
>> An AF_UNIX datagram socket being the client in an n:1 association

[...]

> Applied and queued up for -stable,

I'm sorry for this 13th hour request/ suggestion but while thinking
about a reply to Dmitry, it occurred to me that the restart_locked/
sk_locked logic could be avoided by moving the test for this condition
in front of all the others while leaving the 'act on it' code at its
back, ie, reorganize unix_dgram_sendmsg such that it looks like this:

unix_state_lock(other);

if (unix_peer(other) != sk && unix_recvq_full(other)) {
need_wait = 1;

if (!timeo) {
unix_state_unlock(other);
unix_state_double_lock(sk, other);

if (unix_peer(other) == sk ||
(unix_peer(sk) == other &&
 !unix_dgram_peer_wake_me(sk, other)))
need_wait = 0;

unix_state_unlock(sk);
}
}

/* original code here */

if (need_wait) {
if (timeo) {
timeo = unix_wait_for_peer(other, timeo);

err = sock_intr_errno(timeo);
if (signal_pending(current))
goto out_free;

goto restart;
}

err = -EAGAIN;
goto out_unlock;
}

/* original tail here */

This might cause a socket to be enqueued to the peer despite it's not
allowed to send to it but I don't think this matters much. This is a
less conservative modification but one which results in simpler code
overall. The kernel I'm currently running has been modified like this
and 'survived' the usual tests.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-23 Thread Rainer Weikusat

Rainer Weikusat  writes:
> David Miller  writes:

[...]

> I'm sorry for this 13th hour request/ suggestion but while thinking
> about a reply to Dmitry, it occurred to me that the restart_locked/
> sk_locked logic could be avoided by moving the test for this condition
> in front of all the others while leaving the 'act on it' code at its
> back, ie, reorganize unix_dgram_sendmsg such that it looks like this:

[...]

Just in case this is unclear on its own: If this was considered an
improvement by someone other than me, I could supply either a "complete"
patch with this re-arrangement or a cleanup delta patch changing the
previous change.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-24 Thread Rainer Weikusat

Eric Dumazet  writes:
> On Tue, Nov 24, 2015 at 6:18 AM, Dmitry Vyukov  wrote:
>> Hello,
>>
>> The following program triggers use-after-free in sock_wake_async:

[...]

>> void *thr1(void *arg)
>> {
>> syscall(SYS_close, r2, 0, 0, 0, 0, 0);
>> return 0;
>> }
>>
>> void *thr2(void *arg)
>> {
>> syscall(SYS_write, r3, 0x20003000ul, 0xe7ul, 0, 0, 0);
>> return 0;
>> }

[...]

>> pthread_t th[3];
>> pthread_create(&th[0], 0, thr0, 0);
>> pthread_create(&th[1], 0, thr1, 0);
>> pthread_create(&th[2], 0, thr2, 0);
>> pthread_join(th[0], 0);
>> pthread_join(th[1], 0);
>> pthread_join(th[2], 0);
>> return 0;
>> }

[...]

> Looks like commit 830a1e5c212fb3fdc83b66359c780c3b3a294897 should be reverted 
> ?
>
> commit 830a1e5c212fb3fdc83b66359c780c3b3a294897
> Author: Benjamin LaHaise 
> Date:   Tue Dec 13 23:22:32 2005 -0800
>
> [AF_UNIX]: Remove superfluous reference counting in unix_stream_sendmsg
>
> AF_UNIX stream socket performance on P4 CPUs tends to suffer due to a
> lot of pipeline flushes from atomic operations.  The patch below
> removes the sock_hold() and sock_put() in unix_stream_sendmsg().  This
> should be safe as the socket still holds a reference to its peer which
> is only released after the file descriptor's final user invokes
> unix_release_sock().  The only consideration is that we must add a
> memory barrier before setting the peer initially.
>
> Signed-off-by: Benjamin LaHaise 
> Signed-off-by: David S. Miller 

JFTR: This seems to be unrelated. (As far as I understand this), the
problem is that sk_wake_async accesses sk->sk_socket. That's invoked via
the

other->sk_data_ready(other)

in unix_stream_sendmsg after an

unix_state_unlock(other);

because of this, it can race with the code in unix_release_sock clearing
this pointer (via sock_orphan). The structure this pointer points to is
freed via iput in sock_release (net/socket.c) after the af_unix release
routine returned (it's really one part of a "twin structure" with the
socket inode being the other).

A quick way to test if this was true would be to swap the

unix_state_unlock(other);
other->sk_data_ready(other);

in unix_stream_sendmsg and in case it is, a very 'hacky' fix could be to
put a pointer to the socket inode into the struct unix_sock, do an iget
on that in unix_create1 and a corresponding iput in
unix_sock_destructor.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-24 Thread Rainer Weikusat

Rainer Weikusat  writes:

[...]

> Swap the unix_state_lock and

s/lock/unlock/ :-(
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-24 Thread Rainer Weikusat

Eric Dumazet  writes:
> On Tue, Nov 24, 2015 at 3:34 PM, Rainer Weikusat
>  wrote:
>> Eric Dumazet  writes:
>>> On Tue, Nov 24, 2015 at 6:18 AM, Dmitry Vyukov  wrote:
>>>> Hello,
>>>>
>>>> The following program triggers use-after-free in sock_wake_async:
>>
>> [...]
>>
>>>> void *thr1(void *arg)
>>>> {
>>>> syscall(SYS_close, r2, 0, 0, 0, 0, 0);
>>>> return 0;
>>>> }
>>>>
>>>> void *thr2(void *arg)
>>>> {
>>>> syscall(SYS_write, r3, 0x20003000ul, 0xe7ul, 0, 0, 0);
>>>> return 0;
>>>> }
>>
>> [...]
>>
>>>> pthread_t th[3];
>>>> pthread_create(&th[0], 0, thr0, 0);
>>>> pthread_create(&th[1], 0, thr1, 0);
>>>> pthread_create(&th[2], 0, thr2, 0);
>>>> pthread_join(th[0], 0);
>>>> pthread_join(th[1], 0);
>>>> pthread_join(th[2], 0);
>>>> return 0;
>>>> }
>>
>> [...]
>>
>>> Looks like commit 830a1e5c212fb3fdc83b66359c780c3b3a294897 should be 
>>> reverted ?
>>>
>>> commit 830a1e5c212fb3fdc83b66359c780c3b3a294897
>>> Author: Benjamin LaHaise 
>>> Date:   Tue Dec 13 23:22:32 2005 -0800
>>>
>>> [AF_UNIX]: Remove superfluous reference counting in unix_stream_sendmsg
>>>
>>> AF_UNIX stream socket performance on P4 CPUs tends to suffer due to a
>>> lot of pipeline flushes from atomic operations.  The patch below
>>> removes the sock_hold() and sock_put() in unix_stream_sendmsg().  This
>>> should be safe as the socket still holds a reference to its peer which
>>> is only released after the file descriptor's final user invokes
>>> unix_release_sock().  The only consideration is that we must add a
>>> memory barrier before setting the peer initially.
>>>
>>> Signed-off-by: Benjamin LaHaise 
>>> Signed-off-by: David S. Miller 
>>
>> JFTR: This seems to be unrelated. (As far as I understand this), the
>> problem is that sk_wake_async accesses sk->sk_socket. That's invoked via
>> the
>>
>> other->sk_data_ready(other)
>>
>> in unix_stream_sendmsg after an
>>
>> unix_state_unlock(other);
>>
>> because of this, it can race with the code in unix_release_sock clearing
>> this pointer (via sock_orphan). The structure this pointer points to is
>> freed via iput in sock_release (net/socket.c) after the af_unix release
>> routine returned (it's really one part of a "twin structure" with the
>> socket inode being the other).
>>
>> A quick way to test if this was true would be to swap the
>>
>> unix_state_unlock(other);
>> other->sk_data_ready(other);
>>
>> in unix_stream_sendmsg and in case it is, a very 'hacky' fix could be to
>> put a pointer to the socket inode into the struct unix_sock, do an iget
>> on that in unix_create1 and a corresponding iput in
>> unix_sock_destructor.
>
> This is interesting, but is not the problem or/and the fix.
>
> We are supposed to own a reference on the 'other' socket or make sure
> it cannot disappear under us.

The af_unix part of this, yes, ie, what gets allocated in
unix_create1. But neither the socket inode nor the struct sock
originally passed to unix_create. Since these are part of the same
umbrella structure, they'll both be freed as consequence of the
sock_release iput. As far as I can tell (I don't claim that I'm
necessarily right on this, this is just the result of spending ca 2h
reading the code with the problem report in mind and looking for
something which could cause it), doing a sock_hold on the unix peer of
the socket in unix_stream_sendmsg is indeed not needed, however, there's
no additional reference to the inode or the struct sock accompanying it,
ie, both of these will be freed by unix_release_sock. This also affects
unix_dgram_sendmsg.

It's also easy to verify: Swap the unix_state_lock and
other->sk_data_ready and see if the issue still occurs. Right now (this
may change after I had some sleep as it's pretty late for me), I don't
think there's another local fix: The ->sk_data_ready accesses a
pointer after the lock taken by the code which will clear and
then later free it was released.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-25 Thread Rainer Weikusat

Eric Dumazet  writes:
> On Tue, Nov 24, 2015 at 5:10 PM, Rainer Weikusat
>  wrote:

[...]

>> It's also easy to verify: Swap the unix_state_lock and
>> other->sk_data_ready and see if the issue still occurs. Right now (this
>> may change after I had some sleep as it's pretty late for me), I don't
>> think there's another local fix: The ->sk_data_ready accesses a
>> pointer after the lock taken by the code which will clear and
>> then later free it was released.
>
> It seems that :
>
> int sock_wake_async(struct socket *sock, int how, int band)
>
> should really be changed to
>
> int sock_wake_async(struct socket_wq *wq, int how, int band)
>
> So that RCU rules (already present) apply safely.
>
> sk->sk_socket is inherently racy (that is : racy without using
> sk_callback_lock rwlock )

The comment above sock_wait_async states that

/* This function may be called only under socket lock or callback_lock or 
rcu_lock */

In this case, it's called via sk_wake_async (include/net/sock.h) which
is - in turn - called via sock_def_readable (the 'default' data ready
routine/ net/core/sock.c) which looks like this:

static void sock_def_readable(struct sock *sk)
{
struct socket_wq *wq;

rcu_read_lock();
wq = rcu_dereference(sk->sk_wq);
if (wq_has_sleeper(wq))
wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI |
POLLRDNORM | POLLRDBAND);
sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
rcu_read_unlock();
}

and should thus satisfy the constraint documented by the comment (I
didn't verify if the comment is actually correct, though).

Further - sorry about that - I think changing code in "half of the
network stack" in order to avoid calling a certain routine which will
only ever do something in case someone's using signal-driven I/O with an
already acquired lock held is a terrifying idea. Because of this, I
propose the following alternate patch which should also solve the
problem by ensuring that the ->sk_data_ready activity happens before
unix_release_sock/ sock_release get a chance to clear or free anything
which will be needed.

In case this demonstrably causes other issues, a more complicated
alternate idea (still restricting itself to changes to the af_unix code)
would be to move the socket_wq structure to a dummy struct socket
allocated by unix_release_sock and freed by the destructor.

---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 4e95bdf..5c87ea6 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1754,8 +1754,8 @@ restart_locked:
skb_queue_tail(&other->sk_receive_queue, skb);
if (max_level > unix_sk(other)->recursion_level)
unix_sk(other)->recursion_level = max_level;
-   unix_state_unlock(other);
other->sk_data_ready(other);
+   unix_state_unlock(other);
sock_put(other);
scm_destroy(&scm);
return len;
@@ -1860,8 +1860,8 @@ static int unix_stream_sendmsg(struct socket *sock, 
struct msghdr *msg,
skb_queue_tail(&other->sk_receive_queue, skb);
if (max_level > unix_sk(other)->recursion_level)
unix_sk(other)->recursion_level = max_level;
-   unix_state_unlock(other);
other->sk_data_ready(other);
+   unix_state_unlock(other);
sent += size;
}
 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-25 Thread Rainer Weikusat

Eric Dumazet  writes:
> On Wed, 2015-11-25 at 16:43 +0000, Rainer Weikusat wrote:
>> Eric Dumazet  writes:
>> > On Tue, Nov 24, 2015 at 5:10 PM, Rainer Weikusat
>> >  wrote:
>> 
>> [...]
>> 
>> >> It's also easy to verify: Swap the unix_state_lock and
>> >> other->sk_data_ready and see if the issue still occurs. Right now (this
>> >> may change after I had some sleep as it's pretty late for me), I don't
>> >> think there's another local fix: The ->sk_data_ready accesses a
>> >> pointer after the lock taken by the code which will clear and
>> >> then later free it was released.
>> >
>> > It seems that :
>> >
>> > int sock_wake_async(struct socket *sock, int how, int band)
>> >
>> > should really be changed to
>> >
>> > int sock_wake_async(struct socket_wq *wq, int how, int band)
>> >
>> > So that RCU rules (already present) apply safely.
>> >
>> > sk->sk_socket is inherently racy (that is : racy without using
>> > sk_callback_lock rwlock )
>> 
>> The comment above sock_wait_async states that
>> 
>> /* This function may be called only under socket lock or callback_lock or 
>> rcu_lock */
>> 
>> In this case, it's called via sk_wake_async (include/net/sock.h) which
>> is - in turn - called via sock_def_readable (the 'default' data ready
>> routine/ net/core/sock.c) which looks like this:
>> 
>> static void sock_def_readable(struct sock *sk)
>> {
>>  struct socket_wq *wq;
>> 
>>  rcu_read_lock();
>>  wq = rcu_dereference(sk->sk_wq);
>>  if (wq_has_sleeper(wq))
>>  wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI |
>>  POLLRDNORM | POLLRDBAND);
>>  sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
>>  rcu_read_unlock();
>> }
>> 
>> and should thus satisfy the constraint documented by the comment (I
>> didn't verify if the comment is actually correct, though).
>> 
>> Further - sorry about that - I think changing code in "half of the
>> network stack" in order to avoid calling a certain routine which will
>> only ever do something in case someone's using signal-driven I/O with an
>> already acquired lock held is a terrifying idea. Because of this, I
>> propose the following alternate patch which should also solve the
>> problem by ensuring that the ->sk_data_ready activity happens before
>> unix_release_sock/ sock_release get a chance to clear or free anything
>> which will be needed.
>> 
>> In case this demonstrably causes other issues, a more complicated
>> alternate idea (still restricting itself to changes to the af_unix code)
>> would be to move the socket_wq structure to a dummy struct socket
>> allocated by unix_release_sock and freed by the destructor.
>> 
>> ---
>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> index 4e95bdf..5c87ea6 100644
>> --- a/net/unix/af_unix.c
>> +++ b/net/unix/af_unix.c
>> @@ -1754,8 +1754,8 @@ restart_locked:
>> skb_queue_tail(&other->sk_receive_queue, skb);
>> if (max_level > unix_sk(other)->recursion_level)
>> unix_sk(other)->recursion_level = max_level;
>> -   unix_state_unlock(other);
>> other->sk_data_ready(other);
>> +   unix_state_unlock(other);
>> sock_put(other);
>> scm_destroy(&scm);
>> return len;
>> @@ -1860,8 +1860,8 @@ static int unix_stream_sendmsg(struct socket *sock, 
>> struct msghdr *msg,
>> skb_queue_tail(&other->sk_receive_queue, skb);
>> if (max_level > unix_sk(other)->recursion_level)
>> unix_sk(other)->recursion_level = max_level;
>> -   unix_state_unlock(other);
>> other->sk_data_ready(other);
>> +   unix_state_unlock(other);
>> sent += size;
>> }
>>  
>
>
> The issue is way more complex than that.
>
> We cannot prevent inode from disappearing.
> We can not safely dereference "(struct socket *)->flags"
>
> locking the 'struct sock' wont help at all.

The inode can't disappear unless it's freed which is done by
sock_release,

void sock_release(struct socket *sock)
{
if (sock->ops) {
struct module *owner = sock->ops->owner;

soc

Re: use-after-free in sock_wake_async

2015-11-25 Thread Rainer Weikusat

Eric Dumazet  writes:
> On Wed, 2015-11-25 at 17:30 +0000, Rainer Weikusat wrote:
>
>> In case this is wrong, it obviously implies that sk_sleep(sk) must not
>> be used anywhere as it accesses the same struck sock, hence, when that
>> can "suddenly" disappear despite locks are used in the way indicated
>> above, there is now safe way to invoke that, either, as it just does a
>> rcu_dereference_raw based on the assumption that the caller knows that
>> the i-node (and the corresponding wait queue) still exist.
>> 
>
> Oh well.
>
> sk_sleep() is not used if the return is NULL

static long unix_stream_data_wait(struct sock *sk, long timeo,
  struct sk_buff *last, unsigned int last_len)
{
struct sk_buff *tail;
DEFINE_WAIT(wait);

unix_state_lock(sk);

for (;;) {
prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);

tail = skb_peek_tail(&sk->sk_receive_queue);
if (tail != last ||
(tail && tail->len != last_len) ||
sk->sk_err ||
(sk->sk_shutdown & RCV_SHUTDOWN) ||
signal_pending(current) ||
!timeo)
break;

set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
unix_state_unlock(sk);
timeo = freezable_schedule_timeout(timeo);
unix_state_lock(sk);

if (sock_flag(sk, SOCK_DEAD))
break;

clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
}

finish_wait(sk_sleep(sk), &wait);
unix_state_unlock(sk);
return timeo;
}

Neither prepare_to_wait nor finish_wait check if the pointer is
null. For the finish_wait case, it shouldn't be null because if
SOCK_DEAD is not found to be set after the unix_state_lock was acquired,
unix_release_sock didn't execute the corresponding code yet, hence,
inode etc will remain available until after the corresponding unlock.

But this isn't true anymore if the inode can go away despite
sock_release couldn't complete yet.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-25 Thread Rainer Weikusat

Eric Dumazet  writes:
> On Wed, 2015-11-25 at 18:24 +0000, Rainer Weikusat wrote:
>> Eric Dumazet  writes:
>> > On Wed, 2015-11-25 at 17:30 +0000, Rainer Weikusat wrote:
>> >
>> >> In case this is wrong, it obviously implies that sk_sleep(sk) must not
>> >> be used anywhere as it accesses the same struck sock, hence, when that
>> >> can "suddenly" disappear despite locks are used in the way indicated
>> >> above, there is now safe way to invoke that, either, as it just does a
>> >> rcu_dereference_raw based on the assumption that the caller knows that
>> >> the i-node (and the corresponding wait queue) still exist.
>> >> 
>> >
>> > Oh well.
>> >
>> > sk_sleep() is not used if the return is NULL

[...]

>>  finish_wait(sk_sleep(sk), &wait);
>>  unix_state_unlock(sk);
>>  return timeo;
>> }
>> 
>> Neither prepare_to_wait nor finish_wait check if the pointer is
>> null.

[...]

> You are looking at the wrong side.
>
> Of course, the thread 'owning' a socket has a reference on it, so it
> knows sk->sk_socket and sk->sk_ww is not NULL.

I could continue argueing about this but IMHO, it just leads away from
the actual issue. Taking a couple of steps back, therefore,

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 4e95bdf..5c87ea6 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1754,8 +1754,8 @@ restart_locked:
skb_queue_tail(&other->sk_receive_queue, skb);
if (max_level > unix_sk(other)->recursion_level)
unix_sk(other)->recursion_level = max_level;
-   unix_state_unlock(other);
other->sk_data_ready(other);
+   unix_state_unlock(other);
sock_put(other);
scm_destroy(&scm);
return len;
@@ -1860,8 +1860,8 @@ static int unix_stream_sendmsg(struct socket *sock, 
struct msghdr *msg,
skb_queue_tail(&other->sk_receive_queue, skb);
if (max_level > unix_sk(other)->recursion_level)
unix_sk(other)->recursion_level = max_level;
-   unix_state_unlock(other);
other->sk_data_ready(other);
+   unix_state_unlock(other);
sent += size;
}

-

I'm convinced this will work for the given problem (I don't claim that
it's technically superior to the larger change in any aspect except that
it's simpler and localized) because

1) The use-after-free occurred while accessing the struck sock allocated
   together with the socket inode.

2) This happened because a close was racing with a write.   

3) The socket inode, struct sock and struct socket_wq are freed by
   sock_destroy_inode.

4) sock_destroy_inode can only be called as consequence of the iput in
   sock_release.

5) sock_release invokes the per-protocol/ family release function before
   doing the iput.

6) unix_sock_release has to acquire the unix_state_lock on the socket
   referred to as other in the code above before it can do anything, in
   particular, before it calls sock_orphan which resets the struct sock and wq
   pointers and also sets the SOCK_DEAD flag.

7) the unix_stream_sendmsg code acquires the unix_state_lock on other
   and then checks the SOCK_DEAD flag. The code above only runs if it
   was not set, hence, the iput in sock_release can't have happened yet
   because a concurrent unix_sock_release must still be blocked on the
   unix_state_lock of other.

If there's an error in this reasoning, I'd very much like to know where
and what it is, not the least because the unix_dgram_peer_wake_relay
function I wrote also relies on it being correct (wrt to using the
result of sk_sleep outside of rcu_read_lock/ rcu_read_unlock).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use-after-free in sock_wake_async

2015-11-25 Thread Rainer Weikusat

Eric Dumazet  writes:
> On Wed, 2015-11-25 at 11:50 -0800, Eric Dumazet wrote:
>
>> > other->sk_data_ready(other);
>> > +   unix_state_unlock(other);
>
>
> Also, problem with such construct is that we wakeup a thread that will
> block on the lock we hold.
>
> Beauty of sk_data_ready() is to call it once we hold no lock any more,
> to enable another cpu to immediately proceed.
>
> In this case, 'other' can not disappear, so it should be safe.

I do agree that keeping the ->sk_data_ready outside of the lock will
very likely have performance advantages. That's just something I
wouldn't have undertaken because I'd be reluctant to make a fairly
complicated change to a lot of code in order to improve performance
unless performance was actually found to be lacking and because it would
step onto to many different people's turf.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

unix_dgram_recvmsg wakeups [test results & programs]

2015-11-26 Thread Rainer Weikusat

[to be followed by a patch]

Currently, unix_dgram_recvmsg does a wake up on the peer_wait queue of a
socket after every received datagram. This seems excessive, especially
considering that only SOCK_DGRAM client sockets in an n:1 asssociation
with a server socket ever wait for the associated condition.

Values are based on 500 runs of either test program.

SOCK_SEQPACKET throughput (B/s)

stock   patched
---
avg 14714579.91 17898665.25 +26.64%
median  14409539.95 16429584.38 +14.02%
dev  1582631.52  3475160.01

min 13588814.36 14864694.12
max 2596020327264811.18


SOCK_DGRAM 100 clients/ 1 server (B/s)

stock   patched
---
avg 57394.0658222.93+1.44%
median  55878.1256993.55+2%
dev  3540.023488.49

min 54167.3353165.51
max 67971.1566466.99


SOCK_SEQPACKET test program
---
#include 
#include 
#include 
#include 
#include 
#include 
#include 

enum {
MSG_SZ =16,
MSGS =  100
};

static char msg[MSG_SZ];

static uint64_t tv2u(struct timeval *tv)
{
uint64_t u;

u = tv->tv_sec;
u *= 100;
return u + tv->tv_usec;
}

int main(void)
{
struct timeval start, stop;
uint64_t t_diff;
double rate;
int sks[2];
unsigned remain;
char buf[MSG_SZ];

socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sks);

if (fork() == 0) {
close(*sks);

gettimeofday(&start, 0);
while (read(sks[1], buf, sizeof(buf)) > 0);
gettimeofday(&stop, 0);

t_diff = tv2u(&stop);
t_diff -= tv2u(&start);
rate = MSG_SZ * MSGS;
rate /= t_diff;
rate *= 100;
printf("%f\n", rate);

fflush(stdout);
_exit(0);
}

close(sks[1]);

remain = MSGS;
do write(*sks, msg, sizeof(msg)); while (--remain);
close(*sks);

wait(NULL);
return 0;
}
--


SOCK_DGRAM test program
---
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define SERVER_ADDR "\0dgt-2"

enum {
MSG_SZ =16,
MSGS =  1000
};

static char msg[MSG_SZ];
static struct sockaddr_un server_sun;

static uint64_t tv2u(struct timeval *tv)
{
uint64_t u;

u = tv->tv_sec;
u *= 100;
return u + tv->tv_usec;
}

static void server(int sk, unsigned want)
{
struct timeval start, stop;
char buf[MSG_SZ];
uint64_t t_diff;
double rate;

gettimeofday(&start, 0);
do read(sk, buf, sizeof(buf)); while (--want);
gettimeofday(&stop, 0);

t_diff = tv2u(&stop);
t_diff -= tv2u(&start);
rate = MSG_SZ * MSGS;
rate /= t_diff;
rate *= 100;
printf("%f\n", rate);
}

static void client(void)
{
unsigned remain;
int sk;

sk = socket(AF_UNIX, SOCK_DGRAM, 0);
connect(sk, (struct sockaddr *)&server_sun, sizeof(server_sun));

remain = MSGS;
do write(sk, msg, sizeof(msg)); while (--remain);

_exit(0);
}

int main(int argc, char **argv)
{
unsigned n_clients, want;
int sk;

n_clients = atoi(argv[1]);

server_sun.sun_family = AF_UNIX;
memcpy(server_sun.sun_path, SERVER_ADDR, sizeof(SERVER_ADDR));
sk = socket(AF_UNIX, SOCK_DGRAM, 0);
bind(sk, (struct sockaddr *)&server_sun, sizeof(server_sun));

want = n_clients * MSGS;
do if (fork() == 0) client(); while (--n_clients);

server(sk, want);
return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] unix: use wq_has_sleeper in unix_dgram_recvmsg

2015-11-26 Thread Rainer Weikusat

The current unix_dgram_recvmsg does a wake up for every received
datagram. This seems wasteful as only SOCK_DGRAM client sockets in an
n:1 association with a server socket will ever wait because of the
associated condition. The patch below changes the function such that the
wake up only happens if wq_has_sleeper indicates that someone actually
wants to be notified. Testing with SOCK_SEQPACKET and SOCK_DGRAM socket
seems to confirm that this is an improvment.

Signed-Off-By: Rainer Weikusat 
---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 4e95bdf..7aba73e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2057,8 +2057,10 @@ static int unix_dgram_recvmsg(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
-   wake_up_interruptible_sync_poll(&u->peer_wait,
-   POLLOUT | POLLWRNORM | POLLWRBAND);
+   if (wq_has_sleeper(&u->peer_wq))
+   wake_up_interruptible_sync_poll(&u->peer_wait,
+   POLLOUT | POLLWRNORM |
+   POLLWRBAND);
 
if (msg->msg_name)
unix_copy_addr(msg, skb->sk);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH] af_unix: fix entry locking in unix_dgram_recvmsg

2015-11-29 Thread Rainer Weikusat

This will probably earn me a reputation as the most single-minded
monomaniac on this planet (insofar there's still anything to earn in
this respect) but this issue has been irking me "ever since".

NB: This is somewhat loser formatted than a proper patch submission in
order to explain the background of the issue. I'd split that up in two
proper patches and do some code cleanups (move/ change the
__skb_recv_dgram comment) if there was a chance to get this accepted.

The following little program doesn't work on Linux as the documentation
says it should:

--
#include 
#include 
#include 
#include 

#define SERVER_ADDR "\0nonblocked-forever"

int main(void)
{
struct sockaddr_un sun;
int sk, dummy;

sun.sun_family = AF_UNIX;
memcpy(sun.sun_path, SERVER_ADDR, sizeof(SERVER_ADDR));
sk = socket(AF_UNIX, SOCK_DGRAM, 0);
bind(sk, (struct sockaddr *)&sun, sizeof(sun));

if (fork() == 0) {
sleep(5);
recv(sk, &dummy, sizeof(dummy), MSG_DONTWAIT);
kill(getppid(), SIGTERM);

_exit(0);
}

read(sk, &dummy, sizeof(dummy));
return 0;
}
---

It should sleep for 5s and then terminate but instead, it blocks until
either a message is received on the socket or the process terminated by
a signal. The reason for this is that the blocking read goes to sleep
'forerver' in the kernel with the u->readlock mutex held while the later
non-blocking read blocks on this mutex until the first read releases it.
Insofar I understand the comment in this code block correctly,

err = mutex_lock_interruptible(&u->readlock);
if (unlikely(err)) {
/* recvmsg() in non blocking mode is supposed to return -EAGAIN
 * sk_rcvtimeo is not honored by mutex_lock_interruptible()
 */
err = noblock ? -EAGAIN : -ERESTARTSYS;
goto out;
}

setting a receive timeout for an AF_UNIX datagram socket also doesn't
work as intended because of this: In case of n readers with the same
timeout, the nth reader will end up blocking n times the timeout.

Somewhat annoyingly, this program works as it should:

---
#include 
#include 
#include 
#include 

#define SERVER_ADDR "\0working"

int main(void)
{
struct sockaddr_un sun;
int sk, ska, dummy;

sun.sun_family = AF_UNIX;
memcpy(sun.sun_path, SERVER_ADDR, sizeof(SERVER_ADDR));
sk = socket(AF_UNIX, SOCK_STREAM, 0);
bind(sk, (struct sockaddr *)&sun, sizeof(sun));
listen(sk, 10);

if (fork() == 0) {
ska = socket(AF_UNIX, SOCK_STREAM, 0);
connect(ska, (struct sockaddr *)&sun, sizeof(sun));
read(ska, &dummy, sizeof(dummy));

_exit(0);
}

ska = accept(sk, NULL, 0);

if (fork() == 0) {
sleep(5);
recv(ska, &dummy, sizeof(dummy), MSG_DONTWAIT);
kill(getppid(), SIGTERM);

_exit(0);
}

read(ska, &dummy, sizeof(dummy));
return 0;
}


because the SOCK_STREAM receive code releases the mutex before going to
sleep.

Even more annoyingly, the UNIX(*) equivalent of the first program works
(in case a filesystem name is used instead of the 'abstract' one) on
both FreeBSD and Solaris (or did work some time last year when some
people with access to such systems kindly tested this).

-
#include 
#include 
#include 
#include 
#include 
#include 

#define SERVER_ADDR "\0nonblocked-forever"

int main(void)
{
struct sockaddr_un sun;
pid_t pid;
int sk, dummy;

sun.sun_family = AF_UNIX;
memcpy(sun.sun_path, SERVER_ADDR, sizeof(SERVER_ADDR));
sk = socket(AF_UNIX, SOCK_DGRAM, 0);
bind(sk, (struct sockaddr *)&sun, sizeof(sun));

if ((pid = fork()) == 0) {
read(sk, &dummy, sizeof(dummy));
_exit(0);
}

sleep(5);

fcntl(sk, F_SETFL, fcntl(sk, F_GETFL) | O_NONBLOCK);
read(sk, &dummy, sizeof(dummy));
kill(pid, SIGTERM);
wait(NULL);

return 0;
}
-

In order to fix this, I propose to split the __skb_recv_datagram routine
into two, one __skb_try_recv_datagram executing the receive logic and
another __skb_wait_for_more_packets (identical to the existing wait
routine). __skb_recv_datagram can then be reimplemented using these two
helper routines and the code in unix_dgram_recvmsg can use a similar
loop but with releasing the readlock as appropriate while still
retaining the "single function with actual datagram receive logic"
property.

This could also help other future protocols which also need to use locks
for protecting access to some per-socket state information the
core/datagram.c code is unaware of.

Signed-Off-By: Rainer Weikusat 
---

Patch is against 4.4.0-rc1-net.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4355129..dce91ac 100644
--- a/include/linux/skbuff.h
+

Re: [RFC PATCH] af_unix: fix entry locking in unix_dgram_recvmsg

2015-12-01 Thread Rainer Weikusat

Rainer Weikusat  writes:

[...]

> Insofar I understand the comment in this code block correctly,
>
> err = mutex_lock_interruptible(&u->readlock);
> if (unlikely(err)) {
> /* recvmsg() in non blocking mode is supposed to return 
> -EAGAIN
>  * sk_rcvtimeo is not honored by mutex_lock_interruptible()
>  */
> err = noblock ? -EAGAIN : -ERESTARTSYS;
> goto out;
> }
>
> setting a receive timeout for an AF_UNIX datagram socket also doesn't
> work as intended because of this: In case of n readers with the same
> timeout, the nth reader will end up blocking n times the timeout.

Test program which confirms this. It starts four concurrent reads on the
same socket with a receive timeout of 3s. This means the whole program
should take a little more than 3s to execute as each read should time
out at about the same time. But it takes 12s instead as the reads
pile up on the readlock mutex and each then gets its own timeout once it
could enter the receive loop.

---
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define SERVER_ADDR "\0multi-timeout"
#define RCV_TIMEO   3

static void set_rcv_timeo(int sk)
{
struct timeval tv;

tv.tv_sec = RCV_TIMEO;
tv.tv_usec = 0;
setsockopt(sk, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
}

int main(void)
{
struct sockaddr_un sun;
struct timeval tv_start, tv_end;
int sk, dummy;

sun.sun_family = AF_UNIX;
memcpy(sun.sun_path, SERVER_ADDR, sizeof(SERVER_ADDR));
sk = socket(AF_UNIX, SOCK_DGRAM, 0);
bind(sk, (struct sockaddr *)&sun, sizeof(sun));
set_rcv_timeo(sk);

gettimeofday(&tv_start, NULL);

if (fork() == 0) {
read(sk, &dummy, sizeof(dummy));
_exit(0);
}

if (fork() == 0) {
read(sk, &dummy, sizeof(dummy));
_exit(0);
}

if (fork() == 0) {
read(sk, &dummy, sizeof(dummy));
_exit(0);
}

read(sk, &dummy, sizeof(dummy));

while (waitpid(-1, NULL, 0) > 0);

gettimeofday(&tv_end, NULL);
printf("Waited for %u timeouts\n",
   (unsigned)((tv_end.tv_sec - tv_start.tv_sec) / RCV_TIMEO));

return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket

2015-09-30 Thread Rainer Weikusat

Mathias Krause  writes:
> On 29 September 2015 at 21:09, Jason Baron  wrote:
>> However, if we call connect on socket 's', to connect to a new socket 'o2', 
>> we
>> drop the reference on the original socket 'o'. Thus, we can now close socket
>> 'o' without unregistering from epoll. Then, when we either close the ep
>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>> race per se, but can be triggered sequentially.
>
> Sounds profound, but the reproducers calls connect only once per
> socket. So there is no "connect to a new socket", no?
> But w/e, see below.

In case you want some information on this: This is a kernel warning I
could trigger (more than once) on the single day I could so far spend
looking into this (3.2.54 kernel):

Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 
list_del+0x9/0x30()
Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam
Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next should be 
88022c38f078, but was dead00100100
Sep 15 19:37:19 doppelsaurus kernel: Modules linked in: snd_hrtimer binfmt_misc 
af_packet nf_conntrack loop snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel 
snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm sg snd_page_alloc 
snd_seq_du
mmy sr_mod snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_timer 
ath9k snd cdrom ath9k_common ath9k_hw r8169 mii ath usb_storage unix
Sep 15 19:37:19 doppelsaurus kernel: Pid: 3340, comm: a.out Tainted: GW 
   3.2.54-saurus-vesa #9
Sep 15 19:37:19 doppelsaurus kernel: Call Trace:
Sep 15 19:37:19 doppelsaurus kernel: [] ? 
__list_del_entry+0x80/0xc0
Sep 15 19:37:19 doppelsaurus kernel: [] ? 
warn_slowpath_common+0x79/0xc0
Sep 15 19:37:19 doppelsaurus kernel: [] ? 
warn_slowpath_fmt+0x45/0x50
Sep 15 19:37:19 doppelsaurus kernel: [] ? list_del+0x9/0x30
Sep 15 19:37:19 doppelsaurus kernel: [] ? 
remove_wait_queue+0x29/0x50
Sep 15 19:37:19 doppelsaurus kernel: [] ? 
ep_unregister_pollwait.isra.9+0x32/0x50
Sep 15 19:37:19 doppelsaurus kernel: [] ? ep_remove+0x2a/0xc0
Sep 15 19:37:19 doppelsaurus kernel: [] ? 
eventpoll_release_file+0x5e/0x90
Sep 15 19:37:19 doppelsaurus kernel: [] ? fput+0x1c6/0x220
Sep 15 19:37:19 doppelsaurus kernel: [] ? filp_close+0x5f/0x90
Sep 15 19:37:19 doppelsaurus kernel: [] ? sys_close+0x86/0xd0
Sep 15 19:37:19 doppelsaurus kernel: [] ? 
system_call_fastpath+0x16/0x1b

The dead00100100 is one of the list poison values a linkage pointer
is set to during/ after removal from a list. The particular warning
means that entry->prev (the item being removed) pointed to another entry
whose next pointer was not the address of entry but
dead00100100. Most likely, this means there's a list insert racing
with a list remove somewhere here where the insert picks up the pointer
to the previous item while it is still on the list and uses it while the
delete removes it, with delete having the last word and thus setting
prev->next to dead00100100 after the insert set it to the address of
the item to be inserted.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket

2015-09-30 Thread Rainer Weikusat

Mathias Krause  writes:
> On 30 September 2015 at 12:56, Rainer Weikusat 
>  wrote:
>> Mathias Krause  writes:
>>> On 29 September 2015 at 21:09, Jason Baron  wrote:
>>>> However, if we call connect on socket 's', to connect to a new socket 
>>>> 'o2', we
>>>> drop the reference on the original socket 'o'. Thus, we can now close 
>>>> socket
>>>> 'o' without unregistering from epoll. Then, when we either close the ep
>>>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>>>> race per se, but can be triggered sequentially.
>>>
>>> Sounds profound, but the reproducers calls connect only once per
>>> socket. So there is no "connect to a new socket", no?
>>> But w/e, see below.
>>
>> In case you want some information on this: This is a kernel warning I
>> could trigger (more than once) on the single day I could so far spend
>> looking into this (3.2.54 kernel):
>>
>> Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 
>> list_del+0x9/0x30()
>> Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam
>> Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next should 
>> be 88022c38f078, but was dead00100100
>> [snip]
>
> Is that with Jason's patch or a vanilla v3.2.54?

That's a kernel warning which occurred repeatedly (among other "link 
pointer disorganization" warnings) when I tested the "program with
unknown behaviour" you wrote with the kernel I'm currently supporting a
while ago (as I already wrote in the original mail).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket

2015-09-30 Thread Rainer Weikusat

Mathias Krause  writes:
> On 30 September 2015 at 15:25, Rainer Weikusat
>  wrote:
>> Mathias Krause  writes:
>>> On 30 September 2015 at 12:56, Rainer Weikusat 
>>>  wrote:
>>>> In case you want some information on this: This is a kernel warning I
>>>> could trigger (more than once) on the single day I could so far spend
>>>> looking into this (3.2.54 kernel):
>>>>
>>>> Sep 15 19:37:19 doppelsaurus kernel: WARNING: at lib/list_debug.c:53 
>>>> list_del+0x9/0x30()
>>>> Sep 15 19:37:19 doppelsaurus kernel: Hardware name: 500-330nam
>>>> Sep 15 19:37:19 doppelsaurus kernel: list_del corruption. prev->next 
>>>> should be 88022c38f078, but was dead00100100
>>>> [snip]
>>>
>>> Is that with Jason's patch or a vanilla v3.2.54?
>>
>> That's a kernel warning which occurred repeatedly (among other "link
>> pointer disorganization" warnings) when I tested the "program with
>> unknown behaviour" you wrote with the kernel I'm currently supporting a
>> while ago (as I already wrote in the original mail).
>
> So I assume Jason's patch is not included in your kernel.  Then those
> messages are expected; expected even on kernels as old as v2.6.27.
> Can you re-try with Jason's patch applied?

If something needs to be unregistered here (which I don't claim to
know but someone should), the corresponding code would need to be added
to the unix_dgram_disconnected function (as was already pointed
out).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket

2015-10-01 Thread Rainer Weikusat

Jason Baron  writes:
> On 09/30/2015 01:54 AM, Mathias Krause wrote:
>> On 29 September 2015 at 21:09, Jason Baron  wrote:
>>> However, if we call connect on socket 's', to connect to a new socket 'o2', 
>>> we
>>> drop the reference on the original socket 'o'. Thus, we can now close socket
>>> 'o' without unregistering from epoll. Then, when we either close the ep
>>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>>> race per se, but can be triggered sequentially.
>> 
>> Sounds profound, but the reproducers calls connect only once per
>> socket. So there is no "connect to a new socket", no?
>> But w/e, see below.
>
> Yes, but it can be reproduced this way too. It can also happen with a
> close() on the remote peer 'o', and a send to 'o' from 's', which the
> reproducer can do as pointed out Michal. The patch I sent deals with
> both cases.

As Michal also pointed out, there's a unix_dgram_disconnected routine
being called in both cases and insofar "deregistering" anything beyond
what unix_dgram_disconnected (and - insofar I can tell this -
unix_release_sock) already do is actually required, this would be the
obvious place to add it. A good step on the way to that would be to
write (and post) some test code which actually reproduces the problem in
a predictable way.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket

2015-10-01 Thread Rainer Weikusat

Rainer Weikusat  writes:
> Jason Baron  writes:
>> On 09/30/2015 01:54 AM, Mathias Krause wrote:
>>> On 29 September 2015 at 21:09, Jason Baron  wrote:
>>>> However, if we call connect on socket 's', to connect to a new socket 
>>>> 'o2', we
>>>> drop the reference on the original socket 'o'. Thus, we can now close 
>>>> socket
>>>> 'o' without unregistering from epoll. Then, when we either close the ep
>>>> or unregister 'o', we end up with this list corruption. Thus, this is not a
>>>> race per se, but can be triggered sequentially.
>>> 
>>> Sounds profound, but the reproducers calls connect only once per
>>> socket. So there is no "connect to a new socket", no?
>>> But w/e, see below.
>>
>> Yes, but it can be reproduced this way too. It can also happen with a
>> close() on the remote peer 'o', and a send to 'o' from 's', which the
>> reproducer can do as pointed out Michal. The patch I sent deals with
>> both cases.
>
> As Michal also pointed out, there's a unix_dgram_disconnected routine
> being called in both cases and insofar "deregistering" anything beyond
> what unix_dgram_disconnected (and - insofar I can tell this -
> unix_release_sock) already do is actually required, this would be the
> obvious place to add it. A good step on the way to that would be to
> write (and post) some test code which actually reproduces the problem in
> a predictable way.

Test program (assumes that it can execute itself as ./a.out):

-
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static int sk;

static void *epoller(void *unused)
{
struct epoll_event epev;
int epfd;

epfd = epoll_create(1);

epev.events = EPOLLOUT;
epoll_ctl(epfd, EPOLL_CTL_ADD, sk, &epev);
epoll_wait(epfd, &epev, 1, 5000);

execl("./a.out", "./a.out", (void *)0);

return NULL;
}

int main(void)
{
struct sockaddr_un sun;
pthread_t tid;
int tg0, tg1, rc;

sun.sun_family = AF_UNIX;

tg0 = socket(AF_UNIX, SOCK_DGRAM, 0);
strncpy(sun.sun_path, "/tmp/tg0", sizeof(sun.sun_path));
unlink(sun.sun_path);
bind(tg0, (struct sockaddr *)&sun, sizeof(sun));

tg1 = socket(AF_UNIX, SOCK_DGRAM, 0);
strncpy(sun.sun_path, "/tmp/tg1", sizeof(sun.sun_path));
unlink(sun.sun_path);
bind(tg1, (struct sockaddr *)&sun, sizeof(sun));

sk = socket(AF_UNIX, SOCK_DGRAM, 0);
connect(sk, (struct sockaddr *)&sun, sizeof(sun));

fcntl(sk, F_SETFL, fcntl(sk, F_GETFL) | O_NONBLOCK);

while ((rc = write(sk, "bla", 3)) != -1);

pthread_create(&tid, NULL, epoller, NULL);

usleep(5);

strncpy(sun.sun_path, "/tmp/tg0", sizeof(sun.sun_path));
connect(sk, (struct sockaddr *)&sun, sizeof(sun));
close(tg1);

pause();

return 0;
}
-

Inserting a

struct list_head *p;

p = &u->peer_wait.task_list;
WARN_ON(p->next != p || p->prev != p);

into unix_sock_destructor triggers a warning and after running for a
while, one also gets the list corruption warnings, example

Oct  1 12:53:38 doppelsaurus kernel: WARNING: at lib/list_debug.c:54 
list_del+0x9/0x30()
Oct  1 12:53:38 doppelsaurus kernel: Hardware name: 500-330nam
Oct  1 12:53:38 doppelsaurus kernel: list_del corruption. prev->next should be 
880232634420, but was 880224f878c8
Oct  1 12:53:38 doppelsaurus kernel: Modules linked in: snd_hrtimer binfmt_misc 
af_packet nf_conntrack loop snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel 
snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm ath9k ath9k_common 
ath9k_hw snd_seq_dummy snd_page_alloc snd_seq_oss snd_seq_midi_event snd_seq sg 
ath sr_mod snd_seq_device cdrom snd_timer snd r8169 mii usb_storage unix
Oct  1 12:53:38 doppelsaurus kernel: Pid: 4619, comm: a.out Tainted: GW 
   3.2.54-saurus-vesa #18
Oct  1 12:53:38 doppelsaurus kernel: Call Trace:
Oct  1 12:53:38 doppelsaurus kernel: [] ? 
__list_del_entry+0x80/0x100
Oct  1 12:53:38 doppelsaurus kernel: [] ? 
warn_slowpath_common+0x79/0xc0
Oct  1 12:53:38 doppelsaurus kernel: [] ? 
warn_slowpath_fmt+0x45/0x50
Oct  1 12:53:38 doppelsaurus kernel: [] ? list_del+0x9/0x30
Oct  1 12:53:38 doppelsaurus kernel: [] ? 
remove_wait_queue+0x29/0x50
Oct  1 12:53:38 doppelsaurus kernel: [] ? 
ep_unregister_pollwait.isra.9+0x32/0x50
Oct  1 12:53:38 doppelsaurus kernel: [] ? ep_remove+0x2a/0xc0
Oct  1 12:53:38 doppelsaurus kernel: [] ? 
eventpoll_release_file+0x5e/0x90
Oct  1 12:53:38 doppelsaurus kernel: [] ? fput+0x1c6/0x220
Oct  1 12:53:38 doppelsaurus kernel: [] ? filp_close+0x5f/0x90
Oct  1 12:53:38 doppelsaurus kernel: [] ? 
p

Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket

2015-10-01 Thread Rainer Weikusat

Rainer Weikusat  writes:
> Rainer Weikusat  writes:
>> Jason Baron  writes:
>>> On 09/30/2015 01:54 AM, Mathias Krause wrote:
>>>> On 29 September 2015 at 21:09, Jason Baron  wrote:
>>>>> However, if we call connect on socket 's', to connect to a new socket 
>>>>> 'o2', we
>>>>> drop the reference on the original socket 'o'. Thus, we can now close 
>>>>> socket
>>>>> 'o' without unregistering from epoll. Then, when we either close the ep
>>>>> or unregister 'o', we end up with this list corruption. Thus, this is not 
>>>>> a
>>>>> race per se, but can be triggered sequentially.

[...]


> Test program (assumes that it can execute itself as ./a.out):
>
> -
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> static int sk;
>
> static void *epoller(void *unused)
> {
> struct epoll_event epev;
> int epfd;
> 
> epfd = epoll_create(1);
>
> epev.events = EPOLLOUT;
> epoll_ctl(epfd, EPOLL_CTL_ADD, sk, &epev);
> epoll_wait(epfd, &epev, 1, 5000);
>
> execl("./a.out", "./a.out", (void *)0);
>
> return NULL;
> }

[...]

Possibly interesting additional bit of information: The list corruption
warnings appear only if the 2nd connect is there and both the sk and
epfd file descriptors are left open accross the exec. Closing either of
both still triggers the _destructor warnings but nothing else (until the
process runs out of file descriptors).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: fix use-after-free with unix_dgram_poll()

2015-10-02 Thread Rainer Weikusat

Jason Baron  writes:
> From: Jason Baron 
>
> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
> queue associated with the socket s that we've called poll() on, but it also
> calls sock_poll_wait() for a remote peer socket's wait queue, if it's 
> connected.
> Thus, if we call poll()/select()/epoll() for the socket s, there are then
> a couple of code paths in which the remote peer socket s2 and its associated
> peer_wait queue can be freed before poll()/select()/epoll() have a chance
> to remove themselves from this remote peer socket s2's wait queue.

[...]

> This works because we will continue to get POLLOUT wakeups from
> unix_write_space(), which is called via sock_wfree().

As pointed out in my original comment, this doesn't work (as far as I
can/ could tell) because it will only wake up sockets which had a chance
to enqueue datagrams to the queue of the receiving socket as only
skbuffs enqueued there will be consumed. A socket which is really
waiting for space in the receiving queue won't ever be woken up in this
way.

Further, considering that you're demonstrably not interested in
debugging and fixing this issue (as you haven't even bothered to post
one of the test programs you claim to have), I'm beginning to wonder why
this tripe is being sent to me at all --- it's not "git on autopilot"
this time as someone took the time to dig up my current e-mail address
as the one in the original commit is not valid anymore. Could you please
refrain from such exercises in future unless a discussion is actually
intended?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: fix use-after-free with unix_dgram_poll()

2015-10-02 Thread Rainer Weikusat

Rainer Weikusat  writes:
> Jason Baron  writes:
>> From: Jason Baron 
>>
>> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
>> queue associated with the socket s that we've called poll() on, but it also
>> calls sock_poll_wait() for a remote peer socket's wait queue, if it's 
>> connected.
>> Thus, if we call poll()/select()/epoll() for the socket s, there are then
>> a couple of code paths in which the remote peer socket s2 and its associated
>> peer_wait queue can be freed before poll()/select()/epoll() have a chance
>> to remove themselves from this remote peer socket s2's wait queue.
>
> [...]
>
>> This works because we will continue to get POLLOUT wakeups from
>> unix_write_space(), which is called via sock_wfree().
>
> As pointed out in my original comment, this doesn't work (as far as I
> can/ could tell) because it will only wake up sockets which had a chance
> to enqueue datagrams to the queue of the receiving socket as only
> skbuffs enqueued there will be consumed. A socket which is really
> waiting for space in the receiving queue won't ever be woken up in this
> way.

Program which shows that (on 3.2.54 + "local modification", with the 2nd
sock_poll_wait commented out):

---
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(void)
{
struct sockaddr_un sun;
struct pollfd pfd;
int tg, sk0, sk1, rc;
char buf[16];

sun.sun_family = AF_UNIX;

tg = socket(AF_UNIX, SOCK_DGRAM, 0);
strncpy(sun.sun_path, "/tmp/tg", sizeof(sun.sun_path));
unlink(sun.sun_path);
bind(tg, (struct sockaddr *)&sun, sizeof(sun));

sk0 = socket(AF_UNIX, SOCK_DGRAM, 0);
connect(sk0, (struct sockaddr *)&sun, sizeof(sun));

sk1 = socket(AF_UNIX, SOCK_DGRAM, 0);
connect(sk1, (struct sockaddr *)&sun, sizeof(sun));

fcntl(sk0, F_SETFL, fcntl(sk0, F_GETFL) | O_NONBLOCK);
fcntl(sk1, F_SETFL, fcntl(sk1, F_GETFL) | O_NONBLOCK);

while (write(sk0, "bla", 3) != -1);

if (fork() == 0) {
pfd.fd = sk1;
pfd.events = POLLOUT;
rc = poll(&pfd, 1, -1);

_exit(0);
}

sleep(3);
read(tg, buf, sizeof(buf));
wait(&rc);

return 0;
}


For me, this blocks forever while it should terminate as soon as the
datagram was read. Something else may have changed this behaviour in the
meantime, though.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: fix use-after-free with unix_dgram_poll()

2015-10-02 Thread Rainer Weikusat

Jason Baron  writes:
> On 10/02/2015 03:30 PM, Rainer Weikusat wrote:
>> Jason Baron  writes:
>>> From: Jason Baron 
>>>
>>> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
>>> queue associated with the socket s that we've called poll() on, but it also
>>> calls sock_poll_wait() for a remote peer socket's wait queue, if it's 
>>> connected.
>>> Thus, if we call poll()/select()/epoll() for the socket s, there are then
>>> a couple of code paths in which the remote peer socket s2 and its associated
>>> peer_wait queue can be freed before poll()/select()/epoll() have a chance
>>> to remove themselves from this remote peer socket s2's wait queue.
>> 
>> [...]
>> 
>>> This works because we will continue to get POLLOUT wakeups from
>>> unix_write_space(), which is called via sock_wfree().
>> 
>> As pointed out in my original comment, this doesn't work (as far as I
>> can/ could tell) because it will only wake up sockets which had a chance
>> to enqueue datagrams to the queue of the receiving socket as only
>> skbuffs enqueued there will be consumed. A socket which is really
>> waiting for space in the receiving queue won't ever be woken up in this
>> way.
>
> Ok, good point. I was hoping to avoid a more complex approach here. I think
> then that the patch I posted in the previous thread on this would address
> this concern. I will post it for review.

Some comments on that: From what I remember, this introduced another
wait queue solely for "peer events" in the connecting socket and
enqueued it there on connect. I think this should use the peer_wait
queue because that's what its purpose seems to be and it should also
only be put onto this wait queue if it's actually interested, similar to
the way this is handled in unix_dgram_sendmsg (via
unix_wait_for_peer). But this (likely) implies it would be necessary to
get rid of the second registration in unix_dgram_disconnected (which
gets called if a datagram socket disconnects from another) which may not
be feasible.

Insofar this stays an issue, I'll put more work into this but right now,
my "work" (as in "stuff I'm supposed to do for the people who pay me")
priorities are something rather different.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-03 Thread Rainer Weikusat

Mathias Krause  writes:
> On 2 October 2015 at 22:43, Jason Baron  wrote:
>> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
>> queue associated with the socket s that we are poll'ing against, but also 
>> calls

[useless full-quote removed]

> My reproducer runs on this patch for more than 3 days now without
> triggering anything anymore.

Since the behaviour of your program is random, using it to "test"
anything doesn't really provide any insight: It could have been
executing the same codepath which doesn't happen to trigger any problems
for all of these three days. Nobody can tell.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-04 Thread Rainer Weikusat

Rainer Weikusat  writes:

> Mathias Krause  writes:
>> On 2 October 2015 at 22:43, Jason Baron  wrote:
>>> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
>>> queue associated with the socket s that we are poll'ing against, but also 
>>> calls
>
> [useless full-quote removed]
>
>> My reproducer runs on this patch for more than 3 days now without
>> triggering anything anymore.
>
> Since the behaviour of your program is random, using it to "test"
> anything doesn't really provide any insight: It could have been
> executing the same codepath which doesn't happen to trigger any problems
> for all of these three days. Nobody can tell.

Since this "strangely" seems to have been lost in the thread: Here's the
test program showing that the reconnect while in epoll actually causes a
problem (at least I think so):


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static int sk, tg0, tg1;

static void *epoller(void *unused)
{
struct epoll_event epev;
int epfd;

epfd = epoll_create(1);
if (epfd == -1) exit(0);

epev.events = EPOLLOUT;
epoll_ctl(epfd, EPOLL_CTL_ADD, sk, &epev);
epoll_wait(epfd, &epev, 1, 5000);

close(sk);

execl("./a.out", "./a.out", (void *)0);

return NULL;
}

int main(void)
{
struct sockaddr_un sun;
pthread_t tid;
int rc;

sun.sun_family = AF_UNIX;

tg0 = socket(AF_UNIX, SOCK_DGRAM, 0);
strncpy(sun.sun_path, "/tmp/tg0", sizeof(sun.sun_path));
unlink(sun.sun_path);
bind(tg0, (struct sockaddr *)&sun, sizeof(sun));

tg1 = socket(AF_UNIX, SOCK_DGRAM, 0);
strncpy(sun.sun_path, "/tmp/tg1", sizeof(sun.sun_path));
unlink(sun.sun_path);
bind(tg1, (struct sockaddr *)&sun, sizeof(sun));

sk = socket(AF_UNIX, SOCK_DGRAM, 0);
connect(sk, (struct sockaddr *)&sun, sizeof(sun));

fcntl(sk, F_SETFL, fcntl(sk, F_GETFL) | O_NONBLOCK);

while ((rc = write(sk, "bla", 3)) != -1);

pthread_create(&tid, NULL, epoller, NULL);

usleep(5);

strncpy(sun.sun_path, "/tmp/tg0", sizeof(sun.sun_path));
connect(sk, (struct sockaddr *)&sun, sizeof(sun));
close(tg1);

pause();

return 0;
}
--

And here the other demonstrating the poller not being woken up despite
it could write something:

--
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(void)
{
struct sockaddr_un sun;
struct pollfd pfd;
int tg, sk0, sk1, rc;
char buf[16];

sun.sun_family = AF_UNIX;

tg = socket(AF_UNIX, SOCK_DGRAM, 0);
strncpy(sun.sun_path, "/tmp/tg", sizeof(sun.sun_path));
unlink(sun.sun_path);
bind(tg, (struct sockaddr *)&sun, sizeof(sun));

sk0 = socket(AF_UNIX, SOCK_DGRAM, 0);
connect(sk0, (struct sockaddr *)&sun, sizeof(sun));

sk1 = socket(AF_UNIX, SOCK_DGRAM, 0);
connect(sk1, (struct sockaddr *)&sun, sizeof(sun));

fcntl(sk0, F_SETFL, fcntl(sk0, F_GETFL) | O_NONBLOCK);
fcntl(sk1, F_SETFL, fcntl(sk1, F_GETFL) | O_NONBLOCK);

while (write(sk0, "bla", 3) != -1);

if (fork() == 0) {
pfd.fd = sk1;
pfd.events = POLLOUT;
rc = poll(&pfd, 1, -1);

_exit(0);
}

sleep(3);
read(tg, buf, sizeof(buf));
wait(&rc);

return 0;
}
---
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-05 Thread Rainer Weikusat

Jason Baron  writes:
> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
> queue associated with the socket s that we are poll'ing against, but also 
> calls
> sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
> if we call poll()/select()/epoll() for the socket s, there are then
> a couple of code paths in which the remote peer socket p and its associated
> peer_wait queue can be freed before poll()/select()/epoll() have a chance
> to remove themselves from the remote peer socket.
>
> The way that remote peer socket can be freed are:
>
> 1. If s calls connect() to a connect to a new socket other than p, it will
> drop its reference on p, and thus a close() on p will free it.
>
> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop
> the final reference to p, allowing it to be freed.

Here's a more simple idea which _might_ work. The underlying problem
seems to be that the second sock_poll_wait introduces a covert reference
to the peer socket which isn't accounted for. The basic idea behind this
is to execute an additional sock_hold for the peer whenever the
sock_poll_wait is called for it and store it (the struct sock *) in a
new struct unix_sock member. Upon entering unix_dgram_poll, if the
member is not NULL, it's cleared and a sock_put for its former value is
done. The 'poll peer not NULL -> sock_put it' code is also added to the
destructor, although I'm unsure if this is really necessary. The patch
below also includes the additional SOCK_DEAD test suggested by Martin as
that seems generally sensible to me.

NB: This has survived both Martin's and my test programs for a number
of executions/ longer periods of time than was common before without
generating list corruption warnings. The patch below is against 'my'
3.2.54 and is here provided as a suggestion in the hope that it will be
useful for someting, not as patch submission, as I spent less time
thinking through this than I should ideally have but despite of this,
it's another 2.5 hours of my life spent on something completely
different than what I should be working on at the moment.

--
diff -pru linux-2-6/include/net/af_unix.h linux-2-6.p/include/net/af_unix.h
--- linux-2-6/include/net/af_unix.h 2014-01-20 21:52:53.0 +
+++ linux-2-6.p/include/net/af_unix.h   2015-10-05 15:11:20.270958297 +0100
@@ -50,6 +50,7 @@ struct unix_sock {
struct vfsmount *mnt;
struct mutexreadlock;
struct sock *peer;
+   struct sock *poll_peer;
struct sock *other;
struct list_headlink;
atomic_long_t   inflight;
diff -pru linux-2-6/net/unix/af_unix.c linux-2-6.p/net/unix/af_unix.c
--- linux-2-6/net/unix/af_unix.c2014-01-22 16:51:52.0 +
+++ linux-2-6.p/net/unix/af_unix.c  2015-10-05 17:05:28.358273567 +0100
@@ -361,6 +361,9 @@ static void unix_sock_destructor(struct
if (u->addr)
unix_release_addr(u->addr);

+   if (u->poll_peer)
+   sock_put(u->poll_peer);
+
atomic_long_dec(&unix_nr_socks);
local_bh_disable();
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
@@ -625,6 +628,7 @@ static struct sock *unix_create1(struct
u = unix_sk(sk);
u->dentry = NULL;
u->mnt= NULL;
+   u->poll_peer = NULL;
spin_lock_init(&u->lock);
atomic_long_set(&u->inflight, 0);
INIT_LIST_HEAD(&u->link);
@@ -2135,8 +2139,16 @@ static unsigned int unix_poll(struct fil
 static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
poll_table *wait)
 {
-   struct sock *sk = sock->sk, *other;
-   unsigned int mask, writable;
+   struct sock *sk = sock->sk, *other, *pp;
+   struct unix_sock *u;
+   unsigned int mask, writable, dead;
+
+   u = unix_sk(sk);
+   pp = u->poll_peer;
+   if (pp) {
+   u->poll_peer = NULL;
+   sock_put(pp);
+   }

sock_poll_wait(file, sk_sleep(sk), wait);
mask = 0;
@@ -2170,7 +2182,20 @@ static unsigned int unix_dgram_poll(stru
other = unix_peer_get(sk);
if (other) {
if (unix_peer(other) != sk) {
-   sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
+   unix_state_lock(other);
+
+   dead = sock_flag(other, SOCK_DEAD);
+   if (!dead)
+   sock_poll_wait(file, &unix_sk(other)->peer_wait,
+  wait);
+
+   unix_state_unlock(other);
+
+   if (!dead) {
+   u->poll_peer = other;
+   sock_hold(other);
+   }
+
if (unix_recvq_full(other))
writable =

Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-05 Thread Rainer Weikusat

Eric Dumazet  writes:
> On Mon, 2015-10-05 at 17:31 +0100, Rainer Weikusat wrote:
>
>>  atomic_long_set(&u->inflight, 0);
>>  INIT_LIST_HEAD(&u->link);
>> @@ -2135,8 +2139,16 @@ static unsigned int unix_poll(struct fil
>>  static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
>>  poll_table *wait)
>>  {
>> -struct sock *sk = sock->sk, *other;
>> -unsigned int mask, writable;
>> +struct sock *sk = sock->sk, *other, *pp;
>> +struct unix_sock *u;
>> +unsigned int mask, writable, dead;
>> +
>> +u = unix_sk(sk);
>> +pp = u->poll_peer;
>> +if (pp) {
>> +u->poll_peer = NULL;
>> +sock_put(pp);
>> +}
>
>
> This looks racy.
> Multiple threads could use poll() at the same time,
> and you would have too many sock_put()

That's one of the reasons why I wrote "might work": The use of a single
structure member without any locking for the sock_poll_wait suggests
that this is taken care of in some other way, as does the absence of any
comment about that in the 'public' LDDs ("Linux Device Drivers"),
however, I don't really know if this is true. If not, this simple idea
can't work.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()

2015-10-11 Thread Rainer Weikusat

Hannes Frederic Sowa  writes:
> Jason Baron  writes:
>
>> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
>> queue associated with the socket s that we are poll'ing against, but also 
>> calls
>> sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
>> if we call poll()/select()/epoll() for the socket s, there are then
>> a couple of code paths in which the remote peer socket p and its associated
>> peer_wait queue can be freed before poll()/select()/epoll() have a chance
>> to remove themselves from the remote peer socket.
>>
>> The way that remote peer socket can be freed are:
>>
>> 1. If s calls connect() to a connect to a new socket other than p, it will
>> drop its reference on p, and thus a close() on p will free it.
>>
>> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop
>> the final reference to p, allowing it to be freed.
>>
>> Address this issue, by reverting unix_dgram_poll() to only register with
>> the wait queue associated with s and register a callback with the remote peer
>> socket on connect() that will wake up the wait queue associated with s. If
>> scenarios 1 or 2 occur above we then simply remove the callback from the
>> remote peer. This then presents the expected semantics to poll()/select()/
>> epoll().
>>
>> I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET
>> but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll().
>>
>> Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected
>> DGRAM sockets").
>>
>> Tested-by: Mathias Krause 
>> Signed-off-by: Jason Baron 
>
> While I think this approach works, I haven't seen where the current code
> leaks a reference.

It doesn't "leak a reference" (strictly). It possibly registers a wait queue 
with
whatever invoked the poll-routine which belongs to the peer socket of
the socket poll was called on. And the inherent problem with that is
that the lifetime of the peer socket is not necessarily the same as
the lifetime of the polled socket. If the polled socket is disconnected
from its peer while still being polled (or registered for being
polled), the former peer may be freed despite the polling code (of whatever
provenience) still references the peer_wait member of the unix socket
structure for this socket.

As pointed out in the original mail, two ways for this to happen is to
call connect on the polled socket or cause a unix_dgram_sendmsg call
on that after the peer socket was closed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 0/3] net: unix: fix use-after-free

2015-10-12 Thread Rainer Weikusat

David Miller  writes:
> From: Jason Baron 
> Date: Fri,  9 Oct 2015 00:15:59 -0400
>
>> These patches are against mainline, I can re-base to net-next, please
>> let me know.
>> 
>> They have been tested against: https://lkml.org/lkml/2015/9/13/195,
>> which causes the use-after-free quite quickly and here:
>> https://lkml.org/lkml/2015/10/2/693.
>
> I'd like to understand how patches that don't even compile can be
> "tested"?
>
> net/unix/af_unix.c: In function ‘unix_dgram_writable’:
> net/unix/af_unix.c:2480:3: error: ‘other_full’ undeclared (first use in this 
> function)
> net/unix/af_unix.c:2480:3: note: each undeclared identifier is reported only 
> once for each function it appears in
>
> Could you explain how that works, I'm having a hard time understanding
> this?

This is basicallly a workaround for the problem that it's not possible
to tell epoll to let go of a certain wait queue: Instead of registering
the peer_wait queue via sock_poll_wait, a wait_queue_t under control of
the af_unix.c code is linked onto it which relays a wake up on the
peer_wait queue to the 'ordinary' wait queue associated with the polled
socket via custom wake function. But (at least the code I looked it) it
enqueues a unix socket on connect which has certain side effects (in
particular, /dev/log will have a seriously large wait queue of entirely
uninterested peers) and in many cases, this is simply not necessary, as
the additional peer_wait event is only interesting in case a peer of a
fan-in socket (like /dev/log) happens to be waiting for writeabilty via
poll/ select/ epoll/ ...

Since the wait queue handling code is now under control of the af_unix.c
code, it can remove itself from the peer_wait queue prior to dropping
its reference to a peer on disconnect or on detecting a dead peer in
unix_dgram_sendmsg.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/3] unix: fix use-after-free in unix_dgram_poll()

2015-10-12 Thread Rainer Weikusat

Jason Baron  writes:
> On 10/05/2015 12:31 PM, Rainer Weikusat wrote:

[...]

>> Here's a more simple idea which _might_ work. The underlying problem
>> seems to be that the second sock_poll_wait introduces a covert reference
>> to the peer socket which isn't accounted for. The basic idea behind this
>> is to execute an additional sock_hold for the peer whenever the
>> sock_poll_wait is called for it and store it (the struct sock *) in a
>> new struct unix_sock member.

[...]

> Interesting - will this work for the test case you supplied where we are in
> epoll_wait() and another thread does a connect() to a new target? In that
> case, even if we issue a wakeup to the epoll thread, its not going to have
> a non-NULL poll_table, so it wouldn't be added to the new target. IE
> the first test case here:
>
> https://lkml.org/lkml/2015/10/4/154

"Indeed it would not." I've also meanwhile found the time to check what
is and isn't locked here and found that Eric's "this looks racy" was
also justified. In theory, a clean solution could be based on modifying
the various polling implementations to keep a piece of data for a polled
something and provided that again on each subsequent poll call. This
could then be used to keep the peer socket alive for as long as
necessary were it possible to change the set of wait queues with every
poll call. Since this also isn't the case, the idea to increment the
reference count of the peer socket won't fly.

OTOH, something I seriously dislike about your relaying implementation
is the unconditional add_wait_queue upon connect as this builds up a
possibly large wait queue of entities entirely uninterested in the
event which will need to be traversed even if peer_wait wakeups will
only happen if at least someone actually wants to be notified. This
could be changed such that the struct unix_sock member is only put onto
the peer's wait queue in poll and only if it hasn't already been put
onto it. The connection could then be severed if some kind of
'disconnect' occurs.

The code below (again based on 3.2.54) is what I'm currently running and
it has survived testing during the day (without trying the exercise in
hexadecimal as that doesn't cause failure for me, anyway). The wakeup
relaying function checks that a socket wait queue actually still exists
because I used to get null pointers oopses without every now and then
(I've also tested this with an additional printk printing 'no q' in case
the pointer was actually null to verify that this really occurs here).

-
--- linux-2-6.b/net/unix/af_unix.c  2015-10-12 21:07:27.237102277 +0100
+++ linux-2-6/net/unix/af_unix.c2015-10-12 21:10:26.319384298 +0100
@@ -303,6 +303,51 @@ found:
return s;
 }

+static int unix_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags,
+   void *key)
+{
+   struct unix_sock *u;
+   wait_queue_head_t *u_sleep;
+
+   u = container_of(q, struct unix_sock, peer_wake);
+
+   u_sleep = sk_sleep(&u->sk);
+   if (u_sleep)
+   wake_up_interruptible_poll(u_sleep, key);
+
+   return 0;
+}
+
+static void unix_peer_wake_connect(struct sock *me, struct sock *peer)
+{
+   struct unix_sock *u, *u_peer;
+
+   u = unix_sk(me);
+   u_peer = unix_sk(peer);
+
+   if (!u->peer_wake.private)
+   return;
+
+   u->peer_wake.private = peer;
+   add_wait_queue(&u_peer->peer_wait, &u->peer_wake);
+}
+
+static int unix_peer_wake_disconnect(struct sock *me, struct sock *peer)
+{
+   struct unix_sock *u, *u_peer;
+
+   u = unix_sk(me);
+   u_peer = unix_sk(peer);
+
+   if (u->peer_wake.private != peer)
+   return 0;
+
+   remove_wait_queue(&u_peer->peer_wait, &u->peer_wake);
+   u->peer_wake.private = NULL;
+
+   return 1;
+}
+
 static inline int unix_writable(struct sock *sk)
 {
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -409,6 +454,8 @@ static void unix_release_sock(struct soc
skpair->sk_state_change(skpair);
sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
}
+
+   unix_peer_wake_disconnect(sk, skpair);
sock_put(skpair); /* It may now die */
unix_peer(sk) = NULL;
}
@@ -630,6 +677,7 @@ static struct sock *unix_create1(struct
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
init_waitqueue_head(&u->peer_wait);
+   init_waitqueue_func_entry(&u->peer_wake, unix_peer_wake_relay);
unix_insert_socket(unix_sockets_unbound, sk);
 out:
if (sk == NULL)
@@ -953,7 +1001,7 @@ static

Re: [RFC PATCH] af_unix: fix entry locking in unix_dgram_recvmsg

2015-12-03 Thread Rainer Weikusat

David Miller  writes:
> From: Rainer Weikusat 
>> Rainer Weikusat  writes:
>> 
>> [...]
>> 
>>> Insofar I understand the comment in this code block correctly,

[...]

>>> /* recvmsg() in non blocking mode is supposed to return 
>>> -EAGAIN
>>>  * sk_rcvtimeo is not honored by mutex_lock_interruptible()
>>>
>>> setting a receive timeout for an AF_UNIX datagram socket also doesn't
>>> work as intended because of this: In case of n readers with the same
>>> timeout, the nth reader will end up blocking n times the timeout.

[...]

> So with your patch, the "N * timeout" behavior, where N is the number
> of queues reading threads, no longer occurs?  Do they all now properly
> get released at the appropriate timeout?

As far as I can tell, yes. With the change, unix_dgram_recvmsg has a
read loop looking like this:

last = NULL; /* not really necessary */
timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);

do {
mutex_lock(&u->readlock);

skip = sk_peek_offset(sk, flags);
skb = __skb_try_recv_datagram(sk, flags, &peeked, &skip, &err,
  &last);
if (skb)
break;

mutex_unlock(&u->readlock);

if (err != -EAGAIN)
break;
} while (timeo &&
 !__skb_wait_for_more_packets(sk, &err, &timeo, last));

u->readlock is only used to enforce serialized access while running code
dealing with the peek offset. If there's currently nothing to receive,
the mutex is dropped. Afterwards, non-blocking readers return with
-EAGAIN and blocking readers go to sleep waiting for 'interesting
events' via __skb_wait_for_more_packets without stuffing the mutex into
a pocket and taking it with them: All non-blocking readers of a certain
socket end up going to sleep via schedule_timeout call in the wait
function, hence, each of them will be woken up once its timeout expires.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/02] core: enable more fine-grained datagram reception control

2015-12-06 Thread Rainer Weikusat

The __skb_recv_datagram routine in core/ datagram.c provides a general
skb reception factility supposed to be utilized by protocol modules
providing datagram sockets. It encompasses both the actual recvmsg code
and a surrounding 'sleep until data is available' loop. This is
inconvenient if a protocol module has to use additional locking in order
to maintain some per-socket state the generic datagram socket code is
unaware of (as the af_unix code does). The patch below moves the recvmsg
proper code into a new __skb_try_recv_datagram routine which doesn't
sleep and renames wait_for_more_packets to
__skb_wait_for_more_packets, both routines being exported interfaces. The
original __skb_recv_datagram routine is reimplemented on top of these
two functions such that its user-visible behaviour remains unchanged.

Signed-Off-By: Rainer Weikusat 
---
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4355129..dce91ac 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2788,6 +2788,12 @@ static inline void skb_frag_list_init(struct sk_buff 
*skb)
 #define skb_walk_frags(skb, iter)  \
for (iter = skb_shinfo(skb)->frag_list; iter; iter = iter->next)
 
+
+int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
+   const struct sk_buff *skb);
+struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
+   int *peeked, int *off, int *err,
+   struct sk_buff **last);
 struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned flags,
int *peeked, int *off, int *err);
 struct sk_buff *skb_recv_datagram(struct sock *sk, unsigned flags, int noblock,
diff --git a/net/core/datagram.c b/net/core/datagram.c
index d62af69..7daff66 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -83,8 +83,8 @@ static int receiver_wake_function(wait_queue_t *wait, 
unsigned int mode, int syn
 /*
  * Wait for the last received packet to be different from skb
  */
-static int wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
-const struct sk_buff *skb)
+int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
+   const struct sk_buff *skb)
 {
int error;
DEFINE_WAIT_FUNC(wait, receiver_wake_function);
@@ -130,6 +130,7 @@ out_noerr:
error = 1;
goto out;
 }
+EXPORT_SYMBOL(__skb_wait_for_more_packets);
 
 static struct sk_buff *skb_set_peeked(struct sk_buff *skb)
 {
@@ -161,13 +162,15 @@ done:
 }
 
 /**
- * __skb_recv_datagram - Receive a datagram skbuff
+ * __skb_try_recv_datagram - Receive a datagram skbuff
  * @sk: socket
  * @flags: MSG_ flags
  * @peeked: returns non-zero if this packet has been seen before
  * @off: an offset in bytes to peek skb from. Returns an offset
  *   within an skb where data actually starts
  * @err: error code returned
+ * @last: set to last peeked message to inform the wait function
+ *what to look for when peeking
  *
  * Get a datagram skbuff, understands the peeking, nonblocking wakeups
  * and possible races. This replaces identical code in packet, raw and
@@ -175,9 +178,11 @@ done:
  * the long standing peek and read race for datagram sockets. If you
  * alter this routine remember it must be re-entrant.
  *
- * This function will lock the socket if a skb is returned, so the caller
- * needs to unlock the socket in that case (usually by calling
- * skb_free_datagram)
+ * This function will lock the socket if a skb is returned, so
+ * the caller needs to unlock the socket in that case (usually by
+ * calling skb_free_datagram). Returns NULL with *err set to
+ * -EAGAIN if no data was available or to some other value if an
+ * error was detected.
  *
  * * It does not lock socket since today. This function is
  * * free of race conditions. This measure should/can improve
@@ -191,13 +196,13 @@ done:
  * quite explicitly by POSIX 1003.1g, don't change them without having
  * the standard around please.
  */
-struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,
-   int *peeked, int *off, int *err)
+struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
+   int *peeked, int *off, int *err,
+   struct sk_buff **last)
 {
struct sk_buff_head *queue = &sk->sk_receive_queue;
-   struct sk_buff *skb, *last;
+   struct sk_buff *skb;
unsigned long cpu_flags;
-   long timeo;
/*
 * Caller is allowed not to check sk->sk_err before skb_recv_datagram()
 */
@@ -206,8 +211,6 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, 
u

[PATH 02/02] af_unix: fix unix_dgram_recvmsg entry locking

2015-12-06 Thread Rainer Weikusat

The current unix_dgram_recvsmg code acquires the u->readlock mutex in
order to protect access to the peek offset prior to calling
__skb_recv_datagram for actually receiving data. This implies that a
blocking reader will go to sleep with this mutex held if there's
presently no data to return to userspace. Two non-desirable side effects
of this are that a later non-blocking read call on the same socket will
block on the ->readlock mutex until the earlier blocking call releases it
(or the readers is interrupted) and that later blocking read calls
will wait longer than the effective socket read timeout says they
should: The timeout will only start 'ticking' once such a reader hits
the schedule_timeout in wait_for_more_packets (core.c) while the time it
already had to wait until it could acquire the mutex is unaccounted for.

The patch avoids both by using the __skb_try_recv_datagram and
__skb_wait_for_more packets functions created by the first patch to
implement a unix_dgram_recvmsg read loop which releases the readlock
mutex prior to going to sleep and reacquires it as needed
afterwards. Non-blocking readers will thus immediately return with
-EAGAIN if there's no data available regardless of any concurrent
blocking readers and all blocking readers will end up sleeping via
schedule_timeout, thus honouring the configured socket receive timeout.

Signed-Off-By: Rainer Weikusat 
---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 45aebd9..47dfa97 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2078,8 +2078,8 @@ static int unix_dgram_recvmsg(struct socket *sock, struct 
msghdr *msg,
struct scm_cookie scm;
struct sock *sk = sock->sk;
struct unix_sock *u = unix_sk(sk);
-   int noblock = flags & MSG_DONTWAIT;
-   struct sk_buff *skb;
+   struct sk_buff *skb, *last;
+   long timeo;
int err;
int peeked, skip;
 
@@ -2087,26 +2087,32 @@ static int unix_dgram_recvmsg(struct socket *sock, 
struct msghdr *msg,
if (flags&MSG_OOB)
goto out;
 
-   err = mutex_lock_interruptible(&u->readlock);
-   if (unlikely(err)) {
-   /* recvmsg() in non blocking mode is supposed to return -EAGAIN
-* sk_rcvtimeo is not honored by mutex_lock_interruptible()
-*/
-   err = noblock ? -EAGAIN : -ERESTARTSYS;
-   goto out;
-   }
+   timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
 
-   skip = sk_peek_offset(sk, flags);
+   do {
+   mutex_lock(&u->readlock);
 
-   skb = __skb_recv_datagram(sk, flags, &peeked, &skip, &err);
-   if (!skb) {
+   skip = sk_peek_offset(sk, flags);
+   skb = __skb_try_recv_datagram(sk, flags, &peeked, &skip, &err,
+ &last);
+   if (skb)
+   break;
+
+   mutex_unlock(&u->readlock);
+
+   if (err != -EAGAIN)
+   break;
+   } while (timeo &&
+!__skb_wait_for_more_packets(sk, &err, &timeo, last));
+
+   if (!skb) { /* implies readlock unlocked */
unix_state_lock(sk);
/* Signal EOF on disconnected non-blocking SEQPACKET socket. */
if (sk->sk_type == SOCK_SEQPACKET && err == -EAGAIN &&
(sk->sk_shutdown & RCV_SHUTDOWN))
err = 0;
unix_state_unlock(sk);
-   goto out_unlock;
+   goto out;
}
 
wake_up_interruptible_sync_poll(&u->peer_wait,
@@ -2162,7 +2168,6 @@ static int unix_dgram_recvmsg(struct socket *sock, struct 
msghdr *msg,
 
 out_free:
skb_free_datagram(sk, skb);
-out_unlock:
mutex_unlock(&u->readlock);
 out:
return err;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

breaks blocking receive for other users (was: [PATCH 01/02] core: enable more fine-grained datagram reception control)

2015-12-07 Thread Rainer Weikusat

David Miller  writes:
> From: Rainer Weikusat 
> Date: Sun, 06 Dec 2015 21:11:34 +
>
>> The __skb_recv_datagram routine in core/ datagram.c provides a general
>> skb reception factility supposed to be utilized by protocol modules
>> providing datagram sockets. It encompasses both the actual recvmsg code
>> and a surrounding 'sleep until data is available' loop. This is
>> inconvenient if a protocol module has to use additional locking in order
>> to maintain some per-socket state the generic datagram socket code is
>> unaware of (as the af_unix code does). The patch below moves the recvmsg
>> proper code into a new __skb_try_recv_datagram routine which doesn't
>> sleep and renames wait_for_more_packets to
>> __skb_wait_for_more_packets, both routines being exported interfaces. The
>> original __skb_recv_datagram routine is reimplemented on top of these
>> two functions such that its user-visible behaviour remains unchanged.
>> 
>> Signed-Off-By: Rainer Weikusat 
>
> Applied to net-next.

Because of an oversight, this change breaks blocking datagram reception
for anyone but the AF_UNIX SOCK_DGRAM code. I've noticed this by chance
while running a test program for SOCK_STREAM sockets. The problem seems
to be (fix not yet tested) that a - is missing in the *err check in
__skb_recv_datagram.

Proposed fix:


diff --git a/net/core/datagram.c b/net/core/datagram.c
index 7daff66..fa9dc64 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -275,7 +275,7 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, 
unsigned int flags,
if (skb)
return skb;
 
-   if (*err != EAGAIN)
+   if (*err != -EAGAIN)
break;
} while (timeo &&
!__skb_wait_for_more_packets(sk, err, &timeo, last));
-

Provided this works (very likely), I'll send a real patch for that ASAP.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] fix inverted test in __skb_recv_datagram

2015-12-07 Thread Rainer Weikusat

As the kernel generally uses negated error numbers, *err needs to be
compared with -EAGAIN (d'oh).

Signed-off-by: Rainer Weikusat 
Fixes: ea3793ee29d3
---
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 7daff66..fa9dc64 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -275,7 +275,7 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, 
unsigned int flags,
if (skb)
return skb;
 
-   if (*err != EAGAIN)
+   if (*err != -EAGAIN)
break;
} while (timeo &&
!__skb_wait_for_more_packets(sk, err, &timeo, last));

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix inverted test in __skb_recv_datagram

2015-12-08 Thread Rainer Weikusat

David Miller  writes:
> From: Rainer Weikusat 
> Date: Mon, 07 Dec 2015 23:30:58 +
>
>> As the kernel generally uses negated error numbers, *err needs to be
>> compared with -EAGAIN (d'oh).
>> 
>> Signed-off-by: Rainer Weikusat 
>> Fixes: ea3793ee29d3
>
> Improperly formatted Fixes: tag, you must also include the commit
> header line, in parenthesis and double quotes, after the SHA_ID.
>
> Futhermore this is the wrong SHA_ID.

https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=ea3793ee29d3

displays the commit I was referring to, namely, the one containing this

+ timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+
+ do {
+ skb = __skb_try_recv_datagram(sk, flags, peeked, off, err,
+ &last);
+ if (skb)
+ return skb;
+
+ if (*err != EAGAIN)
+ break;
+ } while (timeo &&
+ !__skb_wait_for_more_packets(sk, err, &timeo, last));

which added the inverted test, IOW, if this is the wrong hash, I have no
idea what the right one could be. I'll resubmit this with the 'one line
summary' added. After noticing the issue around 23:10 UK time yesterday,
I was in a bit of a hurry and stopped reading the 'Fixes' text in
SubmittingPatches after the "with the first 12 characters".
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix inverted test in __skb_recv_datagram

2015-12-08 Thread Rainer Weikusat

As the kernel generally uses negated error numbers, *err needs to be
compared with -EAGAIN (d'oh).

Signed-off-by: Rainer Weikusat 
Fixes: ea3793ee29d3 ("core: enable more fine-grained datagram reception 
control")
---
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 7daff66..fa9dc64 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -275,7 +275,7 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, 
unsigned int flags,
if (skb)
return skb;
 
-   if (*err != EAGAIN)
+   if (*err != -EAGAIN)
break;
} while (timeo &&
!__skb_wait_for_more_packets(sk, err, &timeo, last));
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 112 matches

Mail list logo