Re: [PATCH 2/5] ipc/mqueue.c: Update/document memory barriers

2019-10-14 Thread Davidlohr Bueso

On Fri, 11 Oct 2019, Manfred Spraul wrote:

But you are right, there are two different scenarios:

1) thread already in another wake_q, wakeup happens immediately after 
the cmpxchg_relaxed().


This scenario is safe, due to the smp_mb__before_atomic() in wake_q_add()

2) thread woken up but e.g. a timeout, see ->state=STATE_READY, 
returns to user space, calls sys_exit.


This must not happen before get_task_struct acquired a reference.

And this appears to be unsafe: get_task_struct() is refcount_inc(), 
which is refcount_inc_checked(), which is according to lib/refcount.c 
fully unordered.


Thus: ->state=STATE_READY can execute before the refcount increase.

Thus: ->state=STATE_READY needs a smp_store_release(), correct?


What if we did the reference count explicitly, and then just use
wake_q_add_safe()? That would avoid the extra barrier, __pipelined_op()
would become:

 list_del();
 get_task_struct();
 wake_q_add_safe();
 WRITE_ONCE(->state, STATE_READY);

Thanks,
Davidlohr


Re: [PATCH 2/5] ipc/mqueue.c: Update/document memory barriers

2019-10-11 Thread Manfred Spraul

On 10/11/19 6:55 PM, Davidlohr Bueso wrote:

On Fri, 11 Oct 2019, Manfred Spraul wrote:


Update and document memory barriers for mqueue.c:
- ewp->state is read without any locks, thus READ_ONCE is required.


In general we relied on the barrier for not needing READ/WRITE_ONCE,
but I agree this scenario should be better documented with them.


After reading core-api/atomic_ops.rst:

> _ONCE() should be used. [...] Alternatively, you can place a barrier.

So both approaches are ok.

Let's follow the "should", i.e.: all operations on the ->state variables 
to READ_ONCE()/WRITE_ONCE().


Then we have a standard, and since we can follow the "should", we should 
do that.



Similarly imo, the 'state' should also need them for write, even if
under the lock -- consistency and documentation, for example.

Ok, so let's convert everything to _ONCE. (assuming that my analysis 
below is incorrect)

In addition, I think it makes sense to encapsulate some of the
pipelined send/recv operations, that also can allow us to keep
the barrier comments in pipelined_send(), which I wonder why
you chose to remove. Something like so, before your changes:

I thought that the simple "memory barrier is provided" is enough, so I 
had removed the comment.



But you are right, there are two different scenarios:

1) thread already in another wake_q, wakeup happens immediately after 
the cmpxchg_relaxed().


This scenario is safe, due to the smp_mb__before_atomic() in wake_q_add()

2) thread woken up but e.g. a timeout, see ->state=STATE_READY, returns 
to user space, calls sys_exit.


This must not happen before get_task_struct acquired a reference.

And this appears to be unsafe: get_task_struct() is refcount_inc(), 
which is refcount_inc_checked(), which is according to lib/refcount.c 
fully unordered.


Thus: ->state=STATE_READY can execute before the refcount increase.

Thus: ->state=STATE_READY needs a smp_store_release(), correct?


diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3d920ff15c80..be48c0ba92f7 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -918,17 +918,12 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, 
u_name)

 * The same algorithm is used for senders.
 */

-/* pipelined_send() - send a message directly to the task waiting in
- * sys_mq_timedreceive() (without inserting message into a queue).
- */
-static inline void pipelined_send(struct wake_q_head *wake_q,
+static inline void __pipelined_op(struct wake_q_head *wake_q,
  struct mqueue_inode_info *info,
-  struct msg_msg *message,
-  struct ext_wait_queue *receiver)
+  struct ext_wait_queue *this)
{
-    receiver->msg = message;
-    list_del(>list);
-    wake_q_add(wake_q, receiver->task);
+    list_del(>list);
+    wake_q_add(wake_q, this->task);
/*
 * Rely on the implicit cmpxchg barrier from wake_q_add such
 * that we can ensure that updating receiver->state is the last
@@ -937,7 +932,19 @@ static inline void pipelined_send(struct 
wake_q_head *wake_q,

 * yet, at that point we can later have a use-after-free
 * condition and bogus wakeup.
 */
-    receiver->state = STATE_READY;
+    this->state = STATE_READY;
+}
+
+/* pipelined_send() - send a message directly to the task waiting in
+ * sys_mq_timedreceive() (without inserting message into a queue).
+ */
+static inline void pipelined_send(struct wake_q_head *wake_q,
+  struct mqueue_inode_info *info,
+  struct msg_msg *message,
+  struct ext_wait_queue *receiver)
+{
+    receiver->msg = message;
+    __pipelined_op(wake_q, info, receiver);
}

/* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
@@ -955,9 +962,7 @@ static inline void pipelined_receive(struct 
wake_q_head *wake_q,

if (msg_insert(sender->msg, info))
    return;

-    list_del(>list);
-    wake_q_add(wake_q, sender->task);
-    sender->state = STATE_READY;
+    __pipelined_op(wake_q, info, sender);
}

static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,


I would merge that into the series, ok?

--

    Manfred



Re: [PATCH 2/5] ipc/mqueue.c: Update/document memory barriers

2019-10-11 Thread Davidlohr Bueso

On Fri, 11 Oct 2019, Manfred Spraul wrote:


Update and document memory barriers for mqueue.c:
- ewp->state is read without any locks, thus READ_ONCE is required.


In general we relied on the barrier for not needing READ/WRITE_ONCE,
but I agree this scenario should be better documented with them.
Similarly imo, the 'state' should also need them for write, even if
under the lock -- consistency and documentation, for example.

In addition, I think it makes sense to encapsulate some of the
pipelined send/recv operations, that also can allow us to keep
the barrier comments in pipelined_send(), which I wonder why
you chose to remove. Something like so, before your changes:

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3d920ff15c80..be48c0ba92f7 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -918,17 +918,12 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
 * The same algorithm is used for senders.
 */

-/* pipelined_send() - send a message directly to the task waiting in
- * sys_mq_timedreceive() (without inserting message into a queue).
- */
-static inline void pipelined_send(struct wake_q_head *wake_q,
+static inline void __pipelined_op(struct wake_q_head *wake_q,
  struct mqueue_inode_info *info,
- struct msg_msg *message,
- struct ext_wait_queue *receiver)
+ struct ext_wait_queue *this)
{
-   receiver->msg = message;
-   list_del(>list);
-   wake_q_add(wake_q, receiver->task);
+   list_del(>list);
+   wake_q_add(wake_q, this->task);
/*
 * Rely on the implicit cmpxchg barrier from wake_q_add such
 * that we can ensure that updating receiver->state is the last
@@ -937,7 +932,19 @@ static inline void pipelined_send(struct wake_q_head 
*wake_q,
 * yet, at that point we can later have a use-after-free
 * condition and bogus wakeup.
 */
-   receiver->state = STATE_READY;
+this->state = STATE_READY;
+}
+
+/* pipelined_send() - send a message directly to the task waiting in
+ * sys_mq_timedreceive() (without inserting message into a queue).
+ */
+static inline void pipelined_send(struct wake_q_head *wake_q,
+ struct mqueue_inode_info *info,
+ struct msg_msg *message,
+ struct ext_wait_queue *receiver)
+{
+   receiver->msg = message;
+   __pipelined_op(wake_q, info, receiver);
}

/* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
@@ -955,9 +962,7 @@ static inline void pipelined_receive(struct wake_q_head 
*wake_q,
if (msg_insert(sender->msg, info))
return;

-   list_del(>list);
-   wake_q_add(wake_q, sender->task);
-   sender->state = STATE_READY;
+   __pipelined_op(wake_q, info, sender);
}

static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,


[PATCH 2/5] ipc/mqueue.c: Update/document memory barriers

2019-10-11 Thread Manfred Spraul
Update and document memory barriers for mqueue.c:
- ewp->state is read without any locks, thus READ_ONCE is required.

- add smp_aquire__after_ctrl_dep() after the RAED_ONCE, we need
  acquire semantics if the value is STATE_READY.

- document that the code relies on the barrier inside wake_q_add()

- document why __set_current_state() may be used:
  Reading task->state cannot happen before the wake_q_add() call,
  which happens while holding info->lock.

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/mqueue.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3d920ff15c80..902167407737 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -646,17 +646,25 @@ static int wq_sleep(struct mqueue_inode_info *info, int 
sr,
wq_add(info, sr, ewp);
 
for (;;) {
+   /* memory barrier not required, we hold info->lock */
__set_current_state(TASK_INTERRUPTIBLE);
 
spin_unlock(>lock);
time = schedule_hrtimeout_range_clock(timeout, 0,
HRTIMER_MODE_ABS, CLOCK_REALTIME);
 
-   if (ewp->state == STATE_READY) {
+   if (READ_ONCE(ewp->state) == STATE_READY) {
+   /*
+* Pairs, together with READ_ONCE(), with
+* the barrier in wake_q_add().
+*/
+   smp_acquire__after_ctrl_dep();
retval = 0;
goto out;
}
spin_lock(>lock);
+
+   /* we hold info->lock, so no memory barrier required */
if (ewp->state == STATE_READY) {
retval = 0;
goto out_unlock;
@@ -928,16 +936,11 @@ static inline void pipelined_send(struct wake_q_head 
*wake_q,
 {
receiver->msg = message;
list_del(>list);
+
wake_q_add(wake_q, receiver->task);
-   /*
-* Rely on the implicit cmpxchg barrier from wake_q_add such
-* that we can ensure that updating receiver->state is the last
-* write operation: As once set, the receiver can continue,
-* and if we don't have the reference count from the wake_q,
-* yet, at that point we can later have a use-after-free
-* condition and bogus wakeup.
-*/
-   receiver->state = STATE_READY;
+
+   /* The memory barrier is provided by wake_q_add(). */
+   WRITE_ONCE(receiver->state, STATE_READY);
 }
 
 /* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
@@ -956,8 +959,11 @@ static inline void pipelined_receive(struct wake_q_head 
*wake_q,
return;
 
list_del(>list);
+
wake_q_add(wake_q, sender->task);
-   sender->state = STATE_READY;
+
+   /* The memory barrier is provided by wake_q_add(). */
+   WRITE_ONCE(sender->state, STATE_READY);
 }
 
 static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,
@@ -1044,6 +1050,8 @@ static int do_mq_timedsend(mqd_t mqdes, const char __user 
*u_msg_ptr,
} else {
wait.task = current;
wait.msg = (void *) msg_ptr;
+
+   /* memory barrier not required, we hold info->lock */
wait.state = STATE_NONE;
ret = wq_sleep(info, SEND, timeout, );
/*
@@ -1147,6 +1155,8 @@ static int do_mq_timedreceive(mqd_t mqdes, char __user 
*u_msg_ptr,
ret = -EAGAIN;
} else {
wait.task = current;
+
+   /* memory barrier not required, we hold info->lock */
wait.state = STATE_NONE;
ret = wq_sleep(info, RECV, timeout, );
msg_ptr = wait.msg;
-- 
2.21.0