from:"Manfred Spraul"

[RFC] [PATCH] ipc/util.c: Use binary search for max_idx

2021-04-07 Thread Manfred Spraul

If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO,
MSG_INFO or SHM_INFO, then the return value is the index of the highest
used entry in the kernel's internal array recording information about
all SysV objects of the requested type for the current namespace.
(This information can be used with repeated ..._STAT or ..._STAT_ANY
operations to obtain information about all SysV objects on the system.)

If the current highest used entry is destroyed, then the new highest
used entry is determined by looping over all possible values.
With the introduction of IPCMNI_EXTEND_SHIFT, this could be a
loop over 16 million entries.

As there is no get_last() function for idr structures:
Implement a "get_last()" using a binary search.

As far as I see, ipc is the only user that needs get_last(), thus
implement it in ipc/util.c and not in a central location.

Signed-off-by: Manfred Spraul 
---
 ipc/util.c | 44 +++-
 1 file changed, 39 insertions(+), 5 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index cfa0045e748d..0121bf6b2617 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -64,6 +64,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -450,6 +451,40 @@ static void ipc_kht_remove(struct ipc_ids *ids, struct 
kern_ipc_perm *ipcp)
   ipc_kht_params);
 }
 
+/**
+ * ipc_get_maxusedidx - get highest in-use index
+ * @ids: ipc identifier set
+ * @limit: highest possible index.
+ *
+ * The function determines the highest in use index value.
+ * ipc_ids.rwsem needs to be owned by the caller.
+ * If no ipc object is allocated, then -1 is returned.
+ */
+static int ipc_get_maxusedidx(struct ipc_ids *ids, int limit)
+{
+   void *val;
+   int tmpidx;
+   int i;
+   int retval;
+
+   i = ilog2(limit+1);
+
+   retval = 0;
+   for (; i >= 0; i--) {
+   tmpidx = retval | (1<ipcs_idr, );
+   if (val)
+   retval |= (1<deleted = true;
 
if (unlikely(idx == ids->max_idx)) {
-   do {
-   idx--;
-   if (idx == -1)
-   break;
-   } while (!idr_find(>ipcs_idr, idx));
+
+   idx = ids->max_idx-1;
+   if (idx >= 0)
+   idx = ipc_get_maxusedidx(ids, idx);
ids->max_idx = idx;
}
 }
-- 
2.29.2

Re: [PATCH] ipc/msg: add msgsnd_timed and msgrcv_timed syscall for system V message queue

2021-03-04 Thread Manfred Spraul

Hi Eric,

On 3/4/21 2:12 AM, Andrew Morton wrote:

On Tue, 23 Feb 2021 23:11:43 +0800 Eric Gao  wrote:

sometimes, we need the msgsnd or msgrcv syscall can return after a limited
time, so that the business thread do not be blocked here all the time. In
this case, I add the msgsnd_timed and msgrcv_timed syscall that with time
parameter, which has a unit of ms.

Please cc Manfred and Davidlohr on ipc/ changes.

The above is a very brief description for a new syscall!  Please go to
great lengths to tell us why this is considered useful - what are the
use cases?

Also, please fully describe the proposed syscall interface right here
in the changelog.  Please be prepared to later prepare a full manpage.

...
+SYSCALL_DEFINE5(msgsnd_timed, int, msqid, struct msgbuf __user *, msgp, 
size_t, msgsz,
+   int, msgflg, long, timeoutms)

Specifying the timeout in milliseconds is problematic - it's very
coarse.  See sys_epoll_pwait2()'s use of timespecs.

What about using an absolute timeout, like in mq_timedsend()?

That makes restart handling after signals far simpler.

> -   schedule();
> +
> +   /* sometimes, we need msgsnd syscall return after a given 
time */
> +   if (timeoutms <= 0) {
> +   schedule();
> +   } else {
> +   timeoutms = schedule_timeout(timeoutms);
> +   if (timeoutms == 0)
> +   timeoutflag = true;
> +   }

I wonder if this should be schedule_timeout_interruptible() or at least
schedule_timeout_killable() instead of schedule_timeout(). If it should,
this should probably be done as a separate change.
No. schedule_timeout_interruptible() just means that 
__set_current_state() is called before the schedule_timeout().

The __set_current_state() is done directly in msg.c, before dropping the 
lock.

--

    Manfred

Re: [PATCH] ipc/msg.c: wake up senders until there is a queue empty capacity

2020-06-01 Thread Manfred Spraul


Hi Artus,

On 6/1/20 4:02 PM, Artur Barsegyan wrote:

Hi, Manfred.

Did you get my last message?


Yes, I'm just too busy right now.

My plan/backlog is:

- the xarray patch from Matthew

- improve finding max_id in ipc_rmid(). Perhaps even remove max_id 
entirely and instead calculate it on demand.


- your patch to avoid waking up too many tasks, including my bugfix.



On Wed, May 27, 2020 at 02:22:57PM +0300, Artur Barsegyan wrote:

[sorry for the duplicates — I have changed my email client]

About your case:

The new receiver puts at the end of the receivers list.
pipelined_send() starts from the beginning of the list and iterates until the 
end.

If our queue is always full, each receiver should get a message because new 
receivers appends at the end.
In my vision: we waste some time in that loop but in general should increase 
the throughout. But it should be tested.

Yes, I'm gonna implement it and make a benchmark. But maybe it should be done 
in another patch thread?


My biggest problem is always realistic benchmarks:

Do we optimize for code size/small amount of branches, or add special 
cases for things that we think could be common?


Avoiding thundering herds is always good, avoiding schedule() is always 
good.


Thus I would start with pipelined_receive, and then we would need 
feedback from apps that use sysv msg.


(old fakeroot is what I remember as test app)


On Wed, May 27, 2020 at 08:03:17AM +0200, Manfred Spraul wrote:

Hello Artur,

On 5/26/20 9:56 AM, Artur Barsegyan wrote:

Hello, Manfred!

Thank you, for your review. I've reviewed your patch.

I forgot about the case with different message types. At now with your patch,
a sender will force message consuming if that doesn't hold own capacity.

I have measured queue throughput and have pushed the results to:
https://github.com/artur-barsegyan/systemv_queue_research

But I'm confused about the next thought: in general loop in the do_msgsnd()
function, we doesn't check pipeline sending availability. Your case will be
optimized if we check the pipeline sending inside the loop.

I don't get your concern, or perhaps this is a feature that I had always
assumed as "normal":

"msg_fits_inqueue(msq, msgsz)" is in the loop, this ensures progress.

The rational is a design decision:

The check for pipeline sending is only done if there would be space to store
the message in the queue.

I was afraid that performing the pipeline send immediately, without checking
queue availability, could break apps:

Some messages would arrive immediately (if there is a waiting receiver),
other messages are stuck forever (since the queue is full).

Initial patch: https://lkml.org/lkml/1999/10/3/5 (without any remarks about
the design decision)

The risk that I had seen was theoretical, I do not have any real bug
reports. So we could change it.

Perhaps: Go in the same direction as it was done for POSIX mqueue: implement
pipelined receive.


On Sun, May 24, 2020 at 03:21:31PM +0200, Manfred Spraul wrote:

Hello Artur,

On 5/23/20 10:34 PM, Artur Barsegyan wrote:

Take into account the total size of the already enqueued messages of
previously handled senders before another one.

Otherwise, we have serious degradation of receiver throughput for
case with multiple senders because another sender wakes up,
checks the queue capacity and falls into sleep again.

Each round-trip wastes CPU time a lot and leads to perceptible
throughput degradation.

Source code of:
- sender/receiver
- benchmark script
- ready graphics of before/after results

is located here: https://github.com/artur-barsegyan/systemv_queue_research

Thanks for analyzing the issue!


Signed-off-by: Artur Barsegyan 
---
ipc/msg.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index caca67368cb5..52d634b0a65a 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -214,6 +214,7 @@ static void ss_wakeup(struct msg_queue *msq,
struct msg_sender *mss, *t;
struct task_struct *stop_tsk = NULL;
struct list_head *h = >q_senders;
+   size_t msq_quota_used = 0;
list_for_each_entry_safe(mss, t, h, list) {
if (kill)
@@ -233,7 +234,7 @@ static void ss_wakeup(struct msg_queue *msq,
 * move the sender to the tail on behalf of the
 * blocked task.
 */
-   else if (!msg_fits_inqueue(msq, mss->msgsz)) {
+   else if (!msg_fits_inqueue(msq, msq_quota_used + mss->msgsz)) {
if (!stop_tsk)
stop_tsk = mss->tsk;
@@ -241,6 +242,7 @@ static void ss_wakeup(struct msg_queue *msq,
continue;
}
+   msq_quota_used += mss->msgsz;
wake_q_add(wake_q, mss->tsk);

You have missed the case of a do_msgsnd() that doesn't enqueue the message:

Situation:

- 2 messages of type 1 in the queue (

Re: [PATCH] ipc/msg.c: wake up senders until there is a queue empty capacity

2020-05-27 Thread Manfred Spraul


Hello Artur,

On 5/26/20 9:56 AM, Artur Barsegyan wrote:

Hello, Manfred!

Thank you, for your review. I've reviewed your patch.

I forgot about the case with different message types. At now with your patch,
a sender will force message consuming if that doesn't hold own capacity.

I have measured queue throughput and have pushed the results to:
https://github.com/artur-barsegyan/systemv_queue_research

But I'm confused about the next thought: in general loop in the do_msgsnd()
function, we doesn't check pipeline sending availability. Your case will be
optimized if we check the pipeline sending inside the loop.


I don't get your concern, or perhaps this is a feature that I had always 
assumed as "normal":


"msg_fits_inqueue(msq, msgsz)" is in the loop, this ensures progress.

The rational is a design decision:

The check for pipeline sending is only done if there would be space to 
store the message in the queue.


I was afraid that performing the pipeline send immediately, without 
checking queue availability, could break apps:


Some messages would arrive immediately (if there is a waiting receiver), 
other messages are stuck forever (since the queue is full).


Initial patch: https://lkml.org/lkml/1999/10/3/5 (without any remarks 
about the design decision)


The risk that I had seen was theoretical, I do not have any real bug 
reports. So we could change it.


Perhaps: Go in the same direction as it was done for POSIX mqueue: 
implement pipelined receive.



On Sun, May 24, 2020 at 03:21:31PM +0200, Manfred Spraul wrote:

Hello Artur,

On 5/23/20 10:34 PM, Artur Barsegyan wrote:

Take into account the total size of the already enqueued messages of
previously handled senders before another one.

Otherwise, we have serious degradation of receiver throughput for
case with multiple senders because another sender wakes up,
checks the queue capacity and falls into sleep again.

Each round-trip wastes CPU time a lot and leads to perceptible
throughput degradation.

Source code of:
- sender/receiver
- benchmark script
- ready graphics of before/after results

is located here: https://github.com/artur-barsegyan/systemv_queue_research

Thanks for analyzing the issue!


Signed-off-by: Artur Barsegyan 
---
   ipc/msg.c | 4 +++-
   1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index caca67368cb5..52d634b0a65a 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -214,6 +214,7 @@ static void ss_wakeup(struct msg_queue *msq,
struct msg_sender *mss, *t;
struct task_struct *stop_tsk = NULL;
struct list_head *h = >q_senders;
+   size_t msq_quota_used = 0;
list_for_each_entry_safe(mss, t, h, list) {
if (kill)
@@ -233,7 +234,7 @@ static void ss_wakeup(struct msg_queue *msq,
 * move the sender to the tail on behalf of the
 * blocked task.
 */
-   else if (!msg_fits_inqueue(msq, mss->msgsz)) {
+   else if (!msg_fits_inqueue(msq, msq_quota_used + mss->msgsz)) {
if (!stop_tsk)
stop_tsk = mss->tsk;
@@ -241,6 +242,7 @@ static void ss_wakeup(struct msg_queue *msq,
continue;
}
+   msq_quota_used += mss->msgsz;
wake_q_add(wake_q, mss->tsk);

You have missed the case of a do_msgsnd() that doesn't enqueue the message:

Situation:

- 2 messages of type 1 in the queue (2x8192 bytes, queue full)

- 6 senders waiting to send messages of type 2

- 6 receivers waiting to get messages of type 2.

If now a receiver reads one message of type 1, then all 6 senders can send.

WIth your patch applied, only one sender sends the message to one receiver,
and the remaining 10 tasks continue to sleep.


Could you please check and (assuming that you agree) run your benchmarks
with the patch applied?

--

     Manfred



 From fe2f257b1950a19bf5c6f67e71aa25c2f13bcdc3 Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Sun, 24 May 2020 14:47:31 +0200
Subject: [PATCH 2/2] ipc/msg.c: Handle case of senders not enqueuing the
  message

The patch "ipc/msg.c: wake up senders until there is a queue empty
capacity" avoids the thundering herd problem by wakeing up
only as many potential senders as there is free space in the queue.

This patch is a fix: If one of the senders doesn't enqueue its message,
then a search for further potential senders must be performed.

Signed-off-by: Manfred Spraul 
---
  ipc/msg.c | 21 +
  1 file changed, 21 insertions(+)

diff --git a/ipc/msg.c b/ipc/msg.c
index 52d634b0a65a..f6d5188db38a 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -208,6 +208,12 @@ static inline void ss_del(struct msg_sender *mss)
list_del(>list);
  }
  
+/*

+ * ss_wakeup() assumes that the stored senders will enqueue the pending 
message.
+ * Thus: If a woken up task doesn'

Re: [PATCH] ipc/msg.c: wake up senders until there is a queue empty capacity

2020-05-24 Thread Manfred Spraul


Hello Artur,

On 5/23/20 10:34 PM, Artur Barsegyan wrote:

Take into account the total size of the already enqueued messages of
previously handled senders before another one.

Otherwise, we have serious degradation of receiver throughput for
case with multiple senders because another sender wakes up,
checks the queue capacity and falls into sleep again.

Each round-trip wastes CPU time a lot and leads to perceptible
throughput degradation.

Source code of:
- sender/receiver
- benchmark script
- ready graphics of before/after results

is located here: https://github.com/artur-barsegyan/systemv_queue_research


Thanks for analyzing the issue!


Signed-off-by: Artur Barsegyan 
---
  ipc/msg.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index caca67368cb5..52d634b0a65a 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -214,6 +214,7 @@ static void ss_wakeup(struct msg_queue *msq,
struct msg_sender *mss, *t;
struct task_struct *stop_tsk = NULL;
struct list_head *h = >q_senders;
+   size_t msq_quota_used = 0;
  
  	list_for_each_entry_safe(mss, t, h, list) {

if (kill)
@@ -233,7 +234,7 @@ static void ss_wakeup(struct msg_queue *msq,
 * move the sender to the tail on behalf of the
 * blocked task.
 */
-   else if (!msg_fits_inqueue(msq, mss->msgsz)) {
+   else if (!msg_fits_inqueue(msq, msq_quota_used + mss->msgsz)) {
if (!stop_tsk)
stop_tsk = mss->tsk;
  
@@ -241,6 +242,7 @@ static void ss_wakeup(struct msg_queue *msq,

continue;
}
  
+		msq_quota_used += mss->msgsz;

wake_q_add(wake_q, mss->tsk);


You have missed the case of a do_msgsnd() that doesn't enqueue the message:

Situation:

- 2 messages of type 1 in the queue (2x8192 bytes, queue full)

- 6 senders waiting to send messages of type 2

- 6 receivers waiting to get messages of type 2.

If now a receiver reads one message of type 1, then all 6 senders can send.

WIth your patch applied, only one sender sends the message to one 
receiver, and the remaining 10 tasks continue to sleep.



Could you please check and (assuming that you agree) run your benchmarks 
with the patch applied?


--

    Manfred



>From fe2f257b1950a19bf5c6f67e71aa25c2f13bcdc3 Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Sun, 24 May 2020 14:47:31 +0200
Subject: [PATCH 2/2] ipc/msg.c: Handle case of senders not enqueuing the
 message

The patch "ipc/msg.c: wake up senders until there is a queue empty
capacity" avoids the thundering herd problem by wakeing up
only as many potential senders as there is free space in the queue.

This patch is a fix: If one of the senders doesn't enqueue its message,
then a search for further potential senders must be performed.

Signed-off-by: Manfred Spraul 
---
 ipc/msg.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/ipc/msg.c b/ipc/msg.c
index 52d634b0a65a..f6d5188db38a 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -208,6 +208,12 @@ static inline void ss_del(struct msg_sender *mss)
 		list_del(>list);
 }
 
+/*
+ * ss_wakeup() assumes that the stored senders will enqueue the pending message.
+ * Thus: If a woken up task doesn't send the enqueued message for whatever
+ * reason, then that task must call ss_wakeup() again, to ensure that no
+ * wakeup is lost.
+ */
 static void ss_wakeup(struct msg_queue *msq,
 		  struct wake_q_head *wake_q, bool kill)
 {
@@ -843,6 +849,7 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
 	struct msg_queue *msq;
 	struct msg_msg *msg;
 	int err;
+	bool need_wakeup;
 	struct ipc_namespace *ns;
 	DEFINE_WAKE_Q(wake_q);
 
@@ -869,6 +876,7 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
 
 	ipc_lock_object(>q_perm);
 
+	need_wakeup = false;
 	for (;;) {
 		struct msg_sender s;
 
@@ -898,6 +906,13 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
 		/* enqueue the sender and prepare to block */
 		ss_add(msq, , msgsz);
 
+		/* Enqueuing a sender is actually an obligation:
+		 * The sender must either enqueue the message, or call
+		 * ss_wakeup(). Thus track that we have added our message
+		 * to the candidates for the message queue.
+		 */
+		need_wakeup = true;
+
 		if (!ipc_rcu_getref(>q_perm)) {
 			err = -EIDRM;
 			goto out_unlock0;
@@ -935,12 +950,18 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
 		msq->q_qnum++;
 		atomic_add(msgsz, >msg_bytes);
 		atomic_inc(>msg_hdrs);
+
+		/* we have fulfilled our obligation, no need for wakeup */
+		need_wakeup = false;
 	}
 
 	err = 0;
 	msg = NULL;
 
 out_unlock0:
+	if (need_wakeup)
+		ss_wakeup(msq, _q, false);
+
 	ipc_unlock_object(>q_perm);
 	wake_up_q(_q);
 out_unlock1:
-- 
2.26.2

[PATCH] xarray.h: Correct return code for xa_store_{bh,irq}()

2020-04-30 Thread Manfred Spraul

__xa_store() and xa_store() document that the functions can fail, and
that the return code can be an xa_err() encoded error code.

xa_store_bh() and xa_store_irq() do not document that the functions
can fail and that they can also return xa_err() encoded error codes.

Thus: Update the documentation.

Signed-off-by: Manfred Spraul 
---
 include/linux/xarray.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index d79b8e3aa08d..2815c4ec89b1 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -576,7 +576,7 @@ void __xa_clear_mark(struct xarray *, unsigned long index, 
xa_mark_t);
  *
  * Context: Any context.  Takes and releases the xa_lock while
  * disabling softirqs.
- * Return: The entry which used to be at this index.
+ * Return: The old entry at this index or xa_err() if an error happened.
  */
 static inline void *xa_store_bh(struct xarray *xa, unsigned long index,
void *entry, gfp_t gfp)
@@ -602,7 +602,7 @@ static inline void *xa_store_bh(struct xarray *xa, unsigned 
long index,
  *
  * Context: Process context.  Takes and releases the xa_lock while
  * disabling interrupts.
- * Return: The entry which used to be at this index.
+ * Return: The old entry at this index or xa_err() if an error happened.
  */
 static inline void *xa_store_irq(struct xarray *xa, unsigned long index,
void *entry, gfp_t gfp)
-- 
2.26.2

Re: [PATCH -next] ipc: use GFP_ATOMIC under spin lock

2020-04-28 Thread Manfred Spraul


Hello together,

On 4/28/20 1:14 PM, Matthew Wilcox wrote:

On Tue, Apr 28, 2020 at 03:47:36AM +, Wei Yongjun wrote:

The function ipc_id_alloc() is called from ipc_addid(), in which
a spin lock is held, so we should use GFP_ATOMIC instead.

Fixes: de5738d1c364 ("ipc: convert ipcs_idr to XArray")
Signed-off-by: Wei Yongjun 

I see why you think that, but it's not true.  Yes, we hold a spinlock, but
the spinlock is in an object which is not reachable from any other CPU.


Is it really allowed that spin_lock()/spin_unlock may happen on 
different cpus?


CPU1: spin_lock()

CPU1: schedule() -> sleeps

CPU2: -> schedule() returns

CPU2: spin_unlock().



Converting to GFP_ATOMIC is completely wrong.


What is your solution proposal?

xa_store() also gets a gfp_ flag. Thus even splitting _alloc() and 
_store() won't help


    xa_alloc(,entry=NULL,)
    new->seq = ...
    spin_lock();
    xa_store(,entry=new,GFP_KERNEL);

--

    Manfred

Re: [ipc/sem.c] 6394de3b86: BUG:kernel_NULL_pointer_dereference,address

2019-10-23 Thread Manfred Spraul


Hello,

On 10/21/19 10:35 AM, kernel test robot wrote:

FYI, we noticed the following commit (built with gcc-7):

commit: 6394de3b868537a90dd9128607192b0e97109f6b ("[PATCH 4/5] ipc/sem.c: Document 
and update memory barriers")
url: 
https://github.com/0day-ci/linux/commits/Manfred-Spraul/wake_q-Cleanup-Documentation-update/20191014-055627


Yes, known issue:

@@ -2148,9 +2176,11 @@ static long do_semtimedop(int semid, struct 
sembuf __user *tsops,

    }

    do {
-   WRITE_ONCE(queue.status, -EINTR);
+   /* memory ordering ensured by the lock in sem_lock() */
+   queue.status = EINTR;
    queue.sleeper = current;

+   /* memory ordering is ensured by the lock in sem_lock() */
    __set_current_state(TASK_INTERRUPTIBLE);
    sem_unlock(sma, locknum);
    rcu_read_unlock();

It must be "-EINTR", not "EINTR".

If there is a timeout or a spurious wakeup, then the do_semtimedop() 
returns to user space without unlinking everything properly.


I was able to reproduce the issue: V1 of the series ends up with the 
shown error.


V3 as now merged doesn't fail.

--

    Manfred

[PATCH 3/5] ipc/mqueue.c: Update/document memory barriers

2019-10-20 Thread Manfred Spraul

Update and document memory barriers for mqueue.c:
- ewp->state is read without any locks, thus READ_ONCE is required.

- add smp_aquire__after_ctrl_dep() after the READ_ONCE, we need
  acquire semantics if the value is STATE_READY.

- use wake_q_add_safe()

- document why __set_current_state() may be used:
  Reading task->state cannot happen before the wake_q_add() call,
  which happens while holding info->lock. Thus the spin_unlock()
  is the RELEASE, and the spin_lock() is the ACQUIRE.

For completeness: there is also a 3 CPU scenario, if the to be woken
up task is already on another wake_q.
Then:
- CPU1: spin_unlock() of the task that goes to sleep is the RELEASE
- CPU2: the spin_lock() of the waker is the ACQUIRE
- CPU2: smp_mb__before_atomic inside wake_q_add() is the RELEASE
- CPU3: smp_mb__after_spinlock() inside try_to_wake_up() is the ACQUIRE

Signed-off-by: Manfred Spraul 
Reviewed-by: Davidlohr Bueso 
Cc: Waiman Long 
---
 ipc/mqueue.c | 92 
 1 file changed, 78 insertions(+), 14 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 270456530f6a..49a05ba3000d 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -63,6 +63,66 @@ struct posix_msg_tree_node {
int priority;
 };
 
+/*
+ * Locking:
+ *
+ * Accesses to a message queue are synchronized by acquiring info->lock.
+ *
+ * There are two notable exceptions:
+ * - The actual wakeup of a sleeping task is performed using the wake_q
+ *   framework. info->lock is already released when wake_up_q is called.
+ * - The exit codepaths after sleeping check ext_wait_queue->state without
+ *   any locks. If it is STATE_READY, then the syscall is completed without
+ *   acquiring info->lock.
+ *
+ * MQ_BARRIER:
+ * To achieve proper release/acquire memory barrier pairing, the state is set 
to
+ * STATE_READY with smp_store_release(), and it is read with READ_ONCE followed
+ * by smp_acquire__after_ctrl_dep(). In addition, wake_q_add_safe() is used.
+ *
+ * This prevents the following races:
+ *
+ * 1) With the simple wake_q_add(), the task could be gone already before
+ *the increase of the reference happens
+ * Thread A
+ * Thread B
+ * WRITE_ONCE(wait.state, STATE_NONE);
+ * schedule_hrtimeout()
+ * wake_q_add(A)
+ * if (cmpxchg()) // success
+ *->state = STATE_READY (reordered)
+ * 
+ * if (wait.state == STATE_READY) return;
+ * sysret to user space
+ * sys_exit()
+ * get_task_struct() // UaF
+ *
+ * Solution: Use wake_q_add_safe() and perform the get_task_struct() before
+ * the smp_store_release() that does ->state = STATE_READY.
+ *
+ * 2) Without proper _release/_acquire barriers, the woken up task
+ *could read stale data
+ *
+ * Thread A
+ * Thread B
+ * do_mq_timedreceive
+ * WRITE_ONCE(wait.state, STATE_NONE);
+ * schedule_hrtimeout()
+ * state = STATE_READY;
+ * 
+ * if (wait.state == STATE_READY) return;
+ * msg_ptr = wait.msg; // Access to stale data!
+ * receiver->msg = message; (reordered)
+ *
+ * Solution: use _release and _acquire barriers.
+ *
+ * 3) There is intentionally no barrier when setting current->state
+ *to TASK_INTERRUPTIBLE: spin_unlock(>lock) provides the
+ *release memory barrier, and the wakeup is triggered when holding
+ *info->lock, i.e. spin_lock(>lock) provided a pairing
+ *acquire memory barrier.
+ */
+
 struct ext_wait_queue {/* queue of sleeping tasks */
struct task_struct *task;
struct list_head list;
@@ -646,18 +706,23 @@ static int wq_sleep(struct mqueue_inode_info *info, int 
sr,
wq_add(info, sr, ewp);
 
for (;;) {
+   /* memory barrier not required, we hold info->lock */
__set_current_state(TASK_INTERRUPTIBLE);
 
spin_unlock(>lock);
time = schedule_hrtimeout_range_clock(timeout, 0,
HRTIMER_MODE_ABS, CLOCK_REALTIME);
 
-   if (ewp->state == STATE_READY) {
+   if (READ_ONCE(ewp->state) == STATE_READY) {
+   /* see MQ_BARRIER for purpose/pairing */
+   smp_acquire__after_ctrl_dep();
retval = 0;
goto out;
}
spin_lock(>lock);
-   if (ewp->state == STATE_READY) {
+
+   /* we hold info->lock, so no memory barrier required */
+   if (READ_ONCE(ewp->state) == STATE_READY) {
retval = 0;
goto out_unlock;
}
@@ -923,16 +988,11 @@ static inline void __pipelined_op(struct wake_q_head 
*wake_q,
  struct ext_wait_queue *this)

[PATCH 5/5] ipc/sem.c: Document and update memory barriers

2019-10-20 Thread Manfred Spraul

The patch documents and updates the memory barriers in ipc/sem.c:
- Add smp_store_release() to wake_up_sem_queue_prepare() and
  document why it is needed.

- Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep().
  as the pair for the barrier inside wake_up_sem_queue_prepare().

- Add comments to all barriers, and mention the rules in the block
  regarding locking.

- Switch to using wake_q_add_safe().

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/sem.c | 66 ++-
 1 file changed, 41 insertions(+), 25 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index ec97a7072413..c89734b200c6 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -205,15 +205,38 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
  *
  * Memory ordering:
  * Most ordering is enforced by using spin_lock() and spin_unlock().
- * The special case is use_global_lock:
+ *
+ * Exceptions:
+ * 1) use_global_lock: (SEM_BARRIER_1)
  * Setting it from non-zero to 0 is a RELEASE, this is ensured by
- * using smp_store_release().
+ * using smp_store_release(): Immediately after setting it to 0,
+ * a simple op can start.
  * Testing if it is non-zero is an ACQUIRE, this is ensured by using
  * smp_load_acquire().
  * Setting it from 0 to non-zero must be ordered with regards to
  * this smp_load_acquire(), this is guaranteed because the smp_load_acquire()
  * is inside a spin_lock() and after a write from 0 to non-zero a
  * spin_lock()+spin_unlock() is done.
+ *
+ * 2) queue.status: (SEM_BARRIER_2)
+ * Initialization is done while holding sem_lock(), so no further barrier is
+ * required.
+ * Setting it to a result code is a RELEASE, this is ensured by both a
+ * smp_store_release() (for case a) and while holding sem_lock()
+ * (for case b).
+ * The AQUIRE when reading the result code without holding sem_lock() is
+ * achieved by using READ_ONCE() + smp_acquire__after_ctrl_dep().
+ * (case a above).
+ * Reading the result code while holding sem_lock() needs no further barriers,
+ * the locks inside sem_lock() enforce ordering (case b above)
+ *
+ * 3) current->state:
+ * current->state is set to TASK_INTERRUPTIBLE while holding sem_lock().
+ * The wakeup is handled using the wake_q infrastructure. wake_q wakeups may
+ * happen immediately after calling wake_q_add. As wake_q_add_safe() is called
+ * when holding sem_lock(), no further barriers are required.
+ *
+ * See also ipc/mqueue.c for more details on the covered races.
  */
 
 #define sc_semmsl  sem_ctls[0]
@@ -344,12 +367,8 @@ static void complexmode_tryleave(struct sem_array *sma)
return;
}
if (sma->use_global_lock == 1) {
-   /*
-* Immediately after setting use_global_lock to 0,
-* a simple op can start. Thus: all memory writes
-* performed by the current operation must be visible
-* before we set use_global_lock to 0.
-*/
+
+   /* See SEM_BARRIER_1 for purpose/pairing */
smp_store_release(>use_global_lock, 0);
} else {
sma->use_global_lock--;
@@ -400,7 +419,7 @@ static inline int sem_lock(struct sem_array *sma, struct 
sembuf *sops,
 */
spin_lock(>lock);
 
-   /* pairs with smp_store_release() */
+   /* see SEM_BARRIER_1 for purpose/pairing */
if (!smp_load_acquire(>use_global_lock)) {
/* fast path successful! */
return sops->sem_num;
@@ -766,15 +785,12 @@ static int perform_atomic_semop(struct sem_array *sma, 
struct sem_queue *q)
 static inline void wake_up_sem_queue_prepare(struct sem_queue *q, int error,
 struct wake_q_head *wake_q)
 {
-   wake_q_add(wake_q, q->sleeper);
-   /*
-* Rely on the above implicit barrier, such that we can
-* ensure that we hold reference to the task before setting
-* q->status. Otherwise we could race with do_exit if the
-* task is awoken by an external event before calling
-* wake_up_process().
-*/
-   WRITE_ONCE(q->status, error);
+   get_task_struct(q->sleeper);
+
+   /* see SEM_BARRIER_2 for purpuse/pairing */
+   smp_store_release(>status, error);
+
+   wake_q_add_safe(wake_q, q->sleeper);
 }
 
 static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
@@ -2148,9 +2164,11 @@ static long do_semtimedop(int semid, struct sembuf 
__user *tsops,
}
 
do {
+   /* memory ordering ensured by the lock in sem_lock() */
WRITE_ONCE(queue.status, -EINTR);
queue.sleeper = current;
 
+   /* memory ordering is ensured by the lock in sem_lock() */
__set_current_state(TASK_INTERRUPTIBLE);
sem_unlock(s

[PATCH 2/5] ipc/mqueue.c: Remove duplicated code

2019-10-20 Thread Manfred Spraul

Patch from Davidlohr, I just added this change log.
pipelined_send() and pipelined_receive() are identical, so merge them.

Signed-off-by: Davidlohr Bueso 
Signed-off-by: Manfred Spraul 
---
 ipc/mqueue.c | 31 ++-
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3d920ff15c80..270456530f6a 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -918,17 +918,12 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
  * The same algorithm is used for senders.
  */
 
-/* pipelined_send() - send a message directly to the task waiting in
- * sys_mq_timedreceive() (without inserting message into a queue).
- */
-static inline void pipelined_send(struct wake_q_head *wake_q,
+static inline void __pipelined_op(struct wake_q_head *wake_q,
  struct mqueue_inode_info *info,
- struct msg_msg *message,
- struct ext_wait_queue *receiver)
+ struct ext_wait_queue *this)
 {
-   receiver->msg = message;
-   list_del(>list);
-   wake_q_add(wake_q, receiver->task);
+   list_del(>list);
+   wake_q_add(wake_q, this->task);
/*
 * Rely on the implicit cmpxchg barrier from wake_q_add such
 * that we can ensure that updating receiver->state is the last
@@ -937,7 +932,19 @@ static inline void pipelined_send(struct wake_q_head 
*wake_q,
 * yet, at that point we can later have a use-after-free
 * condition and bogus wakeup.
 */
-   receiver->state = STATE_READY;
+   this->state = STATE_READY;
+}
+
+/* pipelined_send() - send a message directly to the task waiting in
+ * sys_mq_timedreceive() (without inserting message into a queue).
+ */
+static inline void pipelined_send(struct wake_q_head *wake_q,
+ struct mqueue_inode_info *info,
+ struct msg_msg *message,
+ struct ext_wait_queue *receiver)
+{
+   receiver->msg = message;
+   __pipelined_op(wake_q, info, receiver);
 }
 
 /* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
@@ -955,9 +962,7 @@ static inline void pipelined_receive(struct wake_q_head 
*wake_q,
if (msg_insert(sender->msg, info))
return;
 
-   list_del(>list);
-   wake_q_add(wake_q, sender->task);
-   sender->state = STATE_READY;
+   __pipelined_op(wake_q, info, sender);
 }
 
 static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,
-- 
2.21.0

[PATCH 4/5] ipc/msg.c: Update and document memory barriers.

2019-10-20 Thread Manfred Spraul

Transfer findings from ipc/mqueue.c:
- A control barrier was missing for the lockless receive case
  So in theory, not yet initialized data may have been copied
  to user space - obviously only for architectures where
  control barriers are not NOP.

- use smp_store_release(). In theory, the refount
  may have been decreased to 0 already when wake_q_add()
  tries to get a reference.

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/msg.c | 43 ---
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 8dec945fa030..192a9291a8ab 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -61,6 +61,16 @@ struct msg_queue {
struct list_head q_senders;
 } __randomize_layout;
 
+/*
+ * MSG_BARRIER Locking:
+ *
+ * Similar to the optimization used in ipc/mqueue.c, one syscall return path
+ * does not acquire any locks when it sees that a message exists in
+ * msg_receiver.r_msg. Therefore r_msg is set using smp_store_release()
+ * and accessed using READ_ONCE()+smp_acquire__after_ctrl_dep(). In addition,
+ * wake_q_add_safe() is used. See ipc/mqueue.c for more details
+ */
+
 /* one msg_receiver structure for each sleeping receiver */
 struct msg_receiver {
struct list_headr_list;
@@ -184,6 +194,10 @@ static inline void ss_add(struct msg_queue *msq,
 {
mss->tsk = current;
mss->msgsz = msgsz;
+   /*
+* No memory barrier required: we did ipc_lock_object(),
+* and the waker obtains that lock before calling wake_q_add().
+*/
__set_current_state(TASK_INTERRUPTIBLE);
list_add_tail(>list, >q_senders);
 }
@@ -237,8 +251,11 @@ static void expunge_all(struct msg_queue *msq, int res,
struct msg_receiver *msr, *t;
 
list_for_each_entry_safe(msr, t, >q_receivers, r_list) {
-   wake_q_add(wake_q, msr->r_tsk);
-   WRITE_ONCE(msr->r_msg, ERR_PTR(res));
+   get_task_struct(msr->r_tsk);
+
+   /* see MSG_BARRIER for purpose/pairing */
+   smp_store_release(>r_msg, ERR_PTR(res));
+   wake_q_add_safe(wake_q, msr->r_tsk);
}
 }
 
@@ -798,13 +815,17 @@ static inline int pipelined_send(struct msg_queue *msq, 
struct msg_msg *msg,
list_del(>r_list);
if (msr->r_maxsize < msg->m_ts) {
wake_q_add(wake_q, msr->r_tsk);
-   WRITE_ONCE(msr->r_msg, ERR_PTR(-E2BIG));
+
+   /* See expunge_all regarding memory barrier */
+   smp_store_release(>r_msg, ERR_PTR(-E2BIG));
} else {
ipc_update_pid(>q_lrpid, 
task_pid(msr->r_tsk));
msq->q_rtime = ktime_get_real_seconds();
 
wake_q_add(wake_q, msr->r_tsk);
-   WRITE_ONCE(msr->r_msg, msg);
+
+   /* See expunge_all regarding memory barrier */
+   smp_store_release(>r_msg, msg);
return 1;
}
}
@@ -1154,7 +1175,11 @@ static long do_msgrcv(int msqid, void __user *buf, 
size_t bufsz, long msgtyp, in
msr_d.r_maxsize = INT_MAX;
else
msr_d.r_maxsize = bufsz;
-   msr_d.r_msg = ERR_PTR(-EAGAIN);
+
+   /* memory barrier not require due to ipc_lock_object() */
+   WRITE_ONCE(msr_d.r_msg, ERR_PTR(-EAGAIN));
+
+   /* memory barrier not required, we own ipc_lock_object() */
__set_current_state(TASK_INTERRUPTIBLE);
 
ipc_unlock_object(>q_perm);
@@ -1183,8 +1208,12 @@ static long do_msgrcv(int msqid, void __user *buf, 
size_t bufsz, long msgtyp, in
 * signal) it will either see the message and continue ...
 */
msg = READ_ONCE(msr_d.r_msg);
-   if (msg != ERR_PTR(-EAGAIN))
+   if (msg != ERR_PTR(-EAGAIN)) {
+   /* see MSG_BARRIER for purpose/pairing */
+   smp_acquire__after_ctrl_dep();
+
goto out_unlock1;
+   }
 
 /*
  * ... or see -EAGAIN, acquire the lock to check the message
@@ -1192,7 +1221,7 @@ static long do_msgrcv(int msqid, void __user *buf, size_t 
bufsz, long msgtyp, in
  */
ipc_lock_object(>q_perm);
 
-   msg = msr_d.r_msg;
+   msg = READ_ONCE(msr_d.r_msg);
if (msg != ERR_PTR(-EAGAIN))
goto out_unlock0;
 
-- 
2.21.0

[PATCH 0/5] V3: Clarify/standardize memory barriers for ipc

2019-10-20 Thread Manfred Spraul

Hi,

Updated series, based on input from Davidlohr and Peter Zijlstra:

- I've dropped the documentation update for wake_q_add, as what it
  states is normal: When you call a function and pass a parameter
  to a structure, you as caller are responsible to ensure that the 
  parameter is valid, and remains valid for the duration of the
  function call, including any tearing due to memory reordering.
  In addition, I've switched ipc to wake_q_add_safe().

- The patch to Documentation/memory_barriers.txt now as first change.
  @Davidlohr: You proposed to have 2 paragraphs: First, one for
  add/subtract, then one for failed cmpxchg. I didn't like that:
  We have one rule (can be combined with non-mb RMW ops), and then
  examples what are non-mb RMW ops. Listing special cases just ask
  for issues later.
  What I don't know is if there should be examples at all in
  Documentation/memory_barriers, or just
  "See Documentation/atomic_t.txt for examples of RMW ops that
  do not contain a memory barrier"

- For the memory barrier pairs in ipc/, I have now added
  /* See ABC_BARRIER for purpose/pairing */ as standard comment,
  and then a block near the relevant structure where purpose, pairing
  races, ... are explained. I think this makes it easier to read,
  compared to adding it to both the _release and _acquire branches.

Description/purpose:

The memory barriers in ipc are not properly documented, and at least
for some architectures insufficient:
Reading the xyz->status is only a control barrier, thus
smp_acquire__after_ctrl_dep() was missing in mqueue.c and msg.c
sem.c contained a full smp_mb(), which is not required.

Patches:
Patch 1: Documentation for smp_mb__{before,after}_atomic().

Patch 2: Remove code duplication inside ipc/mqueue.c

Patch 3-5: Update the ipc code, especially add missing
   smp_mb__after_ctrl_dep() and switch to wake_q_add_safe().

Clarify that smp_mb__{before,after}_atomic() are compatible with all
RMW atomic operations, not just the operations that do not return a value.

Open issues:
- More testing. I did some tests, but doubt that the tests would be
  sufficient to show issues with regards to incorrect memory barriers.

What do you think?

--
Manfred

[PATCH 1/5] smp_mb__{before,after}_atomic(): Update Documentation

2019-10-20 Thread Manfred Spraul

When adding the _{acquire|release|relaxed}() variants of some atomic
operations, it was forgotten to update Documentation/memory_barrier.txt:

smp_mb__{before,after}_atomic() is now intended for all RMW operations
that do not imply a memory barrier.

1)
smp_mb__before_atomic();
atomic_add();

2)
smp_mb__before_atomic();
atomic_xchg_relaxed();

3)
smp_mb__before_atomic();
atomic_fetch_add_relaxed();

Invalid would be:
smp_mb__before_atomic();
atomic_set();

In addition, the patch splits the long sentence into multiple shorter
sentences.

Fixes: 654672d4ba1a ("locking/atomics: Add _{acquire|release|relaxed}() 
variants of some atomic operations")

Signed-off-by: Manfred Spraul 
Acked-by: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Peter Zijlstra 
Cc: Will Deacon 
---
 Documentation/memory-barriers.txt | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 1adbb8a371c7..fe43f4b30907 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1873,12 +1873,16 @@ There are some more advanced barrier functions:
  (*) smp_mb__before_atomic();
  (*) smp_mb__after_atomic();
 
- These are for use with atomic (such as add, subtract, increment and
- decrement) functions that don't return a value, especially when used for
- reference counting.  These functions do not imply memory barriers.
-
- These are also used for atomic bitop functions that do not return a
- value (such as set_bit and clear_bit).
+ These are for use with atomic RMW functions that do not imply memory
+ barriers, but where the code needs a memory barrier. Examples for atomic
+ RMW functions that do not imply are memory barrier are e.g. add,
+ subtract, (failed) conditional operations, _relaxed functions,
+ but not atomic_read or atomic_set. A common example where a memory
+ barrier may be required is when atomic ops are used for reference
+ counting.
+
+ These are also used for atomic RMW bitop functions that do not imply a
+ memory barrier (such as set_bit and clear_bit).
 
  As an example, consider a piece of code that marks an object as being dead
  and then decrements the object's reference count:
-- 
2.21.0

Re: [PATCH 3/6] ipc/mqueue.c: Update/document memory barriers

2019-10-14 Thread Manfred Spraul


Hi Peter,

On 10/14/19 3:58 PM, Peter Zijlstra wrote:

On Mon, Oct 14, 2019 at 02:59:11PM +0200, Peter Zijlstra wrote:

On Sat, Oct 12, 2019 at 07:49:55AM +0200, Manfred Spraul wrote:


for (;;) {
+   /* memory barrier not required, we hold info->lock */
__set_current_state(TASK_INTERRUPTIBLE);
  
  		spin_unlock(>lock);

time = schedule_hrtimeout_range_clock(timeout, 0,
HRTIMER_MODE_ABS, CLOCK_REALTIME);
  
+		if (READ_ONCE(ewp->state) == STATE_READY) {

+   /*
+* Pairs, together with READ_ONCE(), with
+* the barrier in __pipelined_op().
+*/
+   smp_acquire__after_ctrl_dep();
retval = 0;
goto out;
}
spin_lock(>lock);
+
+   /* we hold info->lock, so no memory barrier required */
+   if (READ_ONCE(ewp->state) == STATE_READY) {
retval = 0;
goto out_unlock;
}
@@ -925,14 +933,12 @@ static inline void __pipelined_op(struct wake_q_head 
*wake_q,
list_del(>list);
wake_q_add(wake_q, this->task);
/*
+* The barrier is required to ensure that the refcount increase
+* inside wake_q_add() is completed before the state is updated.

fails to explain *why* this is important.


+*
+* The barrier pairs with READ_ONCE()+smp_mb__after_ctrl_dep().
 */
+smp_store_release(>state, STATE_READY);

You retained the whitespace damage.

And I'm terribly confused by this code, probably due to the lack of
'why' as per the above. What is this trying to do?

Are we worried about something like:

A   B   C


wq_sleep()
  schedule_...();

/* spuriuos 
wakeup */

wake_up_process(B)

wake_q_add(A)
  if (cmpxchg()) // success

->state = STATE_READY (reordered)

  if (READ_ONCE() == STATE_READY)
goto out;

exit();


get_task_struct() // UaF


Can we put the exact and full race in the comment please?


Yes, I'll do that. Actually, two threads are sufficient:

A    B

WRITE_ONCE(wait.state, STATE_NONE);
schedule_hrtimeout()

  wake_q_add(A)
  if (cmpxchg()) // success
  ->state = STATE_READY (reordered)


if (wait.state == STATE_READY) return;
sysret to user space
sys_exit()

  get_task_struct() // UaF



Like Davidlohr already suggested, elsewhere we write it like so:


--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -930,15 +930,10 @@ static inline void __pipelined_op(struct
  struct mqueue_inode_info *info,
  struct ext_wait_queue *this)
  {
+   get_task_struct(this->task);
list_del(>list);
-   wake_q_add(wake_q, this->task);
-   /*
-* The barrier is required to ensure that the refcount increase
-* inside wake_q_add() is completed before the state is updated.
-*
-* The barrier pairs with READ_ONCE()+smp_mb__after_ctrl_dep().
-*/
-smp_store_release(>state, STATE_READY);
+   smp_store_release(>state, STATE_READY);
+   wake_q_add_safe(wake_q, this->task);
  }
  
  /* pipelined_send() - send a message directly to the task waiting in


Much better, I'll rewrite it and then resend the series.

--

    Manfred

Re: [PATCH 6/6] Documentation/memory-barriers.txt: Clarify cmpxchg()

2019-10-14 Thread Manfred Spraul


Hello Peter,

On 10/14/19 3:03 PM, Peter Zijlstra wrote:

On Sat, Oct 12, 2019 at 07:49:58AM +0200, Manfred Spraul wrote:

The documentation in memory-barriers.txt claims that
smp_mb__{before,after}_atomic() are for atomic ops that do not return a
value.

This is misleading and doesn't match the example in atomic_t.txt,
and e.g. smp_mb__before_atomic() may and is used together with
cmpxchg_relaxed() in the wake_q code.

The purpose of e.g. smp_mb__before_atomic() is to "upgrade" a following
RMW atomic operation to a full memory barrier.
The return code of the atomic operation has no impact, so all of the
following examples are valid:

The value return of atomic ops is relevant in so far that
(traditionally) all value returning atomic ops already implied full
barriers. That of course changed when we added
_release/_acquire/_relaxed variants.

I've updated the Change description accordingly

1)
smp_mb__before_atomic();
atomic_add();

2)
smp_mb__before_atomic();
atomic_xchg_relaxed();

3)
smp_mb__before_atomic();
atomic_fetch_add_relaxed();

Invalid would be:
smp_mb__before_atomic();
atomic_set();

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Peter Zijlstra 
---
  Documentation/memory-barriers.txt | 11 ++-
  1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 1adbb8a371c7..52076b057400 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1873,12 +1873,13 @@ There are some more advanced barrier functions:
   (*) smp_mb__before_atomic();
   (*) smp_mb__after_atomic();
  
- These are for use with atomic (such as add, subtract, increment and

- decrement) functions that don't return a value, especially when used for
- reference counting.  These functions do not imply memory barriers.
+ These are for use with atomic RMW functions (such as add, subtract,
+ increment, decrement, failed conditional operations, ...) that do
+ not imply memory barriers, but where the code needs a memory barrier,
+ for example when used for reference counting.
  
- These are also used for atomic bitop functions that do not return a

- value (such as set_bit and clear_bit).
+ These are also used for atomic RMW bitop functions that do imply a full

s/do/do not/ ?

Sorry, yes, of course

+ memory barrier (such as set_bit and clear_bit).



>From 61c85a56994e32ea393af9debef4cccd9cd24abd Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Fri, 11 Oct 2019 10:33:26 +0200
Subject: [PATCH] Update Documentation for _{acquire|release|relaxed}()

When adding the _{acquire|release|relaxed}() variants of some atomic
operations, it was forgotten to update Documentation/memory_barrier.txt:

smp_mb__{before,after}_atomic() is now indended for all RMW operations
that do not imply a full memory barrier.

1)
	smp_mb__before_atomic();
	atomic_add();

2)
	smp_mb__before_atomic();
	atomic_xchg_relaxed();

3)
	smp_mb__before_atomic();
	atomic_fetch_add_relaxed();

Invalid would be:
	smp_mb__before_atomic();
	atomic_set();

Fixes: 654672d4ba1a ("locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations")

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Peter Zijlstra 
Cc: Will Deacon 
---
 Documentation/memory-barriers.txt | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 1adbb8a371c7..08090eea3751 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1873,12 +1873,13 @@ There are some more advanced barrier functions:
  (*) smp_mb__before_atomic();
  (*) smp_mb__after_atomic();
 
- These are for use with atomic (such as add, subtract, increment and
- decrement) functions that don't return a value, especially when used for
- reference counting.  These functions do not imply memory barriers.
+ These are for use with atomic RMW functions (such as add, subtract,
+ increment, decrement, failed conditional operations, ...) that do
+ not imply memory barriers, but where the code needs a memory barrier,
+ for example when used for reference counting.
 
- These are also used for atomic bitop functions that do not return a
- value (such as set_bit and clear_bit).
+ These are also used for atomic RMW bitop functions that do not imply a
+ full memory barrier (such as set_bit and clear_bit).
 
  As an example, consider a piece of code that marks an object as being dead
  and then decrements the object's reference count:
-- 
2.21.0

[PATCH 2/6] ipc/mqueue.c: Remove duplicated code

2019-10-11 Thread Manfred Spraul

Patch entirely from Davidlohr:
pipelined_send() and pipelined_receive() are identical, so merge them.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 ipc/mqueue.c | 31 ++-
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3d920ff15c80..be48c0ba92f7 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -918,17 +918,12 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
  * The same algorithm is used for senders.
  */
 
-/* pipelined_send() - send a message directly to the task waiting in
- * sys_mq_timedreceive() (without inserting message into a queue).
- */
-static inline void pipelined_send(struct wake_q_head *wake_q,
+static inline void __pipelined_op(struct wake_q_head *wake_q,
  struct mqueue_inode_info *info,
- struct msg_msg *message,
- struct ext_wait_queue *receiver)
+ struct ext_wait_queue *this)
 {
-   receiver->msg = message;
-   list_del(>list);
-   wake_q_add(wake_q, receiver->task);
+   list_del(>list);
+   wake_q_add(wake_q, this->task);
/*
 * Rely on the implicit cmpxchg barrier from wake_q_add such
 * that we can ensure that updating receiver->state is the last
@@ -937,7 +932,19 @@ static inline void pipelined_send(struct wake_q_head 
*wake_q,
 * yet, at that point we can later have a use-after-free
 * condition and bogus wakeup.
 */
-   receiver->state = STATE_READY;
+this->state = STATE_READY;
+}
+
+/* pipelined_send() - send a message directly to the task waiting in
+ * sys_mq_timedreceive() (without inserting message into a queue).
+ */
+static inline void pipelined_send(struct wake_q_head *wake_q,
+ struct mqueue_inode_info *info,
+ struct msg_msg *message,
+ struct ext_wait_queue *receiver)
+{
+   receiver->msg = message;
+   __pipelined_op(wake_q, info, receiver);
 }
 
 /* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
@@ -955,9 +962,7 @@ static inline void pipelined_receive(struct wake_q_head 
*wake_q,
if (msg_insert(sender->msg, info))
return;
 
-   list_del(>list);
-   wake_q_add(wake_q, sender->task);
-   sender->state = STATE_READY;
+   __pipelined_op(wake_q, info, sender);
 }
 
 static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,
-- 
2.21.0

[PATCH 4/6] ipc/msg.c: Update and document memory barriers.

2019-10-11 Thread Manfred Spraul

Transfer findings from ipc/mqueue.c:
- A control barrier was missing for the lockless receive case
  So in theory, not yet initialized data may have been copied
  to user space - obviously only for architectures where
  control barriers are not NOP.

- use smp_store_release(). In theory, the refount
  may have been decreased to 0 already when wake_q_add()
  tries to get a reference.

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/msg.c | 44 ++--
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 8dec945fa030..e6b20a7e6341 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -184,6 +184,10 @@ static inline void ss_add(struct msg_queue *msq,
 {
mss->tsk = current;
mss->msgsz = msgsz;
+   /*
+* No memory barrier required: we did ipc_lock_object(),
+* and the waker obtains that lock before calling wake_q_add().
+*/
__set_current_state(TASK_INTERRUPTIBLE);
list_add_tail(>list, >q_senders);
 }
@@ -238,7 +242,14 @@ static void expunge_all(struct msg_queue *msq, int res,
 
list_for_each_entry_safe(msr, t, >q_receivers, r_list) {
wake_q_add(wake_q, msr->r_tsk);
-   WRITE_ONCE(msr->r_msg, ERR_PTR(res));
+
+   /*
+* The barrier is required to ensure that the refcount increase
+* inside wake_q_add() is completed before the state is updated.
+*
+* The barrier pairs with READ_ONCE()+smp_mb__after_ctrl_dep().
+*/
+   smp_store_release(>r_msg, ERR_PTR(res));
}
 }
 
@@ -798,13 +809,17 @@ static inline int pipelined_send(struct msg_queue *msq, 
struct msg_msg *msg,
list_del(>r_list);
if (msr->r_maxsize < msg->m_ts) {
wake_q_add(wake_q, msr->r_tsk);
-   WRITE_ONCE(msr->r_msg, ERR_PTR(-E2BIG));
+
+   /* See expunge_all regarding memory barrier */
+   smp_store_release(>r_msg, ERR_PTR(-E2BIG));
} else {
ipc_update_pid(>q_lrpid, 
task_pid(msr->r_tsk));
msq->q_rtime = ktime_get_real_seconds();
 
wake_q_add(wake_q, msr->r_tsk);
-   WRITE_ONCE(msr->r_msg, msg);
+
+   /* See expunge_all regarding memory barrier */
+   smp_store_release(>r_msg, msg);
return 1;
}
}
@@ -1154,7 +1169,11 @@ static long do_msgrcv(int msqid, void __user *buf, 
size_t bufsz, long msgtyp, in
msr_d.r_maxsize = INT_MAX;
else
msr_d.r_maxsize = bufsz;
-   msr_d.r_msg = ERR_PTR(-EAGAIN);
+
+   /* memory barrier not require due to ipc_lock_object() */
+   WRITE_ONCE(msr_d.r_msg, ERR_PTR(-EAGAIN));
+
+   /* memory barrier not required, we own ipc_lock_object() */
__set_current_state(TASK_INTERRUPTIBLE);
 
ipc_unlock_object(>q_perm);
@@ -1183,8 +1202,21 @@ static long do_msgrcv(int msqid, void __user *buf, 
size_t bufsz, long msgtyp, in
 * signal) it will either see the message and continue ...
 */
msg = READ_ONCE(msr_d.r_msg);
-   if (msg != ERR_PTR(-EAGAIN))
+   if (msg != ERR_PTR(-EAGAIN)) {
+   /*
+* Memory barrier for msr_d.r_msg
+* The smp_acquire__after_ctrl_dep(), together with the
+* READ_ONCE() above pairs with the barrier inside
+* wake_q_add().
+* The barrier protects the accesses to the message in
+* do_msg_fill(). In addition, the barrier protects user
+* space, too: User space may assume that all data from
+* the CPU that sent the message is visible.
+*/
+   smp_acquire__after_ctrl_dep();
+
goto out_unlock1;
+   }
 
 /*
  * ... or see -EAGAIN, acquire the lock to check the message
@@ -1192,7 +1224,7 @@ static long do_msgrcv(int msqid, void __user *buf, size_t 
bufsz, long msgtyp, in
  */
ipc_lock_object(>q_perm);
 
-   msg = msr_d.r_msg;
+   msg = READ_ONCE(msr_d.r_msg);
if (msg != ERR_PTR(-EAGAIN))
goto out_unlock0;
 
-- 
2.21.0

[PATCH 5/6] ipc/sem.c: Document and update memory barriers

2019-10-11 Thread Manfred Spraul

The patch documents and updates the memory barriers in ipc/sem.c:
- Add smp_store_release() to wake_up_sem_queue_prepare() and
  document why it is needed.

- Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep().
  as the pair for the barrier inside wake_up_sem_queue_prepare().

- Add comments to all barriers, and mention the rules in the block
  regarding locking.

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/sem.c | 63 ---
 1 file changed, 51 insertions(+), 12 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index ec97a7072413..c6c5954a2030 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -205,7 +205,9 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
  *
  * Memory ordering:
  * Most ordering is enforced by using spin_lock() and spin_unlock().
- * The special case is use_global_lock:
+ *
+ * Exceptions:
+ * 1) use_global_lock:
  * Setting it from non-zero to 0 is a RELEASE, this is ensured by
  * using smp_store_release().
  * Testing if it is non-zero is an ACQUIRE, this is ensured by using
@@ -214,6 +216,24 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
  * this smp_load_acquire(), this is guaranteed because the smp_load_acquire()
  * is inside a spin_lock() and after a write from 0 to non-zero a
  * spin_lock()+spin_unlock() is done.
+ *
+ * 2) queue.status:
+ * Initialization is done while holding sem_lock(), so no further barrier is
+ * required.
+ * Setting it to a result code is a RELEASE, this is ensured by both a
+ * smp_store_release() (for case a) and while holding sem_lock()
+ * (for case b).
+ * The AQUIRE when reading the result code without holding sem_lock() is
+ * achieved by using READ_ONCE() + smp_acquire__after_ctrl_dep().
+ * (case a above).
+ * Reading the result code while holding sem_lock() needs no further barriers,
+ * the locks inside sem_lock() enforce ordering (case b above)
+ *
+ * 3) current->state:
+ * current->state is set to TASK_INTERRUPTIBLE while holding sem_lock().
+ * The wakeup is handled using the wake_q infrastructure. wake_q wakeups may
+ * happen immediately after calling wake_q_add. As wake_q_add() is called
+ * when holding sem_lock(), no further barriers are required.
  */
 
 #define sc_semmsl  sem_ctls[0]
@@ -766,15 +786,24 @@ static int perform_atomic_semop(struct sem_array *sma, 
struct sem_queue *q)
 static inline void wake_up_sem_queue_prepare(struct sem_queue *q, int error,
 struct wake_q_head *wake_q)
 {
+   /*
+* When the wakeup is performed, q->sleeper->state is read and later
+* set to TASK_RUNNING. This may happen at any time, even before
+* wake_q_add() returns. Memory ordering for q->sleeper->state is
+* enforced by sem_lock(): we own sem_lock now (that was the ACQUIRE),
+* and q->sleeper wrote q->sleeper->state before calling sem_unlock()
+* (->RELEASE).
+*/
wake_q_add(wake_q, q->sleeper);
/*
-* Rely on the above implicit barrier, such that we can
-* ensure that we hold reference to the task before setting
-* q->status. Otherwise we could race with do_exit if the
-* task is awoken by an external event before calling
-* wake_up_process().
+* Here, we need a barrier to protect the refcount increase inside
+* wake_q_add().
+* case a: The barrier inside wake_q_add() pairs with
+* READ_ONCE(q->status) + smp_acquire__after_ctrl_dep() in
+* do_semtimedop().
+* case b: nothing, ordering is enforced by the locks in sem_lock().
 */
-   WRITE_ONCE(q->status, error);
+   smp_store_release(>status, error);
 }
 
 static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
@@ -2148,9 +2177,11 @@ static long do_semtimedop(int semid, struct sembuf 
__user *tsops,
}
 
do {
+   /* memory ordering ensured by the lock in sem_lock() */
WRITE_ONCE(queue.status, -EINTR);
queue.sleeper = current;
 
+   /* memory ordering is ensured by the lock in sem_lock() */
__set_current_state(TASK_INTERRUPTIBLE);
sem_unlock(sma, locknum);
rcu_read_unlock();
@@ -2174,12 +2205,16 @@ static long do_semtimedop(int semid, struct sembuf 
__user *tsops,
error = READ_ONCE(queue.status);
if (error != -EINTR) {
/*
-* User space could assume that semop() is a memory
-* barrier: Without the mb(), the cpu could
-* speculatively read in userspace stale data that was
-* overwritten by the previous owner of the semaphore.
+* Memory barrier for queue.status, case a):
+

[PATCH 6/6] Documentation/memory-barriers.txt: Clarify cmpxchg()

2019-10-11 Thread Manfred Spraul

The documentation in memory-barriers.txt claims that
smp_mb__{before,after}_atomic() are for atomic ops that do not return a
value.

This is misleading and doesn't match the example in atomic_t.txt,
and e.g. smp_mb__before_atomic() may and is used together with
cmpxchg_relaxed() in the wake_q code.

The purpose of e.g. smp_mb__before_atomic() is to "upgrade" a following
RMW atomic operation to a full memory barrier.
The return code of the atomic operation has no impact, so all of the
following examples are valid:

1)
smp_mb__before_atomic();
atomic_add();

2)
smp_mb__before_atomic();
atomic_xchg_relaxed();

3)
smp_mb__before_atomic();
atomic_fetch_add_relaxed();

Invalid would be:
smp_mb__before_atomic();
atomic_set();

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Peter Zijlstra 
---
 Documentation/memory-barriers.txt | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 1adbb8a371c7..52076b057400 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1873,12 +1873,13 @@ There are some more advanced barrier functions:
  (*) smp_mb__before_atomic();
  (*) smp_mb__after_atomic();
 
- These are for use with atomic (such as add, subtract, increment and
- decrement) functions that don't return a value, especially when used for
- reference counting.  These functions do not imply memory barriers.
+ These are for use with atomic RMW functions (such as add, subtract,
+ increment, decrement, failed conditional operations, ...) that do
+ not imply memory barriers, but where the code needs a memory barrier,
+ for example when used for reference counting.
 
- These are also used for atomic bitop functions that do not return a
- value (such as set_bit and clear_bit).
+ These are also used for atomic RMW bitop functions that do imply a full
+ memory barrier (such as set_bit and clear_bit).
 
  As an example, consider a piece of code that marks an object as being dead
  and then decrements the object's reference count:
-- 
2.21.0

[PATCH 3/6] ipc/mqueue.c: Update/document memory barriers

2019-10-11 Thread Manfred Spraul

Update and document memory barriers for mqueue.c:
- ewp->state is read without any locks, thus READ_ONCE is required.

- add smp_aquire__after_ctrl_dep() after the READ_ONCE, we need
  acquire semantics if the value is STATE_READY.

- add an explicit memory barrier to __pipelined_op(), the
  refcount must have been increased before the updated state becomes
  visible

- document why __set_current_state() may be used:
  Reading task->state cannot happen before the wake_q_add() call,
  which happens while holding info->lock. Thus the spin_unlock()
  is the RELEASE, and the spin_lock() is the ACQUIRE.

For completeness: there is also a 3 CPU szenario, if the to be woken
up task is already on another wake_q.
Then:
- CPU1: spin_unlock() of the task that goes to sleep is the RELEASE
- CPU2: the spin_lock() of the waker is the ACQUIRE
- CPU2: smp_mb__before_atomic inside wake_q_add() is the RELEASE
- CPU3: smp_mb__after_spinlock() inside try_to_wake_up() is the ACQUIRE

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/mqueue.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index be48c0ba92f7..b80574822f0a 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -646,18 +646,26 @@ static int wq_sleep(struct mqueue_inode_info *info, int 
sr,
wq_add(info, sr, ewp);
 
for (;;) {
+   /* memory barrier not required, we hold info->lock */
__set_current_state(TASK_INTERRUPTIBLE);
 
spin_unlock(>lock);
time = schedule_hrtimeout_range_clock(timeout, 0,
HRTIMER_MODE_ABS, CLOCK_REALTIME);
 
-   if (ewp->state == STATE_READY) {
+   if (READ_ONCE(ewp->state) == STATE_READY) {
+   /*
+* Pairs, together with READ_ONCE(), with
+* the barrier in __pipelined_op().
+*/
+   smp_acquire__after_ctrl_dep();
retval = 0;
goto out;
}
spin_lock(>lock);
-   if (ewp->state == STATE_READY) {
+
+   /* we hold info->lock, so no memory barrier required */
+   if (READ_ONCE(ewp->state) == STATE_READY) {
retval = 0;
goto out_unlock;
}
@@ -925,14 +933,12 @@ static inline void __pipelined_op(struct wake_q_head 
*wake_q,
list_del(>list);
wake_q_add(wake_q, this->task);
/*
-* Rely on the implicit cmpxchg barrier from wake_q_add such
-* that we can ensure that updating receiver->state is the last
-* write operation: As once set, the receiver can continue,
-* and if we don't have the reference count from the wake_q,
-* yet, at that point we can later have a use-after-free
-* condition and bogus wakeup.
+* The barrier is required to ensure that the refcount increase
+* inside wake_q_add() is completed before the state is updated.
+*
+* The barrier pairs with READ_ONCE()+smp_mb__after_ctrl_dep().
 */
-this->state = STATE_READY;
+smp_store_release(>state, STATE_READY);
 }
 
 /* pipelined_send() - send a message directly to the task waiting in
@@ -1049,7 +1055,9 @@ static int do_mq_timedsend(mqd_t mqdes, const char __user 
*u_msg_ptr,
} else {
wait.task = current;
wait.msg = (void *) msg_ptr;
-   wait.state = STATE_NONE;
+
+   /* memory barrier not required, we hold info->lock */
+   WRITE_ONCE(wait.state, STATE_NONE);
ret = wq_sleep(info, SEND, timeout, );
/*
 * wq_sleep must be called with info->lock held, and
@@ -1152,7 +1160,9 @@ static int do_mq_timedreceive(mqd_t mqdes, char __user 
*u_msg_ptr,
ret = -EAGAIN;
} else {
wait.task = current;
-   wait.state = STATE_NONE;
+
+   /* memory barrier not required, we hold info->lock */
+   WRITE_ONCE(wait.state, STATE_NONE);
ret = wq_sleep(info, RECV, timeout, );
msg_ptr = wait.msg;
}
-- 
2.21.0

[PATCH 1/6] wake_q: Cleanup + Documentation update.

2019-10-11 Thread Manfred Spraul

1) wake_q_add() contains a memory barrier, and callers such as
ipc/mqueue.c rely on this barrier.
Unfortunately, this is documented in ipc/mqueue.c, and not in the
description of wake_q_add().
Therefore: Update the documentation.
Removing/updating ipc/mqueue.c will happen with the next patch in the
series.

2) wake_q_add() ends with get_task_struct(), which is an
unordered refcount increase. Add a clear comment that the callers
are responsible for a barrier: most likely spin_unlock() or
smp_store_release().

3) wake_up_q() relies on the memory barrier in try_to_wake_up().
Add a comment, to simplify searching.

4) wake_q.next is accessed without synchroniyation by wake_q_add(),
using cmpxchg_relaxed(), and by wake_up_q().
Therefore: Use WRITE_ONCE in wake_up_q(), to ensure that the
compiler doesn't perform any tricks.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 kernel/sched/core.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dd05a378631a..60ae574317fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -440,8 +440,16 @@ static bool __wake_q_add(struct wake_q_head *head, struct 
task_struct *task)
  * @task: the task to queue for 'later' wakeup
  *
  * Queue a task for later wakeup, most likely by the wake_up_q() call in the
- * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
- * instantly.
+ * same context, _HOWEVER_ this is not guaranteed. Especially, the wakeup
+ * may happen before the function returns.
+ *
+ * What is guaranteed is that there is a memory barrier before the wakeup,
+ * callers may rely on this barrier.
+ *
+ * On the other hand, the caller must guarantee that @task does not disappear
+ * before wake_q_add() completed. wake_q_add() does not contain any memory
+ * barrier to ensure ordering, thus the caller may need to use
+ * smp_store_release().
  *
  * This function must be used as-if it were wake_up_process(); IOW the task
  * must be ready to be woken at this location.
@@ -486,11 +494,14 @@ void wake_up_q(struct wake_q_head *head)
BUG_ON(!task);
/* Task can safely be re-inserted now: */
node = node->next;
-   task->wake_q.next = NULL;
+
+   WRITE_ONCE(task->wake_q.next, NULL);
 
/*
 * wake_up_process() executes a full barrier, which pairs with
 * the queueing in wake_q_add() so as not to miss wakeups.
+* The barrier is the smp_mb__after_spinlock() in
+* try_to_wake_up().
 */
wake_up_process(task);
put_task_struct(task);
-- 
2.21.0

[PATCH 0/6] V2: Clarify/standardize memory barriers for ipc

2019-10-11 Thread Manfred Spraul

Hi,

Updated series, based on input from Davidlohr:

- Mixing WRITE_ONCE(), when not holding a lock, and "normal" writes,
  when holding a lock, makes the code less readable.
  Thus use _ONCE() everywhere, for both WRITE_ONCE() and READ_ONCE().

- According to my understanding, wake_q_add() does not contain a barrier
  that protects the refount increase. Document that, and add the barrier
  to the ipc code

- and, based on patch review: The V1 patch for ipc/sem.c is incorrect,
  ->state must be set to "-EINTR", not EINTR.

>From V1:

The memory barriers in ipc are not properly documented, and at least
for some architectures insufficient:
Reading the xyz->status is only a control barrier, thus
smp_acquire__after_ctrl_dep() was missing in mqueue.c and msg.c
sem.c contained a full smp_mb(), which is not required.

Patches:
Patch 1: Document the barrier rules for wake_q_add().

Patch 2: remove code duplication
@Davidlohr: There is no "Signed-off-by" in your mail, otherwise I would
list you as author.

Patch 3-5: Update the ipc code, especially add missing
   smp_mb__after_ctrl_dep().

Clarify that smp_mb__{before,after}_atomic() are compatible with all
RMW atomic operations, not just the operations that do not return a value.

Patch 6: Documentation for smp_mb__{before,after}_atomic().

Open issues:
- Is my analysis regarding the refcount correct?

- Review other users of wake_q_add().

- More testing. I did some tests, but doubt that the tests would be
  sufficient to show issues with regards to incorrect memory barriers.

- Should I add a "Fixes:" or "Cc:stable"? The issues that I see are
  the missing smp_mb__after_ctrl_dep(), and WRITE_ONCE() vs.
  "ptr = NULL", and a risk regarding the refcount that I can't evaluate.


What do you think?

--
Manfred

Re: [PATCH 2/5] ipc/mqueue.c: Update/document memory barriers

2019-10-11 Thread Manfred Spraul


On 10/11/19 6:55 PM, Davidlohr Bueso wrote:

On Fri, 11 Oct 2019, Manfred Spraul wrote:


Update and document memory barriers for mqueue.c:
- ewp->state is read without any locks, thus READ_ONCE is required.


In general we relied on the barrier for not needing READ/WRITE_ONCE,
but I agree this scenario should be better documented with them.


After reading core-api/atomic_ops.rst:

> _ONCE() should be used. [...] Alternatively, you can place a barrier.

So both approaches are ok.

Let's follow the "should", i.e.: all operations on the ->state variables 
to READ_ONCE()/WRITE_ONCE().


Then we have a standard, and since we can follow the "should", we should 
do that.



Similarly imo, the 'state' should also need them for write, even if
under the lock -- consistency and documentation, for example.

Ok, so let's convert everything to _ONCE. (assuming that my analysis 
below is incorrect)

In addition, I think it makes sense to encapsulate some of the
pipelined send/recv operations, that also can allow us to keep
the barrier comments in pipelined_send(), which I wonder why
you chose to remove. Something like so, before your changes:

I thought that the simple "memory barrier is provided" is enough, so I 
had removed the comment.



But you are right, there are two different scenarios:

1) thread already in another wake_q, wakeup happens immediately after 
the cmpxchg_relaxed().


This scenario is safe, due to the smp_mb__before_atomic() in wake_q_add()

2) thread woken up but e.g. a timeout, see ->state=STATE_READY, returns 
to user space, calls sys_exit.


This must not happen before get_task_struct acquired a reference.

And this appears to be unsafe: get_task_struct() is refcount_inc(), 
which is refcount_inc_checked(), which is according to lib/refcount.c 
fully unordered.


Thus: ->state=STATE_READY can execute before the refcount increase.

Thus: ->state=STATE_READY needs a smp_store_release(), correct?


diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3d920ff15c80..be48c0ba92f7 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -918,17 +918,12 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, 
u_name)

 * The same algorithm is used for senders.
 */

-/* pipelined_send() - send a message directly to the task waiting in
- * sys_mq_timedreceive() (without inserting message into a queue).
- */
-static inline void pipelined_send(struct wake_q_head *wake_q,
+static inline void __pipelined_op(struct wake_q_head *wake_q,
  struct mqueue_inode_info *info,
-  struct msg_msg *message,
-  struct ext_wait_queue *receiver)
+  struct ext_wait_queue *this)
{
-    receiver->msg = message;
-    list_del(>list);
-    wake_q_add(wake_q, receiver->task);
+    list_del(>list);
+    wake_q_add(wake_q, this->task);
/*
 * Rely on the implicit cmpxchg barrier from wake_q_add such
 * that we can ensure that updating receiver->state is the last
@@ -937,7 +932,19 @@ static inline void pipelined_send(struct 
wake_q_head *wake_q,

 * yet, at that point we can later have a use-after-free
 * condition and bogus wakeup.
 */
-    receiver->state = STATE_READY;
+    this->state = STATE_READY;
+}
+
+/* pipelined_send() - send a message directly to the task waiting in
+ * sys_mq_timedreceive() (without inserting message into a queue).
+ */
+static inline void pipelined_send(struct wake_q_head *wake_q,
+  struct mqueue_inode_info *info,
+  struct msg_msg *message,
+  struct ext_wait_queue *receiver)
+{
+    receiver->msg = message;
+    __pipelined_op(wake_q, info, receiver);
}

/* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
@@ -955,9 +962,7 @@ static inline void pipelined_receive(struct 
wake_q_head *wake_q,

if (msg_insert(sender->msg, info))
    return;

-    list_del(>list);
-    wake_q_add(wake_q, sender->task);
-    sender->state = STATE_READY;
+    __pipelined_op(wake_q, info, sender);
}

static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,


I would merge that into the series, ok?

--

    Manfred

[PATCH 2/5] ipc/mqueue.c: Update/document memory barriers

2019-10-11 Thread Manfred Spraul

Update and document memory barriers for mqueue.c:
- ewp->state is read without any locks, thus READ_ONCE is required.

- add smp_aquire__after_ctrl_dep() after the RAED_ONCE, we need
  acquire semantics if the value is STATE_READY.

- document that the code relies on the barrier inside wake_q_add()

- document why __set_current_state() may be used:
  Reading task->state cannot happen before the wake_q_add() call,
  which happens while holding info->lock.

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/mqueue.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3d920ff15c80..902167407737 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -646,17 +646,25 @@ static int wq_sleep(struct mqueue_inode_info *info, int 
sr,
wq_add(info, sr, ewp);
 
for (;;) {
+   /* memory barrier not required, we hold info->lock */
__set_current_state(TASK_INTERRUPTIBLE);
 
spin_unlock(>lock);
time = schedule_hrtimeout_range_clock(timeout, 0,
HRTIMER_MODE_ABS, CLOCK_REALTIME);
 
-   if (ewp->state == STATE_READY) {
+   if (READ_ONCE(ewp->state) == STATE_READY) {
+   /*
+* Pairs, together with READ_ONCE(), with
+* the barrier in wake_q_add().
+*/
+   smp_acquire__after_ctrl_dep();
retval = 0;
goto out;
}
spin_lock(>lock);
+
+   /* we hold info->lock, so no memory barrier required */
if (ewp->state == STATE_READY) {
retval = 0;
goto out_unlock;
@@ -928,16 +936,11 @@ static inline void pipelined_send(struct wake_q_head 
*wake_q,
 {
receiver->msg = message;
list_del(>list);
+
wake_q_add(wake_q, receiver->task);
-   /*
-* Rely on the implicit cmpxchg barrier from wake_q_add such
-* that we can ensure that updating receiver->state is the last
-* write operation: As once set, the receiver can continue,
-* and if we don't have the reference count from the wake_q,
-* yet, at that point we can later have a use-after-free
-* condition and bogus wakeup.
-*/
-   receiver->state = STATE_READY;
+
+   /* The memory barrier is provided by wake_q_add(). */
+   WRITE_ONCE(receiver->state, STATE_READY);
 }
 
 /* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
@@ -956,8 +959,11 @@ static inline void pipelined_receive(struct wake_q_head 
*wake_q,
return;
 
list_del(>list);
+
wake_q_add(wake_q, sender->task);
-   sender->state = STATE_READY;
+
+   /* The memory barrier is provided by wake_q_add(). */
+   WRITE_ONCE(sender->state, STATE_READY);
 }
 
 static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,
@@ -1044,6 +1050,8 @@ static int do_mq_timedsend(mqd_t mqdes, const char __user 
*u_msg_ptr,
} else {
wait.task = current;
wait.msg = (void *) msg_ptr;
+
+   /* memory barrier not required, we hold info->lock */
wait.state = STATE_NONE;
ret = wq_sleep(info, SEND, timeout, );
/*
@@ -1147,6 +1155,8 @@ static int do_mq_timedreceive(mqd_t mqdes, char __user 
*u_msg_ptr,
ret = -EAGAIN;
} else {
wait.task = current;
+
+   /* memory barrier not required, we hold info->lock */
wait.state = STATE_NONE;
ret = wq_sleep(info, RECV, timeout, );
msg_ptr = wait.msg;
-- 
2.21.0

[PATCH 3/5] ipc/msg.c: Update and document memory barriers.

2019-10-11 Thread Manfred Spraul

Transfer findings from ipc/sem.c:
- A control barrier was missing for the lockless receive case
  So in theory, not yet initialized data may have been copied
  to user space - obviously only for architectures where
  control barriers are not NOP.

- Add documentation. Especially, document that the code relies
  on the barrier inside wake_q_add().

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/msg.c | 39 ++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 8dec945fa030..1e2c0a3d4998 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -184,6 +184,10 @@ static inline void ss_add(struct msg_queue *msq,
 {
mss->tsk = current;
mss->msgsz = msgsz;
+   /*
+* No memory barrier required: we did ipc_lock_object(),
+* and the waker obtains that lock before calling wake_q_add().
+*/
__set_current_state(TASK_INTERRUPTIBLE);
list_add_tail(>list, >q_senders);
 }
@@ -238,6 +242,12 @@ static void expunge_all(struct msg_queue *msq, int res,
 
list_for_each_entry_safe(msr, t, >q_receivers, r_list) {
wake_q_add(wake_q, msr->r_tsk);
+
+   /*
+* A memory barrier is required that pairs with the
+* READ_ONCE()+smp_mb__after_ctrl_dep(). It is provided by
+* wake_q_add().
+*/
WRITE_ONCE(msr->r_msg, ERR_PTR(res));
}
 }
@@ -798,12 +808,24 @@ static inline int pipelined_send(struct msg_queue *msq, 
struct msg_msg *msg,
list_del(>r_list);
if (msr->r_maxsize < msg->m_ts) {
wake_q_add(wake_q, msr->r_tsk);
+
+   /*
+* A memory barrier is required that pairs with
+* the READ_ONCE()+smp_mb__after_ctrl_dep().
+* It is provided by wake_q_add().
+*/
WRITE_ONCE(msr->r_msg, ERR_PTR(-E2BIG));
} else {
ipc_update_pid(>q_lrpid, 
task_pid(msr->r_tsk));
msq->q_rtime = ktime_get_real_seconds();
 
wake_q_add(wake_q, msr->r_tsk);
+
+   /*
+* A memory barrier is required that pairs with
+* the READ_ONCE()+smp_mb__after_ctrl_dep().
+* It is provided by wake_q_add().
+*/
WRITE_ONCE(msr->r_msg, msg);
return 1;
}
@@ -1155,6 +1177,8 @@ static long do_msgrcv(int msqid, void __user *buf, size_t 
bufsz, long msgtyp, in
else
msr_d.r_maxsize = bufsz;
msr_d.r_msg = ERR_PTR(-EAGAIN);
+
+   /* memory barrier not required, we own ipc_lock_object() */
__set_current_state(TASK_INTERRUPTIBLE);
 
ipc_unlock_object(>q_perm);
@@ -1183,8 +1207,21 @@ static long do_msgrcv(int msqid, void __user *buf, 
size_t bufsz, long msgtyp, in
 * signal) it will either see the message and continue ...
 */
msg = READ_ONCE(msr_d.r_msg);
-   if (msg != ERR_PTR(-EAGAIN))
+   if (msg != ERR_PTR(-EAGAIN)) {
+   /*
+* Memory barrier for msr_d.r_msg
+* The smp_acquire__after_ctrl_dep(), together with the
+* READ_ONCE() above pairs with the barrier inside
+* wake_q_add().
+* The barrier protects the accesses to the message in
+* do_msg_fill(). In addition, the barrier protects user
+* space, too: User space may assume that all data from
+* the CPU that sent the message is visible.
+*/
+   smp_acquire__after_ctrl_dep();
+
goto out_unlock1;
+   }
 
 /*
  * ... or see -EAGAIN, acquire the lock to check the message
-- 
2.21.0

[PATCH 4/5] ipc/sem.c: Document and update memory barriers

2019-10-11 Thread Manfred Spraul

The patch documents and updates the memory barriers in ipc/sem.c:
- Document that the WRITE_ONCE for q->status relies on a barrier
  inside wake_q_add().

- Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep().
  as the pair for the barrier inside wake_q_add()

- Remove READ_ONCE & WRITE_ONCE for the situations where spinlocks
  provide exclusion.

- Add comments to all barriers, and mention the rules in the block
  regarding locking.

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/sem.c | 64 ---
 1 file changed, 51 insertions(+), 13 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index ec97a7072413..53d970c4e60d 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -205,7 +205,9 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
  *
  * Memory ordering:
  * Most ordering is enforced by using spin_lock() and spin_unlock().
- * The special case is use_global_lock:
+ *
+ * Exceptions:
+ * 1) use_global_lock:
  * Setting it from non-zero to 0 is a RELEASE, this is ensured by
  * using smp_store_release().
  * Testing if it is non-zero is an ACQUIRE, this is ensured by using
@@ -214,6 +216,24 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
  * this smp_load_acquire(), this is guaranteed because the smp_load_acquire()
  * is inside a spin_lock() and after a write from 0 to non-zero a
  * spin_lock()+spin_unlock() is done.
+ *
+ * 2) queue.status:
+ * Initialization is done while holding sem_lock(), so no further barrier is
+ * required.
+ * Setting it to a result code is a RELEASE, this is ensured by both the
+ * barrier inside wake_q_add() (for case a) and while holding sem_lock()
+ * (for case b).
+ * The AQUIRE when reading the result code without holding sem_lock() is
+ * achieved by using READ_ONCE() + smp_acquire__after_ctrl_dep().
+ * (case a above).
+ * Reading the result code while holding sem_lock() needs no further barriers,
+ * the locks inside sem_lock() enforce ordering (case b above)
+ *
+ * 3) current->state:
+ * current->state is set to TASK_INTERRUPTIBLE while holding sem_lock().
+ * The wakeup is handled using the wake_q infrastructure. wake_q wakeups may
+ * happen immediately after calling wake_q_add. As wake_q_add() is called
+ * when holding sem_lock(), no further barriers are required.
  */
 
 #define sc_semmsl  sem_ctls[0]
@@ -766,13 +786,21 @@ static int perform_atomic_semop(struct sem_array *sma, 
struct sem_queue *q)
 static inline void wake_up_sem_queue_prepare(struct sem_queue *q, int error,
 struct wake_q_head *wake_q)
 {
+   /*
+* When the wakeup is performed, q->sleeper->state is read and later
+* set to TASK_RUNNING. This may happen at any time, even before
+* wake_q_add() returns. Memory ordering for q->sleeper->state is
+* enforced by sem_lock(): we own sem_lock now (that was the ACQUIRE),
+* and q->sleeper wrote q->sleeper->state before calling sem_unlock()
+* (->RELEASE).
+*/
wake_q_add(wake_q, q->sleeper);
/*
-* Rely on the above implicit barrier, such that we can
-* ensure that we hold reference to the task before setting
-* q->status. Otherwise we could race with do_exit if the
-* task is awoken by an external event before calling
-* wake_up_process().
+* Memory barrier pairing:
+* case a: The barrier inside wake_q_add() pairs with
+* READ_ONCE(q->status) + smp_acquire__after_ctrl_dep() in
+* do_semtimedop().
+* case b: nothing, ordering is enforced by the locks in sem_lock().
 */
WRITE_ONCE(q->status, error);
 }
@@ -2148,9 +2176,11 @@ static long do_semtimedop(int semid, struct sembuf 
__user *tsops,
}
 
do {
-   WRITE_ONCE(queue.status, -EINTR);
+   /* memory ordering ensured by the lock in sem_lock() */
+   queue.status = EINTR;
queue.sleeper = current;
 
+   /* memory ordering is ensured by the lock in sem_lock() */
__set_current_state(TASK_INTERRUPTIBLE);
sem_unlock(sma, locknum);
rcu_read_unlock();
@@ -2174,12 +2204,16 @@ static long do_semtimedop(int semid, struct sembuf 
__user *tsops,
error = READ_ONCE(queue.status);
if (error != -EINTR) {
/*
-* User space could assume that semop() is a memory
-* barrier: Without the mb(), the cpu could
-* speculatively read in userspace stale data that was
-* overwritten by the previous owner of the semaphore.
+* Memory barrier for queue.status, case a):
+* The smp_acquire__after_ctrl_dep(), together with th

[PATCH 5/5] Documentation/memory-barriers.txt: Clarify cmpxchg()

2019-10-11 Thread Manfred Spraul

The documentation in memory-barriers.txt claims that
smp_mb__{before,after}_atomic() are for atomic ops that do not return a
value.

This is misleading and doesn't match the example in atomic_t.txt,
and e.g. smp_mb__before_atomic() may and is used together with
cmpxchg_relaxed() in the wake_q code.

The purpose of e.g. smp_mb__before_atomic() is to "upgrade" a following
RMW atomic operation to a full memory barrier.
The return code of the atomic operation has no impact, so all of the
following examples are valid:

1)
smp_mb__before_atomic();
atomic_add();

2)
smp_mb__before_atomic();
atomic_xchg_relaxed();

3)
smp_mb__before_atomic();
atomic_fetch_add_relaxed();

Invalid would be:
smp_mb__before_atomic();
atomic_set();

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Peter Zijlstra 
---
 Documentation/memory-barriers.txt | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 1adbb8a371c7..52076b057400 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1873,12 +1873,13 @@ There are some more advanced barrier functions:
  (*) smp_mb__before_atomic();
  (*) smp_mb__after_atomic();
 
- These are for use with atomic (such as add, subtract, increment and
- decrement) functions that don't return a value, especially when used for
- reference counting.  These functions do not imply memory barriers.
+ These are for use with atomic RMW functions (such as add, subtract,
+ increment, decrement, failed conditional operations, ...) that do
+ not imply memory barriers, but where the code needs a memory barrier,
+ for example when used for reference counting.
 
- These are also used for atomic bitop functions that do not return a
- value (such as set_bit and clear_bit).
+ These are also used for atomic RMW bitop functions that do imply a full
+ memory barrier (such as set_bit and clear_bit).
 
  As an example, consider a piece of code that marks an object as being dead
  and then decrements the object's reference count:
-- 
2.21.0

[PATCH 0/3] Clarify/standardize memory barriers for ipc

2019-10-11 Thread Manfred Spraul

Hi,

Partially based on the findings from Waiman Long:

a) The memory barriers in ipc are not properly documented, and at least
for some architectures insufficient:
Reading the xyz->status is only a control barrier, thus
smp_acquire__after_ctrl_dep() was missing in mqueue.c and msg.c
sem.c contained a full smp_mb(), which is not required.

Patch 1: Document that wake_q_add() contains a barrier.

b) wake_q_add() provides a memory barrier, ipc/mqueue.c relies on this.
Move the documentation to wake_q_add(), instead writing it in ipc/mqueue.c

Patch 2-4: Update the ipc code, especially add missing
   smp_mb__after_ctrl_dep().

c) [optional]
Clarify that smp_mb__{before,after}_atomic() are compatible with all
RMW atomic operations, not just the operations that do not return a value.

Patch 5: Documentation for smp_mb__{before,after}_atomic().

>From my point of view, patch 1 is a prerequisite for patches 2-4:
If the barrier is not part of the documented API, then ipc should not rely
on it, i.e. then I would propose to replace the WRITE_ONCE with
smp_store_release().

Open issues:
- More testing. I did some tests, but doubt that the tests would be
  sufficient to show issues with regards to incorrect memory barriers.

- Should I add a "Fixes:" or "Cc:stable"? The only issues that I see are
  the missing smp_mb__after_ctrl_dep(), and WRITE_ONCE() vs.
  "ptr = NULL".

What do you think?

--
Manfred

[PATCH 1/5] wake_q: Cleanup + Documentation update.

2019-10-11 Thread Manfred Spraul

1) wake_q_add() contains a memory barrier, and callers such as
ipc/mqueue.c rely on this barrier.
Unfortunately, this is documented in ipc/mqueue.c, and not in the
description of wake_q_add().
Therefore: Update the documentation.
Removing/updating ipc/mqueue.c will happen with the next patch in the
series.

2) wake_up_q() relies on the memory barrier in try_to_wake_up().
Add a comment, to simplify searching.

3) wake_q.next is accessed without synchroniyation by wake_q_add(),
using cmpxchg_relaxed(), and by wake_up_q().
Therefore: Use WRITE_ONCE in wake_up_q(), to ensure that the
compiler doesn't perform any tricks.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 kernel/sched/core.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dd05a378631a..2cf3f7321303 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -440,8 +440,11 @@ static bool __wake_q_add(struct wake_q_head *head, struct 
task_struct *task)
  * @task: the task to queue for 'later' wakeup
  *
  * Queue a task for later wakeup, most likely by the wake_up_q() call in the
- * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
- * instantly.
+ * same context, _HOWEVER_ this is not guaranteed. Especially, the wakeup
+ * may happen before the function returns.
+ *
+ * What is guaranteed is that there is a memory barrier before the wakeup,
+ * callers may rely on this barrier.
  *
  * This function must be used as-if it were wake_up_process(); IOW the task
  * must be ready to be woken at this location.
@@ -486,11 +489,14 @@ void wake_up_q(struct wake_q_head *head)
BUG_ON(!task);
/* Task can safely be re-inserted now: */
node = node->next;
-   task->wake_q.next = NULL;
+
+   WRITE_ONCE(task->wake_q.next, NULL);
 
/*
 * wake_up_process() executes a full barrier, which pairs with
 * the queueing in wake_q_add() so as not to miss wakeups.
+* The barrier is the smp_mb__after_spinlock() in
+* try_to_wake_up().
 */
wake_up_process(task);
put_task_struct(task);
-- 
2.21.0

Re: wake_q memory ordering

2019-10-11 Thread Manfred Spraul


Hi Davidlohr,

On 10/10/19 9:25 PM, Davidlohr Bueso wrote:

On Thu, 10 Oct 2019, Peter Zijlstra wrote:


On Thu, Oct 10, 2019 at 02:13:47PM +0200, Manfred Spraul wrote:


Therefore smp_mb__{before,after}_atomic() may be combined with
cmpxchg_relaxed, to form a full memory barrier, on all archs.


Just so.


We might want something like this?

8<-

From: Davidlohr Bueso 
Subject: [PATCH] Documentation/memory-barriers.txt: Mention 
smp_mb__{before,after}_atomic() and CAS


Explicitly mention possible usages to guarantee serialization even upon
failed cmpxchg (or similar) calls along with 
smp_mb__{before,after}_atomic().


Signed-off-by: Davidlohr Bueso 
---
Documentation/memory-barriers.txt | 12 
1 file changed, 12 insertions(+)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt

index 1adbb8a371c7..5d2873d4b442 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1890,6 +1890,18 @@ There are some more advanced barrier functions:
 This makes sure that the death mark on the object is perceived to 
be set

 *before* the reference counter is decremented.

+ Similarly, these barriers can be used to guarantee serialization 
for atomic
+ RMW calls on architectures which may not imply memory barriers 
upon failure.

+
+    obj->next = NULL;
+    smp_mb__before_atomic()
+    if (cmpxchg(>ptr, NULL, val))
+    return;
+
+ This makes sure that the store to the next pointer always has 
smp_store_mb()
+ semantics. As such, smp_mb__{before,after}_atomic() calls allow 
optimizing

+ the barrier usage by finer grained serialization.
+
 See Documentation/atomic_{t,bitops}.txt for more information.


I don't know. The new documentation would not have answered my question 
(is it ok to combine smp_mb__before_atomic() with atomic_relaxed()?). 
And it copies content already present in atomic_t.txt.


Thus: I would prefer if the first sentence of the paragraph is replaced: 
The list of operations should end with "...", and it should match what 
is in atomic_t.txt


Ok?

--

    Manfred


>From 8df60211228042672ba0cd89c3566c5145e8b203 Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Fri, 11 Oct 2019 10:33:26 +0200
Subject: [PATCH 4/4] Documentation/memory-barriers.txt:  Clarify cmpxchg()

The documentation in memory-barriers.txt claims that
smp_mb__{before,after}_atomic() are for atomic ops that do not return a
value.

This is misleading and doesn't match the example in atomic_t.txt,
and e.g. smp_mb__before_atomic() may and is used together with
cmpxchg_relaxed() in the wake_q code.

The purpose of e.g. smp_mb__before_atomic() is to "upgrade" a following
RMW atomic operation to a full memory barrier.
The return code of the atomic operation has no impact, so all of the
following examples are valid:

1)
	smp_mb__before_atomic();
	atomic_add();

2)
	smp_mb__before_atomic();
	atomic_xchg_relaxed();

3)
	smp_mb__before_atomic();
	atomic_fetch_add_relaxed();

Invalid would be:
	smp_mb__before_atomic();
	atomic_set();

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Peter Zijlstra 
---
 Documentation/memory-barriers.txt | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 1adbb8a371c7..52076b057400 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1873,12 +1873,13 @@ There are some more advanced barrier functions:
  (*) smp_mb__before_atomic();
  (*) smp_mb__after_atomic();
 
- These are for use with atomic (such as add, subtract, increment and
- decrement) functions that don't return a value, especially when used for
- reference counting.  These functions do not imply memory barriers.
+ These are for use with atomic RMW functions (such as add, subtract,
+ increment, decrement, failed conditional operations, ...) that do
+ not imply memory barriers, but where the code needs a memory barrier,
+ for example when used for reference counting.
 
- These are also used for atomic bitop functions that do not return a
- value (such as set_bit and clear_bit).
+ These are also used for atomic RMW bitop functions that do imply a full
+ memory barrier (such as set_bit and clear_bit).
 
  As an example, consider a piece of code that marks an object as being dead
  and then decrements the object's reference count:
-- 
2.21.0

Re: wake_q memory ordering

2019-10-10 Thread Manfred Spraul


Hi Peter,

On 10/10/19 1:42 PM, Peter Zijlstra wrote:

On Thu, Oct 10, 2019 at 12:41:11PM +0200, Manfred Spraul wrote:

Hi,

Waiman Long noticed that the memory barriers in sem_lock() are not really
documented, and while adding documentation, I ended up with one case where
I'm not certain about the wake_q code:

Questions:
- Does smp_mb__before_atomic() + a (failed) cmpxchg_relaxed provide an
   ordering guarantee?

Yep. Either the atomic instruction implies ordering (eg. x86 LOCK
prefix) or it doesn't (most RISC LL/SC), if it does,
smp_mb__{before,after}_atomic() are a NO-OP and the ordering is
unconditinoal, if it does not, then smp_mb__{before,after}_atomic() are
unconditional barriers.


And _relaxed() differs from "normal" cmpxchg only for LL/SC 
architectures, correct?


Therefore smp_mb__{before,after}_atomic() may be combined with 
cmpxchg_relaxed, to form a full memory barrier, on all archs.


[...]



- Is it ok that wake_up_q just writes wake_q->next, shouldn't
   smp_store_acquire() be used? I.e.: guarantee that wake_up_process()
   happens after cmpxchg_relaxed(), assuming that a failed cmpxchg_relaxed
   provides any ordering.

There is no such thing as store_acquire, it is either load_acquire or
store_release. But just like how we can write load-aquire like
load+smp_mb(), so too I suppose we could write store-acquire like
store+smp_mb(), and that is exactly what is there (through the implied
barrier of wake_up_process()).


Thanks for confirming my assumption:
The code is correct, due to the implied barrier inside wake_up_process().

[...]

rewritten:

start condition: A = 1; B = 0;

CPU1:
     B = 1;
     RELEASE, unlock LockX;

CPU2:
     lock LockX, ACQUIRE
     if (LOAD A == 1) return; /* using cmp_xchg_relaxed */

CPU2:
     A = 0;
     ACQUIRE, lock LockY
     smp_mb__after_spinlock();
     READ B

Question: is A = 1, B = 0 possible?

Your example is incomplete (there is no A=1 assignment for example), but
I'm thinking I can guess where that should go given the earlier text.


A=1 is listed as start condition. Way before, someone did wake_q_add().



I don't think this is broken.


Thanks.

--

    Manfred

wake_q memory ordering

2019-10-10 Thread Manfred Spraul


Hi,

Waiman Long noticed that the memory barriers in sem_lock() are not 
really documented, and while adding documentation, I ended up with one 
case where I'm not certain about the wake_q code:


Questions:
- Does smp_mb__before_atomic() + a (failed) cmpxchg_relaxed provide an
  ordering guarantee?
- Is it ok that wake_up_q just writes wake_q->next, shouldn't
  smp_store_acquire() be used? I.e.: guarantee that wake_up_process()
  happens after cmpxchg_relaxed(), assuming that a failed cmpxchg_relaxed
  provides any ordering.

Example:
- CPU2 never touches lock a. It is just an unrelated wake_q user that also
  wants to wake up task 1234.
- I've noticed already that smp_store_acquire() doesn't exist.
  So smp_store_mb() is required. But from semantical point of view, we 
would

  need an ACQUIRE: the wake_up_process() must happen after cmpxchg().
- May wake_up_q() rely on the spinlocks/memory barriers in try_to_wake_up,
  or should the function be safe by itself?

CPU1: /current=1234, inside do_semtimedop()/
    g_wakee = current;
    current->state = TASK_INTERRUPTIBLE;
    spin_unlock(a);

CPU2: / arbitrary kernel thread that uses wake_q /
    wake_q_add(_q, 1234);
    wake_up_q(_q);
    <...ongoing>

CPU3: / do_semtimedop() + wake_up_sem_queue_prepare() /
    spin_lock(a);
    wake_q_add(,g_wakee);
    < within wake_q_add() >:
  smp_mb__before_atomic();
  if (unlikely(cmpxchg_relaxed(>next, 
NULL, WAKE_Q_TAIL)))

  return false; /* -> this happens */

CPU2:
    
    1234->wake_q.next = NULL; < Ok? Is 
store_acquire() missing? 

    wake_up_process(1234);
    < within wake_up_process/try_to_wake_up():
    raw_spin_lock_irqsave()
    smp_mb__after_spinlock()
    if(1234->state = TASK_RUNNING) return;
 >


rewritten:

start condition: A = 1; B = 0;

CPU1:
    B = 1;
    RELEASE, unlock LockX;

CPU2:
    lock LockX, ACQUIRE
    if (LOAD A == 1) return; /* using cmp_xchg_relaxed */

CPU2:
    A = 0;
    ACQUIRE, lock LockY
    smp_mb__after_spinlock();
    READ B

Question: is A = 1, B = 0 possible?

--

    Manfred

Re: [PATCH] ipc/sem: Fix race between to-be-woken task and waker

2019-09-29 Thread Manfred Spraul


Hi Waiman,

I have now written the mail 3 times:
Twice I thought that I found a race, but during further analysis, it 
always turns out that the spin_lock() is sufficient.


First, to avoid any obvious things: Until the series with e.g. 
27d7be1801a4824e, there was a race inside sem_lock().


Thus it was possible that multiple threads were operating on the same 
semaphore array, with obviously arbitrary impact.


On 9/20/19 5:54 PM, Waiman Long wrote:

  
+		/*

+* A spurious wakeup at the right moment can cause race
+* between the to-be-woken task and the waker leading to
+* missed wakeup. Setting state back to TASK_INTERRUPTIBLE
+* before checking queue.status will ensure that the race
+* won't happen.
+*
+*  CPU0CPU1
+*
+*   wake_up_sem_queue_prepare():
+*  state = TASK_INTERRUPTIBLEstatus = error
+*  try_to_wake_up():
+*  smp_mb()  smp_mb()
+*  if (status == -EINTR) if (!(p->state & state))
+*schedule()goto out
+*/
+   set_current_state(TASK_INTERRUPTIBLE);
+


So the the hypothesis is that we have a race due to the optimization 
within try_to_wake_up():

If the status is already TASK_RUNNING, then the wakeup is a nop.

Correct?

The waker wants to use:

    lock();
    set_conditions();
    unlock();

as the wake_q is a shared list, completely asynchroneously this will happen:

    smp_mb();  ***1
    if (current->state = TASK_INTERRUPTIBLE) current->state=TASK_RUNNING;

The only guarantee is that this will happen after lock(), it may happen 
before set_conditions().


The task that goes to sleep uses:

    lock();
    check_conditions();
    __set_current_state();
    unlock();  ***2
    schedule();

You propose to change that to:

    lock();
    set_current_state();
    check_conditions();
    unlock();
    schedule();

I don't see a race anymore, and I don't see how the proposed change will 
help.
e.g.: __set_current_state() and smp_mb() have paired memory barriers 
***1 and ***2 above.


--

    Manfred

Re: [PATCH] ipc/sem: Fix race between to-be-woken task and waker

2019-09-26 Thread Manfred Spraul


Hi,
On 9/26/19 8:12 PM, Waiman Long wrote:

On 9/26/19 5:34 AM, Peter Zijlstra wrote:

On Fri, Sep 20, 2019 at 11:54:02AM -0400, Waiman Long wrote:

While looking at a customr bug report about potential missed wakeup in
the system V semaphore code, I spot a potential problem.  The fact that
semaphore waiter stays in TASK_RUNNING state while checking queue status
may lead to missed wakeup if a spurious wakeup happens in the right
moment as try_to_wake_up() will do nothing if the task state isn't right.

To eliminate this possibility, the task state is now reset to
TASK_INTERRUPTIBLE immediately after wakeup before checking the queue
status. This should eliminate the race condition on the interaction
between the queue status and the task state and fix the potential missed
wakeup problem.

You are obviously right, there is a huge race condition.

Bah, this code always makes my head hurt.

Yes, AFAICT the pattern it uses has been broken since 0a2b9d4c7967,
since that removed doing the actual wakeup from under the sem_lock(),
which is what it relies on.


Correct - I've overlooked that.

First, theory:

setting queue->status, reading queue->status, setting 
current->state=TASK_INTERRUPTIBLE are all under the correct spinlock.


(there is an opportunistic read of queue->status without locks, but it 
is retried when the lock got acquired)


setting current->state=RUNNING is outside of any lock.

So as far as current->state is concerned, the lock doesn't exist. And if 
the lock doesn't exist, we must follow the rules applicable for 
set_current_state().


I'll try to check the code this week.

And we should check the remaining wake-queue users, the logic is 
everywhere identical.



After having a second look at the code again, I probably misread the
code the first time around. In the sleeping path, there is a check of
queue.status and setting of task state both under the sem lock in the
sleeping path. So as long as setting of queue status is under lock, they
should synchronize properly.

It looks like queue status setting is under lock, but I can't use
lockdep to confirm that as the locking can be done by either the array
lock or in one of the spinlocks in the array. Are you aware of a way of
doing that?


For testing? Have you considered just always using the global lock?

(untested):

--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -370,7 +370,7 @@ static inline int sem_lock(struct sem_array *sma, 
struct sembuf *sops,

    struct sem *sem;
    int idx;

-   if (nsops != 1) {
+   if (nsops != 1 || 1) {
    /* Complex operation - acquire a full lock */
    ipc_lock_object(>sem_perm);



Anyway, I do think we need to add some comment to clarify the situation
to avoid future confusion.


Around line 190 is the comment that explains locking & memory ordering.

I have only documented the content of sem_undo and sem_array, but 
neither queue nor current->state :-(



--

    Manfred

Re: [PATCH v11 2/3] ipc: Conserve sequence numbers in ipcmni_extend mode

2019-03-10 Thread Manfred Spraul

On 2/27/19 9:30 PM, Waiman Long wrote:

On 11/20/2018 02:41 PM, Manfred Spraul wrote:

 From 6bbade73d21884258a995698f21ad3128df8e98a Mon Sep 17 00:00:00 2001
From: Manfred Spraul
Date: Sat, 29 Sep 2018 15:43:28 +0200
Subject: [PATCH 2/2] ipc/util.c: use idr_alloc_cyclic() for ipc allocations

A bit related to the patch that increases IPC_MNI, and
partially based on the mail fromwi...@infradead.org:

(User space) id reuse create the risk of data corruption:

Process A: calls ipc function
Process A: sleeps just at the beginning of the syscall
Process B: Frees the ipc object (i.e.: calls ...ctl(IPC_RMID)
Process B: Creates a new ipc object (i.e.: calls ...get())

Process A: is woken up, and accesses the new object

To reduce the probability that the new and the old object have the
same id, the current implementation adds a sequence number to the
index of the object in the idr tree.

To further reduce the probability for a reuse, perform a cyclic
allocation, and increase the sequence number only when there is
a wrap-around. Unfortunately, idr_alloc_cyclic cannot be used,
because the sequence number must be increased when a wrap-around
occurs.

The patch cycles over at least RADIX_TREE_MAP_SIZE, i.e.
if there is only a small number of objects, the accesses
continue to be direct.

Signed-off-by: Manfred Spraul
---
  ipc/util.c | 48 
  1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 07ae117ccdc0..fa7b8fa7a14c 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -216,10 +216,49 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
 */

  	if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */

-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
-   idx = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
+   int idx_max;
+
+   /*
+* If a user space visible id is reused, then this creates a
+* risk for data corruption. To reduce the probability that
+* a number is reused, three approaches are used:
+* 1) the idr index is allocated cyclically.
+* 2) the use space id is build by concatenating the
+*internal idr index with a sequence number.
+* 3) The sequence number is only increased when the index
+*wraps around.
+* Note that this code cannot use idr_alloc_cyclic:
+* new->seq must be set before the entry is inserted in the
+* idr.

I don't think that is true. The IDR code just need to associate a 
pointer to the given ID. It is not going to access anything inside. So 
we don't need to set the seq number first before calling idr_alloc().

We must, sorry - there is even a CVE associate to that bug:

CVE-2015-7613, 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b9a532277938798b53178d5a66af6e2915cb27cf

The problem is not the IDR code, the problem is that 
ipc_obtain_object_check() calls ipc_checkid(), and ipc_checkid() 
accesses ipcp->seq.

And since the ipc_checkid() is called before acquiring any locks, 
everything must be fully initialized before idr_alloc().

+*/
+   idx_max = ids->in_use*2;
+   if (idx_max < RADIX_TREE_MAP_SIZE)
+   idx_max = RADIX_TREE_MAP_SIZE;
+   if (idx_max > ipc_mni)
+   idx_max = ipc_mni;
+
+   if (ids->ipcs_idr.idr_next <= idx_max) {
+   new->seq = ids->seq;
+   idx = idr_alloc(>ipcs_idr, new,
+   ids->ipcs_idr.idr_next,
+   idx_max, GFP_NOWAIT);
+   }
+
+   if ((idx == -ENOSPC) && (ids->ipcs_idr.idr_next > 0)) {
+   /*
+* A wrap around occurred.
+* Increase ids->seq, update new->seq
+*/
+   ids->seq++;
+   if (ids->seq > IPCID_SEQ_MAX)
+   ids->seq = 0;
+   new->seq = ids->seq;
+
+   idx = idr_alloc(>ipcs_idr, new, 0, idx_max,
+   GFP_NOWAIT);
+   }
+   if (idx >= 0)
+   ids->ipcs_idr.idr_next = idx+1;

This code has dependence on the internal implementation of the IDR 
code. So if the IDR code is changed and the one who does it forgets to 
update the IPC code, we may have a problem. Using idr_alloc_cyclic() 
for all will likely increase memory footprint which can be a problem 
on IoT devi

Re: general protection fault in put_pid

2019-01-07 Thread Manfred Spraul


On 1/3/19 11:18 PM, Shakeel Butt wrote:

Hi Manfred,

On Sun, Dec 23, 2018 at 4:26 AM Manfred Spraul  wrote:

Hello Dmitry,

On 12/23/18 10:57 AM, Dmitry Vyukov wrote:

I can reproduce this infinite memory consumption with the C program:
https://gist.githubusercontent.com/dvyukov/03ec54b3429ade16fa07bf8b2379aff3/raw/ae4f654e279810de2505e8fa41b73dc1d8e6/gistfile1.txt

But this is working as intended, right? It just creates infinite
number of large semaphore sets, which reasonably consumes infinite
amount of memory.
Except that it also violates the memcg bound and a process can have
effectively unlimited amount of such "drum memory" in semaphores.

Yes, this is as intended:

If you call semget(), then you can use memory, up to the limits in
/proc/sys/kernel/sem.

Memcg is not taken into account, an admin must set /proc/sys/kernel/sem.

The default are "infinite amount of memory allowed", as this is the most
sane default: We had a logic that tried to autotune (i.e.: a new
namespace "inherits" a fraction of the parent namespaces memory limits),
but this we more or less always wrong.



What's the disadvantage of setting the limits in /proc/sys/kernel/sem
high and let the task's memcg limits the number of semaphore a process
can create? Please note that the memory underlying shmget and msgget
is already accounted to memcg.


Nothing, it it just a question of implementing it.

I'll try to look at it.

--

    Manfred

Re: general protection fault in put_pid

2018-12-23 Thread Manfred Spraul


Hi Dmitry,

let's simplify the mail, otherwise noone can follow:

On 12/23/18 11:42 AM, Dmitry Vyukov wrote:



My naive attempts to re-reproduce this failed so far.
But I noticed that _all_ logs for these 3 crashes:
https://syzkaller.appspot.com/bug?extid=c92d3646e35bc5d1a909
https://syzkaller.appspot.com/bug?extid=1145ec2e23165570c3ac
https://syzkaller.appspot.com/bug?extid=9d8b6fa6ee7636f350c1
involve low memory conditions. My gut feeling says this is not a
coincidence. This is also probably the reason why all reproducers
create large sem sets. There must be some bad interaction between low
memory condition and semaphores/ipc namespaces.


Actually was able to reproduce this with a syzkaller program:

./syz-execprog -repeat=0 -procs=10 prog
...
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8788 Comm: syz-executor8 Not tainted 4.20.0-rc7+ #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
RIP: 0010:__list_del_entry_valid+0x7e/0x150 lib/list_debug.c:51
Code: ad de 4c 8b 26 49 39 c4 74 66 48 b8 00 02 00 00 00 00 ad de 48
89 da 48 39 c3 74 65 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c
02 00 75 7b 48 8b 13 48 39 f2 75 57 49 8d 7c 24 08 48 b8 00
RSP: 0018:88804faef210 EFLAGS: 00010a02
RAX: dc00 RBX: f817edba555e1f00 RCX: 831bad5f
RDX: 1f02fdb74aabc3e0 RSI: 88801b8a0720 RDI: 88801b8a0728
RBP: 88804faef228 R08: f52001055401 R09: f52001055401
R10: 0001 R11: f52001055400 R12: 88802d52cc98
R13: 88801b8a0728 R14: 88801b8a0720 R15: dc00
FS:  00d24940() GS:88802d50() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 004bb580 CR3: 11177005 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
  __list_del_entry include/linux/list.h:117 [inline]
  list_del include/linux/list.h:125 [inline]
  unlink_queue ipc/sem.c:786 [inline]
  freeary+0xddb/0x1c90 ipc/sem.c:1164
  free_ipcs+0xf0/0x160 ipc/namespace.c:112
  sem_exit_ns+0x20/0x40 ipc/sem.c:237
  free_ipc_ns ipc/namespace.c:120 [inline]
  put_ipc_ns+0x55/0x160 ipc/namespace.c:152
  free_nsproxy+0xc0/0x1f0 kernel/nsproxy.c:180
  switch_task_namespaces+0xa5/0xc0 kernel/nsproxy.c:229
  exit_task_namespaces+0x17/0x20 kernel/nsproxy.c:234
  do_exit+0x19e5/0x27d0 kernel/exit.c:866
  do_group_exit+0x151/0x410 kernel/exit.c:970
  __do_sys_exit_group kernel/exit.c:981 [inline]
  __se_sys_exit_group kernel/exit.c:979 [inline]
  __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:979
  do_syscall_64+0x192/0x770 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4570e9
Code: 5d af fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 0f 83 2b af fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7ffe35f12018 EFLAGS: 0246 ORIG_RAX: 00e7
RAX: ffda RBX: 0001 RCX: 004570e9
RDX: 00410540 RSI: 00a34c00 RDI: 0045
RBP: 004a43a4 R08: 000c R09: 
R10: 00d24940 R11: 0246 R12: 
R13: 0001 R14:  R15: 0008
Modules linked in:
Dumping ftrace buffer:
(ftrace buffer empty)
---[ end trace 17829b0f00569a59 ]---
RIP: 0010:__list_del_entry_valid+0x7e/0x150 lib/list_debug.c:51
Code: ad de 4c 8b 26 49 39 c4 74 66 48 b8 00 02 00 00 00 00 ad de 48
89 da 48 39 c3 74 65 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c
02 00 75 7b 48 8b 13 48 39 f2 75 57 49 8d 7c 24 08 48 b8 00
RSP: 0018:88804faef210 EFLAGS: 00010a02
RAX: dc00 RBX: f817edba555e1f00 RCX: 831bad5f
RDX: 1f02fdb74aabc3e0 RSI: 88801b8a0720 RDI: 88801b8a0728
RBP: 88804faef228 R08: f52001055401 R09: f52001055401
R10: 0001 R11: f52001055400 R12: 88802d52cc98
R13: 88801b8a0728 R14: 88801b8a0720 R15: dc00
FS:  00d24940() GS:88802d50() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 004bb580 CR3: 11177005 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400


The prog is:
unshare(0x802)
semget$private(0x0, 0x4007, 0x0)

kernel is on 9105b8aa50c182371533fc97db64fc8f26f051b3

and again it involved lots of oom kills, the repro eats all memory, a
process getting killed, frees some memory and the process repeats.


Ok, thus the above program triggers two bugs:

- a huge memory leak with semaphore arrays

- under OOM pressure, an oops.


1) I can reproduce the memory leak, it happens all the

Re: general protection fault in put_pid

2018-12-23 Thread Manfred Spraul


Hello Dmitry,

On 12/23/18 10:57 AM, Dmitry Vyukov wrote:


I can reproduce this infinite memory consumption with the C program:
https://gist.githubusercontent.com/dvyukov/03ec54b3429ade16fa07bf8b2379aff3/raw/ae4f654e279810de2505e8fa41b73dc1d8e6/gistfile1.txt

But this is working as intended, right? It just creates infinite
number of large semaphore sets, which reasonably consumes infinite
amount of memory.
Except that it also violates the memcg bound and a process can have
effectively unlimited amount of such "drum memory" in semaphores.


Yes, this is as intended:

If you call semget(), then you can use memory, up to the limits in 
/proc/sys/kernel/sem.


Memcg is not taken into account, an admin must set /proc/sys/kernel/sem.

The default are "infinite amount of memory allowed", as this is the most 
sane default: We had a logic that tried to autotune (i.e.: a new 
namespace "inherits" a fraction of the parent namespaces memory limits), 
but this we more or less always wrong.



--

    Manfred

Re: general protection fault in put_pid

2018-12-22 Thread Manfred Spraul


Hi Dmitry,

On 12/20/18 4:36 PM, Dmitry Vyukov wrote:

On Wed, Dec 19, 2018 at 10:04 AM Manfred Spraul
 wrote:

Hello Dmitry,

On 12/12/18 11:55 AM, Dmitry Vyukov wrote:

On Tue, Dec 11, 2018 at 9:23 PM syzbot
 wrote:

Hello,

syzbot found the following crash on:

HEAD commit:f5d582777bcb Merge branch 'for-linus' of git://git.kernel...
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=135bc54740
kernel config:  https://syzkaller.appspot.com/x/.config?x=c8970c89a0efbb23
dashboard link: https://syzkaller.appspot.com/bug?extid=1145ec2e23165570c3ac
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=16803afb40

+Manfred, this looks similar to the other few crashes related to
semget$private(0x0, 0x4000, 0x3f) that you looked at.

I found one unexpected (incorrect?) locking, see the attached patch.

But I doubt that this is the root cause of the crashes.


But why? These one-off sporadic crashes reported by syzbot looks
exactly like a subtle race and your patch touches sem_exit_ns involved
in all reports.
So if you don't spot anything else, I would say close these 3 reports
with this patch (I see you already included Reported-by tags which is
great!) and then wait for syzbot reaction. Since we got 3 of them, if
it's still not fixed I would expect that syzbot will be able to
retrigger this later again.


As I wrote, unless semop() is used, sma->use_global_lock is always 9 and 
nothing can happen.


Every single-operation semop() reduces use_global_lock by one, i.e a 
single semop call as done here cannot trigger the bug:


https://syzkaller.appspot.com/text?tag=ReproSyz=16803afb40


But, one more finding:

https://syzkaller.appspot.com/bug?extid=1145ec2e23165570c3ac

https://syzkaller.appspot.com/text?tag=CrashLog=109ecf6e40

The log file contain 1080 lines like these:


semget$private(..., 0x4003, ...)

semget$private(..., 0x4006, ...)

semget$private(..., 0x4007, ...)


It ends up as kmalloc(128*0x400x), i.e. slightly more than 2 MB, an 
allocation in the 4 MB kmalloc buffer:



[ 1201.210245] kmalloc-4194304  4698112KB4698112KB

i.e.: 1147 4 MB kmalloc blocks --> are we leaking nearly 100% of the 
semaphore arrays??



This one looks similar:

https://syzkaller.appspot.com/bug?extid=c92d3646e35bc5d1a909

except that the array sizes are mixed, and thus there are kmalloc-1M and 
kmalloc-2M as well.


(and I did not count the number of semget calls)


The test apps use unshare(CLONE_NEWNS) and unshare(CLONE_NEWIPC), correct?

I.e. no CLONE_NEWUSER.

https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L1523


--

    Manfred

Re: general protection fault in put_pid

2018-12-19 Thread Manfred Spraul


Hello Dmitry,

On 12/12/18 11:55 AM, Dmitry Vyukov wrote:

On Tue, Dec 11, 2018 at 9:23 PM syzbot
 wrote:

Hello,

syzbot found the following crash on:

HEAD commit:f5d582777bcb Merge branch 'for-linus' of git://git.kernel...
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=135bc54740
kernel config:  https://syzkaller.appspot.com/x/.config?x=c8970c89a0efbb23
dashboard link: https://syzkaller.appspot.com/bug?extid=1145ec2e23165570c3ac
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=16803afb40

+Manfred, this looks similar to the other few crashes related to
semget$private(0x0, 0x4000, 0x3f) that you looked at.


I found one unexpected (incorrect?) locking, see the attached patch.

But I doubt that this is the root cause of the crashes.

Any remarks on the patch?

I would continue to search, and then send a series with all findings.

--

    Manfred

>From 733e888993b71fb3c139f71de61534bc603a2bcb Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Wed, 19 Dec 2018 09:26:48 +0100
Subject: [PATCH] ipc/sem.c: ensure proper locking during namespace teardown

free_ipcs() only calls ipc_lock_object() before calling the free callback.

This means:
- There is no exclusion against parallel simple semop() calls.
- sma->use_global_lock may underflow (i.e. jump to UNIT_MAX) when
  freeary() calls sem_unlock(,,-1).

The patch fixes that, by adding complexmode_enter() before calling
freeary().

There are multiple syzbot crashes in this code area, but I don't see yet
how a missing complexmode_enter() may cause a crash:
- 1) simple semop() calls are not used by these syzbox tests,
  and 2) we are in namespace teardown, noone may run in parallel.

- 1) freeary() is the last call (except parallel operations, which
  are impossible due to namespace teardown)
  and 2) the underflow of use_global_lock merely delays switching to
  parallel simple semop handling for the next UINT_MAX semop() calls.

Thus I think the patch is "only" a cleanup, and does not fix
the observed crashes.

Signed-off-by: Manfred Spraul 
Reported-by: syzbot+1145ec2e23165570c...@syzkaller.appspotmail.com
Reported-by: syzbot+c92d3646e35bc5d1a...@syzkaller.appspotmail.com
Reported-by: syzbot+9d8b6fa6ee7636f35...@syzkaller.appspotmail.com
Cc: dvyu...@google.com
Cc: dbu...@suse.de
Cc: Andrew Morton 
---
 ipc/sem.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 745dc6187e84..8ccacd11fb15 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -184,6 +184,9 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
  */
 #define USE_GLOBAL_LOCK_HYSTERESIS	10
 
+static void complexmode_enter(struct sem_array *sma);
+static void complexmode_tryleave(struct sem_array *sma);
+
 /*
  * Locking:
  * a) global sem_lock() for read/write
@@ -232,9 +235,24 @@ void sem_init_ns(struct ipc_namespace *ns)
 }
 
 #ifdef CONFIG_IPC_NS
+
+static void freeary_lock(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+{
+	struct sem_array *sma = container_of(ipcp, struct sem_array, sem_perm);
+
+	/*
+	 * free_ipcs() isn't aware of sem_lock(), it calls ipc_lock_object()
+	 * directly. In order to stay compatible with sem_lock(), we must
+	 * upgrade from "simple" ipc_lock_object() to sem_lock(,,-1).
+	 */
+	complexmode_enter(sma);
+
+	freeary(ns, ipcp);
+}
+
 void sem_exit_ns(struct ipc_namespace *ns)
 {
-	free_ipcs(ns, _ids(ns), freeary);
+	free_ipcs(ns, _ids(ns), freeary_lock);
 	idr_destroy(>ids[IPC_SEM_IDS].ipcs_idr);
 	rhashtable_destroy(>ids[IPC_SEM_IDS].key_ht);
 }
@@ -374,7 +392,9 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Complex operation - acquire a full lock */
 		ipc_lock_object(>sem_perm);
 
-		/* Prevent parallel simple ops */
+		/* Prevent parallel simple ops.
+		 * This must be identical to freeary_lock().
+		 */
 		complexmode_enter(sma);
 		return SEM_GLOBAL_LOCK;
 	}
-- 
2.17.2

Re: BUG: corrupted list in freeary

2018-12-01 Thread Manfred Spraul


Hi Dmitry,

On 11/30/18 6:58 PM, Dmitry Vyukov wrote:

On Thu, Nov 29, 2018 at 9:13 AM, Manfred Spraul
 wrote:

Hello together,

On 11/27/18 4:52 PM, syzbot wrote:

Hello,

syzbot found the following crash on:

HEAD commit:e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a340

[...]

Isn't this a kernel stack overrun?

RSP: 0x..83e008. Assuming 8 kB kernel stack, and 8 kB alignment, we have
used up everything.

I don't exact answer, that's just the kernel output that we captured
from console.

FWIW with KASAN stacks are 16K:
https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64_types.h#L10
Ok, thanks. And stack overrun detection is enabled as well -> a real 
stack overrun is unlikely.

Well, generally everything except for kernel crashes is expected.

We actually sandbox it with memcg quite aggressively:
https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L2159
But it seems to manage to either break the limits, or cause some
massive memory leaks. The nature of that is yet unknown.


Is it possible to start from that side?

Are there other syzcaller runs where the OOM killer triggers that much?




- Which stress tests are enabled? By chance, I found:

[  433.304586] FAULT_INJECTION: forcing a failure.^M
[  433.304586] name fail_page_alloc, interval 1, probability 0, space 0,
times 0^M
[  433.316471] CPU: 1 PID: 19653 Comm: syz-executor4 Not tainted 4.20.0-rc3+
#348^M
[  433.323841] Hardware name: Google Google Compute Engine/Google Compute
Engine, BIOS Google 01/01/2011^M

I need some more background, then I can review the code.

What exactly do you mean by "Which stress tests"?
Fault injection is enabled. Also random workload from userspace.



Right now, I would put it into my "unknown syzcaller finding" folder.


One more idea: Are there further syzcaller runs that end up with 
0x01 in a pointer?


From what I see, the sysv sem code that is used is trivial, I don't see 
that it could cause the observed behavior.



--

    Manfred

Re: BUG: corrupted list in freeary

2018-12-01 Thread Manfred Spraul


Hi Dmitry,

On 11/30/18 6:58 PM, Dmitry Vyukov wrote:

On Thu, Nov 29, 2018 at 9:13 AM, Manfred Spraul
 wrote:

Hello together,

On 11/27/18 4:52 PM, syzbot wrote:

Hello,

syzbot found the following crash on:

HEAD commit:e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a340

[...]

Isn't this a kernel stack overrun?

RSP: 0x..83e008. Assuming 8 kB kernel stack, and 8 kB alignment, we have
used up everything.

I don't exact answer, that's just the kernel output that we captured
from console.

FWIW with KASAN stacks are 16K:
https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64_types.h#L10
Ok, thanks. And stack overrun detection is enabled as well -> a real 
stack overrun is unlikely.

Well, generally everything except for kernel crashes is expected.

We actually sandbox it with memcg quite aggressively:
https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L2159
But it seems to manage to either break the limits, or cause some
massive memory leaks. The nature of that is yet unknown.


Is it possible to start from that side?

Are there other syzcaller runs where the OOM killer triggers that much?




- Which stress tests are enabled? By chance, I found:

[  433.304586] FAULT_INJECTION: forcing a failure.^M
[  433.304586] name fail_page_alloc, interval 1, probability 0, space 0,
times 0^M
[  433.316471] CPU: 1 PID: 19653 Comm: syz-executor4 Not tainted 4.20.0-rc3+
#348^M
[  433.323841] Hardware name: Google Google Compute Engine/Google Compute
Engine, BIOS Google 01/01/2011^M

I need some more background, then I can review the code.

What exactly do you mean by "Which stress tests"?
Fault injection is enabled. Also random workload from userspace.



Right now, I would put it into my "unknown syzcaller finding" folder.


One more idea: Are there further syzcaller runs that end up with 
0x01 in a pointer?


From what I see, the sysv sem code that is used is trivial, I don't see 
that it could cause the observed behavior.



--

    Manfred

Re: [RFC, PATCH] ipc/util.c: use idr_alloc_cyclic() for ipc allocations

2018-10-03 Thread Manfred Spraul


On 10/2/18 8:27 PM, Waiman Long wrote:

On 10/02/2018 12:19 PM, Manfred Spraul wrote:

A bit related to the patch series that increases IPC_MNI:

(User space) id reuse create the risk of data corruption:

Process A: calls ipc function
Process A: sleeps just at the beginning of the syscall
Process B: Frees the ipc object (i.e.: calls ...ctl(IPC_RMID)
Process B: Creates a new ipc object (i.e.: calls ...get())

Process A: is woken up, and accesses the new object

To reduce the probability that the new and the old object
have the same id, the current implementation adds a
sequence number to the index of the object in the idr tree.

To further reduce the probability for a reuse, switch from
idr_alloc to idr_alloc_cyclic.

The patch cycles over at least RADIX_TREE_MAP_SIZE, i.e.
if there is only a small number of objects, the accesses
continue to be direct.

As an option, this could be made dependent on the extended
mode: In extended mode, cycle over e.g. at least 16k ids.

Signed-off-by: Manfred Spraul 
---

Open questions:
- Is there a significant performance advantage, especially
   there are many ipc ids?
- Over how many ids should the code cycle always?
- Further review remarks?

  ipc/util.c | 22 +-
  1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/ipc/util.c b/ipc/util.c
index 0af05752969f..6f83841f6761 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -216,10 +216,30 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
 */
  
  	if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */

+   int idr_max;
+
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
-   idx = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
+
+   /*
+* If a user space visible id is reused, then this creates a
+* risk for data corruption. To reduce the probability that
+* a number is reduced, two approaches are used:

   reduced -> reused?

Of course.



+* 1) the idr index is allocated cyclically.
+* 2) the use space id is build by concatenating the
+*internal idr index with a sequence number
+* To avoid that both numbers have the same cycle time, try
+* to set the size for the cyclic alloc to an odd number.
+*/
+   idr_max = ids->in_use*2+1;
+   if (idr_max < RADIX_TREE_MAP_SIZE-1)
+   idr_max = RADIX_TREE_MAP_SIZE-1;
+   if (idr_max > IPCMNI)
+   idr_max = IPCMNI;
+
+   idx = idr_alloc_cyclic(>ipcs_idr, new, 0, idr_max,
+   GFP_NOWAIT);
} else {
new->seq = ipcid_to_seqx(next_id);
idx = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),


Each of IPC components have their own sysctl parameters limiting the max
number of objects that can be allocated. With cyclic allocation, you
will have to make sure that idr_max is not larger than the corresponding
IPC sysctl parameters. That may require moving the limits to the
corresponding ipc_ids structure so that it can be used in ipc_idr_alloc().


First, I would disagree:

the sysctl limits specify how many objects can exist.

idr_max is the maximum index in the radix tree that can exist. There is 
a hard limit of IPCMNI, but that's it.



But:

The name is wrong, I will rename the variable to idx_max


What is the point of comparing idr_max against RADIX_TREE_MAP_SIZE-1? Is
it for performance reason.


Let's assume you have only 1 ipc object, and you alloc/release that object.

At alloc time, ids->in_use is 0 -> idr_max 1 -> every object will end up 
with idx=0.


This would defeat the whole purpose of using a cyclic alloc.

Thus: cycle over at least 63 ids -> 5 additional bits to avoid collisions.


--

    Manfred

Re: [RFC, PATCH] ipc/util.c: use idr_alloc_cyclic() for ipc allocations

2018-10-03 Thread Manfred Spraul


On 10/2/18 8:27 PM, Waiman Long wrote:

On 10/02/2018 12:19 PM, Manfred Spraul wrote:

A bit related to the patch series that increases IPC_MNI:

(User space) id reuse create the risk of data corruption:

Process A: calls ipc function
Process A: sleeps just at the beginning of the syscall
Process B: Frees the ipc object (i.e.: calls ...ctl(IPC_RMID)
Process B: Creates a new ipc object (i.e.: calls ...get())

Process A: is woken up, and accesses the new object

To reduce the probability that the new and the old object
have the same id, the current implementation adds a
sequence number to the index of the object in the idr tree.

To further reduce the probability for a reuse, switch from
idr_alloc to idr_alloc_cyclic.

The patch cycles over at least RADIX_TREE_MAP_SIZE, i.e.
if there is only a small number of objects, the accesses
continue to be direct.

As an option, this could be made dependent on the extended
mode: In extended mode, cycle over e.g. at least 16k ids.

Signed-off-by: Manfred Spraul 
---

Open questions:
- Is there a significant performance advantage, especially
   there are many ipc ids?
- Over how many ids should the code cycle always?
- Further review remarks?

  ipc/util.c | 22 +-
  1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/ipc/util.c b/ipc/util.c
index 0af05752969f..6f83841f6761 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -216,10 +216,30 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
 */
  
  	if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */

+   int idr_max;
+
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
-   idx = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
+
+   /*
+* If a user space visible id is reused, then this creates a
+* risk for data corruption. To reduce the probability that
+* a number is reduced, two approaches are used:

   reduced -> reused?

Of course.



+* 1) the idr index is allocated cyclically.
+* 2) the use space id is build by concatenating the
+*internal idr index with a sequence number
+* To avoid that both numbers have the same cycle time, try
+* to set the size for the cyclic alloc to an odd number.
+*/
+   idr_max = ids->in_use*2+1;
+   if (idr_max < RADIX_TREE_MAP_SIZE-1)
+   idr_max = RADIX_TREE_MAP_SIZE-1;
+   if (idr_max > IPCMNI)
+   idr_max = IPCMNI;
+
+   idx = idr_alloc_cyclic(>ipcs_idr, new, 0, idr_max,
+   GFP_NOWAIT);
} else {
new->seq = ipcid_to_seqx(next_id);
idx = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),


Each of IPC components have their own sysctl parameters limiting the max
number of objects that can be allocated. With cyclic allocation, you
will have to make sure that idr_max is not larger than the corresponding
IPC sysctl parameters. That may require moving the limits to the
corresponding ipc_ids structure so that it can be used in ipc_idr_alloc().


First, I would disagree:

the sysctl limits specify how many objects can exist.

idr_max is the maximum index in the radix tree that can exist. There is 
a hard limit of IPCMNI, but that's it.



But:

The name is wrong, I will rename the variable to idx_max


What is the point of comparing idr_max against RADIX_TREE_MAP_SIZE-1? Is
it for performance reason.


Let's assume you have only 1 ipc object, and you alloc/release that object.

At alloc time, ids->in_use is 0 -> idr_max 1 -> every object will end up 
with idx=0.


This would defeat the whole purpose of using a cyclic alloc.

Thus: cycle over at least 63 ids -> 5 additional bits to avoid collisions.


--

    Manfred

[RFC, PATCH] ipc/util.c: use idr_alloc_cyclic() for ipc allocations

2018-10-02 Thread Manfred Spraul

A bit related to the patch series that increases IPC_MNI:

(User space) id reuse create the risk of data corruption:

Process A: calls ipc function
Process A: sleeps just at the beginning of the syscall
Process B: Frees the ipc object (i.e.: calls ...ctl(IPC_RMID)
Process B: Creates a new ipc object (i.e.: calls ...get())

Process A: is woken up, and accesses the new object

To reduce the probability that the new and the old object
have the same id, the current implementation adds a
sequence number to the index of the object in the idr tree.

To further reduce the probability for a reuse, switch from
idr_alloc to idr_alloc_cyclic.

The patch cycles over at least RADIX_TREE_MAP_SIZE, i.e.
if there is only a small number of objects, the accesses
continue to be direct.

As an option, this could be made dependent on the extended
mode: In extended mode, cycle over e.g. at least 16k ids.

Signed-off-by: Manfred Spraul 
---

Open questions:
- Is there a significant performance advantage, especially
  there are many ipc ids?
- Over how many ids should the code cycle always?
- Further review remarks?

 ipc/util.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/ipc/util.c b/ipc/util.c
index 0af05752969f..6f83841f6761 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -216,10 +216,30 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
 */
 
if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */
+   int idr_max;
+
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
-   idx = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
+
+   /*
+* If a user space visible id is reused, then this creates a
+* risk for data corruption. To reduce the probability that
+* a number is reduced, two approaches are used:
+* 1) the idr index is allocated cyclically.
+* 2) the use space id is build by concatenating the
+*internal idr index with a sequence number
+* To avoid that both numbers have the same cycle time, try
+* to set the size for the cyclic alloc to an odd number.
+*/
+   idr_max = ids->in_use*2+1;
+   if (idr_max < RADIX_TREE_MAP_SIZE-1)
+   idr_max = RADIX_TREE_MAP_SIZE-1;
+   if (idr_max > IPCMNI)
+   idr_max = IPCMNI;
+
+   idx = idr_alloc_cyclic(>ipcs_idr, new, 0, idr_max,
+   GFP_NOWAIT);
} else {
new->seq = ipcid_to_seqx(next_id);
idx = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
-- 
2.17.1

[RFC, PATCH] ipc/util.c: use idr_alloc_cyclic() for ipc allocations

2018-10-02 Thread Manfred Spraul

A bit related to the patch series that increases IPC_MNI:

(User space) id reuse create the risk of data corruption:

Process A: calls ipc function
Process A: sleeps just at the beginning of the syscall
Process B: Frees the ipc object (i.e.: calls ...ctl(IPC_RMID)
Process B: Creates a new ipc object (i.e.: calls ...get())

Process A: is woken up, and accesses the new object

To reduce the probability that the new and the old object
have the same id, the current implementation adds a
sequence number to the index of the object in the idr tree.

To further reduce the probability for a reuse, switch from
idr_alloc to idr_alloc_cyclic.

The patch cycles over at least RADIX_TREE_MAP_SIZE, i.e.
if there is only a small number of objects, the accesses
continue to be direct.

As an option, this could be made dependent on the extended
mode: In extended mode, cycle over e.g. at least 16k ids.

Signed-off-by: Manfred Spraul 
---

Open questions:
- Is there a significant performance advantage, especially
  there are many ipc ids?
- Over how many ids should the code cycle always?
- Further review remarks?

 ipc/util.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/ipc/util.c b/ipc/util.c
index 0af05752969f..6f83841f6761 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -216,10 +216,30 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
 */
 
if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */
+   int idr_max;
+
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
-   idx = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
+
+   /*
+* If a user space visible id is reused, then this creates a
+* risk for data corruption. To reduce the probability that
+* a number is reduced, two approaches are used:
+* 1) the idr index is allocated cyclically.
+* 2) the use space id is build by concatenating the
+*internal idr index with a sequence number
+* To avoid that both numbers have the same cycle time, try
+* to set the size for the cyclic alloc to an odd number.
+*/
+   idr_max = ids->in_use*2+1;
+   if (idr_max < RADIX_TREE_MAP_SIZE-1)
+   idr_max = RADIX_TREE_MAP_SIZE-1;
+   if (idr_max > IPCMNI)
+   idr_max = IPCMNI;
+
+   idx = idr_alloc_cyclic(>ipcs_idr, new, 0, idr_max,
+   GFP_NOWAIT);
} else {
new->seq = ipcid_to_seqx(next_id);
idx = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
-- 
2.17.1

Re: [PATCH -next] ipc/sem: prevent queue.status tearing in semop

2018-07-17 Thread Manfred Spraul


Hello Davidlohr,

On 07/17/2018 07:26 AM, Davidlohr Bueso wrote:

In order for load/store tearing to work, _all_ accesses to
the variable in question need to be done around READ and
WRITE_ONCE() macros. Ensure everyone does so for q->status
variable for semtimedop().

What is the background of the above rule?

sma->use_global_lock is sometimes used with smp_load_acquire(), 
sometimes without.

So far, I assumed that this is safe.

The same applies for nf_conntrack_locks_all, in nf_conntrack_all_lock()

Signed-off-by: Davidlohr Bueso 
---
  ipc/sem.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 6cbbf34a44ac..ccab4e51d351 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2125,7 +2125,7 @@ static long do_semtimedop(int semid, struct sembuf __user 
*tsops,
}
  
  	do {

-   queue.status = -EINTR;
+   WRITE_ONCE(queue.status, -EINTR);
queue.sleeper = current;
  
  		__set_current_state(TASK_INTERRUPTIBLE);

Re: [PATCH -next] ipc/sem: prevent queue.status tearing in semop

2018-07-17 Thread Manfred Spraul


Hello Davidlohr,

On 07/17/2018 07:26 AM, Davidlohr Bueso wrote:

In order for load/store tearing to work, _all_ accesses to
the variable in question need to be done around READ and
WRITE_ONCE() macros. Ensure everyone does so for q->status
variable for semtimedop().

What is the background of the above rule?

sma->use_global_lock is sometimes used with smp_load_acquire(), 
sometimes without.

So far, I assumed that this is safe.

The same applies for nf_conntrack_locks_all, in nf_conntrack_all_lock()

Signed-off-by: Davidlohr Bueso 
---
  ipc/sem.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 6cbbf34a44ac..ccab4e51d351 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2125,7 +2125,7 @@ static long do_semtimedop(int semid, struct sembuf __user 
*tsops,
}
  
  	do {

-   queue.status = -EINTR;
+   WRITE_ONCE(queue.status, -EINTR);
queue.sleeper = current;
  
  		__set_current_state(TASK_INTERRUPTIBLE);

[PATCH 01/12] ipc: ipc: compute kern_ipc_perm.id under the ipc lock.

2018-07-12 Thread Manfred Spraul

ipc_addid() initializes kern_ipc_perm.id after having called
ipc_idr_alloc().

Thus a parallel semctl() or msgctl() that uses e.g. MSG_STAT may use
this unitialized value as the return code.

The patch moves all accesses to kern_ipc_perm.id under the spin_lock().

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with kern_ipc_perm.seq

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Reviewed-by: Davidlohr Bueso 
---
 ipc/msg.c | 19 ++-
 ipc/sem.c | 18 +-
 ipc/shm.c | 19 ++-
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 3b6545302598..49358f474fc9 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -491,7 +491,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
 int cmd, struct msqid64_ds *p)
 {
struct msg_queue *msq;
-   int id = 0;
int err;
 
memset(p, 0, sizeof(*p));
@@ -503,7 +502,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
err = PTR_ERR(msq);
goto out_unlock;
}
-   id = msq->q_perm.id;
} else { /* IPC_STAT */
msq = msq_obtain_object_check(ns, msqid);
if (IS_ERR(msq)) {
@@ -548,10 +546,21 @@ static int msgctl_stat(struct ipc_namespace *ns, int 
msqid,
p->msg_lspid  = pid_vnr(msq->q_lspid);
p->msg_lrpid  = pid_vnr(msq->q_lrpid);
 
-   ipc_unlock_object(>q_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* MSG_STAT and MSG_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence number
+*/
+   err = msq->q_perm.id;
+   }
 
+   ipc_unlock_object(>q_perm);
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/sem.c b/ipc/sem.c
index 5af1943ad782..d89ce69b2613 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1222,7 +1222,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 {
struct sem_array *sma;
time64_t semotime;
-   int id = 0;
int err;
 
memset(semid64, 0, sizeof(*semid64));
@@ -1234,7 +1233,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
err = PTR_ERR(sma);
goto out_unlock;
}
-   id = sma->sem_perm.id;
} else { /* IPC_STAT */
sma = sem_obtain_object_check(ns, semid);
if (IS_ERR(sma)) {
@@ -1274,10 +1272,20 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 #endif
semid64->sem_nsems = sma->sem_nsems;
 
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SEM_STAT and SEM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence number
+*/
+   err = sma->sem_perm.id;
+   }
ipc_unlock_object(>sem_perm);
-   rcu_read_unlock();
-   return id;
-
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/shm.c b/ipc/shm.c
index 051a3e1fb8df..f3bae59bed08 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -949,7 +949,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
int cmd, struct shmid64_ds *tbuf)
 {
struct shmid_kernel *shp;
-   int id = 0;
int err;
 
memset(tbuf, 0, sizeof(*tbuf));
@@ -961,7 +960,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
err = PTR_ERR(shp);
goto out_unlock;
}
-   id = shp->shm_perm.id;
} else { /* IPC_STAT */
shp = shm_obtain_object_check(ns, shmid);
if (IS_ERR(shp)) {
@@ -1011,10 +1009,21 @@ static int shmctl_stat(struct ipc_namespace *ns, int 
shmid,
tbuf->shm_lpid  = pid_vnr(shp->shm_lprid);
tbuf->shm_nattch = shp->shm_nattch;
 
-   ipc_unlock_object(>shm_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SHM_STAT and SHM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence number
+*/
+   err = shp->shm_perm.id;
+   }
 
+   ipc_un

[PATCH 03/12] ipc/util.c: Use ipc_rcu_putref() for failues in ipc_addid()

2018-07-12 Thread Manfred Spraul

ipc_addid() is impossible to use:
- for certain failures, the caller must not use ipc_rcu_putref(),
  because the reference counter is not yet initialized.
- for other failures, the caller must use ipc_rcu_putref(),
  because parallel operations could be ongoing already.

The patch cleans that up, by initializing the refcount early,
and by modifying all callers.

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with reading kern_ipc_perm.seq,
here both read and write to already released memory could happen.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
---
 ipc/msg.c  |  2 +-
 ipc/sem.c  |  2 +-
 ipc/shm.c  |  2 ++
 ipc/util.c | 10 --
 4 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 49358f474fc9..38119c1f0da3 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -162,7 +162,7 @@ static int newque(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks msq upon success. */
retval = ipc_addid(_ids(ns), >q_perm, ns->msg_ctlmni);
if (retval < 0) {
-   call_rcu(>q_perm.rcu, msg_rcu_free);
+   ipc_rcu_putref(>q_perm, msg_rcu_free);
return retval;
}
 
diff --git a/ipc/sem.c b/ipc/sem.c
index d89ce69b2613..8a0a1eb05765 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -556,7 +556,7 @@ static int newary(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks sma upon success. */
retval = ipc_addid(_ids(ns), >sem_perm, ns->sc_semmni);
if (retval < 0) {
-   call_rcu(>sem_perm.rcu, sem_rcu_free);
+   ipc_rcu_putref(>sem_perm, sem_rcu_free);
return retval;
}
ns->used_sems += nsems;
diff --git a/ipc/shm.c b/ipc/shm.c
index f3bae59bed08..92d71abe9e8f 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -671,6 +671,8 @@ static int newseg(struct ipc_namespace *ns, struct 
ipc_params *params)
if (is_file_hugepages(file) && shp->mlock_user)
user_shm_unlock(size, shp->mlock_user);
fput(file);
+   ipc_rcu_putref(>shm_perm, shm_rcu_free);
+   return error;
 no_file:
call_rcu(>shm_perm.rcu, shm_rcu_free);
return error;
diff --git a/ipc/util.c b/ipc/util.c
index 4998f8fa8ce0..f3447911c81e 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -250,7 +250,9 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, struct 
kern_ipc_perm *new)
  * Add an entry 'new' to the ipc ids idr. The permissions object is
  * initialised and the first free entry is set up and the id assigned
  * is returned. The 'new' entry is returned in a locked state on success.
+ *
  * On failure the entry is not locked and a negative err-code is returned.
+ * The caller must use ipc_rcu_putref() to free the identifier.
  *
  * Called with writer ipc_ids.rwsem held.
  */
@@ -260,6 +262,9 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
kgid_t egid;
int idx, err;
 
+   /* 1) Initialize the refcount so that ipc_rcu_putref works */
+   refcount_set(>refcount, 1);
+
if (limit > IPCMNI)
limit = IPCMNI;
 
@@ -268,9 +273,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
 
idr_preload(GFP_KERNEL);
 
-   refcount_set(>refcount, 1);
spin_lock_init(>lock);
-   new->deleted = false;
rcu_read_lock();
spin_lock(>lock);
 
@@ -278,6 +281,8 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
new->cuid = new->uid = euid;
new->gid = new->cgid = egid;
 
+   new->deleted = false;
+
idx = ipc_idr_alloc(ids, new);
idr_preload_end();
 
@@ -290,6 +295,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
}
}
if (idx < 0) {
+   new->deleted = true;
spin_unlock(>lock);
rcu_read_unlock();
return idx;
-- 
2.17.1

[PATCH 01/12] ipc: ipc: compute kern_ipc_perm.id under the ipc lock.

2018-07-12 Thread Manfred Spraul

ipc_addid() initializes kern_ipc_perm.id after having called
ipc_idr_alloc().

Thus a parallel semctl() or msgctl() that uses e.g. MSG_STAT may use
this unitialized value as the return code.

The patch moves all accesses to kern_ipc_perm.id under the spin_lock().

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with kern_ipc_perm.seq

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Reviewed-by: Davidlohr Bueso 
---
 ipc/msg.c | 19 ++-
 ipc/sem.c | 18 +-
 ipc/shm.c | 19 ++-
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 3b6545302598..49358f474fc9 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -491,7 +491,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
 int cmd, struct msqid64_ds *p)
 {
struct msg_queue *msq;
-   int id = 0;
int err;
 
memset(p, 0, sizeof(*p));
@@ -503,7 +502,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
err = PTR_ERR(msq);
goto out_unlock;
}
-   id = msq->q_perm.id;
} else { /* IPC_STAT */
msq = msq_obtain_object_check(ns, msqid);
if (IS_ERR(msq)) {
@@ -548,10 +546,21 @@ static int msgctl_stat(struct ipc_namespace *ns, int 
msqid,
p->msg_lspid  = pid_vnr(msq->q_lspid);
p->msg_lrpid  = pid_vnr(msq->q_lrpid);
 
-   ipc_unlock_object(>q_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* MSG_STAT and MSG_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence number
+*/
+   err = msq->q_perm.id;
+   }
 
+   ipc_unlock_object(>q_perm);
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/sem.c b/ipc/sem.c
index 5af1943ad782..d89ce69b2613 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1222,7 +1222,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 {
struct sem_array *sma;
time64_t semotime;
-   int id = 0;
int err;
 
memset(semid64, 0, sizeof(*semid64));
@@ -1234,7 +1233,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
err = PTR_ERR(sma);
goto out_unlock;
}
-   id = sma->sem_perm.id;
} else { /* IPC_STAT */
sma = sem_obtain_object_check(ns, semid);
if (IS_ERR(sma)) {
@@ -1274,10 +1272,20 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 #endif
semid64->sem_nsems = sma->sem_nsems;
 
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SEM_STAT and SEM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence number
+*/
+   err = sma->sem_perm.id;
+   }
ipc_unlock_object(>sem_perm);
-   rcu_read_unlock();
-   return id;
-
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/shm.c b/ipc/shm.c
index 051a3e1fb8df..f3bae59bed08 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -949,7 +949,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
int cmd, struct shmid64_ds *tbuf)
 {
struct shmid_kernel *shp;
-   int id = 0;
int err;
 
memset(tbuf, 0, sizeof(*tbuf));
@@ -961,7 +960,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
err = PTR_ERR(shp);
goto out_unlock;
}
-   id = shp->shm_perm.id;
} else { /* IPC_STAT */
shp = shm_obtain_object_check(ns, shmid);
if (IS_ERR(shp)) {
@@ -1011,10 +1009,21 @@ static int shmctl_stat(struct ipc_namespace *ns, int 
shmid,
tbuf->shm_lpid  = pid_vnr(shp->shm_lprid);
tbuf->shm_nattch = shp->shm_nattch;
 
-   ipc_unlock_object(>shm_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SHM_STAT and SHM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence number
+*/
+   err = shp->shm_perm.id;
+   }
 
+   ipc_un

[PATCH 03/12] ipc/util.c: Use ipc_rcu_putref() for failues in ipc_addid()

2018-07-12 Thread Manfred Spraul

ipc_addid() is impossible to use:
- for certain failures, the caller must not use ipc_rcu_putref(),
  because the reference counter is not yet initialized.
- for other failures, the caller must use ipc_rcu_putref(),
  because parallel operations could be ongoing already.

The patch cleans that up, by initializing the refcount early,
and by modifying all callers.

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with reading kern_ipc_perm.seq,
here both read and write to already released memory could happen.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
---
 ipc/msg.c  |  2 +-
 ipc/sem.c  |  2 +-
 ipc/shm.c  |  2 ++
 ipc/util.c | 10 --
 4 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 49358f474fc9..38119c1f0da3 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -162,7 +162,7 @@ static int newque(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks msq upon success. */
retval = ipc_addid(_ids(ns), >q_perm, ns->msg_ctlmni);
if (retval < 0) {
-   call_rcu(>q_perm.rcu, msg_rcu_free);
+   ipc_rcu_putref(>q_perm, msg_rcu_free);
return retval;
}
 
diff --git a/ipc/sem.c b/ipc/sem.c
index d89ce69b2613..8a0a1eb05765 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -556,7 +556,7 @@ static int newary(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks sma upon success. */
retval = ipc_addid(_ids(ns), >sem_perm, ns->sc_semmni);
if (retval < 0) {
-   call_rcu(>sem_perm.rcu, sem_rcu_free);
+   ipc_rcu_putref(>sem_perm, sem_rcu_free);
return retval;
}
ns->used_sems += nsems;
diff --git a/ipc/shm.c b/ipc/shm.c
index f3bae59bed08..92d71abe9e8f 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -671,6 +671,8 @@ static int newseg(struct ipc_namespace *ns, struct 
ipc_params *params)
if (is_file_hugepages(file) && shp->mlock_user)
user_shm_unlock(size, shp->mlock_user);
fput(file);
+   ipc_rcu_putref(>shm_perm, shm_rcu_free);
+   return error;
 no_file:
call_rcu(>shm_perm.rcu, shm_rcu_free);
return error;
diff --git a/ipc/util.c b/ipc/util.c
index 4998f8fa8ce0..f3447911c81e 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -250,7 +250,9 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, struct 
kern_ipc_perm *new)
  * Add an entry 'new' to the ipc ids idr. The permissions object is
  * initialised and the first free entry is set up and the id assigned
  * is returned. The 'new' entry is returned in a locked state on success.
+ *
  * On failure the entry is not locked and a negative err-code is returned.
+ * The caller must use ipc_rcu_putref() to free the identifier.
  *
  * Called with writer ipc_ids.rwsem held.
  */
@@ -260,6 +262,9 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
kgid_t egid;
int idx, err;
 
+   /* 1) Initialize the refcount so that ipc_rcu_putref works */
+   refcount_set(>refcount, 1);
+
if (limit > IPCMNI)
limit = IPCMNI;
 
@@ -268,9 +273,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
 
idr_preload(GFP_KERNEL);
 
-   refcount_set(>refcount, 1);
spin_lock_init(>lock);
-   new->deleted = false;
rcu_read_lock();
spin_lock(>lock);
 
@@ -278,6 +281,8 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
new->cuid = new->uid = euid;
new->gid = new->cgid = egid;
 
+   new->deleted = false;
+
idx = ipc_idr_alloc(ids, new);
idr_preload_end();
 
@@ -290,6 +295,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
}
}
if (idx < 0) {
+   new->deleted = true;
spin_unlock(>lock);
rcu_read_unlock();
return idx;
-- 
2.17.1

[PATCH 04/12] ipc: Rename ipcctl_pre_down_nolock().

2018-07-12 Thread Manfred Spraul

Both the comment and the name of ipcctl_pre_down_nolock()
are misleading: The function must be called while holdling
the rw semaphore.
Therefore the patch renames the function to ipcctl_obtain_check():
This name matches the other names used in util.c:
- "obtain" function look up a pointer in the idr, without
  acquiring the object lock.
- The caller is responsible for locking.
- _check means that the sequence number is checked.

Signed-off-by: Manfred Spraul 
Reviewed-by: Davidlohr Bueso 
---
 ipc/msg.c  | 2 +-
 ipc/sem.c  | 2 +-
 ipc/shm.c  | 2 +-
 ipc/util.c | 8 
 ipc/util.h | 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 38119c1f0da3..4aca0ce363b5 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -385,7 +385,7 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, 
int cmd,
down_write(_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, _ids(ns), msqid, cmd,
+   ipcp = ipcctl_obtain_check(ns, _ids(ns), msqid, cmd,
  >msg_perm, msqid64->msg_qbytes);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/sem.c b/ipc/sem.c
index 8a0a1eb05765..da1626984083 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1595,7 +1595,7 @@ static int semctl_down(struct ipc_namespace *ns, int 
semid,
down_write(_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, _ids(ns), semid, cmd,
+   ipcp = ipcctl_obtain_check(ns, _ids(ns), semid, cmd,
  >sem_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/shm.c b/ipc/shm.c
index 92d71abe9e8f..0a509befb558 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -868,7 +868,7 @@ static int shmctl_down(struct ipc_namespace *ns, int shmid, 
int cmd,
down_write(_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, _ids(ns), shmid, cmd,
+   ipcp = ipcctl_obtain_check(ns, _ids(ns), shmid, cmd,
  >shm_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/util.c b/ipc/util.c
index f3447911c81e..cffd12240f67 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -687,7 +687,7 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
 }
 
 /**
- * ipcctl_pre_down_nolock - retrieve an ipc and check permissions for some 
IPC_XXX cmd
+ * ipcctl_obtain_check - retrieve an ipc object and check permissions
  * @ns:  ipc namespace
  * @ids:  the table of ids where to look for the ipc
  * @id:   the id of the ipc to retrieve
@@ -697,16 +697,16 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
  *
  * This function does some common audit and permissions check for some IPC_XXX
  * cmd and is called from semctl_down, shmctl_down and msgctl_down.
- * It must be called without any lock held and:
  *
- *   - retrieves the ipc with the given id in the given table.
+ * It:
+ *   - retrieves the ipc object with the given id in the given table.
  *   - performs some audit and permission check, depending on the given cmd
  *   - returns a pointer to the ipc object or otherwise, the corresponding
  * error.
  *
  * Call holding the both the rwsem and the rcu read lock.
  */
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
struct ipc_ids *ids, int id, int cmd,
struct ipc64_perm *perm, int extra_perm)
 {
diff --git a/ipc/util.h b/ipc/util.h
index 0aba3230d007..fcf81425ae98 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -148,7 +148,7 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id);
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
 int ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out);
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
 struct ipc_ids *ids, int id, int 
cmd,
 struct ipc64_perm *perm, int 
extra_perm);
 
-- 
2.17.1

[PATCH 04/12] ipc: Rename ipcctl_pre_down_nolock().

2018-07-12 Thread Manfred Spraul

Both the comment and the name of ipcctl_pre_down_nolock()
are misleading: The function must be called while holdling
the rw semaphore.
Therefore the patch renames the function to ipcctl_obtain_check():
This name matches the other names used in util.c:
- "obtain" function look up a pointer in the idr, without
  acquiring the object lock.
- The caller is responsible for locking.
- _check means that the sequence number is checked.

Signed-off-by: Manfred Spraul 
Reviewed-by: Davidlohr Bueso 
---
 ipc/msg.c  | 2 +-
 ipc/sem.c  | 2 +-
 ipc/shm.c  | 2 +-
 ipc/util.c | 8 
 ipc/util.h | 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 38119c1f0da3..4aca0ce363b5 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -385,7 +385,7 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, 
int cmd,
down_write(_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, _ids(ns), msqid, cmd,
+   ipcp = ipcctl_obtain_check(ns, _ids(ns), msqid, cmd,
  >msg_perm, msqid64->msg_qbytes);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/sem.c b/ipc/sem.c
index 8a0a1eb05765..da1626984083 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1595,7 +1595,7 @@ static int semctl_down(struct ipc_namespace *ns, int 
semid,
down_write(_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, _ids(ns), semid, cmd,
+   ipcp = ipcctl_obtain_check(ns, _ids(ns), semid, cmd,
  >sem_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/shm.c b/ipc/shm.c
index 92d71abe9e8f..0a509befb558 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -868,7 +868,7 @@ static int shmctl_down(struct ipc_namespace *ns, int shmid, 
int cmd,
down_write(_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, _ids(ns), shmid, cmd,
+   ipcp = ipcctl_obtain_check(ns, _ids(ns), shmid, cmd,
  >shm_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/util.c b/ipc/util.c
index f3447911c81e..cffd12240f67 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -687,7 +687,7 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
 }
 
 /**
- * ipcctl_pre_down_nolock - retrieve an ipc and check permissions for some 
IPC_XXX cmd
+ * ipcctl_obtain_check - retrieve an ipc object and check permissions
  * @ns:  ipc namespace
  * @ids:  the table of ids where to look for the ipc
  * @id:   the id of the ipc to retrieve
@@ -697,16 +697,16 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
  *
  * This function does some common audit and permissions check for some IPC_XXX
  * cmd and is called from semctl_down, shmctl_down and msgctl_down.
- * It must be called without any lock held and:
  *
- *   - retrieves the ipc with the given id in the given table.
+ * It:
+ *   - retrieves the ipc object with the given id in the given table.
  *   - performs some audit and permission check, depending on the given cmd
  *   - returns a pointer to the ipc object or otherwise, the corresponding
  * error.
  *
  * Call holding the both the rwsem and the rcu read lock.
  */
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
struct ipc_ids *ids, int id, int cmd,
struct ipc64_perm *perm, int extra_perm)
 {
diff --git a/ipc/util.h b/ipc/util.h
index 0aba3230d007..fcf81425ae98 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -148,7 +148,7 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id);
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
 int ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out);
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
 struct ipc_ids *ids, int id, int 
cmd,
 struct ipc64_perm *perm, int 
extra_perm);
 
-- 
2.17.1

[PATCH 05/12] ipc/util.c: correct comment in ipc_obtain_object_check

2018-07-12 Thread Manfred Spraul

The comment that explains ipc_obtain_object_check is wrong:
The function checks the sequence number, not the reference
counter.
Note that checking the reference counter would be meaningless:
The reference counter is decreased without holding any locks,
thus an object with kern_ipc_perm.deleted=true may disappear at
the end of the next rcu grace period.

Signed-off-by: Manfred Spraul 
Reviewed-by: Davidlohr Bueso 
---
 ipc/util.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index cffd12240f67..5cc37066e659 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -628,8 +628,8 @@ struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
  * @ids: ipc identifier set
  * @id: ipc id to look for
  *
- * Similar to ipc_obtain_object_idr() but also checks
- * the ipc object reference counter.
+ * Similar to ipc_obtain_object_idr() but also checks the ipc object
+ * sequence number.
  *
  * Call inside the RCU critical section.
  * The ipc object is *not* locked on exit.
-- 
2.17.1

[PATCH 05/12] ipc/util.c: correct comment in ipc_obtain_object_check

2018-07-12 Thread Manfred Spraul

The comment that explains ipc_obtain_object_check is wrong:
The function checks the sequence number, not the reference
counter.
Note that checking the reference counter would be meaningless:
The reference counter is decreased without holding any locks,
thus an object with kern_ipc_perm.deleted=true may disappear at
the end of the next rcu grace period.

Signed-off-by: Manfred Spraul 
Reviewed-by: Davidlohr Bueso 
---
 ipc/util.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index cffd12240f67..5cc37066e659 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -628,8 +628,8 @@ struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
  * @ids: ipc identifier set
  * @id: ipc id to look for
  *
- * Similar to ipc_obtain_object_idr() but also checks
- * the ipc object reference counter.
+ * Similar to ipc_obtain_object_idr() but also checks the ipc object
+ * sequence number.
  *
  * Call inside the RCU critical section.
  * The ipc object is *not* locked on exit.
-- 
2.17.1

[PATCH 06/12] ipc: drop ipc_lock()

2018-07-12 Thread Manfred Spraul

From: Davidlohr Bueso 

ipc/util.c contains multiple functions to get the ipc object
pointer given an id number.

There are two sets of function: One set verifies the sequence
counter part of the id number, other functions do not check
the sequence counter.

The standard for function names in ipc/util.c is
- ..._check() functions verify the sequence counter
- ..._idr() functions do not verify the sequence counter

ipc_lock() is an exception: It does not verify the sequence
counter value, but this is not obvious from the function name.

Furthermore, shm.c is the only user of this helper. Thus, we
can simply move the logic into shm_lock() and get rid of the
function altogether.

[changelog mostly by manfred]
Signed-off-by: Davidlohr Bueso 
Signed-off-by: Manfred Spraul 
---
 ipc/shm.c  | 29 +++--
 ipc/util.c | 36 
 ipc/util.h |  1 -
 3 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 0a509befb558..22afb98363ff 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -179,16 +179,33 @@ static inline struct shmid_kernel 
*shm_obtain_object_check(struct ipc_namespace
  */
 static inline struct shmid_kernel *shm_lock(struct ipc_namespace *ns, int id)
 {
-   struct kern_ipc_perm *ipcp = ipc_lock(_ids(ns), id);
+   struct kern_ipc_perm *ipcp;
+
+   rcu_read_lock();
+   ipcp = ipc_obtain_object_idr(_ids(ns), id);
+   if (IS_ERR(ipcp))
+   goto err;
 
+   ipc_lock_object(ipcp);
+   /*
+* ipc_rmid() may have already freed the ID while ipc_lock_object()
+* was spinning: here verify that the structure is still valid.
+* Upon races with RMID, return -EIDRM, thus indicating that
+* the ID points to a removed identifier.
+*/
+   if (ipc_valid_object(ipcp)) {
+   /* return a locked ipc object upon success */
+   return container_of(ipcp, struct shmid_kernel, shm_perm);
+   }
+
+   ipc_unlock_object(ipcp);
+err:
+   rcu_read_unlock();
/*
 * Callers of shm_lock() must validate the status of the returned ipc
-* object pointer (as returned by ipc_lock()), and error out as
-* appropriate.
+* object pointer and error out as appropriate.
 */
-   if (IS_ERR(ipcp))
-   return (void *)ipcp;
-   return container_of(ipcp, struct shmid_kernel, shm_perm);
+   return (void *)ipcp;
 }
 
 static inline void shm_lock_by_ptr(struct shmid_kernel *ipcp)
diff --git a/ipc/util.c b/ipc/util.c
index 5cc37066e659..234f6d781df3 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -587,42 +587,6 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id)
return out;
 }
 
-/**
- * ipc_lock - lock an ipc structure without rwsem held
- * @ids: ipc identifier set
- * @id: ipc id to look for
- *
- * Look for an id in the ipc ids idr and lock the associated ipc object.
- *
- * The ipc object is locked on successful exit.
- */
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
-{
-   struct kern_ipc_perm *out;
-
-   rcu_read_lock();
-   out = ipc_obtain_object_idr(ids, id);
-   if (IS_ERR(out))
-   goto err;
-
-   spin_lock(>lock);
-
-   /*
-* ipc_rmid() may have already freed the ID while ipc_lock()
-* was spinning: here verify that the structure is still valid.
-* Upon races with RMID, return -EIDRM, thus indicating that
-* the ID points to a removed identifier.
-*/
-   if (ipc_valid_object(out))
-   return out;
-
-   spin_unlock(>lock);
-   out = ERR_PTR(-EIDRM);
-err:
-   rcu_read_unlock();
-   return out;
-}
-
 /**
  * ipc_obtain_object_check
  * @ids: ipc identifier set
diff --git a/ipc/util.h b/ipc/util.h
index fcf81425ae98..e3c47b21db93 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -142,7 +142,6 @@ int ipc_rcu_getref(struct kern_ipc_perm *ptr);
 void ipc_rcu_putref(struct kern_ipc_perm *ptr,
void (*func)(struct rcu_head *head));
 
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *, int);
 struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids *ids, int id);
 
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
-- 
2.17.1

[PATCH 08/12] lib/rhashtable: guarantee initial hashtable allocation

2018-07-12 Thread Manfred Spraul

From: Davidlohr Bueso 

rhashtable_init() may fail due to -ENOMEM, thus making the
entire api unusable. This patch removes this scenario,
however unlikely. In order to guarantee memory allocation,
this patch always ends up doing GFP_KERNEL|__GFP_NOFAIL
for both the tbl as well as alloc_bucket_spinlocks().

Upon the first table allocation failure, we shrink the
size to the smallest value that makes sense and retry with
__GFP_NOFAIL semantics. With the defaults, this means that
from 64 buckets, we retry with only 4. Any later issues
regarding performance due to collisions or larger table
resizing (when more memory becomes available) is the least
of our problems.

Signed-off-by: Davidlohr Bueso 
Acked-by: Herbert Xu 
Signed-off-by: Manfred Spraul 
---
 lib/rhashtable.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 083f871491a1..0026cf3e3f27 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -179,10 +179,11 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
 
size = nbuckets;
 
-   if (tbl == NULL && gfp != GFP_KERNEL) {
+   if (tbl == NULL && (gfp & ~__GFP_NOFAIL) != GFP_KERNEL) {
tbl = nested_bucket_table_alloc(ht, nbuckets, gfp);
nbuckets = 0;
}
+
if (tbl == NULL)
return NULL;
 
@@ -1065,9 +1066,16 @@ int rhashtable_init(struct rhashtable *ht,
}
}
 
+   /*
+* This is api initialization and thus we need to guarantee the
+* initial rhashtable allocation. Upon failure, retry with the
+* smallest possible size with __GFP_NOFAIL semantics.
+*/
tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
-   if (tbl == NULL)
-   return -ENOMEM;
+   if (unlikely(tbl == NULL)) {
+   size = max_t(u16, ht->p.min_size, HASH_MIN_SIZE);
+   tbl = bucket_table_alloc(ht, size, GFP_KERNEL | __GFP_NOFAIL);
+   }
 
atomic_set(>nelems, 0);
 
-- 
2.17.1

[PATCH 07/12] lib/rhashtable: simplify bucket_table_alloc()

2018-07-12 Thread Manfred Spraul

From: Davidlohr Bueso 

As of commit ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for
incompatible gfp flags") we can simplify the caller and trust kvzalloc() to
just do the right thing. For the case of the GFP_ATOMIC context, we can
drop the __GFP_NORETRY flag for obvious reasons, and for the __GFP_NOWARN
case, however, it is changed such that the caller passes the flag instead
of making bucket_table_alloc() handle it.

This slightly changes the gfp flags passed on to nested_table_alloc() as
it will now also use GFP_ATOMIC | __GFP_NOWARN. However, I consider this a
positive consequence as for the same reasons we want nowarn semantics in
bucket_table_alloc().

Signed-off-by: Davidlohr Bueso 
Acked-by: Michal Hocko 

(commit id extended to 12 digits, line wraps updated)
Signed-off-by: Manfred Spraul 
---
 lib/rhashtable.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 9427b5766134..083f871491a1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -175,10 +175,7 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
int i;
 
size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
-   if (gfp != GFP_KERNEL)
-   tbl = kzalloc(size, gfp | __GFP_NOWARN | __GFP_NORETRY);
-   else
-   tbl = kvzalloc(size, gfp);
+   tbl = kvzalloc(size, gfp);
 
size = nbuckets;
 
@@ -459,7 +456,7 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 
err = -ENOMEM;
 
-   new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC);
+   new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC | __GFP_NOWARN);
if (new_tbl == NULL)
goto fail;
 
-- 
2.17.1

[PATCH 06/12] ipc: drop ipc_lock()

2018-07-12 Thread Manfred Spraul

From: Davidlohr Bueso 

ipc/util.c contains multiple functions to get the ipc object
pointer given an id number.

There are two sets of function: One set verifies the sequence
counter part of the id number, other functions do not check
the sequence counter.

The standard for function names in ipc/util.c is
- ..._check() functions verify the sequence counter
- ..._idr() functions do not verify the sequence counter

ipc_lock() is an exception: It does not verify the sequence
counter value, but this is not obvious from the function name.

Furthermore, shm.c is the only user of this helper. Thus, we
can simply move the logic into shm_lock() and get rid of the
function altogether.

[changelog mostly by manfred]
Signed-off-by: Davidlohr Bueso 
Signed-off-by: Manfred Spraul 
---
 ipc/shm.c  | 29 +++--
 ipc/util.c | 36 
 ipc/util.h |  1 -
 3 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 0a509befb558..22afb98363ff 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -179,16 +179,33 @@ static inline struct shmid_kernel 
*shm_obtain_object_check(struct ipc_namespace
  */
 static inline struct shmid_kernel *shm_lock(struct ipc_namespace *ns, int id)
 {
-   struct kern_ipc_perm *ipcp = ipc_lock(_ids(ns), id);
+   struct kern_ipc_perm *ipcp;
+
+   rcu_read_lock();
+   ipcp = ipc_obtain_object_idr(_ids(ns), id);
+   if (IS_ERR(ipcp))
+   goto err;
 
+   ipc_lock_object(ipcp);
+   /*
+* ipc_rmid() may have already freed the ID while ipc_lock_object()
+* was spinning: here verify that the structure is still valid.
+* Upon races with RMID, return -EIDRM, thus indicating that
+* the ID points to a removed identifier.
+*/
+   if (ipc_valid_object(ipcp)) {
+   /* return a locked ipc object upon success */
+   return container_of(ipcp, struct shmid_kernel, shm_perm);
+   }
+
+   ipc_unlock_object(ipcp);
+err:
+   rcu_read_unlock();
/*
 * Callers of shm_lock() must validate the status of the returned ipc
-* object pointer (as returned by ipc_lock()), and error out as
-* appropriate.
+* object pointer and error out as appropriate.
 */
-   if (IS_ERR(ipcp))
-   return (void *)ipcp;
-   return container_of(ipcp, struct shmid_kernel, shm_perm);
+   return (void *)ipcp;
 }
 
 static inline void shm_lock_by_ptr(struct shmid_kernel *ipcp)
diff --git a/ipc/util.c b/ipc/util.c
index 5cc37066e659..234f6d781df3 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -587,42 +587,6 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id)
return out;
 }
 
-/**
- * ipc_lock - lock an ipc structure without rwsem held
- * @ids: ipc identifier set
- * @id: ipc id to look for
- *
- * Look for an id in the ipc ids idr and lock the associated ipc object.
- *
- * The ipc object is locked on successful exit.
- */
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
-{
-   struct kern_ipc_perm *out;
-
-   rcu_read_lock();
-   out = ipc_obtain_object_idr(ids, id);
-   if (IS_ERR(out))
-   goto err;
-
-   spin_lock(>lock);
-
-   /*
-* ipc_rmid() may have already freed the ID while ipc_lock()
-* was spinning: here verify that the structure is still valid.
-* Upon races with RMID, return -EIDRM, thus indicating that
-* the ID points to a removed identifier.
-*/
-   if (ipc_valid_object(out))
-   return out;
-
-   spin_unlock(>lock);
-   out = ERR_PTR(-EIDRM);
-err:
-   rcu_read_unlock();
-   return out;
-}
-
 /**
  * ipc_obtain_object_check
  * @ids: ipc identifier set
diff --git a/ipc/util.h b/ipc/util.h
index fcf81425ae98..e3c47b21db93 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -142,7 +142,6 @@ int ipc_rcu_getref(struct kern_ipc_perm *ptr);
 void ipc_rcu_putref(struct kern_ipc_perm *ptr,
void (*func)(struct rcu_head *head));
 
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *, int);
 struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids *ids, int id);
 
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
-- 
2.17.1

[PATCH 08/12] lib/rhashtable: guarantee initial hashtable allocation

2018-07-12 Thread Manfred Spraul

From: Davidlohr Bueso 

rhashtable_init() may fail due to -ENOMEM, thus making the
entire api unusable. This patch removes this scenario,
however unlikely. In order to guarantee memory allocation,
this patch always ends up doing GFP_KERNEL|__GFP_NOFAIL
for both the tbl as well as alloc_bucket_spinlocks().

Upon the first table allocation failure, we shrink the
size to the smallest value that makes sense and retry with
__GFP_NOFAIL semantics. With the defaults, this means that
from 64 buckets, we retry with only 4. Any later issues
regarding performance due to collisions or larger table
resizing (when more memory becomes available) is the least
of our problems.

Signed-off-by: Davidlohr Bueso 
Acked-by: Herbert Xu 
Signed-off-by: Manfred Spraul 
---
 lib/rhashtable.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 083f871491a1..0026cf3e3f27 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -179,10 +179,11 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
 
size = nbuckets;
 
-   if (tbl == NULL && gfp != GFP_KERNEL) {
+   if (tbl == NULL && (gfp & ~__GFP_NOFAIL) != GFP_KERNEL) {
tbl = nested_bucket_table_alloc(ht, nbuckets, gfp);
nbuckets = 0;
}
+
if (tbl == NULL)
return NULL;
 
@@ -1065,9 +1066,16 @@ int rhashtable_init(struct rhashtable *ht,
}
}
 
+   /*
+* This is api initialization and thus we need to guarantee the
+* initial rhashtable allocation. Upon failure, retry with the
+* smallest possible size with __GFP_NOFAIL semantics.
+*/
tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
-   if (tbl == NULL)
-   return -ENOMEM;
+   if (unlikely(tbl == NULL)) {
+   size = max_t(u16, ht->p.min_size, HASH_MIN_SIZE);
+   tbl = bucket_table_alloc(ht, size, GFP_KERNEL | __GFP_NOFAIL);
+   }
 
atomic_set(>nelems, 0);
 
-- 
2.17.1

[PATCH 07/12] lib/rhashtable: simplify bucket_table_alloc()

2018-07-12 Thread Manfred Spraul

From: Davidlohr Bueso 

As of commit ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for
incompatible gfp flags") we can simplify the caller and trust kvzalloc() to
just do the right thing. For the case of the GFP_ATOMIC context, we can
drop the __GFP_NORETRY flag for obvious reasons, and for the __GFP_NOWARN
case, however, it is changed such that the caller passes the flag instead
of making bucket_table_alloc() handle it.

This slightly changes the gfp flags passed on to nested_table_alloc() as
it will now also use GFP_ATOMIC | __GFP_NOWARN. However, I consider this a
positive consequence as for the same reasons we want nowarn semantics in
bucket_table_alloc().

Signed-off-by: Davidlohr Bueso 
Acked-by: Michal Hocko 

(commit id extended to 12 digits, line wraps updated)
Signed-off-by: Manfred Spraul 
---
 lib/rhashtable.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 9427b5766134..083f871491a1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -175,10 +175,7 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
int i;
 
size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
-   if (gfp != GFP_KERNEL)
-   tbl = kzalloc(size, gfp | __GFP_NOWARN | __GFP_NORETRY);
-   else
-   tbl = kvzalloc(size, gfp);
+   tbl = kvzalloc(size, gfp);
 
size = nbuckets;
 
@@ -459,7 +456,7 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 
err = -ENOMEM;
 
-   new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC);
+   new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC | __GFP_NOWARN);
if (new_tbl == NULL)
goto fail;
 
-- 
2.17.1

[PATCH 09/12] ipc: get rid of ids->tables_initialized hack

2018-07-12 Thread Manfred Spraul

From: Davidlohr Bueso 

In sysvipc we have an ids->tables_initialized regarding the
rhashtable, introduced in:

commit 0cfb6aee70bd ("ipc: optimize semget/shmget/msgget for lots of keys")

It's there, specifically, to prevent nil pointer dereferences,
from using an uninitialized api. Considering how rhashtable_init()
can fail (probably due to ENOMEM, if anything), this made the
overall ipc initialization capable of failure as well. That alone
is ugly, but fine, however I've spotted a few issues regarding the
semantics of tables_initialized (however unlikely they may be):

- There is inconsistency in what we return to userspace: ipc_addid()
returns ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr()
returns EINVAL.

- After we started using rhashtables, ipc_findkey() can return nil upon
!tables_initialized, but the caller expects nil for when the ipc structure
isn't found, and can therefore call into ipcget() callbacks.

Now that rhashtable initialization cannot fail, we can properly
get rid of the hack altogether.

Signed-off-by: Davidlohr Bueso 

(commit id extended to 12 digits)
Signed-off-by: Manfred Spraul 
---
 include/linux/ipc_namespace.h |  1 -
 ipc/util.c| 23 ---
 2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8eb2f3..37f3a4b7c637 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -16,7 +16,6 @@ struct user_namespace;
 struct ipc_ids {
int in_use;
unsigned short seq;
-   bool tables_initialized;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
int max_id;
diff --git a/ipc/util.c b/ipc/util.c
index 234f6d781df3..f620778b11d2 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -125,7 +125,6 @@ int ipc_init_ids(struct ipc_ids *ids)
if (err)
return err;
idr_init(>ipcs_idr);
-   ids->tables_initialized = true;
ids->max_id = -1;
 #ifdef CONFIG_CHECKPOINT_RESTORE
ids->next_id = -1;
@@ -178,19 +177,16 @@ void __init ipc_init_proc_interface(const char *path, 
const char *header,
  */
 static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key)
 {
-   struct kern_ipc_perm *ipcp = NULL;
+   struct kern_ipc_perm *ipcp;
 
-   if (likely(ids->tables_initialized))
-   ipcp = rhashtable_lookup_fast(>key_ht, ,
+   ipcp = rhashtable_lookup_fast(>key_ht, ,
  ipc_kht_params);
+   if (!ipcp)
+   return NULL;
 
-   if (ipcp) {
-   rcu_read_lock();
-   ipc_lock_object(ipcp);
-   return ipcp;
-   }
-
-   return NULL;
+   rcu_read_lock();
+   ipc_lock_object(ipcp);
+   return ipcp;
 }
 
 /*
@@ -268,7 +264,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
if (limit > IPCMNI)
limit = IPCMNI;
 
-   if (!ids->tables_initialized || ids->in_use >= limit)
+   if (ids->in_use >= limit)
return -ENOSPC;
 
idr_preload(GFP_KERNEL);
@@ -577,9 +573,6 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id)
struct kern_ipc_perm *out;
int lid = ipcid_to_idx(id);
 
-   if (unlikely(!ids->tables_initialized))
-   return ERR_PTR(-EINVAL);
-
out = idr_find(>ipcs_idr, lid);
if (!out)
return ERR_PTR(-EINVAL);
-- 
2.17.1

[PATCH 09/12] ipc: get rid of ids->tables_initialized hack

2018-07-12 Thread Manfred Spraul

From: Davidlohr Bueso 

In sysvipc we have an ids->tables_initialized regarding the
rhashtable, introduced in:

commit 0cfb6aee70bd ("ipc: optimize semget/shmget/msgget for lots of keys")

It's there, specifically, to prevent nil pointer dereferences,
from using an uninitialized api. Considering how rhashtable_init()
can fail (probably due to ENOMEM, if anything), this made the
overall ipc initialization capable of failure as well. That alone
is ugly, but fine, however I've spotted a few issues regarding the
semantics of tables_initialized (however unlikely they may be):

- There is inconsistency in what we return to userspace: ipc_addid()
returns ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr()
returns EINVAL.

- After we started using rhashtables, ipc_findkey() can return nil upon
!tables_initialized, but the caller expects nil for when the ipc structure
isn't found, and can therefore call into ipcget() callbacks.

Now that rhashtable initialization cannot fail, we can properly
get rid of the hack altogether.

Signed-off-by: Davidlohr Bueso 

(commit id extended to 12 digits)
Signed-off-by: Manfred Spraul 
---
 include/linux/ipc_namespace.h |  1 -
 ipc/util.c| 23 ---
 2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8eb2f3..37f3a4b7c637 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -16,7 +16,6 @@ struct user_namespace;
 struct ipc_ids {
int in_use;
unsigned short seq;
-   bool tables_initialized;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
int max_id;
diff --git a/ipc/util.c b/ipc/util.c
index 234f6d781df3..f620778b11d2 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -125,7 +125,6 @@ int ipc_init_ids(struct ipc_ids *ids)
if (err)
return err;
idr_init(>ipcs_idr);
-   ids->tables_initialized = true;
ids->max_id = -1;
 #ifdef CONFIG_CHECKPOINT_RESTORE
ids->next_id = -1;
@@ -178,19 +177,16 @@ void __init ipc_init_proc_interface(const char *path, 
const char *header,
  */
 static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key)
 {
-   struct kern_ipc_perm *ipcp = NULL;
+   struct kern_ipc_perm *ipcp;
 
-   if (likely(ids->tables_initialized))
-   ipcp = rhashtable_lookup_fast(>key_ht, ,
+   ipcp = rhashtable_lookup_fast(>key_ht, ,
  ipc_kht_params);
+   if (!ipcp)
+   return NULL;
 
-   if (ipcp) {
-   rcu_read_lock();
-   ipc_lock_object(ipcp);
-   return ipcp;
-   }
-
-   return NULL;
+   rcu_read_lock();
+   ipc_lock_object(ipcp);
+   return ipcp;
 }
 
 /*
@@ -268,7 +264,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
if (limit > IPCMNI)
limit = IPCMNI;
 
-   if (!ids->tables_initialized || ids->in_use >= limit)
+   if (ids->in_use >= limit)
return -ENOSPC;
 
idr_preload(GFP_KERNEL);
@@ -577,9 +573,6 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id)
struct kern_ipc_perm *out;
int lid = ipcid_to_idx(id);
 
-   if (unlikely(!ids->tables_initialized))
-   return ERR_PTR(-EINVAL);
-
out = idr_find(>ipcs_idr, lid);
if (!out)
return ERR_PTR(-EINVAL);
-- 
2.17.1

[PATCH 10/12] ipc: simplify ipc initialization

2018-07-12 Thread Manfred Spraul

From: Davidlohr Bueso 

Now that we know that rhashtable_init() will not fail, we
can get rid of a lot of the unnecessary cleanup paths when
the call errored out.

Signed-off-by: Davidlohr Bueso 

(variable name added to util.h to resolve checkpatch warning)
Signed-off-by: Manfred Spraul 
---
 ipc/msg.c   |  9 -
 ipc/namespace.c | 20 
 ipc/sem.c   | 10 --
 ipc/shm.c   |  9 -
 ipc/util.c  | 18 +-
 ipc/util.h  | 18 +-
 6 files changed, 30 insertions(+), 54 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 4aca0ce363b5..130e12e6a8c6 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -1237,7 +1237,7 @@ COMPAT_SYSCALL_DEFINE5(msgrcv, int, msqid, compat_uptr_t, 
msgp,
 }
 #endif
 
-int msg_init_ns(struct ipc_namespace *ns)
+void msg_init_ns(struct ipc_namespace *ns)
 {
ns->msg_ctlmax = MSGMAX;
ns->msg_ctlmnb = MSGMNB;
@@ -1245,7 +1245,7 @@ int msg_init_ns(struct ipc_namespace *ns)
 
atomic_set(>msg_bytes, 0);
atomic_set(>msg_hdrs, 0);
-   return ipc_init_ids(>ids[IPC_MSG_IDS]);
+   ipc_init_ids(>ids[IPC_MSG_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
@@ -1286,12 +1286,11 @@ static int sysvipc_msg_proc_show(struct seq_file *s, 
void *it)
 }
 #endif
 
-int __init msg_init(void)
+void __init msg_init(void)
 {
-   const int err = msg_init_ns(_ipc_ns);
+   msg_init_ns(_ipc_ns);
 
ipc_init_proc_interface("sysvipc/msg",
"   key  msqid perms  cbytes   
qnum lspid lrpid   uid   gid  cuid  cgid  stime  rtime  ctime\n",
IPC_MSG_IDS, sysvipc_msg_proc_show);
-   return err;
 }
diff --git a/ipc/namespace.c b/ipc/namespace.c
index f59a89966f92..21607791d62c 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -55,28 +55,16 @@ static struct ipc_namespace *create_ipc_ns(struct 
user_namespace *user_ns,
ns->user_ns = get_user_ns(user_ns);
ns->ucounts = ucounts;
 
-   err = sem_init_ns(ns);
+   err = mq_init_ns(ns);
if (err)
goto fail_put;
-   err = msg_init_ns(ns);
-   if (err)
-   goto fail_destroy_sem;
-   err = shm_init_ns(ns);
-   if (err)
-   goto fail_destroy_msg;
 
-   err = mq_init_ns(ns);
-   if (err)
-   goto fail_destroy_shm;
+   sem_init_ns(ns);
+   msg_init_ns(ns);
+   shm_init_ns(ns);
 
return ns;
 
-fail_destroy_shm:
-   shm_exit_ns(ns);
-fail_destroy_msg:
-   msg_exit_ns(ns);
-fail_destroy_sem:
-   sem_exit_ns(ns);
 fail_put:
put_user_ns(ns->user_ns);
ns_free_inum(>ns);
diff --git a/ipc/sem.c b/ipc/sem.c
index da1626984083..671d8703b130 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -220,14 +220,14 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
 #define sc_semopm  sem_ctls[2]
 #define sc_semmni  sem_ctls[3]
 
-int sem_init_ns(struct ipc_namespace *ns)
+void sem_init_ns(struct ipc_namespace *ns)
 {
ns->sc_semmsl = SEMMSL;
ns->sc_semmns = SEMMNS;
ns->sc_semopm = SEMOPM;
ns->sc_semmni = SEMMNI;
ns->used_sems = 0;
-   return ipc_init_ids(>ids[IPC_SEM_IDS]);
+   ipc_init_ids(>ids[IPC_SEM_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
@@ -239,14 +239,12 @@ void sem_exit_ns(struct ipc_namespace *ns)
 }
 #endif
 
-int __init sem_init(void)
+void __init sem_init(void)
 {
-   const int err = sem_init_ns(_ipc_ns);
-
+   sem_init_ns(_ipc_ns);
ipc_init_proc_interface("sysvipc/sem",
"   key  semid perms  nsems   uid   
gid  cuid  cgid  otime  ctime\n",
IPC_SEM_IDS, sysvipc_sem_proc_show);
-   return err;
 }
 
 /**
diff --git a/ipc/shm.c b/ipc/shm.c
index 22afb98363ff..d388d6e744c0 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -95,14 +95,14 @@ static void shm_destroy(struct ipc_namespace *ns, struct 
shmid_kernel *shp);
 static int sysvipc_shm_proc_show(struct seq_file *s, void *it);
 #endif
 
-int shm_init_ns(struct ipc_namespace *ns)
+void shm_init_ns(struct ipc_namespace *ns)
 {
ns->shm_ctlmax = SHMMAX;
ns->shm_ctlall = SHMALL;
ns->shm_ctlmni = SHMMNI;
ns->shm_rmid_forced = 0;
ns->shm_tot = 0;
-   return ipc_init_ids(_ids(ns));
+   ipc_init_ids(_ids(ns));
 }
 
 /*
@@ -135,9 +135,8 @@ void shm_exit_ns(struct ipc_namespace *ns)
 
 static int __init ipc_ns_init(void)
 {
-   const int err = shm_init_ns(_ipc_ns);
-   WARN(err, "ipc: sysv shm_init_ns failed: %d\n", err);
-   return err;
+   shm_init_ns(_ipc_ns);
+   return 0;
 }
 
 pure_initcall(ipc_ns_init);
diff --git a/ipc/util.c b/ipc/util.c
index f620778b11d2..35621be0d945 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -87,16 +87,12 @@ struct ipc_proc_

[PATCH 11/12] ipc/util.c: Further variable name cleanups

2018-07-12 Thread Manfred Spraul

The varable names got a mess, thus standardize them again:

id: user space id. Called semid, shmid, msgid if the type is known.
Most functions use "id" already.
idx: "index" for the idr lookup
Right now, some functions use lid, ipc_addid() already uses idx as
the variable name.
seq: sequence number, to avoid quick collisions of the user space id
key: user space key, used for the rhash tree

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
---
 include/linux/ipc_namespace.h |  2 +-
 ipc/msg.c |  6 +++---
 ipc/sem.c |  6 +++---
 ipc/shm.c |  4 ++--
 ipc/util.c| 26 +-
 ipc/util.h| 10 +-
 6 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 37f3a4b7c637..3098d275a29d 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -18,7 +18,7 @@ struct ipc_ids {
unsigned short seq;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
-   int max_id;
+   int max_idx;
 #ifdef CONFIG_CHECKPOINT_RESTORE
int next_id;
 #endif
diff --git a/ipc/msg.c b/ipc/msg.c
index 130e12e6a8c6..1892bec0f1c8 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -455,7 +455,7 @@ static int msgctl_info(struct ipc_namespace *ns, int msqid,
 int cmd, struct msginfo *msginfo)
 {
int err;
-   int max_id;
+   int max_idx;
 
/*
 * We must not return kernel stack data.
@@ -482,9 +482,9 @@ static int msgctl_info(struct ipc_namespace *ns, int msqid,
msginfo->msgpool = MSGPOOL;
msginfo->msgtql = MSGTQL;
}
-   max_id = ipc_get_maxid(_ids(ns));
+   max_idx = ipc_get_maxidx(_ids(ns));
up_read(_ids(ns).rwsem);
-   return (max_id < 0) ? 0 : max_id;
+   return (max_idx < 0) ? 0 : max_idx;
 }
 
 static int msgctl_stat(struct ipc_namespace *ns, int msqid,
diff --git a/ipc/sem.c b/ipc/sem.c
index 671d8703b130..f98962b06024 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1293,7 +1293,7 @@ static int semctl_info(struct ipc_namespace *ns, int 
semid,
 int cmd, void __user *p)
 {
struct seminfo seminfo;
-   int max_id;
+   int max_idx;
int err;
 
err = security_sem_semctl(NULL, cmd);
@@ -1317,11 +1317,11 @@ static int semctl_info(struct ipc_namespace *ns, int 
semid,
seminfo.semusz = SEMUSZ;
seminfo.semaem = SEMAEM;
}
-   max_id = ipc_get_maxid(_ids(ns));
+   max_idx = ipc_get_maxidx(_ids(ns));
up_read(_ids(ns).rwsem);
if (copy_to_user(p, , sizeof(struct seminfo)))
return -EFAULT;
-   return (max_id < 0) ? 0 : max_id;
+   return (max_idx < 0) ? 0 : max_idx;
 }
 
 static int semctl_setval(struct ipc_namespace *ns, int semid, int semnum,
diff --git a/ipc/shm.c b/ipc/shm.c
index d388d6e744c0..a4e9a1b34595 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -935,7 +935,7 @@ static int shmctl_ipc_info(struct ipc_namespace *ns,
shminfo->shmall = ns->shm_ctlall;
shminfo->shmmin = SHMMIN;
down_read(_ids(ns).rwsem);
-   err = ipc_get_maxid(_ids(ns));
+   err = ipc_get_maxidx(_ids(ns));
up_read(_ids(ns).rwsem);
if (err < 0)
err = 0;
@@ -955,7 +955,7 @@ static int shmctl_shm_info(struct ipc_namespace *ns,
shm_info->shm_tot = ns->shm_tot;
shm_info->swap_attempts = 0;
shm_info->swap_successes = 0;
-   err = ipc_get_maxid(_ids(ns));
+   err = ipc_get_maxidx(_ids(ns));
up_read(_ids(ns).rwsem);
if (err < 0)
err = 0;
diff --git a/ipc/util.c b/ipc/util.c
index 35621be0d945..fb69c911655a 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -118,7 +118,7 @@ void ipc_init_ids(struct ipc_ids *ids)
init_rwsem(>rwsem);
rhashtable_init(>key_ht, _kht_params);
idr_init(>ipcs_idr);
-   ids->max_id = -1;
+   ids->max_idx = -1;
 #ifdef CONFIG_CHECKPOINT_RESTORE
ids->next_id = -1;
 #endif
@@ -236,7 +236,7 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, struct 
kern_ipc_perm *new)
  * @limit: limit for the number of used ids
  *
  * Add an entry 'new' to the ipc ids idr. The permissions object is
- * initialised and the first free entry is set up and the id assigned
+ * initialised and the first free entry is set up and the index assigned
  * is returned. The 'new' entry is returned in a locked state on success.
  *
  * On failure the entry is not locked and a negative err-code is returned.
@@ -290,8 +290,8 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
}

[PATCH 12/12] ipc/util.c: update return value of ipc_getref from int to bool

2018-07-12 Thread Manfred Spraul

ipc_getref has still a return value of type "int", matching the atomic_t
interface of atomic_inc_not_zero()/atomic_add_unless().

ipc_getref now uses refcount_inc_not_zero, which has a return value of
type "bool".

Therefore: Update the return code to avoid implicit conversions.

Signed-off-by: Manfred Spraul 
---
 ipc/util.c | 2 +-
 ipc/util.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index fb69c911655a..6306eb25180b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -461,7 +461,7 @@ void ipc_set_key_private(struct ipc_ids *ids, struct 
kern_ipc_perm *ipcp)
ipcp->key = IPC_PRIVATE;
 }
 
-int ipc_rcu_getref(struct kern_ipc_perm *ptr)
+bool ipc_rcu_getref(struct kern_ipc_perm *ptr)
 {
return refcount_inc_not_zero(>refcount);
 }
diff --git a/ipc/util.h b/ipc/util.h
index e74564fe3375..0a159f69b3bb 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -138,7 +138,7 @@ static inline int ipc_get_maxidx(struct ipc_ids *ids)
  * refcount is initialized by ipc_addid(), before that point call_rcu()
  * must be used.
  */
-int ipc_rcu_getref(struct kern_ipc_perm *ptr);
+bool ipc_rcu_getref(struct kern_ipc_perm *ptr);
 void ipc_rcu_putref(struct kern_ipc_perm *ptr,
void (*func)(struct rcu_head *head));
 
-- 
2.17.1

[PATCH 10/12] ipc: simplify ipc initialization

2018-07-12 Thread Manfred Spraul

From: Davidlohr Bueso 

Now that we know that rhashtable_init() will not fail, we
can get rid of a lot of the unnecessary cleanup paths when
the call errored out.

Signed-off-by: Davidlohr Bueso 

(variable name added to util.h to resolve checkpatch warning)
Signed-off-by: Manfred Spraul 
---
 ipc/msg.c   |  9 -
 ipc/namespace.c | 20 
 ipc/sem.c   | 10 --
 ipc/shm.c   |  9 -
 ipc/util.c  | 18 +-
 ipc/util.h  | 18 +-
 6 files changed, 30 insertions(+), 54 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 4aca0ce363b5..130e12e6a8c6 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -1237,7 +1237,7 @@ COMPAT_SYSCALL_DEFINE5(msgrcv, int, msqid, compat_uptr_t, 
msgp,
 }
 #endif
 
-int msg_init_ns(struct ipc_namespace *ns)
+void msg_init_ns(struct ipc_namespace *ns)
 {
ns->msg_ctlmax = MSGMAX;
ns->msg_ctlmnb = MSGMNB;
@@ -1245,7 +1245,7 @@ int msg_init_ns(struct ipc_namespace *ns)
 
atomic_set(>msg_bytes, 0);
atomic_set(>msg_hdrs, 0);
-   return ipc_init_ids(>ids[IPC_MSG_IDS]);
+   ipc_init_ids(>ids[IPC_MSG_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
@@ -1286,12 +1286,11 @@ static int sysvipc_msg_proc_show(struct seq_file *s, 
void *it)
 }
 #endif
 
-int __init msg_init(void)
+void __init msg_init(void)
 {
-   const int err = msg_init_ns(_ipc_ns);
+   msg_init_ns(_ipc_ns);
 
ipc_init_proc_interface("sysvipc/msg",
"   key  msqid perms  cbytes   
qnum lspid lrpid   uid   gid  cuid  cgid  stime  rtime  ctime\n",
IPC_MSG_IDS, sysvipc_msg_proc_show);
-   return err;
 }
diff --git a/ipc/namespace.c b/ipc/namespace.c
index f59a89966f92..21607791d62c 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -55,28 +55,16 @@ static struct ipc_namespace *create_ipc_ns(struct 
user_namespace *user_ns,
ns->user_ns = get_user_ns(user_ns);
ns->ucounts = ucounts;
 
-   err = sem_init_ns(ns);
+   err = mq_init_ns(ns);
if (err)
goto fail_put;
-   err = msg_init_ns(ns);
-   if (err)
-   goto fail_destroy_sem;
-   err = shm_init_ns(ns);
-   if (err)
-   goto fail_destroy_msg;
 
-   err = mq_init_ns(ns);
-   if (err)
-   goto fail_destroy_shm;
+   sem_init_ns(ns);
+   msg_init_ns(ns);
+   shm_init_ns(ns);
 
return ns;
 
-fail_destroy_shm:
-   shm_exit_ns(ns);
-fail_destroy_msg:
-   msg_exit_ns(ns);
-fail_destroy_sem:
-   sem_exit_ns(ns);
 fail_put:
put_user_ns(ns->user_ns);
ns_free_inum(>ns);
diff --git a/ipc/sem.c b/ipc/sem.c
index da1626984083..671d8703b130 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -220,14 +220,14 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
 #define sc_semopm  sem_ctls[2]
 #define sc_semmni  sem_ctls[3]
 
-int sem_init_ns(struct ipc_namespace *ns)
+void sem_init_ns(struct ipc_namespace *ns)
 {
ns->sc_semmsl = SEMMSL;
ns->sc_semmns = SEMMNS;
ns->sc_semopm = SEMOPM;
ns->sc_semmni = SEMMNI;
ns->used_sems = 0;
-   return ipc_init_ids(>ids[IPC_SEM_IDS]);
+   ipc_init_ids(>ids[IPC_SEM_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
@@ -239,14 +239,12 @@ void sem_exit_ns(struct ipc_namespace *ns)
 }
 #endif
 
-int __init sem_init(void)
+void __init sem_init(void)
 {
-   const int err = sem_init_ns(_ipc_ns);
-
+   sem_init_ns(_ipc_ns);
ipc_init_proc_interface("sysvipc/sem",
"   key  semid perms  nsems   uid   
gid  cuid  cgid  otime  ctime\n",
IPC_SEM_IDS, sysvipc_sem_proc_show);
-   return err;
 }
 
 /**
diff --git a/ipc/shm.c b/ipc/shm.c
index 22afb98363ff..d388d6e744c0 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -95,14 +95,14 @@ static void shm_destroy(struct ipc_namespace *ns, struct 
shmid_kernel *shp);
 static int sysvipc_shm_proc_show(struct seq_file *s, void *it);
 #endif
 
-int shm_init_ns(struct ipc_namespace *ns)
+void shm_init_ns(struct ipc_namespace *ns)
 {
ns->shm_ctlmax = SHMMAX;
ns->shm_ctlall = SHMALL;
ns->shm_ctlmni = SHMMNI;
ns->shm_rmid_forced = 0;
ns->shm_tot = 0;
-   return ipc_init_ids(_ids(ns));
+   ipc_init_ids(_ids(ns));
 }
 
 /*
@@ -135,9 +135,8 @@ void shm_exit_ns(struct ipc_namespace *ns)
 
 static int __init ipc_ns_init(void)
 {
-   const int err = shm_init_ns(_ipc_ns);
-   WARN(err, "ipc: sysv shm_init_ns failed: %d\n", err);
-   return err;
+   shm_init_ns(_ipc_ns);
+   return 0;
 }
 
 pure_initcall(ipc_ns_init);
diff --git a/ipc/util.c b/ipc/util.c
index f620778b11d2..35621be0d945 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -87,16 +87,12 @@ struct ipc_proc_

[PATCH 11/12] ipc/util.c: Further variable name cleanups

2018-07-12 Thread Manfred Spraul

The varable names got a mess, thus standardize them again:

id: user space id. Called semid, shmid, msgid if the type is known.
Most functions use "id" already.
idx: "index" for the idr lookup
Right now, some functions use lid, ipc_addid() already uses idx as
the variable name.
seq: sequence number, to avoid quick collisions of the user space id
key: user space key, used for the rhash tree

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
---
 include/linux/ipc_namespace.h |  2 +-
 ipc/msg.c |  6 +++---
 ipc/sem.c |  6 +++---
 ipc/shm.c |  4 ++--
 ipc/util.c| 26 +-
 ipc/util.h| 10 +-
 6 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 37f3a4b7c637..3098d275a29d 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -18,7 +18,7 @@ struct ipc_ids {
unsigned short seq;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
-   int max_id;
+   int max_idx;
 #ifdef CONFIG_CHECKPOINT_RESTORE
int next_id;
 #endif
diff --git a/ipc/msg.c b/ipc/msg.c
index 130e12e6a8c6..1892bec0f1c8 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -455,7 +455,7 @@ static int msgctl_info(struct ipc_namespace *ns, int msqid,
 int cmd, struct msginfo *msginfo)
 {
int err;
-   int max_id;
+   int max_idx;
 
/*
 * We must not return kernel stack data.
@@ -482,9 +482,9 @@ static int msgctl_info(struct ipc_namespace *ns, int msqid,
msginfo->msgpool = MSGPOOL;
msginfo->msgtql = MSGTQL;
}
-   max_id = ipc_get_maxid(_ids(ns));
+   max_idx = ipc_get_maxidx(_ids(ns));
up_read(_ids(ns).rwsem);
-   return (max_id < 0) ? 0 : max_id;
+   return (max_idx < 0) ? 0 : max_idx;
 }
 
 static int msgctl_stat(struct ipc_namespace *ns, int msqid,
diff --git a/ipc/sem.c b/ipc/sem.c
index 671d8703b130..f98962b06024 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1293,7 +1293,7 @@ static int semctl_info(struct ipc_namespace *ns, int 
semid,
 int cmd, void __user *p)
 {
struct seminfo seminfo;
-   int max_id;
+   int max_idx;
int err;
 
err = security_sem_semctl(NULL, cmd);
@@ -1317,11 +1317,11 @@ static int semctl_info(struct ipc_namespace *ns, int 
semid,
seminfo.semusz = SEMUSZ;
seminfo.semaem = SEMAEM;
}
-   max_id = ipc_get_maxid(_ids(ns));
+   max_idx = ipc_get_maxidx(_ids(ns));
up_read(_ids(ns).rwsem);
if (copy_to_user(p, , sizeof(struct seminfo)))
return -EFAULT;
-   return (max_id < 0) ? 0 : max_id;
+   return (max_idx < 0) ? 0 : max_idx;
 }
 
 static int semctl_setval(struct ipc_namespace *ns, int semid, int semnum,
diff --git a/ipc/shm.c b/ipc/shm.c
index d388d6e744c0..a4e9a1b34595 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -935,7 +935,7 @@ static int shmctl_ipc_info(struct ipc_namespace *ns,
shminfo->shmall = ns->shm_ctlall;
shminfo->shmmin = SHMMIN;
down_read(_ids(ns).rwsem);
-   err = ipc_get_maxid(_ids(ns));
+   err = ipc_get_maxidx(_ids(ns));
up_read(_ids(ns).rwsem);
if (err < 0)
err = 0;
@@ -955,7 +955,7 @@ static int shmctl_shm_info(struct ipc_namespace *ns,
shm_info->shm_tot = ns->shm_tot;
shm_info->swap_attempts = 0;
shm_info->swap_successes = 0;
-   err = ipc_get_maxid(_ids(ns));
+   err = ipc_get_maxidx(_ids(ns));
up_read(_ids(ns).rwsem);
if (err < 0)
err = 0;
diff --git a/ipc/util.c b/ipc/util.c
index 35621be0d945..fb69c911655a 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -118,7 +118,7 @@ void ipc_init_ids(struct ipc_ids *ids)
init_rwsem(>rwsem);
rhashtable_init(>key_ht, _kht_params);
idr_init(>ipcs_idr);
-   ids->max_id = -1;
+   ids->max_idx = -1;
 #ifdef CONFIG_CHECKPOINT_RESTORE
ids->next_id = -1;
 #endif
@@ -236,7 +236,7 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, struct 
kern_ipc_perm *new)
  * @limit: limit for the number of used ids
  *
  * Add an entry 'new' to the ipc ids idr. The permissions object is
- * initialised and the first free entry is set up and the id assigned
+ * initialised and the first free entry is set up and the index assigned
  * is returned. The 'new' entry is returned in a locked state on success.
  *
  * On failure the entry is not locked and a negative err-code is returned.
@@ -290,8 +290,8 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
}

[PATCH 12/12] ipc/util.c: update return value of ipc_getref from int to bool

2018-07-12 Thread Manfred Spraul

ipc_getref has still a return value of type "int", matching the atomic_t
interface of atomic_inc_not_zero()/atomic_add_unless().

ipc_getref now uses refcount_inc_not_zero, which has a return value of
type "bool".

Therefore: Update the return code to avoid implicit conversions.

Signed-off-by: Manfred Spraul 
---
 ipc/util.c | 2 +-
 ipc/util.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index fb69c911655a..6306eb25180b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -461,7 +461,7 @@ void ipc_set_key_private(struct ipc_ids *ids, struct 
kern_ipc_perm *ipcp)
ipcp->key = IPC_PRIVATE;
 }
 
-int ipc_rcu_getref(struct kern_ipc_perm *ptr)
+bool ipc_rcu_getref(struct kern_ipc_perm *ptr)
 {
return refcount_inc_not_zero(>refcount);
 }
diff --git a/ipc/util.h b/ipc/util.h
index e74564fe3375..0a159f69b3bb 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -138,7 +138,7 @@ static inline int ipc_get_maxidx(struct ipc_ids *ids)
  * refcount is initialized by ipc_addid(), before that point call_rcu()
  * must be used.
  */
-int ipc_rcu_getref(struct kern_ipc_perm *ptr);
+bool ipc_rcu_getref(struct kern_ipc_perm *ptr);
 void ipc_rcu_putref(struct kern_ipc_perm *ptr,
void (*func)(struct rcu_head *head));
 
-- 
2.17.1

[PATCH 0/12 V3] ipc: cleanups & bugfixes, rhashtable update

2018-07-12 Thread Manfred Spraul

Hi,

I have added all all review findings and rediffed the patches

- patch #1-#6: Fix syzcall findings & further race cleanups
patch #1 has an updated subject/comment
patch #2 contains the squashed result of Dmitrys change and my
own updates.
patch #6 is replaced by the proposal from Davidlohr
- patch #7-#10: rhashtable improvement from Davidlohr
- patch #11: A variable rename patch: id/lid/idx/uid were a mess
- patch #12: change a return code from int to bool, side effect of the
refcount_t introduction.

@Andrew:
Can you merge the patches into -mm/next?

I have not seen any issues in my tests.

--
Manfred

[PATCH 02/12] ipc: reorganize initialization of kern_ipc_perm.seq

2018-07-12 Thread Manfred Spraul

ipc_addid() initializes kern_ipc_perm.seq after having called
idr_alloc() (within ipc_idr_alloc()).

Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_check()
may see an uninitialized value.

The patch moves the initialization of kern_ipc_perm.seq before the
calls of idr_alloc().

Notes:
1) This patch has a user space visible side effect:
If /proc/sys/kernel/*_next_id is used (i.e.: checkpoint/restore) and
if semget()/msgget()/shmget() fails in the final step of adding the id
to the rhash tree, then .._next_id is cleared. Before the patch, is
remained unmodified.

There is no change of the behavior after a successful ..get() call:
It always clears .._next_id, there is no impact to non checkpoint/restore
code as that code does not use .._next_id.

2) The patch correctly documents that after a call to ipc_idr_alloc(),
the full tear-down sequence must be used. The callers of ipc_addid()
do not fullfill that, i.e. more bugfixes are required.

The patch is a squash of a patch from Dmitry and my own changes.

Reported-by: syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com
Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
Cc: Michael Kerrisk 
---
 Documentation/sysctl/kernel.txt |  3 +-
 ipc/util.c  | 91 +
 2 files changed, 50 insertions(+), 44 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index eded671d55eb..b2d4a8f8fe97 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -440,7 +440,8 @@ Notes:
 1) kernel doesn't guarantee, that new object will have desired id. So,
 it's up to userspace, how to handle an object with "wrong" id.
 2) Toggle with non-default value will be set back to -1 by kernel after
-successful IPC object allocation.
+successful IPC object allocation. If an IPC object allocation syscall
+fails, it is undefined if the value remains unmodified or is reset to -1.
 
 ==
 
diff --git a/ipc/util.c b/ipc/util.c
index 4e81182fa0ac..4998f8fa8ce0 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -193,46 +193,54 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
return NULL;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
 /*
- * Specify desired id for next allocated IPC object.
+ * Insert new IPC object into idr tree, and set sequence number and id
+ * in the correct order.
+ * Especially:
+ * - the sequence number must be set before inserting the object into the idr,
+ *   because the sequence number is accessed without a lock.
+ * - the id can/must be set after inserting the object into the idr.
+ *   All accesses must be done after getting kern_ipc_perm.lock.
+ *
+ * The caller must own kern_ipc_perm.lock.of the new object.
+ * On error, the function returns a (negative) error code.
  */
-#define ipc_idr_alloc(ids, new)
\
-   idr_alloc(&(ids)->ipcs_idr, (new),  \
- (ids)->next_id < 0 ? 0 : ipcid_to_idx((ids)->next_id),\
- 0, GFP_NOWAIT)
-
-static inline int ipc_buildid(int id, struct ipc_ids *ids,
- struct kern_ipc_perm *new)
+static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
 {
-   if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
+   int idx, next_id = -1;
+
+#ifdef CONFIG_CHECKPOINT_RESTORE
+   next_id = ids->next_id;
+   ids->next_id = -1;
+#endif
+
+   /*
+* As soon as a new object is inserted into the idr,
+* ipc_obtain_object_idr() or ipc_obtain_object_check() can find it,
+* and the lockless preparations for ipc operations can start.
+* This means especially: permission checks, audit calls, allocation
+* of undo structures, ...
+*
+* Thus the object must be fully initialized, and if something fails,
+* then the full tear-down sequence must be followed.
+* (i.e.: set new->deleted, reduce refcount, call_rcu())
+*/
+
+   if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
+   idx = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
} else {
-   new->seq = ipcid_to_seqx(ids->next_id);
-   ids->next_id = -1;
+   new->seq = ipcid_to_seqx(next_id);
+   idx = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
+   0, GFP_NOWAIT);
}
-
-   return SEQ_MULTIPLIER * new->seq + id;
+   if (idx >= 0)
+   new->id = SEQ_MULTIPLIER * new->seq + idx;
+   return idx;
 }
 
-#else
-#define ipc_idr_alloc(ids, new)

[PATCH 0/12 V3] ipc: cleanups & bugfixes, rhashtable update

2018-07-12 Thread Manfred Spraul

Hi,

I have added all all review findings and rediffed the patches

- patch #1-#6: Fix syzcall findings & further race cleanups
patch #1 has an updated subject/comment
patch #2 contains the squashed result of Dmitrys change and my
own updates.
patch #6 is replaced by the proposal from Davidlohr
- patch #7-#10: rhashtable improvement from Davidlohr
- patch #11: A variable rename patch: id/lid/idx/uid were a mess
- patch #12: change a return code from int to bool, side effect of the
refcount_t introduction.

@Andrew:
Can you merge the patches into -mm/next?

I have not seen any issues in my tests.

--
Manfred

[PATCH 02/12] ipc: reorganize initialization of kern_ipc_perm.seq

2018-07-12 Thread Manfred Spraul

ipc_addid() initializes kern_ipc_perm.seq after having called
idr_alloc() (within ipc_idr_alloc()).

Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_check()
may see an uninitialized value.

The patch moves the initialization of kern_ipc_perm.seq before the
calls of idr_alloc().

Notes:
1) This patch has a user space visible side effect:
If /proc/sys/kernel/*_next_id is used (i.e.: checkpoint/restore) and
if semget()/msgget()/shmget() fails in the final step of adding the id
to the rhash tree, then .._next_id is cleared. Before the patch, is
remained unmodified.

There is no change of the behavior after a successful ..get() call:
It always clears .._next_id, there is no impact to non checkpoint/restore
code as that code does not use .._next_id.

2) The patch correctly documents that after a call to ipc_idr_alloc(),
the full tear-down sequence must be used. The callers of ipc_addid()
do not fullfill that, i.e. more bugfixes are required.

The patch is a squash of a patch from Dmitry and my own changes.

Reported-by: syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com
Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
Cc: Michael Kerrisk 
---
 Documentation/sysctl/kernel.txt |  3 +-
 ipc/util.c  | 91 +
 2 files changed, 50 insertions(+), 44 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index eded671d55eb..b2d4a8f8fe97 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -440,7 +440,8 @@ Notes:
 1) kernel doesn't guarantee, that new object will have desired id. So,
 it's up to userspace, how to handle an object with "wrong" id.
 2) Toggle with non-default value will be set back to -1 by kernel after
-successful IPC object allocation.
+successful IPC object allocation. If an IPC object allocation syscall
+fails, it is undefined if the value remains unmodified or is reset to -1.
 
 ==
 
diff --git a/ipc/util.c b/ipc/util.c
index 4e81182fa0ac..4998f8fa8ce0 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -193,46 +193,54 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
return NULL;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
 /*
- * Specify desired id for next allocated IPC object.
+ * Insert new IPC object into idr tree, and set sequence number and id
+ * in the correct order.
+ * Especially:
+ * - the sequence number must be set before inserting the object into the idr,
+ *   because the sequence number is accessed without a lock.
+ * - the id can/must be set after inserting the object into the idr.
+ *   All accesses must be done after getting kern_ipc_perm.lock.
+ *
+ * The caller must own kern_ipc_perm.lock.of the new object.
+ * On error, the function returns a (negative) error code.
  */
-#define ipc_idr_alloc(ids, new)
\
-   idr_alloc(&(ids)->ipcs_idr, (new),  \
- (ids)->next_id < 0 ? 0 : ipcid_to_idx((ids)->next_id),\
- 0, GFP_NOWAIT)
-
-static inline int ipc_buildid(int id, struct ipc_ids *ids,
- struct kern_ipc_perm *new)
+static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
 {
-   if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
+   int idx, next_id = -1;
+
+#ifdef CONFIG_CHECKPOINT_RESTORE
+   next_id = ids->next_id;
+   ids->next_id = -1;
+#endif
+
+   /*
+* As soon as a new object is inserted into the idr,
+* ipc_obtain_object_idr() or ipc_obtain_object_check() can find it,
+* and the lockless preparations for ipc operations can start.
+* This means especially: permission checks, audit calls, allocation
+* of undo structures, ...
+*
+* Thus the object must be fully initialized, and if something fails,
+* then the full tear-down sequence must be followed.
+* (i.e.: set new->deleted, reduce refcount, call_rcu())
+*/
+
+   if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
+   idx = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
} else {
-   new->seq = ipcid_to_seqx(ids->next_id);
-   ids->next_id = -1;
+   new->seq = ipcid_to_seqx(next_id);
+   idx = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
+   0, GFP_NOWAIT);
}
-
-   return SEQ_MULTIPLIER * new->seq + id;
+   if (idx >= 0)
+   new->id = SEQ_MULTIPLIER * new->seq + idx;
+   return idx;
 }
 
-#else
-#define ipc_idr_alloc(ids, new)

Re: [PATCH 0/12 V2] ipc: cleanups & bugfixes, rhashtable update

2018-07-09 Thread Manfred Spraul


Hi Davidlohr,

On 07/09/2018 10:09 PM, Davidlohr Bueso wrote:

On Mon, 09 Jul 2018, Manfred Spraul wrote:


@Davidlohr:
Please double check that I have taken the correct patches, and
that I didn't break anything.


Everything seems ok.

Patch 8 had an alternative patch that didn't change nowarn semantics for
the rhashtable resizing operations (https://lkml.org/lkml/2018/6/22/732),
but nobody complained about the one you picked up (and also has 
Michal's ack).



Which patch do you prefer?
I have seen two versions, and if I have picked up the wrong one, then I 
can change it.


--
    Manfred

Re: [PATCH 0/12 V2] ipc: cleanups & bugfixes, rhashtable update

2018-07-09 Thread Manfred Spraul


Hi Davidlohr,

On 07/09/2018 10:09 PM, Davidlohr Bueso wrote:

On Mon, 09 Jul 2018, Manfred Spraul wrote:


@Davidlohr:
Please double check that I have taken the correct patches, and
that I didn't break anything.


Everything seems ok.

Patch 8 had an alternative patch that didn't change nowarn semantics for
the rhashtable resizing operations (https://lkml.org/lkml/2018/6/22/732),
but nobody complained about the one you picked up (and also has 
Michal's ack).



Which patch do you prefer?
I have seen two versions, and if I have picked up the wrong one, then I 
can change it.


--
    Manfred

Re: [PATCH 12/12] ipc/util.c: Further ipc_idr_alloc cleanups.

2018-07-09 Thread Manfred Spraul


Hello Dmitry,

On 07/09/2018 07:05 PM, Dmitry Vyukov wrote:

On Mon, Jul 9, 2018 at 5:10 PM, Manfred Spraul  wrote:

If idr_alloc within ipc_idr_alloc fails, then the return value (-ENOSPC)
is used to calculate new->id.
Technically, this is not a bug, because new->id is never accessed.

But: Clean it up anyways: On error, just return, do not set new->id.
And improve the documentation.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
---
  ipc/util.c | 22 --
  1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index d474f2b3b299..302c18fc846b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -182,11 +182,20 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
  }

  /*
- * Specify desired id for next allocated IPC object.
+ * Insert new IPC object into idr tree, and set sequence number and id
+ * in the correct order.
+ * Especially:
+ * - the sequence number must be set before inserting the object into the idr,
+ *   because the sequence number is accessed without a lock.
+ * - the id can/must be set after inserting the object into the idr.
+ *   All accesses must be done after getting kern_ipc_perm.lock.
+ *
+ * The caller must own kern_ipc_perm.lock.of the new object.
+ * On error, the function returns a (negative) error code.
   */
  static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm 
*new)
  {
-   int key, next_id = -1;
+   int id, next_id = -1;

/\/\/\/\
Looks good to me. I was also confused by how key transforms into id,
and then key name is used for something else.
Let's see if there are further findings, perhaps I'll rework the series, 
it may make sense to standardize the variable names:


id: user space id. Called semid, shmid, msgid if the type is known.
    Most functions use "id" already.
    Exception: ipc_checkid(), the function calls is uid.
idx: "index" for the idr lookup
    Right now, ipc_rmid() use lid, ipc_addid() use id as variable name
seq: sequence counter, to avoid quick collisions of the user space id
    In the comments, it got a mixture of sequence counter and sequence 
number.

key: user space key, used for the rhash tree


  #ifdef CONFIG_CHECKPOINT_RESTORE
 next_id = ids->next_id;
@@ -197,14 +206,15 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
 new->seq = ids->seq++;
 if (ids->seq > IPCID_SEQ_MAX)
 ids->seq = 0;
-   key = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
+   id = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
 } else {
 new->seq = ipcid_to_seqx(next_id);
-   key = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
+   id = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
 0, GFP_NOWAIT);
 }
-   new->id = SEQ_MULTIPLIER * new->seq + key;
-   return key;
+   if (id >= 0)
+   new->id = SEQ_MULTIPLIER * new->seq + id;

We still initialize seq in this case. I guess it's ok because the
object is not published at all. But if we are doing this, then perhaps
store seq into a local var first and then:

   if (id >= 0) {
   new->id = SEQ_MULTIPLIER * seq + id;
   new->seq = seq:
   }

?

No!!!
We must initialize ->seq before publication. Otherwise we end up with 
the syzcall findings, or in the worst case a strange rare failure of an 
ipc operation.
The difference between ->id and ->seq is that we have the valid number 
for ->seq.


For the user space ID we cannot have the valid number unless the 
idr_alloc is successful.

The patch only avoids that this line is executed:


new->id = SEQ_MULTIPLIER * new->seq + (-ENOSPC)


As I wrote, the line shouldn't cause any damage, the code is more or less:

new->id = SEQ_MULTIPLIER * new->seq + (-ENOSPC)
kfree(new);

But this is ugly, it asks for problems.

--
Manfred

Re: [PATCH 12/12] ipc/util.c: Further ipc_idr_alloc cleanups.

2018-07-09 Thread Manfred Spraul


Hello Dmitry,

On 07/09/2018 07:05 PM, Dmitry Vyukov wrote:

On Mon, Jul 9, 2018 at 5:10 PM, Manfred Spraul  wrote:

If idr_alloc within ipc_idr_alloc fails, then the return value (-ENOSPC)
is used to calculate new->id.
Technically, this is not a bug, because new->id is never accessed.

But: Clean it up anyways: On error, just return, do not set new->id.
And improve the documentation.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
---
  ipc/util.c | 22 --
  1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index d474f2b3b299..302c18fc846b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -182,11 +182,20 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
  }

  /*
- * Specify desired id for next allocated IPC object.
+ * Insert new IPC object into idr tree, and set sequence number and id
+ * in the correct order.
+ * Especially:
+ * - the sequence number must be set before inserting the object into the idr,
+ *   because the sequence number is accessed without a lock.
+ * - the id can/must be set after inserting the object into the idr.
+ *   All accesses must be done after getting kern_ipc_perm.lock.
+ *
+ * The caller must own kern_ipc_perm.lock.of the new object.
+ * On error, the function returns a (negative) error code.
   */
  static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm 
*new)
  {
-   int key, next_id = -1;
+   int id, next_id = -1;

/\/\/\/\
Looks good to me. I was also confused by how key transforms into id,
and then key name is used for something else.
Let's see if there are further findings, perhaps I'll rework the series, 
it may make sense to standardize the variable names:


id: user space id. Called semid, shmid, msgid if the type is known.
    Most functions use "id" already.
    Exception: ipc_checkid(), the function calls is uid.
idx: "index" for the idr lookup
    Right now, ipc_rmid() use lid, ipc_addid() use id as variable name
seq: sequence counter, to avoid quick collisions of the user space id
    In the comments, it got a mixture of sequence counter and sequence 
number.

key: user space key, used for the rhash tree


  #ifdef CONFIG_CHECKPOINT_RESTORE
 next_id = ids->next_id;
@@ -197,14 +206,15 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
 new->seq = ids->seq++;
 if (ids->seq > IPCID_SEQ_MAX)
 ids->seq = 0;
-   key = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
+   id = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
 } else {
 new->seq = ipcid_to_seqx(next_id);
-   key = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
+   id = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
 0, GFP_NOWAIT);
 }
-   new->id = SEQ_MULTIPLIER * new->seq + key;
-   return key;
+   if (id >= 0)
+   new->id = SEQ_MULTIPLIER * new->seq + id;

We still initialize seq in this case. I guess it's ok because the
object is not published at all. But if we are doing this, then perhaps
store seq into a local var first and then:

   if (id >= 0) {
   new->id = SEQ_MULTIPLIER * seq + id;
   new->seq = seq:
   }

?

No!!!
We must initialize ->seq before publication. Otherwise we end up with 
the syzcall findings, or in the worst case a strange rare failure of an 
ipc operation.
The difference between ->id and ->seq is that we have the valid number 
for ->seq.


For the user space ID we cannot have the valid number unless the 
idr_alloc is successful.

The patch only avoids that this line is executed:


new->id = SEQ_MULTIPLIER * new->seq + (-ENOSPC)


As I wrote, the line shouldn't cause any damage, the code is more or less:

new->id = SEQ_MULTIPLIER * new->seq + (-ENOSPC)
kfree(new);

But this is ugly, it asks for problems.

--
Manfred

[PATCH 01/12] ipc: reorganize initialization of kern_ipc_perm.id

2018-07-09 Thread Manfred Spraul

ipc_addid() initializes kern_ipc_perm.id after having called
ipc_idr_alloc().

Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_idr()
may see an uninitialized value.

The patch moves all accesses to kern_ipc_perm.id under the spin_lock().

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with kern_ipc_perm.seq

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
---
 ipc/msg.c | 19 ++-
 ipc/sem.c | 18 +-
 ipc/shm.c | 19 ++-
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 3b6545302598..829c2062ded4 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -491,7 +491,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
 int cmd, struct msqid64_ds *p)
 {
struct msg_queue *msq;
-   int id = 0;
int err;
 
memset(p, 0, sizeof(*p));
@@ -503,7 +502,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
err = PTR_ERR(msq);
goto out_unlock;
}
-   id = msq->q_perm.id;
} else { /* IPC_STAT */
msq = msq_obtain_object_check(ns, msqid);
if (IS_ERR(msq)) {
@@ -548,10 +546,21 @@ static int msgctl_stat(struct ipc_namespace *ns, int 
msqid,
p->msg_lspid  = pid_vnr(msq->q_lspid);
p->msg_lrpid  = pid_vnr(msq->q_lrpid);
 
-   ipc_unlock_object(>q_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* MSG_STAT and MSG_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence counter
+*/
+   err = msq->q_perm.id;
+   }
 
+   ipc_unlock_object(>q_perm);
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/sem.c b/ipc/sem.c
index 5af1943ad782..e8971fa1d847 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1222,7 +1222,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 {
struct sem_array *sma;
time64_t semotime;
-   int id = 0;
int err;
 
memset(semid64, 0, sizeof(*semid64));
@@ -1234,7 +1233,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
err = PTR_ERR(sma);
goto out_unlock;
}
-   id = sma->sem_perm.id;
} else { /* IPC_STAT */
sma = sem_obtain_object_check(ns, semid);
if (IS_ERR(sma)) {
@@ -1274,10 +1272,20 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 #endif
semid64->sem_nsems = sma->sem_nsems;
 
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SEM_STAT and SEM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence counter
+*/
+   err = sma->sem_perm.id;
+   }
ipc_unlock_object(>sem_perm);
-   rcu_read_unlock();
-   return id;
-
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/shm.c b/ipc/shm.c
index 051a3e1fb8df..59fe8b3b3794 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -949,7 +949,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
int cmd, struct shmid64_ds *tbuf)
 {
struct shmid_kernel *shp;
-   int id = 0;
int err;
 
memset(tbuf, 0, sizeof(*tbuf));
@@ -961,7 +960,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
err = PTR_ERR(shp);
goto out_unlock;
}
-   id = shp->shm_perm.id;
} else { /* IPC_STAT */
shp = shm_obtain_object_check(ns, shmid);
if (IS_ERR(shp)) {
@@ -1011,10 +1009,21 @@ static int shmctl_stat(struct ipc_namespace *ns, int 
shmid,
tbuf->shm_lpid  = pid_vnr(shp->shm_lprid);
tbuf->shm_nattch = shp->shm_nattch;
 
-   ipc_unlock_object(>shm_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SHM_STAT and SHM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence counter
+*/
+   err = shp->shm_perm.id;
+   }
 
+   ipc_unlock_object(>shm_pe

[PATCH 01/12] ipc: reorganize initialization of kern_ipc_perm.id

2018-07-09 Thread Manfred Spraul

ipc_addid() initializes kern_ipc_perm.id after having called
ipc_idr_alloc().

Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_idr()
may see an uninitialized value.

The patch moves all accesses to kern_ipc_perm.id under the spin_lock().

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with kern_ipc_perm.seq

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
---
 ipc/msg.c | 19 ++-
 ipc/sem.c | 18 +-
 ipc/shm.c | 19 ++-
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 3b6545302598..829c2062ded4 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -491,7 +491,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
 int cmd, struct msqid64_ds *p)
 {
struct msg_queue *msq;
-   int id = 0;
int err;
 
memset(p, 0, sizeof(*p));
@@ -503,7 +502,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
err = PTR_ERR(msq);
goto out_unlock;
}
-   id = msq->q_perm.id;
} else { /* IPC_STAT */
msq = msq_obtain_object_check(ns, msqid);
if (IS_ERR(msq)) {
@@ -548,10 +546,21 @@ static int msgctl_stat(struct ipc_namespace *ns, int 
msqid,
p->msg_lspid  = pid_vnr(msq->q_lspid);
p->msg_lrpid  = pid_vnr(msq->q_lrpid);
 
-   ipc_unlock_object(>q_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* MSG_STAT and MSG_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence counter
+*/
+   err = msq->q_perm.id;
+   }
 
+   ipc_unlock_object(>q_perm);
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/sem.c b/ipc/sem.c
index 5af1943ad782..e8971fa1d847 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1222,7 +1222,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 {
struct sem_array *sma;
time64_t semotime;
-   int id = 0;
int err;
 
memset(semid64, 0, sizeof(*semid64));
@@ -1234,7 +1233,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
err = PTR_ERR(sma);
goto out_unlock;
}
-   id = sma->sem_perm.id;
} else { /* IPC_STAT */
sma = sem_obtain_object_check(ns, semid);
if (IS_ERR(sma)) {
@@ -1274,10 +1272,20 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 #endif
semid64->sem_nsems = sma->sem_nsems;
 
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SEM_STAT and SEM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence counter
+*/
+   err = sma->sem_perm.id;
+   }
ipc_unlock_object(>sem_perm);
-   rcu_read_unlock();
-   return id;
-
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/shm.c b/ipc/shm.c
index 051a3e1fb8df..59fe8b3b3794 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -949,7 +949,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
int cmd, struct shmid64_ds *tbuf)
 {
struct shmid_kernel *shp;
-   int id = 0;
int err;
 
memset(tbuf, 0, sizeof(*tbuf));
@@ -961,7 +960,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
err = PTR_ERR(shp);
goto out_unlock;
}
-   id = shp->shm_perm.id;
} else { /* IPC_STAT */
shp = shm_obtain_object_check(ns, shmid);
if (IS_ERR(shp)) {
@@ -1011,10 +1009,21 @@ static int shmctl_stat(struct ipc_namespace *ns, int 
shmid,
tbuf->shm_lpid  = pid_vnr(shp->shm_lprid);
tbuf->shm_nattch = shp->shm_nattch;
 
-   ipc_unlock_object(>shm_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SHM_STAT and SHM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence counter
+*/
+   err = shp->shm_perm.id;
+   }
 
+   ipc_unlock_object(>shm_pe

[PATCH 08/12] lib/rhashtable: simplify bucket_table_alloc()

2018-07-09 Thread Manfred Spraul

From: Davidlohr Bueso 

As of commit ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for
incompatible gfp flags") we can simplify the caller and trust kvzalloc() to
just do the right thing. For the case of the GFP_ATOMIC context, we can
drop the __GFP_NORETRY flag for obvious reasons, and for the __GFP_NOWARN
case, however, it is changed such that the caller passes the flag instead
of making bucket_table_alloc() handle it.

This slightly changes the gfp flags passed on to nested_table_alloc() as
it will now also use GFP_ATOMIC | __GFP_NOWARN. However, I consider this a
positive consequence as for the same reasons we want nowarn semantics in
bucket_table_alloc().

Signed-off-by: Davidlohr Bueso 
Acked-by: Michal Hocko 

(commit id extended to 12 digits, line wraps updated)
Signed-off-by: Manfred Spraul 
---
 lib/rhashtable.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 9427b5766134..083f871491a1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -175,10 +175,7 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
int i;
 
size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
-   if (gfp != GFP_KERNEL)
-   tbl = kzalloc(size, gfp | __GFP_NOWARN | __GFP_NORETRY);
-   else
-   tbl = kvzalloc(size, gfp);
+   tbl = kvzalloc(size, gfp);
 
size = nbuckets;
 
@@ -459,7 +456,7 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 
err = -ENOMEM;
 
-   new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC);
+   new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC | __GFP_NOWARN);
if (new_tbl == NULL)
goto fail;
 
-- 
2.17.1

[PATCH 07/12] ipc_idr_alloc refactoring

2018-07-09 Thread Manfred Spraul

From: Dmitry Vyukov 

ipc_idr_alloc refactoring

Signed-off-by: Dmitry Vyukov 
Signed-off-by: Manfred Spraul 
---
 ipc/util.c | 51 +--
 1 file changed, 13 insertions(+), 38 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 8bc166bb4981..a41b8a69de13 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -193,52 +193,32 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
return NULL;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
 /*
  * Specify desired id for next allocated IPC object.
  */
-static inline int ipc_idr_alloc(struct ipc_ids *ids,
-   struct kern_ipc_perm *new)
+static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
 {
-   int key;
+   int key, next_id = -1;
 
-   if (ids->next_id < 0) {
-   key = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
-   } else {
-   key = idr_alloc(>ipcs_idr, new,
-   ipcid_to_idx(ids->next_id),
-   0, GFP_NOWAIT);
-   ids->next_id = -1;
-   }
-   return key;
-}
+#ifdef CONFIG_CHECKPOINT_RESTORE
+   next_id = ids->next_id;
+   ids->next_id = -1;
+#endif
 
-static inline void ipc_set_seq(struct ipc_ids *ids,
-   struct kern_ipc_perm *new)
-{
-   if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
+   if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
+   key = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
} else {
-   new->seq = ipcid_to_seqx(ids->next_id);
+   new->seq = ipcid_to_seqx(next_id);
+   key = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
+   0, GFP_NOWAIT);
}
+   new->id = SEQ_MULTIPLIER * new->seq + key;
+   return key;
 }
 
-#else
-#define ipc_idr_alloc(ids, new)\
-   idr_alloc(&(ids)->ipcs_idr, (new), 0, 0, GFP_NOWAIT)
-
-static inline void ipc_set_seq(struct ipc_ids *ids,
- struct kern_ipc_perm *new)
-{
-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
-}
-
-#endif /* CONFIG_CHECKPOINT_RESTORE */
-
 /**
  * ipc_addid - add an ipc identifier
  * @ids: ipc identifier set
@@ -278,8 +258,6 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
current_euid_egid(, );
new->cuid = new->uid = euid;
new->gid = new->cgid = egid;
-
-   ipc_set_seq(ids, new);
new->deleted = false;
 
/*
@@ -317,9 +295,6 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
ids->in_use++;
if (id > ids->max_id)
ids->max_id = id;
-
-   new->id = SEQ_MULTIPLIER * new->seq + id;
-
return id;
 }
 
-- 
2.17.1

[PATCH 09/12] lib/rhashtable: guarantee initial hashtable allocation

2018-07-09 Thread Manfred Spraul

From: Davidlohr Bueso 

rhashtable_init() may fail due to -ENOMEM, thus making the
entire api unusable. This patch removes this scenario,
however unlikely. In order to guarantee memory allocation,
this patch always ends up doing GFP_KERNEL|__GFP_NOFAIL
for both the tbl as well as alloc_bucket_spinlocks().

Upon the first table allocation failure, we shrink the
size to the smallest value that makes sense and retry with
__GFP_NOFAIL semantics. With the defaults, this means that
from 64 buckets, we retry with only 4. Any later issues
regarding performance due to collisions or larger table
resizing (when more memory becomes available) is the least
of our problems.

Signed-off-by: Davidlohr Bueso 
Acked-by: Herbert Xu 
Signed-off-by: Manfred Spraul 
---
 lib/rhashtable.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 083f871491a1..0026cf3e3f27 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -179,10 +179,11 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
 
size = nbuckets;
 
-   if (tbl == NULL && gfp != GFP_KERNEL) {
+   if (tbl == NULL && (gfp & ~__GFP_NOFAIL) != GFP_KERNEL) {
tbl = nested_bucket_table_alloc(ht, nbuckets, gfp);
nbuckets = 0;
}
+
if (tbl == NULL)
return NULL;
 
@@ -1065,9 +1066,16 @@ int rhashtable_init(struct rhashtable *ht,
}
}
 
+   /*
+* This is api initialization and thus we need to guarantee the
+* initial rhashtable allocation. Upon failure, retry with the
+* smallest possible size with __GFP_NOFAIL semantics.
+*/
tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
-   if (tbl == NULL)
-   return -ENOMEM;
+   if (unlikely(tbl == NULL)) {
+   size = max_t(u16, ht->p.min_size, HASH_MIN_SIZE);
+   tbl = bucket_table_alloc(ht, size, GFP_KERNEL | __GFP_NOFAIL);
+   }
 
atomic_set(>nelems, 0);
 
-- 
2.17.1

[PATCH 07/12] ipc_idr_alloc refactoring

2018-07-09 Thread Manfred Spraul

From: Dmitry Vyukov 

ipc_idr_alloc refactoring

Signed-off-by: Dmitry Vyukov 
Signed-off-by: Manfred Spraul 
---
 ipc/util.c | 51 +--
 1 file changed, 13 insertions(+), 38 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 8bc166bb4981..a41b8a69de13 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -193,52 +193,32 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
return NULL;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
 /*
  * Specify desired id for next allocated IPC object.
  */
-static inline int ipc_idr_alloc(struct ipc_ids *ids,
-   struct kern_ipc_perm *new)
+static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
 {
-   int key;
+   int key, next_id = -1;
 
-   if (ids->next_id < 0) {
-   key = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
-   } else {
-   key = idr_alloc(>ipcs_idr, new,
-   ipcid_to_idx(ids->next_id),
-   0, GFP_NOWAIT);
-   ids->next_id = -1;
-   }
-   return key;
-}
+#ifdef CONFIG_CHECKPOINT_RESTORE
+   next_id = ids->next_id;
+   ids->next_id = -1;
+#endif
 
-static inline void ipc_set_seq(struct ipc_ids *ids,
-   struct kern_ipc_perm *new)
-{
-   if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
+   if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
+   key = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
} else {
-   new->seq = ipcid_to_seqx(ids->next_id);
+   new->seq = ipcid_to_seqx(next_id);
+   key = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
+   0, GFP_NOWAIT);
}
+   new->id = SEQ_MULTIPLIER * new->seq + key;
+   return key;
 }
 
-#else
-#define ipc_idr_alloc(ids, new)\
-   idr_alloc(&(ids)->ipcs_idr, (new), 0, 0, GFP_NOWAIT)
-
-static inline void ipc_set_seq(struct ipc_ids *ids,
- struct kern_ipc_perm *new)
-{
-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
-}
-
-#endif /* CONFIG_CHECKPOINT_RESTORE */
-
 /**
  * ipc_addid - add an ipc identifier
  * @ids: ipc identifier set
@@ -278,8 +258,6 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
current_euid_egid(, );
new->cuid = new->uid = euid;
new->gid = new->cgid = egid;
-
-   ipc_set_seq(ids, new);
new->deleted = false;
 
/*
@@ -317,9 +295,6 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
ids->in_use++;
if (id > ids->max_id)
ids->max_id = id;
-
-   new->id = SEQ_MULTIPLIER * new->seq + id;
-
return id;
 }
 
-- 
2.17.1

[PATCH 09/12] lib/rhashtable: guarantee initial hashtable allocation

2018-07-09 Thread Manfred Spraul

From: Davidlohr Bueso 

rhashtable_init() may fail due to -ENOMEM, thus making the
entire api unusable. This patch removes this scenario,
however unlikely. In order to guarantee memory allocation,
this patch always ends up doing GFP_KERNEL|__GFP_NOFAIL
for both the tbl as well as alloc_bucket_spinlocks().

Upon the first table allocation failure, we shrink the
size to the smallest value that makes sense and retry with
__GFP_NOFAIL semantics. With the defaults, this means that
from 64 buckets, we retry with only 4. Any later issues
regarding performance due to collisions or larger table
resizing (when more memory becomes available) is the least
of our problems.

Signed-off-by: Davidlohr Bueso 
Acked-by: Herbert Xu 
Signed-off-by: Manfred Spraul 
---
 lib/rhashtable.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 083f871491a1..0026cf3e3f27 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -179,10 +179,11 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
 
size = nbuckets;
 
-   if (tbl == NULL && gfp != GFP_KERNEL) {
+   if (tbl == NULL && (gfp & ~__GFP_NOFAIL) != GFP_KERNEL) {
tbl = nested_bucket_table_alloc(ht, nbuckets, gfp);
nbuckets = 0;
}
+
if (tbl == NULL)
return NULL;
 
@@ -1065,9 +1066,16 @@ int rhashtable_init(struct rhashtable *ht,
}
}
 
+   /*
+* This is api initialization and thus we need to guarantee the
+* initial rhashtable allocation. Upon failure, retry with the
+* smallest possible size with __GFP_NOFAIL semantics.
+*/
tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
-   if (tbl == NULL)
-   return -ENOMEM;
+   if (unlikely(tbl == NULL)) {
+   size = max_t(u16, ht->p.min_size, HASH_MIN_SIZE);
+   tbl = bucket_table_alloc(ht, size, GFP_KERNEL | __GFP_NOFAIL);
+   }
 
atomic_set(>nelems, 0);
 
-- 
2.17.1

[PATCH 08/12] lib/rhashtable: simplify bucket_table_alloc()

2018-07-09 Thread Manfred Spraul

From: Davidlohr Bueso 

As of commit ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for
incompatible gfp flags") we can simplify the caller and trust kvzalloc() to
just do the right thing. For the case of the GFP_ATOMIC context, we can
drop the __GFP_NORETRY flag for obvious reasons, and for the __GFP_NOWARN
case, however, it is changed such that the caller passes the flag instead
of making bucket_table_alloc() handle it.

This slightly changes the gfp flags passed on to nested_table_alloc() as
it will now also use GFP_ATOMIC | __GFP_NOWARN. However, I consider this a
positive consequence as for the same reasons we want nowarn semantics in
bucket_table_alloc().

Signed-off-by: Davidlohr Bueso 
Acked-by: Michal Hocko 

(commit id extended to 12 digits, line wraps updated)
Signed-off-by: Manfred Spraul 
---
 lib/rhashtable.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 9427b5766134..083f871491a1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -175,10 +175,7 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
int i;
 
size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
-   if (gfp != GFP_KERNEL)
-   tbl = kzalloc(size, gfp | __GFP_NOWARN | __GFP_NORETRY);
-   else
-   tbl = kvzalloc(size, gfp);
+   tbl = kvzalloc(size, gfp);
 
size = nbuckets;
 
@@ -459,7 +456,7 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 
err = -ENOMEM;
 
-   new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC);
+   new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC | __GFP_NOWARN);
if (new_tbl == NULL)
goto fail;
 
-- 
2.17.1

[PATCH 06/12] ipc: rename ipc_lock() to ipc_lock_idr()

2018-07-09 Thread Manfred Spraul

ipc/util.c contains multiple functions to get the ipc object
pointer given an id number.

There are two sets of function: One set verifies the sequence
counter part of the id number, other functions do not check
the sequence counter.

The standard for function names in ipc/util.c is
- ..._check() functions verify the sequence counter
- ..._idr() functions do not verify the sequence counter

ipc_lock() is an exception: It does not verify the sequence
counter value, but this is not obvious from the function name.

Therefore: Rename the function to ipc_lock_idr(), to make it
obvious that it does not check the sequence counter.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 ipc/shm.c  |  4 ++--
 ipc/util.c | 10 ++
 ipc/util.h |  2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 426ba1039a7b..cd8655c7bb77 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -179,11 +179,11 @@ static inline struct shmid_kernel 
*shm_obtain_object_check(struct ipc_namespace
  */
 static inline struct shmid_kernel *shm_lock(struct ipc_namespace *ns, int id)
 {
-   struct kern_ipc_perm *ipcp = ipc_lock(_ids(ns), id);
+   struct kern_ipc_perm *ipcp = ipc_lock_idr(_ids(ns), id);
 
/*
 * Callers of shm_lock() must validate the status of the returned ipc
-* object pointer (as returned by ipc_lock()), and error out as
+* object pointer (as returned by ipc_lock_idr()), and error out as
 * appropriate.
 */
if (IS_ERR(ipcp))
diff --git a/ipc/util.c b/ipc/util.c
index 8133f10832a9..8bc166bb4981 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -604,15 +604,17 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct 
ipc_ids *ids, int id)
 }
 
 /**
- * ipc_lock - lock an ipc structure without rwsem held
+ * ipc_lock_idr - lock an ipc structure without rwsem held
  * @ids: ipc identifier set
  * @id: ipc id to look for
  *
  * Look for an id in the ipc ids idr and lock the associated ipc object.
+ * The function does not check if the sequence counter matches the
+ * found ipc object.
  *
  * The ipc object is locked on successful exit.
  */
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
+struct kern_ipc_perm *ipc_lock_idr(struct ipc_ids *ids, int id)
 {
struct kern_ipc_perm *out;
 
@@ -624,8 +626,8 @@ struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
spin_lock(>lock);
 
/*
-* ipc_rmid() may have already freed the ID while ipc_lock()
-* was spinning: here verify that the structure is still valid.
+* ipc_rmid() may have already freed the ID while waiting for
+* the lock. Here verify that the structure is still valid.
 * Upon races with RMID, return -EIDRM, thus indicating that
 * the ID points to a removed identifier.
 */
diff --git a/ipc/util.h b/ipc/util.h
index fcf81425ae98..25d8ee052ac9 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -142,7 +142,7 @@ int ipc_rcu_getref(struct kern_ipc_perm *ptr);
 void ipc_rcu_putref(struct kern_ipc_perm *ptr,
void (*func)(struct rcu_head *head));
 
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *, int);
+struct kern_ipc_perm *ipc_lock_idr(struct ipc_ids *ids, int id);
 struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids *ids, int id);
 
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
-- 
2.17.1

[PATCH 06/12] ipc: rename ipc_lock() to ipc_lock_idr()

2018-07-09 Thread Manfred Spraul

ipc/util.c contains multiple functions to get the ipc object
pointer given an id number.

There are two sets of function: One set verifies the sequence
counter part of the id number, other functions do not check
the sequence counter.

The standard for function names in ipc/util.c is
- ..._check() functions verify the sequence counter
- ..._idr() functions do not verify the sequence counter

ipc_lock() is an exception: It does not verify the sequence
counter value, but this is not obvious from the function name.

Therefore: Rename the function to ipc_lock_idr(), to make it
obvious that it does not check the sequence counter.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 ipc/shm.c  |  4 ++--
 ipc/util.c | 10 ++
 ipc/util.h |  2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 426ba1039a7b..cd8655c7bb77 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -179,11 +179,11 @@ static inline struct shmid_kernel 
*shm_obtain_object_check(struct ipc_namespace
  */
 static inline struct shmid_kernel *shm_lock(struct ipc_namespace *ns, int id)
 {
-   struct kern_ipc_perm *ipcp = ipc_lock(_ids(ns), id);
+   struct kern_ipc_perm *ipcp = ipc_lock_idr(_ids(ns), id);
 
/*
 * Callers of shm_lock() must validate the status of the returned ipc
-* object pointer (as returned by ipc_lock()), and error out as
+* object pointer (as returned by ipc_lock_idr()), and error out as
 * appropriate.
 */
if (IS_ERR(ipcp))
diff --git a/ipc/util.c b/ipc/util.c
index 8133f10832a9..8bc166bb4981 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -604,15 +604,17 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct 
ipc_ids *ids, int id)
 }
 
 /**
- * ipc_lock - lock an ipc structure without rwsem held
+ * ipc_lock_idr - lock an ipc structure without rwsem held
  * @ids: ipc identifier set
  * @id: ipc id to look for
  *
  * Look for an id in the ipc ids idr and lock the associated ipc object.
+ * The function does not check if the sequence counter matches the
+ * found ipc object.
  *
  * The ipc object is locked on successful exit.
  */
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
+struct kern_ipc_perm *ipc_lock_idr(struct ipc_ids *ids, int id)
 {
struct kern_ipc_perm *out;
 
@@ -624,8 +626,8 @@ struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
spin_lock(>lock);
 
/*
-* ipc_rmid() may have already freed the ID while ipc_lock()
-* was spinning: here verify that the structure is still valid.
+* ipc_rmid() may have already freed the ID while waiting for
+* the lock. Here verify that the structure is still valid.
 * Upon races with RMID, return -EIDRM, thus indicating that
 * the ID points to a removed identifier.
 */
diff --git a/ipc/util.h b/ipc/util.h
index fcf81425ae98..25d8ee052ac9 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -142,7 +142,7 @@ int ipc_rcu_getref(struct kern_ipc_perm *ptr);
 void ipc_rcu_putref(struct kern_ipc_perm *ptr,
void (*func)(struct rcu_head *head));
 
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *, int);
+struct kern_ipc_perm *ipc_lock_idr(struct ipc_ids *ids, int id);
 struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids *ids, int id);
 
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
-- 
2.17.1

[PATCH 10/12] ipc: get rid of ids->tables_initialized hack

2018-07-09 Thread Manfred Spraul

From: Davidlohr Bueso 

In sysvipc we have an ids->tables_initialized regarding the
rhashtable, introduced in:

commit 0cfb6aee70bd ("ipc: optimize semget/shmget/msgget for lots of keys")

It's there, specifically, to prevent nil pointer dereferences,
from using an uninitialized api. Considering how rhashtable_init()
can fail (probably due to ENOMEM, if anything), this made the
overall ipc initialization capable of failure as well. That alone
is ugly, but fine, however I've spotted a few issues regarding the
semantics of tables_initialized (however unlikely they may be):

- There is inconsistency in what we return to userspace: ipc_addid()
returns ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr()
returns EINVAL.

- After we started using rhashtables, ipc_findkey() can return nil upon
!tables_initialized, but the caller expects nil for when the ipc structure
isn't found, and can therefore call into ipcget() callbacks.

Now that rhashtable initialization cannot fail, we can properly
get rid of the hack altogether.

Signed-off-by: Davidlohr Bueso 

(commit id extended to 12 digits)
Signed-off-by: Manfred Spraul 
---
 include/linux/ipc_namespace.h |  1 -
 ipc/util.c| 23 ---
 2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8eb2f3..37f3a4b7c637 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -16,7 +16,6 @@ struct user_namespace;
 struct ipc_ids {
int in_use;
unsigned short seq;
-   bool tables_initialized;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
int max_id;
diff --git a/ipc/util.c b/ipc/util.c
index a41b8a69de13..ae485b41ea0b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -125,7 +125,6 @@ int ipc_init_ids(struct ipc_ids *ids)
if (err)
return err;
idr_init(>ipcs_idr);
-   ids->tables_initialized = true;
ids->max_id = -1;
 #ifdef CONFIG_CHECKPOINT_RESTORE
ids->next_id = -1;
@@ -178,19 +177,16 @@ void __init ipc_init_proc_interface(const char *path, 
const char *header,
  */
 static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key)
 {
-   struct kern_ipc_perm *ipcp = NULL;
+   struct kern_ipc_perm *ipcp;
 
-   if (likely(ids->tables_initialized))
-   ipcp = rhashtable_lookup_fast(>key_ht, ,
+   ipcp = rhashtable_lookup_fast(>key_ht, ,
  ipc_kht_params);
+   if (!ipcp)
+   return NULL;
 
-   if (ipcp) {
-   rcu_read_lock();
-   ipc_lock_object(ipcp);
-   return ipcp;
-   }
-
-   return NULL;
+   rcu_read_lock();
+   ipc_lock_object(ipcp);
+   return ipcp;
 }
 
 /*
@@ -246,7 +242,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
if (limit > IPCMNI)
limit = IPCMNI;
 
-   if (!ids->tables_initialized || ids->in_use >= limit)
+   if (ids->in_use >= limit)
return -ENOSPC;
 
idr_preload(GFP_KERNEL);
@@ -568,9 +564,6 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id)
struct kern_ipc_perm *out;
int lid = ipcid_to_idx(id);
 
-   if (unlikely(!ids->tables_initialized))
-   return ERR_PTR(-EINVAL);
-
out = idr_find(>ipcs_idr, lid);
if (!out)
return ERR_PTR(-EINVAL);
-- 
2.17.1

[PATCH 10/12] ipc: get rid of ids->tables_initialized hack

2018-07-09 Thread Manfred Spraul

From: Davidlohr Bueso 

In sysvipc we have an ids->tables_initialized regarding the
rhashtable, introduced in:

commit 0cfb6aee70bd ("ipc: optimize semget/shmget/msgget for lots of keys")

It's there, specifically, to prevent nil pointer dereferences,
from using an uninitialized api. Considering how rhashtable_init()
can fail (probably due to ENOMEM, if anything), this made the
overall ipc initialization capable of failure as well. That alone
is ugly, but fine, however I've spotted a few issues regarding the
semantics of tables_initialized (however unlikely they may be):

- There is inconsistency in what we return to userspace: ipc_addid()
returns ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr()
returns EINVAL.

- After we started using rhashtables, ipc_findkey() can return nil upon
!tables_initialized, but the caller expects nil for when the ipc structure
isn't found, and can therefore call into ipcget() callbacks.

Now that rhashtable initialization cannot fail, we can properly
get rid of the hack altogether.

Signed-off-by: Davidlohr Bueso 

(commit id extended to 12 digits)
Signed-off-by: Manfred Spraul 
---
 include/linux/ipc_namespace.h |  1 -
 ipc/util.c| 23 ---
 2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8eb2f3..37f3a4b7c637 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -16,7 +16,6 @@ struct user_namespace;
 struct ipc_ids {
int in_use;
unsigned short seq;
-   bool tables_initialized;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
int max_id;
diff --git a/ipc/util.c b/ipc/util.c
index a41b8a69de13..ae485b41ea0b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -125,7 +125,6 @@ int ipc_init_ids(struct ipc_ids *ids)
if (err)
return err;
idr_init(>ipcs_idr);
-   ids->tables_initialized = true;
ids->max_id = -1;
 #ifdef CONFIG_CHECKPOINT_RESTORE
ids->next_id = -1;
@@ -178,19 +177,16 @@ void __init ipc_init_proc_interface(const char *path, 
const char *header,
  */
 static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key)
 {
-   struct kern_ipc_perm *ipcp = NULL;
+   struct kern_ipc_perm *ipcp;
 
-   if (likely(ids->tables_initialized))
-   ipcp = rhashtable_lookup_fast(>key_ht, ,
+   ipcp = rhashtable_lookup_fast(>key_ht, ,
  ipc_kht_params);
+   if (!ipcp)
+   return NULL;
 
-   if (ipcp) {
-   rcu_read_lock();
-   ipc_lock_object(ipcp);
-   return ipcp;
-   }
-
-   return NULL;
+   rcu_read_lock();
+   ipc_lock_object(ipcp);
+   return ipcp;
 }
 
 /*
@@ -246,7 +242,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
if (limit > IPCMNI)
limit = IPCMNI;
 
-   if (!ids->tables_initialized || ids->in_use >= limit)
+   if (ids->in_use >= limit)
return -ENOSPC;
 
idr_preload(GFP_KERNEL);
@@ -568,9 +564,6 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id)
struct kern_ipc_perm *out;
int lid = ipcid_to_idx(id);
 
-   if (unlikely(!ids->tables_initialized))
-   return ERR_PTR(-EINVAL);
-
out = idr_find(>ipcs_idr, lid);
if (!out)
return ERR_PTR(-EINVAL);
-- 
2.17.1

[PATCH 11/12] ipc: simplify ipc initialization

2018-07-09 Thread Manfred Spraul

From: Davidlohr Bueso 

Now that we know that rhashtable_init() will not fail, we
can get rid of a lot of the unnecessary cleanup paths when
the call errored out.

Signed-off-by: Davidlohr Bueso 

(variable name added to util.h to resolve checkpatch warning)
Signed-off-by: Manfred Spraul 
---
 ipc/msg.c   |  9 -
 ipc/namespace.c | 20 
 ipc/sem.c   | 10 --
 ipc/shm.c   |  9 -
 ipc/util.c  | 18 +-
 ipc/util.h  | 18 +-
 6 files changed, 30 insertions(+), 54 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index ba85d8849e8d..346230712259 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -1237,7 +1237,7 @@ COMPAT_SYSCALL_DEFINE5(msgrcv, int, msqid, compat_uptr_t, 
msgp,
 }
 #endif
 
-int msg_init_ns(struct ipc_namespace *ns)
+void msg_init_ns(struct ipc_namespace *ns)
 {
ns->msg_ctlmax = MSGMAX;
ns->msg_ctlmnb = MSGMNB;
@@ -1245,7 +1245,7 @@ int msg_init_ns(struct ipc_namespace *ns)
 
atomic_set(>msg_bytes, 0);
atomic_set(>msg_hdrs, 0);
-   return ipc_init_ids(>ids[IPC_MSG_IDS]);
+   ipc_init_ids(>ids[IPC_MSG_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
@@ -1286,12 +1286,11 @@ static int sysvipc_msg_proc_show(struct seq_file *s, 
void *it)
 }
 #endif
 
-int __init msg_init(void)
+void __init msg_init(void)
 {
-   const int err = msg_init_ns(_ipc_ns);
+   msg_init_ns(_ipc_ns);
 
ipc_init_proc_interface("sysvipc/msg",
"   key  msqid perms  cbytes   
qnum lspid lrpid   uid   gid  cuid  cgid  stime  rtime  ctime\n",
IPC_MSG_IDS, sysvipc_msg_proc_show);
-   return err;
 }
diff --git a/ipc/namespace.c b/ipc/namespace.c
index f59a89966f92..21607791d62c 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -55,28 +55,16 @@ static struct ipc_namespace *create_ipc_ns(struct 
user_namespace *user_ns,
ns->user_ns = get_user_ns(user_ns);
ns->ucounts = ucounts;
 
-   err = sem_init_ns(ns);
+   err = mq_init_ns(ns);
if (err)
goto fail_put;
-   err = msg_init_ns(ns);
-   if (err)
-   goto fail_destroy_sem;
-   err = shm_init_ns(ns);
-   if (err)
-   goto fail_destroy_msg;
 
-   err = mq_init_ns(ns);
-   if (err)
-   goto fail_destroy_shm;
+   sem_init_ns(ns);
+   msg_init_ns(ns);
+   shm_init_ns(ns);
 
return ns;
 
-fail_destroy_shm:
-   shm_exit_ns(ns);
-fail_destroy_msg:
-   msg_exit_ns(ns);
-fail_destroy_sem:
-   sem_exit_ns(ns);
 fail_put:
put_user_ns(ns->user_ns);
ns_free_inum(>ns);
diff --git a/ipc/sem.c b/ipc/sem.c
index 9742e9a1c0c2..f3de2f5e7b9b 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -220,14 +220,14 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
 #define sc_semopm  sem_ctls[2]
 #define sc_semmni  sem_ctls[3]
 
-int sem_init_ns(struct ipc_namespace *ns)
+void sem_init_ns(struct ipc_namespace *ns)
 {
ns->sc_semmsl = SEMMSL;
ns->sc_semmns = SEMMNS;
ns->sc_semopm = SEMOPM;
ns->sc_semmni = SEMMNI;
ns->used_sems = 0;
-   return ipc_init_ids(>ids[IPC_SEM_IDS]);
+   ipc_init_ids(>ids[IPC_SEM_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
@@ -239,14 +239,12 @@ void sem_exit_ns(struct ipc_namespace *ns)
 }
 #endif
 
-int __init sem_init(void)
+void __init sem_init(void)
 {
-   const int err = sem_init_ns(_ipc_ns);
-
+   sem_init_ns(_ipc_ns);
ipc_init_proc_interface("sysvipc/sem",
"   key  semid perms  nsems   uid   
gid  cuid  cgid  otime  ctime\n",
IPC_SEM_IDS, sysvipc_sem_proc_show);
-   return err;
 }
 
 /**
diff --git a/ipc/shm.c b/ipc/shm.c
index cd8655c7bb77..1db4cf91f676 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -95,14 +95,14 @@ static void shm_destroy(struct ipc_namespace *ns, struct 
shmid_kernel *shp);
 static int sysvipc_shm_proc_show(struct seq_file *s, void *it);
 #endif
 
-int shm_init_ns(struct ipc_namespace *ns)
+void shm_init_ns(struct ipc_namespace *ns)
 {
ns->shm_ctlmax = SHMMAX;
ns->shm_ctlall = SHMALL;
ns->shm_ctlmni = SHMMNI;
ns->shm_rmid_forced = 0;
ns->shm_tot = 0;
-   return ipc_init_ids(_ids(ns));
+   ipc_init_ids(_ids(ns));
 }
 
 /*
@@ -135,9 +135,8 @@ void shm_exit_ns(struct ipc_namespace *ns)
 
 static int __init ipc_ns_init(void)
 {
-   const int err = shm_init_ns(_ipc_ns);
-   WARN(err, "ipc: sysv shm_init_ns failed: %d\n", err);
-   return err;
+   shm_init_ns(_ipc_ns);
+   return 0;
 }
 
 pure_initcall(ipc_ns_init);
diff --git a/ipc/util.c b/ipc/util.c
index ae485b41ea0b..d474f2b3b299 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -87,16 +87,12 @@ struct ipc_proc_

[PATCH 11/12] ipc: simplify ipc initialization

2018-07-09 Thread Manfred Spraul

From: Davidlohr Bueso 

Now that we know that rhashtable_init() will not fail, we
can get rid of a lot of the unnecessary cleanup paths when
the call errored out.

Signed-off-by: Davidlohr Bueso 

(variable name added to util.h to resolve checkpatch warning)
Signed-off-by: Manfred Spraul 
---
 ipc/msg.c   |  9 -
 ipc/namespace.c | 20 
 ipc/sem.c   | 10 --
 ipc/shm.c   |  9 -
 ipc/util.c  | 18 +-
 ipc/util.h  | 18 +-
 6 files changed, 30 insertions(+), 54 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index ba85d8849e8d..346230712259 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -1237,7 +1237,7 @@ COMPAT_SYSCALL_DEFINE5(msgrcv, int, msqid, compat_uptr_t, 
msgp,
 }
 #endif
 
-int msg_init_ns(struct ipc_namespace *ns)
+void msg_init_ns(struct ipc_namespace *ns)
 {
ns->msg_ctlmax = MSGMAX;
ns->msg_ctlmnb = MSGMNB;
@@ -1245,7 +1245,7 @@ int msg_init_ns(struct ipc_namespace *ns)
 
atomic_set(>msg_bytes, 0);
atomic_set(>msg_hdrs, 0);
-   return ipc_init_ids(>ids[IPC_MSG_IDS]);
+   ipc_init_ids(>ids[IPC_MSG_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
@@ -1286,12 +1286,11 @@ static int sysvipc_msg_proc_show(struct seq_file *s, 
void *it)
 }
 #endif
 
-int __init msg_init(void)
+void __init msg_init(void)
 {
-   const int err = msg_init_ns(_ipc_ns);
+   msg_init_ns(_ipc_ns);
 
ipc_init_proc_interface("sysvipc/msg",
"   key  msqid perms  cbytes   
qnum lspid lrpid   uid   gid  cuid  cgid  stime  rtime  ctime\n",
IPC_MSG_IDS, sysvipc_msg_proc_show);
-   return err;
 }
diff --git a/ipc/namespace.c b/ipc/namespace.c
index f59a89966f92..21607791d62c 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -55,28 +55,16 @@ static struct ipc_namespace *create_ipc_ns(struct 
user_namespace *user_ns,
ns->user_ns = get_user_ns(user_ns);
ns->ucounts = ucounts;
 
-   err = sem_init_ns(ns);
+   err = mq_init_ns(ns);
if (err)
goto fail_put;
-   err = msg_init_ns(ns);
-   if (err)
-   goto fail_destroy_sem;
-   err = shm_init_ns(ns);
-   if (err)
-   goto fail_destroy_msg;
 
-   err = mq_init_ns(ns);
-   if (err)
-   goto fail_destroy_shm;
+   sem_init_ns(ns);
+   msg_init_ns(ns);
+   shm_init_ns(ns);
 
return ns;
 
-fail_destroy_shm:
-   shm_exit_ns(ns);
-fail_destroy_msg:
-   msg_exit_ns(ns);
-fail_destroy_sem:
-   sem_exit_ns(ns);
 fail_put:
put_user_ns(ns->user_ns);
ns_free_inum(>ns);
diff --git a/ipc/sem.c b/ipc/sem.c
index 9742e9a1c0c2..f3de2f5e7b9b 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -220,14 +220,14 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
 #define sc_semopm  sem_ctls[2]
 #define sc_semmni  sem_ctls[3]
 
-int sem_init_ns(struct ipc_namespace *ns)
+void sem_init_ns(struct ipc_namespace *ns)
 {
ns->sc_semmsl = SEMMSL;
ns->sc_semmns = SEMMNS;
ns->sc_semopm = SEMOPM;
ns->sc_semmni = SEMMNI;
ns->used_sems = 0;
-   return ipc_init_ids(>ids[IPC_SEM_IDS]);
+   ipc_init_ids(>ids[IPC_SEM_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
@@ -239,14 +239,12 @@ void sem_exit_ns(struct ipc_namespace *ns)
 }
 #endif
 
-int __init sem_init(void)
+void __init sem_init(void)
 {
-   const int err = sem_init_ns(_ipc_ns);
-
+   sem_init_ns(_ipc_ns);
ipc_init_proc_interface("sysvipc/sem",
"   key  semid perms  nsems   uid   
gid  cuid  cgid  otime  ctime\n",
IPC_SEM_IDS, sysvipc_sem_proc_show);
-   return err;
 }
 
 /**
diff --git a/ipc/shm.c b/ipc/shm.c
index cd8655c7bb77..1db4cf91f676 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -95,14 +95,14 @@ static void shm_destroy(struct ipc_namespace *ns, struct 
shmid_kernel *shp);
 static int sysvipc_shm_proc_show(struct seq_file *s, void *it);
 #endif
 
-int shm_init_ns(struct ipc_namespace *ns)
+void shm_init_ns(struct ipc_namespace *ns)
 {
ns->shm_ctlmax = SHMMAX;
ns->shm_ctlall = SHMALL;
ns->shm_ctlmni = SHMMNI;
ns->shm_rmid_forced = 0;
ns->shm_tot = 0;
-   return ipc_init_ids(_ids(ns));
+   ipc_init_ids(_ids(ns));
 }
 
 /*
@@ -135,9 +135,8 @@ void shm_exit_ns(struct ipc_namespace *ns)
 
 static int __init ipc_ns_init(void)
 {
-   const int err = shm_init_ns(_ipc_ns);
-   WARN(err, "ipc: sysv shm_init_ns failed: %d\n", err);
-   return err;
+   shm_init_ns(_ipc_ns);
+   return 0;
 }
 
 pure_initcall(ipc_ns_init);
diff --git a/ipc/util.c b/ipc/util.c
index ae485b41ea0b..d474f2b3b299 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -87,16 +87,12 @@ struct ipc_proc_

[PATCH 04/12] ipc: Rename ipcctl_pre_down_nolock().

2018-07-09 Thread Manfred Spraul

Both the comment and the name of ipcctl_pre_down_nolock()
are misleading: The function must be called while holdling
the rw semaphore.
Therefore the patch renames the function to ipcctl_obtain_check():
This name matches the other names used in util.c:
- "obtain" function look up a pointer in the idr, without
  acquiring the object lock.
- The caller is responsible for locking.
- _check means that the sequence number is checked.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 ipc/msg.c  | 2 +-
 ipc/sem.c  | 2 +-
 ipc/shm.c  | 2 +-
 ipc/util.c | 8 
 ipc/util.h | 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 5bf5cb8017ea..ba85d8849e8d 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -385,7 +385,7 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, 
int cmd,
down_write(_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, _ids(ns), msqid, cmd,
+   ipcp = ipcctl_obtain_check(ns, _ids(ns), msqid, cmd,
  >msg_perm, msqid64->msg_qbytes);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/sem.c b/ipc/sem.c
index 9d49efeac2e5..9742e9a1c0c2 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1595,7 +1595,7 @@ static int semctl_down(struct ipc_namespace *ns, int 
semid,
down_write(_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, _ids(ns), semid, cmd,
+   ipcp = ipcctl_obtain_check(ns, _ids(ns), semid, cmd,
  >sem_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/shm.c b/ipc/shm.c
index 06b7bf11a011..426ba1039a7b 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -868,7 +868,7 @@ static int shmctl_down(struct ipc_namespace *ns, int shmid, 
int cmd,
down_write(_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, _ids(ns), shmid, cmd,
+   ipcp = ipcctl_obtain_check(ns, _ids(ns), shmid, cmd,
  >shm_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/util.c b/ipc/util.c
index 8b09496ed720..bbb1ce212a0d 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -703,7 +703,7 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
 }
 
 /**
- * ipcctl_pre_down_nolock - retrieve an ipc and check permissions for some 
IPC_XXX cmd
+ * ipcctl_obtain_check - retrieve an ipc object and check permissions
  * @ns:  ipc namespace
  * @ids:  the table of ids where to look for the ipc
  * @id:   the id of the ipc to retrieve
@@ -713,16 +713,16 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
  *
  * This function does some common audit and permissions check for some IPC_XXX
  * cmd and is called from semctl_down, shmctl_down and msgctl_down.
- * It must be called without any lock held and:
  *
- *   - retrieves the ipc with the given id in the given table.
+ * It:
+ *   - retrieves the ipc object with the given id in the given table.
  *   - performs some audit and permission check, depending on the given cmd
  *   - returns a pointer to the ipc object or otherwise, the corresponding
  * error.
  *
  * Call holding the both the rwsem and the rcu read lock.
  */
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
struct ipc_ids *ids, int id, int cmd,
struct ipc64_perm *perm, int extra_perm)
 {
diff --git a/ipc/util.h b/ipc/util.h
index 0aba3230d007..fcf81425ae98 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -148,7 +148,7 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id);
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
 int ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out);
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
 struct ipc_ids *ids, int id, int 
cmd,
 struct ipc64_perm *perm, int 
extra_perm);
 
-- 
2.17.1

[PATCH 12/12] ipc/util.c: Further ipc_idr_alloc cleanups.

2018-07-09 Thread Manfred Spraul

If idr_alloc within ipc_idr_alloc fails, then the return value (-ENOSPC)
is used to calculate new->id.
Technically, this is not a bug, because new->id is never accessed.

But: Clean it up anyways: On error, just return, do not set new->id.
And improve the documentation.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
---
 ipc/util.c | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index d474f2b3b299..302c18fc846b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -182,11 +182,20 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
 }
 
 /*
- * Specify desired id for next allocated IPC object.
+ * Insert new IPC object into idr tree, and set sequence number and id
+ * in the correct order.
+ * Especially:
+ * - the sequence number must be set before inserting the object into the idr,
+ *   because the sequence number is accessed without a lock.
+ * - the id can/must be set after inserting the object into the idr.
+ *   All accesses must be done after getting kern_ipc_perm.lock.
+ *
+ * The caller must own kern_ipc_perm.lock.of the new object.
+ * On error, the function returns a (negative) error code.
  */
 static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
 {
-   int key, next_id = -1;
+   int id, next_id = -1;
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
next_id = ids->next_id;
@@ -197,14 +206,15 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
-   key = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
+   id = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
} else {
new->seq = ipcid_to_seqx(next_id);
-   key = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
+   id = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
0, GFP_NOWAIT);
}
-   new->id = SEQ_MULTIPLIER * new->seq + key;
-   return key;
+   if (id >= 0)
+   new->id = SEQ_MULTIPLIER * new->seq + id;
+   return id;
 }
 
 /**
-- 
2.17.1

[PATCH 04/12] ipc: Rename ipcctl_pre_down_nolock().

2018-07-09 Thread Manfred Spraul

Both the comment and the name of ipcctl_pre_down_nolock()
are misleading: The function must be called while holdling
the rw semaphore.
Therefore the patch renames the function to ipcctl_obtain_check():
This name matches the other names used in util.c:
- "obtain" function look up a pointer in the idr, without
  acquiring the object lock.
- The caller is responsible for locking.
- _check means that the sequence number is checked.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 ipc/msg.c  | 2 +-
 ipc/sem.c  | 2 +-
 ipc/shm.c  | 2 +-
 ipc/util.c | 8 
 ipc/util.h | 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 5bf5cb8017ea..ba85d8849e8d 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -385,7 +385,7 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, 
int cmd,
down_write(_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, _ids(ns), msqid, cmd,
+   ipcp = ipcctl_obtain_check(ns, _ids(ns), msqid, cmd,
  >msg_perm, msqid64->msg_qbytes);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/sem.c b/ipc/sem.c
index 9d49efeac2e5..9742e9a1c0c2 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1595,7 +1595,7 @@ static int semctl_down(struct ipc_namespace *ns, int 
semid,
down_write(_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, _ids(ns), semid, cmd,
+   ipcp = ipcctl_obtain_check(ns, _ids(ns), semid, cmd,
  >sem_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/shm.c b/ipc/shm.c
index 06b7bf11a011..426ba1039a7b 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -868,7 +868,7 @@ static int shmctl_down(struct ipc_namespace *ns, int shmid, 
int cmd,
down_write(_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, _ids(ns), shmid, cmd,
+   ipcp = ipcctl_obtain_check(ns, _ids(ns), shmid, cmd,
  >shm_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/util.c b/ipc/util.c
index 8b09496ed720..bbb1ce212a0d 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -703,7 +703,7 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
 }
 
 /**
- * ipcctl_pre_down_nolock - retrieve an ipc and check permissions for some 
IPC_XXX cmd
+ * ipcctl_obtain_check - retrieve an ipc object and check permissions
  * @ns:  ipc namespace
  * @ids:  the table of ids where to look for the ipc
  * @id:   the id of the ipc to retrieve
@@ -713,16 +713,16 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
  *
  * This function does some common audit and permissions check for some IPC_XXX
  * cmd and is called from semctl_down, shmctl_down and msgctl_down.
- * It must be called without any lock held and:
  *
- *   - retrieves the ipc with the given id in the given table.
+ * It:
+ *   - retrieves the ipc object with the given id in the given table.
  *   - performs some audit and permission check, depending on the given cmd
  *   - returns a pointer to the ipc object or otherwise, the corresponding
  * error.
  *
  * Call holding the both the rwsem and the rcu read lock.
  */
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
struct ipc_ids *ids, int id, int cmd,
struct ipc64_perm *perm, int extra_perm)
 {
diff --git a/ipc/util.h b/ipc/util.h
index 0aba3230d007..fcf81425ae98 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -148,7 +148,7 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id);
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
 int ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out);
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
 struct ipc_ids *ids, int id, int 
cmd,
 struct ipc64_perm *perm, int 
extra_perm);
 
-- 
2.17.1

[PATCH 12/12] ipc/util.c: Further ipc_idr_alloc cleanups.

2018-07-09 Thread Manfred Spraul

If idr_alloc within ipc_idr_alloc fails, then the return value (-ENOSPC)
is used to calculate new->id.
Technically, this is not a bug, because new->id is never accessed.

But: Clean it up anyways: On error, just return, do not set new->id.
And improve the documentation.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
---
 ipc/util.c | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index d474f2b3b299..302c18fc846b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -182,11 +182,20 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
 }
 
 /*
- * Specify desired id for next allocated IPC object.
+ * Insert new IPC object into idr tree, and set sequence number and id
+ * in the correct order.
+ * Especially:
+ * - the sequence number must be set before inserting the object into the idr,
+ *   because the sequence number is accessed without a lock.
+ * - the id can/must be set after inserting the object into the idr.
+ *   All accesses must be done after getting kern_ipc_perm.lock.
+ *
+ * The caller must own kern_ipc_perm.lock.of the new object.
+ * On error, the function returns a (negative) error code.
  */
 static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
 {
-   int key, next_id = -1;
+   int id, next_id = -1;
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
next_id = ids->next_id;
@@ -197,14 +206,15 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
-   key = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
+   id = idr_alloc(>ipcs_idr, new, 0, 0, GFP_NOWAIT);
} else {
new->seq = ipcid_to_seqx(next_id);
-   key = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
+   id = idr_alloc(>ipcs_idr, new, ipcid_to_idx(next_id),
0, GFP_NOWAIT);
}
-   new->id = SEQ_MULTIPLIER * new->seq + key;
-   return key;
+   if (id >= 0)
+   new->id = SEQ_MULTIPLIER * new->seq + id;
+   return id;
 }
 
 /**
-- 
2.17.1

[PATCH 03/12] ipc/util.c: Use ipc_rcu_putref() for failues in ipc_addid()

2018-07-09 Thread Manfred Spraul

ipc_addid() is impossible to use:
- for certain failures, the caller must not use ipc_rcu_putref(),
  because the reference counter is not yet initialized.
- for other failures, the caller must use ipc_rcu_putref(),
  because parallel operations could be ongoing already.

The patch cleans that up, by initializing the refcount early,
and by modifying all callers.

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with reading kern_ipc_perm.seq,
here both read and write to already released memory could happen.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
---
 ipc/msg.c  |  2 +-
 ipc/sem.c  |  2 +-
 ipc/shm.c  |  2 ++
 ipc/util.c | 12 ++--
 4 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 829c2062ded4..5bf5cb8017ea 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -162,7 +162,7 @@ static int newque(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks msq upon success. */
retval = ipc_addid(_ids(ns), >q_perm, ns->msg_ctlmni);
if (retval < 0) {
-   call_rcu(>q_perm.rcu, msg_rcu_free);
+   ipc_rcu_putref(>q_perm, msg_rcu_free);
return retval;
}
 
diff --git a/ipc/sem.c b/ipc/sem.c
index e8971fa1d847..9d49efeac2e5 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -556,7 +556,7 @@ static int newary(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks sma upon success. */
retval = ipc_addid(_ids(ns), >sem_perm, ns->sc_semmni);
if (retval < 0) {
-   call_rcu(>sem_perm.rcu, sem_rcu_free);
+   ipc_rcu_putref(>sem_perm, sem_rcu_free);
return retval;
}
ns->used_sems += nsems;
diff --git a/ipc/shm.c b/ipc/shm.c
index 59fe8b3b3794..06b7bf11a011 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -671,6 +671,8 @@ static int newseg(struct ipc_namespace *ns, struct 
ipc_params *params)
if (is_file_hugepages(file) && shp->mlock_user)
user_shm_unlock(size, shp->mlock_user);
fput(file);
+   ipc_rcu_putref(>shm_perm, shm_rcu_free);
+   return error;
 no_file:
call_rcu(>shm_perm.rcu, shm_rcu_free);
return error;
diff --git a/ipc/util.c b/ipc/util.c
index 662c28c6c9fa..8b09496ed720 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -248,7 +248,9 @@ static inline void ipc_set_seq(struct ipc_ids *ids,
  * Add an entry 'new' to the ipc ids idr. The permissions object is
  * initialised and the first free entry is set up and the id assigned
  * is returned. The 'new' entry is returned in a locked state on success.
+ *
  * On failure the entry is not locked and a negative err-code is returned.
+ * The caller must use ipc_rcu_putref() to free the identifier.
  *
  * Called with writer ipc_ids.rwsem held.
  */
@@ -258,6 +260,9 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
kgid_t egid;
int id, err;
 
+   /* 1) Initialize the refcount so that ipc_rcu_putref works */
+   refcount_set(>refcount, 1);
+
if (limit > IPCMNI)
limit = IPCMNI;
 
@@ -266,9 +271,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
 
idr_preload(GFP_KERNEL);
 
-   refcount_set(>refcount, 1);
spin_lock_init(>lock);
-   new->deleted = false;
rcu_read_lock();
spin_lock(>lock);
 
@@ -277,6 +280,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
new->gid = new->cgid = egid;
 
ipc_set_seq(ids, new);
+   new->deleted = false;
 
/*
 * As soon as a new object is inserted into the idr,
@@ -288,6 +292,9 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
 * Thus the object must be fully initialized, and if something fails,
 * then the full tear-down sequence must be followed.
 * (i.e.: set new->deleted, reduce refcount, call_rcu())
+*
+* This function sets new->deleted, the caller must use ipc_rcu_putef()
+* for the remaining steps.
 */
id = ipc_idr_alloc(ids, new);
idr_preload_end();
@@ -301,6 +308,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
}
}
if (id < 0) {
+   new->deleted = true;
spin_unlock(>lock);
rcu_read_unlock();
return id;
-- 
2.17.1

[PATCH 0/12 V2] ipc: cleanups & bugfixes, rhashtable update

2018-07-09 Thread Manfred Spraul

Hi,

I have merged the patches from Dmitry, Davidlohr and myself:

- patch #1-#6: Fix syzcall findings & further race cleanups
- patch #7: Cleanup from Dmitry for ipc_idr_alloc.
- patch #8-#11: rhashtable improvement from Davidlohr
- patch #12: Another cleanup for ipc_idr_alloc.

@Davidlohr:
Please double check that I have taken the correct patches, and
that I didn't break anything.
Especially, I had to reformat the commit ids, otherwise checkpatch
complained.

@Dmitry: Patch #12 reworks your ipc_idr_alloc patch.
Ok?

@Andrew:
Can you merge the patches into -mm/next?

I have not seen any issues in my tests.

--
Manfred

[PATCH 03/12] ipc/util.c: Use ipc_rcu_putref() for failues in ipc_addid()

2018-07-09 Thread Manfred Spraul

ipc_addid() is impossible to use:
- for certain failures, the caller must not use ipc_rcu_putref(),
  because the reference counter is not yet initialized.
- for other failures, the caller must use ipc_rcu_putref(),
  because parallel operations could be ongoing already.

The patch cleans that up, by initializing the refcount early,
and by modifying all callers.

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with reading kern_ipc_perm.seq,
here both read and write to already released memory could happen.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
---
 ipc/msg.c  |  2 +-
 ipc/sem.c  |  2 +-
 ipc/shm.c  |  2 ++
 ipc/util.c | 12 ++--
 4 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 829c2062ded4..5bf5cb8017ea 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -162,7 +162,7 @@ static int newque(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks msq upon success. */
retval = ipc_addid(_ids(ns), >q_perm, ns->msg_ctlmni);
if (retval < 0) {
-   call_rcu(>q_perm.rcu, msg_rcu_free);
+   ipc_rcu_putref(>q_perm, msg_rcu_free);
return retval;
}
 
diff --git a/ipc/sem.c b/ipc/sem.c
index e8971fa1d847..9d49efeac2e5 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -556,7 +556,7 @@ static int newary(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks sma upon success. */
retval = ipc_addid(_ids(ns), >sem_perm, ns->sc_semmni);
if (retval < 0) {
-   call_rcu(>sem_perm.rcu, sem_rcu_free);
+   ipc_rcu_putref(>sem_perm, sem_rcu_free);
return retval;
}
ns->used_sems += nsems;
diff --git a/ipc/shm.c b/ipc/shm.c
index 59fe8b3b3794..06b7bf11a011 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -671,6 +671,8 @@ static int newseg(struct ipc_namespace *ns, struct 
ipc_params *params)
if (is_file_hugepages(file) && shp->mlock_user)
user_shm_unlock(size, shp->mlock_user);
fput(file);
+   ipc_rcu_putref(>shm_perm, shm_rcu_free);
+   return error;
 no_file:
call_rcu(>shm_perm.rcu, shm_rcu_free);
return error;
diff --git a/ipc/util.c b/ipc/util.c
index 662c28c6c9fa..8b09496ed720 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -248,7 +248,9 @@ static inline void ipc_set_seq(struct ipc_ids *ids,
  * Add an entry 'new' to the ipc ids idr. The permissions object is
  * initialised and the first free entry is set up and the id assigned
  * is returned. The 'new' entry is returned in a locked state on success.
+ *
  * On failure the entry is not locked and a negative err-code is returned.
+ * The caller must use ipc_rcu_putref() to free the identifier.
  *
  * Called with writer ipc_ids.rwsem held.
  */
@@ -258,6 +260,9 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
kgid_t egid;
int id, err;
 
+   /* 1) Initialize the refcount so that ipc_rcu_putref works */
+   refcount_set(>refcount, 1);
+
if (limit > IPCMNI)
limit = IPCMNI;
 
@@ -266,9 +271,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
 
idr_preload(GFP_KERNEL);
 
-   refcount_set(>refcount, 1);
spin_lock_init(>lock);
-   new->deleted = false;
rcu_read_lock();
spin_lock(>lock);
 
@@ -277,6 +280,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
new->gid = new->cgid = egid;
 
ipc_set_seq(ids, new);
+   new->deleted = false;
 
/*
 * As soon as a new object is inserted into the idr,
@@ -288,6 +292,9 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
 * Thus the object must be fully initialized, and if something fails,
 * then the full tear-down sequence must be followed.
 * (i.e.: set new->deleted, reduce refcount, call_rcu())
+*
+* This function sets new->deleted, the caller must use ipc_rcu_putef()
+* for the remaining steps.
 */
id = ipc_idr_alloc(ids, new);
idr_preload_end();
@@ -301,6 +308,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
}
}
if (id < 0) {
+   new->deleted = true;
spin_unlock(>lock);
rcu_read_unlock();
return id;
-- 
2.17.1

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1250 matches

Mail list logo