[RFC] [PATCH] ipc/util.c: Use binary search for max_idx

2021-04-07 Thread Manfred Spraul
If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO,
MSG_INFO or SHM_INFO, then the return value is the index of the highest
used entry in the kernel's internal array recording information about
all SysV objects of the requested type for the current namespace.
(This information can be used with repeated ..._STAT or ..._STAT_ANY
operations to obtain information about all SysV objects on the system.)

If the current highest used entry is destroyed, then the new highest
used entry is determined by looping over all possible values.
With the introduction of IPCMNI_EXTEND_SHIFT, this could be a
loop over 16 million entries.

As there is no get_last() function for idr structures:
Implement a "get_last()" using a binary search.

As far as I see, ipc is the only user that needs get_last(), thus
implement it in ipc/util.c and not in a central location.

Signed-off-by: Manfred Spraul 
---
 ipc/util.c | 44 +++-
 1 file changed, 39 insertions(+), 5 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index cfa0045e748d..0121bf6b2617 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -64,6 +64,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -450,6 +451,40 @@ static void ipc_kht_remove(struct ipc_ids *ids, struct 
kern_ipc_perm *ipcp)
   ipc_kht_params);
 }
 
+/**
+ * ipc_get_maxusedidx - get highest in-use index
+ * @ids: ipc identifier set
+ * @limit: highest possible index.
+ *
+ * The function determines the highest in use index value.
+ * ipc_ids.rwsem needs to be owned by the caller.
+ * If no ipc object is allocated, then -1 is returned.
+ */
+static int ipc_get_maxusedidx(struct ipc_ids *ids, int limit)
+{
+   void *val;
+   int tmpidx;
+   int i;
+   int retval;
+
+   i = ilog2(limit+1);
+
+   retval = 0;
+   for (; i >= 0; i--) {
+   tmpidx = retval | (1<ipcs_idr, &tmpidx);
+   if (val)
+   retval |= (1<deleted = true;
 
if (unlikely(idx == ids->max_idx)) {
-   do {
-   idx--;
-   if (idx == -1)
-   break;
-   } while (!idr_find(&ids->ipcs_idr, idx));
+
+   idx = ids->max_idx-1;
+   if (idx >= 0)
+   idx = ipc_get_maxusedidx(ids, idx);
ids->max_idx = idx;
}
 }
-- 
2.29.2



Re: [PATCH] ipc/msg: add msgsnd_timed and msgrcv_timed syscall for system V message queue

2021-03-04 Thread Manfred Spraul

Hi Eric,


On 3/4/21 2:12 AM, Andrew Morton wrote:

On Tue, 23 Feb 2021 23:11:43 +0800 Eric Gao  wrote:


sometimes, we need the msgsnd or msgrcv syscall can return after a limited
time, so that the business thread do not be blocked here all the time. In
this case, I add the msgsnd_timed and msgrcv_timed syscall that with time
parameter, which has a unit of ms.

Please cc Manfred and Davidlohr on ipc/ changes.

The above is a very brief description for a new syscall!  Please go to
great lengths to tell us why this is considered useful - what are the
use cases?

Also, please fully describe the proposed syscall interface right here
in the changelog.  Please be prepared to later prepare a full manpage.


...
+SYSCALL_DEFINE5(msgsnd_timed, int, msqid, struct msgbuf __user *, msgp, 
size_t, msgsz,
+   int, msgflg, long, timeoutms)

Specifying the timeout in milliseconds is problematic - it's very
coarse.  See sys_epoll_pwait2()'s use of timespecs.


What about using an absolute timeout, like in mq_timedsend()?

That makes restart handling after signals far simpler.


> -   schedule();
> +
> +   /* sometimes, we need msgsnd syscall return after a given 
time */
> +   if (timeoutms <= 0) {
> +   schedule();
> +   } else {
> +   timeoutms = schedule_timeout(timeoutms);
> +   if (timeoutms == 0)
> +   timeoutflag = true;
> +   }

I wonder if this should be schedule_timeout_interruptible() or at least
schedule_timeout_killable() instead of schedule_timeout(). If it should,
this should probably be done as a separate change.
No. schedule_timeout_interruptible() just means that 
__set_current_state() is called before the schedule_timeout().


The __set_current_state() is done directly in msg.c, before dropping the 
lock.


--

    Manfred



Dringende Antwort!

2021-02-16 Thread Manfred Koch




Mein Name ist Manfred Koch, ich komme aus Deutschland und ich habe  
eine Spende von 750.000,00 Euro für jeweils 3 Personen, entweder in  
Deutschland, der Schweiz oder Österreich.
Kontaktieren Sie mich und ich werde Ihnen den Grund für mein Angebot  
mitteilen.


MK



Re: [PATCH] ipc/msg.c: wake up senders until there is a queue empty capacity

2020-06-01 Thread Manfred Spraul

Hi Artus,

On 6/1/20 4:02 PM, Artur Barsegyan wrote:

Hi, Manfred.

Did you get my last message?


Yes, I'm just too busy right now.

My plan/backlog is:

- the xarray patch from Matthew

- improve finding max_id in ipc_rmid(). Perhaps even remove max_id 
entirely and instead calculate it on demand.


- your patch to avoid waking up too many tasks, including my bugfix.



On Wed, May 27, 2020 at 02:22:57PM +0300, Artur Barsegyan wrote:

[sorry for the duplicates — I have changed my email client]

About your case:

The new receiver puts at the end of the receivers list.
pipelined_send() starts from the beginning of the list and iterates until the 
end.

If our queue is always full, each receiver should get a message because new 
receivers appends at the end.
In my vision: we waste some time in that loop but in general should increase 
the throughout. But it should be tested.

Yes, I'm gonna implement it and make a benchmark. But maybe it should be done 
in another patch thread?


My biggest problem is always realistic benchmarks:

Do we optimize for code size/small amount of branches, or add special 
cases for things that we think could be common?


Avoiding thundering herds is always good, avoiding schedule() is always 
good.


Thus I would start with pipelined_receive, and then we would need 
feedback from apps that use sysv msg.


(old fakeroot is what I remember as test app)


On Wed, May 27, 2020 at 08:03:17AM +0200, Manfred Spraul wrote:

Hello Artur,

On 5/26/20 9:56 AM, Artur Barsegyan wrote:

Hello, Manfred!

Thank you, for your review. I've reviewed your patch.

I forgot about the case with different message types. At now with your patch,
a sender will force message consuming if that doesn't hold own capacity.

I have measured queue throughput and have pushed the results to:
https://github.com/artur-barsegyan/systemv_queue_research

But I'm confused about the next thought: in general loop in the do_msgsnd()
function, we doesn't check pipeline sending availability. Your case will be
optimized if we check the pipeline sending inside the loop.

I don't get your concern, or perhaps this is a feature that I had always
assumed as "normal":

"msg_fits_inqueue(msq, msgsz)" is in the loop, this ensures progress.

The rational is a design decision:

The check for pipeline sending is only done if there would be space to store
the message in the queue.

I was afraid that performing the pipeline send immediately, without checking
queue availability, could break apps:

Some messages would arrive immediately (if there is a waiting receiver),
other messages are stuck forever (since the queue is full).

Initial patch: https://lkml.org/lkml/1999/10/3/5 (without any remarks about
the design decision)

The risk that I had seen was theoretical, I do not have any real bug
reports. So we could change it.

Perhaps: Go in the same direction as it was done for POSIX mqueue: implement
pipelined receive.


On Sun, May 24, 2020 at 03:21:31PM +0200, Manfred Spraul wrote:

Hello Artur,

On 5/23/20 10:34 PM, Artur Barsegyan wrote:

Take into account the total size of the already enqueued messages of
previously handled senders before another one.

Otherwise, we have serious degradation of receiver throughput for
case with multiple senders because another sender wakes up,
checks the queue capacity and falls into sleep again.

Each round-trip wastes CPU time a lot and leads to perceptible
throughput degradation.

Source code of:
- sender/receiver
- benchmark script
- ready graphics of before/after results

is located here: https://github.com/artur-barsegyan/systemv_queue_research

Thanks for analyzing the issue!


Signed-off-by: Artur Barsegyan 
---
ipc/msg.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index caca67368cb5..52d634b0a65a 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -214,6 +214,7 @@ static void ss_wakeup(struct msg_queue *msq,
struct msg_sender *mss, *t;
struct task_struct *stop_tsk = NULL;
struct list_head *h = &msq->q_senders;
+   size_t msq_quota_used = 0;
list_for_each_entry_safe(mss, t, h, list) {
if (kill)
@@ -233,7 +234,7 @@ static void ss_wakeup(struct msg_queue *msq,
 * move the sender to the tail on behalf of the
 * blocked task.
 */
-   else if (!msg_fits_inqueue(msq, mss->msgsz)) {
+   else if (!msg_fits_inqueue(msq, msq_quota_used + mss->msgsz)) {
if (!stop_tsk)
stop_tsk = mss->tsk;
@@ -241,6 +242,7 @@ static void ss_wakeup(struct msg_queue *msq,
continue;
}
+   msq_quota_used += mss->msgsz;
wake_q_add(wake_q, mss->tsk);

You have missed the case of a do_msgsnd() that doesn't enqueue the message:

Situa

Re: [PATCH] ipc/msg.c: wake up senders until there is a queue empty capacity

2020-05-26 Thread Manfred Spraul

Hello Artur,

On 5/26/20 9:56 AM, Artur Barsegyan wrote:

Hello, Manfred!

Thank you, for your review. I've reviewed your patch.

I forgot about the case with different message types. At now with your patch,
a sender will force message consuming if that doesn't hold own capacity.

I have measured queue throughput and have pushed the results to:
https://github.com/artur-barsegyan/systemv_queue_research

But I'm confused about the next thought: in general loop in the do_msgsnd()
function, we doesn't check pipeline sending availability. Your case will be
optimized if we check the pipeline sending inside the loop.


I don't get your concern, or perhaps this is a feature that I had always 
assumed as "normal":


"msg_fits_inqueue(msq, msgsz)" is in the loop, this ensures progress.

The rational is a design decision:

The check for pipeline sending is only done if there would be space to 
store the message in the queue.


I was afraid that performing the pipeline send immediately, without 
checking queue availability, could break apps:


Some messages would arrive immediately (if there is a waiting receiver), 
other messages are stuck forever (since the queue is full).


Initial patch: https://lkml.org/lkml/1999/10/3/5 (without any remarks 
about the design decision)


The risk that I had seen was theoretical, I do not have any real bug 
reports. So we could change it.


Perhaps: Go in the same direction as it was done for POSIX mqueue: 
implement pipelined receive.



On Sun, May 24, 2020 at 03:21:31PM +0200, Manfred Spraul wrote:

Hello Artur,

On 5/23/20 10:34 PM, Artur Barsegyan wrote:

Take into account the total size of the already enqueued messages of
previously handled senders before another one.

Otherwise, we have serious degradation of receiver throughput for
case with multiple senders because another sender wakes up,
checks the queue capacity and falls into sleep again.

Each round-trip wastes CPU time a lot and leads to perceptible
throughput degradation.

Source code of:
- sender/receiver
- benchmark script
- ready graphics of before/after results

is located here: https://github.com/artur-barsegyan/systemv_queue_research

Thanks for analyzing the issue!


Signed-off-by: Artur Barsegyan 
---
   ipc/msg.c | 4 +++-
   1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index caca67368cb5..52d634b0a65a 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -214,6 +214,7 @@ static void ss_wakeup(struct msg_queue *msq,
struct msg_sender *mss, *t;
struct task_struct *stop_tsk = NULL;
struct list_head *h = &msq->q_senders;
+   size_t msq_quota_used = 0;
list_for_each_entry_safe(mss, t, h, list) {
if (kill)
@@ -233,7 +234,7 @@ static void ss_wakeup(struct msg_queue *msq,
 * move the sender to the tail on behalf of the
 * blocked task.
 */
-   else if (!msg_fits_inqueue(msq, mss->msgsz)) {
+   else if (!msg_fits_inqueue(msq, msq_quota_used + mss->msgsz)) {
if (!stop_tsk)
stop_tsk = mss->tsk;
@@ -241,6 +242,7 @@ static void ss_wakeup(struct msg_queue *msq,
continue;
}
+   msq_quota_used += mss->msgsz;
wake_q_add(wake_q, mss->tsk);

You have missed the case of a do_msgsnd() that doesn't enqueue the message:

Situation:

- 2 messages of type 1 in the queue (2x8192 bytes, queue full)

- 6 senders waiting to send messages of type 2

- 6 receivers waiting to get messages of type 2.

If now a receiver reads one message of type 1, then all 6 senders can send.

WIth your patch applied, only one sender sends the message to one receiver,
and the remaining 10 tasks continue to sleep.


Could you please check and (assuming that you agree) run your benchmarks
with the patch applied?

--

     Manfred



 From fe2f257b1950a19bf5c6f67e71aa25c2f13bcdc3 Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Sun, 24 May 2020 14:47:31 +0200
Subject: [PATCH 2/2] ipc/msg.c: Handle case of senders not enqueuing the
  message

The patch "ipc/msg.c: wake up senders until there is a queue empty
capacity" avoids the thundering herd problem by wakeing up
only as many potential senders as there is free space in the queue.

This patch is a fix: If one of the senders doesn't enqueue its message,
then a search for further potential senders must be performed.

Signed-off-by: Manfred Spraul 
---
  ipc/msg.c | 21 +
  1 file changed, 21 insertions(+)

diff --git a/ipc/msg.c b/ipc/msg.c
index 52d634b0a65a..f6d5188db38a 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -208,6 +208,12 @@ static inline void ss_del(struct msg_sender *mss)
list_del(&mss->list);
  }
  
+/*

+ * ss_wakeup() assumes that the stored senders will enqueue the

Re: [PATCH] ipc/msg.c: wake up senders until there is a queue empty capacity

2020-05-24 Thread Manfred Spraul

Hello Artur,

On 5/23/20 10:34 PM, Artur Barsegyan wrote:

Take into account the total size of the already enqueued messages of
previously handled senders before another one.

Otherwise, we have serious degradation of receiver throughput for
case with multiple senders because another sender wakes up,
checks the queue capacity and falls into sleep again.

Each round-trip wastes CPU time a lot and leads to perceptible
throughput degradation.

Source code of:
- sender/receiver
- benchmark script
- ready graphics of before/after results

is located here: https://github.com/artur-barsegyan/systemv_queue_research


Thanks for analyzing the issue!


Signed-off-by: Artur Barsegyan 
---
  ipc/msg.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index caca67368cb5..52d634b0a65a 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -214,6 +214,7 @@ static void ss_wakeup(struct msg_queue *msq,
struct msg_sender *mss, *t;
struct task_struct *stop_tsk = NULL;
struct list_head *h = &msq->q_senders;
+   size_t msq_quota_used = 0;
  
  	list_for_each_entry_safe(mss, t, h, list) {

if (kill)
@@ -233,7 +234,7 @@ static void ss_wakeup(struct msg_queue *msq,
 * move the sender to the tail on behalf of the
 * blocked task.
 */
-   else if (!msg_fits_inqueue(msq, mss->msgsz)) {
+   else if (!msg_fits_inqueue(msq, msq_quota_used + mss->msgsz)) {
if (!stop_tsk)
stop_tsk = mss->tsk;
  
@@ -241,6 +242,7 @@ static void ss_wakeup(struct msg_queue *msq,

continue;
}
  
+		msq_quota_used += mss->msgsz;

wake_q_add(wake_q, mss->tsk);


You have missed the case of a do_msgsnd() that doesn't enqueue the message:

Situation:

- 2 messages of type 1 in the queue (2x8192 bytes, queue full)

- 6 senders waiting to send messages of type 2

- 6 receivers waiting to get messages of type 2.

If now a receiver reads one message of type 1, then all 6 senders can send.

WIth your patch applied, only one sender sends the message to one 
receiver, and the remaining 10 tasks continue to sleep.



Could you please check and (assuming that you agree) run your benchmarks 
with the patch applied?


--

    Manfred



>From fe2f257b1950a19bf5c6f67e71aa25c2f13bcdc3 Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Sun, 24 May 2020 14:47:31 +0200
Subject: [PATCH 2/2] ipc/msg.c: Handle case of senders not enqueuing the
 message

The patch "ipc/msg.c: wake up senders until there is a queue empty
capacity" avoids the thundering herd problem by wakeing up
only as many potential senders as there is free space in the queue.

This patch is a fix: If one of the senders doesn't enqueue its message,
then a search for further potential senders must be performed.

Signed-off-by: Manfred Spraul 
---
 ipc/msg.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/ipc/msg.c b/ipc/msg.c
index 52d634b0a65a..f6d5188db38a 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -208,6 +208,12 @@ static inline void ss_del(struct msg_sender *mss)
 		list_del(&mss->list);
 }
 
+/*
+ * ss_wakeup() assumes that the stored senders will enqueue the pending message.
+ * Thus: If a woken up task doesn't send the enqueued message for whatever
+ * reason, then that task must call ss_wakeup() again, to ensure that no
+ * wakeup is lost.
+ */
 static void ss_wakeup(struct msg_queue *msq,
 		  struct wake_q_head *wake_q, bool kill)
 {
@@ -843,6 +849,7 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
 	struct msg_queue *msq;
 	struct msg_msg *msg;
 	int err;
+	bool need_wakeup;
 	struct ipc_namespace *ns;
 	DEFINE_WAKE_Q(wake_q);
 
@@ -869,6 +876,7 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
 
 	ipc_lock_object(&msq->q_perm);
 
+	need_wakeup = false;
 	for (;;) {
 		struct msg_sender s;
 
@@ -898,6 +906,13 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
 		/* enqueue the sender and prepare to block */
 		ss_add(msq, &s, msgsz);
 
+		/* Enqueuing a sender is actually an obligation:
+		 * The sender must either enqueue the message, or call
+		 * ss_wakeup(). Thus track that we have added our message
+		 * to the candidates for the message queue.
+		 */
+		need_wakeup = true;
+
 		if (!ipc_rcu_getref(&msq->q_perm)) {
 			err = -EIDRM;
 			goto out_unlock0;
@@ -935,12 +950,18 @@ static long do_msgsnd(int msqid, long mtype, void __user *mtext,
 		msq->q_qnum++;
 		atomic_add(msgsz, &ns->msg_bytes);
 		atomic_inc(&ns->msg_hdrs);
+
+		/* we have fulfilled our obligation, no need for wakeup */
+		need_wakeup = false;
 	}
 
 	err = 0;
 	msg = NULL;
 
 out_unlock0:
+	if (need_wakeup)
+		ss_wakeup(msq, &wake_q, false);
+
 	ipc_unlock_object(&msq->q_perm);
 	wake_up_q(&wake_q);
 out_unlock1:
-- 
2.26.2



[PATCH] xarray.h: Correct return code for xa_store_{bh,irq}()

2020-04-30 Thread Manfred Spraul
__xa_store() and xa_store() document that the functions can fail, and
that the return code can be an xa_err() encoded error code.

xa_store_bh() and xa_store_irq() do not document that the functions
can fail and that they can also return xa_err() encoded error codes.

Thus: Update the documentation.

Signed-off-by: Manfred Spraul 
---
 include/linux/xarray.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index d79b8e3aa08d..2815c4ec89b1 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -576,7 +576,7 @@ void __xa_clear_mark(struct xarray *, unsigned long index, 
xa_mark_t);
  *
  * Context: Any context.  Takes and releases the xa_lock while
  * disabling softirqs.
- * Return: The entry which used to be at this index.
+ * Return: The old entry at this index or xa_err() if an error happened.
  */
 static inline void *xa_store_bh(struct xarray *xa, unsigned long index,
void *entry, gfp_t gfp)
@@ -602,7 +602,7 @@ static inline void *xa_store_bh(struct xarray *xa, unsigned 
long index,
  *
  * Context: Process context.  Takes and releases the xa_lock while
  * disabling interrupts.
- * Return: The entry which used to be at this index.
+ * Return: The old entry at this index or xa_err() if an error happened.
  */
 static inline void *xa_store_irq(struct xarray *xa, unsigned long index,
void *entry, gfp_t gfp)
-- 
2.26.2



Re: [PATCH -next] ipc: use GFP_ATOMIC under spin lock

2020-04-28 Thread Manfred Spraul

Hello together,

On 4/28/20 1:14 PM, Matthew Wilcox wrote:

On Tue, Apr 28, 2020 at 03:47:36AM +, Wei Yongjun wrote:

The function ipc_id_alloc() is called from ipc_addid(), in which
a spin lock is held, so we should use GFP_ATOMIC instead.

Fixes: de5738d1c364 ("ipc: convert ipcs_idr to XArray")
Signed-off-by: Wei Yongjun 

I see why you think that, but it's not true.  Yes, we hold a spinlock, but
the spinlock is in an object which is not reachable from any other CPU.


Is it really allowed that spin_lock()/spin_unlock may happen on 
different cpus?


CPU1: spin_lock()

CPU1: schedule() -> sleeps

CPU2: -> schedule() returns

CPU2: spin_unlock().



Converting to GFP_ATOMIC is completely wrong.


What is your solution proposal?

xa_store() also gets a gfp_ flag. Thus even splitting _alloc() and 
_store() won't help


    xa_alloc(,entry=NULL,)
    new->seq = ...
    spin_lock();
    xa_store(,entry=new,GFP_KERNEL);

--

    Manfred




Re: [ipc/sem.c] 6394de3b86: BUG:kernel_NULL_pointer_dereference,address

2019-10-23 Thread Manfred Spraul

Hello,

On 10/21/19 10:35 AM, kernel test robot wrote:

FYI, we noticed the following commit (built with gcc-7):

commit: 6394de3b868537a90dd9128607192b0e97109f6b ("[PATCH 4/5] ipc/sem.c: Document 
and update memory barriers")
url: 
https://github.com/0day-ci/linux/commits/Manfred-Spraul/wake_q-Cleanup-Documentation-update/20191014-055627


Yes, known issue:

@@ -2148,9 +2176,11 @@ static long do_semtimedop(int semid, struct 
sembuf __user *tsops,

    }

    do {
-   WRITE_ONCE(queue.status, -EINTR);
+   /* memory ordering ensured by the lock in sem_lock() */
+   queue.status = EINTR;
    queue.sleeper = current;

+   /* memory ordering is ensured by the lock in sem_lock() */
    __set_current_state(TASK_INTERRUPTIBLE);
    sem_unlock(sma, locknum);
    rcu_read_unlock();

It must be "-EINTR", not "EINTR".

If there is a timeout or a spurious wakeup, then the do_semtimedop() 
returns to user space without unlinking everything properly.


I was able to reproduce the issue: V1 of the series ends up with the 
shown error.


V3 as now merged doesn't fail.

--

    Manfred




[PATCH 3/5] ipc/mqueue.c: Update/document memory barriers

2019-10-20 Thread Manfred Spraul
Update and document memory barriers for mqueue.c:
- ewp->state is read without any locks, thus READ_ONCE is required.

- add smp_aquire__after_ctrl_dep() after the READ_ONCE, we need
  acquire semantics if the value is STATE_READY.

- use wake_q_add_safe()

- document why __set_current_state() may be used:
  Reading task->state cannot happen before the wake_q_add() call,
  which happens while holding info->lock. Thus the spin_unlock()
  is the RELEASE, and the spin_lock() is the ACQUIRE.

For completeness: there is also a 3 CPU scenario, if the to be woken
up task is already on another wake_q.
Then:
- CPU1: spin_unlock() of the task that goes to sleep is the RELEASE
- CPU2: the spin_lock() of the waker is the ACQUIRE
- CPU2: smp_mb__before_atomic inside wake_q_add() is the RELEASE
- CPU3: smp_mb__after_spinlock() inside try_to_wake_up() is the ACQUIRE

Signed-off-by: Manfred Spraul 
Reviewed-by: Davidlohr Bueso 
Cc: Waiman Long 
---
 ipc/mqueue.c | 92 
 1 file changed, 78 insertions(+), 14 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 270456530f6a..49a05ba3000d 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -63,6 +63,66 @@ struct posix_msg_tree_node {
int priority;
 };
 
+/*
+ * Locking:
+ *
+ * Accesses to a message queue are synchronized by acquiring info->lock.
+ *
+ * There are two notable exceptions:
+ * - The actual wakeup of a sleeping task is performed using the wake_q
+ *   framework. info->lock is already released when wake_up_q is called.
+ * - The exit codepaths after sleeping check ext_wait_queue->state without
+ *   any locks. If it is STATE_READY, then the syscall is completed without
+ *   acquiring info->lock.
+ *
+ * MQ_BARRIER:
+ * To achieve proper release/acquire memory barrier pairing, the state is set 
to
+ * STATE_READY with smp_store_release(), and it is read with READ_ONCE followed
+ * by smp_acquire__after_ctrl_dep(). In addition, wake_q_add_safe() is used.
+ *
+ * This prevents the following races:
+ *
+ * 1) With the simple wake_q_add(), the task could be gone already before
+ *the increase of the reference happens
+ * Thread A
+ * Thread B
+ * WRITE_ONCE(wait.state, STATE_NONE);
+ * schedule_hrtimeout()
+ * wake_q_add(A)
+ * if (cmpxchg()) // success
+ *->state = STATE_READY (reordered)
+ * 
+ * if (wait.state == STATE_READY) return;
+ * sysret to user space
+ * sys_exit()
+ * get_task_struct() // UaF
+ *
+ * Solution: Use wake_q_add_safe() and perform the get_task_struct() before
+ * the smp_store_release() that does ->state = STATE_READY.
+ *
+ * 2) Without proper _release/_acquire barriers, the woken up task
+ *could read stale data
+ *
+ * Thread A
+ * Thread B
+ * do_mq_timedreceive
+ * WRITE_ONCE(wait.state, STATE_NONE);
+ * schedule_hrtimeout()
+ * state = STATE_READY;
+ * 
+ * if (wait.state == STATE_READY) return;
+ * msg_ptr = wait.msg; // Access to stale data!
+ * receiver->msg = message; (reordered)
+ *
+ * Solution: use _release and _acquire barriers.
+ *
+ * 3) There is intentionally no barrier when setting current->state
+ *to TASK_INTERRUPTIBLE: spin_unlock(&info->lock) provides the
+ *release memory barrier, and the wakeup is triggered when holding
+ *info->lock, i.e. spin_lock(&info->lock) provided a pairing
+ *acquire memory barrier.
+ */
+
 struct ext_wait_queue {/* queue of sleeping tasks */
struct task_struct *task;
struct list_head list;
@@ -646,18 +706,23 @@ static int wq_sleep(struct mqueue_inode_info *info, int 
sr,
wq_add(info, sr, ewp);
 
for (;;) {
+   /* memory barrier not required, we hold info->lock */
__set_current_state(TASK_INTERRUPTIBLE);
 
spin_unlock(&info->lock);
time = schedule_hrtimeout_range_clock(timeout, 0,
HRTIMER_MODE_ABS, CLOCK_REALTIME);
 
-   if (ewp->state == STATE_READY) {
+   if (READ_ONCE(ewp->state) == STATE_READY) {
+   /* see MQ_BARRIER for purpose/pairing */
+   smp_acquire__after_ctrl_dep();
retval = 0;
goto out;
}
spin_lock(&info->lock);
-   if (ewp->state == STATE_READY) {
+
+   /* we hold info->lock, so no memory barrier required */
+   if (READ_ONCE(ewp->state) == STATE_READY) {
retval = 0;
goto out_unlock;
}
@@ -923,16 +988,11 @@ static inline void __pipelined_op(struct wake_q_head 
*wake_q,

[PATCH 5/5] ipc/sem.c: Document and update memory barriers

2019-10-20 Thread Manfred Spraul
The patch documents and updates the memory barriers in ipc/sem.c:
- Add smp_store_release() to wake_up_sem_queue_prepare() and
  document why it is needed.

- Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep().
  as the pair for the barrier inside wake_up_sem_queue_prepare().

- Add comments to all barriers, and mention the rules in the block
  regarding locking.

- Switch to using wake_q_add_safe().

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/sem.c | 66 ++-
 1 file changed, 41 insertions(+), 25 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index ec97a7072413..c89734b200c6 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -205,15 +205,38 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
  *
  * Memory ordering:
  * Most ordering is enforced by using spin_lock() and spin_unlock().
- * The special case is use_global_lock:
+ *
+ * Exceptions:
+ * 1) use_global_lock: (SEM_BARRIER_1)
  * Setting it from non-zero to 0 is a RELEASE, this is ensured by
- * using smp_store_release().
+ * using smp_store_release(): Immediately after setting it to 0,
+ * a simple op can start.
  * Testing if it is non-zero is an ACQUIRE, this is ensured by using
  * smp_load_acquire().
  * Setting it from 0 to non-zero must be ordered with regards to
  * this smp_load_acquire(), this is guaranteed because the smp_load_acquire()
  * is inside a spin_lock() and after a write from 0 to non-zero a
  * spin_lock()+spin_unlock() is done.
+ *
+ * 2) queue.status: (SEM_BARRIER_2)
+ * Initialization is done while holding sem_lock(), so no further barrier is
+ * required.
+ * Setting it to a result code is a RELEASE, this is ensured by both a
+ * smp_store_release() (for case a) and while holding sem_lock()
+ * (for case b).
+ * The AQUIRE when reading the result code without holding sem_lock() is
+ * achieved by using READ_ONCE() + smp_acquire__after_ctrl_dep().
+ * (case a above).
+ * Reading the result code while holding sem_lock() needs no further barriers,
+ * the locks inside sem_lock() enforce ordering (case b above)
+ *
+ * 3) current->state:
+ * current->state is set to TASK_INTERRUPTIBLE while holding sem_lock().
+ * The wakeup is handled using the wake_q infrastructure. wake_q wakeups may
+ * happen immediately after calling wake_q_add. As wake_q_add_safe() is called
+ * when holding sem_lock(), no further barriers are required.
+ *
+ * See also ipc/mqueue.c for more details on the covered races.
  */
 
 #define sc_semmsl  sem_ctls[0]
@@ -344,12 +367,8 @@ static void complexmode_tryleave(struct sem_array *sma)
return;
}
if (sma->use_global_lock == 1) {
-   /*
-* Immediately after setting use_global_lock to 0,
-* a simple op can start. Thus: all memory writes
-* performed by the current operation must be visible
-* before we set use_global_lock to 0.
-*/
+
+   /* See SEM_BARRIER_1 for purpose/pairing */
smp_store_release(&sma->use_global_lock, 0);
} else {
sma->use_global_lock--;
@@ -400,7 +419,7 @@ static inline int sem_lock(struct sem_array *sma, struct 
sembuf *sops,
 */
spin_lock(&sem->lock);
 
-   /* pairs with smp_store_release() */
+   /* see SEM_BARRIER_1 for purpose/pairing */
if (!smp_load_acquire(&sma->use_global_lock)) {
/* fast path successful! */
return sops->sem_num;
@@ -766,15 +785,12 @@ static int perform_atomic_semop(struct sem_array *sma, 
struct sem_queue *q)
 static inline void wake_up_sem_queue_prepare(struct sem_queue *q, int error,
 struct wake_q_head *wake_q)
 {
-   wake_q_add(wake_q, q->sleeper);
-   /*
-* Rely on the above implicit barrier, such that we can
-* ensure that we hold reference to the task before setting
-* q->status. Otherwise we could race with do_exit if the
-* task is awoken by an external event before calling
-* wake_up_process().
-*/
-   WRITE_ONCE(q->status, error);
+   get_task_struct(q->sleeper);
+
+   /* see SEM_BARRIER_2 for purpuse/pairing */
+   smp_store_release(&q->status, error);
+
+   wake_q_add_safe(wake_q, q->sleeper);
 }
 
 static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
@@ -2148,9 +2164,11 @@ static long do_semtimedop(int semid, struct sembuf 
__user *tsops,
}
 
do {
+   /* memory ordering ensured by the lock in sem_lock() */
WRITE_ONCE(queue.status, -EINTR);
queue.sleeper = current;
 
+   /* memory ordering is ensured by the lock in sem_lock() */
__set_current_state

[PATCH 2/5] ipc/mqueue.c: Remove duplicated code

2019-10-20 Thread Manfred Spraul
Patch from Davidlohr, I just added this change log.
pipelined_send() and pipelined_receive() are identical, so merge them.

Signed-off-by: Davidlohr Bueso 
Signed-off-by: Manfred Spraul 
---
 ipc/mqueue.c | 31 ++-
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3d920ff15c80..270456530f6a 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -918,17 +918,12 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
  * The same algorithm is used for senders.
  */
 
-/* pipelined_send() - send a message directly to the task waiting in
- * sys_mq_timedreceive() (without inserting message into a queue).
- */
-static inline void pipelined_send(struct wake_q_head *wake_q,
+static inline void __pipelined_op(struct wake_q_head *wake_q,
  struct mqueue_inode_info *info,
- struct msg_msg *message,
- struct ext_wait_queue *receiver)
+ struct ext_wait_queue *this)
 {
-   receiver->msg = message;
-   list_del(&receiver->list);
-   wake_q_add(wake_q, receiver->task);
+   list_del(&this->list);
+   wake_q_add(wake_q, this->task);
/*
 * Rely on the implicit cmpxchg barrier from wake_q_add such
 * that we can ensure that updating receiver->state is the last
@@ -937,7 +932,19 @@ static inline void pipelined_send(struct wake_q_head 
*wake_q,
 * yet, at that point we can later have a use-after-free
 * condition and bogus wakeup.
 */
-   receiver->state = STATE_READY;
+   this->state = STATE_READY;
+}
+
+/* pipelined_send() - send a message directly to the task waiting in
+ * sys_mq_timedreceive() (without inserting message into a queue).
+ */
+static inline void pipelined_send(struct wake_q_head *wake_q,
+ struct mqueue_inode_info *info,
+ struct msg_msg *message,
+ struct ext_wait_queue *receiver)
+{
+   receiver->msg = message;
+   __pipelined_op(wake_q, info, receiver);
 }
 
 /* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
@@ -955,9 +962,7 @@ static inline void pipelined_receive(struct wake_q_head 
*wake_q,
if (msg_insert(sender->msg, info))
return;
 
-   list_del(&sender->list);
-   wake_q_add(wake_q, sender->task);
-   sender->state = STATE_READY;
+   __pipelined_op(wake_q, info, sender);
 }
 
 static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,
-- 
2.21.0



[PATCH 4/5] ipc/msg.c: Update and document memory barriers.

2019-10-20 Thread Manfred Spraul
Transfer findings from ipc/mqueue.c:
- A control barrier was missing for the lockless receive case
  So in theory, not yet initialized data may have been copied
  to user space - obviously only for architectures where
  control barriers are not NOP.

- use smp_store_release(). In theory, the refount
  may have been decreased to 0 already when wake_q_add()
  tries to get a reference.

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/msg.c | 43 ---
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 8dec945fa030..192a9291a8ab 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -61,6 +61,16 @@ struct msg_queue {
struct list_head q_senders;
 } __randomize_layout;
 
+/*
+ * MSG_BARRIER Locking:
+ *
+ * Similar to the optimization used in ipc/mqueue.c, one syscall return path
+ * does not acquire any locks when it sees that a message exists in
+ * msg_receiver.r_msg. Therefore r_msg is set using smp_store_release()
+ * and accessed using READ_ONCE()+smp_acquire__after_ctrl_dep(). In addition,
+ * wake_q_add_safe() is used. See ipc/mqueue.c for more details
+ */
+
 /* one msg_receiver structure for each sleeping receiver */
 struct msg_receiver {
struct list_headr_list;
@@ -184,6 +194,10 @@ static inline void ss_add(struct msg_queue *msq,
 {
mss->tsk = current;
mss->msgsz = msgsz;
+   /*
+* No memory barrier required: we did ipc_lock_object(),
+* and the waker obtains that lock before calling wake_q_add().
+*/
__set_current_state(TASK_INTERRUPTIBLE);
list_add_tail(&mss->list, &msq->q_senders);
 }
@@ -237,8 +251,11 @@ static void expunge_all(struct msg_queue *msq, int res,
struct msg_receiver *msr, *t;
 
list_for_each_entry_safe(msr, t, &msq->q_receivers, r_list) {
-   wake_q_add(wake_q, msr->r_tsk);
-   WRITE_ONCE(msr->r_msg, ERR_PTR(res));
+   get_task_struct(msr->r_tsk);
+
+   /* see MSG_BARRIER for purpose/pairing */
+   smp_store_release(&msr->r_msg, ERR_PTR(res));
+   wake_q_add_safe(wake_q, msr->r_tsk);
}
 }
 
@@ -798,13 +815,17 @@ static inline int pipelined_send(struct msg_queue *msq, 
struct msg_msg *msg,
list_del(&msr->r_list);
if (msr->r_maxsize < msg->m_ts) {
wake_q_add(wake_q, msr->r_tsk);
-   WRITE_ONCE(msr->r_msg, ERR_PTR(-E2BIG));
+
+   /* See expunge_all regarding memory barrier */
+   smp_store_release(&msr->r_msg, ERR_PTR(-E2BIG));
} else {
ipc_update_pid(&msq->q_lrpid, 
task_pid(msr->r_tsk));
msq->q_rtime = ktime_get_real_seconds();
 
wake_q_add(wake_q, msr->r_tsk);
-   WRITE_ONCE(msr->r_msg, msg);
+
+   /* See expunge_all regarding memory barrier */
+   smp_store_release(&msr->r_msg, msg);
return 1;
}
}
@@ -1154,7 +1175,11 @@ static long do_msgrcv(int msqid, void __user *buf, 
size_t bufsz, long msgtyp, in
msr_d.r_maxsize = INT_MAX;
else
msr_d.r_maxsize = bufsz;
-   msr_d.r_msg = ERR_PTR(-EAGAIN);
+
+   /* memory barrier not require due to ipc_lock_object() */
+   WRITE_ONCE(msr_d.r_msg, ERR_PTR(-EAGAIN));
+
+   /* memory barrier not required, we own ipc_lock_object() */
__set_current_state(TASK_INTERRUPTIBLE);
 
ipc_unlock_object(&msq->q_perm);
@@ -1183,8 +1208,12 @@ static long do_msgrcv(int msqid, void __user *buf, 
size_t bufsz, long msgtyp, in
 * signal) it will either see the message and continue ...
 */
msg = READ_ONCE(msr_d.r_msg);
-   if (msg != ERR_PTR(-EAGAIN))
+   if (msg != ERR_PTR(-EAGAIN)) {
+   /* see MSG_BARRIER for purpose/pairing */
+   smp_acquire__after_ctrl_dep();
+
goto out_unlock1;
+   }
 
 /*
  * ... or see -EAGAIN, acquire the lock to check the message
@@ -1192,7 +1221,7 @@ static long do_msgrcv(int msqid, void __user *buf, size_t 
bufsz, long msgtyp, in
  */
ipc_lock_object(&msq->q_perm);
 
-   msg = msr_d.r_msg;
+   msg = READ_ONCE(msr_d.r_msg);
if (msg != ERR_PTR(-EAGAIN))
goto out_unlock0;
 
-- 
2.21.0



[PATCH 0/5] V3: Clarify/standardize memory barriers for ipc

2019-10-20 Thread Manfred Spraul
Hi,

Updated series, based on input from Davidlohr and Peter Zijlstra:

- I've dropped the documentation update for wake_q_add, as what it
  states is normal: When you call a function and pass a parameter
  to a structure, you as caller are responsible to ensure that the 
  parameter is valid, and remains valid for the duration of the
  function call, including any tearing due to memory reordering.
  In addition, I've switched ipc to wake_q_add_safe().

- The patch to Documentation/memory_barriers.txt now as first change.
  @Davidlohr: You proposed to have 2 paragraphs: First, one for
  add/subtract, then one for failed cmpxchg. I didn't like that:
  We have one rule (can be combined with non-mb RMW ops), and then
  examples what are non-mb RMW ops. Listing special cases just ask
  for issues later.
  What I don't know is if there should be examples at all in
  Documentation/memory_barriers, or just
  "See Documentation/atomic_t.txt for examples of RMW ops that
  do not contain a memory barrier"

- For the memory barrier pairs in ipc/, I have now added
  /* See ABC_BARRIER for purpose/pairing */ as standard comment,
  and then a block near the relevant structure where purpose, pairing
  races, ... are explained. I think this makes it easier to read,
  compared to adding it to both the _release and _acquire branches.

Description/purpose:

The memory barriers in ipc are not properly documented, and at least
for some architectures insufficient:
Reading the xyz->status is only a control barrier, thus
smp_acquire__after_ctrl_dep() was missing in mqueue.c and msg.c
sem.c contained a full smp_mb(), which is not required.

Patches:
Patch 1: Documentation for smp_mb__{before,after}_atomic().

Patch 2: Remove code duplication inside ipc/mqueue.c

Patch 3-5: Update the ipc code, especially add missing
   smp_mb__after_ctrl_dep() and switch to wake_q_add_safe().

Clarify that smp_mb__{before,after}_atomic() are compatible with all
RMW atomic operations, not just the operations that do not return a value.

Open issues:
- More testing. I did some tests, but doubt that the tests would be
  sufficient to show issues with regards to incorrect memory barriers.

What do you think?

--
Manfred


[PATCH 1/5] smp_mb__{before,after}_atomic(): Update Documentation

2019-10-20 Thread Manfred Spraul
When adding the _{acquire|release|relaxed}() variants of some atomic
operations, it was forgotten to update Documentation/memory_barrier.txt:

smp_mb__{before,after}_atomic() is now intended for all RMW operations
that do not imply a memory barrier.

1)
smp_mb__before_atomic();
atomic_add();

2)
smp_mb__before_atomic();
atomic_xchg_relaxed();

3)
smp_mb__before_atomic();
atomic_fetch_add_relaxed();

Invalid would be:
smp_mb__before_atomic();
atomic_set();

In addition, the patch splits the long sentence into multiple shorter
sentences.

Fixes: 654672d4ba1a ("locking/atomics: Add _{acquire|release|relaxed}() 
variants of some atomic operations")

Signed-off-by: Manfred Spraul 
Acked-by: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Peter Zijlstra 
Cc: Will Deacon 
---
 Documentation/memory-barriers.txt | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 1adbb8a371c7..fe43f4b30907 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1873,12 +1873,16 @@ There are some more advanced barrier functions:
  (*) smp_mb__before_atomic();
  (*) smp_mb__after_atomic();
 
- These are for use with atomic (such as add, subtract, increment and
- decrement) functions that don't return a value, especially when used for
- reference counting.  These functions do not imply memory barriers.
-
- These are also used for atomic bitop functions that do not return a
- value (such as set_bit and clear_bit).
+ These are for use with atomic RMW functions that do not imply memory
+ barriers, but where the code needs a memory barrier. Examples for atomic
+ RMW functions that do not imply are memory barrier are e.g. add,
+ subtract, (failed) conditional operations, _relaxed functions,
+ but not atomic_read or atomic_set. A common example where a memory
+ barrier may be required is when atomic ops are used for reference
+ counting.
+
+ These are also used for atomic RMW bitop functions that do not imply a
+ memory barrier (such as set_bit and clear_bit).
 
  As an example, consider a piece of code that marks an object as being dead
  and then decrements the object's reference count:
-- 
2.21.0



Re: [PATCH 3/6] ipc/mqueue.c: Update/document memory barriers

2019-10-14 Thread Manfred Spraul

Hi Peter,

On 10/14/19 3:58 PM, Peter Zijlstra wrote:

On Mon, Oct 14, 2019 at 02:59:11PM +0200, Peter Zijlstra wrote:

On Sat, Oct 12, 2019 at 07:49:55AM +0200, Manfred Spraul wrote:


for (;;) {
+   /* memory barrier not required, we hold info->lock */
__set_current_state(TASK_INTERRUPTIBLE);
  
  		spin_unlock(&info->lock);

time = schedule_hrtimeout_range_clock(timeout, 0,
HRTIMER_MODE_ABS, CLOCK_REALTIME);
  
+		if (READ_ONCE(ewp->state) == STATE_READY) {

+   /*
+* Pairs, together with READ_ONCE(), with
+* the barrier in __pipelined_op().
+*/
+   smp_acquire__after_ctrl_dep();
retval = 0;
goto out;
}
spin_lock(&info->lock);
+
+   /* we hold info->lock, so no memory barrier required */
+   if (READ_ONCE(ewp->state) == STATE_READY) {
retval = 0;
goto out_unlock;
}
@@ -925,14 +933,12 @@ static inline void __pipelined_op(struct wake_q_head 
*wake_q,
list_del(&this->list);
wake_q_add(wake_q, this->task);
/*
+* The barrier is required to ensure that the refcount increase
+* inside wake_q_add() is completed before the state is updated.

fails to explain *why* this is important.


+*
+* The barrier pairs with READ_ONCE()+smp_mb__after_ctrl_dep().
 */
+smp_store_release(&this->state, STATE_READY);

You retained the whitespace damage.

And I'm terribly confused by this code, probably due to the lack of
'why' as per the above. What is this trying to do?

Are we worried about something like:

A   B   C


wq_sleep()
  schedule_...();

/* spuriuos 
wakeup */

wake_up_process(B)

wake_q_add(A)
  if (cmpxchg()) // success

->state = STATE_READY (reordered)

  if (READ_ONCE() == STATE_READY)
goto out;

exit();


get_task_struct() // UaF


Can we put the exact and full race in the comment please?


Yes, I'll do that. Actually, two threads are sufficient:

A    B

WRITE_ONCE(wait.state, STATE_NONE);
schedule_hrtimeout()

  wake_q_add(A)
  if (cmpxchg()) // success
  ->state = STATE_READY (reordered)


if (wait.state == STATE_READY) return;
sysret to user space
sys_exit()

  get_task_struct() // UaF



Like Davidlohr already suggested, elsewhere we write it like so:


--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -930,15 +930,10 @@ static inline void __pipelined_op(struct
  struct mqueue_inode_info *info,
  struct ext_wait_queue *this)
  {
+   get_task_struct(this->task);
list_del(&this->list);
-   wake_q_add(wake_q, this->task);
-   /*
-* The barrier is required to ensure that the refcount increase
-* inside wake_q_add() is completed before the state is updated.
-*
-* The barrier pairs with READ_ONCE()+smp_mb__after_ctrl_dep().
-*/
-smp_store_release(&this->state, STATE_READY);
+   smp_store_release(&this->state, STATE_READY);
+   wake_q_add_safe(wake_q, this->task);
  }
  
  /* pipelined_send() - send a message directly to the task waiting in


Much better, I'll rewrite it and then resend the series.

--

    Manfred



Re: [PATCH 6/6] Documentation/memory-barriers.txt: Clarify cmpxchg()

2019-10-14 Thread Manfred Spraul

Hello Peter,

On 10/14/19 3:03 PM, Peter Zijlstra wrote:

On Sat, Oct 12, 2019 at 07:49:58AM +0200, Manfred Spraul wrote:

The documentation in memory-barriers.txt claims that
smp_mb__{before,after}_atomic() are for atomic ops that do not return a
value.

This is misleading and doesn't match the example in atomic_t.txt,
and e.g. smp_mb__before_atomic() may and is used together with
cmpxchg_relaxed() in the wake_q code.

The purpose of e.g. smp_mb__before_atomic() is to "upgrade" a following
RMW atomic operation to a full memory barrier.
The return code of the atomic operation has no impact, so all of the
following examples are valid:

The value return of atomic ops is relevant in so far that
(traditionally) all value returning atomic ops already implied full
barriers. That of course changed when we added
_release/_acquire/_relaxed variants.

I've updated the Change description accordingly

1)
smp_mb__before_atomic();
atomic_add();

2)
smp_mb__before_atomic();
atomic_xchg_relaxed();

3)
smp_mb__before_atomic();
atomic_fetch_add_relaxed();

Invalid would be:
smp_mb__before_atomic();
atomic_set();

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Peter Zijlstra 
---
  Documentation/memory-barriers.txt | 11 ++-
  1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 1adbb8a371c7..52076b057400 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1873,12 +1873,13 @@ There are some more advanced barrier functions:
   (*) smp_mb__before_atomic();
   (*) smp_mb__after_atomic();
  
- These are for use with atomic (such as add, subtract, increment and

- decrement) functions that don't return a value, especially when used for
- reference counting.  These functions do not imply memory barriers.
+ These are for use with atomic RMW functions (such as add, subtract,
+ increment, decrement, failed conditional operations, ...) that do
+ not imply memory barriers, but where the code needs a memory barrier,
+ for example when used for reference counting.
  
- These are also used for atomic bitop functions that do not return a

- value (such as set_bit and clear_bit).
+ These are also used for atomic RMW bitop functions that do imply a full

s/do/do not/ ?

Sorry, yes, of course

+ memory barrier (such as set_bit and clear_bit).



>From 61c85a56994e32ea393af9debef4cccd9cd24abd Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Fri, 11 Oct 2019 10:33:26 +0200
Subject: [PATCH] Update Documentation for _{acquire|release|relaxed}()

When adding the _{acquire|release|relaxed}() variants of some atomic
operations, it was forgotten to update Documentation/memory_barrier.txt:

smp_mb__{before,after}_atomic() is now indended for all RMW operations
that do not imply a full memory barrier.

1)
	smp_mb__before_atomic();
	atomic_add();

2)
	smp_mb__before_atomic();
	atomic_xchg_relaxed();

3)
	smp_mb__before_atomic();
	atomic_fetch_add_relaxed();

Invalid would be:
	smp_mb__before_atomic();
	atomic_set();

Fixes: 654672d4ba1a ("locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations")

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Peter Zijlstra 
Cc: Will Deacon 
---
 Documentation/memory-barriers.txt | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 1adbb8a371c7..08090eea3751 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1873,12 +1873,13 @@ There are some more advanced barrier functions:
  (*) smp_mb__before_atomic();
  (*) smp_mb__after_atomic();
 
- These are for use with atomic (such as add, subtract, increment and
- decrement) functions that don't return a value, especially when used for
- reference counting.  These functions do not imply memory barriers.
+ These are for use with atomic RMW functions (such as add, subtract,
+ increment, decrement, failed conditional operations, ...) that do
+ not imply memory barriers, but where the code needs a memory barrier,
+ for example when used for reference counting.
 
- These are also used for atomic bitop functions that do not return a
- value (such as set_bit and clear_bit).
+ These are also used for atomic RMW bitop functions that do not imply a
+ full memory barrier (such as set_bit and clear_bit).
 
  As an example, consider a piece of code that marks an object as being dead
  and then decrements the object's reference count:
-- 
2.21.0



[PATCH 2/6] ipc/mqueue.c: Remove duplicated code

2019-10-11 Thread Manfred Spraul
Patch entirely from Davidlohr:
pipelined_send() and pipelined_receive() are identical, so merge them.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 ipc/mqueue.c | 31 ++-
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3d920ff15c80..be48c0ba92f7 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -918,17 +918,12 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
  * The same algorithm is used for senders.
  */
 
-/* pipelined_send() - send a message directly to the task waiting in
- * sys_mq_timedreceive() (without inserting message into a queue).
- */
-static inline void pipelined_send(struct wake_q_head *wake_q,
+static inline void __pipelined_op(struct wake_q_head *wake_q,
  struct mqueue_inode_info *info,
- struct msg_msg *message,
- struct ext_wait_queue *receiver)
+ struct ext_wait_queue *this)
 {
-   receiver->msg = message;
-   list_del(&receiver->list);
-   wake_q_add(wake_q, receiver->task);
+   list_del(&this->list);
+   wake_q_add(wake_q, this->task);
/*
 * Rely on the implicit cmpxchg barrier from wake_q_add such
 * that we can ensure that updating receiver->state is the last
@@ -937,7 +932,19 @@ static inline void pipelined_send(struct wake_q_head 
*wake_q,
 * yet, at that point we can later have a use-after-free
 * condition and bogus wakeup.
 */
-   receiver->state = STATE_READY;
+this->state = STATE_READY;
+}
+
+/* pipelined_send() - send a message directly to the task waiting in
+ * sys_mq_timedreceive() (without inserting message into a queue).
+ */
+static inline void pipelined_send(struct wake_q_head *wake_q,
+ struct mqueue_inode_info *info,
+ struct msg_msg *message,
+ struct ext_wait_queue *receiver)
+{
+   receiver->msg = message;
+   __pipelined_op(wake_q, info, receiver);
 }
 
 /* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
@@ -955,9 +962,7 @@ static inline void pipelined_receive(struct wake_q_head 
*wake_q,
if (msg_insert(sender->msg, info))
return;
 
-   list_del(&sender->list);
-   wake_q_add(wake_q, sender->task);
-   sender->state = STATE_READY;
+   __pipelined_op(wake_q, info, sender);
 }
 
 static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,
-- 
2.21.0



[PATCH 4/6] ipc/msg.c: Update and document memory barriers.

2019-10-11 Thread Manfred Spraul
Transfer findings from ipc/mqueue.c:
- A control barrier was missing for the lockless receive case
  So in theory, not yet initialized data may have been copied
  to user space - obviously only for architectures where
  control barriers are not NOP.

- use smp_store_release(). In theory, the refount
  may have been decreased to 0 already when wake_q_add()
  tries to get a reference.

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/msg.c | 44 ++--
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 8dec945fa030..e6b20a7e6341 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -184,6 +184,10 @@ static inline void ss_add(struct msg_queue *msq,
 {
mss->tsk = current;
mss->msgsz = msgsz;
+   /*
+* No memory barrier required: we did ipc_lock_object(),
+* and the waker obtains that lock before calling wake_q_add().
+*/
__set_current_state(TASK_INTERRUPTIBLE);
list_add_tail(&mss->list, &msq->q_senders);
 }
@@ -238,7 +242,14 @@ static void expunge_all(struct msg_queue *msq, int res,
 
list_for_each_entry_safe(msr, t, &msq->q_receivers, r_list) {
wake_q_add(wake_q, msr->r_tsk);
-   WRITE_ONCE(msr->r_msg, ERR_PTR(res));
+
+   /*
+* The barrier is required to ensure that the refcount increase
+* inside wake_q_add() is completed before the state is updated.
+*
+* The barrier pairs with READ_ONCE()+smp_mb__after_ctrl_dep().
+*/
+   smp_store_release(&msr->r_msg, ERR_PTR(res));
}
 }
 
@@ -798,13 +809,17 @@ static inline int pipelined_send(struct msg_queue *msq, 
struct msg_msg *msg,
list_del(&msr->r_list);
if (msr->r_maxsize < msg->m_ts) {
wake_q_add(wake_q, msr->r_tsk);
-   WRITE_ONCE(msr->r_msg, ERR_PTR(-E2BIG));
+
+   /* See expunge_all regarding memory barrier */
+   smp_store_release(&msr->r_msg, ERR_PTR(-E2BIG));
} else {
ipc_update_pid(&msq->q_lrpid, 
task_pid(msr->r_tsk));
msq->q_rtime = ktime_get_real_seconds();
 
wake_q_add(wake_q, msr->r_tsk);
-   WRITE_ONCE(msr->r_msg, msg);
+
+   /* See expunge_all regarding memory barrier */
+   smp_store_release(&msr->r_msg, msg);
return 1;
}
}
@@ -1154,7 +1169,11 @@ static long do_msgrcv(int msqid, void __user *buf, 
size_t bufsz, long msgtyp, in
msr_d.r_maxsize = INT_MAX;
else
msr_d.r_maxsize = bufsz;
-   msr_d.r_msg = ERR_PTR(-EAGAIN);
+
+   /* memory barrier not require due to ipc_lock_object() */
+   WRITE_ONCE(msr_d.r_msg, ERR_PTR(-EAGAIN));
+
+   /* memory barrier not required, we own ipc_lock_object() */
__set_current_state(TASK_INTERRUPTIBLE);
 
ipc_unlock_object(&msq->q_perm);
@@ -1183,8 +1202,21 @@ static long do_msgrcv(int msqid, void __user *buf, 
size_t bufsz, long msgtyp, in
 * signal) it will either see the message and continue ...
 */
msg = READ_ONCE(msr_d.r_msg);
-   if (msg != ERR_PTR(-EAGAIN))
+   if (msg != ERR_PTR(-EAGAIN)) {
+   /*
+* Memory barrier for msr_d.r_msg
+* The smp_acquire__after_ctrl_dep(), together with the
+* READ_ONCE() above pairs with the barrier inside
+* wake_q_add().
+* The barrier protects the accesses to the message in
+* do_msg_fill(). In addition, the barrier protects user
+* space, too: User space may assume that all data from
+* the CPU that sent the message is visible.
+*/
+   smp_acquire__after_ctrl_dep();
+
goto out_unlock1;
+   }
 
 /*
  * ... or see -EAGAIN, acquire the lock to check the message
@@ -1192,7 +1224,7 @@ static long do_msgrcv(int msqid, void __user *buf, size_t 
bufsz, long msgtyp, in
  */
ipc_lock_object(&msq->q_perm);
 
-   msg = msr_d.r_msg;
+   msg = READ_ONCE(msr_d.r_msg);
if (msg != ERR_PTR(-EAGAIN))
goto out_unlock0;
 
-- 
2.21.0



[PATCH 5/6] ipc/sem.c: Document and update memory barriers

2019-10-11 Thread Manfred Spraul
The patch documents and updates the memory barriers in ipc/sem.c:
- Add smp_store_release() to wake_up_sem_queue_prepare() and
  document why it is needed.

- Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep().
  as the pair for the barrier inside wake_up_sem_queue_prepare().

- Add comments to all barriers, and mention the rules in the block
  regarding locking.

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/sem.c | 63 ---
 1 file changed, 51 insertions(+), 12 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index ec97a7072413..c6c5954a2030 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -205,7 +205,9 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
  *
  * Memory ordering:
  * Most ordering is enforced by using spin_lock() and spin_unlock().
- * The special case is use_global_lock:
+ *
+ * Exceptions:
+ * 1) use_global_lock:
  * Setting it from non-zero to 0 is a RELEASE, this is ensured by
  * using smp_store_release().
  * Testing if it is non-zero is an ACQUIRE, this is ensured by using
@@ -214,6 +216,24 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
  * this smp_load_acquire(), this is guaranteed because the smp_load_acquire()
  * is inside a spin_lock() and after a write from 0 to non-zero a
  * spin_lock()+spin_unlock() is done.
+ *
+ * 2) queue.status:
+ * Initialization is done while holding sem_lock(), so no further barrier is
+ * required.
+ * Setting it to a result code is a RELEASE, this is ensured by both a
+ * smp_store_release() (for case a) and while holding sem_lock()
+ * (for case b).
+ * The AQUIRE when reading the result code without holding sem_lock() is
+ * achieved by using READ_ONCE() + smp_acquire__after_ctrl_dep().
+ * (case a above).
+ * Reading the result code while holding sem_lock() needs no further barriers,
+ * the locks inside sem_lock() enforce ordering (case b above)
+ *
+ * 3) current->state:
+ * current->state is set to TASK_INTERRUPTIBLE while holding sem_lock().
+ * The wakeup is handled using the wake_q infrastructure. wake_q wakeups may
+ * happen immediately after calling wake_q_add. As wake_q_add() is called
+ * when holding sem_lock(), no further barriers are required.
  */
 
 #define sc_semmsl  sem_ctls[0]
@@ -766,15 +786,24 @@ static int perform_atomic_semop(struct sem_array *sma, 
struct sem_queue *q)
 static inline void wake_up_sem_queue_prepare(struct sem_queue *q, int error,
 struct wake_q_head *wake_q)
 {
+   /*
+* When the wakeup is performed, q->sleeper->state is read and later
+* set to TASK_RUNNING. This may happen at any time, even before
+* wake_q_add() returns. Memory ordering for q->sleeper->state is
+* enforced by sem_lock(): we own sem_lock now (that was the ACQUIRE),
+* and q->sleeper wrote q->sleeper->state before calling sem_unlock()
+* (->RELEASE).
+*/
wake_q_add(wake_q, q->sleeper);
/*
-* Rely on the above implicit barrier, such that we can
-* ensure that we hold reference to the task before setting
-* q->status. Otherwise we could race with do_exit if the
-* task is awoken by an external event before calling
-* wake_up_process().
+* Here, we need a barrier to protect the refcount increase inside
+* wake_q_add().
+* case a: The barrier inside wake_q_add() pairs with
+* READ_ONCE(q->status) + smp_acquire__after_ctrl_dep() in
+* do_semtimedop().
+* case b: nothing, ordering is enforced by the locks in sem_lock().
 */
-   WRITE_ONCE(q->status, error);
+   smp_store_release(&q->status, error);
 }
 
 static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
@@ -2148,9 +2177,11 @@ static long do_semtimedop(int semid, struct sembuf 
__user *tsops,
}
 
do {
+   /* memory ordering ensured by the lock in sem_lock() */
WRITE_ONCE(queue.status, -EINTR);
queue.sleeper = current;
 
+   /* memory ordering is ensured by the lock in sem_lock() */
__set_current_state(TASK_INTERRUPTIBLE);
sem_unlock(sma, locknum);
rcu_read_unlock();
@@ -2174,12 +2205,16 @@ static long do_semtimedop(int semid, struct sembuf 
__user *tsops,
error = READ_ONCE(queue.status);
if (error != -EINTR) {
/*
-* User space could assume that semop() is a memory
-* barrier: Without the mb(), the cpu could
-* speculatively read in userspace stale data that was
-* overwritten by the previous owner of the semaphore.
+* Memory barrier for queue.status, case a):
+ 

[PATCH 6/6] Documentation/memory-barriers.txt: Clarify cmpxchg()

2019-10-11 Thread Manfred Spraul
The documentation in memory-barriers.txt claims that
smp_mb__{before,after}_atomic() are for atomic ops that do not return a
value.

This is misleading and doesn't match the example in atomic_t.txt,
and e.g. smp_mb__before_atomic() may and is used together with
cmpxchg_relaxed() in the wake_q code.

The purpose of e.g. smp_mb__before_atomic() is to "upgrade" a following
RMW atomic operation to a full memory barrier.
The return code of the atomic operation has no impact, so all of the
following examples are valid:

1)
smp_mb__before_atomic();
atomic_add();

2)
smp_mb__before_atomic();
atomic_xchg_relaxed();

3)
smp_mb__before_atomic();
atomic_fetch_add_relaxed();

Invalid would be:
smp_mb__before_atomic();
atomic_set();

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Peter Zijlstra 
---
 Documentation/memory-barriers.txt | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 1adbb8a371c7..52076b057400 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1873,12 +1873,13 @@ There are some more advanced barrier functions:
  (*) smp_mb__before_atomic();
  (*) smp_mb__after_atomic();
 
- These are for use with atomic (such as add, subtract, increment and
- decrement) functions that don't return a value, especially when used for
- reference counting.  These functions do not imply memory barriers.
+ These are for use with atomic RMW functions (such as add, subtract,
+ increment, decrement, failed conditional operations, ...) that do
+ not imply memory barriers, but where the code needs a memory barrier,
+ for example when used for reference counting.
 
- These are also used for atomic bitop functions that do not return a
- value (such as set_bit and clear_bit).
+ These are also used for atomic RMW bitop functions that do imply a full
+ memory barrier (such as set_bit and clear_bit).
 
  As an example, consider a piece of code that marks an object as being dead
  and then decrements the object's reference count:
-- 
2.21.0



[PATCH 3/6] ipc/mqueue.c: Update/document memory barriers

2019-10-11 Thread Manfred Spraul
Update and document memory barriers for mqueue.c:
- ewp->state is read without any locks, thus READ_ONCE is required.

- add smp_aquire__after_ctrl_dep() after the READ_ONCE, we need
  acquire semantics if the value is STATE_READY.

- add an explicit memory barrier to __pipelined_op(), the
  refcount must have been increased before the updated state becomes
  visible

- document why __set_current_state() may be used:
  Reading task->state cannot happen before the wake_q_add() call,
  which happens while holding info->lock. Thus the spin_unlock()
  is the RELEASE, and the spin_lock() is the ACQUIRE.

For completeness: there is also a 3 CPU szenario, if the to be woken
up task is already on another wake_q.
Then:
- CPU1: spin_unlock() of the task that goes to sleep is the RELEASE
- CPU2: the spin_lock() of the waker is the ACQUIRE
- CPU2: smp_mb__before_atomic inside wake_q_add() is the RELEASE
- CPU3: smp_mb__after_spinlock() inside try_to_wake_up() is the ACQUIRE

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/mqueue.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index be48c0ba92f7..b80574822f0a 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -646,18 +646,26 @@ static int wq_sleep(struct mqueue_inode_info *info, int 
sr,
wq_add(info, sr, ewp);
 
for (;;) {
+   /* memory barrier not required, we hold info->lock */
__set_current_state(TASK_INTERRUPTIBLE);
 
spin_unlock(&info->lock);
time = schedule_hrtimeout_range_clock(timeout, 0,
HRTIMER_MODE_ABS, CLOCK_REALTIME);
 
-   if (ewp->state == STATE_READY) {
+   if (READ_ONCE(ewp->state) == STATE_READY) {
+   /*
+* Pairs, together with READ_ONCE(), with
+* the barrier in __pipelined_op().
+*/
+   smp_acquire__after_ctrl_dep();
retval = 0;
goto out;
}
spin_lock(&info->lock);
-   if (ewp->state == STATE_READY) {
+
+   /* we hold info->lock, so no memory barrier required */
+   if (READ_ONCE(ewp->state) == STATE_READY) {
retval = 0;
goto out_unlock;
}
@@ -925,14 +933,12 @@ static inline void __pipelined_op(struct wake_q_head 
*wake_q,
list_del(&this->list);
wake_q_add(wake_q, this->task);
/*
-* Rely on the implicit cmpxchg barrier from wake_q_add such
-* that we can ensure that updating receiver->state is the last
-* write operation: As once set, the receiver can continue,
-* and if we don't have the reference count from the wake_q,
-* yet, at that point we can later have a use-after-free
-* condition and bogus wakeup.
+* The barrier is required to ensure that the refcount increase
+* inside wake_q_add() is completed before the state is updated.
+*
+* The barrier pairs with READ_ONCE()+smp_mb__after_ctrl_dep().
 */
-this->state = STATE_READY;
+smp_store_release(&this->state, STATE_READY);
 }
 
 /* pipelined_send() - send a message directly to the task waiting in
@@ -1049,7 +1055,9 @@ static int do_mq_timedsend(mqd_t mqdes, const char __user 
*u_msg_ptr,
} else {
wait.task = current;
wait.msg = (void *) msg_ptr;
-   wait.state = STATE_NONE;
+
+   /* memory barrier not required, we hold info->lock */
+   WRITE_ONCE(wait.state, STATE_NONE);
ret = wq_sleep(info, SEND, timeout, &wait);
/*
 * wq_sleep must be called with info->lock held, and
@@ -1152,7 +1160,9 @@ static int do_mq_timedreceive(mqd_t mqdes, char __user 
*u_msg_ptr,
ret = -EAGAIN;
} else {
wait.task = current;
-   wait.state = STATE_NONE;
+
+   /* memory barrier not required, we hold info->lock */
+   WRITE_ONCE(wait.state, STATE_NONE);
ret = wq_sleep(info, RECV, timeout, &wait);
msg_ptr = wait.msg;
}
-- 
2.21.0



[PATCH 1/6] wake_q: Cleanup + Documentation update.

2019-10-11 Thread Manfred Spraul
1) wake_q_add() contains a memory barrier, and callers such as
ipc/mqueue.c rely on this barrier.
Unfortunately, this is documented in ipc/mqueue.c, and not in the
description of wake_q_add().
Therefore: Update the documentation.
Removing/updating ipc/mqueue.c will happen with the next patch in the
series.

2) wake_q_add() ends with get_task_struct(), which is an
unordered refcount increase. Add a clear comment that the callers
are responsible for a barrier: most likely spin_unlock() or
smp_store_release().

3) wake_up_q() relies on the memory barrier in try_to_wake_up().
Add a comment, to simplify searching.

4) wake_q.next is accessed without synchroniyation by wake_q_add(),
using cmpxchg_relaxed(), and by wake_up_q().
Therefore: Use WRITE_ONCE in wake_up_q(), to ensure that the
compiler doesn't perform any tricks.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 kernel/sched/core.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dd05a378631a..60ae574317fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -440,8 +440,16 @@ static bool __wake_q_add(struct wake_q_head *head, struct 
task_struct *task)
  * @task: the task to queue for 'later' wakeup
  *
  * Queue a task for later wakeup, most likely by the wake_up_q() call in the
- * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
- * instantly.
+ * same context, _HOWEVER_ this is not guaranteed. Especially, the wakeup
+ * may happen before the function returns.
+ *
+ * What is guaranteed is that there is a memory barrier before the wakeup,
+ * callers may rely on this barrier.
+ *
+ * On the other hand, the caller must guarantee that @task does not disappear
+ * before wake_q_add() completed. wake_q_add() does not contain any memory
+ * barrier to ensure ordering, thus the caller may need to use
+ * smp_store_release().
  *
  * This function must be used as-if it were wake_up_process(); IOW the task
  * must be ready to be woken at this location.
@@ -486,11 +494,14 @@ void wake_up_q(struct wake_q_head *head)
BUG_ON(!task);
/* Task can safely be re-inserted now: */
node = node->next;
-   task->wake_q.next = NULL;
+
+   WRITE_ONCE(task->wake_q.next, NULL);
 
/*
 * wake_up_process() executes a full barrier, which pairs with
 * the queueing in wake_q_add() so as not to miss wakeups.
+* The barrier is the smp_mb__after_spinlock() in
+* try_to_wake_up().
 */
wake_up_process(task);
put_task_struct(task);
-- 
2.21.0



[PATCH 0/6] V2: Clarify/standardize memory barriers for ipc

2019-10-11 Thread Manfred Spraul
Hi,

Updated series, based on input from Davidlohr:

- Mixing WRITE_ONCE(), when not holding a lock, and "normal" writes,
  when holding a lock, makes the code less readable.
  Thus use _ONCE() everywhere, for both WRITE_ONCE() and READ_ONCE().

- According to my understanding, wake_q_add() does not contain a barrier
  that protects the refount increase. Document that, and add the barrier
  to the ipc code

- and, based on patch review: The V1 patch for ipc/sem.c is incorrect,
  ->state must be set to "-EINTR", not EINTR.

>From V1:

The memory barriers in ipc are not properly documented, and at least
for some architectures insufficient:
Reading the xyz->status is only a control barrier, thus
smp_acquire__after_ctrl_dep() was missing in mqueue.c and msg.c
sem.c contained a full smp_mb(), which is not required.

Patches:
Patch 1: Document the barrier rules for wake_q_add().

Patch 2: remove code duplication
@Davidlohr: There is no "Signed-off-by" in your mail, otherwise I would
list you as author.

Patch 3-5: Update the ipc code, especially add missing
   smp_mb__after_ctrl_dep().

Clarify that smp_mb__{before,after}_atomic() are compatible with all
RMW atomic operations, not just the operations that do not return a value.

Patch 6: Documentation for smp_mb__{before,after}_atomic().

Open issues:
- Is my analysis regarding the refcount correct?

- Review other users of wake_q_add().

- More testing. I did some tests, but doubt that the tests would be
  sufficient to show issues with regards to incorrect memory barriers.

- Should I add a "Fixes:" or "Cc:stable"? The issues that I see are
  the missing smp_mb__after_ctrl_dep(), and WRITE_ONCE() vs.
  "ptr = NULL", and a risk regarding the refcount that I can't evaluate.


What do you think?

--
Manfred


Re: [PATCH 2/5] ipc/mqueue.c: Update/document memory barriers

2019-10-11 Thread Manfred Spraul

On 10/11/19 6:55 PM, Davidlohr Bueso wrote:

On Fri, 11 Oct 2019, Manfred Spraul wrote:


Update and document memory barriers for mqueue.c:
- ewp->state is read without any locks, thus READ_ONCE is required.


In general we relied on the barrier for not needing READ/WRITE_ONCE,
but I agree this scenario should be better documented with them.


After reading core-api/atomic_ops.rst:

> _ONCE() should be used. [...] Alternatively, you can place a barrier.

So both approaches are ok.

Let's follow the "should", i.e.: all operations on the ->state variables 
to READ_ONCE()/WRITE_ONCE().


Then we have a standard, and since we can follow the "should", we should 
do that.



Similarly imo, the 'state' should also need them for write, even if
under the lock -- consistency and documentation, for example.

Ok, so let's convert everything to _ONCE. (assuming that my analysis 
below is incorrect)

In addition, I think it makes sense to encapsulate some of the
pipelined send/recv operations, that also can allow us to keep
the barrier comments in pipelined_send(), which I wonder why
you chose to remove. Something like so, before your changes:

I thought that the simple "memory barrier is provided" is enough, so I 
had removed the comment.



But you are right, there are two different scenarios:

1) thread already in another wake_q, wakeup happens immediately after 
the cmpxchg_relaxed().


This scenario is safe, due to the smp_mb__before_atomic() in wake_q_add()

2) thread woken up but e.g. a timeout, see ->state=STATE_READY, returns 
to user space, calls sys_exit.


This must not happen before get_task_struct acquired a reference.

And this appears to be unsafe: get_task_struct() is refcount_inc(), 
which is refcount_inc_checked(), which is according to lib/refcount.c 
fully unordered.


Thus: ->state=STATE_READY can execute before the refcount increase.

Thus: ->state=STATE_READY needs a smp_store_release(), correct?


diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3d920ff15c80..be48c0ba92f7 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -918,17 +918,12 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, 
u_name)

 * The same algorithm is used for senders.
 */

-/* pipelined_send() - send a message directly to the task waiting in
- * sys_mq_timedreceive() (without inserting message into a queue).
- */
-static inline void pipelined_send(struct wake_q_head *wake_q,
+static inline void __pipelined_op(struct wake_q_head *wake_q,
  struct mqueue_inode_info *info,
-  struct msg_msg *message,
-  struct ext_wait_queue *receiver)
+  struct ext_wait_queue *this)
{
-    receiver->msg = message;
-    list_del(&receiver->list);
-    wake_q_add(wake_q, receiver->task);
+    list_del(&this->list);
+    wake_q_add(wake_q, this->task);
/*
 * Rely on the implicit cmpxchg barrier from wake_q_add such
 * that we can ensure that updating receiver->state is the last
@@ -937,7 +932,19 @@ static inline void pipelined_send(struct 
wake_q_head *wake_q,

 * yet, at that point we can later have a use-after-free
 * condition and bogus wakeup.
 */
-    receiver->state = STATE_READY;
+    this->state = STATE_READY;
+}
+
+/* pipelined_send() - send a message directly to the task waiting in
+ * sys_mq_timedreceive() (without inserting message into a queue).
+ */
+static inline void pipelined_send(struct wake_q_head *wake_q,
+  struct mqueue_inode_info *info,
+  struct msg_msg *message,
+  struct ext_wait_queue *receiver)
+{
+    receiver->msg = message;
+    __pipelined_op(wake_q, info, receiver);
}

/* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
@@ -955,9 +962,7 @@ static inline void pipelined_receive(struct 
wake_q_head *wake_q,

if (msg_insert(sender->msg, info))
    return;

-    list_del(&sender->list);
-    wake_q_add(wake_q, sender->task);
-    sender->state = STATE_READY;
+    __pipelined_op(wake_q, info, sender);
}

static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,


I would merge that into the series, ok?

--

    Manfred



[PATCH 2/5] ipc/mqueue.c: Update/document memory barriers

2019-10-11 Thread Manfred Spraul
Update and document memory barriers for mqueue.c:
- ewp->state is read without any locks, thus READ_ONCE is required.

- add smp_aquire__after_ctrl_dep() after the RAED_ONCE, we need
  acquire semantics if the value is STATE_READY.

- document that the code relies on the barrier inside wake_q_add()

- document why __set_current_state() may be used:
  Reading task->state cannot happen before the wake_q_add() call,
  which happens while holding info->lock.

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/mqueue.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3d920ff15c80..902167407737 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -646,17 +646,25 @@ static int wq_sleep(struct mqueue_inode_info *info, int 
sr,
wq_add(info, sr, ewp);
 
for (;;) {
+   /* memory barrier not required, we hold info->lock */
__set_current_state(TASK_INTERRUPTIBLE);
 
spin_unlock(&info->lock);
time = schedule_hrtimeout_range_clock(timeout, 0,
HRTIMER_MODE_ABS, CLOCK_REALTIME);
 
-   if (ewp->state == STATE_READY) {
+   if (READ_ONCE(ewp->state) == STATE_READY) {
+   /*
+* Pairs, together with READ_ONCE(), with
+* the barrier in wake_q_add().
+*/
+   smp_acquire__after_ctrl_dep();
retval = 0;
goto out;
}
spin_lock(&info->lock);
+
+   /* we hold info->lock, so no memory barrier required */
if (ewp->state == STATE_READY) {
retval = 0;
goto out_unlock;
@@ -928,16 +936,11 @@ static inline void pipelined_send(struct wake_q_head 
*wake_q,
 {
receiver->msg = message;
list_del(&receiver->list);
+
wake_q_add(wake_q, receiver->task);
-   /*
-* Rely on the implicit cmpxchg barrier from wake_q_add such
-* that we can ensure that updating receiver->state is the last
-* write operation: As once set, the receiver can continue,
-* and if we don't have the reference count from the wake_q,
-* yet, at that point we can later have a use-after-free
-* condition and bogus wakeup.
-*/
-   receiver->state = STATE_READY;
+
+   /* The memory barrier is provided by wake_q_add(). */
+   WRITE_ONCE(receiver->state, STATE_READY);
 }
 
 /* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
@@ -956,8 +959,11 @@ static inline void pipelined_receive(struct wake_q_head 
*wake_q,
return;
 
list_del(&sender->list);
+
wake_q_add(wake_q, sender->task);
-   sender->state = STATE_READY;
+
+   /* The memory barrier is provided by wake_q_add(). */
+   WRITE_ONCE(sender->state, STATE_READY);
 }
 
 static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,
@@ -1044,6 +1050,8 @@ static int do_mq_timedsend(mqd_t mqdes, const char __user 
*u_msg_ptr,
} else {
wait.task = current;
wait.msg = (void *) msg_ptr;
+
+   /* memory barrier not required, we hold info->lock */
wait.state = STATE_NONE;
ret = wq_sleep(info, SEND, timeout, &wait);
/*
@@ -1147,6 +1155,8 @@ static int do_mq_timedreceive(mqd_t mqdes, char __user 
*u_msg_ptr,
ret = -EAGAIN;
} else {
wait.task = current;
+
+   /* memory barrier not required, we hold info->lock */
wait.state = STATE_NONE;
ret = wq_sleep(info, RECV, timeout, &wait);
msg_ptr = wait.msg;
-- 
2.21.0



[PATCH 3/5] ipc/msg.c: Update and document memory barriers.

2019-10-11 Thread Manfred Spraul
Transfer findings from ipc/sem.c:
- A control barrier was missing for the lockless receive case
  So in theory, not yet initialized data may have been copied
  to user space - obviously only for architectures where
  control barriers are not NOP.

- Add documentation. Especially, document that the code relies
  on the barrier inside wake_q_add().

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/msg.c | 39 ++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 8dec945fa030..1e2c0a3d4998 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -184,6 +184,10 @@ static inline void ss_add(struct msg_queue *msq,
 {
mss->tsk = current;
mss->msgsz = msgsz;
+   /*
+* No memory barrier required: we did ipc_lock_object(),
+* and the waker obtains that lock before calling wake_q_add().
+*/
__set_current_state(TASK_INTERRUPTIBLE);
list_add_tail(&mss->list, &msq->q_senders);
 }
@@ -238,6 +242,12 @@ static void expunge_all(struct msg_queue *msq, int res,
 
list_for_each_entry_safe(msr, t, &msq->q_receivers, r_list) {
wake_q_add(wake_q, msr->r_tsk);
+
+   /*
+* A memory barrier is required that pairs with the
+* READ_ONCE()+smp_mb__after_ctrl_dep(). It is provided by
+* wake_q_add().
+*/
WRITE_ONCE(msr->r_msg, ERR_PTR(res));
}
 }
@@ -798,12 +808,24 @@ static inline int pipelined_send(struct msg_queue *msq, 
struct msg_msg *msg,
list_del(&msr->r_list);
if (msr->r_maxsize < msg->m_ts) {
wake_q_add(wake_q, msr->r_tsk);
+
+   /*
+* A memory barrier is required that pairs with
+* the READ_ONCE()+smp_mb__after_ctrl_dep().
+* It is provided by wake_q_add().
+*/
WRITE_ONCE(msr->r_msg, ERR_PTR(-E2BIG));
} else {
ipc_update_pid(&msq->q_lrpid, 
task_pid(msr->r_tsk));
msq->q_rtime = ktime_get_real_seconds();
 
wake_q_add(wake_q, msr->r_tsk);
+
+   /*
+* A memory barrier is required that pairs with
+* the READ_ONCE()+smp_mb__after_ctrl_dep().
+* It is provided by wake_q_add().
+*/
WRITE_ONCE(msr->r_msg, msg);
return 1;
}
@@ -1155,6 +1177,8 @@ static long do_msgrcv(int msqid, void __user *buf, size_t 
bufsz, long msgtyp, in
else
msr_d.r_maxsize = bufsz;
msr_d.r_msg = ERR_PTR(-EAGAIN);
+
+   /* memory barrier not required, we own ipc_lock_object() */
__set_current_state(TASK_INTERRUPTIBLE);
 
ipc_unlock_object(&msq->q_perm);
@@ -1183,8 +1207,21 @@ static long do_msgrcv(int msqid, void __user *buf, 
size_t bufsz, long msgtyp, in
 * signal) it will either see the message and continue ...
 */
msg = READ_ONCE(msr_d.r_msg);
-   if (msg != ERR_PTR(-EAGAIN))
+   if (msg != ERR_PTR(-EAGAIN)) {
+   /*
+* Memory barrier for msr_d.r_msg
+* The smp_acquire__after_ctrl_dep(), together with the
+* READ_ONCE() above pairs with the barrier inside
+* wake_q_add().
+* The barrier protects the accesses to the message in
+* do_msg_fill(). In addition, the barrier protects user
+* space, too: User space may assume that all data from
+* the CPU that sent the message is visible.
+*/
+   smp_acquire__after_ctrl_dep();
+
goto out_unlock1;
+   }
 
 /*
  * ... or see -EAGAIN, acquire the lock to check the message
-- 
2.21.0



[PATCH 4/5] ipc/sem.c: Document and update memory barriers

2019-10-11 Thread Manfred Spraul
The patch documents and updates the memory barriers in ipc/sem.c:
- Document that the WRITE_ONCE for q->status relies on a barrier
  inside wake_q_add().

- Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep().
  as the pair for the barrier inside wake_q_add()

- Remove READ_ONCE & WRITE_ONCE for the situations where spinlocks
  provide exclusion.

- Add comments to all barriers, and mention the rules in the block
  regarding locking.

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
---
 ipc/sem.c | 64 ---
 1 file changed, 51 insertions(+), 13 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index ec97a7072413..53d970c4e60d 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -205,7 +205,9 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
  *
  * Memory ordering:
  * Most ordering is enforced by using spin_lock() and spin_unlock().
- * The special case is use_global_lock:
+ *
+ * Exceptions:
+ * 1) use_global_lock:
  * Setting it from non-zero to 0 is a RELEASE, this is ensured by
  * using smp_store_release().
  * Testing if it is non-zero is an ACQUIRE, this is ensured by using
@@ -214,6 +216,24 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
  * this smp_load_acquire(), this is guaranteed because the smp_load_acquire()
  * is inside a spin_lock() and after a write from 0 to non-zero a
  * spin_lock()+spin_unlock() is done.
+ *
+ * 2) queue.status:
+ * Initialization is done while holding sem_lock(), so no further barrier is
+ * required.
+ * Setting it to a result code is a RELEASE, this is ensured by both the
+ * barrier inside wake_q_add() (for case a) and while holding sem_lock()
+ * (for case b).
+ * The AQUIRE when reading the result code without holding sem_lock() is
+ * achieved by using READ_ONCE() + smp_acquire__after_ctrl_dep().
+ * (case a above).
+ * Reading the result code while holding sem_lock() needs no further barriers,
+ * the locks inside sem_lock() enforce ordering (case b above)
+ *
+ * 3) current->state:
+ * current->state is set to TASK_INTERRUPTIBLE while holding sem_lock().
+ * The wakeup is handled using the wake_q infrastructure. wake_q wakeups may
+ * happen immediately after calling wake_q_add. As wake_q_add() is called
+ * when holding sem_lock(), no further barriers are required.
  */
 
 #define sc_semmsl  sem_ctls[0]
@@ -766,13 +786,21 @@ static int perform_atomic_semop(struct sem_array *sma, 
struct sem_queue *q)
 static inline void wake_up_sem_queue_prepare(struct sem_queue *q, int error,
 struct wake_q_head *wake_q)
 {
+   /*
+* When the wakeup is performed, q->sleeper->state is read and later
+* set to TASK_RUNNING. This may happen at any time, even before
+* wake_q_add() returns. Memory ordering for q->sleeper->state is
+* enforced by sem_lock(): we own sem_lock now (that was the ACQUIRE),
+* and q->sleeper wrote q->sleeper->state before calling sem_unlock()
+* (->RELEASE).
+*/
wake_q_add(wake_q, q->sleeper);
/*
-* Rely on the above implicit barrier, such that we can
-* ensure that we hold reference to the task before setting
-* q->status. Otherwise we could race with do_exit if the
-* task is awoken by an external event before calling
-* wake_up_process().
+* Memory barrier pairing:
+* case a: The barrier inside wake_q_add() pairs with
+* READ_ONCE(q->status) + smp_acquire__after_ctrl_dep() in
+* do_semtimedop().
+* case b: nothing, ordering is enforced by the locks in sem_lock().
 */
WRITE_ONCE(q->status, error);
 }
@@ -2148,9 +2176,11 @@ static long do_semtimedop(int semid, struct sembuf 
__user *tsops,
}
 
do {
-   WRITE_ONCE(queue.status, -EINTR);
+   /* memory ordering ensured by the lock in sem_lock() */
+   queue.status = EINTR;
queue.sleeper = current;
 
+   /* memory ordering is ensured by the lock in sem_lock() */
__set_current_state(TASK_INTERRUPTIBLE);
sem_unlock(sma, locknum);
rcu_read_unlock();
@@ -2174,12 +2204,16 @@ static long do_semtimedop(int semid, struct sembuf 
__user *tsops,
error = READ_ONCE(queue.status);
if (error != -EINTR) {
/*
-* User space could assume that semop() is a memory
-* barrier: Without the mb(), the cpu could
-* speculatively read in userspace stale data that was
-* overwritten by the previous owner of the semaphore.
+* Memory barrier for queue.status, case a):
+* The smp_acquire__after_ctrl_dep(), together with th

[PATCH 5/5] Documentation/memory-barriers.txt: Clarify cmpxchg()

2019-10-11 Thread Manfred Spraul
The documentation in memory-barriers.txt claims that
smp_mb__{before,after}_atomic() are for atomic ops that do not return a
value.

This is misleading and doesn't match the example in atomic_t.txt,
and e.g. smp_mb__before_atomic() may and is used together with
cmpxchg_relaxed() in the wake_q code.

The purpose of e.g. smp_mb__before_atomic() is to "upgrade" a following
RMW atomic operation to a full memory barrier.
The return code of the atomic operation has no impact, so all of the
following examples are valid:

1)
smp_mb__before_atomic();
atomic_add();

2)
smp_mb__before_atomic();
atomic_xchg_relaxed();

3)
smp_mb__before_atomic();
atomic_fetch_add_relaxed();

Invalid would be:
smp_mb__before_atomic();
atomic_set();

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Peter Zijlstra 
---
 Documentation/memory-barriers.txt | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 1adbb8a371c7..52076b057400 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1873,12 +1873,13 @@ There are some more advanced barrier functions:
  (*) smp_mb__before_atomic();
  (*) smp_mb__after_atomic();
 
- These are for use with atomic (such as add, subtract, increment and
- decrement) functions that don't return a value, especially when used for
- reference counting.  These functions do not imply memory barriers.
+ These are for use with atomic RMW functions (such as add, subtract,
+ increment, decrement, failed conditional operations, ...) that do
+ not imply memory barriers, but where the code needs a memory barrier,
+ for example when used for reference counting.
 
- These are also used for atomic bitop functions that do not return a
- value (such as set_bit and clear_bit).
+ These are also used for atomic RMW bitop functions that do imply a full
+ memory barrier (such as set_bit and clear_bit).
 
  As an example, consider a piece of code that marks an object as being dead
  and then decrements the object's reference count:
-- 
2.21.0



[PATCH 0/3] Clarify/standardize memory barriers for ipc

2019-10-11 Thread Manfred Spraul
Hi,

Partially based on the findings from Waiman Long:

a) The memory barriers in ipc are not properly documented, and at least
for some architectures insufficient:
Reading the xyz->status is only a control barrier, thus
smp_acquire__after_ctrl_dep() was missing in mqueue.c and msg.c
sem.c contained a full smp_mb(), which is not required.

Patch 1: Document that wake_q_add() contains a barrier.

b) wake_q_add() provides a memory barrier, ipc/mqueue.c relies on this.
Move the documentation to wake_q_add(), instead writing it in ipc/mqueue.c

Patch 2-4: Update the ipc code, especially add missing
   smp_mb__after_ctrl_dep().

c) [optional]
Clarify that smp_mb__{before,after}_atomic() are compatible with all
RMW atomic operations, not just the operations that do not return a value.

Patch 5: Documentation for smp_mb__{before,after}_atomic().

>From my point of view, patch 1 is a prerequisite for patches 2-4:
If the barrier is not part of the documented API, then ipc should not rely
on it, i.e. then I would propose to replace the WRITE_ONCE with
smp_store_release().

Open issues:
- More testing. I did some tests, but doubt that the tests would be
  sufficient to show issues with regards to incorrect memory barriers.

- Should I add a "Fixes:" or "Cc:stable"? The only issues that I see are
  the missing smp_mb__after_ctrl_dep(), and WRITE_ONCE() vs.
  "ptr = NULL".

What do you think?

--
Manfred


[PATCH 1/5] wake_q: Cleanup + Documentation update.

2019-10-11 Thread Manfred Spraul
1) wake_q_add() contains a memory barrier, and callers such as
ipc/mqueue.c rely on this barrier.
Unfortunately, this is documented in ipc/mqueue.c, and not in the
description of wake_q_add().
Therefore: Update the documentation.
Removing/updating ipc/mqueue.c will happen with the next patch in the
series.

2) wake_up_q() relies on the memory barrier in try_to_wake_up().
Add a comment, to simplify searching.

3) wake_q.next is accessed without synchroniyation by wake_q_add(),
using cmpxchg_relaxed(), and by wake_up_q().
Therefore: Use WRITE_ONCE in wake_up_q(), to ensure that the
compiler doesn't perform any tricks.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 kernel/sched/core.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dd05a378631a..2cf3f7321303 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -440,8 +440,11 @@ static bool __wake_q_add(struct wake_q_head *head, struct 
task_struct *task)
  * @task: the task to queue for 'later' wakeup
  *
  * Queue a task for later wakeup, most likely by the wake_up_q() call in the
- * same context, _HOWEVER_ this is not guaranteed, the wakeup can come
- * instantly.
+ * same context, _HOWEVER_ this is not guaranteed. Especially, the wakeup
+ * may happen before the function returns.
+ *
+ * What is guaranteed is that there is a memory barrier before the wakeup,
+ * callers may rely on this barrier.
  *
  * This function must be used as-if it were wake_up_process(); IOW the task
  * must be ready to be woken at this location.
@@ -486,11 +489,14 @@ void wake_up_q(struct wake_q_head *head)
BUG_ON(!task);
/* Task can safely be re-inserted now: */
node = node->next;
-   task->wake_q.next = NULL;
+
+   WRITE_ONCE(task->wake_q.next, NULL);
 
/*
 * wake_up_process() executes a full barrier, which pairs with
 * the queueing in wake_q_add() so as not to miss wakeups.
+* The barrier is the smp_mb__after_spinlock() in
+* try_to_wake_up().
 */
wake_up_process(task);
put_task_struct(task);
-- 
2.21.0



Re: wake_q memory ordering

2019-10-11 Thread Manfred Spraul

Hi Davidlohr,

On 10/10/19 9:25 PM, Davidlohr Bueso wrote:

On Thu, 10 Oct 2019, Peter Zijlstra wrote:


On Thu, Oct 10, 2019 at 02:13:47PM +0200, Manfred Spraul wrote:


Therefore smp_mb__{before,after}_atomic() may be combined with
cmpxchg_relaxed, to form a full memory barrier, on all archs.


Just so.


We might want something like this?

8<-

From: Davidlohr Bueso 
Subject: [PATCH] Documentation/memory-barriers.txt: Mention 
smp_mb__{before,after}_atomic() and CAS


Explicitly mention possible usages to guarantee serialization even upon
failed cmpxchg (or similar) calls along with 
smp_mb__{before,after}_atomic().


Signed-off-by: Davidlohr Bueso 
---
Documentation/memory-barriers.txt | 12 
1 file changed, 12 insertions(+)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt

index 1adbb8a371c7..5d2873d4b442 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1890,6 +1890,18 @@ There are some more advanced barrier functions:
 This makes sure that the death mark on the object is perceived to 
be set

 *before* the reference counter is decremented.

+ Similarly, these barriers can be used to guarantee serialization 
for atomic
+ RMW calls on architectures which may not imply memory barriers 
upon failure.

+
+    obj->next = NULL;
+    smp_mb__before_atomic()
+    if (cmpxchg(&obj->ptr, NULL, val))
+    return;
+
+ This makes sure that the store to the next pointer always has 
smp_store_mb()
+ semantics. As such, smp_mb__{before,after}_atomic() calls allow 
optimizing

+ the barrier usage by finer grained serialization.
+
 See Documentation/atomic_{t,bitops}.txt for more information.


I don't know. The new documentation would not have answered my question 
(is it ok to combine smp_mb__before_atomic() with atomic_relaxed()?). 
And it copies content already present in atomic_t.txt.


Thus: I would prefer if the first sentence of the paragraph is replaced: 
The list of operations should end with "...", and it should match what 
is in atomic_t.txt


Ok?

--

    Manfred


>From 8df60211228042672ba0cd89c3566c5145e8b203 Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Fri, 11 Oct 2019 10:33:26 +0200
Subject: [PATCH 4/4] Documentation/memory-barriers.txt:  Clarify cmpxchg()

The documentation in memory-barriers.txt claims that
smp_mb__{before,after}_atomic() are for atomic ops that do not return a
value.

This is misleading and doesn't match the example in atomic_t.txt,
and e.g. smp_mb__before_atomic() may and is used together with
cmpxchg_relaxed() in the wake_q code.

The purpose of e.g. smp_mb__before_atomic() is to "upgrade" a following
RMW atomic operation to a full memory barrier.
The return code of the atomic operation has no impact, so all of the
following examples are valid:

1)
	smp_mb__before_atomic();
	atomic_add();

2)
	smp_mb__before_atomic();
	atomic_xchg_relaxed();

3)
	smp_mb__before_atomic();
	atomic_fetch_add_relaxed();

Invalid would be:
	smp_mb__before_atomic();
	atomic_set();

Signed-off-by: Manfred Spraul 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Peter Zijlstra 
---
 Documentation/memory-barriers.txt | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 1adbb8a371c7..52076b057400 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1873,12 +1873,13 @@ There are some more advanced barrier functions:
  (*) smp_mb__before_atomic();
  (*) smp_mb__after_atomic();
 
- These are for use with atomic (such as add, subtract, increment and
- decrement) functions that don't return a value, especially when used for
- reference counting.  These functions do not imply memory barriers.
+ These are for use with atomic RMW functions (such as add, subtract,
+ increment, decrement, failed conditional operations, ...) that do
+ not imply memory barriers, but where the code needs a memory barrier,
+ for example when used for reference counting.
 
- These are also used for atomic bitop functions that do not return a
- value (such as set_bit and clear_bit).
+ These are also used for atomic RMW bitop functions that do imply a full
+ memory barrier (such as set_bit and clear_bit).
 
  As an example, consider a piece of code that marks an object as being dead
  and then decrements the object's reference count:
-- 
2.21.0



Re: wake_q memory ordering

2019-10-10 Thread Manfred Spraul

Hi Peter,

On 10/10/19 1:42 PM, Peter Zijlstra wrote:

On Thu, Oct 10, 2019 at 12:41:11PM +0200, Manfred Spraul wrote:

Hi,

Waiman Long noticed that the memory barriers in sem_lock() are not really
documented, and while adding documentation, I ended up with one case where
I'm not certain about the wake_q code:

Questions:
- Does smp_mb__before_atomic() + a (failed) cmpxchg_relaxed provide an
   ordering guarantee?

Yep. Either the atomic instruction implies ordering (eg. x86 LOCK
prefix) or it doesn't (most RISC LL/SC), if it does,
smp_mb__{before,after}_atomic() are a NO-OP and the ordering is
unconditinoal, if it does not, then smp_mb__{before,after}_atomic() are
unconditional barriers.


And _relaxed() differs from "normal" cmpxchg only for LL/SC 
architectures, correct?


Therefore smp_mb__{before,after}_atomic() may be combined with 
cmpxchg_relaxed, to form a full memory barrier, on all archs.


[...]



- Is it ok that wake_up_q just writes wake_q->next, shouldn't
   smp_store_acquire() be used? I.e.: guarantee that wake_up_process()
   happens after cmpxchg_relaxed(), assuming that a failed cmpxchg_relaxed
   provides any ordering.

There is no such thing as store_acquire, it is either load_acquire or
store_release. But just like how we can write load-aquire like
load+smp_mb(), so too I suppose we could write store-acquire like
store+smp_mb(), and that is exactly what is there (through the implied
barrier of wake_up_process()).


Thanks for confirming my assumption:
The code is correct, due to the implied barrier inside wake_up_process().

[...]

rewritten:

start condition: A = 1; B = 0;

CPU1:
     B = 1;
     RELEASE, unlock LockX;

CPU2:
     lock LockX, ACQUIRE
     if (LOAD A == 1) return; /* using cmp_xchg_relaxed */

CPU2:
     A = 0;
     ACQUIRE, lock LockY
     smp_mb__after_spinlock();
     READ B

Question: is A = 1, B = 0 possible?

Your example is incomplete (there is no A=1 assignment for example), but
I'm thinking I can guess where that should go given the earlier text.


A=1 is listed as start condition. Way before, someone did wake_q_add().



I don't think this is broken.


Thanks.

--

    Manfred



wake_q memory ordering

2019-10-10 Thread Manfred Spraul

Hi,

Waiman Long noticed that the memory barriers in sem_lock() are not 
really documented, and while adding documentation, I ended up with one 
case where I'm not certain about the wake_q code:


Questions:
- Does smp_mb__before_atomic() + a (failed) cmpxchg_relaxed provide an
  ordering guarantee?
- Is it ok that wake_up_q just writes wake_q->next, shouldn't
  smp_store_acquire() be used? I.e.: guarantee that wake_up_process()
  happens after cmpxchg_relaxed(), assuming that a failed cmpxchg_relaxed
  provides any ordering.

Example:
- CPU2 never touches lock a. It is just an unrelated wake_q user that also
  wants to wake up task 1234.
- I've noticed already that smp_store_acquire() doesn't exist.
  So smp_store_mb() is required. But from semantical point of view, we 
would

  need an ACQUIRE: the wake_up_process() must happen after cmpxchg().
- May wake_up_q() rely on the spinlocks/memory barriers in try_to_wake_up,
  or should the function be safe by itself?

CPU1: /current=1234, inside do_semtimedop()/
    g_wakee = current;
    current->state = TASK_INTERRUPTIBLE;
    spin_unlock(a);

CPU2: / arbitrary kernel thread that uses wake_q /
    wake_q_add(&unrelated_q, 1234);
    wake_up_q(&unrelated_q);
    <...ongoing>

CPU3: / do_semtimedop() + wake_up_sem_queue_prepare() /
    spin_lock(a);
    wake_q_add(,g_wakee);
    < within wake_q_add() >:
  smp_mb__before_atomic();
  if (unlikely(cmpxchg_relaxed(&node->next, 
NULL, WAKE_Q_TAIL)))

  return false; /* -> this happens */

CPU2:
    
    1234->wake_q.next = NULL; <<<<<<<<< Ok? Is 
store_acquire() missing? >>>>>>>>>>>>

    wake_up_process(1234);
    < within wake_up_process/try_to_wake_up():
    raw_spin_lock_irqsave()
    smp_mb__after_spinlock()
    if(1234->state = TASK_RUNNING) return;
 >


rewritten:

start condition: A = 1; B = 0;

CPU1:
    B = 1;
    RELEASE, unlock LockX;

CPU2:
    lock LockX, ACQUIRE
    if (LOAD A == 1) return; /* using cmp_xchg_relaxed */

CPU2:
    A = 0;
    ACQUIRE, lock LockY
    smp_mb__after_spinlock();
    READ B

Question: is A = 1, B = 0 possible?

--

    Manfred



Re: [PATCH] ipc/sem: Fix race between to-be-woken task and waker

2019-09-29 Thread Manfred Spraul

Hi Waiman,

I have now written the mail 3 times:
Twice I thought that I found a race, but during further analysis, it 
always turns out that the spin_lock() is sufficient.


First, to avoid any obvious things: Until the series with e.g. 
27d7be1801a4824e, there was a race inside sem_lock().


Thus it was possible that multiple threads were operating on the same 
semaphore array, with obviously arbitrary impact.


On 9/20/19 5:54 PM, Waiman Long wrote:

  
+		/*

+* A spurious wakeup at the right moment can cause race
+* between the to-be-woken task and the waker leading to
+* missed wakeup. Setting state back to TASK_INTERRUPTIBLE
+* before checking queue.status will ensure that the race
+* won't happen.
+*
+*  CPU0CPU1
+*
+*   wake_up_sem_queue_prepare():
+*  state = TASK_INTERRUPTIBLEstatus = error
+*  try_to_wake_up():
+*  smp_mb()  smp_mb()
+*  if (status == -EINTR) if (!(p->state & state))
+*schedule()goto out
+*/
+   set_current_state(TASK_INTERRUPTIBLE);
+


So the the hypothesis is that we have a race due to the optimization 
within try_to_wake_up():

If the status is already TASK_RUNNING, then the wakeup is a nop.

Correct?

The waker wants to use:

    lock();
    set_conditions();
    unlock();

as the wake_q is a shared list, completely asynchroneously this will happen:

    smp_mb();  ***1
    if (current->state = TASK_INTERRUPTIBLE) current->state=TASK_RUNNING;

The only guarantee is that this will happen after lock(), it may happen 
before set_conditions().


The task that goes to sleep uses:

    lock();
    check_conditions();
    __set_current_state();
    unlock();  ***2
    schedule();

You propose to change that to:

    lock();
    set_current_state();
    check_conditions();
    unlock();
    schedule();

I don't see a race anymore, and I don't see how the proposed change will 
help.
e.g.: __set_current_state() and smp_mb() have paired memory barriers 
***1 and ***2 above.


--

    Manfred



Re: [PATCH] ipc/sem: Fix race between to-be-woken task and waker

2019-09-26 Thread Manfred Spraul

Hi,
On 9/26/19 8:12 PM, Waiman Long wrote:

On 9/26/19 5:34 AM, Peter Zijlstra wrote:

On Fri, Sep 20, 2019 at 11:54:02AM -0400, Waiman Long wrote:

While looking at a customr bug report about potential missed wakeup in
the system V semaphore code, I spot a potential problem.  The fact that
semaphore waiter stays in TASK_RUNNING state while checking queue status
may lead to missed wakeup if a spurious wakeup happens in the right
moment as try_to_wake_up() will do nothing if the task state isn't right.

To eliminate this possibility, the task state is now reset to
TASK_INTERRUPTIBLE immediately after wakeup before checking the queue
status. This should eliminate the race condition on the interaction
between the queue status and the task state and fix the potential missed
wakeup problem.

You are obviously right, there is a huge race condition.

Bah, this code always makes my head hurt.

Yes, AFAICT the pattern it uses has been broken since 0a2b9d4c7967,
since that removed doing the actual wakeup from under the sem_lock(),
which is what it relies on.


Correct - I've overlooked that.

First, theory:

setting queue->status, reading queue->status, setting 
current->state=TASK_INTERRUPTIBLE are all under the correct spinlock.


(there is an opportunistic read of queue->status without locks, but it 
is retried when the lock got acquired)


setting current->state=RUNNING is outside of any lock.

So as far as current->state is concerned, the lock doesn't exist. And if 
the lock doesn't exist, we must follow the rules applicable for 
set_current_state().


I'll try to check the code this week.

And we should check the remaining wake-queue users, the logic is 
everywhere identical.



After having a second look at the code again, I probably misread the
code the first time around. In the sleeping path, there is a check of
queue.status and setting of task state both under the sem lock in the
sleeping path. So as long as setting of queue status is under lock, they
should synchronize properly.

It looks like queue status setting is under lock, but I can't use
lockdep to confirm that as the locking can be done by either the array
lock or in one of the spinlocks in the array. Are you aware of a way of
doing that?


For testing? Have you considered just always using the global lock?

(untested):

--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -370,7 +370,7 @@ static inline int sem_lock(struct sem_array *sma, 
struct sembuf *sops,

    struct sem *sem;
    int idx;

-   if (nsops != 1) {
+   if (nsops != 1 || 1) {
    /* Complex operation - acquire a full lock */
    ipc_lock_object(&sma->sem_perm);



Anyway, I do think we need to add some comment to clarify the situation
to avoid future confusion.


Around line 190 is the comment that explains locking & memory ordering.

I have only documented the content of sem_undo and sem_array, but 
neither queue nor current->state :-(



--

    Manfred




Re: [PATCH v11 2/3] ipc: Conserve sequence numbers in ipcmni_extend mode

2019-03-10 Thread Manfred Spraul

On 2/27/19 9:30 PM, Waiman Long wrote:

On 11/20/2018 02:41 PM, Manfred Spraul wrote:

 From 6bbade73d21884258a995698f21ad3128df8e98a Mon Sep 17 00:00:00 2001
From: Manfred Spraul
Date: Sat, 29 Sep 2018 15:43:28 +0200
Subject: [PATCH 2/2] ipc/util.c: use idr_alloc_cyclic() for ipc allocations

A bit related to the patch that increases IPC_MNI, and
partially based on the mail fromwi...@infradead.org:

(User space) id reuse create the risk of data corruption:

Process A: calls ipc function
Process A: sleeps just at the beginning of the syscall
Process B: Frees the ipc object (i.e.: calls ...ctl(IPC_RMID)
Process B: Creates a new ipc object (i.e.: calls ...get())

Process A: is woken up, and accesses the new object

To reduce the probability that the new and the old object have the
same id, the current implementation adds a sequence number to the
index of the object in the idr tree.

To further reduce the probability for a reuse, perform a cyclic
allocation, and increase the sequence number only when there is
a wrap-around. Unfortunately, idr_alloc_cyclic cannot be used,
because the sequence number must be increased when a wrap-around
occurs.

The patch cycles over at least RADIX_TREE_MAP_SIZE, i.e.
if there is only a small number of objects, the accesses
continue to be direct.

Signed-off-by: Manfred Spraul
---
  ipc/util.c | 48 
  1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 07ae117ccdc0..fa7b8fa7a14c 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -216,10 +216,49 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
 */
  
  	if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */

-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
-   idx = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
+   int idx_max;
+
+   /*
+* If a user space visible id is reused, then this creates a
+* risk for data corruption. To reduce the probability that
+* a number is reused, three approaches are used:
+* 1) the idr index is allocated cyclically.
+* 2) the use space id is build by concatenating the
+*internal idr index with a sequence number.
+* 3) The sequence number is only increased when the index
+*wraps around.
+* Note that this code cannot use idr_alloc_cyclic:
+* new->seq must be set before the entry is inserted in the
+* idr.


I don't think that is true. The IDR code just need to associate a 
pointer to the given ID. It is not going to access anything inside. So 
we don't need to set the seq number first before calling idr_alloc().



We must, sorry - there is even a CVE associate to that bug:

CVE-2015-7613, 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b9a532277938798b53178d5a66af6e2915cb27cf


The problem is not the IDR code, the problem is that 
ipc_obtain_object_check() calls ipc_checkid(), and ipc_checkid() 
accesses ipcp->seq.


And since the ipc_checkid() is called before acquiring any locks, 
everything must be fully initialized before idr_alloc().



+*/
+   idx_max = ids->in_use*2;
+   if (idx_max < RADIX_TREE_MAP_SIZE)
+   idx_max = RADIX_TREE_MAP_SIZE;
+   if (idx_max > ipc_mni)
+   idx_max = ipc_mni;
+
+   if (ids->ipcs_idr.idr_next <= idx_max) {
+   new->seq = ids->seq;
+   idx = idr_alloc(&ids->ipcs_idr, new,
+   ids->ipcs_idr.idr_next,
+   idx_max, GFP_NOWAIT);
+   }
+
+   if ((idx == -ENOSPC) && (ids->ipcs_idr.idr_next > 0)) {
+   /*
+* A wrap around occurred.
+* Increase ids->seq, update new->seq
+*/
+   ids->seq++;
+   if (ids->seq > IPCID_SEQ_MAX)
+   ids->seq = 0;
+   new->seq = ids->seq;
+
+   idx = idr_alloc(&ids->ipcs_idr, new, 0, idx_max,
+   GFP_NOWAIT);
+   }
+   if (idx >= 0)
+   ids->ipcs_idr.idr_next = idx+1;


This code has dependence on the internal implementation of the IDR 
code. So if the IDR code is changed and the one who does it forgets to 
update the IPC code, we may have a problem. Using idr_alloc_cyclic() 
for all will likely increase memory footprint w

Re: general protection fault in put_pid

2019-01-07 Thread Manfred Spraul

On 1/3/19 11:18 PM, Shakeel Butt wrote:

Hi Manfred,

On Sun, Dec 23, 2018 at 4:26 AM Manfred Spraul  wrote:

Hello Dmitry,

On 12/23/18 10:57 AM, Dmitry Vyukov wrote:

I can reproduce this infinite memory consumption with the C program:
https://gist.githubusercontent.com/dvyukov/03ec54b3429ade16fa07bf8b2379aff3/raw/ae4f654e279810de2505e8fa41b73dc1d8e6/gistfile1.txt

But this is working as intended, right? It just creates infinite
number of large semaphore sets, which reasonably consumes infinite
amount of memory.
Except that it also violates the memcg bound and a process can have
effectively unlimited amount of such "drum memory" in semaphores.

Yes, this is as intended:

If you call semget(), then you can use memory, up to the limits in
/proc/sys/kernel/sem.

Memcg is not taken into account, an admin must set /proc/sys/kernel/sem.

The default are "infinite amount of memory allowed", as this is the most
sane default: We had a logic that tried to autotune (i.e.: a new
namespace "inherits" a fraction of the parent namespaces memory limits),
but this we more or less always wrong.



What's the disadvantage of setting the limits in /proc/sys/kernel/sem
high and let the task's memcg limits the number of semaphore a process
can create? Please note that the memory underlying shmget and msgget
is already accounted to memcg.


Nothing, it it just a question of implementing it.

I'll try to look at it.

--

    Manfred



Re: [PATCH] Revert "can: dev: __can_get_echo_skb(): print error message, if trying to echo non existing skb"

2019-01-07 Thread Manfred Schlaegl


Manfred Schlaegl | Leitung Entwicklung Linz 

GINZINGER ELECTRONIC SYSTEMS GMBH

Tel.: +43 7723 5422 153
Mobil: +43 676 841 208 253
Mail: manfred.schla...@ginzinger.com
Web: www.ginzinger.com




On 04.01.19 16:23, Marc Kleine-Budde wrote:
> On 12/19/18 7:39 PM, Manfred Schlaegl wrote:
>> This reverts commit 7da11ba5c5066dadc2e96835a6233d56d7b7764a.
>>
>> After introduction of this change we encountered following new error
>> message on various i.MX plattforms (flexcan)
>> flexcan 53fc8000.can can0: __can_get_echo_skb: BUG! Trying to echo non
>> existing skb: can_priv::echo_skb[0]
> 
> Doh! I should have tested more extensive. Sorry.
> 
>> The introduction of the message was a mistake because
>> priv->echo_skb[idx] = NULL is a perfectly valid in following case:
>> If CAN_RAW_LOOPBACK is disabled (setsockopt) in applications, the
>> pkt_type of the tx skb's given to can_put_echo_skb is set to
>> PACKET_LOOPBACK. In this case can_put_echo_skb will not set
>> priv->echo_skb[idx]. It is therefore kept NULL.
>>
>> (As additional argument for revert: The order of check and usage of idx
>> was changed. idx is used to access an array element before checking it's
>> boundaries)
>>
>> Signed-off-by: Manfred Schlaegl 
> 
> Applied to linux-can.

Great, thanks!

> 
> Tnx,
> Marc
> 






Ginzinger electronic systems GmbH
Gewerbegebiet Pirath 16
4952 Weng im Innkreis
www.ginzinger.com

Firmenbuchnummer: FN 364958d
Firmenbuchgericht: Ried im Innkreis
UID-Nr.: ATU66521089


Diese Nachricht ist vertraulich und darf nicht an andere Personen weitergegeben 
oder von diesen verwendet werden. Verständigen Sie uns, wenn Sie irrtümlich 
eine Mitteilung empfangen haben.

This message is confidential. It may not be disclosed to, or used by, anyone 
other than the addressee. If you receive this message by mistake, please advise 
the sender.


smime.p7s
Description: S/MIME cryptographic signature


Re: general protection fault in put_pid

2018-12-23 Thread Manfred Spraul
eak, it happens all the time :-(

I must look what is wrong.

2) regarding the crash:

What differs under oom pressure?

- kvmalloc can fall back to vmalloc()

- the 2nd or 3rd of multiple allocations can fail, and that triggers a 
rare codepath/race condition.


- rcu callback can happen earlier that expected


So far, I didn't notice anything unexpected :-(


--

    Manfred



Re: general protection fault in put_pid

2018-12-23 Thread Manfred Spraul

Hello Dmitry,

On 12/23/18 10:57 AM, Dmitry Vyukov wrote:


I can reproduce this infinite memory consumption with the C program:
https://gist.githubusercontent.com/dvyukov/03ec54b3429ade16fa07bf8b2379aff3/raw/ae4f654e279810de2505e8fa41b73dc1d8e6/gistfile1.txt

But this is working as intended, right? It just creates infinite
number of large semaphore sets, which reasonably consumes infinite
amount of memory.
Except that it also violates the memcg bound and a process can have
effectively unlimited amount of such "drum memory" in semaphores.


Yes, this is as intended:

If you call semget(), then you can use memory, up to the limits in 
/proc/sys/kernel/sem.


Memcg is not taken into account, an admin must set /proc/sys/kernel/sem.

The default are "infinite amount of memory allowed", as this is the most 
sane default: We had a logic that tried to autotune (i.e.: a new 
namespace "inherits" a fraction of the parent namespaces memory limits), 
but this we more or less always wrong.



--

    Manfred



Re: general protection fault in put_pid

2018-12-22 Thread Manfred Spraul

Hi Dmitry,

On 12/20/18 4:36 PM, Dmitry Vyukov wrote:

On Wed, Dec 19, 2018 at 10:04 AM Manfred Spraul
 wrote:

Hello Dmitry,

On 12/12/18 11:55 AM, Dmitry Vyukov wrote:

On Tue, Dec 11, 2018 at 9:23 PM syzbot
 wrote:

Hello,

syzbot found the following crash on:

HEAD commit:f5d582777bcb Merge branch 'for-linus' of git://git.kernel...
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=135bc54740
kernel config:  https://syzkaller.appspot.com/x/.config?x=c8970c89a0efbb23
dashboard link: https://syzkaller.appspot.com/bug?extid=1145ec2e23165570c3ac
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=16803afb40

+Manfred, this looks similar to the other few crashes related to
semget$private(0x0, 0x4000, 0x3f) that you looked at.

I found one unexpected (incorrect?) locking, see the attached patch.

But I doubt that this is the root cause of the crashes.


But why? These one-off sporadic crashes reported by syzbot looks
exactly like a subtle race and your patch touches sem_exit_ns involved
in all reports.
So if you don't spot anything else, I would say close these 3 reports
with this patch (I see you already included Reported-by tags which is
great!) and then wait for syzbot reaction. Since we got 3 of them, if
it's still not fixed I would expect that syzbot will be able to
retrigger this later again.


As I wrote, unless semop() is used, sma->use_global_lock is always 9 and 
nothing can happen.


Every single-operation semop() reduces use_global_lock by one, i.e a 
single semop call as done here cannot trigger the bug:


https://syzkaller.appspot.com/text?tag=ReproSyz&x=16803afb40


But, one more finding:

https://syzkaller.appspot.com/bug?extid=1145ec2e23165570c3ac

https://syzkaller.appspot.com/text?tag=CrashLog&x=109ecf6e40

The log file contain 1080 lines like these:


semget$private(..., 0x4003, ...)

semget$private(..., 0x4006, ...)

semget$private(..., 0x4007, ...)


It ends up as kmalloc(128*0x400x), i.e. slightly more than 2 MB, an 
allocation in the 4 MB kmalloc buffer:



[ 1201.210245] kmalloc-4194304  4698112KB4698112KB

i.e.: 1147 4 MB kmalloc blocks --> are we leaking nearly 100% of the 
semaphore arrays??



This one looks similar:

https://syzkaller.appspot.com/bug?extid=c92d3646e35bc5d1a909

except that the array sizes are mixed, and thus there are kmalloc-1M and 
kmalloc-2M as well.


(and I did not count the number of semget calls)


The test apps use unshare(CLONE_NEWNS) and unshare(CLONE_NEWIPC), correct?

I.e. no CLONE_NEWUSER.

https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L1523


--

    Manfred




[PATCH] Revert "can: dev: __can_get_echo_skb(): print error message, if trying to echo non existing skb"

2018-12-19 Thread Manfred Schlaegl
This reverts commit 7da11ba5c5066dadc2e96835a6233d56d7b7764a.

After introduction of this change we encountered following new error
message on various i.MX plattforms (flexcan)
flexcan 53fc8000.can can0: __can_get_echo_skb: BUG! Trying to echo non
existing skb: can_priv::echo_skb[0]

The introduction of the message was a mistake because
priv->echo_skb[idx] = NULL is a perfectly valid in following case:
If CAN_RAW_LOOPBACK is disabled (setsockopt) in applications, the
pkt_type of the tx skb's given to can_put_echo_skb is set to
PACKET_LOOPBACK. In this case can_put_echo_skb will not set
priv->echo_skb[idx]. It is therefore kept NULL.

(As additional argument for revert: The order of check and usage of idx
was changed. idx is used to access an array element before checking it's
boundaries)

Signed-off-by: Manfred Schlaegl 
---
 drivers/net/can/dev.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/net/can/dev.c b/drivers/net/can/dev.c
index 3b3f88ffab53..c05e4d50d43d 100644
--- a/drivers/net/can/dev.c
+++ b/drivers/net/can/dev.c
@@ -480,8 +480,6 @@ EXPORT_SYMBOL_GPL(can_put_echo_skb);
 struct sk_buff *__can_get_echo_skb(struct net_device *dev, unsigned int idx, 
u8 *len_ptr)
 {
struct can_priv *priv = netdev_priv(dev);
-   struct sk_buff *skb = priv->echo_skb[idx];
-   struct canfd_frame *cf;
 
if (idx >= priv->echo_skb_max) {
netdev_err(dev, "%s: BUG! Trying to access can_priv::echo_skb 
out of bounds (%u/max %u)\n",
@@ -489,20 +487,21 @@ struct sk_buff *__can_get_echo_skb(struct net_device 
*dev, unsigned int idx, u8
return NULL;
}
 
-   if (!skb) {
-   netdev_err(dev, "%s: BUG! Trying to echo non existing skb: 
can_priv::echo_skb[%u]\n",
-  __func__, idx);
-   return NULL;
-   }
+   if (priv->echo_skb[idx]) {
+   /* Using "struct canfd_frame::len" for the frame
+* length is supported on both CAN and CANFD frames.
+*/
+   struct sk_buff *skb = priv->echo_skb[idx];
+   struct canfd_frame *cf = (struct canfd_frame *)skb->data;
+   u8 len = cf->len;
 
-   /* Using "struct canfd_frame::len" for the frame
-* length is supported on both CAN and CANFD frames.
-*/
-   cf = (struct canfd_frame *)skb->data;
-   *len_ptr = cf->len;
-   priv->echo_skb[idx] = NULL;
+   *len_ptr = len;
+   priv->echo_skb[idx] = NULL;
 
-   return skb;
+   return skb;
+   }
+
+   return NULL;
 }
 
 /*
-- 
2.11.0



smime.p7s
Description: S/MIME cryptographic signature


Re: general protection fault in put_pid

2018-12-19 Thread Manfred Spraul

Hello Dmitry,

On 12/12/18 11:55 AM, Dmitry Vyukov wrote:

On Tue, Dec 11, 2018 at 9:23 PM syzbot
 wrote:

Hello,

syzbot found the following crash on:

HEAD commit:f5d582777bcb Merge branch 'for-linus' of git://git.kernel...
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=135bc54740
kernel config:  https://syzkaller.appspot.com/x/.config?x=c8970c89a0efbb23
dashboard link: https://syzkaller.appspot.com/bug?extid=1145ec2e23165570c3ac
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=16803afb40

+Manfred, this looks similar to the other few crashes related to
semget$private(0x0, 0x4000, 0x3f) that you looked at.


I found one unexpected (incorrect?) locking, see the attached patch.

But I doubt that this is the root cause of the crashes.

Any remarks on the patch?

I would continue to search, and then send a series with all findings.

--

    Manfred

>From 733e888993b71fb3c139f71de61534bc603a2bcb Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Wed, 19 Dec 2018 09:26:48 +0100
Subject: [PATCH] ipc/sem.c: ensure proper locking during namespace teardown

free_ipcs() only calls ipc_lock_object() before calling the free callback.

This means:
- There is no exclusion against parallel simple semop() calls.
- sma->use_global_lock may underflow (i.e. jump to UNIT_MAX) when
  freeary() calls sem_unlock(,,-1).

The patch fixes that, by adding complexmode_enter() before calling
freeary().

There are multiple syzbot crashes in this code area, but I don't see yet
how a missing complexmode_enter() may cause a crash:
- 1) simple semop() calls are not used by these syzbox tests,
  and 2) we are in namespace teardown, noone may run in parallel.

- 1) freeary() is the last call (except parallel operations, which
  are impossible due to namespace teardown)
  and 2) the underflow of use_global_lock merely delays switching to
  parallel simple semop handling for the next UINT_MAX semop() calls.

Thus I think the patch is "only" a cleanup, and does not fix
the observed crashes.

Signed-off-by: Manfred Spraul 
Reported-by: syzbot+1145ec2e23165570c...@syzkaller.appspotmail.com
Reported-by: syzbot+c92d3646e35bc5d1a...@syzkaller.appspotmail.com
Reported-by: syzbot+9d8b6fa6ee7636f35...@syzkaller.appspotmail.com
Cc: dvyu...@google.com
Cc: dbu...@suse.de
Cc: Andrew Morton 
---
 ipc/sem.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 745dc6187e84..8ccacd11fb15 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -184,6 +184,9 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
  */
 #define USE_GLOBAL_LOCK_HYSTERESIS	10
 
+static void complexmode_enter(struct sem_array *sma);
+static void complexmode_tryleave(struct sem_array *sma);
+
 /*
  * Locking:
  * a) global sem_lock() for read/write
@@ -232,9 +235,24 @@ void sem_init_ns(struct ipc_namespace *ns)
 }
 
 #ifdef CONFIG_IPC_NS
+
+static void freeary_lock(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+{
+	struct sem_array *sma = container_of(ipcp, struct sem_array, sem_perm);
+
+	/*
+	 * free_ipcs() isn't aware of sem_lock(), it calls ipc_lock_object()
+	 * directly. In order to stay compatible with sem_lock(), we must
+	 * upgrade from "simple" ipc_lock_object() to sem_lock(,,-1).
+	 */
+	complexmode_enter(sma);
+
+	freeary(ns, ipcp);
+}
+
 void sem_exit_ns(struct ipc_namespace *ns)
 {
-	free_ipcs(ns, &sem_ids(ns), freeary);
+	free_ipcs(ns, &sem_ids(ns), freeary_lock);
 	idr_destroy(&ns->ids[IPC_SEM_IDS].ipcs_idr);
 	rhashtable_destroy(&ns->ids[IPC_SEM_IDS].key_ht);
 }
@@ -374,7 +392,9 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Complex operation - acquire a full lock */
 		ipc_lock_object(&sma->sem_perm);
 
-		/* Prevent parallel simple ops */
+		/* Prevent parallel simple ops.
+		 * This must be identical to freeary_lock().
+		 */
 		complexmode_enter(sma);
 		return SEM_GLOBAL_LOCK;
 	}
-- 
2.17.2



Re: BUG: corrupted list in freeary

2018-12-01 Thread Manfred Spraul

Hi Dmitry,

On 11/30/18 6:58 PM, Dmitry Vyukov wrote:

On Thu, Nov 29, 2018 at 9:13 AM, Manfred Spraul
 wrote:

Hello together,

On 11/27/18 4:52 PM, syzbot wrote:

Hello,

syzbot found the following crash on:

HEAD commit:e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a340

[...]

Isn't this a kernel stack overrun?

RSP: 0x..83e008. Assuming 8 kB kernel stack, and 8 kB alignment, we have
used up everything.

I don't exact answer, that's just the kernel output that we captured
from console.

FWIW with KASAN stacks are 16K:
https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64_types.h#L10
Ok, thanks. And stack overrun detection is enabled as well -> a real 
stack overrun is unlikely.

Well, generally everything except for kernel crashes is expected.

We actually sandbox it with memcg quite aggressively:
https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L2159
But it seems to manage to either break the limits, or cause some
massive memory leaks. The nature of that is yet unknown.


Is it possible to start from that side?

Are there other syzcaller runs where the OOM killer triggers that much?




- Which stress tests are enabled? By chance, I found:

[  433.304586] FAULT_INJECTION: forcing a failure.^M
[  433.304586] name fail_page_alloc, interval 1, probability 0, space 0,
times 0^M
[  433.316471] CPU: 1 PID: 19653 Comm: syz-executor4 Not tainted 4.20.0-rc3+
#348^M
[  433.323841] Hardware name: Google Google Compute Engine/Google Compute
Engine, BIOS Google 01/01/2011^M

I need some more background, then I can review the code.

What exactly do you mean by "Which stress tests"?
Fault injection is enabled. Also random workload from userspace.



Right now, I would put it into my "unknown syzcaller finding" folder.


One more idea: Are there further syzcaller runs that end up with 
0x01 in a pointer?


From what I see, the sysv sem code that is used is trivial, I don't see 
that it could cause the observed behavior.



--

    Manfred



RE: [patch 8/9] posix-clocks: Remove license boiler plate

2018-11-22 Thread Manfred Rudigier
Acked-by: Manfred Rudigier 

Regards,
Manfred

> -Original Message-
> From: Richard Cochran 
> Sent: Thursday, November 1, 2018 3:12 AM
> To: Thomas Gleixner 
> Cc: LKML ; Cristian Marinescu
> ; Manfred Rudigier
> 
> Subject: Re: [patch 8/9] posix-clocks: Remove license boiler plate
> 
> On Wed, Oct 31, 2018 at 07:21:15PM +0100, Thomas Gleixner wrote:
> > The SPDX identifier defines the license of the file already. No need
> > for the boilerplate.
> >
> > Signed-off-by: Thomas Gleixner 
> > Cc: Richard Cochran 
> > ---
> >
> > @Richard: This file is (C) OMICRON, but I don't have a contact
> > anymore. That Cochran dude is not longer working there :)
> >
> > Do you have a contact? If so, can you please reply to this mail and Cc
> > him/her.
> 
> @Cristian and Manfred:
> 
> We want to replace the license boilerplate with SPDX tags.  The file,
> kernel/time/posix-clock.c, is copyrighted by omicron, and so we need your
> okay.  All that is needed is a reply to this email with an omicron Acked-by 
> tag,
> like this:
> 
> Acked-by: Richard Cochran 
> 
> Thanks,
> Richard
> 
> >
> > ---
> >  kernel/time/posix-clock.c |   14 --
> >  1 file changed, 14 deletions(-)
> >
> > --- a/kernel/time/posix-clock.c
> > +++ b/kernel/time/posix-clock.c
> > @@ -3,20 +3,6 @@
> >   * Support for dynamic clock devices
> >   *
> >   * Copyright (C) 2010 OMICRON electronics GmbH
> > - *
> > - *  This program is free software; you can redistribute it and/or
> > modify
> > - *  it under the terms of the GNU General Public License as published
> > by
> > - *  the Free Software Foundation; either version 2 of the License, or
> > - *  (at your option) any later version.
> > - *
> > - *  This program is distributed in the hope that it will be useful,
> > - *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> > - *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > - *  GNU General Public License for more details.
> > - *
> > - *  You should have received a copy of the GNU General Public License
> > - *  along with this program; if not, write to the Free Software
> > - *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> >   */
> >  #include 
> >  #include 
> >
> >


Re: [RFC, PATCH] ipc/util.c: use idr_alloc_cyclic() for ipc allocations

2018-10-03 Thread Manfred Spraul

On 10/2/18 8:27 PM, Waiman Long wrote:

On 10/02/2018 12:19 PM, Manfred Spraul wrote:

A bit related to the patch series that increases IPC_MNI:

(User space) id reuse create the risk of data corruption:

Process A: calls ipc function
Process A: sleeps just at the beginning of the syscall
Process B: Frees the ipc object (i.e.: calls ...ctl(IPC_RMID)
Process B: Creates a new ipc object (i.e.: calls ...get())

Process A: is woken up, and accesses the new object

To reduce the probability that the new and the old object
have the same id, the current implementation adds a
sequence number to the index of the object in the idr tree.

To further reduce the probability for a reuse, switch from
idr_alloc to idr_alloc_cyclic.

The patch cycles over at least RADIX_TREE_MAP_SIZE, i.e.
if there is only a small number of objects, the accesses
continue to be direct.

As an option, this could be made dependent on the extended
mode: In extended mode, cycle over e.g. at least 16k ids.

Signed-off-by: Manfred Spraul 
---

Open questions:
- Is there a significant performance advantage, especially
   there are many ipc ids?
- Over how many ids should the code cycle always?
- Further review remarks?

  ipc/util.c | 22 +-
  1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/ipc/util.c b/ipc/util.c
index 0af05752969f..6f83841f6761 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -216,10 +216,30 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
 */
  
  	if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */

+   int idr_max;
+
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
-   idx = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
+
+   /*
+* If a user space visible id is reused, then this creates a
+* risk for data corruption. To reduce the probability that
+* a number is reduced, two approaches are used:

   reduced -> reused?

Of course.



+* 1) the idr index is allocated cyclically.
+* 2) the use space id is build by concatenating the
+*internal idr index with a sequence number
+* To avoid that both numbers have the same cycle time, try
+* to set the size for the cyclic alloc to an odd number.
+*/
+   idr_max = ids->in_use*2+1;
+   if (idr_max < RADIX_TREE_MAP_SIZE-1)
+   idr_max = RADIX_TREE_MAP_SIZE-1;
+   if (idr_max > IPCMNI)
+   idr_max = IPCMNI;
+
+   idx = idr_alloc_cyclic(&ids->ipcs_idr, new, 0, idr_max,
+   GFP_NOWAIT);
} else {
new->seq = ipcid_to_seqx(next_id);
idx = idr_alloc(&ids->ipcs_idr, new, ipcid_to_idx(next_id),


Each of IPC components have their own sysctl parameters limiting the max
number of objects that can be allocated. With cyclic allocation, you
will have to make sure that idr_max is not larger than the corresponding
IPC sysctl parameters. That may require moving the limits to the
corresponding ipc_ids structure so that it can be used in ipc_idr_alloc().


First, I would disagree:

the sysctl limits specify how many objects can exist.

idr_max is the maximum index in the radix tree that can exist. There is 
a hard limit of IPCMNI, but that's it.



But:

The name is wrong, I will rename the variable to idx_max


What is the point of comparing idr_max against RADIX_TREE_MAP_SIZE-1? Is
it for performance reason.


Let's assume you have only 1 ipc object, and you alloc/release that object.

At alloc time, ids->in_use is 0 -> idr_max 1 -> every object will end up 
with idx=0.


This would defeat the whole purpose of using a cyclic alloc.

Thus: cycle over at least 63 ids -> 5 additional bits to avoid collisions.


--

    Manfred



[RFC, PATCH] ipc/util.c: use idr_alloc_cyclic() for ipc allocations

2018-10-02 Thread Manfred Spraul
A bit related to the patch series that increases IPC_MNI:

(User space) id reuse create the risk of data corruption:

Process A: calls ipc function
Process A: sleeps just at the beginning of the syscall
Process B: Frees the ipc object (i.e.: calls ...ctl(IPC_RMID)
Process B: Creates a new ipc object (i.e.: calls ...get())

Process A: is woken up, and accesses the new object

To reduce the probability that the new and the old object
have the same id, the current implementation adds a
sequence number to the index of the object in the idr tree.

To further reduce the probability for a reuse, switch from
idr_alloc to idr_alloc_cyclic.

The patch cycles over at least RADIX_TREE_MAP_SIZE, i.e.
if there is only a small number of objects, the accesses
continue to be direct.

As an option, this could be made dependent on the extended
mode: In extended mode, cycle over e.g. at least 16k ids.

Signed-off-by: Manfred Spraul 
---

Open questions:
- Is there a significant performance advantage, especially
  there are many ipc ids?
- Over how many ids should the code cycle always?
- Further review remarks?

 ipc/util.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/ipc/util.c b/ipc/util.c
index 0af05752969f..6f83841f6761 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -216,10 +216,30 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
 */
 
if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */
+   int idr_max;
+
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
-   idx = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
+
+   /*
+* If a user space visible id is reused, then this creates a
+* risk for data corruption. To reduce the probability that
+* a number is reduced, two approaches are used:
+* 1) the idr index is allocated cyclically.
+* 2) the use space id is build by concatenating the
+*internal idr index with a sequence number
+* To avoid that both numbers have the same cycle time, try
+* to set the size for the cyclic alloc to an odd number.
+*/
+   idr_max = ids->in_use*2+1;
+   if (idr_max < RADIX_TREE_MAP_SIZE-1)
+   idr_max = RADIX_TREE_MAP_SIZE-1;
+   if (idr_max > IPCMNI)
+   idr_max = IPCMNI;
+
+   idx = idr_alloc_cyclic(&ids->ipcs_idr, new, 0, idr_max,
+   GFP_NOWAIT);
} else {
new->seq = ipcid_to_seqx(next_id);
idx = idr_alloc(&ids->ipcs_idr, new, ipcid_to_idx(next_id),
-- 
2.17.1



Re: [PATCH -next] ipc/sem: prevent queue.status tearing in semop

2018-07-17 Thread Manfred Spraul

Hello Davidlohr,

On 07/17/2018 07:26 AM, Davidlohr Bueso wrote:

In order for load/store tearing to work, _all_ accesses to
the variable in question need to be done around READ and
WRITE_ONCE() macros. Ensure everyone does so for q->status
variable for semtimedop().

What is the background of the above rule?

sma->use_global_lock is sometimes used with smp_load_acquire(), 
sometimes without.

So far, I assumed that this is safe.

The same applies for nf_conntrack_locks_all, in nf_conntrack_all_lock()

Signed-off-by: Davidlohr Bueso 
---
  ipc/sem.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 6cbbf34a44ac..ccab4e51d351 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2125,7 +2125,7 @@ static long do_semtimedop(int semid, struct sembuf __user 
*tsops,
}
  
  	do {

-   queue.status = -EINTR;
+   WRITE_ONCE(queue.status, -EINTR);
queue.sleeper = current;
  
  		__set_current_state(TASK_INTERRUPTIBLE);





[PATCH 01/12] ipc: ipc: compute kern_ipc_perm.id under the ipc lock.

2018-07-12 Thread Manfred Spraul
ipc_addid() initializes kern_ipc_perm.id after having called
ipc_idr_alloc().

Thus a parallel semctl() or msgctl() that uses e.g. MSG_STAT may use
this unitialized value as the return code.

The patch moves all accesses to kern_ipc_perm.id under the spin_lock().

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with kern_ipc_perm.seq

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Reviewed-by: Davidlohr Bueso 
---
 ipc/msg.c | 19 ++-
 ipc/sem.c | 18 +-
 ipc/shm.c | 19 ++-
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 3b6545302598..49358f474fc9 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -491,7 +491,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
 int cmd, struct msqid64_ds *p)
 {
struct msg_queue *msq;
-   int id = 0;
int err;
 
memset(p, 0, sizeof(*p));
@@ -503,7 +502,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
err = PTR_ERR(msq);
goto out_unlock;
}
-   id = msq->q_perm.id;
} else { /* IPC_STAT */
msq = msq_obtain_object_check(ns, msqid);
if (IS_ERR(msq)) {
@@ -548,10 +546,21 @@ static int msgctl_stat(struct ipc_namespace *ns, int 
msqid,
p->msg_lspid  = pid_vnr(msq->q_lspid);
p->msg_lrpid  = pid_vnr(msq->q_lrpid);
 
-   ipc_unlock_object(&msq->q_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* MSG_STAT and MSG_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence number
+*/
+   err = msq->q_perm.id;
+   }
 
+   ipc_unlock_object(&msq->q_perm);
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/sem.c b/ipc/sem.c
index 5af1943ad782..d89ce69b2613 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1222,7 +1222,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 {
struct sem_array *sma;
time64_t semotime;
-   int id = 0;
int err;
 
memset(semid64, 0, sizeof(*semid64));
@@ -1234,7 +1233,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
err = PTR_ERR(sma);
goto out_unlock;
}
-   id = sma->sem_perm.id;
} else { /* IPC_STAT */
sma = sem_obtain_object_check(ns, semid);
if (IS_ERR(sma)) {
@@ -1274,10 +1272,20 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 #endif
semid64->sem_nsems = sma->sem_nsems;
 
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SEM_STAT and SEM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence number
+*/
+   err = sma->sem_perm.id;
+   }
ipc_unlock_object(&sma->sem_perm);
-   rcu_read_unlock();
-   return id;
-
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/shm.c b/ipc/shm.c
index 051a3e1fb8df..f3bae59bed08 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -949,7 +949,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
int cmd, struct shmid64_ds *tbuf)
 {
struct shmid_kernel *shp;
-   int id = 0;
int err;
 
memset(tbuf, 0, sizeof(*tbuf));
@@ -961,7 +960,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
err = PTR_ERR(shp);
goto out_unlock;
}
-   id = shp->shm_perm.id;
} else { /* IPC_STAT */
shp = shm_obtain_object_check(ns, shmid);
if (IS_ERR(shp)) {
@@ -1011,10 +1009,21 @@ static int shmctl_stat(struct ipc_namespace *ns, int 
shmid,
tbuf->shm_lpid  = pid_vnr(shp->shm_lprid);
tbuf->shm_nattch = shp->shm_nattch;
 
-   ipc_unlock_object(&shp->shm_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SHM_STAT and SHM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence number
+*/
+   err = shp->shm

[PATCH 03/12] ipc/util.c: Use ipc_rcu_putref() for failues in ipc_addid()

2018-07-12 Thread Manfred Spraul
ipc_addid() is impossible to use:
- for certain failures, the caller must not use ipc_rcu_putref(),
  because the reference counter is not yet initialized.
- for other failures, the caller must use ipc_rcu_putref(),
  because parallel operations could be ongoing already.

The patch cleans that up, by initializing the refcount early,
and by modifying all callers.

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with reading kern_ipc_perm.seq,
here both read and write to already released memory could happen.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
---
 ipc/msg.c  |  2 +-
 ipc/sem.c  |  2 +-
 ipc/shm.c  |  2 ++
 ipc/util.c | 10 --
 4 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 49358f474fc9..38119c1f0da3 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -162,7 +162,7 @@ static int newque(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks msq upon success. */
retval = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni);
if (retval < 0) {
-   call_rcu(&msq->q_perm.rcu, msg_rcu_free);
+   ipc_rcu_putref(&msq->q_perm, msg_rcu_free);
return retval;
}
 
diff --git a/ipc/sem.c b/ipc/sem.c
index d89ce69b2613..8a0a1eb05765 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -556,7 +556,7 @@ static int newary(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks sma upon success. */
retval = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni);
if (retval < 0) {
-   call_rcu(&sma->sem_perm.rcu, sem_rcu_free);
+   ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
return retval;
}
ns->used_sems += nsems;
diff --git a/ipc/shm.c b/ipc/shm.c
index f3bae59bed08..92d71abe9e8f 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -671,6 +671,8 @@ static int newseg(struct ipc_namespace *ns, struct 
ipc_params *params)
if (is_file_hugepages(file) && shp->mlock_user)
user_shm_unlock(size, shp->mlock_user);
fput(file);
+   ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+   return error;
 no_file:
call_rcu(&shp->shm_perm.rcu, shm_rcu_free);
return error;
diff --git a/ipc/util.c b/ipc/util.c
index 4998f8fa8ce0..f3447911c81e 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -250,7 +250,9 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, struct 
kern_ipc_perm *new)
  * Add an entry 'new' to the ipc ids idr. The permissions object is
  * initialised and the first free entry is set up and the id assigned
  * is returned. The 'new' entry is returned in a locked state on success.
+ *
  * On failure the entry is not locked and a negative err-code is returned.
+ * The caller must use ipc_rcu_putref() to free the identifier.
  *
  * Called with writer ipc_ids.rwsem held.
  */
@@ -260,6 +262,9 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
kgid_t egid;
int idx, err;
 
+   /* 1) Initialize the refcount so that ipc_rcu_putref works */
+   refcount_set(&new->refcount, 1);
+
if (limit > IPCMNI)
limit = IPCMNI;
 
@@ -268,9 +273,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
 
idr_preload(GFP_KERNEL);
 
-   refcount_set(&new->refcount, 1);
spin_lock_init(&new->lock);
-   new->deleted = false;
rcu_read_lock();
spin_lock(&new->lock);
 
@@ -278,6 +281,8 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
new->cuid = new->uid = euid;
new->gid = new->cgid = egid;
 
+   new->deleted = false;
+
idx = ipc_idr_alloc(ids, new);
idr_preload_end();
 
@@ -290,6 +295,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
}
}
if (idx < 0) {
+   new->deleted = true;
spin_unlock(&new->lock);
rcu_read_unlock();
return idx;
-- 
2.17.1



[PATCH 04/12] ipc: Rename ipcctl_pre_down_nolock().

2018-07-12 Thread Manfred Spraul
Both the comment and the name of ipcctl_pre_down_nolock()
are misleading: The function must be called while holdling
the rw semaphore.
Therefore the patch renames the function to ipcctl_obtain_check():
This name matches the other names used in util.c:
- "obtain" function look up a pointer in the idr, without
  acquiring the object lock.
- The caller is responsible for locking.
- _check means that the sequence number is checked.

Signed-off-by: Manfred Spraul 
Reviewed-by: Davidlohr Bueso 
---
 ipc/msg.c  | 2 +-
 ipc/sem.c  | 2 +-
 ipc/shm.c  | 2 +-
 ipc/util.c | 8 
 ipc/util.h | 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 38119c1f0da3..4aca0ce363b5 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -385,7 +385,7 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, 
int cmd,
down_write(&msg_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, &msg_ids(ns), msqid, cmd,
+   ipcp = ipcctl_obtain_check(ns, &msg_ids(ns), msqid, cmd,
  &msqid64->msg_perm, msqid64->msg_qbytes);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/sem.c b/ipc/sem.c
index 8a0a1eb05765..da1626984083 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1595,7 +1595,7 @@ static int semctl_down(struct ipc_namespace *ns, int 
semid,
down_write(&sem_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, &sem_ids(ns), semid, cmd,
+   ipcp = ipcctl_obtain_check(ns, &sem_ids(ns), semid, cmd,
  &semid64->sem_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/shm.c b/ipc/shm.c
index 92d71abe9e8f..0a509befb558 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -868,7 +868,7 @@ static int shmctl_down(struct ipc_namespace *ns, int shmid, 
int cmd,
down_write(&shm_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, &shm_ids(ns), shmid, cmd,
+   ipcp = ipcctl_obtain_check(ns, &shm_ids(ns), shmid, cmd,
  &shmid64->shm_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/util.c b/ipc/util.c
index f3447911c81e..cffd12240f67 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -687,7 +687,7 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
 }
 
 /**
- * ipcctl_pre_down_nolock - retrieve an ipc and check permissions for some 
IPC_XXX cmd
+ * ipcctl_obtain_check - retrieve an ipc object and check permissions
  * @ns:  ipc namespace
  * @ids:  the table of ids where to look for the ipc
  * @id:   the id of the ipc to retrieve
@@ -697,16 +697,16 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
  *
  * This function does some common audit and permissions check for some IPC_XXX
  * cmd and is called from semctl_down, shmctl_down and msgctl_down.
- * It must be called without any lock held and:
  *
- *   - retrieves the ipc with the given id in the given table.
+ * It:
+ *   - retrieves the ipc object with the given id in the given table.
  *   - performs some audit and permission check, depending on the given cmd
  *   - returns a pointer to the ipc object or otherwise, the corresponding
  * error.
  *
  * Call holding the both the rwsem and the rcu read lock.
  */
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
struct ipc_ids *ids, int id, int cmd,
struct ipc64_perm *perm, int extra_perm)
 {
diff --git a/ipc/util.h b/ipc/util.h
index 0aba3230d007..fcf81425ae98 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -148,7 +148,7 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id);
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
 int ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out);
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
 struct ipc_ids *ids, int id, int 
cmd,
 struct ipc64_perm *perm, int 
extra_perm);
 
-- 
2.17.1



[PATCH 05/12] ipc/util.c: correct comment in ipc_obtain_object_check

2018-07-12 Thread Manfred Spraul
The comment that explains ipc_obtain_object_check is wrong:
The function checks the sequence number, not the reference
counter.
Note that checking the reference counter would be meaningless:
The reference counter is decreased without holding any locks,
thus an object with kern_ipc_perm.deleted=true may disappear at
the end of the next rcu grace period.

Signed-off-by: Manfred Spraul 
Reviewed-by: Davidlohr Bueso 
---
 ipc/util.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index cffd12240f67..5cc37066e659 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -628,8 +628,8 @@ struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
  * @ids: ipc identifier set
  * @id: ipc id to look for
  *
- * Similar to ipc_obtain_object_idr() but also checks
- * the ipc object reference counter.
+ * Similar to ipc_obtain_object_idr() but also checks the ipc object
+ * sequence number.
  *
  * Call inside the RCU critical section.
  * The ipc object is *not* locked on exit.
-- 
2.17.1



[PATCH 06/12] ipc: drop ipc_lock()

2018-07-12 Thread Manfred Spraul
From: Davidlohr Bueso 

ipc/util.c contains multiple functions to get the ipc object
pointer given an id number.

There are two sets of function: One set verifies the sequence
counter part of the id number, other functions do not check
the sequence counter.

The standard for function names in ipc/util.c is
- ..._check() functions verify the sequence counter
- ..._idr() functions do not verify the sequence counter

ipc_lock() is an exception: It does not verify the sequence
counter value, but this is not obvious from the function name.

Furthermore, shm.c is the only user of this helper. Thus, we
can simply move the logic into shm_lock() and get rid of the
function altogether.

[changelog mostly by manfred]
Signed-off-by: Davidlohr Bueso 
Signed-off-by: Manfred Spraul 
---
 ipc/shm.c  | 29 +++--
 ipc/util.c | 36 
 ipc/util.h |  1 -
 3 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 0a509befb558..22afb98363ff 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -179,16 +179,33 @@ static inline struct shmid_kernel 
*shm_obtain_object_check(struct ipc_namespace
  */
 static inline struct shmid_kernel *shm_lock(struct ipc_namespace *ns, int id)
 {
-   struct kern_ipc_perm *ipcp = ipc_lock(&shm_ids(ns), id);
+   struct kern_ipc_perm *ipcp;
+
+   rcu_read_lock();
+   ipcp = ipc_obtain_object_idr(&shm_ids(ns), id);
+   if (IS_ERR(ipcp))
+   goto err;
 
+   ipc_lock_object(ipcp);
+   /*
+* ipc_rmid() may have already freed the ID while ipc_lock_object()
+* was spinning: here verify that the structure is still valid.
+* Upon races with RMID, return -EIDRM, thus indicating that
+* the ID points to a removed identifier.
+*/
+   if (ipc_valid_object(ipcp)) {
+   /* return a locked ipc object upon success */
+   return container_of(ipcp, struct shmid_kernel, shm_perm);
+   }
+
+   ipc_unlock_object(ipcp);
+err:
+   rcu_read_unlock();
/*
 * Callers of shm_lock() must validate the status of the returned ipc
-* object pointer (as returned by ipc_lock()), and error out as
-* appropriate.
+* object pointer and error out as appropriate.
 */
-   if (IS_ERR(ipcp))
-   return (void *)ipcp;
-   return container_of(ipcp, struct shmid_kernel, shm_perm);
+   return (void *)ipcp;
 }
 
 static inline void shm_lock_by_ptr(struct shmid_kernel *ipcp)
diff --git a/ipc/util.c b/ipc/util.c
index 5cc37066e659..234f6d781df3 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -587,42 +587,6 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id)
return out;
 }
 
-/**
- * ipc_lock - lock an ipc structure without rwsem held
- * @ids: ipc identifier set
- * @id: ipc id to look for
- *
- * Look for an id in the ipc ids idr and lock the associated ipc object.
- *
- * The ipc object is locked on successful exit.
- */
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
-{
-   struct kern_ipc_perm *out;
-
-   rcu_read_lock();
-   out = ipc_obtain_object_idr(ids, id);
-   if (IS_ERR(out))
-   goto err;
-
-   spin_lock(&out->lock);
-
-   /*
-* ipc_rmid() may have already freed the ID while ipc_lock()
-* was spinning: here verify that the structure is still valid.
-* Upon races with RMID, return -EIDRM, thus indicating that
-* the ID points to a removed identifier.
-*/
-   if (ipc_valid_object(out))
-   return out;
-
-   spin_unlock(&out->lock);
-   out = ERR_PTR(-EIDRM);
-err:
-   rcu_read_unlock();
-   return out;
-}
-
 /**
  * ipc_obtain_object_check
  * @ids: ipc identifier set
diff --git a/ipc/util.h b/ipc/util.h
index fcf81425ae98..e3c47b21db93 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -142,7 +142,6 @@ int ipc_rcu_getref(struct kern_ipc_perm *ptr);
 void ipc_rcu_putref(struct kern_ipc_perm *ptr,
void (*func)(struct rcu_head *head));
 
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *, int);
 struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids *ids, int id);
 
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
-- 
2.17.1



[PATCH 08/12] lib/rhashtable: guarantee initial hashtable allocation

2018-07-12 Thread Manfred Spraul
From: Davidlohr Bueso 

rhashtable_init() may fail due to -ENOMEM, thus making the
entire api unusable. This patch removes this scenario,
however unlikely. In order to guarantee memory allocation,
this patch always ends up doing GFP_KERNEL|__GFP_NOFAIL
for both the tbl as well as alloc_bucket_spinlocks().

Upon the first table allocation failure, we shrink the
size to the smallest value that makes sense and retry with
__GFP_NOFAIL semantics. With the defaults, this means that
from 64 buckets, we retry with only 4. Any later issues
regarding performance due to collisions or larger table
resizing (when more memory becomes available) is the least
of our problems.

Signed-off-by: Davidlohr Bueso 
Acked-by: Herbert Xu 
Signed-off-by: Manfred Spraul 
---
 lib/rhashtable.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 083f871491a1..0026cf3e3f27 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -179,10 +179,11 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
 
size = nbuckets;
 
-   if (tbl == NULL && gfp != GFP_KERNEL) {
+   if (tbl == NULL && (gfp & ~__GFP_NOFAIL) != GFP_KERNEL) {
tbl = nested_bucket_table_alloc(ht, nbuckets, gfp);
nbuckets = 0;
}
+
if (tbl == NULL)
return NULL;
 
@@ -1065,9 +1066,16 @@ int rhashtable_init(struct rhashtable *ht,
}
}
 
+   /*
+* This is api initialization and thus we need to guarantee the
+* initial rhashtable allocation. Upon failure, retry with the
+* smallest possible size with __GFP_NOFAIL semantics.
+*/
tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
-   if (tbl == NULL)
-   return -ENOMEM;
+   if (unlikely(tbl == NULL)) {
+   size = max_t(u16, ht->p.min_size, HASH_MIN_SIZE);
+   tbl = bucket_table_alloc(ht, size, GFP_KERNEL | __GFP_NOFAIL);
+   }
 
atomic_set(&ht->nelems, 0);
 
-- 
2.17.1



[PATCH 07/12] lib/rhashtable: simplify bucket_table_alloc()

2018-07-12 Thread Manfred Spraul
From: Davidlohr Bueso 

As of commit ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for
incompatible gfp flags") we can simplify the caller and trust kvzalloc() to
just do the right thing. For the case of the GFP_ATOMIC context, we can
drop the __GFP_NORETRY flag for obvious reasons, and for the __GFP_NOWARN
case, however, it is changed such that the caller passes the flag instead
of making bucket_table_alloc() handle it.

This slightly changes the gfp flags passed on to nested_table_alloc() as
it will now also use GFP_ATOMIC | __GFP_NOWARN. However, I consider this a
positive consequence as for the same reasons we want nowarn semantics in
bucket_table_alloc().

Signed-off-by: Davidlohr Bueso 
Acked-by: Michal Hocko 

(commit id extended to 12 digits, line wraps updated)
Signed-off-by: Manfred Spraul 
---
 lib/rhashtable.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 9427b5766134..083f871491a1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -175,10 +175,7 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
int i;
 
size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
-   if (gfp != GFP_KERNEL)
-   tbl = kzalloc(size, gfp | __GFP_NOWARN | __GFP_NORETRY);
-   else
-   tbl = kvzalloc(size, gfp);
+   tbl = kvzalloc(size, gfp);
 
size = nbuckets;
 
@@ -459,7 +456,7 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 
err = -ENOMEM;
 
-   new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC);
+   new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC | __GFP_NOWARN);
if (new_tbl == NULL)
goto fail;
 
-- 
2.17.1



[PATCH 09/12] ipc: get rid of ids->tables_initialized hack

2018-07-12 Thread Manfred Spraul
From: Davidlohr Bueso 

In sysvipc we have an ids->tables_initialized regarding the
rhashtable, introduced in:

commit 0cfb6aee70bd ("ipc: optimize semget/shmget/msgget for lots of keys")

It's there, specifically, to prevent nil pointer dereferences,
from using an uninitialized api. Considering how rhashtable_init()
can fail (probably due to ENOMEM, if anything), this made the
overall ipc initialization capable of failure as well. That alone
is ugly, but fine, however I've spotted a few issues regarding the
semantics of tables_initialized (however unlikely they may be):

- There is inconsistency in what we return to userspace: ipc_addid()
returns ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr()
returns EINVAL.

- After we started using rhashtables, ipc_findkey() can return nil upon
!tables_initialized, but the caller expects nil for when the ipc structure
isn't found, and can therefore call into ipcget() callbacks.

Now that rhashtable initialization cannot fail, we can properly
get rid of the hack altogether.

Signed-off-by: Davidlohr Bueso 

(commit id extended to 12 digits)
Signed-off-by: Manfred Spraul 
---
 include/linux/ipc_namespace.h |  1 -
 ipc/util.c| 23 ---
 2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8eb2f3..37f3a4b7c637 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -16,7 +16,6 @@ struct user_namespace;
 struct ipc_ids {
int in_use;
unsigned short seq;
-   bool tables_initialized;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
int max_id;
diff --git a/ipc/util.c b/ipc/util.c
index 234f6d781df3..f620778b11d2 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -125,7 +125,6 @@ int ipc_init_ids(struct ipc_ids *ids)
if (err)
return err;
idr_init(&ids->ipcs_idr);
-   ids->tables_initialized = true;
ids->max_id = -1;
 #ifdef CONFIG_CHECKPOINT_RESTORE
ids->next_id = -1;
@@ -178,19 +177,16 @@ void __init ipc_init_proc_interface(const char *path, 
const char *header,
  */
 static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key)
 {
-   struct kern_ipc_perm *ipcp = NULL;
+   struct kern_ipc_perm *ipcp;
 
-   if (likely(ids->tables_initialized))
-   ipcp = rhashtable_lookup_fast(&ids->key_ht, &key,
+   ipcp = rhashtable_lookup_fast(&ids->key_ht, &key,
  ipc_kht_params);
+   if (!ipcp)
+   return NULL;
 
-   if (ipcp) {
-   rcu_read_lock();
-   ipc_lock_object(ipcp);
-   return ipcp;
-   }
-
-   return NULL;
+   rcu_read_lock();
+   ipc_lock_object(ipcp);
+   return ipcp;
 }
 
 /*
@@ -268,7 +264,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
if (limit > IPCMNI)
limit = IPCMNI;
 
-   if (!ids->tables_initialized || ids->in_use >= limit)
+   if (ids->in_use >= limit)
return -ENOSPC;
 
idr_preload(GFP_KERNEL);
@@ -577,9 +573,6 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id)
struct kern_ipc_perm *out;
int lid = ipcid_to_idx(id);
 
-   if (unlikely(!ids->tables_initialized))
-   return ERR_PTR(-EINVAL);
-
out = idr_find(&ids->ipcs_idr, lid);
if (!out)
return ERR_PTR(-EINVAL);
-- 
2.17.1



[PATCH 10/12] ipc: simplify ipc initialization

2018-07-12 Thread Manfred Spraul
From: Davidlohr Bueso 

Now that we know that rhashtable_init() will not fail, we
can get rid of a lot of the unnecessary cleanup paths when
the call errored out.

Signed-off-by: Davidlohr Bueso 

(variable name added to util.h to resolve checkpatch warning)
Signed-off-by: Manfred Spraul 
---
 ipc/msg.c   |  9 -
 ipc/namespace.c | 20 
 ipc/sem.c   | 10 --
 ipc/shm.c   |  9 -
 ipc/util.c  | 18 +-
 ipc/util.h  | 18 +-
 6 files changed, 30 insertions(+), 54 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 4aca0ce363b5..130e12e6a8c6 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -1237,7 +1237,7 @@ COMPAT_SYSCALL_DEFINE5(msgrcv, int, msqid, compat_uptr_t, 
msgp,
 }
 #endif
 
-int msg_init_ns(struct ipc_namespace *ns)
+void msg_init_ns(struct ipc_namespace *ns)
 {
ns->msg_ctlmax = MSGMAX;
ns->msg_ctlmnb = MSGMNB;
@@ -1245,7 +1245,7 @@ int msg_init_ns(struct ipc_namespace *ns)
 
atomic_set(&ns->msg_bytes, 0);
atomic_set(&ns->msg_hdrs, 0);
-   return ipc_init_ids(&ns->ids[IPC_MSG_IDS]);
+   ipc_init_ids(&ns->ids[IPC_MSG_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
@@ -1286,12 +1286,11 @@ static int sysvipc_msg_proc_show(struct seq_file *s, 
void *it)
 }
 #endif
 
-int __init msg_init(void)
+void __init msg_init(void)
 {
-   const int err = msg_init_ns(&init_ipc_ns);
+   msg_init_ns(&init_ipc_ns);
 
ipc_init_proc_interface("sysvipc/msg",
"   key  msqid perms  cbytes   
qnum lspid lrpid   uid   gid  cuid  cgid  stime  rtime  ctime\n",
IPC_MSG_IDS, sysvipc_msg_proc_show);
-   return err;
 }
diff --git a/ipc/namespace.c b/ipc/namespace.c
index f59a89966f92..21607791d62c 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -55,28 +55,16 @@ static struct ipc_namespace *create_ipc_ns(struct 
user_namespace *user_ns,
ns->user_ns = get_user_ns(user_ns);
ns->ucounts = ucounts;
 
-   err = sem_init_ns(ns);
+   err = mq_init_ns(ns);
if (err)
goto fail_put;
-   err = msg_init_ns(ns);
-   if (err)
-   goto fail_destroy_sem;
-   err = shm_init_ns(ns);
-   if (err)
-   goto fail_destroy_msg;
 
-   err = mq_init_ns(ns);
-   if (err)
-   goto fail_destroy_shm;
+   sem_init_ns(ns);
+   msg_init_ns(ns);
+   shm_init_ns(ns);
 
return ns;
 
-fail_destroy_shm:
-   shm_exit_ns(ns);
-fail_destroy_msg:
-   msg_exit_ns(ns);
-fail_destroy_sem:
-   sem_exit_ns(ns);
 fail_put:
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
diff --git a/ipc/sem.c b/ipc/sem.c
index da1626984083..671d8703b130 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -220,14 +220,14 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
 #define sc_semopm  sem_ctls[2]
 #define sc_semmni  sem_ctls[3]
 
-int sem_init_ns(struct ipc_namespace *ns)
+void sem_init_ns(struct ipc_namespace *ns)
 {
ns->sc_semmsl = SEMMSL;
ns->sc_semmns = SEMMNS;
ns->sc_semopm = SEMOPM;
ns->sc_semmni = SEMMNI;
ns->used_sems = 0;
-   return ipc_init_ids(&ns->ids[IPC_SEM_IDS]);
+   ipc_init_ids(&ns->ids[IPC_SEM_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
@@ -239,14 +239,12 @@ void sem_exit_ns(struct ipc_namespace *ns)
 }
 #endif
 
-int __init sem_init(void)
+void __init sem_init(void)
 {
-   const int err = sem_init_ns(&init_ipc_ns);
-
+   sem_init_ns(&init_ipc_ns);
ipc_init_proc_interface("sysvipc/sem",
"   key  semid perms  nsems   uid   
gid  cuid  cgid  otime  ctime\n",
IPC_SEM_IDS, sysvipc_sem_proc_show);
-   return err;
 }
 
 /**
diff --git a/ipc/shm.c b/ipc/shm.c
index 22afb98363ff..d388d6e744c0 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -95,14 +95,14 @@ static void shm_destroy(struct ipc_namespace *ns, struct 
shmid_kernel *shp);
 static int sysvipc_shm_proc_show(struct seq_file *s, void *it);
 #endif
 
-int shm_init_ns(struct ipc_namespace *ns)
+void shm_init_ns(struct ipc_namespace *ns)
 {
ns->shm_ctlmax = SHMMAX;
ns->shm_ctlall = SHMALL;
ns->shm_ctlmni = SHMMNI;
ns->shm_rmid_forced = 0;
ns->shm_tot = 0;
-   return ipc_init_ids(&shm_ids(ns));
+   ipc_init_ids(&shm_ids(ns));
 }
 
 /*
@@ -135,9 +135,8 @@ void shm_exit_ns(struct ipc_namespace *ns)
 
 static int __init ipc_ns_init(void)
 {
-   const int err = shm_init_ns(&init_ipc_ns);
-   WARN(err, "ipc: sysv shm_init_ns failed: %d\n", err);
-   return err;
+   shm_init_ns(&init_ipc_ns);
+   return 0;
 }
 
 pure_initcall(ipc_ns_init);
diff --git a/ipc

[PATCH 11/12] ipc/util.c: Further variable name cleanups

2018-07-12 Thread Manfred Spraul
The varable names got a mess, thus standardize them again:

id: user space id. Called semid, shmid, msgid if the type is known.
Most functions use "id" already.
idx: "index" for the idr lookup
Right now, some functions use lid, ipc_addid() already uses idx as
the variable name.
seq: sequence number, to avoid quick collisions of the user space id
key: user space key, used for the rhash tree

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
---
 include/linux/ipc_namespace.h |  2 +-
 ipc/msg.c |  6 +++---
 ipc/sem.c |  6 +++---
 ipc/shm.c |  4 ++--
 ipc/util.c| 26 +-
 ipc/util.h| 10 +-
 6 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 37f3a4b7c637..3098d275a29d 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -18,7 +18,7 @@ struct ipc_ids {
unsigned short seq;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
-   int max_id;
+   int max_idx;
 #ifdef CONFIG_CHECKPOINT_RESTORE
int next_id;
 #endif
diff --git a/ipc/msg.c b/ipc/msg.c
index 130e12e6a8c6..1892bec0f1c8 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -455,7 +455,7 @@ static int msgctl_info(struct ipc_namespace *ns, int msqid,
 int cmd, struct msginfo *msginfo)
 {
int err;
-   int max_id;
+   int max_idx;
 
/*
 * We must not return kernel stack data.
@@ -482,9 +482,9 @@ static int msgctl_info(struct ipc_namespace *ns, int msqid,
msginfo->msgpool = MSGPOOL;
msginfo->msgtql = MSGTQL;
}
-   max_id = ipc_get_maxid(&msg_ids(ns));
+   max_idx = ipc_get_maxidx(&msg_ids(ns));
up_read(&msg_ids(ns).rwsem);
-   return (max_id < 0) ? 0 : max_id;
+   return (max_idx < 0) ? 0 : max_idx;
 }
 
 static int msgctl_stat(struct ipc_namespace *ns, int msqid,
diff --git a/ipc/sem.c b/ipc/sem.c
index 671d8703b130..f98962b06024 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1293,7 +1293,7 @@ static int semctl_info(struct ipc_namespace *ns, int 
semid,
 int cmd, void __user *p)
 {
struct seminfo seminfo;
-   int max_id;
+   int max_idx;
int err;
 
err = security_sem_semctl(NULL, cmd);
@@ -1317,11 +1317,11 @@ static int semctl_info(struct ipc_namespace *ns, int 
semid,
seminfo.semusz = SEMUSZ;
seminfo.semaem = SEMAEM;
}
-   max_id = ipc_get_maxid(&sem_ids(ns));
+   max_idx = ipc_get_maxidx(&sem_ids(ns));
up_read(&sem_ids(ns).rwsem);
if (copy_to_user(p, &seminfo, sizeof(struct seminfo)))
return -EFAULT;
-   return (max_id < 0) ? 0 : max_id;
+   return (max_idx < 0) ? 0 : max_idx;
 }
 
 static int semctl_setval(struct ipc_namespace *ns, int semid, int semnum,
diff --git a/ipc/shm.c b/ipc/shm.c
index d388d6e744c0..a4e9a1b34595 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -935,7 +935,7 @@ static int shmctl_ipc_info(struct ipc_namespace *ns,
shminfo->shmall = ns->shm_ctlall;
shminfo->shmmin = SHMMIN;
down_read(&shm_ids(ns).rwsem);
-   err = ipc_get_maxid(&shm_ids(ns));
+   err = ipc_get_maxidx(&shm_ids(ns));
up_read(&shm_ids(ns).rwsem);
if (err < 0)
err = 0;
@@ -955,7 +955,7 @@ static int shmctl_shm_info(struct ipc_namespace *ns,
shm_info->shm_tot = ns->shm_tot;
shm_info->swap_attempts = 0;
shm_info->swap_successes = 0;
-   err = ipc_get_maxid(&shm_ids(ns));
+   err = ipc_get_maxidx(&shm_ids(ns));
up_read(&shm_ids(ns).rwsem);
if (err < 0)
err = 0;
diff --git a/ipc/util.c b/ipc/util.c
index 35621be0d945..fb69c911655a 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -118,7 +118,7 @@ void ipc_init_ids(struct ipc_ids *ids)
init_rwsem(&ids->rwsem);
rhashtable_init(&ids->key_ht, &ipc_kht_params);
idr_init(&ids->ipcs_idr);
-   ids->max_id = -1;
+   ids->max_idx = -1;
 #ifdef CONFIG_CHECKPOINT_RESTORE
ids->next_id = -1;
 #endif
@@ -236,7 +236,7 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, struct 
kern_ipc_perm *new)
  * @limit: limit for the number of used ids
  *
  * Add an entry 'new' to the ipc ids idr. The permissions object is
- * initialised and the first free entry is set up and the id assigned
+ * initialised and the first free entry is set up and the index assigned
  * is returned. The 'new' entry is returned in a locked state on success.
  *
  * On failure the entry i

[PATCH 12/12] ipc/util.c: update return value of ipc_getref from int to bool

2018-07-12 Thread Manfred Spraul
ipc_getref has still a return value of type "int", matching the atomic_t
interface of atomic_inc_not_zero()/atomic_add_unless().

ipc_getref now uses refcount_inc_not_zero, which has a return value of
type "bool".

Therefore: Update the return code to avoid implicit conversions.

Signed-off-by: Manfred Spraul 
---
 ipc/util.c | 2 +-
 ipc/util.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index fb69c911655a..6306eb25180b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -461,7 +461,7 @@ void ipc_set_key_private(struct ipc_ids *ids, struct 
kern_ipc_perm *ipcp)
ipcp->key = IPC_PRIVATE;
 }
 
-int ipc_rcu_getref(struct kern_ipc_perm *ptr)
+bool ipc_rcu_getref(struct kern_ipc_perm *ptr)
 {
return refcount_inc_not_zero(&ptr->refcount);
 }
diff --git a/ipc/util.h b/ipc/util.h
index e74564fe3375..0a159f69b3bb 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -138,7 +138,7 @@ static inline int ipc_get_maxidx(struct ipc_ids *ids)
  * refcount is initialized by ipc_addid(), before that point call_rcu()
  * must be used.
  */
-int ipc_rcu_getref(struct kern_ipc_perm *ptr);
+bool ipc_rcu_getref(struct kern_ipc_perm *ptr);
 void ipc_rcu_putref(struct kern_ipc_perm *ptr,
void (*func)(struct rcu_head *head));
 
-- 
2.17.1



[PATCH 0/12 V3] ipc: cleanups & bugfixes, rhashtable update

2018-07-12 Thread Manfred Spraul
Hi,

I have added all all review findings and rediffed the patches

- patch #1-#6: Fix syzcall findings & further race cleanups
patch #1 has an updated subject/comment
patch #2 contains the squashed result of Dmitrys change and my
own updates.
patch #6 is replaced by the proposal from Davidlohr
- patch #7-#10: rhashtable improvement from Davidlohr
- patch #11: A variable rename patch: id/lid/idx/uid were a mess
- patch #12: change a return code from int to bool, side effect of the
refcount_t introduction.

@Andrew:
Can you merge the patches into -mm/next?

I have not seen any issues in my tests.

--
Manfred


[PATCH 02/12] ipc: reorganize initialization of kern_ipc_perm.seq

2018-07-12 Thread Manfred Spraul
ipc_addid() initializes kern_ipc_perm.seq after having called
idr_alloc() (within ipc_idr_alloc()).

Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_check()
may see an uninitialized value.

The patch moves the initialization of kern_ipc_perm.seq before the
calls of idr_alloc().

Notes:
1) This patch has a user space visible side effect:
If /proc/sys/kernel/*_next_id is used (i.e.: checkpoint/restore) and
if semget()/msgget()/shmget() fails in the final step of adding the id
to the rhash tree, then .._next_id is cleared. Before the patch, is
remained unmodified.

There is no change of the behavior after a successful ..get() call:
It always clears .._next_id, there is no impact to non checkpoint/restore
code as that code does not use .._next_id.

2) The patch correctly documents that after a call to ipc_idr_alloc(),
the full tear-down sequence must be used. The callers of ipc_addid()
do not fullfill that, i.e. more bugfixes are required.

The patch is a squash of a patch from Dmitry and my own changes.

Reported-by: syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com
Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
Cc: Michael Kerrisk 
---
 Documentation/sysctl/kernel.txt |  3 +-
 ipc/util.c  | 91 +
 2 files changed, 50 insertions(+), 44 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index eded671d55eb..b2d4a8f8fe97 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -440,7 +440,8 @@ Notes:
 1) kernel doesn't guarantee, that new object will have desired id. So,
 it's up to userspace, how to handle an object with "wrong" id.
 2) Toggle with non-default value will be set back to -1 by kernel after
-successful IPC object allocation.
+successful IPC object allocation. If an IPC object allocation syscall
+fails, it is undefined if the value remains unmodified or is reset to -1.
 
 ==
 
diff --git a/ipc/util.c b/ipc/util.c
index 4e81182fa0ac..4998f8fa8ce0 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -193,46 +193,54 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
return NULL;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
 /*
- * Specify desired id for next allocated IPC object.
+ * Insert new IPC object into idr tree, and set sequence number and id
+ * in the correct order.
+ * Especially:
+ * - the sequence number must be set before inserting the object into the idr,
+ *   because the sequence number is accessed without a lock.
+ * - the id can/must be set after inserting the object into the idr.
+ *   All accesses must be done after getting kern_ipc_perm.lock.
+ *
+ * The caller must own kern_ipc_perm.lock.of the new object.
+ * On error, the function returns a (negative) error code.
  */
-#define ipc_idr_alloc(ids, new)
\
-   idr_alloc(&(ids)->ipcs_idr, (new),  \
- (ids)->next_id < 0 ? 0 : ipcid_to_idx((ids)->next_id),\
- 0, GFP_NOWAIT)
-
-static inline int ipc_buildid(int id, struct ipc_ids *ids,
- struct kern_ipc_perm *new)
+static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
 {
-   if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
+   int idx, next_id = -1;
+
+#ifdef CONFIG_CHECKPOINT_RESTORE
+   next_id = ids->next_id;
+   ids->next_id = -1;
+#endif
+
+   /*
+* As soon as a new object is inserted into the idr,
+* ipc_obtain_object_idr() or ipc_obtain_object_check() can find it,
+* and the lockless preparations for ipc operations can start.
+* This means especially: permission checks, audit calls, allocation
+* of undo structures, ...
+*
+* Thus the object must be fully initialized, and if something fails,
+* then the full tear-down sequence must be followed.
+* (i.e.: set new->deleted, reduce refcount, call_rcu())
+*/
+
+   if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
+   idx = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
} else {
-   new->seq = ipcid_to_seqx(ids->next_id);
-   ids->next_id = -1;
+   new->seq = ipcid_to_seqx(next_id);
+   idx = idr_alloc(&ids->ipcs_idr, new, ipcid_to_idx(next_id),
+   0, GFP_NOWAIT);
}
-
-   return SEQ_MULTIPLIER * new->seq + id;
+   if (idx >= 0)
+   new->id = SEQ_MULTIPLIER * new->seq + idx;
+   return idx;
 }
 
-#else
-#define ipc_i

Re: [PATCH 0/12 V2] ipc: cleanups & bugfixes, rhashtable update

2018-07-09 Thread Manfred Spraul

Hi Davidlohr,

On 07/09/2018 10:09 PM, Davidlohr Bueso wrote:

On Mon, 09 Jul 2018, Manfred Spraul wrote:


@Davidlohr:
Please double check that I have taken the correct patches, and
that I didn't break anything.


Everything seems ok.

Patch 8 had an alternative patch that didn't change nowarn semantics for
the rhashtable resizing operations (https://lkml.org/lkml/2018/6/22/732),
but nobody complained about the one you picked up (and also has 
Michal's ack).



Which patch do you prefer?
I have seen two versions, and if I have picked up the wrong one, then I 
can change it.


--
    Manfred


Re: [PATCH 12/12] ipc/util.c: Further ipc_idr_alloc cleanups.

2018-07-09 Thread Manfred Spraul

Hello Dmitry,

On 07/09/2018 07:05 PM, Dmitry Vyukov wrote:

On Mon, Jul 9, 2018 at 5:10 PM, Manfred Spraul  wrote:

If idr_alloc within ipc_idr_alloc fails, then the return value (-ENOSPC)
is used to calculate new->id.
Technically, this is not a bug, because new->id is never accessed.

But: Clean it up anyways: On error, just return, do not set new->id.
And improve the documentation.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
---
  ipc/util.c | 22 --
  1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index d474f2b3b299..302c18fc846b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -182,11 +182,20 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
  }

  /*
- * Specify desired id for next allocated IPC object.
+ * Insert new IPC object into idr tree, and set sequence number and id
+ * in the correct order.
+ * Especially:
+ * - the sequence number must be set before inserting the object into the idr,
+ *   because the sequence number is accessed without a lock.
+ * - the id can/must be set after inserting the object into the idr.
+ *   All accesses must be done after getting kern_ipc_perm.lock.
+ *
+ * The caller must own kern_ipc_perm.lock.of the new object.
+ * On error, the function returns a (negative) error code.
   */
  static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm 
*new)
  {
-   int key, next_id = -1;
+   int id, next_id = -1;

/\/\/\/\
Looks good to me. I was also confused by how key transforms into id,
and then key name is used for something else.
Let's see if there are further findings, perhaps I'll rework the series, 
it may make sense to standardize the variable names:


id: user space id. Called semid, shmid, msgid if the type is known.
    Most functions use "id" already.
    Exception: ipc_checkid(), the function calls is uid.
idx: "index" for the idr lookup
    Right now, ipc_rmid() use lid, ipc_addid() use id as variable name
seq: sequence counter, to avoid quick collisions of the user space id
    In the comments, it got a mixture of sequence counter and sequence 
number.

key: user space key, used for the rhash tree


  #ifdef CONFIG_CHECKPOINT_RESTORE
 next_id = ids->next_id;
@@ -197,14 +206,15 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
 new->seq = ids->seq++;
 if (ids->seq > IPCID_SEQ_MAX)
 ids->seq = 0;
-   key = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
+   id = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
 } else {
 new->seq = ipcid_to_seqx(next_id);
-   key = idr_alloc(&ids->ipcs_idr, new, ipcid_to_idx(next_id),
+   id = idr_alloc(&ids->ipcs_idr, new, ipcid_to_idx(next_id),
 0, GFP_NOWAIT);
 }
-   new->id = SEQ_MULTIPLIER * new->seq + key;
-   return key;
+   if (id >= 0)
+   new->id = SEQ_MULTIPLIER * new->seq + id;

We still initialize seq in this case. I guess it's ok because the
object is not published at all. But if we are doing this, then perhaps
store seq into a local var first and then:

   if (id >= 0) {
   new->id = SEQ_MULTIPLIER * seq + id;
   new->seq = seq:
   }

?

No!!!
We must initialize ->seq before publication. Otherwise we end up with 
the syzcall findings, or in the worst case a strange rare failure of an 
ipc operation.
The difference between ->id and ->seq is that we have the valid number 
for ->seq.


For the user space ID we cannot have the valid number unless the 
idr_alloc is successful.

The patch only avoids that this line is executed:


new->id = SEQ_MULTIPLIER * new->seq + (-ENOSPC)


As I wrote, the line shouldn't cause any damage, the code is more or less:

new->id = SEQ_MULTIPLIER * new->seq + (-ENOSPC)
kfree(new);

But this is ugly, it asks for problems.

--
Manfred



[PATCH 01/12] ipc: reorganize initialization of kern_ipc_perm.id

2018-07-09 Thread Manfred Spraul
ipc_addid() initializes kern_ipc_perm.id after having called
ipc_idr_alloc().

Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_idr()
may see an uninitialized value.

The patch moves all accesses to kern_ipc_perm.id under the spin_lock().

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with kern_ipc_perm.seq

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
---
 ipc/msg.c | 19 ++-
 ipc/sem.c | 18 +-
 ipc/shm.c | 19 ++-
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 3b6545302598..829c2062ded4 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -491,7 +491,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
 int cmd, struct msqid64_ds *p)
 {
struct msg_queue *msq;
-   int id = 0;
int err;
 
memset(p, 0, sizeof(*p));
@@ -503,7 +502,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
err = PTR_ERR(msq);
goto out_unlock;
}
-   id = msq->q_perm.id;
} else { /* IPC_STAT */
msq = msq_obtain_object_check(ns, msqid);
if (IS_ERR(msq)) {
@@ -548,10 +546,21 @@ static int msgctl_stat(struct ipc_namespace *ns, int 
msqid,
p->msg_lspid  = pid_vnr(msq->q_lspid);
p->msg_lrpid  = pid_vnr(msq->q_lrpid);
 
-   ipc_unlock_object(&msq->q_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* MSG_STAT and MSG_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence counter
+*/
+   err = msq->q_perm.id;
+   }
 
+   ipc_unlock_object(&msq->q_perm);
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/sem.c b/ipc/sem.c
index 5af1943ad782..e8971fa1d847 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1222,7 +1222,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 {
struct sem_array *sma;
time64_t semotime;
-   int id = 0;
int err;
 
memset(semid64, 0, sizeof(*semid64));
@@ -1234,7 +1233,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
err = PTR_ERR(sma);
goto out_unlock;
}
-   id = sma->sem_perm.id;
} else { /* IPC_STAT */
sma = sem_obtain_object_check(ns, semid);
if (IS_ERR(sma)) {
@@ -1274,10 +1272,20 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 #endif
semid64->sem_nsems = sma->sem_nsems;
 
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SEM_STAT and SEM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence counter
+*/
+   err = sma->sem_perm.id;
+   }
ipc_unlock_object(&sma->sem_perm);
-   rcu_read_unlock();
-   return id;
-
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/shm.c b/ipc/shm.c
index 051a3e1fb8df..59fe8b3b3794 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -949,7 +949,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
int cmd, struct shmid64_ds *tbuf)
 {
struct shmid_kernel *shp;
-   int id = 0;
int err;
 
memset(tbuf, 0, sizeof(*tbuf));
@@ -961,7 +960,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
err = PTR_ERR(shp);
goto out_unlock;
}
-   id = shp->shm_perm.id;
} else { /* IPC_STAT */
shp = shm_obtain_object_check(ns, shmid);
if (IS_ERR(shp)) {
@@ -1011,10 +1009,21 @@ static int shmctl_stat(struct ipc_namespace *ns, int 
shmid,
tbuf->shm_lpid  = pid_vnr(shp->shm_lprid);
tbuf->shm_nattch = shp->shm_nattch;
 
-   ipc_unlock_object(&shp->shm_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SHM_STAT and SHM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence counter
+*/
+   err = shp->shm_perm.id;
+   }
 
+  

[PATCH 07/12] ipc_idr_alloc refactoring

2018-07-09 Thread Manfred Spraul
From: Dmitry Vyukov 

ipc_idr_alloc refactoring

Signed-off-by: Dmitry Vyukov 
Signed-off-by: Manfred Spraul 
---
 ipc/util.c | 51 +--
 1 file changed, 13 insertions(+), 38 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 8bc166bb4981..a41b8a69de13 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -193,52 +193,32 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
return NULL;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
 /*
  * Specify desired id for next allocated IPC object.
  */
-static inline int ipc_idr_alloc(struct ipc_ids *ids,
-   struct kern_ipc_perm *new)
+static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
 {
-   int key;
+   int key, next_id = -1;
 
-   if (ids->next_id < 0) {
-   key = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
-   } else {
-   key = idr_alloc(&ids->ipcs_idr, new,
-   ipcid_to_idx(ids->next_id),
-   0, GFP_NOWAIT);
-   ids->next_id = -1;
-   }
-   return key;
-}
+#ifdef CONFIG_CHECKPOINT_RESTORE
+   next_id = ids->next_id;
+   ids->next_id = -1;
+#endif
 
-static inline void ipc_set_seq(struct ipc_ids *ids,
-   struct kern_ipc_perm *new)
-{
-   if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
+   if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
+   key = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
} else {
-   new->seq = ipcid_to_seqx(ids->next_id);
+   new->seq = ipcid_to_seqx(next_id);
+   key = idr_alloc(&ids->ipcs_idr, new, ipcid_to_idx(next_id),
+   0, GFP_NOWAIT);
}
+   new->id = SEQ_MULTIPLIER * new->seq + key;
+   return key;
 }
 
-#else
-#define ipc_idr_alloc(ids, new)\
-   idr_alloc(&(ids)->ipcs_idr, (new), 0, 0, GFP_NOWAIT)
-
-static inline void ipc_set_seq(struct ipc_ids *ids,
- struct kern_ipc_perm *new)
-{
-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
-}
-
-#endif /* CONFIG_CHECKPOINT_RESTORE */
-
 /**
  * ipc_addid - add an ipc identifier
  * @ids: ipc identifier set
@@ -278,8 +258,6 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
current_euid_egid(&euid, &egid);
new->cuid = new->uid = euid;
new->gid = new->cgid = egid;
-
-   ipc_set_seq(ids, new);
new->deleted = false;
 
/*
@@ -317,9 +295,6 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
ids->in_use++;
if (id > ids->max_id)
ids->max_id = id;
-
-   new->id = SEQ_MULTIPLIER * new->seq + id;
-
return id;
 }
 
-- 
2.17.1



[PATCH 09/12] lib/rhashtable: guarantee initial hashtable allocation

2018-07-09 Thread Manfred Spraul
From: Davidlohr Bueso 

rhashtable_init() may fail due to -ENOMEM, thus making the
entire api unusable. This patch removes this scenario,
however unlikely. In order to guarantee memory allocation,
this patch always ends up doing GFP_KERNEL|__GFP_NOFAIL
for both the tbl as well as alloc_bucket_spinlocks().

Upon the first table allocation failure, we shrink the
size to the smallest value that makes sense and retry with
__GFP_NOFAIL semantics. With the defaults, this means that
from 64 buckets, we retry with only 4. Any later issues
regarding performance due to collisions or larger table
resizing (when more memory becomes available) is the least
of our problems.

Signed-off-by: Davidlohr Bueso 
Acked-by: Herbert Xu 
Signed-off-by: Manfred Spraul 
---
 lib/rhashtable.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 083f871491a1..0026cf3e3f27 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -179,10 +179,11 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
 
size = nbuckets;
 
-   if (tbl == NULL && gfp != GFP_KERNEL) {
+   if (tbl == NULL && (gfp & ~__GFP_NOFAIL) != GFP_KERNEL) {
tbl = nested_bucket_table_alloc(ht, nbuckets, gfp);
nbuckets = 0;
}
+
if (tbl == NULL)
return NULL;
 
@@ -1065,9 +1066,16 @@ int rhashtable_init(struct rhashtable *ht,
}
}
 
+   /*
+* This is api initialization and thus we need to guarantee the
+* initial rhashtable allocation. Upon failure, retry with the
+* smallest possible size with __GFP_NOFAIL semantics.
+*/
tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
-   if (tbl == NULL)
-   return -ENOMEM;
+   if (unlikely(tbl == NULL)) {
+   size = max_t(u16, ht->p.min_size, HASH_MIN_SIZE);
+   tbl = bucket_table_alloc(ht, size, GFP_KERNEL | __GFP_NOFAIL);
+   }
 
atomic_set(&ht->nelems, 0);
 
-- 
2.17.1



[PATCH 08/12] lib/rhashtable: simplify bucket_table_alloc()

2018-07-09 Thread Manfred Spraul
From: Davidlohr Bueso 

As of commit ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for
incompatible gfp flags") we can simplify the caller and trust kvzalloc() to
just do the right thing. For the case of the GFP_ATOMIC context, we can
drop the __GFP_NORETRY flag for obvious reasons, and for the __GFP_NOWARN
case, however, it is changed such that the caller passes the flag instead
of making bucket_table_alloc() handle it.

This slightly changes the gfp flags passed on to nested_table_alloc() as
it will now also use GFP_ATOMIC | __GFP_NOWARN. However, I consider this a
positive consequence as for the same reasons we want nowarn semantics in
bucket_table_alloc().

Signed-off-by: Davidlohr Bueso 
Acked-by: Michal Hocko 

(commit id extended to 12 digits, line wraps updated)
Signed-off-by: Manfred Spraul 
---
 lib/rhashtable.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 9427b5766134..083f871491a1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -175,10 +175,7 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
int i;
 
size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
-   if (gfp != GFP_KERNEL)
-   tbl = kzalloc(size, gfp | __GFP_NOWARN | __GFP_NORETRY);
-   else
-   tbl = kvzalloc(size, gfp);
+   tbl = kvzalloc(size, gfp);
 
size = nbuckets;
 
@@ -459,7 +456,7 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 
err = -ENOMEM;
 
-   new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC);
+   new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC | __GFP_NOWARN);
if (new_tbl == NULL)
goto fail;
 
-- 
2.17.1



[PATCH 06/12] ipc: rename ipc_lock() to ipc_lock_idr()

2018-07-09 Thread Manfred Spraul
ipc/util.c contains multiple functions to get the ipc object
pointer given an id number.

There are two sets of function: One set verifies the sequence
counter part of the id number, other functions do not check
the sequence counter.

The standard for function names in ipc/util.c is
- ..._check() functions verify the sequence counter
- ..._idr() functions do not verify the sequence counter

ipc_lock() is an exception: It does not verify the sequence
counter value, but this is not obvious from the function name.

Therefore: Rename the function to ipc_lock_idr(), to make it
obvious that it does not check the sequence counter.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 ipc/shm.c  |  4 ++--
 ipc/util.c | 10 ++
 ipc/util.h |  2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 426ba1039a7b..cd8655c7bb77 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -179,11 +179,11 @@ static inline struct shmid_kernel 
*shm_obtain_object_check(struct ipc_namespace
  */
 static inline struct shmid_kernel *shm_lock(struct ipc_namespace *ns, int id)
 {
-   struct kern_ipc_perm *ipcp = ipc_lock(&shm_ids(ns), id);
+   struct kern_ipc_perm *ipcp = ipc_lock_idr(&shm_ids(ns), id);
 
/*
 * Callers of shm_lock() must validate the status of the returned ipc
-* object pointer (as returned by ipc_lock()), and error out as
+* object pointer (as returned by ipc_lock_idr()), and error out as
 * appropriate.
 */
if (IS_ERR(ipcp))
diff --git a/ipc/util.c b/ipc/util.c
index 8133f10832a9..8bc166bb4981 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -604,15 +604,17 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct 
ipc_ids *ids, int id)
 }
 
 /**
- * ipc_lock - lock an ipc structure without rwsem held
+ * ipc_lock_idr - lock an ipc structure without rwsem held
  * @ids: ipc identifier set
  * @id: ipc id to look for
  *
  * Look for an id in the ipc ids idr and lock the associated ipc object.
+ * The function does not check if the sequence counter matches the
+ * found ipc object.
  *
  * The ipc object is locked on successful exit.
  */
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
+struct kern_ipc_perm *ipc_lock_idr(struct ipc_ids *ids, int id)
 {
struct kern_ipc_perm *out;
 
@@ -624,8 +626,8 @@ struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
spin_lock(&out->lock);
 
/*
-* ipc_rmid() may have already freed the ID while ipc_lock()
-* was spinning: here verify that the structure is still valid.
+* ipc_rmid() may have already freed the ID while waiting for
+* the lock. Here verify that the structure is still valid.
 * Upon races with RMID, return -EIDRM, thus indicating that
 * the ID points to a removed identifier.
 */
diff --git a/ipc/util.h b/ipc/util.h
index fcf81425ae98..25d8ee052ac9 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -142,7 +142,7 @@ int ipc_rcu_getref(struct kern_ipc_perm *ptr);
 void ipc_rcu_putref(struct kern_ipc_perm *ptr,
void (*func)(struct rcu_head *head));
 
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *, int);
+struct kern_ipc_perm *ipc_lock_idr(struct ipc_ids *ids, int id);
 struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids *ids, int id);
 
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
-- 
2.17.1



[PATCH 10/12] ipc: get rid of ids->tables_initialized hack

2018-07-09 Thread Manfred Spraul
From: Davidlohr Bueso 

In sysvipc we have an ids->tables_initialized regarding the
rhashtable, introduced in:

commit 0cfb6aee70bd ("ipc: optimize semget/shmget/msgget for lots of keys")

It's there, specifically, to prevent nil pointer dereferences,
from using an uninitialized api. Considering how rhashtable_init()
can fail (probably due to ENOMEM, if anything), this made the
overall ipc initialization capable of failure as well. That alone
is ugly, but fine, however I've spotted a few issues regarding the
semantics of tables_initialized (however unlikely they may be):

- There is inconsistency in what we return to userspace: ipc_addid()
returns ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr()
returns EINVAL.

- After we started using rhashtables, ipc_findkey() can return nil upon
!tables_initialized, but the caller expects nil for when the ipc structure
isn't found, and can therefore call into ipcget() callbacks.

Now that rhashtable initialization cannot fail, we can properly
get rid of the hack altogether.

Signed-off-by: Davidlohr Bueso 

(commit id extended to 12 digits)
Signed-off-by: Manfred Spraul 
---
 include/linux/ipc_namespace.h |  1 -
 ipc/util.c| 23 ---
 2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8eb2f3..37f3a4b7c637 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -16,7 +16,6 @@ struct user_namespace;
 struct ipc_ids {
int in_use;
unsigned short seq;
-   bool tables_initialized;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
int max_id;
diff --git a/ipc/util.c b/ipc/util.c
index a41b8a69de13..ae485b41ea0b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -125,7 +125,6 @@ int ipc_init_ids(struct ipc_ids *ids)
if (err)
return err;
idr_init(&ids->ipcs_idr);
-   ids->tables_initialized = true;
ids->max_id = -1;
 #ifdef CONFIG_CHECKPOINT_RESTORE
ids->next_id = -1;
@@ -178,19 +177,16 @@ void __init ipc_init_proc_interface(const char *path, 
const char *header,
  */
 static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key)
 {
-   struct kern_ipc_perm *ipcp = NULL;
+   struct kern_ipc_perm *ipcp;
 
-   if (likely(ids->tables_initialized))
-   ipcp = rhashtable_lookup_fast(&ids->key_ht, &key,
+   ipcp = rhashtable_lookup_fast(&ids->key_ht, &key,
  ipc_kht_params);
+   if (!ipcp)
+   return NULL;
 
-   if (ipcp) {
-   rcu_read_lock();
-   ipc_lock_object(ipcp);
-   return ipcp;
-   }
-
-   return NULL;
+   rcu_read_lock();
+   ipc_lock_object(ipcp);
+   return ipcp;
 }
 
 /*
@@ -246,7 +242,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
if (limit > IPCMNI)
limit = IPCMNI;
 
-   if (!ids->tables_initialized || ids->in_use >= limit)
+   if (ids->in_use >= limit)
return -ENOSPC;
 
idr_preload(GFP_KERNEL);
@@ -568,9 +564,6 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id)
struct kern_ipc_perm *out;
int lid = ipcid_to_idx(id);
 
-   if (unlikely(!ids->tables_initialized))
-   return ERR_PTR(-EINVAL);
-
out = idr_find(&ids->ipcs_idr, lid);
if (!out)
return ERR_PTR(-EINVAL);
-- 
2.17.1



[PATCH 11/12] ipc: simplify ipc initialization

2018-07-09 Thread Manfred Spraul
From: Davidlohr Bueso 

Now that we know that rhashtable_init() will not fail, we
can get rid of a lot of the unnecessary cleanup paths when
the call errored out.

Signed-off-by: Davidlohr Bueso 

(variable name added to util.h to resolve checkpatch warning)
Signed-off-by: Manfred Spraul 
---
 ipc/msg.c   |  9 -
 ipc/namespace.c | 20 
 ipc/sem.c   | 10 --
 ipc/shm.c   |  9 -
 ipc/util.c  | 18 +-
 ipc/util.h  | 18 +-
 6 files changed, 30 insertions(+), 54 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index ba85d8849e8d..346230712259 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -1237,7 +1237,7 @@ COMPAT_SYSCALL_DEFINE5(msgrcv, int, msqid, compat_uptr_t, 
msgp,
 }
 #endif
 
-int msg_init_ns(struct ipc_namespace *ns)
+void msg_init_ns(struct ipc_namespace *ns)
 {
ns->msg_ctlmax = MSGMAX;
ns->msg_ctlmnb = MSGMNB;
@@ -1245,7 +1245,7 @@ int msg_init_ns(struct ipc_namespace *ns)
 
atomic_set(&ns->msg_bytes, 0);
atomic_set(&ns->msg_hdrs, 0);
-   return ipc_init_ids(&ns->ids[IPC_MSG_IDS]);
+   ipc_init_ids(&ns->ids[IPC_MSG_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
@@ -1286,12 +1286,11 @@ static int sysvipc_msg_proc_show(struct seq_file *s, 
void *it)
 }
 #endif
 
-int __init msg_init(void)
+void __init msg_init(void)
 {
-   const int err = msg_init_ns(&init_ipc_ns);
+   msg_init_ns(&init_ipc_ns);
 
ipc_init_proc_interface("sysvipc/msg",
"   key  msqid perms  cbytes   
qnum lspid lrpid   uid   gid  cuid  cgid  stime  rtime  ctime\n",
IPC_MSG_IDS, sysvipc_msg_proc_show);
-   return err;
 }
diff --git a/ipc/namespace.c b/ipc/namespace.c
index f59a89966f92..21607791d62c 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -55,28 +55,16 @@ static struct ipc_namespace *create_ipc_ns(struct 
user_namespace *user_ns,
ns->user_ns = get_user_ns(user_ns);
ns->ucounts = ucounts;
 
-   err = sem_init_ns(ns);
+   err = mq_init_ns(ns);
if (err)
goto fail_put;
-   err = msg_init_ns(ns);
-   if (err)
-   goto fail_destroy_sem;
-   err = shm_init_ns(ns);
-   if (err)
-   goto fail_destroy_msg;
 
-   err = mq_init_ns(ns);
-   if (err)
-   goto fail_destroy_shm;
+   sem_init_ns(ns);
+   msg_init_ns(ns);
+   shm_init_ns(ns);
 
return ns;
 
-fail_destroy_shm:
-   shm_exit_ns(ns);
-fail_destroy_msg:
-   msg_exit_ns(ns);
-fail_destroy_sem:
-   sem_exit_ns(ns);
 fail_put:
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
diff --git a/ipc/sem.c b/ipc/sem.c
index 9742e9a1c0c2..f3de2f5e7b9b 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -220,14 +220,14 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it);
 #define sc_semopm  sem_ctls[2]
 #define sc_semmni  sem_ctls[3]
 
-int sem_init_ns(struct ipc_namespace *ns)
+void sem_init_ns(struct ipc_namespace *ns)
 {
ns->sc_semmsl = SEMMSL;
ns->sc_semmns = SEMMNS;
ns->sc_semopm = SEMOPM;
ns->sc_semmni = SEMMNI;
ns->used_sems = 0;
-   return ipc_init_ids(&ns->ids[IPC_SEM_IDS]);
+   ipc_init_ids(&ns->ids[IPC_SEM_IDS]);
 }
 
 #ifdef CONFIG_IPC_NS
@@ -239,14 +239,12 @@ void sem_exit_ns(struct ipc_namespace *ns)
 }
 #endif
 
-int __init sem_init(void)
+void __init sem_init(void)
 {
-   const int err = sem_init_ns(&init_ipc_ns);
-
+   sem_init_ns(&init_ipc_ns);
ipc_init_proc_interface("sysvipc/sem",
"   key  semid perms  nsems   uid   
gid  cuid  cgid  otime  ctime\n",
IPC_SEM_IDS, sysvipc_sem_proc_show);
-   return err;
 }
 
 /**
diff --git a/ipc/shm.c b/ipc/shm.c
index cd8655c7bb77..1db4cf91f676 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -95,14 +95,14 @@ static void shm_destroy(struct ipc_namespace *ns, struct 
shmid_kernel *shp);
 static int sysvipc_shm_proc_show(struct seq_file *s, void *it);
 #endif
 
-int shm_init_ns(struct ipc_namespace *ns)
+void shm_init_ns(struct ipc_namespace *ns)
 {
ns->shm_ctlmax = SHMMAX;
ns->shm_ctlall = SHMALL;
ns->shm_ctlmni = SHMMNI;
ns->shm_rmid_forced = 0;
ns->shm_tot = 0;
-   return ipc_init_ids(&shm_ids(ns));
+   ipc_init_ids(&shm_ids(ns));
 }
 
 /*
@@ -135,9 +135,8 @@ void shm_exit_ns(struct ipc_namespace *ns)
 
 static int __init ipc_ns_init(void)
 {
-   const int err = shm_init_ns(&init_ipc_ns);
-   WARN(err, "ipc: sysv shm_init_ns failed: %d\n", err);
-   return err;
+   shm_init_ns(&init_ipc_ns);
+   return 0;
 }
 
 pure_initcall(ipc_ns_init);
diff --git a/ipc

[PATCH 04/12] ipc: Rename ipcctl_pre_down_nolock().

2018-07-09 Thread Manfred Spraul
Both the comment and the name of ipcctl_pre_down_nolock()
are misleading: The function must be called while holdling
the rw semaphore.
Therefore the patch renames the function to ipcctl_obtain_check():
This name matches the other names used in util.c:
- "obtain" function look up a pointer in the idr, without
  acquiring the object lock.
- The caller is responsible for locking.
- _check means that the sequence number is checked.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 ipc/msg.c  | 2 +-
 ipc/sem.c  | 2 +-
 ipc/shm.c  | 2 +-
 ipc/util.c | 8 
 ipc/util.h | 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 5bf5cb8017ea..ba85d8849e8d 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -385,7 +385,7 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, 
int cmd,
down_write(&msg_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, &msg_ids(ns), msqid, cmd,
+   ipcp = ipcctl_obtain_check(ns, &msg_ids(ns), msqid, cmd,
  &msqid64->msg_perm, msqid64->msg_qbytes);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/sem.c b/ipc/sem.c
index 9d49efeac2e5..9742e9a1c0c2 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1595,7 +1595,7 @@ static int semctl_down(struct ipc_namespace *ns, int 
semid,
down_write(&sem_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, &sem_ids(ns), semid, cmd,
+   ipcp = ipcctl_obtain_check(ns, &sem_ids(ns), semid, cmd,
  &semid64->sem_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/shm.c b/ipc/shm.c
index 06b7bf11a011..426ba1039a7b 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -868,7 +868,7 @@ static int shmctl_down(struct ipc_namespace *ns, int shmid, 
int cmd,
down_write(&shm_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, &shm_ids(ns), shmid, cmd,
+   ipcp = ipcctl_obtain_check(ns, &shm_ids(ns), shmid, cmd,
  &shmid64->shm_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/util.c b/ipc/util.c
index 8b09496ed720..bbb1ce212a0d 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -703,7 +703,7 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
 }
 
 /**
- * ipcctl_pre_down_nolock - retrieve an ipc and check permissions for some 
IPC_XXX cmd
+ * ipcctl_obtain_check - retrieve an ipc object and check permissions
  * @ns:  ipc namespace
  * @ids:  the table of ids where to look for the ipc
  * @id:   the id of the ipc to retrieve
@@ -713,16 +713,16 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
  *
  * This function does some common audit and permissions check for some IPC_XXX
  * cmd and is called from semctl_down, shmctl_down and msgctl_down.
- * It must be called without any lock held and:
  *
- *   - retrieves the ipc with the given id in the given table.
+ * It:
+ *   - retrieves the ipc object with the given id in the given table.
  *   - performs some audit and permission check, depending on the given cmd
  *   - returns a pointer to the ipc object or otherwise, the corresponding
  * error.
  *
  * Call holding the both the rwsem and the rcu read lock.
  */
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
struct ipc_ids *ids, int id, int cmd,
struct ipc64_perm *perm, int extra_perm)
 {
diff --git a/ipc/util.h b/ipc/util.h
index 0aba3230d007..fcf81425ae98 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -148,7 +148,7 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id);
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
 int ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out);
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
 struct ipc_ids *ids, int id, int 
cmd,
 struct ipc64_perm *perm, int 
extra_perm);
 
-- 
2.17.1



[PATCH 12/12] ipc/util.c: Further ipc_idr_alloc cleanups.

2018-07-09 Thread Manfred Spraul
If idr_alloc within ipc_idr_alloc fails, then the return value (-ENOSPC)
is used to calculate new->id.
Technically, this is not a bug, because new->id is never accessed.

But: Clean it up anyways: On error, just return, do not set new->id.
And improve the documentation.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
---
 ipc/util.c | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index d474f2b3b299..302c18fc846b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -182,11 +182,20 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
 }
 
 /*
- * Specify desired id for next allocated IPC object.
+ * Insert new IPC object into idr tree, and set sequence number and id
+ * in the correct order.
+ * Especially:
+ * - the sequence number must be set before inserting the object into the idr,
+ *   because the sequence number is accessed without a lock.
+ * - the id can/must be set after inserting the object into the idr.
+ *   All accesses must be done after getting kern_ipc_perm.lock.
+ *
+ * The caller must own kern_ipc_perm.lock.of the new object.
+ * On error, the function returns a (negative) error code.
  */
 static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
 {
-   int key, next_id = -1;
+   int id, next_id = -1;
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
next_id = ids->next_id;
@@ -197,14 +206,15 @@ static inline int ipc_idr_alloc(struct ipc_ids *ids, 
struct kern_ipc_perm *new)
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
-   key = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
+   id = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
} else {
new->seq = ipcid_to_seqx(next_id);
-   key = idr_alloc(&ids->ipcs_idr, new, ipcid_to_idx(next_id),
+   id = idr_alloc(&ids->ipcs_idr, new, ipcid_to_idx(next_id),
0, GFP_NOWAIT);
}
-   new->id = SEQ_MULTIPLIER * new->seq + key;
-   return key;
+   if (id >= 0)
+   new->id = SEQ_MULTIPLIER * new->seq + id;
+   return id;
 }
 
 /**
-- 
2.17.1



[PATCH 03/12] ipc/util.c: Use ipc_rcu_putref() for failues in ipc_addid()

2018-07-09 Thread Manfred Spraul
ipc_addid() is impossible to use:
- for certain failures, the caller must not use ipc_rcu_putref(),
  because the reference counter is not yet initialized.
- for other failures, the caller must use ipc_rcu_putref(),
  because parallel operations could be ongoing already.

The patch cleans that up, by initializing the refcount early,
and by modifying all callers.

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with reading kern_ipc_perm.seq,
here both read and write to already released memory could happen.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
---
 ipc/msg.c  |  2 +-
 ipc/sem.c  |  2 +-
 ipc/shm.c  |  2 ++
 ipc/util.c | 12 ++--
 4 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 829c2062ded4..5bf5cb8017ea 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -162,7 +162,7 @@ static int newque(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks msq upon success. */
retval = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni);
if (retval < 0) {
-   call_rcu(&msq->q_perm.rcu, msg_rcu_free);
+   ipc_rcu_putref(&msq->q_perm, msg_rcu_free);
return retval;
}
 
diff --git a/ipc/sem.c b/ipc/sem.c
index e8971fa1d847..9d49efeac2e5 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -556,7 +556,7 @@ static int newary(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks sma upon success. */
retval = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni);
if (retval < 0) {
-   call_rcu(&sma->sem_perm.rcu, sem_rcu_free);
+   ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
return retval;
}
ns->used_sems += nsems;
diff --git a/ipc/shm.c b/ipc/shm.c
index 59fe8b3b3794..06b7bf11a011 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -671,6 +671,8 @@ static int newseg(struct ipc_namespace *ns, struct 
ipc_params *params)
if (is_file_hugepages(file) && shp->mlock_user)
user_shm_unlock(size, shp->mlock_user);
fput(file);
+   ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+   return error;
 no_file:
call_rcu(&shp->shm_perm.rcu, shm_rcu_free);
return error;
diff --git a/ipc/util.c b/ipc/util.c
index 662c28c6c9fa..8b09496ed720 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -248,7 +248,9 @@ static inline void ipc_set_seq(struct ipc_ids *ids,
  * Add an entry 'new' to the ipc ids idr. The permissions object is
  * initialised and the first free entry is set up and the id assigned
  * is returned. The 'new' entry is returned in a locked state on success.
+ *
  * On failure the entry is not locked and a negative err-code is returned.
+ * The caller must use ipc_rcu_putref() to free the identifier.
  *
  * Called with writer ipc_ids.rwsem held.
  */
@@ -258,6 +260,9 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
kgid_t egid;
int id, err;
 
+   /* 1) Initialize the refcount so that ipc_rcu_putref works */
+   refcount_set(&new->refcount, 1);
+
if (limit > IPCMNI)
limit = IPCMNI;
 
@@ -266,9 +271,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
 
idr_preload(GFP_KERNEL);
 
-   refcount_set(&new->refcount, 1);
spin_lock_init(&new->lock);
-   new->deleted = false;
rcu_read_lock();
spin_lock(&new->lock);
 
@@ -277,6 +280,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
new->gid = new->cgid = egid;
 
ipc_set_seq(ids, new);
+   new->deleted = false;
 
/*
 * As soon as a new object is inserted into the idr,
@@ -288,6 +292,9 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
 * Thus the object must be fully initialized, and if something fails,
 * then the full tear-down sequence must be followed.
 * (i.e.: set new->deleted, reduce refcount, call_rcu())
+*
+* This function sets new->deleted, the caller must use ipc_rcu_putef()
+* for the remaining steps.
 */
id = ipc_idr_alloc(ids, new);
idr_preload_end();
@@ -301,6 +308,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
}
}
if (id < 0) {
+   new->deleted = true;
spin_unlock(&new->lock);
rcu_read_unlock();
return id;
-- 
2.17.1



[PATCH 0/12 V2] ipc: cleanups & bugfixes, rhashtable update

2018-07-09 Thread Manfred Spraul
Hi,

I have merged the patches from Dmitry, Davidlohr and myself:

- patch #1-#6: Fix syzcall findings & further race cleanups
- patch #7: Cleanup from Dmitry for ipc_idr_alloc.
- patch #8-#11: rhashtable improvement from Davidlohr
- patch #12: Another cleanup for ipc_idr_alloc.

@Davidlohr:
Please double check that I have taken the correct patches, and
that I didn't break anything.
Especially, I had to reformat the commit ids, otherwise checkpatch
complained.

@Dmitry: Patch #12 reworks your ipc_idr_alloc patch.
Ok?

@Andrew:
Can you merge the patches into -mm/next?

I have not seen any issues in my tests.

--
    Manfred


[PATCH 05/12] ipc/util.c: correct comment in ipc_obtain_object_check

2018-07-09 Thread Manfred Spraul
The comment that explains ipc_obtain_object_check is wrong:
The function checks the sequence number, not the reference
counter.
Note that checking the reference counter would be meaningless:
The reference counter is decreased without holding any locks,
thus an object with kern_ipc_perm.deleted=true may disappear at
the end of the next rcu grace period.

Signed-off-by: Manfred Spraul 
Cc: Davidlohr Bueso 
---
 ipc/util.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index bbb1ce212a0d..8133f10832a9 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -644,8 +644,8 @@ struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
  * @ids: ipc identifier set
  * @id: ipc id to look for
  *
- * Similar to ipc_obtain_object_idr() but also checks
- * the ipc object reference counter.
+ * Similar to ipc_obtain_object_idr() but also checks the ipc object
+ * sequence number.
  *
  * Call inside the RCU critical section.
  * The ipc object is *not* locked on exit.
-- 
2.17.1



[PATCH 02/12] ipc: reorganize initialization of kern_ipc_perm.seq

2018-07-09 Thread Manfred Spraul
ipc_addid() initializes kern_ipc_perm.seq after having called
ipc_idr_alloc().

Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_check()
may see an uninitialized value.

The patch moves the initialization of kern_ipc_perm.seq before the
calls of ipc_idr_alloc().

Notes:
1) This patch has a user space visible side effect:
If /proc/sys/kernel/*_next_id is used (i.e.: checkpoint/restore) and
if semget()/msgget()/shmget() fails in the final step of adding the id
to the rhash tree, then .._next_id is cleared. Before the patch, is
remained unmodified.

There is no change of the behavior after a successful ..get() call:
It always clears .._next_id, there is no impact to non checkpoint/restore
code as that code does not use .._next_id.

2) The patch correctly documents that after a call to ipc_idr_alloc(),
the full tear-down sequence must be used. The callers of ipc_addid()
do not fullfill that, i.e. more bugfixes are required.

Reported-by: syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com
Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
Cc: Michael Kerrisk 
---
 Documentation/sysctl/kernel.txt |  3 ++-
 ipc/util.c  | 45 +++--
 2 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index eded671d55eb..b2d4a8f8fe97 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -440,7 +440,8 @@ Notes:
 1) kernel doesn't guarantee, that new object will have desired id. So,
 it's up to userspace, how to handle an object with "wrong" id.
 2) Toggle with non-default value will be set back to -1 by kernel after
-successful IPC object allocation.
+successful IPC object allocation. If an IPC object allocation syscall
+fails, it is undefined if the value remains unmodified or is reset to -1.
 
 ==
 
diff --git a/ipc/util.c b/ipc/util.c
index 4e81182fa0ac..662c28c6c9fa 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -197,13 +197,24 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
 /*
  * Specify desired id for next allocated IPC object.
  */
-#define ipc_idr_alloc(ids, new)
\
-   idr_alloc(&(ids)->ipcs_idr, (new),  \
- (ids)->next_id < 0 ? 0 : ipcid_to_idx((ids)->next_id),\
- 0, GFP_NOWAIT)
+static inline int ipc_idr_alloc(struct ipc_ids *ids,
+   struct kern_ipc_perm *new)
+{
+   int key;
 
-static inline int ipc_buildid(int id, struct ipc_ids *ids,
- struct kern_ipc_perm *new)
+   if (ids->next_id < 0) {
+   key = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
+   } else {
+   key = idr_alloc(&ids->ipcs_idr, new,
+   ipcid_to_idx(ids->next_id),
+   0, GFP_NOWAIT);
+   ids->next_id = -1;
+   }
+   return key;
+}
+
+static inline void ipc_set_seq(struct ipc_ids *ids,
+   struct kern_ipc_perm *new)
 {
if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
new->seq = ids->seq++;
@@ -211,24 +222,19 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
ids->seq = 0;
} else {
new->seq = ipcid_to_seqx(ids->next_id);
-   ids->next_id = -1;
}
-
-   return SEQ_MULTIPLIER * new->seq + id;
 }
 
 #else
 #define ipc_idr_alloc(ids, new)\
idr_alloc(&(ids)->ipcs_idr, (new), 0, 0, GFP_NOWAIT)
 
-static inline int ipc_buildid(int id, struct ipc_ids *ids,
+static inline void ipc_set_seq(struct ipc_ids *ids,
  struct kern_ipc_perm *new)
 {
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
-
-   return SEQ_MULTIPLIER * new->seq + id;
 }
 
 #endif /* CONFIG_CHECKPOINT_RESTORE */
@@ -270,6 +276,19 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
new->cuid = new->uid = euid;
new->gid = new->cgid = egid;
 
+   ipc_set_seq(ids, new);
+
+   /*
+* As soon as a new object is inserted into the idr,
+* ipc_obtain_object_idr() or ipc_obtain_object_check() can find it,
+* and the lockless preparations for ipc operations can start.
+* This means especially: permission checks, audit calls, allocation
+* of undo structures, ...
+*
+* Thus the object must be fully initialized, and if something fails,
+* then the full tear-down sequence must be followed.
+* (i.e.: set new->dele

Re: [PATCH 2/6] ipc: reorganize initialization of kern_ipc_perm.seq

2018-07-05 Thread Manfred Spraul

Hi Dmitry,

On 07/05/2018 10:36 AM, Dmitry Vyukov wrote:

[...]
Hi Manfred,

The series looks like a significant improvement to me. Thanks!

I feel that this code can be further simplified (unless I am missing
something here). Please take a look at this version:

https://github.com/dvyukov/linux/commit/f77aeaf80f3c4ab524db92184d874b03063fea3a?diff=split

This is on top of your patches. It basically does the same as your
code, but consolidates all id/seq assignment and dealing with next_id,
and deduplicates code re CONFIG_CHECKPOINT_RESTORE. Currently it's a
bit tricky to follow e.g. where exactly next_id is consumed and where
it needs to be left intact.
The only difference is that my code assigns new->id earlier. Not sure
if it can lead to anything bad. But if yes, then it seems that
currently uninitialized new->id is exposed. If necessary (?) we could
reset new->id in the same place where we set new->deleted.

Everything looks correct for me, it is better than the current code.
Except that you didn't sign off your last patch.

As next step: Who can merge the patches towards linux-next?
The only open point that I see are stress tests of the error codepaths.

And:
I don't think that the patches are relevant for linux-stable, correct?

--
    Manfred


[PATCH 3/6] ipc/util.c: Use ipc_rcu_putref() for failues in ipc_addid()

2018-07-04 Thread Manfred Spraul
ipc_addid() is impossible to use:
- for certain failures, the caller must not use ipc_rcu_putref(),
  because the reference counter is not yet initialized.
- for other failures, the caller must use ipc_rcu_putref(),
  because parallel operations could be ongoing already.

The patch cleans that up, by initializing the refcount early,
and by modifying all callers.

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with reading kern_ipc_perm.seq,
here both read and write to already released memory could happen.

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
---
 ipc/msg.c  |  2 +-
 ipc/sem.c  |  2 +-
 ipc/shm.c  |  2 ++
 ipc/util.c | 12 ++--
 4 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 829c2062ded4..5bf5cb8017ea 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -162,7 +162,7 @@ static int newque(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks msq upon success. */
retval = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni);
if (retval < 0) {
-   call_rcu(&msq->q_perm.rcu, msg_rcu_free);
+   ipc_rcu_putref(&msq->q_perm, msg_rcu_free);
return retval;
}
 
diff --git a/ipc/sem.c b/ipc/sem.c
index e8971fa1d847..9d49efeac2e5 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -556,7 +556,7 @@ static int newary(struct ipc_namespace *ns, struct 
ipc_params *params)
/* ipc_addid() locks sma upon success. */
retval = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni);
if (retval < 0) {
-   call_rcu(&sma->sem_perm.rcu, sem_rcu_free);
+   ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
return retval;
}
ns->used_sems += nsems;
diff --git a/ipc/shm.c b/ipc/shm.c
index 59fe8b3b3794..06b7bf11a011 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -671,6 +671,8 @@ static int newseg(struct ipc_namespace *ns, struct 
ipc_params *params)
if (is_file_hugepages(file) && shp->mlock_user)
user_shm_unlock(size, shp->mlock_user);
fput(file);
+   ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
+   return error;
 no_file:
call_rcu(&shp->shm_perm.rcu, shm_rcu_free);
return error;
diff --git a/ipc/util.c b/ipc/util.c
index 662c28c6c9fa..8b09496ed720 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -248,7 +248,9 @@ static inline void ipc_set_seq(struct ipc_ids *ids,
  * Add an entry 'new' to the ipc ids idr. The permissions object is
  * initialised and the first free entry is set up and the id assigned
  * is returned. The 'new' entry is returned in a locked state on success.
+ *
  * On failure the entry is not locked and a negative err-code is returned.
+ * The caller must use ipc_rcu_putref() to free the identifier.
  *
  * Called with writer ipc_ids.rwsem held.
  */
@@ -258,6 +260,9 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
kgid_t egid;
int id, err;
 
+   /* 1) Initialize the refcount so that ipc_rcu_putref works */
+   refcount_set(&new->refcount, 1);
+
if (limit > IPCMNI)
limit = IPCMNI;
 
@@ -266,9 +271,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
 
idr_preload(GFP_KERNEL);
 
-   refcount_set(&new->refcount, 1);
spin_lock_init(&new->lock);
-   new->deleted = false;
rcu_read_lock();
spin_lock(&new->lock);
 
@@ -277,6 +280,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
new->gid = new->cgid = egid;
 
ipc_set_seq(ids, new);
+   new->deleted = false;
 
/*
 * As soon as a new object is inserted into the idr,
@@ -288,6 +292,9 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
 * Thus the object must be fully initialized, and if something fails,
 * then the full tear-down sequence must be followed.
 * (i.e.: set new->deleted, reduce refcount, call_rcu())
+*
+* This function sets new->deleted, the caller must use ipc_rcu_putef()
+* for the remaining steps.
 */
id = ipc_idr_alloc(ids, new);
idr_preload_end();
@@ -301,6 +308,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
}
}
if (id < 0) {
+   new->deleted = true;
spin_unlock(&new->lock);
rcu_read_unlock();
return id;
-- 
2.17.1



[PATCH 2/6] ipc: reorganize initialization of kern_ipc_perm.seq

2018-07-04 Thread Manfred Spraul
ipc_addid() initializes kern_ipc_perm.seq after having called
ipc_idr_alloc().

Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_check()
may see an uninitialized value.

The patch moves the initialization of kern_ipc_perm.seq before the
calls of ipc_idr_alloc().

Notes:
1) This patch has a user space visible side effect:
If /proc/sys/kernel/*_next_id is used (i.e.: checkpoint/restore) and
if semget()/msgget()/shmget() fails in the final step of adding the id
to the rhash tree, then .._next_id is cleared. Before the patch, is
remained unmodified.

There is no change of the behavior after a successful ..get() call:
It always clears .._next_id, there is no impact to non checkpoint/restore
code as that code does not use .._next_id.

2) The patch correctly documents that after a call to ipc_idr_alloc(),
the full tear-down sequence must be used. The callers of ipc_addid()
do not fullfill that, i.e. more bugfixes are required.

Reported-by: syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com
Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
Cc: Michael Kerrisk 
Signed-off-by: Manfred Spraul 
---
 Documentation/sysctl/kernel.txt |  3 ++-
 ipc/util.c  | 45 +++--
 2 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index eded671d55eb..b2d4a8f8fe97 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -440,7 +440,8 @@ Notes:
 1) kernel doesn't guarantee, that new object will have desired id. So,
 it's up to userspace, how to handle an object with "wrong" id.
 2) Toggle with non-default value will be set back to -1 by kernel after
-successful IPC object allocation.
+successful IPC object allocation. If an IPC object allocation syscall
+fails, it is undefined if the value remains unmodified or is reset to -1.
 
 ==
 
diff --git a/ipc/util.c b/ipc/util.c
index 4e81182fa0ac..662c28c6c9fa 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -197,13 +197,24 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
 /*
  * Specify desired id for next allocated IPC object.
  */
-#define ipc_idr_alloc(ids, new)
\
-   idr_alloc(&(ids)->ipcs_idr, (new),  \
- (ids)->next_id < 0 ? 0 : ipcid_to_idx((ids)->next_id),\
- 0, GFP_NOWAIT)
+static inline int ipc_idr_alloc(struct ipc_ids *ids,
+   struct kern_ipc_perm *new)
+{
+   int key;
 
-static inline int ipc_buildid(int id, struct ipc_ids *ids,
- struct kern_ipc_perm *new)
+   if (ids->next_id < 0) {
+   key = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
+   } else {
+   key = idr_alloc(&ids->ipcs_idr, new,
+   ipcid_to_idx(ids->next_id),
+   0, GFP_NOWAIT);
+   ids->next_id = -1;
+   }
+   return key;
+}
+
+static inline void ipc_set_seq(struct ipc_ids *ids,
+   struct kern_ipc_perm *new)
 {
if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
new->seq = ids->seq++;
@@ -211,24 +222,19 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
ids->seq = 0;
} else {
new->seq = ipcid_to_seqx(ids->next_id);
-   ids->next_id = -1;
}
-
-   return SEQ_MULTIPLIER * new->seq + id;
 }
 
 #else
 #define ipc_idr_alloc(ids, new)\
idr_alloc(&(ids)->ipcs_idr, (new), 0, 0, GFP_NOWAIT)
 
-static inline int ipc_buildid(int id, struct ipc_ids *ids,
+static inline void ipc_set_seq(struct ipc_ids *ids,
  struct kern_ipc_perm *new)
 {
new->seq = ids->seq++;
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
-
-   return SEQ_MULTIPLIER * new->seq + id;
 }
 
 #endif /* CONFIG_CHECKPOINT_RESTORE */
@@ -270,6 +276,19 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
new->cuid = new->uid = euid;
new->gid = new->cgid = egid;
 
+   ipc_set_seq(ids, new);
+
+   /*
+* As soon as a new object is inserted into the idr,
+* ipc_obtain_object_idr() or ipc_obtain_object_check() can find it,
+* and the lockless preparations for ipc operations can start.
+* This means especially: permission checks, audit calls, allocation
+* of undo structures, ...
+*
+* Thus the object must be fully initialized, and if something fails,
+* then the full tear-down sequence must

[PATCH 6/6] ipc/util.c: correct comment in ipc_obtain_object_check

2018-07-04 Thread Manfred Spraul
The comment that explains ipc_obtain_object_check is wrong:
The function checks the sequence number, not the reference
counter.
Note that checking the reference counter would be meaningless:
The reference counter is decreased without holding any locks,
thus an object with kern_ipc_perm.deleted=true may disappear at
the end of the next rcu grace period.

Signed-off-by: Manfred Spraul 
---
 ipc/util.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 4f2db913acf9..776a9ce2905f 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -646,8 +646,8 @@ struct kern_ipc_perm *ipc_lock_idr(struct ipc_ids *ids, int 
id)
  * @ids: ipc identifier set
  * @id: ipc id to look for
  *
- * Similar to ipc_obtain_object_idr() but also checks
- * the ipc object reference counter.
+ * Similar to ipc_obtain_object_idr() but also checks the ipc object
+ * sequence number.
  *
  * Call inside the RCU critical section.
  * The ipc object is *not* locked on exit.
-- 
2.17.1



[PATCH 4/6] ipc: Rename ipcctl_pre_down_nolock().

2018-07-04 Thread Manfred Spraul
Both the comment and the name of ipcctl_pre_down_nolock()
are misleading: The function must be called while holdling
the rw semaphore.
Therefore the patch renames the function to ipcctl_obtain_check():
This name matches the other names used in util.c:
- "obtain" function look up a pointer in the idr, without
  acquiring the object lock.
- The caller is responsible for locking.
- _check means that some checks are made.

Signed-off-by: Manfred Spraul 
---
 ipc/msg.c  | 2 +-
 ipc/sem.c  | 2 +-
 ipc/shm.c  | 2 +-
 ipc/util.c | 6 +++---
 ipc/util.h | 2 +-
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 5bf5cb8017ea..ba85d8849e8d 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -385,7 +385,7 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, 
int cmd,
down_write(&msg_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, &msg_ids(ns), msqid, cmd,
+   ipcp = ipcctl_obtain_check(ns, &msg_ids(ns), msqid, cmd,
  &msqid64->msg_perm, msqid64->msg_qbytes);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/sem.c b/ipc/sem.c
index 9d49efeac2e5..9742e9a1c0c2 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1595,7 +1595,7 @@ static int semctl_down(struct ipc_namespace *ns, int 
semid,
down_write(&sem_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, &sem_ids(ns), semid, cmd,
+   ipcp = ipcctl_obtain_check(ns, &sem_ids(ns), semid, cmd,
  &semid64->sem_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/shm.c b/ipc/shm.c
index 06b7bf11a011..426ba1039a7b 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -868,7 +868,7 @@ static int shmctl_down(struct ipc_namespace *ns, int shmid, 
int cmd,
down_write(&shm_ids(ns).rwsem);
rcu_read_lock();
 
-   ipcp = ipcctl_pre_down_nolock(ns, &shm_ids(ns), shmid, cmd,
+   ipcp = ipcctl_obtain_check(ns, &shm_ids(ns), shmid, cmd,
  &shmid64->shm_perm, 0);
if (IS_ERR(ipcp)) {
err = PTR_ERR(ipcp);
diff --git a/ipc/util.c b/ipc/util.c
index 8b09496ed720..751d39baaf38 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -703,7 +703,7 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
 }
 
 /**
- * ipcctl_pre_down_nolock - retrieve an ipc and check permissions for some 
IPC_XXX cmd
+ * ipcctl_obtain_check - retrieve an ipc and check permissions for some 
IPC_XXX cmd
  * @ns:  ipc namespace
  * @ids:  the table of ids where to look for the ipc
  * @id:   the id of the ipc to retrieve
@@ -713,7 +713,7 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
  *
  * This function does some common audit and permissions check for some IPC_XXX
  * cmd and is called from semctl_down, shmctl_down and msgctl_down.
- * It must be called without any lock held and:
+ * It:
  *
  *   - retrieves the ipc with the given id in the given table.
  *   - performs some audit and permission check, depending on the given cmd
@@ -722,7 +722,7 @@ int ipc_update_perm(struct ipc64_perm *in, struct 
kern_ipc_perm *out)
  *
  * Call holding the both the rwsem and the rcu read lock.
  */
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
struct ipc_ids *ids, int id, int cmd,
struct ipc64_perm *perm, int extra_perm)
 {
diff --git a/ipc/util.h b/ipc/util.h
index 0aba3230d007..fcf81425ae98 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -148,7 +148,7 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids 
*ids, int id);
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
 int ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out);
-struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace *ns,
 struct ipc_ids *ids, int id, int 
cmd,
 struct ipc64_perm *perm, int 
extra_perm);
 
-- 
2.17.1



[PATCH 5/6] ipc: rename ipc_lock() to ipc_lock_idr()

2018-07-04 Thread Manfred Spraul
ipc/util.c contains multiple functions to get the ipc object
pointer given an id number.

There are two sets of function: One set verifies the sequence
counter part of the id number, other functions do not check
the sequence counter.

The standard for function names in ipc/util.c is
- ..._check() functions verify the sequence counter
- ..._idr() functions do not verify the sequence counter

ipc_lock() is an exception: It does not verify the sequence
counter value, but this is not obvious from the function name.

Therefore: Rename the function to ipc_lock_idr(), to make it
obvious that it does not check the sequence counter.

Signed-off-by: Manfred Spraul 
---
 ipc/shm.c  |  4 ++--
 ipc/util.c | 10 ++
 ipc/util.h |  2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 426ba1039a7b..cd8655c7bb77 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -179,11 +179,11 @@ static inline struct shmid_kernel 
*shm_obtain_object_check(struct ipc_namespace
  */
 static inline struct shmid_kernel *shm_lock(struct ipc_namespace *ns, int id)
 {
-   struct kern_ipc_perm *ipcp = ipc_lock(&shm_ids(ns), id);
+   struct kern_ipc_perm *ipcp = ipc_lock_idr(&shm_ids(ns), id);
 
/*
 * Callers of shm_lock() must validate the status of the returned ipc
-* object pointer (as returned by ipc_lock()), and error out as
+* object pointer (as returned by ipc_lock_idr()), and error out as
 * appropriate.
 */
if (IS_ERR(ipcp))
diff --git a/ipc/util.c b/ipc/util.c
index 751d39baaf38..4f2db913acf9 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -604,15 +604,17 @@ struct kern_ipc_perm *ipc_obtain_object_idr(struct 
ipc_ids *ids, int id)
 }
 
 /**
- * ipc_lock - lock an ipc structure without rwsem held
+ * ipc_lock_idr - lock an ipc structure without rwsem held
  * @ids: ipc identifier set
  * @id: ipc id to look for
  *
  * Look for an id in the ipc ids idr and lock the associated ipc object.
+ * The function does not check if the sequence counter matches the
+ * found ipc object.
  *
  * The ipc object is locked on successful exit.
  */
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
+struct kern_ipc_perm *ipc_lock_idr(struct ipc_ids *ids, int id)
 {
struct kern_ipc_perm *out;
 
@@ -624,8 +626,8 @@ struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
spin_lock(&out->lock);
 
/*
-* ipc_rmid() may have already freed the ID while ipc_lock()
-* was spinning: here verify that the structure is still valid.
+* ipc_rmid() may have already freed the ID while waiting for
+* the lock. Here verify that the structure is still valid.
 * Upon races with RMID, return -EIDRM, thus indicating that
 * the ID points to a removed identifier.
 */
diff --git a/ipc/util.h b/ipc/util.h
index fcf81425ae98..ed74b0fc68c9 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -142,7 +142,7 @@ int ipc_rcu_getref(struct kern_ipc_perm *ptr);
 void ipc_rcu_putref(struct kern_ipc_perm *ptr,
void (*func)(struct rcu_head *head));
 
-struct kern_ipc_perm *ipc_lock(struct ipc_ids *, int);
+struct kern_ipc_perm *ipc_lock_idr(struct ipc_ids *, int);
 struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids *ids, int id);
 
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
-- 
2.17.1



[PATCH 1/6] ipc: reorganize initialization of kern_ipc_perm.id

2018-07-04 Thread Manfred Spraul
ipc_addid() initializes kern_ipc_perm.id after having called
ipc_idr_alloc().

Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_idr()
may see an uninitialized value.

The patch moves all accesses to kern_ipc_perm.id under the spin_lock().

The issues is related to the finding of
syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com:
syzbot found an issue with kern_ipc_perm.seq

Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
Signed-off-by: Manfred Spraul 
---
 ipc/msg.c | 19 ++-
 ipc/sem.c | 18 +-
 ipc/shm.c | 19 ++-
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 3b6545302598..829c2062ded4 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -491,7 +491,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
 int cmd, struct msqid64_ds *p)
 {
struct msg_queue *msq;
-   int id = 0;
int err;
 
memset(p, 0, sizeof(*p));
@@ -503,7 +502,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
err = PTR_ERR(msq);
goto out_unlock;
}
-   id = msq->q_perm.id;
} else { /* IPC_STAT */
msq = msq_obtain_object_check(ns, msqid);
if (IS_ERR(msq)) {
@@ -548,10 +546,21 @@ static int msgctl_stat(struct ipc_namespace *ns, int 
msqid,
p->msg_lspid  = pid_vnr(msq->q_lspid);
p->msg_lrpid  = pid_vnr(msq->q_lrpid);
 
-   ipc_unlock_object(&msq->q_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* MSG_STAT and MSG_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence counter
+*/
+   err = msq->q_perm.id;
+   }
 
+   ipc_unlock_object(&msq->q_perm);
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/sem.c b/ipc/sem.c
index 5af1943ad782..e8971fa1d847 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1222,7 +1222,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 {
struct sem_array *sma;
time64_t semotime;
-   int id = 0;
int err;
 
memset(semid64, 0, sizeof(*semid64));
@@ -1234,7 +1233,6 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
err = PTR_ERR(sma);
goto out_unlock;
}
-   id = sma->sem_perm.id;
} else { /* IPC_STAT */
sma = sem_obtain_object_check(ns, semid);
if (IS_ERR(sma)) {
@@ -1274,10 +1272,20 @@ static int semctl_stat(struct ipc_namespace *ns, int 
semid,
 #endif
semid64->sem_nsems = sma->sem_nsems;
 
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SEM_STAT and SEM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence counter
+*/
+   err = sma->sem_perm.id;
+   }
ipc_unlock_object(&sma->sem_perm);
-   rcu_read_unlock();
-   return id;
-
 out_unlock:
rcu_read_unlock();
return err;
diff --git a/ipc/shm.c b/ipc/shm.c
index 051a3e1fb8df..59fe8b3b3794 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -949,7 +949,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
int cmd, struct shmid64_ds *tbuf)
 {
struct shmid_kernel *shp;
-   int id = 0;
int err;
 
memset(tbuf, 0, sizeof(*tbuf));
@@ -961,7 +960,6 @@ static int shmctl_stat(struct ipc_namespace *ns, int shmid,
err = PTR_ERR(shp);
goto out_unlock;
}
-   id = shp->shm_perm.id;
} else { /* IPC_STAT */
shp = shm_obtain_object_check(ns, shmid);
if (IS_ERR(shp)) {
@@ -1011,10 +1009,21 @@ static int shmctl_stat(struct ipc_namespace *ns, int 
shmid,
tbuf->shm_lpid  = pid_vnr(shp->shm_lprid);
tbuf->shm_nattch = shp->shm_nattch;
 
-   ipc_unlock_object(&shp->shm_perm);
-   rcu_read_unlock();
-   return id;
+   if (cmd == IPC_STAT) {
+   /*
+* As defined in SUS:
+* Return 0 on success
+*/
+   err = 0;
+   } else {
+   /*
+* SHM_STAT and SHM_STAT_ANY (both Linux specific)
+* Return the full id, including the sequence counter
+*/
+   err = shp->shm

[PATCH 0/5] ipc: cleanups & bugfixes

2018-07-04 Thread Manfred Spraul
Hi,

Dmitry convinced me that I should properly review the initialization
of new ipc objects, and I found another issue.

The series corrects 3 issues with ipc_addid(), and also renames
two functions and corrects a wrong comment.

0001-ipc-reorganize-initialization-of-kern_ipc_perm.id:
Access kern_ipc_perm.id under the IPC spinlock.
My original idea of removing kern_ipc_perm entirely
is not possible, e.g. the proc interface needs the id.

0002-ipc-reorganize-initialization-of-kern_ipc_perm.seq:
Bugfix for the syzbot finding

0003-ipc-util.c-Use-ipc_rcu_putref-for-failues-in-ipc_add:
Bugfix from code review

0004-ipc-Rename-ipcctl_pre_down_nolock.patch:
Comment update & function rename from code review

0005-ipc-rename-ipc_lock-to-ipc_lock_idr:
Function rename from code review

0006-ipc-util.c-correct-comment-in-ipc_obtain_object_che
Comment correction from code review

The patches are lightly tested, especially I have not tested
the checkpoint/restore code or tested the failure cases.

--
Manfred


Re: ipc/msg: zalloc struct msg_queue when creating a new msq

2018-07-04 Thread Manfred Spraul

Hello Dmitry,
On 07/04/2018 12:03 PM, Dmitry Vyukov wrote:

On Wed, Jul 4, 2018 at 11:18 AM, Manfred Spraul
 wrote:


There are 2 relevant values: kern_ipc_perm.id and kern_ipc_perm.seq.

For kern_ipc_perm.id, it is possible to move the access to the codepath that
hold the lock.

For kern_ipc_perm.seq, there are two options:
1) set it before publication.
2) initialize to an invalid value, and correct that at the end.

I'm in favor of option 2, it avoids that we must think about reducing the
next sequence number or not:

The purpose of the sequence counter is to minimize the risk that e.g. a
semop() will write into a newly created array.
I intentially write "minimize the risk", as it is by design impossible to
guarantee that this cannot happen, e.g. if semop() sleeps at the instruction
before the syscall.

Therefore, we can set seq to ULONG_MAX, then ipc_checkid() will always fail
and the corruption is avoided.

What do you think?

And, obviously:
Just set seq to 0 is dangerous, as the first allocated sequence number is 0,
and if that object is destroyed, then the newly created object has
temporarily sequence number 0 as well.

Hi Manfred,

It still looks fishy to me. This code published uninitialized uid's
for years (which lead not only to accidentally accessing wrong
objects, but also to privilege escalation). Now it publishes uninit
id/seq. The first proposed fix still did not make it correct. I can't
say that I see a but in your patch, but initializing id/seq in a racy
manner rings bells for me. Say, if we write/read seq ahead of id, can
reader still get access to a wrong object?
It all suggests some design flaw to me. Could ipc_idr_alloc() do full
initialization, i.e. also do what ipc_buildid() does? This would
ensure that we publish a fully constructed object in the first place.
We already have cleanup for ipc_idr_alloc(), which is idr_remove(), so
if we care about seq space conservation even in error conditions
(ENOMEM?), idr_remove() could accept an additional flag saying "this
object should not have been used by sane users yet, so retake its
seq". Did I get your concern about seq properly?

You have convinced me, I'll rewrite the patch:

1) kern_ipc_perm.seq should be accessible under rcu_read_lock(), this 
means replacing ipc_build_id() with two functions:
One that initializes kern_ipc_perm.seq, and one that would set 
kern_ipc_perm.id.
2) the accesses to kern_ipc_perm.id must be moved to the position where 
the lock is held. This is trivial.
3) we need a clear table that describes which variables can be accessed 
under rcu_read_lock() and which need ipc_lock_object().
  e.g.: kern_ipc_perm.id would end up under ipc_lock_object, 
kern_ipc_perm.seq or the xuid fields can be read under rcu_read_lock().
  Everything that can be accessed without ipc_lock_object must be 
initialized before publication of a new object.


Or, as all access to kern_ipc_perm.id are in rare codepaths:
I'll remove kern_ipc_perm.id entirely, and build the id on demand.

Ok?

--
    Manfred


[RFC] ipc: refcounting / use after free?

2018-07-04 Thread Manfred Spraul
The ipc code uses the equivalent of

rcu_read_lock();
kfree_rcu(a, rcu);
if (a->deleted) {
rcu_read_unlock();
return FAILURE;
}
<...>

Is this safe, or is dereferencing "a" after having called call_rcu()
a use-after-free?

According to rcupdate.h, the kfree is only deferred until the
other CPUs exit their critical sections:

include/linux/rcupdate.h:
> * Similarly, if call_rcu() is invoked
> * on one CPU while other CPUs are within RCU read-side critical
> * sections, invocation of the corresponding RCU callback is deferred
> * until after the all the other CPUs exit their critical sections.


---
 ipc/msg.c  | 11 ---
 ipc/sem.c  | 42 ++
 ipc/util.c | 35 ---
 ipc/util.h | 18 --
 4 files changed, 86 insertions(+), 20 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 3b6545302598..724000c15296 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -805,7 +805,7 @@ static long do_msgsnd(int msqid, long mtype, void __user 
*mtext,
msq = msq_obtain_object_check(ns, msqid);
if (IS_ERR(msq)) {
err = PTR_ERR(msq);
-   goto out_unlock1;
+   goto out_unlock2;
}
 
ipc_lock_object(&msq->q_perm);
@@ -851,8 +851,12 @@ static long do_msgsnd(int msqid, long mtype, void __user 
*mtext,
rcu_read_lock();
ipc_lock_object(&msq->q_perm);
 
-   ipc_rcu_putref(&msq->q_perm, msg_rcu_free);
/* raced with RMID? */
+   if (!__ipc_rcu_putref(&msq->q_perm)) {
+   ipc_unlock_object(&msq->q_perm);
+   call_rcu(&msq->q_perm.rcu, msg_rcu_free);
+   goto out_unlock1;
+   }
if (!ipc_valid_object(&msq->q_perm)) {
err = -EIDRM;
goto out_unlock0;
@@ -883,8 +887,9 @@ static long do_msgsnd(int msqid, long mtype, void __user 
*mtext,
 
 out_unlock0:
ipc_unlock_object(&msq->q_perm);
-   wake_up_q(&wake_q);
 out_unlock1:
+   wake_up_q(&wake_q);
+out_unlock2:
rcu_read_unlock();
if (msg != NULL)
free_msg(msg);
diff --git a/ipc/sem.c b/ipc/sem.c
index 5af1943ad782..c269fae05b24 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -475,10 +475,16 @@ static inline struct sem_array 
*sem_obtain_object_check(struct ipc_namespace *ns
return container_of(ipcp, struct sem_array, sem_perm);
 }
 
-static inline void sem_lock_and_putref(struct sem_array *sma)
+static int __must_check sem_lock_and_putref(struct sem_array *sma)
 {
sem_lock(sma, NULL, -1);
-   ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
+
+   if (!__ipc_rcu_putref(&sma->sem_perm)) {
+   sem_unlock(sma, -1);
+   call_rcu(&sma->sem_perm.rcu, sem_rcu_free);
+   return 0;
+   }
+   return 1;
 }
 
 static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
@@ -1434,7 +1440,10 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
}
 
rcu_read_lock();
-   sem_lock_and_putref(sma);
+   if (!sem_lock_and_putref(sma)) {
+   goto out_rcu_wakeup;
+   }
+
if (!ipc_valid_object(&sma->sem_perm)) {
err = -EIDRM;
goto out_unlock;
@@ -1483,7 +1492,11 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
}
}
rcu_read_lock();
-   sem_lock_and_putref(sma);
+   if (!sem_lock_and_putref(sma)) {
+   err = -EIDRM;
+   goto out_rcu_wakeup;
+   }
+
if (!ipc_valid_object(&sma->sem_perm)) {
err = -EIDRM;
goto out_unlock;
@@ -1898,14 +1911,12 @@ static struct sem_undo *find_alloc_undo(struct 
ipc_namespace *ns, int semid)
 
/* step 3: Acquire the lock on semaphore array */
rcu_read_lock();
-   sem_lock_and_putref(sma);
-   if (!ipc_valid_object(&sma->sem_perm)) {
-   sem_unlock(sma, -1);
-   rcu_read_unlock();
-   kfree(new);
-   un = ERR_PTR(-EIDRM);
-   goto out;
-   }
+   if (!sem_lock_and_putref(sma))
+   goto out_EIDRM_free;
+
+   if (!ipc_valid_object(&sma->sem_perm))
+   goto out_EIDRM_unlock;
+
spin_lock(&ulp->lock);
 
/*
@@ -1931,6 +1942,13 @@ static struct sem_undo *find_alloc_undo(struct 
ipc_namespace *ns, int semid)
sem_unlock(sma, -1);
 out:
return un;
+
+out_EIDRM_unlock:
+   sem_unlock(sma, -1);
+out_EIDRM_free:
+   rcu_read_unlock();
+   kfree

Re: ipc/msg: zalloc struct msg_queue when creating a new msq

2018-07-04 Thread Manfred Spraul

Hello together,

On 06/25/2018 11:21 AM, Dmitry Vyukov wrote:

On Sun, Jun 24, 2018 at 4:56 AM, Davidlohr Bueso  wrote:

The following splat was reported around the msg_queue structure
which can have uninitialized fields left over after newque().
Future syscalls which make use of the msq id (now valid) can thus
make KMSAN complain because not all fields are explicitly initialized
and we have the padding as well. This is internal to the kernel,
hence no bogus leaks.

Hi Davidlohr,

As far as I understand the root problem is that (1) we publish a
not-fully initialized objects and (2) finish it's initialization in a
racy manner when other threads already have access to it. As the
result other threads can act on a wrong object. I am not sure that
zeroing the object really solves these problems. It will sure get rid
of the report at hand (but probably not of KTSAN, data race detector,
report), other threads still can see wrong 0 id and the id is still
initialized in racy way. I would expect that a proper fix would be to
publish a fully initialized object with proper, final id. Am I missing
something?

There are 2 relevant values: kern_ipc_perm.id and kern_ipc_perm.seq.

For kern_ipc_perm.id, it is possible to move the access to the codepath 
that hold the lock.


For kern_ipc_perm.seq, there are two options:
1) set it before publication.
2) initialize to an invalid value, and correct that at the end.

I'm in favor of option 2, it avoids that we must think about reducing 
the next sequence number or not:


The purpose of the sequence counter is to minimize the risk that e.g. a 
semop() will write into a newly created array.
I intentially write "minimize the risk", as it is by design impossible 
to guarantee that this cannot happen, e.g. if semop() sleeps at the 
instruction before the syscall.


Therefore, we can set seq to ULONG_MAX, then ipc_checkid() will always 
fail and the corruption is avoided.


What do you think?

And, obviously:
Just set seq to 0 is dangerous, as the first allocated sequence number 
is 0, and if that object is destroyed, then the newly created object has 
temporarily sequence number 0 as well.


--
    Manfred
>From 4791e604dcb618ed7ea1f42b2f6ca9cfe3c113c3 Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Wed, 4 Jul 2018 10:04:49 +0200
Subject: [PATCH] ipc: fix races with kern_ipc_perm.id and .seq

ipc_addid() initializes kern_ipc_perm.id and kern_ipc_perm.seq after
having called ipc_idr_alloc().

Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_idr()
may see an uninitialized value.

The simple solution cannot be used, as the correct id is only known
after ipc_idr_alloc().

Therefore:
- Initialize kern_ipc_perm.seq to an invalid value, so that
  ipc_checkid() is guaranteed to fail.
  This fulfills the purpose of the sequence counter: If e.g. semget() and
  semop() run in parallel, then the semop() should not write into the
  newly created array.
- Move the accesses to kern_ipc_perm.id into the code that is protected
  by kern_ipc_perm.lock.

The patch also fixes a use-after free that can be triggered by concurrent
semget() and semctl(IPC_RMID): reading kern_ipc_perm.id must happen
before dropping the locks.

Reported-by: syzbot+2827ef6b3385deb07...@syzkaller.appspotmail.com
Signed-off-by: Manfred Spraul 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Davidlohr Bueso 
Signed-off-by: Manfred Spraul 
---
 ipc/msg.c  | 23 +--
 ipc/sem.c  | 23 ---
 ipc/shm.c  | 19 ++-
 ipc/util.c |  8 +++-
 4 files changed, 54 insertions(+), 19 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 724000c15296..551c10be8d06 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -166,10 +166,12 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 		return retval;
 	}
 
+	retval = msq->q_perm.id;
+
 	ipc_unlock_object(&msq->q_perm);
 	rcu_read_unlock();
 
-	return msq->q_perm.id;
+	return retval;
 }
 
 static inline bool msg_fits_inqueue(struct msg_queue *msq, size_t msgsz)
@@ -491,7 +493,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
 			 int cmd, struct msqid64_ds *p)
 {
 	struct msg_queue *msq;
-	int id = 0;
 	int err;
 
 	memset(p, 0, sizeof(*p));
@@ -503,7 +504,6 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
 			err = PTR_ERR(msq);
 			goto out_unlock;
 		}
-		id = msq->q_perm.id;
 	} else { /* IPC_STAT */
 		msq = msq_obtain_object_check(ns, msqid);
 		if (IS_ERR(msq)) {
@@ -548,10 +548,21 @@ static int msgctl_stat(struct ipc_namespace *ns, int msqid,
 	p->msg_lspid  = pid_vnr(msq->q_lspid);
 	p->msg_lrpid  = pid_vnr(msq->q_lrpid);
 
-	ipc_unlock_object(&msq->q_perm);
-	rcu_read_unlock();
-	return id;
+	if (cmd == IPC_STAT) {
+		/*
+		 * As defined in SUS:
+		 * Return 0 on success
+		 */
+		err = 0;
+	} else {
+		/*
+		 * MSG_STAT and MSG_STAT_ANY (both Linux specific)
+		 * Return the full id, including the sequence counter
+		 

Re: [REVIEW][PATCH 11/11] ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces

2018-04-02 Thread Manfred Spraul

Hi,

On 03/30/2018 09:09 PM, Davidlohr Bueso wrote:

On Wed, 28 Mar 2018, Davidlohr Bueso wrote:


On Fri, 23 Mar 2018, Eric W. Biederman wrote:


Today the last process to update a semaphore is remembered and
reported in the pid namespace of that process.  If there are processes
in any other pid namespace querying that process id with GETPID the
result will be unusable nonsense as it does not make any
sense in your own pid namespace.


Yeah that sounds pretty wrong.



Due to ipc_update_pid I don't think you will be able to get System V
ipc semaphores into a troublesome cache line ping-pong.  Using struct
pids from separate process are not a problem because they do not share
a cache line.  Using struct pid from different threads of the same
process are unlikely to be a problem as the reference count update
can be avoided.

Further linux futexes are a much better tool for the job of mutual
exclusion between processes than System V semaphores.  So I expect
programs that  are performance limited by their interprocess mutual
exclusion primitive will be using futexes.


The performance of sysv sem and futexes for the contended case is more 
or less identical, it depends on the CONFIG_ options what is faster.


And this is obvious, both primitives must do the same tasks:
sleep:
- lookup a kernel pointer from a user space reference
- acquire a lock, do some housekeeping, unlock and sleep
wakeup:
- lookup a kernel pointer from a user space reference
- acquire a lock, do some housekeeping, especially unlink the to be 
woken up task, unlock and wakeup


The woken up task has nothing to do, it returns immediately to user space.

IIRC for the uncontended case, sysvsem was at ~300 cpu cycles, but that 
number is a few years old, and I don't know what is the impact of spectre.

The futex code is obviously faster.
But I don't know which real-world applications do their own 
optimizations for the uncontended case before using sysvsem.


Thus the only "real" challenge is to minimize cache line trashing.


You would be wrong. There are plenty of real workloads out there
that do not use futexes and are care about performance; in the end
futexes are only good for the uncontended cases, it can also
destroy numa boxes if you consider the global hash table. Experience
as shown me that sysvipc sems are quite still used.



So while it is possible that enhancing the storage of the last
rocess of a System V semaphore from an integer to a struct pid
will cause a performance regression because of the effect
of frequently updating the pid reference count.  I don't expect
that to happen in practice.


How's that? Now thanks to ipc_update_pid() for each semop the user
passes, perform_atomic_semop() will do two atomic updates for the
cases where there are multiple processes updating the sem. This is
not uncommon.

Could you please provide some numbers.



[...]

So at least for a large box this patch hurts the cases where there is low
to medium cpu usage (no more than ~8 processes on a 40 core box) in a non
trivial way. For more processes it doesn't matter. We can confirm that 
the
case for threads is irrelevant. While I'm not happy about the 30% 
regression

I guess we can live with this.

Manfred, any thoughts?

Bugfixing has always first priority, and a 30% regression in one 
microbenchmark doesn't seem to be that bad.


Thus I would propose that we fix SEMPID first, and _if_ someone notices 
a noticeable regression, then we must improve the code.


--
    Manfred


Re: [RFC][PATCH] ipc: Remove IPCMNI

2018-03-29 Thread Manfred Spraul

Hello Mathew,

On 03/29/2018 12:56 PM, Matthew Wilcox wrote:

On Thu, Mar 29, 2018 at 10:47:45AM +0200, Manfred Spraul wrote:

This can be implemented trivially with the current code
using idr_alloc_cyclic.

Is there a performance impact?
Right now, the idr tree is only large if there are lots of objects.
What happens if we have only 1 object, with id=INT_MAX-1?

The radix tree uses a branching factor of 64 entries (6 bits) per level.
The maximum ID is 31 bits (positive signed 32-bit integer).  So the
worst case for a single object is 6 pointer dereferences to find the
object anywhere in the range (INT_MAX/2 - INT_MAX].  That will read 12
cachelines.  If we were to constrain ourselves to a maximum of INT_MAX/2
(30 bits), we'd reduce that to 5 pointer dereferences and 10 cachelines.

I'm concerned about the up to 6 branches.
But this is just guessing, we need a test with a realistic workload.

--
    Manfred


Re: [RFC][PATCH] ipc: Remove IPCMNI

2018-03-29 Thread Manfred Spraul

Hello together,

On 03/29/2018 04:14 AM, Davidlohr Bueso wrote:

Cc'ing mtk, Manfred and linux-api.

See below.

On Thu, 15 Mar 2018, Waiman Long wrote:


On 03/15/2018 03:00 PM, Eric W. Biederman wrote:

Waiman Long  writes:


On 03/14/2018 08:49 PM, Eric W. Biederman wrote:
The define IPCMNI was originally the size of a statically sized 
array in
the kernel and that has long since been removed. Therefore there 
is no

fundamental reason for IPCMNI.

The only remaining use IPCMNI serves is as a convoluted way to format
the ipc id to userspace.  It does not appear that anything except for
the CHECKPOINT_RESTORE code even cares about this variety of 
assignment
and the CHECKPOINT_RESTORE code only cares about this weirdness 
because

it has to restore these peculiar ids.

My assumption is that if an array is recreated, it should get a 
different id.

    a=semget(1234,,);
    semctl(a,,IPC_RMID);
    b=semget(1234,,);
now a!=b.

Rational: semop() calls only refer to the array by the id.
If there is a stale process in the system that tries to access the "old" 
array and the new array has the same id, then the locking gets corrupted.

Therefore make the assignment of ipc ids match the description in
Advanced Programming in the Unix Environment and assign the next id
until INT_MAX is hit then loop around to the lower ids.


Ok, sounds good.
That way we really cycle through INT_MAX, right now a==b would happen 
after 128k RMID calls.
This can be implemented trivially with the current code using 
idr_alloc_cyclic.



Is there a performance impact?
Right now, the idr tree is only large if there are lots of objects.
What happens if we have only 1 object, with id=INT_MAX-1?

semop() that do not sleep are fairly fast.
The same applies for msgsnd/msgrcv, if the message is small enough.

@Davidlohr:
Do you know if there are application that frequently call semop() and it 
doesn't have to sleep?
From the scalability that was pushed into the kernel, I assume that 
this exists.


I have myself only checked postgresql, and postgresql always sleeps.
(and this was long ago)

To make it possible to keep checkpoint/restore working I have renamed
the sysctls from xxx_next_id to xxx_nextid.  That is enough change 
that
a smart CRIU implementation can see that what is exported has 
changed,
and act accordingly.  New kernels will be able to restore the old 
id's.


This code still needs some real world testing to verify my 
assumptions.

And some work with the CRIU implementations to actually add the code
that deals with the new for of id assignment.

It means that all existing checkpoint/restore application will not work 
with a new kernel.
Everyone must first update the checkpoint/restore application, then 
update the kernel.


Is this acceptable?

--
    Manfred


[PATCH] mtd: nand: gpmi: fix edo mode for non fully ONFI compliant flashes

2018-02-20 Thread Manfred Schlaegl
In enable_edo_mode the timing mode feature is set according to previously
read capabilities of the parameter page ("Timing mode support"). After
the value was set, it is read back to provide a "double-check".
If the "double check" fails, the whole function returns with an error,
which leads to a very slow (non-edo) fallback timing.

The problem here is, that there seem to be some NAND flashes, which are
not fully ONFI 1.0 compliant.
One of these is Winbond W29N04GV. According to datasheet and parameter
page, the flash supports timing mode 4 (edo), but the timing mode feature
is simply missing.

It seems that setting a non-existing feature is simply ignored. The real
problem occurs, when the feature is read back: W29N04GV always delivers
zero, which causes the "double-check" to fail. This leads to very slow
timing and therefore to poor performance.

To solve this, we simply remove the double-check, which is a paranoia
check anyways.

The modification was intensively tested on i.MX6 with linux-4.1, Winbond
W29N04GV and Micron MT29F4G08ABADAH4.

Signed-off-by: Manfred Schlaegl 
---
 drivers/mtd/nand/raw/gpmi-nand/gpmi-lib.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/drivers/mtd/nand/raw/gpmi-nand/gpmi-lib.c 
b/drivers/mtd/nand/raw/gpmi-nand/gpmi-lib.c
index 97787246af41..40fba96df215 100644
--- a/drivers/mtd/nand/raw/gpmi-nand/gpmi-lib.c
+++ b/drivers/mtd/nand/raw/gpmi-nand/gpmi-lib.c
@@ -939,16 +939,9 @@ static int enable_edo_mode(struct gpmi_nand_data *this, 
int mode)
if (ret)
goto err_out;
 
-   /* [2] send GET FEATURE command to double-check the timing mode */
-   memset(feature, 0, ONFI_SUBFEATURE_PARAM_LEN);
-   ret = nand->onfi_get_features(mtd, nand,
-   ONFI_FEATURE_ADDR_TIMING_MODE, feature);
-   if (ret || feature[0] != mode)
-   goto err_out;
-
nand->select_chip(mtd, -1);
 
-   /* [3] set the main IO clock, 100MHz for mode 5, 80MHz for mode 4. */
+   /* [2] set the main IO clock, 100MHz for mode 5, 80MHz for mode 4. */
rate = (mode == 5) ? 1 : 8000;
clk_set_rate(r->clock[0], rate);
 
-- 
2.11.0


Re: stable/linux-3.16.y build: 178 builds: 1 failed, 177 passed, 2 errors, 57 warnings (v3.16.52)

2018-01-13 Thread Manfred Spraul

Hi Arnd,

On 01/03/2018 12:15 AM, Arnd Bergmann wrote:



2 ipc/sem.c:377:6: warning: '___p1' may be used uninitialized in this function 
[-Wmaybe-uninitialized]

This code was last touched in 3.16 by the backport of commit
5864a2fd3088 ("ipc/sem.c: fix complex_count vs. simple op race")

The warning is in "smp_load_acquire(&sma->complex_mode))", and I suspect
that commit 27d7be1801a4 ("ipc/sem.c: avoid using spin_unlock_wait()")
avoided the warning upstream by removing the smp_mb() before it.

The smp_mb() pairs with spin_unlock_wait() in complexmode_enter()
It is removed by commit 27d7be1801a4 ("ipc/sem.c: avoid using 
spin_unlock_wait()").


From what I see, it doesn't exist in any of the stable kernels 
(intentionally, the above commit is a rewrite for better performance).


___p1 is from smp_load_acquire()
>    typeof(*p) ___p1 = READ_ONCE(*p);   \

I don't see how ___p1 could be used uninitialized. Perhaps a compiler issue?

--
    Manfred



Re: BUG: unable to handle kernel paging request in ipcget

2017-12-23 Thread Manfred Spraul

Hi,

On 12/23/2017 08:33 AM, syzbot wrote:

Hello,

syzkaller hit the following crash on 
6084b576dca2e898f5c101baef151f7bfdbb606d

git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master
compiler: gcc (GCC) 7.1.1 20170620
.config is attached
Raw console output is attached.

Unfortunately, I don't have any reproducer for this bug yet.


Is one of the recent issues reproducible?
Either something is wrong with the faster ipc_get, or the improved 
ipc_get makes issues in other areas visible.


--
    Manfred


Re: shmctl(SHM_STAT) vs. /proc/sysvipc/shm permissions discrepancies

2017-12-20 Thread Dr. Manfred Spraul

Hi Michal,

On 12/19/2017 10:48 AM, Michal Hocko wrote:

Hi,
we have been contacted by our partner about the following permission
discrepancy
1. Create a shared memory segment with permissions 600 with user A using
shmget(key, 1024, 0600 | IPC_CREAT)
2. ipcs -m should return an output as follows:

-- Shared Memory Segments 
keyshmid  owner  perms  bytes  nattch status
0x58b74326 759562241  A  6001024   0

3. Try to read the metadata with shmctl(0, SHM_STAT,...) as user B.
4. shmctl will return -EACCES

The supper set information provided by shmctl can be retrieved by
reading /proc/sysvipc/shm which does not require read permissions
because it is 444.

It seems that the discrepancy is there since ae7817745eef ("[PATCH] ipc:
add generic struct ipc_ids seq_file iteration") when the proc interface
has been introduced. The changelog is really modest on information or
intention but I suspect this just got overlooked during review. SHM_STAT
has always been about read permission and it is explicitly documented
that way.

Are you sure that this patch changed the behavior?
The proc interface is much older.

--
    Manfred


Re: [PATCH 2/2] ipc: Fix ipc data structures inconsistency

2017-12-01 Thread Manfred Spraul

Hi,

On 12/01/2017 06:20 PM, Davidlohr Bueso wrote:

On Thu, 30 Nov 2017, Philippe Mikoyan wrote:


As described in the title, this patch fixes id_ds inconsistency
when ctl_stat runs concurrently with some ds-changing function,
e.g. shmat, msgsnd or whatever.

For instance, if shmctl(IPC_STAT) is running concurrently with shmat,
following data structure can be returned:
{... shm_lpid = 0, shm_nattch = 1, ...}


The patch appears to be good. I'll try to perform some tests, but I'm 
not sure when I will be able to.
Especially: I don't know the shm code good enough to immediately check 
the change you make to nattach.


And, perhaps as a side information:
There appears to be a use-after-free in shm, I now got a 2nd mail from 
syzbot:

http://lkml.iu.edu/hypermail/linux/kernel/1702.3/02480.html



Hmm yeah that's pretty fishy, also shm_atime = 0, no?

So I think this patch is fine as we can obviously race at a user level.
This is another justification for converting the ipc lock to rwlock;
performance wise they are the pretty much the same (being queued)...
but that's irrelevant to this patch. I like that you manage to do
security and such checks still only under rcu, like all ipc calls
work; *_stat() is no longer special.

I don't like rwlock, they add complexity without reducing the cache line 
pressure.


What I would like to try is to create a mutex_lock_rcu() function, and 
then convert everything to a mutex.


As pseudocode::
    rcu_lock();
    idr_lookup();
    mutex_trylock();
    if (failed) {
        getref();
        rcu_unlock();
        mutex_lock();
        putref();
    } else {
        rcu_unlock();
    }

Obviously, the getref then within the mutex framework, i.e. only if 
mutex_lock() really sleeps.
If the code in ipc gets significantly simpler, then perhaps convert it 
to an rw mutex.


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-10 Thread Manfred Spraul

Hi Alan,

On 07/08/2017 06:21 PM, Alan Stern wrote:

Pardon me for barging in, but I found this whole interchange extremely
confusing...

On Sat, 8 Jul 2017, Ingo Molnar wrote:


* Paul E. McKenney  wrote:


On Sat, Jul 08, 2017 at 10:35:43AM +0200, Ingo Molnar wrote:

* Manfred Spraul  wrote:


Hi Ingo,

On 07/07/2017 10:31 AM, Ingo Molnar wrote:

There's another, probably just as significant advantage: 
queued_spin_unlock_wait()
is 'read-only', while spin_lock()+spin_unlock() dirties the lock cache line. On
any bigger system this should make a very measurable difference - if
spin_unlock_wait() is ever used in a performance critical code path.

At least for ipc/sem:
Dirtying the cacheline (in the slow path) allows to remove a smp_mb() in the
hot path.
So for sem_lock(), I either need a primitive that dirties the cacheline or
sem_lock() must continue to use spin_lock()/spin_unlock().

This statement doesn't seem to make sense.  Did Manfred mean to write
"smp_mb()" instead of "spin_lock()/spin_unlock()"?

Option 1:
fastpath:
spin_lock(local_lock)
smp_mb(); [[1]]
smp_load_acquire(global_flag);
slow path:
global_flag = 1;
smp_mb();


Option 2:
fastpath:
spin_lock(local_lock);
smp_load_acquire(global_flag)
slow path:
global_flag = 1;
spin_lock(local_lock);spin_unlock(local_lock).

Rational:
The ACQUIRE from spin_lock is at the read of local_lock, not at the write.
i.e.: Without the smp_mb() at [[1]], the CPU can do:
read local_lock;
read global_flag;
write local_lock;
For Option 2, the smp_mb() is not required, because fast path and slow 
path acquire the same lock.



Technically you could use spin_trylock()+spin_unlock() and avoid the lock 
acquire
spinning on spin_unlock() and get very close to the slow path performance of a
pure cacheline-dirtying behavior.

This is even more confusing.  Did Ingo mean to suggest using
"spin_trylock()+spin_unlock()" in place of "spin_lock()+spin_unlock()"
could provide the desired ordering guarantee without delaying other
CPUs that may try to acquire the lock?  That seems highly questionable.

I agree :-)

--
Manfred


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Manfred Spraul

Hi Ingo,

On 07/07/2017 10:31 AM, Ingo Molnar wrote:


There's another, probably just as significant advantage: 
queued_spin_unlock_wait()
is 'read-only', while spin_lock()+spin_unlock() dirties the lock cache line. On
any bigger system this should make a very measurable difference - if
spin_unlock_wait() is ever used in a performance critical code path.

At least for ipc/sem:
Dirtying the cacheline (in the slow path) allows to remove a smp_mb() in 
the hot path.
So for sem_lock(), I either need a primitive that dirties the cacheline 
or sem_lock() must continue to use spin_lock()/spin_unlock().


--
Manfred


Re: [PATCH v2 1/9] net/netfilter/nf_conntrack_core: Fix net_conntrack_lock()

2017-07-06 Thread Manfred Spraul

Hi Paul,

On 07/06/2017 01:31 AM, Paul E. McKenney wrote:

From: Manfred Spraul 

As we want to remove spin_unlock_wait() and replace it with explicit
spin_lock()/spin_unlock() calls, we can use this to simplify the
locking.

In addition:
- Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
- The new code avoids the backwards loop.

Only slightly tested, I did not manage to trigger calls to
nf_conntrack_all_lock().


If you want:
Attached would be V2, with adapted comments.

--
Manfred
>From e3562faa1bc96e883108505e05deecaf38c87a26 Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Sun, 21 Aug 2016 07:17:55 +0200
Subject: [PATCH 1/2] net/netfilter/nf_conntrack_core: Fix net_conntrack_lock()

As we want to remove spin_unlock_wait() and replace it with explicit
spin_lock()/spin_unlock() calls, we can use this to simplify the
locking.

In addition:
- Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
- The new code avoids the backwards loop.

Only slightly tested, I did not manage to trigger calls to
nf_conntrack_all_lock().

V2: With improved comments, to clearly show how the barriers
pair.

Fixes: b16c29191dc8
Signed-off-by: Manfred Spraul 
Cc: 
Cc: Alan Stern 
Cc: Sasha Levin 
Cc: Pablo Neira Ayuso 
Cc: netfilter-de...@vger.kernel.org
---
 net/netfilter/nf_conntrack_core.c | 52 ++-
 1 file changed, 29 insertions(+), 23 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 9979f46..51390fe 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -96,19 +96,26 @@ static struct conntrack_gc_work conntrack_gc_work;
 
 void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
 {
+	/* 1) Acquire the lock */
 	spin_lock(lock);
-	while (unlikely(nf_conntrack_locks_all)) {
-		spin_unlock(lock);
 
-		/*
-		 * Order the 'nf_conntrack_locks_all' load vs. the
-		 * spin_unlock_wait() loads below, to ensure
-		 * that 'nf_conntrack_locks_all_lock' is indeed held:
-		 */
-		smp_rmb(); /* spin_lock(&nf_conntrack_locks_all_lock) */
-		spin_unlock_wait(&nf_conntrack_locks_all_lock);
-		spin_lock(lock);
-	}
+	/* 2) read nf_conntrack_locks_all, with ACQUIRE semantics
+	 * It pairs with the smp_store_release() in nf_conntrack_all_unlock()
+	 */
+	if (likely(smp_load_acquire(&nf_conntrack_locks_all) == false))
+		return;
+
+	/* fast path failed, unlock */
+	spin_unlock(lock);
+
+	/* Slow path 1) get global lock */
+	spin_lock(&nf_conntrack_locks_all_lock);
+
+	/* Slow path 2) get the lock we want */
+	spin_lock(lock);
+
+	/* Slow path 3) release the global lock */
+	spin_unlock(&nf_conntrack_locks_all_lock);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_lock);
 
@@ -149,28 +156,27 @@ static void nf_conntrack_all_lock(void)
 	int i;
 
 	spin_lock(&nf_conntrack_locks_all_lock);
-	nf_conntrack_locks_all = true;
 
-	/*
-	 * Order the above store of 'nf_conntrack_locks_all' against
-	 * the spin_unlock_wait() loads below, such that if
-	 * nf_conntrack_lock() observes 'nf_conntrack_locks_all'
-	 * we must observe nf_conntrack_locks[] held:
-	 */
-	smp_mb(); /* spin_lock(&nf_conntrack_locks_all_lock) */
+	nf_conntrack_locks_all = true;
 
 	for (i = 0; i < CONNTRACK_LOCKS; i++) {
-		spin_unlock_wait(&nf_conntrack_locks[i]);
+		spin_lock(&nf_conntrack_locks[i]);
+
+		/* This spin_unlock provides the "release" to ensure that
+		 * nf_conntrack_locks_all==true is visible to everyone that
+		 * acquired spin_lock(&nf_conntrack_locks[]).
+		 */
+		spin_unlock(&nf_conntrack_locks[i]);
 	}
 }
 
 static void nf_conntrack_all_unlock(void)
 {
-	/*
-	 * All prior stores must be complete before we clear
+	/* All prior stores must be complete before we clear
 	 * 'nf_conntrack_locks_all'. Otherwise nf_conntrack_lock()
 	 * might observe the false value but not the entire
-	 * critical section:
+	 * critical section.
+	 * It pairs with the smp_load_acquire() in nf_conntrack_lock()
 	 */
 	smp_store_release(&nf_conntrack_locks_all, false);
 	spin_unlock(&nf_conntrack_locks_all_lock);
-- 
2.9.4



Re: [PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with lock/unlock pair

2017-07-06 Thread Manfred Spraul

Hi Alan,

On 07/03/2017 09:57 PM, Alan Stern wrote:


(Alternatively, you could make nf_conntrack_all_unlock() do a
lock+unlock on all the locks in the array, just like
nf_conntrack_all_lock().  But of course, that would be a lot less
efficient.)

H.

Someone with a weakly ordered system who can test this?
semop() has a very short hotpath.

Either with aim9.shared_memory.ops_per_sec or

#sem-scalebench -t 10 -m 0
https://github.com/manfred-colorfu/ipcscale/blob/master/sem-scalebench.cpp
--
Manfred
>From b549e0281b66124b62aa94543f91b0e616abaf52 Mon Sep 17 00:00:00 2001
From: Manfred Spraul 
Date: Thu, 6 Jul 2017 20:05:44 +0200
Subject: [PATCH 2/2] ipc/sem.c: avoid smp_load_acuqire() in the hot-path

Alan Stern came up with an interesting idea:
If we perform a spin_lock()/spin_unlock() pair in the slow path, then
we can skip the smp_load_acquire() in the hot path.

What do you think?

* When we removed the smp_mb() from the hot path, it was a user space
  visible speed-up of 11%:

  https://lists.01.org/pipermail/lkp/2017-February/005520.html

* On x86, there is no improvement - as smp_load_acquire is READ_ONCE().

* Slowing down the slow path should not hurt:
  Due to the hysteresis code, the slow path is at least factor 10
  rarer than it was before.

Especially: Who is able to test it?

Signed-off-by: Manfred Spraul 
Cc: Alan Stern 
---
 ipc/sem.c | 33 +++--
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 947dc23..75a4358 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -186,16 +186,15 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
  *	* either local or global sem_lock() for read.
  *
  * Memory ordering:
- * Most ordering is enforced by using spin_lock() and spin_unlock().
+ * All ordering is enforced by using spin_lock() and spin_unlock().
  * The special case is use_global_lock:
  * Setting it from non-zero to 0 is a RELEASE, this is ensured by
- * using smp_store_release().
- * Testing if it is non-zero is an ACQUIRE, this is ensured by using
- * smp_load_acquire().
- * Setting it from 0 to non-zero must be ordered with regards to
- * this smp_load_acquire(), this is guaranteed because the smp_load_acquire()
- * is inside a spin_lock() and after a write from 0 to non-zero a
- * spin_lock()+spin_unlock() is done.
+ * performing spin_lock()/spin_lock() on every semaphore before setting to
+ * non-zero.
+ * Setting it from 0 to non-zero is an ACQUIRE, this is ensured by
+ * performing spin_lock()/spin_lock() on every semaphore after setting to
+ * non-zero.
+ * Testing if it is non-zero is within spin_lock(), no need for a barrier.
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -325,13 +324,20 @@ static void complexmode_tryleave(struct sem_array *sma)
 		return;
 	}
 	if (sma->use_global_lock == 1) {
+		int i;
+		struct sem *sem;
 		/*
 		 * Immediately after setting use_global_lock to 0,
-		 * a simple op can start. Thus: all memory writes
-		 * performed by the current operation must be visible
-		 * before we set use_global_lock to 0.
+		 * a simple op can start.
+		 * Perform a full lock/unlock, to guarantee memory
+		 * ordering.
 		 */
-		smp_store_release(&sma->use_global_lock, 0);
+		for (i = 0; i < sma->sem_nsems; i++) {
+			sem = sma->sem_base + i;
+			spin_lock(&sem->lock);
+			spin_unlock(&sem->lock);
+		}
+		sma->use_global_lock = 0;
 	} else {
 		sma->use_global_lock--;
 	}
@@ -379,8 +385,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		 */
 		spin_lock(&sem->lock);
 
-		/* pairs with smp_store_release() */
-		if (!smp_load_acquire(&sma->use_global_lock)) {
+		if (!sma->use_global_lock) {
 			/* fast path successful! */
 			return sops->sem_num;
 		}
-- 
2.9.4



  1   2   3   4   5   6   7   8   >