Re: ipc,sem: sysv semaphore scalability

2013-05-06 Thread Jörn Engel
On Sat, 4 May 2013 20:12:45 +0200, Borislav Petkov wrote:
> On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote:
> > Blockconsole currently lives here:
> > https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/
> 
> Tja, if only that were upstream...

Linus has a pull request.  If he prefers to change code until the
relevant bits fit into 80x25, that is his choice. ;)

Jörn

--
The only good bug is a dead bug.
-- Starship Troopers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-05-04 Thread Borislav Petkov
On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote:
> Blockconsole currently lives here:
> https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/

Tja, if only that were upstream...

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-05-04 Thread Jörn Engel
On Sat, 23 March 2013 10:19:16 +0700, Emmanuel Benisty wrote:
> 
> I could reproduce it but could you please let me know what would be
> the right tools I should use to catch the original oops?
> This is what I got but I doubt it will be helpful:
> http://i.imgur.com/Mewi1hC.jpg

You could use either netconsole or blockconsole.  Netconsole requires
a second machine to capture the information, blockconsole requires a
usb key or something similar to write to.  In both cases you will get
the entire console output in a file, often including very helpful
messages leading up to the crash.

Blockconsole currently lives here:
https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/

Jörn

--
Unless something dramatically changes, by 2015 we'll be largely
wondering what all the fuss surrounding Linux was really about.
-- Rob Enderle
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-05-03 Thread Peter Hurley

On 03/29/2013 03:01 PM, Dave Jones wrote:

On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote:
  > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones  wrote:
  >
  > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
  > >
  > >  > Whichever way we go, we should get a wiggle on - this has been hanging
  > >  > around for too long.  Dave, do you have time to determine whether
  > >  > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
  > >  > than max") fixes things up?
  > >
  > > Ok, with that reverted it's been grinding away for a few hours without 
incident.
  > > Normally I see the oops within a minute or so.
  > >
  >
  > OK, thanks, I queued a revert:
  >
  > From: Andrew Morton 
  > Subject: revert "ipc: don't allocate a copy larger than max"
  >
  > Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
  > during trinity testing.

I owe Peter an apology. I just hit it again with that backed out.
Andrew, might as well drop that revert.

BUG: unable to handle kernel NULL pointer dereference at 000f
IP: [] testmsg.isra.5+0x1a/0x60
[...snip...]

I think I wasn't seeing that this last week because I had inadvertantly 
disabled DEBUG_PAGEALLOC

and.. we're back to square one.

Dave


Andrew,

I just realized you're still carrying

commit 4bea54c91bcc5451f237e6b721b0b35eccd01d17
Author: Andrew Morton 
Date:   Fri Apr 26 10:55:12 2013 +1000

revert "ipc: don't allocate a copy larger than max"

Revert 88b9e456b164.  Dave has confirmed that this was causing 
oopses
during trinity testing.

Cc: Peter Hurley 
Cc: Stanislav Kinsbursky 
Reported-by: Dave Jones 
Cc: 
Signed-off-by: Andrew Morton 

Please drop.

As quoted above, the testing on mainline that attributed
the observed oops to the reverted patch was due to a config
error.

As Linus pointed out below, this patch fixes real bugs.

On 04/02/2013 03:53 PM, Sasha Levin wrote:
> On 04/02/2013 01:52 PM, Linus Torvalds wrote:
>> On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin  wrote:
>>>
>>> By just playing with the 'msgsz' parameter with MSG_COPY set.
>>
>> Hmm. Looking closer, I suspect you're testing without commit
>> 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
>> should limit the size passed in to prepare_copy -> load_copy to
>> msg_ctlmax.
>
> That commit has a revert in the -next trees, do we need a revert
> of the revert?
>
>commit ff6577a3e714ccae02d4400e989762c19c37b0b3
>Author: Andrew Morton 
>Date:   Wed Mar 27 10:24:02 2013 +1100
>
>revert "ipc: don't allocate a copy larger than max"
>
>Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
>during trinity testing.
>
>Cc: Peter Hurley 
>Cc: Stanislav Kinsbursky 
>Reported-by: Dave Jones 
>Cc: 
>Signed-off-by: Andrew Morton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-04-16 Thread Andrew Morton
On Tue, 26 Mar 2013 11:35:33 -0700 Andrew Morton  
wrote:

> Do we need the locking at all?  What does it actually do?
> 
>   sem_lock_and_putref(sma);
>   if (sma->sem_perm.deleted) {
>   sem_unlock(sma, -1);
>   err = -EIDRM;
>   goto out_free;
>   }
>   sem_unlock(sma, -1);
> 
> We're taking the lock, testing an int and then dropping the lock. 
> What's the point in that?

Rikpoke.

The new semctl_main() is now taking a lock, testing
sma->sem_perm.deleted then dropping that lock.  It looks wrong.  What
is that lock testing against?  What prevents .deleted from changing
value 1ns after we dropped that lock?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Dave Jones
On Tue, Apr 02, 2013 at 03:53:01PM -0400, Sasha Levin wrote:
 > On 04/02/2013 01:52 PM, Linus Torvalds wrote:
 > > On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin  wrote:
 > >>
 > >> By just playing with the 'msgsz' parameter with MSG_COPY set.
 > > 
 > > Hmm. Looking closer, I suspect you're testing without commit
 > > 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
 > > should limit the size passed in to prepare_copy -> load_copy to
 > > msg_ctlmax.
 > 
 > That commit has a revert in the -next trees, do we need a revert
 > of the revert?

Yeah, I told Andrew to drop that, but I think he's travelling.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Sasha Levin
On 04/02/2013 01:52 PM, Linus Torvalds wrote:
> On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin  wrote:
>>
>> By just playing with the 'msgsz' parameter with MSG_COPY set.
> 
> Hmm. Looking closer, I suspect you're testing without commit
> 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
> should limit the size passed in to prepare_copy -> load_copy to
> msg_ctlmax.

That commit has a revert in the -next trees, do we need a revert
of the revert?

commit ff6577a3e714ccae02d4400e989762c19c37b0b3
Author: Andrew Morton 
Date:   Wed Mar 27 10:24:02 2013 +1100

revert "ipc: don't allocate a copy larger than max"

Revert 88b9e456b164.  Dave has confirmed that this was causing 
oopses
during trinity testing.

Cc: Peter Hurley 
Cc: Stanislav Kinsbursky 
Reported-by: Dave Jones 
Cc: 
Signed-off-by: Andrew Morton 



Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Linus Torvalds
On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin  wrote:
>
> By just playing with the 'msgsz' parameter with MSG_COPY set.

Hmm. Looking closer, I suspect you're testing without commit
88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
should limit the size passed in to prepare_copy -> load_copy to
msg_ctlmax.

Now, I think it's possibly still a good idea to limit bufsz to INT_MAX
regardless, but as far as I can see that prepare_copy -> load_copy
path is the only place that can get confused. Everybody else uses
size_t (or "long" in the case of r_maxsize) as far as I can tell.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Linus Torvalds
On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin  wrote:
>
> If you guys are already looking at this, the conversions between size_t,
> long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could
> trigger anything from:

Good catch.

Let's just change the "(long)bufsz < 0" into "bufsz > INT_MAX".

I suspect we should change some of the "int" arguments to "size_t" too
so that we don't have these kinds of odd "different routines see
different values due to subtle casting errors", but in the end we
don't really want to ever help people have these kinds of potential
overflow issues. We already limit normal read/write/sendmsg etc to
INT_MAX (although we tend to *truncate* it to INT_MAX rather than
return an error, but I think the simpler patch here is preferable
unless somebody complains).

Comments?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-04-02 Thread Sasha Levin
On 03/29/2013 03:36 PM, Peter Hurley wrote:
> On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote:
>> On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones  wrote:
>>>
>>> Here's an oops I just hit..
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at 000f
>>> IP: [] testmsg.isra.5+0x1a/0x60
>>
>> Btw, looking at the code leading up to this, what the f*ck is wrong
>> with the IPC stuff?
>>
>> It's using the generic list stuff for most of the lists, but then it
>> open-codes the accesses.
>>
>> So instead of using
>>
>>for_each_entry(walk_msg, &msq->q_messages, m_list) {
>>   ..
>>}
>>
>> the ipc/msg.c code does all that by hand, with
>>
>>tmp = msq->q_messages.next;
>>while (tmp != &msq->q_messages) {
>>   struct msg_msg *walk_msg;
>>
>>   walk_msg = list_entry(tmp, struct msg_msg, m_list);
>>   ...
>>   tmp = tmp->next;
>>}
>>
>> Ugh. The code is near unreadable. And then it has magic memory
>> barriers etc, implying that it doesn't lock the data structures, but
>> no comments about them. See expunge_all() and pipelined_send().
>>
>> The code seems entirely random, and it's badly set up (annoyance of
>> the day: crazy helper functions in ipc/msgutil.c to make sure that (a)
>> you have to spend more effort looking for them, and (b) they won't get
>> inlined).
>>
>> Clearly nobody has cared for the crazy IPC message code in a long time.
> 
> Exactly that's what my patch series does; clean this mess up.
> 
> This is what I wrote to Andrew a couple of days ago.
> 
> On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote:
> I just figured out how the queue is being corrupted and why my series
>> fixes it.
>>
>>
>> With MSG_COPY set, the queue scan can exit with the local variable
> 'msg'
>> pointing to a real msg if the msg_counter never reaches the
> copy_number.
>>
>> The remaining execution looks like this:
>>
>>  if (!IS_ERR(msg)) {
>>  
>>  if (msgflg & MSG_COPY)
>>  goto out_unlock;
>>  
>>
>> out_unlock:
>>  msg_unlock(msq);
>>  break;
>>  }
>>  }
>>  if (IS_ERR(msg))
>>  
>>
>>  bufsz = msg_handler();
>>  free_msg(msg);  << msg never unlinked
>>
>>
>> Since the msg should not have been found (because it failed the match
>> criteria), the if (!IS_ERR(msg)) clause should never have executed.
>>
>> That's why my refactor fixes resolve this; because msg is not
>> inadvertently treated as a found msg.
>>
>> But let's be honest; the real bug here is the poor structure of this
>> function that even made this possible. The deepest nesting executes a
>> goto to a label in the middle of an if clause. Yuck! No wonder this
>> thing's fragile.
>>
>> So my recommendation still stands. The series that fixes this has been
>> getting tested in linux-next for a month. Fixing this some other way
> is
>> just asking for more trouble.
>>
>> But why not just revert MSG_COPY altogether for 3.9?

If you guys are already looking at this, the conversions between size_t,
long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could
trigger anything from:

[   33.046572] BUG: unable to handle kernel paging request at 88003c2c7000
[   33.047721] IP: [] bad_from_user+0x4/0x6
[   33.048528] PGD 7232067 PUD 7233067 PMD 3067 PTE 80003c2c7060
[   33.049506] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   33.050029] Modules linked in:
[   33.050029] CPU 0
[   33.050029] Pid: 6885, comm: a.out Tainted: GW
3.9.0-rc4-next-20130328-sasha-00017-g1463000 #321
[   33.050029] RIP: 0010:[]  [] 
bad_from_user+0x4/0x6
[   33.050029] RSP: 0018:88003462be40  EFLAGS: 00010246
[   33.050029] RAX:  RBX: fffb RCX: ff06ae2b
[   33.050029] RDX: fffb RSI: 7fffed36d2a0 RDI: 88003c2c7000
[   33.050029] RBP: 88003462be88 R08: 0280 R09: 
[   33.050029] R10:  R11:  R12: fffb
[   33.050029] R13: 7fffed36d2a0 R14:  R15: 
[   33.050029] FS:  7f6990044700() GS:88003dc0() 
knlGS:
[   33.050029] CS:  0010 DS:  ES:  CR0: 80050033
[   33.050029] CR2: 88003c2c7000 CR3: 347c8000 CR4: 000406f0
[   33.050029] DR0:  DR1:  DR2: 
[   33.050029] DR3:  DR6: 0ff0 DR7: 0400
[   33.050029] Process a.out (pid: 6885, threadinfo 88003462a000, task 
880034cb3000)
[   33.050029] Stack:
[   33.050029]  8192a6a9 88003462be98 88003b331e00 
88003ddd01e0
[   33.050029]    0001 

[   33.050029]   88003462bf68 8192bb34 

[   33.050029] Call Trace:
[   33.050029]  [] ? load_

Re: ipc,sem: sysv semaphore scalability

2013-04-01 Thread Stanislav Kinsbursky

29.03.2013 22:43, Linus Torvalds пишет:

On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones  wrote:

>
>Now that I have that reverted, I'm not seeing msgrcv traces any more, but
>I've started seeing this..
>
>general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC

Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you
disable it?

I think I foud at least one bug in the MSG_COPY stuff: it leaks the
"copy" allocation if

 mode == SEARCH_LESSEQUAL



Hello, Linus.
Sorry, but I don't see copy allocation leak.
Dummy message is allocated always in msgflg has MSG_COPY flag being set.
Also prepare_copy() use msgtyp as a message number to copy and thus set it to 0.


but maybe I'm misreading it. And that shouldn't cause the problem you
see, but it's indicative of how badly tested and thought through the
MSG_COPY code is.

Totally UNTESTED leak fix appended. Stanislav?



I don't see, how this patch can help. And we should not release it until copy is
done in msg_handler, because msg is equal to copy.

Dummy copy message is release either by free_copy() (if msg is error) or
free_msg().

But there are two issues here definitely:

1) Poor SELinux support for message
copying. This issue was addressed by Stephen Smalley here:

https://lkml.org/lkml/2013/2/6/663

But look like he didn't sent the patch to Andrew.

2) Copy leak and queue corruption in case of
copy message wasn't found (this was mentioned by Peter in another thread; 
thanks for
catching this, Peter!), because msg will be a valid pointer and all the "message 
copy"
clean up logic doesn't work.

I like Peter's cleanup and fix series.  But if it looks like too much changes
for this old code, I have another small patch, which should fix the issue:

ipc: set msg back to -EAGAIN if copy wasn't performed

Make sure, that msg pointer is set back to error value in case of MSG_COPY
flag is set and desired message to copy wasn't found. This garantees, that 
msg
is either a error pointer or a copy address.
Otherwise last message in queue will be freed without unlinking from queue
(which leads to memory corruption) plus dummy allocated copy won't be 
released.

Signed-off-by: Stanislav Kinsbursky 

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf..fede1d0 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -872,6 +872,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, 
long msgtyp,
goto out_unlock;
break;
}
+   msg = ERR_PTR(-EAGAIN);
} else
break;
msg_counter++;


  Linus


patch.diff


  ipc/msg.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf6af27..b841508556cb 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -870,6 +870,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, 
long msgtyp,
msg = copy_msg(msg, copy);
if (IS_ERR(msg))
goto out_unlock;
+   copy = NULL;
break;
}
} else
@@ -976,10 +977,9 @@ out_unlock:
break;
}
}
-   if (IS_ERR(msg)) {
-   free_copy(copy);
+   free_copy(copy);
+   if (IS_ERR(msg))
return PTR_ERR(msg);
-   }

bufsz = msg_handler(buf, msg, bufsz);
free_msg(msg);




--
Best regards,
Stanislav Kinsbursky
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-31 Thread Linus Torvalds
On Sun, Mar 31, 2013 at 6:45 AM, Rik van Riel  wrote:
>
> Should we use "semid" here, like Linus suggested, instead of "un->semid"?

As Davidlohr noted, in linux-next the rcu read-lock is held over the
whole thing, so no, un->semid should be stable once "un" has been
re-looked-up under the semaphore lock.

In mainline, the problem is that the "sem_lock_check()" is done with
"un->semid" *after* we've dropped the RCU read-lock, so "un" at that
point is not reliable (it could be free'd at any time underneath us).

That said, I really *really* hate what both mainline and linux-next do
with the RCU read lock, and linux-next is arguably worse.

The whole "take the RCU lock in one place, and release it in another"
is confusing and bug-prone as hell. And linux-next made it worse: now
sem_lock() no longer takes the read-lock (it expects the caller to
take it), but sem_unlock() still drops the read-lock. This is all just
f*cking crazy.

The rule should be that the rcu read-lock is always and released at
the same "level". For example, find_alloc_undo() should just be called
with (and unconditionaly return with) the rcu read-lock held, and if
it needs to actually do an allocation, it can drop the rcu lock for
the duration of the allocation.

This whole "conditional locking" depending on error returns and on
whether we have undo's etc is bug-prone and confusing. And when you
have totally different locking rules for "sem_lock()" vs
"sem_unlock()", you know you're confused.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-31 Thread Emmanuel Benisty
Hi Davidlohr,

On Sun, Mar 31, 2013 at 12:01 PM, Davidlohr Bueso
 wrote:
> Specially dropping the rcu read lock before the continue statement
> (sorry for not mentioning this in the last email).

I was missing this indeed, thanks. Still the same issues however...
I'll do some more testing on the same machine but with a totally
different environment, within tomorrow hopefully.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-31 Thread Rik van Riel

On 03/31/2013 01:01 AM, Davidlohr Bueso wrote:


diff --git a/ipc/sem.c b/ipc/sem.c
index f257afe..74cedfe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk)
struct sem_array *sma;
struct sem_undo *un;
struct list_head tasks;
-   int semid;
-   int i;
+   int semid, i;

rcu_read_lock();
un = list_entry_rcu(ulp->list_proc.next,
@@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk)
}

sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);


Should we use "semid" here, like Linus suggested, instead of "un->semid"?


-   sem_lock(sma, NULL, -1);
-
/* exit_sem raced with IPC_RMID, nothing to do */
-   if (IS_ERR(sma))
+   if (IS_ERR(sma)) {
+   rcu_read_unlock();
continue;
+   }

+   sem_lock(sma, NULL, -1);
un = __lookup_undo(ulp, semid);
if (un == NULL) {
/* exit_sem raced with IPC_RMID+semget() that created



--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-30 Thread Davidlohr Bueso
On Sat, 2013-03-30 at 11:33 +0700, Emmanuel Benisty wrote:
> On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds
>  wrote:
> > On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty  
> > wrote:
> >>
> >> Then I start building a random package and the problems start. They
> >> may also happen without compiling but this seems to trigger the bug
> >> quite quickly.
> >
> > I suspect it's about preemption, and the build just results in enough
> > scheduling load that you start hitting whatever race there is.
> >
> >> Anyway, some progress here, I hope: dmesg seems to be
> >> willing to reveal some secrets (using some pastebin service since this
> >> is pretty big):
> >>
> >> https://gist.github.com/anonymous/5275120
> >
> > That looks like exactly the exit_sem() bug that Davidlohr was talking
> > about, where the
> >
> > /* exit_sem raced with IPC_RMID, nothing to do */
> > if (IS_ERR(sma))
> > continue;
> >
> > should be moved to *before* the
> >
> > sem_lock(sma, NULL, -1);
> >
> > call. And apparently the bug I had found is already fixed in -next.
> 
> I just tried the 7 original patches + the 2 one liners from -next +
> modified Linus' patch (attached) on the top of 3.9-rc4 using
> PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
> above. I was building two packages at the same time, went away for 30
> seconds, came back and everything froze as soon as I touched the
> laptop's touchpad. Maybe a coincidence but anyway... Another shot in
> the dark, I had this weird message when trying to build gcc:
> semop(2): encountered an error: Identifier removed

*sigh*. I had high hopes for this being the bug triggering your issue,
specially after seeing exit_sem() in the trace. 

Emmanuel, just to be sure, does your changes reflect the patch below?
Specially dropping the rcu read lock before the continue statement
(sorry for not mentioning this in the last email).

Anyway, this is still a bug. Andrew, the patch below applies to
linux-next, please queue this up if you don't have any objections. 

Thanks,
Davidlohr

---8<---
From: Davidlohr Bueso 
Subject: [PATCH] ipc, sem: do not call sem_lock when bogus sma

In exit_sem() we attempt to acquire the sma->sem_perm.lock by calling
sem_lock() immediately after obtaining sma. However, if sma isn't valid,
then calling sem_lock() will tend to do bad things.

Move the sma error check right after the sem_obtain_object_check() call instead.

Signed-off-by: Davidlohr Bueso 
---
 ipc/sem.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index f257afe..74cedfe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk)
struct sem_array *sma;
struct sem_undo *un;
struct list_head tasks;
-   int semid;
-   int i;
+   int semid, i;
 
rcu_read_lock();
un = list_entry_rcu(ulp->list_proc.next,
@@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk)
}
 
sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
-   sem_lock(sma, NULL, -1);
-
/* exit_sem raced with IPC_RMID, nothing to do */
-   if (IS_ERR(sma))
+   if (IS_ERR(sma)) {
+   rcu_read_unlock();
continue;
+   }
 
+   sem_lock(sma, NULL, -1);
un = __lookup_undo(ulp, semid);
if (un == NULL) {
/* exit_sem raced with IPC_RMID+semget() that created
-- 
1.7.11.7



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-30 Thread Emmanuel Benisty
Hi Linus,

On Sun, Mar 31, 2013 at 12:22 AM, Linus Torvalds
 wrote:
> On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty  
> wrote:
>> On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds
>>>
>>> This came from the gcc build?
>>
>> yes, very early in the build process, IIRC this line was repeated a
>> few times and the build just stalled.
>
> Ok, we're bringing out the crazy hacks now.
>
> The attached patch is just insane, doesn't really even work in
> general, and only even compiles on 64-bit. But it should work in
> *practice* to find if somebody adds the same RCU head to the RCU lists
> twice, and ignore the second time it happens (and give a warning that
> hopefully pinpoints the backtrace).
>
> It's ugly. It's broken. It may not work. In other words, I'm not proud
> of it. But you seem to be the only one able to trigger the issue
> easily, willing to try crazy crap, so "tag, you're it". Maybe this
> gives us more information. And maybe it doesn't, and I'm totally wrong
> about the whole "rcu head added twice" theory.

That's all I could get so far:
https://gist.github.com/anonymous/5279255

Losing wireless is generally the start signal of controlled demolition
of the machine.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-30 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty  wrote:
> On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds
>>
>> This came from the gcc build?
>
> yes, very early in the build process, IIRC this line was repeated a
> few times and the build just stalled.

Ok, we're bringing out the crazy hacks now.

The attached patch is just insane, doesn't really even work in
general, and only even compiles on 64-bit. But it should work in
*practice* to find if somebody adds the same RCU head to the RCU lists
twice, and ignore the second time it happens (and give a warning that
hopefully pinpoints the backtrace).

It's ugly. It's broken. It may not work. In other words, I'm not proud
of it. But you seem to be the only one able to trigger the issue
easily, willing to try crazy crap, so "tag, you're it". Maybe this
gives us more information. And maybe it doesn't, and I'm totally wrong
about the whole "rcu head added twice" theory.

Linus


patch.diff
Description: Binary data


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Emmanuel Benisty
On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds
 wrote:
>> Another shot in
>> the dark, I had this weird message when trying to build gcc:
>> semop(2): encountered an error: Identifier removed
>
> This came from the gcc build?

yes, very early in the build process, IIRC this line was repeated a
few times and the build just stalled.

> That's just crazy. No normal app uses sysv semaphores. I have an older
> gcc build environment, and some grepping shows it has some ipc
> semaphore use in the libstdc++ testsuite, and some libmudflap hooks,
> but that should be very very minor.

> I notice that your dmesg says that your kernel is compiled by
> gcc-4.8.1 prerelease. Is there any chance that you could try to
> install a known-stable gcc, like 4.7.2 or something. It's entirely
> possible that it's a kernel bug even if it's triggered by some more
> aggressive compiler optimization or something, but it would be really
> good to try to see if this might be gcc-specific.

I could build a kernel on another machine on which 4.7.2 is installed,
kernel oops'd as well:
http://i.imgur.com/uk6gmq1.jpg

FWIW, I have a few things disabled in my config so here is the one I
used, maybe I'm missing something (but again, everything works
perfectly with your tree):
https://gist.github.com/anonymous/5275566

Lastly, just FTR, I tried Andrew's 3.9-rc4-mm1 and got the same
issues, unsurprisingly I guess.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 9:33 PM, Emmanuel Benisty  wrote:
>
> I just tried the 7 original patches + the 2 one liners from -next +
> modified Linus' patch (attached)

.. that patch looks fine.

> on the top of 3.9-rc4 using
> PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
> above. I was building two packages at the same time, went away for 30
> seconds, came back and everything froze as soon as I touched the
> laptop's touchpad. Maybe a coincidence but anyway... Another shot in
> the dark, I had this weird message when trying to build gcc:
> semop(2): encountered an error: Identifier removed

This came from the gcc build?

That's just crazy. No normal app uses sysv semaphores. I have an older
gcc build environment, and some grepping shows it has some ipc
semaphore use in the libstdc++ testsuite, and some libmudflap hooks,
but that should be very very minor.

You seem to trigger errors really trivially easily, which is really
odd. It's sounding less and less like some subtle race, and more like
the error just happens all the time. If you can make even the gcc
build generate errors, I don't think they can be some rare blue-moon
thing.

I notice that your dmesg says that your kernel is compiled by
gcc-4.8.1 prerelease. Is there any chance that you could try to
install a known-stable gcc, like 4.7.2 or something. It's entirely
possible that it's a kernel bug even if it's triggered by some more
aggressive compiler optimization or something, but it would be really
good to try to see if this might be gcc-specific.

For example, I wonder if your gcc might miscompile idr_alloc() or
something, so that we get the same ID for different ipc objects. That
would certainly potentially cause chaos.

Hmm?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Emmanuel Benisty
On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds
 wrote:
> On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty  wrote:
>>
>> Then I start building a random package and the problems start. They
>> may also happen without compiling but this seems to trigger the bug
>> quite quickly.
>
> I suspect it's about preemption, and the build just results in enough
> scheduling load that you start hitting whatever race there is.
>
>> Anyway, some progress here, I hope: dmesg seems to be
>> willing to reveal some secrets (using some pastebin service since this
>> is pretty big):
>>
>> https://gist.github.com/anonymous/5275120
>
> That looks like exactly the exit_sem() bug that Davidlohr was talking
> about, where the
>
> /* exit_sem raced with IPC_RMID, nothing to do */
> if (IS_ERR(sma))
> continue;
>
> should be moved to *before* the
>
> sem_lock(sma, NULL, -1);
>
> call. And apparently the bug I had found is already fixed in -next.

I just tried the 7 original patches + the 2 one liners from -next +
modified Linus' patch (attached) on the top of 3.9-rc4 using
PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
above. I was building two packages at the same time, went away for 30
seconds, came back and everything froze as soon as I touched the
laptop's touchpad. Maybe a coincidence but anyway... Another shot in
the dark, I had this weird message when trying to build gcc:
semop(2): encountered an error: Identifier removed


patch.diff
Description: Binary data


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty  wrote:
>
> Then I start building a random package and the problems start. They
> may also happen without compiling but this seems to trigger the bug
> quite quickly.

I suspect it's about preemption, and the build just results in enough
scheduling load that you start hitting whatever race there is.

> Anyway, some progress here, I hope: dmesg seems to be
> willing to reveal some secrets (using some pastebin service since this
> is pretty big):
>
> https://gist.github.com/anonymous/5275120

That looks like exactly the exit_sem() bug that Davidlohr was talking
about, where the

/* exit_sem raced with IPC_RMID, nothing to do */
if (IS_ERR(sma))
continue;

should be moved to *before* the

sem_lock(sma, NULL, -1);

call. And apparently the bug I had found is already fixed in -next.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Emmanuel Benisty
Hi Davidlohr,

On Sat, Mar 30, 2013 at 9:08 AM, Davidlohr Bueso  wrote:
> Not sure which one liner you refer to, but, if you haven't already done
> so, please try with these fixes (queued in linux-next):
>
> http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51
> http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31
>
> I've been trying to reproduce your twilight zone problem on five
> different machines now without any luck. Is there anything you're doing
> to trigger the issue? Does the machine boot ok and then do weird things,
> say after X starts, open some program, etc?

I was missing a9cead0, thanks. What I usually do is starting a
standard session, which looks like this:

init-+-5*[agetty]
 |-bash---startx---xinit-+-X---2*[{X}]
 |   `-dwm---sh---sleep
 |-bash---chromium-+-chromium
 | |-chromium-+-chromium
 | |  `-2*[{chromium}]
 |
|-chromium-sandbo---chromium---chromium---4*[chromium---4*[{chromium}]]
 | `-65*[{chromium}]
 |-crond
 |-dbus-daemon
 |-klogd
 |-syslogd
 |-tmux-+-alsamixer
 |  |-bash---bash
 |  |-bash
 |  |-htop
 |  `-newsbeuter---{newsbeuter}
 |-udevd
 |-urxvtd-+-bash---pstree
 |`-bash---tmux
 `-wpa_supplicant

Then I start building a random package and the problems start. They
may also happen without compiling but this seems to trigger the bug
quite quickly. Anyway, some progress here, I hope: dmesg seems to be
willing to reveal some secrets (using some pastebin service since this
is pretty big):

https://gist.github.com/anonymous/5275120

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Davidlohr Bueso
On Fri, 2013-03-29 at 19:09 -0700, Linus Torvalds wrote:
> On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty  wrote:
> >
> > I had to slightly modify the patch since it wouldn't match the changes
> > introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
> > hope that was the right thing to do. So, what I tried was: original 7
> > patches + the one liner + your patch blindly modified by me on the top
> > of 3.9-rc4 and I'm still having twilight zone issues.
> 
> Ok, please send your patch so that I can double-check what you did,
> but it was simple enough that you probably did the right thing.
> 
> Sad. Your case definitely looks like a double rcu-free, as shown by
> the fact that when you enabled SLUB debugging the oops happened with
> the use-after-free pattern (it's __rcu_reclaim() doing the
> "head->func(head);" thing, and "func" is 0x6b6b6b6b6b6b6b6b, so "head"
> has already been free'd once).
> 
> So ipc_rcu_putref() and a refcounting error looked very promising.as a
> potential explanation.
> 
> The 'un' undo structure is also free'd with rcu, but the locking
> around that seems much more robust. The undo entry is on two lists
> (sma->list_id, under sma->sem_perm.lock and ulp->list_proc, under
> ulp->lock). But those locks are actually tested with
> assert_spin_locked() in all the relevant places, and the code actually
> looks sane. So I had high hopes for ipc_rcu_putref()...
> 
> Hmm. Except for exit_sem() that does odd things. You have preemption
> enabled, don't you? exit_sem() does a lookup of the first list_proc
> entry under tcy_read_lock to lookup un->semid, and then it drops the
> rcu read lock. At which point "un" is no longer reliable, I think. But
> then it still uses "un->semid", rather than the stable value it looked
> up under the rcu read lock. Which looks bogus.
> 
> So I'd like you to test a few more things:
> 
>  (a) In exit_sem(), can you change the
> 
>  sma = sem_lock_check(tsk->nsproxy->ipc_ns, un->semid);
> 
>  to use just "semid" rather than "un->semid", because I don't
> think "un" is stable here.

Well that's not really the case in the new code. We don't drop the rcu
read lock until the end of the loop, in sem_unlock(). However, I just
noticed that we're checking sma for error after trying to acquire
sma->sem_perm.lock:

sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
sem_lock(sma, NULL, -1);

/* exit_sem raced with IPC_RMID, nothing to do */
if (IS_ERR(sma))
continue;

The IS_ERR(sma) check should be right after the sem_obtain_object_check() call 
instead.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty  wrote:
>
> I had to slightly modify the patch since it wouldn't match the changes
> introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
> hope that was the right thing to do. So, what I tried was: original 7
> patches + the one liner + your patch blindly modified by me on the top
> of 3.9-rc4 and I'm still having twilight zone issues.

Ok, please send your patch so that I can double-check what you did,
but it was simple enough that you probably did the right thing.

Sad. Your case definitely looks like a double rcu-free, as shown by
the fact that when you enabled SLUB debugging the oops happened with
the use-after-free pattern (it's __rcu_reclaim() doing the
"head->func(head);" thing, and "func" is 0x6b6b6b6b6b6b6b6b, so "head"
has already been free'd once).

So ipc_rcu_putref() and a refcounting error looked very promising.as a
potential explanation.

The 'un' undo structure is also free'd with rcu, but the locking
around that seems much more robust. The undo entry is on two lists
(sma->list_id, under sma->sem_perm.lock and ulp->list_proc, under
ulp->lock). But those locks are actually tested with
assert_spin_locked() in all the relevant places, and the code actually
looks sane. So I had high hopes for ipc_rcu_putref()...

Hmm. Except for exit_sem() that does odd things. You have preemption
enabled, don't you? exit_sem() does a lookup of the first list_proc
entry under tcy_read_lock to lookup un->semid, and then it drops the
rcu read lock. At which point "un" is no longer reliable, I think. But
then it still uses "un->semid", rather than the stable value it looked
up under the rcu read lock. Which looks bogus.

So I'd like you to test a few more things:

 (a) In exit_sem(), can you change the

 sma = sem_lock_check(tsk->nsproxy->ipc_ns, un->semid);

 to use just "semid" rather than "un->semid", because I don't
think "un" is stable here.

 (b) does the problem go away if you change disable CONFIG_PREEMPT
(perhaps to PREEMPT_NONE or PREEMPT_VOLUNTARY?)

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Davidlohr Bueso
On Sat, 2013-03-30 at 08:36 +0700, Emmanuel Benisty wrote:
> Hi Linus,
> 
> On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds
>  wrote:
> > Emmanuel, can you try the attached patch? I think it applies cleanly
> > on top of the scalability series too without any changes, but I didn't
> > check if the patches perhaps changed some of the naming or something.
> 
> I had to slightly modify the patch since it wouldn't match the changes
> introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
> hope that was the right thing to do. So, what I tried was: original 7
> patches + the one liner + your patch blindly modified by me on the top
> of 3.9-rc4 and I'm still having twilight zone issues.

Not sure which one liner you refer to, but, if you haven't already done
so, please try with these fixes (queued in linux-next):

http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31

I've been trying to reproduce your twilight zone problem on five
different machines now without any luck. Is there anything you're doing
to trigger the issue? Does the machine boot ok and then do weird things,
say after X starts, open some program, etc?

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Emmanuel Benisty
Hi Linus,

On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds
 wrote:
> Emmanuel, can you try the attached patch? I think it applies cleanly
> on top of the scalability series too without any changes, but I didn't
> check if the patches perhaps changed some of the naming or something.

I had to slightly modify the patch since it wouldn't match the changes
introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
hope that was the right thing to do. So, what I tried was: original 7
patches + the one liner + your patch blindly modified by me on the top
of 3.9-rc4 and I'm still having twilight zone issues.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 2:12 PM, Linus Torvalds
 wrote:
>
> I dunno. I'm still not sure this is triggerable, but it looks bad. But
> both the semaphore case and the msg cases seem to be solvable by
> moving the unlock down, and shm seem to have no getref/putref users to
> race with, so this (whitespace-damaged) patch *may* be sufficient:

Well, the patch doesn't seem to cause any problems, at least neither
lockdep nor spinlock sleep debugging complains. I have no idea whether
it actually fixes any problems, though.

I do wonder if this might explain the problem Emmanuel saw. A double
free of a RCU-freeable object would possibly result in exactly the
kind of mess that Emmanuel reported with the semaphore scalability
patches.

Emmanuel, can you try the attached patch? I think it applies cleanly
on top of the scalability series too without any changes, but I didn't
check if the patches perhaps changed some of the naming or something.

  Linus


patch.diff
Description: Binary data


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 1:41 PM, Linus Torvalds
 wrote:
>
> The alternative would be to make sure the thing is always locked (and
> in a rcu-read-safe region) before putref/getref. The only place (apart
> from the initial allocation, which doesn't matter, because nothing can
> see if itf that path fails) seems to be that freeque(), but I didn't
> check everything.
>
> Moving the
>
> msg_unlock(msq);
>
> to the end of freeque() might be the way to go. It's going to cause us
> to hold the lock for longer, but I'm not sure we care for that path.

Uhhuh. I think shm_destroy() has the same pattern. And I think that
one has locking reasons why it has to do the shm_unlock() before
tearing some things down, although I'm not sure..

The good news is that shm doesn't seem to have any users of
ipc_rcu_get/putref(), so I don't see anything to race against. So it
has the same buggy pattern, but I don't think it can trigger anything.

And ipc/sem.c has the same bug-pattern in freeary(). It does
"sem_unlock(sma)" followed by "ipc_rcu_putref(sma);", and it *does*
seem to have code that that can race against (sem_lock_and_putref()).
The whole "sem_putref()" tries to be careful and gets the lock for the
last ref, but getting the lock doesn't help when freeary() does the
refcount access without it.

The semaphore case seems to argue fairly strongly for an "atomic_t
refcount", because right now it does a lot of "sem_putref()" stuff
that just gets the lock for the putref. So not only is that
insufficient (if it races against freeary(), but it's also more
expensive than just an atomic refcount would have been.

I dunno. I'm still not sure this is triggerable, but it looks bad. But
both the semaphore case and the msg cases seem to be solvable by
moving the unlock down, and shm seem to have no getref/putref users to
race with, so this (whitespace-damaged) patch *may* be sufficient:

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf6af27..338d8e2b589b 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -284,7 +284,6 @@ static void freeque(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)
expunge_all(msq, -EIDRM);
ss_wakeup(&msq->q_senders, 1);
msg_rmid(ns, msq);
-   msg_unlock(msq);

tmp = msq->q_messages.next;
while (tmp != &msq->q_messages) {
@@ -297,6 +296,7 @@ static void freeque(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)
atomic_sub(msq->q_cbytes, &ns->msg_bytes);
security_msg_queue_free(msq);
ipc_rcu_putref(msq);
+   msg_unlock(msq);
 }

 /*
diff --git a/ipc/sem.c b/ipc/sem.c
index 58d31f1c1eb5..1cf024b9eac0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -766,12 +766,12 @@ static void freeary(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)

/* Remove the semaphore set from the IDR */
sem_rmid(ns, sma);
-   sem_unlock(sma);

wake_up_sem_queue_do(&tasks);
ns->used_sems -= sma->sem_nsems;
security_sem_free(sma);
ipc_rcu_putref(sma);
+   sem_unlock(sma);
 }

 static unsigned long copy_semid_to_user(void __user *buf, struct
semid64_ds *in, int version)

(I didn't check very carefully, it's possible that we end up having
some locking problem if we move the unlocks down later, but it *looks*
fine)

Anybody?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones  wrote:
>
> Now that I have that reverted, I'm not seeing msgrcv traces any more, but
> I've started seeing this..
>
> general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> RIP: 0010:[]  [] free_msg+0x2b/0x40
> Call Trace:
>  [] freeque+0xcf/0x140
>  [] msgctl_down.constprop.9+0x183/0x200
>  [] sys_msgctl+0x139/0x400
>  [] system_call_fastpath+0x16/0x1b
>
> Looks like seg was already kfree'd.

Hmm.

I have a suspicion.

The use of ipc_rcu_getref/ipc_rcu_putref() seems a bit iffy.

In particular, the refcount is not an atomic variable, so we
absolutely *depend* on the spinlock for it.

However, looking at "freeque()", that's not actually the case. It
releases the message queue spinlock *before* it does the
ipc_rcu_putref(), and it does that because the thing has become
unreachable (it did a msg_rmid(), which will set ->deleted, which in
turn will mean that nobody should successfully look it up any more).

HOWEVER.

While the "deleted" flag is serialized, the actual refcount is not. So
in *parallel* with the freeque() call, we may have some other user
that does something like

...
ipc_rcu_getref(msq);
msg_unlock(msq);
schedule();

ipc_lock_by_ptr(&msq->q_perm);
ipc_rcu_putref(msq);
if (msq->q_perm.deleted) {
err = -EIDRM;
goto out_unlock_free;
}
...


which got the lock for the "deleted" test, so that part is all fine,
but notice the "ipc_rcu_putref()". It can happen at the same time that
freeque() also does its own ipc_rcu_putref().

So now refcount may get buggered, resulting in random early reuse,
double free's or leaking of the msq.

There may be some reason I don't see why this cannot happen, but it
does look suspicious. I wonder if the refcount should be an atomic
value.

The alternative would be to make sure the thing is always locked (and
in a rcu-read-safe region) before putref/getref. The only place (apart
from the initial allocation, which doesn't matter, because nothing can
see if itf that path fails) seems to be that freeque(), but I didn't
check everything.

Moving the

msg_unlock(msq);

to the end of freeque() might be the way to go. It's going to cause us
to hold the lock for longer, but I'm not sure we care for that path.

Guys, am I missing something? This kind of refcounting problem might
well explain the rcu-freeing-time bugs reported with the scalability
patches: long-time race that just got *much* easier to expose with the
higher level of parallelism?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 12:33 PM, Peter Hurley  wrote:
> On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote:
>> I think I foud at least one bug in the MSG_COPY stuff: it leaks the
>> "copy" allocation if
>>
>> mode == SEARCH_LESSEQUAL
>>
>> but maybe I'm misreading it.
>
> Yes, you're misreading it. copy_msg() returns the 'copy' address when
> copying is successful. So this patch double-frees 'copy'.

No it doesn't.

The case where "copy_msg()" *is* called and is successful, msg gets
set to "copy", my patch sets "copy" to NULL, so no, it doesn't
double-free.

That said, it looks like if MSG_COPY is set, we cannot trigger the
"mode == SEARCH_LESSEQUAL" case, because it forces msgtyp to 0, which
in turn forces mode = SEARCH_ANY.

So the actual leak case seems to not be possible, although that's
*really* hard to see from looking at the code. The code is just an
unreadable mess.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Peter Hurley
On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote:
> On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones  wrote:
> >
> > Here's an oops I just hit..
> >
> > BUG: unable to handle kernel NULL pointer dereference at 000f
> > IP: [] testmsg.isra.5+0x1a/0x60
> 
> Btw, looking at the code leading up to this, what the f*ck is wrong
> with the IPC stuff?
> 
> It's using the generic list stuff for most of the lists, but then it
> open-codes the accesses.
> 
> So instead of using
> 
>for_each_entry(walk_msg, &msq->q_messages, m_list) {
>   ..
>}
> 
> the ipc/msg.c code does all that by hand, with
> 
>tmp = msq->q_messages.next;
>while (tmp != &msq->q_messages) {
>   struct msg_msg *walk_msg;
> 
>   walk_msg = list_entry(tmp, struct msg_msg, m_list);
>   ...
>   tmp = tmp->next;
>}
> 
> Ugh. The code is near unreadable. And then it has magic memory
> barriers etc, implying that it doesn't lock the data structures, but
> no comments about them. See expunge_all() and pipelined_send().
> 
> The code seems entirely random, and it's badly set up (annoyance of
> the day: crazy helper functions in ipc/msgutil.c to make sure that (a)
> you have to spend more effort looking for them, and (b) they won't get
> inlined).
> 
> Clearly nobody has cared for the crazy IPC message code in a long time.

Exactly that's what my patch series does; clean this mess up.

This is what I wrote to Andrew a couple of days ago.

On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote:
I just figured out how the queue is being corrupted and why my series
> fixes it.
> 
> 
> With MSG_COPY set, the queue scan can exit with the local variable
'msg'
> pointing to a real msg if the msg_counter never reaches the
copy_number.
> 
> The remaining execution looks like this:
> 
>   if (!IS_ERR(msg)) {
>   
>   if (msgflg & MSG_COPY)
>   goto out_unlock;
>   
> 
> out_unlock:
>   msg_unlock(msq);
>   break;
>   }
>   }
>   if (IS_ERR(msg))
>   
> 
>   bufsz = msg_handler();
>   free_msg(msg);  << msg never unlinked
> 
> 
> Since the msg should not have been found (because it failed the match
> criteria), the if (!IS_ERR(msg)) clause should never have executed.
> 
> That's why my refactor fixes resolve this; because msg is not
> inadvertently treated as a found msg.
> 
> But let's be honest; the real bug here is the poor structure of this
> function that even made this possible. The deepest nesting executes a
> goto to a label in the middle of an if clause. Yuck! No wonder this
> thing's fragile.
> 
> So my recommendation still stands. The series that fixes this has been
> getting tested in linux-next for a month. Fixing this some other way
is
> just asking for more trouble.
> 
> But why not just revert MSG_COPY altogether for 3.9?

Regards,
Peter Hurley

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Peter Hurley
On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote:
> I think I foud at least one bug in the MSG_COPY stuff: it leaks the
> "copy" allocation if
> 
> mode == SEARCH_LESSEQUAL
> 
> but maybe I'm misreading it.

Yes, you're misreading it. copy_msg() returns the 'copy' address when
copying is successful. So this patch double-frees 'copy'.

Andrew has had the patches that fix this for a month but because they're
fairly extensive he didn't want to apply them for 3.9.

Regards,
Peter


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones  wrote:
>
> Here's an oops I just hit..
>
> BUG: unable to handle kernel NULL pointer dereference at 000f
> IP: [] testmsg.isra.5+0x1a/0x60

Btw, looking at the code leading up to this, what the f*ck is wrong
with the IPC stuff?

It's using the generic list stuff for most of the lists, but then it
open-codes the accesses.

So instead of using

   for_each_entry(walk_msg, &msq->q_messages, m_list) {
  ..
   }

the ipc/msg.c code does all that by hand, with

   tmp = msq->q_messages.next;
   while (tmp != &msq->q_messages) {
  struct msg_msg *walk_msg;

  walk_msg = list_entry(tmp, struct msg_msg, m_list);
  ...
  tmp = tmp->next;
   }

Ugh. The code is near unreadable. And then it has magic memory
barriers etc, implying that it doesn't lock the data structures, but
no comments about them. See expunge_all() and pipelined_send().

The code seems entirely random, and it's badly set up (annoyance of
the day: crazy helper functions in ipc/msgutil.c to make sure that (a)
you have to spend more effort looking for them, and (b) they won't get
inlined).

Clearly nobody has cared for the crazy IPC message code in a long time.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones  wrote:
>
> Btw, something that's really bothering me is just how much bogus
> 'follow-on' spew we have lately. I'm not sure what changed, but it
> seems to have gotten worse.

.. we have many more sanity checks, and they tend to trigger in the
case of us killing somebody holding locks etc.

> Related: is there any value in printing out all the ? symbols on
> kernels with frame pointers enabled ?

I've found them useful. You can often see what happened just before.

Of course, I'd still prefer gcc not generate unnecessarily big stack
frames (which is one of the bigger reasons for old stuff shining
through) but as things are, they often give hints about what happened
just before.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Dave Jones
On Fri, Mar 29, 2013 at 11:43:25AM -0700, Linus Torvalds wrote:
 > On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones  wrote:
 > >
 > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but
 > > I've started seeing this..
 > >
 > > general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 > 
 > Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you
 > disable it?
 > 
 > I think I foud at least one bug in the MSG_COPY stuff: it leaks the
 > "copy" allocation if
 > 
 > mode == SEARCH_LESSEQUAL
 > 
 > but maybe I'm misreading it. And that shouldn't cause the problem you
 > see, but it's indicative of how badly tested and thought through the
 > MSG_COPY code is.
 > 
 > Totally UNTESTED leak fix appended. Stanislav?

I'll give it a shot.

Btw, something that's really bothering me is just how much bogus
'follow-on' spew we have lately. I'm not sure what changed, but it
seems to have gotten worse.

Here's an oops I just hit..

BUG: unable to handle kernel NULL pointer dereference at 000f
IP: [] testmsg.isra.5+0x1a/0x60
PGD 10fd95067 PUD 10f767067 PMD 0 
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif 
can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi 
af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox 
ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT 
nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter 
ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec 
bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill 
snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm 
radeon backlight drm_kms_helper ttm
CPU 2 
Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology 
Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[]  [] testmsg.isra.5+0x1a/0x60
RSP: 0018:880117bb5e88  EFLAGS: 00010246
RAX:  RBX: 0004 RCX: 0078
RDX: 0004 RSI: fffe RDI: 000f
RBP: 880117bb5e88 R08: 0004 R09: 0001
R10: 880117bb8000 R11: 0001 R12: fffe
R13: 88010fd10308 R14: 88010fd10258 R15: 
FS:  7fa89c256740() GS:88012aa0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 000f CR3: 00010f76c000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process trinity-child20 (pid: 958, threadinfo 880117bb4000, task 
880117bb8000)
Stack:
 880117bb5f68 812c3746  880117bb8000
 880117bb8000 880117bb8000 81c7ace0 812c2430
 0004   652a928e
Call Trace:
 [] do_msgrcv+0x1a6/0x5f0
 [] ? msg_security+0x10/0x10
 [] ? trace_hardirqs_on_caller+0x115/0x1a0
 [] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [] sys_msgrcv+0x15/0x20
 [] system_call_fastpath+0x16/0x1b
Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 
fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a <48> 39 37 b8 01 00 00 
00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 
RIP  [] testmsg.isra.5+0x1a/0x60
 RSP 
CR2: 000f
---[ end trace 8f0713d2aacb6aa3 ]---
BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic(): 1, irqs_disabled(): 0, pid: 958, name: trinity-child20
INFO: lockdep is turned off.
Pid: 958, comm: trinity-child20 Tainted: G  D  3.9.0-rc4+ #7
Call Trace:
 [] __might_sleep+0x145/0x200
 [] down_read+0x2a/0xa0
 [] exit_signals+0x24/0x130
 [] do_exit+0xbc/0xd10
 [] ? kmsg_dump+0x1b5/0x230
 [] ? kmsg_dump+0x25/0x230
 [] oops_end+0x9c/0xe0
 [] no_context+0x268/0x275
 [] __bad_area_nosemaphore+0x78/0x1d1
 [] bad_area_nosemaphore+0x13/0x15
 [] __do_page_fault+0x366/0x5b0
 [] ? __lock_acquire+0x2e5/0x1a00
 [] ? trace_hardirqs_off_caller+0x28/0xc0
 [] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [] do_page_fault+0xe/0x10
 [] page_fault+0x22/0x30
 [] ? testmsg.isra.5+0x1a/0x60
 [] ? security_msg_queue_msgrcv+0x16/0x20
 [] do_msgrcv+0x1a6/0x5f0
 [] ? msg_security+0x10/0x10
 [] ? trace_hardirqs_on_caller+0x115/0x1a0
 [] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [] sys_msgrcv+0x15/0x20
 [] system_call_fastpath+0x16/0x1b
note: trinity-child20[958] exited with preempt_count 1
BUG: scheduling while atomic: trinity-child20/958/0x1002
INFO: lockdep is turned off.
Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif 
can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi 
af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox 
ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT 
nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filte

Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Dave Jones
On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote:
 > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones  wrote:
 > 
 > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
 > > 
 > >  > Whichever way we go, we should get a wiggle on - this has been hanging
 > >  > around for too long.  Dave, do you have time to determine whether
 > >  > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
 > >  > than max") fixes things up?
 > > 
 > > Ok, with that reverted it's been grinding away for a few hours without 
 > > incident.
 > > Normally I see the oops within a minute or so.
 > > 
 > 
 > OK, thanks, I queued a revert:
 > 
 > From: Andrew Morton 
 > Subject: revert "ipc: don't allocate a copy larger than max"
 > 
 > Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
 > during trinity testing.

I owe Peter an apology. I just hit it again with that backed out.
Andrew, might as well drop that revert.

BUG: unable to handle kernel NULL pointer dereference at 000f
IP: [] testmsg.isra.5+0x1a/0x60
PGD 10fd95067 PUD 10f767067 PMD 0 
Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif 
can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi 
af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox 
ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT 
nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter 
ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec 
bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill 
snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm 
radeon backlight drm_kms_helper ttm
CPU 2 
Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology 
Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[]  [] testmsg.isra.5+0x1a/0x60
RSP: 0018:880117bb5e88  EFLAGS: 00010246
RAX:  RBX: 0004 RCX: 0078
RDX: 0004 RSI: fffe RDI: 000f
RBP: 880117bb5e88 R08: 0004 R09: 0001
R10: 880117bb8000 R11: 0001 R12: fffe
R13: 88010fd10308 R14: 88010fd10258 R15: 
FS:  7fa89c256740() GS:88012aa0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 000f CR3: 00010f76c000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process trinity-child20 (pid: 958, threadinfo 880117bb4000, task 
880117bb8000)
Stack:
 880117bb5f68 812c3746  880117bb8000
 880117bb8000 880117bb8000 81c7ace0 812c2430
 0004   652a928e
Call Trace:
 [] do_msgrcv+0x1a6/0x5f0
 [] ? msg_security+0x10/0x10
 [] ? trace_hardirqs_on_caller+0x115/0x1a0
 [] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [] sys_msgrcv+0x15/0x20
 [] system_call_fastpath+0x16/0x1b
Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 
fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a <48> 39 37 b8 01 00 00 
00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 
RIP  [] testmsg.isra.5+0x1a/0x60

I think I wasn't seeing that this last week because I had inadvertantly 
disabled DEBUG_PAGEALLOC

and.. we're back to square one.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones  wrote:
>
> Now that I have that reverted, I'm not seeing msgrcv traces any more, but
> I've started seeing this..
>
> general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC

Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you
disable it?

I think I foud at least one bug in the MSG_COPY stuff: it leaks the
"copy" allocation if

mode == SEARCH_LESSEQUAL

but maybe I'm misreading it. And that shouldn't cause the problem you
see, but it's indicative of how badly tested and thought through the
MSG_COPY code is.

Totally UNTESTED leak fix appended. Stanislav?

 Linus


patch.diff
Description: Binary data


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 11:04 AM, Dave Jones  wrote:
>
> mainline. Your current tree.

Ok, that's what I thought you were generally testing, just wanted to
verify. The subject kind of implied otherwise..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Dave Jones
On Fri, Mar 29, 2013 at 11:00:27AM -0700, Linus Torvalds wrote:
 > On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones  wrote:
 > >
 > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but
 > > I've started seeing this..
 > >
 > > general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 > >
 > > Looks like seg was already kfree'd.
 > 
 > Just to clarify: is this you testing -mm that has Davidlohr's and
 > Rik's scalability patches? Or mainline?

mainline. Your current tree.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Linus Torvalds
On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones  wrote:
>
> Now that I have that reverted, I'm not seeing msgrcv traces any more, but
> I've started seeing this..
>
> general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
>
> Looks like seg was already kfree'd.

Just to clarify: is this you testing -mm that has Davidlohr's and
Rik's scalability patches? Or mainline?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-29 Thread Dave Jones
On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote:
 > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones  wrote:
 > 
 > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
 > > 
 > >  > Whichever way we go, we should get a wiggle on - this has been hanging
 > >  > around for too long.  Dave, do you have time to determine whether
 > >  > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
 > >  > than max") fixes things up?
 > > 
 > > Ok, with that reverted it's been grinding away for a few hours without 
 > > incident.
 > > Normally I see the oops within a minute or so.
 > 
 > OK, thanks, I queued a revert:
 > 
 > From: Andrew Morton 
 > Subject: revert "ipc: don't allocate a copy larger than max"
 > 
 > Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
 > during trinity testing.

Now that I have that reverted, I'm not seeing msgrcv traces any more, but 
I've started seeing this..

general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: l2tp_ppp l2tp_netlink l2tp_core llc2 phonet netrom rose 
af_key af_rxrpc caif_socket caif can_raw cmtp kernelcapi can_bcm can nfnetlink 
ipt_ULOG scsi_transport_iscsi af_802154 ax25 atm ipx pppoe pppox x25 nfc irda 
ppp_generic p8023 slhc p8022 appletalk decnet crc_ccitt rds psnap llc lockd 
sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack 
ip6table_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth 
snd_hda_codec raid0 snd_pcm rfkill microcode serio_raw pcspkr snd_page_alloc 
edac_core snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan 
kvm_amd kvm radeon backlight drm_kms_helper ttm
CPU 3 
Pid: 1850, comm: trinity-child37 Tainted: GB3.9.0-rc4+ #7 Gigabyte 
Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[]  [] free_msg+0x2b/0x40
RSP: 0018:8800a1d3bdd0  EFLAGS: 00010202
RAX:  RBX: 6b6b6b6b6b6b6b6b RCX: 
RDX:  RSI:  RDI: 810b6ced
RBP: 8800a1d3bde0 R08:  R09: 0001
R10: 0001 R11: 0001 R12: 88009997e620
R13: 81c7ace0 R14: 8800caf359d8 R15: 81c7b024
FS:  7f2d7be64740() GS:88012ac0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 7f8bd7bb6000 CR3: a1f0a000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process trinity-child37 (pid: 1850, threadinfo 8800a1d3a000, task 
8800a1e62490)
Stack:
 6b6b6b6b6b6b6b6b 8800caf35928 8800a1d3be18 812c289f
  81c7ace0 8800caf35928 
 8800a1d3be28 8800a1d3bec8 812c2a93 8800a1d3be40
Call Trace:
 [] freeque+0xcf/0x140
 [] msgctl_down.constprop.9+0x183/0x200
 [] ? up_read+0x1f/0x40
 [] ? __do_page_fault+0x214/0x5b0
 [] ? lock_release_non_nested+0x23e/0x320
 [] sys_msgctl+0x139/0x400
 [] ? retint_swapgs+0xe/0x13
 [] ? trace_hardirqs_on_caller+0x115/0x1a0
 [] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [] system_call_fastpath+0x16/0x1b
Code: 66 66 66 66 90 55 48 89 e5 41 54 49 89 fc 53 e8 fc 5e 01 00 49 8b 5c 24 
20 4c 89 e7 e8 8f af ed ff 48 85 db 75 05 eb 13 4c 89 e3 <4c> 8b 23 48 89 df e8 
7a af ed ff 4d 85 e4 75 ed 5b 41 5c 5d c3 


(Taint is from an ext4 double-free I just reported in a separate thread)

decoded..

   0:   66 66 66 66 90  data32 data32 data32 xchg %ax,%ax
   5:   55  push   %rbp
   6:   48 89 e5mov%rsp,%rbp
   9:   41 54   push   %r12
   b:   49 89 fcmov%rdi,%r12
   e:   53  push   %rbx
   f:   e8 fc 5e 01 00  callq  0x15f10
  14:   49 8b 5c 24 20  mov0x20(%r12),%rbx
  19:   4c 89 e7mov%r12,%rdi
  1c:   e8 8f af ed ff  callq  0xffedafb0
  21:   48 85 dbtest   %rbx,%rbx
  24:   75 05   jne0x2b
  26:   eb 13   jmp0x3b
  28:   4c 89 e3mov%r12,%rbx
  2b:*  4c 8b 23mov(%rbx),%r12 <-- trapping instruction
  2e:   48 89 dfmov%rbx,%rdi
  31:   e8 7a af ed ff  callq  0xffedafb0
  36:   4d 85 e4test   %r12,%r12
  39:   75 ed   jne0x28
  3b:   5b  pop%rbx
  3c:   41 5c   pop%r12
  3e:   5d  pop%rbp
  3f:   c3  retq   

Disassembly of free_msg shows..

seg = msg->next;
kfree(msg);
while (seg != NULL) {
struct msg_msgseg *tmp = seg->next;
 30b:   4c 8b 23mov(%rbx),%r12
kfree(seg);
 30e:   48 89 dfmov%rbx,%rdi
 311:   e8 00 00 00 00  callq  316 


Looks like seg was a

Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Andrew Morton
On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones  wrote:

> On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
> 
>  > Whichever way we go, we should get a wiggle on - this has been hanging
>  > around for too long.  Dave, do you have time to determine whether
>  > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
>  > than max") fixes things up?
> 
> Ok, with that reverted it's been grinding away for a few hours without 
> incident.
> Normally I see the oops within a minute or so.
> 

OK, thanks, I queued a revert:

From: Andrew Morton 
Subject: revert "ipc: don't allocate a copy larger than max"

Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
during trinity testing.

Cc: Peter Hurley 
Cc: Stanislav Kinsbursky 
Reported-by: Dave Jones 
Cc: 
Signed-off-by: Andrew Morton 
---

 ipc/msg.c |6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff -puN ipc/msg.c~revert-ipc-dont-allocate-a-copy-larger-than-max ipc/msg.c
--- a/ipc/msg.c~revert-ipc-dont-allocate-a-copy-larger-than-max
+++ a/ipc/msg.c
@@ -820,17 +820,15 @@ long do_msgrcv(int msqid, void __user *b
struct msg_msg *copy = NULL;
unsigned long copy_number = 0;
 
-   ns = current->nsproxy->ipc_ns;
-
if (msqid < 0 || (long) bufsz < 0)
return -EINVAL;
if (msgflg & MSG_COPY) {
-   copy = prepare_copy(buf, min_t(size_t, bufsz, ns->msg_ctlmax),
-   msgflg, &msgtyp, ©_number);
+   copy = prepare_copy(buf, bufsz, msgflg, &msgtyp, ©_number);
if (IS_ERR(copy))
return PTR_ERR(copy);
}
mode = convert_mode(&msgtyp, msgflg);
+   ns = current->nsproxy->ipc_ns;
 
msq = msg_lock_check(ns, msqid);
if (IS_ERR(msq)) {
_

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Dave Jones
On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:

 > Whichever way we go, we should get a wiggle on - this has been hanging
 > around for too long.  Dave, do you have time to determine whether
 > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
 > than max") fixes things up?

Ok, with that reverted it's been grinding away for a few hours without incident.
Normally I see the oops within a minute or so.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Andrew Morton
On Tue, 26 Mar 2013 10:59:27 -0700 Davidlohr Bueso  
wrote:

> On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote:
> > On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
> >  wrote:
> > > And you never see this problem without Rik's patches?
> > 
> > No, never.
> > 
> > > Could you bisect
> > > *which* patch it starts with? Are the first four ones ok (the moving
> > > of the locking around, but without the fine-grained ones), for
> > > example?
> > 
> > With the first four patches only, I got some X server freeze (just tried 
> > once).
> 
> Going over the code again, I found a potential recursive spinlock scenario. 
> Andrew, if you have no objections, please queue this up.
> 
> Thanks.
> 
> ---8<---
> 
> From: Davidlohr Bueso 
> Subject: [PATCH] ipc, sem: prevent possible deadlock
> 
> In semctl_main(), when cmd == GETALL, we're locking
> sma->sem_perm.lock (through sem_lock_and_putref), yet
> after the conditional, we lock it again.
> Unlock sma right after exiting the conditional.
> 
> Signed-off-by: Davidlohr Bueso 
> ---
>  ipc/sem.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/ipc/sem.c b/ipc/sem.c
> index 1a2913d..f257afe 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int 
> semid, int semnum,
>   err = -EIDRM;
>   goto out_free;
>   }
> + sem_unlock(sma, -1);
>   }
>  
>   sem_lock(sma, NULL, -1);

Looks right. 

Do we need the locking at all?  What does it actually do?

sem_lock_and_putref(sma);
if (sma->sem_perm.deleted) {
sem_unlock(sma, -1);
err = -EIDRM;
goto out_free;
}
sem_unlock(sma, -1);

We're taking the lock, testing an int and then dropping the lock. 
What's the point in that?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Rik van Riel

On 03/26/2013 02:07 PM, Sasha Levin wrote:

On 03/26/2013 01:51 PM, Davidlohr Bueso wrote:

On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:

On 03/20/2013 03:55 PM, Rik van Riel wrote:

This series makes the sysv semaphore code more scalable,
by reducing the time the semaphore lock is held, and making
the locking more scalable for semaphore arrays with multiple
semaphores.


Hi Rik,

Another issue that came up is:

[   96.347341] 
[   96.348085] [ BUG: lock held when returning to user space! ]
[   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G 
   W
[   96.360300] 
[   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
[   96.362019] 1 lock held by trinity-child9/7583:
[   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [] 
SYSC_semtimedop+0x1fb/0xec0

It seems that we can leave semtimedop without releasing the rcu read lock.

I'm a bit confused by what's going on in semtimedop with regards to rcu read 
lock, it
seems that this behaviour is actually intentional?

 rcu_read_lock();
 sma = sem_obtain_object_check(ns, semid);
 if (IS_ERR(sma)) {
 if (un)
 rcu_read_unlock();
 error = PTR_ERR(sma);
 goto out_free;
 }

When I've looked at that it seems that not releasing the read lock was (very)
intentional.


This logic was from the original code, which I also found to be quite
confusing.


I wasn't getting this warning with the old code, so there was probably something
else that triggers this now.



After that, the only code path that would release the lock starts with:

 if (un) {
...

So we won't release the lock at all if un is NULL?



Not necessarily, we do release everything at the end of the function:

out_unlock_free:
sem_unlock(sma, locknum);


Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even
more I suspect. If un is non-NULL we'll be unlocking rcu lock twice?


It is uglier than you think...

On success, find_alloc_undo will call rcu_read_lock, so we have the
rcu_read_lock held twice :(

Some of the ipc code is quite ugly, but making too many large changes
at once is just asking for trouble. I suspect we're going to have to
untangle this one bit at a time...


--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Rik van Riel

On 03/26/2013 01:59 PM, Davidlohr Bueso wrote:


From: Davidlohr Bueso 
Subject: [PATCH] ipc, sem: prevent possible deadlock

In semctl_main(), when cmd == GETALL, we're locking
sma->sem_perm.lock (through sem_lock_and_putref), yet
after the conditional, we lock it again.
Unlock sma right after exiting the conditional.

Signed-off-by: Davidlohr Bueso 


Reviewed-by: Rik van Riel 

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Sasha Levin
On 03/20/2013 03:55 PM, Rik van Riel wrote:
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.

Hi Rik,

Another issue that came up is:

[   96.347341] 
[   96.348085] [ BUG: lock held when returning to user space! ]
[   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G 
   W
[   96.360300] 
[   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
[   96.362019] 1 lock held by trinity-child9/7583:
[   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [] 
SYSC_semtimedop+0x1fb/0xec0

It seems that we can leave semtimedop without releasing the rcu read lock.

I'm a bit confused by what's going on in semtimedop with regards to rcu read 
lock, it
seems that this behaviour is actually intentional?

rcu_read_lock();
sma = sem_obtain_object_check(ns, semid);
if (IS_ERR(sma)) {
if (un)
rcu_read_unlock();
error = PTR_ERR(sma);
goto out_free;
}

When I've looked at that it seems that not releasing the read lock was (very)
intentional.

After that, the only code path that would release the lock starts with:

if (un) {
...

So we won't release the lock at all if un is NULL?


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Sasha Levin
On 03/26/2013 01:51 PM, Davidlohr Bueso wrote:
> On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:
>> On 03/20/2013 03:55 PM, Rik van Riel wrote:
>>> This series makes the sysv semaphore code more scalable,
>>> by reducing the time the semaphore lock is held, and making
>>> the locking more scalable for semaphore arrays with multiple
>>> semaphores.
>>
>> Hi Rik,
>>
>> Another issue that came up is:
>>
>> [   96.347341] 
>> [   96.348085] [ BUG: lock held when returning to user space! ]
>> [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G  
>>   W
>> [   96.360300] 
>> [   96.361084] trinity-child9/7583 is leaving the kernel with locks still 
>> held!
>> [   96.362019] 1 lock held by trinity-child9/7583:
>> [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [] 
>> SYSC_semtimedop+0x1fb/0xec0
>>
>> It seems that we can leave semtimedop without releasing the rcu read lock.
>>
>> I'm a bit confused by what's going on in semtimedop with regards to rcu read 
>> lock, it
>> seems that this behaviour is actually intentional?
>>
>> rcu_read_lock();
>> sma = sem_obtain_object_check(ns, semid);
>> if (IS_ERR(sma)) {
>> if (un)
>> rcu_read_unlock();
>> error = PTR_ERR(sma);
>> goto out_free;
>> }
>>
>> When I've looked at that it seems that not releasing the read lock was (very)
>> intentional.
> 
> This logic was from the original code, which I also found to be quite
> confusing.

I wasn't getting this warning with the old code, so there was probably something
else that triggers this now.

>>
>> After that, the only code path that would release the lock starts with:
>>
>> if (un) {
>>  ...
>>
>> So we won't release the lock at all if un is NULL?
>>
> 
> Not necessarily, we do release everything at the end of the function: 
> 
> out_unlock_free:
>   sem_unlock(sma, locknum);

Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even
more I suspect. If un is non-NULL we'll be unlocking rcu lock twice?

if (un->semid == -1) {
rcu_read_unlock();
goto out_unlock_free;
}
[...]
out_unlock_free:
sem_unlock(sma, locknum);


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Davidlohr Bueso
On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote:
> On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
>  wrote:
> > And you never see this problem without Rik's patches?
> 
> No, never.
> 
> > Could you bisect
> > *which* patch it starts with? Are the first four ones ok (the moving
> > of the locking around, but without the fine-grained ones), for
> > example?
> 
> With the first four patches only, I got some X server freeze (just tried 
> once).

Going over the code again, I found a potential recursive spinlock scenario. 
Andrew, if you have no objections, please queue this up.

Thanks.

---8<---

From: Davidlohr Bueso 
Subject: [PATCH] ipc, sem: prevent possible deadlock

In semctl_main(), when cmd == GETALL, we're locking
sma->sem_perm.lock (through sem_lock_and_putref), yet
after the conditional, we lock it again.
Unlock sma right after exiting the conditional.

Signed-off-by: Davidlohr Bueso 
---
 ipc/sem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ipc/sem.c b/ipc/sem.c
index 1a2913d..f257afe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int 
semid, int semnum,
err = -EIDRM;
goto out_free;
}
+   sem_unlock(sma, -1);
}
 
sem_lock(sma, NULL, -1);
-- 
1.7.11.7



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Paul E. McKenney
On Tue, Mar 26, 2013 at 01:33:07PM -0400, Sasha Levin wrote:
> On 03/20/2013 03:55 PM, Rik van Riel wrote:
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> 
> Hi Rik,
> 
> Another issue that came up is:
> 
> [   96.347341] 
> [   96.348085] [ BUG: lock held when returning to user space! ]
> [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G   
>  W
> [   96.360300] 
> [   96.361084] trinity-child9/7583 is leaving the kernel with locks still 
> held!
> [   96.362019] 1 lock held by trinity-child9/7583:
> [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [] 
> SYSC_semtimedop+0x1fb/0xec0
> 
> It seems that we can leave semtimedop without releasing the rcu read lock.
> 
> I'm a bit confused by what's going on in semtimedop with regards to rcu read 
> lock, it
> seems that this behaviour is actually intentional?
> 
> rcu_read_lock();
> sma = sem_obtain_object_check(ns, semid);
> if (IS_ERR(sma)) {
> if (un)
> rcu_read_unlock();
> error = PTR_ERR(sma);
> goto out_free;
> }
> 
> When I've looked at that it seems that not releasing the read lock was (very)
> intentional.
> 
> After that, the only code path that would release the lock starts with:
> 
> if (un) {
>   ...
> 
> So we won't release the lock at all if un is NULL?

Intentions notwithstanding, it is absolutely required to exit any and
all RCU read-side critical sections prior to going into user mode.

I suggest removing the "if (un)".

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-26 Thread Davidlohr Bueso
On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:
> On 03/20/2013 03:55 PM, Rik van Riel wrote:
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> 
> Hi Rik,
> 
> Another issue that came up is:
> 
> [   96.347341] 
> [   96.348085] [ BUG: lock held when returning to user space! ]
> [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G   
>  W
> [   96.360300] 
> [   96.361084] trinity-child9/7583 is leaving the kernel with locks still 
> held!
> [   96.362019] 1 lock held by trinity-child9/7583:
> [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [] 
> SYSC_semtimedop+0x1fb/0xec0
> 
> It seems that we can leave semtimedop without releasing the rcu read lock.
> 
> I'm a bit confused by what's going on in semtimedop with regards to rcu read 
> lock, it
> seems that this behaviour is actually intentional?
> 
> rcu_read_lock();
> sma = sem_obtain_object_check(ns, semid);
> if (IS_ERR(sma)) {
> if (un)
> rcu_read_unlock();
> error = PTR_ERR(sma);
> goto out_free;
> }
> 
> When I've looked at that it seems that not releasing the read lock was (very)
> intentional.

This logic was from the original code, which I also found to be quite
confusing.

> 
> After that, the only code path that would release the lock starts with:
> 
> if (un) {
>   ...
> 
> So we won't release the lock at all if un is NULL?
> 

Not necessarily, we do release everything at the end of the function: 

out_unlock_free:
sem_unlock(sma, locknum);

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-25 Thread Sasha Levin
On 03/20/2013 03:55 PM, Rik van Riel wrote:
> Include lkml in the CC: this time... *sigh*
> ---8<---
> 
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.

Hi Rik,

I'm getting the following false positives from lockdep:

[   80.492995] =
[   80.494052] [ INFO: possible recursive locking detected ]
[   80.494878] 3.9.0-rc4-next-20130325-sasha-00044-gcb6ef58 #315 Tainted: G 
   W
[   80.496228] -
[   80.497171] trinity-child9/7210 is trying to acquire lock:
[   80.497934]  (&(&sma->sem_base[i].lock)->rlock){+.+...}, at: 
[] newary+0x1c7/0x2a0
[   80.499202]
[   80.499202] but task is already holding lock:
[   80.500031]  (&(&sma->sem_base[i].lock)->rlock){+.+...}, at: 
[] newary+0x1c7/0x2a0
[   80.500031]
[   80.500031] other info that might help us debug this:
[   80.500031]  Possible unsafe locking scenario:
[   80.500031]
[   80.500031]CPU0
[   80.500031]
[   80.500031]   lock(&(&sma->sem_base[i].lock)->rlock);
[   80.500031]   lock(&(&sma->sem_base[i].lock)->rlock);
[   80.500031]
[   80.500031]  *** DEADLOCK ***
[   80.500031]
[   80.500031]  May be due to missing lock nesting notation
[   80.500031]
[   80.500031] 4 locks held by trinity-child9/7210:
[   80.500031]  #0:  (&ids->rw_mutex){+.}, at: [] 
ipcget+0x72/0x340
[   80.500031]  #1:  (rcu_read_lock){.+.+..}, at: [] 
ipc_addid+0x35/0x230
[   80.500031]  #2:  (&(&new->lock)->rlock){+.+...}, at: [] 
ipc_addid+0xd0/0x230
[   80.500031]  #3:  (&(&sma->sem_base[i].lock)->rlock){+.+...}, at: 
[] newary+0x1c7/0x2a0
[   80.500031]
[   80.500031] stack backtrace:
[   80.500031] Pid: 7210, comm: trinity-child9 Tainted: GW
3.9.0-rc4-next-20130325-sasha-00044-gcb6ef58 #315
[   80.500031] Call Trace:
[   80.500031]  [] __lock_acquire+0xc6e/0x1e50
[   80.500031]  [] ? idr_get_empty_slot+0x255/0x3c0
[   80.500031]  [] ? mark_held_locks+0x12e/0x150
[   80.500031]  [] lock_acquire+0x1aa/0x240
[   80.500031]  [] ? newary+0x1c7/0x2a0
[   80.500031]  [] _raw_spin_lock+0x3b/0x70
[   80.500031]  [] ? newary+0x1c7/0x2a0
[   80.500031]  [] newary+0x1c7/0x2a0
[   80.500031]  [] ? ipcget+0x72/0x340
[   80.500031]  [] ipcget+0x226/0x340
[   80.500031]  [] SyS_semget+0x65/0x80
[   80.500031]  [] ? semctl_down.constprop.8+0x320/0x320
[   80.500031]  [] ? wake_up_sem_queue_do+0xa0/0xa0
[   80.500031]  [] ? SyS_msgrcv+0x20/0x20
[   80.500031]  [] tracesys+0xe1/0xe6

The code is:

for (i = 0; i < nsems; i++) {
INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
spin_lock_init(&sma->sem_base[i].lock);
spin_lock(&sma->sem_base[i].lock);  < here
}


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-25 Thread Emmanuel Benisty
On Mon, Mar 25, 2013 at 10:53 PM, Rik van Riel  wrote:
> On 03/25/2013 11:20 AM, Emmanuel Benisty wrote:
>>
>> On Mon, Mar 25, 2013 at 9:03 PM, Rik van Riel  wrote:
>
> With the first four patches only, I got some X server freeze (just
> tried once).



 Could you try booting with panic=1 so the kernel panics on the first
 oops?
>>>
>>>
>>>
>>> Sorry that should be "oops=panic"
>>>
>>>
 Maybe that way (if we are lucky) we will be able to capture the first
 oops, and maybe get an idea of what causes the problem.
>>
>>
>> Sorry Rik, I get all kind of weird behaviors (wireless dies, compiling
>> gets stuck and is impossible to kill, can't kill X) with the 4
>> patches+oops=panic but no trace. Here after is 7+1 patches with
>> oops=panic boot: http://i.imgur.com/1jep1qx.jpg
>
>
> This may be a stupid question, but you re-compile and re-install
> the kernel modules every time you changed the kernel?
>
> The behaviour you report with just the first four patches is so
> random, it sounds almost like a mismatched data structure between
> compiles...

Yes it's OK. I even started from scratch just in case but I'm still
getting the same weird things. Everything works just fine with a build
from Linus' tree which I normally use. I'll try the patches on another
machine tomorrow.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-25 Thread Rik van Riel

On 03/25/2013 11:20 AM, Emmanuel Benisty wrote:

On Mon, Mar 25, 2013 at 9:03 PM, Rik van Riel  wrote:

With the first four patches only, I got some X server freeze (just
tried once).



Could you try booting with panic=1 so the kernel panics on the first
oops?



Sorry that should be "oops=panic"



Maybe that way (if we are lucky) we will be able to capture the first
oops, and maybe get an idea of what causes the problem.


Sorry Rik, I get all kind of weird behaviors (wireless dies, compiling
gets stuck and is impossible to kill, can't kill X) with the 4
patches+oops=panic but no trace. Here after is 7+1 patches with
oops=panic boot: http://i.imgur.com/1jep1qx.jpg


This may be a stupid question, but you re-compile and re-install
the kernel modules every time you changed the kernel?

The behaviour you report with just the first four patches is so
random, it sounds almost like a mismatched data structure between
compiles...

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-25 Thread Emmanuel Benisty
On Mon, Mar 25, 2013 at 9:03 PM, Rik van Riel  wrote:
>>> With the first four patches only, I got some X server freeze (just
>>> tried once).
>>
>>
>> Could you try booting with panic=1 so the kernel panics on the first
>> oops?
>
>
> Sorry that should be "oops=panic"
>
>
>> Maybe that way (if we are lucky) we will be able to capture the first
>> oops, and maybe get an idea of what causes the problem.

Sorry Rik, I get all kind of weird behaviors (wireless dies, compiling
gets stuck and is impossible to kill, can't kill X) with the 4
patches+oops=panic but no trace. Here after is 7+1 patches with
oops=panic boot: http://i.imgur.com/1jep1qx.jpg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-25 Thread Emmanuel Benisty
On Mon, Mar 25, 2013 at 9:01 PM, Rik van Riel  wrote:
> On 03/25/2013 09:47 AM, Emmanuel Benisty wrote:
>>
>> On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
>>  wrote:
>>>
>>> And you never see this problem without Rik's patches?
>>
>>
>> No, never.
>>
>>> Could you bisect
>>> *which* patch it starts with? Are the first four ones ok (the moving
>>> of the locking around, but without the fine-grained ones), for
>>> example?
>>
>>
>> With the first four patches only, I got some X server freeze (just tried
>> once).
>
>
> Another random question, just to rule something out. What video driver
> are you using on your system?

intel on i3 330m
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-25 Thread Rik van Riel

On 03/25/2013 10:00 AM, Rik van Riel wrote:

On 03/25/2013 09:47 AM, Emmanuel Benisty wrote:

On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
 wrote:

And you never see this problem without Rik's patches?


No, never.


Could you bisect
*which* patch it starts with? Are the first four ones ok (the moving
of the locking around, but without the fine-grained ones), for
example?


With the first four patches only, I got some X server freeze (just
tried once).


Could you try booting with panic=1 so the kernel panics on the first
oops?


Sorry that should be "oops=panic"


Maybe that way (if we are lucky) we will be able to capture the first
oops, and maybe get an idea of what causes the problem.


--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-25 Thread Rik van Riel

On 03/25/2013 09:47 AM, Emmanuel Benisty wrote:

On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
 wrote:

And you never see this problem without Rik's patches?


No, never.


Could you bisect
*which* patch it starts with? Are the first four ones ok (the moving
of the locking around, but without the fine-grained ones), for
example?


With the first four patches only, I got some X server freeze (just tried once).


Another random question, just to rule something out. What video driver
are you using on your system?

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-25 Thread Rik van Riel

On 03/25/2013 09:47 AM, Emmanuel Benisty wrote:

On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
 wrote:

And you never see this problem without Rik's patches?


No, never.


Could you bisect
*which* patch it starts with? Are the first four ones ok (the moving
of the locking around, but without the fine-grained ones), for
example?


With the first four patches only, I got some X server freeze (just tried once).


Could you try booting with panic=1 so the kernel panics on the first
oops?

Maybe that way (if we are lucky) we will be able to capture the first
oops, and maybe get an idea of what causes the problem.

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-25 Thread Emmanuel Benisty
On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
 wrote:
> And you never see this problem without Rik's patches?

No, never.

> Could you bisect
> *which* patch it starts with? Are the first four ones ok (the moving
> of the locking around, but without the fine-grained ones), for
> example?

With the first four patches only, I got some X server freeze (just tried once).

> Another thing to try might be to enable SLUB debugging (ie make sure that all 
> of
>
>   CONFIG_SLUB_DEBUG=y
>   CONFIG_SLUB=y
>   CONFIG_SLUB_DEBUG_ON=y
>
> are set in your kernel config)

My bad, I forgot to enable this in my former build, sorry. Here's what
I get now:
http://i.imgur.com/G6H8KgD.jpg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-24 Thread Linus Torvalds
On Sun, Mar 24, 2013 at 6:46 AM, Emmanuel Benisty  wrote:
>
> Thanks Linus. I hope I got this right, here's the result (3.9-rc4, 7+1
> patches): http://i.imgur.com/BebGZxV.jpg

Ok, that's *slightly* more informative, but not much. At least now we
see the actual page fault information, and see what the bad
dereference was.

It seems to be a branch through the rcu list "->func" pointer in the
rcu callbacks, and the ->func pointer has been corrupted. Instead of
being a valid kernel pointer (or a "kfree_rcu_offset" marker, which is
a small number between 0-4096), it has the odd value
"0x00640064". Two words of decimal "100", in other words.

That's not one of the usual "use-after-free" patters or anything like
that, so I don't see what it would be. So I have to admit to not
really having any better clue about what is going on. Sometimes the
corruption pattern give a hint of what it was that overwrote it, but
not here..

And you never see this problem without Rik's patches? Could you bisect
*which* patch it starts with? Are the first four ones ok (the moving
of the locking around, but without the fine-grained ones), for
example?

Another thing to try might be to enable SLUB debugging (ie make sure that all of

  CONFIG_SLUB_DEBUG=y
  CONFIG_SLUB=y
  CONFIG_SLUB_DEBUG_ON=y

are set in your kernel config), which might help pin things down a
bit. Sometimes that makes any allocation problems show up earlier in
the path, so that it's more obvious who screwed up.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-24 Thread Emmanuel Benisty
On Sun, Mar 24, 2013 at 2:45 AM, Linus Torvalds
 wrote:
> On Fri, Mar 22, 2013 at 8:19 PM, Emmanuel Benisty  wrote:
>>
>> I could reproduce it but could you please let me know what would be
>> the right tools I should use to catch the original oops?
>> This is what I got but I doubt it will be helpful:
>> http://i.imgur.com/Mewi1hC.jpg
>
> In this case, I think the best thing to do would be to comment out all
> of drm_warn_on_modeset_not_all_locked(), because those warnings make
> the original problem (that probably caused the lock problem in the
> first place that it is warning about) scroll away.
>
> That said, you should also take the oneliner fix that Rik posted to
> patch 7 (subject line: "[PATCH 7/7 part3] fix for sem_lock") and apply
> that, just to make sure that you aren't possibly hitting a bug with
> the shared-memory locking introduced by that (unusual) case.

Thanks Linus. I hope I got this right, here's the result (3.9-rc4, 7+1
patches): http://i.imgur.com/BebGZxV.jpg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-23 Thread Linus Torvalds
On Fri, Mar 22, 2013 at 8:19 PM, Emmanuel Benisty  wrote:
>
> I could reproduce it but could you please let me know what would be
> the right tools I should use to catch the original oops?
> This is what I got but I doubt it will be helpful:
> http://i.imgur.com/Mewi1hC.jpg

In this case, I think the best thing to do would be to comment out all
of drm_warn_on_modeset_not_all_locked(), because those warnings make
the original problem (that probably caused the lock problem in the
first place that it is warning about) scroll away.

That said, you should also take the oneliner fix that Rik posted to
patch 7 (subject line: "[PATCH 7/7 part3] fix for sem_lock") and apply
that, just to make sure that you aren't possibly hitting a bug with
the shared-memory locking introduced by that (unusual) case.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-22 Thread Emmanuel Benisty
Hi Linus,

On Fri, Mar 22, 2013 at 10:37 PM, Linus Torvalds
 wrote:
> On Fri, Mar 22, 2013 at 4:04 AM, Emmanuel Benisty  wrote:
>>
>> I was trying your patchset and my machine died while building a
>> package. I could reproduce the bug the (only) two times I tried.
>> There's a poor quality picture here: http://i.imgur.com/MuYuyQC.jpg
>
> Hmm. The original oops may well have scrolled off the window, what
> remains is just indicating something is corrupted in slab. Which isn't
> likely to give much hints..
>
> Can you try to reproduce this with CONFIG_SLUB_DEBUG=y and
> CONFIG_SLUB_DEBUG_ON=y, and see if there are any earlier messages?

I could reproduce it but could you please let me know what would be
the right tools I should use to catch the original oops?
This is what I got but I doubt it will be helpful:
http://i.imgur.com/Mewi1hC.jpg

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-22 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
> Include lkml in the CC: this time... *sigh*
> ---8<---
> 
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.
> 
> The first four patches were written by Davidlohr Buesso, and
> reduce the hold time of the semaphore lock.
> 
> The last three patches change the sysv semaphore code locking
> to be more fine grained, providing a performance boost when
> multiple semaphores in a semaphore array are being manipulated
> simultaneously.
> 
> On a 24 CPU system, performance numbers with the semop-multi
> test with N threads and N semaphores, look like this:
> 
>   vanilla Davidlohr's Davidlohr's +   Davidlohr's +
> threads   patches rwlock patches  v3 patches
> 10610652  726325  1783589 2142206
> 20341570  365699  1520453 1977878
> 30288102  307037  1498167 2037995
> 40290714  305955  1612665 2256484
> 50288620  312890  1733453 2650292
> 60289987  306043  1649360 2388008
> 70291298  306347  1723167 2717486
> 80290948  305662  1729545 2763582
> 90290996  306680  1736021 2757524
> 100   292243  306700  1773700 3059159
> 

Some results with semop-multi on my 4 core laptop:

vanilla v3 patchset
threads
10   5094473 10289146
20   5079946 10187923
30   5041258 10660635
40   4942786 10876009
50   5076437 10759434
60   5139024 10797032
70   5103811 10698323
80   5094850  9959675
90   5085774 10054844
100  4939547  9798291




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-22 Thread Linus Torvalds
On Fri, Mar 22, 2013 at 4:04 AM, Emmanuel Benisty  wrote:
>
> I was trying your patchset and my machine died while building a
> package. I could reproduce the bug the (only) two times I tried.
> There's a poor quality picture here: http://i.imgur.com/MuYuyQC.jpg

Hmm. The original oops may well have scrolled off the window, what
remains is just indicating something is corrupted in slab. Which isn't
likely to give much hints..

Can you try to reproduce this with CONFIG_SLUB_DEBUG=y and
CONFIG_SLUB_DEBUG_ON=y, and see if there are any earlier messages?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-22 Thread Emmanuel Benisty
Hi Rik,

On Thu, Mar 21, 2013 at 2:55 AM, Rik van Riel  wrote:
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.

I was trying your patchset and my machine died while building a
package. I could reproduce the bug the (only) two times I tried.
There's a poor quality picture here: http://i.imgur.com/MuYuyQC.jpg

Thanks.
-- Emmanuel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-22 Thread Mike Galbraith
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote: 
> Include lkml in the CC: this time... *sigh*
> ---8<---
> 
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.
> 
> The first four patches were written by Davidlohr Buesso, and
> reduce the hold time of the semaphore lock.
> 
> The last three patches change the sysv semaphore code locking
> to be more fine grained, providing a performance boost when
> multiple semaphores in a semaphore array are being manipulated
> simultaneously.
> 
> On a 24 CPU system, performance numbers with the semop-multi
> test with N threads and N semaphores, look like this:
> 
>   vanilla Davidlohr's Davidlohr's +   Davidlohr's +
> threads   patches rwlock patches  v3 patches
> 10610652  726325  1783589 2142206
> 20341570  365699  1520453 1977878
> 30288102  307037  1498167 2037995
> 40290714  305955  1612665 2256484
> 50288620  312890  1733453 2650292
> 60289987  306043  1649360 2388008
> 70291298  306347  1723167 2717486
> 80290948  305662  1729545 2763582
> 90290996  306680  1736021 2757524
> 100   292243  306700  1773700 3059159

Hi Rik,

I plugged this set into enterprise -rt kernel, and beat on four boxen.
I ran into no trouble while giving boxen a generic drubbing fwtw.

Some semop-multi -rt numbers for an abby-normal 8 node box, and a
mundane 4 node box below.


32 cores+SMT, 3.0.68-rt92

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 32733 MB
node 0 free: 29910 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 32768 MB
node 1 free: 30396 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 32768 MB
node 2 free: 30568 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 32767 MB
node 3 free: 28706 MB
node distances:
node   0   1   2   3 
  0:  10  21  21  21 
  1:  21  10  21  21 
  2:  21  21  10  21 
  3:  21  21  21  10 

SCHED_OTHER
   -v3 set+v3 set
threads
10  4384851744599
20  3764111580813
30  3966391546841
40  4490622152861
50  4775942431344
60  4464531874476
70  5787052047884
80  6079282144898
90  6621362171074
100 6228892295879
200 7096682867273
300 6619023008695
400 6417583273250
500 6141173403775

SCHED_FIFO
   -v3 set+v3 set
threads
10 158656  914343
20  99784 1133775
30  84245 1099604
40  89725 1756577
50  85697 1607893
60  84033 1467272
70  86833 1979405
80  91451 1922250
90  92484 1990699
100 90691 2067705
200103692 2308652
300101161 2376921
400103722 2417580
500108838 2443349


64 core box (poor thing, smi free zone though), 3.0.68-rt92

numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 8181 MB
node 0 free: 6309 MB
node distances:
node   0
  0:  10

SCHED_OTHER
   -v3 set+v3 set
threads
10  6775342304417
20  4515071873474
30  3568761542688
40  3295851500392
50  4157611318676
60  4034821380644
70  3940891185165
80  4074081191834
90  4457641249156
100 4308231245573
200 4254701421686
300 4270921480379
400 4979001516599
500 4219271551309

SCHED_FIFO
10  3235601882543
20  2262571806862
30  1878511263163
40  205337 881331
50  196806 76
60  193218 612709
70  2094321241797
80  2404451269146
90  2198651482649
100 2279701473038
200 2013541719977
300 1837281823074
400 1750511808792
500 2437081849803


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Rik van Riel

On 03/21/2013 09:23 PM, Linus Torvalds wrote:

On Thu, Mar 21, 2013 at 6:12 PM, Davidlohr Bueso  wrote:


ipc lock contention:
100 users:  8,74%  (vanilla)3.17% (v3 patchset)
400 users:  21,86% (vanilla)5.23% (v3 patchset)
800 users   84,35% (vanilla)7.39% (v3 patchset)


Ok, I'd call that pretty much "solved". Sure, it's still visible, but
for being a benchmark that apparently does little else than pound on
those sysv semaphores, I think we can consider it pretty much fine.
I'm going to assume that anybody who actually then does any real work
(ie a database) is never going to see even close to this bad
contention.

Good job, Rik. I'm assuming we'll be merging this during the 3.10
merge window, and hopefully the merge conflicts will be sorted out
too. Rik, Peter, can you look at each others patches and see if you
can get that sorted out for Andrew?


Will do.

I will rebase this series on top of what is in linux-next.

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Rik van Riel

On 03/21/2013 06:01 PM, Andrew Morton wrote:

On Thu, 21 Mar 2013 17:50:05 -0400 Peter Hurley  
wrote:


On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote:

On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel  wrote:


This series makes the sysv semaphore code more scalable,
by reducing the time the semaphore lock is held, and making
the locking more scalable for semaphore arrays with multiple
semaphores.

The first four patches were written by Davidlohr Buesso, and
reduce the hold time of the semaphore lock.

The last three patches change the sysv semaphore code locking
to be more fine grained, providing a performance boost when
multiple semaphores in a semaphore array are being manipulated
simultaneously.


These patches conflict pretty badly with Peter's:


On one point I'm a little confused: my series has been in linux-next for
a while. On what tree is this series based?


It'll be based on mainline.  People often forget to peek into
linux-next when preparing patches.  In the great majority of cases
that's OK.  Occasionally, we lose...


I'll be happy to rebase the series onto linux-next, to make
merging easier for you.

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Linus Torvalds
On Thu, Mar 21, 2013 at 6:12 PM, Davidlohr Bueso  wrote:
>
> ipc lock contention:
> 100 users:  8,74%  (vanilla)3.17% (v3 patchset)
> 400 users:  21,86% (vanilla)5.23% (v3 patchset)
> 800 users   84,35% (vanilla)7.39% (v3 patchset)

Ok, I'd call that pretty much "solved". Sure, it's still visible, but
for being a benchmark that apparently does little else than pound on
those sysv semaphores, I think we can consider it pretty much fine.
I'm going to assume that anybody who actually then does any real work
(ie a database) is never going to see even close to this bad
contention.

Good job, Rik. I'm assuming we'll be merging this during the 3.10
merge window, and hopefully the merge conflicts will be sorted out
too. Rik, Peter, can you look at each others patches and see if you
can get that sorted out for Andrew?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
> Include lkml in the CC: this time... *sigh*
> ---8<---
> 
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.
> 
> The first four patches were written by Davidlohr Buesso, and
> reduce the hold time of the semaphore lock.
> 
> The last three patches change the sysv semaphore code locking
> to be more fine grained, providing a performance boost when
> multiple semaphores in a semaphore array are being manipulated
> simultaneously.
> 
> On a 24 CPU system, performance numbers with the semop-multi
> test with N threads and N semaphores, look like this:
> 
>   vanilla Davidlohr's Davidlohr's +   Davidlohr's +
> threads   patches rwlock patches  v3 patches
> 10610652  726325  1783589 2142206
> 20341570  365699  1520453 1977878
> 30288102  307037  1498167 2037995
> 40290714  305955  1612665 2256484
> 50288620  312890  1733453 2650292
> 60289987  306043  1649360 2388008
> 70291298  306347  1723167 2717486
> 80290948  305662  1729545 2763582
> 90290996  306680  1736021 2757524
> 100   292243  306700  1773700 3059159
> 

After testing these patches with my Oracle Swingbench DSS workload, I
can say that there are significant improvements. The ipc lock contention
was reduced drastically, specially with higher amounts of benchmark
users. As a result, the overall %sys time went down as well.
Furthermore, throughput (in transactions per second) was increased.

TPS:
100 users: 1257.21 (vanilla)2805.06 (v3 patchset)
400 users: 1437.57 (vanilla)2664.67 (v3 patchset)
800 users: 1236.89 (vanilla)2750.73 (v3 patchset)

ipc lock contention:
100 users:  8,74%  (vanilla)3.17% (v3 patchset)
400 users:  21,86% (vanilla)5.23% (v3 patchset)
800 users   84,35% (vanilla)7.39% (v3 patchset) 

As seen with perf, the ipc lock isn't even the main source of contention
anymore. Also, no matter how many benchmark users,  the lock's user is
mostly semctl_main() .

100 users:
3.17%   oracle  [kernel.kallsyms]   [k] _raw_spin_lock  
  
 |
 --- _raw_spin_lock
|  
|--50.53%-- sem_lock
|  |  
|  |--82.60%-- semctl_main
|   --17.40%-- sys_semtimedop

400 users:
5.23%   oracle  [kernel.kallsyms]   [k] _raw_spin_lock  
  
 |
 --- _raw_spin_lock
|  
|--75.81%-- sem_lock
|  |  
|  |--94.09%-- semctl_main
|   --5.91%-- sys_semtimedop


800 users:
 7.39%   oracle  [kernel.kallsyms]   [k] _raw_spin_lock 
   
 |
 --- _raw_spin_lock
|  
|--81.71%-- sem_lock
|  |  
|  |--64.98%-- semctl_main
|   --35.02%-- sys_semtimedop


Thanks,
Davidlohr


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Andrew Morton
On Thu, 21 Mar 2013 17:50:05 -0400 Peter Hurley  
wrote:

> On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote:
> > On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel  wrote:
> > 
> > > This series makes the sysv semaphore code more scalable,
> > > by reducing the time the semaphore lock is held, and making
> > > the locking more scalable for semaphore arrays with multiple
> > > semaphores.
> > > 
> > > The first four patches were written by Davidlohr Buesso, and
> > > reduce the hold time of the semaphore lock.
> > > 
> > > The last three patches change the sysv semaphore code locking
> > > to be more fine grained, providing a performance boost when
> > > multiple semaphores in a semaphore array are being manipulated
> > > simultaneously.
> > 
> > These patches conflict pretty badly with Peter's:
> 
> On one point I'm a little confused: my series has been in linux-next for
> a while. On what tree is this series based?

It'll be based on mainline.  People often forget to peek into
linux-next when preparing patches.  In the great majority of cases
that's OK.  Occasionally, we lose...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Peter Hurley
On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote:
> On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel  wrote:
> 
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> > 
> > The first four patches were written by Davidlohr Buesso, and
> > reduce the hold time of the semaphore lock.
> > 
> > The last three patches change the sysv semaphore code locking
> > to be more fine grained, providing a performance boost when
> > multiple semaphores in a semaphore array are being manipulated
> > simultaneously.
> 
> These patches conflict pretty badly with Peter's:

On one point I'm a little confused: my series has been in linux-next for
a while. On what tree is this series based?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Peter Hurley
On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote:
> On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel  wrote:
> 
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> > 
> > The first four patches were written by Davidlohr Buesso, and
> > reduce the hold time of the semaphore lock.
> > 
> > The last three patches change the sysv semaphore code locking
> > to be more fine grained, providing a performance boost when
> > multiple semaphores in a semaphore array are being manipulated
> > simultaneously.
> 
> These patches conflict pretty badly with Peter's:
> 
> ipc-clamp-with-min.patch
> ipc-separate-msg-allocation-from-userspace-copy.patch
> ipc-tighten-msg-copy-loops.patch
> ipc-set-efault-as-default-error-in-load_msg.patch
> ipc-remove-msg-handling-from-queue-scan.patch
> ipc-implement-msg_copy-as-a-new-receive-mode.patch
> ipc-simplify-msg-list-search.patch
> ipc-refactor-msg-list-search-into-separate-function.patch
> ipc-refactor-msg-list-search-into-separate-function-fix.patch
> 
> (they're at http://ozlabs.org/~akpm/mmots/broken-out/)
> 
> 
> 
> We're in a bit of a mess at present.
> 
> Last month Peter sent a ten-patch series which fixed an oops (Subject:
> "ipc MSG_COPY fixes").  The series did other stuff, so we merged into
> mainline just two bugfix patches.  Then davej hit a trinity oops
> (Subject: "ipc/testmsg GPF.") and testing confirmed that the remaining
> eight patches in Peter's series fixed that up.

Just to clarify the history.

A while back on linux-next both Dave and I were hitting ipc oopses on
trinity, which was fallout from the MSG_COPY implementation.

One of those oopses was the ipc/testmsg oops.

I was trying to push out the new tty ldisc patchset and trinity is a
decent tool for exploring race conditions (which the tty layer was full
of), so the oopses were in my way.

Because it was nearly impossible to static analyze the existing ipc
code, I just started cleaning up. As I did, I found the two
by-then-obvious bug fixes. Also, assigning the 'msg' variable in the
search loop looked suspicious so I factored that search loop out as
well. That was the whole 10-patch "ipc MSG_COPY fixes" series.

I wasn't getting the ipc/testmsg bug anymore and I let Dave know, so all
was good.

When Andrew asked me if the whole series needed to go into 3.9, I said I
didn't know.

So when just the 2 of 10 patches went in, Dave started to hit the
ipc/testmsg bug again on 3.9.

> Nobody has yet
> identified which of the eight does the good deed.  So we need to
> either:
> 
> a) revert the original two fixes "ipc: fix potential oops when src
>msg > 4k w/ MSG_COPY" and "ipc: don't allocate a copy larger than
>max" or 
> 
> b) try reverting "ipc: don't allocate a copy larger than max", as
>"ipc: fix potential oops when src msg > 4k w/ MSG_COPY" looks pretty
>harmless or
> 
> c) work out which of the remaining 8 fixes the new oops and merge that or
> 
> d) merge all 8 fixes or
> 
> e) something else.


The fix is in the other 8 patches. But I have to say, the base code was
such a mess I have no idea which of the other 8 patches fixes the
ipc/testmsg oops or how.

I guarantee reverting those 2 fixes from my series will not make the
ipc/testmsg oops go away.


> Whichever way we go, we should get a wiggle on - this has been hanging
> around for too long.  Dave, do you have time to determine whether
> reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
> than max") fixes things up?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-21 Thread Andrew Morton
On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel  wrote:

> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.
> 
> The first four patches were written by Davidlohr Buesso, and
> reduce the hold time of the semaphore lock.
> 
> The last three patches change the sysv semaphore code locking
> to be more fine grained, providing a performance boost when
> multiple semaphores in a semaphore array are being manipulated
> simultaneously.

These patches conflict pretty badly with Peter's:

ipc-clamp-with-min.patch
ipc-separate-msg-allocation-from-userspace-copy.patch
ipc-tighten-msg-copy-loops.patch
ipc-set-efault-as-default-error-in-load_msg.patch
ipc-remove-msg-handling-from-queue-scan.patch
ipc-implement-msg_copy-as-a-new-receive-mode.patch
ipc-simplify-msg-list-search.patch
ipc-refactor-msg-list-search-into-separate-function.patch
ipc-refactor-msg-list-search-into-separate-function-fix.patch

(they're at http://ozlabs.org/~akpm/mmots/broken-out/)



We're in a bit of a mess at present.

Last month Peter sent a ten-patch series which fixed an oops (Subject:
"ipc MSG_COPY fixes").  The series did other stuff, so we merged into
mainline just two bugfix patches.  Then davej hit a trinity oops
(Subject: "ipc/testmsg GPF.") and testing confirmed that the remaining
eight patches in Peter's series fixed that up.  Nobody has yet
identified which of the eight does the good deed.  So we need to
either:

a) revert the original two fixes "ipc: fix potential oops when src
   msg > 4k w/ MSG_COPY" and "ipc: don't allocate a copy larger than
   max" or 

b) try reverting "ipc: don't allocate a copy larger than max", as
   "ipc: fix potential oops when src msg > 4k w/ MSG_COPY" looks pretty
   harmless or

c) work out which of the remaining 8 fixes the new oops and merge that or

d) merge all 8 fixes or

e) something else.

Whichever way we go, we should get a wiggle on - this has been hanging
around for too long.  Dave, do you have time to determine whether
reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
than max") fixes things up?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-20 Thread Davidlohr Bueso
On Wed, 2013-03-20 at 13:49 -0700, Linus Torvalds wrote:
> On Wed, Mar 20, 2013 at 12:55 PM, Rik van Riel  wrote:
> >
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> 
> The series looks sane to me, and I like how each individual step is
> pretty small and makes sense.
> 
> It *would* be lovely to see this run with the actual Swingbench
> numbers. The microbenchmark always looked much nicer. Do the
> additional multi-semaphore scalability patches on top of Davidlohr's
> patches help with the swingbench issue, or are we still totally
> swamped by the ipc lock there?

Yes, I'm testing this patchset with my swingbench workloads. I should
have some numbers by today or tomorrow.

> 
> Maybe there were already numbers for that, but the last swingbench
> numbers I can actually recall was from before the finer-grained
> locking..

Right, I couldn't get Oracle to run on the with the previous patches,
hopefully the bug(s) are now addressed.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-20 Thread Linus Torvalds
On Wed, Mar 20, 2013 at 1:49 PM, Linus Torvalds
 wrote:
>
> It *would* be lovely to see this run with the actual Swingbench
> numbers. The microbenchmark always looked much nicer. Do the
> additional multi-semaphore scalability patches on top of Davidlohr's
> patches help with the swingbench issue, or are we still totally
> swamped by the ipc lock there?
>
> Maybe there were already numbers for that, but the last swingbench
> numbers I can actually recall was from before the finer-grained
> locking..

Ok, and if the spinlock is still a big deal even with the finer
granularity, it might be interesting to hear if Michel's fast locks
make a difference. I'm guessing that this series might actually make
it easier/cleaner to do the semaphore locking using another lock,
since the ipc_lock got split up and out...

I think Michel did it for some socket code too. I think that was
fairly independent and was for netperf.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ipc,sem: sysv semaphore scalability

2013-03-20 Thread Linus Torvalds
On Wed, Mar 20, 2013 at 12:55 PM, Rik van Riel  wrote:
>
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.

The series looks sane to me, and I like how each individual step is
pretty small and makes sense.

It *would* be lovely to see this run with the actual Swingbench
numbers. The microbenchmark always looked much nicer. Do the
additional multi-semaphore scalability patches on top of Davidlohr's
patches help with the swingbench issue, or are we still totally
swamped by the ipc lock there?

Maybe there were already numbers for that, but the last swingbench
numbers I can actually recall was from before the finer-grained
locking..

And obviously, getting this tested so that there aren't any more
missed wakeups etc would be lovely. I'm assuming the plan is that this
all goes through Andrew? Do we have big semop users who could test it
on real loads? Considering that I *suspect* the main users are things
like Oracle etc, I'd assume that there's some RH lab or partner or
similar that is interested in making sure this not only helps, but
also that it doesn't break anything ;)

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ipc,sem: sysv semaphore scalability

2013-03-20 Thread Rik van Riel
Include lkml in the CC: this time... *sigh*
---8<---

This series makes the sysv semaphore code more scalable,
by reducing the time the semaphore lock is held, and making
the locking more scalable for semaphore arrays with multiple
semaphores.

The first four patches were written by Davidlohr Buesso, and
reduce the hold time of the semaphore lock.

The last three patches change the sysv semaphore code locking
to be more fine grained, providing a performance boost when
multiple semaphores in a semaphore array are being manipulated
simultaneously.

On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:

vanilla Davidlohr's Davidlohr's +   Davidlohr's +
threads patches rwlock patches  v3 patches
10  610652  726325  1783589 2142206
20  341570  365699  1520453 1977878
30  288102  307037  1498167 2037995
40  290714  305955  1612665 2256484
50  288620  312890  1733453 2650292
60  289987  306043  1649360 2388008
70  291298  306347  1723167 2717486
80  290948  305662  1729545 2763582
90  290996  306680  1736021 2757524
100 292243  306700  1773700 3059159

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/