Re: ipc,sem: sysv semaphore scalability
On Sat, 4 May 2013 20:12:45 +0200, Borislav Petkov wrote: > On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote: > > Blockconsole currently lives here: > > https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/ > > Tja, if only that were upstream... Linus has a pull request. If he prefers to change code until the relevant bits fit into 80x25, that is his choice. ;) Jörn -- The only good bug is a dead bug. -- Starship Troopers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote: > Blockconsole currently lives here: > https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/ Tja, if only that were upstream... -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, 23 March 2013 10:19:16 +0700, Emmanuel Benisty wrote: > > I could reproduce it but could you please let me know what would be > the right tools I should use to catch the original oops? > This is what I got but I doubt it will be helpful: > http://i.imgur.com/Mewi1hC.jpg You could use either netconsole or blockconsole. Netconsole requires a second machine to capture the information, blockconsole requires a usb key or something similar to write to. In both cases you will get the entire console output in a file, often including very helpful messages leading up to the crash. Blockconsole currently lives here: https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/ Jörn -- Unless something dramatically changes, by 2015 we'll be largely wondering what all the fuss surrounding Linux was really about. -- Rob Enderle -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/29/2013 03:01 PM, Dave Jones wrote: On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote: > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones wrote: > > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: > > > > > Whichever way we go, we should get a wiggle on - this has been hanging > > > around for too long. Dave, do you have time to determine whether > > > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger > > > than max") fixes things up? > > > > Ok, with that reverted it's been grinding away for a few hours without incident. > > Normally I see the oops within a minute or so. > > > > OK, thanks, I queued a revert: > > From: Andrew Morton > Subject: revert "ipc: don't allocate a copy larger than max" > > Revert 88b9e456b164. Dave has confirmed that this was causing oopses > during trinity testing. I owe Peter an apology. I just hit it again with that backed out. Andrew, might as well drop that revert. BUG: unable to handle kernel NULL pointer dereference at 000f IP: [] testmsg.isra.5+0x1a/0x60 [...snip...] I think I wasn't seeing that this last week because I had inadvertantly disabled DEBUG_PAGEALLOC and.. we're back to square one. Dave Andrew, I just realized you're still carrying commit 4bea54c91bcc5451f237e6b721b0b35eccd01d17 Author: Andrew Morton Date: Fri Apr 26 10:55:12 2013 +1000 revert "ipc: don't allocate a copy larger than max" Revert 88b9e456b164. Dave has confirmed that this was causing oopses during trinity testing. Cc: Peter Hurley Cc: Stanislav Kinsbursky Reported-by: Dave Jones Cc: Signed-off-by: Andrew Morton Please drop. As quoted above, the testing on mainline that attributed the observed oops to the reverted patch was due to a config error. As Linus pointed out below, this patch fixes real bugs. On 04/02/2013 03:53 PM, Sasha Levin wrote: > On 04/02/2013 01:52 PM, Linus Torvalds wrote: >> On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin wrote: >>> >>> By just playing with the 'msgsz' parameter with MSG_COPY set. >> >> Hmm. Looking closer, I suspect you're testing without commit >> 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That >> should limit the size passed in to prepare_copy -> load_copy to >> msg_ctlmax. > > That commit has a revert in the -next trees, do we need a revert > of the revert? > >commit ff6577a3e714ccae02d4400e989762c19c37b0b3 >Author: Andrew Morton >Date: Wed Mar 27 10:24:02 2013 +1100 > >revert "ipc: don't allocate a copy larger than max" > >Revert 88b9e456b164. Dave has confirmed that this was causing oopses >during trinity testing. > >Cc: Peter Hurley >Cc: Stanislav Kinsbursky >Reported-by: Dave Jones >Cc: >Signed-off-by: Andrew Morton -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, 26 Mar 2013 11:35:33 -0700 Andrew Morton wrote: > Do we need the locking at all? What does it actually do? > > sem_lock_and_putref(sma); > if (sma->sem_perm.deleted) { > sem_unlock(sma, -1); > err = -EIDRM; > goto out_free; > } > sem_unlock(sma, -1); > > We're taking the lock, testing an int and then dropping the lock. > What's the point in that? Rikpoke. The new semctl_main() is now taking a lock, testing sma->sem_perm.deleted then dropping that lock. It looks wrong. What is that lock testing against? What prevents .deleted from changing value 1ns after we dropped that lock? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Apr 02, 2013 at 03:53:01PM -0400, Sasha Levin wrote: > On 04/02/2013 01:52 PM, Linus Torvalds wrote: > > On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin wrote: > >> > >> By just playing with the 'msgsz' parameter with MSG_COPY set. > > > > Hmm. Looking closer, I suspect you're testing without commit > > 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That > > should limit the size passed in to prepare_copy -> load_copy to > > msg_ctlmax. > > That commit has a revert in the -next trees, do we need a revert > of the revert? Yeah, I told Andrew to drop that, but I think he's travelling. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 04/02/2013 01:52 PM, Linus Torvalds wrote: > On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin wrote: >> >> By just playing with the 'msgsz' parameter with MSG_COPY set. > > Hmm. Looking closer, I suspect you're testing without commit > 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That > should limit the size passed in to prepare_copy -> load_copy to > msg_ctlmax. That commit has a revert in the -next trees, do we need a revert of the revert? commit ff6577a3e714ccae02d4400e989762c19c37b0b3 Author: Andrew Morton Date: Wed Mar 27 10:24:02 2013 +1100 revert "ipc: don't allocate a copy larger than max" Revert 88b9e456b164. Dave has confirmed that this was causing oopses during trinity testing. Cc: Peter Hurley Cc: Stanislav Kinsbursky Reported-by: Dave Jones Cc: Signed-off-by: Andrew Morton Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin wrote: > > By just playing with the 'msgsz' parameter with MSG_COPY set. Hmm. Looking closer, I suspect you're testing without commit 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That should limit the size passed in to prepare_copy -> load_copy to msg_ctlmax. Now, I think it's possibly still a good idea to limit bufsz to INT_MAX regardless, but as far as I can see that prepare_copy -> load_copy path is the only place that can get confused. Everybody else uses size_t (or "long" in the case of r_maxsize) as far as I can tell. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin wrote: > > If you guys are already looking at this, the conversions between size_t, > long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could > trigger anything from: Good catch. Let's just change the "(long)bufsz < 0" into "bufsz > INT_MAX". I suspect we should change some of the "int" arguments to "size_t" too so that we don't have these kinds of odd "different routines see different values due to subtle casting errors", but in the end we don't really want to ever help people have these kinds of potential overflow issues. We already limit normal read/write/sendmsg etc to INT_MAX (although we tend to *truncate* it to INT_MAX rather than return an error, but I think the simpler patch here is preferable unless somebody complains). Comments? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/29/2013 03:36 PM, Peter Hurley wrote: > On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote: >> On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones wrote: >>> >>> Here's an oops I just hit.. >>> >>> BUG: unable to handle kernel NULL pointer dereference at 000f >>> IP: [] testmsg.isra.5+0x1a/0x60 >> >> Btw, looking at the code leading up to this, what the f*ck is wrong >> with the IPC stuff? >> >> It's using the generic list stuff for most of the lists, but then it >> open-codes the accesses. >> >> So instead of using >> >>for_each_entry(walk_msg, &msq->q_messages, m_list) { >> .. >>} >> >> the ipc/msg.c code does all that by hand, with >> >>tmp = msq->q_messages.next; >>while (tmp != &msq->q_messages) { >> struct msg_msg *walk_msg; >> >> walk_msg = list_entry(tmp, struct msg_msg, m_list); >> ... >> tmp = tmp->next; >>} >> >> Ugh. The code is near unreadable. And then it has magic memory >> barriers etc, implying that it doesn't lock the data structures, but >> no comments about them. See expunge_all() and pipelined_send(). >> >> The code seems entirely random, and it's badly set up (annoyance of >> the day: crazy helper functions in ipc/msgutil.c to make sure that (a) >> you have to spend more effort looking for them, and (b) they won't get >> inlined). >> >> Clearly nobody has cared for the crazy IPC message code in a long time. > > Exactly that's what my patch series does; clean this mess up. > > This is what I wrote to Andrew a couple of days ago. > > On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote: > I just figured out how the queue is being corrupted and why my series >> fixes it. >> >> >> With MSG_COPY set, the queue scan can exit with the local variable > 'msg' >> pointing to a real msg if the msg_counter never reaches the > copy_number. >> >> The remaining execution looks like this: >> >> if (!IS_ERR(msg)) { >> >> if (msgflg & MSG_COPY) >> goto out_unlock; >> >> >> out_unlock: >> msg_unlock(msq); >> break; >> } >> } >> if (IS_ERR(msg)) >> >> >> bufsz = msg_handler(); >> free_msg(msg); << msg never unlinked >> >> >> Since the msg should not have been found (because it failed the match >> criteria), the if (!IS_ERR(msg)) clause should never have executed. >> >> That's why my refactor fixes resolve this; because msg is not >> inadvertently treated as a found msg. >> >> But let's be honest; the real bug here is the poor structure of this >> function that even made this possible. The deepest nesting executes a >> goto to a label in the middle of an if clause. Yuck! No wonder this >> thing's fragile. >> >> So my recommendation still stands. The series that fixes this has been >> getting tested in linux-next for a month. Fixing this some other way > is >> just asking for more trouble. >> >> But why not just revert MSG_COPY altogether for 3.9? If you guys are already looking at this, the conversions between size_t, long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could trigger anything from: [ 33.046572] BUG: unable to handle kernel paging request at 88003c2c7000 [ 33.047721] IP: [] bad_from_user+0x4/0x6 [ 33.048528] PGD 7232067 PUD 7233067 PMD 3067 PTE 80003c2c7060 [ 33.049506] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 33.050029] Modules linked in: [ 33.050029] CPU 0 [ 33.050029] Pid: 6885, comm: a.out Tainted: GW 3.9.0-rc4-next-20130328-sasha-00017-g1463000 #321 [ 33.050029] RIP: 0010:[] [] bad_from_user+0x4/0x6 [ 33.050029] RSP: 0018:88003462be40 EFLAGS: 00010246 [ 33.050029] RAX: RBX: fffb RCX: ff06ae2b [ 33.050029] RDX: fffb RSI: 7fffed36d2a0 RDI: 88003c2c7000 [ 33.050029] RBP: 88003462be88 R08: 0280 R09: [ 33.050029] R10: R11: R12: fffb [ 33.050029] R13: 7fffed36d2a0 R14: R15: [ 33.050029] FS: 7f6990044700() GS:88003dc0() knlGS: [ 33.050029] CS: 0010 DS: ES: CR0: 80050033 [ 33.050029] CR2: 88003c2c7000 CR3: 347c8000 CR4: 000406f0 [ 33.050029] DR0: DR1: DR2: [ 33.050029] DR3: DR6: 0ff0 DR7: 0400 [ 33.050029] Process a.out (pid: 6885, threadinfo 88003462a000, task 880034cb3000) [ 33.050029] Stack: [ 33.050029] 8192a6a9 88003462be98 88003b331e00 88003ddd01e0 [ 33.050029] 0001 [ 33.050029] 88003462bf68 8192bb34 [ 33.050029] Call Trace: [ 33.050029] [] ? load_
Re: ipc,sem: sysv semaphore scalability
29.03.2013 22:43, Linus Torvalds пишет: On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones wrote: > >Now that I have that reverted, I'm not seeing msgrcv traces any more, but >I've started seeing this.. > >general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you disable it? I think I foud at least one bug in the MSG_COPY stuff: it leaks the "copy" allocation if mode == SEARCH_LESSEQUAL Hello, Linus. Sorry, but I don't see copy allocation leak. Dummy message is allocated always in msgflg has MSG_COPY flag being set. Also prepare_copy() use msgtyp as a message number to copy and thus set it to 0. but maybe I'm misreading it. And that shouldn't cause the problem you see, but it's indicative of how badly tested and thought through the MSG_COPY code is. Totally UNTESTED leak fix appended. Stanislav? I don't see, how this patch can help. And we should not release it until copy is done in msg_handler, because msg is equal to copy. Dummy copy message is release either by free_copy() (if msg is error) or free_msg(). But there are two issues here definitely: 1) Poor SELinux support for message copying. This issue was addressed by Stephen Smalley here: https://lkml.org/lkml/2013/2/6/663 But look like he didn't sent the patch to Andrew. 2) Copy leak and queue corruption in case of copy message wasn't found (this was mentioned by Peter in another thread; thanks for catching this, Peter!), because msg will be a valid pointer and all the "message copy" clean up logic doesn't work. I like Peter's cleanup and fix series. But if it looks like too much changes for this old code, I have another small patch, which should fix the issue: ipc: set msg back to -EAGAIN if copy wasn't performed Make sure, that msg pointer is set back to error value in case of MSG_COPY flag is set and desired message to copy wasn't found. This garantees, that msg is either a error pointer or a copy address. Otherwise last message in queue will be freed without unlinking from queue (which leads to memory corruption) plus dummy allocated copy won't be released. Signed-off-by: Stanislav Kinsbursky diff --git a/ipc/msg.c b/ipc/msg.c index 31cd1bf..fede1d0 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -872,6 +872,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, goto out_unlock; break; } + msg = ERR_PTR(-EAGAIN); } else break; msg_counter++; Linus patch.diff ipc/msg.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/ipc/msg.c b/ipc/msg.c index 31cd1bf6af27..b841508556cb 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -870,6 +870,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, msg = copy_msg(msg, copy); if (IS_ERR(msg)) goto out_unlock; + copy = NULL; break; } } else @@ -976,10 +977,9 @@ out_unlock: break; } } - if (IS_ERR(msg)) { - free_copy(copy); + free_copy(copy); + if (IS_ERR(msg)) return PTR_ERR(msg); - } bufsz = msg_handler(buf, msg, bufsz); free_msg(msg); -- Best regards, Stanislav Kinsbursky -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sun, Mar 31, 2013 at 6:45 AM, Rik van Riel wrote: > > Should we use "semid" here, like Linus suggested, instead of "un->semid"? As Davidlohr noted, in linux-next the rcu read-lock is held over the whole thing, so no, un->semid should be stable once "un" has been re-looked-up under the semaphore lock. In mainline, the problem is that the "sem_lock_check()" is done with "un->semid" *after* we've dropped the RCU read-lock, so "un" at that point is not reliable (it could be free'd at any time underneath us). That said, I really *really* hate what both mainline and linux-next do with the RCU read lock, and linux-next is arguably worse. The whole "take the RCU lock in one place, and release it in another" is confusing and bug-prone as hell. And linux-next made it worse: now sem_lock() no longer takes the read-lock (it expects the caller to take it), but sem_unlock() still drops the read-lock. This is all just f*cking crazy. The rule should be that the rcu read-lock is always and released at the same "level". For example, find_alloc_undo() should just be called with (and unconditionaly return with) the rcu read-lock held, and if it needs to actually do an allocation, it can drop the rcu lock for the duration of the allocation. This whole "conditional locking" depending on error returns and on whether we have undo's etc is bug-prone and confusing. And when you have totally different locking rules for "sem_lock()" vs "sem_unlock()", you know you're confused. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
Hi Davidlohr, On Sun, Mar 31, 2013 at 12:01 PM, Davidlohr Bueso wrote: > Specially dropping the rcu read lock before the continue statement > (sorry for not mentioning this in the last email). I was missing this indeed, thanks. Still the same issues however... I'll do some more testing on the same machine but with a totally different environment, within tomorrow hopefully. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/31/2013 01:01 AM, Davidlohr Bueso wrote: diff --git a/ipc/sem.c b/ipc/sem.c index f257afe..74cedfe 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk) struct sem_array *sma; struct sem_undo *un; struct list_head tasks; - int semid; - int i; + int semid, i; rcu_read_lock(); un = list_entry_rcu(ulp->list_proc.next, @@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk) } sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid); Should we use "semid" here, like Linus suggested, instead of "un->semid"? - sem_lock(sma, NULL, -1); - /* exit_sem raced with IPC_RMID, nothing to do */ - if (IS_ERR(sma)) + if (IS_ERR(sma)) { + rcu_read_unlock(); continue; + } + sem_lock(sma, NULL, -1); un = __lookup_undo(ulp, semid); if (un == NULL) { /* exit_sem raced with IPC_RMID+semget() that created -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, 2013-03-30 at 11:33 +0700, Emmanuel Benisty wrote: > On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds > wrote: > > On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty > > wrote: > >> > >> Then I start building a random package and the problems start. They > >> may also happen without compiling but this seems to trigger the bug > >> quite quickly. > > > > I suspect it's about preemption, and the build just results in enough > > scheduling load that you start hitting whatever race there is. > > > >> Anyway, some progress here, I hope: dmesg seems to be > >> willing to reveal some secrets (using some pastebin service since this > >> is pretty big): > >> > >> https://gist.github.com/anonymous/5275120 > > > > That looks like exactly the exit_sem() bug that Davidlohr was talking > > about, where the > > > > /* exit_sem raced with IPC_RMID, nothing to do */ > > if (IS_ERR(sma)) > > continue; > > > > should be moved to *before* the > > > > sem_lock(sma, NULL, -1); > > > > call. And apparently the bug I had found is already fixed in -next. > > I just tried the 7 original patches + the 2 one liners from -next + > modified Linus' patch (attached) on the top of 3.9-rc4 using > PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained > above. I was building two packages at the same time, went away for 30 > seconds, came back and everything froze as soon as I touched the > laptop's touchpad. Maybe a coincidence but anyway... Another shot in > the dark, I had this weird message when trying to build gcc: > semop(2): encountered an error: Identifier removed *sigh*. I had high hopes for this being the bug triggering your issue, specially after seeing exit_sem() in the trace. Emmanuel, just to be sure, does your changes reflect the patch below? Specially dropping the rcu read lock before the continue statement (sorry for not mentioning this in the last email). Anyway, this is still a bug. Andrew, the patch below applies to linux-next, please queue this up if you don't have any objections. Thanks, Davidlohr ---8<--- From: Davidlohr Bueso Subject: [PATCH] ipc, sem: do not call sem_lock when bogus sma In exit_sem() we attempt to acquire the sma->sem_perm.lock by calling sem_lock() immediately after obtaining sma. However, if sma isn't valid, then calling sem_lock() will tend to do bad things. Move the sma error check right after the sem_obtain_object_check() call instead. Signed-off-by: Davidlohr Bueso --- ipc/sem.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/ipc/sem.c b/ipc/sem.c index f257afe..74cedfe 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk) struct sem_array *sma; struct sem_undo *un; struct list_head tasks; - int semid; - int i; + int semid, i; rcu_read_lock(); un = list_entry_rcu(ulp->list_proc.next, @@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk) } sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid); - sem_lock(sma, NULL, -1); - /* exit_sem raced with IPC_RMID, nothing to do */ - if (IS_ERR(sma)) + if (IS_ERR(sma)) { + rcu_read_unlock(); continue; + } + sem_lock(sma, NULL, -1); un = __lookup_undo(ulp, semid); if (un == NULL) { /* exit_sem raced with IPC_RMID+semget() that created -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
Hi Linus, On Sun, Mar 31, 2013 at 12:22 AM, Linus Torvalds wrote: > On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty > wrote: >> On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds >>> >>> This came from the gcc build? >> >> yes, very early in the build process, IIRC this line was repeated a >> few times and the build just stalled. > > Ok, we're bringing out the crazy hacks now. > > The attached patch is just insane, doesn't really even work in > general, and only even compiles on 64-bit. But it should work in > *practice* to find if somebody adds the same RCU head to the RCU lists > twice, and ignore the second time it happens (and give a warning that > hopefully pinpoints the backtrace). > > It's ugly. It's broken. It may not work. In other words, I'm not proud > of it. But you seem to be the only one able to trigger the issue > easily, willing to try crazy crap, so "tag, you're it". Maybe this > gives us more information. And maybe it doesn't, and I'm totally wrong > about the whole "rcu head added twice" theory. That's all I could get so far: https://gist.github.com/anonymous/5279255 Losing wireless is generally the start signal of controlled demolition of the machine. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty wrote: > On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds >> >> This came from the gcc build? > > yes, very early in the build process, IIRC this line was repeated a > few times and the build just stalled. Ok, we're bringing out the crazy hacks now. The attached patch is just insane, doesn't really even work in general, and only even compiles on 64-bit. But it should work in *practice* to find if somebody adds the same RCU head to the RCU lists twice, and ignore the second time it happens (and give a warning that hopefully pinpoints the backtrace). It's ugly. It's broken. It may not work. In other words, I'm not proud of it. But you seem to be the only one able to trigger the issue easily, willing to try crazy crap, so "tag, you're it". Maybe this gives us more information. And maybe it doesn't, and I'm totally wrong about the whole "rcu head added twice" theory. Linus patch.diff Description: Binary data
Re: ipc,sem: sysv semaphore scalability
On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds wrote: >> Another shot in >> the dark, I had this weird message when trying to build gcc: >> semop(2): encountered an error: Identifier removed > > This came from the gcc build? yes, very early in the build process, IIRC this line was repeated a few times and the build just stalled. > That's just crazy. No normal app uses sysv semaphores. I have an older > gcc build environment, and some grepping shows it has some ipc > semaphore use in the libstdc++ testsuite, and some libmudflap hooks, > but that should be very very minor. > I notice that your dmesg says that your kernel is compiled by > gcc-4.8.1 prerelease. Is there any chance that you could try to > install a known-stable gcc, like 4.7.2 or something. It's entirely > possible that it's a kernel bug even if it's triggered by some more > aggressive compiler optimization or something, but it would be really > good to try to see if this might be gcc-specific. I could build a kernel on another machine on which 4.7.2 is installed, kernel oops'd as well: http://i.imgur.com/uk6gmq1.jpg FWIW, I have a few things disabled in my config so here is the one I used, maybe I'm missing something (but again, everything works perfectly with your tree): https://gist.github.com/anonymous/5275566 Lastly, just FTR, I tried Andrew's 3.9-rc4-mm1 and got the same issues, unsurprisingly I guess. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 9:33 PM, Emmanuel Benisty wrote: > > I just tried the 7 original patches + the 2 one liners from -next + > modified Linus' patch (attached) .. that patch looks fine. > on the top of 3.9-rc4 using > PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained > above. I was building two packages at the same time, went away for 30 > seconds, came back and everything froze as soon as I touched the > laptop's touchpad. Maybe a coincidence but anyway... Another shot in > the dark, I had this weird message when trying to build gcc: > semop(2): encountered an error: Identifier removed This came from the gcc build? That's just crazy. No normal app uses sysv semaphores. I have an older gcc build environment, and some grepping shows it has some ipc semaphore use in the libstdc++ testsuite, and some libmudflap hooks, but that should be very very minor. You seem to trigger errors really trivially easily, which is really odd. It's sounding less and less like some subtle race, and more like the error just happens all the time. If you can make even the gcc build generate errors, I don't think they can be some rare blue-moon thing. I notice that your dmesg says that your kernel is compiled by gcc-4.8.1 prerelease. Is there any chance that you could try to install a known-stable gcc, like 4.7.2 or something. It's entirely possible that it's a kernel bug even if it's triggered by some more aggressive compiler optimization or something, but it would be really good to try to see if this might be gcc-specific. For example, I wonder if your gcc might miscompile idr_alloc() or something, so that we get the same ID for different ipc objects. That would certainly potentially cause chaos. Hmm? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds wrote: > On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty wrote: >> >> Then I start building a random package and the problems start. They >> may also happen without compiling but this seems to trigger the bug >> quite quickly. > > I suspect it's about preemption, and the build just results in enough > scheduling load that you start hitting whatever race there is. > >> Anyway, some progress here, I hope: dmesg seems to be >> willing to reveal some secrets (using some pastebin service since this >> is pretty big): >> >> https://gist.github.com/anonymous/5275120 > > That looks like exactly the exit_sem() bug that Davidlohr was talking > about, where the > > /* exit_sem raced with IPC_RMID, nothing to do */ > if (IS_ERR(sma)) > continue; > > should be moved to *before* the > > sem_lock(sma, NULL, -1); > > call. And apparently the bug I had found is already fixed in -next. I just tried the 7 original patches + the 2 one liners from -next + modified Linus' patch (attached) on the top of 3.9-rc4 using PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained above. I was building two packages at the same time, went away for 30 seconds, came back and everything froze as soon as I touched the laptop's touchpad. Maybe a coincidence but anyway... Another shot in the dark, I had this weird message when trying to build gcc: semop(2): encountered an error: Identifier removed patch.diff Description: Binary data
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty wrote: > > Then I start building a random package and the problems start. They > may also happen without compiling but this seems to trigger the bug > quite quickly. I suspect it's about preemption, and the build just results in enough scheduling load that you start hitting whatever race there is. > Anyway, some progress here, I hope: dmesg seems to be > willing to reveal some secrets (using some pastebin service since this > is pretty big): > > https://gist.github.com/anonymous/5275120 That looks like exactly the exit_sem() bug that Davidlohr was talking about, where the /* exit_sem raced with IPC_RMID, nothing to do */ if (IS_ERR(sma)) continue; should be moved to *before* the sem_lock(sma, NULL, -1); call. And apparently the bug I had found is already fixed in -next. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
Hi Davidlohr, On Sat, Mar 30, 2013 at 9:08 AM, Davidlohr Bueso wrote: > Not sure which one liner you refer to, but, if you haven't already done > so, please try with these fixes (queued in linux-next): > > http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51 > http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31 > > I've been trying to reproduce your twilight zone problem on five > different machines now without any luck. Is there anything you're doing > to trigger the issue? Does the machine boot ok and then do weird things, > say after X starts, open some program, etc? I was missing a9cead0, thanks. What I usually do is starting a standard session, which looks like this: init-+-5*[agetty] |-bash---startx---xinit-+-X---2*[{X}] | `-dwm---sh---sleep |-bash---chromium-+-chromium | |-chromium-+-chromium | | `-2*[{chromium}] | |-chromium-sandbo---chromium---chromium---4*[chromium---4*[{chromium}]] | `-65*[{chromium}] |-crond |-dbus-daemon |-klogd |-syslogd |-tmux-+-alsamixer | |-bash---bash | |-bash | |-htop | `-newsbeuter---{newsbeuter} |-udevd |-urxvtd-+-bash---pstree |`-bash---tmux `-wpa_supplicant Then I start building a random package and the problems start. They may also happen without compiling but this seems to trigger the bug quite quickly. Anyway, some progress here, I hope: dmesg seems to be willing to reveal some secrets (using some pastebin service since this is pretty big): https://gist.github.com/anonymous/5275120 Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, 2013-03-29 at 19:09 -0700, Linus Torvalds wrote: > On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty wrote: > > > > I had to slightly modify the patch since it wouldn't match the changes > > introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch, > > hope that was the right thing to do. So, what I tried was: original 7 > > patches + the one liner + your patch blindly modified by me on the top > > of 3.9-rc4 and I'm still having twilight zone issues. > > Ok, please send your patch so that I can double-check what you did, > but it was simple enough that you probably did the right thing. > > Sad. Your case definitely looks like a double rcu-free, as shown by > the fact that when you enabled SLUB debugging the oops happened with > the use-after-free pattern (it's __rcu_reclaim() doing the > "head->func(head);" thing, and "func" is 0x6b6b6b6b6b6b6b6b, so "head" > has already been free'd once). > > So ipc_rcu_putref() and a refcounting error looked very promising.as a > potential explanation. > > The 'un' undo structure is also free'd with rcu, but the locking > around that seems much more robust. The undo entry is on two lists > (sma->list_id, under sma->sem_perm.lock and ulp->list_proc, under > ulp->lock). But those locks are actually tested with > assert_spin_locked() in all the relevant places, and the code actually > looks sane. So I had high hopes for ipc_rcu_putref()... > > Hmm. Except for exit_sem() that does odd things. You have preemption > enabled, don't you? exit_sem() does a lookup of the first list_proc > entry under tcy_read_lock to lookup un->semid, and then it drops the > rcu read lock. At which point "un" is no longer reliable, I think. But > then it still uses "un->semid", rather than the stable value it looked > up under the rcu read lock. Which looks bogus. > > So I'd like you to test a few more things: > > (a) In exit_sem(), can you change the > > sma = sem_lock_check(tsk->nsproxy->ipc_ns, un->semid); > > to use just "semid" rather than "un->semid", because I don't > think "un" is stable here. Well that's not really the case in the new code. We don't drop the rcu read lock until the end of the loop, in sem_unlock(). However, I just noticed that we're checking sma for error after trying to acquire sma->sem_perm.lock: sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid); sem_lock(sma, NULL, -1); /* exit_sem raced with IPC_RMID, nothing to do */ if (IS_ERR(sma)) continue; The IS_ERR(sma) check should be right after the sem_obtain_object_check() call instead. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty wrote: > > I had to slightly modify the patch since it wouldn't match the changes > introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch, > hope that was the right thing to do. So, what I tried was: original 7 > patches + the one liner + your patch blindly modified by me on the top > of 3.9-rc4 and I'm still having twilight zone issues. Ok, please send your patch so that I can double-check what you did, but it was simple enough that you probably did the right thing. Sad. Your case definitely looks like a double rcu-free, as shown by the fact that when you enabled SLUB debugging the oops happened with the use-after-free pattern (it's __rcu_reclaim() doing the "head->func(head);" thing, and "func" is 0x6b6b6b6b6b6b6b6b, so "head" has already been free'd once). So ipc_rcu_putref() and a refcounting error looked very promising.as a potential explanation. The 'un' undo structure is also free'd with rcu, but the locking around that seems much more robust. The undo entry is on two lists (sma->list_id, under sma->sem_perm.lock and ulp->list_proc, under ulp->lock). But those locks are actually tested with assert_spin_locked() in all the relevant places, and the code actually looks sane. So I had high hopes for ipc_rcu_putref()... Hmm. Except for exit_sem() that does odd things. You have preemption enabled, don't you? exit_sem() does a lookup of the first list_proc entry under tcy_read_lock to lookup un->semid, and then it drops the rcu read lock. At which point "un" is no longer reliable, I think. But then it still uses "un->semid", rather than the stable value it looked up under the rcu read lock. Which looks bogus. So I'd like you to test a few more things: (a) In exit_sem(), can you change the sma = sem_lock_check(tsk->nsproxy->ipc_ns, un->semid); to use just "semid" rather than "un->semid", because I don't think "un" is stable here. (b) does the problem go away if you change disable CONFIG_PREEMPT (perhaps to PREEMPT_NONE or PREEMPT_VOLUNTARY?) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sat, 2013-03-30 at 08:36 +0700, Emmanuel Benisty wrote: > Hi Linus, > > On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds > wrote: > > Emmanuel, can you try the attached patch? I think it applies cleanly > > on top of the scalability series too without any changes, but I didn't > > check if the patches perhaps changed some of the naming or something. > > I had to slightly modify the patch since it wouldn't match the changes > introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch, > hope that was the right thing to do. So, what I tried was: original 7 > patches + the one liner + your patch blindly modified by me on the top > of 3.9-rc4 and I'm still having twilight zone issues. Not sure which one liner you refer to, but, if you haven't already done so, please try with these fixes (queued in linux-next): http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51 http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065a9ec31 I've been trying to reproduce your twilight zone problem on five different machines now without any luck. Is there anything you're doing to trigger the issue? Does the machine boot ok and then do weird things, say after X starts, open some program, etc? Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
Hi Linus, On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds wrote: > Emmanuel, can you try the attached patch? I think it applies cleanly > on top of the scalability series too without any changes, but I didn't > check if the patches perhaps changed some of the naming or something. I had to slightly modify the patch since it wouldn't match the changes introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch, hope that was the right thing to do. So, what I tried was: original 7 patches + the one liner + your patch blindly modified by me on the top of 3.9-rc4 and I'm still having twilight zone issues. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 2:12 PM, Linus Torvalds wrote: > > I dunno. I'm still not sure this is triggerable, but it looks bad. But > both the semaphore case and the msg cases seem to be solvable by > moving the unlock down, and shm seem to have no getref/putref users to > race with, so this (whitespace-damaged) patch *may* be sufficient: Well, the patch doesn't seem to cause any problems, at least neither lockdep nor spinlock sleep debugging complains. I have no idea whether it actually fixes any problems, though. I do wonder if this might explain the problem Emmanuel saw. A double free of a RCU-freeable object would possibly result in exactly the kind of mess that Emmanuel reported with the semaphore scalability patches. Emmanuel, can you try the attached patch? I think it applies cleanly on top of the scalability series too without any changes, but I didn't check if the patches perhaps changed some of the naming or something. Linus patch.diff Description: Binary data
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 1:41 PM, Linus Torvalds wrote: > > The alternative would be to make sure the thing is always locked (and > in a rcu-read-safe region) before putref/getref. The only place (apart > from the initial allocation, which doesn't matter, because nothing can > see if itf that path fails) seems to be that freeque(), but I didn't > check everything. > > Moving the > > msg_unlock(msq); > > to the end of freeque() might be the way to go. It's going to cause us > to hold the lock for longer, but I'm not sure we care for that path. Uhhuh. I think shm_destroy() has the same pattern. And I think that one has locking reasons why it has to do the shm_unlock() before tearing some things down, although I'm not sure.. The good news is that shm doesn't seem to have any users of ipc_rcu_get/putref(), so I don't see anything to race against. So it has the same buggy pattern, but I don't think it can trigger anything. And ipc/sem.c has the same bug-pattern in freeary(). It does "sem_unlock(sma)" followed by "ipc_rcu_putref(sma);", and it *does* seem to have code that that can race against (sem_lock_and_putref()). The whole "sem_putref()" tries to be careful and gets the lock for the last ref, but getting the lock doesn't help when freeary() does the refcount access without it. The semaphore case seems to argue fairly strongly for an "atomic_t refcount", because right now it does a lot of "sem_putref()" stuff that just gets the lock for the putref. So not only is that insufficient (if it races against freeary(), but it's also more expensive than just an atomic refcount would have been. I dunno. I'm still not sure this is triggerable, but it looks bad. But both the semaphore case and the msg cases seem to be solvable by moving the unlock down, and shm seem to have no getref/putref users to race with, so this (whitespace-damaged) patch *may* be sufficient: diff --git a/ipc/msg.c b/ipc/msg.c index 31cd1bf6af27..338d8e2b589b 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -284,7 +284,6 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) expunge_all(msq, -EIDRM); ss_wakeup(&msq->q_senders, 1); msg_rmid(ns, msq); - msg_unlock(msq); tmp = msq->q_messages.next; while (tmp != &msq->q_messages) { @@ -297,6 +296,7 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) atomic_sub(msq->q_cbytes, &ns->msg_bytes); security_msg_queue_free(msq); ipc_rcu_putref(msq); + msg_unlock(msq); } /* diff --git a/ipc/sem.c b/ipc/sem.c index 58d31f1c1eb5..1cf024b9eac0 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -766,12 +766,12 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) /* Remove the semaphore set from the IDR */ sem_rmid(ns, sma); - sem_unlock(sma); wake_up_sem_queue_do(&tasks); ns->used_sems -= sma->sem_nsems; security_sem_free(sma); ipc_rcu_putref(sma); + sem_unlock(sma); } static unsigned long copy_semid_to_user(void __user *buf, struct semid64_ds *in, int version) (I didn't check very carefully, it's possible that we end up having some locking problem if we move the unlocks down later, but it *looks* fine) Anybody? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones wrote: > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but > I've started seeing this.. > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > RIP: 0010:[] [] free_msg+0x2b/0x40 > Call Trace: > [] freeque+0xcf/0x140 > [] msgctl_down.constprop.9+0x183/0x200 > [] sys_msgctl+0x139/0x400 > [] system_call_fastpath+0x16/0x1b > > Looks like seg was already kfree'd. Hmm. I have a suspicion. The use of ipc_rcu_getref/ipc_rcu_putref() seems a bit iffy. In particular, the refcount is not an atomic variable, so we absolutely *depend* on the spinlock for it. However, looking at "freeque()", that's not actually the case. It releases the message queue spinlock *before* it does the ipc_rcu_putref(), and it does that because the thing has become unreachable (it did a msg_rmid(), which will set ->deleted, which in turn will mean that nobody should successfully look it up any more). HOWEVER. While the "deleted" flag is serialized, the actual refcount is not. So in *parallel* with the freeque() call, we may have some other user that does something like ... ipc_rcu_getref(msq); msg_unlock(msq); schedule(); ipc_lock_by_ptr(&msq->q_perm); ipc_rcu_putref(msq); if (msq->q_perm.deleted) { err = -EIDRM; goto out_unlock_free; } ... which got the lock for the "deleted" test, so that part is all fine, but notice the "ipc_rcu_putref()". It can happen at the same time that freeque() also does its own ipc_rcu_putref(). So now refcount may get buggered, resulting in random early reuse, double free's or leaking of the msq. There may be some reason I don't see why this cannot happen, but it does look suspicious. I wonder if the refcount should be an atomic value. The alternative would be to make sure the thing is always locked (and in a rcu-read-safe region) before putref/getref. The only place (apart from the initial allocation, which doesn't matter, because nothing can see if itf that path fails) seems to be that freeque(), but I didn't check everything. Moving the msg_unlock(msq); to the end of freeque() might be the way to go. It's going to cause us to hold the lock for longer, but I'm not sure we care for that path. Guys, am I missing something? This kind of refcounting problem might well explain the rcu-freeing-time bugs reported with the scalability patches: long-time race that just got *much* easier to expose with the higher level of parallelism? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 12:33 PM, Peter Hurley wrote: > On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote: >> I think I foud at least one bug in the MSG_COPY stuff: it leaks the >> "copy" allocation if >> >> mode == SEARCH_LESSEQUAL >> >> but maybe I'm misreading it. > > Yes, you're misreading it. copy_msg() returns the 'copy' address when > copying is successful. So this patch double-frees 'copy'. No it doesn't. The case where "copy_msg()" *is* called and is successful, msg gets set to "copy", my patch sets "copy" to NULL, so no, it doesn't double-free. That said, it looks like if MSG_COPY is set, we cannot trigger the "mode == SEARCH_LESSEQUAL" case, because it forces msgtyp to 0, which in turn forces mode = SEARCH_ANY. So the actual leak case seems to not be possible, although that's *really* hard to see from looking at the code. The code is just an unreadable mess. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote: > On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones wrote: > > > > Here's an oops I just hit.. > > > > BUG: unable to handle kernel NULL pointer dereference at 000f > > IP: [] testmsg.isra.5+0x1a/0x60 > > Btw, looking at the code leading up to this, what the f*ck is wrong > with the IPC stuff? > > It's using the generic list stuff for most of the lists, but then it > open-codes the accesses. > > So instead of using > >for_each_entry(walk_msg, &msq->q_messages, m_list) { > .. >} > > the ipc/msg.c code does all that by hand, with > >tmp = msq->q_messages.next; >while (tmp != &msq->q_messages) { > struct msg_msg *walk_msg; > > walk_msg = list_entry(tmp, struct msg_msg, m_list); > ... > tmp = tmp->next; >} > > Ugh. The code is near unreadable. And then it has magic memory > barriers etc, implying that it doesn't lock the data structures, but > no comments about them. See expunge_all() and pipelined_send(). > > The code seems entirely random, and it's badly set up (annoyance of > the day: crazy helper functions in ipc/msgutil.c to make sure that (a) > you have to spend more effort looking for them, and (b) they won't get > inlined). > > Clearly nobody has cared for the crazy IPC message code in a long time. Exactly that's what my patch series does; clean this mess up. This is what I wrote to Andrew a couple of days ago. On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote: I just figured out how the queue is being corrupted and why my series > fixes it. > > > With MSG_COPY set, the queue scan can exit with the local variable 'msg' > pointing to a real msg if the msg_counter never reaches the copy_number. > > The remaining execution looks like this: > > if (!IS_ERR(msg)) { > > if (msgflg & MSG_COPY) > goto out_unlock; > > > out_unlock: > msg_unlock(msq); > break; > } > } > if (IS_ERR(msg)) > > > bufsz = msg_handler(); > free_msg(msg); << msg never unlinked > > > Since the msg should not have been found (because it failed the match > criteria), the if (!IS_ERR(msg)) clause should never have executed. > > That's why my refactor fixes resolve this; because msg is not > inadvertently treated as a found msg. > > But let's be honest; the real bug here is the poor structure of this > function that even made this possible. The deepest nesting executes a > goto to a label in the middle of an if clause. Yuck! No wonder this > thing's fragile. > > So my recommendation still stands. The series that fixes this has been > getting tested in linux-next for a month. Fixing this some other way is > just asking for more trouble. > > But why not just revert MSG_COPY altogether for 3.9? Regards, Peter Hurley -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote: > I think I foud at least one bug in the MSG_COPY stuff: it leaks the > "copy" allocation if > > mode == SEARCH_LESSEQUAL > > but maybe I'm misreading it. Yes, you're misreading it. copy_msg() returns the 'copy' address when copying is successful. So this patch double-frees 'copy'. Andrew has had the patches that fix this for a month but because they're fairly extensive he didn't want to apply them for 3.9. Regards, Peter -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones wrote: > > Here's an oops I just hit.. > > BUG: unable to handle kernel NULL pointer dereference at 000f > IP: [] testmsg.isra.5+0x1a/0x60 Btw, looking at the code leading up to this, what the f*ck is wrong with the IPC stuff? It's using the generic list stuff for most of the lists, but then it open-codes the accesses. So instead of using for_each_entry(walk_msg, &msq->q_messages, m_list) { .. } the ipc/msg.c code does all that by hand, with tmp = msq->q_messages.next; while (tmp != &msq->q_messages) { struct msg_msg *walk_msg; walk_msg = list_entry(tmp, struct msg_msg, m_list); ... tmp = tmp->next; } Ugh. The code is near unreadable. And then it has magic memory barriers etc, implying that it doesn't lock the data structures, but no comments about them. See expunge_all() and pipelined_send(). The code seems entirely random, and it's badly set up (annoyance of the day: crazy helper functions in ipc/msgutil.c to make sure that (a) you have to spend more effort looking for them, and (b) they won't get inlined). Clearly nobody has cared for the crazy IPC message code in a long time. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones wrote: > > Btw, something that's really bothering me is just how much bogus > 'follow-on' spew we have lately. I'm not sure what changed, but it > seems to have gotten worse. .. we have many more sanity checks, and they tend to trigger in the case of us killing somebody holding locks etc. > Related: is there any value in printing out all the ? symbols on > kernels with frame pointers enabled ? I've found them useful. You can often see what happened just before. Of course, I'd still prefer gcc not generate unnecessarily big stack frames (which is one of the bigger reasons for old stuff shining through) but as things are, they often give hints about what happened just before. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 11:43:25AM -0700, Linus Torvalds wrote: > On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones wrote: > > > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but > > I've started seeing this.. > > > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > > Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you > disable it? > > I think I foud at least one bug in the MSG_COPY stuff: it leaks the > "copy" allocation if > > mode == SEARCH_LESSEQUAL > > but maybe I'm misreading it. And that shouldn't cause the problem you > see, but it's indicative of how badly tested and thought through the > MSG_COPY code is. > > Totally UNTESTED leak fix appended. Stanislav? I'll give it a shot. Btw, something that's really bothering me is just how much bogus 'follow-on' spew we have lately. I'm not sure what changed, but it seems to have gotten worse. Here's an oops I just hit.. BUG: unable to handle kernel NULL pointer dereference at 000f IP: [] testmsg.isra.5+0x1a/0x60 PGD 10fd95067 PUD 10f767067 PMD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm CPU 2 Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H RIP: 0010:[] [] testmsg.isra.5+0x1a/0x60 RSP: 0018:880117bb5e88 EFLAGS: 00010246 RAX: RBX: 0004 RCX: 0078 RDX: 0004 RSI: fffe RDI: 000f RBP: 880117bb5e88 R08: 0004 R09: 0001 R10: 880117bb8000 R11: 0001 R12: fffe R13: 88010fd10308 R14: 88010fd10258 R15: FS: 7fa89c256740() GS:88012aa0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 000f CR3: 00010f76c000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process trinity-child20 (pid: 958, threadinfo 880117bb4000, task 880117bb8000) Stack: 880117bb5f68 812c3746 880117bb8000 880117bb8000 880117bb8000 81c7ace0 812c2430 0004 652a928e Call Trace: [] do_msgrcv+0x1a6/0x5f0 [] ? msg_security+0x10/0x10 [] ? trace_hardirqs_on_caller+0x115/0x1a0 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] sys_msgrcv+0x15/0x20 [] system_call_fastpath+0x16/0x1b Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a <48> 39 37 b8 01 00 00 00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 RIP [] testmsg.isra.5+0x1a/0x60 RSP CR2: 000f ---[ end trace 8f0713d2aacb6aa3 ]--- BUG: sleeping function called from invalid context at kernel/rwsem.c:20 in_atomic(): 1, irqs_disabled(): 0, pid: 958, name: trinity-child20 INFO: lockdep is turned off. Pid: 958, comm: trinity-child20 Tainted: G D 3.9.0-rc4+ #7 Call Trace: [] __might_sleep+0x145/0x200 [] down_read+0x2a/0xa0 [] exit_signals+0x24/0x130 [] do_exit+0xbc/0xd10 [] ? kmsg_dump+0x1b5/0x230 [] ? kmsg_dump+0x25/0x230 [] oops_end+0x9c/0xe0 [] no_context+0x268/0x275 [] __bad_area_nosemaphore+0x78/0x1d1 [] bad_area_nosemaphore+0x13/0x15 [] __do_page_fault+0x366/0x5b0 [] ? __lock_acquire+0x2e5/0x1a00 [] ? trace_hardirqs_off_caller+0x28/0xc0 [] ? trace_hardirqs_off_thunk+0x3a/0x3c [] do_page_fault+0xe/0x10 [] page_fault+0x22/0x30 [] ? testmsg.isra.5+0x1a/0x60 [] ? security_msg_queue_msgrcv+0x16/0x20 [] do_msgrcv+0x1a6/0x5f0 [] ? msg_security+0x10/0x10 [] ? trace_hardirqs_on_caller+0x115/0x1a0 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] sys_msgrcv+0x15/0x20 [] system_call_fastpath+0x16/0x1b note: trinity-child20[958] exited with preempt_count 1 BUG: scheduling while atomic: trinity-child20/958/0x1002 INFO: lockdep is turned off. Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filte
Re: ipc,sem: sysv semaphore scalability
On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote: > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones wrote: > > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: > > > > > Whichever way we go, we should get a wiggle on - this has been hanging > > > around for too long. Dave, do you have time to determine whether > > > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger > > > than max") fixes things up? > > > > Ok, with that reverted it's been grinding away for a few hours without > > incident. > > Normally I see the oops within a minute or so. > > > > OK, thanks, I queued a revert: > > From: Andrew Morton > Subject: revert "ipc: don't allocate a copy larger than max" > > Revert 88b9e456b164. Dave has confirmed that this was causing oopses > during trinity testing. I owe Peter an apology. I just hit it again with that backed out. Andrew, might as well drop that revert. BUG: unable to handle kernel NULL pointer dereference at 000f IP: [] testmsg.isra.5+0x1a/0x60 PGD 10fd95067 PUD 10f767067 PMD 0 Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm CPU 2 Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H RIP: 0010:[] [] testmsg.isra.5+0x1a/0x60 RSP: 0018:880117bb5e88 EFLAGS: 00010246 RAX: RBX: 0004 RCX: 0078 RDX: 0004 RSI: fffe RDI: 000f RBP: 880117bb5e88 R08: 0004 R09: 0001 R10: 880117bb8000 R11: 0001 R12: fffe R13: 88010fd10308 R14: 88010fd10258 R15: FS: 7fa89c256740() GS:88012aa0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 000f CR3: 00010f76c000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process trinity-child20 (pid: 958, threadinfo 880117bb4000, task 880117bb8000) Stack: 880117bb5f68 812c3746 880117bb8000 880117bb8000 880117bb8000 81c7ace0 812c2430 0004 652a928e Call Trace: [] do_msgrcv+0x1a6/0x5f0 [] ? msg_security+0x10/0x10 [] ? trace_hardirqs_on_caller+0x115/0x1a0 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] sys_msgrcv+0x15/0x20 [] system_call_fastpath+0x16/0x1b Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a <48> 39 37 b8 01 00 00 00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 RIP [] testmsg.isra.5+0x1a/0x60 I think I wasn't seeing that this last week because I had inadvertantly disabled DEBUG_PAGEALLOC and.. we're back to square one. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones wrote: > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but > I've started seeing this.. > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you disable it? I think I foud at least one bug in the MSG_COPY stuff: it leaks the "copy" allocation if mode == SEARCH_LESSEQUAL but maybe I'm misreading it. And that shouldn't cause the problem you see, but it's indicative of how badly tested and thought through the MSG_COPY code is. Totally UNTESTED leak fix appended. Stanislav? Linus patch.diff Description: Binary data
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 11:04 AM, Dave Jones wrote: > > mainline. Your current tree. Ok, that's what I thought you were generally testing, just wanted to verify. The subject kind of implied otherwise.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 11:00:27AM -0700, Linus Torvalds wrote: > On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones wrote: > > > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but > > I've started seeing this.. > > > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > > > > Looks like seg was already kfree'd. > > Just to clarify: is this you testing -mm that has Davidlohr's and > Rik's scalability patches? Or mainline? mainline. Your current tree. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones wrote: > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but > I've started seeing this.. > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > > Looks like seg was already kfree'd. Just to clarify: is this you testing -mm that has Davidlohr's and Rik's scalability patches? Or mainline? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote: > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones wrote: > > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: > > > > > Whichever way we go, we should get a wiggle on - this has been hanging > > > around for too long. Dave, do you have time to determine whether > > > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger > > > than max") fixes things up? > > > > Ok, with that reverted it's been grinding away for a few hours without > > incident. > > Normally I see the oops within a minute or so. > > OK, thanks, I queued a revert: > > From: Andrew Morton > Subject: revert "ipc: don't allocate a copy larger than max" > > Revert 88b9e456b164. Dave has confirmed that this was causing oopses > during trinity testing. Now that I have that reverted, I'm not seeing msgrcv traces any more, but I've started seeing this.. general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: l2tp_ppp l2tp_netlink l2tp_core llc2 phonet netrom rose af_key af_rxrpc caif_socket caif can_raw cmtp kernelcapi can_bcm can nfnetlink ipt_ULOG scsi_transport_iscsi af_802154 ax25 atm ipx pppoe pppox x25 nfc irda ppp_generic p8023 slhc p8022 appletalk decnet crc_ccitt rds psnap llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth snd_hda_codec raid0 snd_pcm rfkill microcode serio_raw pcspkr snd_page_alloc edac_core snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm CPU 3 Pid: 1850, comm: trinity-child37 Tainted: GB3.9.0-rc4+ #7 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H RIP: 0010:[] [] free_msg+0x2b/0x40 RSP: 0018:8800a1d3bdd0 EFLAGS: 00010202 RAX: RBX: 6b6b6b6b6b6b6b6b RCX: RDX: RSI: RDI: 810b6ced RBP: 8800a1d3bde0 R08: R09: 0001 R10: 0001 R11: 0001 R12: 88009997e620 R13: 81c7ace0 R14: 8800caf359d8 R15: 81c7b024 FS: 7f2d7be64740() GS:88012ac0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 7f8bd7bb6000 CR3: a1f0a000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process trinity-child37 (pid: 1850, threadinfo 8800a1d3a000, task 8800a1e62490) Stack: 6b6b6b6b6b6b6b6b 8800caf35928 8800a1d3be18 812c289f 81c7ace0 8800caf35928 8800a1d3be28 8800a1d3bec8 812c2a93 8800a1d3be40 Call Trace: [] freeque+0xcf/0x140 [] msgctl_down.constprop.9+0x183/0x200 [] ? up_read+0x1f/0x40 [] ? __do_page_fault+0x214/0x5b0 [] ? lock_release_non_nested+0x23e/0x320 [] sys_msgctl+0x139/0x400 [] ? retint_swapgs+0xe/0x13 [] ? trace_hardirqs_on_caller+0x115/0x1a0 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] system_call_fastpath+0x16/0x1b Code: 66 66 66 66 90 55 48 89 e5 41 54 49 89 fc 53 e8 fc 5e 01 00 49 8b 5c 24 20 4c 89 e7 e8 8f af ed ff 48 85 db 75 05 eb 13 4c 89 e3 <4c> 8b 23 48 89 df e8 7a af ed ff 4d 85 e4 75 ed 5b 41 5c 5d c3 (Taint is from an ext4 double-free I just reported in a separate thread) decoded.. 0: 66 66 66 66 90 data32 data32 data32 xchg %ax,%ax 5: 55 push %rbp 6: 48 89 e5mov%rsp,%rbp 9: 41 54 push %r12 b: 49 89 fcmov%rdi,%r12 e: 53 push %rbx f: e8 fc 5e 01 00 callq 0x15f10 14: 49 8b 5c 24 20 mov0x20(%r12),%rbx 19: 4c 89 e7mov%r12,%rdi 1c: e8 8f af ed ff callq 0xffedafb0 21: 48 85 dbtest %rbx,%rbx 24: 75 05 jne0x2b 26: eb 13 jmp0x3b 28: 4c 89 e3mov%r12,%rbx 2b:* 4c 8b 23mov(%rbx),%r12 <-- trapping instruction 2e: 48 89 dfmov%rbx,%rdi 31: e8 7a af ed ff callq 0xffedafb0 36: 4d 85 e4test %r12,%r12 39: 75 ed jne0x28 3b: 5b pop%rbx 3c: 41 5c pop%r12 3e: 5d pop%rbp 3f: c3 retq Disassembly of free_msg shows.. seg = msg->next; kfree(msg); while (seg != NULL) { struct msg_msgseg *tmp = seg->next; 30b: 4c 8b 23mov(%rbx),%r12 kfree(seg); 30e: 48 89 dfmov%rbx,%rdi 311: e8 00 00 00 00 callq 316 Looks like seg was a
Re: ipc,sem: sysv semaphore scalability
On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones wrote: > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: > > > Whichever way we go, we should get a wiggle on - this has been hanging > > around for too long. Dave, do you have time to determine whether > > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger > > than max") fixes things up? > > Ok, with that reverted it's been grinding away for a few hours without > incident. > Normally I see the oops within a minute or so. > OK, thanks, I queued a revert: From: Andrew Morton Subject: revert "ipc: don't allocate a copy larger than max" Revert 88b9e456b164. Dave has confirmed that this was causing oopses during trinity testing. Cc: Peter Hurley Cc: Stanislav Kinsbursky Reported-by: Dave Jones Cc: Signed-off-by: Andrew Morton --- ipc/msg.c |6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff -puN ipc/msg.c~revert-ipc-dont-allocate-a-copy-larger-than-max ipc/msg.c --- a/ipc/msg.c~revert-ipc-dont-allocate-a-copy-larger-than-max +++ a/ipc/msg.c @@ -820,17 +820,15 @@ long do_msgrcv(int msqid, void __user *b struct msg_msg *copy = NULL; unsigned long copy_number = 0; - ns = current->nsproxy->ipc_ns; - if (msqid < 0 || (long) bufsz < 0) return -EINVAL; if (msgflg & MSG_COPY) { - copy = prepare_copy(buf, min_t(size_t, bufsz, ns->msg_ctlmax), - msgflg, &msgtyp, ©_number); + copy = prepare_copy(buf, bufsz, msgflg, &msgtyp, ©_number); if (IS_ERR(copy)) return PTR_ERR(copy); } mode = convert_mode(&msgtyp, msgflg); + ns = current->nsproxy->ipc_ns; msq = msg_lock_check(ns, msqid); if (IS_ERR(msq)) { _ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote: > Whichever way we go, we should get a wiggle on - this has been hanging > around for too long. Dave, do you have time to determine whether > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger > than max") fixes things up? Ok, with that reverted it's been grinding away for a few hours without incident. Normally I see the oops within a minute or so. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, 26 Mar 2013 10:59:27 -0700 Davidlohr Bueso wrote: > On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote: > > On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds > > wrote: > > > And you never see this problem without Rik's patches? > > > > No, never. > > > > > Could you bisect > > > *which* patch it starts with? Are the first four ones ok (the moving > > > of the locking around, but without the fine-grained ones), for > > > example? > > > > With the first four patches only, I got some X server freeze (just tried > > once). > > Going over the code again, I found a potential recursive spinlock scenario. > Andrew, if you have no objections, please queue this up. > > Thanks. > > ---8<--- > > From: Davidlohr Bueso > Subject: [PATCH] ipc, sem: prevent possible deadlock > > In semctl_main(), when cmd == GETALL, we're locking > sma->sem_perm.lock (through sem_lock_and_putref), yet > after the conditional, we lock it again. > Unlock sma right after exiting the conditional. > > Signed-off-by: Davidlohr Bueso > --- > ipc/sem.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/ipc/sem.c b/ipc/sem.c > index 1a2913d..f257afe 100644 > --- a/ipc/sem.c > +++ b/ipc/sem.c > @@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int > semid, int semnum, > err = -EIDRM; > goto out_free; > } > + sem_unlock(sma, -1); > } > > sem_lock(sma, NULL, -1); Looks right. Do we need the locking at all? What does it actually do? sem_lock_and_putref(sma); if (sma->sem_perm.deleted) { sem_unlock(sma, -1); err = -EIDRM; goto out_free; } sem_unlock(sma, -1); We're taking the lock, testing an int and then dropping the lock. What's the point in that? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/26/2013 02:07 PM, Sasha Levin wrote: On 03/26/2013 01:51 PM, Davidlohr Bueso wrote: On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote: On 03/20/2013 03:55 PM, Rik van Riel wrote: This series makes the sysv semaphore code more scalable, by reducing the time the semaphore lock is held, and making the locking more scalable for semaphore arrays with multiple semaphores. Hi Rik, Another issue that came up is: [ 96.347341] [ 96.348085] [ BUG: lock held when returning to user space! ] [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G W [ 96.360300] [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still held! [ 96.362019] 1 lock held by trinity-child9/7583: [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [] SYSC_semtimedop+0x1fb/0xec0 It seems that we can leave semtimedop without releasing the rcu read lock. I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it seems that this behaviour is actually intentional? rcu_read_lock(); sma = sem_obtain_object_check(ns, semid); if (IS_ERR(sma)) { if (un) rcu_read_unlock(); error = PTR_ERR(sma); goto out_free; } When I've looked at that it seems that not releasing the read lock was (very) intentional. This logic was from the original code, which I also found to be quite confusing. I wasn't getting this warning with the old code, so there was probably something else that triggers this now. After that, the only code path that would release the lock starts with: if (un) { ... So we won't release the lock at all if un is NULL? Not necessarily, we do release everything at the end of the function: out_unlock_free: sem_unlock(sma, locknum); Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even more I suspect. If un is non-NULL we'll be unlocking rcu lock twice? It is uglier than you think... On success, find_alloc_undo will call rcu_read_lock, so we have the rcu_read_lock held twice :( Some of the ipc code is quite ugly, but making too many large changes at once is just asking for trouble. I suspect we're going to have to untangle this one bit at a time... -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/26/2013 01:59 PM, Davidlohr Bueso wrote: From: Davidlohr Bueso Subject: [PATCH] ipc, sem: prevent possible deadlock In semctl_main(), when cmd == GETALL, we're locking sma->sem_perm.lock (through sem_lock_and_putref), yet after the conditional, we lock it again. Unlock sma right after exiting the conditional. Signed-off-by: Davidlohr Bueso Reviewed-by: Rik van Riel -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/20/2013 03:55 PM, Rik van Riel wrote: > This series makes the sysv semaphore code more scalable, > by reducing the time the semaphore lock is held, and making > the locking more scalable for semaphore arrays with multiple > semaphores. Hi Rik, Another issue that came up is: [ 96.347341] [ 96.348085] [ BUG: lock held when returning to user space! ] [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G W [ 96.360300] [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still held! [ 96.362019] 1 lock held by trinity-child9/7583: [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [] SYSC_semtimedop+0x1fb/0xec0 It seems that we can leave semtimedop without releasing the rcu read lock. I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it seems that this behaviour is actually intentional? rcu_read_lock(); sma = sem_obtain_object_check(ns, semid); if (IS_ERR(sma)) { if (un) rcu_read_unlock(); error = PTR_ERR(sma); goto out_free; } When I've looked at that it seems that not releasing the read lock was (very) intentional. After that, the only code path that would release the lock starts with: if (un) { ... So we won't release the lock at all if un is NULL? Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/26/2013 01:51 PM, Davidlohr Bueso wrote: > On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote: >> On 03/20/2013 03:55 PM, Rik van Riel wrote: >>> This series makes the sysv semaphore code more scalable, >>> by reducing the time the semaphore lock is held, and making >>> the locking more scalable for semaphore arrays with multiple >>> semaphores. >> >> Hi Rik, >> >> Another issue that came up is: >> >> [ 96.347341] >> [ 96.348085] [ BUG: lock held when returning to user space! ] >> [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G >> W >> [ 96.360300] >> [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still >> held! >> [ 96.362019] 1 lock held by trinity-child9/7583: >> [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [] >> SYSC_semtimedop+0x1fb/0xec0 >> >> It seems that we can leave semtimedop without releasing the rcu read lock. >> >> I'm a bit confused by what's going on in semtimedop with regards to rcu read >> lock, it >> seems that this behaviour is actually intentional? >> >> rcu_read_lock(); >> sma = sem_obtain_object_check(ns, semid); >> if (IS_ERR(sma)) { >> if (un) >> rcu_read_unlock(); >> error = PTR_ERR(sma); >> goto out_free; >> } >> >> When I've looked at that it seems that not releasing the read lock was (very) >> intentional. > > This logic was from the original code, which I also found to be quite > confusing. I wasn't getting this warning with the old code, so there was probably something else that triggers this now. >> >> After that, the only code path that would release the lock starts with: >> >> if (un) { >> ... >> >> So we won't release the lock at all if un is NULL? >> > > Not necessarily, we do release everything at the end of the function: > > out_unlock_free: > sem_unlock(sma, locknum); Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even more I suspect. If un is non-NULL we'll be unlocking rcu lock twice? if (un->semid == -1) { rcu_read_unlock(); goto out_unlock_free; } [...] out_unlock_free: sem_unlock(sma, locknum); Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote: > On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds > wrote: > > And you never see this problem without Rik's patches? > > No, never. > > > Could you bisect > > *which* patch it starts with? Are the first four ones ok (the moving > > of the locking around, but without the fine-grained ones), for > > example? > > With the first four patches only, I got some X server freeze (just tried > once). Going over the code again, I found a potential recursive spinlock scenario. Andrew, if you have no objections, please queue this up. Thanks. ---8<--- From: Davidlohr Bueso Subject: [PATCH] ipc, sem: prevent possible deadlock In semctl_main(), when cmd == GETALL, we're locking sma->sem_perm.lock (through sem_lock_and_putref), yet after the conditional, we lock it again. Unlock sma right after exiting the conditional. Signed-off-by: Davidlohr Bueso --- ipc/sem.c | 1 + 1 file changed, 1 insertion(+) diff --git a/ipc/sem.c b/ipc/sem.c index 1a2913d..f257afe 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum, err = -EIDRM; goto out_free; } + sem_unlock(sma, -1); } sem_lock(sma, NULL, -1); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, Mar 26, 2013 at 01:33:07PM -0400, Sasha Levin wrote: > On 03/20/2013 03:55 PM, Rik van Riel wrote: > > This series makes the sysv semaphore code more scalable, > > by reducing the time the semaphore lock is held, and making > > the locking more scalable for semaphore arrays with multiple > > semaphores. > > Hi Rik, > > Another issue that came up is: > > [ 96.347341] > [ 96.348085] [ BUG: lock held when returning to user space! ] > [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G > W > [ 96.360300] > [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still > held! > [ 96.362019] 1 lock held by trinity-child9/7583: > [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [] > SYSC_semtimedop+0x1fb/0xec0 > > It seems that we can leave semtimedop without releasing the rcu read lock. > > I'm a bit confused by what's going on in semtimedop with regards to rcu read > lock, it > seems that this behaviour is actually intentional? > > rcu_read_lock(); > sma = sem_obtain_object_check(ns, semid); > if (IS_ERR(sma)) { > if (un) > rcu_read_unlock(); > error = PTR_ERR(sma); > goto out_free; > } > > When I've looked at that it seems that not releasing the read lock was (very) > intentional. > > After that, the only code path that would release the lock starts with: > > if (un) { > ... > > So we won't release the lock at all if un is NULL? Intentions notwithstanding, it is absolutely required to exit any and all RCU read-side critical sections prior to going into user mode. I suggest removing the "if (un)". Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote: > On 03/20/2013 03:55 PM, Rik van Riel wrote: > > This series makes the sysv semaphore code more scalable, > > by reducing the time the semaphore lock is held, and making > > the locking more scalable for semaphore arrays with multiple > > semaphores. > > Hi Rik, > > Another issue that came up is: > > [ 96.347341] > [ 96.348085] [ BUG: lock held when returning to user space! ] > [ 96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G > W > [ 96.360300] > [ 96.361084] trinity-child9/7583 is leaving the kernel with locks still > held! > [ 96.362019] 1 lock held by trinity-child9/7583: > [ 96.362610] #0: (rcu_read_lock){.+.+..}, at: [] > SYSC_semtimedop+0x1fb/0xec0 > > It seems that we can leave semtimedop without releasing the rcu read lock. > > I'm a bit confused by what's going on in semtimedop with regards to rcu read > lock, it > seems that this behaviour is actually intentional? > > rcu_read_lock(); > sma = sem_obtain_object_check(ns, semid); > if (IS_ERR(sma)) { > if (un) > rcu_read_unlock(); > error = PTR_ERR(sma); > goto out_free; > } > > When I've looked at that it seems that not releasing the read lock was (very) > intentional. This logic was from the original code, which I also found to be quite confusing. > > After that, the only code path that would release the lock starts with: > > if (un) { > ... > > So we won't release the lock at all if un is NULL? > Not necessarily, we do release everything at the end of the function: out_unlock_free: sem_unlock(sma, locknum); Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/20/2013 03:55 PM, Rik van Riel wrote: > Include lkml in the CC: this time... *sigh* > ---8<--- > > This series makes the sysv semaphore code more scalable, > by reducing the time the semaphore lock is held, and making > the locking more scalable for semaphore arrays with multiple > semaphores. Hi Rik, I'm getting the following false positives from lockdep: [ 80.492995] = [ 80.494052] [ INFO: possible recursive locking detected ] [ 80.494878] 3.9.0-rc4-next-20130325-sasha-00044-gcb6ef58 #315 Tainted: G W [ 80.496228] - [ 80.497171] trinity-child9/7210 is trying to acquire lock: [ 80.497934] (&(&sma->sem_base[i].lock)->rlock){+.+...}, at: [] newary+0x1c7/0x2a0 [ 80.499202] [ 80.499202] but task is already holding lock: [ 80.500031] (&(&sma->sem_base[i].lock)->rlock){+.+...}, at: [] newary+0x1c7/0x2a0 [ 80.500031] [ 80.500031] other info that might help us debug this: [ 80.500031] Possible unsafe locking scenario: [ 80.500031] [ 80.500031]CPU0 [ 80.500031] [ 80.500031] lock(&(&sma->sem_base[i].lock)->rlock); [ 80.500031] lock(&(&sma->sem_base[i].lock)->rlock); [ 80.500031] [ 80.500031] *** DEADLOCK *** [ 80.500031] [ 80.500031] May be due to missing lock nesting notation [ 80.500031] [ 80.500031] 4 locks held by trinity-child9/7210: [ 80.500031] #0: (&ids->rw_mutex){+.}, at: [] ipcget+0x72/0x340 [ 80.500031] #1: (rcu_read_lock){.+.+..}, at: [] ipc_addid+0x35/0x230 [ 80.500031] #2: (&(&new->lock)->rlock){+.+...}, at: [] ipc_addid+0xd0/0x230 [ 80.500031] #3: (&(&sma->sem_base[i].lock)->rlock){+.+...}, at: [] newary+0x1c7/0x2a0 [ 80.500031] [ 80.500031] stack backtrace: [ 80.500031] Pid: 7210, comm: trinity-child9 Tainted: GW 3.9.0-rc4-next-20130325-sasha-00044-gcb6ef58 #315 [ 80.500031] Call Trace: [ 80.500031] [] __lock_acquire+0xc6e/0x1e50 [ 80.500031] [] ? idr_get_empty_slot+0x255/0x3c0 [ 80.500031] [] ? mark_held_locks+0x12e/0x150 [ 80.500031] [] lock_acquire+0x1aa/0x240 [ 80.500031] [] ? newary+0x1c7/0x2a0 [ 80.500031] [] _raw_spin_lock+0x3b/0x70 [ 80.500031] [] ? newary+0x1c7/0x2a0 [ 80.500031] [] newary+0x1c7/0x2a0 [ 80.500031] [] ? ipcget+0x72/0x340 [ 80.500031] [] ipcget+0x226/0x340 [ 80.500031] [] SyS_semget+0x65/0x80 [ 80.500031] [] ? semctl_down.constprop.8+0x320/0x320 [ 80.500031] [] ? wake_up_sem_queue_do+0xa0/0xa0 [ 80.500031] [] ? SyS_msgrcv+0x20/0x20 [ 80.500031] [] tracesys+0xe1/0xe6 The code is: for (i = 0; i < nsems; i++) { INIT_LIST_HEAD(&sma->sem_base[i].sem_pending); spin_lock_init(&sma->sem_base[i].lock); spin_lock(&sma->sem_base[i].lock); < here } Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Mon, Mar 25, 2013 at 10:53 PM, Rik van Riel wrote: > On 03/25/2013 11:20 AM, Emmanuel Benisty wrote: >> >> On Mon, Mar 25, 2013 at 9:03 PM, Rik van Riel wrote: > > With the first four patches only, I got some X server freeze (just > tried once). Could you try booting with panic=1 so the kernel panics on the first oops? >>> >>> >>> >>> Sorry that should be "oops=panic" >>> >>> Maybe that way (if we are lucky) we will be able to capture the first oops, and maybe get an idea of what causes the problem. >> >> >> Sorry Rik, I get all kind of weird behaviors (wireless dies, compiling >> gets stuck and is impossible to kill, can't kill X) with the 4 >> patches+oops=panic but no trace. Here after is 7+1 patches with >> oops=panic boot: http://i.imgur.com/1jep1qx.jpg > > > This may be a stupid question, but you re-compile and re-install > the kernel modules every time you changed the kernel? > > The behaviour you report with just the first four patches is so > random, it sounds almost like a mismatched data structure between > compiles... Yes it's OK. I even started from scratch just in case but I'm still getting the same weird things. Everything works just fine with a build from Linus' tree which I normally use. I'll try the patches on another machine tomorrow. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/25/2013 11:20 AM, Emmanuel Benisty wrote: On Mon, Mar 25, 2013 at 9:03 PM, Rik van Riel wrote: With the first four patches only, I got some X server freeze (just tried once). Could you try booting with panic=1 so the kernel panics on the first oops? Sorry that should be "oops=panic" Maybe that way (if we are lucky) we will be able to capture the first oops, and maybe get an idea of what causes the problem. Sorry Rik, I get all kind of weird behaviors (wireless dies, compiling gets stuck and is impossible to kill, can't kill X) with the 4 patches+oops=panic but no trace. Here after is 7+1 patches with oops=panic boot: http://i.imgur.com/1jep1qx.jpg This may be a stupid question, but you re-compile and re-install the kernel modules every time you changed the kernel? The behaviour you report with just the first four patches is so random, it sounds almost like a mismatched data structure between compiles... -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Mon, Mar 25, 2013 at 9:03 PM, Rik van Riel wrote: >>> With the first four patches only, I got some X server freeze (just >>> tried once). >> >> >> Could you try booting with panic=1 so the kernel panics on the first >> oops? > > > Sorry that should be "oops=panic" > > >> Maybe that way (if we are lucky) we will be able to capture the first >> oops, and maybe get an idea of what causes the problem. Sorry Rik, I get all kind of weird behaviors (wireless dies, compiling gets stuck and is impossible to kill, can't kill X) with the 4 patches+oops=panic but no trace. Here after is 7+1 patches with oops=panic boot: http://i.imgur.com/1jep1qx.jpg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Mon, Mar 25, 2013 at 9:01 PM, Rik van Riel wrote: > On 03/25/2013 09:47 AM, Emmanuel Benisty wrote: >> >> On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds >> wrote: >>> >>> And you never see this problem without Rik's patches? >> >> >> No, never. >> >>> Could you bisect >>> *which* patch it starts with? Are the first four ones ok (the moving >>> of the locking around, but without the fine-grained ones), for >>> example? >> >> >> With the first four patches only, I got some X server freeze (just tried >> once). > > > Another random question, just to rule something out. What video driver > are you using on your system? intel on i3 330m -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/25/2013 10:00 AM, Rik van Riel wrote: On 03/25/2013 09:47 AM, Emmanuel Benisty wrote: On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds wrote: And you never see this problem without Rik's patches? No, never. Could you bisect *which* patch it starts with? Are the first four ones ok (the moving of the locking around, but without the fine-grained ones), for example? With the first four patches only, I got some X server freeze (just tried once). Could you try booting with panic=1 so the kernel panics on the first oops? Sorry that should be "oops=panic" Maybe that way (if we are lucky) we will be able to capture the first oops, and maybe get an idea of what causes the problem. -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/25/2013 09:47 AM, Emmanuel Benisty wrote: On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds wrote: And you never see this problem without Rik's patches? No, never. Could you bisect *which* patch it starts with? Are the first four ones ok (the moving of the locking around, but without the fine-grained ones), for example? With the first four patches only, I got some X server freeze (just tried once). Another random question, just to rule something out. What video driver are you using on your system? -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/25/2013 09:47 AM, Emmanuel Benisty wrote: On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds wrote: And you never see this problem without Rik's patches? No, never. Could you bisect *which* patch it starts with? Are the first four ones ok (the moving of the locking around, but without the fine-grained ones), for example? With the first four patches only, I got some X server freeze (just tried once). Could you try booting with panic=1 so the kernel panics on the first oops? Maybe that way (if we are lucky) we will be able to capture the first oops, and maybe get an idea of what causes the problem. -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds wrote: > And you never see this problem without Rik's patches? No, never. > Could you bisect > *which* patch it starts with? Are the first four ones ok (the moving > of the locking around, but without the fine-grained ones), for > example? With the first four patches only, I got some X server freeze (just tried once). > Another thing to try might be to enable SLUB debugging (ie make sure that all > of > > CONFIG_SLUB_DEBUG=y > CONFIG_SLUB=y > CONFIG_SLUB_DEBUG_ON=y > > are set in your kernel config) My bad, I forgot to enable this in my former build, sorry. Here's what I get now: http://i.imgur.com/G6H8KgD.jpg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sun, Mar 24, 2013 at 6:46 AM, Emmanuel Benisty wrote: > > Thanks Linus. I hope I got this right, here's the result (3.9-rc4, 7+1 > patches): http://i.imgur.com/BebGZxV.jpg Ok, that's *slightly* more informative, but not much. At least now we see the actual page fault information, and see what the bad dereference was. It seems to be a branch through the rcu list "->func" pointer in the rcu callbacks, and the ->func pointer has been corrupted. Instead of being a valid kernel pointer (or a "kfree_rcu_offset" marker, which is a small number between 0-4096), it has the odd value "0x00640064". Two words of decimal "100", in other words. That's not one of the usual "use-after-free" patters or anything like that, so I don't see what it would be. So I have to admit to not really having any better clue about what is going on. Sometimes the corruption pattern give a hint of what it was that overwrote it, but not here.. And you never see this problem without Rik's patches? Could you bisect *which* patch it starts with? Are the first four ones ok (the moving of the locking around, but without the fine-grained ones), for example? Another thing to try might be to enable SLUB debugging (ie make sure that all of CONFIG_SLUB_DEBUG=y CONFIG_SLUB=y CONFIG_SLUB_DEBUG_ON=y are set in your kernel config), which might help pin things down a bit. Sometimes that makes any allocation problems show up earlier in the path, so that it's more obvious who screwed up. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Sun, Mar 24, 2013 at 2:45 AM, Linus Torvalds wrote: > On Fri, Mar 22, 2013 at 8:19 PM, Emmanuel Benisty wrote: >> >> I could reproduce it but could you please let me know what would be >> the right tools I should use to catch the original oops? >> This is what I got but I doubt it will be helpful: >> http://i.imgur.com/Mewi1hC.jpg > > In this case, I think the best thing to do would be to comment out all > of drm_warn_on_modeset_not_all_locked(), because those warnings make > the original problem (that probably caused the lock problem in the > first place that it is warning about) scroll away. > > That said, you should also take the oneliner fix that Rik posted to > patch 7 (subject line: "[PATCH 7/7 part3] fix for sem_lock") and apply > that, just to make sure that you aren't possibly hitting a bug with > the shared-memory locking introduced by that (unusual) case. Thanks Linus. I hope I got this right, here's the result (3.9-rc4, 7+1 patches): http://i.imgur.com/BebGZxV.jpg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 22, 2013 at 8:19 PM, Emmanuel Benisty wrote: > > I could reproduce it but could you please let me know what would be > the right tools I should use to catch the original oops? > This is what I got but I doubt it will be helpful: > http://i.imgur.com/Mewi1hC.jpg In this case, I think the best thing to do would be to comment out all of drm_warn_on_modeset_not_all_locked(), because those warnings make the original problem (that probably caused the lock problem in the first place that it is warning about) scroll away. That said, you should also take the oneliner fix that Rik posted to patch 7 (subject line: "[PATCH 7/7 part3] fix for sem_lock") and apply that, just to make sure that you aren't possibly hitting a bug with the shared-memory locking introduced by that (unusual) case. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
Hi Linus, On Fri, Mar 22, 2013 at 10:37 PM, Linus Torvalds wrote: > On Fri, Mar 22, 2013 at 4:04 AM, Emmanuel Benisty wrote: >> >> I was trying your patchset and my machine died while building a >> package. I could reproduce the bug the (only) two times I tried. >> There's a poor quality picture here: http://i.imgur.com/MuYuyQC.jpg > > Hmm. The original oops may well have scrolled off the window, what > remains is just indicating something is corrupted in slab. Which isn't > likely to give much hints.. > > Can you try to reproduce this with CONFIG_SLUB_DEBUG=y and > CONFIG_SLUB_DEBUG_ON=y, and see if there are any earlier messages? I could reproduce it but could you please let me know what would be the right tools I should use to catch the original oops? This is what I got but I doubt it will be helpful: http://i.imgur.com/Mewi1hC.jpg Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote: > Include lkml in the CC: this time... *sigh* > ---8<--- > > This series makes the sysv semaphore code more scalable, > by reducing the time the semaphore lock is held, and making > the locking more scalable for semaphore arrays with multiple > semaphores. > > The first four patches were written by Davidlohr Buesso, and > reduce the hold time of the semaphore lock. > > The last three patches change the sysv semaphore code locking > to be more fine grained, providing a performance boost when > multiple semaphores in a semaphore array are being manipulated > simultaneously. > > On a 24 CPU system, performance numbers with the semop-multi > test with N threads and N semaphores, look like this: > > vanilla Davidlohr's Davidlohr's + Davidlohr's + > threads patches rwlock patches v3 patches > 10610652 726325 1783589 2142206 > 20341570 365699 1520453 1977878 > 30288102 307037 1498167 2037995 > 40290714 305955 1612665 2256484 > 50288620 312890 1733453 2650292 > 60289987 306043 1649360 2388008 > 70291298 306347 1723167 2717486 > 80290948 305662 1729545 2763582 > 90290996 306680 1736021 2757524 > 100 292243 306700 1773700 3059159 > Some results with semop-multi on my 4 core laptop: vanilla v3 patchset threads 10 5094473 10289146 20 5079946 10187923 30 5041258 10660635 40 4942786 10876009 50 5076437 10759434 60 5139024 10797032 70 5103811 10698323 80 5094850 9959675 90 5085774 10054844 100 4939547 9798291 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Fri, Mar 22, 2013 at 4:04 AM, Emmanuel Benisty wrote: > > I was trying your patchset and my machine died while building a > package. I could reproduce the bug the (only) two times I tried. > There's a poor quality picture here: http://i.imgur.com/MuYuyQC.jpg Hmm. The original oops may well have scrolled off the window, what remains is just indicating something is corrupted in slab. Which isn't likely to give much hints.. Can you try to reproduce this with CONFIG_SLUB_DEBUG=y and CONFIG_SLUB_DEBUG_ON=y, and see if there are any earlier messages? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
Hi Rik, On Thu, Mar 21, 2013 at 2:55 AM, Rik van Riel wrote: > This series makes the sysv semaphore code more scalable, > by reducing the time the semaphore lock is held, and making > the locking more scalable for semaphore arrays with multiple > semaphores. I was trying your patchset and my machine died while building a package. I could reproduce the bug the (only) two times I tried. There's a poor quality picture here: http://i.imgur.com/MuYuyQC.jpg Thanks. -- Emmanuel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote: > Include lkml in the CC: this time... *sigh* > ---8<--- > > This series makes the sysv semaphore code more scalable, > by reducing the time the semaphore lock is held, and making > the locking more scalable for semaphore arrays with multiple > semaphores. > > The first four patches were written by Davidlohr Buesso, and > reduce the hold time of the semaphore lock. > > The last three patches change the sysv semaphore code locking > to be more fine grained, providing a performance boost when > multiple semaphores in a semaphore array are being manipulated > simultaneously. > > On a 24 CPU system, performance numbers with the semop-multi > test with N threads and N semaphores, look like this: > > vanilla Davidlohr's Davidlohr's + Davidlohr's + > threads patches rwlock patches v3 patches > 10610652 726325 1783589 2142206 > 20341570 365699 1520453 1977878 > 30288102 307037 1498167 2037995 > 40290714 305955 1612665 2256484 > 50288620 312890 1733453 2650292 > 60289987 306043 1649360 2388008 > 70291298 306347 1723167 2717486 > 80290948 305662 1729545 2763582 > 90290996 306680 1736021 2757524 > 100 292243 306700 1773700 3059159 Hi Rik, I plugged this set into enterprise -rt kernel, and beat on four boxen. I ran into no trouble while giving boxen a generic drubbing fwtw. Some semop-multi -rt numbers for an abby-normal 8 node box, and a mundane 4 node box below. 32 cores+SMT, 3.0.68-rt92 numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39 node 0 size: 32733 MB node 0 free: 29910 MB node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47 node 1 size: 32768 MB node 1 free: 30396 MB node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 node 2 size: 32768 MB node 2 free: 30568 MB node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63 node 3 size: 32767 MB node 3 free: 28706 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10 SCHED_OTHER -v3 set+v3 set threads 10 4384851744599 20 3764111580813 30 3966391546841 40 4490622152861 50 4775942431344 60 4464531874476 70 5787052047884 80 6079282144898 90 6621362171074 100 6228892295879 200 7096682867273 300 6619023008695 400 6417583273250 500 6141173403775 SCHED_FIFO -v3 set+v3 set threads 10 158656 914343 20 99784 1133775 30 84245 1099604 40 89725 1756577 50 85697 1607893 60 84033 1467272 70 86833 1979405 80 91451 1922250 90 92484 1990699 100 90691 2067705 200103692 2308652 300101161 2376921 400103722 2417580 500108838 2443349 64 core box (poor thing, smi free zone though), 3.0.68-rt92 numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 0 size: 8181 MB node 0 free: 6309 MB node distances: node 0 0: 10 SCHED_OTHER -v3 set+v3 set threads 10 6775342304417 20 4515071873474 30 3568761542688 40 3295851500392 50 4157611318676 60 4034821380644 70 3940891185165 80 4074081191834 90 4457641249156 100 4308231245573 200 4254701421686 300 4270921480379 400 4979001516599 500 4219271551309 SCHED_FIFO 10 3235601882543 20 2262571806862 30 1878511263163 40 205337 881331 50 196806 76 60 193218 612709 70 2094321241797 80 2404451269146 90 2198651482649 100 2279701473038 200 2013541719977 300 1837281823074 400 1750511808792 500 2437081849803 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/21/2013 09:23 PM, Linus Torvalds wrote: On Thu, Mar 21, 2013 at 6:12 PM, Davidlohr Bueso wrote: ipc lock contention: 100 users: 8,74% (vanilla)3.17% (v3 patchset) 400 users: 21,86% (vanilla)5.23% (v3 patchset) 800 users 84,35% (vanilla)7.39% (v3 patchset) Ok, I'd call that pretty much "solved". Sure, it's still visible, but for being a benchmark that apparently does little else than pound on those sysv semaphores, I think we can consider it pretty much fine. I'm going to assume that anybody who actually then does any real work (ie a database) is never going to see even close to this bad contention. Good job, Rik. I'm assuming we'll be merging this during the 3.10 merge window, and hopefully the merge conflicts will be sorted out too. Rik, Peter, can you look at each others patches and see if you can get that sorted out for Andrew? Will do. I will rebase this series on top of what is in linux-next. -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On 03/21/2013 06:01 PM, Andrew Morton wrote: On Thu, 21 Mar 2013 17:50:05 -0400 Peter Hurley wrote: On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote: On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel wrote: This series makes the sysv semaphore code more scalable, by reducing the time the semaphore lock is held, and making the locking more scalable for semaphore arrays with multiple semaphores. The first four patches were written by Davidlohr Buesso, and reduce the hold time of the semaphore lock. The last three patches change the sysv semaphore code locking to be more fine grained, providing a performance boost when multiple semaphores in a semaphore array are being manipulated simultaneously. These patches conflict pretty badly with Peter's: On one point I'm a little confused: my series has been in linux-next for a while. On what tree is this series based? It'll be based on mainline. People often forget to peek into linux-next when preparing patches. In the great majority of cases that's OK. Occasionally, we lose... I'll be happy to rebase the series onto linux-next, to make merging easier for you. -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Thu, Mar 21, 2013 at 6:12 PM, Davidlohr Bueso wrote: > > ipc lock contention: > 100 users: 8,74% (vanilla)3.17% (v3 patchset) > 400 users: 21,86% (vanilla)5.23% (v3 patchset) > 800 users 84,35% (vanilla)7.39% (v3 patchset) Ok, I'd call that pretty much "solved". Sure, it's still visible, but for being a benchmark that apparently does little else than pound on those sysv semaphores, I think we can consider it pretty much fine. I'm going to assume that anybody who actually then does any real work (ie a database) is never going to see even close to this bad contention. Good job, Rik. I'm assuming we'll be merging this during the 3.10 merge window, and hopefully the merge conflicts will be sorted out too. Rik, Peter, can you look at each others patches and see if you can get that sorted out for Andrew? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote: > Include lkml in the CC: this time... *sigh* > ---8<--- > > This series makes the sysv semaphore code more scalable, > by reducing the time the semaphore lock is held, and making > the locking more scalable for semaphore arrays with multiple > semaphores. > > The first four patches were written by Davidlohr Buesso, and > reduce the hold time of the semaphore lock. > > The last three patches change the sysv semaphore code locking > to be more fine grained, providing a performance boost when > multiple semaphores in a semaphore array are being manipulated > simultaneously. > > On a 24 CPU system, performance numbers with the semop-multi > test with N threads and N semaphores, look like this: > > vanilla Davidlohr's Davidlohr's + Davidlohr's + > threads patches rwlock patches v3 patches > 10610652 726325 1783589 2142206 > 20341570 365699 1520453 1977878 > 30288102 307037 1498167 2037995 > 40290714 305955 1612665 2256484 > 50288620 312890 1733453 2650292 > 60289987 306043 1649360 2388008 > 70291298 306347 1723167 2717486 > 80290948 305662 1729545 2763582 > 90290996 306680 1736021 2757524 > 100 292243 306700 1773700 3059159 > After testing these patches with my Oracle Swingbench DSS workload, I can say that there are significant improvements. The ipc lock contention was reduced drastically, specially with higher amounts of benchmark users. As a result, the overall %sys time went down as well. Furthermore, throughput (in transactions per second) was increased. TPS: 100 users: 1257.21 (vanilla)2805.06 (v3 patchset) 400 users: 1437.57 (vanilla)2664.67 (v3 patchset) 800 users: 1236.89 (vanilla)2750.73 (v3 patchset) ipc lock contention: 100 users: 8,74% (vanilla)3.17% (v3 patchset) 400 users: 21,86% (vanilla)5.23% (v3 patchset) 800 users 84,35% (vanilla)7.39% (v3 patchset) As seen with perf, the ipc lock isn't even the main source of contention anymore. Also, no matter how many benchmark users, the lock's user is mostly semctl_main() . 100 users: 3.17% oracle [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--50.53%-- sem_lock | | | |--82.60%-- semctl_main | --17.40%-- sys_semtimedop 400 users: 5.23% oracle [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--75.81%-- sem_lock | | | |--94.09%-- semctl_main | --5.91%-- sys_semtimedop 800 users: 7.39% oracle [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--81.71%-- sem_lock | | | |--64.98%-- semctl_main | --35.02%-- sys_semtimedop Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Thu, 21 Mar 2013 17:50:05 -0400 Peter Hurley wrote: > On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote: > > On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel wrote: > > > > > This series makes the sysv semaphore code more scalable, > > > by reducing the time the semaphore lock is held, and making > > > the locking more scalable for semaphore arrays with multiple > > > semaphores. > > > > > > The first four patches were written by Davidlohr Buesso, and > > > reduce the hold time of the semaphore lock. > > > > > > The last three patches change the sysv semaphore code locking > > > to be more fine grained, providing a performance boost when > > > multiple semaphores in a semaphore array are being manipulated > > > simultaneously. > > > > These patches conflict pretty badly with Peter's: > > On one point I'm a little confused: my series has been in linux-next for > a while. On what tree is this series based? It'll be based on mainline. People often forget to peek into linux-next when preparing patches. In the great majority of cases that's OK. Occasionally, we lose... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote: > On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel wrote: > > > This series makes the sysv semaphore code more scalable, > > by reducing the time the semaphore lock is held, and making > > the locking more scalable for semaphore arrays with multiple > > semaphores. > > > > The first four patches were written by Davidlohr Buesso, and > > reduce the hold time of the semaphore lock. > > > > The last three patches change the sysv semaphore code locking > > to be more fine grained, providing a performance boost when > > multiple semaphores in a semaphore array are being manipulated > > simultaneously. > > These patches conflict pretty badly with Peter's: On one point I'm a little confused: my series has been in linux-next for a while. On what tree is this series based? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote: > On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel wrote: > > > This series makes the sysv semaphore code more scalable, > > by reducing the time the semaphore lock is held, and making > > the locking more scalable for semaphore arrays with multiple > > semaphores. > > > > The first four patches were written by Davidlohr Buesso, and > > reduce the hold time of the semaphore lock. > > > > The last three patches change the sysv semaphore code locking > > to be more fine grained, providing a performance boost when > > multiple semaphores in a semaphore array are being manipulated > > simultaneously. > > These patches conflict pretty badly with Peter's: > > ipc-clamp-with-min.patch > ipc-separate-msg-allocation-from-userspace-copy.patch > ipc-tighten-msg-copy-loops.patch > ipc-set-efault-as-default-error-in-load_msg.patch > ipc-remove-msg-handling-from-queue-scan.patch > ipc-implement-msg_copy-as-a-new-receive-mode.patch > ipc-simplify-msg-list-search.patch > ipc-refactor-msg-list-search-into-separate-function.patch > ipc-refactor-msg-list-search-into-separate-function-fix.patch > > (they're at http://ozlabs.org/~akpm/mmots/broken-out/) > > > > We're in a bit of a mess at present. > > Last month Peter sent a ten-patch series which fixed an oops (Subject: > "ipc MSG_COPY fixes"). The series did other stuff, so we merged into > mainline just two bugfix patches. Then davej hit a trinity oops > (Subject: "ipc/testmsg GPF.") and testing confirmed that the remaining > eight patches in Peter's series fixed that up. Just to clarify the history. A while back on linux-next both Dave and I were hitting ipc oopses on trinity, which was fallout from the MSG_COPY implementation. One of those oopses was the ipc/testmsg oops. I was trying to push out the new tty ldisc patchset and trinity is a decent tool for exploring race conditions (which the tty layer was full of), so the oopses were in my way. Because it was nearly impossible to static analyze the existing ipc code, I just started cleaning up. As I did, I found the two by-then-obvious bug fixes. Also, assigning the 'msg' variable in the search loop looked suspicious so I factored that search loop out as well. That was the whole 10-patch "ipc MSG_COPY fixes" series. I wasn't getting the ipc/testmsg bug anymore and I let Dave know, so all was good. When Andrew asked me if the whole series needed to go into 3.9, I said I didn't know. So when just the 2 of 10 patches went in, Dave started to hit the ipc/testmsg bug again on 3.9. > Nobody has yet > identified which of the eight does the good deed. So we need to > either: > > a) revert the original two fixes "ipc: fix potential oops when src >msg > 4k w/ MSG_COPY" and "ipc: don't allocate a copy larger than >max" or > > b) try reverting "ipc: don't allocate a copy larger than max", as >"ipc: fix potential oops when src msg > 4k w/ MSG_COPY" looks pretty >harmless or > > c) work out which of the remaining 8 fixes the new oops and merge that or > > d) merge all 8 fixes or > > e) something else. The fix is in the other 8 patches. But I have to say, the base code was such a mess I have no idea which of the other 8 patches fixes the ipc/testmsg oops or how. I guarantee reverting those 2 fixes from my series will not make the ipc/testmsg oops go away. > Whichever way we go, we should get a wiggle on - this has been hanging > around for too long. Dave, do you have time to determine whether > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger > than max") fixes things up? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel wrote: > This series makes the sysv semaphore code more scalable, > by reducing the time the semaphore lock is held, and making > the locking more scalable for semaphore arrays with multiple > semaphores. > > The first four patches were written by Davidlohr Buesso, and > reduce the hold time of the semaphore lock. > > The last three patches change the sysv semaphore code locking > to be more fine grained, providing a performance boost when > multiple semaphores in a semaphore array are being manipulated > simultaneously. These patches conflict pretty badly with Peter's: ipc-clamp-with-min.patch ipc-separate-msg-allocation-from-userspace-copy.patch ipc-tighten-msg-copy-loops.patch ipc-set-efault-as-default-error-in-load_msg.patch ipc-remove-msg-handling-from-queue-scan.patch ipc-implement-msg_copy-as-a-new-receive-mode.patch ipc-simplify-msg-list-search.patch ipc-refactor-msg-list-search-into-separate-function.patch ipc-refactor-msg-list-search-into-separate-function-fix.patch (they're at http://ozlabs.org/~akpm/mmots/broken-out/) We're in a bit of a mess at present. Last month Peter sent a ten-patch series which fixed an oops (Subject: "ipc MSG_COPY fixes"). The series did other stuff, so we merged into mainline just two bugfix patches. Then davej hit a trinity oops (Subject: "ipc/testmsg GPF.") and testing confirmed that the remaining eight patches in Peter's series fixed that up. Nobody has yet identified which of the eight does the good deed. So we need to either: a) revert the original two fixes "ipc: fix potential oops when src msg > 4k w/ MSG_COPY" and "ipc: don't allocate a copy larger than max" or b) try reverting "ipc: don't allocate a copy larger than max", as "ipc: fix potential oops when src msg > 4k w/ MSG_COPY" looks pretty harmless or c) work out which of the remaining 8 fixes the new oops and merge that or d) merge all 8 fixes or e) something else. Whichever way we go, we should get a wiggle on - this has been hanging around for too long. Dave, do you have time to determine whether reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger than max") fixes things up? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Wed, 2013-03-20 at 13:49 -0700, Linus Torvalds wrote: > On Wed, Mar 20, 2013 at 12:55 PM, Rik van Riel wrote: > > > > This series makes the sysv semaphore code more scalable, > > by reducing the time the semaphore lock is held, and making > > the locking more scalable for semaphore arrays with multiple > > semaphores. > > The series looks sane to me, and I like how each individual step is > pretty small and makes sense. > > It *would* be lovely to see this run with the actual Swingbench > numbers. The microbenchmark always looked much nicer. Do the > additional multi-semaphore scalability patches on top of Davidlohr's > patches help with the swingbench issue, or are we still totally > swamped by the ipc lock there? Yes, I'm testing this patchset with my swingbench workloads. I should have some numbers by today or tomorrow. > > Maybe there were already numbers for that, but the last swingbench > numbers I can actually recall was from before the finer-grained > locking.. Right, I couldn't get Oracle to run on the with the previous patches, hopefully the bug(s) are now addressed. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Wed, Mar 20, 2013 at 1:49 PM, Linus Torvalds wrote: > > It *would* be lovely to see this run with the actual Swingbench > numbers. The microbenchmark always looked much nicer. Do the > additional multi-semaphore scalability patches on top of Davidlohr's > patches help with the swingbench issue, or are we still totally > swamped by the ipc lock there? > > Maybe there were already numbers for that, but the last swingbench > numbers I can actually recall was from before the finer-grained > locking.. Ok, and if the spinlock is still a big deal even with the finer granularity, it might be interesting to hear if Michel's fast locks make a difference. I'm guessing that this series might actually make it easier/cleaner to do the semaphore locking using another lock, since the ipc_lock got split up and out... I think Michel did it for some socket code too. I think that was fairly independent and was for netperf. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ipc,sem: sysv semaphore scalability
On Wed, Mar 20, 2013 at 12:55 PM, Rik van Riel wrote: > > This series makes the sysv semaphore code more scalable, > by reducing the time the semaphore lock is held, and making > the locking more scalable for semaphore arrays with multiple > semaphores. The series looks sane to me, and I like how each individual step is pretty small and makes sense. It *would* be lovely to see this run with the actual Swingbench numbers. The microbenchmark always looked much nicer. Do the additional multi-semaphore scalability patches on top of Davidlohr's patches help with the swingbench issue, or are we still totally swamped by the ipc lock there? Maybe there were already numbers for that, but the last swingbench numbers I can actually recall was from before the finer-grained locking.. And obviously, getting this tested so that there aren't any more missed wakeups etc would be lovely. I'm assuming the plan is that this all goes through Andrew? Do we have big semop users who could test it on real loads? Considering that I *suspect* the main users are things like Oracle etc, I'd assume that there's some RH lab or partner or similar that is interested in making sure this not only helps, but also that it doesn't break anything ;) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ipc,sem: sysv semaphore scalability
Include lkml in the CC: this time... *sigh* ---8<--- This series makes the sysv semaphore code more scalable, by reducing the time the semaphore lock is held, and making the locking more scalable for semaphore arrays with multiple semaphores. The first four patches were written by Davidlohr Buesso, and reduce the hold time of the semaphore lock. The last three patches change the sysv semaphore code locking to be more fine grained, providing a performance boost when multiple semaphores in a semaphore array are being manipulated simultaneously. On a 24 CPU system, performance numbers with the semop-multi test with N threads and N semaphores, look like this: vanilla Davidlohr's Davidlohr's + Davidlohr's + threads patches rwlock patches v3 patches 10 610652 726325 1783589 2142206 20 341570 365699 1520453 1977878 30 288102 307037 1498167 2037995 40 290714 305955 1612665 2256484 50 288620 312890 1733453 2650292 60 289987 306043 1649360 2388008 70 291298 306347 1723167 2717486 80 290948 305662 1729545 2763582 90 290996 306680 1736021 2757524 100 292243 306700 1773700 3059159 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/