Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Sat, Feb 15, 2014 at 03:50:29PM -0800, Linus Torvalds wrote: > [ Added linux-mm to the participants list ] > > On Thu, Feb 13, 2014 at 4:24 PM, Dave Chinner wrote: > > > > Dave, the patch below should chop off the stack usage from > > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue. > > Can you given this a run? > > Ok, so DaveJ confirmed that DaveC's patch fixes his issue (damn, > people, your parents were some seriously boring people, were they not? > We've got too many Dave's around), It's an exclusive club - we have 'kernel hacker Dave' reunions in bars around the world. We should get some tshirts made up :) > but DaveC earlier pointed out that > pretty much any memory allocation path can end up using 3kB of stack > even without XFS being involved. > > Which does bring up the question whether we should look (once more) at > the VM direct-reclaim path, and try to avoid GFP_FS/IO direct > reclaim.. We do that mostly already, but GFP_KERNEL allows swap IO and that's where the deepest stack I saw came from. Even if we don't allow IO at all, we're still going to see stack usage of 2-2.5k in direct reclaim. e.g. invalidate a page and enter the rmap code. The rmap is protected by a mutex, so if we fail to get that we have about 1.2k of stack consumed from there and that is on top of the the allocator/reclaim that has already consumes ~1k of stack... > Direct reclaim historically used to be an important throttling > mechanism, and I used to not be a fan of trying to avoid direct > reclaim. But the stack depth issue really looks to be pretty bad, and > I think we've gotten better at throttling explicitly, so.. > > I *think* we already limit filesystem writeback to just kswapd (in > shrink_page_list()), but DaveC posted a backtrace that goes through > do_try_to_free_pages() to shrink_slab(), and through there to the > filesystem and then IO. That looked like a disaster. Right, that's an XFS problem, and I'm working on fixing it. The Patch I sent to DaveJ fixes the worst case, but I need to make it completely IO-less while still retaining the throttling the IO gives us. > And that's because (if I read things right) shrink_page_list() limits > filesystem page writeback to kswapd, but not swap pages. Which I think > probably made more sense back in the days than it does now (I > certainly *hope* that swapping is less important today than it was, > say, ten years ago) > > So I'm wondering whether we should remove that page_is_file_cache() > check from shrink_page_list()? The thing is, the stack usage from the swap IO path is pretty well bound - it's just the worst case stack of issuing IO. We know it won't recurse into direct reclaim, so mempool allocation and blocking is all we need to consider. Compare that to a filesystem which may need to allocate extents and hence do transactions and split btrees and read metadata and allocate large amounts of memory even before it gets to the IO layers. Hence I suspect that we could do a simple thing like only allow swap if there's more than half the stack available in the current reclaim context. Because, let's face it, if the submit_bio path is consuming more than half the available stack then we're totally screwed from a filesystem perspective > And then there is that whole shrink_slab() case... I think with shrinkers we just need to be more careful. The XFS behaviour is all my fault, and I should have known better that to design code that requires IO in the direct reclaim path. :/ Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Mon, Feb 17, 2014 at 06:46:48PM +0100, Oleg Nesterov wrote: > > pulling the logics *out* of kernel/signal.c and into arch/*. > > Not really, I think. How so? You propose to make all architectures call do_coredump() themselves, instead of having it done centrally. It's more boilerplate for all of them to get wrong. And IME any boilerplate in that are *will* be gotten wrong - I've gone through signal handling on all architectures more than a few times and every time it caught a pile of bugs... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 02/17, Al Viro wrote: > > On Mon, Feb 17, 2014 at 05:57:35PM +0100, Oleg Nesterov wrote: > > > Looks like, this is all is really nasty. Actually, I think siginfo on > > stack is not that bad if we are going to do handle_signal() or restart, > > perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump(). > > Something like below. > > Yecc... You've just broken every architecture other than x86, Of course, this is only to explain what I meant. > and to > fix them you'll need to massage every get_signal()/get_signal_to_deliver() > user out there, Yes. > pulling the logics *out* of kernel/signal.c and into arch/*. Not really, I think. Of course this change should be cleanuped. And it should not require to change all architectures at once. > This is just plain wrong. I agree, this change is also ugly. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Mon, Feb 17, 2014 at 05:57:35PM +0100, Oleg Nesterov wrote: > Looks like, this is all is really nasty. Actually, I think siginfo on > stack is not that bad if we are going to do handle_signal() or restart, > perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump(). > Something like below. Yecc... You've just broken every architecture other than x86, and to fix them you'll need to massage every get_signal()/get_signal_to_deliver() user out there, pulling the logics *out* of kernel/signal.c and into arch/*. This is just plain wrong. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 02/15, Oleg Nesterov wrote: > > On 02/15, Al Viro wrote: > > > > On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote: > > > > So basically we want a different condition for "can we just go ahead and > > > > free that sucker", right? Instead of "it's on the list, shan't free it" > > > > it ought to be something like "it's on the list or it is referenced by > > > > ksiginfo". Locking will be interesting, though... ;-/ > > > > > > I guess yes... send_sigqueue() checks list_empty() too, probably nobody > > > else. > > > > The trouble being, we might end up with > > Q picked by collect_signal and and stuff into ksiginfo > > Q resubmitted by timer code > > In this case the timer code should simply inc ->si_overrun and do nothing. > > IOW, list_empty() should be turned into is_queued(), and is_queued() > should be true until dismiss_siginfo() which should also do > do_schedule_next_timer(). I think. No, this is even more complicated. We also need do_schedule_next_timer() to calculate ->si_overrun we are going to report, I missed this... Looks like, this is all is really nasty. Actually, I think siginfo on stack is not that bad if we are going to do handle_signal() or restart, perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump(). Something like below. Oleg. diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 9e5de68..52f16f9 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -684,6 +684,52 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs) test_thread_flag(TIF_IA32) ? __NR_ia32_restart_syscall : __NR_restart_syscall #endif /* CONFIG_X86_32 */ +static noinline int process_signal(struct pt_regs *regs, siginfo_t **pinfo) +{ + struct ksignal ksig; + + switch (get_signal(&ksig)) { + case SIG_DUMP: + *pinfo = kmalloc(sizeof(siginfo_t), GFP_KERNEL); + if (*pinfo) + copy_siginfo(*pinfo, &ksig.info); + + case SIG_EXIT: + return ksig.info.si_signo; + + default: + handle_signal(&ksig, regs); + break; + + case 0: + /* Did we come from a system call? */ + if (syscall_get_nr(current, regs) >= 0) { + /* Restart the system call - no handlers present */ + switch (syscall_get_error(current, regs)) { + case -ERESTARTNOHAND: + case -ERESTARTSYS: + case -ERESTARTNOINTR: + regs->ax = regs->orig_ax; + regs->ip -= 2; + break; + + case -ERESTART_RESTARTBLOCK: + regs->ax = NR_restart_syscall; + regs->ip -= 2; + break; + } + } + + /* +* If there's no signal to deliver, we just put the saved sigmask +* back. +*/ + restore_saved_sigmask(); + } + + return 0; +} + /* * Note that 'init' is a special process: it doesn't get signals it doesn't * want to handle. Thus you cannot kill init even with a SIGKILL even by @@ -691,37 +737,16 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs) */ static void do_signal(struct pt_regs *regs) { - struct ksignal ksig; + siginfo_t *info = NULL; + int sig = process_signal(regs, &info); - if (get_signal(&ksig)) { - /* Whee! Actually deliver the signal. */ - handle_signal(&ksig, regs); - return; - } - - /* Did we come from a system call? */ - if (syscall_get_nr(current, regs) >= 0) { - /* Restart the system call - no handlers present */ - switch (syscall_get_error(current, regs)) { - case -ERESTARTNOHAND: - case -ERESTARTSYS: - case -ERESTARTNOINTR: - regs->ax = regs->orig_ax; - regs->ip -= 2; - break; - - case -ERESTART_RESTARTBLOCK: - regs->ax = NR_restart_syscall; - regs->ip -= 2; - break; + if (sig) { + if (info) { + do_coredump(info); + kfree(info); } + do_group_exit(sig); } - - /* -* If there's no signal to deliver, we just put the saved sigmask -* back. -*/ - restore_saved_sigmask(); } /* diff --git a/include/linux/signal.h b/include/linux/signal.h index 2ac423b..33b5e04 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -285,6 +285,9 @@ struct ksignal { int sig; }; +#define SIG_EXIT -1 +#define SIG_DUMP -2 + extern int get_signa
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
[ Added linux-mm to the participants list ] On Thu, Feb 13, 2014 at 4:24 PM, Dave Chinner wrote: > > Dave, the patch below should chop off the stack usage from > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue. > Can you given this a run? Ok, so DaveJ confirmed that DaveC's patch fixes his issue (damn, people, your parents were some seriously boring people, were they not? We've got too many Dave's around), but DaveC earlier pointed out that pretty much any memory allocation path can end up using 3kB of stack even without XFS being involved. Which does bring up the question whether we should look (once more) at the VM direct-reclaim path, and try to avoid GFP_FS/IO direct reclaim.. Direct reclaim historically used to be an important throttling mechanism, and I used to not be a fan of trying to avoid direct reclaim. But the stack depth issue really looks to be pretty bad, and I think we've gotten better at throttling explicitly, so.. I *think* we already limit filesystem writeback to just kswapd (in shrink_page_list()), but DaveC posted a backtrace that goes through do_try_to_free_pages() to shrink_slab(), and through there to the filesystem and then IO. That looked like a disaster. And that's because (if I read things right) shrink_page_list() limits filesystem page writeback to kswapd, but not swap pages. Which I think probably made more sense back in the days than it does now (I certainly *hope* that swapping is less important today than it was, say, ten years ago) So I'm wondering whether we should remove that page_is_file_cache() check from shrink_page_list()? And then there is that whole shrink_slab() case... Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Sat, Feb 15, 2014 at 2:28 PM, Dave Jones wrote: > > I've got a shitload of debug options enabled, which may explain it. > Or perhaps that new STACK_PROTECTOR_STRONG stuff ? Well, a lot of it is just the callee-saved registers. The compiler will tend to preferentially allocate registers in the callee-trashed registers, but if the function isn't a leaf function, any registers that are live around a function call will have to be saved somewhere - either explicitly around the function call, or - more likely - in callee-saved registers that then get saved in the prologue/epilogue of the function. And this will happen even in leaf functions when there is enough register pressure that the callee-trashed registers aren't sufficient (which is pretty common). So saving 5-6 registers on the stack (in addition to any actual stack frame) is pretty much the norm for anything but the very simplest cases. But yeah, I'm sure some config options make it worse. STACK_PROTECTOR_STRONG could easily be one of those. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Sun, Feb 16, 2014 at 09:23:56AM +1100, Dave Chinner wrote: > There's a pretty massive difference between the actual stack usage > of the local variables and the amount of stack being used by the > compiled code. > > What it appears to be is that the compiler is pushing 6-10 registers > to the stack on every function call. So a function that only has 3 > local variables and does very little but allocate a structure and > call other functions saves an 6 registers to the stack before it > starts: I've got a shitload of debug options enabled, which may explain it. Or perhaps that new STACK_PROTECTOR_STRONG stuff ? Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Fri, Feb 14, 2014 at 11:01:23AM -0500, Dave Jones wrote: > On Fri, Feb 14, 2014 at 11:24:27AM +1100, Dave Chinner wrote: > > > > I can fix this one easily - we already have a workqueue for doing > > > async log pushes (will split the stack between xlog_cil_force_lsn > > > and xlog_cil_push), but the reason we haven't used it for synchronous > > > log forces is that screws up fsync performance on CFQ. We don't > > > recommend CFQ with XFS anyway, so I think I'll make this change > > > anyway. > > > > Dave, the patch below should chop off the stack usage from > > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue. > > Can you given this a run? > > Looks like it's survived an overnight run.. Great. One thing that has been puzzling me is why I'm seeing stack usage reported that is way above what we declare locally on the stack. e.g. 29) 2048 224 xfs_da_grow_inode_int+0xbb/0x340 30) 1824 96 xfs_dir2_grow_inode+0x63/0x110 31) 1728 208 xfs_dir2_sf_to_block+0xe7/0x5e0 32) 1520 144 xfs_dir2_sf_addname+0x115/0x5c0 33) 1376 96 xfs_dir_createname+0x164/0x1a0 34) 1280 224 xfs_create+0x536/0x660 35) 1056 128 xfs_vn_mknod+0xc8/0x1d0 I pulled 128 bytes out of xfs_dir_createname by allocating the xfs_da_args structure, but I still couldn't reconcile the amount of stack use being reported with the amount used by locally declared variables (even considering in-lined leaf functions). For example: Locally DeclaredUsed function 88 224 xfs_da_grow_inode_int 32 96 xfs_dir2_grow_inode 168 208 xfs_dir2_sf_to_block 56 144 xfs_dir2_sf_addname 16 96 xfs_dir_createname 120 224 xfs_create 52 128 xfs_vn_mknod There's a pretty massive difference between the actual stack usage of the local variables and the amount of stack being used by the compiled code. What it appears to be is that the compiler is pushing 6-10 registers to the stack on every function call. So a function that only has 3 local variables and does very little but allocate a structure and call other functions saves an 6 registers to the stack before it starts: Dump of assembler code for function xfs_dir_createname: 214 { 0x814d7380 <+0>: callq 0x81cf0940 0x814d7385 <+5>: push %rbp 0x814d7386 <+6>: mov%rsp,%rbp 0x814d7389 <+9>: sub$0x50,%rsp 0x814d738d <+13>:mov%rbx,-0x28(%rbp) 0x814d7391 <+17>:mov%rdi,%rbx 0x814d7394 <+20>:mov%r12,-0x20(%rbp) 0x814d7398 <+24>:mov%rcx,%r12 0x814d739b <+27>:mov%r13,-0x18(%rbp) 0x814d739f <+31>:mov%rsi,%r13 0x814d73a5 <+37>:mov%r14,-0x10(%rbp) 0x814d73a9 <+41>:mov%rdx,%r14 0x814d73ac <+44>:mov%r15,-0x8(%rbp) 0x814d73b0 <+48>:mov%r8,%r15 0x814d73b7 <+55>:mov%r9,-0x48(%rbp) . If this is typical across the call chain (appears to be from my quick survey) then we are averaging 40-50 bytes of stack per call for saved registers. That's a lot of stack space we can't directly control the usage of, especially when we are talking about call chains that can get 50+ functions deep... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 02/15, Al Viro wrote: > > On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote: > > > So basically we want a different condition for "can we just go ahead and > > > free that sucker", right? Instead of "it's on the list, shan't free it" > > > it ought to be something like "it's on the list or it is referenced by > > > ksiginfo". Locking will be interesting, though... ;-/ > > > > I guess yes... send_sigqueue() checks list_empty() too, probably nobody > > else. > > The trouble being, we might end up with > Q picked by collect_signal and and stuff into ksiginfo > Q resubmitted by timer code In this case the timer code should simply inc ->si_overrun and do nothing. IOW, list_empty() should be turned into is_queued(), and is_queued() should be true until dismiss_siginfo() which should also do do_schedule_next_timer(). I think. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote: > > So basically we want a different condition for "can we just go ahead and > > free that sucker", right? Instead of "it's on the list, shan't free it" > > it ought to be something like "it's on the list or it is referenced by > > ksiginfo". Locking will be interesting, though... ;-/ > > I guess yes... send_sigqueue() checks list_empty() too, probably nobody else. The trouble being, we might end up with Q picked by collect_signal and and stuff into ksiginfo Q resubmitted by timer code Q picked by *another* thread into its ksiginfo the first thread finally done with signal and at that point we still have a reference in the second thread's ksiginfo. Hell knows - my first reflex in that kind of situation is to replace that flag with refcount, so that timer code would hold a reference from timer_create(2) to timer_delete(2), send_sigqueue() would bump it and dismiss_siginfo() - drop the sucker. But that means either grabbing siglock in dismiss_siginfo() or making the counter atomic; either way it's a cacheline ping-pong. Atomic counter is less painful in that respect - it would be right next to the list, so we dirty that cacheline anyway... I'm still trying another approach (slightly bigger ksiginfo used to store all variants with si_code >= 0), but it has messiness of its own; we'll see how it goes... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 02/15, Al Viro wrote: > > > Ouch... I think I see what you mean. Let me see if I got it right: > > timer->sigq is *not* freed by collect_signal(); it's done by > > release_posix_timer() instead, under siglock. Frankly, this > > /* > > * If it is queued it will be freed when dequeued, > > * like the "regular" sigqueue. > > */ > > if (!list_empty(&q->list)) > > q = NULL; > > in sigqueue_free() smells like it's asking for races. Sigh... This is protected by ->siglock, should be safe... > So basically we want a different condition for "can we just go ahead and > free that sucker", right? Instead of "it's on the list, shan't free it" > it ought to be something like "it's on the list or it is referenced by > ksiginfo". Locking will be interesting, though... ;-/ I guess yes... send_sigqueue() checks list_empty() too, probably nobody else. > BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO. > What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to > something blocked, shoves it back with PTRACE_SETSIGINFO and does > PTRACE_CONT with that new signal number? Would we get two sigqueue instances > with the same ->si_tid, one of them matching the timer->sigq and another > - not? Or the task sends a SI_TIMER info to itself via sys_rt_sigqueueinfo(). Afaics, nothing really bad can happen, I mean the kernel should not crash or something like this. do_schedule_next_timer() can be fooled, but at least lock_timer() can only succeed if this process actually has a timer with the same timer_id. This sigqueue != timer->sigq, but I think this doesn't matter, posix_timer_event() will use timer->sigq anyway. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Sat, Feb 15, 2014 at 03:58:38PM +, Al Viro wrote: > BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO. > What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to > something blocked, shoves it back with PTRACE_SETSIGINFO and does > PTRACE_CONT with that new signal number? Would we get two sigqueue instances > with the same ->si_tid, one of them matching the timer->sigq and another > - not? I wonder if it would be simpler to use the pointer *only* for si_code < 0 case. It still gives us ksiginfo_t much smaller than siginfo_t, avoids all the mess with timers and AFAICS results in less intrusive patch. IOW, the split between "we know what that sucker is" and "completely opaque shit the userland has given us". I'll try to put together something along those lines and see how well does that work... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Sat, Feb 15, 2014 at 03:36:31PM +, Al Viro wrote: > On Sat, Feb 15, 2014 at 03:22:51PM +, Al Viro wrote: > > On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote: > > > > > 1. info->q can be already freed if SIGQUEUE_PREALLOC. > > > > > >Once get_signal_to_deliver() or any other caller drops ->siglock > > >another thread can do sys_timer_delete()->sigqueue_free(). > > > > How the devil would it find the sucker? It's off the list already. > > Ouch... I think I see what you mean. Let me see if I got it right: > timer->sigq is *not* freed by collect_signal(); it's done by > release_posix_timer() instead, under siglock. Frankly, this > /* > * If it is queued it will be freed when dequeued, > * like the "regular" sigqueue. > */ > if (!list_empty(&q->list)) > q = NULL; > in sigqueue_free() smells like it's asking for races. Sigh... So basically we want a different condition for "can we just go ahead and free that sucker", right? Instead of "it's on the list, shan't free it" it ought to be something like "it's on the list or it is referenced by ksiginfo". Locking will be interesting, though... ;-/ BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO. What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to something blocked, shoves it back with PTRACE_SETSIGINFO and does PTRACE_CONT with that new signal number? Would we get two sigqueue instances with the same ->si_tid, one of them matching the timer->sigq and another - not? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Sat, Feb 15, 2014 at 03:22:51PM +, Al Viro wrote: > On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote: > > > 1. info->q can be already freed if SIGQUEUE_PREALLOC. > > > >Once get_signal_to_deliver() or any other caller drops ->siglock > >another thread can do sys_timer_delete()->sigqueue_free(). > > How the devil would it find the sucker? It's off the list already. Ouch... I think I see what you mean. Let me see if I got it right: timer->sigq is *not* freed by collect_signal(); it's done by release_posix_timer() instead, under siglock. Frankly, this /* * If it is queued it will be freed when dequeued, * like the "regular" sigqueue. */ if (!list_empty(&q->list)) q = NULL; in sigqueue_free() smells like it's asking for races. Sigh... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 02/15, Al Viro wrote: > > On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote: > > > 1. info->q can be already freed if SIGQUEUE_PREALLOC. > > > >Once get_signal_to_deliver() or any other caller drops ->siglock > >another thread can do sys_timer_delete()->sigqueue_free(). > > How the devil would it find the sucker? It simply frees the SIGQUEUE_PREALLOC sigqueue, k_itimer->sigq. > It's off the list already. Exactly, list_empty(q->list) == T. So release_posix_timer()->sigqueue_free() assumes we can safely free it. > > 2. We need to move do_schedule_next_timer() from dequeue_signal() > >here. > > > >Otherwise ->q can be reused/overwritten by the next send_sigqueue() > >right affter ->siglock is dropped. > > Ditto. We rip them out of queue on collect_signal(); Yes, and dequeue_signal()->do_schedule_next_timer() can trigger another send_sigqueue() which uses the same SIGQUEUE_PREALLOC sigqueue once we drop ->siglock. This is not that bad, but at least ->si_overrun can be overwritten before __setup_rt_frame()->copy_siginfo_to_user(). Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote: > 1. info->q can be already freed if SIGQUEUE_PREALLOC. > >Once get_signal_to_deliver() or any other caller drops ->siglock >another thread can do sys_timer_delete()->sigqueue_free(). How the devil would it find the sucker? It's off the list already. > 2. We need to move do_schedule_next_timer() from dequeue_signal() >here. > >Otherwise ->q can be reused/overwritten by the next send_sigqueue() >right affter ->siglock is dropped. Ditto. We rip them out of queue on collect_signal(); the only thing we do not do is actual __sigqueue_free(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 02/14, Christoph Hellwig wrote: > > Might aswell stick the discmiss into what was dequeue_signal_lock(). > Which at that point should get a saner name (maybe thread_dequeue_signal ?) > and lose all argument except maybe task_struct No, task_struct argument should die, I think. It is misleading. spin_lock(tsk->sighand->siglock) is simply wrong unless tsk == current. And dequeue_signal() assumes that tsk == current too, otherwise recalc_sigpending() is wrong. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 02/15, Al Viro wrote: > > OK, _very_ preliminary patch follows. It's uglier than it has to And I'm afraid it needs more uglifications... > +void dismiss_siginfo(ksiginfo_t *info) > +{ > + if (info->q) > + __sigqueue_free(info->q); > + info->q = NULL; > +} 1. info->q can be already freed if SIGQUEUE_PREALLOC. Once get_signal_to_deliver() or any other caller drops ->siglock another thread can do sys_timer_delete()->sigqueue_free(). 2. We need to move do_schedule_next_timer() from dequeue_signal() here. Otherwise ->q can be reused/overwritten by the next send_sigqueue() right affter ->siglock is dropped. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Thu, Feb 13, 2014 at 09:58:47AM -0800, Linus Torvalds wrote: > On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov wrote: > > > > And we should be careful with SIGQUEUE_PREALLOC, at least > > collect_signal() should not do list_del_init()... Plus we need to > > handle the SEND_SIG_FORCED-like case. > > I don't think the users need to care. They'd just call > "sigqueue_free()" not knowing about our preallocations etc. That kind > of detail should be confined to inside signal.c. > > But there really aren't that many users. There's a couple of > "dequeue_signal_lock()" users, but they don't actually *want* the > siginfo at all (they're kernel threads), so we can just make that > function free the siginfo immediately (and get rid of the totally > unnecessay kernel stack allocation). And outside of signal.c only > signalfd uses "dequeue_signal()" itself, and that would be the only > one that would need to be taught to use (in signalfd_copyinfo()) and > then free the sigqueue entry. > > So it really looks like the right thing to do, and fairly > straightforward. But I'm leaving the coding proof to Al, since he > already offered ;) OK, _very_ preliminary patch follows. It's uglier than it has to and it's not in the form I would consider suitable for merge. It doesn't depend on conversions of architectures to use of ksignal as prereqs; instead of that it has config symbol (ARCH_USES_KSIGNAL) that can be selected on architectures already through that conversion (in this patch - just on x86, since that was all I could test on at the moment; e.g. alpha and arm could also get such selects right now). If it isn't selected, everything builds as usual. With large siginfo on stack for signal path, as before. We get ksiginfo_t as an alias for siginfo_t in that case, dismiss_siginfo() is a no-op, etc. If it _is_ selected, ksiginfo_t is much smaller than siginfo_t. Of course, we pay for that with ifdefs in places that need to understand [k]siginfo_t guts; signalfd_copyinfo(), copy_siginfo_from_user{,32}() and copy_ksiginfo_to_user{,32}(), tracepoint for deliver_signal(), collect_signal(), trivial ones in dequeue_signal() and an ugly bugger in ptrace_signal(). No way around that until all architectures are converted. Other places get away with some siginfo_t replaced with ksiginfo_t and dismiss_siginfo() added in some places. arch/x86/ia32/ia32_signal.c needed a bit of change (we probably ought to unify copy_siginfo_from_user32() on all architectures that have it, actually). The sucker survived LTP syscall tests and hasn't died on xfstests yet either (== doesn't seem to leak horribly). Consider it a proof-of-concept and no more than that. One thing I'm really not satisfied with is signal sending pathway - it would probably make a lot of sense to use ksiginfo there as well, and have insertion into queue simply steal info->q instead of allocating and copying. Another is that ifdefs like that are tolerable only if we plan to convert everything during this cycle; I believe it to be doable, actually. Having that sucker go before the conversions would make for simpler logistics wrt architecture trees, but it may be better to get all conversions merged first. Not sure... Anyway, here is it; comments are welcome. Signed-off-by: Al Viro --- diff --git a/arch/Kconfig b/arch/Kconfig index 80bbb8c..f753561 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -522,4 +522,7 @@ config OLD_SIGACTION config COMPAT_OLD_SIGACTION bool +config ARCH_USES_KSIGNAL + bool + source "kernel/gcov/Kconfig" diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 0af5250..4ec468d 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -127,6 +127,7 @@ config X86 select HAVE_DEBUG_STACKOVERFLOW select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64 select HAVE_CC_STACKPROTECTOR + select ARCH_USES_KSIGNAL config INSTRUCTION_DECODER def_bool y diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c index 2206757..611a03e 100644 --- a/arch/x86/ia32/ia32_signal.c +++ b/arch/x86/ia32/ia32_signal.c @@ -34,7 +34,7 @@ #include #include -int copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from) +int __copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from) { int err = 0; bool ia32 = test_thread_flag(TIF_IA32); @@ -105,6 +105,25 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from) return err; } +int copy_siginfo_to_user32(compat_siginfo_t __user *to, const ksiginfo_t *from) +{ + int err = 0; + if (!from->q) { + if (!access_ok(VERIFY_WRITE, to, sizeof(compat_siginfo_t))) + return -EFAULT; + put_user_try { + /* fast case */ + put_user_ex(from->si_signo, &to->si_signo); + put_user_ex(0, &to->si_errno); + put_user_ex((short)from->si_code, &to->
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Fri, Feb 14, 2014 at 04:16:24PM +, Al Viro wrote: > > All of these have in common that they try to handle signals in a kernel > > thread (which we don't even allow by default), and that they ignore the > > siginfo. I think they could mostly be replaced by an addition to the > > kthread API to allow a kthread to be killed by signals for legacy > > reasons. > > FWIW, there's a funny situation - all users of dequeue_signal_lock() > actually ignore info completely. I'm not saying that we ought to > stop returning it, but e.g. jbd part of that patch is simply Might aswell stick the discmiss into what was dequeue_signal_lock(). Which at that point should get a saner name (maybe thread_dequeue_signal ?) and lose all argument except maybe task_struct - not that it's nessecary, but it would mirror the other functions usually used around it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Fri, Feb 14, 2014 at 04:16:24PM +, Al Viro wrote: > FWIW, there's a funny situation - all users of dequeue_signal_lock() > actually ignore info completely. I'm not saying that we ought to > stop returning it, but e.g. jbd part of that patch is simply s/jbd/jffs2/, obviously. Sorry... And yes, nbd and usb_storage are the same story. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Fri, Feb 14, 2014 at 08:13:02AM -0800, Christoph Hellwig wrote: > On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote: > > We'd have to teach each user of "dequeue_signal()" to free the siginfo > > thing. Which shouldn't be too bad - I think we've collected all of > > that into generic code, and there isn't the mass or architecture code > > that knows about these things any more. But there are a few odd > > drivers etc and signalfd. > > The few odd drivers are nbd, jffs2 and the usb mass storage gadget. > All of these have in common that they try to handle signals in a kernel > thread (which we don't even allow by default), and that they ignore the > siginfo. I think they could mostly be replaced by an addition to the > kthread API to allow a kthread to be killed by signals for legacy > reasons. FWIW, there's a funny situation - all users of dequeue_signal_lock() actually ignore info completely. I'm not saying that we ought to stop returning it, but e.g. jbd part of that patch is simply diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c index 2b60ce1..aefdff2 100644 --- a/fs/jffs2/background.c +++ b/fs/jffs2/background.c @@ -119,13 +119,14 @@ static int jffs2_garbage_collect_thread(void *_c) /* Put_super will send a SIGKILL and then wait on the sem. */ while (signal_pending(current) || freezing(current)) { - siginfo_t info; + ksiginfo_t info; unsigned long signr; if (try_to_freeze()) goto again; signr = dequeue_signal_lock(current, ¤t->blocked, &info); + dismiss_siginfo(&info); switch(signr) { case SIGSTOP: Not complicated at all. Where it does get complicated is ->last_siginfo and PTRACE_SETSIGINFO - getting that reasonably clean is what I'm still fighting right now... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote: > We'd have to teach each user of "dequeue_signal()" to free the siginfo > thing. Which shouldn't be too bad - I think we've collected all of > that into generic code, and there isn't the mass or architecture code > that knows about these things any more. But there are a few odd > drivers etc and signalfd. The few odd drivers are nbd, jffs2 and the usb mass storage gadget. All of these have in common that they try to handle signals in a kernel thread (which we don't even allow by default), and that they ignore the siginfo. I think they could mostly be replaced by an addition to the kthread API to allow a kthread to be killed by signals for legacy reasons. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 02/14, Al Viro wrote: > > BTW, Oleg, could you explain why does PTRACE_PEEK_SIGINFO copy ->si_code > separately? IOW, why do we want the upper 16 bits of ->si_code exposed? > It used to be a strictly internal thing IIRC (it's been what, 2.3.late?) Yes, but checkpoint/restart tools want to dump/restore the internal part of ->si_code too. See also the "task_pid_vnr(current) == pid" hack in do_rt_sigqueueinfo(), this allows to queue a SI_FROMKERNEL() siginfo. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Fri, Feb 14, 2014 at 11:24:27AM +1100, Dave Chinner wrote: > > I can fix this one easily - we already have a workqueue for doing > > async log pushes (will split the stack between xlog_cil_force_lsn > > and xlog_cil_push), but the reason we haven't used it for synchronous > > log forces is that screws up fsync performance on CFQ. We don't > > recommend CFQ with XFS anyway, so I think I'll make this change > > anyway. > > Dave, the patch below should chop off the stack usage from > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue. > Can you given this a run? Looks like it's survived an overnight run.. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Fri, Feb 14, 2014 at 02:29:30PM +0100, Richard Weinberger wrote: > Am 14.02.2014 14:25, schrieb Christoph Hellwig: > > On Thu, Feb 13, 2014 at 08:51:46PM +, Al Viro wrote: > >> On Wed, Feb 12, 2014 at 09:44:11PM +, Al Viro wrote: > >> > >>> I'll try to put something along those lines together, if you or Oleg don't > >>> do it first. > >> > >> OK, having looked at that stuff... > >> > >> 1) things become much more compact if we finish conversion to get_signal() > >> first. > > > > I have vague memories that Richard sent out a series to convert over all > > architectures a while ago. Hopefully he has better memory than I do. > > Yeah. Sending v2 of that series is on my overflowing TODO list. :-\ > I think this is a good reason for me to start working on that series again. > Stay tuned. Would be great. I have several done here, but I'll be glad to replace them with something tested... BTW, Oleg, could you explain why does PTRACE_PEEK_SIGINFO copy ->si_code separately? IOW, why do we want the upper 16 bits of ->si_code exposed? It used to be a strictly internal thing IIRC (it's been what, 2.3.late?) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
Am 14.02.2014 14:25, schrieb Christoph Hellwig: > On Thu, Feb 13, 2014 at 08:51:46PM +, Al Viro wrote: >> On Wed, Feb 12, 2014 at 09:44:11PM +, Al Viro wrote: >> >>> I'll try to put something along those lines together, if you or Oleg don't >>> do it first. >> >> OK, having looked at that stuff... >> >> 1) things become much more compact if we finish conversion to get_signal() >> first. > > I have vague memories that Richard sent out a series to convert over all > architectures a while ago. Hopefully he has better memory than I do. Yeah. Sending v2 of that series is on my overflowing TODO list. :-\ I think this is a good reason for me to start working on that series again. Stay tuned. Thanks, //richard -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Thu, Feb 13, 2014 at 08:51:46PM +, Al Viro wrote: > On Wed, Feb 12, 2014 at 09:44:11PM +, Al Viro wrote: > > > I'll try to put something along those lines together, if you or Oleg don't > > do it first. > > OK, having looked at that stuff... > > 1) things become much more compact if we finish conversion to get_signal() > first. I have vague memories that Richard sent out a series to convert over all architectures a while ago. Hopefully he has better memory than I do. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 06:18:29PM +1100, Dave Chinner wrote: > > It looks like just "do_signal()" has a stack frame that is about 230 > > bytes even under normal circumstancs (largely due to "struct ksignal" > > - which in turn is largely due to the insane 128-byte padding in > > siginfo_t). Add a few other frames in there, and I guess that if it > > was close before, the coredump path just makes it go off. > > Yup. But it's when you see this sort of thing you realise that > almost no GFP_KERNEL memory allocation is really safe from stack > overruns, though: > > 0) 6064 64 update_group_power+0x2c/0x270 > 1) 6000 384 find_busiest_group+0x10a/0x8b0 > 2) 5616 288 load_balance+0x165/0x870 > 3) 5328 96 idle_balance+0x106/0x1b0 > 4) 5232 112 __schedule+0x7b6/0x7e0 > 5) 5120 16 schedule+0x29/0x70 > 6) 5104 176 percpu_ida_alloc+0x1b3/0x3d0 > 7) 4928 32 blk_mq_wait_for_tags+0x1f/0x40 > 8) 4896 80 blk_mq_alloc_request_pinned+0x4e/0x110 > 9) 4816 128 blk_mq_make_request+0x42b/0x510 > 10) 4688 48 generic_make_request+0xc0/0x110 > 11) 4640 96 submit_bio+0x71/0x120 > 12) 4544 192 _xfs_buf_ioapply+0x2cc/0x3b0 > 13) 4352 48 xfs_buf_iorequest+0x6f/0xc0 > 14) 4304 32 xlog_bdstrat+0x23/0x60 > 15) 4272 96 xlog_sync+0x3a4/0x5c0 > 16) 4176 48 xlog_state_release_iclog+0x121/0x130 > 17) 4128 192 xlog_write+0x700/0x7c0 > 18) 3936 192 xlog_cil_push+0x2ae/0x3d0 > 19) 3744 128 xlog_cil_force_lsn+0x1b8/0x1e0 > 20) 3616 144 _xfs_log_force_lsn+0x7c/0x300 > 21) 3472 48 xfs_log_force_lsn+0x3b/0xa0 > 22) 3424 112 xfs_iunpin_wait+0xd7/0x160 > 23) 3312 80 xfs_reclaim_inode+0xd0/0x350 > 24) 3232 432 xfs_reclaim_inodes_ag+0x277/0x3d0 > 25) 2800 48 xfs_reclaim_inodes_nr+0x33/0x40 > 26) 2752 16 xfs_fs_free_cached_objects+0x15/0x20 > 27) 2736 80 super_cache_scan+0x169/0x170 > 28) 2656 160 shrink_slab_node+0x133/0x290 > 29) 2496 80 shrink_slab+0x122/0x160 > 30) 2416 112 do_try_to_free_pages+0x1de/0x360 > 31) 2304 192 try_to_free_pages+0x110/0x190 > 32) 2112 256 __alloc_pages_nodemask+0x5a2/0x8e0 > 33) 1856 80 alloc_pages_current+0xb2/0x170 > 34) 1776 64 new_slab+0x265/0x2e0 > 35) 1712 240 __slab_alloc+0x2fb/0x4c4 > 36) 1472 80 kmem_cache_alloc+0x10b/0x140 > > That's almost 4700 bytes of stack usage from the > kmem_cache_alloc(GFP_KERNEL) call because it entered the IO path. > Yes, it's an extreme case, but if you're looking for a smoking > gun :/ > > I can fix this one easily - we already have a workqueue for doing > async log pushes (will split the stack between xlog_cil_force_lsn > and xlog_cil_push), but the reason we haven't used it for synchronous > log forces is that screws up fsync performance on CFQ. We don't > recommend CFQ with XFS anyway, so I think I'll make this change > anyway. Dave, the patch below should chop off the stack usage from xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue. Can you given this a run? Cheers, Dave. -- Dave Chinner da...@fromorbit.com xfs: always do log forces via the workqueue From: Dave Chinner Log forces can occur deep in the call chain when we have relatively little stack free. Log forces can also happen at close to the call chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from places where we really don't want to add more stack overhead. This stack overhead occurs because log forces do foreground CIL pushes (xlog_cil_push_foreground()) rather than waking the background push wq and waiting for the for the push to complete. This foreground push was done to avoid confusing the CFQ Io scheduler when fsync()s were issued, as it has trouble dealing with dependent IOs being issued from different process contexts. Avoiding blowing the stack is much more critical than performance optimisations for CFQ, especially as we've been recommending against the use of CFQ for XFS since 3.2 kernels were release because of it's problems with multi-threaded IO workloads. Hence convert xlog_cil_push_foreground() to move the push work to the CIL workqueue. We already do the waiting for the push to complete in xlog_cil_force_lsn(), so there's nothing else we need to modify to make this work. Signed-off-by: Dave Chinner --- fs/xfs/xfs_log_cil.c | 52 +++- 1 file changed, 39 insertions(+), 13 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index b57a8e0..7e54553 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -499,13 +499,6 @@ xlog_cil_push( cil->xc_ctx = new_ctx; /* -* mirror the new sequence into the cil structure so that we can do -
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Thu, Feb 13, 2014 at 08:51:46PM +, Al Viro wrote: > On Wed, Feb 12, 2014 at 09:44:11PM +, Al Viro wrote: > > > I'll try to put something along those lines together, if you or Oleg don't > > do it first. > > OK, having looked at that stuff... > > 1) things become much more compact if we finish conversion to get_signal() > first. Callers of get_signal_to_deliver() have k_sigaction and siginfo in > pair of local variables; switching to ksignal will be neutral wrt stack > footprint (it just gathers those two in one struct) *and* we are getting > rid of passing struct siginfo * around. With that done, we can change > struct ksignal ->info with zero impact on the code in arch/*, and conversion > makes sense on its own. In the mainline we have it done for alpha, arm, > openrisc, sparc and x86. I've just put together preliminary (and completely > untested) patches for arm64, m68k and um; doing the rest won't take long, but > they'll obviously need to be tested. It's a fairly safe conversion; > I'd expect the worst bugs to be typos. OK, there's a couple of tricks that allow to reorder that. First of all, temporary config symbol (ARCH_USES_KSIGNAL) selecting that stuff; if it's not selected, we just have #define small_siginfo siginfo and #define assign_sigqueue(info, q) WARN_ON((q) != NULL) and that's it - most of the changes in core kernel consist of s/siginfo/small_&/ and non-trivial ones are under ifdef CONFIG_ARCH_USES_KSIGNAL for the time being. Small ifdefs, at that... Once all architectures get converted, we'll just kill that config symbol and make ifdefs unconditional. Another is that we don't need to bother with task_work_add() at all; on converted architectures we have signal_setup_done(..., ksig, ...) called after each successful get_signal() (the first argument of signal_setup_done() is "has sigframe setup failed" flag; we get there regardless of success or failure of setup_...frame()). So we can just have that sucker do assign_sigqueue(&ksig->info, NULL) and be done with that - all this delayed freeing is explicit now (and we don't get locking overhead, etc. of task_work_add()). If what I've got survives the local beating, I'll put it into signal.git, throw the arch conversions I could test in there and ask on linux-arch for help with the missing ones. With any luck we'll get the full set by the next merge window, at which point we'll be able to kill ifdefs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 09:44:11PM +, Al Viro wrote: > I'll try to put something along those lines together, if you or Oleg don't > do it first. OK, having looked at that stuff... 1) things become much more compact if we finish conversion to get_signal() first. Callers of get_signal_to_deliver() have k_sigaction and siginfo in pair of local variables; switching to ksignal will be neutral wrt stack footprint (it just gathers those two in one struct) *and* we are getting rid of passing struct siginfo * around. With that done, we can change struct ksignal ->info with zero impact on the code in arch/*, and conversion makes sense on its own. In the mainline we have it done for alpha, arm, openrisc, sparc and x86. I've just put together preliminary (and completely untested) patches for arm64, m68k and um; doing the rest won't take long, but they'll obviously need to be tested. It's a fairly safe conversion; I'd expect the worst bugs to be typos. 2) struct small_siginfo ought to go into linux/signal.h. Contains signal number, si_code (usually 0), uid, pid and (often NULL) pointer to struct sigqueue. Overall: 20 or 24 bytes. In struct coredump_params we should replace ->siginfo with pointer to struct small_siginfo. Ditto for task_struct ->last_siginfo. In struct ksignal ->info becomes struct small_siginfo. copy_siginfo_to_user{,32}() gets switched to struct small_siginfo *, so does do_coredump(), ptrace_signal(), ptrace_stop(), ptrace_getsiginfo(), ptrace_setsiginfo(), collect_signal(), __dequeue_signal(), dequeue_signal(), signalfd_dequeue() and do_sigtimedwait(). New primitive: assign_sigqueue(small_info, sigqueue). Frees small_info->q and assigns a new value to it. When ptrace_setsiginfo() is given a non-plain siginfo (si_code != SI_USER, that is), it should a allocate struct sigqueue, copy the sucker there and give it to assign_sigqueue(). Plain ones just get uid/pid/signo copied and NULL passed to assign_sigqueue(). code in ptrace_signal() under if (signr != info->si_signo) should start with assign_sigqueue(info, NULL). specific_send_sig_info() in ptrace_signal() is rather clumsy to handle; not sure what's the best way to deal with that. In any case, ptrace_signal() deciding to return 0 should take care of info->q - either copy and free the original, or actually try to reuse the sucker. Either way, info->q should become NULL. get_signal_to_deliver() should do assign_sigqueue(info, NULL) before the call of do_group_exit(). arch_ptrace_stop_needed() can be left as-is; it's a macro and none of the instances look at the info argument at all. The same goes for arch_ptrace_stop(). signalfd_read() should do assign_sigqueue(&info, NULL) right after signalfd_copyinfo(). rt_sigtimedwait(2) and its compat variant should do the same right after copy_siginfo_to_user{,32}(). TP'ed stuff is a mess, as usual. AFAICS, TP_STORE_SIGINFO() needs to be split into a variant taking siginfo (on trace_signal_generate side) and one taking small_siginfo (for trace_signal_deliver). get_signal() should do assign_sigqueue(&ksig->info, NULL) if it returns false and use task_work_add() to schedule info->q for freeing otherwise. Overall, aside of the need to complete arch/* conversion, the only unfinished part is this: /* If the (new) signal is now blocked, requeue it. */ if (sigismember(¤t->blocked, signr)) { specific_send_sig_info(signr, info, current); signr = 0; } in the end of ptrace_signal(). Sure, we can add a local struct siginfo in that if, fill it if needed and pass its address or &info->q->info to specific_send_sig_info(), but that feels kludgy... Hell knows, I'll look a bit more into that. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 02/13, Oleg Nesterov wrote: > > On 02/13, Linus Torvalds wrote: > > > > On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov wrote: > > > > > > And we should be careful with SIGQUEUE_PREALLOC, at least > > > collect_signal() should not do list_del_init()... Plus we need to > > > handle the SEND_SIG_FORCED-like case. > > > > I don't think the users need to care. They'd just call > > "sigqueue_free()" not knowing about our preallocations etc. > > Yes, but we need to be careful to avoid the races with > release_posix_timer(). Plus we need to delay do_schedule_next_timer() as well. But this is probably good because we can avoid unlock/lock(siglock) in dequeue_signal(). Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 02/13, Linus Torvalds wrote: > > On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov wrote: > > > > And we should be careful with SIGQUEUE_PREALLOC, at least > > collect_signal() should not do list_del_init()... Plus we need to > > handle the SEND_SIG_FORCED-like case. > > I don't think the users need to care. They'd just call > "sigqueue_free()" not knowing about our preallocations etc. Yes, but we need to be careful to avoid the races with release_posix_timer(). > That kind > of detail should be confined to inside signal.c. Yes, sure. > But there really aren't that many users. There's a couple of > "dequeue_signal_lock()" users, but they don't actually *want* the > siginfo at all (they're kernel threads), so we can just make that > function free the siginfo immediately (and get rid of the totally > unnecessay kernel stack allocation). Yes. Perhaps this helper should be changed/renamed. And perhaps we can even change __send_signal() to avoid __sigqueue_alloc() if PF_KTHREAD. Or sig == SIGKILL. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov wrote: > > And we should be careful with SIGQUEUE_PREALLOC, at least > collect_signal() should not do list_del_init()... Plus we need to > handle the SEND_SIG_FORCED-like case. I don't think the users need to care. They'd just call "sigqueue_free()" not knowing about our preallocations etc. That kind of detail should be confined to inside signal.c. But there really aren't that many users. There's a couple of "dequeue_signal_lock()" users, but they don't actually *want* the siginfo at all (they're kernel threads), so we can just make that function free the siginfo immediately (and get rid of the totally unnecessay kernel stack allocation). And outside of signal.c only signalfd uses "dequeue_signal()" itself, and that would be the only one that would need to be taught to use (in signalfd_copyinfo()) and then free the sigqueue entry. So it really looks like the right thing to do, and fairly straightforward. But I'm leaving the coding proof to Al, since he already offered ;) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 02/12, Linus Torvalds wrote: > > On Wed, Feb 12, 2014 at 1:14 PM, Al Viro wrote: > > > > Umm... What if we delay __sigqueue_free()? After all, that's where the > > fat sucker normally comes from. That way we might get away with much > > smaller structure on stack... > > Sounds like the RightThing(tm) to do to me, and I don't see why it > wouldn't work. Probably... I'll try to reply tomorrow. > We'd have to teach each user of "dequeue_signal()" to free the siginfo > thing. And we should be careful with SIGQUEUE_PREALLOC, at least collect_signal() should not do list_del_init()... Plus we need to handle the SEND_SIG_FORCED-like case. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote: > On Wed, Feb 12, 2014 at 1:14 PM, Al Viro wrote: > > > > Umm... What if we delay __sigqueue_free()? After all, that's where the > > fat sucker normally comes from. That way we might get away with much > > smaller structure on stack... > > Sounds like the RightThing(tm) to do to me, and I don't see why it > wouldn't work. > > We'd have to teach each user of "dequeue_signal()" to free the siginfo > thing. Which shouldn't be too bad - I think we've collected all of > that into generic code, and there isn't the mass or architecture code > that knows about these things any more. But there are a few odd > drivers etc and signalfd. I didn't look at what the lifetimes were. Only signalfd, AFAICS. And there we'd want to use the same small structure - it's used in do { ret = signalfd_dequeue(ctx, &info, nonblock); if (unlikely(ret <= 0)) break; ret = signalfd_copyinfo(siginfo, &info); if (ret < 0) break; siginfo++; total += ret; nonblock = 1; } while (--count); and using a smaller struct would actually speed the things up - skips one copying. sigqueue would be freed as soon as we'd done signalfd_copyinfo() (if not by signalfd_copyinfo() itself). I'll try to put something along those lines together, if you or Oleg don't do it first. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 1:14 PM, Al Viro wrote: > > Umm... What if we delay __sigqueue_free()? After all, that's where the > fat sucker normally comes from. That way we might get away with much > smaller structure on stack... Sounds like the RightThing(tm) to do to me, and I don't see why it wouldn't work. We'd have to teach each user of "dequeue_signal()" to free the siginfo thing. Which shouldn't be too bad - I think we've collected all of that into generic code, and there isn't the mass or architecture code that knows about these things any more. But there are a few odd drivers etc and signalfd. I didn't look at what the lifetimes were. Adding Oleg to the cc, since any time we touch any of that code, he should be involved. Oleg - the issue is the biggish size of 'struct ksignal' on stack, brought on by the silly "put a whole siginfo_t in it". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 12:13:19PM -0800, Linus Torvalds wrote: > On Wed, Feb 12, 2014 at 3:39 AM, Al Viro wrote: > > On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote: > > > >> It looks like just "do_signal()" has a stack frame that is about 230 > >> bytes even under normal circumstancs (largely due to "struct ksignal" > >> - which in turn is largely due to the insane 128-byte padding in > >> siginfo_t). Add a few other frames in there, and I guess that if it > >> was close before, the coredump path just makes it go off. > > > > We could, in principle, put it into task_struct and make get_signal() > > return its address - do_signal() is called only in the code that does > > assorted returns to userland... > > We have better uses for random buffers in "struct task_struct", I'd > hate to put a siginfo_t there. *nod* > The thing is, siginfo_t has that idiotic 128-byte area, but it's all > "for future expansion". I think it's some damn glibc disease - we've > seen these kinds of insane paddings before. > > The actual *useful* part of siginfo_t is on the order of 32 bytes. If that. > > Sad. Umm... What if we delay __sigqueue_free()? After all, that's where the fat sucker normally comes from. That way we might get away with much smaller structure on stack... Just introduce a small structure that would contain signr, uid, pid and pointer to struct sigqueue. And pass a pointer to _that_ all the way down to collect_signal(). Pointer's NULL == it's SI_USER with signr/uid/pid from the small struct and all other fields are zero. Pointer isn't NULL - use &small_struct->p->info. And have struct sigqueue actually freed via task_work_add() in that case. Do you see any fundamental problems with that? Looks like it would be faster as well - less copying involved... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 09:25:38AM -0500, Dave Jones wrote: > On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote: > > On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote: > > > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote: > > > > > > > None of the XFS code disables interrupts in that path, not does is > > > > call outside XFS except to dispatch IO. The stack is pretty deep at > > > > this point and I know that the standard (non stacked) IO stack can > > > > consume >3kb of stack space when it gets down to having to do memory > > > > reclaim during GFP_NOIO allocation at the lowest level of SCSI > > > > drivers. Stack overruns typically show up with symptoms like we are > > > > seeing. > > > > .. > > > > > > > > Dave, before chasing ghosts, can you (like Eric originally asked) > > > > turn on stack overrun detection? > > > > > > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on. > > > > That only checks stack usage when an interrupt is taken. If no > > interrupts are taken when stack usage is within 128 bytes of > > overflow, then it doesn't catch it. > > > > I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum > > stack usage of a process via canary overwrites and it records it in > > do_exit(). > > I had that on too. The only message from it came from quite a while > before the trace that happened overnight.. Right, it won't capture an overrun at the point in time an overrun occurs, either, because it only checks when the process exits. But it does tell you what stack usage is being seen, as this: > [ 3415.655125] trinity-c0 (4383) used greatest stack depth: 992 bytes left > [12900.804230] BUG: sleeping function called from invalid context at > mm/mempool.c:203 is a pretty a good indication that trinity is at risk of stack overuns... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 3:39 AM, Al Viro wrote: > On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote: > >> It looks like just "do_signal()" has a stack frame that is about 230 >> bytes even under normal circumstancs (largely due to "struct ksignal" >> - which in turn is largely due to the insane 128-byte padding in >> siginfo_t). Add a few other frames in there, and I guess that if it >> was close before, the coredump path just makes it go off. > > We could, in principle, put it into task_struct and make get_signal() > return its address - do_signal() is called only in the code that does > assorted returns to userland... We have better uses for random buffers in "struct task_struct", I'd hate to put a siginfo_t there. The thing is, siginfo_t has that idiotic 128-byte area, but it's all "for future expansion". I think it's some damn glibc disease - we've seen these kinds of insane paddings before. The actual *useful* part of siginfo_t is on the order of 32 bytes. If that. Sad. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 2/12/14, 12:10 AM, Dave Chinner wrote: > On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote: >> On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote: >> >> > None of the XFS code disables interrupts in that path, not does is >> > call outside XFS except to dispatch IO. The stack is pretty deep at >> > this point and I know that the standard (non stacked) IO stack can >> > consume >3kb of stack space when it gets down to having to do memory >> > reclaim during GFP_NOIO allocation at the lowest level of SCSI >> > drivers. Stack overruns typically show up with symptoms like we are >> > seeing. >> > .. >> > >> > Dave, before chasing ghosts, can you (like Eric originally asked) >> > turn on stack overrun detection? >> >> CONFIG_DEBUG_STACKOVERFLOW ? Already turned on. > > That only checks stack usage when an interrupt is taken. If no > interrupts are taken when stack usage is within 128 bytes of > overflow, then it doesn't catch it. > > I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum > stack usage of a process via canary overwrites and it records it in > do_exit(). I also use the stack tracer to record the largest stack > usage seen so I know exactly what code paths are approaching stack > overruns... > > Cheers, > > Dave. > I'm not sure if I'm off base here, but maybe this would make sense: check for a corrupted stack in __might_sleep. Compile tested only, possibly inelegant, and/or completely wrong, but: From: Eric Sandeen sched: Test for corrupted task_struct in __might_sleep If a thread overruns the stack, it may corrupt the task_struct, leading to false positives on tests like irqs_disabled(). Warn if this seems to be the case. Signed-off-by: Eric Sandeen --- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b46131e..6920c3c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6934,6 +6934,8 @@ static inline int preempt_count_equals(int preempt_offset) void __might_sleep(const char *file, int line, int preempt_offset) { + struct task_struct *tsk = current; + unsigned long *stackend; static unsigned long prev_jiffy;/* ratelimiting */ rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */ @@ -6952,6 +6954,11 @@ void __might_sleep(const char *file, int line, int preempt_offset) in_atomic(), irqs_disabled(), current->pid, current->comm); + /* A corrupted stack can cause a false positive on irqs_disabled etc */ + stackend = end_of_stack(tsk); + if (tsk != &init_task && *stackend != STACK_END_MAGIC) + printk(KERN_EMERG "Thread overran stack, or stack corrupted\n"); + debug_show_held_locks(current); if (irqs_disabled()) print_irqtrace_events(current); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote: > On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote: > > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote: > > > > > None of the XFS code disables interrupts in that path, not does is > > > call outside XFS except to dispatch IO. The stack is pretty deep at > > > this point and I know that the standard (non stacked) IO stack can > > > consume >3kb of stack space when it gets down to having to do memory > > > reclaim during GFP_NOIO allocation at the lowest level of SCSI > > > drivers. Stack overruns typically show up with symptoms like we are > > > seeing. > > > .. > > > > > > Dave, before chasing ghosts, can you (like Eric originally asked) > > > turn on stack overrun detection? > > > > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on. > > That only checks stack usage when an interrupt is taken. If no > interrupts are taken when stack usage is within 128 bytes of > overflow, then it doesn't catch it. > > I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum > stack usage of a process via canary overwrites and it records it in > do_exit(). I had that on too. The only message from it came from quite a while before the trace that happened overnight.. [ 3415.655125] trinity-c0 (4383) used greatest stack depth: 992 bytes left [12900.804230] BUG: sleeping function called from invalid context at mm/mempool.c:203 > I also use the stack tracer to record the largest stack > usage seen so I know exactly what code paths are approaching stack > overruns... I can give that a try later. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 07:40:36AM -0500, Steven Rostedt wrote: > The pt_regs structure. > > That's what? 21 unsigned longs? 21 * 8 = 168. I think that's the > culprit here. > > Peter and Frederic, is there a way not to store that on the stack? Something like so? --- include/trace/ftrace.h | 7 --- kernel/trace/trace_event_perf.c | 5 - 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h index 1a8b28db3775..87ae3ef1d278 100644 --- a/include/trace/ftrace.h +++ b/include/trace/ftrace.h @@ -678,7 +678,7 @@ perf_trace_##call(void *__data, proto) \ struct ftrace_event_call *event_call = __data; \ struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\ struct ftrace_raw_##call *entry;\ - struct pt_regs __regs; \ + struct pt_regs *__regs; \ u64 __addr = 0, __count = 1;\ struct task_struct *__task = NULL; \ struct hlist_head *head;\ @@ -697,18 +697,19 @@ perf_trace_##call(void *__data, proto) \ sizeof(u64)); \ __entry_size -= sizeof(u32);\ \ - perf_fetch_caller_regs(&__regs);\ entry = perf_trace_buf_prepare(__entry_size,\ event_call->event.type, &__regs, &rctx);\ if (!entry) \ return; \ \ + perf_fetch_caller_regs(__regs); \ + \ tstruct \ \ { assign; } \ \ perf_trace_buf_submit(entry, __entry_size, rctx, __addr,\ - __count, &__regs, head, __task);\ + __count, __regs, head, __task); \ } /* diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index e854f420e033..1885f4aac109 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -232,8 +232,10 @@ void perf_trace_del(struct perf_event *p_event, int flags) tp_event->class->reg(tp_event, TRACE_REG_PERF_DEL, p_event); } +static DEFINE_PER_CPU(struct pt_regs, tp_regs[4]); + __kprobes void *perf_trace_buf_prepare(int size, unsigned short type, - struct pt_regs *regs, int *rctxp) + struct pt_regs **regs, int *rctxp) { struct trace_entry *entry; unsigned long flags; @@ -252,6 +254,7 @@ __kprobes void *perf_trace_buf_prepare(int size, unsigned short type, if (*rctxp < 0) return NULL; + *regs = this_cpu_ptr(&tp_regs[*rctxp]); raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]); /* zero the dead bytes from align to not leak stack to user */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, 12 Feb 2014 19:35:13 +1100 Dave Chinner wrote: > And it's leaf functions that the CONFIG_STACK_TRACER doesn't catch > on x86-64 (at least, according to the documentation). > CONFIG_DEBUG_STACK_USAGE output is showing up to 800 bytes more > stack usage than the tracer. As such, I also think that > CONFIG_DEBUG_STACK_USAGE output is a more reliable iindication of > stack usage because it is canary based and so captures the very > worst case usage of the process's stack... Yeah, with the new fentry (adding the mcount call before setting up the stack frame), the function tracer can not catch leaf functions, as it is called before the leaf function's frame is set up. Hmm, I wonder if I should add a config to disable fentry and go back to the old mcount that gets called after setting up the stack frame. This will lead to better stack tracing, but you lose out on all the benefits of fentry. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, 12 Feb 2014 03:13:33 -0500 Tejun Heo wrote: > On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote: > > There's a lot of 200+ byte stack frames in block/blk-core.s, and they > > all seem to be of the type perf_trace_block_buffer() - things created > > with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of > > frame, I have no idea. That sounds like a potential disaster too, > > although hopefully it's mostly leaf functions - but leaf functions > > *deep* in the callchain. Tejun? Steven, why _do_ they end up with such > > huge frames? > > It looks like they're essentially the same for all the automatically > generated trace functions. I'm seeing 232 byte stack frame in most of > them. If I'm not completely confused by these macros, these are > generated by DECLARE_EVENT_CLASS() in include/trace/ftrace.h and > contains struct pt_regs in the stack frame which is already 168 bytes, > so that seems like the culprit. No idea whether this is something > avoidable. At least they shouldn't nest in any way. Steven? They shouldn't nest. But if the perf tracepoint is active, I don't know how much more of the stack is used in the functions that tracepoint calls. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
[ Added the perf tracepoint maintainers ] On Tue, 11 Feb 2014 22:59:58 -0800 Linus Torvalds wrote: > On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner wrote: > > > > FYI, just creating lots of files with open(O_CREAT): > > > > [ 348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left > > [ 348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left > > [ 349.17] fs_mark (4826) used greatest stack depth: 2280 bytes left > > [ 418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left > > [ 460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left > > [ 544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left > > [ 689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left > > > > We've got absolutely no spare stack space anymore in the IO path. > > And the IO path can't get much simpler than filesystem -> virtio > > block device. > > Ugh, that's bad. A thousand bytes of stack space is much too close to > any limits. > > Do you have the stack traces for these things so that we can look at > worst offenders? > > If the new block-mq code is to blame, it needs to be fixed. > __virtblk_add_req() has a 300-byte stack frame, it seems. Looking > elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is > hopefully not in any normal path - online discard is moronic, and I'm > assuming XFS doesn't do that. > > There's a lot of 200+ byte stack frames in block/blk-core.s, and they > all seem to be of the type perf_trace_block_buffer() - things created > with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of > frame, I have no idea. That sounds like a potential disaster too, > although hopefully it's mostly leaf functions - but leaf functions > *deep* in the callchain. Tejun? Steven, why _do_ they end up with such > huge frames? The perf_trace_##event is defined in include/trace/ftrace.h. There we have this: perf_trace_##call(void *__data, proto) \ { \ struct ftrace_event_call *event_call = __data; \ struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\ struct ftrace_raw_##call *entry;\ struct pt_regs __regs; \ u64 __addr = 0, __count = 1;\ struct task_struct *__task = NULL; \ struct hlist_head *head;\ int __entry_size; \ int __data_size;\ int rctx; \ \ Mostly pointers except for two structures. The __data_offests, is dynamically defined, and only consists of values from the tracepoint entry_structure that defines dynamic arrays. But the other structure on the stack looks a bit harrier. The pt_regs structure. That's what? 21 unsigned longs? 21 * 8 = 168. I think that's the culprit here. Peter and Frederic, is there a way not to store that on the stack? -- Steve > > And if the stack use comes from the VFS layer, we can probably work on > that too. But I don't think that has really changed much lately.. > > Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote: > It looks like just "do_signal()" has a stack frame that is about 230 > bytes even under normal circumstancs (largely due to "struct ksignal" > - which in turn is largely due to the insane 128-byte padding in > siginfo_t). Add a few other frames in there, and I guess that if it > was close before, the coredump path just makes it go off. We could, in principle, put it into task_struct and make get_signal() return its address - do_signal() is called only in the code that does assorted returns to userland... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote: > On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner wrote: > > > > FYI, just creating lots of files with open(O_CREAT): > > > > [ 348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left > > [ 348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left > > [ 349.17] fs_mark (4826) used greatest stack depth: 2280 bytes left > > [ 418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left > > [ 460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left > > [ 544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left > > [ 689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left > > > > We've got absolutely no spare stack space anymore in the IO path. > > And the IO path can't get much simpler than filesystem -> virtio > > block device. > > Ugh, that's bad. A thousand bytes of stack space is much too close to > any limits. > > Do you have the stack traces for these things so that we can look at > worst offenders? Sure. It's almost through the XFS block allocation path where we do metadata reads or log writes. That's always been the worst case stack usage path for XFS. The worst path is about 4800 bytes - it was a double btree split (inode extent map tree split, followed by a freespace btree split when allocating a block for the extent map btree split), triggering a reallocation of a btree block that was pinned and STALE, which triggered a log force, which then made it's way into the block layer. I don't have a stack trace of that because I can't create the initial conditions required to trigger it on demand. I've only ever seen it in the wild once. More common, though, is the situation we see here - somewhere around 3-4k of stack usage from a single extent btree update operation. We've already got a stack switch in the data allocation path - we had to add it to stop the bdi-flusher from busting the stack with alarming regularity in production systems. I initially made the metadata allocation paths use it as well, but that part got reverted (aa29284 xfs: don't defer metadata allocation to the workqueue) because Mel Gorman demonstrated several workloads that regressed significantly as a result of making that change. So now we are stuck betweeen a rock and a hard place - those metadata block allocation paths are triggering the stack overflows, but switching stacks in the allocation path to avoid stack overflows causes massive performance regressions The thing is, the XFS stack usage has not changed significantly over the past 5 years - we keep a pretty close eye on it. The issue is that everything around XFS has slowly been growing and I can't point you at one stack trace that demonstrates the worst case. What I can point you to function calls that consume a lot of stack space and hence limit what is available to callers. So here's a few stack fragements I've generated in teh past few minutes on a couple of XFS filesystems inside a 16p/16GB VM by running: $ dd if=/dev/zero of=/mnt/test/foo bs=1024k & which writes at about 600MB/s to a 18TB RAID array. That runs in parallel with a multithreaded inode creation/walk/remove workload on a different filesystem (creates 50 million inodes ~250k inode creates/s, walks them removes them at around 170k removes/s) which is hosted on a dm RAID0 stripe of 2 samsung 840 EVO SSDs. mutex_lock() requires at least 1.2k of stack because of the scheduler overhead.: $ sudo cat /sys/kernel/debug/tracing/stack_trace DepthSize Location(39 entries) - 0) 4368 64 update_group_power+0x2c/0x270 1) 4304 384 find_busiest_group+0x10a/0x8b0 2) 3920 288 load_balance+0x165/0x870 3) 3632 96 idle_balance+0x106/0x1b0 4) 3536 112 __schedule+0x7b6/0x7e0 5) 3424 16 schedule+0x29/0x70 6) 3408 16 schedule_preempt_disabled+0xe/0x10 7) 3392 112 __mutex_lock_slowpath+0x11a/0x400 8) 3280 32 mutex_lock+0x1e/0x32 i.e. any function that is going to schedule needs at least 1k of stack, and some of the lead in infrastructure (like wait_for_completion) can require a total of up to 1.5k... Basic memory allocation can easily >1k of stack without entering reclaim: - 0) 4920 40 zone_statistics+0xbd/0xc0 1) 4880 256 get_page_from_freelist+0x3a8/0x8a0 2) 4624 256 __alloc_pages_nodemask+0x143/0x8e0 3) 4368 80 alloc_pages_current+0xb2/0x170 4) 4288 64 new_slab+0x265/0x2e0 5) 4224 240 __slab_alloc+0x2fb/0x4c4 6) 3984 80 kmem_cache_alloc+0x10b/0x140 7) 3904 48 kmem_zone_alloc+0x77/0x100 If it enters reclaim and we are allowed to IO? Well, you saw the stack I posted in the other thread - 4700 bytes from kmem_cache_alloc() to the top of the stack. Another bad case is the swap pa
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote: > There's a lot of 200+ byte stack frames in block/blk-core.s, and they > all seem to be of the type perf_trace_block_buffer() - things created > with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of > frame, I have no idea. That sounds like a potential disaster too, > although hopefully it's mostly leaf functions - but leaf functions > *deep* in the callchain. Tejun? Steven, why _do_ they end up with such > huge frames? It looks like they're essentially the same for all the automatically generated trace functions. I'm seeing 232 byte stack frame in most of them. If I'm not completely confused by these macros, these are generated by DECLARE_EVENT_CLASS() in include/trace/ftrace.h and contains struct pt_regs in the stack frame which is already 168 bytes, so that seems like the culprit. No idea whether this is something avoidable. At least they shouldn't nest in any way. Steven? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote: > On Tue, Feb 11, 2014 at 9:40 PM, Dave Chinner wrote: > > > > None of the XFS code disables interrupts in that path, not does is > > call outside XFS except to dispatch IO. The stack is pretty deep at > > this point and I know that the standard (non stacked) IO stack can > > consume >3kb of stack space when it gets down to having to do memory > > reclaim during GFP_NOIO allocation at the lowest level of SCSI > > drivers. Stack overruns typically show up with symptoms like we are > > seeing. > > That would also explain why it shows up for do_coredump(), even though > clearly interrupts are not disabled at that point. It's just because > do_coredump() opens a filename at a deeper point in the stack than the > more normal system call paths do. Right, it's exactly the same problem we have when knfsd is making the VFS calls. > It looks like just "do_signal()" has a stack frame that is about 230 > bytes even under normal circumstancs (largely due to "struct ksignal" > - which in turn is largely due to the insane 128-byte padding in > siginfo_t). Add a few other frames in there, and I guess that if it > was close before, the coredump path just makes it go off. Yup. But it's when you see this sort of thing you realise that almost no GFP_KERNEL memory allocation is really safe from stack overruns, though: 0) 6064 64 update_group_power+0x2c/0x270 1) 6000 384 find_busiest_group+0x10a/0x8b0 2) 5616 288 load_balance+0x165/0x870 3) 5328 96 idle_balance+0x106/0x1b0 4) 5232 112 __schedule+0x7b6/0x7e0 5) 5120 16 schedule+0x29/0x70 6) 5104 176 percpu_ida_alloc+0x1b3/0x3d0 7) 4928 32 blk_mq_wait_for_tags+0x1f/0x40 8) 4896 80 blk_mq_alloc_request_pinned+0x4e/0x110 9) 4816 128 blk_mq_make_request+0x42b/0x510 10) 4688 48 generic_make_request+0xc0/0x110 11) 4640 96 submit_bio+0x71/0x120 12) 4544 192 _xfs_buf_ioapply+0x2cc/0x3b0 13) 4352 48 xfs_buf_iorequest+0x6f/0xc0 14) 4304 32 xlog_bdstrat+0x23/0x60 15) 4272 96 xlog_sync+0x3a4/0x5c0 16) 4176 48 xlog_state_release_iclog+0x121/0x130 17) 4128 192 xlog_write+0x700/0x7c0 18) 3936 192 xlog_cil_push+0x2ae/0x3d0 19) 3744 128 xlog_cil_force_lsn+0x1b8/0x1e0 20) 3616 144 _xfs_log_force_lsn+0x7c/0x300 21) 3472 48 xfs_log_force_lsn+0x3b/0xa0 22) 3424 112 xfs_iunpin_wait+0xd7/0x160 23) 3312 80 xfs_reclaim_inode+0xd0/0x350 24) 3232 432 xfs_reclaim_inodes_ag+0x277/0x3d0 25) 2800 48 xfs_reclaim_inodes_nr+0x33/0x40 26) 2752 16 xfs_fs_free_cached_objects+0x15/0x20 27) 2736 80 super_cache_scan+0x169/0x170 28) 2656 160 shrink_slab_node+0x133/0x290 29) 2496 80 shrink_slab+0x122/0x160 30) 2416 112 do_try_to_free_pages+0x1de/0x360 31) 2304 192 try_to_free_pages+0x110/0x190 32) 2112 256 __alloc_pages_nodemask+0x5a2/0x8e0 33) 1856 80 alloc_pages_current+0xb2/0x170 34) 1776 64 new_slab+0x265/0x2e0 35) 1712 240 __slab_alloc+0x2fb/0x4c4 36) 1472 80 kmem_cache_alloc+0x10b/0x140 That's almost 4700 bytes of stack usage from the kmem_cache_alloc(GFP_KERNEL) call because it entered the IO path. Yes, it's an extreme case, but if you're looking for a smoking gun :/ I can fix this one easily - we already have a workqueue for doing async log pushes (will split the stack between xlog_cil_force_lsn and xlog_cil_push), but the reason we haven't used it for synchronous log forces is that screws up fsync performance on CFQ. We don't recommend CFQ with XFS anyway, so I think I'll make this change anyway. > And some of the debug options that I'm sure DaveJ has enabled tend to > make the stack usage worse (simply because they make a lot of data > structures bigger). True - CONFIG_XFS_DEBUG tends to add about 5% to the stack usage of XFS, but realistically 5% is not significant especially as we've been hitting stack overflows with XFS on production systems regularly enough on x86-64 over the past 2-3 years that we know what "typical symptoms" of such overflows are... The problem we have now is that everything outside XFS continues to grow in stack usage, so the only choice that remains for us to avoid overruns is to add performance impacting stack switches into the middle of common XFS paths. We also have to do it unconditionally because we don't know ahead of time if any specific operation is going to require all the stack we can give it because - for example - we can't predict when the IO path is going to require memory allocation... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org M
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner wrote: > > FYI, just creating lots of files with open(O_CREAT): > > [ 348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left > [ 348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left > [ 349.17] fs_mark (4826) used greatest stack depth: 2280 bytes left > [ 418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left > [ 460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left > [ 544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left > [ 689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left > > We've got absolutely no spare stack space anymore in the IO path. > And the IO path can't get much simpler than filesystem -> virtio > block device. Ugh, that's bad. A thousand bytes of stack space is much too close to any limits. Do you have the stack traces for these things so that we can look at worst offenders? If the new block-mq code is to blame, it needs to be fixed. __virtblk_add_req() has a 300-byte stack frame, it seems. Looking elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is hopefully not in any normal path - online discard is moronic, and I'm assuming XFS doesn't do that. There's a lot of 200+ byte stack frames in block/blk-core.s, and they all seem to be of the type perf_trace_block_buffer() - things created with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of frame, I have no idea. That sounds like a potential disaster too, although hopefully it's mostly leaf functions - but leaf functions *deep* in the callchain. Tejun? Steven, why _do_ they end up with such huge frames? And if the stack use comes from the VFS layer, we can probably work on that too. But I don't think that has really changed much lately.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote: > On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote: > > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote: > > > > > None of the XFS code disables interrupts in that path, not does is > > > call outside XFS except to dispatch IO. The stack is pretty deep at > > > this point and I know that the standard (non stacked) IO stack can > > > consume >3kb of stack space when it gets down to having to do memory > > > reclaim during GFP_NOIO allocation at the lowest level of SCSI > > > drivers. Stack overruns typically show up with symptoms like we are > > > seeing. > > > .. > > > > > > Dave, before chasing ghosts, can you (like Eric originally asked) > > > turn on stack overrun detection? > > > > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on. > > That only checks stack usage when an interrupt is taken. If no > interrupts are taken when stack usage is within 128 bytes of > overflow, then it doesn't catch it. > > I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum > stack usage of a process via canary overwrites and it records it in > do_exit(). I also use the stack tracer to record the largest stack > usage seen so I know exactly what code paths are approaching stack > overruns... FYI, just creating lots of files with open(O_CREAT): [ 348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left [ 348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left [ 349.17] fs_mark (4826) used greatest stack depth: 2280 bytes left [ 418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left [ 460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left [ 544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left [ 689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left We've got absolutely no spare stack space anymore in the IO path. And the IO path can't get much simpler than filesystem -> virtio block device. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Tue, Feb 11, 2014 at 9:40 PM, Dave Chinner wrote: > > None of the XFS code disables interrupts in that path, not does is > call outside XFS except to dispatch IO. The stack is pretty deep at > this point and I know that the standard (non stacked) IO stack can > consume >3kb of stack space when it gets down to having to do memory > reclaim during GFP_NOIO allocation at the lowest level of SCSI > drivers. Stack overruns typically show up with symptoms like we are > seeing. That would also explain why it shows up for do_coredump(), even though clearly interrupts are not disabled at that point. It's just because do_coredump() opens a filename at a deeper point in the stack than the more normal system call paths do. It looks like just "do_signal()" has a stack frame that is about 230 bytes even under normal circumstancs (largely due to "struct ksignal" - which in turn is largely due to the insane 128-byte padding in siginfo_t). Add a few other frames in there, and I guess that if it was close before, the coredump path just makes it go off. And some of the debug options that I'm sure DaveJ has enabled tend to make the stack usage worse (simply because they make a lot of data structures bigger). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote: > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote: > > > None of the XFS code disables interrupts in that path, not does is > > call outside XFS except to dispatch IO. The stack is pretty deep at > > this point and I know that the standard (non stacked) IO stack can > > consume >3kb of stack space when it gets down to having to do memory > > reclaim during GFP_NOIO allocation at the lowest level of SCSI > > drivers. Stack overruns typically show up with symptoms like we are > > seeing. > > .. > > > > Dave, before chasing ghosts, can you (like Eric originally asked) > > turn on stack overrun detection? > > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on. That only checks stack usage when an interrupt is taken. If no interrupts are taken when stack usage is within 128 bytes of overflow, then it doesn't catch it. I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum stack usage of a process via canary overwrites and it records it in do_exit(). I also use the stack tracer to record the largest stack usage seen so I know exactly what code paths are approaching stack overruns... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote: > None of the XFS code disables interrupts in that path, not does is > call outside XFS except to dispatch IO. The stack is pretty deep at > this point and I know that the standard (non stacked) IO stack can > consume >3kb of stack space when it gets down to having to do memory > reclaim during GFP_NOIO allocation at the lowest level of SCSI > drivers. Stack overruns typically show up with symptoms like we are > seeing. > .. > > Dave, before chasing ghosts, can you (like Eric originally asked) > turn on stack overrun detection? CONFIG_DEBUG_STACKOVERFLOW ? Already turned on. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Wed, Feb 12, 2014 at 04:22:15AM +, Al Viro wrote: > On Tue, Feb 11, 2014 at 11:03:58PM -0500, Dave Jones wrote: > > [ 3111.414202] [] bio_alloc_bioset+0x156/0x210 > > [ 3111.414855] [] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs] > > [ 3111.415517] [] ? xlog_bdstrat+0x22/0x60 [xfs] > > [ 3111.416175] [] xfs_buf_iorequest+0x6b/0xf0 [xfs] > > [ 3111.416843] [] xlog_bdstrat+0x22/0x60 [xfs] > > [ 3111.417509] [] xlog_sync+0x3a7/0x5b0 [xfs] > > [ 3111.418175] [] xlog_state_release_iclog+0x10f/0x120 > > [xfs] > > [ 3111.418846] [] xlog_write+0x6f0/0x800 [xfs] > > [ 3111.419518] [] xlog_cil_push+0x2f1/0x410 [xfs] > > Very interesting. The first thing xlog_cil_push() is doing is blocking > kmalloc(). So at that point it still hadn't been atomic. I'd probably > slap may_sleep() in the beginning of xlog_sync() and see if that triggers... None of the XFS code disables interrupts in that path, not does is call outside XFS except to dispatch IO. The stack is pretty deep at this point and I know that the standard (non stacked) IO stack can consume >3kb of stack space when it gets down to having to do memory reclaim during GFP_NOIO allocation at the lowest level of SCSI drivers. Stack overruns typically show up with symptoms like we are seeing. Simple example with memory allocation follows. keep in mind that memory reclaim uses a whole lot more stack if it is needed, and that scheduling at this point requires about 1k of stack to be free for the scheduler footprint, too. FWIW, the blk-mq stuff seems to hae added 200-300 bytes of new stack usage to the IO path $ sudo cat /sys/kernel/debug/tracing/stack_trace DepthSize Location(45 entries) - 0) 5944 40 zone_statistics+0xbd/0xc0 1) 5904 256 get_page_from_freelist+0x3a8/0x8a0 2) 5648 256 __alloc_pages_nodemask+0x143/0x8e0 3) 5392 80 alloc_pages_current+0xb2/0x170 4) 5312 64 new_slab+0x265/0x2e0 5) 5248 240 __slab_alloc+0x2fb/0x4c4 6) 5008 80 __kmalloc+0x133/0x180 7) 4928 112 virtqueue_add_sgs+0x2fe/0x520 8) 4816 288 __virtblk_add_req+0xd5/0x180 9) 4528 96 virtio_queue_rq+0xdd/0x1d0 10) 4432 112 __blk_mq_run_hw_queue+0x1c3/0x3c0 11) 4320 16 blk_mq_run_hw_queue+0x35/0x40 12) 4304 80 blk_mq_insert_requests+0xc5/0x120 13) 4224 96 blk_mq_flush_plug_list+0x129/0x140 14) 4128 112 blk_flush_plug_list+0xe7/0x240 15) 4016 32 blk_finish_plug+0x18/0x50 16) 3984 192 _xfs_buf_ioapply+0x30f/0x3b0 17) 3792 48 xfs_buf_iorequest+0x6f/0xc0 37) 928 16 xfs_vn_create+0x13/0x20 38) 912 64 vfs_create+0xb5/0xf0 39) 848 208 do_last.isra.53+0x6e0/0xd00 40) 640 176 path_openat+0xbe/0x620 41) 464 208 do_filp_open+0x43/0xa0 42) 256 112 do_sys_open+0x13c/0x230 43) 144 16 SyS_open+0x22/0x30 44) 128 128 system_call_fastpath+0x16/0x1b Dave, before chasing ghosts, can you (like Eric originally asked) turn on stack overrun detection? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Tue, Feb 11, 2014 at 11:03:58PM -0500, Dave Jones wrote: > [ 3111.414202] [] bio_alloc_bioset+0x156/0x210 > [ 3111.414855] [] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs] > [ 3111.415517] [] ? xlog_bdstrat+0x22/0x60 [xfs] > [ 3111.416175] [] xfs_buf_iorequest+0x6b/0xf0 [xfs] > [ 3111.416843] [] xlog_bdstrat+0x22/0x60 [xfs] > [ 3111.417509] [] xlog_sync+0x3a7/0x5b0 [xfs] > [ 3111.418175] [] xlog_state_release_iclog+0x10f/0x120 > [xfs] > [ 3111.418846] [] xlog_write+0x6f0/0x800 [xfs] > [ 3111.419518] [] xlog_cil_push+0x2f1/0x410 [xfs] Very interesting. The first thing xlog_cil_push() is doing is blocking kmalloc(). So at that point it still hadn't been atomic. I'd probably slap may_sleep() in the beginning of xlog_sync() and see if that triggers... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Tue, Feb 11, 2014 at 06:52:19PM -0800, Linus Torvalds wrote: > On Tue, Feb 11, 2014 at 5:09 PM, Al Viro wrote: > > > > Slap the check in vfs_create(), see if interrupts had been disabled by it > > or > > by something in ->create(). Since it's reproducible... > > path_openat() starts off with a get_empty_filp(), which allocates a > file pointer with GFP_KERNEL. So that should have triggered the > might_sleep warning if irq's were already disabled at that point. > > So it's not before that - in particular, it's not in the signal > handling or do_coredump() paths. > > Also, at least xfs_buf_lock() - which is much deeper in that chain - > does a down(&bp->b_sema). I'm disguested that that doesn't have a > might_sleep() in it. > > Dave, mind adding a "might_sleep()" to the top of > "down[_interruptible]()". It's silly to not have coverage of semaphore > use in bad contexts. Added those, didn't trigger. Neither did Al's suggestion. Slightly different trace, but still from the coredump path. Dave [ 3111.403822] BUG: sleeping function called from invalid context at mm/mempool.c:203 [ 3111.404414] in_atomic(): 0, irqs_disabled(): 1, pid: 19213, name: trinity-c46 [ 3111.404884] 5 locks held by trinity-c46/19213: [ 3111.405354] #0: (sb_writers#9){..}, at: [] do_coredump+0xe05/0xf60 [ 3111.405862] #1: (shrinker_rwsem){..}, at: [] shrink_slab+0x3f/0x180 [ 3111.406374] #2: (&type->s_umount_key#30){..}, at: [] grab_super_passive+0x44/0x90 [ 3111.406905] #3: (&pag->pag_ici_reclaim_lock){..}, at: [] xfs_reclaim_inodes_ag+0x31a/0x430 [xfs] [ 3111.407466] #4: (&(&ip->i_lock)->mr_lock){..}, at: [] xfs_ilock+0x16f/0x1b0 [xfs] [ 3111.408039] CPU: 0 PID: 19213 Comm: trinity-c46 Not tainted 3.14.0-rc2+ #113 [ 3111.409779] 8da3fde6 4f998871 88023715cce0 8d72b091 [ 3111.410396] 88023715cd08 8d09ddb5 0010 [ 3111.411021] 880243566288 0008 88023715cd88 8d1534f3 [ 3111.411656] Call Trace: [ 3111.412283] [] dump_stack+0x4e/0x7a [ 3111.412922] [] __might_sleep+0x105/0x150 [ 3111.413562] [] mempool_alloc+0xa3/0x170 [ 3111.414202] [] bio_alloc_bioset+0x156/0x210 [ 3111.414855] [] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs] [ 3111.415517] [] ? xlog_bdstrat+0x22/0x60 [xfs] [ 3111.416175] [] xfs_buf_iorequest+0x6b/0xf0 [xfs] [ 3111.416843] [] xlog_bdstrat+0x22/0x60 [xfs] [ 3111.417509] [] xlog_sync+0x3a7/0x5b0 [xfs] [ 3111.418175] [] xlog_state_release_iclog+0x10f/0x120 [xfs] [ 3111.418846] [] xlog_write+0x6f0/0x800 [xfs] [ 3111.419518] [] xlog_cil_push+0x2f1/0x410 [xfs] [ 3111.420195] [] xlog_cil_force_lsn+0x1d8/0x210 [xfs] [ 3111.420865] [] ? __bit_spin_unlock.constprop.66+0x19/0x40 [ 3111.421551] [] _xfs_log_force_lsn+0x93/0x340 [xfs] [ 3111.422239] [] xfs_log_force_lsn+0x2e/0xb0 [xfs] [ 3111.422932] [] ? xfs_iunpin_wait+0x19/0x20 [xfs] [ 3111.423625] [] __xfs_iunpin_wait+0xd0/0x1a0 [xfs] [ 3111.424310] [] ? autoremove_wake_function+0x40/0x40 [ 3111.425008] [] xfs_iunpin_wait+0x19/0x20 [xfs] [ 3111.425705] [] xfs_reclaim_inode+0x8c/0x380 [xfs] [ 3111.426405] [] xfs_reclaim_inodes_ag+0x27b/0x430 [xfs] [ 3111.427104] [] ? xfs_reclaim_inodes_ag+0x100/0x430 [xfs] [ 3111.427797] [] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [ 3111.428481] [] xfs_fs_free_cached_objects+0x15/0x20 [xfs] [ 3111.429150] [] super_cache_scan+0x16c/0x180 [ 3111.429824] [] shrink_slab_node+0x14b/0x2e0 [ 3111.430489] [] ? shrink_slab+0x3f/0x180 [ 3111.431146] [] shrink_slab+0x8e/0x180 [ 3111.431796] [] try_to_free_pages+0x516/0x970 [ 3111.432436] [] ? __set_page_dirty+0x27/0xc0 [ 3111.433065] [] __alloc_pages_nodemask+0x7a9/0xb00 [ 3111.433689] [] alloc_pages_current+0x106/0x1f0 [ 3111.434304] [] ? pte_alloc_one+0x17/0x80 [ 3111.434911] [] pte_alloc_one+0x17/0x80 [ 3111.435510] [] __pte_alloc+0x27/0x130 [ 3111.436098] [] handle_mm_fault+0xafc/0xbb0 [ 3111.436681] [] __get_user_pages+0x1ce/0x620 [ 3111.437254] [] get_dump_page+0x54/0x80 [ 3111.437810] [] elf_core_dump+0x12d9/0x1420 [ 3111.438356] [] ? elf_core_dump+0x860/0x1420 [ 3111.438884] [] do_coredump+0xc02/0xf60 [ 3111.439398] [] get_signal_to_deliver+0x2b8/0x6b0 [ 3111.439898] [] do_signal+0x57/0x9d0 [ 3111.440386] [] ? __acct_update_integrals+0x8e/0x120 [ 3111.440873] [] ? preempt_count_sub+0x6b/0xf0 [ 3111.441357] [] ? _raw_spin_unlock+0x31/0x50 [ 3111.441833] [] ? vtime_account_user+0x91/0xa0 [ 3111.442307] [] ? context_tracking_user_exit+0x9b/0x100 [ 3111.442783] [] do_notify_resume+0x5c/0xa0 [ 3111.443260] [] retint_signal+0x46/0x90 [ 3111.443773] [ cut here ] [ 3111.444248] WARNING: CPU: 0 PID: 19213 at block/blk.h:227 generic_make_request_checks+0x33f/0x460() [ 3111.444742] Modules linked in: fuse can_bcm can_raw scsi_transport_iscsi ipt_ULOG nfnetlink nfc caif_socket caif af_802154 phonet af_rxrpc can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm net
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Tue, Feb 11, 2014 at 5:09 PM, Al Viro wrote: > > Slap the check in vfs_create(), see if interrupts had been disabled by it or > by something in ->create(). Since it's reproducible... path_openat() starts off with a get_empty_filp(), which allocates a file pointer with GFP_KERNEL. So that should have triggered the might_sleep warning if irq's were already disabled at that point. So it's not before that - in particular, it's not in the signal handling or do_coredump() paths. Also, at least xfs_buf_lock() - which is much deeper in that chain - does a down(&bp->b_sema). I'm disguested that that doesn't have a might_sleep() in it. Dave, mind adding a "might_sleep()" to the top of "down[_interruptible]()". It's silly to not have coverage of semaphore use in bad contexts. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Tue, Feb 11, 2014 at 07:44:03PM -0500, Dave Jones wrote: > The 'good' news is I seem to be able to reproduce it fairly quickly. > I've seen it 3-4 times this afternoon. (Always from the coredump path) > > I was on vacation last week, but I wasn't hitting this before then, so it's a > fairly recent change that's introduced this. Slap the check in vfs_create(), see if interrupts had been disabled by it or by something in ->create(). Since it's reproducible... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Tue, Feb 11, 2014 at 03:49:14PM -0600, Eric Sandeen wrote: > On 2/11/14, 3:08 PM, Dave Chinner wrote: > > On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote: > >> BUG: sleeping function called from invalid context at mm/mempool.c:203 > >> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3 > >> 5 locks held by trinity-c3/27511: > >> #0: (sb_writers#9){..}, at: [] > >> mnt_want_write+0x24/0x50 > >> #1: (&type->i_mutex_dir_key#3){..}, at: [] > >> do_last.isra.51+0x294/0x11f0 > >> #2: (sb_internal#2){..}, at: [] > >> xfs_trans_alloc+0x24/0x40 [xfs] > >> #3: (&(&ip->i_lock)->mr_lock/1){..}, at: [] > >> xfs_ilock+0x16f/0x1b0 [xfs] > >> #4: (&(&ip->i_lock)->mr_lock){..}, at: [] > >> xfs_ilock_nowait+0x184/0x200 [xfs] > >> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 > >> 83a3fd56 45a3849a 88009f368f60 8372afea > >> 88009f368f88 8309ddb5 0010 > >> 880243566288 0008 88009f369008 831534d3 > >> Call Trace: > >> [] dump_stack+0x4e/0x7a > >> [] __might_sleep+0x105/0x150 > >> [] mempool_alloc+0xa3/0x170 > >> [] ? deactivate_slab+0x51a/0x590 > >> [] bio_alloc_bioset+0x156/0x210 > >> [] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs] > >> [] ? xlog_bdstrat+0x22/0x60 [xfs] > >> [] xfs_buf_iorequest+0x6b/0xf0 [xfs] > >> [] xlog_bdstrat+0x22/0x60 [xfs] > >> [] xlog_sync+0x3a7/0x5b0 [xfs] > >> [] xlog_state_release_iclog+0x10f/0x120 [xfs] > >> [] xlog_write+0x6f0/0x800 [xfs] > >> [] xlog_cil_push+0x2f1/0x410 [xfs] > >> [] xlog_cil_force_lsn+0x1d8/0x210 [xfs] > >> [] ? xfs_bmbt_get_all+0x18/0x20 [xfs] > >> [] _xfs_log_force+0x70/0x290 [xfs] > >> [] ? get_parent_ip+0xd/0x50 > >> [] xfs_log_force+0x26/0xb0 [xfs] > >> [] ? _xfs_buf_find+0x1f6/0x3c0 [xfs] > >> [] xfs_buf_lock+0x133/0x140 [xfs] > >> [] _xfs_buf_find+0x1f6/0x3c0 [xfs] > >> [] xfs_buf_get_map+0x2a/0x1b0 [xfs] > >> [] xfs_trans_get_buf_map+0x1a1/0x240 [xfs] > >> [] xfs_da_get_buf+0xbd/0x100 [xfs] > >> [] xfs_dir3_data_init+0x59/0x1d0 [xfs] > >> [] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs] > >> [] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs] > >> [] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs] > >> [] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs] > >> [] ? xfs_da_compname+0x1f/0x30 [xfs] > >> [] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs] > >> [] xfs_dir2_sf_addname+0x348/0x6d0 [xfs] > >> [] ? xfs_setup_inode+0x1cd/0x320 [xfs] > >> [] xfs_dir_createname+0x184/0x1e0 [xfs] > >> [] xfs_create+0x469/0x580 [xfs] > >> [] xfs_vn_mknod+0xc4/0x1e0 [xfs] > >> [] xfs_vn_create+0x13/0x20 [xfs] > >> [] vfs_create+0x95/0xc0 > >> [] do_last.isra.51+0x9f8/0x11f0 > >> [] ? link_path_walk+0x81/0x870 > >> [] path_openat+0xc9/0x620 > >> [] ? put_dec+0x72/0x90 > >> [] do_filp_open+0x4d/0xb0 > >> [] file_open_name+0xfe/0x160 > >> [] filp_open+0x44/0x60 > >> [] do_coredump+0x602/0xf60 > >> [] get_signal_to_deliver+0x2b8/0x6b0 > >> [] do_signal+0x57/0x9d0 > >> [] ? __acct_update_integrals+0x8e/0x120 > >> [] ? __schedule+0x60/0x850 > >> [] ? preempt_count_sub+0x6b/0xf0 > >> [] ? _raw_spin_unlock+0x31/0x50 > >> [] ? vtime_account_user+0x91/0xa0 > >> [] ? context_tracking_user_exit+0x9b/0x100 > >> [] do_notify_resume+0x5c/0xa0 > >> [] retint_signal+0x46/0x90 > > > > There's nowhere in this XFS stack that disables interrupts, so I > > thinks it's either above or below XFS that is causing this problem. > > any chance the stack got blown & corrupted things such that it > only _looks_ like irqs are disabled? The 'good' news is I seem to be able to reproduce it fairly quickly. I've seen it 3-4 times this afternoon. (Always from the coredump path) I was on vacation last week, but I wasn't hitting this before then, so it's a fairly recent change that's introduced this. Al ? Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On 2/11/14, 3:08 PM, Dave Chinner wrote: > On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote: >> BUG: sleeping function called from invalid context at mm/mempool.c:203 >> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3 >> 5 locks held by trinity-c3/27511: >> #0: (sb_writers#9){..}, at: [] >> mnt_want_write+0x24/0x50 >> #1: (&type->i_mutex_dir_key#3){..}, at: [] >> do_last.isra.51+0x294/0x11f0 >> #2: (sb_internal#2){..}, at: [] >> xfs_trans_alloc+0x24/0x40 [xfs] >> #3: (&(&ip->i_lock)->mr_lock/1){..}, at: [] >> xfs_ilock+0x16f/0x1b0 [xfs] >> #4: (&(&ip->i_lock)->mr_lock){..}, at: [] >> xfs_ilock_nowait+0x184/0x200 [xfs] >> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 >> 83a3fd56 45a3849a 88009f368f60 8372afea >> 88009f368f88 8309ddb5 0010 >> 880243566288 0008 88009f369008 831534d3 >> Call Trace: >> [] dump_stack+0x4e/0x7a >> [] __might_sleep+0x105/0x150 >> [] mempool_alloc+0xa3/0x170 >> [] ? deactivate_slab+0x51a/0x590 >> [] bio_alloc_bioset+0x156/0x210 >> [] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs] >> [] ? xlog_bdstrat+0x22/0x60 [xfs] >> [] xfs_buf_iorequest+0x6b/0xf0 [xfs] >> [] xlog_bdstrat+0x22/0x60 [xfs] >> [] xlog_sync+0x3a7/0x5b0 [xfs] >> [] xlog_state_release_iclog+0x10f/0x120 [xfs] >> [] xlog_write+0x6f0/0x800 [xfs] >> [] xlog_cil_push+0x2f1/0x410 [xfs] >> [] xlog_cil_force_lsn+0x1d8/0x210 [xfs] >> [] ? xfs_bmbt_get_all+0x18/0x20 [xfs] >> [] _xfs_log_force+0x70/0x290 [xfs] >> [] ? get_parent_ip+0xd/0x50 >> [] xfs_log_force+0x26/0xb0 [xfs] >> [] ? _xfs_buf_find+0x1f6/0x3c0 [xfs] >> [] xfs_buf_lock+0x133/0x140 [xfs] >> [] _xfs_buf_find+0x1f6/0x3c0 [xfs] >> [] xfs_buf_get_map+0x2a/0x1b0 [xfs] >> [] xfs_trans_get_buf_map+0x1a1/0x240 [xfs] >> [] xfs_da_get_buf+0xbd/0x100 [xfs] >> [] xfs_dir3_data_init+0x59/0x1d0 [xfs] >> [] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs] >> [] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs] >> [] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs] >> [] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs] >> [] ? xfs_da_compname+0x1f/0x30 [xfs] >> [] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs] >> [] xfs_dir2_sf_addname+0x348/0x6d0 [xfs] >> [] ? xfs_setup_inode+0x1cd/0x320 [xfs] >> [] xfs_dir_createname+0x184/0x1e0 [xfs] >> [] xfs_create+0x469/0x580 [xfs] >> [] xfs_vn_mknod+0xc4/0x1e0 [xfs] >> [] xfs_vn_create+0x13/0x20 [xfs] >> [] vfs_create+0x95/0xc0 >> [] do_last.isra.51+0x9f8/0x11f0 >> [] ? link_path_walk+0x81/0x870 >> [] path_openat+0xc9/0x620 >> [] ? put_dec+0x72/0x90 >> [] do_filp_open+0x4d/0xb0 >> [] file_open_name+0xfe/0x160 >> [] filp_open+0x44/0x60 >> [] do_coredump+0x602/0xf60 >> [] get_signal_to_deliver+0x2b8/0x6b0 >> [] do_signal+0x57/0x9d0 >> [] ? __acct_update_integrals+0x8e/0x120 >> [] ? __schedule+0x60/0x850 >> [] ? preempt_count_sub+0x6b/0xf0 >> [] ? _raw_spin_unlock+0x31/0x50 >> [] ? vtime_account_user+0x91/0xa0 >> [] ? context_tracking_user_exit+0x9b/0x100 >> [] do_notify_resume+0x5c/0xa0 >> [] retint_signal+0x46/0x90 > > There's nowhere in this XFS stack that disables interrupts, so I > thinks it's either above or below XFS that is causing this problem. any chance the stack got blown & corrupted things such that it only _looks_ like irqs are disabled? -Eric > Cheers, > > Dave. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.14-rc2 XFS backtrace because irqs_disabled.
On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote: > BUG: sleeping function called from invalid context at mm/mempool.c:203 > in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3 > 5 locks held by trinity-c3/27511: > #0: (sb_writers#9){..}, at: [] > mnt_want_write+0x24/0x50 > #1: (&type->i_mutex_dir_key#3){..}, at: [] > do_last.isra.51+0x294/0x11f0 > #2: (sb_internal#2){..}, at: [] > xfs_trans_alloc+0x24/0x40 [xfs] > #3: (&(&ip->i_lock)->mr_lock/1){..}, at: [] > xfs_ilock+0x16f/0x1b0 [xfs] > #4: (&(&ip->i_lock)->mr_lock){..}, at: [] > xfs_ilock_nowait+0x184/0x200 [xfs] > CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 > 83a3fd56 45a3849a 88009f368f60 8372afea > 88009f368f88 8309ddb5 0010 > 880243566288 0008 88009f369008 831534d3 > Call Trace: > [] dump_stack+0x4e/0x7a > [] __might_sleep+0x105/0x150 > [] mempool_alloc+0xa3/0x170 > [] ? deactivate_slab+0x51a/0x590 > [] bio_alloc_bioset+0x156/0x210 > [] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs] > [] ? xlog_bdstrat+0x22/0x60 [xfs] > [] xfs_buf_iorequest+0x6b/0xf0 [xfs] > [] xlog_bdstrat+0x22/0x60 [xfs] > [] xlog_sync+0x3a7/0x5b0 [xfs] > [] xlog_state_release_iclog+0x10f/0x120 [xfs] > [] xlog_write+0x6f0/0x800 [xfs] > [] xlog_cil_push+0x2f1/0x410 [xfs] > [] xlog_cil_force_lsn+0x1d8/0x210 [xfs] > [] ? xfs_bmbt_get_all+0x18/0x20 [xfs] > [] _xfs_log_force+0x70/0x290 [xfs] > [] ? get_parent_ip+0xd/0x50 > [] xfs_log_force+0x26/0xb0 [xfs] > [] ? _xfs_buf_find+0x1f6/0x3c0 [xfs] > [] xfs_buf_lock+0x133/0x140 [xfs] > [] _xfs_buf_find+0x1f6/0x3c0 [xfs] > [] xfs_buf_get_map+0x2a/0x1b0 [xfs] > [] xfs_trans_get_buf_map+0x1a1/0x240 [xfs] > [] xfs_da_get_buf+0xbd/0x100 [xfs] > [] xfs_dir3_data_init+0x59/0x1d0 [xfs] > [] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs] > [] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs] > [] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs] > [] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs] > [] ? xfs_da_compname+0x1f/0x30 [xfs] > [] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs] > [] xfs_dir2_sf_addname+0x348/0x6d0 [xfs] > [] ? xfs_setup_inode+0x1cd/0x320 [xfs] > [] xfs_dir_createname+0x184/0x1e0 [xfs] > [] xfs_create+0x469/0x580 [xfs] > [] xfs_vn_mknod+0xc4/0x1e0 [xfs] > [] xfs_vn_create+0x13/0x20 [xfs] > [] vfs_create+0x95/0xc0 > [] do_last.isra.51+0x9f8/0x11f0 > [] ? link_path_walk+0x81/0x870 > [] path_openat+0xc9/0x620 > [] ? put_dec+0x72/0x90 > [] do_filp_open+0x4d/0xb0 > [] file_open_name+0xfe/0x160 > [] filp_open+0x44/0x60 > [] do_coredump+0x602/0xf60 > [] get_signal_to_deliver+0x2b8/0x6b0 > [] do_signal+0x57/0x9d0 > [] ? __acct_update_integrals+0x8e/0x120 > [] ? __schedule+0x60/0x850 > [] ? preempt_count_sub+0x6b/0xf0 > [] ? _raw_spin_unlock+0x31/0x50 > [] ? vtime_account_user+0x91/0xa0 > [] ? context_tracking_user_exit+0x9b/0x100 > [] do_notify_resume+0x5c/0xa0 > [] retint_signal+0x46/0x90 There's nowhere in this XFS stack that disables interrupts, so I thinks it's either above or below XFS that is causing this problem. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/