Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-17 Thread Dave Chinner
On Sat, Feb 15, 2014 at 03:50:29PM -0800, Linus Torvalds wrote:
> [ Added linux-mm to the participants list ]
> 
> On Thu, Feb 13, 2014 at 4:24 PM, Dave Chinner  wrote:
> >
> > Dave, the patch below should chop off the stack usage from
> > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
> > Can you given this a run?
> 
> Ok, so DaveJ confirmed that DaveC's patch fixes his issue (damn,
> people, your parents were some seriously boring people, were they not?
> We've got too many Dave's around),

It's an exclusive club - we have 'kernel hacker Dave' reunions in
bars around the world. We should get some tshirts made up  :)

> but DaveC earlier pointed out that
> pretty much any memory allocation path can end up using 3kB of stack
> even without XFS being involved.
> 
> Which does bring up the question whether we should look (once more) at
> the VM direct-reclaim path, and try to avoid GFP_FS/IO direct
> reclaim..

We do that mostly already, but GFP_KERNEL allows swap IO and that's
where the deepest stack I saw came from.

Even if we don't allow IO at all, we're still going to see stack
usage of 2-2.5k in direct reclaim. e.g.  invalidate a page and enter
the rmap code.  The rmap is protected by a mutex, so if we fail to
get that we have about 1.2k of stack consumed from there and that is
on top of the the allocator/reclaim that has already consumes ~1k of
stack...

> Direct reclaim historically used to be an important throttling
> mechanism, and I used to not be a fan of trying to avoid direct
> reclaim. But the stack depth issue really looks to be pretty bad, and
> I think we've gotten better at throttling explicitly, so..
> 
> I *think* we already limit filesystem writeback to just kswapd (in
> shrink_page_list()), but DaveC posted a backtrace that goes through
> do_try_to_free_pages() to shrink_slab(), and through there to the
> filesystem and then IO. That looked like a disaster.

Right, that's an XFS problem, and I'm working on fixing it. The
Patch I sent to DaveJ fixes the worst case, but I need to make it
completely IO-less while still retaining the throttling the IO gives
us.

> And that's because (if I read things right) shrink_page_list() limits
> filesystem page writeback to kswapd, but not swap pages. Which I think
> probably made more sense back in the days than it does now (I
> certainly *hope* that swapping is less important today than it was,
> say, ten years ago)
> 
> So I'm wondering whether we should remove that page_is_file_cache()
> check from shrink_page_list()?

The thing is, the stack usage from the swap IO path is pretty well
bound - it's just the worst case stack of issuing IO. We know it
won't recurse into direct reclaim, so mempool allocation and
blocking is all we need to consider. Compare that to a filesystem
which may need to allocate extents and hence do transactions and
split btrees and read metadata and allocate large amounts of memory
even before it gets to the IO layers.

Hence I suspect that we could do a simple thing like only allow swap
if there's more than half the stack available in the current reclaim
context. Because, let's face it, if the submit_bio path is consuming
more than half the available stack then we're totally screwed from a
filesystem perspective

> And then there is that whole shrink_slab() case...

I think with shrinkers we just need to be more careful. The XFS
behaviour is all my fault, and I should have known better that to
design code that requires IO in the direct reclaim path. :/

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-17 Thread Al Viro
On Mon, Feb 17, 2014 at 06:46:48PM +0100, Oleg Nesterov wrote:
> > pulling the logics *out* of kernel/signal.c and into arch/*.
> 
> Not really, I think.

How so?  You propose to make all architectures call do_coredump()
themselves, instead of having it done centrally.  It's more boilerplate
for all of them to get wrong.  And IME any boilerplate in that are
*will* be gotten wrong - I've gone through signal handling on all
architectures more than a few times and every time it caught a pile
of bugs...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-17 Thread Oleg Nesterov
On 02/17, Al Viro wrote:
>
> On Mon, Feb 17, 2014 at 05:57:35PM +0100, Oleg Nesterov wrote:
>
> > Looks like, this is all is really nasty. Actually, I think siginfo on
> > stack is not that bad if we are going to do handle_signal() or restart,
> > perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump().
> > Something like below.
>
> Yecc...  You've just broken every architecture other than x86,

Of course, this is only to explain what I meant.

> and to
> fix them you'll need to massage every get_signal()/get_signal_to_deliver()
> user out there,

Yes.

> pulling the logics *out* of kernel/signal.c and into arch/*.

Not really, I think. Of course this change should be cleanuped. And it
should not require to change all architectures at once.

> This is just plain wrong.

I agree, this change is also ugly.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-17 Thread Al Viro
On Mon, Feb 17, 2014 at 05:57:35PM +0100, Oleg Nesterov wrote:

> Looks like, this is all is really nasty. Actually, I think siginfo on
> stack is not that bad if we are going to do handle_signal() or restart,
> perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump().
> Something like below.

Yecc...  You've just broken every architecture other than x86, and to
fix them you'll need to massage every get_signal()/get_signal_to_deliver()
user out there, pulling the logics *out* of kernel/signal.c and into arch/*.
This is just plain wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-17 Thread Oleg Nesterov
On 02/15, Oleg Nesterov wrote:
>
> On 02/15, Al Viro wrote:
> >
> > On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote:
> > > > So basically we want a different condition for "can we just go ahead and
> > > > free that sucker", right?  Instead of "it's on the list, shan't free it"
> > > > it ought to be something like "it's on the list or it is referenced by
> > > > ksiginfo".  Locking will be interesting, though... ;-/
> > >
> > > I guess yes... send_sigqueue() checks list_empty() too, probably nobody 
> > > else.
> >
> > The trouble being, we might end up with
> > Q picked by collect_signal and and stuff into ksiginfo
> > Q resubmitted by timer code
>
> In this case the timer code should simply inc ->si_overrun and do nothing.
>
> IOW, list_empty() should be turned into is_queued(), and is_queued()
> should be true until dismiss_siginfo() which should also do
> do_schedule_next_timer(). I think.

No, this is even more complicated. We also need do_schedule_next_timer()
to calculate ->si_overrun we are going to report, I missed this...

Looks like, this is all is really nasty. Actually, I think siginfo on
stack is not that bad if we are going to do handle_signal() or restart,
perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump().
Something like below.

Oleg.


diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 9e5de68..52f16f9 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -684,6 +684,52 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
test_thread_flag(TIF_IA32) ? __NR_ia32_restart_syscall : 
__NR_restart_syscall
 #endif /* CONFIG_X86_32 */
 
+static noinline int process_signal(struct pt_regs *regs, siginfo_t **pinfo)
+{
+   struct ksignal ksig;
+
+   switch (get_signal(&ksig)) {
+   case SIG_DUMP:
+   *pinfo = kmalloc(sizeof(siginfo_t), GFP_KERNEL);
+   if (*pinfo)
+   copy_siginfo(*pinfo, &ksig.info);
+
+   case SIG_EXIT:
+   return ksig.info.si_signo;
+
+   default:
+   handle_signal(&ksig, regs);
+   break;
+
+   case 0:
+   /* Did we come from a system call? */
+   if (syscall_get_nr(current, regs) >= 0) {
+   /* Restart the system call - no handlers present */
+   switch (syscall_get_error(current, regs)) {
+   case -ERESTARTNOHAND:
+   case -ERESTARTSYS:
+   case -ERESTARTNOINTR:
+   regs->ax = regs->orig_ax;
+   regs->ip -= 2;
+   break;
+
+   case -ERESTART_RESTARTBLOCK:
+   regs->ax = NR_restart_syscall;
+   regs->ip -= 2;
+   break;
+   }
+   }
+
+   /*
+* If there's no signal to deliver, we just put the saved 
sigmask
+* back.
+*/
+   restore_saved_sigmask();
+   }
+
+   return 0;
+}
+
 /*
  * Note that 'init' is a special process: it doesn't get signals it doesn't
  * want to handle. Thus you cannot kill init even with a SIGKILL even by
@@ -691,37 +737,16 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
  */
 static void do_signal(struct pt_regs *regs)
 {
-   struct ksignal ksig;
+   siginfo_t *info = NULL;
+   int sig = process_signal(regs, &info);
 
-   if (get_signal(&ksig)) {
-   /* Whee! Actually deliver the signal.  */
-   handle_signal(&ksig, regs);
-   return;
-   }
-
-   /* Did we come from a system call? */
-   if (syscall_get_nr(current, regs) >= 0) {
-   /* Restart the system call - no handlers present */
-   switch (syscall_get_error(current, regs)) {
-   case -ERESTARTNOHAND:
-   case -ERESTARTSYS:
-   case -ERESTARTNOINTR:
-   regs->ax = regs->orig_ax;
-   regs->ip -= 2;
-   break;
-
-   case -ERESTART_RESTARTBLOCK:
-   regs->ax = NR_restart_syscall;
-   regs->ip -= 2;
-   break;
+   if (sig) {
+   if (info) {
+   do_coredump(info);
+   kfree(info);
}
+   do_group_exit(sig);
}
-
-   /*
-* If there's no signal to deliver, we just put the saved sigmask
-* back.
-*/
-   restore_saved_sigmask();
 }
 
 /*
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 2ac423b..33b5e04 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -285,6 +285,9 @@ struct ksignal {
int sig;
 };
 
+#define SIG_EXIT   -1
+#define SIG_DUMP   -2
+
 extern int get_signa

Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Linus Torvalds
[ Added linux-mm to the participants list ]

On Thu, Feb 13, 2014 at 4:24 PM, Dave Chinner  wrote:
>
> Dave, the patch below should chop off the stack usage from
> xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
> Can you given this a run?

Ok, so DaveJ confirmed that DaveC's patch fixes his issue (damn,
people, your parents were some seriously boring people, were they not?
We've got too many Dave's around), but DaveC earlier pointed out that
pretty much any memory allocation path can end up using 3kB of stack
even without XFS being involved.

Which does bring up the question whether we should look (once more) at
the VM direct-reclaim path, and try to avoid GFP_FS/IO direct
reclaim..

Direct reclaim historically used to be an important throttling
mechanism, and I used to not be a fan of trying to avoid direct
reclaim. But the stack depth issue really looks to be pretty bad, and
I think we've gotten better at throttling explicitly, so..

I *think* we already limit filesystem writeback to just kswapd (in
shrink_page_list()), but DaveC posted a backtrace that goes through
do_try_to_free_pages() to shrink_slab(), and through there to the
filesystem and then IO. That looked like a disaster.

And that's because (if I read things right) shrink_page_list() limits
filesystem page writeback to kswapd, but not swap pages. Which I think
probably made more sense back in the days than it does now (I
certainly *hope* that swapping is less important today than it was,
say, ten years ago)

So I'm wondering whether we should remove that page_is_file_cache()
check from shrink_page_list()?

And then there is that whole shrink_slab() case...

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Linus Torvalds
On Sat, Feb 15, 2014 at 2:28 PM, Dave Jones  wrote:
>
> I've got a shitload of debug options enabled, which may explain it.
> Or perhaps that new STACK_PROTECTOR_STRONG stuff ?

Well, a lot of it is just the callee-saved registers. The compiler
will tend to preferentially allocate registers in the callee-trashed
registers, but if the function isn't a leaf function, any registers
that are live around a function call will have to be saved somewhere -
either explicitly around the function call, or - more likely - in
callee-saved registers that then get saved in the prologue/epilogue of
the function. And this will happen even in leaf functions when there
is enough register pressure that the callee-trashed registers aren't
sufficient (which is pretty common).

So saving 5-6 registers on the stack (in addition to any actual stack
frame) is pretty much the norm for anything but the very simplest
cases.

But yeah, I'm sure some config options make it worse.
STACK_PROTECTOR_STRONG could easily be one of those.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Dave Jones
On Sun, Feb 16, 2014 at 09:23:56AM +1100, Dave Chinner wrote:
 
 > There's a pretty massive difference between the actual stack usage
 > of the local variables and the amount of stack being used by the
 > compiled code.
 > 
 > What it appears to be is that the compiler is pushing 6-10 registers
 > to the stack on every function call. So a function that only has 3
 > local variables and does very little but allocate a structure and
 > call other functions saves an 6 registers to the stack before it
 > starts:

I've got a shitload of debug options enabled, which may explain it.
Or perhaps that new STACK_PROTECTOR_STRONG stuff ?

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Dave Chinner
On Fri, Feb 14, 2014 at 11:01:23AM -0500, Dave Jones wrote:
> On Fri, Feb 14, 2014 at 11:24:27AM +1100, Dave Chinner wrote:
>  
>  > > I can fix this one easily - we already have a workqueue for doing
>  > > async log pushes (will split the stack between xlog_cil_force_lsn
>  > > and xlog_cil_push), but the reason we haven't used it for synchronous
>  > > log forces is that screws up fsync performance on CFQ. We don't
>  > > recommend CFQ with XFS anyway, so I think I'll make this change
>  > > anyway.
>  > 
>  > Dave, the patch below should chop off the stack usage from
>  > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
>  > Can you given this a run?
> 
> Looks like it's survived an overnight run..

Great.

One thing that has been puzzling me is why I'm seeing stack usage
reported that is way above what we declare locally on the stack.
e.g.

 29) 2048 224   xfs_da_grow_inode_int+0xbb/0x340
 30) 1824  96   xfs_dir2_grow_inode+0x63/0x110
 31) 1728 208   xfs_dir2_sf_to_block+0xe7/0x5e0
 32) 1520 144   xfs_dir2_sf_addname+0x115/0x5c0
 33) 1376  96   xfs_dir_createname+0x164/0x1a0
 34) 1280 224   xfs_create+0x536/0x660
 35) 1056 128   xfs_vn_mknod+0xc8/0x1d0

I pulled 128 bytes out of xfs_dir_createname by allocating the
xfs_da_args structure, but I still couldn't reconcile the amount of
stack use being reported with the amount used by locally declared
variables (even considering in-lined leaf functions). For example:

Locally
DeclaredUsed   function
  88 224   xfs_da_grow_inode_int
  32  96   xfs_dir2_grow_inode
 168 208   xfs_dir2_sf_to_block
  56 144   xfs_dir2_sf_addname
  16  96   xfs_dir_createname
 120 224   xfs_create
  52 128   xfs_vn_mknod

There's a pretty massive difference between the actual stack usage
of the local variables and the amount of stack being used by the
compiled code.

What it appears to be is that the compiler is pushing 6-10 registers
to the stack on every function call. So a function that only has 3
local variables and does very little but allocate a structure and
call other functions saves an 6 registers to the stack before it
starts:

Dump of assembler code for function xfs_dir_createname:
214 {
   0x814d7380 <+0>: callq  0x81cf0940
   0x814d7385 <+5>: push   %rbp
   0x814d7386 <+6>: mov%rsp,%rbp
   0x814d7389 <+9>: sub$0x50,%rsp
   0x814d738d <+13>:mov%rbx,-0x28(%rbp)
   0x814d7391 <+17>:mov%rdi,%rbx
   0x814d7394 <+20>:mov%r12,-0x20(%rbp)
   0x814d7398 <+24>:mov%rcx,%r12
   0x814d739b <+27>:mov%r13,-0x18(%rbp)
   0x814d739f <+31>:mov%rsi,%r13
   0x814d73a5 <+37>:mov%r14,-0x10(%rbp)
   0x814d73a9 <+41>:mov%rdx,%r14
   0x814d73ac <+44>:mov%r15,-0x8(%rbp)
   0x814d73b0 <+48>:mov%r8,%r15
   0x814d73b7 <+55>:mov%r9,-0x48(%rbp)
.

If this is typical across the call chain (appears to be from my
quick survey) then we are averaging 40-50 bytes of stack per
call for saved registers. That's a lot of stack space we can't
directly control the usage of, especially when we are talking about
call chains that can get 50+ functions deep...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Oleg Nesterov
On 02/15, Al Viro wrote:
>
> On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote:
> > > So basically we want a different condition for "can we just go ahead and
> > > free that sucker", right?  Instead of "it's on the list, shan't free it"
> > > it ought to be something like "it's on the list or it is referenced by
> > > ksiginfo".  Locking will be interesting, though... ;-/
> >
> > I guess yes... send_sigqueue() checks list_empty() too, probably nobody 
> > else.
>
> The trouble being, we might end up with
>   Q picked by collect_signal and and stuff into ksiginfo
>   Q resubmitted by timer code

In this case the timer code should simply inc ->si_overrun and do nothing.

IOW, list_empty() should be turned into is_queued(), and is_queued()
should be true until dismiss_siginfo() which should also do
do_schedule_next_timer(). I think.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Al Viro
On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote:
> > So basically we want a different condition for "can we just go ahead and
> > free that sucker", right?  Instead of "it's on the list, shan't free it"
> > it ought to be something like "it's on the list or it is referenced by
> > ksiginfo".  Locking will be interesting, though... ;-/
> 
> I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.

The trouble being, we might end up with
Q picked by collect_signal and and stuff into ksiginfo
Q resubmitted by timer code
Q picked by *another* thread into its ksiginfo
the first thread finally done with signal
and at that point we still have a reference in the second thread's ksiginfo.
Hell knows - my first reflex in that kind of situation is to replace
that flag with refcount, so that timer code would hold a reference from
timer_create(2) to timer_delete(2), send_sigqueue() would bump it and
dismiss_siginfo() - drop the sucker.  But that means either grabbing
siglock in dismiss_siginfo() or making the counter atomic; either way
it's a cacheline ping-pong.  Atomic counter is less painful in that respect -
it would be right next to the list, so we dirty that cacheline anyway...

I'm still trying another approach (slightly bigger ksiginfo used to store
all variants with si_code >= 0), but it has messiness of its own; we'll
see how it goes...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Oleg Nesterov
On 02/15, Al Viro wrote:
>
> > Ouch...  I think I see what you mean.  Let me see if I got it right:
> > timer->sigq is *not* freed by collect_signal(); it's done by
> > release_posix_timer() instead, under siglock.  Frankly, this
> > /*
> >  * If it is queued it will be freed when dequeued,
> >  * like the "regular" sigqueue.
> >  */
> > if (!list_empty(&q->list))
> > q = NULL;
> > in sigqueue_free() smells like it's asking for races.  Sigh...

This is protected by ->siglock, should be safe...

> So basically we want a different condition for "can we just go ahead and
> free that sucker", right?  Instead of "it's on the list, shan't free it"
> it ought to be something like "it's on the list or it is referenced by
> ksiginfo".  Locking will be interesting, though... ;-/

I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.


> BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO.
> What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to
> something blocked, shoves it back with PTRACE_SETSIGINFO and does
> PTRACE_CONT with that new signal number?  Would we get two sigqueue instances
> with the same ->si_tid, one of them matching the timer->sigq and another
> - not?

Or the task sends a SI_TIMER info to itself via sys_rt_sigqueueinfo().

Afaics, nothing really bad can happen, I mean the kernel should not
crash or something like this. do_schedule_next_timer() can be fooled,
but at least lock_timer() can only succeed if this process actually
has a timer with the same timer_id. This sigqueue != timer->sigq, but
I think this doesn't matter, posix_timer_event() will use timer->sigq
anyway.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Al Viro
On Sat, Feb 15, 2014 at 03:58:38PM +, Al Viro wrote:

> BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO.
> What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to
> something blocked, shoves it back with PTRACE_SETSIGINFO and does
> PTRACE_CONT with that new signal number?  Would we get two sigqueue instances
> with the same ->si_tid, one of them matching the timer->sigq and another
> - not?

I wonder if it would be simpler to use the pointer *only* for si_code < 0
case.  It still gives us ksiginfo_t much smaller than siginfo_t, avoids
all the mess with timers and AFAICS results in less intrusive patch.
IOW, the split between "we know what that sucker is" and "completely
opaque shit the userland has given us".

I'll try to put together something along those lines and see how well does
that work...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Al Viro
On Sat, Feb 15, 2014 at 03:36:31PM +, Al Viro wrote:
> On Sat, Feb 15, 2014 at 03:22:51PM +, Al Viro wrote:
> > On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:
> > 
> > > 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> > > 
> > >Once get_signal_to_deliver() or any other caller drops ->siglock
> > >another thread can do sys_timer_delete()->sigqueue_free().
> > 
> > How the devil would it find the sucker?  It's off the list already.
> 
> Ouch...  I think I see what you mean.  Let me see if I got it right:
> timer->sigq is *not* freed by collect_signal(); it's done by
> release_posix_timer() instead, under siglock.  Frankly, this
> /*
>  * If it is queued it will be freed when dequeued,
>  * like the "regular" sigqueue.
>  */
> if (!list_empty(&q->list))
> q = NULL;
> in sigqueue_free() smells like it's asking for races.  Sigh...

So basically we want a different condition for "can we just go ahead and
free that sucker", right?  Instead of "it's on the list, shan't free it"
it ought to be something like "it's on the list or it is referenced by
ksiginfo".  Locking will be interesting, though... ;-/

BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO.
What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to
something blocked, shoves it back with PTRACE_SETSIGINFO and does
PTRACE_CONT with that new signal number?  Would we get two sigqueue instances
with the same ->si_tid, one of them matching the timer->sigq and another
- not?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Al Viro
On Sat, Feb 15, 2014 at 03:22:51PM +, Al Viro wrote:
> On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:
> 
> > 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> > 
> >Once get_signal_to_deliver() or any other caller drops ->siglock
> >another thread can do sys_timer_delete()->sigqueue_free().
> 
> How the devil would it find the sucker?  It's off the list already.

Ouch...  I think I see what you mean.  Let me see if I got it right:
timer->sigq is *not* freed by collect_signal(); it's done by
release_posix_timer() instead, under siglock.  Frankly, this
/*
 * If it is queued it will be freed when dequeued,
 * like the "regular" sigqueue.
 */
if (!list_empty(&q->list))
q = NULL;
in sigqueue_free() smells like it's asking for races.  Sigh...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Oleg Nesterov
On 02/15, Al Viro wrote:
>
> On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:
>
> > 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> >
> >Once get_signal_to_deliver() or any other caller drops ->siglock
> >another thread can do sys_timer_delete()->sigqueue_free().
>
> How the devil would it find the sucker?

It simply frees the SIGQUEUE_PREALLOC sigqueue, k_itimer->sigq.

> It's off the list already.

Exactly, list_empty(q->list) == T. So release_posix_timer()->sigqueue_free()
assumes we can safely free it.

> > 2. We need to move do_schedule_next_timer() from dequeue_signal()
> >here.
> >
> >Otherwise ->q can be reused/overwritten by the next send_sigqueue()
> >right affter ->siglock is dropped.
>
> Ditto.  We rip them out of queue on collect_signal();

Yes, and dequeue_signal()->do_schedule_next_timer() can trigger another
send_sigqueue() which uses the same SIGQUEUE_PREALLOC sigqueue once we
drop ->siglock.

This is not that bad, but at least ->si_overrun can be overwritten
before __setup_rt_frame()->copy_siginfo_to_user().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Al Viro
On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:

> 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> 
>Once get_signal_to_deliver() or any other caller drops ->siglock
>another thread can do sys_timer_delete()->sigqueue_free().

How the devil would it find the sucker?  It's off the list already.

> 2. We need to move do_schedule_next_timer() from dequeue_signal()
>here.
> 
>Otherwise ->q can be reused/overwritten by the next send_sigqueue()
>right affter ->siglock is dropped.

Ditto.  We rip them out of queue on collect_signal(); the only thing we do
not do is actual __sigqueue_free().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Oleg Nesterov
On 02/14, Christoph Hellwig wrote:
>
> Might aswell stick the discmiss into what was dequeue_signal_lock().
> Which at that point should get a saner name (maybe thread_dequeue_signal ?)
> and lose all argument except maybe task_struct

No, task_struct argument should die, I think. It is misleading.

spin_lock(tsk->sighand->siglock) is simply wrong unless tsk == current.

And dequeue_signal() assumes that tsk == current too, otherwise
recalc_sigpending() is wrong.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-15 Thread Oleg Nesterov
On 02/15, Al Viro wrote:
>
> OK, _very_ preliminary patch follows.  It's uglier than it has to

And I'm afraid it needs more uglifications...

> +void dismiss_siginfo(ksiginfo_t *info)
> +{
> + if (info->q)
> + __sigqueue_free(info->q);
> + info->q = NULL;
> +}

1. info->q can be already freed if SIGQUEUE_PREALLOC.

   Once get_signal_to_deliver() or any other caller drops ->siglock
   another thread can do sys_timer_delete()->sigqueue_free().

2. We need to move do_schedule_next_timer() from dequeue_signal()
   here.

   Otherwise ->q can be reused/overwritten by the next send_sigqueue()
   right affter ->siglock is dropped.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-14 Thread Al Viro
On Thu, Feb 13, 2014 at 09:58:47AM -0800, Linus Torvalds wrote:
> On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov  wrote:
> >
> > And we should be careful with SIGQUEUE_PREALLOC, at least
> > collect_signal() should not do list_del_init()... Plus we need to
> > handle the SEND_SIG_FORCED-like case.
> 
> I don't think the users need to care. They'd just call
> "sigqueue_free()" not knowing about our preallocations etc. That kind
> of detail should be confined to inside signal.c.
> 
> But there really aren't that many users. There's a couple of
> "dequeue_signal_lock()" users, but they don't actually *want* the
> siginfo at all (they're kernel threads), so we can just make that
> function free the siginfo immediately (and get rid of the totally
> unnecessay kernel stack allocation). And outside of signal.c only
> signalfd uses "dequeue_signal()" itself, and that would be the only
> one that would need to be taught to use (in signalfd_copyinfo()) and
> then free the sigqueue entry.
> 
> So it really looks like the right thing to do, and fairly
> straightforward. But I'm leaving the coding proof to Al, since he
> already offered ;)

OK, _very_ preliminary patch follows.  It's uglier than it has to and it's
not in the form I would consider suitable for merge.  It doesn't depend
on conversions of architectures to use of ksignal as prereqs; instead of
that it has config symbol (ARCH_USES_KSIGNAL) that can be selected on
architectures already through that conversion (in this patch - just on
x86, since that was all I could test on at the moment; e.g. alpha and
arm could also get such selects right now).  If it isn't selected, everything
builds as usual.  With large siginfo on stack for signal path, as before.
We get ksiginfo_t as an alias for siginfo_t in that case, dismiss_siginfo()
is a no-op, etc.  If it _is_ selected, ksiginfo_t is much smaller than
siginfo_t.

Of course, we pay for that with ifdefs in places that need to understand
[k]siginfo_t guts; signalfd_copyinfo(), copy_siginfo_from_user{,32}() and
copy_ksiginfo_to_user{,32}(), tracepoint for deliver_signal(), collect_signal(),
trivial ones in dequeue_signal() and an ugly bugger in ptrace_signal().
No way around that until all architectures are converted.  Other places
get away with some siginfo_t replaced with ksiginfo_t and dismiss_siginfo()
added in some places.  arch/x86/ia32/ia32_signal.c needed a bit of change
(we probably ought to unify copy_siginfo_from_user32() on all architectures
that have it, actually).

The sucker survived LTP syscall tests and hasn't died on xfstests yet either
(== doesn't seem to leak horribly).  Consider it a proof-of-concept and no
more than that.  One thing I'm really not satisfied with is signal sending
pathway - it would probably make a lot of sense to use ksiginfo there as
well, and have insertion into queue simply steal info->q instead of allocating
and copying.  Another is that ifdefs like that are tolerable only if we
plan to convert everything during this cycle; I believe it to be doable,
actually.  Having that sucker go before the conversions would make for
simpler logistics wrt architecture trees, but it may be better to get
all conversions merged first.  Not sure...  Anyway, here is it; comments
are welcome.

Signed-off-by: Al Viro 
---
diff --git a/arch/Kconfig b/arch/Kconfig
index 80bbb8c..f753561 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -522,4 +522,7 @@ config OLD_SIGACTION
 config COMPAT_OLD_SIGACTION
bool
 
+config ARCH_USES_KSIGNAL
+   bool
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..4ec468d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -127,6 +127,7 @@ config X86
select HAVE_DEBUG_STACKOVERFLOW
select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
select HAVE_CC_STACKPROTECTOR
+   select ARCH_USES_KSIGNAL
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 2206757..611a03e 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -34,7 +34,7 @@
 #include 
 #include 
 
-int copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from)
+int __copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t 
*from)
 {
int err = 0;
bool ia32 = test_thread_flag(TIF_IA32);
@@ -105,6 +105,25 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, 
const siginfo_t *from)
return err;
 }
 
+int copy_siginfo_to_user32(compat_siginfo_t __user *to, const ksiginfo_t *from)
+{
+   int err = 0;
+   if (!from->q) {
+   if (!access_ok(VERIFY_WRITE, to, sizeof(compat_siginfo_t)))
+   return -EFAULT;
+   put_user_try {
+   /* fast case */
+   put_user_ex(from->si_signo, &to->si_signo);
+   put_user_ex(0, &to->si_errno);
+   put_user_ex((short)from->si_code, &to->

Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-14 Thread Christoph Hellwig
On Fri, Feb 14, 2014 at 04:16:24PM +, Al Viro wrote:
> > All of these have in common that they try to handle signals in a kernel
> > thread (which we don't even allow by default), and that they ignore the
> > siginfo.  I think they could mostly be replaced by an addition to the
> > kthread API to allow a kthread to be killed by signals for legacy
> > reasons.
> 
> FWIW, there's a funny situation - all users of dequeue_signal_lock()
> actually ignore info completely.  I'm not saying that we ought to
> stop returning it, but e.g. jbd part of that patch is simply

Might aswell stick the discmiss into what was dequeue_signal_lock().
Which at that point should get a saner name (maybe thread_dequeue_signal ?)
and lose all argument except maybe task_struct - not that it's
nessecary, but it would mirror the other functions usually used around
it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-14 Thread Al Viro
On Fri, Feb 14, 2014 at 04:16:24PM +, Al Viro wrote:
> FWIW, there's a funny situation - all users of dequeue_signal_lock()
> actually ignore info completely.  I'm not saying that we ought to
> stop returning it, but e.g. jbd part of that patch is simply

s/jbd/jffs2/, obviously.  Sorry...  And yes, nbd and usb_storage are the
same story.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-14 Thread Al Viro
On Fri, Feb 14, 2014 at 08:13:02AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote:
> > We'd have to teach each user of "dequeue_signal()" to free the siginfo
> > thing. Which shouldn't be too bad - I think we've collected all of
> > that into generic code, and there isn't the mass or architecture code
> > that knows about these things any more. But there are a few odd
> > drivers etc and signalfd.
> 
> The few odd drivers are nbd, jffs2 and the usb mass storage gadget.
> All of these have in common that they try to handle signals in a kernel
> thread (which we don't even allow by default), and that they ignore the
> siginfo.  I think they could mostly be replaced by an addition to the
> kthread API to allow a kthread to be killed by signals for legacy
> reasons.

FWIW, there's a funny situation - all users of dequeue_signal_lock()
actually ignore info completely.  I'm not saying that we ought to
stop returning it, but e.g. jbd part of that patch is simply
diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c
index 2b60ce1..aefdff2 100644
--- a/fs/jffs2/background.c
+++ b/fs/jffs2/background.c
@@ -119,13 +119,14 @@ static int jffs2_garbage_collect_thread(void *_c)
/* Put_super will send a SIGKILL and then wait on the sem.
 */
while (signal_pending(current) || freezing(current)) {
-   siginfo_t info;
+   ksiginfo_t info;
unsigned long signr;
 
if (try_to_freeze())
goto again;
 
signr = dequeue_signal_lock(current, ¤t->blocked, 
&info);
+   dismiss_siginfo(&info);
 
switch(signr) {
case SIGSTOP:

Not complicated at all.  Where it does get complicated is ->last_siginfo and
PTRACE_SETSIGINFO - getting that reasonably clean is what I'm still fighting
right now...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-14 Thread Christoph Hellwig
On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote:
> We'd have to teach each user of "dequeue_signal()" to free the siginfo
> thing. Which shouldn't be too bad - I think we've collected all of
> that into generic code, and there isn't the mass or architecture code
> that knows about these things any more. But there are a few odd
> drivers etc and signalfd.

The few odd drivers are nbd, jffs2 and the usb mass storage gadget.
All of these have in common that they try to handle signals in a kernel
thread (which we don't even allow by default), and that they ignore the
siginfo.  I think they could mostly be replaced by an addition to the
kthread API to allow a kthread to be killed by signals for legacy
reasons.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-14 Thread Oleg Nesterov
On 02/14, Al Viro wrote:
>
> BTW, Oleg, could you explain why does PTRACE_PEEK_SIGINFO copy ->si_code
> separately?  IOW, why do we want the upper 16 bits of ->si_code exposed?
> It used to be a strictly internal thing IIRC (it's been what, 2.3.late?)

Yes, but checkpoint/restart tools want to dump/restore the internal part
of ->si_code too.

See also the "task_pid_vnr(current) == pid" hack in do_rt_sigqueueinfo(),
this allows to queue a SI_FROMKERNEL() siginfo.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-14 Thread Dave Jones
On Fri, Feb 14, 2014 at 11:24:27AM +1100, Dave Chinner wrote:
 
 > > I can fix this one easily - we already have a workqueue for doing
 > > async log pushes (will split the stack between xlog_cil_force_lsn
 > > and xlog_cil_push), but the reason we haven't used it for synchronous
 > > log forces is that screws up fsync performance on CFQ. We don't
 > > recommend CFQ with XFS anyway, so I think I'll make this change
 > > anyway.
 > 
 > Dave, the patch below should chop off the stack usage from
 > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
 > Can you given this a run?

Looks like it's survived an overnight run..

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-14 Thread Al Viro
On Fri, Feb 14, 2014 at 02:29:30PM +0100, Richard Weinberger wrote:
> Am 14.02.2014 14:25, schrieb Christoph Hellwig:
> > On Thu, Feb 13, 2014 at 08:51:46PM +, Al Viro wrote:
> >> On Wed, Feb 12, 2014 at 09:44:11PM +, Al Viro wrote:
> >>
> >>> I'll try to put something along those lines together, if you or Oleg don't
> >>> do it first.
> >>
> >> OK, having looked at that stuff...
> >>
> >> 1) things become much more compact if we finish conversion to get_signal()
> >> first.
> > 
> > I have vague memories that Richard sent out a series to convert over all
> > architectures a while ago.  Hopefully he has better memory than I do.
> 
> Yeah. Sending v2 of that series is on my overflowing TODO list. :-\
> I think this is a good reason for me to start working on that series again.
> Stay tuned.

Would be great.   I have several done here, but I'll be glad to replace them
with something tested...

BTW, Oleg, could you explain why does PTRACE_PEEK_SIGINFO copy ->si_code
separately?  IOW, why do we want the upper 16 bits of ->si_code exposed?
It used to be a strictly internal thing IIRC (it's been what, 2.3.late?)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-14 Thread Richard Weinberger
Am 14.02.2014 14:25, schrieb Christoph Hellwig:
> On Thu, Feb 13, 2014 at 08:51:46PM +, Al Viro wrote:
>> On Wed, Feb 12, 2014 at 09:44:11PM +, Al Viro wrote:
>>
>>> I'll try to put something along those lines together, if you or Oleg don't
>>> do it first.
>>
>> OK, having looked at that stuff...
>>
>> 1) things become much more compact if we finish conversion to get_signal()
>> first.
> 
> I have vague memories that Richard sent out a series to convert over all
> architectures a while ago.  Hopefully he has better memory than I do.

Yeah. Sending v2 of that series is on my overflowing TODO list. :-\
I think this is a good reason for me to start working on that series again.
Stay tuned.

Thanks,
//richard

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-14 Thread Christoph Hellwig
On Thu, Feb 13, 2014 at 08:51:46PM +, Al Viro wrote:
> On Wed, Feb 12, 2014 at 09:44:11PM +, Al Viro wrote:
> 
> > I'll try to put something along those lines together, if you or Oleg don't
> > do it first.
> 
> OK, having looked at that stuff...
> 
> 1) things become much more compact if we finish conversion to get_signal()
> first.

I have vague memories that Richard sent out a series to convert over all
architectures a while ago.  Hopefully he has better memory than I do.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-13 Thread Dave Chinner
On Wed, Feb 12, 2014 at 06:18:29PM +1100, Dave Chinner wrote:
> > It looks like just "do_signal()" has a stack frame that is about 230
> > bytes even under normal circumstancs (largely due to "struct ksignal"
> > - which in turn is largely due to the insane 128-byte padding in
> > siginfo_t). Add a few other frames in there, and I guess that if it
> > was close before, the coredump path just makes it go off.
> 
> Yup. But it's when you see this sort of thing you realise that
> almost no GFP_KERNEL memory allocation is really safe from stack
> overruns, though:
> 
>   0) 6064  64   update_group_power+0x2c/0x270
>   1) 6000 384   find_busiest_group+0x10a/0x8b0
>   2) 5616 288   load_balance+0x165/0x870
>   3) 5328  96   idle_balance+0x106/0x1b0
>   4) 5232 112   __schedule+0x7b6/0x7e0
>   5) 5120  16   schedule+0x29/0x70
>   6) 5104 176   percpu_ida_alloc+0x1b3/0x3d0
>   7) 4928  32   blk_mq_wait_for_tags+0x1f/0x40
>   8) 4896  80   blk_mq_alloc_request_pinned+0x4e/0x110
>   9) 4816 128   blk_mq_make_request+0x42b/0x510
>  10) 4688  48   generic_make_request+0xc0/0x110
>  11) 4640  96   submit_bio+0x71/0x120
>  12) 4544 192   _xfs_buf_ioapply+0x2cc/0x3b0
>  13) 4352  48   xfs_buf_iorequest+0x6f/0xc0
>  14) 4304  32   xlog_bdstrat+0x23/0x60
>  15) 4272  96   xlog_sync+0x3a4/0x5c0
>  16) 4176  48   xlog_state_release_iclog+0x121/0x130
>  17) 4128 192   xlog_write+0x700/0x7c0
>  18) 3936 192   xlog_cil_push+0x2ae/0x3d0
>  19) 3744 128   xlog_cil_force_lsn+0x1b8/0x1e0
>  20) 3616 144   _xfs_log_force_lsn+0x7c/0x300
>  21) 3472  48   xfs_log_force_lsn+0x3b/0xa0
>  22) 3424 112   xfs_iunpin_wait+0xd7/0x160
>  23) 3312  80   xfs_reclaim_inode+0xd0/0x350
>  24) 3232 432   xfs_reclaim_inodes_ag+0x277/0x3d0
>  25) 2800  48   xfs_reclaim_inodes_nr+0x33/0x40
>  26) 2752  16   xfs_fs_free_cached_objects+0x15/0x20
>  27) 2736  80   super_cache_scan+0x169/0x170
>  28) 2656 160   shrink_slab_node+0x133/0x290
>  29) 2496  80   shrink_slab+0x122/0x160
>  30) 2416 112   do_try_to_free_pages+0x1de/0x360
>  31) 2304 192   try_to_free_pages+0x110/0x190
>  32) 2112 256   __alloc_pages_nodemask+0x5a2/0x8e0
>  33) 1856  80   alloc_pages_current+0xb2/0x170
>  34) 1776  64   new_slab+0x265/0x2e0
>  35) 1712 240   __slab_alloc+0x2fb/0x4c4
>  36) 1472  80   kmem_cache_alloc+0x10b/0x140
> 
> That's almost 4700 bytes of stack usage from the
> kmem_cache_alloc(GFP_KERNEL) call because it entered the IO path.
> Yes, it's an extreme case, but if you're looking for a smoking
> gun :/
> 
> I can fix this one easily - we already have a workqueue for doing
> async log pushes (will split the stack between xlog_cil_force_lsn
> and xlog_cil_push), but the reason we haven't used it for synchronous
> log forces is that screws up fsync performance on CFQ. We don't
> recommend CFQ with XFS anyway, so I think I'll make this change
> anyway.

Dave, the patch below should chop off the stack usage from
xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
Can you given this a run?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

xfs: always do log forces via the workqueue

From: Dave Chinner 

Log forces can occur deep in the call chain when we have relatively
little stack free. Log forces can also happen at close to the call
chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
places where we really don't want to add more stack overhead.

This stack overhead occurs because log forces do foreground CIL
pushes (xlog_cil_push_foreground()) rather than waking the
background push wq and waiting for the for the push to complete.
This foreground push was done to avoid confusing the CFQ Io
scheduler when fsync()s were issued, as it has trouble dealing with
dependent IOs being issued from different process contexts.

Avoiding blowing the stack is much more critical than performance
optimisations for CFQ, especially as we've been recommending against
the use of CFQ for XFS since 3.2 kernels were release because of
it's problems with multi-threaded IO workloads.

Hence convert xlog_cil_push_foreground() to move the push work
to the CIL workqueue. We already do the waiting for the push to
complete in xlog_cil_force_lsn(), so there's nothing else we need to
modify to make this work.

Signed-off-by: Dave Chinner 
---
 fs/xfs/xfs_log_cil.c | 52 +++-
 1 file changed, 39 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b57a8e0..7e54553 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -499,13 +499,6 @@ xlog_cil_push(
cil->xc_ctx = new_ctx;
 
/*
-* mirror the new sequence into the cil structure so that we can do
- 

Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-13 Thread Al Viro
On Thu, Feb 13, 2014 at 08:51:46PM +, Al Viro wrote:
> On Wed, Feb 12, 2014 at 09:44:11PM +, Al Viro wrote:
> 
> > I'll try to put something along those lines together, if you or Oleg don't
> > do it first.
> 
> OK, having looked at that stuff...
> 
> 1) things become much more compact if we finish conversion to get_signal()
> first.  Callers of get_signal_to_deliver() have k_sigaction and siginfo in
> pair of local variables; switching to ksignal will be neutral wrt stack
> footprint (it just gathers those two in one struct) *and* we are getting
> rid of passing struct siginfo * around.  With that done, we can change
> struct ksignal ->info with zero impact on the code in arch/*, and conversion
> makes sense on its own.  In the mainline we have it done for alpha, arm,
> openrisc, sparc and x86.  I've just put together preliminary (and completely
> untested) patches for arm64, m68k and um; doing the rest won't take long, but
> they'll obviously need to be tested.  It's a fairly safe conversion;
> I'd expect the worst bugs to be typos.

OK, there's a couple of tricks that allow to reorder that.  First of all,
temporary config symbol (ARCH_USES_KSIGNAL) selecting that stuff; if it's
not selected, we just have #define small_siginfo siginfo and
#define assign_sigqueue(info, q) WARN_ON((q) != NULL)
and that's it - most of the changes in core kernel consist of s/siginfo/small_&/
and non-trivial ones are under ifdef CONFIG_ARCH_USES_KSIGNAL for the time
being.  Small ifdefs, at that...

Once all architectures get converted, we'll just kill that config symbol and
make ifdefs unconditional.

Another is that we don't need to bother with task_work_add() at all; on
converted architectures we have signal_setup_done(..., ksig, ...) called after
each successful get_signal() (the first argument of signal_setup_done() is
"has sigframe setup failed" flag; we get there regardless of success or
failure of setup_...frame()).  So we can just have that sucker do
assign_sigqueue(&ksig->info, NULL) and be done with that - all this delayed
freeing is explicit now (and we don't get locking overhead, etc. of
task_work_add()).

If what I've got survives the local beating, I'll put it into signal.git,
throw the arch conversions I could test in there and ask on linux-arch for
help with the missing ones.  With any luck we'll get the full set by the
next merge window, at which point we'll be able to kill ifdefs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-13 Thread Al Viro
On Wed, Feb 12, 2014 at 09:44:11PM +, Al Viro wrote:

> I'll try to put something along those lines together, if you or Oleg don't
> do it first.

OK, having looked at that stuff...

1) things become much more compact if we finish conversion to get_signal()
first.  Callers of get_signal_to_deliver() have k_sigaction and siginfo in
pair of local variables; switching to ksignal will be neutral wrt stack
footprint (it just gathers those two in one struct) *and* we are getting
rid of passing struct siginfo * around.  With that done, we can change
struct ksignal ->info with zero impact on the code in arch/*, and conversion
makes sense on its own.  In the mainline we have it done for alpha, arm,
openrisc, sparc and x86.  I've just put together preliminary (and completely
untested) patches for arm64, m68k and um; doing the rest won't take long, but
they'll obviously need to be tested.  It's a fairly safe conversion;
I'd expect the worst bugs to be typos.

2) struct small_siginfo ought to go into linux/signal.h.  Contains signal
number, si_code (usually 0), uid, pid and (often NULL) pointer to
struct sigqueue.  Overall: 20 or 24 bytes.  In struct coredump_params we
should replace ->siginfo with pointer to struct small_siginfo.  Ditto for
task_struct ->last_siginfo.  In struct ksignal ->info becomes
struct small_siginfo.

copy_siginfo_to_user{,32}() gets switched to struct small_siginfo *, so does
do_coredump(), ptrace_signal(), ptrace_stop(), ptrace_getsiginfo(),
ptrace_setsiginfo(), collect_signal(), __dequeue_signal(), dequeue_signal(),
signalfd_dequeue() and do_sigtimedwait().

New primitive: assign_sigqueue(small_info, sigqueue).  Frees
small_info->q and assigns a new value to it.

When ptrace_setsiginfo() is given a non-plain siginfo (si_code != SI_USER,
that is), it should a allocate struct sigqueue, copy the sucker there and
give it to assign_sigqueue().  Plain ones just get uid/pid/signo copied
and NULL passed to assign_sigqueue().

code in ptrace_signal() under if (signr != info->si_signo) should
start with assign_sigqueue(info, NULL).

specific_send_sig_info() in ptrace_signal() is rather clumsy to handle; not
sure what's the best way to deal with that.  In any case, ptrace_signal()
deciding to return 0 should take care of info->q - either copy and free the
original, or actually try to reuse the sucker.  Either way, info->q should
become NULL.

get_signal_to_deliver() should do assign_sigqueue(info, NULL) before
the call of do_group_exit().

arch_ptrace_stop_needed() can be left as-is; it's a macro and none of the
instances look at the info argument at all.  The same goes for
arch_ptrace_stop().

signalfd_read() should do assign_sigqueue(&info, NULL) right after
signalfd_copyinfo().

rt_sigtimedwait(2) and its compat variant should do the same right after
copy_siginfo_to_user{,32}().

TP'ed stuff is a mess, as usual.  AFAICS, TP_STORE_SIGINFO() needs to
be split into a variant taking siginfo (on trace_signal_generate side)
and one taking small_siginfo (for trace_signal_deliver).

get_signal() should do assign_sigqueue(&ksig->info, NULL) if it returns
false and use task_work_add() to schedule info->q for freeing otherwise.


Overall, aside of the need to complete arch/* conversion, the only unfinished
part is this:
/* If the (new) signal is now blocked, requeue it.  */
if (sigismember(¤t->blocked, signr)) {
specific_send_sig_info(signr, info, current);
signr = 0;
}
in the end of ptrace_signal().  Sure, we can add a local struct siginfo
in that if, fill it if needed and pass its address or &info->q->info
to specific_send_sig_info(), but that feels kludgy...  Hell knows, I'll
look a bit more into that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-13 Thread Oleg Nesterov
On 02/13, Oleg Nesterov wrote:
>
> On 02/13, Linus Torvalds wrote:
> >
> > On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov  wrote:
> > >
> > > And we should be careful with SIGQUEUE_PREALLOC, at least
> > > collect_signal() should not do list_del_init()... Plus we need to
> > > handle the SEND_SIG_FORCED-like case.
> >
> > I don't think the users need to care. They'd just call
> > "sigqueue_free()" not knowing about our preallocations etc.
>
> Yes, but we need to be careful to avoid the races with
> release_posix_timer().

Plus we need to delay do_schedule_next_timer() as well. But this
is probably good because we can avoid unlock/lock(siglock) in
dequeue_signal().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-13 Thread Oleg Nesterov
On 02/13, Linus Torvalds wrote:
>
> On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov  wrote:
> >
> > And we should be careful with SIGQUEUE_PREALLOC, at least
> > collect_signal() should not do list_del_init()... Plus we need to
> > handle the SEND_SIG_FORCED-like case.
>
> I don't think the users need to care. They'd just call
> "sigqueue_free()" not knowing about our preallocations etc.

Yes, but we need to be careful to avoid the races with
release_posix_timer().

> That kind
> of detail should be confined to inside signal.c.

Yes, sure.

> But there really aren't that many users. There's a couple of
> "dequeue_signal_lock()" users, but they don't actually *want* the
> siginfo at all (they're kernel threads), so we can just make that
> function free the siginfo immediately (and get rid of the totally
> unnecessay kernel stack allocation).

Yes. Perhaps this helper should be changed/renamed. And perhaps we
can even change __send_signal() to avoid __sigqueue_alloc() if
PF_KTHREAD. Or sig == SIGKILL.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-13 Thread Linus Torvalds
On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov  wrote:
>
> And we should be careful with SIGQUEUE_PREALLOC, at least
> collect_signal() should not do list_del_init()... Plus we need to
> handle the SEND_SIG_FORCED-like case.

I don't think the users need to care. They'd just call
"sigqueue_free()" not knowing about our preallocations etc. That kind
of detail should be confined to inside signal.c.

But there really aren't that many users. There's a couple of
"dequeue_signal_lock()" users, but they don't actually *want* the
siginfo at all (they're kernel threads), so we can just make that
function free the siginfo immediately (and get rid of the totally
unnecessay kernel stack allocation). And outside of signal.c only
signalfd uses "dequeue_signal()" itself, and that would be the only
one that would need to be taught to use (in signalfd_copyinfo()) and
then free the sigqueue entry.

So it really looks like the right thing to do, and fairly
straightforward. But I'm leaving the coding proof to Al, since he
already offered ;)

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-13 Thread Oleg Nesterov
On 02/12, Linus Torvalds wrote:
>
> On Wed, Feb 12, 2014 at 1:14 PM, Al Viro  wrote:
> >
> > Umm...  What if we delay __sigqueue_free()?  After all, that's where the
> > fat sucker normally comes from.  That way we might get away with much
> > smaller structure on stack...
>
> Sounds like the RightThing(tm) to do to me, and I don't see why it
> wouldn't work.

Probably... I'll try to reply tomorrow.

> We'd have to teach each user of "dequeue_signal()" to free the siginfo
> thing.

And we should be careful with SIGQUEUE_PREALLOC, at least
collect_signal() should not do list_del_init()... Plus we need to
handle the SEND_SIG_FORCED-like case.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Al Viro
On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2014 at 1:14 PM, Al Viro  wrote:
> >
> > Umm...  What if we delay __sigqueue_free()?  After all, that's where the
> > fat sucker normally comes from.  That way we might get away with much
> > smaller structure on stack...
> 
> Sounds like the RightThing(tm) to do to me, and I don't see why it
> wouldn't work.
> 
> We'd have to teach each user of "dequeue_signal()" to free the siginfo
> thing. Which shouldn't be too bad - I think we've collected all of
> that into generic code, and there isn't the mass or architecture code
> that knows about these things any more. But there are a few odd
> drivers etc and signalfd. I didn't look at what the lifetimes were.

Only signalfd, AFAICS.  And there we'd want to use the same small structure -
it's used in
do {
ret = signalfd_dequeue(ctx, &info, nonblock);
if (unlikely(ret <= 0))
break;
ret = signalfd_copyinfo(siginfo, &info);
if (ret < 0)
break;
siginfo++;
total += ret;
nonblock = 1;
} while (--count);
and using a smaller struct would actually speed the things up - skips one
copying.  sigqueue would be freed as soon as we'd done signalfd_copyinfo()
(if not by signalfd_copyinfo() itself).

I'll try to put something along those lines together, if you or Oleg don't
do it first.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Linus Torvalds
On Wed, Feb 12, 2014 at 1:14 PM, Al Viro  wrote:
>
> Umm...  What if we delay __sigqueue_free()?  After all, that's where the
> fat sucker normally comes from.  That way we might get away with much
> smaller structure on stack...

Sounds like the RightThing(tm) to do to me, and I don't see why it
wouldn't work.

We'd have to teach each user of "dequeue_signal()" to free the siginfo
thing. Which shouldn't be too bad - I think we've collected all of
that into generic code, and there isn't the mass or architecture code
that knows about these things any more. But there are a few odd
drivers etc and signalfd. I didn't look at what the lifetimes were.

Adding Oleg to the cc, since any time we touch any of that code, he
should be involved. Oleg - the issue is the biggish size of 'struct
ksignal' on stack, brought on by the silly "put a whole siginfo_t in
it".

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Al Viro
On Wed, Feb 12, 2014 at 12:13:19PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2014 at 3:39 AM, Al Viro  wrote:
> > On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
> >
> >> It looks like just "do_signal()" has a stack frame that is about 230
> >> bytes even under normal circumstancs (largely due to "struct ksignal"
> >> - which in turn is largely due to the insane 128-byte padding in
> >> siginfo_t). Add a few other frames in there, and I guess that if it
> >> was close before, the coredump path just makes it go off.
> >
> > We could, in principle, put it into task_struct and make get_signal()
> > return its address - do_signal() is called only in the code that does
> > assorted returns to userland...
> 
> We have better uses for random buffers in "struct task_struct", I'd
> hate to put a siginfo_t there.

*nod*

> The thing is, siginfo_t has that idiotic 128-byte area, but it's all
> "for future expansion". I think it's some damn glibc disease - we've
> seen these kinds of insane paddings before.
> 
> The actual *useful* part of siginfo_t is on the order of 32 bytes. If that.
> 
> Sad.

Umm...  What if we delay __sigqueue_free()?  After all, that's where the
fat sucker normally comes from.  That way we might get away with much
smaller structure on stack...

Just introduce a small structure that would contain signr, uid, pid and
pointer to struct sigqueue.  And pass a pointer to _that_ all the way down
to collect_signal().  Pointer's NULL == it's SI_USER with signr/uid/pid
from the small struct and all other fields are zero.  Pointer isn't NULL -
use &small_struct->p->info.  And have struct sigqueue actually freed
via task_work_add() in that case.

Do you see any fundamental problems with that?  Looks like it would be
faster as well - less copying involved...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Dave Chinner
On Wed, Feb 12, 2014 at 09:25:38AM -0500, Dave Jones wrote:
> On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote:
>  > On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
>  > > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
>  > > 
>  > >  > None of the XFS code disables interrupts in that path, not does is
>  > >  > call outside XFS except to dispatch IO. The stack is pretty deep at
>  > >  > this point and I know that the standard (non stacked) IO stack can
>  > >  > consume >3kb of stack space when it gets down to having to do memory
>  > >  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
>  > >  > drivers. Stack overruns typically show up with symptoms like we are
>  > >  > seeing.
>  > >  > ..
>  > >  > 
>  > >  > Dave, before chasing ghosts, can you (like Eric originally asked)
>  > >  > turn on stack overrun detection?
>  > > 
>  > > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
>  > 
>  > That only checks stack usage when an interrupt is taken. If no
>  > interrupts are taken when stack usage is within 128 bytes of
>  > overflow, then it doesn't catch it.
>  > 
>  > I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
>  > stack usage of a process via canary overwrites and it records it in
>  > do_exit().
> 
> I had that on too. The only message from it came from quite a while
> before the trace that happened overnight..

Right, it won't capture an overrun at the point in time an overrun
occurs, either, because it only checks when the process exits. But
it does tell you what stack usage is being seen, as this:

> [ 3415.655125] trinity-c0 (4383) used greatest stack depth: 992 bytes left
> [12900.804230] BUG: sleeping function called from invalid context at 
> mm/mempool.c:203

is a pretty a good indication that trinity is at risk of stack
overuns...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Linus Torvalds
On Wed, Feb 12, 2014 at 3:39 AM, Al Viro  wrote:
> On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
>
>> It looks like just "do_signal()" has a stack frame that is about 230
>> bytes even under normal circumstancs (largely due to "struct ksignal"
>> - which in turn is largely due to the insane 128-byte padding in
>> siginfo_t). Add a few other frames in there, and I guess that if it
>> was close before, the coredump path just makes it go off.
>
> We could, in principle, put it into task_struct and make get_signal()
> return its address - do_signal() is called only in the code that does
> assorted returns to userland...

We have better uses for random buffers in "struct task_struct", I'd
hate to put a siginfo_t there.

The thing is, siginfo_t has that idiotic 128-byte area, but it's all
"for future expansion". I think it's some damn glibc disease - we've
seen these kinds of insane paddings before.

The actual *useful* part of siginfo_t is on the order of 32 bytes. If that.

Sad.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Eric Sandeen
On 2/12/14, 12:10 AM, Dave Chinner wrote:
> On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
>> On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
>>
>>  > None of the XFS code disables interrupts in that path, not does is
>>  > call outside XFS except to dispatch IO. The stack is pretty deep at
>>  > this point and I know that the standard (non stacked) IO stack can
>>  > consume >3kb of stack space when it gets down to having to do memory
>>  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
>>  > drivers. Stack overruns typically show up with symptoms like we are
>>  > seeing.
>>  > ..
>>  > 
>>  > Dave, before chasing ghosts, can you (like Eric originally asked)
>>  > turn on stack overrun detection?
>>
>> CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
> 
> That only checks stack usage when an interrupt is taken. If no
> interrupts are taken when stack usage is within 128 bytes of
> overflow, then it doesn't catch it.
> 
> I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
> stack usage of a process via canary overwrites and it records it in
> do_exit(). I also use the stack tracer to record the largest stack
> usage seen so I know exactly what code paths are approaching stack
> overruns...
> 
> Cheers,
> 
> Dave.
> 


I'm not sure if I'm off base here, but maybe this would make sense: check
for a corrupted stack in __might_sleep.  Compile tested only,
possibly inelegant, and/or completely wrong, but:


From: Eric Sandeen 

sched: Test for corrupted task_struct in __might_sleep

If a thread overruns the stack, it may corrupt the task_struct,
leading to false positives on tests like irqs_disabled().

Warn if this seems to be the case.

Signed-off-by: Eric Sandeen 
---

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b46131e..6920c3c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6934,6 +6934,8 @@ static inline int preempt_count_equals(int preempt_offset)
 
 void __might_sleep(const char *file, int line, int preempt_offset)
 {
+   struct task_struct *tsk = current;
+   unsigned long *stackend;
static unsigned long prev_jiffy;/* ratelimiting */
 
rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */
@@ -6952,6 +6954,11 @@ void __might_sleep(const char *file, int line, int 
preempt_offset)
in_atomic(), irqs_disabled(),
current->pid, current->comm);
 
+   /* A corrupted stack can cause a false positive on irqs_disabled etc */
+   stackend = end_of_stack(tsk);
+   if (tsk != &init_task && *stackend != STACK_END_MAGIC)
+   printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
+   
debug_show_held_locks(current);
if (irqs_disabled())
print_irqtrace_events(current);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Dave Jones
On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote:
 > On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
 > > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
 > > 
 > >  > None of the XFS code disables interrupts in that path, not does is
 > >  > call outside XFS except to dispatch IO. The stack is pretty deep at
 > >  > this point and I know that the standard (non stacked) IO stack can
 > >  > consume >3kb of stack space when it gets down to having to do memory
 > >  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
 > >  > drivers. Stack overruns typically show up with symptoms like we are
 > >  > seeing.
 > >  > ..
 > >  > 
 > >  > Dave, before chasing ghosts, can you (like Eric originally asked)
 > >  > turn on stack overrun detection?
 > > 
 > > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
 > 
 > That only checks stack usage when an interrupt is taken. If no
 > interrupts are taken when stack usage is within 128 bytes of
 > overflow, then it doesn't catch it.
 > 
 > I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
 > stack usage of a process via canary overwrites and it records it in
 > do_exit().

I had that on too. The only message from it came from quite a while
before the trace that happened overnight..

[ 3415.655125] trinity-c0 (4383) used greatest stack depth: 992 bytes left
[12900.804230] BUG: sleeping function called from invalid context at 
mm/mempool.c:203

 > I also use the stack tracer to record the largest stack
 > usage seen so I know exactly what code paths are approaching stack
 > overruns...

I can give that a try later.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Peter Zijlstra
On Wed, Feb 12, 2014 at 07:40:36AM -0500, Steven Rostedt wrote:
> The pt_regs structure.
> 
> That's what? 21 unsigned longs? 21 * 8 = 168. I think that's the
> culprit here.
> 
> Peter and Frederic, is there a way not to store that on the stack?

Something like so?

---
 include/trace/ftrace.h  | 7 ---
 kernel/trace/trace_event_perf.c | 5 -
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 1a8b28db3775..87ae3ef1d278 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -678,7 +678,7 @@ perf_trace_##call(void *__data, proto)  
\
struct ftrace_event_call *event_call = __data;  \
struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
struct ftrace_raw_##call *entry;\
-   struct pt_regs __regs;  \
+   struct pt_regs *__regs; \
u64 __addr = 0, __count = 1;\
struct task_struct *__task = NULL;  \
struct hlist_head *head;\
@@ -697,18 +697,19 @@ perf_trace_##call(void *__data, proto)
\
 sizeof(u64));  \
__entry_size -= sizeof(u32);\
\
-   perf_fetch_caller_regs(&__regs);\
entry = perf_trace_buf_prepare(__entry_size,\
event_call->event.type, &__regs, &rctx);\
if (!entry) \
return; \
\
+   perf_fetch_caller_regs(__regs); \
+   \
tstruct \
\
{ assign; } \
\
perf_trace_buf_submit(entry, __entry_size, rctx, __addr,\
-   __count, &__regs, head, __task);\
+   __count, __regs, head, __task); \
 }
 
 /*
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index e854f420e033..1885f4aac109 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -232,8 +232,10 @@ void perf_trace_del(struct perf_event *p_event, int flags)
tp_event->class->reg(tp_event, TRACE_REG_PERF_DEL, p_event);
 }
 
+static DEFINE_PER_CPU(struct pt_regs, tp_regs[4]);
+
 __kprobes void *perf_trace_buf_prepare(int size, unsigned short type,
-  struct pt_regs *regs, int *rctxp)
+  struct pt_regs **regs, int *rctxp)
 {
struct trace_entry *entry;
unsigned long flags;
@@ -252,6 +254,7 @@ __kprobes void *perf_trace_buf_prepare(int size, unsigned 
short type,
if (*rctxp < 0)
return NULL;
 
+   *regs = this_cpu_ptr(&tp_regs[*rctxp]);
raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]);
 
/* zero the dead bytes from align to not leak stack to user */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Steven Rostedt
On Wed, 12 Feb 2014 19:35:13 +1100
Dave Chinner  wrote:


> And it's leaf functions that the CONFIG_STACK_TRACER doesn't catch
> on x86-64 (at least, according to the documentation).
> CONFIG_DEBUG_STACK_USAGE output is showing up to 800 bytes more
> stack usage than the tracer. As such, I also think that
> CONFIG_DEBUG_STACK_USAGE output is a more reliable iindication of
> stack usage because it is canary based and so captures the very
> worst case usage of the process's stack...

Yeah, with the new fentry (adding the mcount call before setting up the
stack frame), the function tracer can not catch leaf functions, as it
is called before the leaf function's frame is set up.

Hmm, I wonder if I should add a config to disable fentry and go back to
the old mcount that gets called after setting up the stack frame. This
will lead to better stack tracing, but you lose out on all the benefits
of fentry.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Steven Rostedt
On Wed, 12 Feb 2014 03:13:33 -0500
Tejun Heo  wrote:

> On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote:
> > There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> > all seem to be of the type perf_trace_block_buffer() - things created
> > with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> > frame, I have no idea. That sounds like a potential disaster too,
> > although hopefully it's mostly leaf functions - but leaf functions
> > *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> > huge frames?
> 
> It looks like they're essentially the same for all the automatically
> generated trace functions.  I'm seeing 232 byte stack frame in most of
> them.  If I'm not completely confused by these macros, these are
> generated by DECLARE_EVENT_CLASS() in include/trace/ftrace.h and
> contains struct pt_regs in the stack frame which is already 168 bytes,
> so that seems like the culprit.  No idea whether this is something
> avoidable.  At least they shouldn't nest in any way.  Steven?

They shouldn't nest. But if the perf tracepoint is active, I don't know
how much more of the stack is used in the functions that tracepoint
calls.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Steven Rostedt
[ Added the perf tracepoint maintainers ]

On Tue, 11 Feb 2014 22:59:58 -0800
Linus Torvalds  wrote:

> On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner  wrote:
> >
> > FYI, just creating lots of files with open(O_CREAT):
> >
> > [  348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
> > [  348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
> > [  349.17] fs_mark (4826) used greatest stack depth: 2280 bytes left
> > [  418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
> > [  460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
> > [  544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
> > [  689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left
> >
> > We've got absolutely no spare stack space anymore in the IO path.
> > And the IO path can't get much simpler than filesystem -> virtio
> > block device.
> 
> Ugh, that's bad. A thousand bytes of stack space is much too close to
> any limits.
> 
> Do you have the stack traces for these things so that we can look at
> worst offenders?
> 
> If the new block-mq code is to blame, it needs to be fixed.
> __virtblk_add_req() has a 300-byte stack frame, it seems. Looking
> elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is
> hopefully not in any normal path - online discard is moronic, and I'm
> assuming XFS doesn't do that.
> 
> There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> all seem to be of the type perf_trace_block_buffer() - things created
> with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> frame, I have no idea. That sounds like a potential disaster too,
> although hopefully it's mostly leaf functions - but leaf functions
> *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> huge frames?

The perf_trace_##event is defined in include/trace/ftrace.h.

There we have this:

perf_trace_##call(void *__data, proto)  \
{   \
struct ftrace_event_call *event_call = __data;  \
struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
struct ftrace_raw_##call *entry;\
struct pt_regs __regs;  \
u64 __addr = 0, __count = 1;\
struct task_struct *__task = NULL;  \
struct hlist_head *head;\
int __entry_size;   \
int __data_size;\
int rctx;   \
\

Mostly pointers except for two structures. The __data_offests, is
dynamically defined, and only consists of values from the tracepoint
entry_structure that defines dynamic arrays. But the other structure on
the stack looks a bit harrier. The pt_regs structure.

That's what? 21 unsigned longs? 21 * 8 = 168. I think that's the
culprit here.

Peter and Frederic, is there a way not to store that on the stack?

-- Steve

> 
> And if the stack use comes from the VFS layer, we can probably work on
> that too. But I don't think that has really changed much lately..
> 
> Linus

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Al Viro
On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:

> It looks like just "do_signal()" has a stack frame that is about 230
> bytes even under normal circumstancs (largely due to "struct ksignal"
> - which in turn is largely due to the insane 128-byte padding in
> siginfo_t). Add a few other frames in there, and I guess that if it
> was close before, the coredump path just makes it go off.

We could, in principle, put it into task_struct and make get_signal()
return its address - do_signal() is called only in the code that does
assorted returns to userland...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Dave Chinner
On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner  wrote:
> >
> > FYI, just creating lots of files with open(O_CREAT):
> >
> > [  348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
> > [  348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
> > [  349.17] fs_mark (4826) used greatest stack depth: 2280 bytes left
> > [  418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
> > [  460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
> > [  544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
> > [  689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left
> >
> > We've got absolutely no spare stack space anymore in the IO path.
> > And the IO path can't get much simpler than filesystem -> virtio
> > block device.
> 
> Ugh, that's bad. A thousand bytes of stack space is much too close to
> any limits.
> 
> Do you have the stack traces for these things so that we can look at
> worst offenders?

Sure. It's almost through the XFS block allocation path where we do
metadata reads or log writes. That's always been the worst case
stack usage path for XFS. The worst path is about 4800 bytes - it
was a double btree split (inode extent map tree split, followed by a
freespace btree split when allocating a block for the extent map
btree split), triggering a reallocation of a btree block that was
pinned and STALE, which triggered a log force, which then made it's
way into the block layer.

I don't have a stack trace of that because I can't create the
initial conditions required to trigger it on demand. I've only ever
seen it in the wild once. More common, though, is the situation we
see here - somewhere around 3-4k of stack usage from a single extent
btree update operation.

We've already got a stack switch in the data allocation path - we
had to add it to stop the bdi-flusher from busting the stack with
alarming regularity in production systems. I initially made the
metadata allocation paths use it as well, but that part got reverted
(aa29284 xfs: don't defer metadata allocation to the workqueue)
because Mel Gorman demonstrated several workloads that regressed
significantly as a result of making that change.

So now we are stuck betweeen a rock and a hard place - those
metadata block allocation paths are triggering the stack overflows,
but switching stacks in the allocation path to avoid stack overflows
causes massive performance regressions

The thing is, the XFS stack usage has not changed significantly over
the past 5 years - we keep a pretty close eye on it. The issue is
that everything around XFS has slowly been growing and I can't
point you at one stack trace that demonstrates the worst case. What
I can point you to function calls that consume a lot of stack space
and hence limit what is available to callers.

So here's a few stack fragements I've generated in teh past few
minutes on a couple of XFS filesystems inside a 16p/16GB VM by
running:

$ dd if=/dev/zero of=/mnt/test/foo bs=1024k &

which writes at about 600MB/s to a 18TB RAID array. That runs in
parallel with a multithreaded inode creation/walk/remove workload on
a different filesystem (creates 50 million inodes ~250k inode
creates/s, walks them removes them at around 170k removes/s) which
is hosted on a dm RAID0 stripe of 2 samsung 840 EVO SSDs.

mutex_lock() requires at least 1.2k of stack because of the
scheduler overhead.:

$ sudo cat /sys/kernel/debug/tracing/stack_trace
DepthSize   Location(39 entries)
-   
  0) 4368  64   update_group_power+0x2c/0x270
  1) 4304 384   find_busiest_group+0x10a/0x8b0
  2) 3920 288   load_balance+0x165/0x870
  3) 3632  96   idle_balance+0x106/0x1b0
  4) 3536 112   __schedule+0x7b6/0x7e0
  5) 3424  16   schedule+0x29/0x70
  6) 3408  16   schedule_preempt_disabled+0xe/0x10
  7) 3392 112   __mutex_lock_slowpath+0x11a/0x400
  8) 3280  32   mutex_lock+0x1e/0x32

i.e. any function that is going to schedule needs at least 1k of
stack, and some of the lead in  infrastructure (like
wait_for_completion) can require a total of up to 1.5k...

Basic memory allocation can easily >1k of stack without
entering reclaim:

-   
  0) 4920  40   zone_statistics+0xbd/0xc0
  1) 4880 256   get_page_from_freelist+0x3a8/0x8a0
  2) 4624 256   __alloc_pages_nodemask+0x143/0x8e0
  3) 4368  80   alloc_pages_current+0xb2/0x170
  4) 4288  64   new_slab+0x265/0x2e0
  5) 4224 240   __slab_alloc+0x2fb/0x4c4
  6) 3984  80   kmem_cache_alloc+0x10b/0x140
  7) 3904  48   kmem_zone_alloc+0x77/0x100

If it enters reclaim and we are allowed to IO? Well, you saw the
stack I posted in the other thread - 4700 bytes from
kmem_cache_alloc() to the top of the stack. Another bad case is
the swap pa

Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-12 Thread Tejun Heo
On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote:
> There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> all seem to be of the type perf_trace_block_buffer() - things created
> with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> frame, I have no idea. That sounds like a potential disaster too,
> although hopefully it's mostly leaf functions - but leaf functions
> *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> huge frames?

It looks like they're essentially the same for all the automatically
generated trace functions.  I'm seeing 232 byte stack frame in most of
them.  If I'm not completely confused by these macros, these are
generated by DECLARE_EVENT_CLASS() in include/trace/ftrace.h and
contains struct pt_regs in the stack frame which is already 168 bytes,
so that seems like the culprit.  No idea whether this is something
avoidable.  At least they shouldn't nest in any way.  Steven?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Dave Chinner
On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2014 at 9:40 PM, Dave Chinner  wrote:
> >
> > None of the XFS code disables interrupts in that path, not does is
> > call outside XFS except to dispatch IO. The stack is pretty deep at
> > this point and I know that the standard (non stacked) IO stack can
> > consume >3kb of stack space when it gets down to having to do memory
> > reclaim during GFP_NOIO allocation at the lowest level of SCSI
> > drivers. Stack overruns typically show up with symptoms like we are
> > seeing.
> 
> That would also explain why it shows up for do_coredump(), even though
> clearly interrupts are not disabled at that point. It's just because
> do_coredump() opens a filename at a deeper point in the stack than the
> more normal system call paths do.

Right, it's exactly the same problem we have when knfsd is making
the VFS calls.

> It looks like just "do_signal()" has a stack frame that is about 230
> bytes even under normal circumstancs (largely due to "struct ksignal"
> - which in turn is largely due to the insane 128-byte padding in
> siginfo_t). Add a few other frames in there, and I guess that if it
> was close before, the coredump path just makes it go off.

Yup. But it's when you see this sort of thing you realise that
almost no GFP_KERNEL memory allocation is really safe from stack
overruns, though:

  0) 6064  64   update_group_power+0x2c/0x270
  1) 6000 384   find_busiest_group+0x10a/0x8b0
  2) 5616 288   load_balance+0x165/0x870
  3) 5328  96   idle_balance+0x106/0x1b0
  4) 5232 112   __schedule+0x7b6/0x7e0
  5) 5120  16   schedule+0x29/0x70
  6) 5104 176   percpu_ida_alloc+0x1b3/0x3d0
  7) 4928  32   blk_mq_wait_for_tags+0x1f/0x40
  8) 4896  80   blk_mq_alloc_request_pinned+0x4e/0x110
  9) 4816 128   blk_mq_make_request+0x42b/0x510
 10) 4688  48   generic_make_request+0xc0/0x110
 11) 4640  96   submit_bio+0x71/0x120
 12) 4544 192   _xfs_buf_ioapply+0x2cc/0x3b0
 13) 4352  48   xfs_buf_iorequest+0x6f/0xc0
 14) 4304  32   xlog_bdstrat+0x23/0x60
 15) 4272  96   xlog_sync+0x3a4/0x5c0
 16) 4176  48   xlog_state_release_iclog+0x121/0x130
 17) 4128 192   xlog_write+0x700/0x7c0
 18) 3936 192   xlog_cil_push+0x2ae/0x3d0
 19) 3744 128   xlog_cil_force_lsn+0x1b8/0x1e0
 20) 3616 144   _xfs_log_force_lsn+0x7c/0x300
 21) 3472  48   xfs_log_force_lsn+0x3b/0xa0
 22) 3424 112   xfs_iunpin_wait+0xd7/0x160
 23) 3312  80   xfs_reclaim_inode+0xd0/0x350
 24) 3232 432   xfs_reclaim_inodes_ag+0x277/0x3d0
 25) 2800  48   xfs_reclaim_inodes_nr+0x33/0x40
 26) 2752  16   xfs_fs_free_cached_objects+0x15/0x20
 27) 2736  80   super_cache_scan+0x169/0x170
 28) 2656 160   shrink_slab_node+0x133/0x290
 29) 2496  80   shrink_slab+0x122/0x160
 30) 2416 112   do_try_to_free_pages+0x1de/0x360
 31) 2304 192   try_to_free_pages+0x110/0x190
 32) 2112 256   __alloc_pages_nodemask+0x5a2/0x8e0
 33) 1856  80   alloc_pages_current+0xb2/0x170
 34) 1776  64   new_slab+0x265/0x2e0
 35) 1712 240   __slab_alloc+0x2fb/0x4c4
 36) 1472  80   kmem_cache_alloc+0x10b/0x140

That's almost 4700 bytes of stack usage from the
kmem_cache_alloc(GFP_KERNEL) call because it entered the IO path.
Yes, it's an extreme case, but if you're looking for a smoking
gun :/

I can fix this one easily - we already have a workqueue for doing
async log pushes (will split the stack between xlog_cil_force_lsn
and xlog_cil_push), but the reason we haven't used it for synchronous
log forces is that screws up fsync performance on CFQ. We don't
recommend CFQ with XFS anyway, so I think I'll make this change
anyway.

> And some of the debug options that I'm sure DaveJ has enabled tend to
> make the stack usage worse (simply because they make a lot of data
> structures bigger).

True - CONFIG_XFS_DEBUG tends to add about 5% to the stack usage of
XFS, but realistically 5% is not significant especially as we've
been hitting stack overflows with XFS on production systems
regularly enough on x86-64 over the past 2-3 years that we know what
"typical symptoms" of such overflows are...

The problem we have now is that everything outside XFS continues to
grow in stack usage, so the only choice that remains for us to avoid
overruns is to add performance impacting stack switches into the
middle of common XFS paths. We also have to do it unconditionally
because we don't know ahead of time if any specific operation is
going to require all the stack we can give it because - for example
- we can't predict when the IO path is going to require memory
allocation...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
M

Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Linus Torvalds
On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner  wrote:
>
> FYI, just creating lots of files with open(O_CREAT):
>
> [  348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
> [  348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
> [  349.17] fs_mark (4826) used greatest stack depth: 2280 bytes left
> [  418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
> [  460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
> [  544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
> [  689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left
>
> We've got absolutely no spare stack space anymore in the IO path.
> And the IO path can't get much simpler than filesystem -> virtio
> block device.

Ugh, that's bad. A thousand bytes of stack space is much too close to
any limits.

Do you have the stack traces for these things so that we can look at
worst offenders?

If the new block-mq code is to blame, it needs to be fixed.
__virtblk_add_req() has a 300-byte stack frame, it seems. Looking
elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is
hopefully not in any normal path - online discard is moronic, and I'm
assuming XFS doesn't do that.

There's a lot of 200+ byte stack frames in block/blk-core.s, and they
all seem to be of the type perf_trace_block_buffer() - things created
with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
frame, I have no idea. That sounds like a potential disaster too,
although hopefully it's mostly leaf functions - but leaf functions
*deep* in the callchain. Tejun? Steven, why _do_ they end up with such
huge frames?

And if the stack use comes from the VFS layer, we can probably work on
that too. But I don't think that has really changed much lately..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Dave Chinner
On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote:
> On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
> > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
> > 
> >  > None of the XFS code disables interrupts in that path, not does is
> >  > call outside XFS except to dispatch IO. The stack is pretty deep at
> >  > this point and I know that the standard (non stacked) IO stack can
> >  > consume >3kb of stack space when it gets down to having to do memory
> >  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
> >  > drivers. Stack overruns typically show up with symptoms like we are
> >  > seeing.
> >  > ..
> >  > 
> >  > Dave, before chasing ghosts, can you (like Eric originally asked)
> >  > turn on stack overrun detection?
> > 
> > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
> 
> That only checks stack usage when an interrupt is taken. If no
> interrupts are taken when stack usage is within 128 bytes of
> overflow, then it doesn't catch it.
> 
> I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
> stack usage of a process via canary overwrites and it records it in
> do_exit(). I also use the stack tracer to record the largest stack
> usage seen so I know exactly what code paths are approaching stack
> overruns...

FYI, just creating lots of files with open(O_CREAT):

[  348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
[  348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
[  349.17] fs_mark (4826) used greatest stack depth: 2280 bytes left
[  418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
[  460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
[  544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
[  689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left

We've got absolutely no spare stack space anymore in the IO path.
And the IO path can't get much simpler than filesystem -> virtio
block device.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Linus Torvalds
On Tue, Feb 11, 2014 at 9:40 PM, Dave Chinner  wrote:
>
> None of the XFS code disables interrupts in that path, not does is
> call outside XFS except to dispatch IO. The stack is pretty deep at
> this point and I know that the standard (non stacked) IO stack can
> consume >3kb of stack space when it gets down to having to do memory
> reclaim during GFP_NOIO allocation at the lowest level of SCSI
> drivers. Stack overruns typically show up with symptoms like we are
> seeing.

That would also explain why it shows up for do_coredump(), even though
clearly interrupts are not disabled at that point. It's just because
do_coredump() opens a filename at a deeper point in the stack than the
more normal system call paths do.

It looks like just "do_signal()" has a stack frame that is about 230
bytes even under normal circumstancs (largely due to "struct ksignal"
- which in turn is largely due to the insane 128-byte padding in
siginfo_t). Add a few other frames in there, and I guess that if it
was close before, the coredump path just makes it go off.

And some of the debug options that I'm sure DaveJ has enabled tend to
make the stack usage worse (simply because they make a lot of data
structures bigger).

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Dave Chinner
On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
> On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
> 
>  > None of the XFS code disables interrupts in that path, not does is
>  > call outside XFS except to dispatch IO. The stack is pretty deep at
>  > this point and I know that the standard (non stacked) IO stack can
>  > consume >3kb of stack space when it gets down to having to do memory
>  > reclaim during GFP_NOIO allocation at the lowest level of SCSI
>  > drivers. Stack overruns typically show up with symptoms like we are
>  > seeing.
>  > ..
>  > 
>  > Dave, before chasing ghosts, can you (like Eric originally asked)
>  > turn on stack overrun detection?
> 
> CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.

That only checks stack usage when an interrupt is taken. If no
interrupts are taken when stack usage is within 128 bytes of
overflow, then it doesn't catch it.

I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
stack usage of a process via canary overwrites and it records it in
do_exit(). I also use the stack tracer to record the largest stack
usage seen so I know exactly what code paths are approaching stack
overruns...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Dave Jones
On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:

 > None of the XFS code disables interrupts in that path, not does is
 > call outside XFS except to dispatch IO. The stack is pretty deep at
 > this point and I know that the standard (non stacked) IO stack can
 > consume >3kb of stack space when it gets down to having to do memory
 > reclaim during GFP_NOIO allocation at the lowest level of SCSI
 > drivers. Stack overruns typically show up with symptoms like we are
 > seeing.
 > ..
 > 
 > Dave, before chasing ghosts, can you (like Eric originally asked)
 > turn on stack overrun detection?

CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Dave Chinner
On Wed, Feb 12, 2014 at 04:22:15AM +, Al Viro wrote:
> On Tue, Feb 11, 2014 at 11:03:58PM -0500, Dave Jones wrote:
> > [ 3111.414202]  [] bio_alloc_bioset+0x156/0x210
> > [ 3111.414855]  [] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
> > [ 3111.415517]  [] ? xlog_bdstrat+0x22/0x60 [xfs]
> > [ 3111.416175]  [] xfs_buf_iorequest+0x6b/0xf0 [xfs]
> > [ 3111.416843]  [] xlog_bdstrat+0x22/0x60 [xfs]
> > [ 3111.417509]  [] xlog_sync+0x3a7/0x5b0 [xfs]
> > [ 3111.418175]  [] xlog_state_release_iclog+0x10f/0x120 
> > [xfs]
> > [ 3111.418846]  [] xlog_write+0x6f0/0x800 [xfs]
> > [ 3111.419518]  [] xlog_cil_push+0x2f1/0x410 [xfs]
> 
> Very interesting.  The first thing xlog_cil_push() is doing is blocking
> kmalloc().  So at that point it still hadn't been atomic.  I'd probably
> slap may_sleep() in the beginning of xlog_sync() and see if that triggers...

None of the XFS code disables interrupts in that path, not does is
call outside XFS except to dispatch IO. The stack is pretty deep at
this point and I know that the standard (non stacked) IO stack can
consume >3kb of stack space when it gets down to having to do memory
reclaim during GFP_NOIO allocation at the lowest level of SCSI
drivers. Stack overruns typically show up with symptoms like we are
seeing.

Simple example with memory allocation follows. keep in mind that
memory reclaim uses a whole lot more stack if it is needed, and that
scheduling at this point requires about 1k of stack to be free for
the scheduler footprint, too.

FWIW, the blk-mq stuff seems to hae added 200-300 bytes of new stack
usage to the IO path

$ sudo cat /sys/kernel/debug/tracing/stack_trace
DepthSize   Location(45 entries)
-   
  0) 5944  40   zone_statistics+0xbd/0xc0
  1) 5904 256   get_page_from_freelist+0x3a8/0x8a0
  2) 5648 256   __alloc_pages_nodemask+0x143/0x8e0
  3) 5392  80   alloc_pages_current+0xb2/0x170
  4) 5312  64   new_slab+0x265/0x2e0
  5) 5248 240   __slab_alloc+0x2fb/0x4c4
  6) 5008  80   __kmalloc+0x133/0x180
  7) 4928 112   virtqueue_add_sgs+0x2fe/0x520
  8) 4816 288   __virtblk_add_req+0xd5/0x180
  9) 4528  96   virtio_queue_rq+0xdd/0x1d0
 10) 4432 112   __blk_mq_run_hw_queue+0x1c3/0x3c0
 11) 4320  16   blk_mq_run_hw_queue+0x35/0x40
 12) 4304  80   blk_mq_insert_requests+0xc5/0x120
 13) 4224  96   blk_mq_flush_plug_list+0x129/0x140
 14) 4128 112   blk_flush_plug_list+0xe7/0x240
 15) 4016  32   blk_finish_plug+0x18/0x50
 16) 3984 192   _xfs_buf_ioapply+0x30f/0x3b0
 17) 3792  48   xfs_buf_iorequest+0x6f/0xc0

 37)  928  16   xfs_vn_create+0x13/0x20
 38)  912  64   vfs_create+0xb5/0xf0
 39)  848 208   do_last.isra.53+0x6e0/0xd00
 40)  640 176   path_openat+0xbe/0x620
 41)  464 208   do_filp_open+0x43/0xa0
 42)  256 112   do_sys_open+0x13c/0x230
 43)  144  16   SyS_open+0x22/0x30
 44)  128 128   system_call_fastpath+0x16/0x1b


Dave, before chasing ghosts, can you (like Eric originally asked)
turn on stack overrun detection?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Al Viro
On Tue, Feb 11, 2014 at 11:03:58PM -0500, Dave Jones wrote:
> [ 3111.414202]  [] bio_alloc_bioset+0x156/0x210
> [ 3111.414855]  [] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
> [ 3111.415517]  [] ? xlog_bdstrat+0x22/0x60 [xfs]
> [ 3111.416175]  [] xfs_buf_iorequest+0x6b/0xf0 [xfs]
> [ 3111.416843]  [] xlog_bdstrat+0x22/0x60 [xfs]
> [ 3111.417509]  [] xlog_sync+0x3a7/0x5b0 [xfs]
> [ 3111.418175]  [] xlog_state_release_iclog+0x10f/0x120 
> [xfs]
> [ 3111.418846]  [] xlog_write+0x6f0/0x800 [xfs]
> [ 3111.419518]  [] xlog_cil_push+0x2f1/0x410 [xfs]

Very interesting.  The first thing xlog_cil_push() is doing is blocking
kmalloc().  So at that point it still hadn't been atomic.  I'd probably
slap may_sleep() in the beginning of xlog_sync() and see if that triggers...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Dave Jones
On Tue, Feb 11, 2014 at 06:52:19PM -0800, Linus Torvalds wrote:
 > On Tue, Feb 11, 2014 at 5:09 PM, Al Viro  wrote:
 > >
 > > Slap the check in vfs_create(), see if interrupts had been disabled by it 
 > > or
 > > by something in ->create().  Since it's reproducible...
 > 
 > path_openat() starts off with a get_empty_filp(), which allocates a
 > file pointer with GFP_KERNEL. So that should have triggered the
 > might_sleep warning if irq's were already disabled at that point.
 > 
 > So it's not before that - in particular, it's not in the signal
 > handling or do_coredump() paths.
 > 
 > Also, at least xfs_buf_lock() - which is much deeper in that chain -
 > does a down(&bp->b_sema). I'm disguested that that doesn't have a
 > might_sleep() in it.
 > 
 > Dave, mind adding a "might_sleep()" to the top of
 > "down[_interruptible]()". It's silly to not have coverage of semaphore
 > use in bad contexts.

Added those, didn't trigger. Neither did Al's suggestion.

Slightly different trace, but still from the coredump path.

Dave

[ 3111.403822] BUG: sleeping function called from invalid context at 
mm/mempool.c:203
[ 3111.404414] in_atomic(): 0, irqs_disabled(): 1, pid: 19213, name: trinity-c46
[ 3111.404884] 5 locks held by trinity-c46/19213:
[ 3111.405354]  #0:  (sb_writers#9){..}, at: [] 
do_coredump+0xe05/0xf60
[ 3111.405862]  #1:  (shrinker_rwsem){..}, at: [] 
shrink_slab+0x3f/0x180
[ 3111.406374]  #2:  (&type->s_umount_key#30){..}, at: [] 
grab_super_passive+0x44/0x90
[ 3111.406905]  #3:  (&pag->pag_ici_reclaim_lock){..}, at: 
[] xfs_reclaim_inodes_ag+0x31a/0x430 [xfs]
[ 3111.407466]  #4:  (&(&ip->i_lock)->mr_lock){..}, at: 
[] xfs_ilock+0x16f/0x1b0 [xfs]
[ 3111.408039] CPU: 0 PID: 19213 Comm: trinity-c46 Not tainted 3.14.0-rc2+ #113 
[ 3111.409779]  8da3fde6 4f998871 88023715cce0 
8d72b091
[ 3111.410396]   88023715cd08 8d09ddb5 
0010
[ 3111.411021]  880243566288 0008 88023715cd88 
8d1534f3
[ 3111.411656] Call Trace:
[ 3111.412283]  [] dump_stack+0x4e/0x7a
[ 3111.412922]  [] __might_sleep+0x105/0x150
[ 3111.413562]  [] mempool_alloc+0xa3/0x170
[ 3111.414202]  [] bio_alloc_bioset+0x156/0x210
[ 3111.414855]  [] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
[ 3111.415517]  [] ? xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.416175]  [] xfs_buf_iorequest+0x6b/0xf0 [xfs]
[ 3111.416843]  [] xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.417509]  [] xlog_sync+0x3a7/0x5b0 [xfs]
[ 3111.418175]  [] xlog_state_release_iclog+0x10f/0x120 [xfs]
[ 3111.418846]  [] xlog_write+0x6f0/0x800 [xfs]
[ 3111.419518]  [] xlog_cil_push+0x2f1/0x410 [xfs]
[ 3111.420195]  [] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
[ 3111.420865]  [] ? __bit_spin_unlock.constprop.66+0x19/0x40
[ 3111.421551]  [] _xfs_log_force_lsn+0x93/0x340 [xfs]
[ 3111.422239]  [] xfs_log_force_lsn+0x2e/0xb0 [xfs]
[ 3111.422932]  [] ? xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.423625]  [] __xfs_iunpin_wait+0xd0/0x1a0 [xfs]
[ 3111.424310]  [] ? autoremove_wake_function+0x40/0x40
[ 3111.425008]  [] xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.425705]  [] xfs_reclaim_inode+0x8c/0x380 [xfs]
[ 3111.426405]  [] xfs_reclaim_inodes_ag+0x27b/0x430 [xfs]
[ 3111.427104]  [] ? xfs_reclaim_inodes_ag+0x100/0x430 [xfs]
[ 3111.427797]  [] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 3111.428481]  [] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
[ 3111.429150]  [] super_cache_scan+0x16c/0x180
[ 3111.429824]  [] shrink_slab_node+0x14b/0x2e0
[ 3111.430489]  [] ? shrink_slab+0x3f/0x180
[ 3111.431146]  [] shrink_slab+0x8e/0x180
[ 3111.431796]  [] try_to_free_pages+0x516/0x970
[ 3111.432436]  [] ? __set_page_dirty+0x27/0xc0
[ 3111.433065]  [] __alloc_pages_nodemask+0x7a9/0xb00
[ 3111.433689]  [] alloc_pages_current+0x106/0x1f0
[ 3111.434304]  [] ? pte_alloc_one+0x17/0x80
[ 3111.434911]  [] pte_alloc_one+0x17/0x80
[ 3111.435510]  [] __pte_alloc+0x27/0x130
[ 3111.436098]  [] handle_mm_fault+0xafc/0xbb0
[ 3111.436681]  [] __get_user_pages+0x1ce/0x620
[ 3111.437254]  [] get_dump_page+0x54/0x80
[ 3111.437810]  [] elf_core_dump+0x12d9/0x1420
[ 3111.438356]  [] ? elf_core_dump+0x860/0x1420
[ 3111.438884]  [] do_coredump+0xc02/0xf60
[ 3111.439398]  [] get_signal_to_deliver+0x2b8/0x6b0
[ 3111.439898]  [] do_signal+0x57/0x9d0
[ 3111.440386]  [] ? __acct_update_integrals+0x8e/0x120
[ 3111.440873]  [] ? preempt_count_sub+0x6b/0xf0
[ 3111.441357]  [] ? _raw_spin_unlock+0x31/0x50
[ 3111.441833]  [] ? vtime_account_user+0x91/0xa0
[ 3111.442307]  [] ? context_tracking_user_exit+0x9b/0x100
[ 3111.442783]  [] do_notify_resume+0x5c/0xa0
[ 3111.443260]  [] retint_signal+0x46/0x90
[ 3111.443773] [ cut here ]
[ 3111.444248] WARNING: CPU: 0 PID: 19213 at block/blk.h:227 
generic_make_request_checks+0x33f/0x460()
[ 3111.444742] Modules linked in: fuse can_bcm can_raw scsi_transport_iscsi 
ipt_ULOG nfnetlink nfc caif_socket caif af_802154 phonet af_rxrpc can pppoe 
pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm net

Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Linus Torvalds
On Tue, Feb 11, 2014 at 5:09 PM, Al Viro  wrote:
>
> Slap the check in vfs_create(), see if interrupts had been disabled by it or
> by something in ->create().  Since it's reproducible...

path_openat() starts off with a get_empty_filp(), which allocates a
file pointer with GFP_KERNEL. So that should have triggered the
might_sleep warning if irq's were already disabled at that point.

So it's not before that - in particular, it's not in the signal
handling or do_coredump() paths.

Also, at least xfs_buf_lock() - which is much deeper in that chain -
does a down(&bp->b_sema). I'm disguested that that doesn't have a
might_sleep() in it.

Dave, mind adding a "might_sleep()" to the top of
"down[_interruptible]()". It's silly to not have coverage of semaphore
use in bad contexts.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Al Viro
On Tue, Feb 11, 2014 at 07:44:03PM -0500, Dave Jones wrote:

> The 'good' news is I seem to be able to reproduce it fairly quickly.
> I've seen it 3-4 times this afternoon. (Always from the coredump path)
> 
> I was on vacation last week, but I wasn't hitting this before then, so it's a
> fairly recent change that's introduced this.

Slap the check in vfs_create(), see if interrupts had been disabled by it or
by something in ->create().  Since it's reproducible...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Dave Jones
On Tue, Feb 11, 2014 at 03:49:14PM -0600, Eric Sandeen wrote:
 > On 2/11/14, 3:08 PM, Dave Chinner wrote:
 > > On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote:
 > >> BUG: sleeping function called from invalid context at mm/mempool.c:203
 > >> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
 > >> 5 locks held by trinity-c3/27511:
 > >>  #0:  (sb_writers#9){..}, at: [] 
 > >> mnt_want_write+0x24/0x50
 > >>  #1:  (&type->i_mutex_dir_key#3){..}, at: [] 
 > >> do_last.isra.51+0x294/0x11f0
 > >>  #2:  (sb_internal#2){..}, at: [] 
 > >> xfs_trans_alloc+0x24/0x40 [xfs]
 > >>  #3:  (&(&ip->i_lock)->mr_lock/1){..}, at: [] 
 > >> xfs_ilock+0x16f/0x1b0 [xfs]
 > >>  #4:  (&(&ip->i_lock)->mr_lock){..}, at: [] 
 > >> xfs_ilock_nowait+0x184/0x200 [xfs]
 > >> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
 > >>  83a3fd56 45a3849a 88009f368f60 8372afea
 > >>   88009f368f88 8309ddb5 0010
 > >>  880243566288 0008 88009f369008 831534d3
 > >> Call Trace:
 > >>  [] dump_stack+0x4e/0x7a
 > >>  [] __might_sleep+0x105/0x150
 > >>  [] mempool_alloc+0xa3/0x170
 > >>  [] ? deactivate_slab+0x51a/0x590
 > >>  [] bio_alloc_bioset+0x156/0x210
 > >>  [] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
 > >>  [] ? xlog_bdstrat+0x22/0x60 [xfs]
 > >>  [] xfs_buf_iorequest+0x6b/0xf0 [xfs]
 > >>  [] xlog_bdstrat+0x22/0x60 [xfs]
 > >>  [] xlog_sync+0x3a7/0x5b0 [xfs]
 > >>  [] xlog_state_release_iclog+0x10f/0x120 [xfs]
 > >>  [] xlog_write+0x6f0/0x800 [xfs]
 > >>  [] xlog_cil_push+0x2f1/0x410 [xfs]
 > >>  [] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
 > >>  [] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
 > >>  [] _xfs_log_force+0x70/0x290 [xfs]
 > >>  [] ? get_parent_ip+0xd/0x50
 > >>  [] xfs_log_force+0x26/0xb0 [xfs]
 > >>  [] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
 > >>  [] xfs_buf_lock+0x133/0x140 [xfs]
 > >>  [] _xfs_buf_find+0x1f6/0x3c0 [xfs]
 > >>  [] xfs_buf_get_map+0x2a/0x1b0 [xfs]
 > >>  [] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
 > >>  [] xfs_da_get_buf+0xbd/0x100 [xfs]
 > >>  [] xfs_dir3_data_init+0x59/0x1d0 [xfs]
 > >>  [] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
 > >>  [] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
 > >>  [] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
 > >>  [] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
 > >>  [] ? xfs_da_compname+0x1f/0x30 [xfs]
 > >>  [] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
 > >>  [] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
 > >>  [] ? xfs_setup_inode+0x1cd/0x320 [xfs]
 > >>  [] xfs_dir_createname+0x184/0x1e0 [xfs]
 > >>  [] xfs_create+0x469/0x580 [xfs]
 > >>  [] xfs_vn_mknod+0xc4/0x1e0 [xfs]
 > >>  [] xfs_vn_create+0x13/0x20 [xfs]
 > >>  [] vfs_create+0x95/0xc0
 > >>  [] do_last.isra.51+0x9f8/0x11f0
 > >>  [] ? link_path_walk+0x81/0x870
 > >>  [] path_openat+0xc9/0x620
 > >>  [] ? put_dec+0x72/0x90
 > >>  [] do_filp_open+0x4d/0xb0
 > >>  [] file_open_name+0xfe/0x160
 > >>  [] filp_open+0x44/0x60
 > >>  [] do_coredump+0x602/0xf60
 > >>  [] get_signal_to_deliver+0x2b8/0x6b0
 > >>  [] do_signal+0x57/0x9d0
 > >>  [] ? __acct_update_integrals+0x8e/0x120
 > >>  [] ? __schedule+0x60/0x850
 > >>  [] ? preempt_count_sub+0x6b/0xf0
 > >>  [] ? _raw_spin_unlock+0x31/0x50
 > >>  [] ? vtime_account_user+0x91/0xa0
 > >>  [] ? context_tracking_user_exit+0x9b/0x100
 > >>  [] do_notify_resume+0x5c/0xa0
 > >>  [] retint_signal+0x46/0x90
 > > 
 > > There's nowhere in this XFS stack that disables interrupts, so I
 > > thinks it's either above or below XFS that is causing this problem.
 > 
 > any chance the stack got blown & corrupted things such that it
 > only _looks_ like irqs are disabled?

The 'good' news is I seem to be able to reproduce it fairly quickly.
I've seen it 3-4 times this afternoon. (Always from the coredump path)

I was on vacation last week, but I wasn't hitting this before then, so it's a
fairly recent change that's introduced this.

Al ?

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Eric Sandeen
On 2/11/14, 3:08 PM, Dave Chinner wrote:
> On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote:
>> BUG: sleeping function called from invalid context at mm/mempool.c:203
>> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
>> 5 locks held by trinity-c3/27511:
>>  #0:  (sb_writers#9){..}, at: [] 
>> mnt_want_write+0x24/0x50
>>  #1:  (&type->i_mutex_dir_key#3){..}, at: [] 
>> do_last.isra.51+0x294/0x11f0
>>  #2:  (sb_internal#2){..}, at: [] 
>> xfs_trans_alloc+0x24/0x40 [xfs]
>>  #3:  (&(&ip->i_lock)->mr_lock/1){..}, at: [] 
>> xfs_ilock+0x16f/0x1b0 [xfs]
>>  #4:  (&(&ip->i_lock)->mr_lock){..}, at: [] 
>> xfs_ilock_nowait+0x184/0x200 [xfs]
>> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
>>  83a3fd56 45a3849a 88009f368f60 8372afea
>>   88009f368f88 8309ddb5 0010
>>  880243566288 0008 88009f369008 831534d3
>> Call Trace:
>>  [] dump_stack+0x4e/0x7a
>>  [] __might_sleep+0x105/0x150
>>  [] mempool_alloc+0xa3/0x170
>>  [] ? deactivate_slab+0x51a/0x590
>>  [] bio_alloc_bioset+0x156/0x210
>>  [] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
>>  [] ? xlog_bdstrat+0x22/0x60 [xfs]
>>  [] xfs_buf_iorequest+0x6b/0xf0 [xfs]
>>  [] xlog_bdstrat+0x22/0x60 [xfs]
>>  [] xlog_sync+0x3a7/0x5b0 [xfs]
>>  [] xlog_state_release_iclog+0x10f/0x120 [xfs]
>>  [] xlog_write+0x6f0/0x800 [xfs]
>>  [] xlog_cil_push+0x2f1/0x410 [xfs]
>>  [] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
>>  [] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
>>  [] _xfs_log_force+0x70/0x290 [xfs]
>>  [] ? get_parent_ip+0xd/0x50
>>  [] xfs_log_force+0x26/0xb0 [xfs]
>>  [] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
>>  [] xfs_buf_lock+0x133/0x140 [xfs]
>>  [] _xfs_buf_find+0x1f6/0x3c0 [xfs]
>>  [] xfs_buf_get_map+0x2a/0x1b0 [xfs]
>>  [] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
>>  [] xfs_da_get_buf+0xbd/0x100 [xfs]
>>  [] xfs_dir3_data_init+0x59/0x1d0 [xfs]
>>  [] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
>>  [] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
>>  [] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
>>  [] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
>>  [] ? xfs_da_compname+0x1f/0x30 [xfs]
>>  [] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
>>  [] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
>>  [] ? xfs_setup_inode+0x1cd/0x320 [xfs]
>>  [] xfs_dir_createname+0x184/0x1e0 [xfs]
>>  [] xfs_create+0x469/0x580 [xfs]
>>  [] xfs_vn_mknod+0xc4/0x1e0 [xfs]
>>  [] xfs_vn_create+0x13/0x20 [xfs]
>>  [] vfs_create+0x95/0xc0
>>  [] do_last.isra.51+0x9f8/0x11f0
>>  [] ? link_path_walk+0x81/0x870
>>  [] path_openat+0xc9/0x620
>>  [] ? put_dec+0x72/0x90
>>  [] do_filp_open+0x4d/0xb0
>>  [] file_open_name+0xfe/0x160
>>  [] filp_open+0x44/0x60
>>  [] do_coredump+0x602/0xf60
>>  [] get_signal_to_deliver+0x2b8/0x6b0
>>  [] do_signal+0x57/0x9d0
>>  [] ? __acct_update_integrals+0x8e/0x120
>>  [] ? __schedule+0x60/0x850
>>  [] ? preempt_count_sub+0x6b/0xf0
>>  [] ? _raw_spin_unlock+0x31/0x50
>>  [] ? vtime_account_user+0x91/0xa0
>>  [] ? context_tracking_user_exit+0x9b/0x100
>>  [] do_notify_resume+0x5c/0xa0
>>  [] retint_signal+0x46/0x90
> 
> There's nowhere in this XFS stack that disables interrupts, so I
> thinks it's either above or below XFS that is causing this problem.

any chance the stack got blown & corrupted things such that it
only _looks_ like irqs are disabled?

-Eric

> Cheers,
> 
> Dave.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.14-rc2 XFS backtrace because irqs_disabled.

2014-02-11 Thread Dave Chinner
On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote:
> BUG: sleeping function called from invalid context at mm/mempool.c:203
> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
> 5 locks held by trinity-c3/27511:
>  #0:  (sb_writers#9){..}, at: [] 
> mnt_want_write+0x24/0x50
>  #1:  (&type->i_mutex_dir_key#3){..}, at: [] 
> do_last.isra.51+0x294/0x11f0
>  #2:  (sb_internal#2){..}, at: [] 
> xfs_trans_alloc+0x24/0x40 [xfs]
>  #3:  (&(&ip->i_lock)->mr_lock/1){..}, at: [] 
> xfs_ilock+0x16f/0x1b0 [xfs]
>  #4:  (&(&ip->i_lock)->mr_lock){..}, at: [] 
> xfs_ilock_nowait+0x184/0x200 [xfs]
> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111 
>  83a3fd56 45a3849a 88009f368f60 8372afea
>   88009f368f88 8309ddb5 0010
>  880243566288 0008 88009f369008 831534d3
> Call Trace:
>  [] dump_stack+0x4e/0x7a
>  [] __might_sleep+0x105/0x150
>  [] mempool_alloc+0xa3/0x170
>  [] ? deactivate_slab+0x51a/0x590
>  [] bio_alloc_bioset+0x156/0x210
>  [] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
>  [] ? xlog_bdstrat+0x22/0x60 [xfs]
>  [] xfs_buf_iorequest+0x6b/0xf0 [xfs]
>  [] xlog_bdstrat+0x22/0x60 [xfs]
>  [] xlog_sync+0x3a7/0x5b0 [xfs]
>  [] xlog_state_release_iclog+0x10f/0x120 [xfs]
>  [] xlog_write+0x6f0/0x800 [xfs]
>  [] xlog_cil_push+0x2f1/0x410 [xfs]
>  [] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
>  [] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
>  [] _xfs_log_force+0x70/0x290 [xfs]
>  [] ? get_parent_ip+0xd/0x50
>  [] xfs_log_force+0x26/0xb0 [xfs]
>  [] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
>  [] xfs_buf_lock+0x133/0x140 [xfs]
>  [] _xfs_buf_find+0x1f6/0x3c0 [xfs]
>  [] xfs_buf_get_map+0x2a/0x1b0 [xfs]
>  [] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
>  [] xfs_da_get_buf+0xbd/0x100 [xfs]
>  [] xfs_dir3_data_init+0x59/0x1d0 [xfs]
>  [] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
>  [] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
>  [] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
>  [] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
>  [] ? xfs_da_compname+0x1f/0x30 [xfs]
>  [] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
>  [] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
>  [] ? xfs_setup_inode+0x1cd/0x320 [xfs]
>  [] xfs_dir_createname+0x184/0x1e0 [xfs]
>  [] xfs_create+0x469/0x580 [xfs]
>  [] xfs_vn_mknod+0xc4/0x1e0 [xfs]
>  [] xfs_vn_create+0x13/0x20 [xfs]
>  [] vfs_create+0x95/0xc0
>  [] do_last.isra.51+0x9f8/0x11f0
>  [] ? link_path_walk+0x81/0x870
>  [] path_openat+0xc9/0x620
>  [] ? put_dec+0x72/0x90
>  [] do_filp_open+0x4d/0xb0
>  [] file_open_name+0xfe/0x160
>  [] filp_open+0x44/0x60
>  [] do_coredump+0x602/0xf60
>  [] get_signal_to_deliver+0x2b8/0x6b0
>  [] do_signal+0x57/0x9d0
>  [] ? __acct_update_integrals+0x8e/0x120
>  [] ? __schedule+0x60/0x850
>  [] ? preempt_count_sub+0x6b/0xf0
>  [] ? _raw_spin_unlock+0x31/0x50
>  [] ? vtime_account_user+0x91/0xa0
>  [] ? context_tracking_user_exit+0x9b/0x100
>  [] do_notify_resume+0x5c/0xa0
>  [] retint_signal+0x46/0x90

There's nowhere in this XFS stack that disables interrupts, so I
thinks it's either above or below XFS that is causing this problem.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/