subject:"Re\: \[RFC 2\/2\] x86_64\: expand kernel stack to 16K"

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-10-20 Thread Andy Lutomirski

On 10/20/2014 07:00 PM, Dave Jones wrote:
> On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote:
>  > On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin  wrote:
>  > >
>  > > If we removed struct thread_info from the stack allocation then one
>  > > could do a guard page below the stack.  Of course, we'd have to use IST
>  > > for #PF in that case, which makes it a non-production option.

Why is thread_info in the stack allocation anyway?  Every time I look at
the entry asm, one (minor) thing that contributes to general
brain-hurtingness / sense of horrified awe is the incomprehensible (to
me) split between task_struct and thread_info.

struct thread_info is at the bottom of the stack, right?  If we don't
want to merge it into task_struct, couldn't we stick it at the top of
the stack instead?  Anything that can overwrite the *top* of the stack
gives trivial user-controlled CPL0 execution regardless.

>  > 
>  > We could just have the guard page in between the stack and the
>  > thread_info, take a double fault, and then just map it back in on
>  > double fault.
>  > 
>  > That would give us 8kB of "normal" stack, with a very loud fault - and
>  > then an extra 7kB or so of stack (whatever the size of thread-info is)
>  > - after the first time it traps.
>  > 
>  > That said, it's still likely a non-production option due to the page
>  > table games we'd have to play at fork/clone time.

What's wrong with vmalloc?  Doesn't it already have guard pages?

(Also, we have a shiny hardware dirty bit, so we could relatively
cheaply check whether we're near the limit without any weird
#PF-in-weird-context issues.)

Also, muahaha, I've infected more people with the crazy idea that
intentional double-faults are okay.  Suckers!  Soon I'll have Linux
returning from interrupts with lret!  (IIRC Windows used to do
intentional *triple* faults on context switches, so this should be
considered entirely sensible.)

> 
> [thread necrophilia]
> 
> So digging this back up, it occurs to me that after we bumped to 16K,
> we never did anything like the debug stuff you suggested here.
> 
> The reason I'm bringing this up, is that the last few weeks, I've been
> seeing things like..
> 
> [27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left
> 
> So we're now eating past that first 8KB in some situations.
> 
> Do we care ? Or shall we only start worrying if it gets even deeper ?

I would *love* to have an immediate, loud failure when we overrun the
stack.  This will unavoidably increase the number of TLB misses, but
that probably isn't so bad.

--Andy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-10-20 Thread Dave Jones

On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote:
 > On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin  wrote:
 > >
 > > If we removed struct thread_info from the stack allocation then one
 > > could do a guard page below the stack.  Of course, we'd have to use IST
 > > for #PF in that case, which makes it a non-production option.
 > 
 > We could just have the guard page in between the stack and the
 > thread_info, take a double fault, and then just map it back in on
 > double fault.
 > 
 > That would give us 8kB of "normal" stack, with a very loud fault - and
 > then an extra 7kB or so of stack (whatever the size of thread-info is)
 > - after the first time it traps.
 > 
 > That said, it's still likely a non-production option due to the page
 > table games we'd have to play at fork/clone time.

[thread necrophilia]

So digging this back up, it occurs to me that after we bumped to 16K,
we never did anything like the debug stuff you suggested here.

The reason I'm bringing this up, is that the last few weeks, I've been
seeing things like..

[27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left

So we're now eating past that first 8KB in some situations.

Do we care ? Or shall we only start worrying if it gets even deeper ?

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-10-20 Thread Dave Jones

On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote:
  On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin h...@zytor.com wrote:
  
   If we removed struct thread_info from the stack allocation then one
   could do a guard page below the stack.  Of course, we'd have to use IST
   for #PF in that case, which makes it a non-production option.
  
  We could just have the guard page in between the stack and the
  thread_info, take a double fault, and then just map it back in on
  double fault.
  
  That would give us 8kB of normal stack, with a very loud fault - and
  then an extra 7kB or so of stack (whatever the size of thread-info is)
  - after the first time it traps.
  
  That said, it's still likely a non-production option due to the page
  table games we'd have to play at fork/clone time.

[thread necrophilia]

So digging this back up, it occurs to me that after we bumped to 16K,
we never did anything like the debug stuff you suggested here.

The reason I'm bringing this up, is that the last few weeks, I've been
seeing things like..

[27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left

So we're now eating past that first 8KB in some situations.

Do we care ? Or shall we only start worrying if it gets even deeper ?

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-10-20 Thread Andy Lutomirski

On 10/20/2014 07:00 PM, Dave Jones wrote:
 On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote:
   On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin h...@zytor.com wrote:
   
If we removed struct thread_info from the stack allocation then one
could do a guard page below the stack.  Of course, we'd have to use IST
for #PF in that case, which makes it a non-production option.

Why is thread_info in the stack allocation anyway?  Every time I look at
the entry asm, one (minor) thing that contributes to general
brain-hurtingness / sense of horrified awe is the incomprehensible (to
me) split between task_struct and thread_info.

struct thread_info is at the bottom of the stack, right?  If we don't
want to merge it into task_struct, couldn't we stick it at the top of
the stack instead?  Anything that can overwrite the *top* of the stack
gives trivial user-controlled CPL0 execution regardless.

   
   We could just have the guard page in between the stack and the
   thread_info, take a double fault, and then just map it back in on
   double fault.
   
   That would give us 8kB of normal stack, with a very loud fault - and
   then an extra 7kB or so of stack (whatever the size of thread-info is)
   - after the first time it traps.
   
   That said, it's still likely a non-production option due to the page
   table games we'd have to play at fork/clone time.

What's wrong with vmalloc?  Doesn't it already have guard pages?

(Also, we have a shiny hardware dirty bit, so we could relatively
cheaply check whether we're near the limit without any weird
#PF-in-weird-context issues.)

Also, muahaha, I've infected more people with the crazy idea that
intentional double-faults are okay.  Suckers!  Soon I'll have Linux
returning from interrupts with lret!  (IIRC Windows used to do
intentional *triple* faults on context switches, so this should be
considered entirely sensible.)

 
 [thread necrophilia]
 
 So digging this back up, it occurs to me that after we bumped to 16K,
 we never did anything like the debug stuff you suggested here.
 
 The reason I'm bringing this up, is that the last few weeks, I've been
 seeing things like..
 
 [27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left
 
 So we're now eating past that first 8KB in some situations.
 
 Do we care ? Or shall we only start worrying if it gets even deeper ?

I would *love* to have an immediate, loud failure when we overrun the
stack.  This will unavoidably increase the number of TLB misses, but
that probably isn't so bad.

--Andy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-06-03 Thread Linus Torvalds

On Tue, Jun 3, 2014 at 6:28 AM, Rasmus Villemoes
 wrote:
> Possibly stupid question: Is it true that any given task can only be
> using one wait_queue_t at a time?

Nope.

Being on multiple different wait-queues is actually very common. The
obvious case is select/poll, but there are others. The more subtle
ones involve being on a wait-queue while doing something that can
cause nested waiting (iow, it's technically not wrong to be on a
wait-queue and then do a user space access, which obviously can end up
doing IO).

That said, the case of a single wait-queue entry is another common
case, and it wouldn't necessarily be wrong to have one pre-initialized
wait queue entry in the task structure for that special case, for when
you know that there is no possible nesting. And even if it *does*
nest, if it's the "leaf" entry it could be used for that innermost
nesting without worrying about other wait queue users (who use stack
allocations or actual explicit allocations like poll).

So it might certainly be worth looking at. In fact, it might be worth
it having multiple per-thread entries, so that we could get rid of the
special on-stack allocation for poll too (and making one of them
special and not available to poll, to handle the "leaf waiter" case).

So it's not necessarily a bad idea at all, even if the general case
requires more than one (or a few) static per-thread allocations.

Anybody want to try to code it up?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-06-03 Thread Rasmus Villemoes

Possibly stupid question: Is it true that any given task can only be
using one wait_queue_t at a time? If so, would it be an idea to put a
wait_queue_t into struct task_struct [maybe union'ed with a struct
wait_bit_queue] and avoid allocating this 40 byte structure repeatedly
on the stack.

E.g., in one of Minchan's stack traces, there are two calls of
mempool_alloc (which itself declares a wait_queue_t) and one
try_to_free_pages (which is the only caller of throttle_direct_reclaim,
which in turn uses wait_event_interruptible_timeout and
wait_event_killable).

Rasmus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-06-03 Thread Konstantin Khlebnikov

On Sat, May 31, 2014 at 6:06 AM, Jens Axboe  wrote:
> On 2014-05-28 20:42, Linus Torvalds wrote:
>>>
>>> Regardless of whether it is swap or something external queues the
>>> bio on the plug, perhaps we should look at why it's done inline
>>> rather than by kblockd, where it was moved because it was blowing
>>> the stack from schedule():
>>
>>
>> So it sounds like we need to do this for io_schedule() too.
>>
>> In fact, we've generally found it to be a mistake every time we
>> "automatically" unblock some IO queue. And I'm not saying that because
>> of stack space, but because we've _often_ had the situation that eager
>> unblocking results in IO that could have been done as bigger requests.
>
>
> We definitely need to auto-unplug on the schedule path, otherwise we run
> into all sorts of trouble. But making it async off the IO schedule path is
> fine. By definition, it's not latency sensitive if we are hitting unplug on
> schedule. I'm pretty sure it was run inline on CPU concerns here, as running
> inline is certainly cheaper than punting to kblockd.
>
>
>> Looking at that callchain, I have to say that ext4 doesn't look
>> horrible compared to the whole block layer and virtio.. Yes,
>> "ext4_writepages()" is using almost 400 bytes of stack, and most of
>> that seems to be due to:
>>
>>  struct mpage_da_data mpd;
>>  struct blk_plug plug;
>
>
> Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the magic
> part of it, since that's never caught any bugs. Only saves 8 bytes, but may
> as well take that. Especially if we end up with nested plugs.

In case of nested plugs only the first one is used? Right?
So, it may be embedded into task_struct together with integer recursion counter.
This will save bit of precious stack and make it looks cleaner.


>
>
>> Well, we've definitely have had some issues with deeper callchains
>> with md, but I suspect virtio might be worse, and the new blk-mq code
>> is lilkely worse in this respect too.
>
>
> I don't think blk-mq is worse than the older stack, in fact it should be
> better. The call chains are shorter, and a lot less cruft on the stack.
> Historically the stack issues have been nested devices, however. And for
> sync IO, we do run it inline, so if the driver chews up a lot of stack,
> well...
>
> Looks like I'm late here and the decision has been made to go 16K stacks,
> which I think is a good one. We've been living on the edge (and sometimes
> over) for heavy dm/md setups for a while, and have been patching around that
> fact in the IO stack for years.
>
>
> --
> Jens Axboe
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-06-03 Thread Konstantin Khlebnikov

On Sat, May 31, 2014 at 6:06 AM, Jens Axboe ax...@kernel.dk wrote:
 On 2014-05-28 20:42, Linus Torvalds wrote:

 Regardless of whether it is swap or something external queues the
 bio on the plug, perhaps we should look at why it's done inline
 rather than by kblockd, where it was moved because it was blowing
 the stack from schedule():


 So it sounds like we need to do this for io_schedule() too.

 In fact, we've generally found it to be a mistake every time we
 automatically unblock some IO queue. And I'm not saying that because
 of stack space, but because we've _often_ had the situation that eager
 unblocking results in IO that could have been done as bigger requests.


 We definitely need to auto-unplug on the schedule path, otherwise we run
 into all sorts of trouble. But making it async off the IO schedule path is
 fine. By definition, it's not latency sensitive if we are hitting unplug on
 schedule. I'm pretty sure it was run inline on CPU concerns here, as running
 inline is certainly cheaper than punting to kblockd.


 Looking at that callchain, I have to say that ext4 doesn't look
 horrible compared to the whole block layer and virtio.. Yes,
 ext4_writepages() is using almost 400 bytes of stack, and most of
 that seems to be due to:

  struct mpage_da_data mpd;
  struct blk_plug plug;


 Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the magic
 part of it, since that's never caught any bugs. Only saves 8 bytes, but may
 as well take that. Especially if we end up with nested plugs.

In case of nested plugs only the first one is used? Right?
So, it may be embedded into task_struct together with integer recursion counter.
This will save bit of precious stack and make it looks cleaner.




 Well, we've definitely have had some issues with deeper callchains
 with md, but I suspect virtio might be worse, and the new blk-mq code
 is lilkely worse in this respect too.


 I don't think blk-mq is worse than the older stack, in fact it should be
 better. The call chains are shorter, and a lot less cruft on the stack.
 Historically the stack issues have been nested devices, however. And for
 sync IO, we do run it inline, so if the driver chews up a lot of stack,
 well...

 Looks like I'm late here and the decision has been made to go 16K stacks,
 which I think is a good one. We've been living on the edge (and sometimes
 over) for heavy dm/md setups for a while, and have been patching around that
 fact in the IO stack for years.


 --
 Jens Axboe


 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-06-03 Thread Rasmus Villemoes

Possibly stupid question: Is it true that any given task can only be
using one wait_queue_t at a time? If so, would it be an idea to put a
wait_queue_t into struct task_struct [maybe union'ed with a struct
wait_bit_queue] and avoid allocating this 40 byte structure repeatedly
on the stack.

E.g., in one of Minchan's stack traces, there are two calls of
mempool_alloc (which itself declares a wait_queue_t) and one
try_to_free_pages (which is the only caller of throttle_direct_reclaim,
which in turn uses wait_event_interruptible_timeout and
wait_event_killable).

Rasmus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-06-03 Thread Linus Torvalds

On Tue, Jun 3, 2014 at 6:28 AM, Rasmus Villemoes
li...@rasmusvillemoes.dk wrote:
 Possibly stupid question: Is it true that any given task can only be
 using one wait_queue_t at a time?

Nope.

Being on multiple different wait-queues is actually very common. The
obvious case is select/poll, but there are others. The more subtle
ones involve being on a wait-queue while doing something that can
cause nested waiting (iow, it's technically not wrong to be on a
wait-queue and then do a user space access, which obviously can end up
doing IO).

That said, the case of a single wait-queue entry is another common
case, and it wouldn't necessarily be wrong to have one pre-initialized
wait queue entry in the task structure for that special case, for when
you know that there is no possible nesting. And even if it *does*
nest, if it's the leaf entry it could be used for that innermost
nesting without worrying about other wait queue users (who use stack
allocations or actual explicit allocations like poll).

So it might certainly be worth looking at. In fact, it might be worth
it having multiple per-thread entries, so that we could get rid of the
special on-stack allocation for poll too (and making one of them
special and not available to poll, to handle the leaf waiter case).

So it's not necessarily a bad idea at all, even if the general case
requires more than one (or a few) static per-thread allocations.

Anybody want to try to code it up?

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-06-02 Thread Dave Chinner

On Fri, May 30, 2014 at 08:06:53PM -0600, Jens Axboe wrote:
> On 2014-05-28 20:42, Linus Torvalds wrote:
> >Well, we've definitely have had some issues with deeper callchains
> >with md, but I suspect virtio might be worse, and the new blk-mq code
> >is lilkely worse in this respect too.
> 
> I don't think blk-mq is worse than the older stack, in fact it
> should be better. The call chains are shorter, and a lot less cruft
> on the stack. Historically the stack issues have been nested
> devices, however. And for sync IO, we do run it inline, so if the
> driver chews up a lot of stack, well...

Hi Jens - as we found out with the mm code, there's a significant
disconnect between what the code looks like (i.e. it may use very
little stack directly) and what the compiler is generating.

Before blk-mq:

  9) 3952 112   scsi_request_fn+0x4b/0x490
 10) 3840  32   __blk_run_queue+0x37/0x50
 11) 3808  64   queue_unplugged+0x39/0xb0
 12) 3744 112   blk_flush_plug_list+0x20b/0x240

Now with blk-mq:

  3) 4672  96   virtio_queue_rq+0xd2/0x1e0
  4) 4576 128   __blk_mq_run_hw_queue+0x1f0/0x3e0
  5) 4448  16   blk_mq_run_hw_queue+0x35/0x40
  6) 4432  80   blk_mq_insert_requests+0xc7/0x130
  7) 4352  96   blk_mq_flush_plug_list+0x129/0x140
  8) 4256 112   blk_flush_plug_list+0xe7/0x230

So previously flushing a plug used rough 200 bytes of stack.  With
blk-mq, it's over 400 bytes. IOWs, blk-mq has more than doubled the
block layer stack usage...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-06-02 Thread Dave Chinner

On Fri, May 30, 2014 at 08:06:53PM -0600, Jens Axboe wrote:
 On 2014-05-28 20:42, Linus Torvalds wrote:
 Well, we've definitely have had some issues with deeper callchains
 with md, but I suspect virtio might be worse, and the new blk-mq code
 is lilkely worse in this respect too.
 
 I don't think blk-mq is worse than the older stack, in fact it
 should be better. The call chains are shorter, and a lot less cruft
 on the stack. Historically the stack issues have been nested
 devices, however. And for sync IO, we do run it inline, so if the
 driver chews up a lot of stack, well...

Hi Jens - as we found out with the mm code, there's a significant
disconnect between what the code looks like (i.e. it may use very
little stack directly) and what the compiler is generating.

Before blk-mq:

  9) 3952 112   scsi_request_fn+0x4b/0x490
 10) 3840  32   __blk_run_queue+0x37/0x50
 11) 3808  64   queue_unplugged+0x39/0xb0
 12) 3744 112   blk_flush_plug_list+0x20b/0x240

Now with blk-mq:

  3) 4672  96   virtio_queue_rq+0xd2/0x1e0
  4) 4576 128   __blk_mq_run_hw_queue+0x1f0/0x3e0
  5) 4448  16   blk_mq_run_hw_queue+0x35/0x40
  6) 4432  80   blk_mq_insert_requests+0xc7/0x130
  7) 4352  96   blk_mq_flush_plug_list+0x129/0x140
  8) 4256 112   blk_flush_plug_list+0xe7/0x230

So previously flushing a plug used rough 200 bytes of stack.  With
blk-mq, it's over 400 bytes. IOWs, blk-mq has more than doubled the
block layer stack usage...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Jens Axboe


On 2014-05-28 20:42, Linus Torvalds wrote:

Regardless of whether it is swap or something external queues the
bio on the plug, perhaps we should look at why it's done inline
rather than by kblockd, where it was moved because it was blowing
the stack from schedule():


So it sounds like we need to do this for io_schedule() too.

In fact, we've generally found it to be a mistake every time we
"automatically" unblock some IO queue. And I'm not saying that because
of stack space, but because we've _often_ had the situation that eager
unblocking results in IO that could have been done as bigger requests.


We definitely need to auto-unplug on the schedule path, otherwise we run 
into all sorts of trouble. But making it async off the IO schedule path 
is fine. By definition, it's not latency sensitive if we are hitting 
unplug on schedule. I'm pretty sure it was run inline on CPU concerns 
here, as running inline is certainly cheaper than punting to kblockd.



Looking at that callchain, I have to say that ext4 doesn't look
horrible compared to the whole block layer and virtio.. Yes,
"ext4_writepages()" is using almost 400 bytes of stack, and most of
that seems to be due to:

 struct mpage_da_data mpd;
 struct blk_plug plug;


Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the 
magic part of it, since that's never caught any bugs. Only saves 8 
bytes, but may as well take that. Especially if we end up with nested plugs.



Well, we've definitely have had some issues with deeper callchains
with md, but I suspect virtio might be worse, and the new blk-mq code
is lilkely worse in this respect too.


I don't think blk-mq is worse than the older stack, in fact it should be 
better. The call chains are shorter, and a lot less cruft on the stack. 
Historically the stack issues have been nested devices, however. And for 
sync IO, we do run it inline, so if the driver chews up a lot of stack, 
well...


Looks like I'm late here and the decision has been made to go 16K 
stacks, which I think is a good one. We've been living on the edge (and 
sometimes over) for heavy dm/md setups for a while, and have been 
patching around that fact in the IO stack for years.



--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Linus Torvalds

On Thu, May 29, 2014 at 9:37 PM, Linus Torvalds
 wrote:
>
> It really might be very good to create a "struct alloc_info" that
> contains those shared arguments, and just pass a (const) pointer to
> that around. [ .. ]
>
> Ugh. I think I'll try looking at that tomorrow.

I did look at it, but the thing is horrible. I started on this
something like ten times, and always ended up running away screaming.
Some things are truly fixed (notably "order"), but most things end up
changing subtly halfway through the callchain.

I might look at it some more later, but people may have noticed that I
decided to just apply Minchan's original patch in the meantime. I'll
make an rc8 this weekend..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Andi Kleen

Linus Torvalds  writes:

> From a quick glance at the frame usage, some of it seems to be gcc
> being rather bad at stack allocation, but lots of it is just nasty
> spilling around the disgusting call-sites with tons or arguments. A
> _lot_ of the stack slots are marked as "%sfp" (which is gcc'ese for
> "spill frame pointer", afaik).

> Avoiding some inlining, and using a single flag value rather than the
> collection of "bool"s would probably help. But nothing really
> trivially obvious stands out.

One thing that may be worth playing around with gcc's
--param large-stack-frame and --param large-stack-frame-growth

This tells the inliner when to stop inlining when too much
stack would be used.

We use conserve stack I believe. So perhaps smaller values than 100
and 400 would make sense to try.

   -fconserve-stack
   Attempt to minimize stack usage.  The compiler attempts to
   use less stack space, even if that makes the program slower.
   This option
   implies setting the large-stack-frame parameter to 100 and
   the large-stack-frame-growth parameter to 400.


-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread H. Peter Anvin

On 05/30/2014 10:24 AM, Dave Hansen wrote:
> On 05/30/2014 09:06 AM, Linus Torvalds wrote:
>> On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin  wrote:
 That said, it's still likely a non-production option due to the page
 table games we'd have to play at fork/clone time.
>>>
>>> Still, seems much more tractable.
>>
>> We might be able to make it more attractive by having a small
>> front-end cache of the 16kB allocations with the second page unmapped.
>> That would at least capture the common "lots of short-lived processes"
>> case without having to do kernel page table work.
> 
> If we want to use 4k mappings, we'd need to move the stack over to using
> vmalloc() (or at least be out of the linear mapping) to avoid breaking
> up the linear map's page tables too much.  Doing that, we'd actually not
> _have_ to worry about fragmentation, and we could actually utilize the
> per-cpu-pageset code since we'd could be back to using order-0 pages.
> So it's at least not all a loss.  Although, I do remember playing with
> 4k stacks back in the 32-bit days and not getting much of a win with it.
> 
> We'd definitely that cache, if for no other reason than the vmalloc/vmap
> code as-is isn't super-scalable.
> 

I don't think we want to use 4K mappings for production...

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Dave Hansen

On 05/30/2014 09:06 AM, Linus Torvalds wrote:
> On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin  wrote:
>>> That said, it's still likely a non-production option due to the page
>>> table games we'd have to play at fork/clone time.
>>
>> Still, seems much more tractable.
> 
> We might be able to make it more attractive by having a small
> front-end cache of the 16kB allocations with the second page unmapped.
> That would at least capture the common "lots of short-lived processes"
> case without having to do kernel page table work.

If we want to use 4k mappings, we'd need to move the stack over to using
vmalloc() (or at least be out of the linear mapping) to avoid breaking
up the linear map's page tables too much.  Doing that, we'd actually not
_have_ to worry about fragmentation, and we could actually utilize the
per-cpu-pageset code since we'd could be back to using order-0 pages.
So it's at least not all a loss.  Although, I do remember playing with
4k stacks back in the 32-bit days and not getting much of a win with it.

We'd definitely that cache, if for no other reason than the vmalloc/vmap
code as-is isn't super-scalable.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Linus Torvalds

On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin  wrote:
>>
>> That said, it's still likely a non-production option due to the page
>> table games we'd have to play at fork/clone time.
>
> Still, seems much more tractable.

We might be able to make it more attractive by having a small
front-end cache of the 16kB allocations with the second page unmapped.
That would at least capture the common "lots of short-lived processes"
case without having to do kernel page table work.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread H. Peter Anvin

On 05/30/2014 08:41 AM, Linus Torvalds wrote:
> On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin  wrote:
>>
>> If we removed struct thread_info from the stack allocation then one
>> could do a guard page below the stack.  Of course, we'd have to use IST
>> for #PF in that case, which makes it a non-production option.
> 
> We could just have the guard page in between the stack and the
> thread_info, take a double fault, and then just map it back in on
> double fault.
> 

Oh, duh.  Right, much better.  Similar to the espfix64 hack, too.

> That would give us 8kB of "normal" stack, with a very loud fault - and
> then an extra 7kB or so of stack (whatever the size of thread-info is)
> - after the first time it traps.
> 
> That said, it's still likely a non-production option due to the page
> table games we'd have to play at fork/clone time.

Still, seems much more tractable.

I would still like struct thread_info off the stack allocation for other
reasons (as we have discussed in the past.)

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Linus Torvalds

On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin  wrote:
>
> If we removed struct thread_info from the stack allocation then one
> could do a guard page below the stack.  Of course, we'd have to use IST
> for #PF in that case, which makes it a non-production option.

We could just have the guard page in between the stack and the
thread_info, take a double fault, and then just map it back in on
double fault.

That would give us 8kB of "normal" stack, with a very loud fault - and
then an extra 7kB or so of stack (whatever the size of thread-info is)
- after the first time it traps.

That said, it's still likely a non-production option due to the page
table games we'd have to play at fork/clone time.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Linus Torvalds

On Fri, May 30, 2014 at 2:48 AM, Richard Weinberger
 wrote:
>
> If we raise the stack size on x86_64 to 16k, what about i386?
> Beside of the fact that most of you consider 32bits as dead and must die... ;)

x86-32 doesn't have nearly the same issue, since a large portion of
stack content tends to be pointers and longs. So it's not like it uses
half the stack, but a 32-bit environment does use a lot less stack
than a 64-bit one.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread H. Peter Anvin

On 05/29/2014 06:34 PM, Dave Chinner wrote:
>> ...
>> "kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means
>> there is some horrible stack hogger in your kernel. Please report it
>> the LKML and enable stacktrace to investigate who is culprit"
> 
> That, however, presumes that a user can reproduce the problem on
> demand. Experience tells me that this is the exception rather than
> the norm for production systems, and so capturing the stack in real
> time is IMO the only useful thing we could add...
> 

If we removed struct thread_info from the stack allocation then one
could do a guard page below the stack.  Of course, we'd have to use IST
for #PF in that case, which makes it a non-production option.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Richard Weinberger

On Thu, May 29, 2014 at 5:24 PM, Linus Torvalds
 wrote:
> So I'm not in fact arguing against Minchan's patch of upping
> THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
> remain one of my "we really need to be careful" issues, so while I am
> basically planning on applying that patch, I _also_ want to make sure
> that we fix the problems we do see and not just paper them over.
>
> The 8kB stack has been somewhat restrictive and painful for a while,
> and I'm ok with admitting that it is just getting _too_ damn painful,
> but I don't want to just give up entirely when we have a known deep
> stack case.

If we raise the stack size on x86_64 to 16k, what about i386?
Beside of the fact that most of you consider 32bits as dead and must die... ;)

-- 
Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Minchan Kim

On Thu, May 29, 2014 at 06:24:02PM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 5:50 PM, Minchan Kim  wrote:
> >>
> >> You could also try Dave's patch, and _not_ do my mm/vmscan.c part.
> >
> > Sure. While I write this, Rusty's test was crached so I will try Dave's 
> > patch,
> > them yours except vmscan.c part.
> 
> Looking more at Dave's patch (well, description), I don't think there
> is any way in hell we can ever apply it. If I read it right, it will
> cause all IO that overflows the max request count to go through the
> scheduler to get it flushed. Maybe I misread it, but that's definitely
> not acceptable. Maybe it's not noticeable with a slow rotational
> device, but modern ssd hardware? No way.
> 
> I'd *much* rather slow down the swap side. Not "real IO". So I think
> my mm/vmscan.c patch is preferable (but yes, it might require some
> work to make kswapd do better).
> 
> So you can try Dave's patch just to see what it does for stack depth,
> but other than that it looks unacceptable unless I misread things.
> 
>  Linus

I tested below patch and the result is endless OOM although there are
lots of anon pages and empty space of swap.

I guess __alloc_pages_direct_reclaim couldn't proceed due to anon pages
once VM drop most of file-backed pages, then go to OOM.

---
 mm/backing-dev.c | 25 +++--
 mm/vmscan.c  |  4 +---
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7a4f29..2762b16404bd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
@@ -565,6 +566,18 @@ void set_bdi_congested(struct backing_dev_info *bdi, int 
sync)
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+static long congestion_timeout(int sync, long timeout)
+{
+   long ret;
+   DEFINE_WAIT(wait);
+
+   wait_queue_head_t *wqh = _wqh[sync];
+   prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE);
+   ret = schedule_timeout(timeout);
+   finish_wait(wqh, );
+   return ret;
+}
+
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -578,12 +591,8 @@ long congestion_wait(int sync, long timeout)
 {
long ret;
unsigned long start = jiffies;
-   DEFINE_WAIT(wait);
-   wait_queue_head_t *wqh = _wqh[sync];
 
-   prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE);
-   ret = io_schedule_timeout(timeout);
-   finish_wait(wqh, );
+   ret = congestion_timeout(sync,timeout);
 
trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
jiffies_to_usecs(jiffies - start));
@@ -614,8 +623,6 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
 {
long ret;
unsigned long start = jiffies;
-   DEFINE_WAIT(wait);
-   wait_queue_head_t *wqh = _wqh[sync];
 
/*
 * If there is no congestion, or heavy congestion is not being
@@ -635,9 +642,7 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
}
 
/* Sleep until uncongested or a write happens */
-   prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE);
-   ret = io_schedule_timeout(timeout);
-   finish_wait(wqh, );
+   ret = congestion_timeout(sync, timeout);
 
 out:
trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b409681..e4ad7cd1885b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -975,9 +975,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 * avoid risk of stack overflow but only writeback
 * if many dirty pages have been encountered.
 */
-   if (page_is_file_cache(page) &&
-   (!current_is_kswapd() ||
-!zone_is_reclaim_dirty(zone))) {
+   if (!current_is_kswapd() || 
!zone_is_reclaim_dirty(zone)) {
/*
 * Immediately reclaim when written back.
 * Similar in principal to deactivate_page()
-- 
1.9.2

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Minchan Kim

Final result,

I tested the machine below patch (Dave suggested + some part I modified)
and I couldn't see the problem any more(tested 4hr, I will queue it into
the machine during weekend for long running test if I don't get more
enhanced version before leaving the office today) but as I reported
interim result, still VM's stack usage is high.

Anyway, it's another issue we should really diet of VM functions
(ex, uninlining slow path part from __alloc_pages_nodemask and
alloc_info idea from Linus and more).

Looking forwad to seeing blk_plug_start_async way.
Thanks, Dave!

---
 block/blk-core.c| 2 +-
 block/blk-mq.c  | 2 +-
 kernel/sched/core.c | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bfe16d5af9f9..0c81aacec75b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1585,7 +1585,7 @@ get_rq:
trace_block_plug(q);
else {
if (request_count >= BLK_MAX_REQUEST_COUNT) {
-   blk_flush_plug_list(plug, false);
+   blk_flush_plug_list(plug, true);
trace_block_plug(q);
}
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 883f72089015..6e72e700d11e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -897,7 +897,7 @@ static void blk_mq_make_request(struct request_queue *q, 
struct bio *bio)
if (list_empty(>mq_list))
trace_block_plug(q);
else if (request_count >= BLK_MAX_REQUEST_COUNT) {
-   blk_flush_plug_list(plug, false);
+   blk_flush_plug_list(plug, true);
trace_block_plug(q);
}
list_add_tail(>queuelist, >mq_list);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..ebca9e1f200f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4244,7 +4244,7 @@ void __sched io_schedule(void)
 
delayacct_blkio_start();
atomic_inc(>nr_iowait);
-   blk_flush_plug(current);
+   blk_schedule_flush_plug(current);
current->in_iowait = 1;
schedule();
current->in_iowait = 0;
@@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout)
 
delayacct_blkio_start();
atomic_inc(>nr_iowait);
-   blk_flush_plug(current);
+   blk_schedule_flush_plug(current);
current->in_iowait = 1;
ret = schedule_timeout(timeout);
current->in_iowait = 0;
-- 
1.9.2


On Fri, May 30, 2014 at 11:12:47AM +0900, Minchan Kim wrote:
> On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote:
> > On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
> > > Hello Dave,
> > > 
> > > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> > > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > > > > Author: Jens Axboe 
> > > > > Date:   Sat Apr 16 13:27:55 2011 +0200
> > > > > 
> > > > > block: let io_schedule() flush the plug inline
> > > > > 
> > > > > Linus correctly observes that the most important dispatch cases
> > > > > are now done from kblockd, this isn't ideal for latency reasons.
> > > > > The original reason for switching dispatches out-of-line was to
> > > > > avoid too deep a stack, so by _only_ letting the "accidental"
> > > > > flush directly in schedule() be guarded by offload to kblockd,
> > > > > we should be able to get the best of both worlds.
> > > > > 
> > > > > So add a blk_schedule_flush_plug() that offloads to kblockd,
> > > > > and only use that from the schedule() path.
> > > > > 
> > > > > Signed-off-by: Jens Axboe 
> > > > > 
> > > > > And now we have too deep a stack due to unplugging from 
> > > > > io_schedule()...
> > > > 
> > > > So, if we make io_schedule() push the plug list off to the kblockd
> > > > like is done for schedule()
> > 
> > > I did below hacky test to apply your idea and the result is overflow 
> > > again.
> > > So, again it would second stack expansion. Otherwise, we should prevent
> > > swapout in direct reclaim.
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f5c6635b806c..95f169e85dbe 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
> > >  void __sched io_schedule(void)
> > >  {
> > >   struct rq *rq = raw_rq();
> > > + struct blk_plug *plug = current->plug;
> > >  
> > >   delayacct_blkio_start();
> > >   atomic_inc(>nr_iowait);
> > > - blk_flush_plug(current);
> > > + if (plug)
> > > + blk_flush_plug_list(plug, true);
> > > +
> > >   current->in_iowait =

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Minchan Kim

Final result,

I tested the machine below patch (Dave suggested + some part I modified)
and I couldn't see the problem any more(tested 4hr, I will queue it into
the machine during weekend for long running test if I don't get more
enhanced version before leaving the office today) but as I reported
interim result, still VM's stack usage is high.

Anyway, it's another issue we should really diet of VM functions
(ex, uninlining slow path part from __alloc_pages_nodemask and
alloc_info idea from Linus and more).

Looking forwad to seeing blk_plug_start_async way.
Thanks, Dave!

---
 block/blk-core.c| 2 +-
 block/blk-mq.c  | 2 +-
 kernel/sched/core.c | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bfe16d5af9f9..0c81aacec75b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1585,7 +1585,7 @@ get_rq:
trace_block_plug(q);
else {
if (request_count = BLK_MAX_REQUEST_COUNT) {
-   blk_flush_plug_list(plug, false);
+   blk_flush_plug_list(plug, true);
trace_block_plug(q);
}
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 883f72089015..6e72e700d11e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -897,7 +897,7 @@ static void blk_mq_make_request(struct request_queue *q, 
struct bio *bio)
if (list_empty(plug-mq_list))
trace_block_plug(q);
else if (request_count = BLK_MAX_REQUEST_COUNT) {
-   blk_flush_plug_list(plug, false);
+   blk_flush_plug_list(plug, true);
trace_block_plug(q);
}
list_add_tail(rq-queuelist, plug-mq_list);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..ebca9e1f200f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4244,7 +4244,7 @@ void __sched io_schedule(void)
 
delayacct_blkio_start();
atomic_inc(rq-nr_iowait);
-   blk_flush_plug(current);
+   blk_schedule_flush_plug(current);
current-in_iowait = 1;
schedule();
current-in_iowait = 0;
@@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout)
 
delayacct_blkio_start();
atomic_inc(rq-nr_iowait);
-   blk_flush_plug(current);
+   blk_schedule_flush_plug(current);
current-in_iowait = 1;
ret = schedule_timeout(timeout);
current-in_iowait = 0;
-- 
1.9.2


On Fri, May 30, 2014 at 11:12:47AM +0900, Minchan Kim wrote:
 On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote:
  On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
   Hello Dave,
   
   On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
 On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
 commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
 Author: Jens Axboe jax...@fusionio.com
 Date:   Sat Apr 16 13:27:55 2011 +0200
 
 block: let io_schedule() flush the plug inline
 
 Linus correctly observes that the most important dispatch cases
 are now done from kblockd, this isn't ideal for latency reasons.
 The original reason for switching dispatches out-of-line was to
 avoid too deep a stack, so by _only_ letting the accidental
 flush directly in schedule() be guarded by offload to kblockd,
 we should be able to get the best of both worlds.
 
 So add a blk_schedule_flush_plug() that offloads to kblockd,
 and only use that from the schedule() path.
 
 Signed-off-by: Jens Axboe jax...@fusionio.com
 
 And now we have too deep a stack due to unplugging from 
 io_schedule()...

So, if we make io_schedule() push the plug list off to the kblockd
like is done for schedule()
  
   I did below hacky test to apply your idea and the result is overflow 
   again.
   So, again it would second stack expansion. Otherwise, we should prevent
   swapout in direct reclaim.
   
   diff --git a/kernel/sched/core.c b/kernel/sched/core.c
   index f5c6635b806c..95f169e85dbe 100644
   --- a/kernel/sched/core.c
   +++ b/kernel/sched/core.c
   @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
void __sched io_schedule(void)
{
 struct rq *rq = raw_rq();
   + struct blk_plug *plug = current-plug;

 delayacct_blkio_start();
 atomic_inc(rq-nr_iowait);
   - blk_flush_plug(current);
   + if (plug)
   + blk_flush_plug_list(plug, true);
   +
 current-in_iowait = 1;
 schedule();
 current-in_iowait = 0;
  
  .
  
   DepthSize   Location(46 entries)
  
 0) 7200   8

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Minchan Kim

On Thu, May 29, 2014 at 06:24:02PM -0700, Linus Torvalds wrote:
 On Thu, May 29, 2014 at 5:50 PM, Minchan Kim minc...@kernel.org wrote:
 
  You could also try Dave's patch, and _not_ do my mm/vmscan.c part.
 
  Sure. While I write this, Rusty's test was crached so I will try Dave's 
  patch,
  them yours except vmscan.c part.
 
 Looking more at Dave's patch (well, description), I don't think there
 is any way in hell we can ever apply it. If I read it right, it will
 cause all IO that overflows the max request count to go through the
 scheduler to get it flushed. Maybe I misread it, but that's definitely
 not acceptable. Maybe it's not noticeable with a slow rotational
 device, but modern ssd hardware? No way.
 
 I'd *much* rather slow down the swap side. Not real IO. So I think
 my mm/vmscan.c patch is preferable (but yes, it might require some
 work to make kswapd do better).
 
 So you can try Dave's patch just to see what it does for stack depth,
 but other than that it looks unacceptable unless I misread things.
 
  Linus

I tested below patch and the result is endless OOM although there are
lots of anon pages and empty space of swap.

I guess __alloc_pages_direct_reclaim couldn't proceed due to anon pages
once VM drop most of file-backed pages, then go to OOM.

---
 mm/backing-dev.c | 25 +++--
 mm/vmscan.c  |  4 +---
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7a4f29..2762b16404bd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -11,6 +11,7 @@
 #include linux/writeback.h
 #include linux/device.h
 #include trace/events/writeback.h
+#include linux/blkdev.h
 
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
@@ -565,6 +566,18 @@ void set_bdi_congested(struct backing_dev_info *bdi, int 
sync)
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+static long congestion_timeout(int sync, long timeout)
+{
+   long ret;
+   DEFINE_WAIT(wait);
+
+   wait_queue_head_t *wqh = congestion_wqh[sync];
+   prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
+   ret = schedule_timeout(timeout);
+   finish_wait(wqh, wait);
+   return ret;
+}
+
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -578,12 +591,8 @@ long congestion_wait(int sync, long timeout)
 {
long ret;
unsigned long start = jiffies;
-   DEFINE_WAIT(wait);
-   wait_queue_head_t *wqh = congestion_wqh[sync];
 
-   prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
-   ret = io_schedule_timeout(timeout);
-   finish_wait(wqh, wait);
+   ret = congestion_timeout(sync,timeout);
 
trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
jiffies_to_usecs(jiffies - start));
@@ -614,8 +623,6 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
 {
long ret;
unsigned long start = jiffies;
-   DEFINE_WAIT(wait);
-   wait_queue_head_t *wqh = congestion_wqh[sync];
 
/*
 * If there is no congestion, or heavy congestion is not being
@@ -635,9 +642,7 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
}
 
/* Sleep until uncongested or a write happens */
-   prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
-   ret = io_schedule_timeout(timeout);
-   finish_wait(wqh, wait);
+   ret = congestion_timeout(sync, timeout);
 
 out:
trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b409681..e4ad7cd1885b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -975,9 +975,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 * avoid risk of stack overflow but only writeback
 * if many dirty pages have been encountered.
 */
-   if (page_is_file_cache(page) 
-   (!current_is_kswapd() ||
-!zone_is_reclaim_dirty(zone))) {
+   if (!current_is_kswapd() || 
!zone_is_reclaim_dirty(zone)) {
/*
 * Immediately reclaim when written back.
 * Similar in principal to deactivate_page()
-- 
1.9.2

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Richard Weinberger

On Thu, May 29, 2014 at 5:24 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 So I'm not in fact arguing against Minchan's patch of upping
 THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
 remain one of my we really need to be careful issues, so while I am
 basically planning on applying that patch, I _also_ want to make sure
 that we fix the problems we do see and not just paper them over.

 The 8kB stack has been somewhat restrictive and painful for a while,
 and I'm ok with admitting that it is just getting _too_ damn painful,
 but I don't want to just give up entirely when we have a known deep
 stack case.

If we raise the stack size on x86_64 to 16k, what about i386?
Beside of the fact that most of you consider 32bits as dead and must die... ;)

-- 
Thanks,
//richard
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread H. Peter Anvin

On 05/29/2014 06:34 PM, Dave Chinner wrote:
 ...
 kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means
 there is some horrible stack hogger in your kernel. Please report it
 the LKML and enable stacktrace to investigate who is culprit
 
 That, however, presumes that a user can reproduce the problem on
 demand. Experience tells me that this is the exception rather than
 the norm for production systems, and so capturing the stack in real
 time is IMO the only useful thing we could add...
 

If we removed struct thread_info from the stack allocation then one
could do a guard page below the stack.  Of course, we'd have to use IST
for #PF in that case, which makes it a non-production option.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Linus Torvalds

On Fri, May 30, 2014 at 2:48 AM, Richard Weinberger
richard.weinber...@gmail.com wrote:

 If we raise the stack size on x86_64 to 16k, what about i386?
 Beside of the fact that most of you consider 32bits as dead and must die... ;)

x86-32 doesn't have nearly the same issue, since a large portion of
stack content tends to be pointers and longs. So it's not like it uses
half the stack, but a 32-bit environment does use a lot less stack
than a 64-bit one.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Linus Torvalds

On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin h...@zytor.com wrote:

 If we removed struct thread_info from the stack allocation then one
 could do a guard page below the stack.  Of course, we'd have to use IST
 for #PF in that case, which makes it a non-production option.

We could just have the guard page in between the stack and the
thread_info, take a double fault, and then just map it back in on
double fault.

That would give us 8kB of normal stack, with a very loud fault - and
then an extra 7kB or so of stack (whatever the size of thread-info is)
- after the first time it traps.

That said, it's still likely a non-production option due to the page
table games we'd have to play at fork/clone time.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread H. Peter Anvin

On 05/30/2014 08:41 AM, Linus Torvalds wrote:
 On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin h...@zytor.com wrote:

 If we removed struct thread_info from the stack allocation then one
 could do a guard page below the stack.  Of course, we'd have to use IST
 for #PF in that case, which makes it a non-production option.
 
 We could just have the guard page in between the stack and the
 thread_info, take a double fault, and then just map it back in on
 double fault.
 

Oh, duh.  Right, much better.  Similar to the espfix64 hack, too.

 That would give us 8kB of normal stack, with a very loud fault - and
 then an extra 7kB or so of stack (whatever the size of thread-info is)
 - after the first time it traps.
 
 That said, it's still likely a non-production option due to the page
 table games we'd have to play at fork/clone time.

Still, seems much more tractable.

I would still like struct thread_info off the stack allocation for other
reasons (as we have discussed in the past.)

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Linus Torvalds

On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin h...@zytor.com wrote:

 That said, it's still likely a non-production option due to the page
 table games we'd have to play at fork/clone time.

 Still, seems much more tractable.

We might be able to make it more attractive by having a small
front-end cache of the 16kB allocations with the second page unmapped.
That would at least capture the common lots of short-lived processes
case without having to do kernel page table work.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Dave Hansen

On 05/30/2014 09:06 AM, Linus Torvalds wrote:
 On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin h...@zytor.com wrote:
 That said, it's still likely a non-production option due to the page
 table games we'd have to play at fork/clone time.

 Still, seems much more tractable.
 
 We might be able to make it more attractive by having a small
 front-end cache of the 16kB allocations with the second page unmapped.
 That would at least capture the common lots of short-lived processes
 case without having to do kernel page table work.

If we want to use 4k mappings, we'd need to move the stack over to using
vmalloc() (or at least be out of the linear mapping) to avoid breaking
up the linear map's page tables too much.  Doing that, we'd actually not
_have_ to worry about fragmentation, and we could actually utilize the
per-cpu-pageset code since we'd could be back to using order-0 pages.
So it's at least not all a loss.  Although, I do remember playing with
4k stacks back in the 32-bit days and not getting much of a win with it.

We'd definitely that cache, if for no other reason than the vmalloc/vmap
code as-is isn't super-scalable.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread H. Peter Anvin

On 05/30/2014 10:24 AM, Dave Hansen wrote:
 On 05/30/2014 09:06 AM, Linus Torvalds wrote:
 On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin h...@zytor.com wrote:
 That said, it's still likely a non-production option due to the page
 table games we'd have to play at fork/clone time.

 Still, seems much more tractable.

 We might be able to make it more attractive by having a small
 front-end cache of the 16kB allocations with the second page unmapped.
 That would at least capture the common lots of short-lived processes
 case without having to do kernel page table work.
 
 If we want to use 4k mappings, we'd need to move the stack over to using
 vmalloc() (or at least be out of the linear mapping) to avoid breaking
 up the linear map's page tables too much.  Doing that, we'd actually not
 _have_ to worry about fragmentation, and we could actually utilize the
 per-cpu-pageset code since we'd could be back to using order-0 pages.
 So it's at least not all a loss.  Although, I do remember playing with
 4k stacks back in the 32-bit days and not getting much of a win with it.
 
 We'd definitely that cache, if for no other reason than the vmalloc/vmap
 code as-is isn't super-scalable.
 

I don't think we want to use 4K mappings for production...

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Andi Kleen

Linus Torvalds torva...@linux-foundation.org writes:

 From a quick glance at the frame usage, some of it seems to be gcc
 being rather bad at stack allocation, but lots of it is just nasty
 spilling around the disgusting call-sites with tons or arguments. A
 _lot_ of the stack slots are marked as %sfp (which is gcc'ese for
 spill frame pointer, afaik).

 Avoiding some inlining, and using a single flag value rather than the
 collection of bools would probably help. But nothing really
 trivially obvious stands out.

One thing that may be worth playing around with gcc's
--param large-stack-frame and --param large-stack-frame-growth

This tells the inliner when to stop inlining when too much
stack would be used.

We use conserve stack I believe. So perhaps smaller values than 100
and 400 would make sense to try.

   -fconserve-stack
   Attempt to minimize stack usage.  The compiler attempts to
   use less stack space, even if that makes the program slower.
   This option
   implies setting the large-stack-frame parameter to 100 and
   the large-stack-frame-growth parameter to 400.


-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Linus Torvalds

On Thu, May 29, 2014 at 9:37 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 It really might be very good to create a struct alloc_info that
 contains those shared arguments, and just pass a (const) pointer to
 that around. [ .. ]

 Ugh. I think I'll try looking at that tomorrow.

I did look at it, but the thing is horrible. I started on this
something like ten times, and always ended up running away screaming.
Some things are truly fixed (notably order), but most things end up
changing subtly halfway through the callchain.

I might look at it some more later, but people may have noticed that I
decided to just apply Minchan's original patch in the meantime. I'll
make an rc8 this weekend..

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Jens Axboe


On 2014-05-28 20:42, Linus Torvalds wrote:

Regardless of whether it is swap or something external queues the
bio on the plug, perhaps we should look at why it's done inline
rather than by kblockd, where it was moved because it was blowing
the stack from schedule():


So it sounds like we need to do this for io_schedule() too.

In fact, we've generally found it to be a mistake every time we
automatically unblock some IO queue. And I'm not saying that because
of stack space, but because we've _often_ had the situation that eager
unblocking results in IO that could have been done as bigger requests.


We definitely need to auto-unplug on the schedule path, otherwise we run 
into all sorts of trouble. But making it async off the IO schedule path 
is fine. By definition, it's not latency sensitive if we are hitting 
unplug on schedule. I'm pretty sure it was run inline on CPU concerns 
here, as running inline is certainly cheaper than punting to kblockd.



Looking at that callchain, I have to say that ext4 doesn't look
horrible compared to the whole block layer and virtio.. Yes,
ext4_writepages() is using almost 400 bytes of stack, and most of
that seems to be due to:

 struct mpage_da_data mpd;
 struct blk_plug plug;


Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the 
magic part of it, since that's never caught any bugs. Only saves 8 
bytes, but may as well take that. Especially if we end up with nested plugs.



Well, we've definitely have had some issues with deeper callchains
with md, but I suspect virtio might be worse, and the new blk-mq code
is lilkely worse in this respect too.


I don't think blk-mq is worse than the older stack, in fact it should be 
better. The call chains are shorter, and a lot less cruft on the stack. 
Historically the stack issues have been nested devices, however. And for 
sync IO, we do run it inline, so if the driver chews up a lot of stack, 
well...


Looks like I'm late here and the decision has been made to go 16K 
stacks, which I think is a good one. We've been living on the edge (and 
sometimes over) for heavy dm/md setups for a while, and have been 
patching around that fact in the IO stack for years.



--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 7:12 PM, Minchan Kim  wrote:
>
> Interim report,
>
> And result is as follows, It reduce about 800-byte compared to
> my first report but still stack usage seems to be high.
> Really needs diet of VM functions.

Yes. And in this case uninlining things might actually help, because
the it's not actually performing reclaim in the second case, so
inlining the reclaim code into that huge __alloc_pages_nodemask()
function means that it has the stack frame for all those cases even if
they don't actually get used.

That said, the way those functions are set up (with lots of arguments
passed from one to the other), not inlining will cause huge costs too
for the argument setup.

It really might be very good to create a "struct alloc_info" that
contains those shared arguments, and just pass a (const) pointer to
that around. Gcc would likely tend to be *much* better at generating
code for that, because it avoids a tons of temporaries being created
by function calls. Even when it's inlined, the argument itself ends up
being a new temporary internally, and I suspect one reason gcc
(especially your 4.6.3 version, apparently) generates those big spill
frames is because there's tons of these duplicate temporaries that
apparently don't get merged properly.

Ugh. I think I'll try looking at that tomorrow.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 6:58 PM, Dave Chinner  wrote:
>
> If the patch I sent solves the swap stack usage issue, then perhaps
> we should look towards adding "blk_plug_start_async()" to pass such
> hints to the plug flushing. I'd want to use the same behaviour in
> __xfs_buf_delwri_submit() for bulk metadata writeback in XFS, and
> probably also in mpage_writepages() for bulk data writeback in
> WB_SYNC_NONE context...

Yeah, adding a flag to the plug about what kind of plug it is does
sound quite reasonable. It already has that "magic" field, it could
easily be extended to have a "async" vs "sync" bit to it..

Of course, it's also possible that the unplugging code could just look
at the actual requests that are plugged to determine that, and maybe
we wouldn't even need to mark things specially. I don't think we ever
end up mixing reads and writes under the same plug, so "first request
is a write" is probably a good approximation for "async".

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote:
> On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
> > Hello Dave,
> > 
> > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > > > Author: Jens Axboe 
> > > > Date:   Sat Apr 16 13:27:55 2011 +0200
> > > > 
> > > > block: let io_schedule() flush the plug inline
> > > > 
> > > > Linus correctly observes that the most important dispatch cases
> > > > are now done from kblockd, this isn't ideal for latency reasons.
> > > > The original reason for switching dispatches out-of-line was to
> > > > avoid too deep a stack, so by _only_ letting the "accidental"
> > > > flush directly in schedule() be guarded by offload to kblockd,
> > > > we should be able to get the best of both worlds.
> > > > 
> > > > So add a blk_schedule_flush_plug() that offloads to kblockd,
> > > > and only use that from the schedule() path.
> > > > 
> > > > Signed-off-by: Jens Axboe 
> > > > 
> > > > And now we have too deep a stack due to unplugging from io_schedule()...
> > > 
> > > So, if we make io_schedule() push the plug list off to the kblockd
> > > like is done for schedule()
> 
> > I did below hacky test to apply your idea and the result is overflow again.
> > So, again it would second stack expansion. Otherwise, we should prevent
> > swapout in direct reclaim.
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f5c6635b806c..95f169e85dbe 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
> >  void __sched io_schedule(void)
> >  {
> > struct rq *rq = raw_rq();
> > +   struct blk_plug *plug = current->plug;
> >  
> > delayacct_blkio_start();
> > atomic_inc(>nr_iowait);
> > -   blk_flush_plug(current);
> > +   if (plug)
> > +   blk_flush_plug_list(plug, true);
> > +
> > current->in_iowait = 1;
> > schedule();
> > current->in_iowait = 0;
> 
> .
> 
> > DepthSize   Location(46 entries)
> >
> >   0) 7200   8   _raw_spin_lock_irqsave+0x51/0x60
> >   1) 7192 296   get_page_from_freelist+0x886/0x920
> >   2) 6896 352   __alloc_pages_nodemask+0x5e1/0xb20
> >   3) 6544   8   alloc_pages_current+0x10f/0x1f0
> >   4) 6536 168   new_slab+0x2c5/0x370
> >   5) 6368   8   __slab_alloc+0x3a9/0x501
> >   6) 6360  80   __kmalloc+0x1cb/0x200
> >   7) 6280 376   vring_add_indirect+0x36/0x200
> >   8) 5904 144   virtqueue_add_sgs+0x2e2/0x320
> >   9) 5760 288   __virtblk_add_req+0xda/0x1b0
> >  10) 5472  96   virtio_queue_rq+0xd3/0x1d0
> >  11) 5376 128   __blk_mq_run_hw_queue+0x1ef/0x440
> >  12) 5248  16   blk_mq_run_hw_queue+0x35/0x40
> >  13) 5232  96   blk_mq_insert_requests+0xdb/0x160
> >  14) 5136 112   blk_mq_flush_plug_list+0x12b/0x140
> >  15) 5024 112   blk_flush_plug_list+0xc7/0x220
> >  16) 4912 128   blk_mq_make_request+0x42a/0x600
> >  17) 4784  48   generic_make_request+0xc0/0x100
> >  18) 4736 112   submit_bio+0x86/0x160
> >  19) 4624 160   __swap_writepage+0x198/0x230
> >  20) 4464  32   swap_writepage+0x42/0x90
> >  21) 4432 320   shrink_page_list+0x676/0xa80
> >  22) 4112 208   shrink_inactive_list+0x262/0x4e0
> >  23) 3904 304   shrink_lruvec+0x3e1/0x6a0
> 
> The device is supposed to be plugged here in shrink_lruvec().
> 
> Oh, a plug can only hold 16 individual bios, and then it does a
> synchronous flush. Hmmm - perhaps that should also defer the flush
> to the kblockd, because if we are overrunning a plug then we've
> already surrendered IO dispatch latency
> 
> So, in blk_mq_make_request(), can you do:
> 
>   if (list_empty(>mq_list))
>   trace_block_plug(q);
>   else if (request_count >= BLK_MAX_REQUEST_COUNT) {
> - blk_flush_plug_list(plug, false);
> + blk_flush_plug_list(plug, true);
>   trace_block_plug(q);
>   }
>   list_add_tail(>queuelist, >mq_list);
> 
> To see if that defers all the swap IO to kblockd?
> 

Interim report,

I applied below(we need to fix io_schedule_timeout due to mempool_alloc)

diff --git a/block/blk-core.c b/block/blk-core.c
index bfe16d5af9f9..0c81aacec75b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1585,7 +1585,7 @@ get_rq:
trace_block_plug(q);
else {
if (request_count >= BLK_MAX_REQUEST_COUNT) {
-   blk_flush_plug_list(plug, false);

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner

On Thu, May 29, 2014 at 06:24:02PM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 5:50 PM, Minchan Kim  wrote:
> >>
> >> You could also try Dave's patch, and _not_ do my mm/vmscan.c part.
> >
> > Sure. While I write this, Rusty's test was crached so I will try Dave's 
> > patch,
> > them yours except vmscan.c part.
> 
> Looking more at Dave's patch (well, description), I don't think there
> is any way in hell we can ever apply it. If I read it right, it will
> cause all IO that overflows the max request count to go through the
> scheduler to get it flushed. Maybe I misread it, but that's definitely
> not acceptable. Maybe it's not noticeable with a slow rotational
> device, but modern ssd hardware? No way.
> 
> I'd *much* rather slow down the swap side. Not "real IO". So I think
> my mm/vmscan.c patch is preferable (but yes, it might require some
> work to make kswapd do better).
> 
> So you can try Dave's patch just to see what it does for stack depth,
> but other than that it looks unacceptable unless I misread things.

Yeah, it's a hack, not intended as a potential solution.

I'm thinking, though, that plug flushing behaviour is actually
dependent on plugger context and there is no one "correct"
behaviour. If we are doing process driven IO, then we want to do
immediate dispatch, but for IO where stack is an issue or is for
bulk throughput (e.g. background writeback) async dispatch through
kblockd is desirable.

If the patch I sent solves the swap stack usage issue, then perhaps
we should look towards adding "blk_plug_start_async()" to pass such
hints to the plug flushing. I'd want to use the same behaviour in
__xfs_buf_delwri_submit() for bulk metadata writeback in XFS, and
probably also in mpage_writepages() for bulk data writeback in
WB_SYNC_NONE context

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner

On Fri, May 30, 2014 at 09:32:19AM +0900, Minchan Kim wrote:
> On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote:
> > On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
> > > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
> > > 
> > >  > That sounds like a plan. Perhaps it would be useful to add a
> > >  > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
> > >  > 8k) so that we get some indication that we're hitting a deep stack
> > >  > but the system otherwise keeps functioning. That gives us some
> > >  > motivation to keep stack usage down but isn't a fatal problem like
> > >  > it is now
> > > 
> > > We have check_stack_usage() and DEBUG_STACK_USAGE for this.
> > > Though it needs some tweaking if we move to 16K
> > 
> > Right, but it doesn't throw loud warnings when a specific threshold
> > is reached - it just issues a quiet message when a process exits
> > telling you what the maximum was without giving us a stack to chew
> > on
> 
> But we could enhance the inform so notice the risk to the user.
> as follow
> 
> ...
> "kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means
> there is some horrible stack hogger in your kernel. Please report it
> the LKML and enable stacktrace to investigate who is culprit"

That, however, presumes that a user can reproduce the problem on
demand. Experience tells me that this is the exception rather than
the norm for production systems, and so capturing the stack in real
time is IMO the only useful thing we could add...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 5:05 PM, Linus Torvalds
 wrote:
>
> So maybe test a patch something like the attached.
>
> NOTE! This is absolutely TOTALLY UNTESTED!

It's still untested, but I realized that the whole
"blk_flush_plug_list(plug, true);" thing is pointless, since
schedule() itself will do that for us.

So I think you can remove the

+   struct blk_plug *plug = current->plug;
+   if (plug)
+   blk_flush_plug_list(plug, true);

part from congestion_timeout().

Not that it should *hurt* to have it there, so I'm not bothering to
send a changed patch.

And again, no actual testing by me on any of this, just looking at the code.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 5:50 PM, Minchan Kim  wrote:
>>
>> You could also try Dave's patch, and _not_ do my mm/vmscan.c part.
>
> Sure. While I write this, Rusty's test was crached so I will try Dave's patch,
> them yours except vmscan.c part.

Looking more at Dave's patch (well, description), I don't think there
is any way in hell we can ever apply it. If I read it right, it will
cause all IO that overflows the max request count to go through the
scheduler to get it flushed. Maybe I misread it, but that's definitely
not acceptable. Maybe it's not noticeable with a slow rotational
device, but modern ssd hardware? No way.

I'd *much* rather slow down the swap side. Not "real IO". So I think
my mm/vmscan.c patch is preferable (but yes, it might require some
work to make kswapd do better).

So you can try Dave's patch just to see what it does for stack depth,
but other than that it looks unacceptable unless I misread things.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

On Thu, May 29, 2014 at 05:31:42PM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 5:20 PM, Minchan Kim  wrote:
> >
> > I guess this part which avoid swapout in direct reclaim would be key
> > if this patch were successful. But it could make anon pages rotate back
> > into inactive's head from tail in direct reclaim path until kswapd can
> > catch up. And kswapd kswapd can swap out anon pages from tail of inactive
> > LRU so I suspect it could make side-effect LRU churning.
> 
> Oh, it could make bad things happen, no question about that.
> 
> That said, those bad things are what happens to shared mapped pages
> today, so in that sense it's not new. But large dirty shared mmap's
> have traditionally been a great way to really hurt out VM, so "it
> should work as well as shared mapping pages" is definitely not a
> ringing endorsement!

True.

> 
> (Of course, *if* we can improve kswapd behavior for both swap-out and
> shared dirty pages, that would then be a double win, so there is
> _some_ argument for saying that we should aim to handle both kinds of
> pages equally).

Just an idea for preventing LRU churn.
We can return back the pages to tail of inactive instead of head if it's
not proper pages in this context and reclaimer uses the cursor as list_head
instead of LRU head to scan victim page and record the cursor in somewhere
like lruvec after shrinking is done. It makes VM code more complicated but
is worthy to try if we approach that way.

> 
> > Anyway, I will queue it into testing machine since Rusty's test is done.
> 
> You could also try Dave's patch, and _not_ do my mm/vmscan.c part.

Sure. While I write this, Rusty's test was crached so I will try Dave's patch,
them yours except vmscan.c part.

Thanks.

> 
> Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 5:20 PM, Minchan Kim  wrote:
>
> I guess this part which avoid swapout in direct reclaim would be key
> if this patch were successful. But it could make anon pages rotate back
> into inactive's head from tail in direct reclaim path until kswapd can
> catch up. And kswapd kswapd can swap out anon pages from tail of inactive
> LRU so I suspect it could make side-effect LRU churning.

Oh, it could make bad things happen, no question about that.

That said, those bad things are what happens to shared mapped pages
today, so in that sense it's not new. But large dirty shared mmap's
have traditionally been a great way to really hurt out VM, so "it
should work as well as shared mapping pages" is definitely not a
ringing endorsement!

(Of course, *if* we can improve kswapd behavior for both swap-out and
shared dirty pages, that would then be a double win, so there is
_some_ argument for saying that we should aim to handle both kinds of
pages equally).

> Anyway, I will queue it into testing machine since Rusty's test is done.

You could also try Dave's patch, and _not_ do my mm/vmscan.c part.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote:
> On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
> > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
> > 
> >  > That sounds like a plan. Perhaps it would be useful to add a
> >  > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
> >  > 8k) so that we get some indication that we're hitting a deep stack
> >  > but the system otherwise keeps functioning. That gives us some
> >  > motivation to keep stack usage down but isn't a fatal problem like
> >  > it is now
> > 
> > We have check_stack_usage() and DEBUG_STACK_USAGE for this.
> > Though it needs some tweaking if we move to 16K
> 
> Right, but it doesn't throw loud warnings when a specific threshold
> is reached - it just issues a quiet message when a process exits
> telling you what the maximum was without giving us a stack to chew
> on

But we could enhance the inform so notice the risk to the user.
as follow

...
"kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means
there is some horrible stack hogger in your kernel. Please report it
the LKML and enable stacktrace to investigate who is culprit"


> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Jones

On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote:
 > On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
 > > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
 > > 
 > >  > That sounds like a plan. Perhaps it would be useful to add a
 > >  > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
 > >  > 8k) so that we get some indication that we're hitting a deep stack
 > >  > but the system otherwise keeps functioning. That gives us some
 > >  > motivation to keep stack usage down but isn't a fatal problem like
 > >  > it is now
 > > 
 > > We have check_stack_usage() and DEBUG_STACK_USAGE for this.
 > > Though it needs some tweaking if we move to 16K
 > 
 > Right, but it doesn't throw loud warnings when a specific threshold
 > is reached - it just issues a quiet message when a process exits
 > telling you what the maximum was without giving us a stack to chew
 > on

ah, right good point. That would be more useful.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner

On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
> On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
> 
>  > That sounds like a plan. Perhaps it would be useful to add a
>  > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
>  > 8k) so that we get some indication that we're hitting a deep stack
>  > but the system otherwise keeps functioning. That gives us some
>  > motivation to keep stack usage down but isn't a fatal problem like
>  > it is now
> 
> We have check_stack_usage() and DEBUG_STACK_USAGE for this.
> Though it needs some tweaking if we move to 16K

Right, but it doesn't throw loud warnings when a specific threshold
is reached - it just issues a quiet message when a process exits
telling you what the maximum was without giving us a stack to chew
on

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

Hello Linus,

On Thu, May 29, 2014 at 05:05:17PM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 4:36 PM, Minchan Kim  wrote:
> >
> > I did below hacky test to apply your idea and the result is overflow again.
> > So, again it would second stack expansion. Otherwise, we should prevent
> > swapout in direct reclaim.
> 
> So changing io_schedule() is bad, for the reasons I outlined elsewhere
> (we use it for wait_for_page*() - see sleep_on_page().
> 
> It's the congestion waiting where the io_schedule() should be avoided.
> 
> So maybe test a patch something like the attached.
> 
> NOTE! This is absolutely TOTALLY UNTESTED! It might do horrible
> horrible things. It seems to compile, but I have absolutely no reason
> to believe that it would work. I didn't actually test that this moves
> anything at all to kblockd. So think of it as a concept patch that
> *might* work, but as Dave said, there might also be other things that
> cause unplugging and need some tough love.
> 
>Linus

>  mm/backing-dev.c | 28 ++--
>  mm/vmscan.c  |  4 +---
>  2 files changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 09d9591b7708..cb26b24c2da2 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
>  
> @@ -573,6 +574,21 @@ void set_bdi_congested(struct backing_dev_info *bdi, int 
> sync)
>  }
>  EXPORT_SYMBOL(set_bdi_congested);
>  
> +static long congestion_timeout(int sync, long timeout)
> +{
> + long ret;
> + DEFINE_WAIT(wait);
> + struct blk_plug *plug = current->plug;
> + wait_queue_head_t *wqh = _wqh[sync];
> +
> + prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE);
> + if (plug)
> + blk_flush_plug_list(plug, true);
> + ret = schedule_timeout(timeout);
> + finish_wait(wqh, );
> + return ret;
> +}
> +
>  /**
>   * congestion_wait - wait for a backing_dev to become uncongested
>   * @sync: SYNC or ASYNC IO
> @@ -586,12 +602,8 @@ long congestion_wait(int sync, long timeout)
>  {
>   long ret;
>   unsigned long start = jiffies;
> - DEFINE_WAIT(wait);
> - wait_queue_head_t *wqh = _wqh[sync];
>  
> - prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE);
> - ret = io_schedule_timeout(timeout);
> - finish_wait(wqh, );
> + ret = congestion_timeout(sync,timeout);
>  
>   trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
>   jiffies_to_usecs(jiffies - start));
> @@ -622,8 +634,6 @@ long wait_iff_congested(struct zone *zone, int sync, long 
> timeout)
>  {
>   long ret;
>   unsigned long start = jiffies;
> - DEFINE_WAIT(wait);
> - wait_queue_head_t *wqh = _wqh[sync];
>  
>   /*
>* If there is no congestion, or heavy congestion is not being
> @@ -643,9 +653,7 @@ long wait_iff_congested(struct zone *zone, int sync, long 
> timeout)
>   }
>  
>   /* Sleep until uncongested or a write happens */
> - prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE);
> - ret = io_schedule_timeout(timeout);
> - finish_wait(wqh, );
> + ret = congestion_timeout(sync, timeout);
>  
>  out:
>   trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 32c661d66a45..1e524000b83e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -989,9 +989,7 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>* avoid risk of stack overflow but only writeback
>* if many dirty pages have been encountered.
>*/
> - if (page_is_file_cache(page) &&
> - (!current_is_kswapd() ||
> -  !zone_is_reclaim_dirty(zone))) {
> + if (!current_is_kswapd() || 
> !zone_is_reclaim_dirty(zone)) {
>   /*
>* Immediately reclaim when written back.
>* Similar in principal to deactivate_page()

I guess this part which avoid swapout in direct reclaim would be key
if this patch were successful. But it could make anon pages rotate back
into inactive's head from tail in direct reclaim path until kswapd can
catch up. And kswapd kswapd can swap out anon pages from tail of inactive
LRU so I suspect it could make side-effect LRU churning.

Anyway, I will queue it into testing machine since Rusty's test is done.
Thanks!

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner

On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
> Hello Dave,
> 
> On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > > Author: Jens Axboe 
> > > Date:   Sat Apr 16 13:27:55 2011 +0200
> > > 
> > > block: let io_schedule() flush the plug inline
> > > 
> > > Linus correctly observes that the most important dispatch cases
> > > are now done from kblockd, this isn't ideal for latency reasons.
> > > The original reason for switching dispatches out-of-line was to
> > > avoid too deep a stack, so by _only_ letting the "accidental"
> > > flush directly in schedule() be guarded by offload to kblockd,
> > > we should be able to get the best of both worlds.
> > > 
> > > So add a blk_schedule_flush_plug() that offloads to kblockd,
> > > and only use that from the schedule() path.
> > > 
> > > Signed-off-by: Jens Axboe 
> > > 
> > > And now we have too deep a stack due to unplugging from io_schedule()...
> > 
> > So, if we make io_schedule() push the plug list off to the kblockd
> > like is done for schedule()

> I did below hacky test to apply your idea and the result is overflow again.
> So, again it would second stack expansion. Otherwise, we should prevent
> swapout in direct reclaim.
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f5c6635b806c..95f169e85dbe 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
>  void __sched io_schedule(void)
>  {
>   struct rq *rq = raw_rq();
> + struct blk_plug *plug = current->plug;
>  
>   delayacct_blkio_start();
>   atomic_inc(>nr_iowait);
> - blk_flush_plug(current);
> + if (plug)
> + blk_flush_plug_list(plug, true);
> +
>   current->in_iowait = 1;
>   schedule();
>   current->in_iowait = 0;

.

> DepthSize   Location(46 entries)
>
>   0) 7200   8   _raw_spin_lock_irqsave+0x51/0x60
>   1) 7192 296   get_page_from_freelist+0x886/0x920
>   2) 6896 352   __alloc_pages_nodemask+0x5e1/0xb20
>   3) 6544   8   alloc_pages_current+0x10f/0x1f0
>   4) 6536 168   new_slab+0x2c5/0x370
>   5) 6368   8   __slab_alloc+0x3a9/0x501
>   6) 6360  80   __kmalloc+0x1cb/0x200
>   7) 6280 376   vring_add_indirect+0x36/0x200
>   8) 5904 144   virtqueue_add_sgs+0x2e2/0x320
>   9) 5760 288   __virtblk_add_req+0xda/0x1b0
>  10) 5472  96   virtio_queue_rq+0xd3/0x1d0
>  11) 5376 128   __blk_mq_run_hw_queue+0x1ef/0x440
>  12) 5248  16   blk_mq_run_hw_queue+0x35/0x40
>  13) 5232  96   blk_mq_insert_requests+0xdb/0x160
>  14) 5136 112   blk_mq_flush_plug_list+0x12b/0x140
>  15) 5024 112   blk_flush_plug_list+0xc7/0x220
>  16) 4912 128   blk_mq_make_request+0x42a/0x600
>  17) 4784  48   generic_make_request+0xc0/0x100
>  18) 4736 112   submit_bio+0x86/0x160
>  19) 4624 160   __swap_writepage+0x198/0x230
>  20) 4464  32   swap_writepage+0x42/0x90
>  21) 4432 320   shrink_page_list+0x676/0xa80
>  22) 4112 208   shrink_inactive_list+0x262/0x4e0
>  23) 3904 304   shrink_lruvec+0x3e1/0x6a0

The device is supposed to be plugged here in shrink_lruvec().

Oh, a plug can only hold 16 individual bios, and then it does a
synchronous flush. Hmmm - perhaps that should also defer the flush
to the kblockd, because if we are overrunning a plug then we've
already surrendered IO dispatch latency

So, in blk_mq_make_request(), can you do:

if (list_empty(>mq_list))
trace_block_plug(q);
else if (request_count >= BLK_MAX_REQUEST_COUNT) {
-   blk_flush_plug_list(plug, false);
+   blk_flush_plug_list(plug, true);
trace_block_plug(q);
}
list_add_tail(>queuelist, >mq_list);

To see if that defers all the swap IO to kblockd?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Jones

On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:

 > That sounds like a plan. Perhaps it would be useful to add a
 > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
 > 8k) so that we get some indication that we're hitting a deep stack
 > but the system otherwise keeps functioning. That gives us some
 > motivation to keep stack usage down but isn't a fatal problem like
 > it is now

We have check_stack_usage() and DEBUG_STACK_USAGE for this.
Though it needs some tweaking if we move to 16K

I gave it a try yesterday, and noticed a spew of noisy warnings as soon
as I gave it a workload to chew on. (Moreso than usual with 8K stacks)

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 4:36 PM, Minchan Kim  wrote:
>
> I did below hacky test to apply your idea and the result is overflow again.
> So, again it would second stack expansion. Otherwise, we should prevent
> swapout in direct reclaim.

So changing io_schedule() is bad, for the reasons I outlined elsewhere
(we use it for wait_for_page*() - see sleep_on_page().

It's the congestion waiting where the io_schedule() should be avoided.

So maybe test a patch something like the attached.

NOTE! This is absolutely TOTALLY UNTESTED! It might do horrible
horrible things. It seems to compile, but I have absolutely no reason
to believe that it would work. I didn't actually test that this moves
anything at all to kblockd. So think of it as a concept patch that
*might* work, but as Dave said, there might also be other things that
cause unplugging and need some tough love.

   Linus
 mm/backing-dev.c | 28 ++--
 mm/vmscan.c  |  4 +---
 2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 09d9591b7708..cb26b24c2da2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
@@ -573,6 +574,21 @@ void set_bdi_congested(struct backing_dev_info *bdi, int 
sync)
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+static long congestion_timeout(int sync, long timeout)
+{
+   long ret;
+   DEFINE_WAIT(wait);
+   struct blk_plug *plug = current->plug;
+   wait_queue_head_t *wqh = _wqh[sync];
+
+   prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE);
+   if (plug)
+   blk_flush_plug_list(plug, true);
+   ret = schedule_timeout(timeout);
+   finish_wait(wqh, );
+   return ret;
+}
+
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -586,12 +602,8 @@ long congestion_wait(int sync, long timeout)
 {
long ret;
unsigned long start = jiffies;
-   DEFINE_WAIT(wait);
-   wait_queue_head_t *wqh = _wqh[sync];
 
-   prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE);
-   ret = io_schedule_timeout(timeout);
-   finish_wait(wqh, );
+   ret = congestion_timeout(sync,timeout);
 
trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
jiffies_to_usecs(jiffies - start));
@@ -622,8 +634,6 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
 {
long ret;
unsigned long start = jiffies;
-   DEFINE_WAIT(wait);
-   wait_queue_head_t *wqh = _wqh[sync];
 
/*
 * If there is no congestion, or heavy congestion is not being
@@ -643,9 +653,7 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
}
 
/* Sleep until uncongested or a write happens */
-   prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE);
-   ret = io_schedule_timeout(timeout);
-   finish_wait(wqh, );
+   ret = congestion_timeout(sync, timeout);
 
 out:
trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32c661d66a45..1e524000b83e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -989,9 +989,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 * avoid risk of stack overflow but only writeback
 * if many dirty pages have been encountered.
 */
-   if (page_is_file_cache(page) &&
-   (!current_is_kswapd() ||
-!zone_is_reclaim_dirty(zone))) {
+   if (!current_is_kswapd() || 
!zone_is_reclaim_dirty(zone)) {
/*
 * Immediately reclaim when written back.
 * Similar in principal to deactivate_page()

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner

On Thu, May 29, 2014 at 08:24:49AM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 12:26 AM, Dave Chinner  wrote:
> >
> > What concerns me about both __alloc_pages_nodemask() and
> > kernel_map_pages is that when I look at the code I see functions
> > that have no obvious stack usage problem. However, the compiler is
> > producing functions with huge stack footprints and it's not at all
> > obvious when I read the code. So in this case I'm more concerned
> > that we have a major disconnect between the source code structure
> > and the code that the compiler produces...
> 
> I agree. In fact, this is the main reason that Minchan's call trace
> and this thread has actually convinced me that yes, we really do need
> to make x86-64 have a 16kB stack (well, 16kB allocation - there's
> still the thread info etc too).
> 
> Usually when we see the stack-smashing traces, they are because
> somebody did something stupid. In this case, there are certainly
> stupid details, and things I think we should fix, but there is *not*
> the usual red flag of "Christ, somebody did something _really_ wrong".
> 
> So I'm not in fact arguing against Minchan's patch of upping
> THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
> remain one of my "we really need to be careful" issues, so while I am
> basically planning on applying that patch, I _also_ want to make sure
> that we fix the problems we do see and not just paper them over.
> 
> The 8kB stack has been somewhat restrictive and painful for a while,
> and I'm ok with admitting that it is just getting _too_ damn painful,
> but I don't want to just give up entirely when we have a known deep
> stack case.

That sounds like a plan. Perhaps it would be useful to add a
WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
8k) so that we get some indication that we're hitting a deep stack
but the system otherwise keeps functioning. That gives us some
motivation to keep stack usage down but isn't a fatal problem like
it is now

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

Hello Linus,

On Thu, May 29, 2014 at 08:24:49AM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 12:26 AM, Dave Chinner  wrote:
> >
> > What concerns me about both __alloc_pages_nodemask() and
> > kernel_map_pages is that when I look at the code I see functions
> > that have no obvious stack usage problem. However, the compiler is
> > producing functions with huge stack footprints and it's not at all
> > obvious when I read the code. So in this case I'm more concerned
> > that we have a major disconnect between the source code structure
> > and the code that the compiler produces...
> 
> I agree. In fact, this is the main reason that Minchan's call trace
> and this thread has actually convinced me that yes, we really do need
> to make x86-64 have a 16kB stack (well, 16kB allocation - there's
> still the thread info etc too).
> 
> Usually when we see the stack-smashing traces, they are because
> somebody did something stupid. In this case, there are certainly
> stupid details, and things I think we should fix, but there is *not*
> the usual red flag of "Christ, somebody did something _really_ wrong".
> 
> So I'm not in fact arguing against Minchan's patch of upping
> THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
> remain one of my "we really need to be careful" issues, so while I am
> basically planning on applying that patch, I _also_ want to make sure
> that we fix the problems we do see and not just paper them over.

So, should I resend a patch w/o RFC in subject but with Acked-by from Dave?
Or, you will do it by yourself?

> 
> The 8kB stack has been somewhat restrictive and painful for a while,
> and I'm ok with admitting that it is just getting _too_ damn painful,
> but I don't want to just give up entirely when we have a known deep
> stack case.
> 
>   Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

Hello Dave,

On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > Author: Jens Axboe 
> > Date:   Sat Apr 16 13:27:55 2011 +0200
> > 
> > block: let io_schedule() flush the plug inline
> > 
> > Linus correctly observes that the most important dispatch cases
> > are now done from kblockd, this isn't ideal for latency reasons.
> > The original reason for switching dispatches out-of-line was to
> > avoid too deep a stack, so by _only_ letting the "accidental"
> > flush directly in schedule() be guarded by offload to kblockd,
> > we should be able to get the best of both worlds.
> > 
> > So add a blk_schedule_flush_plug() that offloads to kblockd,
> > and only use that from the schedule() path.
> > 
> > Signed-off-by: Jens Axboe 
> > 
> > And now we have too deep a stack due to unplugging from io_schedule()...
> 
> So, if we make io_schedule() push the plug list off to the kblockd
> like is done for schedule()
> 
> > > IOW, swap-out directly caused that extra 3kB of stack use in what was
> > > a deep call chain (due to memory allocation). I really don't
> > > understand why you are arguing anything else on a pure technicality.
> > >
> > > I thought you had some other argument for why swap was different, and
> > > against removing that "page_is_file_cache()" special case in
> > > shrink_page_list().
> > 
> > I've said in the past that swap is different to filesystem
> > ->writepage implementations because it doesn't require significant
> > stack to do block allocation and doesn't trigger IO deep in that
> > allocation stack. Hence it has much lower stack overhead than the
> > filesystem ->writepage implementations and so is much less likely to
> > have stack issues.
> > 
> > This stack overflow shows us that just the memory reclaim + IO
> > layers are sufficient to cause a stack overflow,
> 
>  we solve this problem directly by being able to remove the IO
> stack usage from the direct reclaim swap path.
> 
> IOWs, we don't need to turn swap off at all in direct reclaim
> because all the swap IO can be captured in a plug list and
> dispatched via kblockd. This could be done either by io_schedule()
> or a new blk_flush_plug_list() wrapper that pushes the work to
> kblockd...

I did below hacky test to apply your idea and the result is overflow again.
So, again it would second stack expansion. Otherwise, we should prevent
swapout in direct reclaim.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..95f169e85dbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
 void __sched io_schedule(void)
 {
struct rq *rq = raw_rq();
+   struct blk_plug *plug = current->plug;
 
delayacct_blkio_start();
atomic_inc(>nr_iowait);
-   blk_flush_plug(current);
+   if (plug)
+   blk_flush_plug_list(plug, true);
+
current->in_iowait = 1;
schedule();
current->in_iowait = 0;


[ 1209.764725] kworker/u24:0 (23627) used greatest stack depth: 304 bytes left
[ 1510.835509] kworker/u24:1 (25817) used greatest stack depth: 144 bytes left
[ 3701.482790] PANIC: double fault, error_code: 0x0
[ 3701.483297] CPU: 8 PID: 6117 Comm: kworker/u24:1 Not tainted 3.14.0+ #201
[ 3701.483980] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 3701.484366] Workqueue: writeback bdi_writeback_workfn (flush-253:0)
[ 3701.484366] task: 8800353c41c0 ti: 88106000 task.ti: 
88106000
[ 3701.484366] RIP: 0010:[]  [] 
__lock_acquire+0x170/0x1ca0
[ 3701.484366] RSP: :88105f58  EFLAGS: 00010046
[ 3701.484366] RAX: 0001 RBX: 8800353c41c0 RCX: 0002
[ 3701.484366] RDX:  RSI:  RDI: 81c4a1e0
[ 3701.484366] RBP: 88106048 R08: 0001 R09: 0001
[ 3701.484366] R10:  R11:  R12: 0001
[ 3701.484366] R13:  R14: 81c4a1e0 R15: 
[ 3701.484366] FS:  () GS:880037d0() 
knlGS:
[ 3701.484366] CS:  0010 DS:  ES:  CR0: 8005003b
[ 3701.484366] CR2: 88105f48 CR3: 01c0b000 CR4: 06e0
[ 3701.484366] Stack:
[ 3701.484366] BUG: unable to handle kernel paging request at 88105f58
[ 3701.484366] IP: [] show_stack_log_lvl+0x134/0x1a0
[ 3701.484366] PGD 28c5067 PUD 28c6067 PMD 28c7067 PTE 80105060
[ 3701.484366] Thread overran stack, or stack corrupted
[ 3701.484366] Oops:  [#1] SMP DEBUG_PAGEALLOC
[ 3701.484366] Dumping ftrace buffer:
[ 3701.484366] -
[ 3701.484366]<...>-61178d..4 3786719374us : stack_trace_call: 
Depth

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 12:26 AM, Dave Chinner  wrote:
>
> What concerns me about both __alloc_pages_nodemask() and
> kernel_map_pages is that when I look at the code I see functions
> that have no obvious stack usage problem. However, the compiler is
> producing functions with huge stack footprints and it's not at all
> obvious when I read the code. So in this case I'm more concerned
> that we have a major disconnect between the source code structure
> and the code that the compiler produces...

I agree. In fact, this is the main reason that Minchan's call trace
and this thread has actually convinced me that yes, we really do need
to make x86-64 have a 16kB stack (well, 16kB allocation - there's
still the thread info etc too).

Usually when we see the stack-smashing traces, they are because
somebody did something stupid. In this case, there are certainly
stupid details, and things I think we should fix, but there is *not*
the usual red flag of "Christ, somebody did something _really_ wrong".

So I'm not in fact arguing against Minchan's patch of upping
THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
remain one of my "we really need to be careful" issues, so while I am
basically planning on applying that patch, I _also_ want to make sure
that we fix the problems we do see and not just paper them over.

The 8kB stack has been somewhat restrictive and painful for a while,
and I'm ok with admitting that it is just getting _too_ damn painful,
but I don't want to just give up entirely when we have a known deep
stack case.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread One Thousand Gnomes

> Hmm, stupid question: what happens when 16K is not enough too, do we
> increase again? When do we stop increasing? 1M, 2M... ?

It's not a stupid question, it's IMHO the most important question

> Sounds like we want to make it a config option with a couple of sizes
> for everyone to be happy. :-)

At the moment it goes bang if you freakily get three layers of recursion
through allocations. But show me the proof we can't already hit four, or
five, or six  

We don't *need* to allocate tons of stack memory to each task just because
we might recursively allocate. We don't solve the problem by doing so
either. We at best fudge over it.

Why is *any* recursive memory allocation not ending up waiting for other
kernel worker threads to free up memory (beyond it being rather hard to
go and retrofit) ?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner

On Wed, May 28, 2014 at 07:42:40PM -0700, Linus Torvalds wrote:
> On Wed, May 28, 2014 at 6:30 PM, Dave Chinner  wrote:
> >
> > You're focussing on the specific symptoms, not the bigger picture.
> > i.e. you're ignoring all the other "let's start IO" triggers in
> > direct reclaim. e.g there's two separate plug flush triggers in
> > shrink_inactive_list(), one of which is:
> 
> Fair enough. I certainly agree that we should look at the other cases here 
> too.
> 
> In fact, I also find it distasteful just how much stack space some of
> those VM routines are just using up on their own, never mind any
> actual IO paths at all. The fact that __alloc_pages_nodemask() uses
> 350 bytes of stackspace on its own is actually quite disturbing. The
> fact that kernel_map_pages() apparently has almost 400 bytes of stack
> is just crazy. Obviously that case only happens with
> CONFIG_DEBUG_PAGEALLOC, but still..

What concerns me about both __alloc_pages_nodemask() and
kernel_map_pages is that when I look at the code I see functions
that have no obvious stack usage problem. However, the compiler is
producing functions with huge stack footprints and it's not at all
obvious when I read the code. So in this case I'm more concerned
that we have a major disconnect between the source code structure
and the code that the compiler produces...

> > I'm not saying we shouldn't turn of swap from direct reclaim, just
> > that all we'd be doing by turning off swap is playing whack-a-stack
> > - the next report will simply be from one of the other direct
> > reclaim IO schedule points.
> 
> Playing whack-a-mole with this for a while might not be a bad idea,
> though. It's not like we will ever really improve unless we start
> whacking the worst cases. And it should still be a fairly limited
> number.

I guess I've been playing whack-a-stack for so long now and some of
the overruns have been so large I just don't see it as a viable
medium to long term solution.

> After all, historically, some of the cases we've played whack-a-mole
> on have been in XFS, so I'd think you'd be thrilled to see some other
> code get blamed this time around ;)

Blame shifting doesn't thrill me - I'm still at the pointy end of
stack overrun reports, and we've still got to do the hard work of
solving the problem. However, I am happy to see acknowlegement of
the problem so we can work out how to solve the issues...

> > Regardless of whether it is swap or something external queues the
> > bio on the plug, perhaps we should look at why it's done inline
> > rather than by kblockd, where it was moved because it was blowing
> > the stack from schedule():
> 
> So it sounds like we need to do this for io_schedule() too.
> 
> In fact, we've generally found it to be a mistake every time we
> "automatically" unblock some IO queue. And I'm not saying that because
> of stack space, but because we've _often_ had the situation that eager
> unblocking results in IO that could have been done as bigger requests.
> 
> Of course, we do need to worry about latency for starting IO, but any
> of these kinds of memory-pressure writeback patterns are pretty much
> by definition not about the latency of one _particular_ IO, so they
> don't tent to be latency-sensitive. Quite the reverse: we start
> writeback and then end up waiting on something else altogether
> (possibly a writeback that got started much earlier).

*nod*

> swapout certainly is _not_ IO-latency-sensitive, especially these
> days. And while we _do_ want to throttle in direct reclaim, if it's
> about throttling I'd certainly think that it sounds quite reasonable
> to push any unplugging to kblockd than try to do that synchronously.
> If we are throttling in direct-reclaim, we need to slow things _down_
> for the writer, not worry about latency.

Right, we are adding latency to the caller by having to swap so
a small amount of additional IO dispatch latency for IO we aren't
going to wait directly on doesn't really matter at all.

> >That implies no IO in direct reclaim context
> > is safe - either from swap or io_schedule() unplugging. It also
> > lends a lot of weight to my assertion that the majority of the stack
> > growth over the past couple of years has been ocurring outside the
> > filesystems
> 
> I think Minchan's stack trace definitely backs you up on that. The
> filesystem part - despite that one ext4_writepages() function - is a
> very small part of the whole. It sits at about ~1kB of stack. Just the
> VM "top-level" writeback code is about as much, and then the VM page
> alloc/shrinking code when the filesystem needs memory is *twice* that,
> and then the block layer and the virtio code are another 1kB each.

*nod*

As i said early, look at this in the context of the bigger picture.
We can also have more stack using layers in the IO stack and/or more
stack-expensive layers. e.g.  it could be block -> dm -> md -> SCSI
-> mempool_alloc in that stack rather than block -> virtio ->

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Rusty Russell

Linus Torvalds  writes:
> Well, we've definitely have had some issues with deeper callchains
> with md, but I suspect virtio might be worse, and the new blk-mq code
> is lilkely worse in this respect too.

I looked at this; I've now got a couple of virtio core cleanups, and
I'm testing with Minchan's config and gcc versions now.

MST reported that gcc 4.8 is a better than 4.6, but I'll test that too.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

On Wed, May 28, 2014 at 12:06:58PM -0400, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> > > [ cc XFS list ]
> > 
> > [and now there is a complete copy on the XFs list, I'll add my 2c]
> > 
> > > On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > > > While I play inhouse patches with much memory pressure on qemu-kvm,
> > > > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > > > 
> > > > When I investigated the problem, the callstack was a little bit deeper
> > > > by involve with reclaim functions but not direct reclaim path.
> > > > 
> > > > I tried to diet stack size of some functions related with alloc/reclaim
> > > > so did a hundred of byte but overflow was't disappeard so that I 
> > > > encounter
> > > > overflow by another deeper callstack on reclaim/allocator path.
> > 
> > That's a no win situation. The stack overruns through ->writepage
> > we've been seeing with XFS over the past *4 years* are much larger
> > than a few bytes. The worst case stack usage on a virtio block
> > device was about 10.5KB of stack usage.
> > 
> > And, like this one, it came from the flusher thread as well. The
> > difference was that the allocation that triggered the reclaim path
> > you've reported occurred when 5k of the stack had already been
> > used...
> > 
> > > > Of course, we might sweep every sites we have found for reducing
> > > > stack usage but I'm not sure how long it saves the world(surely,
> > > > lots of developer start to add nice features which will use stack
> > > > agains) and if we consider another more complex feature in I/O layer
> > > > and/or reclaim path, it might be better to increase stack size(
> > > > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > > > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > > > already expaned to 16K. )
> > 
> > Yup, that's all been pointed out previously. 8k stacks were never
> > large enough to fit the linux IO architecture on x86-64, but nobody
> > outside filesystem and IO developers has been willing to accept that
> > argument as valid, despite regular stack overruns and filesystem
> > having to add workaround after workaround to prevent stack overruns.
> > 
> > That's why stuff like this appears in various filesystem's
> > ->writepage:
> > 
> > /*
> >  * Refuse to write the page out if we are called from reclaim 
> > context.
> >  *
> >  * This avoids stack overflows when called from deeply used stacks 
> > in
> >  * random callers for direct reclaim or memcg reclaim.  We 
> > explicitly
> >  * allow reclaim from kswapd as the stack usage there is relatively 
> > low.
> >  *
> >  * This should never happen except in the case of a VM regression so
> >  * warn about it.
> >  */
> > if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> > PF_MEMALLOC))
> > goto redirty;
> > 
> > That still doesn't guarantee us enough stack space to do writeback,
> > though, because memory allocation can occur when reading in metadata
> > needed to do delayed allocation, and so we could trigger GFP_NOFS
> > memory allocation from the flusher thread with 4-5k of stack already
> > consumed, so that would still overrun teh stack.
> > 
> > So, a couple of years ago we started defering half the writeback
> > stack usage to a worker thread (commit c999a22 "xfs: introduce an
> > allocation workqueue"), under the assumption that the worst stack
> > usage when we call memory allocation is around 3-3.5k of stack used.
> > We thought that would be safe, but the stack trace you've posted
> > shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
> > which means we're still screwed despite all the workarounds we have
> > in place.
> 
> The allocation and reclaim stack itself is only 2k per the stacktrace
> below.  What got us in this particular case is that we engaged a
> complicated block layer setup from within the allocation context in
> order to swap out a page.
> 
> In the past we disabled filesystem ->writepage from within the
> allocation context and deferred it to kswapd for stack reasons (see
> the WARN_ON_ONCE and the comment in your above quote), but I think we
> have to go further and do the same for even swap_writepage():
> 
> > > > I guess this topic was discussed several time so there might be
> > > > strong reason not to increase kernel stack size on x86_64, for me not
> > > > knowing so Ccing x86_64 maintainers, other MM guys and virtio
> > > > maintainers.
> > > >
> > > >  DepthSize   Location(51 entries)
> > > > 
> > > >0) 7696  16   lookup_address+0x28/0x30
> > > >1) 7680  16   _lookup_address_cpa.isra.3+0x3b/0x40
> > > >2) 7664  24   __change_page_attr_set_clr+0xe0/0xb50
> > >

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

On Wed, May 28, 2014 at 12:06:58PM -0400, Johannes Weiner wrote:
 On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote:
  On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
   [ cc XFS list ]
  
  [and now there is a complete copy on the XFs list, I'll add my 2c]
  
   On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
While I play inhouse patches with much memory pressure on qemu-kvm,
3.14 kernel was randomly crashed. The reason was kernel stack overflow.

When I investigated the problem, the callstack was a little bit deeper
by involve with reclaim functions but not direct reclaim path.

I tried to diet stack size of some functions related with alloc/reclaim
so did a hundred of byte but overflow was't disappeard so that I 
encounter
overflow by another deeper callstack on reclaim/allocator path.
  
  That's a no win situation. The stack overruns through -writepage
  we've been seeing with XFS over the past *4 years* are much larger
  than a few bytes. The worst case stack usage on a virtio block
  device was about 10.5KB of stack usage.
  
  And, like this one, it came from the flusher thread as well. The
  difference was that the allocation that triggered the reclaim path
  you've reported occurred when 5k of the stack had already been
  used...
  
Of course, we might sweep every sites we have found for reducing
stack usage but I'm not sure how long it saves the world(surely,
lots of developer start to add nice features which will use stack
agains) and if we consider another more complex feature in I/O layer
and/or reclaim path, it might be better to increase stack size(
meanwhile, stack usage on 64bit machine was doubled compared to 32bit
while it have sticked to 8K. Hmm, it's not a fair to me and arm64
already expaned to 16K. )
  
  Yup, that's all been pointed out previously. 8k stacks were never
  large enough to fit the linux IO architecture on x86-64, but nobody
  outside filesystem and IO developers has been willing to accept that
  argument as valid, despite regular stack overruns and filesystem
  having to add workaround after workaround to prevent stack overruns.
  
  That's why stuff like this appears in various filesystem's
  -writepage:
  
  /*
   * Refuse to write the page out if we are called from reclaim 
  context.
   *
   * This avoids stack overflows when called from deeply used stacks 
  in
   * random callers for direct reclaim or memcg reclaim.  We 
  explicitly
   * allow reclaim from kswapd as the stack usage there is relatively 
  low.
   *
   * This should never happen except in the case of a VM regression so
   * warn about it.
   */
  if (WARN_ON_ONCE((current-flags  (PF_MEMALLOC|PF_KSWAPD)) ==
  PF_MEMALLOC))
  goto redirty;
  
  That still doesn't guarantee us enough stack space to do writeback,
  though, because memory allocation can occur when reading in metadata
  needed to do delayed allocation, and so we could trigger GFP_NOFS
  memory allocation from the flusher thread with 4-5k of stack already
  consumed, so that would still overrun teh stack.
  
  So, a couple of years ago we started defering half the writeback
  stack usage to a worker thread (commit c999a22 xfs: introduce an
  allocation workqueue), under the assumption that the worst stack
  usage when we call memory allocation is around 3-3.5k of stack used.
  We thought that would be safe, but the stack trace you've posted
  shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
  which means we're still screwed despite all the workarounds we have
  in place.
 
 The allocation and reclaim stack itself is only 2k per the stacktrace
 below.  What got us in this particular case is that we engaged a
 complicated block layer setup from within the allocation context in
 order to swap out a page.
 
 In the past we disabled filesystem -writepage from within the
 allocation context and deferred it to kswapd for stack reasons (see
 the WARN_ON_ONCE and the comment in your above quote), but I think we
 have to go further and do the same for even swap_writepage():
 
I guess this topic was discussed several time so there might be
strong reason not to increase kernel stack size on x86_64, for me not
knowing so Ccing x86_64 maintainers, other MM guys and virtio
maintainers.
   
 DepthSize   Location(51 entries)

   0) 7696  16   lookup_address+0x28/0x30
   1) 7680  16   _lookup_address_cpa.isra.3+0x3b/0x40
   2) 7664  24   __change_page_attr_set_clr+0xe0/0xb50
   3) 7640 392   kernel_map_pages+0x6c/0x120
   4) 7248 256   get_page_from_freelist+0x489/0x920
   5) 6992 352   __alloc_pages_nodemask+0x5e1/0xb20
   6) 6640   8   alloc_pages_current+0x10f/0x1f0
   7)

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Rusty Russell

Linus Torvalds torva...@linux-foundation.org writes:
 Well, we've definitely have had some issues with deeper callchains
 with md, but I suspect virtio might be worse, and the new blk-mq code
 is lilkely worse in this respect too.

I looked at this; I've now got a couple of virtio core cleanups, and
I'm testing with Minchan's config and gcc versions now.

MST reported that gcc 4.8 is a better than 4.6, but I'll test that too.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner

On Wed, May 28, 2014 at 07:42:40PM -0700, Linus Torvalds wrote:
 On Wed, May 28, 2014 at 6:30 PM, Dave Chinner da...@fromorbit.com wrote:
 
  You're focussing on the specific symptoms, not the bigger picture.
  i.e. you're ignoring all the other let's start IO triggers in
  direct reclaim. e.g there's two separate plug flush triggers in
  shrink_inactive_list(), one of which is:
 
 Fair enough. I certainly agree that we should look at the other cases here 
 too.
 
 In fact, I also find it distasteful just how much stack space some of
 those VM routines are just using up on their own, never mind any
 actual IO paths at all. The fact that __alloc_pages_nodemask() uses
 350 bytes of stackspace on its own is actually quite disturbing. The
 fact that kernel_map_pages() apparently has almost 400 bytes of stack
 is just crazy. Obviously that case only happens with
 CONFIG_DEBUG_PAGEALLOC, but still..

What concerns me about both __alloc_pages_nodemask() and
kernel_map_pages is that when I look at the code I see functions
that have no obvious stack usage problem. However, the compiler is
producing functions with huge stack footprints and it's not at all
obvious when I read the code. So in this case I'm more concerned
that we have a major disconnect between the source code structure
and the code that the compiler produces...

  I'm not saying we shouldn't turn of swap from direct reclaim, just
  that all we'd be doing by turning off swap is playing whack-a-stack
  - the next report will simply be from one of the other direct
  reclaim IO schedule points.
 
 Playing whack-a-mole with this for a while might not be a bad idea,
 though. It's not like we will ever really improve unless we start
 whacking the worst cases. And it should still be a fairly limited
 number.

I guess I've been playing whack-a-stack for so long now and some of
the overruns have been so large I just don't see it as a viable
medium to long term solution.

 After all, historically, some of the cases we've played whack-a-mole
 on have been in XFS, so I'd think you'd be thrilled to see some other
 code get blamed this time around ;)

Blame shifting doesn't thrill me - I'm still at the pointy end of
stack overrun reports, and we've still got to do the hard work of
solving the problem. However, I am happy to see acknowlegement of
the problem so we can work out how to solve the issues...

  Regardless of whether it is swap or something external queues the
  bio on the plug, perhaps we should look at why it's done inline
  rather than by kblockd, where it was moved because it was blowing
  the stack from schedule():
 
 So it sounds like we need to do this for io_schedule() too.
 
 In fact, we've generally found it to be a mistake every time we
 automatically unblock some IO queue. And I'm not saying that because
 of stack space, but because we've _often_ had the situation that eager
 unblocking results in IO that could have been done as bigger requests.
 
 Of course, we do need to worry about latency for starting IO, but any
 of these kinds of memory-pressure writeback patterns are pretty much
 by definition not about the latency of one _particular_ IO, so they
 don't tent to be latency-sensitive. Quite the reverse: we start
 writeback and then end up waiting on something else altogether
 (possibly a writeback that got started much earlier).

*nod*

 swapout certainly is _not_ IO-latency-sensitive, especially these
 days. And while we _do_ want to throttle in direct reclaim, if it's
 about throttling I'd certainly think that it sounds quite reasonable
 to push any unplugging to kblockd than try to do that synchronously.
 If we are throttling in direct-reclaim, we need to slow things _down_
 for the writer, not worry about latency.

Right, we are adding latency to the caller by having to swap so
a small amount of additional IO dispatch latency for IO we aren't
going to wait directly on doesn't really matter at all.

 That implies no IO in direct reclaim context
  is safe - either from swap or io_schedule() unplugging. It also
  lends a lot of weight to my assertion that the majority of the stack
  growth over the past couple of years has been ocurring outside the
  filesystems
 
 I think Minchan's stack trace definitely backs you up on that. The
 filesystem part - despite that one ext4_writepages() function - is a
 very small part of the whole. It sits at about ~1kB of stack. Just the
 VM top-level writeback code is about as much, and then the VM page
 alloc/shrinking code when the filesystem needs memory is *twice* that,
 and then the block layer and the virtio code are another 1kB each.

*nod*

As i said early, look at this in the context of the bigger picture.
We can also have more stack using layers in the IO stack and/or more
stack-expensive layers. e.g.  it could be block - dm - md - SCSI
- mempool_alloc in that stack rather than block - virtio -
kmalloc. Hence 1k of virtio stack could be 1.5k of SCSI stack,
md/dm could

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread One Thousand Gnomes

 Hmm, stupid question: what happens when 16K is not enough too, do we
 increase again? When do we stop increasing? 1M, 2M... ?

It's not a stupid question, it's IMHO the most important question

 Sounds like we want to make it a config option with a couple of sizes
 for everyone to be happy. :-)

At the moment it goes bang if you freakily get three layers of recursion
through allocations. But show me the proof we can't already hit four, or
five, or six  

We don't *need* to allocate tons of stack memory to each task just because
we might recursively allocate. We don't solve the problem by doing so
either. We at best fudge over it.

Why is *any* recursive memory allocation not ending up waiting for other
kernel worker threads to free up memory (beyond it being rather hard to
go and retrofit) ?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 12:26 AM, Dave Chinner da...@fromorbit.com wrote:

 What concerns me about both __alloc_pages_nodemask() and
 kernel_map_pages is that when I look at the code I see functions
 that have no obvious stack usage problem. However, the compiler is
 producing functions with huge stack footprints and it's not at all
 obvious when I read the code. So in this case I'm more concerned
 that we have a major disconnect between the source code structure
 and the code that the compiler produces...

I agree. In fact, this is the main reason that Minchan's call trace
and this thread has actually convinced me that yes, we really do need
to make x86-64 have a 16kB stack (well, 16kB allocation - there's
still the thread info etc too).

Usually when we see the stack-smashing traces, they are because
somebody did something stupid. In this case, there are certainly
stupid details, and things I think we should fix, but there is *not*
the usual red flag of Christ, somebody did something _really_ wrong.

So I'm not in fact arguing against Minchan's patch of upping
THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
remain one of my we really need to be careful issues, so while I am
basically planning on applying that patch, I _also_ want to make sure
that we fix the problems we do see and not just paper them over.

The 8kB stack has been somewhat restrictive and painful for a while,
and I'm ok with admitting that it is just getting _too_ damn painful,
but I don't want to just give up entirely when we have a known deep
stack case.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

Hello Dave,

On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
 On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
  On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
  commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
  Author: Jens Axboe jax...@fusionio.com
  Date:   Sat Apr 16 13:27:55 2011 +0200
  
  block: let io_schedule() flush the plug inline
  
  Linus correctly observes that the most important dispatch cases
  are now done from kblockd, this isn't ideal for latency reasons.
  The original reason for switching dispatches out-of-line was to
  avoid too deep a stack, so by _only_ letting the accidental
  flush directly in schedule() be guarded by offload to kblockd,
  we should be able to get the best of both worlds.
  
  So add a blk_schedule_flush_plug() that offloads to kblockd,
  and only use that from the schedule() path.
  
  Signed-off-by: Jens Axboe jax...@fusionio.com
  
  And now we have too deep a stack due to unplugging from io_schedule()...
 
 So, if we make io_schedule() push the plug list off to the kblockd
 like is done for schedule()
 
   IOW, swap-out directly caused that extra 3kB of stack use in what was
   a deep call chain (due to memory allocation). I really don't
   understand why you are arguing anything else on a pure technicality.
  
   I thought you had some other argument for why swap was different, and
   against removing that page_is_file_cache() special case in
   shrink_page_list().
  
  I've said in the past that swap is different to filesystem
  -writepage implementations because it doesn't require significant
  stack to do block allocation and doesn't trigger IO deep in that
  allocation stack. Hence it has much lower stack overhead than the
  filesystem -writepage implementations and so is much less likely to
  have stack issues.
  
  This stack overflow shows us that just the memory reclaim + IO
  layers are sufficient to cause a stack overflow,
 
  we solve this problem directly by being able to remove the IO
 stack usage from the direct reclaim swap path.
 
 IOWs, we don't need to turn swap off at all in direct reclaim
 because all the swap IO can be captured in a plug list and
 dispatched via kblockd. This could be done either by io_schedule()
 or a new blk_flush_plug_list() wrapper that pushes the work to
 kblockd...

I did below hacky test to apply your idea and the result is overflow again.
So, again it would second stack expansion. Otherwise, we should prevent
swapout in direct reclaim.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..95f169e85dbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
 void __sched io_schedule(void)
 {
struct rq *rq = raw_rq();
+   struct blk_plug *plug = current-plug;
 
delayacct_blkio_start();
atomic_inc(rq-nr_iowait);
-   blk_flush_plug(current);
+   if (plug)
+   blk_flush_plug_list(plug, true);
+
current-in_iowait = 1;
schedule();
current-in_iowait = 0;


[ 1209.764725] kworker/u24:0 (23627) used greatest stack depth: 304 bytes left
[ 1510.835509] kworker/u24:1 (25817) used greatest stack depth: 144 bytes left
[ 3701.482790] PANIC: double fault, error_code: 0x0
[ 3701.483297] CPU: 8 PID: 6117 Comm: kworker/u24:1 Not tainted 3.14.0+ #201
[ 3701.483980] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 3701.484366] Workqueue: writeback bdi_writeback_workfn (flush-253:0)
[ 3701.484366] task: 8800353c41c0 ti: 88106000 task.ti: 
88106000
[ 3701.484366] RIP: 0010:[810a5390]  [810a5390] 
__lock_acquire+0x170/0x1ca0
[ 3701.484366] RSP: :88105f58  EFLAGS: 00010046
[ 3701.484366] RAX: 0001 RBX: 8800353c41c0 RCX: 0002
[ 3701.484366] RDX:  RSI:  RDI: 81c4a1e0
[ 3701.484366] RBP: 88106048 R08: 0001 R09: 0001
[ 3701.484366] R10:  R11:  R12: 0001
[ 3701.484366] R13:  R14: 81c4a1e0 R15: 
[ 3701.484366] FS:  () GS:880037d0() 
knlGS:
[ 3701.484366] CS:  0010 DS:  ES:  CR0: 8005003b
[ 3701.484366] CR2: 88105f48 CR3: 01c0b000 CR4: 06e0
[ 3701.484366] Stack:
[ 3701.484366] BUG: unable to handle kernel paging request at 88105f58
[ 3701.484366] IP: [81004e14] show_stack_log_lvl+0x134/0x1a0
[ 3701.484366] PGD 28c5067 PUD 28c6067 PMD 28c7067 PTE 80105060
[ 3701.484366] Thread overran stack, or stack corrupted
[ 3701.484366] Oops:  [#1] SMP DEBUG_PAGEALLOC
[ 3701.484366] Dumping ftrace buffer:
[ 3701.484366] -
[ 3701.484366]...-61178d..4 3786719374us : stack_trace_call: 
DepthSize   Location

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

Hello Linus,

On Thu, May 29, 2014 at 08:24:49AM -0700, Linus Torvalds wrote:
 On Thu, May 29, 2014 at 12:26 AM, Dave Chinner da...@fromorbit.com wrote:
 
  What concerns me about both __alloc_pages_nodemask() and
  kernel_map_pages is that when I look at the code I see functions
  that have no obvious stack usage problem. However, the compiler is
  producing functions with huge stack footprints and it's not at all
  obvious when I read the code. So in this case I'm more concerned
  that we have a major disconnect between the source code structure
  and the code that the compiler produces...
 
 I agree. In fact, this is the main reason that Minchan's call trace
 and this thread has actually convinced me that yes, we really do need
 to make x86-64 have a 16kB stack (well, 16kB allocation - there's
 still the thread info etc too).
 
 Usually when we see the stack-smashing traces, they are because
 somebody did something stupid. In this case, there are certainly
 stupid details, and things I think we should fix, but there is *not*
 the usual red flag of Christ, somebody did something _really_ wrong.
 
 So I'm not in fact arguing against Minchan's patch of upping
 THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
 remain one of my we really need to be careful issues, so while I am
 basically planning on applying that patch, I _also_ want to make sure
 that we fix the problems we do see and not just paper them over.

So, should I resend a patch w/o RFC in subject but with Acked-by from Dave?
Or, you will do it by yourself?

 
 The 8kB stack has been somewhat restrictive and painful for a while,
 and I'm ok with admitting that it is just getting _too_ damn painful,
 but I don't want to just give up entirely when we have a known deep
 stack case.
 
   Linus
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner

On Thu, May 29, 2014 at 08:24:49AM -0700, Linus Torvalds wrote:
 On Thu, May 29, 2014 at 12:26 AM, Dave Chinner da...@fromorbit.com wrote:
 
  What concerns me about both __alloc_pages_nodemask() and
  kernel_map_pages is that when I look at the code I see functions
  that have no obvious stack usage problem. However, the compiler is
  producing functions with huge stack footprints and it's not at all
  obvious when I read the code. So in this case I'm more concerned
  that we have a major disconnect between the source code structure
  and the code that the compiler produces...
 
 I agree. In fact, this is the main reason that Minchan's call trace
 and this thread has actually convinced me that yes, we really do need
 to make x86-64 have a 16kB stack (well, 16kB allocation - there's
 still the thread info etc too).
 
 Usually when we see the stack-smashing traces, they are because
 somebody did something stupid. In this case, there are certainly
 stupid details, and things I think we should fix, but there is *not*
 the usual red flag of Christ, somebody did something _really_ wrong.
 
 So I'm not in fact arguing against Minchan's patch of upping
 THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
 remain one of my we really need to be careful issues, so while I am
 basically planning on applying that patch, I _also_ want to make sure
 that we fix the problems we do see and not just paper them over.
 
 The 8kB stack has been somewhat restrictive and painful for a while,
 and I'm ok with admitting that it is just getting _too_ damn painful,
 but I don't want to just give up entirely when we have a known deep
 stack case.

That sounds like a plan. Perhaps it would be useful to add a
WARN_ON_ONCE(stack_usage  8k) (or some other arbitrary depth beyond
8k) so that we get some indication that we're hitting a deep stack
but the system otherwise keeps functioning. That gives us some
motivation to keep stack usage down but isn't a fatal problem like
it is now

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 4:36 PM, Minchan Kim minc...@kernel.org wrote:

 I did below hacky test to apply your idea and the result is overflow again.
 So, again it would second stack expansion. Otherwise, we should prevent
 swapout in direct reclaim.

So changing io_schedule() is bad, for the reasons I outlined elsewhere
(we use it for wait_for_page*() - see sleep_on_page().

It's the congestion waiting where the io_schedule() should be avoided.

So maybe test a patch something like the attached.

NOTE! This is absolutely TOTALLY UNTESTED! It might do horrible
horrible things. It seems to compile, but I have absolutely no reason
to believe that it would work. I didn't actually test that this moves
anything at all to kblockd. So think of it as a concept patch that
*might* work, but as Dave said, there might also be other things that
cause unplugging and need some tough love.

   Linus
 mm/backing-dev.c | 28 ++--
 mm/vmscan.c  |  4 +---
 2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 09d9591b7708..cb26b24c2da2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -11,6 +11,7 @@
 #include linux/writeback.h
 #include linux/device.h
 #include trace/events/writeback.h
+#include linux/blkdev.h
 
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
@@ -573,6 +574,21 @@ void set_bdi_congested(struct backing_dev_info *bdi, int 
sync)
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+static long congestion_timeout(int sync, long timeout)
+{
+   long ret;
+   DEFINE_WAIT(wait);
+   struct blk_plug *plug = current-plug;
+   wait_queue_head_t *wqh = congestion_wqh[sync];
+
+   prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
+   if (plug)
+   blk_flush_plug_list(plug, true);
+   ret = schedule_timeout(timeout);
+   finish_wait(wqh, wait);
+   return ret;
+}
+
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -586,12 +602,8 @@ long congestion_wait(int sync, long timeout)
 {
long ret;
unsigned long start = jiffies;
-   DEFINE_WAIT(wait);
-   wait_queue_head_t *wqh = congestion_wqh[sync];
 
-   prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
-   ret = io_schedule_timeout(timeout);
-   finish_wait(wqh, wait);
+   ret = congestion_timeout(sync,timeout);
 
trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
jiffies_to_usecs(jiffies - start));
@@ -622,8 +634,6 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
 {
long ret;
unsigned long start = jiffies;
-   DEFINE_WAIT(wait);
-   wait_queue_head_t *wqh = congestion_wqh[sync];
 
/*
 * If there is no congestion, or heavy congestion is not being
@@ -643,9 +653,7 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
}
 
/* Sleep until uncongested or a write happens */
-   prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
-   ret = io_schedule_timeout(timeout);
-   finish_wait(wqh, wait);
+   ret = congestion_timeout(sync, timeout);
 
 out:
trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32c661d66a45..1e524000b83e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -989,9 +989,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 * avoid risk of stack overflow but only writeback
 * if many dirty pages have been encountered.
 */
-   if (page_is_file_cache(page) 
-   (!current_is_kswapd() ||
-!zone_is_reclaim_dirty(zone))) {
+   if (!current_is_kswapd() || 
!zone_is_reclaim_dirty(zone)) {
/*
 * Immediately reclaim when written back.
 * Similar in principal to deactivate_page()

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Jones

On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:

  That sounds like a plan. Perhaps it would be useful to add a
  WARN_ON_ONCE(stack_usage  8k) (or some other arbitrary depth beyond
  8k) so that we get some indication that we're hitting a deep stack
  but the system otherwise keeps functioning. That gives us some
  motivation to keep stack usage down but isn't a fatal problem like
  it is now

We have check_stack_usage() and DEBUG_STACK_USAGE for this.
Though it needs some tweaking if we move to 16K

I gave it a try yesterday, and noticed a spew of noisy warnings as soon
as I gave it a workload to chew on. (Moreso than usual with 8K stacks)

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner

On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
 Hello Dave,
 
 On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
  On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
   On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
   commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
   Author: Jens Axboe jax...@fusionio.com
   Date:   Sat Apr 16 13:27:55 2011 +0200
   
   block: let io_schedule() flush the plug inline
   
   Linus correctly observes that the most important dispatch cases
   are now done from kblockd, this isn't ideal for latency reasons.
   The original reason for switching dispatches out-of-line was to
   avoid too deep a stack, so by _only_ letting the accidental
   flush directly in schedule() be guarded by offload to kblockd,
   we should be able to get the best of both worlds.
   
   So add a blk_schedule_flush_plug() that offloads to kblockd,
   and only use that from the schedule() path.
   
   Signed-off-by: Jens Axboe jax...@fusionio.com
   
   And now we have too deep a stack due to unplugging from io_schedule()...
  
  So, if we make io_schedule() push the plug list off to the kblockd
  like is done for schedule()

 I did below hacky test to apply your idea and the result is overflow again.
 So, again it would second stack expansion. Otherwise, we should prevent
 swapout in direct reclaim.
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index f5c6635b806c..95f169e85dbe 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
  void __sched io_schedule(void)
  {
   struct rq *rq = raw_rq();
 + struct blk_plug *plug = current-plug;
  
   delayacct_blkio_start();
   atomic_inc(rq-nr_iowait);
 - blk_flush_plug(current);
 + if (plug)
 + blk_flush_plug_list(plug, true);
 +
   current-in_iowait = 1;
   schedule();
   current-in_iowait = 0;

.

 DepthSize   Location(46 entries)

   0) 7200   8   _raw_spin_lock_irqsave+0x51/0x60
   1) 7192 296   get_page_from_freelist+0x886/0x920
   2) 6896 352   __alloc_pages_nodemask+0x5e1/0xb20
   3) 6544   8   alloc_pages_current+0x10f/0x1f0
   4) 6536 168   new_slab+0x2c5/0x370
   5) 6368   8   __slab_alloc+0x3a9/0x501
   6) 6360  80   __kmalloc+0x1cb/0x200
   7) 6280 376   vring_add_indirect+0x36/0x200
   8) 5904 144   virtqueue_add_sgs+0x2e2/0x320
   9) 5760 288   __virtblk_add_req+0xda/0x1b0
  10) 5472  96   virtio_queue_rq+0xd3/0x1d0
  11) 5376 128   __blk_mq_run_hw_queue+0x1ef/0x440
  12) 5248  16   blk_mq_run_hw_queue+0x35/0x40
  13) 5232  96   blk_mq_insert_requests+0xdb/0x160
  14) 5136 112   blk_mq_flush_plug_list+0x12b/0x140
  15) 5024 112   blk_flush_plug_list+0xc7/0x220
  16) 4912 128   blk_mq_make_request+0x42a/0x600
  17) 4784  48   generic_make_request+0xc0/0x100
  18) 4736 112   submit_bio+0x86/0x160
  19) 4624 160   __swap_writepage+0x198/0x230
  20) 4464  32   swap_writepage+0x42/0x90
  21) 4432 320   shrink_page_list+0x676/0xa80
  22) 4112 208   shrink_inactive_list+0x262/0x4e0
  23) 3904 304   shrink_lruvec+0x3e1/0x6a0

The device is supposed to be plugged here in shrink_lruvec().

Oh, a plug can only hold 16 individual bios, and then it does a
synchronous flush. Hmmm - perhaps that should also defer the flush
to the kblockd, because if we are overrunning a plug then we've
already surrendered IO dispatch latency

So, in blk_mq_make_request(), can you do:

if (list_empty(plug-mq_list))
trace_block_plug(q);
else if (request_count = BLK_MAX_REQUEST_COUNT) {
-   blk_flush_plug_list(plug, false);
+   blk_flush_plug_list(plug, true);
trace_block_plug(q);
}
list_add_tail(rq-queuelist, plug-mq_list);

To see if that defers all the swap IO to kblockd?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

Hello Linus,

On Thu, May 29, 2014 at 05:05:17PM -0700, Linus Torvalds wrote:
 On Thu, May 29, 2014 at 4:36 PM, Minchan Kim minc...@kernel.org wrote:
 
  I did below hacky test to apply your idea and the result is overflow again.
  So, again it would second stack expansion. Otherwise, we should prevent
  swapout in direct reclaim.
 
 So changing io_schedule() is bad, for the reasons I outlined elsewhere
 (we use it for wait_for_page*() - see sleep_on_page().
 
 It's the congestion waiting where the io_schedule() should be avoided.
 
 So maybe test a patch something like the attached.
 
 NOTE! This is absolutely TOTALLY UNTESTED! It might do horrible
 horrible things. It seems to compile, but I have absolutely no reason
 to believe that it would work. I didn't actually test that this moves
 anything at all to kblockd. So think of it as a concept patch that
 *might* work, but as Dave said, there might also be other things that
 cause unplugging and need some tough love.
 
Linus

  mm/backing-dev.c | 28 ++--
  mm/vmscan.c  |  4 +---
  2 files changed, 19 insertions(+), 13 deletions(-)
 
 diff --git a/mm/backing-dev.c b/mm/backing-dev.c
 index 09d9591b7708..cb26b24c2da2 100644
 --- a/mm/backing-dev.c
 +++ b/mm/backing-dev.c
 @@ -11,6 +11,7 @@
  #include linux/writeback.h
  #include linux/device.h
  #include trace/events/writeback.h
 +#include linux/blkdev.h
  
  static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
  
 @@ -573,6 +574,21 @@ void set_bdi_congested(struct backing_dev_info *bdi, int 
 sync)
  }
  EXPORT_SYMBOL(set_bdi_congested);
  
 +static long congestion_timeout(int sync, long timeout)
 +{
 + long ret;
 + DEFINE_WAIT(wait);
 + struct blk_plug *plug = current-plug;
 + wait_queue_head_t *wqh = congestion_wqh[sync];
 +
 + prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
 + if (plug)
 + blk_flush_plug_list(plug, true);
 + ret = schedule_timeout(timeout);
 + finish_wait(wqh, wait);
 + return ret;
 +}
 +
  /**
   * congestion_wait - wait for a backing_dev to become uncongested
   * @sync: SYNC or ASYNC IO
 @@ -586,12 +602,8 @@ long congestion_wait(int sync, long timeout)
  {
   long ret;
   unsigned long start = jiffies;
 - DEFINE_WAIT(wait);
 - wait_queue_head_t *wqh = congestion_wqh[sync];
  
 - prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
 - ret = io_schedule_timeout(timeout);
 - finish_wait(wqh, wait);
 + ret = congestion_timeout(sync,timeout);
  
   trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
   jiffies_to_usecs(jiffies - start));
 @@ -622,8 +634,6 @@ long wait_iff_congested(struct zone *zone, int sync, long 
 timeout)
  {
   long ret;
   unsigned long start = jiffies;
 - DEFINE_WAIT(wait);
 - wait_queue_head_t *wqh = congestion_wqh[sync];
  
   /*
* If there is no congestion, or heavy congestion is not being
 @@ -643,9 +653,7 @@ long wait_iff_congested(struct zone *zone, int sync, long 
 timeout)
   }
  
   /* Sleep until uncongested or a write happens */
 - prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE);
 - ret = io_schedule_timeout(timeout);
 - finish_wait(wqh, wait);
 + ret = congestion_timeout(sync, timeout);
  
  out:
   trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
 diff --git a/mm/vmscan.c b/mm/vmscan.c
 index 32c661d66a45..1e524000b83e 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
 @@ -989,9 +989,7 @@ static unsigned long shrink_page_list(struct list_head 
 *page_list,
* avoid risk of stack overflow but only writeback
* if many dirty pages have been encountered.
*/
 - if (page_is_file_cache(page) 
 - (!current_is_kswapd() ||
 -  !zone_is_reclaim_dirty(zone))) {
 + if (!current_is_kswapd() || 
 !zone_is_reclaim_dirty(zone)) {
   /*
* Immediately reclaim when written back.
* Similar in principal to deactivate_page()

I guess this part which avoid swapout in direct reclaim would be key
if this patch were successful. But it could make anon pages rotate back
into inactive's head from tail in direct reclaim path until kswapd can
catch up. And kswapd kswapd can swap out anon pages from tail of inactive
LRU so I suspect it could make side-effect LRU churning.

Anyway, I will queue it into testing machine since Rusty's test is done.
Thanks!

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner

On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
 On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
 
   That sounds like a plan. Perhaps it would be useful to add a
   WARN_ON_ONCE(stack_usage  8k) (or some other arbitrary depth beyond
   8k) so that we get some indication that we're hitting a deep stack
   but the system otherwise keeps functioning. That gives us some
   motivation to keep stack usage down but isn't a fatal problem like
   it is now
 
 We have check_stack_usage() and DEBUG_STACK_USAGE for this.
 Though it needs some tweaking if we move to 16K

Right, but it doesn't throw loud warnings when a specific threshold
is reached - it just issues a quiet message when a process exits
telling you what the maximum was without giving us a stack to chew
on

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Jones

On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote:
  On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
   On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
   
 That sounds like a plan. Perhaps it would be useful to add a
 WARN_ON_ONCE(stack_usage  8k) (or some other arbitrary depth beyond
 8k) so that we get some indication that we're hitting a deep stack
 but the system otherwise keeps functioning. That gives us some
 motivation to keep stack usage down but isn't a fatal problem like
 it is now
   
   We have check_stack_usage() and DEBUG_STACK_USAGE for this.
   Though it needs some tweaking if we move to 16K
  
  Right, but it doesn't throw loud warnings when a specific threshold
  is reached - it just issues a quiet message when a process exits
  telling you what the maximum was without giving us a stack to chew
  on

ah, right good point. That would be more useful.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote:
 On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
  On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
  
That sounds like a plan. Perhaps it would be useful to add a
WARN_ON_ONCE(stack_usage  8k) (or some other arbitrary depth beyond
8k) so that we get some indication that we're hitting a deep stack
but the system otherwise keeps functioning. That gives us some
motivation to keep stack usage down but isn't a fatal problem like
it is now
  
  We have check_stack_usage() and DEBUG_STACK_USAGE for this.
  Though it needs some tweaking if we move to 16K
 
 Right, but it doesn't throw loud warnings when a specific threshold
 is reached - it just issues a quiet message when a process exits
 telling you what the maximum was without giving us a stack to chew
 on

But we could enhance the inform so notice the risk to the user.
as follow

...
kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means
there is some horrible stack hogger in your kernel. Please report it
the LKML and enable stacktrace to investigate who is culprit


 
 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 da...@fromorbit.com
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 5:20 PM, Minchan Kim minc...@kernel.org wrote:

 I guess this part which avoid swapout in direct reclaim would be key
 if this patch were successful. But it could make anon pages rotate back
 into inactive's head from tail in direct reclaim path until kswapd can
 catch up. And kswapd kswapd can swap out anon pages from tail of inactive
 LRU so I suspect it could make side-effect LRU churning.

Oh, it could make bad things happen, no question about that.

That said, those bad things are what happens to shared mapped pages
today, so in that sense it's not new. But large dirty shared mmap's
have traditionally been a great way to really hurt out VM, so it
should work as well as shared mapping pages is definitely not a
ringing endorsement!

(Of course, *if* we can improve kswapd behavior for both swap-out and
shared dirty pages, that would then be a double win, so there is
_some_ argument for saying that we should aim to handle both kinds of
pages equally).

 Anyway, I will queue it into testing machine since Rusty's test is done.

You could also try Dave's patch, and _not_ do my mm/vmscan.c part.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

On Thu, May 29, 2014 at 05:31:42PM -0700, Linus Torvalds wrote:
 On Thu, May 29, 2014 at 5:20 PM, Minchan Kim minc...@kernel.org wrote:
 
  I guess this part which avoid swapout in direct reclaim would be key
  if this patch were successful. But it could make anon pages rotate back
  into inactive's head from tail in direct reclaim path until kswapd can
  catch up. And kswapd kswapd can swap out anon pages from tail of inactive
  LRU so I suspect it could make side-effect LRU churning.
 
 Oh, it could make bad things happen, no question about that.
 
 That said, those bad things are what happens to shared mapped pages
 today, so in that sense it's not new. But large dirty shared mmap's
 have traditionally been a great way to really hurt out VM, so it
 should work as well as shared mapping pages is definitely not a
 ringing endorsement!

True.

 
 (Of course, *if* we can improve kswapd behavior for both swap-out and
 shared dirty pages, that would then be a double win, so there is
 _some_ argument for saying that we should aim to handle both kinds of
 pages equally).

Just an idea for preventing LRU churn.
We can return back the pages to tail of inactive instead of head if it's
not proper pages in this context and reclaimer uses the cursor as list_head
instead of LRU head to scan victim page and record the cursor in somewhere
like lruvec after shrinking is done. It makes VM code more complicated but
is worthy to try if we approach that way.

 
  Anyway, I will queue it into testing machine since Rusty's test is done.
 
 You could also try Dave's patch, and _not_ do my mm/vmscan.c part.

Sure. While I write this, Rusty's test was crached so I will try Dave's patch,
them yours except vmscan.c part.

Thanks.

 
 Linus
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 5:50 PM, Minchan Kim minc...@kernel.org wrote:

 You could also try Dave's patch, and _not_ do my mm/vmscan.c part.

 Sure. While I write this, Rusty's test was crached so I will try Dave's patch,
 them yours except vmscan.c part.

Looking more at Dave's patch (well, description), I don't think there
is any way in hell we can ever apply it. If I read it right, it will
cause all IO that overflows the max request count to go through the
scheduler to get it flushed. Maybe I misread it, but that's definitely
not acceptable. Maybe it's not noticeable with a slow rotational
device, but modern ssd hardware? No way.

I'd *much* rather slow down the swap side. Not real IO. So I think
my mm/vmscan.c patch is preferable (but yes, it might require some
work to make kswapd do better).

So you can try Dave's patch just to see what it does for stack depth,
but other than that it looks unacceptable unless I misread things.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 5:05 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 So maybe test a patch something like the attached.

 NOTE! This is absolutely TOTALLY UNTESTED!

It's still untested, but I realized that the whole
blk_flush_plug_list(plug, true); thing is pointless, since
schedule() itself will do that for us.

So I think you can remove the

+   struct blk_plug *plug = current-plug;
+   if (plug)
+   blk_flush_plug_list(plug, true);

part from congestion_timeout().

Not that it should *hurt* to have it there, so I'm not bothering to
send a changed patch.

And again, no actual testing by me on any of this, just looking at the code.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner

On Fri, May 30, 2014 at 09:32:19AM +0900, Minchan Kim wrote:
 On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote:
  On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
   On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
   
 That sounds like a plan. Perhaps it would be useful to add a
 WARN_ON_ONCE(stack_usage  8k) (or some other arbitrary depth beyond
 8k) so that we get some indication that we're hitting a deep stack
 but the system otherwise keeps functioning. That gives us some
 motivation to keep stack usage down but isn't a fatal problem like
 it is now
   
   We have check_stack_usage() and DEBUG_STACK_USAGE for this.
   Though it needs some tweaking if we move to 16K
  
  Right, but it doesn't throw loud warnings when a specific threshold
  is reached - it just issues a quiet message when a process exits
  telling you what the maximum was without giving us a stack to chew
  on
 
 But we could enhance the inform so notice the risk to the user.
 as follow
 
 ...
 kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means
 there is some horrible stack hogger in your kernel. Please report it
 the LKML and enable stacktrace to investigate who is culprit

That, however, presumes that a user can reproduce the problem on
demand. Experience tells me that this is the exception rather than
the norm for production systems, and so capturing the stack in real
time is IMO the only useful thing we could add...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner

On Thu, May 29, 2014 at 06:24:02PM -0700, Linus Torvalds wrote:
 On Thu, May 29, 2014 at 5:50 PM, Minchan Kim minc...@kernel.org wrote:
 
  You could also try Dave's patch, and _not_ do my mm/vmscan.c part.
 
  Sure. While I write this, Rusty's test was crached so I will try Dave's 
  patch,
  them yours except vmscan.c part.
 
 Looking more at Dave's patch (well, description), I don't think there
 is any way in hell we can ever apply it. If I read it right, it will
 cause all IO that overflows the max request count to go through the
 scheduler to get it flushed. Maybe I misread it, but that's definitely
 not acceptable. Maybe it's not noticeable with a slow rotational
 device, but modern ssd hardware? No way.
 
 I'd *much* rather slow down the swap side. Not real IO. So I think
 my mm/vmscan.c patch is preferable (but yes, it might require some
 work to make kswapd do better).
 
 So you can try Dave's patch just to see what it does for stack depth,
 but other than that it looks unacceptable unless I misread things.

Yeah, it's a hack, not intended as a potential solution.

I'm thinking, though, that plug flushing behaviour is actually
dependent on plugger context and there is no one correct
behaviour. If we are doing process driven IO, then we want to do
immediate dispatch, but for IO where stack is an issue or is for
bulk throughput (e.g. background writeback) async dispatch through
kblockd is desirable.

If the patch I sent solves the swap stack usage issue, then perhaps
we should look towards adding blk_plug_start_async() to pass such
hints to the plug flushing. I'd want to use the same behaviour in
__xfs_buf_delwri_submit() for bulk metadata writeback in XFS, and
probably also in mpage_writepages() for bulk data writeback in
WB_SYNC_NONE context

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim

On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote:
 On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
  Hello Dave,
  
  On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
   On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
Author: Jens Axboe jax...@fusionio.com
Date:   Sat Apr 16 13:27:55 2011 +0200

block: let io_schedule() flush the plug inline

Linus correctly observes that the most important dispatch cases
are now done from kblockd, this isn't ideal for latency reasons.
The original reason for switching dispatches out-of-line was to
avoid too deep a stack, so by _only_ letting the accidental
flush directly in schedule() be guarded by offload to kblockd,
we should be able to get the best of both worlds.

So add a blk_schedule_flush_plug() that offloads to kblockd,
and only use that from the schedule() path.

Signed-off-by: Jens Axboe jax...@fusionio.com

And now we have too deep a stack due to unplugging from io_schedule()...
   
   So, if we make io_schedule() push the plug list off to the kblockd
   like is done for schedule()
 
  I did below hacky test to apply your idea and the result is overflow again.
  So, again it would second stack expansion. Otherwise, we should prevent
  swapout in direct reclaim.
  
  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
  index f5c6635b806c..95f169e85dbe 100644
  --- a/kernel/sched/core.c
  +++ b/kernel/sched/core.c
  @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
   void __sched io_schedule(void)
   {
  struct rq *rq = raw_rq();
  +   struct blk_plug *plug = current-plug;
   
  delayacct_blkio_start();
  atomic_inc(rq-nr_iowait);
  -   blk_flush_plug(current);
  +   if (plug)
  +   blk_flush_plug_list(plug, true);
  +
  current-in_iowait = 1;
  schedule();
  current-in_iowait = 0;
 
 .
 
  DepthSize   Location(46 entries)
 
0) 7200   8   _raw_spin_lock_irqsave+0x51/0x60
1) 7192 296   get_page_from_freelist+0x886/0x920
2) 6896 352   __alloc_pages_nodemask+0x5e1/0xb20
3) 6544   8   alloc_pages_current+0x10f/0x1f0
4) 6536 168   new_slab+0x2c5/0x370
5) 6368   8   __slab_alloc+0x3a9/0x501
6) 6360  80   __kmalloc+0x1cb/0x200
7) 6280 376   vring_add_indirect+0x36/0x200
8) 5904 144   virtqueue_add_sgs+0x2e2/0x320
9) 5760 288   __virtblk_add_req+0xda/0x1b0
   10) 5472  96   virtio_queue_rq+0xd3/0x1d0
   11) 5376 128   __blk_mq_run_hw_queue+0x1ef/0x440
   12) 5248  16   blk_mq_run_hw_queue+0x35/0x40
   13) 5232  96   blk_mq_insert_requests+0xdb/0x160
   14) 5136 112   blk_mq_flush_plug_list+0x12b/0x140
   15) 5024 112   blk_flush_plug_list+0xc7/0x220
   16) 4912 128   blk_mq_make_request+0x42a/0x600
   17) 4784  48   generic_make_request+0xc0/0x100
   18) 4736 112   submit_bio+0x86/0x160
   19) 4624 160   __swap_writepage+0x198/0x230
   20) 4464  32   swap_writepage+0x42/0x90
   21) 4432 320   shrink_page_list+0x676/0xa80
   22) 4112 208   shrink_inactive_list+0x262/0x4e0
   23) 3904 304   shrink_lruvec+0x3e1/0x6a0
 
 The device is supposed to be plugged here in shrink_lruvec().
 
 Oh, a plug can only hold 16 individual bios, and then it does a
 synchronous flush. Hmmm - perhaps that should also defer the flush
 to the kblockd, because if we are overrunning a plug then we've
 already surrendered IO dispatch latency
 
 So, in blk_mq_make_request(), can you do:
 
   if (list_empty(plug-mq_list))
   trace_block_plug(q);
   else if (request_count = BLK_MAX_REQUEST_COUNT) {
 - blk_flush_plug_list(plug, false);
 + blk_flush_plug_list(plug, true);
   trace_block_plug(q);
   }
   list_add_tail(rq-queuelist, plug-mq_list);
 
 To see if that defers all the swap IO to kblockd?
 

Interim report,

I applied below(we need to fix io_schedule_timeout due to mempool_alloc)

diff --git a/block/blk-core.c b/block/blk-core.c
index bfe16d5af9f9..0c81aacec75b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1585,7 +1585,7 @@ get_rq:
trace_block_plug(q);
else {
if (request_count = BLK_MAX_REQUEST_COUNT) {
-   blk_flush_plug_list(plug, false);
+   blk_flush_plug_list(plug, true);
trace_block_plug(q);
}
}
diff --git

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 6:58 PM, Dave Chinner da...@fromorbit.com wrote:

 If the patch I sent solves the swap stack usage issue, then perhaps
 we should look towards adding blk_plug_start_async() to pass such
 hints to the plug flushing. I'd want to use the same behaviour in
 __xfs_buf_delwri_submit() for bulk metadata writeback in XFS, and
 probably also in mpage_writepages() for bulk data writeback in
 WB_SYNC_NONE context...

Yeah, adding a flag to the plug about what kind of plug it is does
sound quite reasonable. It already has that magic field, it could
easily be extended to have a async vs sync bit to it..

Of course, it's also possible that the unplugging code could just look
at the actual requests that are plugged to determine that, and maybe
we wouldn't even need to mark things specially. I don't think we ever
end up mixing reads and writes under the same plug, so first request
is a write is probably a good approximation for async.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds

On Thu, May 29, 2014 at 7:12 PM, Minchan Kim minc...@kernel.org wrote:

 Interim report,

 And result is as follows, It reduce about 800-byte compared to
 my first report but still stack usage seems to be high.
 Really needs diet of VM functions.

Yes. And in this case uninlining things might actually help, because
the it's not actually performing reclaim in the second case, so
inlining the reclaim code into that huge __alloc_pages_nodemask()
function means that it has the stack frame for all those cases even if
they don't actually get used.

That said, the way those functions are set up (with lots of arguments
passed from one to the other), not inlining will cause huge costs too
for the argument setup.

It really might be very good to create a struct alloc_info that
contains those shared arguments, and just pass a (const) pointer to
that around. Gcc would likely tend to be *much* better at generating
code for that, because it avoids a tons of temporaries being created
by function calls. Even when it's inlined, the argument itself ends up
being a new temporary internally, and I suspect one reason gcc
(especially your 4.6.3 version, apparently) generates those big spill
frames is because there's tons of these duplicate temporaries that
apparently don't get merged properly.

Ugh. I think I'll try looking at that tomorrow.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread H. Peter Anvin

On 05/28/2014 07:42 PM, Linus Torvalds wrote:
> 
> And Minchan running out of stack is at least _partly_ due to his debug
> options (that DEBUG_PAGEALLOC thing as an extreme example, but I
> suspect there's a few other options there that generate more bloated
> data structures too too).
> 

I have wondered if a larger stack would make sense as a debug option.
I'm just worried it will be abused.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Minchan Kim

On Wed, May 28, 2014 at 09:13:15PM -0700, Linus Torvalds wrote:
> On Wed, May 28, 2014 at 8:46 PM, Minchan Kim  wrote:
> >
> > Yes. For example, with mark __alloc_pages_slowpath noinline_for_stack,
> > we can reduce 176byte.
> 
> Well, but it will then call that __alloc_pages_slowpath() function,
> which has a 176-byte stack frame.. Plus the call frame.
> 
> Now, that only triggers for when the initial "__GFP_HARDWALL" case
> fails, but that's exactly what happens when we do need to do direct
> reclaim.
> 
> That said, I *have* seen cases where the gcc spill code got really
> confused, and simplifying the function (by not inlining excessively)
> actually causes a truly smaller stack overall, despite the actual call
> frames etc.  But I think the gcc people fixed the kinds of things that
> caused *that* kind of stack slot explosion.
> 
> And avoiding inlining can end up resulting in less stack, if the
> really deep parts don't happen to go through that function that got
> inlined (ie any call chain that wouldn't have gone through that
> "slowpath" function at all).
> 
> But in this case, __alloc_pages_slowpath() is where we end up doing
> the actual direct reclaim anyway, so just uninlining doesn't actually
> help. Although it would probably make the asm code more readable ;)

Indeed. :(

Actually I found there are other places to opitmize out.
For example, we can unline try_preserve_large_page for 
__change_page_attr_set_clr.
Although I'm not familiar with that part, I guess large page would be rare
so we could save 112-byte.

before:

81042330 <__change_page_attr_set_clr>:
81042330:   e8 4b da 6a 00  callq  816efd80 
<__entry_text_start>
81042335:   55  push   %rbp
81042336:   48 89 e5mov%rsp,%rbp
81042339:   41 57   push   %r15
8104233b:   41 56   push   %r14
8104233d:   41 55   push   %r13
8104233f:   41 54   push   %r12
81042341:   49 89 fcmov%rdi,%r12
81042344:   53  push   %rbx
81042345:   48 81 ec f8 00 00 00sub$0xf8,%rsp
8104234c:   8b 47 20mov0x20(%rdi),%eax
8104234f:   89 b5 50 ff ff ff   mov%esi,-0xb0(%rbp)
81042355:   85 c0   test   %eax,%eax
81042357:   89 85 5c ff ff ff   mov%eax,-0xa4(%rbp)
8104235d:   0f 84 8c 06 00 00   je 810429ef 
<__change_page_attr_set_clr+0x6bf>

after:

81042740 <__change_page_attr_set_clr>:
81042740:   e8 bb d5 6a 00  callq  816efd00 
<__entry_text_start>
81042745:   55  push   %rbp
81042746:   48 89 e5mov%rsp,%rbp
81042749:   41 57   push   %r15
8104274b:   41 56   push   %r14
8104274d:   41 55   push   %r13
8104274f:   49 89 fdmov%rdi,%r13
81042752:   41 54   push   %r12
81042754:   53  push   %rbx
81042755:   48 81 ec 88 00 00 00sub$0x88,%rsp
8104275c:   8b 47 20mov0x20(%rdi),%eax
8104275f:   89 b5 70 ff ff ff   mov%esi,-0x90(%rbp)
81042765:   85 c0   test   %eax,%eax
81042767:   89 85 74 ff ff ff   mov%eax,-0x8c(%rbp)
8104276d:   0f 84 cb 02 00 00   je 81042a3e 
<__change_page_attr_set_clr+0x2fe>


And below patch saves 96-byte from shrink_lruvec.

That would be not all and I am not saying optimization of every functions of VM
is way to go but just want to notice we have rooms to optimize it out.
I will wait more discussions and happy to test it(I can reproduce it in 1~2 hour
if I have a luck)

Thanks!

8115b560 :
8115b560:   e8 db 46 59 00  callq  816efc40 
<__entry_text_start>
8115b565:   55  push   %rbp
8115b566:   65 48 8b 04 25 40 bamov%gs:0xba40,%rax
8115b56d:   00 00
8115b56f:   48 89 e5mov%rsp,%rbp
8115b572:   41 57   push   %r15
8115b574:   41 56   push   %r14
8115b576:   45 31 f6xor%r14d,%r14d
8115b579:   41 55   push   %r13
8115b57b:   49 89 fdmov%rdi,%r13
8115b57e:   41 54   push   %r12
8115b580:   49 89 f4mov%rsi,%r12
8115b583:   49 83 c4 34 add$0x34,%r12
8115b587:   53  push

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Linus Torvalds

On Wed, May 28, 2014 at 8:46 PM, Minchan Kim  wrote:
>
> Yes. For example, with mark __alloc_pages_slowpath noinline_for_stack,
> we can reduce 176byte.

Well, but it will then call that __alloc_pages_slowpath() function,
which has a 176-byte stack frame.. Plus the call frame.

Now, that only triggers for when the initial "__GFP_HARDWALL" case
fails, but that's exactly what happens when we do need to do direct
reclaim.

That said, I *have* seen cases where the gcc spill code got really
confused, and simplifying the function (by not inlining excessively)
actually causes a truly smaller stack overall, despite the actual call
frames etc.  But I think the gcc people fixed the kinds of things that
caused *that* kind of stack slot explosion.

And avoiding inlining can end up resulting in less stack, if the
really deep parts don't happen to go through that function that got
inlined (ie any call chain that wouldn't have gone through that
"slowpath" function at all).

But in this case, __alloc_pages_slowpath() is where we end up doing
the actual direct reclaim anyway, so just uninlining doesn't actually
help. Although it would probably make the asm code more readable ;)

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Minchan Kim

On Wed, May 28, 2014 at 10:44:48PM -0400, Steven Rostedt wrote:
> On Thu, 29 May 2014 10:09:40 +0900
> Minchan Kim  wrote:
> 
> > stacktrace reported that vring_add_indirect used 376byte and objdump says
> > 
> > 8141dc60 :
> > 8141dc60:   55  push   %rbp
> > 8141dc61:   48 89 e5mov%rsp,%rbp
> > 8141dc64:   41 57   push   %r15
> > 8141dc66:   41 56   push   %r14
> > 8141dc68:   41 55   push   %r13
> > 8141dc6a:   49 89 fdmov%rdi,%r13
> > 8141dc6d:   89 cf   mov%ecx,%edi
> > 8141dc6f:   48 c1 e7 04 shl$0x4,%rdi
> > 8141dc73:   41 54   push   %r12
> > 8141dc75:   49 89 d4mov%rdx,%r12
> > 8141dc78:   53  push   %rbx
> > 8141dc79:   48 89 f3mov%rsi,%rbx
> > 8141dc7c:   48 83 ec 28 sub$0x28,%rsp
> > 8141dc80:   8b 75 20mov0x20(%rbp),%esi
> > 8141dc83:   89 4d bcmov%ecx,-0x44(%rbp)
> > 8141dc86:   44 89 45 cc mov%r8d,-0x34(%rbp)
> > 8141dc8a:   44 89 4d c8 mov%r9d,-0x38(%rbp)
> > 8141dc8e:   83 e6 ddand$0xffdd,%esi
> > 8141dc91:   e8 7a d1 d7 ff  callq  8119ae10 
> > <__kmalloc>
> > 8141dc96:   48 85 c0test   %rax,%rax
> > 
> > So, it's *strange*.
> > 
> > I will add .config and .o.
> > Maybe someone might find what happens.
> > 
> 
> This is really bothering me. I'm trying to figure it out. We have from
> the stack trace:
> 
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:   9)   
>   6456  80   __kmalloc+0x1cb/0x200
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  10)   
>   6376 376   vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  11)   
>   6000 144   virtqueue_add_sgs+0x2e2/0x320
> 
> The way the stack tracer works, is that when it detects a new max stack
> it calls save_stack_trace() to get the complete call chain from the
> stack. This should be rather accurate as it seems that your kernel was
> compiled with frame pointers (confirmed by the objdump as well as the
> config file). It then uses that stack trace that it got to examine the
> stack to find the locations of the saved return addresses and records
> them in an array (in your case, an array of 50 entries).
> 
> From your .o file:
> 
> vring_add_indirect + 0x36: (0x370 + 0x36 = 0x3a6)
> 
> 0370 :
> 
>  39e:   83 e6 ddand$0xffdd,%esi
>  3a1:   e8 00 00 00 00  callq  3a6 
> 3a2: R_X86_64_PC32  __kmalloc-0x4
>  3a6:   48 85 c0test   %rax,%rax
> 
> Definitely the return address to the call to __kmalloc. Then to
> determine the size of the stack frame, it is subtracted from the next
> one down. In this case, the location of virtqueue_add_sgs+0x2e2.
> 
> virtqueue_add_sgs + 0x2e2: (0x880 + 0x2e2 = 0xb62)
> 
> 0880 :
> 
> b4f:   89 4c 24 08 mov%ecx,0x8(%rsp)
>  b53:   48 c7 c2 00 00 00 00mov$0x0,%rdx
> b56: R_X86_64_32S   .text+0x570
>  b5a:   44 89 d1mov%r10d,%ecx
>  b5d:   e8 0e f8 ff ff  callq  370 
>  b62:   85 c0   test   %eax,%eax
> 
> 
> Which is the return address of where vring_add_indirect was called.
> 
> The return address back to virtqueue_add_sgs was found at 6000 bytes of
> the stack. The return address back to vring_add_indirect was found at
> 6376 bytes from the top of the stack.
> 
> My question is, why were they so far apart? I see 6 words pushed
> (8bytes each, for a total of 48 bytes), and a subtraction of the stack
> pointer of 0x28 (40 bytes) giving us a total of 88 bytes. Plus we need
> to add the push of the return address itself which would just give us
> 96 bytes for the stack frame. What is making this show 376 bytes??

That's what I want to know. :(

> 
> Looking more into this, I'm not sure I trust the top numbers anymore.
> kmalloc reports a stack frame of 80, and I'm coming up with 104
> (perhaps even 112). And slab_alloc only has 8. Something's messed up there.

Yes, Looks weired but some of upper functions in callstack match well so might
think only top functions of callstack was corrupted.
But in case of alloc_pages_current(8 byte), it looks weired too but it reports
same value 8 bytes in uppder and bottom of callstack. :(
> 
> -- Steve
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Minchan Kim

On Wed, May 28, 2014 at 09:09:23AM -0700, Linus Torvalds wrote:
> On Tue, May 27, 2014 at 11:53 PM, Minchan Kim  wrote:
> >
> > So, my stupid idea is just let's expand stack size and keep an eye
> > toward stack consumption on each kernel functions via stacktrace of ftrace.
> 
> We probably have to do this at some point, but that point is not -rc7.
> 
> And quite frankly, from the backtrace, I can only say: there is some
> bad shit there. The current VM stands out as a bloated pig:
> 
> > [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   0) 
> > 7696  16   lookup_address+0x28/0x30
> > [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   1) 
> > 7680  16   _lookup_address_cpa.isra.3+0x3b/0x40
> > [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   2) 
> > 7664  24   __change_page_attr_set_clr+0xe0/0xb50
> > [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   3) 
> > 7640 392   kernel_map_pages+0x6c/0x120
> > [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   4) 
> > 7248 256   get_page_from_freelist+0x489/0x920
> > [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   5) 
> > 6992 352   __alloc_pages_nodemask+0x5e1/0xb20
> 
> > [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  23) 
> > 4672 160   __swap_writepage+0x150/0x230
> > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  24) 
> > 4512  32   swap_writepage+0x42/0x90
> > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  25) 
> > 4480 320   shrink_page_list+0x676/0xa80
> > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  26) 
> > 4160 208   shrink_inactive_list+0x262/0x4e0
> > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  27) 
> > 3952 304   shrink_lruvec+0x3e1/0x6a0
> > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  28) 
> > 3648  80   shrink_zone+0x3f/0x110
> > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  29) 
> > 3568 128   do_try_to_free_pages+0x156/0x4c0
> > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  30) 
> > 3440 208   try_to_free_pages+0xf7/0x1e0
> > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  31) 
> > 3232 352   __alloc_pages_nodemask+0x783/0xb20
> > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  32) 
> > 2880   8   alloc_pages_current+0x10f/0x1f0
> > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  33) 
> > 2872 200   __page_cache_alloc+0x13f/0x160
> 
> That __alloc_pages_nodemask() thing in particular looks bad. It
> actually seems not to be the usual "let's just allocate some
> structures on the stack" disease, it looks more like "lots of
> inlining, horrible calling conventions, and lots of random stupid
> variables".

Yes. For example, with mark __alloc_pages_slowpath noinline_for_stack,
we can reduce 176byte. And there are more places we could reduce stack
consumption but I thought it was bandaid although reducing stack itself
is desireable.

before

81150600 <__alloc_pages_nodemask>:
81150600:   e8 fb f6 59 00  callq  816efd00 
<__entry_text_start>
81150605:   55  push   %rbp
81150606:   b8 e8 e8 00 00  mov$0xe8e8,%eax
8115060b:   48 89 e5mov%rsp,%rbp
8115060e:   41 57   push   %r15
81150610:   41 56   push   %r14
81150612:   41 be 22 01 32 01   mov$0x1320122,%r14d
81150618:   41 55   push   %r13
8115061a:   41 54   push   %r12
8115061c:   41 89 fcmov%edi,%r12d
8115061f:   53  push   %rbx
81150620:   48 81 ec 28 01 00 00sub$0x128,%rsp
81150627:   48 89 55 88 mov%rdx,-0x78(%rbp)
8115062b:   89 fa   mov%edi,%edx
8115062d:   83 e2 0fand$0xf,%edx
81150630:   48 89 4d 90 mov%rcx,-0x70(%rbp)

after:

81150600 <__alloc_pages_nodemask>:
81150600:   e8 7b f6 59 00  callq  816efc80 
<__entry_text_start>
81150605:   55  push   %rbp
81150606:   b8 e8 e8 00 00  mov$0xe8e8,%eax
8115060b:   48 89 e5mov%rsp,%rbp
8115060e:   41 57   push   %r15
81150610:   41 bf 22 01 32 01   mov$0x1320122,%r15d
81150616:   41 56   push   %r14
81150618:   41 55

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Linus Torvalds

[ Crossed emails ]

On Wed, May 28, 2014 at 6:58 PM, Dave Chinner  wrote:
> On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
>>
>> And now we have too deep a stack due to unplugging from io_schedule()...
>
> So, if we make io_schedule() push the plug list off to the kblockd
> like is done for schedule()

We might have a few different cases.

The cases where we *do* care about latency is when we are waiting for
the IO ourselves (ie in wait_on_page() and friends), and those end up
using io_schedule() too.

So in *that* case we definitely have a latency argument for doing it
directly, and we shouldn't kick it off to kblockd. That's very much a
"get this started as soon as humanly possible".

But the "wait_iff_congested()" code that also uses io_schedule()
should push it out to kblockd, methinks.

>> This stack overflow shows us that just the memory reclaim + IO
>> layers are sufficient to cause a stack overflow,
>
>  we solve this problem directly by being able to remove the IO
> stack usage from the direct reclaim swap path.
>
> IOWs, we don't need to turn swap off at all in direct reclaim
> because all the swap IO can be captured in a plug list and
> dispatched via kblockd. This could be done either by io_schedule()
> or a new blk_flush_plug_list() wrapper that pushes the work to
> kblockd...

That would work. That said, I personally would not mind to see that
"swap is special" go away, if possible. Because it can be behind a
filesystem too. Christ, even NFS (and people used to fight that tooth
and nail!) is back as a swap target..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Rusty Russell

Minchan Kim  writes:
> On Wed, May 28, 2014 at 12:04:09PM +0300, Michael S. Tsirkin wrote:
>> On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
>> > [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:   
>> > 9) 6456  80   __kmalloc+0x1cb/0x200
>> > [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  
>> > 10) 6376 376   vring_add_indirect+0x36/0x200
>> > [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  
>> > 11) 6000 144   virtqueue_add_sgs+0x2e2/0x320

Hmm, we can actually skip the vring_add_indirect if we're hurting for
stack.  It just means the request will try to fit linearly in the ring,
rather than using indirect.

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1e443629f76d..496e727cefc8 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -184,6 +184,13 @@ static inline int vring_add_indirect(struct 
vring_virtqueue *vq,
return head;
 }
 
+/* The Morton Technique */
+static noinline bool stack_trouble(void)
+{
+   unsigned long sp = (unsigned long)
+   return sp - (sp & ~(THREAD_SIZE - 1)) < 3000;
+}
+
 static inline int virtqueue_add(struct virtqueue *_vq,
struct scatterlist *sgs[],
struct scatterlist *(*next)
@@ -226,7 +233,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 
/* If the host supports indirect descriptor tables, and we have multiple
 * buffers, then go indirect. FIXME: tune this threshold */
-   if (vq->indirect && total_sg > 1 && vq->vq.num_free) {
+   if (vq->indirect && total_sg > 1 && vq->vq.num_free && 
!stack_trouble()) {
head = vring_add_indirect(vq, sgs, next, total_sg, total_out,
  total_in,
  out_sgs, in_sgs, gfp);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Steven Rostedt

On Thu, 29 May 2014 10:09:40 +0900
Minchan Kim  wrote:

> stacktrace reported that vring_add_indirect used 376byte and objdump says
> 
> 8141dc60 :
> 8141dc60:   55  push   %rbp
> 8141dc61:   48 89 e5mov%rsp,%rbp
> 8141dc64:   41 57   push   %r15
> 8141dc66:   41 56   push   %r14
> 8141dc68:   41 55   push   %r13
> 8141dc6a:   49 89 fdmov%rdi,%r13
> 8141dc6d:   89 cf   mov%ecx,%edi
> 8141dc6f:   48 c1 e7 04 shl$0x4,%rdi
> 8141dc73:   41 54   push   %r12
> 8141dc75:   49 89 d4mov%rdx,%r12
> 8141dc78:   53  push   %rbx
> 8141dc79:   48 89 f3mov%rsi,%rbx
> 8141dc7c:   48 83 ec 28 sub$0x28,%rsp
> 8141dc80:   8b 75 20mov0x20(%rbp),%esi
> 8141dc83:   89 4d bcmov%ecx,-0x44(%rbp)
> 8141dc86:   44 89 45 cc mov%r8d,-0x34(%rbp)
> 8141dc8a:   44 89 4d c8 mov%r9d,-0x38(%rbp)
> 8141dc8e:   83 e6 ddand$0xffdd,%esi
> 8141dc91:   e8 7a d1 d7 ff  callq  8119ae10 
> <__kmalloc>
> 8141dc96:   48 85 c0test   %rax,%rax
> 
> So, it's *strange*.
> 
> I will add .config and .o.
> Maybe someone might find what happens.
> 

This is really bothering me. I'm trying to figure it out. We have from
the stack trace:

[ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:   9) 
6456  80   __kmalloc+0x1cb/0x200
[ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  10) 
6376 376   vring_add_indirect+0x36/0x200
[ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  11) 
6000 144   virtqueue_add_sgs+0x2e2/0x320

The way the stack tracer works, is that when it detects a new max stack
it calls save_stack_trace() to get the complete call chain from the
stack. This should be rather accurate as it seems that your kernel was
compiled with frame pointers (confirmed by the objdump as well as the
config file). It then uses that stack trace that it got to examine the
stack to find the locations of the saved return addresses and records
them in an array (in your case, an array of 50 entries).

>From your .o file:

vring_add_indirect + 0x36: (0x370 + 0x36 = 0x3a6)

0370 :

 39e:   83 e6 ddand$0xffdd,%esi
 3a1:   e8 00 00 00 00  callq  3a6 
3a2: R_X86_64_PC32  __kmalloc-0x4
 3a6:   48 85 c0test   %rax,%rax

Definitely the return address to the call to __kmalloc. Then to
determine the size of the stack frame, it is subtracted from the next
one down. In this case, the location of virtqueue_add_sgs+0x2e2.

virtqueue_add_sgs + 0x2e2: (0x880 + 0x2e2 = 0xb62)

0880 :

b4f:   89 4c 24 08 mov%ecx,0x8(%rsp)
 b53:   48 c7 c2 00 00 00 00mov$0x0,%rdx
b56: R_X86_64_32S   .text+0x570
 b5a:   44 89 d1mov%r10d,%ecx
 b5d:   e8 0e f8 ff ff  callq  370 
 b62:   85 c0   test   %eax,%eax

Which is the return address of where vring_add_indirect was called.

The return address back to virtqueue_add_sgs was found at 6000 bytes of
the stack. The return address back to vring_add_indirect was found at
6376 bytes from the top of the stack.

My question is, why were they so far apart? I see 6 words pushed
(8bytes each, for a total of 48 bytes), and a subtraction of the stack
pointer of 0x28 (40 bytes) giving us a total of 88 bytes. Plus we need
to add the push of the return address itself which would just give us
96 bytes for the stack frame. What is making this show 376 bytes??

Looking more into this, I'm not sure I trust the top numbers anymore.
kmalloc reports a stack frame of 80, and I'm coming up with 104
(perhaps even 112). And slab_alloc only has 8. Something's messed up there.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Linus Torvalds

On Wed, May 28, 2014 at 6:30 PM, Dave Chinner  wrote:
>
> You're focussing on the specific symptoms, not the bigger picture.
> i.e. you're ignoring all the other "let's start IO" triggers in
> direct reclaim. e.g there's two separate plug flush triggers in
> shrink_inactive_list(), one of which is:

Fair enough. I certainly agree that we should look at the other cases here too.

In fact, I also find it distasteful just how much stack space some of
those VM routines are just using up on their own, never mind any
actual IO paths at all. The fact that __alloc_pages_nodemask() uses
350 bytes of stackspace on its own is actually quite disturbing. The
fact that kernel_map_pages() apparently has almost 400 bytes of stack
is just crazy. Obviously that case only happens with
CONFIG_DEBUG_PAGEALLOC, but still..

> I'm not saying we shouldn't turn of swap from direct reclaim, just
> that all we'd be doing by turning off swap is playing whack-a-stack
> - the next report will simply be from one of the other direct
> reclaim IO schedule points.

Playing whack-a-mole with this for a while might not be a bad idea,
though. It's not like we will ever really improve unless we start
whacking the worst cases. And it should still be a fairly limited
number.

After all, historically, some of the cases we've played whack-a-mole
on have been in XFS, so I'd think you'd be thrilled to see some other
code get blamed this time around ;)

> Regardless of whether it is swap or something external queues the
> bio on the plug, perhaps we should look at why it's done inline
> rather than by kblockd, where it was moved because it was blowing
> the stack from schedule():

So it sounds like we need to do this for io_schedule() too.

In fact, we've generally found it to be a mistake every time we
"automatically" unblock some IO queue. And I'm not saying that because
of stack space, but because we've _often_ had the situation that eager
unblocking results in IO that could have been done as bigger requests.

Of course, we do need to worry about latency for starting IO, but any
of these kinds of memory-pressure writeback patterns are pretty much
by definition not about the latency of one _particular_ IO, so they
don't tent to be latency-sensitive. Quite the reverse: we start
writeback and then end up waiting on something else altogether
(possibly a writeback that got started much earlier).

swapout certainly is _not_ IO-latency-sensitive, especially these
days. And while we _do_ want to throttle in direct reclaim, if it's
about throttling I'd certainly think that it sounds quite reasonable
to push any unplugging to kblockd than try to do that synchronously.
If we are throttling in direct-reclaim, we need to slow things _down_
for the writer, not worry about latency.

> I've said in the past that swap is different to filesystem
> ->writepage implementations because it doesn't require significant
> stack to do block allocation and doesn't trigger IO deep in that
> allocation stack. Hence it has much lower stack overhead than the
> filesystem ->writepage implementations and so is much less likely to
> have stack issues.

Clearly it is true that it lacks the actual filesystem part needed for
the writeback. At the same time, Minchan's example is certainly a good
one of a filesystem (ext4) already being reasonably deep in its own
stack space when it then wants memory.

Looking at that callchain, I have to say that ext4 doesn't look
horrible compared to the whole block layer and virtio.. Yes,
"ext4_writepages()" is using almost 400 bytes of stack, and most of
that seems to be due to:

struct mpage_da_data mpd;
struct blk_plug plug;

which looks at least understandable (nothing like the mess in the VM
code where the stack usage is because gcc creates horrible spills)

> This stack overflow shows us that just the memory reclaim + IO
> layers are sufficient to cause a stack overflow, which is something
> I've never seen before.

Well, we've definitely have had some issues with deeper callchains
with md, but I suspect virtio might be worse, and the new blk-mq code
is lilkely worse in this respect too.

And Minchan running out of stack is at least _partly_ due to his debug
options (that DEBUG_PAGEALLOC thing as an extreme example, but I
suspect there's a few other options there that generate more bloated
data structures too too).

>That implies no IO in direct reclaim context
> is safe - either from swap or io_schedule() unplugging. It also
> lends a lot of weight to my assertion that the majority of the stack
> growth over the past couple of years has been ocurring outside the
> filesystems

I think Minchan's stack trace definitely backs you up on that. The
filesystem part - despite that one ext4_writepages() function - is a
very small part of the whole. It sits at about ~1kB of stack. Just the
VM "top-level" writeback code is about as much, and then the VM page
alloc/shrinking code when the filesystem needs memory is

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Dave Chinner

On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> Author: Jens Axboe 
> Date:   Sat Apr 16 13:27:55 2011 +0200
> 
> block: let io_schedule() flush the plug inline
> 
> Linus correctly observes that the most important dispatch cases
> are now done from kblockd, this isn't ideal for latency reasons.
> The original reason for switching dispatches out-of-line was to
> avoid too deep a stack, so by _only_ letting the "accidental"
> flush directly in schedule() be guarded by offload to kblockd,
> we should be able to get the best of both worlds.
> 
> So add a blk_schedule_flush_plug() that offloads to kblockd,
> and only use that from the schedule() path.
> 
> Signed-off-by: Jens Axboe 
> 
> And now we have too deep a stack due to unplugging from io_schedule()...

So, if we make io_schedule() push the plug list off to the kblockd
like is done for schedule()

> > IOW, swap-out directly caused that extra 3kB of stack use in what was
> > a deep call chain (due to memory allocation). I really don't
> > understand why you are arguing anything else on a pure technicality.
> >
> > I thought you had some other argument for why swap was different, and
> > against removing that "page_is_file_cache()" special case in
> > shrink_page_list().
> 
> I've said in the past that swap is different to filesystem
> ->writepage implementations because it doesn't require significant
> stack to do block allocation and doesn't trigger IO deep in that
> allocation stack. Hence it has much lower stack overhead than the
> filesystem ->writepage implementations and so is much less likely to
> have stack issues.
> 
> This stack overflow shows us that just the memory reclaim + IO
> layers are sufficient to cause a stack overflow,

 we solve this problem directly by being able to remove the IO
stack usage from the direct reclaim swap path.

IOWs, we don't need to turn swap off at all in direct reclaim
because all the swap IO can be captured in a plug list and
dispatched via kblockd. This could be done either by io_schedule()
or a new blk_flush_plug_list() wrapper that pushes the work to
kblockd...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Dave Chinner

On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> On Wed, May 28, 2014 at 3:31 PM, Dave Chinner  wrote:
> >
> > Indeed, the call chain reported here is not caused by swap issuing
> > IO.
> 
> Well, that's one way of reading that callchain.
> 
> I think it's the *wrong* way of reading it, though. Almost dishonestly
> so.

I guess you haven't met your insult quota for the day, Linus. :/

> Because very clearly, the swapout _is_ what causes the unplugging
> of the IO queue, and does so because it is allocating the BIO for its
> own IO.  The fact that that then fails (because of other IO's in
> flight), and causes *other* IO to be flushed, doesn't really change
> anything fundamental. It's still very much swap that causes that
> "let's start IO".

It is not rocket science to see how plugging outside memory
allocation context can lead to flushing that plug within direct
reclaim without having dispatched any IO at all from direct
reclaim

You're focussing on the specific symptoms, not the bigger picture.
i.e. you're ignoring all the other "let's start IO" triggers in
direct reclaim. e.g there's two separate plug flush triggers in
shrink_inactive_list(), one of which is:

/*
 * Stall direct reclaim for IO completions if underlying BDIs or zone
 * is congested. Allow kswapd to continue until it starts encountering
 * unqueued dirty pages or cycling through the LRU too quickly.
 */
if (!sc->hibernation_mode && !current_is_kswapd())
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

I'm not saying we shouldn't turn of swap from direct reclaim, just
that all we'd be doing by turning off swap is playing whack-a-stack
- the next report will simply be from one of the other direct
reclaim IO schedule points.

Regardless of whether it is swap or something external queues the
bio on the plug, perhaps we should look at why it's done inline
rather than by kblockd, where it was moved because it was blowing
the stack from schedule():

commit f4af3c3d077a004762aaad052049c809fd8c6f0c
Author: Jens Axboe 
Date:   Tue Apr 12 14:58:51 2011 +0200

block: move queue run on unplug to kblockd

There are worries that we are now consuming a lot more stack in
some cases, since we potentially call into IO dispatch from
schedule() or io_schedule(). We can reduce this problem by moving
the running of the queue to kblockd, like the old plugging scheme
did as well.

This may or may not be a good idea from a performance perspective,
depending on how many tasks have queue plugs running at the same
time. For even the slightly contended case, doing just a single
queue run from kblockd instead of multiple runs directly from the
unpluggers will be faster.

Signed-off-by: Jens Axboe 

commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
Author: Jens Axboe 
Date:   Sat Apr 16 13:27:55 2011 +0200

block: let io_schedule() flush the plug inline

Linus correctly observes that the most important dispatch cases
are now done from kblockd, this isn't ideal for latency reasons.
The original reason for switching dispatches out-of-line was to
avoid too deep a stack, so by _only_ letting the "accidental"
flush directly in schedule() be guarded by offload to kblockd,
we should be able to get the best of both worlds.

So add a blk_schedule_flush_plug() that offloads to kblockd,
and only use that from the schedule() path.

Signed-off-by: Jens Axboe 

And now we have too deep a stack due to unplugging from io_schedule()...

> IOW, swap-out directly caused that extra 3kB of stack use in what was
> a deep call chain (due to memory allocation). I really don't
> understand why you are arguing anything else on a pure technicality.
>
> I thought you had some other argument for why swap was different, and
> against removing that "page_is_file_cache()" special case in
> shrink_page_list().

I've said in the past that swap is different to filesystem
->writepage implementations because it doesn't require significant
stack to do block allocation and doesn't trigger IO deep in that
allocation stack. Hence it has much lower stack overhead than the
filesystem ->writepage implementations and so is much less likely to
have stack issues.

This stack overflow shows us that just the memory reclaim + IO
layers are sufficient to cause a stack overflow, which is something
I've never seen before. That implies no IO in direct reclaim context
is safe - either from swap or io_schedule() unplugging. It also
lends a lot of weight to my assertion that the majority of the stack
growth over the past couple of years has been ocurring outside the
filesystems

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread H. Peter Anvin

On 05/28/2014 04:17 PM, Dave Chinner wrote:
>>
>> You were the one calling it a canary.
> 
> That doesn't mean it's to blame. Don't shoot the messenger...
> 

Fair enough.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Dave Chinner

On Wed, May 28, 2014 at 03:42:18PM -0700, H. Peter Anvin wrote:
> On 05/28/2014 03:11 PM, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 07:23:23AM -0700, H. Peter Anvin wrote:
> >> We tried for 4K on x86-64, too, for b quite a while as I recall.
> >> The kernel stack is a one of the main costs for a thread.  I would
> >> like to decouple struct thread_info from the kernel stack (PJ
> >> Waskewicz was working on that before he left Intel) but that
> >> doesn't buy us all that much.
> >>
> >> 8K additional per thread is a huge hit.  XFS has indeed always
> >> been a canary, or troublespot, I suspect because it originally
> >> came from another kernel where this was not an optimization
> >> target.
> > 
> > 
> > 
> > Always blame XFS for stack usage problems.
> > 
> > Even when the reported problem is from IO to an ext4 filesystem.
> > 
> 
> You were the one calling it a canary.

That doesn't mean it's to blame. Don't shoot the messenger...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread H. Peter Anvin

On 05/28/2014 03:11 PM, Dave Chinner wrote:
> On Wed, May 28, 2014 at 07:23:23AM -0700, H. Peter Anvin wrote:
>> We tried for 4K on x86-64, too, for b quite a while as I recall.
>> The kernel stack is a one of the main costs for a thread.  I would
>> like to decouple struct thread_info from the kernel stack (PJ
>> Waskewicz was working on that before he left Intel) but that
>> doesn't buy us all that much.
>>
>> 8K additional per thread is a huge hit.  XFS has indeed always
>> been a canary, or troublespot, I suspect because it originally
>> came from another kernel where this was not an optimization
>> target.
> 
> 
> 
> Always blame XFS for stack usage problems.
> 
> Even when the reported problem is from IO to an ext4 filesystem.
> 

You were the one calling it a canary.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 146 matches

Mail list logo