Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-10-20 Thread Andy Lutomirski
On 10/20/2014 07:00 PM, Dave Jones wrote:
> On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote:
>  > On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin  wrote:
>  > >
>  > > If we removed struct thread_info from the stack allocation then one
>  > > could do a guard page below the stack.  Of course, we'd have to use IST
>  > > for #PF in that case, which makes it a non-production option.

Why is thread_info in the stack allocation anyway?  Every time I look at
the entry asm, one (minor) thing that contributes to general
brain-hurtingness / sense of horrified awe is the incomprehensible (to
me) split between task_struct and thread_info.

struct thread_info is at the bottom of the stack, right?  If we don't
want to merge it into task_struct, couldn't we stick it at the top of
the stack instead?  Anything that can overwrite the *top* of the stack
gives trivial user-controlled CPL0 execution regardless.

>  > 
>  > We could just have the guard page in between the stack and the
>  > thread_info, take a double fault, and then just map it back in on
>  > double fault.
>  > 
>  > That would give us 8kB of "normal" stack, with a very loud fault - and
>  > then an extra 7kB or so of stack (whatever the size of thread-info is)
>  > - after the first time it traps.
>  > 
>  > That said, it's still likely a non-production option due to the page
>  > table games we'd have to play at fork/clone time.

What's wrong with vmalloc?  Doesn't it already have guard pages?

(Also, we have a shiny hardware dirty bit, so we could relatively
cheaply check whether we're near the limit without any weird
#PF-in-weird-context issues.)

Also, muahaha, I've infected more people with the crazy idea that
intentional double-faults are okay.  Suckers!  Soon I'll have Linux
returning from interrupts with lret!  (IIRC Windows used to do
intentional *triple* faults on context switches, so this should be
considered entirely sensible.)

> 
> [thread necrophilia]
> 
> So digging this back up, it occurs to me that after we bumped to 16K,
> we never did anything like the debug stuff you suggested here.
> 
> The reason I'm bringing this up, is that the last few weeks, I've been
> seeing things like..
> 
> [27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left
> 
> So we're now eating past that first 8KB in some situations.
> 
> Do we care ? Or shall we only start worrying if it gets even deeper ?

I would *love* to have an immediate, loud failure when we overrun the
stack.  This will unavoidably increase the number of TLB misses, but
that probably isn't so bad.

--Andy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-10-20 Thread Dave Jones
On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote:
 > On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin  wrote:
 > >
 > > If we removed struct thread_info from the stack allocation then one
 > > could do a guard page below the stack.  Of course, we'd have to use IST
 > > for #PF in that case, which makes it a non-production option.
 > 
 > We could just have the guard page in between the stack and the
 > thread_info, take a double fault, and then just map it back in on
 > double fault.
 > 
 > That would give us 8kB of "normal" stack, with a very loud fault - and
 > then an extra 7kB or so of stack (whatever the size of thread-info is)
 > - after the first time it traps.
 > 
 > That said, it's still likely a non-production option due to the page
 > table games we'd have to play at fork/clone time.

[thread necrophilia]

So digging this back up, it occurs to me that after we bumped to 16K,
we never did anything like the debug stuff you suggested here.

The reason I'm bringing this up, is that the last few weeks, I've been
seeing things like..

[27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left

So we're now eating past that first 8KB in some situations.

Do we care ? Or shall we only start worrying if it gets even deeper ?

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-06-03 Thread Linus Torvalds
On Tue, Jun 3, 2014 at 6:28 AM, Rasmus Villemoes
 wrote:
> Possibly stupid question: Is it true that any given task can only be
> using one wait_queue_t at a time?

Nope.

Being on multiple different wait-queues is actually very common. The
obvious case is select/poll, but there are others. The more subtle
ones involve being on a wait-queue while doing something that can
cause nested waiting (iow, it's technically not wrong to be on a
wait-queue and then do a user space access, which obviously can end up
doing IO).

That said, the case of a single wait-queue entry is another common
case, and it wouldn't necessarily be wrong to have one pre-initialized
wait queue entry in the task structure for that special case, for when
you know that there is no possible nesting. And even if it *does*
nest, if it's the "leaf" entry it could be used for that innermost
nesting without worrying about other wait queue users (who use stack
allocations or actual explicit allocations like poll).

So it might certainly be worth looking at. In fact, it might be worth
it having multiple per-thread entries, so that we could get rid of the
special on-stack allocation for poll too (and making one of them
special and not available to poll, to handle the "leaf waiter" case).

So it's not necessarily a bad idea at all, even if the general case
requires more than one (or a few) static per-thread allocations.

Anybody want to try to code it up?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-06-03 Thread Rasmus Villemoes
Possibly stupid question: Is it true that any given task can only be
using one wait_queue_t at a time? If so, would it be an idea to put a
wait_queue_t into struct task_struct [maybe union'ed with a struct
wait_bit_queue] and avoid allocating this 40 byte structure repeatedly
on the stack.

E.g., in one of Minchan's stack traces, there are two calls of
mempool_alloc (which itself declares a wait_queue_t) and one
try_to_free_pages (which is the only caller of throttle_direct_reclaim,
which in turn uses wait_event_interruptible_timeout and
wait_event_killable).

Rasmus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-06-03 Thread Konstantin Khlebnikov
On Sat, May 31, 2014 at 6:06 AM, Jens Axboe  wrote:
> On 2014-05-28 20:42, Linus Torvalds wrote:
>>>
>>> Regardless of whether it is swap or something external queues the
>>> bio on the plug, perhaps we should look at why it's done inline
>>> rather than by kblockd, where it was moved because it was blowing
>>> the stack from schedule():
>>
>>
>> So it sounds like we need to do this for io_schedule() too.
>>
>> In fact, we've generally found it to be a mistake every time we
>> "automatically" unblock some IO queue. And I'm not saying that because
>> of stack space, but because we've _often_ had the situation that eager
>> unblocking results in IO that could have been done as bigger requests.
>
>
> We definitely need to auto-unplug on the schedule path, otherwise we run
> into all sorts of trouble. But making it async off the IO schedule path is
> fine. By definition, it's not latency sensitive if we are hitting unplug on
> schedule. I'm pretty sure it was run inline on CPU concerns here, as running
> inline is certainly cheaper than punting to kblockd.
>
>
>> Looking at that callchain, I have to say that ext4 doesn't look
>> horrible compared to the whole block layer and virtio.. Yes,
>> "ext4_writepages()" is using almost 400 bytes of stack, and most of
>> that seems to be due to:
>>
>>  struct mpage_da_data mpd;
>>  struct blk_plug plug;
>
>
> Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the magic
> part of it, since that's never caught any bugs. Only saves 8 bytes, but may
> as well take that. Especially if we end up with nested plugs.

In case of nested plugs only the first one is used? Right?
So, it may be embedded into task_struct together with integer recursion counter.
This will save bit of precious stack and make it looks cleaner.


>
>
>> Well, we've definitely have had some issues with deeper callchains
>> with md, but I suspect virtio might be worse, and the new blk-mq code
>> is lilkely worse in this respect too.
>
>
> I don't think blk-mq is worse than the older stack, in fact it should be
> better. The call chains are shorter, and a lot less cruft on the stack.
> Historically the stack issues have been nested devices, however. And for
> sync IO, we do run it inline, so if the driver chews up a lot of stack,
> well...
>
> Looks like I'm late here and the decision has been made to go 16K stacks,
> which I think is a good one. We've been living on the edge (and sometimes
> over) for heavy dm/md setups for a while, and have been patching around that
> fact in the IO stack for years.
>
>
> --
> Jens Axboe
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-06-02 Thread Dave Chinner
On Fri, May 30, 2014 at 08:06:53PM -0600, Jens Axboe wrote:
> On 2014-05-28 20:42, Linus Torvalds wrote:
> >Well, we've definitely have had some issues with deeper callchains
> >with md, but I suspect virtio might be worse, and the new blk-mq code
> >is lilkely worse in this respect too.
> 
> I don't think blk-mq is worse than the older stack, in fact it
> should be better. The call chains are shorter, and a lot less cruft
> on the stack. Historically the stack issues have been nested
> devices, however. And for sync IO, we do run it inline, so if the
> driver chews up a lot of stack, well...

Hi Jens - as we found out with the mm code, there's a significant
disconnect between what the code looks like (i.e. it may use very
little stack directly) and what the compiler is generating.

Before blk-mq:

  9) 3952 112   scsi_request_fn+0x4b/0x490
 10) 3840  32   __blk_run_queue+0x37/0x50
 11) 3808  64   queue_unplugged+0x39/0xb0
 12) 3744 112   blk_flush_plug_list+0x20b/0x240

Now with blk-mq:

  3) 4672  96   virtio_queue_rq+0xd2/0x1e0
  4) 4576 128   __blk_mq_run_hw_queue+0x1f0/0x3e0
  5) 4448  16   blk_mq_run_hw_queue+0x35/0x40
  6) 4432  80   blk_mq_insert_requests+0xc7/0x130
  7) 4352  96   blk_mq_flush_plug_list+0x129/0x140
  8) 4256 112   blk_flush_plug_list+0xe7/0x230

So previously flushing a plug used rough 200 bytes of stack.  With
blk-mq, it's over 400 bytes. IOWs, blk-mq has more than doubled the
block layer stack usage...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Jens Axboe

On 2014-05-28 20:42, Linus Torvalds wrote:

Regardless of whether it is swap or something external queues the
bio on the plug, perhaps we should look at why it's done inline
rather than by kblockd, where it was moved because it was blowing
the stack from schedule():


So it sounds like we need to do this for io_schedule() too.

In fact, we've generally found it to be a mistake every time we
"automatically" unblock some IO queue. And I'm not saying that because
of stack space, but because we've _often_ had the situation that eager
unblocking results in IO that could have been done as bigger requests.


We definitely need to auto-unplug on the schedule path, otherwise we run 
into all sorts of trouble. But making it async off the IO schedule path 
is fine. By definition, it's not latency sensitive if we are hitting 
unplug on schedule. I'm pretty sure it was run inline on CPU concerns 
here, as running inline is certainly cheaper than punting to kblockd.



Looking at that callchain, I have to say that ext4 doesn't look
horrible compared to the whole block layer and virtio.. Yes,
"ext4_writepages()" is using almost 400 bytes of stack, and most of
that seems to be due to:

 struct mpage_da_data mpd;
 struct blk_plug plug;


Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the 
magic part of it, since that's never caught any bugs. Only saves 8 
bytes, but may as well take that. Especially if we end up with nested plugs.



Well, we've definitely have had some issues with deeper callchains
with md, but I suspect virtio might be worse, and the new blk-mq code
is lilkely worse in this respect too.


I don't think blk-mq is worse than the older stack, in fact it should be 
better. The call chains are shorter, and a lot less cruft on the stack. 
Historically the stack issues have been nested devices, however. And for 
sync IO, we do run it inline, so if the driver chews up a lot of stack, 
well...


Looks like I'm late here and the decision has been made to go 16K 
stacks, which I think is a good one. We've been living on the edge (and 
sometimes over) for heavy dm/md setups for a while, and have been 
patching around that fact in the IO stack for years.



--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Linus Torvalds
On Thu, May 29, 2014 at 9:37 PM, Linus Torvalds
 wrote:
>
> It really might be very good to create a "struct alloc_info" that
> contains those shared arguments, and just pass a (const) pointer to
> that around. [ .. ]
>
> Ugh. I think I'll try looking at that tomorrow.

I did look at it, but the thing is horrible. I started on this
something like ten times, and always ended up running away screaming.
Some things are truly fixed (notably "order"), but most things end up
changing subtly halfway through the callchain.

I might look at it some more later, but people may have noticed that I
decided to just apply Minchan's original patch in the meantime. I'll
make an rc8 this weekend..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Andi Kleen
Linus Torvalds  writes:

> From a quick glance at the frame usage, some of it seems to be gcc
> being rather bad at stack allocation, but lots of it is just nasty
> spilling around the disgusting call-sites with tons or arguments. A
> _lot_ of the stack slots are marked as "%sfp" (which is gcc'ese for
> "spill frame pointer", afaik).

> Avoiding some inlining, and using a single flag value rather than the
> collection of "bool"s would probably help. But nothing really
> trivially obvious stands out.

One thing that may be worth playing around with gcc's
--param large-stack-frame and --param large-stack-frame-growth

This tells the inliner when to stop inlining when too much
stack would be used.

We use conserve stack I believe. So perhaps smaller values than 100
and 400 would make sense to try.

   -fconserve-stack
   Attempt to minimize stack usage.  The compiler attempts to
   use less stack space, even if that makes the program slower.
   This option
   implies setting the large-stack-frame parameter to 100 and
   the large-stack-frame-growth parameter to 400.


-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread H. Peter Anvin
On 05/30/2014 10:24 AM, Dave Hansen wrote:
> On 05/30/2014 09:06 AM, Linus Torvalds wrote:
>> On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin  wrote:
 That said, it's still likely a non-production option due to the page
 table games we'd have to play at fork/clone time.
>>>
>>> Still, seems much more tractable.
>>
>> We might be able to make it more attractive by having a small
>> front-end cache of the 16kB allocations with the second page unmapped.
>> That would at least capture the common "lots of short-lived processes"
>> case without having to do kernel page table work.
> 
> If we want to use 4k mappings, we'd need to move the stack over to using
> vmalloc() (or at least be out of the linear mapping) to avoid breaking
> up the linear map's page tables too much.  Doing that, we'd actually not
> _have_ to worry about fragmentation, and we could actually utilize the
> per-cpu-pageset code since we'd could be back to using order-0 pages.
> So it's at least not all a loss.  Although, I do remember playing with
> 4k stacks back in the 32-bit days and not getting much of a win with it.
> 
> We'd definitely that cache, if for no other reason than the vmalloc/vmap
> code as-is isn't super-scalable.
> 

I don't think we want to use 4K mappings for production...

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Dave Hansen
On 05/30/2014 09:06 AM, Linus Torvalds wrote:
> On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin  wrote:
>>> That said, it's still likely a non-production option due to the page
>>> table games we'd have to play at fork/clone time.
>>
>> Still, seems much more tractable.
> 
> We might be able to make it more attractive by having a small
> front-end cache of the 16kB allocations with the second page unmapped.
> That would at least capture the common "lots of short-lived processes"
> case without having to do kernel page table work.

If we want to use 4k mappings, we'd need to move the stack over to using
vmalloc() (or at least be out of the linear mapping) to avoid breaking
up the linear map's page tables too much.  Doing that, we'd actually not
_have_ to worry about fragmentation, and we could actually utilize the
per-cpu-pageset code since we'd could be back to using order-0 pages.
So it's at least not all a loss.  Although, I do remember playing with
4k stacks back in the 32-bit days and not getting much of a win with it.

We'd definitely that cache, if for no other reason than the vmalloc/vmap
code as-is isn't super-scalable.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Linus Torvalds
On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin  wrote:
>>
>> That said, it's still likely a non-production option due to the page
>> table games we'd have to play at fork/clone time.
>
> Still, seems much more tractable.

We might be able to make it more attractive by having a small
front-end cache of the 16kB allocations with the second page unmapped.
That would at least capture the common "lots of short-lived processes"
case without having to do kernel page table work.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread H. Peter Anvin
On 05/30/2014 08:41 AM, Linus Torvalds wrote:
> On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin  wrote:
>>
>> If we removed struct thread_info from the stack allocation then one
>> could do a guard page below the stack.  Of course, we'd have to use IST
>> for #PF in that case, which makes it a non-production option.
> 
> We could just have the guard page in between the stack and the
> thread_info, take a double fault, and then just map it back in on
> double fault.
> 

Oh, duh.  Right, much better.  Similar to the espfix64 hack, too.

> That would give us 8kB of "normal" stack, with a very loud fault - and
> then an extra 7kB or so of stack (whatever the size of thread-info is)
> - after the first time it traps.
> 
> That said, it's still likely a non-production option due to the page
> table games we'd have to play at fork/clone time.

Still, seems much more tractable.

I would still like struct thread_info off the stack allocation for other
reasons (as we have discussed in the past.)

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Linus Torvalds
On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin  wrote:
>
> If we removed struct thread_info from the stack allocation then one
> could do a guard page below the stack.  Of course, we'd have to use IST
> for #PF in that case, which makes it a non-production option.

We could just have the guard page in between the stack and the
thread_info, take a double fault, and then just map it back in on
double fault.

That would give us 8kB of "normal" stack, with a very loud fault - and
then an extra 7kB or so of stack (whatever the size of thread-info is)
- after the first time it traps.

That said, it's still likely a non-production option due to the page
table games we'd have to play at fork/clone time.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Linus Torvalds
On Fri, May 30, 2014 at 2:48 AM, Richard Weinberger
 wrote:
>
> If we raise the stack size on x86_64 to 16k, what about i386?
> Beside of the fact that most of you consider 32bits as dead and must die... ;)

x86-32 doesn't have nearly the same issue, since a large portion of
stack content tends to be pointers and longs. So it's not like it uses
half the stack, but a 32-bit environment does use a lot less stack
than a 64-bit one.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread H. Peter Anvin
On 05/29/2014 06:34 PM, Dave Chinner wrote:
>> ...
>> "kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means
>> there is some horrible stack hogger in your kernel. Please report it
>> the LKML and enable stacktrace to investigate who is culprit"
> 
> That, however, presumes that a user can reproduce the problem on
> demand. Experience tells me that this is the exception rather than
> the norm for production systems, and so capturing the stack in real
> time is IMO the only useful thing we could add...
> 

If we removed struct thread_info from the stack allocation then one
could do a guard page below the stack.  Of course, we'd have to use IST
for #PF in that case, which makes it a non-production option.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-30 Thread Richard Weinberger
On Thu, May 29, 2014 at 5:24 PM, Linus Torvalds
 wrote:
> So I'm not in fact arguing against Minchan's patch of upping
> THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
> remain one of my "we really need to be careful" issues, so while I am
> basically planning on applying that patch, I _also_ want to make sure
> that we fix the problems we do see and not just paper them over.
>
> The 8kB stack has been somewhat restrictive and painful for a while,
> and I'm ok with admitting that it is just getting _too_ damn painful,
> but I don't want to just give up entirely when we have a known deep
> stack case.

If we raise the stack size on x86_64 to 16k, what about i386?
Beside of the fact that most of you consider 32bits as dead and must die... ;)

-- 
Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim
On Thu, May 29, 2014 at 06:24:02PM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 5:50 PM, Minchan Kim  wrote:
> >>
> >> You could also try Dave's patch, and _not_ do my mm/vmscan.c part.
> >
> > Sure. While I write this, Rusty's test was crached so I will try Dave's 
> > patch,
> > them yours except vmscan.c part.
> 
> Looking more at Dave's patch (well, description), I don't think there
> is any way in hell we can ever apply it. If I read it right, it will
> cause all IO that overflows the max request count to go through the
> scheduler to get it flushed. Maybe I misread it, but that's definitely
> not acceptable. Maybe it's not noticeable with a slow rotational
> device, but modern ssd hardware? No way.
> 
> I'd *much* rather slow down the swap side. Not "real IO". So I think
> my mm/vmscan.c patch is preferable (but yes, it might require some
> work to make kswapd do better).
> 
> So you can try Dave's patch just to see what it does for stack depth,
> but other than that it looks unacceptable unless I misread things.
> 
>  Linus

I tested below patch and the result is endless OOM although there are
lots of anon pages and empty space of swap.

I guess __alloc_pages_direct_reclaim couldn't proceed due to anon pages
once VM drop most of file-backed pages, then go to OOM.

---
 mm/backing-dev.c | 25 +++--
 mm/vmscan.c  |  4 +---
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7a4f29..2762b16404bd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
@@ -565,6 +566,18 @@ void set_bdi_congested(struct backing_dev_info *bdi, int 
sync)
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+static long congestion_timeout(int sync, long timeout)
+{
+   long ret;
+   DEFINE_WAIT(wait);
+
+   wait_queue_head_t *wqh = &congestion_wqh[sync];
+   prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+   ret = schedule_timeout(timeout);
+   finish_wait(wqh, &wait);
+   return ret;
+}
+
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -578,12 +591,8 @@ long congestion_wait(int sync, long timeout)
 {
long ret;
unsigned long start = jiffies;
-   DEFINE_WAIT(wait);
-   wait_queue_head_t *wqh = &congestion_wqh[sync];
 
-   prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-   ret = io_schedule_timeout(timeout);
-   finish_wait(wqh, &wait);
+   ret = congestion_timeout(sync,timeout);
 
trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
jiffies_to_usecs(jiffies - start));
@@ -614,8 +623,6 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
 {
long ret;
unsigned long start = jiffies;
-   DEFINE_WAIT(wait);
-   wait_queue_head_t *wqh = &congestion_wqh[sync];
 
/*
 * If there is no congestion, or heavy congestion is not being
@@ -635,9 +642,7 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
}
 
/* Sleep until uncongested or a write happens */
-   prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-   ret = io_schedule_timeout(timeout);
-   finish_wait(wqh, &wait);
+   ret = congestion_timeout(sync, timeout);
 
 out:
trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b409681..e4ad7cd1885b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -975,9 +975,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 * avoid risk of stack overflow but only writeback
 * if many dirty pages have been encountered.
 */
-   if (page_is_file_cache(page) &&
-   (!current_is_kswapd() ||
-!zone_is_reclaim_dirty(zone))) {
+   if (!current_is_kswapd() || 
!zone_is_reclaim_dirty(zone)) {
/*
 * Immediately reclaim when written back.
 * Similar in principal to deactivate_page()
-- 
1.9.2

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim
Final result,

I tested the machine below patch (Dave suggested + some part I modified)
and I couldn't see the problem any more(tested 4hr, I will queue it into
the machine during weekend for long running test if I don't get more
enhanced version before leaving the office today) but as I reported
interim result, still VM's stack usage is high.

Anyway, it's another issue we should really diet of VM functions
(ex, uninlining slow path part from __alloc_pages_nodemask and
alloc_info idea from Linus and more).

Looking forwad to seeing blk_plug_start_async way.
Thanks, Dave!

---
 block/blk-core.c| 2 +-
 block/blk-mq.c  | 2 +-
 kernel/sched/core.c | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bfe16d5af9f9..0c81aacec75b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1585,7 +1585,7 @@ get_rq:
trace_block_plug(q);
else {
if (request_count >= BLK_MAX_REQUEST_COUNT) {
-   blk_flush_plug_list(plug, false);
+   blk_flush_plug_list(plug, true);
trace_block_plug(q);
}
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 883f72089015..6e72e700d11e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -897,7 +897,7 @@ static void blk_mq_make_request(struct request_queue *q, 
struct bio *bio)
if (list_empty(&plug->mq_list))
trace_block_plug(q);
else if (request_count >= BLK_MAX_REQUEST_COUNT) {
-   blk_flush_plug_list(plug, false);
+   blk_flush_plug_list(plug, true);
trace_block_plug(q);
}
list_add_tail(&rq->queuelist, &plug->mq_list);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..ebca9e1f200f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4244,7 +4244,7 @@ void __sched io_schedule(void)
 
delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
-   blk_flush_plug(current);
+   blk_schedule_flush_plug(current);
current->in_iowait = 1;
schedule();
current->in_iowait = 0;
@@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout)
 
delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
-   blk_flush_plug(current);
+   blk_schedule_flush_plug(current);
current->in_iowait = 1;
ret = schedule_timeout(timeout);
current->in_iowait = 0;
-- 
1.9.2


On Fri, May 30, 2014 at 11:12:47AM +0900, Minchan Kim wrote:
> On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote:
> > On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
> > > Hello Dave,
> > > 
> > > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> > > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > > > > Author: Jens Axboe 
> > > > > Date:   Sat Apr 16 13:27:55 2011 +0200
> > > > > 
> > > > > block: let io_schedule() flush the plug inline
> > > > > 
> > > > > Linus correctly observes that the most important dispatch cases
> > > > > are now done from kblockd, this isn't ideal for latency reasons.
> > > > > The original reason for switching dispatches out-of-line was to
> > > > > avoid too deep a stack, so by _only_ letting the "accidental"
> > > > > flush directly in schedule() be guarded by offload to kblockd,
> > > > > we should be able to get the best of both worlds.
> > > > > 
> > > > > So add a blk_schedule_flush_plug() that offloads to kblockd,
> > > > > and only use that from the schedule() path.
> > > > > 
> > > > > Signed-off-by: Jens Axboe 
> > > > > 
> > > > > And now we have too deep a stack due to unplugging from 
> > > > > io_schedule()...
> > > > 
> > > > So, if we make io_schedule() push the plug list off to the kblockd
> > > > like is done for schedule()
> > 
> > > I did below hacky test to apply your idea and the result is overflow 
> > > again.
> > > So, again it would second stack expansion. Otherwise, we should prevent
> > > swapout in direct reclaim.
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f5c6635b806c..95f169e85dbe 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
> > >  void __sched io_schedule(void)
> > >  {
> > >   struct rq *rq = raw_rq();
> > > + struct blk_plug *plug = current->plug;
> > >  
> > >   delayacct_blkio_start();
> > >   atomic_inc(&rq->nr_iowait);
> > > - blk_flush_plug(current);
> > > + if (plug)
> > > + blk_flush_plug_list(plug, true);
> > > +
> >

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds
On Thu, May 29, 2014 at 7:12 PM, Minchan Kim  wrote:
>
> Interim report,
>
> And result is as follows, It reduce about 800-byte compared to
> my first report but still stack usage seems to be high.
> Really needs diet of VM functions.

Yes. And in this case uninlining things might actually help, because
the it's not actually performing reclaim in the second case, so
inlining the reclaim code into that huge __alloc_pages_nodemask()
function means that it has the stack frame for all those cases even if
they don't actually get used.

That said, the way those functions are set up (with lots of arguments
passed from one to the other), not inlining will cause huge costs too
for the argument setup.

It really might be very good to create a "struct alloc_info" that
contains those shared arguments, and just pass a (const) pointer to
that around. Gcc would likely tend to be *much* better at generating
code for that, because it avoids a tons of temporaries being created
by function calls. Even when it's inlined, the argument itself ends up
being a new temporary internally, and I suspect one reason gcc
(especially your 4.6.3 version, apparently) generates those big spill
frames is because there's tons of these duplicate temporaries that
apparently don't get merged properly.

Ugh. I think I'll try looking at that tomorrow.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds
On Thu, May 29, 2014 at 6:58 PM, Dave Chinner  wrote:
>
> If the patch I sent solves the swap stack usage issue, then perhaps
> we should look towards adding "blk_plug_start_async()" to pass such
> hints to the plug flushing. I'd want to use the same behaviour in
> __xfs_buf_delwri_submit() for bulk metadata writeback in XFS, and
> probably also in mpage_writepages() for bulk data writeback in
> WB_SYNC_NONE context...

Yeah, adding a flag to the plug about what kind of plug it is does
sound quite reasonable. It already has that "magic" field, it could
easily be extended to have a "async" vs "sync" bit to it..

Of course, it's also possible that the unplugging code could just look
at the actual requests that are plugged to determine that, and maybe
we wouldn't even need to mark things specially. I don't think we ever
end up mixing reads and writes under the same plug, so "first request
is a write" is probably a good approximation for "async".

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim
On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote:
> On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
> > Hello Dave,
> > 
> > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > > > Author: Jens Axboe 
> > > > Date:   Sat Apr 16 13:27:55 2011 +0200
> > > > 
> > > > block: let io_schedule() flush the plug inline
> > > > 
> > > > Linus correctly observes that the most important dispatch cases
> > > > are now done from kblockd, this isn't ideal for latency reasons.
> > > > The original reason for switching dispatches out-of-line was to
> > > > avoid too deep a stack, so by _only_ letting the "accidental"
> > > > flush directly in schedule() be guarded by offload to kblockd,
> > > > we should be able to get the best of both worlds.
> > > > 
> > > > So add a blk_schedule_flush_plug() that offloads to kblockd,
> > > > and only use that from the schedule() path.
> > > > 
> > > > Signed-off-by: Jens Axboe 
> > > > 
> > > > And now we have too deep a stack due to unplugging from io_schedule()...
> > > 
> > > So, if we make io_schedule() push the plug list off to the kblockd
> > > like is done for schedule()
> 
> > I did below hacky test to apply your idea and the result is overflow again.
> > So, again it would second stack expansion. Otherwise, we should prevent
> > swapout in direct reclaim.
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f5c6635b806c..95f169e85dbe 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
> >  void __sched io_schedule(void)
> >  {
> > struct rq *rq = raw_rq();
> > +   struct blk_plug *plug = current->plug;
> >  
> > delayacct_blkio_start();
> > atomic_inc(&rq->nr_iowait);
> > -   blk_flush_plug(current);
> > +   if (plug)
> > +   blk_flush_plug_list(plug, true);
> > +
> > current->in_iowait = 1;
> > schedule();
> > current->in_iowait = 0;
> 
> .
> 
> > DepthSize   Location(46 entries)
> >
> >   0) 7200   8   _raw_spin_lock_irqsave+0x51/0x60
> >   1) 7192 296   get_page_from_freelist+0x886/0x920
> >   2) 6896 352   __alloc_pages_nodemask+0x5e1/0xb20
> >   3) 6544   8   alloc_pages_current+0x10f/0x1f0
> >   4) 6536 168   new_slab+0x2c5/0x370
> >   5) 6368   8   __slab_alloc+0x3a9/0x501
> >   6) 6360  80   __kmalloc+0x1cb/0x200
> >   7) 6280 376   vring_add_indirect+0x36/0x200
> >   8) 5904 144   virtqueue_add_sgs+0x2e2/0x320
> >   9) 5760 288   __virtblk_add_req+0xda/0x1b0
> >  10) 5472  96   virtio_queue_rq+0xd3/0x1d0
> >  11) 5376 128   __blk_mq_run_hw_queue+0x1ef/0x440
> >  12) 5248  16   blk_mq_run_hw_queue+0x35/0x40
> >  13) 5232  96   blk_mq_insert_requests+0xdb/0x160
> >  14) 5136 112   blk_mq_flush_plug_list+0x12b/0x140
> >  15) 5024 112   blk_flush_plug_list+0xc7/0x220
> >  16) 4912 128   blk_mq_make_request+0x42a/0x600
> >  17) 4784  48   generic_make_request+0xc0/0x100
> >  18) 4736 112   submit_bio+0x86/0x160
> >  19) 4624 160   __swap_writepage+0x198/0x230
> >  20) 4464  32   swap_writepage+0x42/0x90
> >  21) 4432 320   shrink_page_list+0x676/0xa80
> >  22) 4112 208   shrink_inactive_list+0x262/0x4e0
> >  23) 3904 304   shrink_lruvec+0x3e1/0x6a0
> 
> The device is supposed to be plugged here in shrink_lruvec().
> 
> Oh, a plug can only hold 16 individual bios, and then it does a
> synchronous flush. Hmmm - perhaps that should also defer the flush
> to the kblockd, because if we are overrunning a plug then we've
> already surrendered IO dispatch latency
> 
> So, in blk_mq_make_request(), can you do:
> 
>   if (list_empty(&plug->mq_list))
>   trace_block_plug(q);
>   else if (request_count >= BLK_MAX_REQUEST_COUNT) {
> - blk_flush_plug_list(plug, false);
> + blk_flush_plug_list(plug, true);
>   trace_block_plug(q);
>   }
>   list_add_tail(&rq->queuelist, &plug->mq_list);
> 
> To see if that defers all the swap IO to kblockd?
> 

Interim report,

I applied below(we need to fix io_schedule_timeout due to mempool_alloc)

diff --git a/block/blk-core.c b/block/blk-core.c
index bfe16d5af9f9..0c81aacec75b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1585,7 +1585,7 @@ get_rq:
trace_block_plug(q);
else {
if (request_count >= BLK_MAX_REQUEST_COUNT) {
-   blk_flush_plug

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner
On Thu, May 29, 2014 at 06:24:02PM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 5:50 PM, Minchan Kim  wrote:
> >>
> >> You could also try Dave's patch, and _not_ do my mm/vmscan.c part.
> >
> > Sure. While I write this, Rusty's test was crached so I will try Dave's 
> > patch,
> > them yours except vmscan.c part.
> 
> Looking more at Dave's patch (well, description), I don't think there
> is any way in hell we can ever apply it. If I read it right, it will
> cause all IO that overflows the max request count to go through the
> scheduler to get it flushed. Maybe I misread it, but that's definitely
> not acceptable. Maybe it's not noticeable with a slow rotational
> device, but modern ssd hardware? No way.
> 
> I'd *much* rather slow down the swap side. Not "real IO". So I think
> my mm/vmscan.c patch is preferable (but yes, it might require some
> work to make kswapd do better).
> 
> So you can try Dave's patch just to see what it does for stack depth,
> but other than that it looks unacceptable unless I misread things.

Yeah, it's a hack, not intended as a potential solution.

I'm thinking, though, that plug flushing behaviour is actually
dependent on plugger context and there is no one "correct"
behaviour. If we are doing process driven IO, then we want to do
immediate dispatch, but for IO where stack is an issue or is for
bulk throughput (e.g. background writeback) async dispatch through
kblockd is desirable.

If the patch I sent solves the swap stack usage issue, then perhaps
we should look towards adding "blk_plug_start_async()" to pass such
hints to the plug flushing. I'd want to use the same behaviour in
__xfs_buf_delwri_submit() for bulk metadata writeback in XFS, and
probably also in mpage_writepages() for bulk data writeback in
WB_SYNC_NONE context

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner
On Fri, May 30, 2014 at 09:32:19AM +0900, Minchan Kim wrote:
> On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote:
> > On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
> > > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
> > > 
> > >  > That sounds like a plan. Perhaps it would be useful to add a
> > >  > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
> > >  > 8k) so that we get some indication that we're hitting a deep stack
> > >  > but the system otherwise keeps functioning. That gives us some
> > >  > motivation to keep stack usage down but isn't a fatal problem like
> > >  > it is now
> > > 
> > > We have check_stack_usage() and DEBUG_STACK_USAGE for this.
> > > Though it needs some tweaking if we move to 16K
> > 
> > Right, but it doesn't throw loud warnings when a specific threshold
> > is reached - it just issues a quiet message when a process exits
> > telling you what the maximum was without giving us a stack to chew
> > on
> 
> But we could enhance the inform so notice the risk to the user.
> as follow
> 
> ...
> "kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means
> there is some horrible stack hogger in your kernel. Please report it
> the LKML and enable stacktrace to investigate who is culprit"

That, however, presumes that a user can reproduce the problem on
demand. Experience tells me that this is the exception rather than
the norm for production systems, and so capturing the stack in real
time is IMO the only useful thing we could add...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds
On Thu, May 29, 2014 at 5:05 PM, Linus Torvalds
 wrote:
>
> So maybe test a patch something like the attached.
>
> NOTE! This is absolutely TOTALLY UNTESTED!

It's still untested, but I realized that the whole
"blk_flush_plug_list(plug, true);" thing is pointless, since
schedule() itself will do that for us.

So I think you can remove the

+   struct blk_plug *plug = current->plug;
+   if (plug)
+   blk_flush_plug_list(plug, true);

part from congestion_timeout().

Not that it should *hurt* to have it there, so I'm not bothering to
send a changed patch.

And again, no actual testing by me on any of this, just looking at the code.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds
On Thu, May 29, 2014 at 5:50 PM, Minchan Kim  wrote:
>>
>> You could also try Dave's patch, and _not_ do my mm/vmscan.c part.
>
> Sure. While I write this, Rusty's test was crached so I will try Dave's patch,
> them yours except vmscan.c part.

Looking more at Dave's patch (well, description), I don't think there
is any way in hell we can ever apply it. If I read it right, it will
cause all IO that overflows the max request count to go through the
scheduler to get it flushed. Maybe I misread it, but that's definitely
not acceptable. Maybe it's not noticeable with a slow rotational
device, but modern ssd hardware? No way.

I'd *much* rather slow down the swap side. Not "real IO". So I think
my mm/vmscan.c patch is preferable (but yes, it might require some
work to make kswapd do better).

So you can try Dave's patch just to see what it does for stack depth,
but other than that it looks unacceptable unless I misread things.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim
On Thu, May 29, 2014 at 05:31:42PM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 5:20 PM, Minchan Kim  wrote:
> >
> > I guess this part which avoid swapout in direct reclaim would be key
> > if this patch were successful. But it could make anon pages rotate back
> > into inactive's head from tail in direct reclaim path until kswapd can
> > catch up. And kswapd kswapd can swap out anon pages from tail of inactive
> > LRU so I suspect it could make side-effect LRU churning.
> 
> Oh, it could make bad things happen, no question about that.
> 
> That said, those bad things are what happens to shared mapped pages
> today, so in that sense it's not new. But large dirty shared mmap's
> have traditionally been a great way to really hurt out VM, so "it
> should work as well as shared mapping pages" is definitely not a
> ringing endorsement!

True.

> 
> (Of course, *if* we can improve kswapd behavior for both swap-out and
> shared dirty pages, that would then be a double win, so there is
> _some_ argument for saying that we should aim to handle both kinds of
> pages equally).

Just an idea for preventing LRU churn.
We can return back the pages to tail of inactive instead of head if it's
not proper pages in this context and reclaimer uses the cursor as list_head
instead of LRU head to scan victim page and record the cursor in somewhere
like lruvec after shrinking is done. It makes VM code more complicated but
is worthy to try if we approach that way.

> 
> > Anyway, I will queue it into testing machine since Rusty's test is done.
> 
> You could also try Dave's patch, and _not_ do my mm/vmscan.c part.

Sure. While I write this, Rusty's test was crached so I will try Dave's patch,
them yours except vmscan.c part.

Thanks.

> 
> Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds
On Thu, May 29, 2014 at 5:20 PM, Minchan Kim  wrote:
>
> I guess this part which avoid swapout in direct reclaim would be key
> if this patch were successful. But it could make anon pages rotate back
> into inactive's head from tail in direct reclaim path until kswapd can
> catch up. And kswapd kswapd can swap out anon pages from tail of inactive
> LRU so I suspect it could make side-effect LRU churning.

Oh, it could make bad things happen, no question about that.

That said, those bad things are what happens to shared mapped pages
today, so in that sense it's not new. But large dirty shared mmap's
have traditionally been a great way to really hurt out VM, so "it
should work as well as shared mapping pages" is definitely not a
ringing endorsement!

(Of course, *if* we can improve kswapd behavior for both swap-out and
shared dirty pages, that would then be a double win, so there is
_some_ argument for saying that we should aim to handle both kinds of
pages equally).

> Anyway, I will queue it into testing machine since Rusty's test is done.

You could also try Dave's patch, and _not_ do my mm/vmscan.c part.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim
On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote:
> On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
> > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
> > 
> >  > That sounds like a plan. Perhaps it would be useful to add a
> >  > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
> >  > 8k) so that we get some indication that we're hitting a deep stack
> >  > but the system otherwise keeps functioning. That gives us some
> >  > motivation to keep stack usage down but isn't a fatal problem like
> >  > it is now
> > 
> > We have check_stack_usage() and DEBUG_STACK_USAGE for this.
> > Though it needs some tweaking if we move to 16K
> 
> Right, but it doesn't throw loud warnings when a specific threshold
> is reached - it just issues a quiet message when a process exits
> telling you what the maximum was without giving us a stack to chew
> on

But we could enhance the inform so notice the risk to the user.
as follow

...
"kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means
there is some horrible stack hogger in your kernel. Please report it
the LKML and enable stacktrace to investigate who is culprit"


> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Jones
On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote:
 > On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
 > > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
 > > 
 > >  > That sounds like a plan. Perhaps it would be useful to add a
 > >  > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
 > >  > 8k) so that we get some indication that we're hitting a deep stack
 > >  > but the system otherwise keeps functioning. That gives us some
 > >  > motivation to keep stack usage down but isn't a fatal problem like
 > >  > it is now
 > > 
 > > We have check_stack_usage() and DEBUG_STACK_USAGE for this.
 > > Though it needs some tweaking if we move to 16K
 > 
 > Right, but it doesn't throw loud warnings when a specific threshold
 > is reached - it just issues a quiet message when a process exits
 > telling you what the maximum was without giving us a stack to chew
 > on

ah, right good point. That would be more useful.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner
On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
> On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
> 
>  > That sounds like a plan. Perhaps it would be useful to add a
>  > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
>  > 8k) so that we get some indication that we're hitting a deep stack
>  > but the system otherwise keeps functioning. That gives us some
>  > motivation to keep stack usage down but isn't a fatal problem like
>  > it is now
> 
> We have check_stack_usage() and DEBUG_STACK_USAGE for this.
> Though it needs some tweaking if we move to 16K

Right, but it doesn't throw loud warnings when a specific threshold
is reached - it just issues a quiet message when a process exits
telling you what the maximum was without giving us a stack to chew
on

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim
Hello Linus,

On Thu, May 29, 2014 at 05:05:17PM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 4:36 PM, Minchan Kim  wrote:
> >
> > I did below hacky test to apply your idea and the result is overflow again.
> > So, again it would second stack expansion. Otherwise, we should prevent
> > swapout in direct reclaim.
> 
> So changing io_schedule() is bad, for the reasons I outlined elsewhere
> (we use it for wait_for_page*() - see sleep_on_page().
> 
> It's the congestion waiting where the io_schedule() should be avoided.
> 
> So maybe test a patch something like the attached.
> 
> NOTE! This is absolutely TOTALLY UNTESTED! It might do horrible
> horrible things. It seems to compile, but I have absolutely no reason
> to believe that it would work. I didn't actually test that this moves
> anything at all to kblockd. So think of it as a concept patch that
> *might* work, but as Dave said, there might also be other things that
> cause unplugging and need some tough love.
> 
>Linus

>  mm/backing-dev.c | 28 ++--
>  mm/vmscan.c  |  4 +---
>  2 files changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 09d9591b7708..cb26b24c2da2 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
>  
> @@ -573,6 +574,21 @@ void set_bdi_congested(struct backing_dev_info *bdi, int 
> sync)
>  }
>  EXPORT_SYMBOL(set_bdi_congested);
>  
> +static long congestion_timeout(int sync, long timeout)
> +{
> + long ret;
> + DEFINE_WAIT(wait);
> + struct blk_plug *plug = current->plug;
> + wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> + if (plug)
> + blk_flush_plug_list(plug, true);
> + ret = schedule_timeout(timeout);
> + finish_wait(wqh, &wait);
> + return ret;
> +}
> +
>  /**
>   * congestion_wait - wait for a backing_dev to become uncongested
>   * @sync: SYNC or ASYNC IO
> @@ -586,12 +602,8 @@ long congestion_wait(int sync, long timeout)
>  {
>   long ret;
>   unsigned long start = jiffies;
> - DEFINE_WAIT(wait);
> - wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
> - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> - ret = io_schedule_timeout(timeout);
> - finish_wait(wqh, &wait);
> + ret = congestion_timeout(sync,timeout);
>  
>   trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
>   jiffies_to_usecs(jiffies - start));
> @@ -622,8 +634,6 @@ long wait_iff_congested(struct zone *zone, int sync, long 
> timeout)
>  {
>   long ret;
>   unsigned long start = jiffies;
> - DEFINE_WAIT(wait);
> - wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
>   /*
>* If there is no congestion, or heavy congestion is not being
> @@ -643,9 +653,7 @@ long wait_iff_congested(struct zone *zone, int sync, long 
> timeout)
>   }
>  
>   /* Sleep until uncongested or a write happens */
> - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> - ret = io_schedule_timeout(timeout);
> - finish_wait(wqh, &wait);
> + ret = congestion_timeout(sync, timeout);
>  
>  out:
>   trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 32c661d66a45..1e524000b83e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -989,9 +989,7 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>* avoid risk of stack overflow but only writeback
>* if many dirty pages have been encountered.
>*/
> - if (page_is_file_cache(page) &&
> - (!current_is_kswapd() ||
> -  !zone_is_reclaim_dirty(zone))) {
> + if (!current_is_kswapd() || 
> !zone_is_reclaim_dirty(zone)) {
>   /*
>* Immediately reclaim when written back.
>* Similar in principal to deactivate_page()

I guess this part which avoid swapout in direct reclaim would be key
if this patch were successful. But it could make anon pages rotate back
into inactive's head from tail in direct reclaim path until kswapd can
catch up. And kswapd kswapd can swap out anon pages from tail of inactive
LRU so I suspect it could make side-effect LRU churning.

Anyway, I will queue it into testing machine since Rusty's test is done.
Thanks!

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner
On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
> Hello Dave,
> 
> On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > > Author: Jens Axboe 
> > > Date:   Sat Apr 16 13:27:55 2011 +0200
> > > 
> > > block: let io_schedule() flush the plug inline
> > > 
> > > Linus correctly observes that the most important dispatch cases
> > > are now done from kblockd, this isn't ideal for latency reasons.
> > > The original reason for switching dispatches out-of-line was to
> > > avoid too deep a stack, so by _only_ letting the "accidental"
> > > flush directly in schedule() be guarded by offload to kblockd,
> > > we should be able to get the best of both worlds.
> > > 
> > > So add a blk_schedule_flush_plug() that offloads to kblockd,
> > > and only use that from the schedule() path.
> > > 
> > > Signed-off-by: Jens Axboe 
> > > 
> > > And now we have too deep a stack due to unplugging from io_schedule()...
> > 
> > So, if we make io_schedule() push the plug list off to the kblockd
> > like is done for schedule()

> I did below hacky test to apply your idea and the result is overflow again.
> So, again it would second stack expansion. Otherwise, we should prevent
> swapout in direct reclaim.
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f5c6635b806c..95f169e85dbe 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
>  void __sched io_schedule(void)
>  {
>   struct rq *rq = raw_rq();
> + struct blk_plug *plug = current->plug;
>  
>   delayacct_blkio_start();
>   atomic_inc(&rq->nr_iowait);
> - blk_flush_plug(current);
> + if (plug)
> + blk_flush_plug_list(plug, true);
> +
>   current->in_iowait = 1;
>   schedule();
>   current->in_iowait = 0;

.

> DepthSize   Location(46 entries)
>
>   0) 7200   8   _raw_spin_lock_irqsave+0x51/0x60
>   1) 7192 296   get_page_from_freelist+0x886/0x920
>   2) 6896 352   __alloc_pages_nodemask+0x5e1/0xb20
>   3) 6544   8   alloc_pages_current+0x10f/0x1f0
>   4) 6536 168   new_slab+0x2c5/0x370
>   5) 6368   8   __slab_alloc+0x3a9/0x501
>   6) 6360  80   __kmalloc+0x1cb/0x200
>   7) 6280 376   vring_add_indirect+0x36/0x200
>   8) 5904 144   virtqueue_add_sgs+0x2e2/0x320
>   9) 5760 288   __virtblk_add_req+0xda/0x1b0
>  10) 5472  96   virtio_queue_rq+0xd3/0x1d0
>  11) 5376 128   __blk_mq_run_hw_queue+0x1ef/0x440
>  12) 5248  16   blk_mq_run_hw_queue+0x35/0x40
>  13) 5232  96   blk_mq_insert_requests+0xdb/0x160
>  14) 5136 112   blk_mq_flush_plug_list+0x12b/0x140
>  15) 5024 112   blk_flush_plug_list+0xc7/0x220
>  16) 4912 128   blk_mq_make_request+0x42a/0x600
>  17) 4784  48   generic_make_request+0xc0/0x100
>  18) 4736 112   submit_bio+0x86/0x160
>  19) 4624 160   __swap_writepage+0x198/0x230
>  20) 4464  32   swap_writepage+0x42/0x90
>  21) 4432 320   shrink_page_list+0x676/0xa80
>  22) 4112 208   shrink_inactive_list+0x262/0x4e0
>  23) 3904 304   shrink_lruvec+0x3e1/0x6a0

The device is supposed to be plugged here in shrink_lruvec().

Oh, a plug can only hold 16 individual bios, and then it does a
synchronous flush. Hmmm - perhaps that should also defer the flush
to the kblockd, because if we are overrunning a plug then we've
already surrendered IO dispatch latency

So, in blk_mq_make_request(), can you do:

if (list_empty(&plug->mq_list))
trace_block_plug(q);
else if (request_count >= BLK_MAX_REQUEST_COUNT) {
-   blk_flush_plug_list(plug, false);
+   blk_flush_plug_list(plug, true);
trace_block_plug(q);
}
list_add_tail(&rq->queuelist, &plug->mq_list);

To see if that defers all the swap IO to kblockd?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Jones
On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:

 > That sounds like a plan. Perhaps it would be useful to add a
 > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
 > 8k) so that we get some indication that we're hitting a deep stack
 > but the system otherwise keeps functioning. That gives us some
 > motivation to keep stack usage down but isn't a fatal problem like
 > it is now

We have check_stack_usage() and DEBUG_STACK_USAGE for this.
Though it needs some tweaking if we move to 16K

I gave it a try yesterday, and noticed a spew of noisy warnings as soon
as I gave it a workload to chew on. (Moreso than usual with 8K stacks)

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds
On Thu, May 29, 2014 at 4:36 PM, Minchan Kim  wrote:
>
> I did below hacky test to apply your idea and the result is overflow again.
> So, again it would second stack expansion. Otherwise, we should prevent
> swapout in direct reclaim.

So changing io_schedule() is bad, for the reasons I outlined elsewhere
(we use it for wait_for_page*() - see sleep_on_page().

It's the congestion waiting where the io_schedule() should be avoided.

So maybe test a patch something like the attached.

NOTE! This is absolutely TOTALLY UNTESTED! It might do horrible
horrible things. It seems to compile, but I have absolutely no reason
to believe that it would work. I didn't actually test that this moves
anything at all to kblockd. So think of it as a concept patch that
*might* work, but as Dave said, there might also be other things that
cause unplugging and need some tough love.

   Linus
 mm/backing-dev.c | 28 ++--
 mm/vmscan.c  |  4 +---
 2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 09d9591b7708..cb26b24c2da2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
@@ -573,6 +574,21 @@ void set_bdi_congested(struct backing_dev_info *bdi, int 
sync)
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+static long congestion_timeout(int sync, long timeout)
+{
+   long ret;
+   DEFINE_WAIT(wait);
+   struct blk_plug *plug = current->plug;
+   wait_queue_head_t *wqh = &congestion_wqh[sync];
+
+   prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+   if (plug)
+   blk_flush_plug_list(plug, true);
+   ret = schedule_timeout(timeout);
+   finish_wait(wqh, &wait);
+   return ret;
+}
+
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -586,12 +602,8 @@ long congestion_wait(int sync, long timeout)
 {
long ret;
unsigned long start = jiffies;
-   DEFINE_WAIT(wait);
-   wait_queue_head_t *wqh = &congestion_wqh[sync];
 
-   prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-   ret = io_schedule_timeout(timeout);
-   finish_wait(wqh, &wait);
+   ret = congestion_timeout(sync,timeout);
 
trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
jiffies_to_usecs(jiffies - start));
@@ -622,8 +634,6 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
 {
long ret;
unsigned long start = jiffies;
-   DEFINE_WAIT(wait);
-   wait_queue_head_t *wqh = &congestion_wqh[sync];
 
/*
 * If there is no congestion, or heavy congestion is not being
@@ -643,9 +653,7 @@ long wait_iff_congested(struct zone *zone, int sync, long 
timeout)
}
 
/* Sleep until uncongested or a write happens */
-   prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-   ret = io_schedule_timeout(timeout);
-   finish_wait(wqh, &wait);
+   ret = congestion_timeout(sync, timeout);
 
 out:
trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32c661d66a45..1e524000b83e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -989,9 +989,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 * avoid risk of stack overflow but only writeback
 * if many dirty pages have been encountered.
 */
-   if (page_is_file_cache(page) &&
-   (!current_is_kswapd() ||
-!zone_is_reclaim_dirty(zone))) {
+   if (!current_is_kswapd() || 
!zone_is_reclaim_dirty(zone)) {
/*
 * Immediately reclaim when written back.
 * Similar in principal to deactivate_page()


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner
On Thu, May 29, 2014 at 08:24:49AM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 12:26 AM, Dave Chinner  wrote:
> >
> > What concerns me about both __alloc_pages_nodemask() and
> > kernel_map_pages is that when I look at the code I see functions
> > that have no obvious stack usage problem. However, the compiler is
> > producing functions with huge stack footprints and it's not at all
> > obvious when I read the code. So in this case I'm more concerned
> > that we have a major disconnect between the source code structure
> > and the code that the compiler produces...
> 
> I agree. In fact, this is the main reason that Minchan's call trace
> and this thread has actually convinced me that yes, we really do need
> to make x86-64 have a 16kB stack (well, 16kB allocation - there's
> still the thread info etc too).
> 
> Usually when we see the stack-smashing traces, they are because
> somebody did something stupid. In this case, there are certainly
> stupid details, and things I think we should fix, but there is *not*
> the usual red flag of "Christ, somebody did something _really_ wrong".
> 
> So I'm not in fact arguing against Minchan's patch of upping
> THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
> remain one of my "we really need to be careful" issues, so while I am
> basically planning on applying that patch, I _also_ want to make sure
> that we fix the problems we do see and not just paper them over.
> 
> The 8kB stack has been somewhat restrictive and painful for a while,
> and I'm ok with admitting that it is just getting _too_ damn painful,
> but I don't want to just give up entirely when we have a known deep
> stack case.

That sounds like a plan. Perhaps it would be useful to add a
WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
8k) so that we get some indication that we're hitting a deep stack
but the system otherwise keeps functioning. That gives us some
motivation to keep stack usage down but isn't a fatal problem like
it is now

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim
Hello Linus,

On Thu, May 29, 2014 at 08:24:49AM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 12:26 AM, Dave Chinner  wrote:
> >
> > What concerns me about both __alloc_pages_nodemask() and
> > kernel_map_pages is that when I look at the code I see functions
> > that have no obvious stack usage problem. However, the compiler is
> > producing functions with huge stack footprints and it's not at all
> > obvious when I read the code. So in this case I'm more concerned
> > that we have a major disconnect between the source code structure
> > and the code that the compiler produces...
> 
> I agree. In fact, this is the main reason that Minchan's call trace
> and this thread has actually convinced me that yes, we really do need
> to make x86-64 have a 16kB stack (well, 16kB allocation - there's
> still the thread info etc too).
> 
> Usually when we see the stack-smashing traces, they are because
> somebody did something stupid. In this case, there are certainly
> stupid details, and things I think we should fix, but there is *not*
> the usual red flag of "Christ, somebody did something _really_ wrong".
> 
> So I'm not in fact arguing against Minchan's patch of upping
> THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
> remain one of my "we really need to be careful" issues, so while I am
> basically planning on applying that patch, I _also_ want to make sure
> that we fix the problems we do see and not just paper them over.

So, should I resend a patch w/o RFC in subject but with Acked-by from Dave?
Or, you will do it by yourself?

> 
> The 8kB stack has been somewhat restrictive and painful for a while,
> and I'm ok with admitting that it is just getting _too_ damn painful,
> but I don't want to just give up entirely when we have a known deep
> stack case.
> 
>   Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Minchan Kim
Hello Dave,

On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > Author: Jens Axboe 
> > Date:   Sat Apr 16 13:27:55 2011 +0200
> > 
> > block: let io_schedule() flush the plug inline
> > 
> > Linus correctly observes that the most important dispatch cases
> > are now done from kblockd, this isn't ideal for latency reasons.
> > The original reason for switching dispatches out-of-line was to
> > avoid too deep a stack, so by _only_ letting the "accidental"
> > flush directly in schedule() be guarded by offload to kblockd,
> > we should be able to get the best of both worlds.
> > 
> > So add a blk_schedule_flush_plug() that offloads to kblockd,
> > and only use that from the schedule() path.
> > 
> > Signed-off-by: Jens Axboe 
> > 
> > And now we have too deep a stack due to unplugging from io_schedule()...
> 
> So, if we make io_schedule() push the plug list off to the kblockd
> like is done for schedule()
> 
> > > IOW, swap-out directly caused that extra 3kB of stack use in what was
> > > a deep call chain (due to memory allocation). I really don't
> > > understand why you are arguing anything else on a pure technicality.
> > >
> > > I thought you had some other argument for why swap was different, and
> > > against removing that "page_is_file_cache()" special case in
> > > shrink_page_list().
> > 
> > I've said in the past that swap is different to filesystem
> > ->writepage implementations because it doesn't require significant
> > stack to do block allocation and doesn't trigger IO deep in that
> > allocation stack. Hence it has much lower stack overhead than the
> > filesystem ->writepage implementations and so is much less likely to
> > have stack issues.
> > 
> > This stack overflow shows us that just the memory reclaim + IO
> > layers are sufficient to cause a stack overflow,
> 
>  we solve this problem directly by being able to remove the IO
> stack usage from the direct reclaim swap path.
> 
> IOWs, we don't need to turn swap off at all in direct reclaim
> because all the swap IO can be captured in a plug list and
> dispatched via kblockd. This could be done either by io_schedule()
> or a new blk_flush_plug_list() wrapper that pushes the work to
> kblockd...

I did below hacky test to apply your idea and the result is overflow again.
So, again it would second stack expansion. Otherwise, we should prevent
swapout in direct reclaim.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..95f169e85dbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
 void __sched io_schedule(void)
 {
struct rq *rq = raw_rq();
+   struct blk_plug *plug = current->plug;
 
delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
-   blk_flush_plug(current);
+   if (plug)
+   blk_flush_plug_list(plug, true);
+
current->in_iowait = 1;
schedule();
current->in_iowait = 0;


[ 1209.764725] kworker/u24:0 (23627) used greatest stack depth: 304 bytes left
[ 1510.835509] kworker/u24:1 (25817) used greatest stack depth: 144 bytes left
[ 3701.482790] PANIC: double fault, error_code: 0x0
[ 3701.483297] CPU: 8 PID: 6117 Comm: kworker/u24:1 Not tainted 3.14.0+ #201
[ 3701.483980] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 3701.484366] Workqueue: writeback bdi_writeback_workfn (flush-253:0)
[ 3701.484366] task: 8800353c41c0 ti: 88106000 task.ti: 
88106000
[ 3701.484366] RIP: 0010:[]  [] 
__lock_acquire+0x170/0x1ca0
[ 3701.484366] RSP: :88105f58  EFLAGS: 00010046
[ 3701.484366] RAX: 0001 RBX: 8800353c41c0 RCX: 0002
[ 3701.484366] RDX:  RSI:  RDI: 81c4a1e0
[ 3701.484366] RBP: 88106048 R08: 0001 R09: 0001
[ 3701.484366] R10:  R11:  R12: 0001
[ 3701.484366] R13:  R14: 81c4a1e0 R15: 
[ 3701.484366] FS:  () GS:880037d0() 
knlGS:
[ 3701.484366] CS:  0010 DS:  ES:  CR0: 8005003b
[ 3701.484366] CR2: 88105f48 CR3: 01c0b000 CR4: 06e0
[ 3701.484366] Stack:
[ 3701.484366] BUG: unable to handle kernel paging request at 88105f58
[ 3701.484366] IP: [] show_stack_log_lvl+0x134/0x1a0
[ 3701.484366] PGD 28c5067 PUD 28c6067 PMD 28c7067 PTE 80105060
[ 3701.484366] Thread overran stack, or stack corrupted
[ 3701.484366] Oops:  [#1] SMP DEBUG_PAGEALLOC
[ 3701.484366] Dumping ftrace buffer:
[ 3701.484366] -
[ 3701.484366]<...>-61178d..4 3786719374us : stack_trace_call: 
Depth

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Linus Torvalds
On Thu, May 29, 2014 at 12:26 AM, Dave Chinner  wrote:
>
> What concerns me about both __alloc_pages_nodemask() and
> kernel_map_pages is that when I look at the code I see functions
> that have no obvious stack usage problem. However, the compiler is
> producing functions with huge stack footprints and it's not at all
> obvious when I read the code. So in this case I'm more concerned
> that we have a major disconnect between the source code structure
> and the code that the compiler produces...

I agree. In fact, this is the main reason that Minchan's call trace
and this thread has actually convinced me that yes, we really do need
to make x86-64 have a 16kB stack (well, 16kB allocation - there's
still the thread info etc too).

Usually when we see the stack-smashing traces, they are because
somebody did something stupid. In this case, there are certainly
stupid details, and things I think we should fix, but there is *not*
the usual red flag of "Christ, somebody did something _really_ wrong".

So I'm not in fact arguing against Minchan's patch of upping
THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
remain one of my "we really need to be careful" issues, so while I am
basically planning on applying that patch, I _also_ want to make sure
that we fix the problems we do see and not just paper them over.

The 8kB stack has been somewhat restrictive and painful for a while,
and I'm ok with admitting that it is just getting _too_ damn painful,
but I don't want to just give up entirely when we have a known deep
stack case.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread One Thousand Gnomes
> Hmm, stupid question: what happens when 16K is not enough too, do we
> increase again? When do we stop increasing? 1M, 2M... ?

It's not a stupid question, it's IMHO the most important question

> Sounds like we want to make it a config option with a couple of sizes
> for everyone to be happy. :-)

At the moment it goes bang if you freakily get three layers of recursion
through allocations. But show me the proof we can't already hit four, or
five, or six  

We don't *need* to allocate tons of stack memory to each task just because
we might recursively allocate. We don't solve the problem by doing so
either. We at best fudge over it.

Why is *any* recursive memory allocation not ending up waiting for other
kernel worker threads to free up memory (beyond it being rather hard to
go and retrofit) ?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Dave Chinner
On Wed, May 28, 2014 at 07:42:40PM -0700, Linus Torvalds wrote:
> On Wed, May 28, 2014 at 6:30 PM, Dave Chinner  wrote:
> >
> > You're focussing on the specific symptoms, not the bigger picture.
> > i.e. you're ignoring all the other "let's start IO" triggers in
> > direct reclaim. e.g there's two separate plug flush triggers in
> > shrink_inactive_list(), one of which is:
> 
> Fair enough. I certainly agree that we should look at the other cases here 
> too.
> 
> In fact, I also find it distasteful just how much stack space some of
> those VM routines are just using up on their own, never mind any
> actual IO paths at all. The fact that __alloc_pages_nodemask() uses
> 350 bytes of stackspace on its own is actually quite disturbing. The
> fact that kernel_map_pages() apparently has almost 400 bytes of stack
> is just crazy. Obviously that case only happens with
> CONFIG_DEBUG_PAGEALLOC, but still..

What concerns me about both __alloc_pages_nodemask() and
kernel_map_pages is that when I look at the code I see functions
that have no obvious stack usage problem. However, the compiler is
producing functions with huge stack footprints and it's not at all
obvious when I read the code. So in this case I'm more concerned
that we have a major disconnect between the source code structure
and the code that the compiler produces...

> > I'm not saying we shouldn't turn of swap from direct reclaim, just
> > that all we'd be doing by turning off swap is playing whack-a-stack
> > - the next report will simply be from one of the other direct
> > reclaim IO schedule points.
> 
> Playing whack-a-mole with this for a while might not be a bad idea,
> though. It's not like we will ever really improve unless we start
> whacking the worst cases. And it should still be a fairly limited
> number.

I guess I've been playing whack-a-stack for so long now and some of
the overruns have been so large I just don't see it as a viable
medium to long term solution.

> After all, historically, some of the cases we've played whack-a-mole
> on have been in XFS, so I'd think you'd be thrilled to see some other
> code get blamed this time around ;)

Blame shifting doesn't thrill me - I'm still at the pointy end of
stack overrun reports, and we've still got to do the hard work of
solving the problem. However, I am happy to see acknowlegement of
the problem so we can work out how to solve the issues...

> > Regardless of whether it is swap or something external queues the
> > bio on the plug, perhaps we should look at why it's done inline
> > rather than by kblockd, where it was moved because it was blowing
> > the stack from schedule():
> 
> So it sounds like we need to do this for io_schedule() too.
> 
> In fact, we've generally found it to be a mistake every time we
> "automatically" unblock some IO queue. And I'm not saying that because
> of stack space, but because we've _often_ had the situation that eager
> unblocking results in IO that could have been done as bigger requests.
> 
> Of course, we do need to worry about latency for starting IO, but any
> of these kinds of memory-pressure writeback patterns are pretty much
> by definition not about the latency of one _particular_ IO, so they
> don't tent to be latency-sensitive. Quite the reverse: we start
> writeback and then end up waiting on something else altogether
> (possibly a writeback that got started much earlier).

*nod*

> swapout certainly is _not_ IO-latency-sensitive, especially these
> days. And while we _do_ want to throttle in direct reclaim, if it's
> about throttling I'd certainly think that it sounds quite reasonable
> to push any unplugging to kblockd than try to do that synchronously.
> If we are throttling in direct-reclaim, we need to slow things _down_
> for the writer, not worry about latency.

Right, we are adding latency to the caller by having to swap so
a small amount of additional IO dispatch latency for IO we aren't
going to wait directly on doesn't really matter at all.

> >That implies no IO in direct reclaim context
> > is safe - either from swap or io_schedule() unplugging. It also
> > lends a lot of weight to my assertion that the majority of the stack
> > growth over the past couple of years has been ocurring outside the
> > filesystems
> 
> I think Minchan's stack trace definitely backs you up on that. The
> filesystem part - despite that one ext4_writepages() function - is a
> very small part of the whole. It sits at about ~1kB of stack. Just the
> VM "top-level" writeback code is about as much, and then the VM page
> alloc/shrinking code when the filesystem needs memory is *twice* that,
> and then the block layer and the virtio code are another 1kB each.

*nod*

As i said early, look at this in the context of the bigger picture.
We can also have more stack using layers in the IO stack and/or more
stack-expensive layers. e.g.  it could be block -> dm -> md -> SCSI
-> mempool_alloc in that stack rather than block -> virtio ->
kmallo

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-29 Thread Rusty Russell
Linus Torvalds  writes:
> Well, we've definitely have had some issues with deeper callchains
> with md, but I suspect virtio might be worse, and the new blk-mq code
> is lilkely worse in this respect too.

I looked at this; I've now got a couple of virtio core cleanups, and
I'm testing with Minchan's config and gcc versions now.

MST reported that gcc 4.8 is a better than 4.6, but I'll test that too.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Minchan Kim
On Wed, May 28, 2014 at 12:06:58PM -0400, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> > > [ cc XFS list ]
> > 
> > [and now there is a complete copy on the XFs list, I'll add my 2c]
> > 
> > > On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > > > While I play inhouse patches with much memory pressure on qemu-kvm,
> > > > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > > > 
> > > > When I investigated the problem, the callstack was a little bit deeper
> > > > by involve with reclaim functions but not direct reclaim path.
> > > > 
> > > > I tried to diet stack size of some functions related with alloc/reclaim
> > > > so did a hundred of byte but overflow was't disappeard so that I 
> > > > encounter
> > > > overflow by another deeper callstack on reclaim/allocator path.
> > 
> > That's a no win situation. The stack overruns through ->writepage
> > we've been seeing with XFS over the past *4 years* are much larger
> > than a few bytes. The worst case stack usage on a virtio block
> > device was about 10.5KB of stack usage.
> > 
> > And, like this one, it came from the flusher thread as well. The
> > difference was that the allocation that triggered the reclaim path
> > you've reported occurred when 5k of the stack had already been
> > used...
> > 
> > > > Of course, we might sweep every sites we have found for reducing
> > > > stack usage but I'm not sure how long it saves the world(surely,
> > > > lots of developer start to add nice features which will use stack
> > > > agains) and if we consider another more complex feature in I/O layer
> > > > and/or reclaim path, it might be better to increase stack size(
> > > > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > > > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > > > already expaned to 16K. )
> > 
> > Yup, that's all been pointed out previously. 8k stacks were never
> > large enough to fit the linux IO architecture on x86-64, but nobody
> > outside filesystem and IO developers has been willing to accept that
> > argument as valid, despite regular stack overruns and filesystem
> > having to add workaround after workaround to prevent stack overruns.
> > 
> > That's why stuff like this appears in various filesystem's
> > ->writepage:
> > 
> > /*
> >  * Refuse to write the page out if we are called from reclaim 
> > context.
> >  *
> >  * This avoids stack overflows when called from deeply used stacks 
> > in
> >  * random callers for direct reclaim or memcg reclaim.  We 
> > explicitly
> >  * allow reclaim from kswapd as the stack usage there is relatively 
> > low.
> >  *
> >  * This should never happen except in the case of a VM regression so
> >  * warn about it.
> >  */
> > if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> > PF_MEMALLOC))
> > goto redirty;
> > 
> > That still doesn't guarantee us enough stack space to do writeback,
> > though, because memory allocation can occur when reading in metadata
> > needed to do delayed allocation, and so we could trigger GFP_NOFS
> > memory allocation from the flusher thread with 4-5k of stack already
> > consumed, so that would still overrun teh stack.
> > 
> > So, a couple of years ago we started defering half the writeback
> > stack usage to a worker thread (commit c999a22 "xfs: introduce an
> > allocation workqueue"), under the assumption that the worst stack
> > usage when we call memory allocation is around 3-3.5k of stack used.
> > We thought that would be safe, but the stack trace you've posted
> > shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
> > which means we're still screwed despite all the workarounds we have
> > in place.
> 
> The allocation and reclaim stack itself is only 2k per the stacktrace
> below.  What got us in this particular case is that we engaged a
> complicated block layer setup from within the allocation context in
> order to swap out a page.
> 
> In the past we disabled filesystem ->writepage from within the
> allocation context and deferred it to kswapd for stack reasons (see
> the WARN_ON_ONCE and the comment in your above quote), but I think we
> have to go further and do the same for even swap_writepage():
> 
> > > > I guess this topic was discussed several time so there might be
> > > > strong reason not to increase kernel stack size on x86_64, for me not
> > > > knowing so Ccing x86_64 maintainers, other MM guys and virtio
> > > > maintainers.
> > > >
> > > >  DepthSize   Location(51 entries)
> > > > 
> > > >0) 7696  16   lookup_address+0x28/0x30
> > > >1) 7680  16   _lookup_address_cpa.isra.3+0x3b/0x40
> > > >2) 7664  24   __change_page_attr_set_clr+0xe0/0xb50
> > > 

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread H. Peter Anvin
On 05/28/2014 07:42 PM, Linus Torvalds wrote:
> 
> And Minchan running out of stack is at least _partly_ due to his debug
> options (that DEBUG_PAGEALLOC thing as an extreme example, but I
> suspect there's a few other options there that generate more bloated
> data structures too too).
> 

I have wondered if a larger stack would make sense as a debug option.
I'm just worried it will be abused.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Minchan Kim
On Wed, May 28, 2014 at 09:13:15PM -0700, Linus Torvalds wrote:
> On Wed, May 28, 2014 at 8:46 PM, Minchan Kim  wrote:
> >
> > Yes. For example, with mark __alloc_pages_slowpath noinline_for_stack,
> > we can reduce 176byte.
> 
> Well, but it will then call that __alloc_pages_slowpath() function,
> which has a 176-byte stack frame.. Plus the call frame.
> 
> Now, that only triggers for when the initial "__GFP_HARDWALL" case
> fails, but that's exactly what happens when we do need to do direct
> reclaim.
> 
> That said, I *have* seen cases where the gcc spill code got really
> confused, and simplifying the function (by not inlining excessively)
> actually causes a truly smaller stack overall, despite the actual call
> frames etc.  But I think the gcc people fixed the kinds of things that
> caused *that* kind of stack slot explosion.
> 
> And avoiding inlining can end up resulting in less stack, if the
> really deep parts don't happen to go through that function that got
> inlined (ie any call chain that wouldn't have gone through that
> "slowpath" function at all).
> 
> But in this case, __alloc_pages_slowpath() is where we end up doing
> the actual direct reclaim anyway, so just uninlining doesn't actually
> help. Although it would probably make the asm code more readable ;)

Indeed. :(

Actually I found there are other places to opitmize out.
For example, we can unline try_preserve_large_page for 
__change_page_attr_set_clr.
Although I'm not familiar with that part, I guess large page would be rare
so we could save 112-byte.

before:

81042330 <__change_page_attr_set_clr>:
81042330:   e8 4b da 6a 00  callq  816efd80 
<__entry_text_start>
81042335:   55  push   %rbp
81042336:   48 89 e5mov%rsp,%rbp
81042339:   41 57   push   %r15
8104233b:   41 56   push   %r14
8104233d:   41 55   push   %r13
8104233f:   41 54   push   %r12
81042341:   49 89 fcmov%rdi,%r12
81042344:   53  push   %rbx
81042345:   48 81 ec f8 00 00 00sub$0xf8,%rsp
8104234c:   8b 47 20mov0x20(%rdi),%eax
8104234f:   89 b5 50 ff ff ff   mov%esi,-0xb0(%rbp)
81042355:   85 c0   test   %eax,%eax
81042357:   89 85 5c ff ff ff   mov%eax,-0xa4(%rbp)
8104235d:   0f 84 8c 06 00 00   je 810429ef 
<__change_page_attr_set_clr+0x6bf>

after:

81042740 <__change_page_attr_set_clr>:
81042740:   e8 bb d5 6a 00  callq  816efd00 
<__entry_text_start>
81042745:   55  push   %rbp
81042746:   48 89 e5mov%rsp,%rbp
81042749:   41 57   push   %r15
8104274b:   41 56   push   %r14
8104274d:   41 55   push   %r13
8104274f:   49 89 fdmov%rdi,%r13
81042752:   41 54   push   %r12
81042754:   53  push   %rbx
81042755:   48 81 ec 88 00 00 00sub$0x88,%rsp
8104275c:   8b 47 20mov0x20(%rdi),%eax
8104275f:   89 b5 70 ff ff ff   mov%esi,-0x90(%rbp)
81042765:   85 c0   test   %eax,%eax
81042767:   89 85 74 ff ff ff   mov%eax,-0x8c(%rbp)
8104276d:   0f 84 cb 02 00 00   je 81042a3e 
<__change_page_attr_set_clr+0x2fe>


And below patch saves 96-byte from shrink_lruvec.

That would be not all and I am not saying optimization of every functions of VM
is way to go but just want to notice we have rooms to optimize it out.
I will wait more discussions and happy to test it(I can reproduce it in 1~2 hour
if I have a luck)

Thanks!

8115b560 :
8115b560:   e8 db 46 59 00  callq  816efc40 
<__entry_text_start>
8115b565:   55  push   %rbp
8115b566:   65 48 8b 04 25 40 bamov%gs:0xba40,%rax
8115b56d:   00 00
8115b56f:   48 89 e5mov%rsp,%rbp
8115b572:   41 57   push   %r15
8115b574:   41 56   push   %r14
8115b576:   45 31 f6xor%r14d,%r14d
8115b579:   41 55   push   %r13
8115b57b:   49 89 fdmov%rdi,%r13
8115b57e:   41 54   push   %r12
8115b580:   49 89 f4mov%rsi,%r12
8115b583:   49 83 c4 34 add$0x34,%r12
8115b587:   53  push   %rbx

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Linus Torvalds
On Wed, May 28, 2014 at 8:46 PM, Minchan Kim  wrote:
>
> Yes. For example, with mark __alloc_pages_slowpath noinline_for_stack,
> we can reduce 176byte.

Well, but it will then call that __alloc_pages_slowpath() function,
which has a 176-byte stack frame.. Plus the call frame.

Now, that only triggers for when the initial "__GFP_HARDWALL" case
fails, but that's exactly what happens when we do need to do direct
reclaim.

That said, I *have* seen cases where the gcc spill code got really
confused, and simplifying the function (by not inlining excessively)
actually causes a truly smaller stack overall, despite the actual call
frames etc.  But I think the gcc people fixed the kinds of things that
caused *that* kind of stack slot explosion.

And avoiding inlining can end up resulting in less stack, if the
really deep parts don't happen to go through that function that got
inlined (ie any call chain that wouldn't have gone through that
"slowpath" function at all).

But in this case, __alloc_pages_slowpath() is where we end up doing
the actual direct reclaim anyway, so just uninlining doesn't actually
help. Although it would probably make the asm code more readable ;)

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Minchan Kim
On Wed, May 28, 2014 at 10:44:48PM -0400, Steven Rostedt wrote:
> On Thu, 29 May 2014 10:09:40 +0900
> Minchan Kim  wrote:
> 
> > stacktrace reported that vring_add_indirect used 376byte and objdump says
> > 
> > 8141dc60 :
> > 8141dc60:   55  push   %rbp
> > 8141dc61:   48 89 e5mov%rsp,%rbp
> > 8141dc64:   41 57   push   %r15
> > 8141dc66:   41 56   push   %r14
> > 8141dc68:   41 55   push   %r13
> > 8141dc6a:   49 89 fdmov%rdi,%r13
> > 8141dc6d:   89 cf   mov%ecx,%edi
> > 8141dc6f:   48 c1 e7 04 shl$0x4,%rdi
> > 8141dc73:   41 54   push   %r12
> > 8141dc75:   49 89 d4mov%rdx,%r12
> > 8141dc78:   53  push   %rbx
> > 8141dc79:   48 89 f3mov%rsi,%rbx
> > 8141dc7c:   48 83 ec 28 sub$0x28,%rsp
> > 8141dc80:   8b 75 20mov0x20(%rbp),%esi
> > 8141dc83:   89 4d bcmov%ecx,-0x44(%rbp)
> > 8141dc86:   44 89 45 cc mov%r8d,-0x34(%rbp)
> > 8141dc8a:   44 89 4d c8 mov%r9d,-0x38(%rbp)
> > 8141dc8e:   83 e6 ddand$0xffdd,%esi
> > 8141dc91:   e8 7a d1 d7 ff  callq  8119ae10 
> > <__kmalloc>
> > 8141dc96:   48 85 c0test   %rax,%rax
> > 
> > So, it's *strange*.
> > 
> > I will add .config and .o.
> > Maybe someone might find what happens.
> > 
> 
> This is really bothering me. I'm trying to figure it out. We have from
> the stack trace:
> 
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:   9)   
>   6456  80   __kmalloc+0x1cb/0x200
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  10)   
>   6376 376   vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  11)   
>   6000 144   virtqueue_add_sgs+0x2e2/0x320
> 
> The way the stack tracer works, is that when it detects a new max stack
> it calls save_stack_trace() to get the complete call chain from the
> stack. This should be rather accurate as it seems that your kernel was
> compiled with frame pointers (confirmed by the objdump as well as the
> config file). It then uses that stack trace that it got to examine the
> stack to find the locations of the saved return addresses and records
> them in an array (in your case, an array of 50 entries).
> 
> From your .o file:
> 
> vring_add_indirect + 0x36: (0x370 + 0x36 = 0x3a6)
> 
> 0370 :
> 
>  39e:   83 e6 ddand$0xffdd,%esi
>  3a1:   e8 00 00 00 00  callq  3a6 
> 3a2: R_X86_64_PC32  __kmalloc-0x4
>  3a6:   48 85 c0test   %rax,%rax
> 
> Definitely the return address to the call to __kmalloc. Then to
> determine the size of the stack frame, it is subtracted from the next
> one down. In this case, the location of virtqueue_add_sgs+0x2e2.
> 
> virtqueue_add_sgs + 0x2e2: (0x880 + 0x2e2 = 0xb62)
> 
> 0880 :
> 
> b4f:   89 4c 24 08 mov%ecx,0x8(%rsp)
>  b53:   48 c7 c2 00 00 00 00mov$0x0,%rdx
> b56: R_X86_64_32S   .text+0x570
>  b5a:   44 89 d1mov%r10d,%ecx
>  b5d:   e8 0e f8 ff ff  callq  370 
>  b62:   85 c0   test   %eax,%eax
> 
> 
> Which is the return address of where vring_add_indirect was called.
> 
> The return address back to virtqueue_add_sgs was found at 6000 bytes of
> the stack. The return address back to vring_add_indirect was found at
> 6376 bytes from the top of the stack.
> 
> My question is, why were they so far apart? I see 6 words pushed
> (8bytes each, for a total of 48 bytes), and a subtraction of the stack
> pointer of 0x28 (40 bytes) giving us a total of 88 bytes. Plus we need
> to add the push of the return address itself which would just give us
> 96 bytes for the stack frame. What is making this show 376 bytes??

That's what I want to know. :(

> 
> Looking more into this, I'm not sure I trust the top numbers anymore.
> kmalloc reports a stack frame of 80, and I'm coming up with 104
> (perhaps even 112). And slab_alloc only has 8. Something's messed up there.

Yes, Looks weired but some of upper functions in callstack match well so might
think only top functions of callstack was corrupted.
But in case of alloc_pages_current(8 byte), it looks weired too but it reports
same value 8 bytes in uppder and bottom of callstack. :(
> 
> -- Steve
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't em

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Minchan Kim
On Wed, May 28, 2014 at 09:09:23AM -0700, Linus Torvalds wrote:
> On Tue, May 27, 2014 at 11:53 PM, Minchan Kim  wrote:
> >
> > So, my stupid idea is just let's expand stack size and keep an eye
> > toward stack consumption on each kernel functions via stacktrace of ftrace.
> 
> We probably have to do this at some point, but that point is not -rc7.
> 
> And quite frankly, from the backtrace, I can only say: there is some
> bad shit there. The current VM stands out as a bloated pig:
> 
> > [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   0) 
> > 7696  16   lookup_address+0x28/0x30
> > [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   1) 
> > 7680  16   _lookup_address_cpa.isra.3+0x3b/0x40
> > [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   2) 
> > 7664  24   __change_page_attr_set_clr+0xe0/0xb50
> > [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   3) 
> > 7640 392   kernel_map_pages+0x6c/0x120
> > [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   4) 
> > 7248 256   get_page_from_freelist+0x489/0x920
> > [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   5) 
> > 6992 352   __alloc_pages_nodemask+0x5e1/0xb20
> 
> > [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  23) 
> > 4672 160   __swap_writepage+0x150/0x230
> > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  24) 
> > 4512  32   swap_writepage+0x42/0x90
> > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  25) 
> > 4480 320   shrink_page_list+0x676/0xa80
> > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  26) 
> > 4160 208   shrink_inactive_list+0x262/0x4e0
> > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  27) 
> > 3952 304   shrink_lruvec+0x3e1/0x6a0
> > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  28) 
> > 3648  80   shrink_zone+0x3f/0x110
> > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  29) 
> > 3568 128   do_try_to_free_pages+0x156/0x4c0
> > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  30) 
> > 3440 208   try_to_free_pages+0xf7/0x1e0
> > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  31) 
> > 3232 352   __alloc_pages_nodemask+0x783/0xb20
> > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  32) 
> > 2880   8   alloc_pages_current+0x10f/0x1f0
> > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  33) 
> > 2872 200   __page_cache_alloc+0x13f/0x160
> 
> That __alloc_pages_nodemask() thing in particular looks bad. It
> actually seems not to be the usual "let's just allocate some
> structures on the stack" disease, it looks more like "lots of
> inlining, horrible calling conventions, and lots of random stupid
> variables".

Yes. For example, with mark __alloc_pages_slowpath noinline_for_stack,
we can reduce 176byte. And there are more places we could reduce stack
consumption but I thought it was bandaid although reducing stack itself
is desireable.

before

81150600 <__alloc_pages_nodemask>:
81150600:   e8 fb f6 59 00  callq  816efd00 
<__entry_text_start>
81150605:   55  push   %rbp
81150606:   b8 e8 e8 00 00  mov$0xe8e8,%eax
8115060b:   48 89 e5mov%rsp,%rbp
8115060e:   41 57   push   %r15
81150610:   41 56   push   %r14
81150612:   41 be 22 01 32 01   mov$0x1320122,%r14d
81150618:   41 55   push   %r13
8115061a:   41 54   push   %r12
8115061c:   41 89 fcmov%edi,%r12d
8115061f:   53  push   %rbx
81150620:   48 81 ec 28 01 00 00sub$0x128,%rsp
81150627:   48 89 55 88 mov%rdx,-0x78(%rbp)
8115062b:   89 fa   mov%edi,%edx
8115062d:   83 e2 0fand$0xf,%edx
81150630:   48 89 4d 90 mov%rcx,-0x70(%rbp)

after:

81150600 <__alloc_pages_nodemask>:
81150600:   e8 7b f6 59 00  callq  816efc80 
<__entry_text_start>
81150605:   55  push   %rbp
81150606:   b8 e8 e8 00 00  mov$0xe8e8,%eax
8115060b:   48 89 e5mov%rsp,%rbp
8115060e:   41 57   push   %r15
81150610:   41 bf 22 01 32 01   mov$0x1320122,%r15d
81150616:   41 56   push   %r14
81150618:   41 55 

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Linus Torvalds
[ Crossed emails ]

On Wed, May 28, 2014 at 6:58 PM, Dave Chinner  wrote:
> On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
>>
>> And now we have too deep a stack due to unplugging from io_schedule()...
>
> So, if we make io_schedule() push the plug list off to the kblockd
> like is done for schedule()

We might have a few different cases.

The cases where we *do* care about latency is when we are waiting for
the IO ourselves (ie in wait_on_page() and friends), and those end up
using io_schedule() too.

So in *that* case we definitely have a latency argument for doing it
directly, and we shouldn't kick it off to kblockd. That's very much a
"get this started as soon as humanly possible".

But the "wait_iff_congested()" code that also uses io_schedule()
should push it out to kblockd, methinks.

>> This stack overflow shows us that just the memory reclaim + IO
>> layers are sufficient to cause a stack overflow,
>
>  we solve this problem directly by being able to remove the IO
> stack usage from the direct reclaim swap path.
>
> IOWs, we don't need to turn swap off at all in direct reclaim
> because all the swap IO can be captured in a plug list and
> dispatched via kblockd. This could be done either by io_schedule()
> or a new blk_flush_plug_list() wrapper that pushes the work to
> kblockd...

That would work. That said, I personally would not mind to see that
"swap is special" go away, if possible. Because it can be behind a
filesystem too. Christ, even NFS (and people used to fight that tooth
and nail!) is back as a swap target..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Rusty Russell
Minchan Kim  writes:
> On Wed, May 28, 2014 at 12:04:09PM +0300, Michael S. Tsirkin wrote:
>> On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
>> > [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:   
>> > 9) 6456  80   __kmalloc+0x1cb/0x200
>> > [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  
>> > 10) 6376 376   vring_add_indirect+0x36/0x200
>> > [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  
>> > 11) 6000 144   virtqueue_add_sgs+0x2e2/0x320

Hmm, we can actually skip the vring_add_indirect if we're hurting for
stack.  It just means the request will try to fit linearly in the ring,
rather than using indirect.

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1e443629f76d..496e727cefc8 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -184,6 +184,13 @@ static inline int vring_add_indirect(struct 
vring_virtqueue *vq,
return head;
 }
 
+/* The Morton Technique */
+static noinline bool stack_trouble(void)
+{
+   unsigned long sp = (unsigned long)&sp;
+   return sp - (sp & ~(THREAD_SIZE - 1)) < 3000;
+}
+
 static inline int virtqueue_add(struct virtqueue *_vq,
struct scatterlist *sgs[],
struct scatterlist *(*next)
@@ -226,7 +233,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 
/* If the host supports indirect descriptor tables, and we have multiple
 * buffers, then go indirect. FIXME: tune this threshold */
-   if (vq->indirect && total_sg > 1 && vq->vq.num_free) {
+   if (vq->indirect && total_sg > 1 && vq->vq.num_free && 
!stack_trouble()) {
head = vring_add_indirect(vq, sgs, next, total_sg, total_out,
  total_in,
  out_sgs, in_sgs, gfp);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Steven Rostedt
On Thu, 29 May 2014 10:09:40 +0900
Minchan Kim  wrote:

> stacktrace reported that vring_add_indirect used 376byte and objdump says
> 
> 8141dc60 :
> 8141dc60:   55  push   %rbp
> 8141dc61:   48 89 e5mov%rsp,%rbp
> 8141dc64:   41 57   push   %r15
> 8141dc66:   41 56   push   %r14
> 8141dc68:   41 55   push   %r13
> 8141dc6a:   49 89 fdmov%rdi,%r13
> 8141dc6d:   89 cf   mov%ecx,%edi
> 8141dc6f:   48 c1 e7 04 shl$0x4,%rdi
> 8141dc73:   41 54   push   %r12
> 8141dc75:   49 89 d4mov%rdx,%r12
> 8141dc78:   53  push   %rbx
> 8141dc79:   48 89 f3mov%rsi,%rbx
> 8141dc7c:   48 83 ec 28 sub$0x28,%rsp
> 8141dc80:   8b 75 20mov0x20(%rbp),%esi
> 8141dc83:   89 4d bcmov%ecx,-0x44(%rbp)
> 8141dc86:   44 89 45 cc mov%r8d,-0x34(%rbp)
> 8141dc8a:   44 89 4d c8 mov%r9d,-0x38(%rbp)
> 8141dc8e:   83 e6 ddand$0xffdd,%esi
> 8141dc91:   e8 7a d1 d7 ff  callq  8119ae10 
> <__kmalloc>
> 8141dc96:   48 85 c0test   %rax,%rax
> 
> So, it's *strange*.
> 
> I will add .config and .o.
> Maybe someone might find what happens.
> 

This is really bothering me. I'm trying to figure it out. We have from
the stack trace:

[ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:   9) 
6456  80   __kmalloc+0x1cb/0x200
[ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  10) 
6376 376   vring_add_indirect+0x36/0x200
[ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  11) 
6000 144   virtqueue_add_sgs+0x2e2/0x320

The way the stack tracer works, is that when it detects a new max stack
it calls save_stack_trace() to get the complete call chain from the
stack. This should be rather accurate as it seems that your kernel was
compiled with frame pointers (confirmed by the objdump as well as the
config file). It then uses that stack trace that it got to examine the
stack to find the locations of the saved return addresses and records
them in an array (in your case, an array of 50 entries).

>From your .o file:

vring_add_indirect + 0x36: (0x370 + 0x36 = 0x3a6)

0370 :

 39e:   83 e6 ddand$0xffdd,%esi
 3a1:   e8 00 00 00 00  callq  3a6 
3a2: R_X86_64_PC32  __kmalloc-0x4
 3a6:   48 85 c0test   %rax,%rax

Definitely the return address to the call to __kmalloc. Then to
determine the size of the stack frame, it is subtracted from the next
one down. In this case, the location of virtqueue_add_sgs+0x2e2.

virtqueue_add_sgs + 0x2e2: (0x880 + 0x2e2 = 0xb62)

0880 :

b4f:   89 4c 24 08 mov%ecx,0x8(%rsp)
 b53:   48 c7 c2 00 00 00 00mov$0x0,%rdx
b56: R_X86_64_32S   .text+0x570
 b5a:   44 89 d1mov%r10d,%ecx
 b5d:   e8 0e f8 ff ff  callq  370 
 b62:   85 c0   test   %eax,%eax


Which is the return address of where vring_add_indirect was called.

The return address back to virtqueue_add_sgs was found at 6000 bytes of
the stack. The return address back to vring_add_indirect was found at
6376 bytes from the top of the stack.

My question is, why were they so far apart? I see 6 words pushed
(8bytes each, for a total of 48 bytes), and a subtraction of the stack
pointer of 0x28 (40 bytes) giving us a total of 88 bytes. Plus we need
to add the push of the return address itself which would just give us
96 bytes for the stack frame. What is making this show 376 bytes??

Looking more into this, I'm not sure I trust the top numbers anymore.
kmalloc reports a stack frame of 80, and I'm coming up with 104
(perhaps even 112). And slab_alloc only has 8. Something's messed up there.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Linus Torvalds
On Wed, May 28, 2014 at 6:30 PM, Dave Chinner  wrote:
>
> You're focussing on the specific symptoms, not the bigger picture.
> i.e. you're ignoring all the other "let's start IO" triggers in
> direct reclaim. e.g there's two separate plug flush triggers in
> shrink_inactive_list(), one of which is:

Fair enough. I certainly agree that we should look at the other cases here too.

In fact, I also find it distasteful just how much stack space some of
those VM routines are just using up on their own, never mind any
actual IO paths at all. The fact that __alloc_pages_nodemask() uses
350 bytes of stackspace on its own is actually quite disturbing. The
fact that kernel_map_pages() apparently has almost 400 bytes of stack
is just crazy. Obviously that case only happens with
CONFIG_DEBUG_PAGEALLOC, but still..

> I'm not saying we shouldn't turn of swap from direct reclaim, just
> that all we'd be doing by turning off swap is playing whack-a-stack
> - the next report will simply be from one of the other direct
> reclaim IO schedule points.

Playing whack-a-mole with this for a while might not be a bad idea,
though. It's not like we will ever really improve unless we start
whacking the worst cases. And it should still be a fairly limited
number.

After all, historically, some of the cases we've played whack-a-mole
on have been in XFS, so I'd think you'd be thrilled to see some other
code get blamed this time around ;)

> Regardless of whether it is swap or something external queues the
> bio on the plug, perhaps we should look at why it's done inline
> rather than by kblockd, where it was moved because it was blowing
> the stack from schedule():

So it sounds like we need to do this for io_schedule() too.

In fact, we've generally found it to be a mistake every time we
"automatically" unblock some IO queue. And I'm not saying that because
of stack space, but because we've _often_ had the situation that eager
unblocking results in IO that could have been done as bigger requests.

Of course, we do need to worry about latency for starting IO, but any
of these kinds of memory-pressure writeback patterns are pretty much
by definition not about the latency of one _particular_ IO, so they
don't tent to be latency-sensitive. Quite the reverse: we start
writeback and then end up waiting on something else altogether
(possibly a writeback that got started much earlier).

swapout certainly is _not_ IO-latency-sensitive, especially these
days. And while we _do_ want to throttle in direct reclaim, if it's
about throttling I'd certainly think that it sounds quite reasonable
to push any unplugging to kblockd than try to do that synchronously.
If we are throttling in direct-reclaim, we need to slow things _down_
for the writer, not worry about latency.

> I've said in the past that swap is different to filesystem
> ->writepage implementations because it doesn't require significant
> stack to do block allocation and doesn't trigger IO deep in that
> allocation stack. Hence it has much lower stack overhead than the
> filesystem ->writepage implementations and so is much less likely to
> have stack issues.

Clearly it is true that it lacks the actual filesystem part needed for
the writeback. At the same time, Minchan's example is certainly a good
one of a filesystem (ext4) already being reasonably deep in its own
stack space when it then wants memory.

Looking at that callchain, I have to say that ext4 doesn't look
horrible compared to the whole block layer and virtio.. Yes,
"ext4_writepages()" is using almost 400 bytes of stack, and most of
that seems to be due to:

struct mpage_da_data mpd;
struct blk_plug plug;

which looks at least understandable (nothing like the mess in the VM
code where the stack usage is because gcc creates horrible spills)

> This stack overflow shows us that just the memory reclaim + IO
> layers are sufficient to cause a stack overflow, which is something
> I've never seen before.

Well, we've definitely have had some issues with deeper callchains
with md, but I suspect virtio might be worse, and the new blk-mq code
is lilkely worse in this respect too.

And Minchan running out of stack is at least _partly_ due to his debug
options (that DEBUG_PAGEALLOC thing as an extreme example, but I
suspect there's a few other options there that generate more bloated
data structures too too).

>That implies no IO in direct reclaim context
> is safe - either from swap or io_schedule() unplugging. It also
> lends a lot of weight to my assertion that the majority of the stack
> growth over the past couple of years has been ocurring outside the
> filesystems

I think Minchan's stack trace definitely backs you up on that. The
filesystem part - despite that one ext4_writepages() function - is a
very small part of the whole. It sits at about ~1kB of stack. Just the
VM "top-level" writeback code is about as much, and then the VM page
alloc/shrinking code when the filesystem needs memory is *tw

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Dave Chinner
On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> Author: Jens Axboe 
> Date:   Sat Apr 16 13:27:55 2011 +0200
> 
> block: let io_schedule() flush the plug inline
> 
> Linus correctly observes that the most important dispatch cases
> are now done from kblockd, this isn't ideal for latency reasons.
> The original reason for switching dispatches out-of-line was to
> avoid too deep a stack, so by _only_ letting the "accidental"
> flush directly in schedule() be guarded by offload to kblockd,
> we should be able to get the best of both worlds.
> 
> So add a blk_schedule_flush_plug() that offloads to kblockd,
> and only use that from the schedule() path.
> 
> Signed-off-by: Jens Axboe 
> 
> And now we have too deep a stack due to unplugging from io_schedule()...

So, if we make io_schedule() push the plug list off to the kblockd
like is done for schedule()

> > IOW, swap-out directly caused that extra 3kB of stack use in what was
> > a deep call chain (due to memory allocation). I really don't
> > understand why you are arguing anything else on a pure technicality.
> >
> > I thought you had some other argument for why swap was different, and
> > against removing that "page_is_file_cache()" special case in
> > shrink_page_list().
> 
> I've said in the past that swap is different to filesystem
> ->writepage implementations because it doesn't require significant
> stack to do block allocation and doesn't trigger IO deep in that
> allocation stack. Hence it has much lower stack overhead than the
> filesystem ->writepage implementations and so is much less likely to
> have stack issues.
> 
> This stack overflow shows us that just the memory reclaim + IO
> layers are sufficient to cause a stack overflow,

 we solve this problem directly by being able to remove the IO
stack usage from the direct reclaim swap path.

IOWs, we don't need to turn swap off at all in direct reclaim
because all the swap IO can be captured in a plug list and
dispatched via kblockd. This could be done either by io_schedule()
or a new blk_flush_plug_list() wrapper that pushes the work to
kblockd...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Dave Chinner
On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> On Wed, May 28, 2014 at 3:31 PM, Dave Chinner  wrote:
> >
> > Indeed, the call chain reported here is not caused by swap issuing
> > IO.
> 
> Well, that's one way of reading that callchain.
> 
> I think it's the *wrong* way of reading it, though. Almost dishonestly
> so.

I guess you haven't met your insult quota for the day, Linus. :/

> Because very clearly, the swapout _is_ what causes the unplugging
> of the IO queue, and does so because it is allocating the BIO for its
> own IO.  The fact that that then fails (because of other IO's in
> flight), and causes *other* IO to be flushed, doesn't really change
> anything fundamental. It's still very much swap that causes that
> "let's start IO".

It is not rocket science to see how plugging outside memory
allocation context can lead to flushing that plug within direct
reclaim without having dispatched any IO at all from direct
reclaim

You're focussing on the specific symptoms, not the bigger picture.
i.e. you're ignoring all the other "let's start IO" triggers in
direct reclaim. e.g there's two separate plug flush triggers in
shrink_inactive_list(), one of which is:

/*
 * Stall direct reclaim for IO completions if underlying BDIs or zone
 * is congested. Allow kswapd to continue until it starts encountering
 * unqueued dirty pages or cycling through the LRU too quickly.
 */
if (!sc->hibernation_mode && !current_is_kswapd())
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

I'm not saying we shouldn't turn of swap from direct reclaim, just
that all we'd be doing by turning off swap is playing whack-a-stack
- the next report will simply be from one of the other direct
reclaim IO schedule points.

Regardless of whether it is swap or something external queues the
bio on the plug, perhaps we should look at why it's done inline
rather than by kblockd, where it was moved because it was blowing
the stack from schedule():

commit f4af3c3d077a004762aaad052049c809fd8c6f0c
Author: Jens Axboe 
Date:   Tue Apr 12 14:58:51 2011 +0200

block: move queue run on unplug to kblockd

There are worries that we are now consuming a lot more stack in
some cases, since we potentially call into IO dispatch from
schedule() or io_schedule(). We can reduce this problem by moving
the running of the queue to kblockd, like the old plugging scheme
did as well.

This may or may not be a good idea from a performance perspective,
depending on how many tasks have queue plugs running at the same
time. For even the slightly contended case, doing just a single
queue run from kblockd instead of multiple runs directly from the
unpluggers will be faster.

Signed-off-by: Jens Axboe 


commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
Author: Jens Axboe 
Date:   Sat Apr 16 13:27:55 2011 +0200

block: let io_schedule() flush the plug inline

Linus correctly observes that the most important dispatch cases
are now done from kblockd, this isn't ideal for latency reasons.
The original reason for switching dispatches out-of-line was to
avoid too deep a stack, so by _only_ letting the "accidental"
flush directly in schedule() be guarded by offload to kblockd,
we should be able to get the best of both worlds.

So add a blk_schedule_flush_plug() that offloads to kblockd,
and only use that from the schedule() path.

Signed-off-by: Jens Axboe 

And now we have too deep a stack due to unplugging from io_schedule()...

> IOW, swap-out directly caused that extra 3kB of stack use in what was
> a deep call chain (due to memory allocation). I really don't
> understand why you are arguing anything else on a pure technicality.
>
> I thought you had some other argument for why swap was different, and
> against removing that "page_is_file_cache()" special case in
> shrink_page_list().

I've said in the past that swap is different to filesystem
->writepage implementations because it doesn't require significant
stack to do block allocation and doesn't trigger IO deep in that
allocation stack. Hence it has much lower stack overhead than the
filesystem ->writepage implementations and so is much less likely to
have stack issues.

This stack overflow shows us that just the memory reclaim + IO
layers are sufficient to cause a stack overflow, which is something
I've never seen before. That implies no IO in direct reclaim context
is safe - either from swap or io_schedule() unplugging. It also
lends a lot of weight to my assertion that the majority of the stack
growth over the past couple of years has been ocurring outside the
filesystems

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please 

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread H. Peter Anvin
On 05/28/2014 04:17 PM, Dave Chinner wrote:
>>
>> You were the one calling it a canary.
> 
> That doesn't mean it's to blame. Don't shoot the messenger...
> 

Fair enough.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Dave Chinner
On Wed, May 28, 2014 at 03:42:18PM -0700, H. Peter Anvin wrote:
> On 05/28/2014 03:11 PM, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 07:23:23AM -0700, H. Peter Anvin wrote:
> >> We tried for 4K on x86-64, too, for b quite a while as I recall.
> >> The kernel stack is a one of the main costs for a thread.  I would
> >> like to decouple struct thread_info from the kernel stack (PJ
> >> Waskewicz was working on that before he left Intel) but that
> >> doesn't buy us all that much.
> >>
> >> 8K additional per thread is a huge hit.  XFS has indeed always
> >> been a canary, or troublespot, I suspect because it originally
> >> came from another kernel where this was not an optimization
> >> target.
> > 
> > 
> > 
> > Always blame XFS for stack usage problems.
> > 
> > Even when the reported problem is from IO to an ext4 filesystem.
> > 
> 
> You were the one calling it a canary.

That doesn't mean it's to blame. Don't shoot the messenger...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread H. Peter Anvin
On 05/28/2014 03:11 PM, Dave Chinner wrote:
> On Wed, May 28, 2014 at 07:23:23AM -0700, H. Peter Anvin wrote:
>> We tried for 4K on x86-64, too, for b quite a while as I recall.
>> The kernel stack is a one of the main costs for a thread.  I would
>> like to decouple struct thread_info from the kernel stack (PJ
>> Waskewicz was working on that before he left Intel) but that
>> doesn't buy us all that much.
>>
>> 8K additional per thread is a huge hit.  XFS has indeed always
>> been a canary, or troublespot, I suspect because it originally
>> came from another kernel where this was not an optimization
>> target.
> 
> 
> 
> Always blame XFS for stack usage problems.
> 
> Even when the reported problem is from IO to an ext4 filesystem.
> 

You were the one calling it a canary.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Linus Torvalds
On Wed, May 28, 2014 at 3:31 PM, Dave Chinner  wrote:
>
> Indeed, the call chain reported here is not caused by swap issuing
> IO.

Well, that's one way of reading that callchain.

I think it's the *wrong* way of reading it, though. Almost dishonestly
so. Because very clearly, the swapout _is_ what causes the unplugging
of the IO queue, and does so because it is allocating the BIO for its
own IO. The fact that that then fails (because of other IO's in
flight), and causes *other* IO to be flushed, doesn't really change
anything fundamental. It's still very much swap that causes that
"let's start IO".

IOW, swap-out directly caused that extra 3kB of stack use in what was
a deep call chain (due to memory allocation). I really don't
understand why you are arguing anything else on a pure technicality.

I thought you had some other argument for why swap was different, and
against removing that "page_is_file_cache()" special case in
shrink_page_list().

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Dave Chinner
On Wed, May 28, 2014 at 09:09:23AM -0700, Linus Torvalds wrote:
> On Tue, May 27, 2014 at 11:53 PM, Minchan Kim  wrote:
> >
> > So, my stupid idea is just let's expand stack size and keep an eye
> > toward stack consumption on each kernel functions via stacktrace of ftrace.
.
> But what *does* stand out (once again) is that we probably shouldn't
> do swap-out in direct reclaim. This came up the last time we had stack
> issues (XFS) too. I really do suspect that direct reclaim should only
> do the kind of reclaim that does not need any IO at all.
> 
> I think we _do_ generally avoid IO in direct reclaim, but swap is
> special. And not for a good reason, afaik. DaveC, remind me, I think
> you said something about the swap case the last time this came up..

Right, we do generally avoid IO through filesystems via direct
reclaim because delayed allocation requires significant amounts
of additional memory, stack space and IO.

However, swap doesn't have that overhead - it's just the IO stack
that it drives through submit_bio(), and the worst case I'd seen
through that path was much less than other reclaim stack path usage.
I haven't seen swap in any of the stack overflows from production
machines, and I only rarely see it in worst case stack usage
profiles on my test machines.

Indeed, the call chain reported here is not caused by swap issuing
IO.  We scheduled in the swap code (throttling waiting for
congestion, I think) with a plugged block device (from the ext4
writeback layer) with pending bios queued on it and the scheduler
has triggered a flush of the device.  submit_bio in the swap path
has much less stack usage than io_schedule() because it doesn't have
any of the scheduler or plug list flushing overhead in the stack.

So, realistically, the swap path is not worst case stack usage here
and disabling it won't prevent this stack overflow from happening.
Direct reclaim will simply throttle elsewhere and that will still
cause the plug to be flushed, the IO to be issued and the stack to
overflow.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Dave Chinner
On Wed, May 28, 2014 at 07:23:23AM -0700, H. Peter Anvin wrote:
> We tried for 4K on x86-64, too, for b quite a while as I recall.
> The kernel stack is a one of the main costs for a thread.  I would
> like to decouple struct thread_info from the kernel stack (PJ
> Waskewicz was working on that before he left Intel) but that
> doesn't buy us all that much.
> 
> 8K additional per thread is a huge hit.  XFS has indeed always
> been a canary, or troublespot, I suspect because it originally
> came from another kernel where this was not an optimization
> target.



Always blame XFS for stack usage problems.

Even when the reported problem is from IO to an ext4 filesystem.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Dave Chinner
On Wed, May 28, 2014 at 12:06:58PM -0400, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> > > [ cc XFS list ]
> > 
> > [and now there is a complete copy on the XFs list, I'll add my 2c]
> > 
> > > On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > > > While I play inhouse patches with much memory pressure on qemu-kvm,
> > > > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > > > 
> > > > When I investigated the problem, the callstack was a little bit deeper
> > > > by involve with reclaim functions but not direct reclaim path.
> > > > 
> > > > I tried to diet stack size of some functions related with alloc/reclaim
> > > > so did a hundred of byte but overflow was't disappeard so that I 
> > > > encounter
> > > > overflow by another deeper callstack on reclaim/allocator path.
> > 
> > That's a no win situation. The stack overruns through ->writepage
> > we've been seeing with XFS over the past *4 years* are much larger
> > than a few bytes. The worst case stack usage on a virtio block
> > device was about 10.5KB of stack usage.
> > 
> > And, like this one, it came from the flusher thread as well. The
> > difference was that the allocation that triggered the reclaim path
> > you've reported occurred when 5k of the stack had already been
> > used...
> > 
> > > > Of course, we might sweep every sites we have found for reducing
> > > > stack usage but I'm not sure how long it saves the world(surely,
> > > > lots of developer start to add nice features which will use stack
> > > > agains) and if we consider another more complex feature in I/O layer
> > > > and/or reclaim path, it might be better to increase stack size(
> > > > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > > > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > > > already expaned to 16K. )
> > 
> > Yup, that's all been pointed out previously. 8k stacks were never
> > large enough to fit the linux IO architecture on x86-64, but nobody
> > outside filesystem and IO developers has been willing to accept that
> > argument as valid, despite regular stack overruns and filesystem
> > having to add workaround after workaround to prevent stack overruns.
> > 
> > That's why stuff like this appears in various filesystem's
> > ->writepage:
> > 
> > /*
> >  * Refuse to write the page out if we are called from reclaim 
> > context.
> >  *
> >  * This avoids stack overflows when called from deeply used stacks 
> > in
> >  * random callers for direct reclaim or memcg reclaim.  We 
> > explicitly
> >  * allow reclaim from kswapd as the stack usage there is relatively 
> > low.
> >  *
> >  * This should never happen except in the case of a VM regression so
> >  * warn about it.
> >  */
> > if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> > PF_MEMALLOC))
> > goto redirty;
> > 
> > That still doesn't guarantee us enough stack space to do writeback,
> > though, because memory allocation can occur when reading in metadata
> > needed to do delayed allocation, and so we could trigger GFP_NOFS
> > memory allocation from the flusher thread with 4-5k of stack already
> > consumed, so that would still overrun teh stack.
> > 
> > So, a couple of years ago we started defering half the writeback
> > stack usage to a worker thread (commit c999a22 "xfs: introduce an
> > allocation workqueue"), under the assumption that the worst stack
> > usage when we call memory allocation is around 3-3.5k of stack used.
> > We thought that would be safe, but the stack trace you've posted
> > shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
> > which means we're still screwed despite all the workarounds we have
> > in place.
> 
> The allocation and reclaim stack itself is only 2k per the stacktrace
> below.  What got us in this particular case is that we engaged a
> complicated block layer setup from within the allocation context in
> order to swap out a page.

The report does not have a complicated block layer setup - it's just
a swap device on a virtio device. There's no MD, no raid, no complex
transport and protocol layer, etc. It's about as simple as it gets.

> In the past we disabled filesystem ->writepage from within the
> allocation context and deferred it to kswapd for stack reasons (see
> the WARN_ON_ONCE and the comment in your above quote), but I think we
> have to go further and do the same for even swap_writepage():

I don't think that solves the problem. I've seen plenty of near
stack overflows that were caused by >3k of stack being used because
of memory allocation/reclaim overhead and then scheduling.
usage and another 1k of stack scheduling waiting.

If we have a subsystem that can put >3k on the stack at arbitrary
locations,

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Steven Rostedt
On Wed, 28 May 2014 17:43:50 +0200
Richard Weinberger  wrote:


> > diff --git a/arch/x86/include/asm/page_64_types.h 
> > b/arch/x86/include/asm/page_64_types.h
> > index 8de6d9cf3b95..678205195ae1 100644
> > --- a/arch/x86/include/asm/page_64_types.h
> > +++ b/arch/x86/include/asm/page_64_types.h
> > @@ -1,7 +1,7 @@
> >  #ifndef _ASM_X86_PAGE_64_DEFS_H
> >  #define _ASM_X86_PAGE_64_DEFS_H
> >
> > -#define THREAD_SIZE_ORDER  1
> > +#define THREAD_SIZE_ORDER  2
> >  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
> >  #define CURRENT_MASK (~(THREAD_SIZE - 1))
> 
> Do you have any numbers of the performance impact of this?
> 

What performance impact are you looking for? Now if the system is short
on memory, it would probably cause issues in creating tasks. But other
than that, I'm not sure what you are looking for.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Linus Torvalds
On Wed, May 28, 2014 at 9:08 AM, Steven Rostedt  wrote:
>
> What performance impact are you looking for? Now if the system is short
> on memory, it would probably cause issues in creating tasks.

It doesn't necessarily need to be short on memory, it could just be
fragmented. But a page order of 2 should still be ok'ish.

That said, this is definitely not a rc7 issue. I'd *much* rather
disable swap from direct reclaim, although that kind of patch too
would be a "can Minchan test it, we can put it in the next merge
window and then backport it if we don't have issues".

I see that Johannes already did a patch for that (and this really
_has_ come up before), although I'd skip the WARN_ON_ONCE() part
except for perhaps Minchan testing it.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Richard Weinberger
Am 28.05.2014 18:08, schrieb Steven Rostedt:
> On Wed, 28 May 2014 17:43:50 +0200
> Richard Weinberger  wrote:
> 
> 
>>> diff --git a/arch/x86/include/asm/page_64_types.h 
>>> b/arch/x86/include/asm/page_64_types.h
>>> index 8de6d9cf3b95..678205195ae1 100644
>>> --- a/arch/x86/include/asm/page_64_types.h
>>> +++ b/arch/x86/include/asm/page_64_types.h
>>> @@ -1,7 +1,7 @@
>>>  #ifndef _ASM_X86_PAGE_64_DEFS_H
>>>  #define _ASM_X86_PAGE_64_DEFS_H
>>>
>>> -#define THREAD_SIZE_ORDER  1
>>> +#define THREAD_SIZE_ORDER  2
>>>  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
>>>  #define CURRENT_MASK (~(THREAD_SIZE - 1))
>>
>> Do you have any numbers of the performance impact of this?
>>
> 
> What performance impact are you looking for? Now if the system is short
> on memory, it would probably cause issues in creating tasks. But other
> than that, I'm not sure what you are looking for.

Allocating more continuous memory for every thread is not cheap.
I'd assume that such a change will cause more pressure on the allocator.

Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Linus Torvalds
On Tue, May 27, 2014 at 11:53 PM, Minchan Kim  wrote:
>
> So, my stupid idea is just let's expand stack size and keep an eye
> toward stack consumption on each kernel functions via stacktrace of ftrace.

We probably have to do this at some point, but that point is not -rc7.

And quite frankly, from the backtrace, I can only say: there is some
bad shit there. The current VM stands out as a bloated pig:

> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   0)   
>   7696  16   lookup_address+0x28/0x30
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   1)   
>   7680  16   _lookup_address_cpa.isra.3+0x3b/0x40
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   2)   
>   7664  24   __change_page_attr_set_clr+0xe0/0xb50
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   3)   
>   7640 392   kernel_map_pages+0x6c/0x120
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   4)   
>   7248 256   get_page_from_freelist+0x489/0x920
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   5)   
>   6992 352   __alloc_pages_nodemask+0x5e1/0xb20

> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  23)   
>   4672 160   __swap_writepage+0x150/0x230
> [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  24)   
>   4512  32   swap_writepage+0x42/0x90
> [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  25)   
>   4480 320   shrink_page_list+0x676/0xa80
> [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  26)   
>   4160 208   shrink_inactive_list+0x262/0x4e0
> [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  27)   
>   3952 304   shrink_lruvec+0x3e1/0x6a0
> [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  28)   
>   3648  80   shrink_zone+0x3f/0x110
> [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  29)   
>   3568 128   do_try_to_free_pages+0x156/0x4c0
> [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  30)   
>   3440 208   try_to_free_pages+0xf7/0x1e0
> [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  31)   
>   3232 352   __alloc_pages_nodemask+0x783/0xb20
> [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  32)   
>   2880   8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call:  33)   
>   2872 200   __page_cache_alloc+0x13f/0x160

That __alloc_pages_nodemask() thing in particular looks bad. It
actually seems not to be the usual "let's just allocate some
structures on the stack" disease, it looks more like "lots of
inlining, horrible calling conventions, and lots of random stupid
variables".

>From a quick glance at the frame usage, some of it seems to be gcc
being rather bad at stack allocation, but lots of it is just nasty
spilling around the disgusting call-sites with tons or arguments. A
_lot_ of the stack slots are marked as "%sfp" (which is gcc'ese for
"spill frame pointer", afaik).

Avoiding some inlining, and using a single flag value rather than the
collection of "bool"s would probably help. But nothing really
trivially obvious stands out.

But what *does* stand out (once again) is that we probably shouldn't
do swap-out in direct reclaim. This came up the last time we had stack
issues (XFS) too. I really do suspect that direct reclaim should only
do the kind of reclaim that does not need any IO at all.

I think we _do_ generally avoid IO in direct reclaim, but swap is
special. And not for a good reason, afaik. DaveC, remind me, I think
you said something about the swap case the last time this came up..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Johannes Weiner
On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote:
> On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> > [ cc XFS list ]
> 
> [and now there is a complete copy on the XFs list, I'll add my 2c]
> 
> > On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > > While I play inhouse patches with much memory pressure on qemu-kvm,
> > > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > > 
> > > When I investigated the problem, the callstack was a little bit deeper
> > > by involve with reclaim functions but not direct reclaim path.
> > > 
> > > I tried to diet stack size of some functions related with alloc/reclaim
> > > so did a hundred of byte but overflow was't disappeard so that I encounter
> > > overflow by another deeper callstack on reclaim/allocator path.
> 
> That's a no win situation. The stack overruns through ->writepage
> we've been seeing with XFS over the past *4 years* are much larger
> than a few bytes. The worst case stack usage on a virtio block
> device was about 10.5KB of stack usage.
> 
> And, like this one, it came from the flusher thread as well. The
> difference was that the allocation that triggered the reclaim path
> you've reported occurred when 5k of the stack had already been
> used...
> 
> > > Of course, we might sweep every sites we have found for reducing
> > > stack usage but I'm not sure how long it saves the world(surely,
> > > lots of developer start to add nice features which will use stack
> > > agains) and if we consider another more complex feature in I/O layer
> > > and/or reclaim path, it might be better to increase stack size(
> > > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > > already expaned to 16K. )
> 
> Yup, that's all been pointed out previously. 8k stacks were never
> large enough to fit the linux IO architecture on x86-64, but nobody
> outside filesystem and IO developers has been willing to accept that
> argument as valid, despite regular stack overruns and filesystem
> having to add workaround after workaround to prevent stack overruns.
> 
> That's why stuff like this appears in various filesystem's
> ->writepage:
> 
> /*
>  * Refuse to write the page out if we are called from reclaim context.
>  *
>  * This avoids stack overflows when called from deeply used stacks in
>  * random callers for direct reclaim or memcg reclaim.  We explicitly
>  * allow reclaim from kswapd as the stack usage there is relatively 
> low.
>  *
>  * This should never happen except in the case of a VM regression so
>  * warn about it.
>  */
> if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> PF_MEMALLOC))
> goto redirty;
> 
> That still doesn't guarantee us enough stack space to do writeback,
> though, because memory allocation can occur when reading in metadata
> needed to do delayed allocation, and so we could trigger GFP_NOFS
> memory allocation from the flusher thread with 4-5k of stack already
> consumed, so that would still overrun teh stack.
> 
> So, a couple of years ago we started defering half the writeback
> stack usage to a worker thread (commit c999a22 "xfs: introduce an
> allocation workqueue"), under the assumption that the worst stack
> usage when we call memory allocation is around 3-3.5k of stack used.
> We thought that would be safe, but the stack trace you've posted
> shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
> which means we're still screwed despite all the workarounds we have
> in place.

The allocation and reclaim stack itself is only 2k per the stacktrace
below.  What got us in this particular case is that we engaged a
complicated block layer setup from within the allocation context in
order to swap out a page.

In the past we disabled filesystem ->writepage from within the
allocation context and deferred it to kswapd for stack reasons (see
the WARN_ON_ONCE and the comment in your above quote), but I think we
have to go further and do the same for even swap_writepage():

> > > I guess this topic was discussed several time so there might be
> > > strong reason not to increase kernel stack size on x86_64, for me not
> > > knowing so Ccing x86_64 maintainers, other MM guys and virtio
> > > maintainers.
> > >
> > >  DepthSize   Location(51 entries)
> > > 
> > >0) 7696  16   lookup_address+0x28/0x30
> > >1) 7680  16   _lookup_address_cpa.isra.3+0x3b/0x40
> > >2) 7664  24   __change_page_attr_set_clr+0xe0/0xb50
> > >3) 7640 392   kernel_map_pages+0x6c/0x120
> > >4) 7248 256   get_page_from_freelist+0x489/0x920
> > >5) 6992 352   __alloc_pages_nodemask+0x5e1/0xb20
> > >6) 6640   8   alloc_pages_current+0x10f/0x1f0
> > >7) 6632 168   

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Richard Weinberger
On Wed, May 28, 2014 at 8:53 AM, Minchan Kim  wrote:
> While I play inhouse patches with much memory pressure on qemu-kvm,
> 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
>
> When I investigated the problem, the callstack was a little bit deeper
> by involve with reclaim functions but not direct reclaim path.
>
> I tried to diet stack size of some functions related with alloc/reclaim
> so did a hundred of byte but overflow was't disappeard so that I encounter
> overflow by another deeper callstack on reclaim/allocator path.
>
> Of course, we might sweep every sites we have found for reducing
> stack usage but I'm not sure how long it saves the world(surely,
> lots of developer start to add nice features which will use stack
> agains) and if we consider another more complex feature in I/O layer
> and/or reclaim path, it might be better to increase stack size(
> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> already expaned to 16K. )
>
> So, my stupid idea is just let's expand stack size and keep an eye
> toward stack consumption on each kernel functions via stacktrace of ftrace.
> For example, we can have a bar like that each funcion shouldn't exceed 200K
> and emit the warning when some function consumes more in runtime.
> Of course, it could make false positive but at least, it could make a
> chance to think over it.
>
> I guess this topic was discussed several time so there might be
> strong reason not to increase kernel stack size on x86_64, for me not
> knowing so Ccing x86_64 maintainers, other MM guys and virtio
> maintainers.
>
> [ 1065.604404] kworker/-57660d..2 1071625990us : stack_trace_call:
>  DepthSize   Location(51 entries)
> [ 1065.604404] -   
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   0)   
>   7696  16   lookup_address+0x28/0x30
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   1)   
>   7680  16   _lookup_address_cpa.isra.3+0x3b/0x40
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   2)   
>   7664  24   __change_page_attr_set_clr+0xe0/0xb50
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   3)   
>   7640 392   kernel_map_pages+0x6c/0x120
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   4)   
>   7248 256   get_page_from_freelist+0x489/0x920
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   5)   
>   6992 352   __alloc_pages_nodemask+0x5e1/0xb20
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   6)   
>   6640   8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   7)   
>   6632 168   new_slab+0x2c5/0x370
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   8)   
>   6464   8   __slab_alloc+0x3a9/0x501
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:   9)   
>   6456  80   __kmalloc+0x1cb/0x200
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  10)   
>   6376 376   vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  11)   
>   6000 144   virtqueue_add_sgs+0x2e2/0x320
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  12)   
>   5856 288   __virtblk_add_req+0xda/0x1b0
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  13)   
>   5568  96   virtio_queue_rq+0xd3/0x1d0
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  14)   
>   5472 128   __blk_mq_run_hw_queue+0x1ef/0x440
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  15)   
>   5344  16   blk_mq_run_hw_queue+0x35/0x40
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  16)   
>   5328  96   blk_mq_insert_requests+0xdb/0x160
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  17)   
>   5232 112   blk_mq_flush_plug_list+0x12b/0x140
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  18)   
>   5120 112   blk_flush_plug_list+0xc7/0x220
> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  19)   
>   5008  64   io_schedule_timeout+0x88/0x100
> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  20)   
>   4944 128   mempool_alloc+0x145/0x170
> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  21)   
>   4816  96   bio_alloc_bioset+0x10b/0x1d0
> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  22)   
>   4720  48   get_swap_bio+0x30/0x90
> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  23)   
>   4672 160   __swap_writepage+0x150/0x230
> [ 1065.604404] kworker/-57

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread H. Peter Anvin
We tried for 4K on x86-64, too, for b quite a while as I recall.  The kernel 
stack is a one of the main costs for a thread.  I would like to decouple struct 
thread_info from the kernel stack (PJ Waskewicz was working on that before he 
left Intel) but that doesn't buy us all that much.

8K additional per thread is a huge hit.  XFS has indeed always been a canary, 
or troublespot, I suspect because it originally came from another kernel where 
this was not an optimization target.



On May 28, 2014 7:14:01 AM PDT, Steven Rostedt  wrote:
>
>This looks like something that Linus should be involved in too. He's
>been critical in the past about stack usage.
>
>On Wed, 28 May 2014 15:53:59 +0900
>Minchan Kim  wrote:
>
>> While I play inhouse patches with much memory pressure on qemu-kvm,
>> 3.14 kernel was randomly crashed. The reason was kernel stack
>overflow.
>> 
>> When I investigated the problem, the callstack was a little bit
>deeper
>> by involve with reclaim functions but not direct reclaim path.
>> 
>> I tried to diet stack size of some functions related with
>alloc/reclaim
>> so did a hundred of byte but overflow was't disappeard so that I
>encounter
>> overflow by another deeper callstack on reclaim/allocator path.
>> 
>> Of course, we might sweep every sites we have found for reducing
>> stack usage but I'm not sure how long it saves the world(surely,
>> lots of developer start to add nice features which will use stack
>> agains) and if we consider another more complex feature in I/O layer
>> and/or reclaim path, it might be better to increase stack size(
>> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
>> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
>> already expaned to 16K. )
>> 
>> So, my stupid idea is just let's expand stack size and keep an eye
>> toward stack consumption on each kernel functions via stacktrace of
>ftrace.
>> For example, we can have a bar like that each funcion shouldn't
>exceed 200K
>> and emit the warning when some function consumes more in runtime.
>> Of course, it could make false positive but at least, it could make a
>> chance to think over it.
>> 
>> I guess this topic was discussed several time so there might be
>> strong reason not to increase kernel stack size on x86_64, for me not
>> knowing so Ccing x86_64 maintainers, other MM guys and virtio
>> maintainers.
>
>I agree with Boris that if this goes in, it should be a config option.
>Or perhaps selected by those file systems that need it. I hate to have
>16K stacks on a box that doesn't have that much memory, but also just
>uses ext2.
>
>-- Steve
>
>> 
>> [ 1065.604404] kworker/-57660d..2 1071625990us :
>stack_trace_call: DepthSize   Location(51 entries)
>> [ 1065.604404] -   
>> [ 1065.604404] kworker/-57660d..2 1071625991us :
>stack_trace_call:   0) 7696  16   lookup_address+0x28/0x30
>> [ 1065.604404] kworker/-57660d..2 1071625991us :
>stack_trace_call:   1) 7680  16  
>_lookup_address_cpa.isra.3+0x3b/0x40
>> [ 1065.604404] kworker/-57660d..2 1071625991us :
>stack_trace_call:   2) 7664  24  
>__change_page_attr_set_clr+0xe0/0xb50
>> [ 1065.604404] kworker/-57660d..2 1071625991us :
>stack_trace_call:   3) 7640 392   kernel_map_pages+0x6c/0x120
>> [ 1065.604404] kworker/-57660d..2 1071625992us :
>stack_trace_call:   4) 7248 256  
>get_page_from_freelist+0x489/0x920
>> [ 1065.604404] kworker/-57660d..2 1071625992us :
>stack_trace_call:   5) 6992 352  
>__alloc_pages_nodemask+0x5e1/0xb20
>> [ 1065.604404] kworker/-57660d..2 1071625992us :
>stack_trace_call:   6) 6640   8  
>alloc_pages_current+0x10f/0x1f0
>> [ 1065.604404] kworker/-57660d..2 1071625992us :
>stack_trace_call:   7) 6632 168   new_slab+0x2c5/0x370
>> [ 1065.604404] kworker/-57660d..2 1071625992us :
>stack_trace_call:   8) 6464   8   __slab_alloc+0x3a9/0x501
>> [ 1065.604404] kworker/-57660d..2 1071625993us :
>stack_trace_call:   9) 6456  80   __kmalloc+0x1cb/0x200
>> [ 1065.604404] kworker/-57660d..2 1071625993us :
>stack_trace_call:  10) 6376 376   vring_add_indirect+0x36/0x200
>> [ 1065.604404] kworker/-57660d..2 1071625993us :
>stack_trace_call:  11) 6000 144   virtqueue_add_sgs+0x2e2/0x320
>> [ 1065.604404] kworker/-57660d..2 1071625993us :
>stack_trace_call:  12) 5856 288   __virtblk_add_req+0xda/0x1b0
>> [ 1065.604404] kworker/-57660d..2 1071625993us :
>stack_trace_call:  13) 5568  96   virtio_queue_rq+0xd3/0x1d0
>> [ 1065.604404] kworker/-57660d..2 1071625994us :
>stack_trace_call:  14) 5472 128  
>__blk_mq_run_hw_queue+0x1ef/0x440
>> [ 1065.604404] kworker/-57660d..2 1071625994us :
>stack_trace_call:  15) 5344  16   blk_mq_run_hw_queue+0x35/0x40
>> [ 1065.604404] kworker/-57660d..2 1071625994us :
>stack_trace_call:  16) 5328  96  
>blk_mq_insert_reques

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Steven Rostedt

This looks like something that Linus should be involved in too. He's
been critical in the past about stack usage.

On Wed, 28 May 2014 15:53:59 +0900
Minchan Kim  wrote:

> While I play inhouse patches with much memory pressure on qemu-kvm,
> 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> 
> When I investigated the problem, the callstack was a little bit deeper
> by involve with reclaim functions but not direct reclaim path.
> 
> I tried to diet stack size of some functions related with alloc/reclaim
> so did a hundred of byte but overflow was't disappeard so that I encounter
> overflow by another deeper callstack on reclaim/allocator path.
> 
> Of course, we might sweep every sites we have found for reducing
> stack usage but I'm not sure how long it saves the world(surely,
> lots of developer start to add nice features which will use stack
> agains) and if we consider another more complex feature in I/O layer
> and/or reclaim path, it might be better to increase stack size(
> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> already expaned to 16K. )
> 
> So, my stupid idea is just let's expand stack size and keep an eye
> toward stack consumption on each kernel functions via stacktrace of ftrace.
> For example, we can have a bar like that each funcion shouldn't exceed 200K
> and emit the warning when some function consumes more in runtime.
> Of course, it could make false positive but at least, it could make a
> chance to think over it.
> 
> I guess this topic was discussed several time so there might be
> strong reason not to increase kernel stack size on x86_64, for me not
> knowing so Ccing x86_64 maintainers, other MM guys and virtio
> maintainers.

I agree with Boris that if this goes in, it should be a config option.
Or perhaps selected by those file systems that need it. I hate to have
16K stacks on a box that doesn't have that much memory, but also just
uses ext2.

-- Steve

> 
> [ 1065.604404] kworker/-57660d..2 1071625990us : stack_trace_call:
>  DepthSize   Location(51 entries)
> [ 1065.604404] -   
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   0)   
>   7696  16   lookup_address+0x28/0x30
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   1)   
>   7680  16   _lookup_address_cpa.isra.3+0x3b/0x40
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   2)   
>   7664  24   __change_page_attr_set_clr+0xe0/0xb50
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   3)   
>   7640 392   kernel_map_pages+0x6c/0x120
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   4)   
>   7248 256   get_page_from_freelist+0x489/0x920
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   5)   
>   6992 352   __alloc_pages_nodemask+0x5e1/0xb20
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   6)   
>   6640   8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   7)   
>   6632 168   new_slab+0x2c5/0x370
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   8)   
>   6464   8   __slab_alloc+0x3a9/0x501
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:   9)   
>   6456  80   __kmalloc+0x1cb/0x200
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  10)   
>   6376 376   vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  11)   
>   6000 144   virtqueue_add_sgs+0x2e2/0x320
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  12)   
>   5856 288   __virtblk_add_req+0xda/0x1b0
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  13)   
>   5568  96   virtio_queue_rq+0xd3/0x1d0
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  14)   
>   5472 128   __blk_mq_run_hw_queue+0x1ef/0x440
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  15)   
>   5344  16   blk_mq_run_hw_queue+0x35/0x40
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  16)   
>   5328  96   blk_mq_insert_requests+0xdb/0x160
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  17)   
>   5232 112   blk_mq_flush_plug_list+0x12b/0x140
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  18)   
>   5120 112   blk_flush_plug_list+0xc7/0x220
> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  19)   
>   5008  64   io_schedule_timeout+0x88/0x100
> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  20)   
>   4944 128   mempool_alloc+0x145/0x170
> [ 1065.604404] kworker/-57660d..2 107162599

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Borislav Petkov
On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> While I play inhouse patches with much memory pressure on qemu-kvm,
> 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> 
> When I investigated the problem, the callstack was a little bit deeper
> by involve with reclaim functions but not direct reclaim path.
> 
> I tried to diet stack size of some functions related with alloc/reclaim
> so did a hundred of byte but overflow was't disappeard so that I encounter
> overflow by another deeper callstack on reclaim/allocator path.
> 
> Of course, we might sweep every sites we have found for reducing
> stack usage but I'm not sure how long it saves the world(surely,
> lots of developer start to add nice features which will use stack
> agains) and if we consider another more complex feature in I/O layer
> and/or reclaim path, it might be better to increase stack size(
> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> already expaned to 16K. )

Hmm, stupid question: what happens when 16K is not enough too, do we
increase again? When do we stop increasing? 1M, 2M... ?

Sounds like we want to make it a config option with a couple of sizes
for everyone to be happy. :-)

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Dave Chinner
On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> [ cc XFS list ]

[and now there is a complete copy on the XFs list, I'll add my 2c]

> On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > While I play inhouse patches with much memory pressure on qemu-kvm,
> > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > 
> > When I investigated the problem, the callstack was a little bit deeper
> > by involve with reclaim functions but not direct reclaim path.
> > 
> > I tried to diet stack size of some functions related with alloc/reclaim
> > so did a hundred of byte but overflow was't disappeard so that I encounter
> > overflow by another deeper callstack on reclaim/allocator path.

That's a no win situation. The stack overruns through ->writepage
we've been seeing with XFS over the past *4 years* are much larger
than a few bytes. The worst case stack usage on a virtio block
device was about 10.5KB of stack usage.

And, like this one, it came from the flusher thread as well. The
difference was that the allocation that triggered the reclaim path
you've reported occurred when 5k of the stack had already been
used...

> > Of course, we might sweep every sites we have found for reducing
> > stack usage but I'm not sure how long it saves the world(surely,
> > lots of developer start to add nice features which will use stack
> > agains) and if we consider another more complex feature in I/O layer
> > and/or reclaim path, it might be better to increase stack size(
> > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > already expaned to 16K. )

Yup, that's all been pointed out previously. 8k stacks were never
large enough to fit the linux IO architecture on x86-64, but nobody
outside filesystem and IO developers has been willing to accept that
argument as valid, despite regular stack overruns and filesystem
having to add workaround after workaround to prevent stack overruns.

That's why stuff like this appears in various filesystem's
->writepage:

/*
 * Refuse to write the page out if we are called from reclaim context.
 *
 * This avoids stack overflows when called from deeply used stacks in
 * random callers for direct reclaim or memcg reclaim.  We explicitly
 * allow reclaim from kswapd as the stack usage there is relatively low.
 *
 * This should never happen except in the case of a VM regression so
 * warn about it.
 */
if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
PF_MEMALLOC))
goto redirty;

That still doesn't guarantee us enough stack space to do writeback,
though, because memory allocation can occur when reading in metadata
needed to do delayed allocation, and so we could trigger GFP_NOFS
memory allocation from the flusher thread with 4-5k of stack already
consumed, so that would still overrun teh stack.

So, a couple of years ago we started defering half the writeback
stack usage to a worker thread (commit c999a22 "xfs: introduce an
allocation workqueue"), under the assumption that the worst stack
usage when we call memory allocation is around 3-3.5k of stack used.
We thought that would be safe, but the stack trace you've posted
shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
which means we're still screwed despite all the workarounds we have
in place.

We've also had recent reports of allocation from direct IO blowing
the stack, as well as block allocation adding an entry to a
directory.  We're basically at the point where we have to push every
XFS operation that requires block allocation off to another thread
to get enough stack space for normal operation.

> > So, my stupid idea is just let's expand stack size and keep an eye

Not stupid: it's been what I've been advocating we need to do for
the past 3-4 years. XFS has always been the stack usage canary and
this issue is basically a repeat of the 4k stack on i386 kernel
debacle.

> > toward stack consumption on each kernel functions via stacktrace of ftrace.
> > For example, we can have a bar like that each funcion shouldn't exceed 200K
> > and emit the warning when some function consumes more in runtime.
> > Of course, it could make false positive but at least, it could make a
> > chance to think over it.

I don't think that's a good idea. There are reasons for putting a
150-200 byte structure on the stack (e.g. used in a context where
allocation cannot be guaranteed to succeed because forward progress
cannot be guaranteed). hence having these users warn all the time
will quickly get very annoying and that functionality switched off
or removed

> > I guess this topic was discussed several time so there might be
> > strong reason not to increase kernel stack size on x86_64, for me not
> > knowing so Ccing x86_64 maintainers, other MM guys and virt

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Michael S. Tsirkin
On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> While I play inhouse patches with much memory pressure on qemu-kvm,
> 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> 
> When I investigated the problem, the callstack was a little bit deeper
> by involve with reclaim functions but not direct reclaim path.
> 
> I tried to diet stack size of some functions related with alloc/reclaim
> so did a hundred of byte but overflow was't disappeard so that I encounter
> overflow by another deeper callstack on reclaim/allocator path.
> 
> Of course, we might sweep every sites we have found for reducing
> stack usage but I'm not sure how long it saves the world(surely,
> lots of developer start to add nice features which will use stack
> agains) and if we consider another more complex feature in I/O layer
> and/or reclaim path, it might be better to increase stack size(
> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> already expaned to 16K. )
> 
> So, my stupid idea is just let's expand stack size and keep an eye
> toward stack consumption on each kernel functions via stacktrace of ftrace.
> For example, we can have a bar like that each funcion shouldn't exceed 200K
> and emit the warning when some function consumes more in runtime.
> Of course, it could make false positive but at least, it could make a
> chance to think over it.
> 
> I guess this topic was discussed several time so there might be
> strong reason not to increase kernel stack size on x86_64, for me not
> knowing so Ccing x86_64 maintainers, other MM guys and virtio
> maintainers.
> 
> [ 1065.604404] kworker/-57660d..2 1071625990us : stack_trace_call:
>  DepthSize   Location(51 entries)
> [ 1065.604404] -   
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   0)   
>   7696  16   lookup_address+0x28/0x30
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   1)   
>   7680  16   _lookup_address_cpa.isra.3+0x3b/0x40
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   2)   
>   7664  24   __change_page_attr_set_clr+0xe0/0xb50
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   3)   
>   7640 392   kernel_map_pages+0x6c/0x120
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   4)   
>   7248 256   get_page_from_freelist+0x489/0x920
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   5)   
>   6992 352   __alloc_pages_nodemask+0x5e1/0xb20
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   6)   
>   6640   8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   7)   
>   6632 168   new_slab+0x2c5/0x370
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   8)   
>   6464   8   __slab_alloc+0x3a9/0x501
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:   9)   
>   6456  80   __kmalloc+0x1cb/0x200
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  10)   
>   6376 376   vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  11)   
>   6000 144   virtqueue_add_sgs+0x2e2/0x320
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  12)   
>   5856 288   __virtblk_add_req+0xda/0x1b0
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  13)   
>   5568  96   virtio_queue_rq+0xd3/0x1d0

virtio stack usage seems very high.
Here is virtio_ring.su generated using -fstack-usage flag for gcc 4.8.2.

virtio_ring.c:107:35:sg_next_arr16  static


<--- this is a surprise, I really expected it to be inlined
 same for sg_next_chained.
<--- Rusty: should we force compiler to inline it?


virtio_ring.c:584:6:virtqueue_disable_cb16  static
virtio_ring.c:604:10:virtqueue_enable_cb_prepare16  static
virtio_ring.c:632:6:virtqueue_poll  16  static
virtio_ring.c:652:6:virtqueue_enable_cb 16  static
virtio_ring.c:845:14:virtqueue_get_vring_size   16  static
virtio_ring.c:854:6:virtqueue_is_broken 16  static
virtio_ring.c:101:35:sg_next_chained16  static
virtio_ring.c:436:6:virtqueue_notify24  static
virtio_ring.c:672:6:virtqueue_enable_cb_delayed 16  static
virtio_ring.c:820:6:vring_transport_features16  static
virtio_ring.c:472:13:detach_buf 40  static
virtio_ring.c:518:7:virtqueue_get_buf   32  static
virtio_ring.c:812:6:vring_del_virtqueue 16  static
virtio_ring.c:394:6:virtqueue_kick_prepare  16  static
virtio_ring.c:464:6:virtqueue_kick  32  static
virtio_ring.c:186:19:4  16  static
virtio_ring.c:733:13:vring_interrupt24  static
virtio_ring.c:707:7:virtqueue_detach_

Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-28 Thread Dave Chinner
[ cc XFS list ]

On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> While I play inhouse patches with much memory pressure on qemu-kvm,
> 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> 
> When I investigated the problem, the callstack was a little bit deeper
> by involve with reclaim functions but not direct reclaim path.
> 
> I tried to diet stack size of some functions related with alloc/reclaim
> so did a hundred of byte but overflow was't disappeard so that I encounter
> overflow by another deeper callstack on reclaim/allocator path.
>
> Of course, we might sweep every sites we have found for reducing
> stack usage but I'm not sure how long it saves the world(surely,
> lots of developer start to add nice features which will use stack
> agains) and if we consider another more complex feature in I/O layer
> and/or reclaim path, it might be better to increase stack size(
> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> already expaned to 16K. )
>
> So, my stupid idea is just let's expand stack size and keep an eye
> toward stack consumption on each kernel functions via stacktrace of ftrace.
> For example, we can have a bar like that each funcion shouldn't exceed 200K
> and emit the warning when some function consumes more in runtime.
> Of course, it could make false positive but at least, it could make a
> chance to think over it.
>
> I guess this topic was discussed several time so there might be
> strong reason not to increase kernel stack size on x86_64, for me not
> knowing so Ccing x86_64 maintainers, other MM guys and virtio
> maintainers.
>
> [ 1065.604404] kworker/-57660d..2 1071625990us : stack_trace_call:
>  DepthSize   Location(51 entries)
> [ 1065.604404] -   
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   0)   
>   7696  16   lookup_address+0x28/0x30
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   1)   
>   7680  16   _lookup_address_cpa.isra.3+0x3b/0x40
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   2)   
>   7664  24   __change_page_attr_set_clr+0xe0/0xb50
> [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   3)   
>   7640 392   kernel_map_pages+0x6c/0x120
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   4)   
>   7248 256   get_page_from_freelist+0x489/0x920
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   5)   
>   6992 352   __alloc_pages_nodemask+0x5e1/0xb20
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   6)   
>   6640   8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   7)   
>   6632 168   new_slab+0x2c5/0x370
> [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   8)   
>   6464   8   __slab_alloc+0x3a9/0x501
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:   9)   
>   6456  80   __kmalloc+0x1cb/0x200
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  10)   
>   6376 376   vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  11)   
>   6000 144   virtqueue_add_sgs+0x2e2/0x320
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  12)   
>   5856 288   __virtblk_add_req+0xda/0x1b0
> [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  13)   
>   5568  96   virtio_queue_rq+0xd3/0x1d0
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  14)   
>   5472 128   __blk_mq_run_hw_queue+0x1ef/0x440
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  15)   
>   5344  16   blk_mq_run_hw_queue+0x35/0x40
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  16)   
>   5328  96   blk_mq_insert_requests+0xdb/0x160
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  17)   
>   5232 112   blk_mq_flush_plug_list+0x12b/0x140
> [ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  18)   
>   5120 112   blk_flush_plug_list+0xc7/0x220
> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  19)   
>   5008  64   io_schedule_timeout+0x88/0x100
> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  20)   
>   4944 128   mempool_alloc+0x145/0x170
> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  21)   
>   4816  96   bio_alloc_bioset+0x10b/0x1d0
> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  22)   
>   4720  48   get_swap_bio+0x30/0x90
> [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  23)   
>   4672 160   __swap_writepage+0x150/0x230
>

[RFC 2/2] x86_64: expand kernel stack to 16K

2014-05-27 Thread Minchan Kim
While I play inhouse patches with much memory pressure on qemu-kvm,
3.14 kernel was randomly crashed. The reason was kernel stack overflow.

When I investigated the problem, the callstack was a little bit deeper
by involve with reclaim functions but not direct reclaim path.

I tried to diet stack size of some functions related with alloc/reclaim
so did a hundred of byte but overflow was't disappeard so that I encounter
overflow by another deeper callstack on reclaim/allocator path.

Of course, we might sweep every sites we have found for reducing
stack usage but I'm not sure how long it saves the world(surely,
lots of developer start to add nice features which will use stack
agains) and if we consider another more complex feature in I/O layer
and/or reclaim path, it might be better to increase stack size(
meanwhile, stack usage on 64bit machine was doubled compared to 32bit
while it have sticked to 8K. Hmm, it's not a fair to me and arm64
already expaned to 16K. )

So, my stupid idea is just let's expand stack size and keep an eye
toward stack consumption on each kernel functions via stacktrace of ftrace.
For example, we can have a bar like that each funcion shouldn't exceed 200K
and emit the warning when some function consumes more in runtime.
Of course, it could make false positive but at least, it could make a
chance to think over it.

I guess this topic was discussed several time so there might be
strong reason not to increase kernel stack size on x86_64, for me not
knowing so Ccing x86_64 maintainers, other MM guys and virtio
maintainers.

[ 1065.604404] kworker/-57660d..2 1071625990us : stack_trace_call: 
DepthSize   Location(51 entries)
[ 1065.604404] -   
[ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   0) 
7696  16   lookup_address+0x28/0x30
[ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   1) 
7680  16   _lookup_address_cpa.isra.3+0x3b/0x40
[ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   2) 
7664  24   __change_page_attr_set_clr+0xe0/0xb50
[ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call:   3) 
7640 392   kernel_map_pages+0x6c/0x120
[ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   4) 
7248 256   get_page_from_freelist+0x489/0x920
[ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   5) 
6992 352   __alloc_pages_nodemask+0x5e1/0xb20
[ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   6) 
6640   8   alloc_pages_current+0x10f/0x1f0
[ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   7) 
6632 168   new_slab+0x2c5/0x370
[ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call:   8) 
6464   8   __slab_alloc+0x3a9/0x501
[ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:   9) 
6456  80   __kmalloc+0x1cb/0x200
[ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  10) 
6376 376   vring_add_indirect+0x36/0x200
[ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  11) 
6000 144   virtqueue_add_sgs+0x2e2/0x320
[ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  12) 
5856 288   __virtblk_add_req+0xda/0x1b0
[ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call:  13) 
5568  96   virtio_queue_rq+0xd3/0x1d0
[ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  14) 
5472 128   __blk_mq_run_hw_queue+0x1ef/0x440
[ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  15) 
5344  16   blk_mq_run_hw_queue+0x35/0x40
[ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  16) 
5328  96   blk_mq_insert_requests+0xdb/0x160
[ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  17) 
5232 112   blk_mq_flush_plug_list+0x12b/0x140
[ 1065.604404] kworker/-57660d..2 1071625994us : stack_trace_call:  18) 
5120 112   blk_flush_plug_list+0xc7/0x220
[ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  19) 
5008  64   io_schedule_timeout+0x88/0x100
[ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  20) 
4944 128   mempool_alloc+0x145/0x170
[ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  21) 
4816  96   bio_alloc_bioset+0x10b/0x1d0
[ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  22) 
4720  48   get_swap_bio+0x30/0x90
[ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call:  23) 
4672 160   __swap_writepage+0x150/0x230
[ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  24) 
4512  32   swap_writepage+0x42/0x90
[ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call:  25) 
4480 320   shrink_page_list+0x676