Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 10/20/2014 07:00 PM, Dave Jones wrote: > On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote: > > On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin wrote: > > > > > > If we removed struct thread_info from the stack allocation then one > > > could do a guard page below the stack. Of course, we'd have to use IST > > > for #PF in that case, which makes it a non-production option. Why is thread_info in the stack allocation anyway? Every time I look at the entry asm, one (minor) thing that contributes to general brain-hurtingness / sense of horrified awe is the incomprehensible (to me) split between task_struct and thread_info. struct thread_info is at the bottom of the stack, right? If we don't want to merge it into task_struct, couldn't we stick it at the top of the stack instead? Anything that can overwrite the *top* of the stack gives trivial user-controlled CPL0 execution regardless. > > > > We could just have the guard page in between the stack and the > > thread_info, take a double fault, and then just map it back in on > > double fault. > > > > That would give us 8kB of "normal" stack, with a very loud fault - and > > then an extra 7kB or so of stack (whatever the size of thread-info is) > > - after the first time it traps. > > > > That said, it's still likely a non-production option due to the page > > table games we'd have to play at fork/clone time. What's wrong with vmalloc? Doesn't it already have guard pages? (Also, we have a shiny hardware dirty bit, so we could relatively cheaply check whether we're near the limit without any weird #PF-in-weird-context issues.) Also, muahaha, I've infected more people with the crazy idea that intentional double-faults are okay. Suckers! Soon I'll have Linux returning from interrupts with lret! (IIRC Windows used to do intentional *triple* faults on context switches, so this should be considered entirely sensible.) > > [thread necrophilia] > > So digging this back up, it occurs to me that after we bumped to 16K, > we never did anything like the debug stuff you suggested here. > > The reason I'm bringing this up, is that the last few weeks, I've been > seeing things like.. > > [27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left > > So we're now eating past that first 8KB in some situations. > > Do we care ? Or shall we only start worrying if it gets even deeper ? I would *love* to have an immediate, loud failure when we overrun the stack. This will unavoidably increase the number of TLB misses, but that probably isn't so bad. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote: > On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin wrote: > > > > If we removed struct thread_info from the stack allocation then one > > could do a guard page below the stack. Of course, we'd have to use IST > > for #PF in that case, which makes it a non-production option. > > We could just have the guard page in between the stack and the > thread_info, take a double fault, and then just map it back in on > double fault. > > That would give us 8kB of "normal" stack, with a very loud fault - and > then an extra 7kB or so of stack (whatever the size of thread-info is) > - after the first time it traps. > > That said, it's still likely a non-production option due to the page > table games we'd have to play at fork/clone time. [thread necrophilia] So digging this back up, it occurs to me that after we bumped to 16K, we never did anything like the debug stuff you suggested here. The reason I'm bringing this up, is that the last few weeks, I've been seeing things like.. [27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left So we're now eating past that first 8KB in some situations. Do we care ? Or shall we only start worrying if it gets even deeper ? Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote: On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin h...@zytor.com wrote: If we removed struct thread_info from the stack allocation then one could do a guard page below the stack. Of course, we'd have to use IST for #PF in that case, which makes it a non-production option. We could just have the guard page in between the stack and the thread_info, take a double fault, and then just map it back in on double fault. That would give us 8kB of normal stack, with a very loud fault - and then an extra 7kB or so of stack (whatever the size of thread-info is) - after the first time it traps. That said, it's still likely a non-production option due to the page table games we'd have to play at fork/clone time. [thread necrophilia] So digging this back up, it occurs to me that after we bumped to 16K, we never did anything like the debug stuff you suggested here. The reason I'm bringing this up, is that the last few weeks, I've been seeing things like.. [27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left So we're now eating past that first 8KB in some situations. Do we care ? Or shall we only start worrying if it gets even deeper ? Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 10/20/2014 07:00 PM, Dave Jones wrote: On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote: On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin h...@zytor.com wrote: If we removed struct thread_info from the stack allocation then one could do a guard page below the stack. Of course, we'd have to use IST for #PF in that case, which makes it a non-production option. Why is thread_info in the stack allocation anyway? Every time I look at the entry asm, one (minor) thing that contributes to general brain-hurtingness / sense of horrified awe is the incomprehensible (to me) split between task_struct and thread_info. struct thread_info is at the bottom of the stack, right? If we don't want to merge it into task_struct, couldn't we stick it at the top of the stack instead? Anything that can overwrite the *top* of the stack gives trivial user-controlled CPL0 execution regardless. We could just have the guard page in between the stack and the thread_info, take a double fault, and then just map it back in on double fault. That would give us 8kB of normal stack, with a very loud fault - and then an extra 7kB or so of stack (whatever the size of thread-info is) - after the first time it traps. That said, it's still likely a non-production option due to the page table games we'd have to play at fork/clone time. What's wrong with vmalloc? Doesn't it already have guard pages? (Also, we have a shiny hardware dirty bit, so we could relatively cheaply check whether we're near the limit without any weird #PF-in-weird-context issues.) Also, muahaha, I've infected more people with the crazy idea that intentional double-faults are okay. Suckers! Soon I'll have Linux returning from interrupts with lret! (IIRC Windows used to do intentional *triple* faults on context switches, so this should be considered entirely sensible.) [thread necrophilia] So digging this back up, it occurs to me that after we bumped to 16K, we never did anything like the debug stuff you suggested here. The reason I'm bringing this up, is that the last few weeks, I've been seeing things like.. [27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left So we're now eating past that first 8KB in some situations. Do we care ? Or shall we only start worrying if it gets even deeper ? I would *love* to have an immediate, loud failure when we overrun the stack. This will unavoidably increase the number of TLB misses, but that probably isn't so bad. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Tue, Jun 3, 2014 at 6:28 AM, Rasmus Villemoes wrote: > Possibly stupid question: Is it true that any given task can only be > using one wait_queue_t at a time? Nope. Being on multiple different wait-queues is actually very common. The obvious case is select/poll, but there are others. The more subtle ones involve being on a wait-queue while doing something that can cause nested waiting (iow, it's technically not wrong to be on a wait-queue and then do a user space access, which obviously can end up doing IO). That said, the case of a single wait-queue entry is another common case, and it wouldn't necessarily be wrong to have one pre-initialized wait queue entry in the task structure for that special case, for when you know that there is no possible nesting. And even if it *does* nest, if it's the "leaf" entry it could be used for that innermost nesting without worrying about other wait queue users (who use stack allocations or actual explicit allocations like poll). So it might certainly be worth looking at. In fact, it might be worth it having multiple per-thread entries, so that we could get rid of the special on-stack allocation for poll too (and making one of them special and not available to poll, to handle the "leaf waiter" case). So it's not necessarily a bad idea at all, even if the general case requires more than one (or a few) static per-thread allocations. Anybody want to try to code it up? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Possibly stupid question: Is it true that any given task can only be using one wait_queue_t at a time? If so, would it be an idea to put a wait_queue_t into struct task_struct [maybe union'ed with a struct wait_bit_queue] and avoid allocating this 40 byte structure repeatedly on the stack. E.g., in one of Minchan's stack traces, there are two calls of mempool_alloc (which itself declares a wait_queue_t) and one try_to_free_pages (which is the only caller of throttle_direct_reclaim, which in turn uses wait_event_interruptible_timeout and wait_event_killable). Rasmus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Sat, May 31, 2014 at 6:06 AM, Jens Axboe wrote: > On 2014-05-28 20:42, Linus Torvalds wrote: >>> >>> Regardless of whether it is swap or something external queues the >>> bio on the plug, perhaps we should look at why it's done inline >>> rather than by kblockd, where it was moved because it was blowing >>> the stack from schedule(): >> >> >> So it sounds like we need to do this for io_schedule() too. >> >> In fact, we've generally found it to be a mistake every time we >> "automatically" unblock some IO queue. And I'm not saying that because >> of stack space, but because we've _often_ had the situation that eager >> unblocking results in IO that could have been done as bigger requests. > > > We definitely need to auto-unplug on the schedule path, otherwise we run > into all sorts of trouble. But making it async off the IO schedule path is > fine. By definition, it's not latency sensitive if we are hitting unplug on > schedule. I'm pretty sure it was run inline on CPU concerns here, as running > inline is certainly cheaper than punting to kblockd. > > >> Looking at that callchain, I have to say that ext4 doesn't look >> horrible compared to the whole block layer and virtio.. Yes, >> "ext4_writepages()" is using almost 400 bytes of stack, and most of >> that seems to be due to: >> >> struct mpage_da_data mpd; >> struct blk_plug plug; > > > Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the magic > part of it, since that's never caught any bugs. Only saves 8 bytes, but may > as well take that. Especially if we end up with nested plugs. In case of nested plugs only the first one is used? Right? So, it may be embedded into task_struct together with integer recursion counter. This will save bit of precious stack and make it looks cleaner. > > >> Well, we've definitely have had some issues with deeper callchains >> with md, but I suspect virtio might be worse, and the new blk-mq code >> is lilkely worse in this respect too. > > > I don't think blk-mq is worse than the older stack, in fact it should be > better. The call chains are shorter, and a lot less cruft on the stack. > Historically the stack issues have been nested devices, however. And for > sync IO, we do run it inline, so if the driver chews up a lot of stack, > well... > > Looks like I'm late here and the decision has been made to go 16K stacks, > which I think is a good one. We've been living on the edge (and sometimes > over) for heavy dm/md setups for a while, and have been patching around that > fact in the IO stack for years. > > > -- > Jens Axboe > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Sat, May 31, 2014 at 6:06 AM, Jens Axboe ax...@kernel.dk wrote: On 2014-05-28 20:42, Linus Torvalds wrote: Regardless of whether it is swap or something external queues the bio on the plug, perhaps we should look at why it's done inline rather than by kblockd, where it was moved because it was blowing the stack from schedule(): So it sounds like we need to do this for io_schedule() too. In fact, we've generally found it to be a mistake every time we automatically unblock some IO queue. And I'm not saying that because of stack space, but because we've _often_ had the situation that eager unblocking results in IO that could have been done as bigger requests. We definitely need to auto-unplug on the schedule path, otherwise we run into all sorts of trouble. But making it async off the IO schedule path is fine. By definition, it's not latency sensitive if we are hitting unplug on schedule. I'm pretty sure it was run inline on CPU concerns here, as running inline is certainly cheaper than punting to kblockd. Looking at that callchain, I have to say that ext4 doesn't look horrible compared to the whole block layer and virtio.. Yes, ext4_writepages() is using almost 400 bytes of stack, and most of that seems to be due to: struct mpage_da_data mpd; struct blk_plug plug; Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the magic part of it, since that's never caught any bugs. Only saves 8 bytes, but may as well take that. Especially if we end up with nested plugs. In case of nested plugs only the first one is used? Right? So, it may be embedded into task_struct together with integer recursion counter. This will save bit of precious stack and make it looks cleaner. Well, we've definitely have had some issues with deeper callchains with md, but I suspect virtio might be worse, and the new blk-mq code is lilkely worse in this respect too. I don't think blk-mq is worse than the older stack, in fact it should be better. The call chains are shorter, and a lot less cruft on the stack. Historically the stack issues have been nested devices, however. And for sync IO, we do run it inline, so if the driver chews up a lot of stack, well... Looks like I'm late here and the decision has been made to go 16K stacks, which I think is a good one. We've been living on the edge (and sometimes over) for heavy dm/md setups for a while, and have been patching around that fact in the IO stack for years. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Possibly stupid question: Is it true that any given task can only be using one wait_queue_t at a time? If so, would it be an idea to put a wait_queue_t into struct task_struct [maybe union'ed with a struct wait_bit_queue] and avoid allocating this 40 byte structure repeatedly on the stack. E.g., in one of Minchan's stack traces, there are two calls of mempool_alloc (which itself declares a wait_queue_t) and one try_to_free_pages (which is the only caller of throttle_direct_reclaim, which in turn uses wait_event_interruptible_timeout and wait_event_killable). Rasmus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Tue, Jun 3, 2014 at 6:28 AM, Rasmus Villemoes li...@rasmusvillemoes.dk wrote: Possibly stupid question: Is it true that any given task can only be using one wait_queue_t at a time? Nope. Being on multiple different wait-queues is actually very common. The obvious case is select/poll, but there are others. The more subtle ones involve being on a wait-queue while doing something that can cause nested waiting (iow, it's technically not wrong to be on a wait-queue and then do a user space access, which obviously can end up doing IO). That said, the case of a single wait-queue entry is another common case, and it wouldn't necessarily be wrong to have one pre-initialized wait queue entry in the task structure for that special case, for when you know that there is no possible nesting. And even if it *does* nest, if it's the leaf entry it could be used for that innermost nesting without worrying about other wait queue users (who use stack allocations or actual explicit allocations like poll). So it might certainly be worth looking at. In fact, it might be worth it having multiple per-thread entries, so that we could get rid of the special on-stack allocation for poll too (and making one of them special and not available to poll, to handle the leaf waiter case). So it's not necessarily a bad idea at all, even if the general case requires more than one (or a few) static per-thread allocations. Anybody want to try to code it up? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 08:06:53PM -0600, Jens Axboe wrote: > On 2014-05-28 20:42, Linus Torvalds wrote: > >Well, we've definitely have had some issues with deeper callchains > >with md, but I suspect virtio might be worse, and the new blk-mq code > >is lilkely worse in this respect too. > > I don't think blk-mq is worse than the older stack, in fact it > should be better. The call chains are shorter, and a lot less cruft > on the stack. Historically the stack issues have been nested > devices, however. And for sync IO, we do run it inline, so if the > driver chews up a lot of stack, well... Hi Jens - as we found out with the mm code, there's a significant disconnect between what the code looks like (i.e. it may use very little stack directly) and what the compiler is generating. Before blk-mq: 9) 3952 112 scsi_request_fn+0x4b/0x490 10) 3840 32 __blk_run_queue+0x37/0x50 11) 3808 64 queue_unplugged+0x39/0xb0 12) 3744 112 blk_flush_plug_list+0x20b/0x240 Now with blk-mq: 3) 4672 96 virtio_queue_rq+0xd2/0x1e0 4) 4576 128 __blk_mq_run_hw_queue+0x1f0/0x3e0 5) 4448 16 blk_mq_run_hw_queue+0x35/0x40 6) 4432 80 blk_mq_insert_requests+0xc7/0x130 7) 4352 96 blk_mq_flush_plug_list+0x129/0x140 8) 4256 112 blk_flush_plug_list+0xe7/0x230 So previously flushing a plug used rough 200 bytes of stack. With blk-mq, it's over 400 bytes. IOWs, blk-mq has more than doubled the block layer stack usage... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 08:06:53PM -0600, Jens Axboe wrote: On 2014-05-28 20:42, Linus Torvalds wrote: Well, we've definitely have had some issues with deeper callchains with md, but I suspect virtio might be worse, and the new blk-mq code is lilkely worse in this respect too. I don't think blk-mq is worse than the older stack, in fact it should be better. The call chains are shorter, and a lot less cruft on the stack. Historically the stack issues have been nested devices, however. And for sync IO, we do run it inline, so if the driver chews up a lot of stack, well... Hi Jens - as we found out with the mm code, there's a significant disconnect between what the code looks like (i.e. it may use very little stack directly) and what the compiler is generating. Before blk-mq: 9) 3952 112 scsi_request_fn+0x4b/0x490 10) 3840 32 __blk_run_queue+0x37/0x50 11) 3808 64 queue_unplugged+0x39/0xb0 12) 3744 112 blk_flush_plug_list+0x20b/0x240 Now with blk-mq: 3) 4672 96 virtio_queue_rq+0xd2/0x1e0 4) 4576 128 __blk_mq_run_hw_queue+0x1f0/0x3e0 5) 4448 16 blk_mq_run_hw_queue+0x35/0x40 6) 4432 80 blk_mq_insert_requests+0xc7/0x130 7) 4352 96 blk_mq_flush_plug_list+0x129/0x140 8) 4256 112 blk_flush_plug_list+0xe7/0x230 So previously flushing a plug used rough 200 bytes of stack. With blk-mq, it's over 400 bytes. IOWs, blk-mq has more than doubled the block layer stack usage... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 2014-05-28 20:42, Linus Torvalds wrote: Regardless of whether it is swap or something external queues the bio on the plug, perhaps we should look at why it's done inline rather than by kblockd, where it was moved because it was blowing the stack from schedule(): So it sounds like we need to do this for io_schedule() too. In fact, we've generally found it to be a mistake every time we "automatically" unblock some IO queue. And I'm not saying that because of stack space, but because we've _often_ had the situation that eager unblocking results in IO that could have been done as bigger requests. We definitely need to auto-unplug on the schedule path, otherwise we run into all sorts of trouble. But making it async off the IO schedule path is fine. By definition, it's not latency sensitive if we are hitting unplug on schedule. I'm pretty sure it was run inline on CPU concerns here, as running inline is certainly cheaper than punting to kblockd. Looking at that callchain, I have to say that ext4 doesn't look horrible compared to the whole block layer and virtio.. Yes, "ext4_writepages()" is using almost 400 bytes of stack, and most of that seems to be due to: struct mpage_da_data mpd; struct blk_plug plug; Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the magic part of it, since that's never caught any bugs. Only saves 8 bytes, but may as well take that. Especially if we end up with nested plugs. Well, we've definitely have had some issues with deeper callchains with md, but I suspect virtio might be worse, and the new blk-mq code is lilkely worse in this respect too. I don't think blk-mq is worse than the older stack, in fact it should be better. The call chains are shorter, and a lot less cruft on the stack. Historically the stack issues have been nested devices, however. And for sync IO, we do run it inline, so if the driver chews up a lot of stack, well... Looks like I'm late here and the decision has been made to go 16K stacks, which I think is a good one. We've been living on the edge (and sometimes over) for heavy dm/md setups for a while, and have been patching around that fact in the IO stack for years. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 9:37 PM, Linus Torvalds wrote: > > It really might be very good to create a "struct alloc_info" that > contains those shared arguments, and just pass a (const) pointer to > that around. [ .. ] > > Ugh. I think I'll try looking at that tomorrow. I did look at it, but the thing is horrible. I started on this something like ten times, and always ended up running away screaming. Some things are truly fixed (notably "order"), but most things end up changing subtly halfway through the callchain. I might look at it some more later, but people may have noticed that I decided to just apply Minchan's original patch in the meantime. I'll make an rc8 this weekend.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Linus Torvalds writes: > From a quick glance at the frame usage, some of it seems to be gcc > being rather bad at stack allocation, but lots of it is just nasty > spilling around the disgusting call-sites with tons or arguments. A > _lot_ of the stack slots are marked as "%sfp" (which is gcc'ese for > "spill frame pointer", afaik). > Avoiding some inlining, and using a single flag value rather than the > collection of "bool"s would probably help. But nothing really > trivially obvious stands out. One thing that may be worth playing around with gcc's --param large-stack-frame and --param large-stack-frame-growth This tells the inliner when to stop inlining when too much stack would be used. We use conserve stack I believe. So perhaps smaller values than 100 and 400 would make sense to try. -fconserve-stack Attempt to minimize stack usage. The compiler attempts to use less stack space, even if that makes the program slower. This option implies setting the large-stack-frame parameter to 100 and the large-stack-frame-growth parameter to 400. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 05/30/2014 10:24 AM, Dave Hansen wrote: > On 05/30/2014 09:06 AM, Linus Torvalds wrote: >> On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin wrote: That said, it's still likely a non-production option due to the page table games we'd have to play at fork/clone time. >>> >>> Still, seems much more tractable. >> >> We might be able to make it more attractive by having a small >> front-end cache of the 16kB allocations with the second page unmapped. >> That would at least capture the common "lots of short-lived processes" >> case without having to do kernel page table work. > > If we want to use 4k mappings, we'd need to move the stack over to using > vmalloc() (or at least be out of the linear mapping) to avoid breaking > up the linear map's page tables too much. Doing that, we'd actually not > _have_ to worry about fragmentation, and we could actually utilize the > per-cpu-pageset code since we'd could be back to using order-0 pages. > So it's at least not all a loss. Although, I do remember playing with > 4k stacks back in the 32-bit days and not getting much of a win with it. > > We'd definitely that cache, if for no other reason than the vmalloc/vmap > code as-is isn't super-scalable. > I don't think we want to use 4K mappings for production... -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 05/30/2014 09:06 AM, Linus Torvalds wrote: > On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin wrote: >>> That said, it's still likely a non-production option due to the page >>> table games we'd have to play at fork/clone time. >> >> Still, seems much more tractable. > > We might be able to make it more attractive by having a small > front-end cache of the 16kB allocations with the second page unmapped. > That would at least capture the common "lots of short-lived processes" > case without having to do kernel page table work. If we want to use 4k mappings, we'd need to move the stack over to using vmalloc() (or at least be out of the linear mapping) to avoid breaking up the linear map's page tables too much. Doing that, we'd actually not _have_ to worry about fragmentation, and we could actually utilize the per-cpu-pageset code since we'd could be back to using order-0 pages. So it's at least not all a loss. Although, I do remember playing with 4k stacks back in the 32-bit days and not getting much of a win with it. We'd definitely that cache, if for no other reason than the vmalloc/vmap code as-is isn't super-scalable. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin wrote: >> >> That said, it's still likely a non-production option due to the page >> table games we'd have to play at fork/clone time. > > Still, seems much more tractable. We might be able to make it more attractive by having a small front-end cache of the 16kB allocations with the second page unmapped. That would at least capture the common "lots of short-lived processes" case without having to do kernel page table work. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 05/30/2014 08:41 AM, Linus Torvalds wrote: > On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin wrote: >> >> If we removed struct thread_info from the stack allocation then one >> could do a guard page below the stack. Of course, we'd have to use IST >> for #PF in that case, which makes it a non-production option. > > We could just have the guard page in between the stack and the > thread_info, take a double fault, and then just map it back in on > double fault. > Oh, duh. Right, much better. Similar to the espfix64 hack, too. > That would give us 8kB of "normal" stack, with a very loud fault - and > then an extra 7kB or so of stack (whatever the size of thread-info is) > - after the first time it traps. > > That said, it's still likely a non-production option due to the page > table games we'd have to play at fork/clone time. Still, seems much more tractable. I would still like struct thread_info off the stack allocation for other reasons (as we have discussed in the past.) -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin wrote: > > If we removed struct thread_info from the stack allocation then one > could do a guard page below the stack. Of course, we'd have to use IST > for #PF in that case, which makes it a non-production option. We could just have the guard page in between the stack and the thread_info, take a double fault, and then just map it back in on double fault. That would give us 8kB of "normal" stack, with a very loud fault - and then an extra 7kB or so of stack (whatever the size of thread-info is) - after the first time it traps. That said, it's still likely a non-production option due to the page table games we'd have to play at fork/clone time. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 2:48 AM, Richard Weinberger wrote: > > If we raise the stack size on x86_64 to 16k, what about i386? > Beside of the fact that most of you consider 32bits as dead and must die... ;) x86-32 doesn't have nearly the same issue, since a large portion of stack content tends to be pointers and longs. So it's not like it uses half the stack, but a 32-bit environment does use a lot less stack than a 64-bit one. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 05/29/2014 06:34 PM, Dave Chinner wrote: >> ... >> "kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means >> there is some horrible stack hogger in your kernel. Please report it >> the LKML and enable stacktrace to investigate who is culprit" > > That, however, presumes that a user can reproduce the problem on > demand. Experience tells me that this is the exception rather than > the norm for production systems, and so capturing the stack in real > time is IMO the only useful thing we could add... > If we removed struct thread_info from the stack allocation then one could do a guard page below the stack. Of course, we'd have to use IST for #PF in that case, which makes it a non-production option. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 5:24 PM, Linus Torvalds wrote: > So I'm not in fact arguing against Minchan's patch of upping > THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does > remain one of my "we really need to be careful" issues, so while I am > basically planning on applying that patch, I _also_ want to make sure > that we fix the problems we do see and not just paper them over. > > The 8kB stack has been somewhat restrictive and painful for a while, > and I'm ok with admitting that it is just getting _too_ damn painful, > but I don't want to just give up entirely when we have a known deep > stack case. If we raise the stack size on x86_64 to 16k, what about i386? Beside of the fact that most of you consider 32bits as dead and must die... ;) -- Thanks, //richard -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 06:24:02PM -0700, Linus Torvalds wrote: > On Thu, May 29, 2014 at 5:50 PM, Minchan Kim wrote: > >> > >> You could also try Dave's patch, and _not_ do my mm/vmscan.c part. > > > > Sure. While I write this, Rusty's test was crached so I will try Dave's > > patch, > > them yours except vmscan.c part. > > Looking more at Dave's patch (well, description), I don't think there > is any way in hell we can ever apply it. If I read it right, it will > cause all IO that overflows the max request count to go through the > scheduler to get it flushed. Maybe I misread it, but that's definitely > not acceptable. Maybe it's not noticeable with a slow rotational > device, but modern ssd hardware? No way. > > I'd *much* rather slow down the swap side. Not "real IO". So I think > my mm/vmscan.c patch is preferable (but yes, it might require some > work to make kswapd do better). > > So you can try Dave's patch just to see what it does for stack depth, > but other than that it looks unacceptable unless I misread things. > > Linus I tested below patch and the result is endless OOM although there are lots of anon pages and empty space of swap. I guess __alloc_pages_direct_reclaim couldn't proceed due to anon pages once VM drop most of file-backed pages, then go to OOM. --- mm/backing-dev.c | 25 +++-- mm/vmscan.c | 4 +--- 2 files changed, 16 insertions(+), 13 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ce682f7a4f29..2762b16404bd 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -11,6 +11,7 @@ #include #include #include +#include static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0); @@ -565,6 +566,18 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) } EXPORT_SYMBOL(set_bdi_congested); +static long congestion_timeout(int sync, long timeout) +{ + long ret; + DEFINE_WAIT(wait); + + wait_queue_head_t *wqh = _wqh[sync]; + prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE); + ret = schedule_timeout(timeout); + finish_wait(wqh, ); + return ret; +} + /** * congestion_wait - wait for a backing_dev to become uncongested * @sync: SYNC or ASYNC IO @@ -578,12 +591,8 @@ long congestion_wait(int sync, long timeout) { long ret; unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = _wqh[sync]; - prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, ); + ret = congestion_timeout(sync,timeout); trace_writeback_congestion_wait(jiffies_to_usecs(timeout), jiffies_to_usecs(jiffies - start)); @@ -614,8 +623,6 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) { long ret; unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = _wqh[sync]; /* * If there is no congestion, or heavy congestion is not being @@ -635,9 +642,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) } /* Sleep until uncongested or a write happens */ - prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, ); + ret = congestion_timeout(sync, timeout); out: trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout), diff --git a/mm/vmscan.c b/mm/vmscan.c index a9c74b409681..e4ad7cd1885b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -975,9 +975,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * avoid risk of stack overflow but only writeback * if many dirty pages have been encountered. */ - if (page_is_file_cache(page) && - (!current_is_kswapd() || -!zone_is_reclaim_dirty(zone))) { + if (!current_is_kswapd() || !zone_is_reclaim_dirty(zone)) { /* * Immediately reclaim when written back. * Similar in principal to deactivate_page() -- 1.9.2 -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Final result, I tested the machine below patch (Dave suggested + some part I modified) and I couldn't see the problem any more(tested 4hr, I will queue it into the machine during weekend for long running test if I don't get more enhanced version before leaving the office today) but as I reported interim result, still VM's stack usage is high. Anyway, it's another issue we should really diet of VM functions (ex, uninlining slow path part from __alloc_pages_nodemask and alloc_info idea from Linus and more). Looking forwad to seeing blk_plug_start_async way. Thanks, Dave! --- block/blk-core.c| 2 +- block/blk-mq.c | 2 +- kernel/sched/core.c | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index bfe16d5af9f9..0c81aacec75b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1585,7 +1585,7 @@ get_rq: trace_block_plug(q); else { if (request_count >= BLK_MAX_REQUEST_COUNT) { - blk_flush_plug_list(plug, false); + blk_flush_plug_list(plug, true); trace_block_plug(q); } } diff --git a/block/blk-mq.c b/block/blk-mq.c index 883f72089015..6e72e700d11e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -897,7 +897,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio) if (list_empty(>mq_list)) trace_block_plug(q); else if (request_count >= BLK_MAX_REQUEST_COUNT) { - blk_flush_plug_list(plug, false); + blk_flush_plug_list(plug, true); trace_block_plug(q); } list_add_tail(>queuelist, >mq_list); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f5c6635b806c..ebca9e1f200f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4244,7 +4244,7 @@ void __sched io_schedule(void) delayacct_blkio_start(); atomic_inc(>nr_iowait); - blk_flush_plug(current); + blk_schedule_flush_plug(current); current->in_iowait = 1; schedule(); current->in_iowait = 0; @@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout) delayacct_blkio_start(); atomic_inc(>nr_iowait); - blk_flush_plug(current); + blk_schedule_flush_plug(current); current->in_iowait = 1; ret = schedule_timeout(timeout); current->in_iowait = 0; -- 1.9.2 On Fri, May 30, 2014 at 11:12:47AM +0900, Minchan Kim wrote: > On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote: > > On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote: > > > Hello Dave, > > > > > > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote: > > > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote: > > > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote: > > > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca > > > > > Author: Jens Axboe > > > > > Date: Sat Apr 16 13:27:55 2011 +0200 > > > > > > > > > > block: let io_schedule() flush the plug inline > > > > > > > > > > Linus correctly observes that the most important dispatch cases > > > > > are now done from kblockd, this isn't ideal for latency reasons. > > > > > The original reason for switching dispatches out-of-line was to > > > > > avoid too deep a stack, so by _only_ letting the "accidental" > > > > > flush directly in schedule() be guarded by offload to kblockd, > > > > > we should be able to get the best of both worlds. > > > > > > > > > > So add a blk_schedule_flush_plug() that offloads to kblockd, > > > > > and only use that from the schedule() path. > > > > > > > > > > Signed-off-by: Jens Axboe > > > > > > > > > > And now we have too deep a stack due to unplugging from > > > > > io_schedule()... > > > > > > > > So, if we make io_schedule() push the plug list off to the kblockd > > > > like is done for schedule() > > > > > I did below hacky test to apply your idea and the result is overflow > > > again. > > > So, again it would second stack expansion. Otherwise, we should prevent > > > swapout in direct reclaim. > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index f5c6635b806c..95f169e85dbe 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to); > > > void __sched io_schedule(void) > > > { > > > struct rq *rq = raw_rq(); > > > + struct blk_plug *plug = current->plug; > > > > > > delayacct_blkio_start(); > > > atomic_inc(>nr_iowait); > > > - blk_flush_plug(current); > > > + if (plug) > > > + blk_flush_plug_list(plug, true); > > > + > > > current->in_iowait =
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Final result, I tested the machine below patch (Dave suggested + some part I modified) and I couldn't see the problem any more(tested 4hr, I will queue it into the machine during weekend for long running test if I don't get more enhanced version before leaving the office today) but as I reported interim result, still VM's stack usage is high. Anyway, it's another issue we should really diet of VM functions (ex, uninlining slow path part from __alloc_pages_nodemask and alloc_info idea from Linus and more). Looking forwad to seeing blk_plug_start_async way. Thanks, Dave! --- block/blk-core.c| 2 +- block/blk-mq.c | 2 +- kernel/sched/core.c | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index bfe16d5af9f9..0c81aacec75b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1585,7 +1585,7 @@ get_rq: trace_block_plug(q); else { if (request_count = BLK_MAX_REQUEST_COUNT) { - blk_flush_plug_list(plug, false); + blk_flush_plug_list(plug, true); trace_block_plug(q); } } diff --git a/block/blk-mq.c b/block/blk-mq.c index 883f72089015..6e72e700d11e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -897,7 +897,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio) if (list_empty(plug-mq_list)) trace_block_plug(q); else if (request_count = BLK_MAX_REQUEST_COUNT) { - blk_flush_plug_list(plug, false); + blk_flush_plug_list(plug, true); trace_block_plug(q); } list_add_tail(rq-queuelist, plug-mq_list); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f5c6635b806c..ebca9e1f200f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4244,7 +4244,7 @@ void __sched io_schedule(void) delayacct_blkio_start(); atomic_inc(rq-nr_iowait); - blk_flush_plug(current); + blk_schedule_flush_plug(current); current-in_iowait = 1; schedule(); current-in_iowait = 0; @@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout) delayacct_blkio_start(); atomic_inc(rq-nr_iowait); - blk_flush_plug(current); + blk_schedule_flush_plug(current); current-in_iowait = 1; ret = schedule_timeout(timeout); current-in_iowait = 0; -- 1.9.2 On Fri, May 30, 2014 at 11:12:47AM +0900, Minchan Kim wrote: On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote: On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote: Hello Dave, On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote: On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote: On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote: commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca Author: Jens Axboe jax...@fusionio.com Date: Sat Apr 16 13:27:55 2011 +0200 block: let io_schedule() flush the plug inline Linus correctly observes that the most important dispatch cases are now done from kblockd, this isn't ideal for latency reasons. The original reason for switching dispatches out-of-line was to avoid too deep a stack, so by _only_ letting the accidental flush directly in schedule() be guarded by offload to kblockd, we should be able to get the best of both worlds. So add a blk_schedule_flush_plug() that offloads to kblockd, and only use that from the schedule() path. Signed-off-by: Jens Axboe jax...@fusionio.com And now we have too deep a stack due to unplugging from io_schedule()... So, if we make io_schedule() push the plug list off to the kblockd like is done for schedule() I did below hacky test to apply your idea and the result is overflow again. So, again it would second stack expansion. Otherwise, we should prevent swapout in direct reclaim. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f5c6635b806c..95f169e85dbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to); void __sched io_schedule(void) { struct rq *rq = raw_rq(); + struct blk_plug *plug = current-plug; delayacct_blkio_start(); atomic_inc(rq-nr_iowait); - blk_flush_plug(current); + if (plug) + blk_flush_plug_list(plug, true); + current-in_iowait = 1; schedule(); current-in_iowait = 0; . DepthSize Location(46 entries) 0) 7200 8
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 06:24:02PM -0700, Linus Torvalds wrote: On Thu, May 29, 2014 at 5:50 PM, Minchan Kim minc...@kernel.org wrote: You could also try Dave's patch, and _not_ do my mm/vmscan.c part. Sure. While I write this, Rusty's test was crached so I will try Dave's patch, them yours except vmscan.c part. Looking more at Dave's patch (well, description), I don't think there is any way in hell we can ever apply it. If I read it right, it will cause all IO that overflows the max request count to go through the scheduler to get it flushed. Maybe I misread it, but that's definitely not acceptable. Maybe it's not noticeable with a slow rotational device, but modern ssd hardware? No way. I'd *much* rather slow down the swap side. Not real IO. So I think my mm/vmscan.c patch is preferable (but yes, it might require some work to make kswapd do better). So you can try Dave's patch just to see what it does for stack depth, but other than that it looks unacceptable unless I misread things. Linus I tested below patch and the result is endless OOM although there are lots of anon pages and empty space of swap. I guess __alloc_pages_direct_reclaim couldn't proceed due to anon pages once VM drop most of file-backed pages, then go to OOM. --- mm/backing-dev.c | 25 +++-- mm/vmscan.c | 4 +--- 2 files changed, 16 insertions(+), 13 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ce682f7a4f29..2762b16404bd 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -11,6 +11,7 @@ #include linux/writeback.h #include linux/device.h #include trace/events/writeback.h +#include linux/blkdev.h static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0); @@ -565,6 +566,18 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) } EXPORT_SYMBOL(set_bdi_congested); +static long congestion_timeout(int sync, long timeout) +{ + long ret; + DEFINE_WAIT(wait); + + wait_queue_head_t *wqh = congestion_wqh[sync]; + prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE); + ret = schedule_timeout(timeout); + finish_wait(wqh, wait); + return ret; +} + /** * congestion_wait - wait for a backing_dev to become uncongested * @sync: SYNC or ASYNC IO @@ -578,12 +591,8 @@ long congestion_wait(int sync, long timeout) { long ret; unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = congestion_wqh[sync]; - prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, wait); + ret = congestion_timeout(sync,timeout); trace_writeback_congestion_wait(jiffies_to_usecs(timeout), jiffies_to_usecs(jiffies - start)); @@ -614,8 +623,6 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) { long ret; unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = congestion_wqh[sync]; /* * If there is no congestion, or heavy congestion is not being @@ -635,9 +642,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) } /* Sleep until uncongested or a write happens */ - prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, wait); + ret = congestion_timeout(sync, timeout); out: trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout), diff --git a/mm/vmscan.c b/mm/vmscan.c index a9c74b409681..e4ad7cd1885b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -975,9 +975,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * avoid risk of stack overflow but only writeback * if many dirty pages have been encountered. */ - if (page_is_file_cache(page) - (!current_is_kswapd() || -!zone_is_reclaim_dirty(zone))) { + if (!current_is_kswapd() || !zone_is_reclaim_dirty(zone)) { /* * Immediately reclaim when written back. * Similar in principal to deactivate_page() -- 1.9.2 -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 5:24 PM, Linus Torvalds torva...@linux-foundation.org wrote: So I'm not in fact arguing against Minchan's patch of upping THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does remain one of my we really need to be careful issues, so while I am basically planning on applying that patch, I _also_ want to make sure that we fix the problems we do see and not just paper them over. The 8kB stack has been somewhat restrictive and painful for a while, and I'm ok with admitting that it is just getting _too_ damn painful, but I don't want to just give up entirely when we have a known deep stack case. If we raise the stack size on x86_64 to 16k, what about i386? Beside of the fact that most of you consider 32bits as dead and must die... ;) -- Thanks, //richard -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 05/29/2014 06:34 PM, Dave Chinner wrote: ... kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means there is some horrible stack hogger in your kernel. Please report it the LKML and enable stacktrace to investigate who is culprit That, however, presumes that a user can reproduce the problem on demand. Experience tells me that this is the exception rather than the norm for production systems, and so capturing the stack in real time is IMO the only useful thing we could add... If we removed struct thread_info from the stack allocation then one could do a guard page below the stack. Of course, we'd have to use IST for #PF in that case, which makes it a non-production option. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 2:48 AM, Richard Weinberger richard.weinber...@gmail.com wrote: If we raise the stack size on x86_64 to 16k, what about i386? Beside of the fact that most of you consider 32bits as dead and must die... ;) x86-32 doesn't have nearly the same issue, since a large portion of stack content tends to be pointers and longs. So it's not like it uses half the stack, but a 32-bit environment does use a lot less stack than a 64-bit one. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin h...@zytor.com wrote: If we removed struct thread_info from the stack allocation then one could do a guard page below the stack. Of course, we'd have to use IST for #PF in that case, which makes it a non-production option. We could just have the guard page in between the stack and the thread_info, take a double fault, and then just map it back in on double fault. That would give us 8kB of normal stack, with a very loud fault - and then an extra 7kB or so of stack (whatever the size of thread-info is) - after the first time it traps. That said, it's still likely a non-production option due to the page table games we'd have to play at fork/clone time. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 05/30/2014 08:41 AM, Linus Torvalds wrote: On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin h...@zytor.com wrote: If we removed struct thread_info from the stack allocation then one could do a guard page below the stack. Of course, we'd have to use IST for #PF in that case, which makes it a non-production option. We could just have the guard page in between the stack and the thread_info, take a double fault, and then just map it back in on double fault. Oh, duh. Right, much better. Similar to the espfix64 hack, too. That would give us 8kB of normal stack, with a very loud fault - and then an extra 7kB or so of stack (whatever the size of thread-info is) - after the first time it traps. That said, it's still likely a non-production option due to the page table games we'd have to play at fork/clone time. Still, seems much more tractable. I would still like struct thread_info off the stack allocation for other reasons (as we have discussed in the past.) -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin h...@zytor.com wrote: That said, it's still likely a non-production option due to the page table games we'd have to play at fork/clone time. Still, seems much more tractable. We might be able to make it more attractive by having a small front-end cache of the 16kB allocations with the second page unmapped. That would at least capture the common lots of short-lived processes case without having to do kernel page table work. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 05/30/2014 09:06 AM, Linus Torvalds wrote: On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin h...@zytor.com wrote: That said, it's still likely a non-production option due to the page table games we'd have to play at fork/clone time. Still, seems much more tractable. We might be able to make it more attractive by having a small front-end cache of the 16kB allocations with the second page unmapped. That would at least capture the common lots of short-lived processes case without having to do kernel page table work. If we want to use 4k mappings, we'd need to move the stack over to using vmalloc() (or at least be out of the linear mapping) to avoid breaking up the linear map's page tables too much. Doing that, we'd actually not _have_ to worry about fragmentation, and we could actually utilize the per-cpu-pageset code since we'd could be back to using order-0 pages. So it's at least not all a loss. Although, I do remember playing with 4k stacks back in the 32-bit days and not getting much of a win with it. We'd definitely that cache, if for no other reason than the vmalloc/vmap code as-is isn't super-scalable. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 05/30/2014 10:24 AM, Dave Hansen wrote: On 05/30/2014 09:06 AM, Linus Torvalds wrote: On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin h...@zytor.com wrote: That said, it's still likely a non-production option due to the page table games we'd have to play at fork/clone time. Still, seems much more tractable. We might be able to make it more attractive by having a small front-end cache of the 16kB allocations with the second page unmapped. That would at least capture the common lots of short-lived processes case without having to do kernel page table work. If we want to use 4k mappings, we'd need to move the stack over to using vmalloc() (or at least be out of the linear mapping) to avoid breaking up the linear map's page tables too much. Doing that, we'd actually not _have_ to worry about fragmentation, and we could actually utilize the per-cpu-pageset code since we'd could be back to using order-0 pages. So it's at least not all a loss. Although, I do remember playing with 4k stacks back in the 32-bit days and not getting much of a win with it. We'd definitely that cache, if for no other reason than the vmalloc/vmap code as-is isn't super-scalable. I don't think we want to use 4K mappings for production... -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Linus Torvalds torva...@linux-foundation.org writes: From a quick glance at the frame usage, some of it seems to be gcc being rather bad at stack allocation, but lots of it is just nasty spilling around the disgusting call-sites with tons or arguments. A _lot_ of the stack slots are marked as %sfp (which is gcc'ese for spill frame pointer, afaik). Avoiding some inlining, and using a single flag value rather than the collection of bools would probably help. But nothing really trivially obvious stands out. One thing that may be worth playing around with gcc's --param large-stack-frame and --param large-stack-frame-growth This tells the inliner when to stop inlining when too much stack would be used. We use conserve stack I believe. So perhaps smaller values than 100 and 400 would make sense to try. -fconserve-stack Attempt to minimize stack usage. The compiler attempts to use less stack space, even if that makes the program slower. This option implies setting the large-stack-frame parameter to 100 and the large-stack-frame-growth parameter to 400. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 9:37 PM, Linus Torvalds torva...@linux-foundation.org wrote: It really might be very good to create a struct alloc_info that contains those shared arguments, and just pass a (const) pointer to that around. [ .. ] Ugh. I think I'll try looking at that tomorrow. I did look at it, but the thing is horrible. I started on this something like ten times, and always ended up running away screaming. Some things are truly fixed (notably order), but most things end up changing subtly halfway through the callchain. I might look at it some more later, but people may have noticed that I decided to just apply Minchan's original patch in the meantime. I'll make an rc8 this weekend.. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 2014-05-28 20:42, Linus Torvalds wrote: Regardless of whether it is swap or something external queues the bio on the plug, perhaps we should look at why it's done inline rather than by kblockd, where it was moved because it was blowing the stack from schedule(): So it sounds like we need to do this for io_schedule() too. In fact, we've generally found it to be a mistake every time we automatically unblock some IO queue. And I'm not saying that because of stack space, but because we've _often_ had the situation that eager unblocking results in IO that could have been done as bigger requests. We definitely need to auto-unplug on the schedule path, otherwise we run into all sorts of trouble. But making it async off the IO schedule path is fine. By definition, it's not latency sensitive if we are hitting unplug on schedule. I'm pretty sure it was run inline on CPU concerns here, as running inline is certainly cheaper than punting to kblockd. Looking at that callchain, I have to say that ext4 doesn't look horrible compared to the whole block layer and virtio.. Yes, ext4_writepages() is using almost 400 bytes of stack, and most of that seems to be due to: struct mpage_da_data mpd; struct blk_plug plug; Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the magic part of it, since that's never caught any bugs. Only saves 8 bytes, but may as well take that. Especially if we end up with nested plugs. Well, we've definitely have had some issues with deeper callchains with md, but I suspect virtio might be worse, and the new blk-mq code is lilkely worse in this respect too. I don't think blk-mq is worse than the older stack, in fact it should be better. The call chains are shorter, and a lot less cruft on the stack. Historically the stack issues have been nested devices, however. And for sync IO, we do run it inline, so if the driver chews up a lot of stack, well... Looks like I'm late here and the decision has been made to go 16K stacks, which I think is a good one. We've been living on the edge (and sometimes over) for heavy dm/md setups for a while, and have been patching around that fact in the IO stack for years. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 7:12 PM, Minchan Kim wrote: > > Interim report, > > And result is as follows, It reduce about 800-byte compared to > my first report but still stack usage seems to be high. > Really needs diet of VM functions. Yes. And in this case uninlining things might actually help, because the it's not actually performing reclaim in the second case, so inlining the reclaim code into that huge __alloc_pages_nodemask() function means that it has the stack frame for all those cases even if they don't actually get used. That said, the way those functions are set up (with lots of arguments passed from one to the other), not inlining will cause huge costs too for the argument setup. It really might be very good to create a "struct alloc_info" that contains those shared arguments, and just pass a (const) pointer to that around. Gcc would likely tend to be *much* better at generating code for that, because it avoids a tons of temporaries being created by function calls. Even when it's inlined, the argument itself ends up being a new temporary internally, and I suspect one reason gcc (especially your 4.6.3 version, apparently) generates those big spill frames is because there's tons of these duplicate temporaries that apparently don't get merged properly. Ugh. I think I'll try looking at that tomorrow. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 6:58 PM, Dave Chinner wrote: > > If the patch I sent solves the swap stack usage issue, then perhaps > we should look towards adding "blk_plug_start_async()" to pass such > hints to the plug flushing. I'd want to use the same behaviour in > __xfs_buf_delwri_submit() for bulk metadata writeback in XFS, and > probably also in mpage_writepages() for bulk data writeback in > WB_SYNC_NONE context... Yeah, adding a flag to the plug about what kind of plug it is does sound quite reasonable. It already has that "magic" field, it could easily be extended to have a "async" vs "sync" bit to it.. Of course, it's also possible that the unplugging code could just look at the actual requests that are plugged to determine that, and maybe we wouldn't even need to mark things specially. I don't think we ever end up mixing reads and writes under the same plug, so "first request is a write" is probably a good approximation for "async". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote: > On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote: > > Hello Dave, > > > > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote: > > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote: > > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote: > > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca > > > > Author: Jens Axboe > > > > Date: Sat Apr 16 13:27:55 2011 +0200 > > > > > > > > block: let io_schedule() flush the plug inline > > > > > > > > Linus correctly observes that the most important dispatch cases > > > > are now done from kblockd, this isn't ideal for latency reasons. > > > > The original reason for switching dispatches out-of-line was to > > > > avoid too deep a stack, so by _only_ letting the "accidental" > > > > flush directly in schedule() be guarded by offload to kblockd, > > > > we should be able to get the best of both worlds. > > > > > > > > So add a blk_schedule_flush_plug() that offloads to kblockd, > > > > and only use that from the schedule() path. > > > > > > > > Signed-off-by: Jens Axboe > > > > > > > > And now we have too deep a stack due to unplugging from io_schedule()... > > > > > > So, if we make io_schedule() push the plug list off to the kblockd > > > like is done for schedule() > > > I did below hacky test to apply your idea and the result is overflow again. > > So, again it would second stack expansion. Otherwise, we should prevent > > swapout in direct reclaim. > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index f5c6635b806c..95f169e85dbe 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to); > > void __sched io_schedule(void) > > { > > struct rq *rq = raw_rq(); > > + struct blk_plug *plug = current->plug; > > > > delayacct_blkio_start(); > > atomic_inc(>nr_iowait); > > - blk_flush_plug(current); > > + if (plug) > > + blk_flush_plug_list(plug, true); > > + > > current->in_iowait = 1; > > schedule(); > > current->in_iowait = 0; > > . > > > DepthSize Location(46 entries) > > > > 0) 7200 8 _raw_spin_lock_irqsave+0x51/0x60 > > 1) 7192 296 get_page_from_freelist+0x886/0x920 > > 2) 6896 352 __alloc_pages_nodemask+0x5e1/0xb20 > > 3) 6544 8 alloc_pages_current+0x10f/0x1f0 > > 4) 6536 168 new_slab+0x2c5/0x370 > > 5) 6368 8 __slab_alloc+0x3a9/0x501 > > 6) 6360 80 __kmalloc+0x1cb/0x200 > > 7) 6280 376 vring_add_indirect+0x36/0x200 > > 8) 5904 144 virtqueue_add_sgs+0x2e2/0x320 > > 9) 5760 288 __virtblk_add_req+0xda/0x1b0 > > 10) 5472 96 virtio_queue_rq+0xd3/0x1d0 > > 11) 5376 128 __blk_mq_run_hw_queue+0x1ef/0x440 > > 12) 5248 16 blk_mq_run_hw_queue+0x35/0x40 > > 13) 5232 96 blk_mq_insert_requests+0xdb/0x160 > > 14) 5136 112 blk_mq_flush_plug_list+0x12b/0x140 > > 15) 5024 112 blk_flush_plug_list+0xc7/0x220 > > 16) 4912 128 blk_mq_make_request+0x42a/0x600 > > 17) 4784 48 generic_make_request+0xc0/0x100 > > 18) 4736 112 submit_bio+0x86/0x160 > > 19) 4624 160 __swap_writepage+0x198/0x230 > > 20) 4464 32 swap_writepage+0x42/0x90 > > 21) 4432 320 shrink_page_list+0x676/0xa80 > > 22) 4112 208 shrink_inactive_list+0x262/0x4e0 > > 23) 3904 304 shrink_lruvec+0x3e1/0x6a0 > > The device is supposed to be plugged here in shrink_lruvec(). > > Oh, a plug can only hold 16 individual bios, and then it does a > synchronous flush. Hmmm - perhaps that should also defer the flush > to the kblockd, because if we are overrunning a plug then we've > already surrendered IO dispatch latency > > So, in blk_mq_make_request(), can you do: > > if (list_empty(>mq_list)) > trace_block_plug(q); > else if (request_count >= BLK_MAX_REQUEST_COUNT) { > - blk_flush_plug_list(plug, false); > + blk_flush_plug_list(plug, true); > trace_block_plug(q); > } > list_add_tail(>queuelist, >mq_list); > > To see if that defers all the swap IO to kblockd? > Interim report, I applied below(we need to fix io_schedule_timeout due to mempool_alloc) diff --git a/block/blk-core.c b/block/blk-core.c index bfe16d5af9f9..0c81aacec75b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1585,7 +1585,7 @@ get_rq: trace_block_plug(q); else { if (request_count >= BLK_MAX_REQUEST_COUNT) { - blk_flush_plug_list(plug, false);
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 06:24:02PM -0700, Linus Torvalds wrote: > On Thu, May 29, 2014 at 5:50 PM, Minchan Kim wrote: > >> > >> You could also try Dave's patch, and _not_ do my mm/vmscan.c part. > > > > Sure. While I write this, Rusty's test was crached so I will try Dave's > > patch, > > them yours except vmscan.c part. > > Looking more at Dave's patch (well, description), I don't think there > is any way in hell we can ever apply it. If I read it right, it will > cause all IO that overflows the max request count to go through the > scheduler to get it flushed. Maybe I misread it, but that's definitely > not acceptable. Maybe it's not noticeable with a slow rotational > device, but modern ssd hardware? No way. > > I'd *much* rather slow down the swap side. Not "real IO". So I think > my mm/vmscan.c patch is preferable (but yes, it might require some > work to make kswapd do better). > > So you can try Dave's patch just to see what it does for stack depth, > but other than that it looks unacceptable unless I misread things. Yeah, it's a hack, not intended as a potential solution. I'm thinking, though, that plug flushing behaviour is actually dependent on plugger context and there is no one "correct" behaviour. If we are doing process driven IO, then we want to do immediate dispatch, but for IO where stack is an issue or is for bulk throughput (e.g. background writeback) async dispatch through kblockd is desirable. If the patch I sent solves the swap stack usage issue, then perhaps we should look towards adding "blk_plug_start_async()" to pass such hints to the plug flushing. I'd want to use the same behaviour in __xfs_buf_delwri_submit() for bulk metadata writeback in XFS, and probably also in mpage_writepages() for bulk data writeback in WB_SYNC_NONE context Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 09:32:19AM +0900, Minchan Kim wrote: > On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote: > > On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote: > > > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote: > > > > > > > That sounds like a plan. Perhaps it would be useful to add a > > > > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond > > > > 8k) so that we get some indication that we're hitting a deep stack > > > > but the system otherwise keeps functioning. That gives us some > > > > motivation to keep stack usage down but isn't a fatal problem like > > > > it is now > > > > > > We have check_stack_usage() and DEBUG_STACK_USAGE for this. > > > Though it needs some tweaking if we move to 16K > > > > Right, but it doesn't throw loud warnings when a specific threshold > > is reached - it just issues a quiet message when a process exits > > telling you what the maximum was without giving us a stack to chew > > on > > But we could enhance the inform so notice the risk to the user. > as follow > > ... > "kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means > there is some horrible stack hogger in your kernel. Please report it > the LKML and enable stacktrace to investigate who is culprit" That, however, presumes that a user can reproduce the problem on demand. Experience tells me that this is the exception rather than the norm for production systems, and so capturing the stack in real time is IMO the only useful thing we could add... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 5:05 PM, Linus Torvalds wrote: > > So maybe test a patch something like the attached. > > NOTE! This is absolutely TOTALLY UNTESTED! It's still untested, but I realized that the whole "blk_flush_plug_list(plug, true);" thing is pointless, since schedule() itself will do that for us. So I think you can remove the + struct blk_plug *plug = current->plug; + if (plug) + blk_flush_plug_list(plug, true); part from congestion_timeout(). Not that it should *hurt* to have it there, so I'm not bothering to send a changed patch. And again, no actual testing by me on any of this, just looking at the code. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 5:50 PM, Minchan Kim wrote: >> >> You could also try Dave's patch, and _not_ do my mm/vmscan.c part. > > Sure. While I write this, Rusty's test was crached so I will try Dave's patch, > them yours except vmscan.c part. Looking more at Dave's patch (well, description), I don't think there is any way in hell we can ever apply it. If I read it right, it will cause all IO that overflows the max request count to go through the scheduler to get it flushed. Maybe I misread it, but that's definitely not acceptable. Maybe it's not noticeable with a slow rotational device, but modern ssd hardware? No way. I'd *much* rather slow down the swap side. Not "real IO". So I think my mm/vmscan.c patch is preferable (but yes, it might require some work to make kswapd do better). So you can try Dave's patch just to see what it does for stack depth, but other than that it looks unacceptable unless I misread things. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 05:31:42PM -0700, Linus Torvalds wrote: > On Thu, May 29, 2014 at 5:20 PM, Minchan Kim wrote: > > > > I guess this part which avoid swapout in direct reclaim would be key > > if this patch were successful. But it could make anon pages rotate back > > into inactive's head from tail in direct reclaim path until kswapd can > > catch up. And kswapd kswapd can swap out anon pages from tail of inactive > > LRU so I suspect it could make side-effect LRU churning. > > Oh, it could make bad things happen, no question about that. > > That said, those bad things are what happens to shared mapped pages > today, so in that sense it's not new. But large dirty shared mmap's > have traditionally been a great way to really hurt out VM, so "it > should work as well as shared mapping pages" is definitely not a > ringing endorsement! True. > > (Of course, *if* we can improve kswapd behavior for both swap-out and > shared dirty pages, that would then be a double win, so there is > _some_ argument for saying that we should aim to handle both kinds of > pages equally). Just an idea for preventing LRU churn. We can return back the pages to tail of inactive instead of head if it's not proper pages in this context and reclaimer uses the cursor as list_head instead of LRU head to scan victim page and record the cursor in somewhere like lruvec after shrinking is done. It makes VM code more complicated but is worthy to try if we approach that way. > > > Anyway, I will queue it into testing machine since Rusty's test is done. > > You could also try Dave's patch, and _not_ do my mm/vmscan.c part. Sure. While I write this, Rusty's test was crached so I will try Dave's patch, them yours except vmscan.c part. Thanks. > > Linus > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 5:20 PM, Minchan Kim wrote: > > I guess this part which avoid swapout in direct reclaim would be key > if this patch were successful. But it could make anon pages rotate back > into inactive's head from tail in direct reclaim path until kswapd can > catch up. And kswapd kswapd can swap out anon pages from tail of inactive > LRU so I suspect it could make side-effect LRU churning. Oh, it could make bad things happen, no question about that. That said, those bad things are what happens to shared mapped pages today, so in that sense it's not new. But large dirty shared mmap's have traditionally been a great way to really hurt out VM, so "it should work as well as shared mapping pages" is definitely not a ringing endorsement! (Of course, *if* we can improve kswapd behavior for both swap-out and shared dirty pages, that would then be a double win, so there is _some_ argument for saying that we should aim to handle both kinds of pages equally). > Anyway, I will queue it into testing machine since Rusty's test is done. You could also try Dave's patch, and _not_ do my mm/vmscan.c part. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote: > On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote: > > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote: > > > > > That sounds like a plan. Perhaps it would be useful to add a > > > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond > > > 8k) so that we get some indication that we're hitting a deep stack > > > but the system otherwise keeps functioning. That gives us some > > > motivation to keep stack usage down but isn't a fatal problem like > > > it is now > > > > We have check_stack_usage() and DEBUG_STACK_USAGE for this. > > Though it needs some tweaking if we move to 16K > > Right, but it doesn't throw loud warnings when a specific threshold > is reached - it just issues a quiet message when a process exits > telling you what the maximum was without giving us a stack to chew > on But we could enhance the inform so notice the risk to the user. as follow ... "kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means there is some horrible stack hogger in your kernel. Please report it the LKML and enable stacktrace to investigate who is culprit" > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote: > On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote: > > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote: > > > > > That sounds like a plan. Perhaps it would be useful to add a > > > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond > > > 8k) so that we get some indication that we're hitting a deep stack > > > but the system otherwise keeps functioning. That gives us some > > > motivation to keep stack usage down but isn't a fatal problem like > > > it is now > > > > We have check_stack_usage() and DEBUG_STACK_USAGE for this. > > Though it needs some tweaking if we move to 16K > > Right, but it doesn't throw loud warnings when a specific threshold > is reached - it just issues a quiet message when a process exits > telling you what the maximum was without giving us a stack to chew > on ah, right good point. That would be more useful. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote: > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote: > > > That sounds like a plan. Perhaps it would be useful to add a > > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond > > 8k) so that we get some indication that we're hitting a deep stack > > but the system otherwise keeps functioning. That gives us some > > motivation to keep stack usage down but isn't a fatal problem like > > it is now > > We have check_stack_usage() and DEBUG_STACK_USAGE for this. > Though it needs some tweaking if we move to 16K Right, but it doesn't throw loud warnings when a specific threshold is reached - it just issues a quiet message when a process exits telling you what the maximum was without giving us a stack to chew on Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Hello Linus, On Thu, May 29, 2014 at 05:05:17PM -0700, Linus Torvalds wrote: > On Thu, May 29, 2014 at 4:36 PM, Minchan Kim wrote: > > > > I did below hacky test to apply your idea and the result is overflow again. > > So, again it would second stack expansion. Otherwise, we should prevent > > swapout in direct reclaim. > > So changing io_schedule() is bad, for the reasons I outlined elsewhere > (we use it for wait_for_page*() - see sleep_on_page(). > > It's the congestion waiting where the io_schedule() should be avoided. > > So maybe test a patch something like the attached. > > NOTE! This is absolutely TOTALLY UNTESTED! It might do horrible > horrible things. It seems to compile, but I have absolutely no reason > to believe that it would work. I didn't actually test that this moves > anything at all to kblockd. So think of it as a concept patch that > *might* work, but as Dave said, there might also be other things that > cause unplugging and need some tough love. > >Linus > mm/backing-dev.c | 28 ++-- > mm/vmscan.c | 4 +--- > 2 files changed, 19 insertions(+), 13 deletions(-) > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index 09d9591b7708..cb26b24c2da2 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > > static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0); > > @@ -573,6 +574,21 @@ void set_bdi_congested(struct backing_dev_info *bdi, int > sync) > } > EXPORT_SYMBOL(set_bdi_congested); > > +static long congestion_timeout(int sync, long timeout) > +{ > + long ret; > + DEFINE_WAIT(wait); > + struct blk_plug *plug = current->plug; > + wait_queue_head_t *wqh = _wqh[sync]; > + > + prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE); > + if (plug) > + blk_flush_plug_list(plug, true); > + ret = schedule_timeout(timeout); > + finish_wait(wqh, ); > + return ret; > +} > + > /** > * congestion_wait - wait for a backing_dev to become uncongested > * @sync: SYNC or ASYNC IO > @@ -586,12 +602,8 @@ long congestion_wait(int sync, long timeout) > { > long ret; > unsigned long start = jiffies; > - DEFINE_WAIT(wait); > - wait_queue_head_t *wqh = _wqh[sync]; > > - prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE); > - ret = io_schedule_timeout(timeout); > - finish_wait(wqh, ); > + ret = congestion_timeout(sync,timeout); > > trace_writeback_congestion_wait(jiffies_to_usecs(timeout), > jiffies_to_usecs(jiffies - start)); > @@ -622,8 +634,6 @@ long wait_iff_congested(struct zone *zone, int sync, long > timeout) > { > long ret; > unsigned long start = jiffies; > - DEFINE_WAIT(wait); > - wait_queue_head_t *wqh = _wqh[sync]; > > /* >* If there is no congestion, or heavy congestion is not being > @@ -643,9 +653,7 @@ long wait_iff_congested(struct zone *zone, int sync, long > timeout) > } > > /* Sleep until uncongested or a write happens */ > - prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE); > - ret = io_schedule_timeout(timeout); > - finish_wait(wqh, ); > + ret = congestion_timeout(sync, timeout); > > out: > trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout), > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 32c661d66a45..1e524000b83e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -989,9 +989,7 @@ static unsigned long shrink_page_list(struct list_head > *page_list, >* avoid risk of stack overflow but only writeback >* if many dirty pages have been encountered. >*/ > - if (page_is_file_cache(page) && > - (!current_is_kswapd() || > - !zone_is_reclaim_dirty(zone))) { > + if (!current_is_kswapd() || > !zone_is_reclaim_dirty(zone)) { > /* >* Immediately reclaim when written back. >* Similar in principal to deactivate_page() I guess this part which avoid swapout in direct reclaim would be key if this patch were successful. But it could make anon pages rotate back into inactive's head from tail in direct reclaim path until kswapd can catch up. And kswapd kswapd can swap out anon pages from tail of inactive LRU so I suspect it could make side-effect LRU churning. Anyway, I will queue it into testing machine since Rusty's test is done. Thanks! -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote: > Hello Dave, > > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote: > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote: > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote: > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca > > > Author: Jens Axboe > > > Date: Sat Apr 16 13:27:55 2011 +0200 > > > > > > block: let io_schedule() flush the plug inline > > > > > > Linus correctly observes that the most important dispatch cases > > > are now done from kblockd, this isn't ideal for latency reasons. > > > The original reason for switching dispatches out-of-line was to > > > avoid too deep a stack, so by _only_ letting the "accidental" > > > flush directly in schedule() be guarded by offload to kblockd, > > > we should be able to get the best of both worlds. > > > > > > So add a blk_schedule_flush_plug() that offloads to kblockd, > > > and only use that from the schedule() path. > > > > > > Signed-off-by: Jens Axboe > > > > > > And now we have too deep a stack due to unplugging from io_schedule()... > > > > So, if we make io_schedule() push the plug list off to the kblockd > > like is done for schedule() > I did below hacky test to apply your idea and the result is overflow again. > So, again it would second stack expansion. Otherwise, we should prevent > swapout in direct reclaim. > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index f5c6635b806c..95f169e85dbe 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to); > void __sched io_schedule(void) > { > struct rq *rq = raw_rq(); > + struct blk_plug *plug = current->plug; > > delayacct_blkio_start(); > atomic_inc(>nr_iowait); > - blk_flush_plug(current); > + if (plug) > + blk_flush_plug_list(plug, true); > + > current->in_iowait = 1; > schedule(); > current->in_iowait = 0; . > DepthSize Location(46 entries) > > 0) 7200 8 _raw_spin_lock_irqsave+0x51/0x60 > 1) 7192 296 get_page_from_freelist+0x886/0x920 > 2) 6896 352 __alloc_pages_nodemask+0x5e1/0xb20 > 3) 6544 8 alloc_pages_current+0x10f/0x1f0 > 4) 6536 168 new_slab+0x2c5/0x370 > 5) 6368 8 __slab_alloc+0x3a9/0x501 > 6) 6360 80 __kmalloc+0x1cb/0x200 > 7) 6280 376 vring_add_indirect+0x36/0x200 > 8) 5904 144 virtqueue_add_sgs+0x2e2/0x320 > 9) 5760 288 __virtblk_add_req+0xda/0x1b0 > 10) 5472 96 virtio_queue_rq+0xd3/0x1d0 > 11) 5376 128 __blk_mq_run_hw_queue+0x1ef/0x440 > 12) 5248 16 blk_mq_run_hw_queue+0x35/0x40 > 13) 5232 96 blk_mq_insert_requests+0xdb/0x160 > 14) 5136 112 blk_mq_flush_plug_list+0x12b/0x140 > 15) 5024 112 blk_flush_plug_list+0xc7/0x220 > 16) 4912 128 blk_mq_make_request+0x42a/0x600 > 17) 4784 48 generic_make_request+0xc0/0x100 > 18) 4736 112 submit_bio+0x86/0x160 > 19) 4624 160 __swap_writepage+0x198/0x230 > 20) 4464 32 swap_writepage+0x42/0x90 > 21) 4432 320 shrink_page_list+0x676/0xa80 > 22) 4112 208 shrink_inactive_list+0x262/0x4e0 > 23) 3904 304 shrink_lruvec+0x3e1/0x6a0 The device is supposed to be plugged here in shrink_lruvec(). Oh, a plug can only hold 16 individual bios, and then it does a synchronous flush. Hmmm - perhaps that should also defer the flush to the kblockd, because if we are overrunning a plug then we've already surrendered IO dispatch latency So, in blk_mq_make_request(), can you do: if (list_empty(>mq_list)) trace_block_plug(q); else if (request_count >= BLK_MAX_REQUEST_COUNT) { - blk_flush_plug_list(plug, false); + blk_flush_plug_list(plug, true); trace_block_plug(q); } list_add_tail(>queuelist, >mq_list); To see if that defers all the swap IO to kblockd? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote: > That sounds like a plan. Perhaps it would be useful to add a > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond > 8k) so that we get some indication that we're hitting a deep stack > but the system otherwise keeps functioning. That gives us some > motivation to keep stack usage down but isn't a fatal problem like > it is now We have check_stack_usage() and DEBUG_STACK_USAGE for this. Though it needs some tweaking if we move to 16K I gave it a try yesterday, and noticed a spew of noisy warnings as soon as I gave it a workload to chew on. (Moreso than usual with 8K stacks) Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 4:36 PM, Minchan Kim wrote: > > I did below hacky test to apply your idea and the result is overflow again. > So, again it would second stack expansion. Otherwise, we should prevent > swapout in direct reclaim. So changing io_schedule() is bad, for the reasons I outlined elsewhere (we use it for wait_for_page*() - see sleep_on_page(). It's the congestion waiting where the io_schedule() should be avoided. So maybe test a patch something like the attached. NOTE! This is absolutely TOTALLY UNTESTED! It might do horrible horrible things. It seems to compile, but I have absolutely no reason to believe that it would work. I didn't actually test that this moves anything at all to kblockd. So think of it as a concept patch that *might* work, but as Dave said, there might also be other things that cause unplugging and need some tough love. Linus mm/backing-dev.c | 28 ++-- mm/vmscan.c | 4 +--- 2 files changed, 19 insertions(+), 13 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 09d9591b7708..cb26b24c2da2 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -11,6 +11,7 @@ #include #include #include +#include static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0); @@ -573,6 +574,21 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) } EXPORT_SYMBOL(set_bdi_congested); +static long congestion_timeout(int sync, long timeout) +{ + long ret; + DEFINE_WAIT(wait); + struct blk_plug *plug = current->plug; + wait_queue_head_t *wqh = _wqh[sync]; + + prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE); + if (plug) + blk_flush_plug_list(plug, true); + ret = schedule_timeout(timeout); + finish_wait(wqh, ); + return ret; +} + /** * congestion_wait - wait for a backing_dev to become uncongested * @sync: SYNC or ASYNC IO @@ -586,12 +602,8 @@ long congestion_wait(int sync, long timeout) { long ret; unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = _wqh[sync]; - prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, ); + ret = congestion_timeout(sync,timeout); trace_writeback_congestion_wait(jiffies_to_usecs(timeout), jiffies_to_usecs(jiffies - start)); @@ -622,8 +634,6 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) { long ret; unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = _wqh[sync]; /* * If there is no congestion, or heavy congestion is not being @@ -643,9 +653,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) } /* Sleep until uncongested or a write happens */ - prepare_to_wait(wqh, , TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, ); + ret = congestion_timeout(sync, timeout); out: trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout), diff --git a/mm/vmscan.c b/mm/vmscan.c index 32c661d66a45..1e524000b83e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -989,9 +989,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * avoid risk of stack overflow but only writeback * if many dirty pages have been encountered. */ - if (page_is_file_cache(page) && - (!current_is_kswapd() || -!zone_is_reclaim_dirty(zone))) { + if (!current_is_kswapd() || !zone_is_reclaim_dirty(zone)) { /* * Immediately reclaim when written back. * Similar in principal to deactivate_page()
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 08:24:49AM -0700, Linus Torvalds wrote: > On Thu, May 29, 2014 at 12:26 AM, Dave Chinner wrote: > > > > What concerns me about both __alloc_pages_nodemask() and > > kernel_map_pages is that when I look at the code I see functions > > that have no obvious stack usage problem. However, the compiler is > > producing functions with huge stack footprints and it's not at all > > obvious when I read the code. So in this case I'm more concerned > > that we have a major disconnect between the source code structure > > and the code that the compiler produces... > > I agree. In fact, this is the main reason that Minchan's call trace > and this thread has actually convinced me that yes, we really do need > to make x86-64 have a 16kB stack (well, 16kB allocation - there's > still the thread info etc too). > > Usually when we see the stack-smashing traces, they are because > somebody did something stupid. In this case, there are certainly > stupid details, and things I think we should fix, but there is *not* > the usual red flag of "Christ, somebody did something _really_ wrong". > > So I'm not in fact arguing against Minchan's patch of upping > THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does > remain one of my "we really need to be careful" issues, so while I am > basically planning on applying that patch, I _also_ want to make sure > that we fix the problems we do see and not just paper them over. > > The 8kB stack has been somewhat restrictive and painful for a while, > and I'm ok with admitting that it is just getting _too_ damn painful, > but I don't want to just give up entirely when we have a known deep > stack case. That sounds like a plan. Perhaps it would be useful to add a WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond 8k) so that we get some indication that we're hitting a deep stack but the system otherwise keeps functioning. That gives us some motivation to keep stack usage down but isn't a fatal problem like it is now Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Hello Linus, On Thu, May 29, 2014 at 08:24:49AM -0700, Linus Torvalds wrote: > On Thu, May 29, 2014 at 12:26 AM, Dave Chinner wrote: > > > > What concerns me about both __alloc_pages_nodemask() and > > kernel_map_pages is that when I look at the code I see functions > > that have no obvious stack usage problem. However, the compiler is > > producing functions with huge stack footprints and it's not at all > > obvious when I read the code. So in this case I'm more concerned > > that we have a major disconnect between the source code structure > > and the code that the compiler produces... > > I agree. In fact, this is the main reason that Minchan's call trace > and this thread has actually convinced me that yes, we really do need > to make x86-64 have a 16kB stack (well, 16kB allocation - there's > still the thread info etc too). > > Usually when we see the stack-smashing traces, they are because > somebody did something stupid. In this case, there are certainly > stupid details, and things I think we should fix, but there is *not* > the usual red flag of "Christ, somebody did something _really_ wrong". > > So I'm not in fact arguing against Minchan's patch of upping > THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does > remain one of my "we really need to be careful" issues, so while I am > basically planning on applying that patch, I _also_ want to make sure > that we fix the problems we do see and not just paper them over. So, should I resend a patch w/o RFC in subject but with Acked-by from Dave? Or, you will do it by yourself? > > The 8kB stack has been somewhat restrictive and painful for a while, > and I'm ok with admitting that it is just getting _too_ damn painful, > but I don't want to just give up entirely when we have a known deep > stack case. > > Linus > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Hello Dave, On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote: > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote: > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote: > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca > > Author: Jens Axboe > > Date: Sat Apr 16 13:27:55 2011 +0200 > > > > block: let io_schedule() flush the plug inline > > > > Linus correctly observes that the most important dispatch cases > > are now done from kblockd, this isn't ideal for latency reasons. > > The original reason for switching dispatches out-of-line was to > > avoid too deep a stack, so by _only_ letting the "accidental" > > flush directly in schedule() be guarded by offload to kblockd, > > we should be able to get the best of both worlds. > > > > So add a blk_schedule_flush_plug() that offloads to kblockd, > > and only use that from the schedule() path. > > > > Signed-off-by: Jens Axboe > > > > And now we have too deep a stack due to unplugging from io_schedule()... > > So, if we make io_schedule() push the plug list off to the kblockd > like is done for schedule() > > > > IOW, swap-out directly caused that extra 3kB of stack use in what was > > > a deep call chain (due to memory allocation). I really don't > > > understand why you are arguing anything else on a pure technicality. > > > > > > I thought you had some other argument for why swap was different, and > > > against removing that "page_is_file_cache()" special case in > > > shrink_page_list(). > > > > I've said in the past that swap is different to filesystem > > ->writepage implementations because it doesn't require significant > > stack to do block allocation and doesn't trigger IO deep in that > > allocation stack. Hence it has much lower stack overhead than the > > filesystem ->writepage implementations and so is much less likely to > > have stack issues. > > > > This stack overflow shows us that just the memory reclaim + IO > > layers are sufficient to cause a stack overflow, > > we solve this problem directly by being able to remove the IO > stack usage from the direct reclaim swap path. > > IOWs, we don't need to turn swap off at all in direct reclaim > because all the swap IO can be captured in a plug list and > dispatched via kblockd. This could be done either by io_schedule() > or a new blk_flush_plug_list() wrapper that pushes the work to > kblockd... I did below hacky test to apply your idea and the result is overflow again. So, again it would second stack expansion. Otherwise, we should prevent swapout in direct reclaim. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f5c6635b806c..95f169e85dbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to); void __sched io_schedule(void) { struct rq *rq = raw_rq(); + struct blk_plug *plug = current->plug; delayacct_blkio_start(); atomic_inc(>nr_iowait); - blk_flush_plug(current); + if (plug) + blk_flush_plug_list(plug, true); + current->in_iowait = 1; schedule(); current->in_iowait = 0; [ 1209.764725] kworker/u24:0 (23627) used greatest stack depth: 304 bytes left [ 1510.835509] kworker/u24:1 (25817) used greatest stack depth: 144 bytes left [ 3701.482790] PANIC: double fault, error_code: 0x0 [ 3701.483297] CPU: 8 PID: 6117 Comm: kworker/u24:1 Not tainted 3.14.0+ #201 [ 3701.483980] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 3701.484366] Workqueue: writeback bdi_writeback_workfn (flush-253:0) [ 3701.484366] task: 8800353c41c0 ti: 88106000 task.ti: 88106000 [ 3701.484366] RIP: 0010:[] [] __lock_acquire+0x170/0x1ca0 [ 3701.484366] RSP: :88105f58 EFLAGS: 00010046 [ 3701.484366] RAX: 0001 RBX: 8800353c41c0 RCX: 0002 [ 3701.484366] RDX: RSI: RDI: 81c4a1e0 [ 3701.484366] RBP: 88106048 R08: 0001 R09: 0001 [ 3701.484366] R10: R11: R12: 0001 [ 3701.484366] R13: R14: 81c4a1e0 R15: [ 3701.484366] FS: () GS:880037d0() knlGS: [ 3701.484366] CS: 0010 DS: ES: CR0: 8005003b [ 3701.484366] CR2: 88105f48 CR3: 01c0b000 CR4: 06e0 [ 3701.484366] Stack: [ 3701.484366] BUG: unable to handle kernel paging request at 88105f58 [ 3701.484366] IP: [] show_stack_log_lvl+0x134/0x1a0 [ 3701.484366] PGD 28c5067 PUD 28c6067 PMD 28c7067 PTE 80105060 [ 3701.484366] Thread overran stack, or stack corrupted [ 3701.484366] Oops: [#1] SMP DEBUG_PAGEALLOC [ 3701.484366] Dumping ftrace buffer: [ 3701.484366] - [ 3701.484366]<...>-61178d..4 3786719374us : stack_trace_call: Depth
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 12:26 AM, Dave Chinner wrote: > > What concerns me about both __alloc_pages_nodemask() and > kernel_map_pages is that when I look at the code I see functions > that have no obvious stack usage problem. However, the compiler is > producing functions with huge stack footprints and it's not at all > obvious when I read the code. So in this case I'm more concerned > that we have a major disconnect between the source code structure > and the code that the compiler produces... I agree. In fact, this is the main reason that Minchan's call trace and this thread has actually convinced me that yes, we really do need to make x86-64 have a 16kB stack (well, 16kB allocation - there's still the thread info etc too). Usually when we see the stack-smashing traces, they are because somebody did something stupid. In this case, there are certainly stupid details, and things I think we should fix, but there is *not* the usual red flag of "Christ, somebody did something _really_ wrong". So I'm not in fact arguing against Minchan's patch of upping THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does remain one of my "we really need to be careful" issues, so while I am basically planning on applying that patch, I _also_ want to make sure that we fix the problems we do see and not just paper them over. The 8kB stack has been somewhat restrictive and painful for a while, and I'm ok with admitting that it is just getting _too_ damn painful, but I don't want to just give up entirely when we have a known deep stack case. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
> Hmm, stupid question: what happens when 16K is not enough too, do we > increase again? When do we stop increasing? 1M, 2M... ? It's not a stupid question, it's IMHO the most important question > Sounds like we want to make it a config option with a couple of sizes > for everyone to be happy. :-) At the moment it goes bang if you freakily get three layers of recursion through allocations. But show me the proof we can't already hit four, or five, or six We don't *need* to allocate tons of stack memory to each task just because we might recursively allocate. We don't solve the problem by doing so either. We at best fudge over it. Why is *any* recursive memory allocation not ending up waiting for other kernel worker threads to free up memory (beyond it being rather hard to go and retrofit) ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Wed, May 28, 2014 at 07:42:40PM -0700, Linus Torvalds wrote: > On Wed, May 28, 2014 at 6:30 PM, Dave Chinner wrote: > > > > You're focussing on the specific symptoms, not the bigger picture. > > i.e. you're ignoring all the other "let's start IO" triggers in > > direct reclaim. e.g there's two separate plug flush triggers in > > shrink_inactive_list(), one of which is: > > Fair enough. I certainly agree that we should look at the other cases here > too. > > In fact, I also find it distasteful just how much stack space some of > those VM routines are just using up on their own, never mind any > actual IO paths at all. The fact that __alloc_pages_nodemask() uses > 350 bytes of stackspace on its own is actually quite disturbing. The > fact that kernel_map_pages() apparently has almost 400 bytes of stack > is just crazy. Obviously that case only happens with > CONFIG_DEBUG_PAGEALLOC, but still.. What concerns me about both __alloc_pages_nodemask() and kernel_map_pages is that when I look at the code I see functions that have no obvious stack usage problem. However, the compiler is producing functions with huge stack footprints and it's not at all obvious when I read the code. So in this case I'm more concerned that we have a major disconnect between the source code structure and the code that the compiler produces... > > I'm not saying we shouldn't turn of swap from direct reclaim, just > > that all we'd be doing by turning off swap is playing whack-a-stack > > - the next report will simply be from one of the other direct > > reclaim IO schedule points. > > Playing whack-a-mole with this for a while might not be a bad idea, > though. It's not like we will ever really improve unless we start > whacking the worst cases. And it should still be a fairly limited > number. I guess I've been playing whack-a-stack for so long now and some of the overruns have been so large I just don't see it as a viable medium to long term solution. > After all, historically, some of the cases we've played whack-a-mole > on have been in XFS, so I'd think you'd be thrilled to see some other > code get blamed this time around ;) Blame shifting doesn't thrill me - I'm still at the pointy end of stack overrun reports, and we've still got to do the hard work of solving the problem. However, I am happy to see acknowlegement of the problem so we can work out how to solve the issues... > > Regardless of whether it is swap or something external queues the > > bio on the plug, perhaps we should look at why it's done inline > > rather than by kblockd, where it was moved because it was blowing > > the stack from schedule(): > > So it sounds like we need to do this for io_schedule() too. > > In fact, we've generally found it to be a mistake every time we > "automatically" unblock some IO queue. And I'm not saying that because > of stack space, but because we've _often_ had the situation that eager > unblocking results in IO that could have been done as bigger requests. > > Of course, we do need to worry about latency for starting IO, but any > of these kinds of memory-pressure writeback patterns are pretty much > by definition not about the latency of one _particular_ IO, so they > don't tent to be latency-sensitive. Quite the reverse: we start > writeback and then end up waiting on something else altogether > (possibly a writeback that got started much earlier). *nod* > swapout certainly is _not_ IO-latency-sensitive, especially these > days. And while we _do_ want to throttle in direct reclaim, if it's > about throttling I'd certainly think that it sounds quite reasonable > to push any unplugging to kblockd than try to do that synchronously. > If we are throttling in direct-reclaim, we need to slow things _down_ > for the writer, not worry about latency. Right, we are adding latency to the caller by having to swap so a small amount of additional IO dispatch latency for IO we aren't going to wait directly on doesn't really matter at all. > >That implies no IO in direct reclaim context > > is safe - either from swap or io_schedule() unplugging. It also > > lends a lot of weight to my assertion that the majority of the stack > > growth over the past couple of years has been ocurring outside the > > filesystems > > I think Minchan's stack trace definitely backs you up on that. The > filesystem part - despite that one ext4_writepages() function - is a > very small part of the whole. It sits at about ~1kB of stack. Just the > VM "top-level" writeback code is about as much, and then the VM page > alloc/shrinking code when the filesystem needs memory is *twice* that, > and then the block layer and the virtio code are another 1kB each. *nod* As i said early, look at this in the context of the bigger picture. We can also have more stack using layers in the IO stack and/or more stack-expensive layers. e.g. it could be block -> dm -> md -> SCSI -> mempool_alloc in that stack rather than block -> virtio ->
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Linus Torvalds writes: > Well, we've definitely have had some issues with deeper callchains > with md, but I suspect virtio might be worse, and the new blk-mq code > is lilkely worse in this respect too. I looked at this; I've now got a couple of virtio core cleanups, and I'm testing with Minchan's config and gcc versions now. MST reported that gcc 4.8 is a better than 4.6, but I'll test that too. Cheers, Rusty. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Wed, May 28, 2014 at 12:06:58PM -0400, Johannes Weiner wrote: > On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote: > > On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote: > > > [ cc XFS list ] > > > > [and now there is a complete copy on the XFs list, I'll add my 2c] > > > > > On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote: > > > > While I play inhouse patches with much memory pressure on qemu-kvm, > > > > 3.14 kernel was randomly crashed. The reason was kernel stack overflow. > > > > > > > > When I investigated the problem, the callstack was a little bit deeper > > > > by involve with reclaim functions but not direct reclaim path. > > > > > > > > I tried to diet stack size of some functions related with alloc/reclaim > > > > so did a hundred of byte but overflow was't disappeard so that I > > > > encounter > > > > overflow by another deeper callstack on reclaim/allocator path. > > > > That's a no win situation. The stack overruns through ->writepage > > we've been seeing with XFS over the past *4 years* are much larger > > than a few bytes. The worst case stack usage on a virtio block > > device was about 10.5KB of stack usage. > > > > And, like this one, it came from the flusher thread as well. The > > difference was that the allocation that triggered the reclaim path > > you've reported occurred when 5k of the stack had already been > > used... > > > > > > Of course, we might sweep every sites we have found for reducing > > > > stack usage but I'm not sure how long it saves the world(surely, > > > > lots of developer start to add nice features which will use stack > > > > agains) and if we consider another more complex feature in I/O layer > > > > and/or reclaim path, it might be better to increase stack size( > > > > meanwhile, stack usage on 64bit machine was doubled compared to 32bit > > > > while it have sticked to 8K. Hmm, it's not a fair to me and arm64 > > > > already expaned to 16K. ) > > > > Yup, that's all been pointed out previously. 8k stacks were never > > large enough to fit the linux IO architecture on x86-64, but nobody > > outside filesystem and IO developers has been willing to accept that > > argument as valid, despite regular stack overruns and filesystem > > having to add workaround after workaround to prevent stack overruns. > > > > That's why stuff like this appears in various filesystem's > > ->writepage: > > > > /* > > * Refuse to write the page out if we are called from reclaim > > context. > > * > > * This avoids stack overflows when called from deeply used stacks > > in > > * random callers for direct reclaim or memcg reclaim. We > > explicitly > > * allow reclaim from kswapd as the stack usage there is relatively > > low. > > * > > * This should never happen except in the case of a VM regression so > > * warn about it. > > */ > > if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == > > PF_MEMALLOC)) > > goto redirty; > > > > That still doesn't guarantee us enough stack space to do writeback, > > though, because memory allocation can occur when reading in metadata > > needed to do delayed allocation, and so we could trigger GFP_NOFS > > memory allocation from the flusher thread with 4-5k of stack already > > consumed, so that would still overrun teh stack. > > > > So, a couple of years ago we started defering half the writeback > > stack usage to a worker thread (commit c999a22 "xfs: introduce an > > allocation workqueue"), under the assumption that the worst stack > > usage when we call memory allocation is around 3-3.5k of stack used. > > We thought that would be safe, but the stack trace you've posted > > shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack, > > which means we're still screwed despite all the workarounds we have > > in place. > > The allocation and reclaim stack itself is only 2k per the stacktrace > below. What got us in this particular case is that we engaged a > complicated block layer setup from within the allocation context in > order to swap out a page. > > In the past we disabled filesystem ->writepage from within the > allocation context and deferred it to kswapd for stack reasons (see > the WARN_ON_ONCE and the comment in your above quote), but I think we > have to go further and do the same for even swap_writepage(): > > > > > I guess this topic was discussed several time so there might be > > > > strong reason not to increase kernel stack size on x86_64, for me not > > > > knowing so Ccing x86_64 maintainers, other MM guys and virtio > > > > maintainers. > > > > > > > > DepthSize Location(51 entries) > > > > > > > >0) 7696 16 lookup_address+0x28/0x30 > > > >1) 7680 16 _lookup_address_cpa.isra.3+0x3b/0x40 > > > >2) 7664 24 __change_page_attr_set_clr+0xe0/0xb50 > > >
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Wed, May 28, 2014 at 12:06:58PM -0400, Johannes Weiner wrote: On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote: On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote: [ cc XFS list ] [and now there is a complete copy on the XFs list, I'll add my 2c] On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote: While I play inhouse patches with much memory pressure on qemu-kvm, 3.14 kernel was randomly crashed. The reason was kernel stack overflow. When I investigated the problem, the callstack was a little bit deeper by involve with reclaim functions but not direct reclaim path. I tried to diet stack size of some functions related with alloc/reclaim so did a hundred of byte but overflow was't disappeard so that I encounter overflow by another deeper callstack on reclaim/allocator path. That's a no win situation. The stack overruns through -writepage we've been seeing with XFS over the past *4 years* are much larger than a few bytes. The worst case stack usage on a virtio block device was about 10.5KB of stack usage. And, like this one, it came from the flusher thread as well. The difference was that the allocation that triggered the reclaim path you've reported occurred when 5k of the stack had already been used... Of course, we might sweep every sites we have found for reducing stack usage but I'm not sure how long it saves the world(surely, lots of developer start to add nice features which will use stack agains) and if we consider another more complex feature in I/O layer and/or reclaim path, it might be better to increase stack size( meanwhile, stack usage on 64bit machine was doubled compared to 32bit while it have sticked to 8K. Hmm, it's not a fair to me and arm64 already expaned to 16K. ) Yup, that's all been pointed out previously. 8k stacks were never large enough to fit the linux IO architecture on x86-64, but nobody outside filesystem and IO developers has been willing to accept that argument as valid, despite regular stack overruns and filesystem having to add workaround after workaround to prevent stack overruns. That's why stuff like this appears in various filesystem's -writepage: /* * Refuse to write the page out if we are called from reclaim context. * * This avoids stack overflows when called from deeply used stacks in * random callers for direct reclaim or memcg reclaim. We explicitly * allow reclaim from kswapd as the stack usage there is relatively low. * * This should never happen except in the case of a VM regression so * warn about it. */ if (WARN_ON_ONCE((current-flags (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC)) goto redirty; That still doesn't guarantee us enough stack space to do writeback, though, because memory allocation can occur when reading in metadata needed to do delayed allocation, and so we could trigger GFP_NOFS memory allocation from the flusher thread with 4-5k of stack already consumed, so that would still overrun teh stack. So, a couple of years ago we started defering half the writeback stack usage to a worker thread (commit c999a22 xfs: introduce an allocation workqueue), under the assumption that the worst stack usage when we call memory allocation is around 3-3.5k of stack used. We thought that would be safe, but the stack trace you've posted shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack, which means we're still screwed despite all the workarounds we have in place. The allocation and reclaim stack itself is only 2k per the stacktrace below. What got us in this particular case is that we engaged a complicated block layer setup from within the allocation context in order to swap out a page. In the past we disabled filesystem -writepage from within the allocation context and deferred it to kswapd for stack reasons (see the WARN_ON_ONCE and the comment in your above quote), but I think we have to go further and do the same for even swap_writepage(): I guess this topic was discussed several time so there might be strong reason not to increase kernel stack size on x86_64, for me not knowing so Ccing x86_64 maintainers, other MM guys and virtio maintainers. DepthSize Location(51 entries) 0) 7696 16 lookup_address+0x28/0x30 1) 7680 16 _lookup_address_cpa.isra.3+0x3b/0x40 2) 7664 24 __change_page_attr_set_clr+0xe0/0xb50 3) 7640 392 kernel_map_pages+0x6c/0x120 4) 7248 256 get_page_from_freelist+0x489/0x920 5) 6992 352 __alloc_pages_nodemask+0x5e1/0xb20 6) 6640 8 alloc_pages_current+0x10f/0x1f0 7)
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Linus Torvalds torva...@linux-foundation.org writes: Well, we've definitely have had some issues with deeper callchains with md, but I suspect virtio might be worse, and the new blk-mq code is lilkely worse in this respect too. I looked at this; I've now got a couple of virtio core cleanups, and I'm testing with Minchan's config and gcc versions now. MST reported that gcc 4.8 is a better than 4.6, but I'll test that too. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Wed, May 28, 2014 at 07:42:40PM -0700, Linus Torvalds wrote: On Wed, May 28, 2014 at 6:30 PM, Dave Chinner da...@fromorbit.com wrote: You're focussing on the specific symptoms, not the bigger picture. i.e. you're ignoring all the other let's start IO triggers in direct reclaim. e.g there's two separate plug flush triggers in shrink_inactive_list(), one of which is: Fair enough. I certainly agree that we should look at the other cases here too. In fact, I also find it distasteful just how much stack space some of those VM routines are just using up on their own, never mind any actual IO paths at all. The fact that __alloc_pages_nodemask() uses 350 bytes of stackspace on its own is actually quite disturbing. The fact that kernel_map_pages() apparently has almost 400 bytes of stack is just crazy. Obviously that case only happens with CONFIG_DEBUG_PAGEALLOC, but still.. What concerns me about both __alloc_pages_nodemask() and kernel_map_pages is that when I look at the code I see functions that have no obvious stack usage problem. However, the compiler is producing functions with huge stack footprints and it's not at all obvious when I read the code. So in this case I'm more concerned that we have a major disconnect between the source code structure and the code that the compiler produces... I'm not saying we shouldn't turn of swap from direct reclaim, just that all we'd be doing by turning off swap is playing whack-a-stack - the next report will simply be from one of the other direct reclaim IO schedule points. Playing whack-a-mole with this for a while might not be a bad idea, though. It's not like we will ever really improve unless we start whacking the worst cases. And it should still be a fairly limited number. I guess I've been playing whack-a-stack for so long now and some of the overruns have been so large I just don't see it as a viable medium to long term solution. After all, historically, some of the cases we've played whack-a-mole on have been in XFS, so I'd think you'd be thrilled to see some other code get blamed this time around ;) Blame shifting doesn't thrill me - I'm still at the pointy end of stack overrun reports, and we've still got to do the hard work of solving the problem. However, I am happy to see acknowlegement of the problem so we can work out how to solve the issues... Regardless of whether it is swap or something external queues the bio on the plug, perhaps we should look at why it's done inline rather than by kblockd, where it was moved because it was blowing the stack from schedule(): So it sounds like we need to do this for io_schedule() too. In fact, we've generally found it to be a mistake every time we automatically unblock some IO queue. And I'm not saying that because of stack space, but because we've _often_ had the situation that eager unblocking results in IO that could have been done as bigger requests. Of course, we do need to worry about latency for starting IO, but any of these kinds of memory-pressure writeback patterns are pretty much by definition not about the latency of one _particular_ IO, so they don't tent to be latency-sensitive. Quite the reverse: we start writeback and then end up waiting on something else altogether (possibly a writeback that got started much earlier). *nod* swapout certainly is _not_ IO-latency-sensitive, especially these days. And while we _do_ want to throttle in direct reclaim, if it's about throttling I'd certainly think that it sounds quite reasonable to push any unplugging to kblockd than try to do that synchronously. If we are throttling in direct-reclaim, we need to slow things _down_ for the writer, not worry about latency. Right, we are adding latency to the caller by having to swap so a small amount of additional IO dispatch latency for IO we aren't going to wait directly on doesn't really matter at all. That implies no IO in direct reclaim context is safe - either from swap or io_schedule() unplugging. It also lends a lot of weight to my assertion that the majority of the stack growth over the past couple of years has been ocurring outside the filesystems I think Minchan's stack trace definitely backs you up on that. The filesystem part - despite that one ext4_writepages() function - is a very small part of the whole. It sits at about ~1kB of stack. Just the VM top-level writeback code is about as much, and then the VM page alloc/shrinking code when the filesystem needs memory is *twice* that, and then the block layer and the virtio code are another 1kB each. *nod* As i said early, look at this in the context of the bigger picture. We can also have more stack using layers in the IO stack and/or more stack-expensive layers. e.g. it could be block - dm - md - SCSI - mempool_alloc in that stack rather than block - virtio - kmalloc. Hence 1k of virtio stack could be 1.5k of SCSI stack, md/dm could
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Hmm, stupid question: what happens when 16K is not enough too, do we increase again? When do we stop increasing? 1M, 2M... ? It's not a stupid question, it's IMHO the most important question Sounds like we want to make it a config option with a couple of sizes for everyone to be happy. :-) At the moment it goes bang if you freakily get three layers of recursion through allocations. But show me the proof we can't already hit four, or five, or six We don't *need* to allocate tons of stack memory to each task just because we might recursively allocate. We don't solve the problem by doing so either. We at best fudge over it. Why is *any* recursive memory allocation not ending up waiting for other kernel worker threads to free up memory (beyond it being rather hard to go and retrofit) ? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 12:26 AM, Dave Chinner da...@fromorbit.com wrote: What concerns me about both __alloc_pages_nodemask() and kernel_map_pages is that when I look at the code I see functions that have no obvious stack usage problem. However, the compiler is producing functions with huge stack footprints and it's not at all obvious when I read the code. So in this case I'm more concerned that we have a major disconnect between the source code structure and the code that the compiler produces... I agree. In fact, this is the main reason that Minchan's call trace and this thread has actually convinced me that yes, we really do need to make x86-64 have a 16kB stack (well, 16kB allocation - there's still the thread info etc too). Usually when we see the stack-smashing traces, they are because somebody did something stupid. In this case, there are certainly stupid details, and things I think we should fix, but there is *not* the usual red flag of Christ, somebody did something _really_ wrong. So I'm not in fact arguing against Minchan's patch of upping THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does remain one of my we really need to be careful issues, so while I am basically planning on applying that patch, I _also_ want to make sure that we fix the problems we do see and not just paper them over. The 8kB stack has been somewhat restrictive and painful for a while, and I'm ok with admitting that it is just getting _too_ damn painful, but I don't want to just give up entirely when we have a known deep stack case. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Hello Dave, On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote: On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote: On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote: commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca Author: Jens Axboe jax...@fusionio.com Date: Sat Apr 16 13:27:55 2011 +0200 block: let io_schedule() flush the plug inline Linus correctly observes that the most important dispatch cases are now done from kblockd, this isn't ideal for latency reasons. The original reason for switching dispatches out-of-line was to avoid too deep a stack, so by _only_ letting the accidental flush directly in schedule() be guarded by offload to kblockd, we should be able to get the best of both worlds. So add a blk_schedule_flush_plug() that offloads to kblockd, and only use that from the schedule() path. Signed-off-by: Jens Axboe jax...@fusionio.com And now we have too deep a stack due to unplugging from io_schedule()... So, if we make io_schedule() push the plug list off to the kblockd like is done for schedule() IOW, swap-out directly caused that extra 3kB of stack use in what was a deep call chain (due to memory allocation). I really don't understand why you are arguing anything else on a pure technicality. I thought you had some other argument for why swap was different, and against removing that page_is_file_cache() special case in shrink_page_list(). I've said in the past that swap is different to filesystem -writepage implementations because it doesn't require significant stack to do block allocation and doesn't trigger IO deep in that allocation stack. Hence it has much lower stack overhead than the filesystem -writepage implementations and so is much less likely to have stack issues. This stack overflow shows us that just the memory reclaim + IO layers are sufficient to cause a stack overflow, we solve this problem directly by being able to remove the IO stack usage from the direct reclaim swap path. IOWs, we don't need to turn swap off at all in direct reclaim because all the swap IO can be captured in a plug list and dispatched via kblockd. This could be done either by io_schedule() or a new blk_flush_plug_list() wrapper that pushes the work to kblockd... I did below hacky test to apply your idea and the result is overflow again. So, again it would second stack expansion. Otherwise, we should prevent swapout in direct reclaim. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f5c6635b806c..95f169e85dbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to); void __sched io_schedule(void) { struct rq *rq = raw_rq(); + struct blk_plug *plug = current-plug; delayacct_blkio_start(); atomic_inc(rq-nr_iowait); - blk_flush_plug(current); + if (plug) + blk_flush_plug_list(plug, true); + current-in_iowait = 1; schedule(); current-in_iowait = 0; [ 1209.764725] kworker/u24:0 (23627) used greatest stack depth: 304 bytes left [ 1510.835509] kworker/u24:1 (25817) used greatest stack depth: 144 bytes left [ 3701.482790] PANIC: double fault, error_code: 0x0 [ 3701.483297] CPU: 8 PID: 6117 Comm: kworker/u24:1 Not tainted 3.14.0+ #201 [ 3701.483980] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 3701.484366] Workqueue: writeback bdi_writeback_workfn (flush-253:0) [ 3701.484366] task: 8800353c41c0 ti: 88106000 task.ti: 88106000 [ 3701.484366] RIP: 0010:[810a5390] [810a5390] __lock_acquire+0x170/0x1ca0 [ 3701.484366] RSP: :88105f58 EFLAGS: 00010046 [ 3701.484366] RAX: 0001 RBX: 8800353c41c0 RCX: 0002 [ 3701.484366] RDX: RSI: RDI: 81c4a1e0 [ 3701.484366] RBP: 88106048 R08: 0001 R09: 0001 [ 3701.484366] R10: R11: R12: 0001 [ 3701.484366] R13: R14: 81c4a1e0 R15: [ 3701.484366] FS: () GS:880037d0() knlGS: [ 3701.484366] CS: 0010 DS: ES: CR0: 8005003b [ 3701.484366] CR2: 88105f48 CR3: 01c0b000 CR4: 06e0 [ 3701.484366] Stack: [ 3701.484366] BUG: unable to handle kernel paging request at 88105f58 [ 3701.484366] IP: [81004e14] show_stack_log_lvl+0x134/0x1a0 [ 3701.484366] PGD 28c5067 PUD 28c6067 PMD 28c7067 PTE 80105060 [ 3701.484366] Thread overran stack, or stack corrupted [ 3701.484366] Oops: [#1] SMP DEBUG_PAGEALLOC [ 3701.484366] Dumping ftrace buffer: [ 3701.484366] - [ 3701.484366]...-61178d..4 3786719374us : stack_trace_call: DepthSize Location
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Hello Linus, On Thu, May 29, 2014 at 08:24:49AM -0700, Linus Torvalds wrote: On Thu, May 29, 2014 at 12:26 AM, Dave Chinner da...@fromorbit.com wrote: What concerns me about both __alloc_pages_nodemask() and kernel_map_pages is that when I look at the code I see functions that have no obvious stack usage problem. However, the compiler is producing functions with huge stack footprints and it's not at all obvious when I read the code. So in this case I'm more concerned that we have a major disconnect between the source code structure and the code that the compiler produces... I agree. In fact, this is the main reason that Minchan's call trace and this thread has actually convinced me that yes, we really do need to make x86-64 have a 16kB stack (well, 16kB allocation - there's still the thread info etc too). Usually when we see the stack-smashing traces, they are because somebody did something stupid. In this case, there are certainly stupid details, and things I think we should fix, but there is *not* the usual red flag of Christ, somebody did something _really_ wrong. So I'm not in fact arguing against Minchan's patch of upping THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does remain one of my we really need to be careful issues, so while I am basically planning on applying that patch, I _also_ want to make sure that we fix the problems we do see and not just paper them over. So, should I resend a patch w/o RFC in subject but with Acked-by from Dave? Or, you will do it by yourself? The 8kB stack has been somewhat restrictive and painful for a while, and I'm ok with admitting that it is just getting _too_ damn painful, but I don't want to just give up entirely when we have a known deep stack case. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 08:24:49AM -0700, Linus Torvalds wrote: On Thu, May 29, 2014 at 12:26 AM, Dave Chinner da...@fromorbit.com wrote: What concerns me about both __alloc_pages_nodemask() and kernel_map_pages is that when I look at the code I see functions that have no obvious stack usage problem. However, the compiler is producing functions with huge stack footprints and it's not at all obvious when I read the code. So in this case I'm more concerned that we have a major disconnect between the source code structure and the code that the compiler produces... I agree. In fact, this is the main reason that Minchan's call trace and this thread has actually convinced me that yes, we really do need to make x86-64 have a 16kB stack (well, 16kB allocation - there's still the thread info etc too). Usually when we see the stack-smashing traces, they are because somebody did something stupid. In this case, there are certainly stupid details, and things I think we should fix, but there is *not* the usual red flag of Christ, somebody did something _really_ wrong. So I'm not in fact arguing against Minchan's patch of upping THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does remain one of my we really need to be careful issues, so while I am basically planning on applying that patch, I _also_ want to make sure that we fix the problems we do see and not just paper them over. The 8kB stack has been somewhat restrictive and painful for a while, and I'm ok with admitting that it is just getting _too_ damn painful, but I don't want to just give up entirely when we have a known deep stack case. That sounds like a plan. Perhaps it would be useful to add a WARN_ON_ONCE(stack_usage 8k) (or some other arbitrary depth beyond 8k) so that we get some indication that we're hitting a deep stack but the system otherwise keeps functioning. That gives us some motivation to keep stack usage down but isn't a fatal problem like it is now Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 4:36 PM, Minchan Kim minc...@kernel.org wrote: I did below hacky test to apply your idea and the result is overflow again. So, again it would second stack expansion. Otherwise, we should prevent swapout in direct reclaim. So changing io_schedule() is bad, for the reasons I outlined elsewhere (we use it for wait_for_page*() - see sleep_on_page(). It's the congestion waiting where the io_schedule() should be avoided. So maybe test a patch something like the attached. NOTE! This is absolutely TOTALLY UNTESTED! It might do horrible horrible things. It seems to compile, but I have absolutely no reason to believe that it would work. I didn't actually test that this moves anything at all to kblockd. So think of it as a concept patch that *might* work, but as Dave said, there might also be other things that cause unplugging and need some tough love. Linus mm/backing-dev.c | 28 ++-- mm/vmscan.c | 4 +--- 2 files changed, 19 insertions(+), 13 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 09d9591b7708..cb26b24c2da2 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -11,6 +11,7 @@ #include linux/writeback.h #include linux/device.h #include trace/events/writeback.h +#include linux/blkdev.h static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0); @@ -573,6 +574,21 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) } EXPORT_SYMBOL(set_bdi_congested); +static long congestion_timeout(int sync, long timeout) +{ + long ret; + DEFINE_WAIT(wait); + struct blk_plug *plug = current-plug; + wait_queue_head_t *wqh = congestion_wqh[sync]; + + prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE); + if (plug) + blk_flush_plug_list(plug, true); + ret = schedule_timeout(timeout); + finish_wait(wqh, wait); + return ret; +} + /** * congestion_wait - wait for a backing_dev to become uncongested * @sync: SYNC or ASYNC IO @@ -586,12 +602,8 @@ long congestion_wait(int sync, long timeout) { long ret; unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = congestion_wqh[sync]; - prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, wait); + ret = congestion_timeout(sync,timeout); trace_writeback_congestion_wait(jiffies_to_usecs(timeout), jiffies_to_usecs(jiffies - start)); @@ -622,8 +634,6 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) { long ret; unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = congestion_wqh[sync]; /* * If there is no congestion, or heavy congestion is not being @@ -643,9 +653,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) } /* Sleep until uncongested or a write happens */ - prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, wait); + ret = congestion_timeout(sync, timeout); out: trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout), diff --git a/mm/vmscan.c b/mm/vmscan.c index 32c661d66a45..1e524000b83e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -989,9 +989,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * avoid risk of stack overflow but only writeback * if many dirty pages have been encountered. */ - if (page_is_file_cache(page) - (!current_is_kswapd() || -!zone_is_reclaim_dirty(zone))) { + if (!current_is_kswapd() || !zone_is_reclaim_dirty(zone)) { /* * Immediately reclaim when written back. * Similar in principal to deactivate_page()
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote: That sounds like a plan. Perhaps it would be useful to add a WARN_ON_ONCE(stack_usage 8k) (or some other arbitrary depth beyond 8k) so that we get some indication that we're hitting a deep stack but the system otherwise keeps functioning. That gives us some motivation to keep stack usage down but isn't a fatal problem like it is now We have check_stack_usage() and DEBUG_STACK_USAGE for this. Though it needs some tweaking if we move to 16K I gave it a try yesterday, and noticed a spew of noisy warnings as soon as I gave it a workload to chew on. (Moreso than usual with 8K stacks) Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote: Hello Dave, On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote: On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote: On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote: commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca Author: Jens Axboe jax...@fusionio.com Date: Sat Apr 16 13:27:55 2011 +0200 block: let io_schedule() flush the plug inline Linus correctly observes that the most important dispatch cases are now done from kblockd, this isn't ideal for latency reasons. The original reason for switching dispatches out-of-line was to avoid too deep a stack, so by _only_ letting the accidental flush directly in schedule() be guarded by offload to kblockd, we should be able to get the best of both worlds. So add a blk_schedule_flush_plug() that offloads to kblockd, and only use that from the schedule() path. Signed-off-by: Jens Axboe jax...@fusionio.com And now we have too deep a stack due to unplugging from io_schedule()... So, if we make io_schedule() push the plug list off to the kblockd like is done for schedule() I did below hacky test to apply your idea and the result is overflow again. So, again it would second stack expansion. Otherwise, we should prevent swapout in direct reclaim. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f5c6635b806c..95f169e85dbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to); void __sched io_schedule(void) { struct rq *rq = raw_rq(); + struct blk_plug *plug = current-plug; delayacct_blkio_start(); atomic_inc(rq-nr_iowait); - blk_flush_plug(current); + if (plug) + blk_flush_plug_list(plug, true); + current-in_iowait = 1; schedule(); current-in_iowait = 0; . DepthSize Location(46 entries) 0) 7200 8 _raw_spin_lock_irqsave+0x51/0x60 1) 7192 296 get_page_from_freelist+0x886/0x920 2) 6896 352 __alloc_pages_nodemask+0x5e1/0xb20 3) 6544 8 alloc_pages_current+0x10f/0x1f0 4) 6536 168 new_slab+0x2c5/0x370 5) 6368 8 __slab_alloc+0x3a9/0x501 6) 6360 80 __kmalloc+0x1cb/0x200 7) 6280 376 vring_add_indirect+0x36/0x200 8) 5904 144 virtqueue_add_sgs+0x2e2/0x320 9) 5760 288 __virtblk_add_req+0xda/0x1b0 10) 5472 96 virtio_queue_rq+0xd3/0x1d0 11) 5376 128 __blk_mq_run_hw_queue+0x1ef/0x440 12) 5248 16 blk_mq_run_hw_queue+0x35/0x40 13) 5232 96 blk_mq_insert_requests+0xdb/0x160 14) 5136 112 blk_mq_flush_plug_list+0x12b/0x140 15) 5024 112 blk_flush_plug_list+0xc7/0x220 16) 4912 128 blk_mq_make_request+0x42a/0x600 17) 4784 48 generic_make_request+0xc0/0x100 18) 4736 112 submit_bio+0x86/0x160 19) 4624 160 __swap_writepage+0x198/0x230 20) 4464 32 swap_writepage+0x42/0x90 21) 4432 320 shrink_page_list+0x676/0xa80 22) 4112 208 shrink_inactive_list+0x262/0x4e0 23) 3904 304 shrink_lruvec+0x3e1/0x6a0 The device is supposed to be plugged here in shrink_lruvec(). Oh, a plug can only hold 16 individual bios, and then it does a synchronous flush. Hmmm - perhaps that should also defer the flush to the kblockd, because if we are overrunning a plug then we've already surrendered IO dispatch latency So, in blk_mq_make_request(), can you do: if (list_empty(plug-mq_list)) trace_block_plug(q); else if (request_count = BLK_MAX_REQUEST_COUNT) { - blk_flush_plug_list(plug, false); + blk_flush_plug_list(plug, true); trace_block_plug(q); } list_add_tail(rq-queuelist, plug-mq_list); To see if that defers all the swap IO to kblockd? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Hello Linus, On Thu, May 29, 2014 at 05:05:17PM -0700, Linus Torvalds wrote: On Thu, May 29, 2014 at 4:36 PM, Minchan Kim minc...@kernel.org wrote: I did below hacky test to apply your idea and the result is overflow again. So, again it would second stack expansion. Otherwise, we should prevent swapout in direct reclaim. So changing io_schedule() is bad, for the reasons I outlined elsewhere (we use it for wait_for_page*() - see sleep_on_page(). It's the congestion waiting where the io_schedule() should be avoided. So maybe test a patch something like the attached. NOTE! This is absolutely TOTALLY UNTESTED! It might do horrible horrible things. It seems to compile, but I have absolutely no reason to believe that it would work. I didn't actually test that this moves anything at all to kblockd. So think of it as a concept patch that *might* work, but as Dave said, there might also be other things that cause unplugging and need some tough love. Linus mm/backing-dev.c | 28 ++-- mm/vmscan.c | 4 +--- 2 files changed, 19 insertions(+), 13 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 09d9591b7708..cb26b24c2da2 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -11,6 +11,7 @@ #include linux/writeback.h #include linux/device.h #include trace/events/writeback.h +#include linux/blkdev.h static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0); @@ -573,6 +574,21 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) } EXPORT_SYMBOL(set_bdi_congested); +static long congestion_timeout(int sync, long timeout) +{ + long ret; + DEFINE_WAIT(wait); + struct blk_plug *plug = current-plug; + wait_queue_head_t *wqh = congestion_wqh[sync]; + + prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE); + if (plug) + blk_flush_plug_list(plug, true); + ret = schedule_timeout(timeout); + finish_wait(wqh, wait); + return ret; +} + /** * congestion_wait - wait for a backing_dev to become uncongested * @sync: SYNC or ASYNC IO @@ -586,12 +602,8 @@ long congestion_wait(int sync, long timeout) { long ret; unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = congestion_wqh[sync]; - prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, wait); + ret = congestion_timeout(sync,timeout); trace_writeback_congestion_wait(jiffies_to_usecs(timeout), jiffies_to_usecs(jiffies - start)); @@ -622,8 +634,6 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) { long ret; unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = congestion_wqh[sync]; /* * If there is no congestion, or heavy congestion is not being @@ -643,9 +653,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) } /* Sleep until uncongested or a write happens */ - prepare_to_wait(wqh, wait, TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, wait); + ret = congestion_timeout(sync, timeout); out: trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout), diff --git a/mm/vmscan.c b/mm/vmscan.c index 32c661d66a45..1e524000b83e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -989,9 +989,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * avoid risk of stack overflow but only writeback * if many dirty pages have been encountered. */ - if (page_is_file_cache(page) - (!current_is_kswapd() || - !zone_is_reclaim_dirty(zone))) { + if (!current_is_kswapd() || !zone_is_reclaim_dirty(zone)) { /* * Immediately reclaim when written back. * Similar in principal to deactivate_page() I guess this part which avoid swapout in direct reclaim would be key if this patch were successful. But it could make anon pages rotate back into inactive's head from tail in direct reclaim path until kswapd can catch up. And kswapd kswapd can swap out anon pages from tail of inactive LRU so I suspect it could make side-effect LRU churning. Anyway, I will queue it into testing machine since Rusty's test is done. Thanks! -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote: On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote: That sounds like a plan. Perhaps it would be useful to add a WARN_ON_ONCE(stack_usage 8k) (or some other arbitrary depth beyond 8k) so that we get some indication that we're hitting a deep stack but the system otherwise keeps functioning. That gives us some motivation to keep stack usage down but isn't a fatal problem like it is now We have check_stack_usage() and DEBUG_STACK_USAGE for this. Though it needs some tweaking if we move to 16K Right, but it doesn't throw loud warnings when a specific threshold is reached - it just issues a quiet message when a process exits telling you what the maximum was without giving us a stack to chew on Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote: On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote: On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote: That sounds like a plan. Perhaps it would be useful to add a WARN_ON_ONCE(stack_usage 8k) (or some other arbitrary depth beyond 8k) so that we get some indication that we're hitting a deep stack but the system otherwise keeps functioning. That gives us some motivation to keep stack usage down but isn't a fatal problem like it is now We have check_stack_usage() and DEBUG_STACK_USAGE for this. Though it needs some tweaking if we move to 16K Right, but it doesn't throw loud warnings when a specific threshold is reached - it just issues a quiet message when a process exits telling you what the maximum was without giving us a stack to chew on ah, right good point. That would be more useful. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote: On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote: On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote: That sounds like a plan. Perhaps it would be useful to add a WARN_ON_ONCE(stack_usage 8k) (or some other arbitrary depth beyond 8k) so that we get some indication that we're hitting a deep stack but the system otherwise keeps functioning. That gives us some motivation to keep stack usage down but isn't a fatal problem like it is now We have check_stack_usage() and DEBUG_STACK_USAGE for this. Though it needs some tweaking if we move to 16K Right, but it doesn't throw loud warnings when a specific threshold is reached - it just issues a quiet message when a process exits telling you what the maximum was without giving us a stack to chew on But we could enhance the inform so notice the risk to the user. as follow ... kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means there is some horrible stack hogger in your kernel. Please report it the LKML and enable stacktrace to investigate who is culprit Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 5:20 PM, Minchan Kim minc...@kernel.org wrote: I guess this part which avoid swapout in direct reclaim would be key if this patch were successful. But it could make anon pages rotate back into inactive's head from tail in direct reclaim path until kswapd can catch up. And kswapd kswapd can swap out anon pages from tail of inactive LRU so I suspect it could make side-effect LRU churning. Oh, it could make bad things happen, no question about that. That said, those bad things are what happens to shared mapped pages today, so in that sense it's not new. But large dirty shared mmap's have traditionally been a great way to really hurt out VM, so it should work as well as shared mapping pages is definitely not a ringing endorsement! (Of course, *if* we can improve kswapd behavior for both swap-out and shared dirty pages, that would then be a double win, so there is _some_ argument for saying that we should aim to handle both kinds of pages equally). Anyway, I will queue it into testing machine since Rusty's test is done. You could also try Dave's patch, and _not_ do my mm/vmscan.c part. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 05:31:42PM -0700, Linus Torvalds wrote: On Thu, May 29, 2014 at 5:20 PM, Minchan Kim minc...@kernel.org wrote: I guess this part which avoid swapout in direct reclaim would be key if this patch were successful. But it could make anon pages rotate back into inactive's head from tail in direct reclaim path until kswapd can catch up. And kswapd kswapd can swap out anon pages from tail of inactive LRU so I suspect it could make side-effect LRU churning. Oh, it could make bad things happen, no question about that. That said, those bad things are what happens to shared mapped pages today, so in that sense it's not new. But large dirty shared mmap's have traditionally been a great way to really hurt out VM, so it should work as well as shared mapping pages is definitely not a ringing endorsement! True. (Of course, *if* we can improve kswapd behavior for both swap-out and shared dirty pages, that would then be a double win, so there is _some_ argument for saying that we should aim to handle both kinds of pages equally). Just an idea for preventing LRU churn. We can return back the pages to tail of inactive instead of head if it's not proper pages in this context and reclaimer uses the cursor as list_head instead of LRU head to scan victim page and record the cursor in somewhere like lruvec after shrinking is done. It makes VM code more complicated but is worthy to try if we approach that way. Anyway, I will queue it into testing machine since Rusty's test is done. You could also try Dave's patch, and _not_ do my mm/vmscan.c part. Sure. While I write this, Rusty's test was crached so I will try Dave's patch, them yours except vmscan.c part. Thanks. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 5:50 PM, Minchan Kim minc...@kernel.org wrote: You could also try Dave's patch, and _not_ do my mm/vmscan.c part. Sure. While I write this, Rusty's test was crached so I will try Dave's patch, them yours except vmscan.c part. Looking more at Dave's patch (well, description), I don't think there is any way in hell we can ever apply it. If I read it right, it will cause all IO that overflows the max request count to go through the scheduler to get it flushed. Maybe I misread it, but that's definitely not acceptable. Maybe it's not noticeable with a slow rotational device, but modern ssd hardware? No way. I'd *much* rather slow down the swap side. Not real IO. So I think my mm/vmscan.c patch is preferable (but yes, it might require some work to make kswapd do better). So you can try Dave's patch just to see what it does for stack depth, but other than that it looks unacceptable unless I misread things. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 5:05 PM, Linus Torvalds torva...@linux-foundation.org wrote: So maybe test a patch something like the attached. NOTE! This is absolutely TOTALLY UNTESTED! It's still untested, but I realized that the whole blk_flush_plug_list(plug, true); thing is pointless, since schedule() itself will do that for us. So I think you can remove the + struct blk_plug *plug = current-plug; + if (plug) + blk_flush_plug_list(plug, true); part from congestion_timeout(). Not that it should *hurt* to have it there, so I'm not bothering to send a changed patch. And again, no actual testing by me on any of this, just looking at the code. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 09:32:19AM +0900, Minchan Kim wrote: On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote: On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote: On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote: That sounds like a plan. Perhaps it would be useful to add a WARN_ON_ONCE(stack_usage 8k) (or some other arbitrary depth beyond 8k) so that we get some indication that we're hitting a deep stack but the system otherwise keeps functioning. That gives us some motivation to keep stack usage down but isn't a fatal problem like it is now We have check_stack_usage() and DEBUG_STACK_USAGE for this. Though it needs some tweaking if we move to 16K Right, but it doesn't throw loud warnings when a specific threshold is reached - it just issues a quiet message when a process exits telling you what the maximum was without giving us a stack to chew on But we could enhance the inform so notice the risk to the user. as follow ... kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means there is some horrible stack hogger in your kernel. Please report it the LKML and enable stacktrace to investigate who is culprit That, however, presumes that a user can reproduce the problem on demand. Experience tells me that this is the exception rather than the norm for production systems, and so capturing the stack in real time is IMO the only useful thing we could add... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 06:24:02PM -0700, Linus Torvalds wrote: On Thu, May 29, 2014 at 5:50 PM, Minchan Kim minc...@kernel.org wrote: You could also try Dave's patch, and _not_ do my mm/vmscan.c part. Sure. While I write this, Rusty's test was crached so I will try Dave's patch, them yours except vmscan.c part. Looking more at Dave's patch (well, description), I don't think there is any way in hell we can ever apply it. If I read it right, it will cause all IO that overflows the max request count to go through the scheduler to get it flushed. Maybe I misread it, but that's definitely not acceptable. Maybe it's not noticeable with a slow rotational device, but modern ssd hardware? No way. I'd *much* rather slow down the swap side. Not real IO. So I think my mm/vmscan.c patch is preferable (but yes, it might require some work to make kswapd do better). So you can try Dave's patch just to see what it does for stack depth, but other than that it looks unacceptable unless I misread things. Yeah, it's a hack, not intended as a potential solution. I'm thinking, though, that plug flushing behaviour is actually dependent on plugger context and there is no one correct behaviour. If we are doing process driven IO, then we want to do immediate dispatch, but for IO where stack is an issue or is for bulk throughput (e.g. background writeback) async dispatch through kblockd is desirable. If the patch I sent solves the swap stack usage issue, then perhaps we should look towards adding blk_plug_start_async() to pass such hints to the plug flushing. I'd want to use the same behaviour in __xfs_buf_delwri_submit() for bulk metadata writeback in XFS, and probably also in mpage_writepages() for bulk data writeback in WB_SYNC_NONE context Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote: On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote: Hello Dave, On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote: On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote: On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote: commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca Author: Jens Axboe jax...@fusionio.com Date: Sat Apr 16 13:27:55 2011 +0200 block: let io_schedule() flush the plug inline Linus correctly observes that the most important dispatch cases are now done from kblockd, this isn't ideal for latency reasons. The original reason for switching dispatches out-of-line was to avoid too deep a stack, so by _only_ letting the accidental flush directly in schedule() be guarded by offload to kblockd, we should be able to get the best of both worlds. So add a blk_schedule_flush_plug() that offloads to kblockd, and only use that from the schedule() path. Signed-off-by: Jens Axboe jax...@fusionio.com And now we have too deep a stack due to unplugging from io_schedule()... So, if we make io_schedule() push the plug list off to the kblockd like is done for schedule() I did below hacky test to apply your idea and the result is overflow again. So, again it would second stack expansion. Otherwise, we should prevent swapout in direct reclaim. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f5c6635b806c..95f169e85dbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to); void __sched io_schedule(void) { struct rq *rq = raw_rq(); + struct blk_plug *plug = current-plug; delayacct_blkio_start(); atomic_inc(rq-nr_iowait); - blk_flush_plug(current); + if (plug) + blk_flush_plug_list(plug, true); + current-in_iowait = 1; schedule(); current-in_iowait = 0; . DepthSize Location(46 entries) 0) 7200 8 _raw_spin_lock_irqsave+0x51/0x60 1) 7192 296 get_page_from_freelist+0x886/0x920 2) 6896 352 __alloc_pages_nodemask+0x5e1/0xb20 3) 6544 8 alloc_pages_current+0x10f/0x1f0 4) 6536 168 new_slab+0x2c5/0x370 5) 6368 8 __slab_alloc+0x3a9/0x501 6) 6360 80 __kmalloc+0x1cb/0x200 7) 6280 376 vring_add_indirect+0x36/0x200 8) 5904 144 virtqueue_add_sgs+0x2e2/0x320 9) 5760 288 __virtblk_add_req+0xda/0x1b0 10) 5472 96 virtio_queue_rq+0xd3/0x1d0 11) 5376 128 __blk_mq_run_hw_queue+0x1ef/0x440 12) 5248 16 blk_mq_run_hw_queue+0x35/0x40 13) 5232 96 blk_mq_insert_requests+0xdb/0x160 14) 5136 112 blk_mq_flush_plug_list+0x12b/0x140 15) 5024 112 blk_flush_plug_list+0xc7/0x220 16) 4912 128 blk_mq_make_request+0x42a/0x600 17) 4784 48 generic_make_request+0xc0/0x100 18) 4736 112 submit_bio+0x86/0x160 19) 4624 160 __swap_writepage+0x198/0x230 20) 4464 32 swap_writepage+0x42/0x90 21) 4432 320 shrink_page_list+0x676/0xa80 22) 4112 208 shrink_inactive_list+0x262/0x4e0 23) 3904 304 shrink_lruvec+0x3e1/0x6a0 The device is supposed to be plugged here in shrink_lruvec(). Oh, a plug can only hold 16 individual bios, and then it does a synchronous flush. Hmmm - perhaps that should also defer the flush to the kblockd, because if we are overrunning a plug then we've already surrendered IO dispatch latency So, in blk_mq_make_request(), can you do: if (list_empty(plug-mq_list)) trace_block_plug(q); else if (request_count = BLK_MAX_REQUEST_COUNT) { - blk_flush_plug_list(plug, false); + blk_flush_plug_list(plug, true); trace_block_plug(q); } list_add_tail(rq-queuelist, plug-mq_list); To see if that defers all the swap IO to kblockd? Interim report, I applied below(we need to fix io_schedule_timeout due to mempool_alloc) diff --git a/block/blk-core.c b/block/blk-core.c index bfe16d5af9f9..0c81aacec75b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1585,7 +1585,7 @@ get_rq: trace_block_plug(q); else { if (request_count = BLK_MAX_REQUEST_COUNT) { - blk_flush_plug_list(plug, false); + blk_flush_plug_list(plug, true); trace_block_plug(q); } } diff --git
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 6:58 PM, Dave Chinner da...@fromorbit.com wrote: If the patch I sent solves the swap stack usage issue, then perhaps we should look towards adding blk_plug_start_async() to pass such hints to the plug flushing. I'd want to use the same behaviour in __xfs_buf_delwri_submit() for bulk metadata writeback in XFS, and probably also in mpage_writepages() for bulk data writeback in WB_SYNC_NONE context... Yeah, adding a flag to the plug about what kind of plug it is does sound quite reasonable. It already has that magic field, it could easily be extended to have a async vs sync bit to it.. Of course, it's also possible that the unplugging code could just look at the actual requests that are plugged to determine that, and maybe we wouldn't even need to mark things specially. I don't think we ever end up mixing reads and writes under the same plug, so first request is a write is probably a good approximation for async. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 7:12 PM, Minchan Kim minc...@kernel.org wrote: Interim report, And result is as follows, It reduce about 800-byte compared to my first report but still stack usage seems to be high. Really needs diet of VM functions. Yes. And in this case uninlining things might actually help, because the it's not actually performing reclaim in the second case, so inlining the reclaim code into that huge __alloc_pages_nodemask() function means that it has the stack frame for all those cases even if they don't actually get used. That said, the way those functions are set up (with lots of arguments passed from one to the other), not inlining will cause huge costs too for the argument setup. It really might be very good to create a struct alloc_info that contains those shared arguments, and just pass a (const) pointer to that around. Gcc would likely tend to be *much* better at generating code for that, because it avoids a tons of temporaries being created by function calls. Even when it's inlined, the argument itself ends up being a new temporary internally, and I suspect one reason gcc (especially your 4.6.3 version, apparently) generates those big spill frames is because there's tons of these duplicate temporaries that apparently don't get merged properly. Ugh. I think I'll try looking at that tomorrow. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 05/28/2014 07:42 PM, Linus Torvalds wrote: > > And Minchan running out of stack is at least _partly_ due to his debug > options (that DEBUG_PAGEALLOC thing as an extreme example, but I > suspect there's a few other options there that generate more bloated > data structures too too). > I have wondered if a larger stack would make sense as a debug option. I'm just worried it will be abused. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Wed, May 28, 2014 at 09:13:15PM -0700, Linus Torvalds wrote: > On Wed, May 28, 2014 at 8:46 PM, Minchan Kim wrote: > > > > Yes. For example, with mark __alloc_pages_slowpath noinline_for_stack, > > we can reduce 176byte. > > Well, but it will then call that __alloc_pages_slowpath() function, > which has a 176-byte stack frame.. Plus the call frame. > > Now, that only triggers for when the initial "__GFP_HARDWALL" case > fails, but that's exactly what happens when we do need to do direct > reclaim. > > That said, I *have* seen cases where the gcc spill code got really > confused, and simplifying the function (by not inlining excessively) > actually causes a truly smaller stack overall, despite the actual call > frames etc. But I think the gcc people fixed the kinds of things that > caused *that* kind of stack slot explosion. > > And avoiding inlining can end up resulting in less stack, if the > really deep parts don't happen to go through that function that got > inlined (ie any call chain that wouldn't have gone through that > "slowpath" function at all). > > But in this case, __alloc_pages_slowpath() is where we end up doing > the actual direct reclaim anyway, so just uninlining doesn't actually > help. Although it would probably make the asm code more readable ;) Indeed. :( Actually I found there are other places to opitmize out. For example, we can unline try_preserve_large_page for __change_page_attr_set_clr. Although I'm not familiar with that part, I guess large page would be rare so we could save 112-byte. before: 81042330 <__change_page_attr_set_clr>: 81042330: e8 4b da 6a 00 callq 816efd80 <__entry_text_start> 81042335: 55 push %rbp 81042336: 48 89 e5mov%rsp,%rbp 81042339: 41 57 push %r15 8104233b: 41 56 push %r14 8104233d: 41 55 push %r13 8104233f: 41 54 push %r12 81042341: 49 89 fcmov%rdi,%r12 81042344: 53 push %rbx 81042345: 48 81 ec f8 00 00 00sub$0xf8,%rsp 8104234c: 8b 47 20mov0x20(%rdi),%eax 8104234f: 89 b5 50 ff ff ff mov%esi,-0xb0(%rbp) 81042355: 85 c0 test %eax,%eax 81042357: 89 85 5c ff ff ff mov%eax,-0xa4(%rbp) 8104235d: 0f 84 8c 06 00 00 je 810429ef <__change_page_attr_set_clr+0x6bf> after: 81042740 <__change_page_attr_set_clr>: 81042740: e8 bb d5 6a 00 callq 816efd00 <__entry_text_start> 81042745: 55 push %rbp 81042746: 48 89 e5mov%rsp,%rbp 81042749: 41 57 push %r15 8104274b: 41 56 push %r14 8104274d: 41 55 push %r13 8104274f: 49 89 fdmov%rdi,%r13 81042752: 41 54 push %r12 81042754: 53 push %rbx 81042755: 48 81 ec 88 00 00 00sub$0x88,%rsp 8104275c: 8b 47 20mov0x20(%rdi),%eax 8104275f: 89 b5 70 ff ff ff mov%esi,-0x90(%rbp) 81042765: 85 c0 test %eax,%eax 81042767: 89 85 74 ff ff ff mov%eax,-0x8c(%rbp) 8104276d: 0f 84 cb 02 00 00 je 81042a3e <__change_page_attr_set_clr+0x2fe> And below patch saves 96-byte from shrink_lruvec. That would be not all and I am not saying optimization of every functions of VM is way to go but just want to notice we have rooms to optimize it out. I will wait more discussions and happy to test it(I can reproduce it in 1~2 hour if I have a luck) Thanks! 8115b560 : 8115b560: e8 db 46 59 00 callq 816efc40 <__entry_text_start> 8115b565: 55 push %rbp 8115b566: 65 48 8b 04 25 40 bamov%gs:0xba40,%rax 8115b56d: 00 00 8115b56f: 48 89 e5mov%rsp,%rbp 8115b572: 41 57 push %r15 8115b574: 41 56 push %r14 8115b576: 45 31 f6xor%r14d,%r14d 8115b579: 41 55 push %r13 8115b57b: 49 89 fdmov%rdi,%r13 8115b57e: 41 54 push %r12 8115b580: 49 89 f4mov%rsi,%r12 8115b583: 49 83 c4 34 add$0x34,%r12 8115b587: 53 push
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Wed, May 28, 2014 at 8:46 PM, Minchan Kim wrote: > > Yes. For example, with mark __alloc_pages_slowpath noinline_for_stack, > we can reduce 176byte. Well, but it will then call that __alloc_pages_slowpath() function, which has a 176-byte stack frame.. Plus the call frame. Now, that only triggers for when the initial "__GFP_HARDWALL" case fails, but that's exactly what happens when we do need to do direct reclaim. That said, I *have* seen cases where the gcc spill code got really confused, and simplifying the function (by not inlining excessively) actually causes a truly smaller stack overall, despite the actual call frames etc. But I think the gcc people fixed the kinds of things that caused *that* kind of stack slot explosion. And avoiding inlining can end up resulting in less stack, if the really deep parts don't happen to go through that function that got inlined (ie any call chain that wouldn't have gone through that "slowpath" function at all). But in this case, __alloc_pages_slowpath() is where we end up doing the actual direct reclaim anyway, so just uninlining doesn't actually help. Although it would probably make the asm code more readable ;) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Wed, May 28, 2014 at 10:44:48PM -0400, Steven Rostedt wrote: > On Thu, 29 May 2014 10:09:40 +0900 > Minchan Kim wrote: > > > stacktrace reported that vring_add_indirect used 376byte and objdump says > > > > 8141dc60 : > > 8141dc60: 55 push %rbp > > 8141dc61: 48 89 e5mov%rsp,%rbp > > 8141dc64: 41 57 push %r15 > > 8141dc66: 41 56 push %r14 > > 8141dc68: 41 55 push %r13 > > 8141dc6a: 49 89 fdmov%rdi,%r13 > > 8141dc6d: 89 cf mov%ecx,%edi > > 8141dc6f: 48 c1 e7 04 shl$0x4,%rdi > > 8141dc73: 41 54 push %r12 > > 8141dc75: 49 89 d4mov%rdx,%r12 > > 8141dc78: 53 push %rbx > > 8141dc79: 48 89 f3mov%rsi,%rbx > > 8141dc7c: 48 83 ec 28 sub$0x28,%rsp > > 8141dc80: 8b 75 20mov0x20(%rbp),%esi > > 8141dc83: 89 4d bcmov%ecx,-0x44(%rbp) > > 8141dc86: 44 89 45 cc mov%r8d,-0x34(%rbp) > > 8141dc8a: 44 89 4d c8 mov%r9d,-0x38(%rbp) > > 8141dc8e: 83 e6 ddand$0xffdd,%esi > > 8141dc91: e8 7a d1 d7 ff callq 8119ae10 > > <__kmalloc> > > 8141dc96: 48 85 c0test %rax,%rax > > > > So, it's *strange*. > > > > I will add .config and .o. > > Maybe someone might find what happens. > > > > This is really bothering me. I'm trying to figure it out. We have from > the stack trace: > > [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call: 9) > 6456 80 __kmalloc+0x1cb/0x200 > [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call: 10) > 6376 376 vring_add_indirect+0x36/0x200 > [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call: 11) > 6000 144 virtqueue_add_sgs+0x2e2/0x320 > > The way the stack tracer works, is that when it detects a new max stack > it calls save_stack_trace() to get the complete call chain from the > stack. This should be rather accurate as it seems that your kernel was > compiled with frame pointers (confirmed by the objdump as well as the > config file). It then uses that stack trace that it got to examine the > stack to find the locations of the saved return addresses and records > them in an array (in your case, an array of 50 entries). > > From your .o file: > > vring_add_indirect + 0x36: (0x370 + 0x36 = 0x3a6) > > 0370 : > > 39e: 83 e6 ddand$0xffdd,%esi > 3a1: e8 00 00 00 00 callq 3a6 > 3a2: R_X86_64_PC32 __kmalloc-0x4 > 3a6: 48 85 c0test %rax,%rax > > Definitely the return address to the call to __kmalloc. Then to > determine the size of the stack frame, it is subtracted from the next > one down. In this case, the location of virtqueue_add_sgs+0x2e2. > > virtqueue_add_sgs + 0x2e2: (0x880 + 0x2e2 = 0xb62) > > 0880 : > > b4f: 89 4c 24 08 mov%ecx,0x8(%rsp) > b53: 48 c7 c2 00 00 00 00mov$0x0,%rdx > b56: R_X86_64_32S .text+0x570 > b5a: 44 89 d1mov%r10d,%ecx > b5d: e8 0e f8 ff ff callq 370 > b62: 85 c0 test %eax,%eax > > > Which is the return address of where vring_add_indirect was called. > > The return address back to virtqueue_add_sgs was found at 6000 bytes of > the stack. The return address back to vring_add_indirect was found at > 6376 bytes from the top of the stack. > > My question is, why were they so far apart? I see 6 words pushed > (8bytes each, for a total of 48 bytes), and a subtraction of the stack > pointer of 0x28 (40 bytes) giving us a total of 88 bytes. Plus we need > to add the push of the return address itself which would just give us > 96 bytes for the stack frame. What is making this show 376 bytes?? That's what I want to know. :( > > Looking more into this, I'm not sure I trust the top numbers anymore. > kmalloc reports a stack frame of 80, and I'm coming up with 104 > (perhaps even 112). And slab_alloc only has 8. Something's messed up there. Yes, Looks weired but some of upper functions in callstack match well so might think only top functions of callstack was corrupted. But in case of alloc_pages_current(8 byte), it looks weired too but it reports same value 8 bytes in uppder and bottom of callstack. :( > > -- Steve > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Wed, May 28, 2014 at 09:09:23AM -0700, Linus Torvalds wrote: > On Tue, May 27, 2014 at 11:53 PM, Minchan Kim wrote: > > > > So, my stupid idea is just let's expand stack size and keep an eye > > toward stack consumption on each kernel functions via stacktrace of ftrace. > > We probably have to do this at some point, but that point is not -rc7. > > And quite frankly, from the backtrace, I can only say: there is some > bad shit there. The current VM stands out as a bloated pig: > > > [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call: 0) > > 7696 16 lookup_address+0x28/0x30 > > [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call: 1) > > 7680 16 _lookup_address_cpa.isra.3+0x3b/0x40 > > [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call: 2) > > 7664 24 __change_page_attr_set_clr+0xe0/0xb50 > > [ 1065.604404] kworker/-57660d..2 1071625991us : stack_trace_call: 3) > > 7640 392 kernel_map_pages+0x6c/0x120 > > [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call: 4) > > 7248 256 get_page_from_freelist+0x489/0x920 > > [ 1065.604404] kworker/-57660d..2 1071625992us : stack_trace_call: 5) > > 6992 352 __alloc_pages_nodemask+0x5e1/0xb20 > > > [ 1065.604404] kworker/-57660d..2 1071625995us : stack_trace_call: 23) > > 4672 160 __swap_writepage+0x150/0x230 > > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call: 24) > > 4512 32 swap_writepage+0x42/0x90 > > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call: 25) > > 4480 320 shrink_page_list+0x676/0xa80 > > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call: 26) > > 4160 208 shrink_inactive_list+0x262/0x4e0 > > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call: 27) > > 3952 304 shrink_lruvec+0x3e1/0x6a0 > > [ 1065.604404] kworker/-57660d..2 1071625996us : stack_trace_call: 28) > > 3648 80 shrink_zone+0x3f/0x110 > > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call: 29) > > 3568 128 do_try_to_free_pages+0x156/0x4c0 > > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call: 30) > > 3440 208 try_to_free_pages+0xf7/0x1e0 > > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call: 31) > > 3232 352 __alloc_pages_nodemask+0x783/0xb20 > > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call: 32) > > 2880 8 alloc_pages_current+0x10f/0x1f0 > > [ 1065.604404] kworker/-57660d..2 1071625997us : stack_trace_call: 33) > > 2872 200 __page_cache_alloc+0x13f/0x160 > > That __alloc_pages_nodemask() thing in particular looks bad. It > actually seems not to be the usual "let's just allocate some > structures on the stack" disease, it looks more like "lots of > inlining, horrible calling conventions, and lots of random stupid > variables". Yes. For example, with mark __alloc_pages_slowpath noinline_for_stack, we can reduce 176byte. And there are more places we could reduce stack consumption but I thought it was bandaid although reducing stack itself is desireable. before 81150600 <__alloc_pages_nodemask>: 81150600: e8 fb f6 59 00 callq 816efd00 <__entry_text_start> 81150605: 55 push %rbp 81150606: b8 e8 e8 00 00 mov$0xe8e8,%eax 8115060b: 48 89 e5mov%rsp,%rbp 8115060e: 41 57 push %r15 81150610: 41 56 push %r14 81150612: 41 be 22 01 32 01 mov$0x1320122,%r14d 81150618: 41 55 push %r13 8115061a: 41 54 push %r12 8115061c: 41 89 fcmov%edi,%r12d 8115061f: 53 push %rbx 81150620: 48 81 ec 28 01 00 00sub$0x128,%rsp 81150627: 48 89 55 88 mov%rdx,-0x78(%rbp) 8115062b: 89 fa mov%edi,%edx 8115062d: 83 e2 0fand$0xf,%edx 81150630: 48 89 4d 90 mov%rcx,-0x70(%rbp) after: 81150600 <__alloc_pages_nodemask>: 81150600: e8 7b f6 59 00 callq 816efc80 <__entry_text_start> 81150605: 55 push %rbp 81150606: b8 e8 e8 00 00 mov$0xe8e8,%eax 8115060b: 48 89 e5mov%rsp,%rbp 8115060e: 41 57 push %r15 81150610: 41 bf 22 01 32 01 mov$0x1320122,%r15d 81150616: 41 56 push %r14 81150618: 41 55
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
[ Crossed emails ] On Wed, May 28, 2014 at 6:58 PM, Dave Chinner wrote: > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote: >> >> And now we have too deep a stack due to unplugging from io_schedule()... > > So, if we make io_schedule() push the plug list off to the kblockd > like is done for schedule() We might have a few different cases. The cases where we *do* care about latency is when we are waiting for the IO ourselves (ie in wait_on_page() and friends), and those end up using io_schedule() too. So in *that* case we definitely have a latency argument for doing it directly, and we shouldn't kick it off to kblockd. That's very much a "get this started as soon as humanly possible". But the "wait_iff_congested()" code that also uses io_schedule() should push it out to kblockd, methinks. >> This stack overflow shows us that just the memory reclaim + IO >> layers are sufficient to cause a stack overflow, > > we solve this problem directly by being able to remove the IO > stack usage from the direct reclaim swap path. > > IOWs, we don't need to turn swap off at all in direct reclaim > because all the swap IO can be captured in a plug list and > dispatched via kblockd. This could be done either by io_schedule() > or a new blk_flush_plug_list() wrapper that pushes the work to > kblockd... That would work. That said, I personally would not mind to see that "swap is special" go away, if possible. Because it can be behind a filesystem too. Christ, even NFS (and people used to fight that tooth and nail!) is back as a swap target.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
Minchan Kim writes: > On Wed, May 28, 2014 at 12:04:09PM +0300, Michael S. Tsirkin wrote: >> On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote: >> > [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call: >> > 9) 6456 80 __kmalloc+0x1cb/0x200 >> > [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call: >> > 10) 6376 376 vring_add_indirect+0x36/0x200 >> > [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call: >> > 11) 6000 144 virtqueue_add_sgs+0x2e2/0x320 Hmm, we can actually skip the vring_add_indirect if we're hurting for stack. It just means the request will try to fit linearly in the ring, rather than using indirect. diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 1e443629f76d..496e727cefc8 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -184,6 +184,13 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq, return head; } +/* The Morton Technique */ +static noinline bool stack_trouble(void) +{ + unsigned long sp = (unsigned long) + return sp - (sp & ~(THREAD_SIZE - 1)) < 3000; +} + static inline int virtqueue_add(struct virtqueue *_vq, struct scatterlist *sgs[], struct scatterlist *(*next) @@ -226,7 +233,7 @@ static inline int virtqueue_add(struct virtqueue *_vq, /* If the host supports indirect descriptor tables, and we have multiple * buffers, then go indirect. FIXME: tune this threshold */ - if (vq->indirect && total_sg > 1 && vq->vq.num_free) { + if (vq->indirect && total_sg > 1 && vq->vq.num_free && !stack_trouble()) { head = vring_add_indirect(vq, sgs, next, total_sg, total_out, total_in, out_sgs, in_sgs, gfp); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, 29 May 2014 10:09:40 +0900 Minchan Kim wrote: > stacktrace reported that vring_add_indirect used 376byte and objdump says > > 8141dc60 : > 8141dc60: 55 push %rbp > 8141dc61: 48 89 e5mov%rsp,%rbp > 8141dc64: 41 57 push %r15 > 8141dc66: 41 56 push %r14 > 8141dc68: 41 55 push %r13 > 8141dc6a: 49 89 fdmov%rdi,%r13 > 8141dc6d: 89 cf mov%ecx,%edi > 8141dc6f: 48 c1 e7 04 shl$0x4,%rdi > 8141dc73: 41 54 push %r12 > 8141dc75: 49 89 d4mov%rdx,%r12 > 8141dc78: 53 push %rbx > 8141dc79: 48 89 f3mov%rsi,%rbx > 8141dc7c: 48 83 ec 28 sub$0x28,%rsp > 8141dc80: 8b 75 20mov0x20(%rbp),%esi > 8141dc83: 89 4d bcmov%ecx,-0x44(%rbp) > 8141dc86: 44 89 45 cc mov%r8d,-0x34(%rbp) > 8141dc8a: 44 89 4d c8 mov%r9d,-0x38(%rbp) > 8141dc8e: 83 e6 ddand$0xffdd,%esi > 8141dc91: e8 7a d1 d7 ff callq 8119ae10 > <__kmalloc> > 8141dc96: 48 85 c0test %rax,%rax > > So, it's *strange*. > > I will add .config and .o. > Maybe someone might find what happens. > This is really bothering me. I'm trying to figure it out. We have from the stack trace: [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call: 9) 6456 80 __kmalloc+0x1cb/0x200 [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call: 10) 6376 376 vring_add_indirect+0x36/0x200 [ 1065.604404] kworker/-57660d..2 1071625993us : stack_trace_call: 11) 6000 144 virtqueue_add_sgs+0x2e2/0x320 The way the stack tracer works, is that when it detects a new max stack it calls save_stack_trace() to get the complete call chain from the stack. This should be rather accurate as it seems that your kernel was compiled with frame pointers (confirmed by the objdump as well as the config file). It then uses that stack trace that it got to examine the stack to find the locations of the saved return addresses and records them in an array (in your case, an array of 50 entries). >From your .o file: vring_add_indirect + 0x36: (0x370 + 0x36 = 0x3a6) 0370 : 39e: 83 e6 ddand$0xffdd,%esi 3a1: e8 00 00 00 00 callq 3a6 3a2: R_X86_64_PC32 __kmalloc-0x4 3a6: 48 85 c0test %rax,%rax Definitely the return address to the call to __kmalloc. Then to determine the size of the stack frame, it is subtracted from the next one down. In this case, the location of virtqueue_add_sgs+0x2e2. virtqueue_add_sgs + 0x2e2: (0x880 + 0x2e2 = 0xb62) 0880 : b4f: 89 4c 24 08 mov%ecx,0x8(%rsp) b53: 48 c7 c2 00 00 00 00mov$0x0,%rdx b56: R_X86_64_32S .text+0x570 b5a: 44 89 d1mov%r10d,%ecx b5d: e8 0e f8 ff ff callq 370 b62: 85 c0 test %eax,%eax Which is the return address of where vring_add_indirect was called. The return address back to virtqueue_add_sgs was found at 6000 bytes of the stack. The return address back to vring_add_indirect was found at 6376 bytes from the top of the stack. My question is, why were they so far apart? I see 6 words pushed (8bytes each, for a total of 48 bytes), and a subtraction of the stack pointer of 0x28 (40 bytes) giving us a total of 88 bytes. Plus we need to add the push of the return address itself which would just give us 96 bytes for the stack frame. What is making this show 376 bytes?? Looking more into this, I'm not sure I trust the top numbers anymore. kmalloc reports a stack frame of 80, and I'm coming up with 104 (perhaps even 112). And slab_alloc only has 8. Something's messed up there. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Wed, May 28, 2014 at 6:30 PM, Dave Chinner wrote: > > You're focussing on the specific symptoms, not the bigger picture. > i.e. you're ignoring all the other "let's start IO" triggers in > direct reclaim. e.g there's two separate plug flush triggers in > shrink_inactive_list(), one of which is: Fair enough. I certainly agree that we should look at the other cases here too. In fact, I also find it distasteful just how much stack space some of those VM routines are just using up on their own, never mind any actual IO paths at all. The fact that __alloc_pages_nodemask() uses 350 bytes of stackspace on its own is actually quite disturbing. The fact that kernel_map_pages() apparently has almost 400 bytes of stack is just crazy. Obviously that case only happens with CONFIG_DEBUG_PAGEALLOC, but still.. > I'm not saying we shouldn't turn of swap from direct reclaim, just > that all we'd be doing by turning off swap is playing whack-a-stack > - the next report will simply be from one of the other direct > reclaim IO schedule points. Playing whack-a-mole with this for a while might not be a bad idea, though. It's not like we will ever really improve unless we start whacking the worst cases. And it should still be a fairly limited number. After all, historically, some of the cases we've played whack-a-mole on have been in XFS, so I'd think you'd be thrilled to see some other code get blamed this time around ;) > Regardless of whether it is swap or something external queues the > bio on the plug, perhaps we should look at why it's done inline > rather than by kblockd, where it was moved because it was blowing > the stack from schedule(): So it sounds like we need to do this for io_schedule() too. In fact, we've generally found it to be a mistake every time we "automatically" unblock some IO queue. And I'm not saying that because of stack space, but because we've _often_ had the situation that eager unblocking results in IO that could have been done as bigger requests. Of course, we do need to worry about latency for starting IO, but any of these kinds of memory-pressure writeback patterns are pretty much by definition not about the latency of one _particular_ IO, so they don't tent to be latency-sensitive. Quite the reverse: we start writeback and then end up waiting on something else altogether (possibly a writeback that got started much earlier). swapout certainly is _not_ IO-latency-sensitive, especially these days. And while we _do_ want to throttle in direct reclaim, if it's about throttling I'd certainly think that it sounds quite reasonable to push any unplugging to kblockd than try to do that synchronously. If we are throttling in direct-reclaim, we need to slow things _down_ for the writer, not worry about latency. > I've said in the past that swap is different to filesystem > ->writepage implementations because it doesn't require significant > stack to do block allocation and doesn't trigger IO deep in that > allocation stack. Hence it has much lower stack overhead than the > filesystem ->writepage implementations and so is much less likely to > have stack issues. Clearly it is true that it lacks the actual filesystem part needed for the writeback. At the same time, Minchan's example is certainly a good one of a filesystem (ext4) already being reasonably deep in its own stack space when it then wants memory. Looking at that callchain, I have to say that ext4 doesn't look horrible compared to the whole block layer and virtio.. Yes, "ext4_writepages()" is using almost 400 bytes of stack, and most of that seems to be due to: struct mpage_da_data mpd; struct blk_plug plug; which looks at least understandable (nothing like the mess in the VM code where the stack usage is because gcc creates horrible spills) > This stack overflow shows us that just the memory reclaim + IO > layers are sufficient to cause a stack overflow, which is something > I've never seen before. Well, we've definitely have had some issues with deeper callchains with md, but I suspect virtio might be worse, and the new blk-mq code is lilkely worse in this respect too. And Minchan running out of stack is at least _partly_ due to his debug options (that DEBUG_PAGEALLOC thing as an extreme example, but I suspect there's a few other options there that generate more bloated data structures too too). >That implies no IO in direct reclaim context > is safe - either from swap or io_schedule() unplugging. It also > lends a lot of weight to my assertion that the majority of the stack > growth over the past couple of years has been ocurring outside the > filesystems I think Minchan's stack trace definitely backs you up on that. The filesystem part - despite that one ext4_writepages() function - is a very small part of the whole. It sits at about ~1kB of stack. Just the VM "top-level" writeback code is about as much, and then the VM page alloc/shrinking code when the filesystem needs memory is
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote: > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote: > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca > Author: Jens Axboe > Date: Sat Apr 16 13:27:55 2011 +0200 > > block: let io_schedule() flush the plug inline > > Linus correctly observes that the most important dispatch cases > are now done from kblockd, this isn't ideal for latency reasons. > The original reason for switching dispatches out-of-line was to > avoid too deep a stack, so by _only_ letting the "accidental" > flush directly in schedule() be guarded by offload to kblockd, > we should be able to get the best of both worlds. > > So add a blk_schedule_flush_plug() that offloads to kblockd, > and only use that from the schedule() path. > > Signed-off-by: Jens Axboe > > And now we have too deep a stack due to unplugging from io_schedule()... So, if we make io_schedule() push the plug list off to the kblockd like is done for schedule() > > IOW, swap-out directly caused that extra 3kB of stack use in what was > > a deep call chain (due to memory allocation). I really don't > > understand why you are arguing anything else on a pure technicality. > > > > I thought you had some other argument for why swap was different, and > > against removing that "page_is_file_cache()" special case in > > shrink_page_list(). > > I've said in the past that swap is different to filesystem > ->writepage implementations because it doesn't require significant > stack to do block allocation and doesn't trigger IO deep in that > allocation stack. Hence it has much lower stack overhead than the > filesystem ->writepage implementations and so is much less likely to > have stack issues. > > This stack overflow shows us that just the memory reclaim + IO > layers are sufficient to cause a stack overflow, we solve this problem directly by being able to remove the IO stack usage from the direct reclaim swap path. IOWs, we don't need to turn swap off at all in direct reclaim because all the swap IO can be captured in a plug list and dispatched via kblockd. This could be done either by io_schedule() or a new blk_flush_plug_list() wrapper that pushes the work to kblockd... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote: > On Wed, May 28, 2014 at 3:31 PM, Dave Chinner wrote: > > > > Indeed, the call chain reported here is not caused by swap issuing > > IO. > > Well, that's one way of reading that callchain. > > I think it's the *wrong* way of reading it, though. Almost dishonestly > so. I guess you haven't met your insult quota for the day, Linus. :/ > Because very clearly, the swapout _is_ what causes the unplugging > of the IO queue, and does so because it is allocating the BIO for its > own IO. The fact that that then fails (because of other IO's in > flight), and causes *other* IO to be flushed, doesn't really change > anything fundamental. It's still very much swap that causes that > "let's start IO". It is not rocket science to see how plugging outside memory allocation context can lead to flushing that plug within direct reclaim without having dispatched any IO at all from direct reclaim You're focussing on the specific symptoms, not the bigger picture. i.e. you're ignoring all the other "let's start IO" triggers in direct reclaim. e.g there's two separate plug flush triggers in shrink_inactive_list(), one of which is: /* * Stall direct reclaim for IO completions if underlying BDIs or zone * is congested. Allow kswapd to continue until it starts encountering * unqueued dirty pages or cycling through the LRU too quickly. */ if (!sc->hibernation_mode && !current_is_kswapd()) wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10); I'm not saying we shouldn't turn of swap from direct reclaim, just that all we'd be doing by turning off swap is playing whack-a-stack - the next report will simply be from one of the other direct reclaim IO schedule points. Regardless of whether it is swap or something external queues the bio on the plug, perhaps we should look at why it's done inline rather than by kblockd, where it was moved because it was blowing the stack from schedule(): commit f4af3c3d077a004762aaad052049c809fd8c6f0c Author: Jens Axboe Date: Tue Apr 12 14:58:51 2011 +0200 block: move queue run on unplug to kblockd There are worries that we are now consuming a lot more stack in some cases, since we potentially call into IO dispatch from schedule() or io_schedule(). We can reduce this problem by moving the running of the queue to kblockd, like the old plugging scheme did as well. This may or may not be a good idea from a performance perspective, depending on how many tasks have queue plugs running at the same time. For even the slightly contended case, doing just a single queue run from kblockd instead of multiple runs directly from the unpluggers will be faster. Signed-off-by: Jens Axboe commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca Author: Jens Axboe Date: Sat Apr 16 13:27:55 2011 +0200 block: let io_schedule() flush the plug inline Linus correctly observes that the most important dispatch cases are now done from kblockd, this isn't ideal for latency reasons. The original reason for switching dispatches out-of-line was to avoid too deep a stack, so by _only_ letting the "accidental" flush directly in schedule() be guarded by offload to kblockd, we should be able to get the best of both worlds. So add a blk_schedule_flush_plug() that offloads to kblockd, and only use that from the schedule() path. Signed-off-by: Jens Axboe And now we have too deep a stack due to unplugging from io_schedule()... > IOW, swap-out directly caused that extra 3kB of stack use in what was > a deep call chain (due to memory allocation). I really don't > understand why you are arguing anything else on a pure technicality. > > I thought you had some other argument for why swap was different, and > against removing that "page_is_file_cache()" special case in > shrink_page_list(). I've said in the past that swap is different to filesystem ->writepage implementations because it doesn't require significant stack to do block allocation and doesn't trigger IO deep in that allocation stack. Hence it has much lower stack overhead than the filesystem ->writepage implementations and so is much less likely to have stack issues. This stack overflow shows us that just the memory reclaim + IO layers are sufficient to cause a stack overflow, which is something I've never seen before. That implies no IO in direct reclaim context is safe - either from swap or io_schedule() unplugging. It also lends a lot of weight to my assertion that the majority of the stack growth over the past couple of years has been ocurring outside the filesystems Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 05/28/2014 04:17 PM, Dave Chinner wrote: >> >> You were the one calling it a canary. > > That doesn't mean it's to blame. Don't shoot the messenger... > Fair enough. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On Wed, May 28, 2014 at 03:42:18PM -0700, H. Peter Anvin wrote: > On 05/28/2014 03:11 PM, Dave Chinner wrote: > > On Wed, May 28, 2014 at 07:23:23AM -0700, H. Peter Anvin wrote: > >> We tried for 4K on x86-64, too, for b quite a while as I recall. > >> The kernel stack is a one of the main costs for a thread. I would > >> like to decouple struct thread_info from the kernel stack (PJ > >> Waskewicz was working on that before he left Intel) but that > >> doesn't buy us all that much. > >> > >> 8K additional per thread is a huge hit. XFS has indeed always > >> been a canary, or troublespot, I suspect because it originally > >> came from another kernel where this was not an optimization > >> target. > > > > > > > > Always blame XFS for stack usage problems. > > > > Even when the reported problem is from IO to an ext4 filesystem. > > > > You were the one calling it a canary. That doesn't mean it's to blame. Don't shoot the messenger... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] x86_64: expand kernel stack to 16K
On 05/28/2014 03:11 PM, Dave Chinner wrote: > On Wed, May 28, 2014 at 07:23:23AM -0700, H. Peter Anvin wrote: >> We tried for 4K on x86-64, too, for b quite a while as I recall. >> The kernel stack is a one of the main costs for a thread. I would >> like to decouple struct thread_info from the kernel stack (PJ >> Waskewicz was working on that before he left Intel) but that >> doesn't buy us all that much. >> >> 8K additional per thread is a huge hit. XFS has indeed always >> been a canary, or troublespot, I suspect because it originally >> came from another kernel where this was not an optimization >> target. > > > > Always blame XFS for stack usage problems. > > Even when the reported problem is from IO to an ext4 filesystem. > You were the one calling it a canary. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/