Re: kswapd craziness in 3.7
On 11.12.2012 01:19, Zlatko Calusic wrote: On 10.12.2012 20:13, Linus Torvalds wrote: It's worth giving this as much testing as is at all possible, but at the same time I really don't think I can delay 3.7 any more without messing up the holiday season too much. So unless something obvious pops up, I will do the release tonight. So testing will be minimal - but it's not like we haven't gone back-and-forth on this several times already, and we revert to *mostly* the same old state as 3.6 anyway, so it should be fairly safe. So, here's what I found. In short: close, but no cigar! Kswapd is certainly no more CPU pig, and memory seems to be utilized properly (the kernel still likes to keep 400MB free, somebody else can confirm if that's to be expected on a 4GB THP-enabled machine). So it looks very decent, and much better than anything I run in last 10 days, barring !THP kernel. What remains a mystery is that kswapd occassionaly still likes to get stuck in a D state, only now it recovers faster than before (sometimes in a matter of seconds, but sometimes it takes a few minutes). Now, I admit it's a small, maybe even cosmetic issue. But, it could also be a warning sign of a bigger problem that will reveal itself on a more loaded machine. Ha, I nailed it! The cigar aka the explanation together with a patch will follow shortly in a separate topic. It's a genuine bug that has been with us for a long long time. -- Zlatko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 11.12.2012 01:19, Zlatko Calusic wrote: I will now make one last attempt, I've just reverted 2 Johannes' commits that were also applied in attempt to fix breakage that removing gfp_no_kswapd introduced, namely ed23ec4 & c702418. For various reasons the results of this test will be available tommorow, so it's your call Linus. To be honest, I don't see any difference with those two commits reverted. Like those lines never did much anyway, so it's probably good we got rid of them. :P -- Zlatko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Mon, 10 Dec 2012, Linus Torvalds wrote: > [ Adding High Dickins because of the shmem oops. ] I had already noticed, and was about to reply; but only then refreshed my mbox window, to find that you've already done it all for me: thanks. > > On Mon, Dec 10, 2012 at 12:35 PM, Zlatko Calusic > wrote: > > > > And funny thing that you mention i915, because yesterday my daughter > > managed to lock up our laptop hard (that was a first), and this is what I > > found in kern.log after restart: > > > > Dec 9 21:29:42 titan vmunix: general protection fault: [#1] PREEMPT > > SMP > > Dec 9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) > > vboxnetflt(O) vboxdrv(O) [last unloaded: microcode] > > Dec 9 21:29:42 titan vmunix: CPU 2 > > Dec 9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G O > > 3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B > > Dec 9 21:29:42 titan vmunix: RIP: 0010:[] > > [] find_get_page+0x3c/0x90 > > Ho humm.. > > I'm not convinced this is related. > > > Dec 9 21:29:42 titan vmunix: Call Trace: > > Dec 9 21:29:42 titan vmunix: [] find_lock_page+0x21/0x80 > > Dec 9 21:29:42 titan vmunix: [] > > shmem_getpage_gfp+0xa0/0x620 > > Dec 9 21:29:42 titan vmunix: [] > > shmem_read_mapping_page_gfp+0x2c/0x50 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_object_get_pages_gtt+0xe1/0x270 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_object_get_pages+0x4f/0x90 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_object_bind_to_gtt+0xc3/0x4c0 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_object_pin+0x123/0x190 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_execbuffer2+0x94/0x280 > > Dec 9 21:29:42 titan vmunix: [] drm_ioctl+0x493/0x530 > > Dec 9 21:29:42 titan vmunix: [] do_vfs_ioctl+0x8f/0x530 > > Dec 9 21:29:42 titan vmunix: [] sys_ioctl+0x4b/0x90 > > Dec 9 21:29:42 titan vmunix: [] > > system_call_fastpath+0x16/0x1b > > > > It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, > > the i915 driver will need to be taken better care of. > > That decodes to > > 11: e8 89 b7 15 00 callq 0x15b79f # radix_tree_lookup_slot > 16: 48 85 c0 test %rax,%rax > 19: 48 89 c6 mov%rax,%rsi > 1c: 74 41 je 0x5f > 1e: 48 8b 18 mov(%rax),%rbx # > 21: 48 85 db test %rbx,%rbx > 24: 74 1f je 0x45 > 26: f6 c3 03 test $0x3,%bl > 29: 75 3c jne0x67 > 2b:* 8b 53 1c mov0x1c(%rbx),%edx <-- trapping > instruction > 2e: 85 d2 test %edx,%edx > 30: 74 d9 je 0xb > > where %rbx is 0x0200. That looks like it could be a > single-bit error, and should have been zero. > > It's the "atomic_read(&page->counter)" which is part of > "page_cache_get_speculative()" as far as I can tell, and it's the > "page" pointer that is that odd (non-pointer) value. The fact that > %ecx contains the value "-6" makes me wonder if there was a -ENXIO > somewhere, though. Yes, just what I was about to say; except I never considered the -6. I was going to suggest it's a new notebook with not-so-good memory, but see that Borislav has since made a better suggestion. > > None of it looks all that much related to whether the i915 driver uses > GFP_NO_KSWAPD or not, though. Yes, no evidence here of anything to delay 3.7 further. I'm running on current git, and no problems observed; but then, I never did see any of these kswapd problems anyway. And, in particular, I was unable to reproduce Zlatko's 1GB of 4GB kept free (on yesterday's tree, with no swap) - I saw about 100MB kept free. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 10.12.2012 22:54, Borislav Petkov wrote: On Mon, Dec 10, 2012 at 01:47:23PM -0800, Linus Torvalds wrote: On Mon, Dec 10, 2012 at 1:42 PM, Borislav Petkov wrote: Aren't we gonna consider the out-of-tree vbox modules being loaded and causing some corruptions like maybe the single-bit error above? I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317 Yup, that looks more likely, I agree. @Zlatko: can your daughter try to retrigger the freeze without the vbox modules loaded? Sure thing! :) Although, the vbox modules were only loaded, no VM was running at the time lockup happened. But, I've just read the whole thread you mention above and I understand the concern. I'll make sure the vbox modules are unloaded when not really needed (most of the time on that machine), in case lockup happens again. Next time my daughter plays online games, I'll tell her she's actually serving a greater purpose, and let her take her time. :) -- Zlatko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Mon, Dec 10, 2012 at 01:47:23PM -0800, Linus Torvalds wrote: > On Mon, Dec 10, 2012 at 1:42 PM, Borislav Petkov wrote: > > > > Aren't we gonna consider the out-of-tree vbox modules being loaded and > > causing some corruptions like maybe the single-bit error above? > > > > I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317 > > Yup, that looks more likely, I agree. @Zlatko: can your daughter try to retrigger the freeze without the vbox modules loaded? Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Mon, Dec 10, 2012 at 1:42 PM, Borislav Petkov wrote: > > Aren't we gonna consider the out-of-tree vbox modules being loaded and > causing some corruptions like maybe the single-bit error above? > > I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317 Yup, that looks more likely, I agree. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Mon, Dec 10, 2012 at 01:28:54PM -0800, Linus Torvalds wrote: > [ Adding High Dickins because of the shmem oops. ] > > On Mon, Dec 10, 2012 at 12:35 PM, Zlatko Calusic > wrote: > > > > And funny thing that you mention i915, because yesterday my daughter > > managed to lock up our laptop hard (that was a first), and this is what I > > found in kern.log after restart: > > > > Dec 9 21:29:42 titan vmunix: general protection fault: [#1] PREEMPT > > SMP > > Dec 9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) > > vboxnetflt(O) vboxdrv(O) [last unloaded: microcode] > > Dec 9 21:29:42 titan vmunix: CPU 2 > > Dec 9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G O > > 3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B > > Dec 9 21:29:42 titan vmunix: RIP: 0010:[] > > [] find_get_page+0x3c/0x90 > > Ho humm.. > > I'm not convinced this is related. > > > Dec 9 21:29:42 titan vmunix: Call Trace: > > Dec 9 21:29:42 titan vmunix: [] find_lock_page+0x21/0x80 > > Dec 9 21:29:42 titan vmunix: [] > > shmem_getpage_gfp+0xa0/0x620 > > Dec 9 21:29:42 titan vmunix: [] > > shmem_read_mapping_page_gfp+0x2c/0x50 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_object_get_pages_gtt+0xe1/0x270 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_object_get_pages+0x4f/0x90 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_object_bind_to_gtt+0xc3/0x4c0 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_object_pin+0x123/0x190 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0 > > Dec 9 21:29:42 titan vmunix: [] > > i915_gem_execbuffer2+0x94/0x280 > > Dec 9 21:29:42 titan vmunix: [] drm_ioctl+0x493/0x530 > > Dec 9 21:29:42 titan vmunix: [] do_vfs_ioctl+0x8f/0x530 > > Dec 9 21:29:42 titan vmunix: [] sys_ioctl+0x4b/0x90 > > Dec 9 21:29:42 titan vmunix: [] > > system_call_fastpath+0x16/0x1b > > > > It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, > > the i915 driver will need to be taken better care of. > > That decodes to > > 11: e8 89 b7 15 00 callq 0x15b79f # radix_tree_lookup_slot > 16: 48 85 c0 test %rax,%rax > 19: 48 89 c6 mov%rax,%rsi > 1c: 74 41 je 0x5f > 1e: 48 8b 18 mov(%rax),%rbx # > 21: 48 85 db test %rbx,%rbx > 24: 74 1f je 0x45 > 26: f6 c3 03 test $0x3,%bl > 29: 75 3c jne0x67 > 2b:* 8b 53 1c mov0x1c(%rbx),%edx <-- trapping > instruction > 2e: 85 d2 test %edx,%edx > 30: 74 d9 je 0xb > > where %rbx is 0x0200. That looks like it could be a > single-bit error, and should have been zero. > > It's the "atomic_read(&page->counter)" which is part of > "page_cache_get_speculative()" as far as I can tell, and it's the > "page" pointer that is that odd (non-pointer) value. The fact that > %ecx contains the value "-6" makes me wonder if there was a -ENXIO > somewhere, though. > > None of it looks all that much related to whether the i915 driver uses > GFP_NO_KSWAPD or not, though. Aren't we gonna consider the out-of-tree vbox modules being loaded and causing some corruptions like maybe the single-bit error above? I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317 Hmm. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
[ Adding High Dickins because of the shmem oops. ] On Mon, Dec 10, 2012 at 12:35 PM, Zlatko Calusic wrote: > > And funny thing that you mention i915, because yesterday my daughter managed > to lock up our laptop hard (that was a first), and this is what I found in > kern.log after restart: > > Dec 9 21:29:42 titan vmunix: general protection fault: [#1] PREEMPT SMP > Dec 9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) > vboxnetflt(O) vboxdrv(O) [last unloaded: microcode] > Dec 9 21:29:42 titan vmunix: CPU 2 > Dec 9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G O > 3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B > Dec 9 21:29:42 titan vmunix: RIP: 0010:[] > [] find_get_page+0x3c/0x90 Ho humm.. I'm not convinced this is related. > Dec 9 21:29:42 titan vmunix: Call Trace: > Dec 9 21:29:42 titan vmunix: [] find_lock_page+0x21/0x80 > Dec 9 21:29:42 titan vmunix: [] > shmem_getpage_gfp+0xa0/0x620 > Dec 9 21:29:42 titan vmunix: [] > shmem_read_mapping_page_gfp+0x2c/0x50 > Dec 9 21:29:42 titan vmunix: [] > i915_gem_object_get_pages_gtt+0xe1/0x270 > Dec 9 21:29:42 titan vmunix: [] > i915_gem_object_get_pages+0x4f/0x90 > Dec 9 21:29:42 titan vmunix: [] > i915_gem_object_bind_to_gtt+0xc3/0x4c0 > Dec 9 21:29:42 titan vmunix: [] > i915_gem_object_pin+0x123/0x190 > Dec 9 21:29:42 titan vmunix: [] > i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190 > Dec 9 21:29:42 titan vmunix: [] > i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320 > Dec 9 21:29:42 titan vmunix: [] > i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0 > Dec 9 21:29:42 titan vmunix: [] > i915_gem_execbuffer2+0x94/0x280 > Dec 9 21:29:42 titan vmunix: [] drm_ioctl+0x493/0x530 > Dec 9 21:29:42 titan vmunix: [] do_vfs_ioctl+0x8f/0x530 > Dec 9 21:29:42 titan vmunix: [] sys_ioctl+0x4b/0x90 > Dec 9 21:29:42 titan vmunix: [] > system_call_fastpath+0x16/0x1b > > It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, > the i915 driver will need to be taken better care of. That decodes to 11: e8 89 b7 15 00 callq 0x15b79f # radix_tree_lookup_slot 16: 48 85 c0 test %rax,%rax 19: 48 89 c6 mov%rax,%rsi 1c: 74 41 je 0x5f 1e: 48 8b 18 mov(%rax),%rbx # 21: 48 85 db test %rbx,%rbx 24: 74 1f je 0x45 26: f6 c3 03 test $0x3,%bl 29: 75 3c jne0x67 2b:* 8b 53 1c mov0x1c(%rbx),%edx <-- trapping instruction 2e: 85 d2 test %edx,%edx 30: 74 d9 je 0xb where %rbx is 0x0200. That looks like it could be a single-bit error, and should have been zero. It's the "atomic_read(&page->counter)" which is part of "page_cache_get_speculative()" as far as I can tell, and it's the "page" pointer that is that odd (non-pointer) value. The fact that %ecx contains the value "-6" makes me wonder if there was a -ENXIO somewhere, though. None of it looks all that much related to whether the i915 driver uses GFP_NO_KSWAPD or not, though. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 10.12.2012 20:13, Linus Torvalds wrote: > > It's worth giving this as much testing as is at all possible, but at > the same time I really don't think I can delay 3.7 any more without > messing up the holiday season too much. So unless something obvious > pops up, I will do the release tonight. So testing will be minimal - > but it's not like we haven't gone back-and-forth on this several times > already, and we revert to *mostly* the same old state as 3.6 anyway, > so it should be fairly safe. > It compiles and boots without a hitch, so it must be perfect. :) Seriously, a few more hours need to pass, until I can provide more convincing data. That's how long it takes on this particular machine for memory pressure to build up and memory fragmentation to ensue. Only then I'll be able to tell how it really behaves. I promise to get back as soon as I can. And funny thing that you mention i915, because yesterday my daughter managed to lock up our laptop hard (that was a first), and this is what I found in kern.log after restart: Dec 9 21:29:42 titan vmunix: general protection fault: [#1] PREEMPT SMP Dec 9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) [last unloaded: microcode] Dec 9 21:29:42 titan vmunix: CPU 2 Dec 9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G O 3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B Dec 9 21:29:42 titan vmunix: RIP: 0010:[] [] find_get_page+0x3c/0x90 Dec 9 21:29:42 titan vmunix: RSP: 0018:88014d9f7928 EFLAGS: 00010246 Dec 9 21:29:42 titan vmunix: RAX: 880052594bc8 RBX: 0200 RCX: fffa Dec 9 21:29:42 titan vmunix: RDX: 0001 RSI: 880052594bc8 RDI: Dec 9 21:29:42 titan vmunix: RBP: 88014d9f7948 R08: 0200 R09: 880052594b18 Dec 9 21:29:42 titan vmunix: R10: 57ffe4cbb74d1280 R11: R12: 88011c959a90 Dec 9 21:29:42 titan vmunix: R13: 0053 R14: R15: 0053 Dec 9 21:29:42 titan vmunix: FS: 7fcd8d413880() GS:880157c8() knlGS: Dec 9 21:29:42 titan vmunix: CS: 0010 DS: ES: CR0: 80050033 Dec 9 21:29:42 titan vmunix: CR2: ff600400 CR3: 00014d937000 CR4: 07e0 Dec 9 21:29:42 titan vmunix: DR0: DR1: DR2: Dec 9 21:29:42 titan vmunix: DR3: DR6: 0ff0 DR7: 0400 Dec 9 21:29:42 titan vmunix: Process Xorg (pid: 2523, threadinfo 88014d9f6000, task 88014d9c1260) Dec 9 21:29:42 titan vmunix: Stack: Dec 9 21:29:42 titan vmunix: 88014d9f7958 88011c959a88 0053 88011c959a88 Dec 9 21:29:42 titan vmunix: 88014d9f7978 81090e21 0001 ea00014d1280 Dec 9 21:29:42 titan vmunix: 88011c959960 0001 88014d9f7a28 810a1b60 Dec 9 21:29:42 titan vmunix: Call Trace: Dec 9 21:29:42 titan vmunix: [] find_lock_page+0x21/0x80 Dec 9 21:29:42 titan vmunix: [] shmem_getpage_gfp+0xa0/0x620 Dec 9 21:29:42 titan vmunix: [] shmem_read_mapping_page_gfp+0x2c/0x50 Dec 9 21:29:42 titan vmunix: [] i915_gem_object_get_pages_gtt+0xe1/0x270 Dec 9 21:29:42 titan vmunix: [] i915_gem_object_get_pages+0x4f/0x90 Dec 9 21:29:42 titan vmunix: [] i915_gem_object_bind_to_gtt+0xc3/0x4c0 Dec 9 21:29:42 titan vmunix: [] i915_gem_object_pin+0x123/0x190 Dec 9 21:29:42 titan vmunix: [] i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190 Dec 9 21:29:42 titan vmunix: [] i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320 Dec 9 21:29:42 titan vmunix: [] i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0 Dec 9 21:29:42 titan vmunix: [] i915_gem_execbuffer2+0x94/0x280 Dec 9 21:29:42 titan vmunix: [] drm_ioctl+0x493/0x530 Dec 9 21:29:42 titan vmunix: [] ? i915_gem_execbuffer+0x480/0x480 Dec 9 21:29:42 titan vmunix: [] do_vfs_ioctl+0x8f/0x530 Dec 9 21:29:42 titan vmunix: [] sys_ioctl+0x4b/0x90 Dec 9 21:29:42 titan vmunix: [] ? sys_read+0x4d/0xa0 Dec 9 21:29:42 titan vmunix: [] system_call_fastpath+0x16/0x1b Dec 9 21:29:42 titan vmunix: Code: 63 08 48 83 ec 08 e8 84 9c fb ff 4c 89 ee 4c 89 e7 e8 89 b7 15 00 48 85 c0 48 89 c6 74 41 48 8b 18 48 85 db 74 1f f6 c3 03 75 3c <8b> 53 1c 85 d2 74 d9 8d 7a 01 89 d0 f0 0f b1 7b 1c 39 c2 75 23 Dec 9 21:29:42 titan vmunix: RIP [] find_get_page+0x3c/0x90 Dec 9 21:29:42 titan vmunix: RSP It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, the i915 driver will need to be taken better care of. -- Zlatko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Mon, Dec 10, 2012 at 10:33 AM, Zlatko Calusic wrote: > > I was about to apply the patch that you sent, and reboot the server, but it > seems there's no point because the patch is flawed? > > Anyway, if and when you have a proper one, I'll be glad to test it for you > and report results. I have reverted (again) the __GFP_NO_KSWAPD removal, and considering that it really looks like there are overwhelming reasons to have that flag, I will *not* take some new patch to revert it. I'm getting convinced that the original removal really was bogus, and had no actual valid reason for it. Part of that is that I noticed that non-THP allocations wanted to use it too. The i915 driver had wanted to use __GFP_NO_KSWAPD because it too didn't want to start some cleaning thread. The whole mindset kswapd is somehow better than direct reclaim or needed when it fails is broken. Some allocations simply *will* fail, without necessarily wanting kswapd to be started. THP - where the high order of the allocation means that failure is inevitable under some fragmentation circumstances - is just one such case. I also reverted one of the "fix up the mess from removing __GFP_NO_KSWAPD" patch, because that one was an obvious workaround that tried to re-introduce the "let's not wake up kswapd after all for that case". It clashed with a clean revert, and it was pointless in the presense of __GFP_NO_KSWAPD anyway. I did *not* revert some of the other fixup patches that tried to help kswapd balancing decisions and avoid excessive CPU use other ways. So some remains of this whole saga do still remain, but they look fairly minimal. It's worth giving this as much testing as is at all possible, but at the same time I really don't think I can delay 3.7 any more without messing up the holiday season too much. So unless something obvious pops up, I will do the release tonight. So testing will be minimal - but it's not like we haven't gone back-and-forth on this several times already, and we revert to *mostly* the same old state as 3.6 anyway, so it should be fairly safe. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 10.12.2012 19:01, Mel Gorman wrote: In this last-minute disaster, I'm not thinking properly at all any more. The shrink slab disabling should have happened before the loop_again but even then it's wrong because it's just covering over the problem. The way order and testorder interact with how balanced is calculated means that we potentially call shrink_slab() multiple times and that thing is global in nature and basically uncontrolled. You could argue that we should only call shrink_slab() if order-0 watermarks are not met but that will not necessarily prevent kswapd reclaiming too much. It keeps going back to balance_pgdat needing its list of requirements drawn up and receive some major surgery and we're not going to do that as a quick hack. I was about to apply the patch that you sent, and reboot the server, but it seems there's no point because the patch is flawed? Anyway, if and when you have a proper one, I'll be glad to test it for you and report results. -- Zlatko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 10.12.2012 12:03, Mel Gorman wrote: There is a big difference between a direct reclaim/compaction for THP and kswapd doing the same work. Direct reclaim/compaction will try once, give up quickly and defer requests in the near future to avoid impacting the system heavily for THP. The same applies for khugepaged. kswapd is different. It can keep going until it meets its watermarks for a THP allocation are met. Two reasons why it might keep going for a long time are that compaction is being inefficient which we know it may be due to crap like this end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages); and the second reason is if the highest zone is relatively because compaction_suitable will keep saying that allocations are failing due to insufficient amounts of memory in the highest zone. It'll reclaim a little from this highest zone and then shrink_slab() potentially dumping a large amount of memory. This may be the case for Zlatko as with a 4G machine his ZONE_NORMAL could be small depending on how the 32-bit address space is used by his hardware. The kernel is 64-bit, if it makes any difference (userspace, though is still 32-bit). There's no swap (swap support not even compiled in). The zones are as follows: On node 0 totalpages: 1048019 DMA zone: 64 pages used for memmap DMA zone: 6 pages reserved DMA zone: 3913 pages, LIFO batch:0 DMA32 zone: 16320 pages used for memmap DMA32 zone: 831109 pages, LIFO batch:31 Normal zone: 3072 pages used for memmap Normal zone: 193535 pages, LIFO batch:31 If I understand correctly, you think that because 193535 pages in ZONE_NORMAL is relatively small compared to 831109 pages of ZONE_DMA32 the system has hard time balancing itself? Is there any way I could force and test different memory layout? I'm slightly lost at all the memory models (if I have a choice at all), so if you have any suggestions, I'm all ears. Maybe I could limit available memory and thus have only DMA32 zone, just to prove your theory? I remember doing tuning like that many years ago when I had more time to play with Linux MM, unfortunately didn't have much time lately, so I'm a bit rusty, but I'm willing to help testing and resolving this issue. -- Zlatko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Mon, Dec 10, 2012 at 11:39:04AM -0500, Johannes Weiner wrote: > On Mon, Dec 10, 2012 at 11:03:37AM +, Mel Gorman wrote: > > On Sat, Dec 08, 2012 at 05:01:42PM -0800, Linus Torvalds wrote: > > > On Sat, 8 Dec 2012, Zlatko Calusic wrote: > > > > Or sooner... in short: nothing's changed! > > > > > > > > On a 4GB RAM system, where applications use close to 2GB, kswapd likes > > > > to keep > > > > around 1GB free (unused), leaving only 1GB for page/buffer cache. If I > > > > force > > > > bigger page cache by reading a big file and thus use the unused 1GB of > > > > RAM, > > > > kswapd will soon (in a matter of minutes) evict those (or other) pages > > > > out and > > > > once again keep unused memory close to 1GB. > > > > > > Ok, guys, what was the reclaim or kswapd patch during the merge window > > > that actually caused all of these insane problems? > > > > I believe commit c6543459 (mm: remove __GFP_NO_KSWAPD) is the primary > > candidate. __GFP_NO_KSWAPD was originally introduced by THP because kswapd > > was excessively reclaiming. kswapd would stay awake aggressively reclaiming > > even if compaction was deferred. The flag was removed in this cycle when it > > was expected that it was no longer necessary. I'm not foisting the blame > > on Rik here, I was on the review list for that patch and did not identify > > that it would cause this many problems either. > > > > > It seems it was more > > > fundamentally buggered than the fifteen-million fixes for kswapd we have > > > already picked up. > > > > It was already fundamentally buggered up. The difference was it stayed > > asleep for THP requests in earlier kernels. > > > > There is a big difference between a direct reclaim/compaction for THP > > and kswapd doing the same work. Direct reclaim/compaction will try once, > > give up quickly and defer requests in the near future to avoid impacting > > the system heavily for THP. The same applies for khugepaged. > > > > kswapd is different. It can keep going until it meets its watermarks for > > a THP allocation are met. Two reasons why it might keep going for a long > > time are that compaction is being inefficient which we know it may be due > > to crap like this > > > > end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages); > > > > and the second reason is if the highest zone is relatively because > > compaction_suitable will keep saying that allocations are failing due to > > insufficient amounts of memory in the highest zone. It'll reclaim a little > > from this highest zone and then shrink_slab() potentially dumping a large > > amount of memory. This may be the case for Zlatko as with a 4G machine > > his ZONE_NORMAL could be small depending on how the 32-bit address space > > is used by his hardware. > > Unlike direct reclaim, kswapd also never does sync migration. Since > the fragmentation index is a ratio of free pages over free page > blocks, doing lightweight compaction that reduces the page blocks but > never really follows through to compact a THP block increases the free > memory requirement. > True. > I thought about the small Normal zone too. Direct reclaim/compaction > is fine with one zone being able to provide a THP, but kswapd requires > 25% of the node. A small ZONE_NORMAL would not be able to meet this > and so the bigger DMA32 zone would also be required to be balanced for > the THP allocation. > Also true. > > > Mel? Ideas? > > > > Consider reverting the revert of __GFP_NO_KSWAPD again until this can be > > ironed out at a more reasonable pace. Rik? Johannes? > > Yes, I also think we need more time for this. > Yes, the last minute band-aids are just getting worse and the result is more mess. > > > I don't see a shrink_slab() invocation after this point since the > loop_again jumps in this loop where removed, so this shouldn't change > anything? /me slaps self In this last-minute disaster, I'm not thinking properly at all any more. The shrink slab disabling should have happened before the loop_again but even then it's wrong because it's just covering over the problem. The way order and testorder interact with how balanced is calculated means that we potentially call shrink_slab() multiple times and that thing is global in nature and basically uncontrolled. You could argue that we should only call shrink_slab() if order-0 watermarks are not met but that will not necessarily prevent kswapd reclaiming too much. It keeps going back to balance_pgdat needing its list of requirements drawn up and receive some major surgery and we're not going to do that as a quick hack. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Mon, Dec 10, 2012 at 11:03:37AM +, Mel Gorman wrote: > On Sat, Dec 08, 2012 at 05:01:42PM -0800, Linus Torvalds wrote: > > On Sat, 8 Dec 2012, Zlatko Calusic wrote: > > > Or sooner... in short: nothing's changed! > > > > > > On a 4GB RAM system, where applications use close to 2GB, kswapd likes to > > > keep > > > around 1GB free (unused), leaving only 1GB for page/buffer cache. If I > > > force > > > bigger page cache by reading a big file and thus use the unused 1GB of > > > RAM, > > > kswapd will soon (in a matter of minutes) evict those (or other) pages > > > out and > > > once again keep unused memory close to 1GB. > > > > Ok, guys, what was the reclaim or kswapd patch during the merge window > > that actually caused all of these insane problems? > > I believe commit c6543459 (mm: remove __GFP_NO_KSWAPD) is the primary > candidate. __GFP_NO_KSWAPD was originally introduced by THP because kswapd > was excessively reclaiming. kswapd would stay awake aggressively reclaiming > even if compaction was deferred. The flag was removed in this cycle when it > was expected that it was no longer necessary. I'm not foisting the blame > on Rik here, I was on the review list for that patch and did not identify > that it would cause this many problems either. > > > It seems it was more > > fundamentally buggered than the fifteen-million fixes for kswapd we have > > already picked up. > > It was already fundamentally buggered up. The difference was it stayed > asleep for THP requests in earlier kernels. > > There is a big difference between a direct reclaim/compaction for THP > and kswapd doing the same work. Direct reclaim/compaction will try once, > give up quickly and defer requests in the near future to avoid impacting > the system heavily for THP. The same applies for khugepaged. > > kswapd is different. It can keep going until it meets its watermarks for > a THP allocation are met. Two reasons why it might keep going for a long > time are that compaction is being inefficient which we know it may be due > to crap like this > > end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages); > > and the second reason is if the highest zone is relatively because > compaction_suitable will keep saying that allocations are failing due to > insufficient amounts of memory in the highest zone. It'll reclaim a little > from this highest zone and then shrink_slab() potentially dumping a large > amount of memory. This may be the case for Zlatko as with a 4G machine > his ZONE_NORMAL could be small depending on how the 32-bit address space > is used by his hardware. Unlike direct reclaim, kswapd also never does sync migration. Since the fragmentation index is a ratio of free pages over free page blocks, doing lightweight compaction that reduces the page blocks but never really follows through to compact a THP block increases the free memory requirement. I thought about the small Normal zone too. Direct reclaim/compaction is fine with one zone being able to provide a THP, but kswapd requires 25% of the node. A small ZONE_NORMAL would not be able to meet this and so the bigger DMA32 zone would also be required to be balanced for the THP allocation. > > Mel? Ideas? > > Consider reverting the revert of __GFP_NO_KSWAPD again until this can be > ironed out at a more reasonable pace. Rik? Johannes? Yes, I also think we need more time for this. > Verify if the shrinking slab is the issue with this brutually ugly > hack. Zlatko? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index b7ed376..2189d20 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2550,6 +2550,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, > int order, > unsigned long balanced; > int i; > int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ > + bool should_shrink_slab = true; > unsigned long total_scanned; > struct reclaim_state *reclaim_state = current->reclaim_state; > unsigned long nr_soft_reclaimed; > @@ -2695,7 +2696,8 @@ loop_again: > shrink_zone(zone, &sc); > > reclaim_state->reclaimed_slab = 0; > - nr_slab = shrink_slab(&shrink, sc.nr_scanned, > lru_pages); > + if (should_shrink_slab) > + nr_slab = shrink_slab(&shrink, > sc.nr_scanned, lru_pages); > sc.nr_reclaimed += > reclaim_state->reclaimed_slab; > total_scanned += sc.nr_scanned; > > @@ -2817,6 +2819,16 @@ out: > if (order) { > int zones_need_compaction = 1; > > + /* > + * Shrinking slab for high-order allocs can cause an excessive > + * amount of memory to be dumped. Only shrink slab once per > + * round for high-order allocs. > + * > + * This is a very stupid hack. balance_pgdat() is in serious > +
Re: kswapd craziness in 3.7
On Sat, Dec 08, 2012 at 05:01:42PM -0800, Linus Torvalds wrote: > > > On Sat, 8 Dec 2012, Zlatko Calusic wrote: > > > > Or sooner... in short: nothing's changed! > > > > On a 4GB RAM system, where applications use close to 2GB, kswapd likes to > > keep > > around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force > > bigger page cache by reading a big file and thus use the unused 1GB of RAM, > > kswapd will soon (in a matter of minutes) evict those (or other) pages out > > and > > once again keep unused memory close to 1GB. > > Ok, guys, what was the reclaim or kswapd patch during the merge window > that actually caused all of these insane problems? I believe commit c6543459 (mm: remove __GFP_NO_KSWAPD) is the primary candidate. __GFP_NO_KSWAPD was originally introduced by THP because kswapd was excessively reclaiming. kswapd would stay awake aggressively reclaiming even if compaction was deferred. The flag was removed in this cycle when it was expected that it was no longer necessary. I'm not foisting the blame on Rik here, I was on the review list for that patch and did not identify that it would cause this many problems either. > It seems it was more > fundamentally buggered than the fifteen-million fixes for kswapd we have > already picked up. > It was already fundamentally buggered up. The difference was it stayed asleep for THP requests in earlier kernels. There is a big difference between a direct reclaim/compaction for THP and kswapd doing the same work. Direct reclaim/compaction will try once, give up quickly and defer requests in the near future to avoid impacting the system heavily for THP. The same applies for khugepaged. kswapd is different. It can keep going until it meets its watermarks for a THP allocation are met. Two reasons why it might keep going for a long time are that compaction is being inefficient which we know it may be due to crap like this end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages); and the second reason is if the highest zone is relatively because compaction_suitable will keep saying that allocations are failing due to insufficient amounts of memory in the highest zone. It'll reclaim a little from this highest zone and then shrink_slab() potentially dumping a large amount of memory. This may be the case for Zlatko as with a 4G machine his ZONE_NORMAL could be small depending on how the 32-bit address space is used by his hardware. > (Ok, I may be exaggerating the number of patches, but it's starting to > feel that way - I thought that 3.7 was going to be a calm and easy > release, but the kswapd issues seem to just keep happening. We've been > fighting the kswapd changes for a while now.) > Yes. > Trying to keep a gigabyte free (presumably because that way we have lots > of high-order alloction pages) is ridiculous. Is it one of the compaction > changes? > Not directly. Compaction has been a bigger factor after 3.5 due to the removal of lumpy reclaim but it's not directly responsible for excessive amounts of memory being kept free. The closest patch I'm aware of that would cause problems of that nature would be commit 83fde0f2 (mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures) and it has already been reverted by 96710098. > Mel? Ideas? > Consider reverting the revert of __GFP_NO_KSWAPD again until this can be ironed out at a more reasonable pace. Rik? Johannes? Verify if the shrinking slab is the issue with this brutually ugly hack. Zlatko? diff --git a/mm/vmscan.c b/mm/vmscan.c index b7ed376..2189d20 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2550,6 +2550,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, unsigned long balanced; int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ + bool should_shrink_slab = true; unsigned long total_scanned; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_soft_reclaimed; @@ -2695,7 +2696,8 @@ loop_again: shrink_zone(zone, &sc); reclaim_state->reclaimed_slab = 0; - nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages); + if (should_shrink_slab) + nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages); sc.nr_reclaimed += reclaim_state->reclaimed_slab; total_scanned += sc.nr_scanned; @@ -2817,6 +2819,16 @@ out: if (order) { int zones_need_compaction = 1; + /* +* Shrinking slab for high-order allocs can cause an excessive +* amount of memory to be dumped. Only shrink slab once per +* round for high-order allocs. +* +* This is a very stupid hack. balance_pgdat() is in ser
Re: kswapd craziness in 3.7
Dne 9.12.2012 02:01, Linus Torvalds napsal(a): On Sat, 8 Dec 2012, Zlatko Calusic wrote: Or sooner... in short: nothing's changed! On a 4GB RAM system, where applications use close to 2GB, kswapd likes to keep around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force bigger page cache by reading a big file and thus use the unused 1GB of RAM, kswapd will soon (in a matter of minutes) evict those (or other) pages out and once again keep unused memory close to 1GB. Ok, guys, what was the reclaim or kswapd patch during the merge window that actually caused all of these insane problems? It seems it was more fundamentally buggered than the fifteen-million fixes for kswapd we have already picked up. (Ok, I may be exaggerating the number of patches, but it's starting to feel that way - I thought that 3.7 was going to be a calm and easy release, but the kswapd issues seem to just keep happening. We've been fighting the kswapd changes for a while now.) Trying to keep a gigabyte free (presumably because that way we have lots of high-order alloction pages) is ridiculous. Is it one of the compaction changes? Mel? Ideas? Very true It's just as simple a making dd if=/dev/zero of=/tmp/zero bs=1M count=0 seek=100 and now dd if=/tmp/zero of=/dev/null bs=1M and kswapd fights with dd for CPU time Zdenek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Sat, 8 Dec 2012, Zlatko Calusic wrote: > > Or sooner... in short: nothing's changed! > > On a 4GB RAM system, where applications use close to 2GB, kswapd likes to keep > around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force > bigger page cache by reading a big file and thus use the unused 1GB of RAM, > kswapd will soon (in a matter of minutes) evict those (or other) pages out and > once again keep unused memory close to 1GB. Ok, guys, what was the reclaim or kswapd patch during the merge window that actually caused all of these insane problems? It seems it was more fundamentally buggered than the fifteen-million fixes for kswapd we have already picked up. (Ok, I may be exaggerating the number of patches, but it's starting to feel that way - I thought that 3.7 was going to be a calm and easy release, but the kswapd issues seem to just keep happening. We've been fighting the kswapd changes for a while now.) Trying to keep a gigabyte free (presumably because that way we have lots of high-order alloction pages) is ridiculous. Is it one of the compaction changes? Mel? Ideas? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 08.12.2012 13:06, Zlatko Calusic wrote: On 06.12.2012 20:31, Linus Torvalds wrote: Ok, people seem to be reporting success. I've applied Johannes' last patch with the new tested-by tags. I've been testing this patch since it was applied, and it certainly fixes the kswapd craziness issue, good work Johannes! But, it's still not perfect yet, because I see that the system keeps lots of memory unused (free), where it previously used it all for the page cache (there's enough fs activity to warrant it). I'm now testing the last piece of Johannes' changes (still not in git tree), and can report results in 24-48 hours. Regards, Or sooner... in short: nothing's changed! On a 4GB RAM system, where applications use close to 2GB, kswapd likes to keep around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force bigger page cache by reading a big file and thus use the unused 1GB of RAM, kswapd will soon (in a matter of minutes) evict those (or other) pages out and once again keep unused memory close to 1GB. I guess it's not a showstopper, but it still counts as a very bad memory management, wasting lots of RAM. As an additional data point, if memory pressure is slightly higher (say backup kicks in, keeping page cache mostly full) kswapd gets in D (uninterruptible sleep) state (function: congestion_wait) and load average goes up by 1. It recovers only when it successfully throws out half of page cache again. Hope it helps. -- Zlatko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 06.12.2012 20:31, Linus Torvalds wrote: Ok, people seem to be reporting success. I've applied Johannes' last patch with the new tested-by tags. I've been testing this patch since it was applied, and it certainly fixes the kswapd craziness issue, good work Johannes! But, it's still not perfect yet, because I see that the system keeps lots of memory unused (free), where it previously used it all for the page cache (there's enough fs activity to warrant it). I'm now testing the last piece of Johannes' changes (still not in git tree), and can report results in 24-48 hours. Regards, -- Zlatko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 12/04/2012 05:11 PM, Johannes Weiner wrote: > On Tue, Dec 04, 2012 at 10:15:09AM +0100, Jiri Slaby wrote: >> It does not apply to -next :/. Should I try anything else? > > The COMPACTION_BUILD changed to IS_ENABLED(CONFIG_COMPACTION), below > is a -next patch. I hope you don't run into other problems that come > out of -next craziness, because Linus is kinda waiting for this to be > resolved to release 3.8. If you've always tested against -next so far > and it worked otherwise, don't change the environment now, please. If > you just started, it would make more sense to test based on 3.7-rc8. > > Thanks! > > --- > From: Johannes Weiner > Subject: [patch] mm: vmscan: do not keep kswapd looping forever due > to individual uncompactable zones > > When a zone meets its high watermark and is compactable in case of > higher order allocations, it contributes to the percentage of the > node's memory that is considered balanced. > > This requirement, that a node be only partially balanced, came about > when kswapd was desparately trying to balance tiny zones when all > bigger zones in the node had plenty of free memory. Arguably, the > same should apply to compaction: if a significant part of the node is > balanced enough to run compaction, do not get hung up on that tiny > zone that might never get in shape. > > When the compaction logic in kswapd is reached, we know that at least > 25% of the node's memory is balanced properly for compaction (see > zone_balanced and pgdat_balanced). Remove the individual zone checks > that restart the kswapd cycle. > > Otherwise, we may observe more endless looping in kswapd where the > compaction code loops back to reclaim because of a single zone and > reclaim does nothing because the node is considered balanced overall. > > Reported-by: Thorsten Leemhuis > Signed-off-by: Johannes Weiner Looks like it's gone with this patch now. Hopefully the send button won't trigger the issue the same as the last time :). > --- > mm/vmscan.c | 16 > 1 file changed, 16 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 3b0aef4..486100f 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2806,22 +2806,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, > int order, > if (!populated_zone(zone)) > continue; > > - if (zone->all_unreclaimable && > - sc.priority != DEF_PRIORITY) > - continue; > - > - /* Would compaction fail due to lack of free memory? */ > - if (IS_ENABLED(CONFIG_COMPACTION) && > - compaction_suitable(zone, order) == COMPACT_SKIPPED) > - goto loop_again; > - > - /* Confirm the zone is balanced for order-0 */ > - if (!zone_watermark_ok(zone, 0, > - high_wmark_pages(zone), 0, 0)) { > - order = sc.order = 0; > - goto loop_again; > - } > - > /* Check if the memory needs to be defragmented. */ > if (zone_watermark_ok(zone, order, > low_wmark_pages(zone), *classzone_idx, 0)) > -- js suse labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 12/06/2012 03:23 PM, Johannes Weiner wrote: From: Johannes Weiner Subject: [patch] mm: vmscan: fix inappropriate zone congestion clearing c702418 ("mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones") removed zone watermark checks from the compaction code in kswapd but left in the zone congestion clearing, which now happens unconditionally on higher order reclaim. This messes up the reclaim throttling logic for zones with dirty/writeback pages, where zones should only lose their congestion status when their watermarks have been restored. Remove the clearing from the zone compaction section entirely. The preliminary zone check and the reclaim loop in kswapd will clear it if the zone is considered balanced. Signed-off-by: Johannes Weiner Reviewed-by: Rik van Riel -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Thu, Dec 06, 2012 at 11:31:21AM -0800, Linus Torvalds wrote: > Ok, people seem to be reporting success. > > I've applied Johannes' last patch with the new tested-by tags. > > Johannes (or anybody else, for that matter), please holler LOUDLY if > you disagreed.. (or if I used the wrong version of the patch, there's > been several, afaik). I just went back one more time and of course that's when I spot that I forgot to remove the zone congestion clearing that depended on the now removed checks to ensure the zone is balanced. It's not too big of a deal, just the /risk/ of increased CPU use from reclaim because we go back to scanning zones that we previously deemed congested and slept a little bit before continuing reclaim. Sorry, I should have seen that earlier. Removing it is a low risk fix, the clearing was kinda redundant anyway (the preliminary zone check clears it for OK zones, so does the reclaim loop under the same criteria), letting it stay is probably more problematic for 3.8 than just dropping it... --- From: Johannes Weiner Subject: [patch] mm: vmscan: fix inappropriate zone congestion clearing c702418 ("mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones") removed zone watermark checks from the compaction code in kswapd but left in the zone congestion clearing, which now happens unconditionally on higher order reclaim. This messes up the reclaim throttling logic for zones with dirty/writeback pages, where zones should only lose their congestion status when their watermarks have been restored. Remove the clearing from the zone compaction section entirely. The preliminary zone check and the reclaim loop in kswapd will clear it if the zone is considered balanced. Signed-off-by: Johannes Weiner --- mm/vmscan.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 124bbfe..b7ed376 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2827,9 +2827,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, if (zone_watermark_ok(zone, order, low_wmark_pages(zone), *classzone_idx, 0)) zones_need_compaction = 0; - - /* If balanced, clear the congested flag */ - zone_clear_flag(zone, ZONE_CONGESTED); } if (zones_need_compaction) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 12/06/2012 02:31 PM, Linus Torvalds wrote: Ok, people seem to be reporting success. I've applied Johannes' last patch with the new tested-by tags. Johannes (or anybody else, for that matter), please holler LOUDLY if you disagreed.. (or if I used the wrong version of the patch, there's been several, afaik). Johannes's patch is a fairly big hammer, with kswapd not looping back to the start when zones are still unbalanced. However, the next allocation will wake up kswapd again, and having kswapd stop early beats having it in an infinite loop. I believe Johannes's patch will be fine for 3.7. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
Ok, people seem to be reporting success. I've applied Johannes' last patch with the new tested-by tags. Johannes (or anybody else, for that matter), please holler LOUDLY if you disagreed.. (or if I used the wrong version of the patch, there's been several, afaik). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Tue, Dec 04, 2012 at 21:01:33 -0600, Bruno Wolff III wrote: On Tue, Dec 04, 2012 at 16:42:10 -0500, Johannes Weiner wrote: kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686 and kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.x86_64 for over 24hours with no evidence of problems with kswapd" Now waiting for results from Jiri, Zdenek and Bruno... I have been running 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686.PAE a bit over 23 hours and kswapd has accumalated one minute 8 seconds of CPU time. I did several yum operations during that time and didn't see kswapd spike to 90+% CPU usage as I had seen in the past. With some kernels I wasn't reliably triggering the kswapd issue, so it may not be long enough to know for sure that the problem is fixed. I am now at a bit over 2 and 1/2 days with kswapd having used 1 minute 53 seconds of CPU time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
Dne 4.12.2012 10:05, Zdenek Kabelac napsal(a): Dne 3.12.2012 20:18, Johannes Weiner napsal(a): Szia Zdenek, On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote: Ok, bad news - I've been hit by kswapd0 loop again - my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again shown kswapd0 for couple minutes on CPU. It seemed to go instantly away when I've drop caches (echo 3 >/proc/sys/vm/drop_cache) (After that I've had over 1G free memory) Any chance you could retry with this patch on top? --- From: Johannes Weiner Subject: [patch] mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones --- mm/vmscan.c | 16 1 file changed, 16 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c Ok, I'm running now b69f0859dc8e633c5d8c06845811588fe17e68b3 (-rc8) with your patch. I'll be able to give some feedback after couple days (if I keep my machine running without reboot - since before So to just give some positive info - with 2 1/2 day uptime, several suspend/resumes, ff at 1.4GB I still have just 29 seconds for kswapd0 process. So the patch above might have helped - but I'll look for a few more days. Zdenek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
Hi! Just a quick update Johannes Weiner wrote on 03.12.2012 20:42: > On Mon, Dec 03, 2012 at 09:30:12AM +0100, Thorsten Leemhuis wrote: > >> BTW, I built that kernel without the patch you mentioned in >> http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153 >> ("buffer_heads_over_limit can put kswapd into reclaim, but it's ignored >> [...]) It looked to me like that patch was only meant for debugging. Let >> me know if that was wrong. Ohh, and I didn't update to a fresher >> mainline checkout yet to make sure the base for John's testing didn't >> change. > > Ah, yes, the ApplyPatch is commented out. > > I think we want that upstream as well, but it's not critical. > [...] Sorry, it had no "Singed-off-by", so I assumed it was just for debugging. > Not rebasing sounds reasonable to me to verify the patch. It might be > worth testing that the final version that will be 3.8 still works for > John, however, once that is done. Just to be sure. Just to be sure, I yesterday built a rc8 kernel with the patch referenced above and the one that is not yet merged (these two, to be precise: http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153 http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91300 ; all the others patches my kswap test kernels contained earlier were afaics merged a few days ago) and mentioned it in the Fedora bug report. John gave them a try and in https://bugzilla.redhat.com/show_bug.cgi?id=866988#c65 reported "No problems so far. I'll check back again in ~24hours." CU, Thorsten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Tue, Dec 04, 2012 at 16:42:10 -0500, Johannes Weiner wrote: kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686 and kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.x86_64 for over 24hours with no evidence of problems with kswapd" Now waiting for results from Jiri, Zdenek and Bruno... I have been running 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686.PAE a bit over 23 hours and kswapd has accumalated one minute 8 seconds of CPU time. I did several yum operations during that time and didn't see kswapd spike to 90+% CPU usage as I had seen in the past. With some kernels I wasn't reliably triggering the kswapd issue, so it may not be long enough to know for sure that the problem is fixed. I also should note that when I tried 3.7.0-0.rc7.git3.2.fc19.i686.PAE I did see problems with kswapd hitting 90+% usage of a CPU. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Mon, Dec 03, 2012 at 02:42:08PM -0500, Johannes Weiner wrote: > On Mon, Dec 03, 2012 at 09:30:12AM +0100, Thorsten Leemhuis wrote: > > >> John was able to reproduce the problem quickly with a kernel that > > >> contained the patch from your mail. For details see > > > > > > [stripped: all the glory details of what likely went wrong and lead > > > to the problem john sees or saw] > > > > > > --- > > > From: Johannes Weiner > > > Subject: [patch] mm: vmscan: do not keep kswapd looping forever due > > > to individual uncompactable zones > > > > > > When a zone meets its high watermark and is compactable in case of > > > higher order allocations, it contributes to the percentage of the > > > node's memory that is considered balanced. > > > [...] > > > > FYI: I built a kernel with that patch. I've been running on my x86_64 > > machine at home over the weekend and everything was working fine (just > > as without the patch). John gave it a quick try and in > > https://bugzilla.redhat.com/show_bug.cgi?id=866988#c57 reported: > > > > """ > > I just installed > > kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686 and ran my > > usual load that triggers the problem. OK so far. I'll check again in > > 24hours, but looking good so far. > > """ > > w00t! Update from John in the BZ (https://bugzilla.redhat.com/show_bug.cgi?id=866988#c62): "Good news. I've now been running both kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686 and kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.x86_64 for over 24hours with no evidence of problems with kswapd" Now waiting for results from Jiri, Zdenek and Bruno... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Tue, Dec 04, 2012 at 05:22:38PM +0100, Jiri Slaby wrote: > On 12/04/2012 05:11 PM, Johannes Weiner wrote: > Any chance you could retry with this patch on top? > >> > >> It does not apply to -next :/. Should I try anything else? > > > > The COMPACTION_BUILD changed to IS_ENABLED(CONFIG_COMPACTION), below > > is a -next patch. I hope you don't run into other problems that come > > out of -next craziness, because Linus is kinda waiting for this to be > > resolved to release 3.8. If you've always tested against -next so far > > and it worked otherwise, don't change the environment now, please. If > > you just started, it would make more sense to test based on 3.7-rc8. > > I reported the issue as soon as it appeared in -next for the first time > on Oct 12. Since then I'm constantly hitting the issue (well, there were > more than one I suppose, but not all of them were fixed by now) until > now. I run only -next... Okay. Yes, it was a couple of problems, but not everybody hit the same subset. > Going to apply the patch now. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 12/04/2012 05:11 PM, Johannes Weiner wrote: Any chance you could retry with this patch on top? >> >> It does not apply to -next :/. Should I try anything else? > > The COMPACTION_BUILD changed to IS_ENABLED(CONFIG_COMPACTION), below > is a -next patch. I hope you don't run into other problems that come > out of -next craziness, because Linus is kinda waiting for this to be > resolved to release 3.8. If you've always tested against -next so far > and it worked otherwise, don't change the environment now, please. If > you just started, it would make more sense to test based on 3.7-rc8. I reported the issue as soon as it appeared in -next for the first time on Oct 12. Since then I'm constantly hitting the issue (well, there were more than one I suppose, but not all of them were fixed by now) until now. I run only -next... Going to apply the patch now. -- js suse labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Tue, Dec 04, 2012 at 10:05:29AM +0100, Zdenek Kabelac wrote: > Dne 3.12.2012 20:18, Johannes Weiner napsal(a): > >Szia Zdenek, > > > >On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote: > >>Ok, bad news - I've been hit by kswapd0 loop again - > >>my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again > >>shown kswapd0 for couple minutes on CPU. > >> > >>It seemed to go instantly away when I've drop caches > >>(echo 3 >/proc/sys/vm/drop_cache) > >>(After that I've had over 1G free memory) > > > >Any chance you could retry with this patch on top? > > > >--- > >From: Johannes Weiner > >Subject: [patch] mm: vmscan: do not keep kswapd looping forever due > > to individual uncompactable zones > > > >--- > > mm/vmscan.c | 16 > > 1 file changed, 16 deletions(-) > > > >diff --git a/mm/vmscan.c b/mm/vmscan.c > > > Ok, I'm running now b69f0859dc8e633c5d8c06845811588fe17e68b3 (-rc8) > with your patch. I'll be able to give some feedback after couple > days (if I keep my machine running without reboot - since before > I had occasional problems with ACPI now resolved. > (https://bugzilla.kernel.org/show_bug.cgi?id=51071) > (patch not yet in -rc8) > I'm also using this extra patch: https://patchwork.kernel.org/patch/1792531/ Okay, fingers crossed! Thanks for persisting. > What seems to be triggering condition on my machine - running laptop > for some days - and having Thunderbird reaching 0.8G (I guess they > must keep all my news messages in memory to consume that size) and > Firefox 1.3GB of consumed > memory (assuming massive leaking with combination of flash) Were you able speed this process up in the past? I.e. by doing a search over all mail? Watching 8 nyan cat videos in parallel? If not, it's probably better not to change anything now... Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Tue, Dec 04, 2012 at 10:15:09AM +0100, Jiri Slaby wrote: > On 12/04/2012 10:05 AM, Zdenek Kabelac wrote: > > Dne 3.12.2012 20:18, Johannes Weiner napsal(a): > >> Szia Zdenek, > >> > >> On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote: > >>> Ok, bad news - I've been hit by kswapd0 loop again - > >>> my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again > >>> shown kswapd0 for couple minutes on CPU. > >>> > >>> It seemed to go instantly away when I've drop caches > >>> (echo 3 >/proc/sys/vm/drop_cache) > >>> (After that I've had over 1G free memory) > >> > >> Any chance you could retry with this patch on top? > > It does not apply to -next :/. Should I try anything else? The COMPACTION_BUILD changed to IS_ENABLED(CONFIG_COMPACTION), below is a -next patch. I hope you don't run into other problems that come out of -next craziness, because Linus is kinda waiting for this to be resolved to release 3.8. If you've always tested against -next so far and it worked otherwise, don't change the environment now, please. If you just started, it would make more sense to test based on 3.7-rc8. Thanks! --- From: Johannes Weiner Subject: [patch] mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones When a zone meets its high watermark and is compactable in case of higher order allocations, it contributes to the percentage of the node's memory that is considered balanced. This requirement, that a node be only partially balanced, came about when kswapd was desparately trying to balance tiny zones when all bigger zones in the node had plenty of free memory. Arguably, the same should apply to compaction: if a significant part of the node is balanced enough to run compaction, do not get hung up on that tiny zone that might never get in shape. When the compaction logic in kswapd is reached, we know that at least 25% of the node's memory is balanced properly for compaction (see zone_balanced and pgdat_balanced). Remove the individual zone checks that restart the kswapd cycle. Otherwise, we may observe more endless looping in kswapd where the compaction code loops back to reclaim because of a single zone and reclaim does nothing because the node is considered balanced overall. Reported-by: Thorsten Leemhuis Signed-off-by: Johannes Weiner --- mm/vmscan.c | 16 1 file changed, 16 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 3b0aef4..486100f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2806,22 +2806,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, if (!populated_zone(zone)) continue; - if (zone->all_unreclaimable && - sc.priority != DEF_PRIORITY) - continue; - - /* Would compaction fail due to lack of free memory? */ - if (IS_ENABLED(CONFIG_COMPACTION) && - compaction_suitable(zone, order) == COMPACT_SKIPPED) - goto loop_again; - - /* Confirm the zone is balanced for order-0 */ - if (!zone_watermark_ok(zone, 0, - high_wmark_pages(zone), 0, 0)) { - order = sc.order = 0; - goto loop_again; - } - /* Check if the memory needs to be defragmented. */ if (zone_watermark_ok(zone, order, low_wmark_pages(zone), *classzone_idx, 0)) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 12/04/2012 10:05 AM, Zdenek Kabelac wrote: > Dne 3.12.2012 20:18, Johannes Weiner napsal(a): >> Szia Zdenek, >> >> On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote: >>> Ok, bad news - I've been hit by kswapd0 loop again - >>> my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again >>> shown kswapd0 for couple minutes on CPU. >>> >>> It seemed to go instantly away when I've drop caches >>> (echo 3 >/proc/sys/vm/drop_cache) >>> (After that I've had over 1G free memory) >> >> Any chance you could retry with this patch on top? It does not apply to -next :/. Should I try anything else? >> From: Johannes Weiner >> Subject: [patch] mm: vmscan: do not keep kswapd looping forever due >> to individual uncompactable zones ... > What seems to be triggering condition on my machine - running laptop for > some days - and having Thunderbird reaching 0.8G (I guess they must > keep all my news messages in memory to consume that size) and Firefox > 1.3GB of consumed > memory (assuming massive leaking with combination of flash) Similar here, 5 days of uptime (suspend/resumes in between). FF 900M, TB 250M, java 1.1G, kvm 550M, X 400M, cache 1.5G out of 6G total mem. And boom. thanks, -- js suse labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
Dne 3.12.2012 20:18, Johannes Weiner napsal(a): Szia Zdenek, On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote: Ok, bad news - I've been hit by kswapd0 loop again - my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again shown kswapd0 for couple minutes on CPU. It seemed to go instantly away when I've drop caches (echo 3 >/proc/sys/vm/drop_cache) (After that I've had over 1G free memory) Any chance you could retry with this patch on top? --- From: Johannes Weiner Subject: [patch] mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones --- mm/vmscan.c | 16 1 file changed, 16 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c Ok, I'm running now b69f0859dc8e633c5d8c06845811588fe17e68b3 (-rc8) with your patch. I'll be able to give some feedback after couple days (if I keep my machine running without reboot - since before I had occasional problems with ACPI now resolved. (https://bugzilla.kernel.org/show_bug.cgi?id=51071) (patch not yet in -rc8) I'm also using this extra patch: https://patchwork.kernel.org/patch/1792531/ What seems to be triggering condition on my machine - running laptop for some days - and having Thunderbird reaching 0.8G (I guess they must keep all my news messages in memory to consume that size) and Firefox 1.3GB of consumed memory (assuming massive leaking with combination of flash) Zdenek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 12/03/2012 02:14 PM, Jiri Slaby wrote: > On 11/27/2012 09:48 PM, Johannes Weiner wrote: >> I hope I included everybody that participated in the various threads >> on kswapd getting stuck / exhibiting high CPU usage. We were looking >> at at least three root causes as far as I can see, so it's not really >> clear who observed which problem. Please correct me if the >> reported-by, tested-by, bisected-by tags are incomplete. > > Hi, I reported the problem for the first time but I got lost in the > patches flying around very early. > > Whatever is in the current -next, works for me since -next was > resurrected after the 2 weeks gap last week... Bah, I always need to write an email to reproduce that. It's back: 3.7.0-rc7-next-20121130 [] __cond_resched+0x2a/0x40 [] shrink_slab+0x1c0/0x2d0 [] kswapd+0x65d/0xb50 [] kthread+0xc0/0xd0 [] ret_from_fork+0x7c/0xb0 [] 0x Going to apply this: https://lkml.org/lkml/2012/12/3/407 and wait another 5 days to see the results... thanks, -- js suse labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Mon, Dec 03, 2012 at 09:30:12AM +0100, Thorsten Leemhuis wrote: > >> John was able to reproduce the problem quickly with a kernel that > >> contained the patch from your mail. For details see > > > > [stripped: all the glory details of what likely went wrong and lead > > to the problem john sees or saw] > > > > --- > > From: Johannes Weiner > > Subject: [patch] mm: vmscan: do not keep kswapd looping forever due > > to individual uncompactable zones > > > > When a zone meets its high watermark and is compactable in case of > > higher order allocations, it contributes to the percentage of the > > node's memory that is considered balanced. > > [...] > > FYI: I built a kernel with that patch. I've been running on my x86_64 > machine at home over the weekend and everything was working fine (just > as without the patch). John gave it a quick try and in > https://bugzilla.redhat.com/show_bug.cgi?id=866988#c57 reported: > > """ > I just installed > kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686 and ran my > usual load that triggers the problem. OK so far. I'll check again in > 24hours, but looking good so far. > """ w00t! > BTW, I built that kernel without the patch you mentioned in > http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153 > ("buffer_heads_over_limit can put kswapd into reclaim, but it's ignored > [...]) It looked to me like that patch was only meant for debugging. Let > me know if that was wrong. Ohh, and I didn't update to a fresher > mainline checkout yet to make sure the base for John's testing didn't > change. Ah, yes, the ApplyPatch is commented out. I think we want that upstream as well, but it's not critical. It'll reduce kswapd CPU usage marginally on highmem systems in certain situations, but I don't think any of the 100% CPU usage problems are fixed by it. Not rebasing sounds reasonable to me to verify the patch. It might be worth testing that the final version that will be 3.8 still works for John, however, once that is done. Just to be sure. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
Szia Zdenek, On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote: > Ok, bad news - I've been hit by kswapd0 loop again - > my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again > shown kswapd0 for couple minutes on CPU. > > It seemed to go instantly away when I've drop caches > (echo 3 >/proc/sys/vm/drop_cache) > (After that I've had over 1G free memory) Any chance you could retry with this patch on top? Thanks! --- From: Johannes Weiner Subject: [patch] mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones When a zone meets its high watermark and is compactable in case of higher order allocations, it contributes to the percentage of the node's memory that is considered balanced. This requirement, that a node be only partially balanced, came about when kswapd was desparately trying to balance tiny zones when all bigger zones in the node had plenty of free memory. Arguably, the same should apply to compaction: if a significant part of the node is balanced enough to run compaction, do not get hung up on that tiny zone that might never get in shape. When the compaction logic in kswapd is reached, we know that at least 25% of the node's memory is balanced properly for compaction (see zone_balanced and pgdat_balanced). Remove the individual zone checks that restart the kswapd cycle. Otherwise, we may observe more endless looping in kswapd where the compaction code loops back to reclaim because of a single zone and reclaim does nothing because the node is considered balanced overall. Reported-by: Thorsten Leemhuis Signed-off-by: Johannes Weiner --- mm/vmscan.c | 16 1 file changed, 16 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 3b0aef4..486100f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2806,22 +2806,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, if (!populated_zone(zone)) continue; - if (zone->all_unreclaimable && - sc.priority != DEF_PRIORITY) - continue; - - /* Would compaction fail due to lack of free memory? */ - if (COMPACTION_BUILD && - compaction_suitable(zone, order) == COMPACT_SKIPPED) - goto loop_again; - - /* Confirm the zone is balanced for order-0 */ - if (!zone_watermark_ok(zone, 0, - high_wmark_pages(zone), 0, 0)) { - order = sc.order = 0; - goto loop_again; - } - /* Check if the memory needs to be defragmented. */ if (zone_watermark_ok(zone, order, low_wmark_pages(zone), *classzone_idx, 0)) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
Dne 28.11.2012 10:45, Mel Gorman napsal(a): (Adding Thorsten to cc) On Tue, Nov 27, 2012 at 03:48:34PM -0500, Johannes Weiner wrote: Hi everyone, I hope I included everybody that participated in the various threads on kswapd getting stuck / exhibiting high CPU usage. We were looking at at least three root causes as far as I can see, so it's not really clear who observed which problem. Please correct me if the reported-by, tested-by, bisected-by tags are incomplete. One problem was, as it seems, overly aggressive reclaim due to scaling up reclaim goals based on compaction failures. This one was reverted in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures". This particular one would have been made worse by the accounting bug and if kswapd was staying awake longer than necessary. As scaling the amount of reclaim only for direct reclaim helped this problem a lot, I strongly suspect the accounting bug was a factor. However the benefit for this is marginal -- it primarily affects how many THP pages we can allocate under stress. There is already a graceful fallback path and a system under heavy reclaim pressure is not going to notice the performance benefit of THP. Another one was an accounting problem where a freed higher order page was underreported, and so kswapd had trouble restoring watermarks. This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting (appears like memory leak). This almost certainly also requires the follow-on fix at https://lkml.org/lkml/2012/11/26/225 for reasons I explained in https://lkml.org/lkml/2012/11/27/190 . The third one is a problem with small zones, like the DMA zone, where the high watermark is lower than the low watermark plus compaction gap (2 * allocation size). The zonelist reclaim in kswapd would do nothing because all high watermarks are met, but the compaction logic would find its own requirements unmet and loop over the zones again. Indefinitely, until some third party would free enough memory to help meet the higher compaction watermark. The problematic code has been there since the 3.4 merge window for non-THP higher order allocations but has been more prominent since the 3.7 merge window, where kswapd is also woken up for the much more common THP allocations. Yes. The following patch should fix the third issue by making both reclaim and compaction code in kswapd use the same predicate to determine whether a zone is balanced or not. Hopefully, the sum of all three fixes should tame kswapd enough for 3.7. Not exactly sure of that. With just those patches it is possible for allocations for THP entering the slow path to keep kswapd continually awake doing busy work. This was an alternative to the revert that covered that https://lkml.org/lkml/2012/11/12/151 but it was not enough because kswapd would stay awake due to the bug you identified and fixed. I went with the __GFP_NO_KSWAPD patch in this cycle because 3.6 was/is very poor in how it handles THP after the removal of lumpy reclaim. 3.7 was shaping up to be even worse with multiple root causes too close to the release date. Taking kswapd out of the equation covered some of the problems (yes, by hiding them) so it could be revisited but Johannes may have finally squashed it. However, if we revert the revert then I strongly recommend that it be replaced with "Avoid waking kswapd for THP allocations when compaction is deferred or contended". Ok, bad news - I've been hit by kswapd0 loop again - my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again shown kswapd0 for couple minutes on CPU. It seemed to go instantly away when I've drop caches (echo 3 >/proc/sys/vm/drop_cache) (After that I've had over 1G free memory) Here are some stats before drop while kswapd0 was running: kswapd0 R running task030 2 0x 880133207b08 0082 880133207b18 0246 880135b92340 880133207fd8 880133207fd8 880133207fd8 880103098000 880135b92340 880133206000 Call Trace: [] preempt_schedule+0x42/0x60 [] _raw_spin_unlock+0x55/0x60 [] grab_super_passive+0x3c/0x90 [] prune_super+0x46/0x1b0 [] shrink_slab+0xba/0x510 [] ? mem_cgroup_iter+0x17a/0x2e0 [] ? mem_cgroup_iter+0xca/0x2e0 [] balance_pgdat+0x621/0x7e0 [] kswapd+0x174/0x640 [] ? __init_waitqueue_head+0x60/0x60 [] ? balance_pgdat+0x7e0/0x7e0 [] kthread+0xdb/0xe0 [] ? kthread_create_on_node+0x140/0x140 [] ret_from_fork+0x7c/0xb0 [] ? kthread_create_on_node+0x140/0x140 runnable tasks: task PID tree-key switches prio exec-runtime sum-execsum-sleep -- kswapd030 8087056.792356 30543 120 8087056.792356 158938.479290 137131605.711862 / kworker/0:3 29833 8087050.792356526664 120
Re: kswapd craziness in 3.7
On 11/27/2012 09:48 PM, Johannes Weiner wrote: > I hope I included everybody that participated in the various threads > on kswapd getting stuck / exhibiting high CPU usage. We were looking > at at least three root causes as far as I can see, so it's not really > clear who observed which problem. Please correct me if the > reported-by, tested-by, bisected-by tags are incomplete. Hi, I reported the problem for the first time but I got lost in the patches flying around very early. Whatever is in the current -next, works for me since -next was resurrected after the 2 weeks gap last week... thanks, -- js suse labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Fedora repo (was: Re: kswapd craziness in 3.7)
On Mon, Dec 03, 2012 at 09:30:12AM +0100, Thorsten Leemhuis wrote: > Np; BTW, in case anybody here on LKML cares: I started maintaining a > side repo (PPA in ubuntu speak) a few weeks ago that offers kernel > vanilla builds (mainline and stable) for the Fedora 17 and 18; see > https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories > for details. It's not as good and up2date yet as I would like it, but > one has to start somewhere. Once you have this ready, you should send a more official mail with "[ANNOUNCE]" in its subject and containing explanations how to use the repo to lkml and relevant lists so that more people know about it. Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
Hi! Johannes Weiner wrote on 01.12.2012 01:45: > On Fri, Nov 30, 2012 at 01:39:03PM +0100, Thorsten Leemhuis wrote: >> /me wonders how to elegantly get out of his man-in-the-middle position > You control the mighty koji :-) Something even a journalist can ;-) > But seriously, this is very helpful, thank you! Np; BTW, in case anybody here on LKML cares: I started maintaining a side repo (PPA in ubuntu speak) a few weeks ago that offers kernel vanilla builds (mainline and stable) for the Fedora 17 and 18; see https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories for details. It's not as good and up2date yet as I would like it, but one has to start somewhere. Back to topic: > John now also Cc'd directly. > >> John was able to reproduce the problem quickly with a kernel that >> contained the patch from your mail. For details see > > [stripped: all the glory details of what likely went wrong and lead > to the problem john sees or saw] > > --- > From: Johannes Weiner > Subject: [patch] mm: vmscan: do not keep kswapd looping forever due > to individual uncompactable zones > > When a zone meets its high watermark and is compactable in case of > higher order allocations, it contributes to the percentage of the > node's memory that is considered balanced. > [...] FYI: I built a kernel with that patch. I've been running on my x86_64 machine at home over the weekend and everything was working fine (just as without the patch). John gave it a quick try and in https://bugzilla.redhat.com/show_bug.cgi?id=866988#c57 reported: """ I just installed kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686 and ran my usual load that triggers the problem. OK so far. I'll check again in 24hours, but looking good so far. """ BTW, I built that kernel without the patch you mentioned in http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153 ("buffer_heads_over_limit can put kswapd into reclaim, but it's ignored [...]) It looked to me like that patch was only meant for debugging. Let me know if that was wrong. Ohh, and I didn't update to a fresher mainline checkout yet to make sure the base for John's testing didn't change. CU Thorsten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
Hi Thorsten, On Fri, Nov 30, 2012 at 01:39:03PM +0100, Thorsten Leemhuis wrote: > /me wonders how to elegantly get out of his man-in-the-middle position You control the mighty koji :-) But seriously, this is very helpful, thank you! John now also Cc'd directly. > John was able to reproduce the problem quickly with a kernel that > contained the patch from your mail. For details see > > https://bugzilla.redhat.com/show_bug.cgi?id=866988#c42 and later > > He provided the informations there. Parts of it: > /proc/vmstat while kswad0 at 100%cpu > /proc/zoneinfo with kswapd0 at 100% cpu > perf profile Thanks. I'm quoting the interesting bits in order of the cars on my possibly derailing train of thought: > pageoutrun 117729182 > allocstall 5 Okay, so kswapd is stupidly looping but it's still managing to do it's actual job; nobody is dropping into direct reclaim. > pgsteal_kswapd_dma 1 > pgsteal_kswapd_normal 202106 > pgsteal_kswapd_high 36515 > pgsteal_kswapd_movable 0 > pgscan_kswapd_dma 1 > pgscan_kswapd_normal 203044 > pgscan_kswapd_high 40407 > pgscan_kswapd_movable 0 Does not seem excessive, so apparently it also does not overreclaim. > Node 0, zone DMA > pages free 1655 > min 196 > low 245 > high 294 > Node 0, zone Normal > pages free 186234 > min 10953 > low 13691 > high 16429 > Node 0, zone HighMem > pages free 8983 > min 34 > low 475 > high 917 These are all well above their watermarks, yet kswapd is definitely finding something wrong with one of these as it actually does drop into the reclaim loop, so zone_balanced() must be returning false: > 16.52% kswapd0 [kernel.kallsyms] [k] idr_get_next > > | > --- idr_get_next >| >|--99.76%-- css_get_next >| mem_cgroup_iter >| | >| |--50.49%-- shrink_zone >| | kswapd >| | kthread >| | ret_from_kernel_thread >| | >| --49.51%-- kswapd >| kthread >| ret_from_kernel_thread > --0.24%-- [...] > > 11.23% kswapd0 [kernel.kallsyms] [k] prune_super > > | > --- prune_super >| >|--86.74%-- shrink_slab >| kswapd >| kthread >| ret_from_kernel_thread >| > --13.26%-- kswapd > kthread > ret_from_kernel_thread Spending so much time in shrink_zone and shrink_slab without overreclaiming a zone, I would say that a) this always stays on the DEF_PRIORITY and b) only loops on the DMA zone. At DEF_PRIORITY, the scan goal for filepages in the other zones would be > 0 e.g. As the DMA zone watermarks are fine, it must be the fragmentation index that indicates a lack of memory. Filling in the 1655 free pages into the fragmentation index formula indicates lack of free memory when these 1655 pages are lumped together in less than 9 page blocks. Not unrealistic, I think: on my desktop machine, the DMA zone's free 3975 pages are lumped together in only 12 blocks. But on my system, the DMA zone is either never used and there is always at least one page block available that could satisfy a huge page allocation (fragmentation index == -1000). Unless the system gets really close to OOM, at which point the DMA zone is highly fragmented. And keep in mind that if the priority level goes below DEF_PRIORITY, as it does close to OOM, the unreclaimable DMA zone is ignored anyway. But the DMA zone here is just barely used: > Node 0, zone DMA [...] > nr_slab_reclaimable 3 > nr_slab_unreclaimable 1 [...] > nr_dirtied 315 > nr_written 315 which could explain a fragmentation index that asks for more free memory while the watermarks are fine. Why this all loops: there is one more inconsistency where the conditions for reclaim and the conditions for compaction contradict each other: reclaim also does not consider the DMA zone balanced, but it needs only 25% of the whole node to be balanced, while compaction requires every single zone to be balanced individually. So these strict per-zone checks for compaction at the end of balance_pgdat() are likely to be the culprits that keep kswapd looping forever on this machine, trying to balance DMA for compaction while reclaim decides it has enough balanced memory in the nod
Re: kswapd craziness in 3.7
Johannes Weiner wrote on 29.11.2012 18:05: > On Thu, Nov 29, 2012 at 04:30:12PM +0100, Thorsten Leemhuis wrote: >> Mel Gorman wrote on 29.11.2012 00:54: >> > On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote: >> >> On Wed, 28 Nov 2012 10:13:59 + >> >> Mel Gorman wrote: >> >> > Based on the reports I've seen I expect the following to work for 3.7 >> >> > Keep >> >> > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by >> >> > reclaim/compaction based on failures" >> >> > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory >> >> > leak) >> >> > Revert >> >> > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD" >> >> > Merge >> >> > mm: vmscan: fix kswapd endless loop on higher order allocation >> >> > mm: Avoid waking kswapd for THP allocations when compaction is >> >> > deferred or contended >> >> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it >> >> myself" and when Zdenek tested it he hit an unexplained oom. >> > I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM. >> > Further, when he hit that OOM, it looked like a genuine OOM. He had no >> > swap configured and inactive/active file pages were very low. Finally, >> > the free pages for Normal looked off and could also have been affected by >> > the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132 >> > here. Are you thinking of something else? >> > I have not tested with the patch admittedly but Thorsten has and seemed >> > to be ok with it https://lkml.org/lkml/2012/11/23/276. >> Yeah, on my two main work horses a few different kernels based on rc6 or >> rc7 worked fine with this patch. But sorry, it seems the patch doesn't >> fix the problems Fedora user John Ellson sees, who tried kernels I built >> in the Fedora buildsystem. Details: > [...] >> I know, this makes things more complicated again; but I wanted to let >> you guys know that some problem might still be lurking somewhere. Side >> note: right now it seems John with kernels that contain >> "Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem >> quicker (or only?) on i686 than on x86-64. > > Humm, highmem... Could this be the lowmem protection forcing kswapd > to reclaim highmem at DEF_PRIORITY (not useful but burns CPU) every > time it's woken up? > > This requires somebody to wake up kswapd regularly, though and from > his report it's not quite clear to me if kswapd gets stuck or just has > really high CPU usage while the system is still under load. The > initial post says he would expect "<5% cpu when idling" but his top > snippet in there shows there are other tasks running as well. So does > it happen while the system is busy or when it's otherwise idle? > > [ On the other hand, not waking kswapd from THP allocations seems to > not show this problem on his i686 machine. But it could also just > be a tiny window of conditions aligning perfectly that drops kswapd > in an endless loop, and the increased wakeups increase the > probability of hitting it. So, yeah, this would be good to know. ] > > As the system is still responsive when this happens, any chance he > could capture /proc/zoneinfo and /proc/vmstat when kswapd goes > haywire? > > Or even run perf record -a -g sleep 5; perf report > kswapd.txt? > > Preferrably with this patch applied, to rule out faulty lowmem > protection: > > buffer_heads_over_limit can put kswapd into reclaim, but it's ignored > when figuring out whether the zone is balanced and so priority levels > are not descended and no progress is ever made. /me wonders how to elegantly get out of his man-in-the-middle position John was able to reproduce the problem quickly with a kernel that contained the patch from your mail. For details see https://bugzilla.redhat.com/show_bug.cgi?id=866988#c42 and later He provided the informations there. Parts of it: /proc/vmstat while kswad0 at 100%cpu nr_free_pages 196858 nr_inactive_anon 15804 nr_active_anon 65 nr_inactive_file 20792 nr_active_file 11307 nr_unevictable 0 nr_mlock 0 nr_anon_pages 14385 nr_mapped 2393 nr_file_pages 32563 nr_dirty 5 nr_writeback 0 nr_slab_reclaimable 3113 nr_slab_unreclaimable 4725 nr_page_table_pages 271 nr_kernel_stack 96 nr_unstable 0 nr_bounce 0 nr_vmscan_write 1487 nr_vmscan_immediate_reclaim 3 nr_writeback_temp 0 nr_isolated_anon 0 nr_isolated_file 0 nr_shmem 381 nr_dirtied 388323 nr_written 361128 nr_anon_transparent_hugepages 1 nr_free_cma 0 nr_dirty_threshold 38188 nr_dirty_background_threshold 19094 pgpgin 1057223 pgpgout 1552306 pswpin 8 pswpout 1487 pgalloc_dma 5548 pgalloc_normal 10651864 pgalloc_high 2191246 pgalloc_movable 0 pgfree 13055503 pgactivate 440358 pgdeactivate 259724 pgfault 31423675 pgmajfault 3760 pgrefill_dma 2174 pgrefill_normal 212914 pgrefill_high 51755 pgrefill_movable 0 pgsteal_kswapd_dma 1 pgsteal_kswapd_normal 202106 pgsteal_kswapd_high 36515 pgsteal_kswapd_movable 0 pgsteal_direct_dma 18 pgsteal_direct_normal
Re: kswapd craziness in 3.7
On Thu, Nov 29, 2012 at 04:30:12PM +0100, Thorsten Leemhuis wrote: > Mel Gorman wrote on 29.11.2012 00:54: > > On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote: > >> On Wed, 28 Nov 2012 10:13:59 + > >> Mel Gorman wrote: > >> > >> > Based on the reports I've seen I expect the following to work for 3.7 > >> > Keep > >> > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by > >> > reclaim/compaction based on failures" > >> > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory > >> > leak) > >> > Revert > >> > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD" > >> > Merge > >> > mm: vmscan: fix kswapd endless loop on higher order allocation > >> > mm: Avoid waking kswapd for THP allocations when compaction is > >> > deferred or contended > >> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it > >> myself" and when Zdenek tested it he hit an unexplained oom. > > I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM. > > Further, when he hit that OOM, it looked like a genuine OOM. He had no > > swap configured and inactive/active file pages were very low. Finally, > > the free pages for Normal looked off and could also have been affected by > > the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132 > > here. Are you thinking of something else? > > > > I have not tested with the patch admittedly but Thorsten has and seemed > > to be ok with it https://lkml.org/lkml/2012/11/23/276. > > Yeah, on my two main work horses a few different kernels based on rc6 or > rc7 worked fine with this patch. But sorry, it seems the patch doesn't > fix the problems Fedora user John Ellson sees, who tried kernels I built > in the Fedora buildsystem. Details: > > In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c35 he mentioned > his machine worked fine with a rc6 based kernel I built that contained > 82b212f4 (Revert "mm: remove __GFP_NO_KSWAPD"). Before that he had tried > a kernel with the same baseline that contained "Avoid waking kswapd for > THP allocations when […]" instead and reported it didn't help on his > i686 machine (seems it helped the x86-64 one): > https://bugzilla.redhat.com/show_bug.cgi?id=866988#c33 > > He now tried a recent mainline kernel I built 20 hours ago that is based > on a git checkout from round about two days ago, reverts 82b212f4, and had > * fix-kswapd-endless-loop-on-higher-order-allocation.patch > * Avoid-waking-kswapd-for-THP-allocations-when.patch > * mm-compaction-Fix-return-value-of-capture_free_page.patch > applied. In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c39 and > comment 41 he reported that this kernel on his i686 host showed 100%cpu > usage by kswapd0 :-/ > > Build log for said kernel rpms (I quite sure I applied the patches > properly, but you know: mistakes happen, so be careful, maybe I did > something stupid somewhere...): > http://kojipkgs.fedoraproject.org//work/tasks/8253/4738253/build.log > > I know, this makes things more complicated again; but I wanted to let > you guys know that some problem might still be lurking somewhere. Side > note: right now it seems John with kernels that contain > "Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem > quicker (or only?) on i686 than on x86-64. Humm, highmem... Could this be the lowmem protection forcing kswapd to reclaim highmem at DEF_PRIORITY (not useful but burns CPU) every time it's woken up? This requires somebody to wake up kswapd regularly, though and from his report it's not quite clear to me if kswapd gets stuck or just has really high CPU usage while the system is still under load. The initial post says he would expect "<5% cpu when idling" but his top snippet in there shows there are other tasks running as well. So does it happen while the system is busy or when it's otherwise idle? [ On the other hand, not waking kswapd from THP allocations seems to not show this problem on his i686 machine. But it could also just be a tiny window of conditions aligning perfectly that drops kswapd in an endless loop, and the increased wakeups increase the probability of hitting it. So, yeah, this would be good to know. ] As the system is still responsive when this happens, any chance he could capture /proc/zoneinfo and /proc/vmstat when kswapd goes haywire? Or even run perf record -a -g sleep 5; perf report > kswapd.txt? Preferrably with this patch applied, to rule out faulty lowmem protection: buffer_heads_over_limit can put kswapd into reclaim, but it's ignored when figuring out whether the zone is balanced and so priority levels are not descended and no progress is ever made. diff --git a/mm/vmscan.c b/mm/vmscan.c index 3b0aef4..73c4f5f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2400,6 +2400,14 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc) static bool zone_balanced(struct zone *zone, int order, unsigned lo
Re: kswapd craziness in 3.7
Mel Gorman wrote on 29.11.2012 00:54: > On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote: >> On Wed, 28 Nov 2012 10:13:59 + >> Mel Gorman wrote: >> >> > Based on the reports I've seen I expect the following to work for 3.7 >> > Keep >> > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by >> > reclaim/compaction based on failures" >> > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory >> > leak) >> > Revert >> > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD" >> > Merge >> > mm: vmscan: fix kswapd endless loop on higher order allocation >> > mm: Avoid waking kswapd for THP allocations when compaction is deferred >> > or contended >> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it >> myself" and when Zdenek tested it he hit an unexplained oom. > I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM. > Further, when he hit that OOM, it looked like a genuine OOM. He had no > swap configured and inactive/active file pages were very low. Finally, > the free pages for Normal looked off and could also have been affected by > the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132 > here. Are you thinking of something else? > > I have not tested with the patch admittedly but Thorsten has and seemed > to be ok with it https://lkml.org/lkml/2012/11/23/276. Yeah, on my two main work horses a few different kernels based on rc6 or rc7 worked fine with this patch. But sorry, it seems the patch doesn't fix the problems Fedora user John Ellson sees, who tried kernels I built in the Fedora buildsystem. Details: In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c35 he mentioned his machine worked fine with a rc6 based kernel I built that contained 82b212f4 (Revert "mm: remove __GFP_NO_KSWAPD"). Before that he had tried a kernel with the same baseline that contained "Avoid waking kswapd for THP allocations when […]" instead and reported it didn't help on his i686 machine (seems it helped the x86-64 one): https://bugzilla.redhat.com/show_bug.cgi?id=866988#c33 He now tried a recent mainline kernel I built 20 hours ago that is based on a git checkout from round about two days ago, reverts 82b212f4, and had * fix-kswapd-endless-loop-on-higher-order-allocation.patch * Avoid-waking-kswapd-for-THP-allocations-when.patch * mm-compaction-Fix-return-value-of-capture_free_page.patch applied. In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c39 and comment 41 he reported that this kernel on his i686 host showed 100%cpu usage by kswapd0 :-/ Build log for said kernel rpms (I quite sure I applied the patches properly, but you know: mistakes happen, so be careful, maybe I did something stupid somewhere...): http://kojipkgs.fedoraproject.org//work/tasks/8253/4738253/build.log I know, this makes things more complicated again; but I wanted to let you guys know that some problem might still be lurking somewhere. Side note: right now it seems John with kernels that contain "Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem quicker (or only?) on i686 than on x86-64. CU Thorsten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Wed, 28 Nov 2012 23:54:12 + Mel Gorman wrote: > On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote: > > On Wed, 28 Nov 2012 10:13:59 + > > Mel Gorman wrote: > > > > > Based on the reports I've seen I expect the following to work for 3.7 > > > > > > Keep > > > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by > > > reclaim/compaction based on failures" > > > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory > > > leak) > > > > > > Revert > > > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD" > > > > > > Merge > > > mm: vmscan: fix kswapd endless loop on higher order allocation > > > mm: Avoid waking kswapd for THP allocations when compaction is deferred > > > or contended > > > > "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it > > myself" and when Zdenek tested it he hit an unexplained oom. > > > > I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM. > Further, when he hit that OOM, it looked like a genuine OOM. He had no > swap configured and inactive/active file pages were very low. Finally, > the free pages for Normal looked off and could also have been affected by > the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132 > here. Are you thinking of something else? who, me, think? I was trying to work out why I hadn't merged or queued a patch which you felt was important. Turned out it was because it didn't look very tested and final. > I have not tested with the patch admittedly but Thorsten has and seemed > to be ok with it https://lkml.org/lkml/2012/11/23/276. OK, I'll queue revert-revert-mm-remove-__gfp_no_kswapd.patch and the patch from https://patchwork.kernel.org/patch/1728081/. So what I'm currently sitting on for 3.7 is mm-compaction-fix-return-value-of-capture_free_page.patch mm-vmemmap-fix-wrong-use-of-virt_to_page.patch mm-vmscan-fix-endless-loop-in-kswapd-balancing.patch revert-revert-mm-remove-__gfp_no_kswapd.patch mm-avoid-waking-kswapd-for-thp-allocations-when-compaction-is-deferred-or-contended.patch mm-soft-offline-split-thp-at-the-beginning-of-soft_offline_page.patch > > Please identify "Johannes' patch"? > > mm: vmscan: fix kswapd endless loop on higher order allocation OK, we have that. I'll start a round of testing, do another -next drop and send the above Linuswards tomorrow. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote: > On Wed, 28 Nov 2012 10:13:59 + > Mel Gorman wrote: > > > Based on the reports I've seen I expect the following to work for 3.7 > > > > Keep > > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by > > reclaim/compaction based on failures" > > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak) > > > > Revert > > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD" > > > > Merge > > mm: vmscan: fix kswapd endless loop on higher order allocation > > mm: Avoid waking kswapd for THP allocations when compaction is deferred > > or contended > > "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it > myself" and when Zdenek tested it he hit an unexplained oom. > I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM. Further, when he hit that OOM, it looked like a genuine OOM. He had no swap configured and inactive/active file pages were very low. Finally, the free pages for Normal looked off and could also have been affected by the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132 here. Are you thinking of something else? I have not tested with the patch admittedly but Thorsten has and seemed to be ok with it https://lkml.org/lkml/2012/11/23/276. > > Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I > > think we should also avoid waking kswapd for THP allocations if compaction > > is deferred. Johannes' patch might mean that kswapd goes quickly go back > > to sleep but it's still busy work. > > > > 3.6 is still known to be screwed in terms of THP because of the amount of > > time it can spend in compaction after lumpy reclaim was removed. This is > > my old list of patches I felt needed to be backported after 3.7 came out. > > They are not tagged -stable, I'll be sending it to Greg manually. > > > > e64c523 mm: compaction: abort compaction loop if lock is contended or run > > too long > > 3cc668f mm: compaction: move fatal signal check out of > > compact_checklock_irqsave > > 661c4cb mm: compaction: Update try_to_compact_pages()kerneldoc comment > > 2a1402a mm: compaction: acquire the zone->lru_lock as late as possible > > f40d1e4 mm: compaction: acquire the zone->lock as late as possible > > 753341a revert "mm: have order > 0 compaction start off where it left" > > bb13ffe mm: compaction: cache if a pageblock was scanned and no pages were > > isolated > > c89511a mm: compaction: Restart compaction from near where it left off > > 6299702 mm: compaction: clear PG_migrate_skip based on compaction and > > reclaim activity > > 0db63d7 mm: compaction: correct the nr_strict va isolated check for CMA > > > > Only Johannes' patch needs to be added to this list. kswapd is not woken > > for THP in 3.6 but as it calls compaction for other high-order allocations > > it still makes sense. > > Please identify "Johannes' patch"? mm: vmscan: fix kswapd endless loop on higher order allocation -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Wed, 28 Nov 2012 10:13:59 + Mel Gorman wrote: > Based on the reports I've seen I expect the following to work for 3.7 > > Keep > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by > reclaim/compaction based on failures" > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak) > > Revert > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD" > > Merge > mm: vmscan: fix kswapd endless loop on higher order allocation > mm: Avoid waking kswapd for THP allocations when compaction is deferred or > contended "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it myself" and when Zdenek tested it he hit an unexplained oom. > Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I > think we should also avoid waking kswapd for THP allocations if compaction > is deferred. Johannes' patch might mean that kswapd goes quickly go back > to sleep but it's still busy work. > > 3.6 is still known to be screwed in terms of THP because of the amount of > time it can spend in compaction after lumpy reclaim was removed. This is > my old list of patches I felt needed to be backported after 3.7 came out. > They are not tagged -stable, I'll be sending it to Greg manually. > > e64c523 mm: compaction: abort compaction loop if lock is contended or run too > long > 3cc668f mm: compaction: move fatal signal check out of > compact_checklock_irqsave > 661c4cb mm: compaction: Update try_to_compact_pages()kerneldoc comment > 2a1402a mm: compaction: acquire the zone->lru_lock as late as possible > f40d1e4 mm: compaction: acquire the zone->lock as late as possible > 753341a revert "mm: have order > 0 compaction start off where it left" > bb13ffe mm: compaction: cache if a pageblock was scanned and no pages were > isolated > c89511a mm: compaction: Restart compaction from near where it left off > 6299702 mm: compaction: clear PG_migrate_skip based on compaction and reclaim > activity > 0db63d7 mm: compaction: correct the nr_strict va isolated check for CMA > > Only Johannes' patch needs to be added to this list. kswapd is not woken > for THP in 3.6 but as it calls compaction for other high-order allocations > it still makes sense. Please identify "Johannes' patch"? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Wed, Nov 28, 2012 at 10:13:59AM +, Mel Gorman wrote: > On Tue, Nov 27, 2012 at 03:19:38PM -0800, Linus Torvalds wrote: > > On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner wrote: > > > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote: > > >> > > >> Kswapd going crazy is certainly a large part of the problem. > > >> > > >> However, that leaves the issue of page_alloc.c waking up > > >> kswapd when the system is not actually low on memory. > > >> > > >> Instead, kswapd is woken up because memory compaction failed, > > >> potentially even due to lock contention during compaction! > > >> > > >> Ideally the allocation code would only wake up kswapd if > > >> memory needs to be freed, or in order for kswapd to do > > >> memory compaction (so the allocator does not have to). > > > > > > Maybe I missed something, but shouldn't this be solved with my patch? > > > > Ok, guys. Cage fight! > > > > The rules are simple: two men enter, one man leaves. > > > > I'm fairly scorch damaged from this whole cycle already. I won't need a > prop master to look the part for a thunderdome match. > > > And the one who comes out gets to explain to me which patch(es) I > > should apply, and which I should revert, if any. > > > > Based on the reports I've seen I expect the following to work for 3.7 > > Keep > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by > reclaim/compaction based on failures" > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak) > > Revert > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD" > > Merge > mm: vmscan: fix kswapd endless loop on higher order allocation > mm: Avoid waking kswapd for THP allocations when compaction is deferred or > contended > and mm: compaction: Fix return value of capture_free_page but this one may already be in flight from Andrew's tree as he picked it up already. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 11/28/2012 02:35 PM, Zdenek Kabelac wrote: > and added slightly modified patch from Jiri > (https://lkml.org/lkml/2012/11/15/950 > (Unsure where it still applies for -rc7??) It is needed for -next only. And if you have recent -next, it's already there... thanks, -- js suse labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
Dne 27.11.2012 21:58, Linus Torvalds napsal(a): Note that in the meantime, I've also applied (through Andrew) the patch that reverts commit c654345924f7 (see commit 82b212f40059 'Revert "mm: remove __GFP_NO_KSWAPD"'). I wonder if that revert may be bogus, and a result of this same issue. Maybe that revert should be reverted, and replaced with your patch? Mel? Zdenek? What's the status here? I've tried for longer term: https://lkml.org/lkml/2012/11/5/308 https://lkml.org/lkml/2012/11/12/113 these 2 seems to be now merge in -rc7 (since they disappeared after my git rebase) and added slightly modified patch from Jiri (https://lkml.org/lkml/2012/11/15/950 (Unsure where it still applies for -rc7??) Also I've Jan Kara fs: Fix imbalance in freeze protection in mark_files_ro() (which is still not applied to upstream) And I think I'm NOT seeing huge load from kswapd0. (At least related to my not really long uptimes) But also I'm now frequent victim of my other report: https://lkml.org/lkml/2012/11/15/369 Which turns into a problem, that if my T61 docking station has enabled support for 'old hw' for docking in BIOS - i.e. serial output' it becomes unstable and either 1st. or 2nd. resume deadlocks machine - and serial port gives just garbage) Zdenek Linus On Tue, Nov 27, 2012 at 12:48 PM, Johannes Weiner wrote: Hi everyone, I hope I included everybody that participated in the various threads on kswapd getting stuck / exhibiting high CPU usage. We were looking at at least three root causes as far as I can see, so it's not really clear who observed which problem. Please correct me if the reported-by, tested-by, bisected-by tags are incomplete. One problem was, as it seems, overly aggressive reclaim due to scaling up reclaim goals based on compaction failures. This one was reverted in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures". Another one was an accounting problem where a freed higher order page was underreported, and so kswapd had trouble restoring watermarks. This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting (appears like memory leak). The third one is a problem with small zones, like the DMA zone, where the high watermark is lower than the low watermark plus compaction gap (2 * allocation size). The zonelist reclaim in kswapd would do nothing because all high watermarks are met, but the compaction logic would find its own requirements unmet and loop over the zones again. Indefinitely, until some third party would free enough memory to help meet the higher compaction watermark. The problematic code has been there since the 3.4 merge window for non-THP higher order allocations but has been more prominent since the 3.7 merge window, where kswapd is also woken up for the much more common THP allocations. The following patch should fix the third issue by making both reclaim and compaction code in kswapd use the same predicate to determine whether a zone is balanced or not. Hopefully, the sum of all three fixes should tame kswapd enough for 3.7. Johannes -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
Mel Gorman wrote on 28.11.2012 11:13: > On Tue, Nov 27, 2012 at 03:19:38PM -0800, Linus Torvalds wrote: >> On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner wrote: >> > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote: > >> And the one who comes out gets to explain to me which patch(es) I >> should apply, and which I should revert, if any. > > Based on the reports I've seen I expect the following to work for 3.7 > > Keep > 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by > reclaim/compaction based on failures" > ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak) > > Revert > 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD" > > Merge > mm: vmscan: fix kswapd endless loop on higher order allocation > mm: Avoid waking kswapd for THP allocations when compaction is deferred or > contended I'll build a kernel with this combination and will give it a try. Maybe one of those people that reported problems in https://bugzilla.redhat.com/show_bug.cgi?id=866988 can try them, too. There two people recently reported their problems were gone with kernels that contained 82b212f4. > Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I > think we should also avoid waking kswapd for THP allocations if compaction > is deferred. Johannes' patch might mean that kswapd goes quickly go back > to sleep but it's still busy work. Is there a way to trigger (some benchmark?) and detect (something in /proc/vmstat ?) the problem Hannes patch tries to fix? Background: The two main problems that got me into this discussion vanished thx to 9671009 (mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures") and ef6c5be (fix incorrect NR_FREE_PAGES accounting (appears like memory leak)). I thought all my problems had gone, but after a few days of uptime (suspended and resumed the particular machine a few times in between, as I was using it just in the evenings) kswap now and then started consuming nearly 100% of one cpu core for 10 to 15 seconds intervals (it seems watching a YouTube video triggered it; and the machine was using a little bit swap space). I just had started debugging this, but due to some stupid mistake (https://plus.google.com/107616711159256259828/posts/GXuhf1LTien ) then rebooted the machine :-/ So maybe I hit the problem Hannes patch tries to solve, but I'm not sure; and I have no easy way to verify quickly if the proposed patch combination helps. Thorsten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Tue, Nov 27, 2012 at 03:19:38PM -0800, Linus Torvalds wrote: > On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner wrote: > > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote: > >> > >> Kswapd going crazy is certainly a large part of the problem. > >> > >> However, that leaves the issue of page_alloc.c waking up > >> kswapd when the system is not actually low on memory. > >> > >> Instead, kswapd is woken up because memory compaction failed, > >> potentially even due to lock contention during compaction! > >> > >> Ideally the allocation code would only wake up kswapd if > >> memory needs to be freed, or in order for kswapd to do > >> memory compaction (so the allocator does not have to). > > > > Maybe I missed something, but shouldn't this be solved with my patch? > > Ok, guys. Cage fight! > > The rules are simple: two men enter, one man leaves. > I'm fairly scorch damaged from this whole cycle already. I won't need a prop master to look the part for a thunderdome match. > And the one who comes out gets to explain to me which patch(es) I > should apply, and which I should revert, if any. > Based on the reports I've seen I expect the following to work for 3.7 Keep 96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak) Revert 82b212f4 Revert "mm: remove __GFP_NO_KSWAPD" Merge mm: vmscan: fix kswapd endless loop on higher order allocation mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I think we should also avoid waking kswapd for THP allocations if compaction is deferred. Johannes' patch might mean that kswapd goes quickly go back to sleep but it's still busy work. 3.6 is still known to be screwed in terms of THP because of the amount of time it can spend in compaction after lumpy reclaim was removed. This is my old list of patches I felt needed to be backported after 3.7 came out. They are not tagged -stable, I'll be sending it to Greg manually. e64c523 mm: compaction: abort compaction loop if lock is contended or run too long 3cc668f mm: compaction: move fatal signal check out of compact_checklock_irqsave 661c4cb mm: compaction: Update try_to_compact_pages()kerneldoc comment 2a1402a mm: compaction: acquire the zone->lru_lock as late as possible f40d1e4 mm: compaction: acquire the zone->lock as late as possible 753341a revert "mm: have order > 0 compaction start off where it left" bb13ffe mm: compaction: cache if a pageblock was scanned and no pages were isolated c89511a mm: compaction: Restart compaction from near where it left off 6299702 mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity 0db63d7 mm: compaction: correct the nr_strict va isolated check for CMA Only Johannes' patch needs to be added to this list. kswapd is not woken for THP in 3.6 but as it calls compaction for other high-order allocations it still makes sense. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
(Adding Thorsten to cc) On Tue, Nov 27, 2012 at 03:48:34PM -0500, Johannes Weiner wrote: > Hi everyone, > > I hope I included everybody that participated in the various threads > on kswapd getting stuck / exhibiting high CPU usage. We were looking > at at least three root causes as far as I can see, so it's not really > clear who observed which problem. Please correct me if the > reported-by, tested-by, bisected-by tags are incomplete. > > One problem was, as it seems, overly aggressive reclaim due to scaling > up reclaim goals based on compaction failures. This one was reverted > in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by > reclaim/compaction based on failures". > This particular one would have been made worse by the accounting bug and if kswapd was staying awake longer than necessary. As scaling the amount of reclaim only for direct reclaim helped this problem a lot, I strongly suspect the accounting bug was a factor. However the benefit for this is marginal -- it primarily affects how many THP pages we can allocate under stress. There is already a graceful fallback path and a system under heavy reclaim pressure is not going to notice the performance benefit of THP. > Another one was an accounting problem where a freed higher order page > was underreported, and so kswapd had trouble restoring watermarks. > This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting > (appears like memory leak). > This almost certainly also requires the follow-on fix at https://lkml.org/lkml/2012/11/26/225 for reasons I explained in https://lkml.org/lkml/2012/11/27/190 . > The third one is a problem with small zones, like the DMA zone, where > the high watermark is lower than the low watermark plus compaction gap > (2 * allocation size). The zonelist reclaim in kswapd would do > nothing because all high watermarks are met, but the compaction logic > would find its own requirements unmet and loop over the zones again. > Indefinitely, until some third party would free enough memory to help > meet the higher compaction watermark. The problematic code has been > there since the 3.4 merge window for non-THP higher order allocations > but has been more prominent since the 3.7 merge window, where kswapd > is also woken up for the much more common THP allocations. > Yes. > The following patch should fix the third issue by making both reclaim > and compaction code in kswapd use the same predicate to determine > whether a zone is balanced or not. > > Hopefully, the sum of all three fixes should tame kswapd enough for > 3.7. > Not exactly sure of that. With just those patches it is possible for allocations for THP entering the slow path to keep kswapd continually awake doing busy work. This was an alternative to the revert that covered that https://lkml.org/lkml/2012/11/12/151 but it was not enough because kswapd would stay awake due to the bug you identified and fixed. I went with the __GFP_NO_KSWAPD patch in this cycle because 3.6 was/is very poor in how it handles THP after the removal of lumpy reclaim. 3.7 was shaping up to be even worse with multiple root causes too close to the release date. Taking kswapd out of the equation covered some of the problems (yes, by hiding them) so it could be revisited but Johannes may have finally squashed it. However, if we revert the revert then I strongly recommend that it be replaced with "Avoid waking kswapd for THP allocations when compaction is deferred or contended". -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner wrote: > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote: >> >> Kswapd going crazy is certainly a large part of the problem. >> >> However, that leaves the issue of page_alloc.c waking up >> kswapd when the system is not actually low on memory. >> >> Instead, kswapd is woken up because memory compaction failed, >> potentially even due to lock contention during compaction! >> >> Ideally the allocation code would only wake up kswapd if >> memory needs to be freed, or in order for kswapd to do >> memory compaction (so the allocator does not have to). > > Maybe I missed something, but shouldn't this be solved with my patch? Ok, guys. Cage fight! The rules are simple: two men enter, one man leaves. And the one who comes out gets to explain to me which patch(es) I should apply, and which I should revert, if any. My current guess is that I should apply the one Johannes just sent ("mm: vmscan: fix kswapd endless loop on higher order allocation") after having added the cc to stable to it, and then revert the recent revert (commit 82b212f40059). But I await the Thunderdome. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote: > On 11/27/2012 04:49 PM, Johannes Weiner wrote: > >On Tue, Nov 27, 2012 at 04:16:52PM -0500, Rik van Riel wrote: > >>On 11/27/2012 03:58 PM, Linus Torvalds wrote: > >>>Note that in the meantime, I've also applied (through Andrew) the > >>>patch that reverts commit c654345924f7 (see commit 82b212f40059 > >>>'Revert "mm: remove __GFP_NO_KSWAPD"'). > >>> > >>>I wonder if that revert may be bogus, and a result of this same issue. > >>>Maybe that revert should be reverted, and replaced with your patch? > >>> > >>>Mel? Zdenek? What's the status here? > >> > >>Mel posted several patches to fix the kswapd issue. This one is > >>slightly more risky than the outright revert, but probably preferred > >>from a performance point of view: > >> > >>https://lkml.org/lkml/2012/11/12/151 > >> > >>It works by skipping the kswapd wakeup for THP allocations, only > >>if compaction is deferred or contended. > > > >Just to clarify, this would be a replacement strictly for the > >__GFP_NO_KSWAPD removal revert, to control how often kswapd is woken > >up for higher order allocations like THP. > > > >My patch is to fix how kswapd actually does higher order reclaim, and > >it is required either way. > > > >[ But isn't the _reason_ why the "wake up kswapd more carefully for > > THP" patch was written kind of moot now since it was developed > > against a crazy kswapd? It would certainly need to be re-evaluated. > > My (limited) testing didn't show any issues anymore with waking > > kswapd unconditionally once it's fixed. ] > > Kswapd going crazy is certainly a large part of the problem. > > However, that leaves the issue of page_alloc.c waking up > kswapd when the system is not actually low on memory. > > Instead, kswapd is woken up because memory compaction failed, > potentially even due to lock contention during compaction! > > Ideally the allocation code would only wake up kswapd if > memory needs to be freed, or in order for kswapd to do > memory compaction (so the allocator does not have to). Maybe I missed something, but shouldn't this be solved with my patch? The first scan over the zones finds the higher order watermark breached, but the reclaim scan over the zones tests against order-0 (testorder) watermarks when compaction is suitable, i.e. no reclaim if there are enough order-0 pages for compaction to work. It should just fall through to that zones_need_compaction condition at the end and run compaction. As such, it should always be approriate to wake kswapd if allocations fail. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 11/27/2012 04:49 PM, Johannes Weiner wrote: On Tue, Nov 27, 2012 at 04:16:52PM -0500, Rik van Riel wrote: On 11/27/2012 03:58 PM, Linus Torvalds wrote: Note that in the meantime, I've also applied (through Andrew) the patch that reverts commit c654345924f7 (see commit 82b212f40059 'Revert "mm: remove __GFP_NO_KSWAPD"'). I wonder if that revert may be bogus, and a result of this same issue. Maybe that revert should be reverted, and replaced with your patch? Mel? Zdenek? What's the status here? Mel posted several patches to fix the kswapd issue. This one is slightly more risky than the outright revert, but probably preferred from a performance point of view: https://lkml.org/lkml/2012/11/12/151 It works by skipping the kswapd wakeup for THP allocations, only if compaction is deferred or contended. Just to clarify, this would be a replacement strictly for the __GFP_NO_KSWAPD removal revert, to control how often kswapd is woken up for higher order allocations like THP. My patch is to fix how kswapd actually does higher order reclaim, and it is required either way. [ But isn't the _reason_ why the "wake up kswapd more carefully for THP" patch was written kind of moot now since it was developed against a crazy kswapd? It would certainly need to be re-evaluated. My (limited) testing didn't show any issues anymore with waking kswapd unconditionally once it's fixed. ] Kswapd going crazy is certainly a large part of the problem. However, that leaves the issue of page_alloc.c waking up kswapd when the system is not actually low on memory. Instead, kswapd is woken up because memory compaction failed, potentially even due to lock contention during compaction! Ideally the allocation code would only wake up kswapd if memory needs to be freed, or in order for kswapd to do memory compaction (so the allocator does not have to). -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Tue, Nov 27, 2012 at 04:16:52PM -0500, Rik van Riel wrote: > On 11/27/2012 03:58 PM, Linus Torvalds wrote: > >Note that in the meantime, I've also applied (through Andrew) the > >patch that reverts commit c654345924f7 (see commit 82b212f40059 > >'Revert "mm: remove __GFP_NO_KSWAPD"'). > > > >I wonder if that revert may be bogus, and a result of this same issue. > >Maybe that revert should be reverted, and replaced with your patch? > > > >Mel? Zdenek? What's the status here? > > Mel posted several patches to fix the kswapd issue. This one is > slightly more risky than the outright revert, but probably preferred > from a performance point of view: > > https://lkml.org/lkml/2012/11/12/151 > > It works by skipping the kswapd wakeup for THP allocations, only > if compaction is deferred or contended. Just to clarify, this would be a replacement strictly for the __GFP_NO_KSWAPD removal revert, to control how often kswapd is woken up for higher order allocations like THP. My patch is to fix how kswapd actually does higher order reclaim, and it is required either way. [ But isn't the _reason_ why the "wake up kswapd more carefully for THP" patch was written kind of moot now since it was developed against a crazy kswapd? It would certainly need to be re-evaluated. My (limited) testing didn't show any issues anymore with waking kswapd unconditionally once it's fixed. ] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On Tue, Nov 27, 2012 at 12:58:18PM -0800, Linus Torvalds wrote: > Note that in the meantime, I've also applied (through Andrew) the > patch that reverts commit c654345924f7 (see commit 82b212f40059 > 'Revert "mm: remove __GFP_NO_KSWAPD"'). > > I wonder if that revert may be bogus, and a result of this same issue. > Maybe that revert should be reverted, and replaced with your patch? The __GFP_NO_KSWAPD removal woke kswapd for THP reclaim and so it exposed all these bugs that accumulated in there when higher order kswapd reclaim was excercised less often. The revert will hide the problem again, but doesn't make it go away entirely, so I think we need my fix either way. Whether you want to put the full THP weight back on the freshly fixed higher order kswapd code for 3.7 is a different matter :-) At least we would see quickly if it's still not working correctly... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
On 11/27/2012 03:58 PM, Linus Torvalds wrote: Note that in the meantime, I've also applied (through Andrew) the patch that reverts commit c654345924f7 (see commit 82b212f40059 'Revert "mm: remove __GFP_NO_KSWAPD"'). I wonder if that revert may be bogus, and a result of this same issue. Maybe that revert should be reverted, and replaced with your patch? Mel? Zdenek? What's the status here? Mel posted several patches to fix the kswapd issue. This one is slightly more risky than the outright revert, but probably preferred from a performance point of view: https://lkml.org/lkml/2012/11/12/151 It works by skipping the kswapd wakeup for THP allocations, only if compaction is deferred or contended. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd craziness in 3.7
Note that in the meantime, I've also applied (through Andrew) the patch that reverts commit c654345924f7 (see commit 82b212f40059 'Revert "mm: remove __GFP_NO_KSWAPD"'). I wonder if that revert may be bogus, and a result of this same issue. Maybe that revert should be reverted, and replaced with your patch? Mel? Zdenek? What's the status here? Linus On Tue, Nov 27, 2012 at 12:48 PM, Johannes Weiner wrote: > Hi everyone, > > I hope I included everybody that participated in the various threads > on kswapd getting stuck / exhibiting high CPU usage. We were looking > at at least three root causes as far as I can see, so it's not really > clear who observed which problem. Please correct me if the > reported-by, tested-by, bisected-by tags are incomplete. > > One problem was, as it seems, overly aggressive reclaim due to scaling > up reclaim goals based on compaction failures. This one was reverted > in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by > reclaim/compaction based on failures". > > Another one was an accounting problem where a freed higher order page > was underreported, and so kswapd had trouble restoring watermarks. > This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting > (appears like memory leak). > > The third one is a problem with small zones, like the DMA zone, where > the high watermark is lower than the low watermark plus compaction gap > (2 * allocation size). The zonelist reclaim in kswapd would do > nothing because all high watermarks are met, but the compaction logic > would find its own requirements unmet and loop over the zones again. > Indefinitely, until some third party would free enough memory to help > meet the higher compaction watermark. The problematic code has been > there since the 3.4 merge window for non-THP higher order allocations > but has been more prominent since the 3.7 merge window, where kswapd > is also woken up for the much more common THP allocations. > > The following patch should fix the third issue by making both reclaim > and compaction code in kswapd use the same predicate to determine > whether a zone is balanced or not. > > Hopefully, the sum of all three fixes should tame kswapd enough for > 3.7. > > Johannes > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
kswapd craziness in 3.7
Hi everyone, I hope I included everybody that participated in the various threads on kswapd getting stuck / exhibiting high CPU usage. We were looking at at least three root causes as far as I can see, so it's not really clear who observed which problem. Please correct me if the reported-by, tested-by, bisected-by tags are incomplete. One problem was, as it seems, overly aggressive reclaim due to scaling up reclaim goals based on compaction failures. This one was reverted in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures". Another one was an accounting problem where a freed higher order page was underreported, and so kswapd had trouble restoring watermarks. This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting (appears like memory leak). The third one is a problem with small zones, like the DMA zone, where the high watermark is lower than the low watermark plus compaction gap (2 * allocation size). The zonelist reclaim in kswapd would do nothing because all high watermarks are met, but the compaction logic would find its own requirements unmet and loop over the zones again. Indefinitely, until some third party would free enough memory to help meet the higher compaction watermark. The problematic code has been there since the 3.4 merge window for non-THP higher order allocations but has been more prominent since the 3.7 merge window, where kswapd is also woken up for the much more common THP allocations. The following patch should fix the third issue by making both reclaim and compaction code in kswapd use the same predicate to determine whether a zone is balanced or not. Hopefully, the sum of all three fixes should tame kswapd enough for 3.7. Johannes -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/