Re: kswapd craziness in 3.7

2012-12-19 Thread Zlatko Calusic

On 11.12.2012 01:19, Zlatko Calusic wrote:

On 10.12.2012 20:13, Linus Torvalds wrote:


It's worth giving this as much testing as is at all possible, but at
the same time I really don't think I can delay 3.7 any more without
messing up the holiday season too much. So unless something obvious
pops up, I will do the release tonight. So testing will be minimal -
but it's not like we haven't gone back-and-forth on this several times
already, and we revert to *mostly* the same old state as 3.6 anyway,
so it should be fairly safe.



So, here's what I found. In short: close, but no cigar!

Kswapd is certainly no more CPU pig, and memory seems to be utilized
properly (the kernel still likes to keep 400MB free, somebody else can
confirm if that's to be expected on a 4GB THP-enabled machine). So it
looks very decent, and much better than anything I run in last 10 days,
barring !THP kernel.

What remains a mystery is that kswapd occassionaly still likes to get
stuck in a D state, only now it recovers faster than before (sometimes
in a matter of seconds, but sometimes it takes a few minutes). Now, I
admit it's a small, maybe even cosmetic issue. But, it could also be a
warning sign of a bigger problem that will reveal itself on a more
loaded machine.



Ha, I nailed it!

The cigar aka the explanation together with a patch will follow shortly 
in a separate topic.


It's a genuine bug that has been with us for a long long time.
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-11 Thread Zlatko Calusic

On 11.12.2012 01:19, Zlatko Calusic wrote:


I will now make one last attempt, I've just reverted 2 Johannes' commits
that were also applied in attempt to fix breakage that removing
gfp_no_kswapd introduced, namely ed23ec4 & c702418. For various reasons
the results of this test will be available tommorow, so it's your call
Linus.



To be honest, I don't see any difference with those two commits 
reverted. Like those lines never did much anyway, so it's probably good 
we got rid of them. :P


--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Hugh Dickins
On Mon, 10 Dec 2012, Linus Torvalds wrote:
> [ Adding High Dickins because of the shmem oops. ]

I had already noticed, and was about to reply; but only then refreshed
my mbox window, to find that you've already done it all for me: thanks.

> 
> On Mon, Dec 10, 2012 at 12:35 PM, Zlatko Calusic
>  wrote:
> >
> > And funny thing that you mention i915, because yesterday my daughter 
> > managed to lock up our laptop hard (that was a first), and this is what I 
> > found in kern.log after restart:
> >
> > Dec  9 21:29:42 titan vmunix: general protection fault:  [#1] PREEMPT 
> > SMP
> > Dec  9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) 
> > vboxnetflt(O) vboxdrv(O) [last unloaded: microcode]
> > Dec  9 21:29:42 titan vmunix: CPU 2
> > Dec  9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G   O 
> > 3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B
> > Dec  9 21:29:42 titan vmunix: RIP: 0010:[]  
> > [] find_get_page+0x3c/0x90
> 
> Ho humm..
> 
> I'm not convinced this is related.
> 
> > Dec  9 21:29:42 titan vmunix: Call Trace:
> > Dec  9 21:29:42 titan vmunix:  [] find_lock_page+0x21/0x80
> > Dec  9 21:29:42 titan vmunix:  [] 
> > shmem_getpage_gfp+0xa0/0x620
> > Dec  9 21:29:42 titan vmunix:  [] 
> > shmem_read_mapping_page_gfp+0x2c/0x50
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_object_get_pages_gtt+0xe1/0x270
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_object_get_pages+0x4f/0x90
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_object_bind_to_gtt+0xc3/0x4c0
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_object_pin+0x123/0x190
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_execbuffer2+0x94/0x280
> > Dec  9 21:29:42 titan vmunix:  [] drm_ioctl+0x493/0x530
> > Dec  9 21:29:42 titan vmunix:  [] do_vfs_ioctl+0x8f/0x530
> > Dec  9 21:29:42 titan vmunix:  [] sys_ioctl+0x4b/0x90
> > Dec  9 21:29:42 titan vmunix:  [] 
> > system_call_fastpath+0x16/0x1b
> >
> > It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, 
> > the i915 driver will need to be taken better care of.
> 
> That decodes to
> 
>   11: e8 89 b7 15 00   callq  0x15b79f  # radix_tree_lookup_slot
>   16: 48 85 c0 test   %rax,%rax
>   19: 48 89 c6 mov%rax,%rsi
>   1c: 74 41 je 0x5f
>   1e: 48 8b 18 mov(%rax),%rbx  #
>   21: 48 85 db test   %rbx,%rbx
>   24: 74 1f je 0x45
>   26: f6 c3 03 test   $0x3,%bl
>   29: 75 3c jne0x67
>   2b:* 8b 53 1c mov0x1c(%rbx),%edx <-- trapping 
> instruction
>   2e: 85 d2 test   %edx,%edx
>   30: 74 d9 je 0xb
> 
> where %rbx is 0x0200. That looks like it could be a
> single-bit error, and should have been zero.
> 
> It's the "atomic_read(&page->counter)" which is part of
> "page_cache_get_speculative()" as far as I can tell, and it's the
> "page" pointer that is that odd (non-pointer) value. The fact that
> %ecx contains the value "-6" makes me wonder if there was a -ENXIO
> somewhere, though.

Yes, just what I was about to say; except I never considered the -6.

I was going to suggest it's a new notebook with not-so-good memory,
but see that Borislav has since made a better suggestion.

> 
> None of it looks all that much related to whether the i915 driver uses
> GFP_NO_KSWAPD or not, though.

Yes, no evidence here of anything to delay 3.7 further.

I'm running on current git, and no problems observed; but then, I never
did see any of these kswapd problems anyway.  And, in particular, I was
unable to reproduce Zlatko's 1GB of 4GB kept free (on yesterday's tree,
with no swap) - I saw about 100MB kept free.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Zlatko Calusic

On 10.12.2012 22:54, Borislav Petkov wrote:

On Mon, Dec 10, 2012 at 01:47:23PM -0800, Linus Torvalds wrote:

On Mon, Dec 10, 2012 at 1:42 PM, Borislav Petkov  wrote:


Aren't we gonna consider the out-of-tree vbox modules being loaded and
causing some corruptions like maybe the single-bit error above?

I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317


Yup, that looks more likely, I agree.


@Zlatko: can your daughter try to retrigger the freeze without the vbox
modules loaded?



Sure thing! :)

Although, the vbox modules were only loaded, no VM was running at the 
time lockup happened. But, I've just read the whole thread you mention 
above and I understand the concern. I'll make sure the vbox modules are 
unloaded when not really needed (most of the time on that machine), in 
case lockup happens again.


Next time my daughter plays online games, I'll tell her she's actually 
serving a greater purpose, and let her take her time. :)

--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Borislav Petkov
On Mon, Dec 10, 2012 at 01:47:23PM -0800, Linus Torvalds wrote:
> On Mon, Dec 10, 2012 at 1:42 PM, Borislav Petkov  wrote:
> >
> > Aren't we gonna consider the out-of-tree vbox modules being loaded and
> > causing some corruptions like maybe the single-bit error above?
> >
> > I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317
> 
> Yup, that looks more likely, I agree.

@Zlatko: can your daughter try to retrigger the freeze without the vbox
modules loaded?

Thanks.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Linus Torvalds
On Mon, Dec 10, 2012 at 1:42 PM, Borislav Petkov  wrote:
>
> Aren't we gonna consider the out-of-tree vbox modules being loaded and
> causing some corruptions like maybe the single-bit error above?
>
> I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317

Yup, that looks more likely, I agree.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Borislav Petkov
On Mon, Dec 10, 2012 at 01:28:54PM -0800, Linus Torvalds wrote:
> [ Adding High Dickins because of the shmem oops. ]
> 
> On Mon, Dec 10, 2012 at 12:35 PM, Zlatko Calusic
>  wrote:
> >
> > And funny thing that you mention i915, because yesterday my daughter 
> > managed to lock up our laptop hard (that was a first), and this is what I 
> > found in kern.log after restart:
> >
> > Dec  9 21:29:42 titan vmunix: general protection fault:  [#1] PREEMPT 
> > SMP
> > Dec  9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) 
> > vboxnetflt(O) vboxdrv(O) [last unloaded: microcode]
> > Dec  9 21:29:42 titan vmunix: CPU 2
> > Dec  9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G   O 
> > 3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B
> > Dec  9 21:29:42 titan vmunix: RIP: 0010:[]  
> > [] find_get_page+0x3c/0x90
> 
> Ho humm..
> 
> I'm not convinced this is related.
> 
> > Dec  9 21:29:42 titan vmunix: Call Trace:
> > Dec  9 21:29:42 titan vmunix:  [] find_lock_page+0x21/0x80
> > Dec  9 21:29:42 titan vmunix:  [] 
> > shmem_getpage_gfp+0xa0/0x620
> > Dec  9 21:29:42 titan vmunix:  [] 
> > shmem_read_mapping_page_gfp+0x2c/0x50
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_object_get_pages_gtt+0xe1/0x270
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_object_get_pages+0x4f/0x90
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_object_bind_to_gtt+0xc3/0x4c0
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_object_pin+0x123/0x190
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0
> > Dec  9 21:29:42 titan vmunix:  [] 
> > i915_gem_execbuffer2+0x94/0x280
> > Dec  9 21:29:42 titan vmunix:  [] drm_ioctl+0x493/0x530
> > Dec  9 21:29:42 titan vmunix:  [] do_vfs_ioctl+0x8f/0x530
> > Dec  9 21:29:42 titan vmunix:  [] sys_ioctl+0x4b/0x90
> > Dec  9 21:29:42 titan vmunix:  [] 
> > system_call_fastpath+0x16/0x1b
> >
> > It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, 
> > the i915 driver will need to be taken better care of.
> 
> That decodes to
> 
>   11: e8 89 b7 15 00   callq  0x15b79f  # radix_tree_lookup_slot
>   16: 48 85 c0 test   %rax,%rax
>   19: 48 89 c6 mov%rax,%rsi
>   1c: 74 41 je 0x5f
>   1e: 48 8b 18 mov(%rax),%rbx  #
>   21: 48 85 db test   %rbx,%rbx
>   24: 74 1f je 0x45
>   26: f6 c3 03 test   $0x3,%bl
>   29: 75 3c jne0x67
>   2b:* 8b 53 1c mov0x1c(%rbx),%edx <-- trapping 
> instruction
>   2e: 85 d2 test   %edx,%edx
>   30: 74 d9 je 0xb
> 
> where %rbx is 0x0200. That looks like it could be a
> single-bit error, and should have been zero.
> 
> It's the "atomic_read(&page->counter)" which is part of
> "page_cache_get_speculative()" as far as I can tell, and it's the
> "page" pointer that is that odd (non-pointer) value. The fact that
> %ecx contains the value "-6" makes me wonder if there was a -ENXIO
> somewhere, though.
> 
> None of it looks all that much related to whether the i915 driver uses
> GFP_NO_KSWAPD or not, though.

Aren't we gonna consider the out-of-tree vbox modules being loaded and
causing some corruptions like maybe the single-bit error above?

I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317

Hmm.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Linus Torvalds
[ Adding High Dickins because of the shmem oops. ]

On Mon, Dec 10, 2012 at 12:35 PM, Zlatko Calusic
 wrote:
>
> And funny thing that you mention i915, because yesterday my daughter managed 
> to lock up our laptop hard (that was a first), and this is what I found in 
> kern.log after restart:
>
> Dec  9 21:29:42 titan vmunix: general protection fault:  [#1] PREEMPT SMP
> Dec  9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) 
> vboxnetflt(O) vboxdrv(O) [last unloaded: microcode]
> Dec  9 21:29:42 titan vmunix: CPU 2
> Dec  9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G   O 
> 3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B
> Dec  9 21:29:42 titan vmunix: RIP: 0010:[]  
> [] find_get_page+0x3c/0x90

Ho humm..

I'm not convinced this is related.

> Dec  9 21:29:42 titan vmunix: Call Trace:
> Dec  9 21:29:42 titan vmunix:  [] find_lock_page+0x21/0x80
> Dec  9 21:29:42 titan vmunix:  [] 
> shmem_getpage_gfp+0xa0/0x620
> Dec  9 21:29:42 titan vmunix:  [] 
> shmem_read_mapping_page_gfp+0x2c/0x50
> Dec  9 21:29:42 titan vmunix:  [] 
> i915_gem_object_get_pages_gtt+0xe1/0x270
> Dec  9 21:29:42 titan vmunix:  [] 
> i915_gem_object_get_pages+0x4f/0x90
> Dec  9 21:29:42 titan vmunix:  [] 
> i915_gem_object_bind_to_gtt+0xc3/0x4c0
> Dec  9 21:29:42 titan vmunix:  [] 
> i915_gem_object_pin+0x123/0x190
> Dec  9 21:29:42 titan vmunix:  [] 
> i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190
> Dec  9 21:29:42 titan vmunix:  [] 
> i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320
> Dec  9 21:29:42 titan vmunix:  [] 
> i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0
> Dec  9 21:29:42 titan vmunix:  [] 
> i915_gem_execbuffer2+0x94/0x280
> Dec  9 21:29:42 titan vmunix:  [] drm_ioctl+0x493/0x530
> Dec  9 21:29:42 titan vmunix:  [] do_vfs_ioctl+0x8f/0x530
> Dec  9 21:29:42 titan vmunix:  [] sys_ioctl+0x4b/0x90
> Dec  9 21:29:42 titan vmunix:  [] 
> system_call_fastpath+0x16/0x1b
>
> It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, 
> the i915 driver will need to be taken better care of.

That decodes to

  11: e8 89 b7 15 00   callq  0x15b79f  # radix_tree_lookup_slot
  16: 48 85 c0 test   %rax,%rax
  19: 48 89 c6 mov%rax,%rsi
  1c: 74 41 je 0x5f
  1e: 48 8b 18 mov(%rax),%rbx  #
  21: 48 85 db test   %rbx,%rbx
  24: 74 1f je 0x45
  26: f6 c3 03 test   $0x3,%bl
  29: 75 3c jne0x67
  2b:* 8b 53 1c mov0x1c(%rbx),%edx <-- trapping instruction
  2e: 85 d2 test   %edx,%edx
  30: 74 d9 je 0xb

where %rbx is 0x0200. That looks like it could be a
single-bit error, and should have been zero.

It's the "atomic_read(&page->counter)" which is part of
"page_cache_get_speculative()" as far as I can tell, and it's the
"page" pointer that is that odd (non-pointer) value. The fact that
%ecx contains the value "-6" makes me wonder if there was a -ENXIO
somewhere, though.

None of it looks all that much related to whether the i915 driver uses
GFP_NO_KSWAPD or not, though.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Zlatko Calusic
On 10.12.2012 20:13, Linus Torvalds wrote:
> 
> It's worth giving this as much testing as is at all possible, but at
> the same time I really don't think I can delay 3.7 any more without
> messing up the holiday season too much. So unless something obvious
> pops up, I will do the release tonight. So testing will be minimal -
> but it's not like we haven't gone back-and-forth on this several times
> already, and we revert to *mostly* the same old state as 3.6 anyway,
> so it should be fairly safe.
> 

It compiles and boots without a hitch, so it must be perfect. :)

Seriously, a few more hours need to pass, until I can provide more convincing 
data. That's how long it takes on this particular machine for memory pressure 
to build up and memory fragmentation to ensue. Only then I'll be able to tell 
how it really behaves. I promise to get back as soon as I can.

And funny thing that you mention i915, because yesterday my daughter managed to 
lock up our laptop hard (that was a first), and this is what I found in 
kern.log after restart:

Dec  9 21:29:42 titan vmunix: general protection fault:  [#1] PREEMPT SMP 
Dec  9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) 
vboxnetflt(O) vboxdrv(O) [last unloaded: microcode]
Dec  9 21:29:42 titan vmunix: CPU 2 
Dec  9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G   O 
3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B
Dec  9 21:29:42 titan vmunix: RIP: 0010:[]  
[] find_get_page+0x3c/0x90
Dec  9 21:29:42 titan vmunix: RSP: 0018:88014d9f7928  EFLAGS: 00010246
Dec  9 21:29:42 titan vmunix: RAX: 880052594bc8 RBX: 0200 RCX: 
fffa
Dec  9 21:29:42 titan vmunix: RDX: 0001 RSI: 880052594bc8 RDI: 

Dec  9 21:29:42 titan vmunix: RBP: 88014d9f7948 R08: 0200 R09: 
880052594b18
Dec  9 21:29:42 titan vmunix: R10: 57ffe4cbb74d1280 R11:  R12: 
88011c959a90
Dec  9 21:29:42 titan vmunix: R13: 0053 R14:  R15: 
0053
Dec  9 21:29:42 titan vmunix: FS:  7fcd8d413880() 
GS:880157c8() knlGS:
Dec  9 21:29:42 titan vmunix: CS:  0010 DS:  ES:  CR0: 80050033
Dec  9 21:29:42 titan vmunix: CR2: ff600400 CR3: 00014d937000 CR4: 
07e0
Dec  9 21:29:42 titan vmunix: DR0:  DR1:  DR2: 

Dec  9 21:29:42 titan vmunix: DR3:  DR6: 0ff0 DR7: 
0400
Dec  9 21:29:42 titan vmunix: Process Xorg (pid: 2523, threadinfo 
88014d9f6000, task 88014d9c1260)
Dec  9 21:29:42 titan vmunix: Stack:
Dec  9 21:29:42 titan vmunix:  88014d9f7958 88011c959a88 
0053 88011c959a88
Dec  9 21:29:42 titan vmunix:  88014d9f7978 81090e21 
0001 ea00014d1280
Dec  9 21:29:42 titan vmunix:  88011c959960 0001 
88014d9f7a28 810a1b60
Dec  9 21:29:42 titan vmunix: Call Trace:
Dec  9 21:29:42 titan vmunix:  [] find_lock_page+0x21/0x80
Dec  9 21:29:42 titan vmunix:  [] shmem_getpage_gfp+0xa0/0x620
Dec  9 21:29:42 titan vmunix:  [] 
shmem_read_mapping_page_gfp+0x2c/0x50
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_object_get_pages_gtt+0xe1/0x270
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_object_get_pages+0x4f/0x90
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_object_bind_to_gtt+0xc3/0x4c0
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_object_pin+0x123/0x190
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_execbuffer2+0x94/0x280
Dec  9 21:29:42 titan vmunix:  [] drm_ioctl+0x493/0x530
Dec  9 21:29:42 titan vmunix:  [] ? 
i915_gem_execbuffer+0x480/0x480
Dec  9 21:29:42 titan vmunix:  [] do_vfs_ioctl+0x8f/0x530
Dec  9 21:29:42 titan vmunix:  [] sys_ioctl+0x4b/0x90
Dec  9 21:29:42 titan vmunix:  [] ? sys_read+0x4d/0xa0
Dec  9 21:29:42 titan vmunix:  [] 
system_call_fastpath+0x16/0x1b
Dec  9 21:29:42 titan vmunix: Code: 63 08 48 83 ec 08 e8 84 9c fb ff 4c 89 ee 
4c 89 e7 e8 89 b7 15 00 48 85 c0 48 89 c6 74 41 48 8b 18 48 85 db 74 1f f6 c3 
03 75 3c <8b> 53 1c 85 d2 74 d9 8d 7a 01 89 d0 f0 0f b1 7b 1c 39 c2 75 23 
Dec  9 21:29:42 titan vmunix: RIP  [] find_get_page+0x3c/0x90
Dec  9 21:29:42 titan vmunix:  RSP 

It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, the 
i915 driver will need to be taken better care of.
-- 
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Linus Torvalds
On Mon, Dec 10, 2012 at 10:33 AM, Zlatko Calusic
 wrote:
>
> I was about to apply the patch that you sent, and reboot the server, but it
> seems there's no point because the patch is flawed?
>
> Anyway, if and when you have a proper one, I'll be glad to test it for you
> and report results.

I have reverted (again) the __GFP_NO_KSWAPD removal, and considering
that it really looks like there are overwhelming reasons to have that
flag, I will *not* take some new patch to revert it. I'm getting
convinced that the original removal really was bogus, and had no
actual valid reason for it.

Part of that is that I noticed that non-THP allocations wanted to use
it too. The i915 driver had wanted to use __GFP_NO_KSWAPD because it
too didn't want to start some cleaning thread. The whole mindset
kswapd is somehow better than direct reclaim or needed when it fails
is broken. Some allocations simply *will* fail, without necessarily
wanting kswapd to be started. THP - where the high order of the
allocation means that failure is inevitable under some fragmentation
circumstances - is just one such case.

I also reverted one of the "fix up the mess from removing
__GFP_NO_KSWAPD" patch, because that one was an obvious workaround
that tried to re-introduce the "let's not wake up kswapd after all for
that case". It clashed with a clean revert, and it was pointless in
the presense of __GFP_NO_KSWAPD anyway.

I did *not* revert some of the other fixup patches that tried to help
kswapd balancing decisions and avoid excessive CPU use other ways. So
some remains of this whole saga do still remain, but they look fairly
minimal.

It's worth giving this as much testing as is at all possible, but at
the same time I really don't think I can delay 3.7 any more without
messing up the holiday season too much. So unless something obvious
pops up, I will do the release tonight. So testing will be minimal -
but it's not like we haven't gone back-and-forth on this several times
already, and we revert to *mostly* the same old state as 3.6 anyway,
so it should be fairly safe.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Zlatko Calusic

On 10.12.2012 19:01, Mel Gorman wrote:

In this last-minute disaster, I'm not thinking properly at all any more. The
shrink slab disabling should have happened before the loop_again but even
then it's wrong because it's just covering over the problem.

The way order and testorder interact with how balanced is calculated means
that we potentially call shrink_slab() multiple times and that thing is
global in nature and basically uncontrolled. You could argue that we should
only call shrink_slab() if order-0 watermarks are not met but that will
not necessarily prevent kswapd reclaiming too much. It keeps going back
to balance_pgdat needing its list of requirements drawn up and receive
some major surgery and we're not going to do that as a quick hack.



I was about to apply the patch that you sent, and reboot the server, but 
it seems there's no point because the patch is flawed?


Anyway, if and when you have a proper one, I'll be glad to test it for 
you and report results.

--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Zlatko Calusic

On 10.12.2012 12:03, Mel Gorman wrote:

There is a big difference between a direct reclaim/compaction for THP
and kswapd doing the same work. Direct reclaim/compaction will try once,
give up quickly and defer requests in the near future to avoid impacting
the system heavily for THP. The same applies for khugepaged.

kswapd is different. It can keep going until it meets its watermarks for
a THP allocation are met. Two reasons why it might keep going for a long
time are that compaction is being inefficient which we know it may be due
to crap like this

end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);

and the second reason is if the highest zone is relatively because
compaction_suitable will keep saying that allocations are failing due to
insufficient amounts of memory in the highest zone. It'll reclaim a little
from this highest zone and then shrink_slab() potentially dumping a large
amount of memory. This may be the case for Zlatko as with a 4G machine
his ZONE_NORMAL could be small depending on how the 32-bit address space
is used by his hardware.



The kernel is 64-bit, if it makes any difference (userspace, though is 
still 32-bit). There's no swap (swap support not even compiled in). The 
zones are as follows:


On node 0 totalpages: 1048019
  DMA zone: 64 pages used for memmap
  DMA zone: 6 pages reserved
  DMA zone: 3913 pages, LIFO batch:0
  DMA32 zone: 16320 pages used for memmap
  DMA32 zone: 831109 pages, LIFO batch:31
  Normal zone: 3072 pages used for memmap
  Normal zone: 193535 pages, LIFO batch:31

If I understand correctly, you think that because 193535 pages in 
ZONE_NORMAL is relatively small compared to 831109 pages of ZONE_DMA32 
the system has hard time balancing itself?


Is there any way I could force and test different memory layout? I'm 
slightly lost at all the memory models (if I have a choice at all), so 
if you have any suggestions, I'm all ears.


Maybe I could limit available memory and thus have only DMA32 zone, just 
to prove your theory? I remember doing tuning like that many years ago 
when I had more time to play with Linux MM, unfortunately didn't have 
much time lately, so I'm a bit rusty, but I'm willing to help testing 
and resolving this issue.


--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Mel Gorman
On Mon, Dec 10, 2012 at 11:39:04AM -0500, Johannes Weiner wrote:
> On Mon, Dec 10, 2012 at 11:03:37AM +, Mel Gorman wrote:
> > On Sat, Dec 08, 2012 at 05:01:42PM -0800, Linus Torvalds wrote:
> > > On Sat, 8 Dec 2012, Zlatko Calusic wrote:
> > > > Or sooner... in short: nothing's changed!
> > > > 
> > > > On a 4GB RAM system, where applications use close to 2GB, kswapd likes 
> > > > to keep
> > > > around 1GB free (unused), leaving only 1GB for page/buffer cache. If I 
> > > > force
> > > > bigger page cache by reading a big file and thus use the unused 1GB of 
> > > > RAM,
> > > > kswapd will soon (in a matter of minutes) evict those (or other) pages 
> > > > out and
> > > > once again keep unused memory close to 1GB.
> > > 
> > > Ok, guys, what was the reclaim or kswapd patch during the merge window 
> > > that actually caused all of these insane problems?
> > 
> > I believe commit c6543459 (mm: remove __GFP_NO_KSWAPD) is the primary
> > candidate. __GFP_NO_KSWAPD was originally introduced by THP because kswapd
> > was excessively reclaiming. kswapd would stay awake aggressively reclaiming
> > even if compaction was deferred. The flag was removed in this cycle when it
> > was expected that it was no longer necessary. I'm not foisting the blame
> > on Rik here, I was on the review list for that patch and did not identify
> > that it would cause this many problems either.
> >
> > > It seems it was more 
> > > fundamentally buggered than the fifteen-million fixes for kswapd we have 
> > > already picked up.
> > 
> > It was already fundamentally buggered up. The difference was it stayed
> > asleep for THP requests in earlier kernels.
> > 
> > There is a big difference between a direct reclaim/compaction for THP
> > and kswapd doing the same work. Direct reclaim/compaction will try once,
> > give up quickly and defer requests in the near future to avoid impacting
> > the system heavily for THP. The same applies for khugepaged.
> > 
> > kswapd is different. It can keep going until it meets its watermarks for
> > a THP allocation are met. Two reasons why it might keep going for a long
> > time are that compaction is being inefficient which we know it may be due
> > to crap like this
> > 
> > end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> > 
> > and the second reason is if the highest zone is relatively because
> > compaction_suitable will keep saying that allocations are failing due to
> > insufficient amounts of memory in the highest zone. It'll reclaim a little
> > from this highest zone and then shrink_slab() potentially dumping a large
> > amount of memory. This may be the case for Zlatko as with a 4G machine
> > his ZONE_NORMAL could be small depending on how the 32-bit address space
> > is used by his hardware.
> 
> Unlike direct reclaim, kswapd also never does sync migration.  Since
> the fragmentation index is a ratio of free pages over free page
> blocks, doing lightweight compaction that reduces the page blocks but
> never really follows through to compact a THP block increases the free
> memory requirement.
> 

True.

> I thought about the small Normal zone too.  Direct reclaim/compaction
> is fine with one zone being able to provide a THP, but kswapd requires
> 25% of the node.  A small ZONE_NORMAL would not be able to meet this
> and so the bigger DMA32 zone would also be required to be balanced for
> the THP allocation.
> 

Also true.

> > > Mel? Ideas?
> > 
> > Consider reverting the revert of __GFP_NO_KSWAPD again until this can be
> > ironed out at a more reasonable pace. Rik? Johannes?
> 
> Yes, I also think we need more time for this.
> 

Yes, the last minute band-aids are just getting worse and the result is
more mess.

> 
> 
> I don't see a shrink_slab() invocation after this point since the
> loop_again jumps in this loop where removed, so this shouldn't change
> anything?

/me slaps self

In this last-minute disaster, I'm not thinking properly at all any more. The
shrink slab disabling should have happened before the loop_again but even
then it's wrong because it's just covering over the problem.

The way order and testorder interact with how balanced is calculated means
that we potentially call shrink_slab() multiple times and that thing is
global in nature and basically uncontrolled. You could argue that we should
only call shrink_slab() if order-0 watermarks are not met but that will
not necessarily prevent kswapd reclaiming too much. It keeps going back
to balance_pgdat needing its list of requirements drawn up and receive
some major surgery and we're not going to do that as a quick hack.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Johannes Weiner
On Mon, Dec 10, 2012 at 11:03:37AM +, Mel Gorman wrote:
> On Sat, Dec 08, 2012 at 05:01:42PM -0800, Linus Torvalds wrote:
> > On Sat, 8 Dec 2012, Zlatko Calusic wrote:
> > > Or sooner... in short: nothing's changed!
> > > 
> > > On a 4GB RAM system, where applications use close to 2GB, kswapd likes to 
> > > keep
> > > around 1GB free (unused), leaving only 1GB for page/buffer cache. If I 
> > > force
> > > bigger page cache by reading a big file and thus use the unused 1GB of 
> > > RAM,
> > > kswapd will soon (in a matter of minutes) evict those (or other) pages 
> > > out and
> > > once again keep unused memory close to 1GB.
> > 
> > Ok, guys, what was the reclaim or kswapd patch during the merge window 
> > that actually caused all of these insane problems?
> 
> I believe commit c6543459 (mm: remove __GFP_NO_KSWAPD) is the primary
> candidate. __GFP_NO_KSWAPD was originally introduced by THP because kswapd
> was excessively reclaiming. kswapd would stay awake aggressively reclaiming
> even if compaction was deferred. The flag was removed in this cycle when it
> was expected that it was no longer necessary. I'm not foisting the blame
> on Rik here, I was on the review list for that patch and did not identify
> that it would cause this many problems either.
>
> > It seems it was more 
> > fundamentally buggered than the fifteen-million fixes for kswapd we have 
> > already picked up.
> 
> It was already fundamentally buggered up. The difference was it stayed
> asleep for THP requests in earlier kernels.
> 
> There is a big difference between a direct reclaim/compaction for THP
> and kswapd doing the same work. Direct reclaim/compaction will try once,
> give up quickly and defer requests in the near future to avoid impacting
> the system heavily for THP. The same applies for khugepaged.
> 
> kswapd is different. It can keep going until it meets its watermarks for
> a THP allocation are met. Two reasons why it might keep going for a long
> time are that compaction is being inefficient which we know it may be due
> to crap like this
> 
> end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> 
> and the second reason is if the highest zone is relatively because
> compaction_suitable will keep saying that allocations are failing due to
> insufficient amounts of memory in the highest zone. It'll reclaim a little
> from this highest zone and then shrink_slab() potentially dumping a large
> amount of memory. This may be the case for Zlatko as with a 4G machine
> his ZONE_NORMAL could be small depending on how the 32-bit address space
> is used by his hardware.

Unlike direct reclaim, kswapd also never does sync migration.  Since
the fragmentation index is a ratio of free pages over free page
blocks, doing lightweight compaction that reduces the page blocks but
never really follows through to compact a THP block increases the free
memory requirement.

I thought about the small Normal zone too.  Direct reclaim/compaction
is fine with one zone being able to provide a THP, but kswapd requires
25% of the node.  A small ZONE_NORMAL would not be able to meet this
and so the bigger DMA32 zone would also be required to be balanced for
the THP allocation.

> > Mel? Ideas?
> 
> Consider reverting the revert of __GFP_NO_KSWAPD again until this can be
> ironed out at a more reasonable pace. Rik? Johannes?

Yes, I also think we need more time for this.

> Verify if the shrinking slab is the issue with this brutually ugly
> hack. Zlatko?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b7ed376..2189d20 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2550,6 +2550,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, 
> int order,
>   unsigned long balanced;
>   int i;
>   int end_zone = 0;   /* Inclusive.  0 = ZONE_DMA */
> + bool should_shrink_slab = true;
>   unsigned long total_scanned;
>   struct reclaim_state *reclaim_state = current->reclaim_state;
>   unsigned long nr_soft_reclaimed;
> @@ -2695,7 +2696,8 @@ loop_again:
>   shrink_zone(zone, &sc);
>  
>   reclaim_state->reclaimed_slab = 0;
> - nr_slab = shrink_slab(&shrink, sc.nr_scanned, 
> lru_pages);
> + if (should_shrink_slab)
> + nr_slab = shrink_slab(&shrink, 
> sc.nr_scanned, lru_pages);
>   sc.nr_reclaimed += 
> reclaim_state->reclaimed_slab;
>   total_scanned += sc.nr_scanned;
>  
> @@ -2817,6 +2819,16 @@ out:
>   if (order) {
>   int zones_need_compaction = 1;
>  
> + /*
> +  * Shrinking slab for high-order allocs can cause an excessive
> +  * amount of memory to be dumped. Only shrink slab once per
> +  * round for high-order allocs.
> +  *
> +  * This is a very stupid hack. balance_pgdat() is in serious
> +  

Re: kswapd craziness in 3.7

2012-12-10 Thread Mel Gorman
On Sat, Dec 08, 2012 at 05:01:42PM -0800, Linus Torvalds wrote:
> 
> 
> On Sat, 8 Dec 2012, Zlatko Calusic wrote:
> > 
> > Or sooner... in short: nothing's changed!
> > 
> > On a 4GB RAM system, where applications use close to 2GB, kswapd likes to 
> > keep
> > around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force
> > bigger page cache by reading a big file and thus use the unused 1GB of RAM,
> > kswapd will soon (in a matter of minutes) evict those (or other) pages out 
> > and
> > once again keep unused memory close to 1GB.
> 
> Ok, guys, what was the reclaim or kswapd patch during the merge window 
> that actually caused all of these insane problems?

I believe commit c6543459 (mm: remove __GFP_NO_KSWAPD) is the primary
candidate. __GFP_NO_KSWAPD was originally introduced by THP because kswapd
was excessively reclaiming. kswapd would stay awake aggressively reclaiming
even if compaction was deferred. The flag was removed in this cycle when it
was expected that it was no longer necessary. I'm not foisting the blame
on Rik here, I was on the review list for that patch and did not identify
that it would cause this many problems either.

> It seems it was more 
> fundamentally buggered than the fifteen-million fixes for kswapd we have 
> already picked up.
> 

It was already fundamentally buggered up. The difference was it stayed
asleep for THP requests in earlier kernels.

There is a big difference between a direct reclaim/compaction for THP
and kswapd doing the same work. Direct reclaim/compaction will try once,
give up quickly and defer requests in the near future to avoid impacting
the system heavily for THP. The same applies for khugepaged.

kswapd is different. It can keep going until it meets its watermarks for
a THP allocation are met. Two reasons why it might keep going for a long
time are that compaction is being inefficient which we know it may be due
to crap like this

end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);

and the second reason is if the highest zone is relatively because
compaction_suitable will keep saying that allocations are failing due to
insufficient amounts of memory in the highest zone. It'll reclaim a little
from this highest zone and then shrink_slab() potentially dumping a large
amount of memory. This may be the case for Zlatko as with a 4G machine
his ZONE_NORMAL could be small depending on how the 32-bit address space
is used by his hardware.

> (Ok, I may be exaggerating the number of patches, but it's starting to 
> feel that way - I thought that 3.7 was going to be a calm and easy 
> release, but the kswapd issues seem to just keep happening. We've been 
> fighting the kswapd changes for a while now.)
> 

Yes.

> Trying to keep a gigabyte free (presumably because that way we have lots 
> of high-order alloction pages) is ridiculous. Is it one of the compaction 
> changes? 
> 

Not directly. Compaction has been a bigger factor after 3.5 due to the
removal of lumpy reclaim but it's not directly responsible for excessive
amounts of memory being kept free. The closest patch I'm aware of that
would cause problems of that nature would be commit 83fde0f2 (mm: vmscan:
scale number of pages reclaimed by reclaim/compaction based on failures)
and it has already been reverted by 96710098.

> Mel? Ideas?
> 

Consider reverting the revert of __GFP_NO_KSWAPD again until this can be
ironed out at a more reasonable pace. Rik? Johannes?

Verify if the shrinking slab is the issue with this brutually ugly
hack. Zlatko?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7ed376..2189d20 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2550,6 +2550,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int 
order,
unsigned long balanced;
int i;
int end_zone = 0;   /* Inclusive.  0 = ZONE_DMA */
+   bool should_shrink_slab = true;
unsigned long total_scanned;
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_soft_reclaimed;
@@ -2695,7 +2696,8 @@ loop_again:
shrink_zone(zone, &sc);
 
reclaim_state->reclaimed_slab = 0;
-   nr_slab = shrink_slab(&shrink, sc.nr_scanned, 
lru_pages);
+   if (should_shrink_slab)
+   nr_slab = shrink_slab(&shrink, 
sc.nr_scanned, lru_pages);
sc.nr_reclaimed += 
reclaim_state->reclaimed_slab;
total_scanned += sc.nr_scanned;
 
@@ -2817,6 +2819,16 @@ out:
if (order) {
int zones_need_compaction = 1;
 
+   /*
+* Shrinking slab for high-order allocs can cause an excessive
+* amount of memory to be dumped. Only shrink slab once per
+* round for high-order allocs.
+*
+* This is a very stupid hack. balance_pgdat() is in ser

Re: kswapd craziness in 3.7

2012-12-09 Thread Zdenek Kabelac

Dne 9.12.2012 02:01, Linus Torvalds napsal(a):



On Sat, 8 Dec 2012, Zlatko Calusic wrote:


Or sooner... in short: nothing's changed!

On a 4GB RAM system, where applications use close to 2GB, kswapd likes to keep
around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force
bigger page cache by reading a big file and thus use the unused 1GB of RAM,
kswapd will soon (in a matter of minutes) evict those (or other) pages out and
once again keep unused memory close to 1GB.


Ok, guys, what was the reclaim or kswapd patch during the merge window
that actually caused all of these insane problems? It seems it was more
fundamentally buggered than the fifteen-million fixes for kswapd we have
already picked up.

(Ok, I may be exaggerating the number of patches, but it's starting to
feel that way - I thought that 3.7 was going to be a calm and easy
release, but the kswapd issues seem to just keep happening. We've been
fighting the kswapd changes for a while now.)

Trying to keep a gigabyte free (presumably because that way we have lots
of high-order alloction pages) is ridiculous. Is it one of the compaction
changes?

Mel? Ideas?



Very true

It's just as simple a making

dd if=/dev/zero of=/tmp/zero bs=1M count=0 seek=100

and now

dd if=/tmp/zero of=/dev/null bs=1M

and kswapd fights with dd  for CPU time


Zdenek


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-08 Thread Linus Torvalds


On Sat, 8 Dec 2012, Zlatko Calusic wrote:
> 
> Or sooner... in short: nothing's changed!
> 
> On a 4GB RAM system, where applications use close to 2GB, kswapd likes to keep
> around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force
> bigger page cache by reading a big file and thus use the unused 1GB of RAM,
> kswapd will soon (in a matter of minutes) evict those (or other) pages out and
> once again keep unused memory close to 1GB.

Ok, guys, what was the reclaim or kswapd patch during the merge window 
that actually caused all of these insane problems? It seems it was more 
fundamentally buggered than the fifteen-million fixes for kswapd we have 
already picked up.

(Ok, I may be exaggerating the number of patches, but it's starting to 
feel that way - I thought that 3.7 was going to be a calm and easy 
release, but the kswapd issues seem to just keep happening. We've been 
fighting the kswapd changes for a while now.)

Trying to keep a gigabyte free (presumably because that way we have lots 
of high-order alloction pages) is ridiculous. Is it one of the compaction 
changes? 

Mel? Ideas?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-08 Thread Zlatko Calusic

On 08.12.2012 13:06, Zlatko Calusic wrote:

On 06.12.2012 20:31, Linus Torvalds wrote:

Ok, people seem to be reporting success.

I've applied Johannes' last patch with the new tested-by tags.



I've been testing this patch since it was applied, and it certainly
fixes the kswapd craziness issue, good work Johannes!

But, it's still not perfect yet, because I see that the system keeps
lots of memory unused (free), where it previously used it all for the
page cache (there's enough fs activity to warrant it).

I'm now testing the last piece of Johannes' changes (still not in git
tree), and can report results in 24-48 hours.

Regards,


Or sooner... in short: nothing's changed!

On a 4GB RAM system, where applications use close to 2GB, kswapd likes 
to keep around 1GB free (unused), leaving only 1GB for page/buffer 
cache. If I force bigger page cache by reading a big file and thus use 
the unused 1GB of RAM, kswapd will soon (in a matter of minutes) evict 
those (or other) pages out and once again keep unused memory close to 1GB.


I guess it's not a showstopper, but it still counts as a very bad memory 
management, wasting lots of RAM.


As an additional data point, if memory pressure is slightly higher (say 
backup kicks in, keeping page cache mostly full) kswapd gets in D 
(uninterruptible sleep) state (function: congestion_wait) and load 
average goes up by 1. It recovers only when it successfully throws out 
half of page cache again.


Hope it helps.
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-08 Thread Zlatko Calusic

On 06.12.2012 20:31, Linus Torvalds wrote:

Ok, people seem to be reporting success.

I've applied Johannes' last patch with the new tested-by tags.



I've been testing this patch since it was applied, and it certainly 
fixes the kswapd craziness issue, good work Johannes!


But, it's still not perfect yet, because I see that the system keeps 
lots of memory unused (free), where it previously used it all for the 
page cache (there's enough fs activity to warrant it).


I'm now testing the last piece of Johannes' changes (still not in git 
tree), and can report results in 24-48 hours.


Regards,
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-08 Thread Jiri Slaby
On 12/04/2012 05:11 PM, Johannes Weiner wrote:
> On Tue, Dec 04, 2012 at 10:15:09AM +0100, Jiri Slaby wrote:
>> It does not apply to -next :/. Should I try anything else?
> 
> The COMPACTION_BUILD changed to IS_ENABLED(CONFIG_COMPACTION), below
> is a -next patch.  I hope you don't run into other problems that come
> out of -next craziness, because Linus is kinda waiting for this to be
> resolved to release 3.8.  If you've always tested against -next so far
> and it worked otherwise, don't change the environment now, please.  If
> you just started, it would make more sense to test based on 3.7-rc8.
> 
> Thanks!
> 
> ---
> From: Johannes Weiner 
> Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
>  to individual uncompactable zones
> 
> When a zone meets its high watermark and is compactable in case of
> higher order allocations, it contributes to the percentage of the
> node's memory that is considered balanced.
> 
> This requirement, that a node be only partially balanced, came about
> when kswapd was desparately trying to balance tiny zones when all
> bigger zones in the node had plenty of free memory.  Arguably, the
> same should apply to compaction: if a significant part of the node is
> balanced enough to run compaction, do not get hung up on that tiny
> zone that might never get in shape.
> 
> When the compaction logic in kswapd is reached, we know that at least
> 25% of the node's memory is balanced properly for compaction (see
> zone_balanced and pgdat_balanced).  Remove the individual zone checks
> that restart the kswapd cycle.
> 
> Otherwise, we may observe more endless looping in kswapd where the
> compaction code loops back to reclaim because of a single zone and
> reclaim does nothing because the node is considered balanced overall.
> 
> Reported-by: Thorsten Leemhuis 
> Signed-off-by: Johannes Weiner 

Looks like it's gone with this patch now. Hopefully the send button
won't trigger the issue the same as the last time :).

> ---
>  mm/vmscan.c | 16 
>  1 file changed, 16 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3b0aef4..486100f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2806,22 +2806,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, 
> int order,
>   if (!populated_zone(zone))
>   continue;
>  
> - if (zone->all_unreclaimable &&
> - sc.priority != DEF_PRIORITY)
> - continue;
> -
> - /* Would compaction fail due to lack of free memory? */
> - if (IS_ENABLED(CONFIG_COMPACTION) &&
> - compaction_suitable(zone, order) == COMPACT_SKIPPED)
> - goto loop_again;
> -
> - /* Confirm the zone is balanced for order-0 */
> - if (!zone_watermark_ok(zone, 0,
> - high_wmark_pages(zone), 0, 0)) {
> - order = sc.order = 0;
> - goto loop_again;
> - }
> -
>   /* Check if the memory needs to be defragmented. */
>   if (zone_watermark_ok(zone, order,
>   low_wmark_pages(zone), *classzone_idx, 0))
> 


-- 
js
suse labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-06 Thread Rik van Riel

On 12/06/2012 03:23 PM, Johannes Weiner wrote:


From: Johannes Weiner 
Subject: [patch] mm: vmscan: fix inappropriate zone congestion clearing

c702418 ("mm: vmscan: do not keep kswapd looping forever due to
individual uncompactable zones") removed zone watermark checks from
the compaction code in kswapd but left in the zone congestion
clearing, which now happens unconditionally on higher order reclaim.

This messes up the reclaim throttling logic for zones with
dirty/writeback pages, where zones should only lose their congestion
status when their watermarks have been restored.

Remove the clearing from the zone compaction section entirely.  The
preliminary zone check and the reclaim loop in kswapd will clear it if
the zone is considered balanced.

Signed-off-by: Johannes Weiner 


Reviewed-by: Rik van Riel 


--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-06 Thread Johannes Weiner
On Thu, Dec 06, 2012 at 11:31:21AM -0800, Linus Torvalds wrote:
> Ok, people seem to be reporting success.
> 
> I've applied Johannes' last patch with the new tested-by tags.
> 
> Johannes (or anybody else, for that matter), please holler LOUDLY if
> you disagreed.. (or if I used the wrong version of the patch, there's
> been several, afaik).

I just went back one more time and of course that's when I spot that I
forgot to remove the zone congestion clearing that depended on the now
removed checks to ensure the zone is balanced.  It's not too big of a
deal, just the /risk/ of increased CPU use from reclaim because we go
back to scanning zones that we previously deemed congested and slept a
little bit before continuing reclaim.

Sorry, I should have seen that earlier.

Removing it is a low risk fix, the clearing was kinda redundant anyway
(the preliminary zone check clears it for OK zones, so does the
reclaim loop under the same criteria), letting it stay is probably
more problematic for 3.8 than just dropping it...

---
From: Johannes Weiner 
Subject: [patch] mm: vmscan: fix inappropriate zone congestion clearing

c702418 ("mm: vmscan: do not keep kswapd looping forever due to
individual uncompactable zones") removed zone watermark checks from
the compaction code in kswapd but left in the zone congestion
clearing, which now happens unconditionally on higher order reclaim.

This messes up the reclaim throttling logic for zones with
dirty/writeback pages, where zones should only lose their congestion
status when their watermarks have been restored.

Remove the clearing from the zone compaction section entirely.  The
preliminary zone check and the reclaim loop in kswapd will clear it if
the zone is considered balanced.

Signed-off-by: Johannes Weiner 
---
 mm/vmscan.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 124bbfe..b7ed376 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2827,9 +2827,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int 
order,
if (zone_watermark_ok(zone, order,
low_wmark_pages(zone), *classzone_idx, 0))
zones_need_compaction = 0;
-
-   /* If balanced, clear the congested flag */
-   zone_clear_flag(zone, ZONE_CONGESTED);
}
 
if (zones_need_compaction)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-06 Thread Rik van Riel

On 12/06/2012 02:31 PM, Linus Torvalds wrote:

Ok, people seem to be reporting success.

I've applied Johannes' last patch with the new tested-by tags.

Johannes (or anybody else, for that matter), please holler LOUDLY if
you disagreed.. (or if I used the wrong version of the patch, there's
been several, afaik).


Johannes's patch is a fairly big hammer, with kswapd not looping
back to the start when zones are still unbalanced.

However, the next allocation will wake up kswapd again, and
having kswapd stop early beats having it in an infinite loop.

I believe Johannes's patch will be fine for 3.7.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-06 Thread Linus Torvalds
Ok, people seem to be reporting success.

I've applied Johannes' last patch with the new tested-by tags.

Johannes (or anybody else, for that matter), please holler LOUDLY if
you disagreed.. (or if I used the wrong version of the patch, there's
been several, afaik).

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-06 Thread Bruno Wolff III

On Tue, Dec 04, 2012 at 21:01:33 -0600,
  Bruno Wolff III  wrote:

On Tue, Dec 04, 2012 at 16:42:10 -0500,
 Johannes Weiner  wrote:

kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686
and
kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.x86_64
for over 24hours with no evidence of problems with kswapd"

Now waiting for results from Jiri, Zdenek and Bruno...


I have been running 
3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686.PAE a bit over 23 
hours and kswapd has accumalated one minute 8 seconds of CPU time. I 
did several yum operations during that time and didn't see kswapd 
spike to 90+% CPU usage as I had seen in the past. With some kernels 
I wasn't reliably triggering the kswapd issue, so it may not be long 
enough to know for sure that the problem is fixed.


I am now at a bit over 2 and 1/2 days with kswapd having used 1 minute 
53 seconds of CPU time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-06 Thread Zdenek Kabelac

Dne 4.12.2012 10:05, Zdenek Kabelac napsal(a):

Dne 3.12.2012 20:18, Johannes Weiner napsal(a):

Szia Zdenek,

On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote:

Ok, bad news - I've been hit by  kswapd0 loop again -
my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again
shown kswapd0 for couple minutes on CPU.

It seemed to go instantly away when I've drop caches
(echo 3 >/proc/sys/vm/drop_cache)
(After that I've had over 1G free memory)


Any chance you could retry with this patch on top?

---
From: Johannes Weiner 
Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
  to individual uncompactable zones

---
  mm/vmscan.c | 16 
  1 file changed, 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c



Ok, I'm running now b69f0859dc8e633c5d8c06845811588fe17e68b3 (-rc8)
with your patch.  I'll be able to give some feedback after couple
days (if I keep my machine running without reboot - since before


So to just give some positive info -

with  2 1/2 day uptime, several suspend/resumes, ff at 1.4GB
I still have just 29 seconds for kswapd0 process.

So the patch above might have helped - but I'll look for a few more days.

Zdenek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-06 Thread Thorsten Leemhuis
Hi!

Just a quick update

Johannes Weiner wrote on 03.12.2012 20:42:
> On Mon, Dec 03, 2012 at 09:30:12AM +0100, Thorsten Leemhuis wrote:
>
>> BTW, I built that kernel without the patch you mentioned in
>> http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153
>> ("buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
>> [...]) It looked to me like that patch was only meant for debugging. Let
>> me know if that was wrong. Ohh, and I didn't update to a fresher
>> mainline checkout yet to make sure the base for John's testing didn't
>> change.
> 
> Ah, yes, the ApplyPatch is commented out.
> 
> I think we want that upstream as well, but it's not critical.
> [...]

Sorry, it had no "Singed-off-by", so I assumed it was just for debugging.

> Not rebasing sounds reasonable to me to verify the patch.  It might be
> worth testing that the final version that will be 3.8 still works for
> John, however, once that is done.  Just to be sure.

Just to be sure, I yesterday built a rc8 kernel with the patch
referenced above and the one that is not yet merged (these two, to be
precise: http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153
http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91300
; all the others patches my kswap test kernels contained earlier were
afaics merged a few days ago) and mentioned it in the Fedora bug report.
John gave them a try and  in
https://bugzilla.redhat.com/show_bug.cgi?id=866988#c65 reported "No
problems so far.  I'll check back again in ~24hours."

CU, Thorsten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-04 Thread Bruno Wolff III

On Tue, Dec 04, 2012 at 16:42:10 -0500,
  Johannes Weiner  wrote:

 kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686
and
 kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.x86_64
for over 24hours with no evidence of problems with kswapd"

Now waiting for results from Jiri, Zdenek and Bruno...


I have been running 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686.PAE 
a bit over 23 hours and kswapd has accumalated one minute 8 seconds of 
CPU time. I did several yum operations during that time and didn't see 
kswapd spike to 90+% CPU usage as I had seen in the past. With some kernels 
I wasn't reliably triggering the kswapd issue, so it may not be long enough 
to know for sure that the problem is fixed.


I also should note that when I tried 3.7.0-0.rc7.git3.2.fc19.i686.PAE I 
did see problems with kswapd hitting 90+% usage of a CPU.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-04 Thread Johannes Weiner
On Mon, Dec 03, 2012 at 02:42:08PM -0500, Johannes Weiner wrote:
> On Mon, Dec 03, 2012 at 09:30:12AM +0100, Thorsten Leemhuis wrote:
> > >> John was able to reproduce the problem quickly with a kernel that 
> > >> contained the patch from your mail. For details see
> > >
> > > [stripped: all the glory details of what likely went wrong and lead
> > > to the problem john sees or saw]
> > >
> > > ---
> > > From: Johannes Weiner 
> > > Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
> > >  to individual uncompactable zones
> > > 
> > > When a zone meets its high watermark and is compactable in case of
> > > higher order allocations, it contributes to the percentage of the
> > > node's memory that is considered balanced.
> > > [...]
> > 
> > FYI: I built a kernel with that patch. I've been running on my x86_64
> > machine at home over the weekend and everything was working fine (just
> > as without the patch). John gave it a quick try and in
> > https://bugzilla.redhat.com/show_bug.cgi?id=866988#c57 reported:
> > 
> > """
> > I just installed
> > kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686 and ran my
> > usual load that triggers the problem.  OK so far.  I'll check again in
> > 24hours, but looking good so far.
> > """
> 
> w00t!

Update from John in the BZ
(https://bugzilla.redhat.com/show_bug.cgi?id=866988#c62):

"Good news.

I've now been running both
  kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686
and
  kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.x86_64
for over 24hours with no evidence of problems with kswapd"

Now waiting for results from Jiri, Zdenek and Bruno...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-04 Thread Johannes Weiner
On Tue, Dec 04, 2012 at 05:22:38PM +0100, Jiri Slaby wrote:
> On 12/04/2012 05:11 PM, Johannes Weiner wrote:
>  Any chance you could retry with this patch on top?
> >>
> >> It does not apply to -next :/. Should I try anything else?
> > 
> > The COMPACTION_BUILD changed to IS_ENABLED(CONFIG_COMPACTION), below
> > is a -next patch.  I hope you don't run into other problems that come
> > out of -next craziness, because Linus is kinda waiting for this to be
> > resolved to release 3.8.  If you've always tested against -next so far
> > and it worked otherwise, don't change the environment now, please.  If
> > you just started, it would make more sense to test based on 3.7-rc8.
> 
> I reported the issue as soon as it appeared in -next for the first time
> on Oct 12. Since then I'm constantly hitting the issue (well, there were
> more than one I suppose, but not all of them were fixed by now) until
> now. I run only -next...

Okay.  Yes, it was a couple of problems, but not everybody hit the
same subset.

> Going to apply the patch now.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-04 Thread Jiri Slaby
On 12/04/2012 05:11 PM, Johannes Weiner wrote:
 Any chance you could retry with this patch on top?
>>
>> It does not apply to -next :/. Should I try anything else?
> 
> The COMPACTION_BUILD changed to IS_ENABLED(CONFIG_COMPACTION), below
> is a -next patch.  I hope you don't run into other problems that come
> out of -next craziness, because Linus is kinda waiting for this to be
> resolved to release 3.8.  If you've always tested against -next so far
> and it worked otherwise, don't change the environment now, please.  If
> you just started, it would make more sense to test based on 3.7-rc8.

I reported the issue as soon as it appeared in -next for the first time
on Oct 12. Since then I'm constantly hitting the issue (well, there were
more than one I suppose, but not all of them were fixed by now) until
now. I run only -next...

Going to apply the patch now.

-- 
js
suse labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-04 Thread Johannes Weiner
On Tue, Dec 04, 2012 at 10:05:29AM +0100, Zdenek Kabelac wrote:
> Dne 3.12.2012 20:18, Johannes Weiner napsal(a):
> >Szia Zdenek,
> >
> >On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote:
> >>Ok, bad news - I've been hit by  kswapd0 loop again -
> >>my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again
> >>shown kswapd0 for couple minutes on CPU.
> >>
> >>It seemed to go instantly away when I've drop caches
> >>(echo 3 >/proc/sys/vm/drop_cache)
> >>(After that I've had over 1G free memory)
> >
> >Any chance you could retry with this patch on top?
> >
> >---
> >From: Johannes Weiner 
> >Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
> >  to individual uncompactable zones
> >
> >---
> >  mm/vmscan.c | 16 
> >  1 file changed, 16 deletions(-)
> >
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> 
> 
> Ok, I'm running now b69f0859dc8e633c5d8c06845811588fe17e68b3 (-rc8)
> with your patch.  I'll be able to give some feedback after couple
> days (if I keep my machine running without reboot - since before
> I had occasional problems with ACPI now resolved.
> (https://bugzilla.kernel.org/show_bug.cgi?id=51071)
> (patch not yet in -rc8)
> I'm also using this extra patch: https://patchwork.kernel.org/patch/1792531/

Okay, fingers crossed!  Thanks for persisting.

> What seems to be triggering condition on my machine - running laptop
> for some days - and having   Thunderbird reaching 0.8G (I guess they
> must keep all my news messages in memory to consume that size) and
> Firefox 1.3GB of consumed
> memory (assuming massive leaking with combination of flash)

Were you able speed this process up in the past?  I.e. by doing a
search over all mail?  Watching 8 nyan cat videos in parallel?

If not, it's probably better not to change anything now...

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-04 Thread Johannes Weiner
On Tue, Dec 04, 2012 at 10:15:09AM +0100, Jiri Slaby wrote:
> On 12/04/2012 10:05 AM, Zdenek Kabelac wrote:
> > Dne 3.12.2012 20:18, Johannes Weiner napsal(a):
> >> Szia Zdenek,
> >>
> >> On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote:
> >>> Ok, bad news - I've been hit by  kswapd0 loop again -
> >>> my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again
> >>> shown kswapd0 for couple minutes on CPU.
> >>>
> >>> It seemed to go instantly away when I've drop caches
> >>> (echo 3 >/proc/sys/vm/drop_cache)
> >>> (After that I've had over 1G free memory)
> >>
> >> Any chance you could retry with this patch on top?
> 
> It does not apply to -next :/. Should I try anything else?

The COMPACTION_BUILD changed to IS_ENABLED(CONFIG_COMPACTION), below
is a -next patch.  I hope you don't run into other problems that come
out of -next craziness, because Linus is kinda waiting for this to be
resolved to release 3.8.  If you've always tested against -next so far
and it worked otherwise, don't change the environment now, please.  If
you just started, it would make more sense to test based on 3.7-rc8.

Thanks!

---
From: Johannes Weiner 
Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
 to individual uncompactable zones

When a zone meets its high watermark and is compactable in case of
higher order allocations, it contributes to the percentage of the
node's memory that is considered balanced.

This requirement, that a node be only partially balanced, came about
when kswapd was desparately trying to balance tiny zones when all
bigger zones in the node had plenty of free memory.  Arguably, the
same should apply to compaction: if a significant part of the node is
balanced enough to run compaction, do not get hung up on that tiny
zone that might never get in shape.

When the compaction logic in kswapd is reached, we know that at least
25% of the node's memory is balanced properly for compaction (see
zone_balanced and pgdat_balanced).  Remove the individual zone checks
that restart the kswapd cycle.

Otherwise, we may observe more endless looping in kswapd where the
compaction code loops back to reclaim because of a single zone and
reclaim does nothing because the node is considered balanced overall.

Reported-by: Thorsten Leemhuis 
Signed-off-by: Johannes Weiner 
---
 mm/vmscan.c | 16 
 1 file changed, 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b0aef4..486100f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2806,22 +2806,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int 
order,
if (!populated_zone(zone))
continue;
 
-   if (zone->all_unreclaimable &&
-   sc.priority != DEF_PRIORITY)
-   continue;
-
-   /* Would compaction fail due to lack of free memory? */
-   if (IS_ENABLED(CONFIG_COMPACTION) &&
-   compaction_suitable(zone, order) == COMPACT_SKIPPED)
-   goto loop_again;
-
-   /* Confirm the zone is balanced for order-0 */
-   if (!zone_watermark_ok(zone, 0,
-   high_wmark_pages(zone), 0, 0)) {
-   order = sc.order = 0;
-   goto loop_again;
-   }
-
/* Check if the memory needs to be defragmented. */
if (zone_watermark_ok(zone, order,
low_wmark_pages(zone), *classzone_idx, 0))
-- 
1.7.11.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-04 Thread Jiri Slaby
On 12/04/2012 10:05 AM, Zdenek Kabelac wrote:
> Dne 3.12.2012 20:18, Johannes Weiner napsal(a):
>> Szia Zdenek,
>>
>> On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote:
>>> Ok, bad news - I've been hit by  kswapd0 loop again -
>>> my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again
>>> shown kswapd0 for couple minutes on CPU.
>>>
>>> It seemed to go instantly away when I've drop caches
>>> (echo 3 >/proc/sys/vm/drop_cache)
>>> (After that I've had over 1G free memory)
>>
>> Any chance you could retry with this patch on top?

It does not apply to -next :/. Should I try anything else?

>> From: Johannes Weiner 
>> Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
>>   to individual uncompactable zones
...
> What seems to be triggering condition on my machine - running laptop for
> some days - and having   Thunderbird reaching 0.8G (I guess they must
> keep all my news messages in memory to consume that size) and Firefox
> 1.3GB of consumed
> memory (assuming massive leaking with combination of flash)

Similar here, 5 days of uptime (suspend/resumes in between). FF 900M, TB
250M, java 1.1G, kvm 550M, X 400M, cache 1.5G out of 6G total mem. And boom.

thanks,
-- 
js
suse labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-04 Thread Zdenek Kabelac

Dne 3.12.2012 20:18, Johannes Weiner napsal(a):

Szia Zdenek,

On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote:

Ok, bad news - I've been hit by  kswapd0 loop again -
my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again
shown kswapd0 for couple minutes on CPU.

It seemed to go instantly away when I've drop caches
(echo 3 >/proc/sys/vm/drop_cache)
(After that I've had over 1G free memory)


Any chance you could retry with this patch on top?

---
From: Johannes Weiner 
Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
  to individual uncompactable zones

---
  mm/vmscan.c | 16 
  1 file changed, 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c



Ok, I'm running now b69f0859dc8e633c5d8c06845811588fe17e68b3 (-rc8)
with your patch.  I'll be able to give some feedback after couple
days (if I keep my machine running without reboot - since before
I had occasional problems with ACPI now resolved.
(https://bugzilla.kernel.org/show_bug.cgi?id=51071)
(patch not yet in -rc8)
I'm also using this extra patch: https://patchwork.kernel.org/patch/1792531/

What seems to be triggering condition on my machine - running laptop for some 
days - and having   Thunderbird reaching 0.8G (I guess they must keep all my 
news messages in memory to consume that size) and Firefox 1.3GB of consumed

memory (assuming massive leaking with combination of flash)

Zdenek

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-04 Thread Jiri Slaby
On 12/03/2012 02:14 PM, Jiri Slaby wrote:
> On 11/27/2012 09:48 PM, Johannes Weiner wrote:
>> I hope I included everybody that participated in the various threads
>> on kswapd getting stuck / exhibiting high CPU usage.  We were looking
>> at at least three root causes as far as I can see, so it's not really
>> clear who observed which problem.  Please correct me if the
>> reported-by, tested-by, bisected-by tags are incomplete.
> 
> Hi, I reported the problem for the first time but I got lost in the
> patches flying around very early.
> 
> Whatever is in the current -next, works for me since -next was
> resurrected after the 2 weeks gap last week...

Bah, I always need to write an email to reproduce that. It's back:
3.7.0-rc7-next-20121130

[] __cond_resched+0x2a/0x40
[] shrink_slab+0x1c0/0x2d0
[] kswapd+0x65d/0xb50
[] kthread+0xc0/0xd0
[] ret_from_fork+0x7c/0xb0
[] 0x

Going to apply this:
https://lkml.org/lkml/2012/12/3/407
and wait another 5 days to see the results...

thanks,
-- 
js
suse labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-03 Thread Johannes Weiner
On Mon, Dec 03, 2012 at 09:30:12AM +0100, Thorsten Leemhuis wrote:
> >> John was able to reproduce the problem quickly with a kernel that 
> >> contained the patch from your mail. For details see
> >
> > [stripped: all the glory details of what likely went wrong and lead
> > to the problem john sees or saw]
> >
> > ---
> > From: Johannes Weiner 
> > Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
> >  to individual uncompactable zones
> > 
> > When a zone meets its high watermark and is compactable in case of
> > higher order allocations, it contributes to the percentage of the
> > node's memory that is considered balanced.
> > [...]
> 
> FYI: I built a kernel with that patch. I've been running on my x86_64
> machine at home over the weekend and everything was working fine (just
> as without the patch). John gave it a quick try and in
> https://bugzilla.redhat.com/show_bug.cgi?id=866988#c57 reported:
> 
> """
> I just installed
> kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686 and ran my
> usual load that triggers the problem.  OK so far.  I'll check again in
> 24hours, but looking good so far.
> """

w00t!

> BTW, I built that kernel without the patch you mentioned in
> http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153
> ("buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
> [...]) It looked to me like that patch was only meant for debugging. Let
> me know if that was wrong. Ohh, and I didn't update to a fresher
> mainline checkout yet to make sure the base for John's testing didn't
> change.

Ah, yes, the ApplyPatch is commented out.

I think we want that upstream as well, but it's not critical.  It'll
reduce kswapd CPU usage marginally on highmem systems in certain
situations, but I don't think any of the 100% CPU usage problems are
fixed by it.

Not rebasing sounds reasonable to me to verify the patch.  It might be
worth testing that the final version that will be 3.8 still works for
John, however, once that is done.  Just to be sure.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-03 Thread Johannes Weiner
Szia Zdenek,

On Mon, Dec 03, 2012 at 04:23:15PM +0100, Zdenek Kabelac wrote:
> Ok, bad news - I've been hit by  kswapd0 loop again -
> my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again
> shown kswapd0 for couple minutes on CPU.
> 
> It seemed to go instantly away when I've drop caches
> (echo 3 >/proc/sys/vm/drop_cache)
> (After that I've had over 1G free memory)

Any chance you could retry with this patch on top?

Thanks!

---
From: Johannes Weiner 
Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
 to individual uncompactable zones

When a zone meets its high watermark and is compactable in case of
higher order allocations, it contributes to the percentage of the
node's memory that is considered balanced.

This requirement, that a node be only partially balanced, came about
when kswapd was desparately trying to balance tiny zones when all
bigger zones in the node had plenty of free memory.  Arguably, the
same should apply to compaction: if a significant part of the node is
balanced enough to run compaction, do not get hung up on that tiny
zone that might never get in shape.

When the compaction logic in kswapd is reached, we know that at least
25% of the node's memory is balanced properly for compaction (see
zone_balanced and pgdat_balanced).  Remove the individual zone checks
that restart the kswapd cycle.

Otherwise, we may observe more endless looping in kswapd where the
compaction code loops back to reclaim because of a single zone and
reclaim does nothing because the node is considered balanced overall.

Reported-by: Thorsten Leemhuis 
Signed-off-by: Johannes Weiner 
---
 mm/vmscan.c | 16 
 1 file changed, 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b0aef4..486100f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2806,22 +2806,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int 
order,
if (!populated_zone(zone))
continue;
 
-   if (zone->all_unreclaimable &&
-   sc.priority != DEF_PRIORITY)
-   continue;
-
-   /* Would compaction fail due to lack of free memory? */
-   if (COMPACTION_BUILD &&
-   compaction_suitable(zone, order) == COMPACT_SKIPPED)
-   goto loop_again;
-
-   /* Confirm the zone is balanced for order-0 */
-   if (!zone_watermark_ok(zone, 0,
-   high_wmark_pages(zone), 0, 0)) {
-   order = sc.order = 0;
-   goto loop_again;
-   }
-
/* Check if the memory needs to be defragmented. */
if (zone_watermark_ok(zone, order,
low_wmark_pages(zone), *classzone_idx, 0))
-- 
1.7.11.7
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-03 Thread Zdenek Kabelac

Dne 28.11.2012 10:45, Mel Gorman napsal(a):

(Adding Thorsten to cc)

On Tue, Nov 27, 2012 at 03:48:34PM -0500, Johannes Weiner wrote:

Hi everyone,

I hope I included everybody that participated in the various threads
on kswapd getting stuck / exhibiting high CPU usage.  We were looking
at at least three root causes as far as I can see, so it's not really
clear who observed which problem.  Please correct me if the
reported-by, tested-by, bisected-by tags are incomplete.

One problem was, as it seems, overly aggressive reclaim due to scaling
up reclaim goals based on compaction failures.  This one was reverted
in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
reclaim/compaction based on failures".



This particular one would have been made worse by the accounting bug and
if kswapd was staying awake longer than necessary. As scaling the amount
of reclaim only for direct reclaim helped this problem a lot, I strongly
suspect the accounting bug was a factor.

However the benefit for this is marginal -- it primarily affects how
many THP pages we can allocate under stress. There is already a graceful
fallback path and a system under heavy reclaim pressure is not going to
notice the performance benefit of THP.


Another one was an accounting problem where a freed higher order page
was underreported, and so kswapd had trouble restoring watermarks.
This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
(appears like memory leak).



This almost certainly also requires the follow-on fix at
https://lkml.org/lkml/2012/11/26/225 for reasons I explained in
https://lkml.org/lkml/2012/11/27/190 .


The third one is a problem with small zones, like the DMA zone, where
the high watermark is lower than the low watermark plus compaction gap
(2 * allocation size).  The zonelist reclaim in kswapd would do
nothing because all high watermarks are met, but the compaction logic
would find its own requirements unmet and loop over the zones again.
Indefinitely, until some third party would free enough memory to help
meet the higher compaction watermark.  The problematic code has been
there since the 3.4 merge window for non-THP higher order allocations
but has been more prominent since the 3.7 merge window, where kswapd
is also woken up for the much more common THP allocations.



Yes.


The following patch should fix the third issue by making both reclaim
and compaction code in kswapd use the same predicate to determine
whether a zone is balanced or not.

Hopefully, the sum of all three fixes should tame kswapd enough for
3.7.



Not exactly sure of that. With just those patches it is possible for
allocations for THP entering the slow path to keep kswapd continually awake
doing busy work. This was an alternative to the revert that covered that
https://lkml.org/lkml/2012/11/12/151 but it was not enough because kswapd
would stay awake due to the bug you identified and fixed.

I went with the __GFP_NO_KSWAPD patch in this cycle because 3.6 was/is
very poor in how it handles THP after the removal of lumpy reclaim. 3.7
was shaping up to be even worse with multiple root causes too close to the
release date.  Taking kswapd out of the equation covered some of the
problems (yes, by hiding them) so it could be revisited but Johannes may
have finally squashed it.

However, if we revert the revert then I strongly recommend that it be
replaced with "Avoid waking kswapd for THP allocations when compaction is
deferred or contended".




Ok, bad news - I've been hit by  kswapd0 loop again -
my kernel git commit cc19528bd3084c3c2d870b31a3578da8c69952f3 again shown 
kswapd0 for couple minutes on CPU.


It seemed to go instantly away when I've drop caches
(echo 3 >/proc/sys/vm/drop_cache)
(After that I've had over 1G free memory)

Here are some stats before drop while kswapd0 was running:

kswapd0 R  running task030  2 0x
 880133207b08 0082 880133207b18 0246
 880135b92340 880133207fd8 880133207fd8 880133207fd8
 880103098000 880135b92340  880133206000
Call Trace:
 [] preempt_schedule+0x42/0x60
 [] _raw_spin_unlock+0x55/0x60
 [] grab_super_passive+0x3c/0x90
 [] prune_super+0x46/0x1b0
 [] shrink_slab+0xba/0x510
 [] ? mem_cgroup_iter+0x17a/0x2e0
 [] ? mem_cgroup_iter+0xca/0x2e0
 [] balance_pgdat+0x621/0x7e0
 [] kswapd+0x174/0x640
 [] ? __init_waitqueue_head+0x60/0x60
 [] ? balance_pgdat+0x7e0/0x7e0
 [] kthread+0xdb/0xe0
 [] ? kthread_create_on_node+0x140/0x140
 [] ret_from_fork+0x7c/0xb0
 [] ? kthread_create_on_node+0x140/0x140

runnable tasks:
task   PID tree-key  switches  prio exec-runtime 
sum-execsum-sleep

--
 kswapd030   8087056.792356 30543   120   8087056.792356 
158938.479290 137131605.711862 /
 kworker/0:3 29833   8087050.792356526664   120  

Re: kswapd craziness in 3.7

2012-12-03 Thread Jiri Slaby
On 11/27/2012 09:48 PM, Johannes Weiner wrote:
> I hope I included everybody that participated in the various threads
> on kswapd getting stuck / exhibiting high CPU usage.  We were looking
> at at least three root causes as far as I can see, so it's not really
> clear who observed which problem.  Please correct me if the
> reported-by, tested-by, bisected-by tags are incomplete.

Hi, I reported the problem for the first time but I got lost in the
patches flying around very early.

Whatever is in the current -next, works for me since -next was
resurrected after the 2 weeks gap last week...

thanks,
-- 
js
suse labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Fedora repo (was: Re: kswapd craziness in 3.7)

2012-12-03 Thread Borislav Petkov
On Mon, Dec 03, 2012 at 09:30:12AM +0100, Thorsten Leemhuis wrote:
> Np; BTW, in case anybody here on LKML cares: I started maintaining a
> side repo (PPA in ubuntu speak) a few weeks ago that offers kernel
> vanilla builds (mainline and stable) for the Fedora 17 and 18; see
> https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories
> for details. It's not as good and up2date yet as I would like it, but
> one has to start somewhere.

Once you have this ready, you should send a more official mail with
"[ANNOUNCE]" in its subject and containing explanations how to use the
repo to lkml and relevant lists so that more people know about it.

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-03 Thread Thorsten Leemhuis
Hi!

Johannes Weiner wrote on 01.12.2012 01:45:
> On Fri, Nov 30, 2012 at 01:39:03PM +0100, Thorsten Leemhuis wrote:
>> /me wonders how to elegantly get out of his man-in-the-middle position
> You control the mighty koji :-)

Something even a journalist can ;-)

> But seriously, this is very helpful, thank you!

Np; BTW, in case anybody here on LKML cares: I started maintaining a
side repo (PPA in ubuntu speak) a few weeks ago that offers kernel
vanilla builds (mainline and stable) for the Fedora 17 and 18; see
https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories
for details. It's not as good and up2date yet as I would like it, but
one has to start somewhere.

Back to topic:

> John now also Cc'd directly.
> 
>> John was able to reproduce the problem quickly with a kernel that 
>> contained the patch from your mail. For details see
>
> [stripped: all the glory details of what likely went wrong and lead
> to the problem john sees or saw]
>
> ---
> From: Johannes Weiner 
> Subject: [patch] mm: vmscan: do not keep kswapd looping forever due
>  to individual uncompactable zones
> 
> When a zone meets its high watermark and is compactable in case of
> higher order allocations, it contributes to the percentage of the
> node's memory that is considered balanced.
> [...]

FYI: I built a kernel with that patch. I've been running on my x86_64
machine at home over the weekend and everything was working fine (just
as without the patch). John gave it a quick try and in
https://bugzilla.redhat.com/show_bug.cgi?id=866988#c57 reported:

"""
I just installed
kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686 and ran my
usual load that triggers the problem.  OK so far.  I'll check again in
24hours, but looking good so far.
"""

BTW, I built that kernel without the patch you mentioned in
http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153
("buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
[...]) It looked to me like that patch was only meant for debugging. Let
me know if that was wrong. Ohh, and I didn't update to a fresher
mainline checkout yet to make sure the base for John's testing didn't
change.

CU
 Thorsten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-30 Thread Johannes Weiner
Hi Thorsten,

On Fri, Nov 30, 2012 at 01:39:03PM +0100, Thorsten Leemhuis wrote:
> /me wonders how to elegantly get out of his man-in-the-middle position

You control the mighty koji :-)

But seriously, this is very helpful, thank you!  John now also Cc'd
directly.

> John was able to reproduce the problem quickly with a kernel that 
> contained the patch from your mail. For details see
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=866988#c42 and later
> 
> He provided the informations there. Parts of it:

> /proc/vmstat while kswad0 at 100%cpu
> /proc/zoneinfo with kswapd0 at 100% cpu
> perf profile

Thanks.

I'm quoting the interesting bits in order of the cars on my possibly
derailing train of thought:

> pageoutrun 117729182
> allocstall 5

Okay, so kswapd is stupidly looping but it's still managing to do it's
actual job; nobody is dropping into direct reclaim.

> pgsteal_kswapd_dma 1
> pgsteal_kswapd_normal 202106
> pgsteal_kswapd_high 36515
> pgsteal_kswapd_movable 0

> pgscan_kswapd_dma 1
> pgscan_kswapd_normal 203044
> pgscan_kswapd_high 40407
> pgscan_kswapd_movable 0

Does not seem excessive, so apparently it also does not overreclaim.

> Node 0, zone  DMA
>   pages free 1655
> min  196
> low  245
> high 294

> Node 0, zone   Normal
>   pages free 186234
> min  10953
> low  13691
> high 16429

> Node 0, zone  HighMem
>   pages free 8983
> min  34
> low  475
> high 917

These are all well above their watermarks, yet kswapd is definitely
finding something wrong with one of these as it actually does drop
into the reclaim loop, so zone_balanced() must be returning false:

> 16.52%  kswapd0  [kernel.kallsyms]  [k] idr_get_next  
>  
> |
> --- idr_get_next
>|  
>|--99.76%-- css_get_next
>|  mem_cgroup_iter
>|  |  
>|  |--50.49%-- shrink_zone
>|  |  kswapd
>|  |  kthread
>|  |  ret_from_kernel_thread
>|  |  
>|   --49.51%-- kswapd
>| kthread
>| ret_from_kernel_thread
> --0.24%-- [...]
> 
> 11.23%  kswapd0  [kernel.kallsyms]  [k] prune_super   
>  
> |
> --- prune_super
>|  
>|--86.74%-- shrink_slab
>|  kswapd
>|  kthread
>|  ret_from_kernel_thread
>|  
> --13.26%-- kswapd
>   kthread
>   ret_from_kernel_thread

Spending so much time in shrink_zone and shrink_slab without
overreclaiming a zone, I would say that a) this always stays on the
DEF_PRIORITY and b) only loops on the DMA zone.  At DEF_PRIORITY, the
scan goal for filepages in the other zones would be > 0 e.g.

As the DMA zone watermarks are fine, it must be the fragmentation
index that indicates a lack of memory.  Filling in the 1655 free pages
into the fragmentation index formula indicates lack of free memory
when these 1655 pages are lumped together in less than 9 page blocks.
Not unrealistic, I think: on my desktop machine, the DMA zone's free
3975 pages are lumped together in only 12 blocks.  But on my system,
the DMA zone is either never used and there is always at least one
page block available that could satisfy a huge page allocation
(fragmentation index == -1000).  Unless the system gets really close
to OOM, at which point the DMA zone is highly fragmented.  And keep in
mind that if the priority level goes below DEF_PRIORITY, as it does
close to OOM, the unreclaimable DMA zone is ignored anyway.  But the
DMA zone here is just barely used:

> Node 0, zone  DMA
[...]
> nr_slab_reclaimable 3
> nr_slab_unreclaimable 1
[...]
> nr_dirtied   315
> nr_written   315

which could explain a fragmentation index that asks for more free
memory while the watermarks are fine.

Why this all loops: there is one more inconsistency where the
conditions for reclaim and the conditions for compaction contradict
each other: reclaim also does not consider the DMA zone balanced, but
it needs only 25% of the whole node to be balanced, while compaction
requires every single zone to be balanced individually.

So these strict per-zone checks for compaction at the end of
balance_pgdat() are likely to be the culprits that keep kswapd looping
forever on this machine, trying to balance DMA for compaction while
reclaim decides it has enough balanced memory in the nod

Re: kswapd craziness in 3.7

2012-11-30 Thread Thorsten Leemhuis
Johannes Weiner wrote on 29.11.2012 18:05:
> On Thu, Nov 29, 2012 at 04:30:12PM +0100, Thorsten Leemhuis wrote:
>> Mel Gorman wrote on 29.11.2012 00:54:
>> > On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
>> >> On Wed, 28 Nov 2012 10:13:59 +
>> >> Mel Gorman  wrote:
>> >> > Based on the reports I've seen I expect the following to work for 3.7
>> >> > Keep
>> >> >   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by 
>> >> > reclaim/compaction based on failures"
>> >> >   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory 
>> >> > leak)
>> >> > Revert
>> >> >   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
>> >> > Merge
>> >> >   mm: vmscan: fix kswapd endless loop on higher order allocation
>> >> >   mm: Avoid waking kswapd for THP allocations when compaction is 
>> >> > deferred or contended
>> >> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
>> >> myself" and when Zdenek tested it he hit an unexplained oom.
>> > I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
>> > Further, when he hit that OOM, it looked like a genuine OOM. He had no
>> > swap configured and inactive/active file pages were very low. Finally,
>> > the free pages for Normal looked off and could also have been affected by
>> > the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
>> > here. Are you thinking of something else?
>> > I have not tested with the patch admittedly but Thorsten has and seemed
>> > to be ok with it https://lkml.org/lkml/2012/11/23/276.
>> Yeah, on my two main work horses a few different kernels based on rc6 or
>> rc7 worked fine with this patch. But sorry, it seems the patch doesn't
>> fix the problems Fedora user John Ellson sees, who tried kernels I built
>> in the Fedora buildsystem. Details:
> [...]
>> I know, this makes things more complicated again; but I wanted to let
>> you guys know that some problem might still be lurking somewhere. Side
>> note: right now it seems John with kernels that contain
>> "Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem
>> quicker (or only?) on i686 than on x86-64.
>
> Humm, highmem...  Could this be the lowmem protection forcing kswapd
> to reclaim highmem at DEF_PRIORITY (not useful but burns CPU) every
> time it's woken up?
> 
> This requires somebody to wake up kswapd regularly, though and from
> his report it's not quite clear to me if kswapd gets stuck or just has
> really high CPU usage while the system is still under load.  The
> initial post says he would expect "<5% cpu when idling" but his top
> snippet in there shows there are other tasks running as well.  So does
> it happen while the system is busy or when it's otherwise idle?
> 
> [ On the other hand, not waking kswapd from THP allocations seems to
>   not show this problem on his i686 machine.  But it could also just
>   be a tiny window of conditions aligning perfectly that drops kswapd
>   in an endless loop, and the increased wakeups increase the
>   probability of hitting it.  So, yeah, this would be good to know. ]
> 
> As the system is still responsive when this happens, any chance he
> could capture /proc/zoneinfo and /proc/vmstat when kswapd goes
> haywire?
> 
> Or even run perf record -a -g sleep 5; perf report > kswapd.txt?
> 
> Preferrably with this patch applied, to rule out faulty lowmem
> protection:
> 
> buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
> when figuring out whether the zone is balanced and so priority levels
> are not descended and no progress is ever made.

/me wonders how to elegantly get out of his man-in-the-middle position

John was able to reproduce the problem quickly with a kernel that 
contained the patch from your mail. For details see 

https://bugzilla.redhat.com/show_bug.cgi?id=866988#c42 and later

He provided the informations there. Parts of it:

/proc/vmstat while kswad0 at 100%cpu

nr_free_pages 196858
nr_inactive_anon 15804
nr_active_anon 65
nr_inactive_file 20792
nr_active_file 11307
nr_unevictable 0
nr_mlock 0
nr_anon_pages 14385
nr_mapped 2393
nr_file_pages 32563
nr_dirty 5
nr_writeback 0
nr_slab_reclaimable 3113
nr_slab_unreclaimable 4725
nr_page_table_pages 271
nr_kernel_stack 96
nr_unstable 0
nr_bounce 0
nr_vmscan_write 1487
nr_vmscan_immediate_reclaim 3
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 381
nr_dirtied 388323
nr_written 361128
nr_anon_transparent_hugepages 1
nr_free_cma 0
nr_dirty_threshold 38188
nr_dirty_background_threshold 19094
pgpgin 1057223
pgpgout 1552306
pswpin 8
pswpout 1487
pgalloc_dma 5548
pgalloc_normal 10651864
pgalloc_high 2191246
pgalloc_movable 0
pgfree 13055503
pgactivate 440358
pgdeactivate 259724
pgfault 31423675
pgmajfault 3760
pgrefill_dma 2174
pgrefill_normal 212914
pgrefill_high 51755
pgrefill_movable 0
pgsteal_kswapd_dma 1
pgsteal_kswapd_normal 202106
pgsteal_kswapd_high 36515
pgsteal_kswapd_movable 0
pgsteal_direct_dma 18
pgsteal_direct_normal 

Re: kswapd craziness in 3.7

2012-11-29 Thread Johannes Weiner
On Thu, Nov 29, 2012 at 04:30:12PM +0100, Thorsten Leemhuis wrote:
> Mel Gorman wrote on 29.11.2012 00:54:
> > On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
> >> On Wed, 28 Nov 2012 10:13:59 +
> >> Mel Gorman  wrote:
> >> 
> >> > Based on the reports I've seen I expect the following to work for 3.7
> >> > Keep
> >> >   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by 
> >> > reclaim/compaction based on failures"
> >> >   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory 
> >> > leak)
> >> > Revert
> >> >   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> >> > Merge
> >> >   mm: vmscan: fix kswapd endless loop on higher order allocation
> >> >   mm: Avoid waking kswapd for THP allocations when compaction is 
> >> > deferred or contended
> >> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
> >> myself" and when Zdenek tested it he hit an unexplained oom.
> > I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
> > Further, when he hit that OOM, it looked like a genuine OOM. He had no
> > swap configured and inactive/active file pages were very low. Finally,
> > the free pages for Normal looked off and could also have been affected by
> > the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
> > here. Are you thinking of something else?
> > 
> > I have not tested with the patch admittedly but Thorsten has and seemed
> > to be ok with it https://lkml.org/lkml/2012/11/23/276.
> 
> Yeah, on my two main work horses a few different kernels based on rc6 or
> rc7 worked fine with this patch. But sorry, it seems the patch doesn't
> fix the problems Fedora user John Ellson sees, who tried kernels I built
> in the Fedora buildsystem. Details:
> 
> In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c35 he mentioned
> his machine worked fine with a rc6 based kernel I built that contained
> 82b212f4 (Revert "mm: remove __GFP_NO_KSWAPD"). Before that he had tried
> a kernel with the same baseline that contained "Avoid waking kswapd for
> THP allocations when […]" instead and reported it didn't help on his
> i686 machine (seems it helped the x86-64 one):
> https://bugzilla.redhat.com/show_bug.cgi?id=866988#c33
> 
> He now tried a recent mainline kernel I built 20 hours ago that is based
> on a git checkout from round about two days ago, reverts 82b212f4, and had
>  * fix-kswapd-endless-loop-on-higher-order-allocation.patch
>  * Avoid-waking-kswapd-for-THP-allocations-when.patch
>  * mm-compaction-Fix-return-value-of-capture_free_page.patch
> applied. In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c39 and
> comment 41 he reported that this kernel on his i686 host showed 100%cpu
> usage by kswapd0 :-/
> 
> Build log for said kernel rpms (I quite sure I applied the patches
> properly, but you know: mistakes happen, so be careful, maybe I did
> something stupid somewhere...):
> http://kojipkgs.fedoraproject.org//work/tasks/8253/4738253/build.log
> 
> I know, this makes things more complicated again; but I wanted to let
> you guys know that some problem might still be lurking somewhere. Side
> note: right now it seems John with kernels that contain
> "Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem
> quicker (or only?) on i686 than on x86-64.

Humm, highmem...  Could this be the lowmem protection forcing kswapd
to reclaim highmem at DEF_PRIORITY (not useful but burns CPU) every
time it's woken up?

This requires somebody to wake up kswapd regularly, though and from
his report it's not quite clear to me if kswapd gets stuck or just has
really high CPU usage while the system is still under load.  The
initial post says he would expect "<5% cpu when idling" but his top
snippet in there shows there are other tasks running as well.  So does
it happen while the system is busy or when it's otherwise idle?

[ On the other hand, not waking kswapd from THP allocations seems to
  not show this problem on his i686 machine.  But it could also just
  be a tiny window of conditions aligning perfectly that drops kswapd
  in an endless loop, and the increased wakeups increase the
  probability of hitting it.  So, yeah, this would be good to know. ]

As the system is still responsive when this happens, any chance he
could capture /proc/zoneinfo and /proc/vmstat when kswapd goes
haywire?

Or even run perf record -a -g sleep 5; perf report > kswapd.txt?

Preferrably with this patch applied, to rule out faulty lowmem
protection:

buffer_heads_over_limit can put kswapd into reclaim, but it's ignored
when figuring out whether the zone is balanced and so priority levels
are not descended and no progress is ever made.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b0aef4..73c4f5f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2400,6 +2400,14 @@ static void age_active_anon(struct zone *zone, struct 
scan_control *sc)
 static bool zone_balanced(struct zone *zone, int order,
  unsigned lo

Re: kswapd craziness in 3.7

2012-11-29 Thread Thorsten Leemhuis
Mel Gorman wrote on 29.11.2012 00:54:
> On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
>> On Wed, 28 Nov 2012 10:13:59 +
>> Mel Gorman  wrote:
>> 
>> > Based on the reports I've seen I expect the following to work for 3.7
>> > Keep
>> >   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by 
>> > reclaim/compaction based on failures"
>> >   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory 
>> > leak)
>> > Revert
>> >   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
>> > Merge
>> >   mm: vmscan: fix kswapd endless loop on higher order allocation
>> >   mm: Avoid waking kswapd for THP allocations when compaction is deferred 
>> > or contended
>> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
>> myself" and when Zdenek tested it he hit an unexplained oom.
> I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
> Further, when he hit that OOM, it looked like a genuine OOM. He had no
> swap configured and inactive/active file pages were very low. Finally,
> the free pages for Normal looked off and could also have been affected by
> the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
> here. Are you thinking of something else?
> 
> I have not tested with the patch admittedly but Thorsten has and seemed
> to be ok with it https://lkml.org/lkml/2012/11/23/276.

Yeah, on my two main work horses a few different kernels based on rc6 or
rc7 worked fine with this patch. But sorry, it seems the patch doesn't
fix the problems Fedora user John Ellson sees, who tried kernels I built
in the Fedora buildsystem. Details:

In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c35 he mentioned
his machine worked fine with a rc6 based kernel I built that contained
82b212f4 (Revert "mm: remove __GFP_NO_KSWAPD"). Before that he had tried
a kernel with the same baseline that contained "Avoid waking kswapd for
THP allocations when […]" instead and reported it didn't help on his
i686 machine (seems it helped the x86-64 one):
https://bugzilla.redhat.com/show_bug.cgi?id=866988#c33

He now tried a recent mainline kernel I built 20 hours ago that is based
on a git checkout from round about two days ago, reverts 82b212f4, and had
 * fix-kswapd-endless-loop-on-higher-order-allocation.patch
 * Avoid-waking-kswapd-for-THP-allocations-when.patch
 * mm-compaction-Fix-return-value-of-capture_free_page.patch
applied. In https://bugzilla.redhat.com/show_bug.cgi?id=866988#c39 and
comment 41 he reported that this kernel on his i686 host showed 100%cpu
usage by kswapd0 :-/

Build log for said kernel rpms (I quite sure I applied the patches
properly, but you know: mistakes happen, so be careful, maybe I did
something stupid somewhere...):
http://kojipkgs.fedoraproject.org//work/tasks/8253/4738253/build.log

I know, this makes things more complicated again; but I wanted to let
you guys know that some problem might still be lurking somewhere. Side
note: right now it seems John with kernels that contain
"Avoid-waking-kswapd-for-THP-allocations-when" can trigger the problem
quicker (or only?) on i686 than on x86-64.

CU
Thorsten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-28 Thread Andrew Morton
On Wed, 28 Nov 2012 23:54:12 +
Mel Gorman  wrote:

> On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
> > On Wed, 28 Nov 2012 10:13:59 +
> > Mel Gorman  wrote:
> > 
> > > Based on the reports I've seen I expect the following to work for 3.7
> > > 
> > > Keep
> > >   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by 
> > > reclaim/compaction based on failures"
> > >   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory 
> > > leak)
> > > 
> > > Revert
> > >   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> > > 
> > > Merge
> > >   mm: vmscan: fix kswapd endless loop on higher order allocation
> > >   mm: Avoid waking kswapd for THP allocations when compaction is deferred 
> > > or contended
> > 
> > "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
> > myself" and when Zdenek tested it he hit an unexplained oom.
> > 
> 
> I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
> Further, when he hit that OOM, it looked like a genuine OOM. He had no
> swap configured and inactive/active file pages were very low. Finally,
> the free pages for Normal looked off and could also have been affected by
> the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
> here. Are you thinking of something else?

who, me, think?  I was trying to work out why I hadn't merged or queued
a patch which you felt was important.  Turned out it was because it
didn't look very tested and final.

> I have not tested with the patch admittedly but Thorsten has and seemed
> to be ok with it https://lkml.org/lkml/2012/11/23/276.

OK, I'll queue revert-revert-mm-remove-__gfp_no_kswapd.patch and the
patch from https://patchwork.kernel.org/patch/1728081/.

So what I'm currently sitting on for 3.7 is

mm-compaction-fix-return-value-of-capture_free_page.patch
mm-vmemmap-fix-wrong-use-of-virt_to_page.patch
mm-vmscan-fix-endless-loop-in-kswapd-balancing.patch
revert-revert-mm-remove-__gfp_no_kswapd.patch
mm-avoid-waking-kswapd-for-thp-allocations-when-compaction-is-deferred-or-contended.patch
mm-soft-offline-split-thp-at-the-beginning-of-soft_offline_page.patch

> > Please identify "Johannes' patch"?
> 
> mm: vmscan: fix kswapd endless loop on higher order allocation

OK, we have that.  I'll start a round of testing, do another -next drop
and send the above Linuswards tomorrow.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-28 Thread Mel Gorman
On Wed, Nov 28, 2012 at 02:52:15PM -0800, Andrew Morton wrote:
> On Wed, 28 Nov 2012 10:13:59 +
> Mel Gorman  wrote:
> 
> > Based on the reports I've seen I expect the following to work for 3.7
> > 
> > Keep
> >   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by 
> > reclaim/compaction based on failures"
> >   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> > 
> > Revert
> >   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> > 
> > Merge
> >   mm: vmscan: fix kswapd endless loop on higher order allocation
> >   mm: Avoid waking kswapd for THP allocations when compaction is deferred 
> > or contended
> 
> "mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
> myself" and when Zdenek tested it he hit an unexplained oom.
> 

I thought Zdenek was testing with __GFP_NO_KSWAPD when he hit that OOM.
Further, when he hit that OOM, it looked like a genuine OOM. He had no
swap configured and inactive/active file pages were very low. Finally,
the free pages for Normal looked off and could also have been affected by
the accounting bug. I'm looking at https://lkml.org/lkml/2012/11/18/132
here. Are you thinking of something else?

I have not tested with the patch admittedly but Thorsten has and seemed
to be ok with it https://lkml.org/lkml/2012/11/23/276.

> > Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I
> > think we should also avoid waking kswapd for THP allocations if compaction
> > is deferred. Johannes' patch might mean that kswapd goes quickly go back
> > to sleep but it's still busy work.
> > 
> > 3.6 is still known to be screwed in terms of THP because of the amount of
> > time it can spend in compaction after lumpy reclaim was removed. This is
> > my old list of patches I felt needed to be backported after 3.7 came out.
> > They are not tagged -stable, I'll be sending it to Greg manually.
> > 
> > e64c523 mm: compaction: abort compaction loop if lock is contended or run 
> > too long
> > 3cc668f mm: compaction: move fatal signal check out of 
> > compact_checklock_irqsave
> > 661c4cb mm: compaction: Update try_to_compact_pages()kerneldoc comment
> > 2a1402a mm: compaction: acquire the zone->lru_lock as late as possible
> > f40d1e4 mm: compaction: acquire the zone->lock as late as possible
> > 753341a revert "mm: have order > 0 compaction start off where it left"
> > bb13ffe mm: compaction: cache if a pageblock was scanned and no pages were 
> > isolated
> > c89511a mm: compaction: Restart compaction from near where it left off
> > 6299702 mm: compaction: clear PG_migrate_skip based on compaction and 
> > reclaim activity
> > 0db63d7 mm: compaction: correct the nr_strict va isolated check for CMA
> > 
> > Only Johannes' patch needs to be added to this list. kswapd is not woken
> > for THP in 3.6 but as it calls compaction for other high-order allocations
> > it still makes sense.
> 
> Please identify "Johannes' patch"?

mm: vmscan: fix kswapd endless loop on higher order allocation

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-28 Thread Andrew Morton
On Wed, 28 Nov 2012 10:13:59 +
Mel Gorman  wrote:

> Based on the reports I've seen I expect the following to work for 3.7
> 
> Keep
>   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by 
> reclaim/compaction based on failures"
>   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> 
> Revert
>   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> 
> Merge
>   mm: vmscan: fix kswapd endless loop on higher order allocation
>   mm: Avoid waking kswapd for THP allocations when compaction is deferred or 
> contended

"mm: Avoid waking kswapd for THP ..." is marked "I have not tested it
myself" and when Zdenek tested it he hit an unexplained oom.

> Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I
> think we should also avoid waking kswapd for THP allocations if compaction
> is deferred. Johannes' patch might mean that kswapd goes quickly go back
> to sleep but it's still busy work.
> 
> 3.6 is still known to be screwed in terms of THP because of the amount of
> time it can spend in compaction after lumpy reclaim was removed. This is
> my old list of patches I felt needed to be backported after 3.7 came out.
> They are not tagged -stable, I'll be sending it to Greg manually.
> 
> e64c523 mm: compaction: abort compaction loop if lock is contended or run too 
> long
> 3cc668f mm: compaction: move fatal signal check out of 
> compact_checklock_irqsave
> 661c4cb mm: compaction: Update try_to_compact_pages()kerneldoc comment
> 2a1402a mm: compaction: acquire the zone->lru_lock as late as possible
> f40d1e4 mm: compaction: acquire the zone->lock as late as possible
> 753341a revert "mm: have order > 0 compaction start off where it left"
> bb13ffe mm: compaction: cache if a pageblock was scanned and no pages were 
> isolated
> c89511a mm: compaction: Restart compaction from near where it left off
> 6299702 mm: compaction: clear PG_migrate_skip based on compaction and reclaim 
> activity
> 0db63d7 mm: compaction: correct the nr_strict va isolated check for CMA
> 
> Only Johannes' patch needs to be added to this list. kswapd is not woken
> for THP in 3.6 but as it calls compaction for other high-order allocations
> it still makes sense.

Please identify "Johannes' patch"?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-28 Thread Mel Gorman
On Wed, Nov 28, 2012 at 10:13:59AM +, Mel Gorman wrote:
> On Tue, Nov 27, 2012 at 03:19:38PM -0800, Linus Torvalds wrote:
> > On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner  wrote:
> > > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
> > >>
> > >> Kswapd going crazy is certainly a large part of the problem.
> > >>
> > >> However, that leaves the issue of page_alloc.c waking up
> > >> kswapd when the system is not actually low on memory.
> > >>
> > >> Instead, kswapd is woken up because memory compaction failed,
> > >> potentially even due to lock contention during compaction!
> > >>
> > >> Ideally the allocation code would only wake up kswapd if
> > >> memory needs to be freed, or in order for kswapd to do
> > >> memory compaction (so the allocator does not have to).
> > >
> > > Maybe I missed something, but shouldn't this be solved with my patch?
> > 
> > Ok, guys. Cage fight!
> > 
> > The rules are simple: two men enter, one man leaves.
> > 
> 
> I'm fairly scorch damaged from this whole cycle already. I won't need a
> prop master to look the part for a thunderdome match.
> 
> > And the one who comes out gets to explain to me which patch(es) I
> > should apply, and which I should revert, if any.
> > 
> 
> Based on the reports I've seen I expect the following to work for 3.7
> 
> Keep
>   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by 
> reclaim/compaction based on failures"
>   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> 
> Revert
>   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> 
> Merge
>   mm: vmscan: fix kswapd endless loop on higher order allocation
>   mm: Avoid waking kswapd for THP allocations when compaction is deferred or 
> contended
> 

and
mm: compaction: Fix return value of capture_free_page

but this one may already be in flight from Andrew's tree as he picked it
up already.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-28 Thread Jiri Slaby
On 11/28/2012 02:35 PM, Zdenek Kabelac wrote:
> and added slightly modified patch from Jiri
> (https://lkml.org/lkml/2012/11/15/950
> (Unsure where it still applies for -rc7??)

It is needed for -next only. And if you have recent -next, it's already
there...

thanks,
-- 
js
suse labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-28 Thread Zdenek Kabelac

Dne 27.11.2012 21:58, Linus Torvalds napsal(a):

Note that in the meantime, I've also applied (through Andrew) the
patch that reverts commit c654345924f7 (see commit 82b212f40059
'Revert "mm: remove __GFP_NO_KSWAPD"').

I wonder if that revert may be bogus, and a result of this same issue.
Maybe that revert should be reverted, and replaced with your patch?

Mel? Zdenek? What's the status here?




I've tried for longer term:

https://lkml.org/lkml/2012/11/5/308
https://lkml.org/lkml/2012/11/12/113

these 2 seems to be now merge in -rc7
(since they disappeared after my git rebase)


and added slightly modified patch from Jiri
(https://lkml.org/lkml/2012/11/15/950
(Unsure where it still applies for -rc7??)

Also I've Jan Kara 
fs: Fix imbalance in freeze protection in mark_files_ro()
(which is still not applied to upstream)

And I think I'm NOT seeing huge load from kswapd0.
(At least related to my not really long uptimes)


But also I'm now  frequent victim of my other report:

https://lkml.org/lkml/2012/11/15/369

Which turns into a problem, that if my T61 docking station
has enabled support for 'old hw' for docking in BIOS - i.e. serial output'
it becomes unstable and either 1st. or 2nd. resume deadlocks
machine - and serial port gives just garbage)

Zdenek



  Linus

On Tue, Nov 27, 2012 at 12:48 PM, Johannes Weiner  wrote:

Hi everyone,

I hope I included everybody that participated in the various threads
on kswapd getting stuck / exhibiting high CPU usage.  We were looking
at at least three root causes as far as I can see, so it's not really
clear who observed which problem.  Please correct me if the
reported-by, tested-by, bisected-by tags are incomplete.

One problem was, as it seems, overly aggressive reclaim due to scaling
up reclaim goals based on compaction failures.  This one was reverted
in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
reclaim/compaction based on failures".

Another one was an accounting problem where a freed higher order page
was underreported, and so kswapd had trouble restoring watermarks.
This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
(appears like memory leak).

The third one is a problem with small zones, like the DMA zone, where
the high watermark is lower than the low watermark plus compaction gap
(2 * allocation size).  The zonelist reclaim in kswapd would do
nothing because all high watermarks are met, but the compaction logic
would find its own requirements unmet and loop over the zones again.
Indefinitely, until some third party would free enough memory to help
meet the higher compaction watermark.  The problematic code has been
there since the 3.4 merge window for non-THP higher order allocations
but has been more prominent since the 3.7 merge window, where kswapd
is also woken up for the much more common THP allocations.

The following patch should fix the third issue by making both reclaim
and compaction code in kswapd use the same predicate to determine
whether a zone is balanced or not.

Hopefully, the sum of all three fixes should tame kswapd enough for
3.7.

Johannes



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-28 Thread Thorsten Leemhuis
Mel Gorman wrote on 28.11.2012 11:13:
> On Tue, Nov 27, 2012 at 03:19:38PM -0800, Linus Torvalds wrote:
>> On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner  wrote:
>> > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
>
>> And the one who comes out gets to explain to me which patch(es) I
>> should apply, and which I should revert, if any.
> 
> Based on the reports I've seen I expect the following to work for 3.7
> 
> Keep
>   96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by 
> reclaim/compaction based on failures"
>   ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)
> 
> Revert
>   82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"
> 
> Merge
>   mm: vmscan: fix kswapd endless loop on higher order allocation
>   mm: Avoid waking kswapd for THP allocations when compaction is deferred or 
> contended

I'll build a kernel with this combination and will give it a try. Maybe
one of those people that reported problems in
https://bugzilla.redhat.com/show_bug.cgi?id=866988 can try them, too.
There two people recently reported their problems were gone with kernels
that contained 82b212f4.

> Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I
> think we should also avoid waking kswapd for THP allocations if compaction
> is deferred. Johannes' patch might mean that kswapd goes quickly go back
> to sleep but it's still busy work.

Is there a way to trigger (some benchmark?) and detect (something in
/proc/vmstat ?) the problem Hannes patch tries to fix?

Background: The two main problems that got me into this discussion
vanished thx to 9671009 (mm: revert "mm: vmscan: scale number of pages
reclaimed by reclaim/compaction based on failures") and ef6c5be (fix
incorrect NR_FREE_PAGES accounting (appears like memory leak)). I
thought all my problems had gone, but after a few days of uptime
(suspended and resumed the particular machine a few times in between, as
I was using it just in the evenings) kswap now and then started
consuming nearly 100% of one cpu core for 10 to 15 seconds intervals (it
seems watching a YouTube video triggered it; and the machine was using a
little bit swap space). I just had started debugging this, but due to
some stupid mistake
(https://plus.google.com/107616711159256259828/posts/GXuhf1LTien ) then
rebooted the machine :-/ So maybe I hit the problem Hannes patch tries
to solve, but I'm not sure; and I have no easy way to verify quickly if
the proposed patch combination helps.

Thorsten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-28 Thread Mel Gorman
On Tue, Nov 27, 2012 at 03:19:38PM -0800, Linus Torvalds wrote:
> On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner  wrote:
> > On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
> >>
> >> Kswapd going crazy is certainly a large part of the problem.
> >>
> >> However, that leaves the issue of page_alloc.c waking up
> >> kswapd when the system is not actually low on memory.
> >>
> >> Instead, kswapd is woken up because memory compaction failed,
> >> potentially even due to lock contention during compaction!
> >>
> >> Ideally the allocation code would only wake up kswapd if
> >> memory needs to be freed, or in order for kswapd to do
> >> memory compaction (so the allocator does not have to).
> >
> > Maybe I missed something, but shouldn't this be solved with my patch?
> 
> Ok, guys. Cage fight!
> 
> The rules are simple: two men enter, one man leaves.
> 

I'm fairly scorch damaged from this whole cycle already. I won't need a
prop master to look the part for a thunderdome match.

> And the one who comes out gets to explain to me which patch(es) I
> should apply, and which I should revert, if any.
> 

Based on the reports I've seen I expect the following to work for 3.7

Keep
  96710098 mm: revert "mm: vmscan: scale number of pages reclaimed by 
reclaim/compaction based on failures"
  ef6c5be6 fix incorrect NR_FREE_PAGES accounting (appears like memory leak)

Revert
  82b212f4 Revert "mm: remove __GFP_NO_KSWAPD"

Merge
  mm: vmscan: fix kswapd endless loop on higher order allocation
  mm: Avoid waking kswapd for THP allocations when compaction is deferred or 
contended

Johannes' patch should remove the necessity for __GFP_NO_KSWAPD revert but I
think we should also avoid waking kswapd for THP allocations if compaction
is deferred. Johannes' patch might mean that kswapd goes quickly go back
to sleep but it's still busy work.

3.6 is still known to be screwed in terms of THP because of the amount of
time it can spend in compaction after lumpy reclaim was removed. This is
my old list of patches I felt needed to be backported after 3.7 came out.
They are not tagged -stable, I'll be sending it to Greg manually.

e64c523 mm: compaction: abort compaction loop if lock is contended or run too 
long
3cc668f mm: compaction: move fatal signal check out of compact_checklock_irqsave
661c4cb mm: compaction: Update try_to_compact_pages()kerneldoc comment
2a1402a mm: compaction: acquire the zone->lru_lock as late as possible
f40d1e4 mm: compaction: acquire the zone->lock as late as possible
753341a revert "mm: have order > 0 compaction start off where it left"
bb13ffe mm: compaction: cache if a pageblock was scanned and no pages were 
isolated
c89511a mm: compaction: Restart compaction from near where it left off
6299702 mm: compaction: clear PG_migrate_skip based on compaction and reclaim 
activity
0db63d7 mm: compaction: correct the nr_strict va isolated check for CMA

Only Johannes' patch needs to be added to this list. kswapd is not woken
for THP in 3.6 but as it calls compaction for other high-order allocations
it still makes sense.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-28 Thread Mel Gorman
(Adding Thorsten to cc)

On Tue, Nov 27, 2012 at 03:48:34PM -0500, Johannes Weiner wrote:
> Hi everyone,
> 
> I hope I included everybody that participated in the various threads
> on kswapd getting stuck / exhibiting high CPU usage.  We were looking
> at at least three root causes as far as I can see, so it's not really
> clear who observed which problem.  Please correct me if the
> reported-by, tested-by, bisected-by tags are incomplete.
> 
> One problem was, as it seems, overly aggressive reclaim due to scaling
> up reclaim goals based on compaction failures.  This one was reverted
> in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
> reclaim/compaction based on failures".
> 

This particular one would have been made worse by the accounting bug and
if kswapd was staying awake longer than necessary. As scaling the amount
of reclaim only for direct reclaim helped this problem a lot, I strongly
suspect the accounting bug was a factor.

However the benefit for this is marginal -- it primarily affects how
many THP pages we can allocate under stress. There is already a graceful
fallback path and a system under heavy reclaim pressure is not going to
notice the performance benefit of THP.

> Another one was an accounting problem where a freed higher order page
> was underreported, and so kswapd had trouble restoring watermarks.
> This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
> (appears like memory leak).
> 

This almost certainly also requires the follow-on fix at
https://lkml.org/lkml/2012/11/26/225 for reasons I explained in
https://lkml.org/lkml/2012/11/27/190 .

> The third one is a problem with small zones, like the DMA zone, where
> the high watermark is lower than the low watermark plus compaction gap
> (2 * allocation size).  The zonelist reclaim in kswapd would do
> nothing because all high watermarks are met, but the compaction logic
> would find its own requirements unmet and loop over the zones again.
> Indefinitely, until some third party would free enough memory to help
> meet the higher compaction watermark.  The problematic code has been
> there since the 3.4 merge window for non-THP higher order allocations
> but has been more prominent since the 3.7 merge window, where kswapd
> is also woken up for the much more common THP allocations.
> 

Yes. 

> The following patch should fix the third issue by making both reclaim
> and compaction code in kswapd use the same predicate to determine
> whether a zone is balanced or not.
> 
> Hopefully, the sum of all three fixes should tame kswapd enough for
> 3.7.
> 

Not exactly sure of that. With just those patches it is possible for
allocations for THP entering the slow path to keep kswapd continually awake
doing busy work. This was an alternative to the revert that covered that
https://lkml.org/lkml/2012/11/12/151 but it was not enough because kswapd
would stay awake due to the bug you identified and fixed.

I went with the __GFP_NO_KSWAPD patch in this cycle because 3.6 was/is
very poor in how it handles THP after the removal of lumpy reclaim. 3.7
was shaping up to be even worse with multiple root causes too close to the
release date.  Taking kswapd out of the equation covered some of the
problems (yes, by hiding them) so it could be revisited but Johannes may
have finally squashed it.

However, if we revert the revert then I strongly recommend that it be
replaced with "Avoid waking kswapd for THP allocations when compaction is
deferred or contended".

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-27 Thread Linus Torvalds
On Tue, Nov 27, 2012 at 2:26 PM, Johannes Weiner  wrote:
> On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
>>
>> Kswapd going crazy is certainly a large part of the problem.
>>
>> However, that leaves the issue of page_alloc.c waking up
>> kswapd when the system is not actually low on memory.
>>
>> Instead, kswapd is woken up because memory compaction failed,
>> potentially even due to lock contention during compaction!
>>
>> Ideally the allocation code would only wake up kswapd if
>> memory needs to be freed, or in order for kswapd to do
>> memory compaction (so the allocator does not have to).
>
> Maybe I missed something, but shouldn't this be solved with my patch?

Ok, guys. Cage fight!

The rules are simple: two men enter, one man leaves.

And the one who comes out gets to explain to me which patch(es) I
should apply, and which I should revert, if any.

My current guess is that I should apply the one Johannes just sent
("mm: vmscan: fix kswapd endless loop on higher order allocation")
after having added the cc to stable to it, and then revert the recent
revert (commit 82b212f40059).

But I await the Thunderdome. 

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-27 Thread Johannes Weiner
On Tue, Nov 27, 2012 at 05:02:36PM -0500, Rik van Riel wrote:
> On 11/27/2012 04:49 PM, Johannes Weiner wrote:
> >On Tue, Nov 27, 2012 at 04:16:52PM -0500, Rik van Riel wrote:
> >>On 11/27/2012 03:58 PM, Linus Torvalds wrote:
> >>>Note that in the meantime, I've also applied (through Andrew) the
> >>>patch that reverts commit c654345924f7 (see commit 82b212f40059
> >>>'Revert "mm: remove __GFP_NO_KSWAPD"').
> >>>
> >>>I wonder if that revert may be bogus, and a result of this same issue.
> >>>Maybe that revert should be reverted, and replaced with your patch?
> >>>
> >>>Mel? Zdenek? What's the status here?
> >>
> >>Mel posted several patches to fix the kswapd issue.  This one is
> >>slightly more risky than the outright revert, but probably preferred
> >>from a performance point of view:
> >>
> >>https://lkml.org/lkml/2012/11/12/151
> >>
> >>It works by skipping the kswapd wakeup for THP allocations, only
> >>if compaction is deferred or contended.
> >
> >Just to clarify, this would be a replacement strictly for the
> >__GFP_NO_KSWAPD removal revert, to control how often kswapd is woken
> >up for higher order allocations like THP.
> >
> >My patch is to fix how kswapd actually does higher order reclaim, and
> >it is required either way.
> >
> >[ But isn't the _reason_ why the "wake up kswapd more carefully for
> >   THP" patch was written kind of moot now since it was developed
> >   against a crazy kswapd?  It would certainly need to be re-evaluated.
> >   My (limited) testing didn't show any issues anymore with waking
> >   kswapd unconditionally once it's fixed. ]
> 
> Kswapd going crazy is certainly a large part of the problem.
> 
> However, that leaves the issue of page_alloc.c waking up
> kswapd when the system is not actually low on memory.
> 
> Instead, kswapd is woken up because memory compaction failed,
> potentially even due to lock contention during compaction!
> 
> Ideally the allocation code would only wake up kswapd if
> memory needs to be freed, or in order for kswapd to do
> memory compaction (so the allocator does not have to).

Maybe I missed something, but shouldn't this be solved with my patch?

The first scan over the zones finds the higher order watermark
breached, but the reclaim scan over the zones tests against order-0
(testorder) watermarks when compaction is suitable, i.e. no reclaim if
there are enough order-0 pages for compaction to work.  It should just
fall through to that zones_need_compaction condition at the end and
run compaction.

As such, it should always be approriate to wake kswapd if allocations
fail.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-27 Thread Rik van Riel

On 11/27/2012 04:49 PM, Johannes Weiner wrote:

On Tue, Nov 27, 2012 at 04:16:52PM -0500, Rik van Riel wrote:

On 11/27/2012 03:58 PM, Linus Torvalds wrote:

Note that in the meantime, I've also applied (through Andrew) the
patch that reverts commit c654345924f7 (see commit 82b212f40059
'Revert "mm: remove __GFP_NO_KSWAPD"').

I wonder if that revert may be bogus, and a result of this same issue.
Maybe that revert should be reverted, and replaced with your patch?

Mel? Zdenek? What's the status here?


Mel posted several patches to fix the kswapd issue.  This one is
slightly more risky than the outright revert, but probably preferred
from a performance point of view:

https://lkml.org/lkml/2012/11/12/151

It works by skipping the kswapd wakeup for THP allocations, only
if compaction is deferred or contended.


Just to clarify, this would be a replacement strictly for the
__GFP_NO_KSWAPD removal revert, to control how often kswapd is woken
up for higher order allocations like THP.

My patch is to fix how kswapd actually does higher order reclaim, and
it is required either way.

[ But isn't the _reason_ why the "wake up kswapd more carefully for
   THP" patch was written kind of moot now since it was developed
   against a crazy kswapd?  It would certainly need to be re-evaluated.
   My (limited) testing didn't show any issues anymore with waking
   kswapd unconditionally once it's fixed. ]


Kswapd going crazy is certainly a large part of the problem.

However, that leaves the issue of page_alloc.c waking up
kswapd when the system is not actually low on memory.

Instead, kswapd is woken up because memory compaction failed,
potentially even due to lock contention during compaction!

Ideally the allocation code would only wake up kswapd if
memory needs to be freed, or in order for kswapd to do
memory compaction (so the allocator does not have to).

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-27 Thread Johannes Weiner
On Tue, Nov 27, 2012 at 04:16:52PM -0500, Rik van Riel wrote:
> On 11/27/2012 03:58 PM, Linus Torvalds wrote:
> >Note that in the meantime, I've also applied (through Andrew) the
> >patch that reverts commit c654345924f7 (see commit 82b212f40059
> >'Revert "mm: remove __GFP_NO_KSWAPD"').
> >
> >I wonder if that revert may be bogus, and a result of this same issue.
> >Maybe that revert should be reverted, and replaced with your patch?
> >
> >Mel? Zdenek? What's the status here?
> 
> Mel posted several patches to fix the kswapd issue.  This one is
> slightly more risky than the outright revert, but probably preferred
> from a performance point of view:
> 
> https://lkml.org/lkml/2012/11/12/151
> 
> It works by skipping the kswapd wakeup for THP allocations, only
> if compaction is deferred or contended.

Just to clarify, this would be a replacement strictly for the
__GFP_NO_KSWAPD removal revert, to control how often kswapd is woken
up for higher order allocations like THP.

My patch is to fix how kswapd actually does higher order reclaim, and
it is required either way.

[ But isn't the _reason_ why the "wake up kswapd more carefully for
  THP" patch was written kind of moot now since it was developed
  against a crazy kswapd?  It would certainly need to be re-evaluated.
  My (limited) testing didn't show any issues anymore with waking
  kswapd unconditionally once it's fixed. ]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-27 Thread Johannes Weiner
On Tue, Nov 27, 2012 at 12:58:18PM -0800, Linus Torvalds wrote:
> Note that in the meantime, I've also applied (through Andrew) the
> patch that reverts commit c654345924f7 (see commit 82b212f40059
> 'Revert "mm: remove __GFP_NO_KSWAPD"').
> 
> I wonder if that revert may be bogus, and a result of this same issue.
> Maybe that revert should be reverted, and replaced with your patch?

The __GFP_NO_KSWAPD removal woke kswapd for THP reclaim and so it
exposed all these bugs that accumulated in there when higher order
kswapd reclaim was excercised less often.

The revert will hide the problem again, but doesn't make it go away
entirely, so I think we need my fix either way.

Whether you want to put the full THP weight back on the freshly fixed
higher order kswapd code for 3.7 is a different matter :-) At least we
would see quickly if it's still not working correctly...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-27 Thread Rik van Riel

On 11/27/2012 03:58 PM, Linus Torvalds wrote:

Note that in the meantime, I've also applied (through Andrew) the
patch that reverts commit c654345924f7 (see commit 82b212f40059
'Revert "mm: remove __GFP_NO_KSWAPD"').

I wonder if that revert may be bogus, and a result of this same issue.
Maybe that revert should be reverted, and replaced with your patch?

Mel? Zdenek? What's the status here?


Mel posted several patches to fix the kswapd issue.  This one is
slightly more risky than the outright revert, but probably preferred
from a performance point of view:

https://lkml.org/lkml/2012/11/12/151

It works by skipping the kswapd wakeup for THP allocations, only
if compaction is deferred or contended.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-11-27 Thread Linus Torvalds
Note that in the meantime, I've also applied (through Andrew) the
patch that reverts commit c654345924f7 (see commit 82b212f40059
'Revert "mm: remove __GFP_NO_KSWAPD"').

I wonder if that revert may be bogus, and a result of this same issue.
Maybe that revert should be reverted, and replaced with your patch?

Mel? Zdenek? What's the status here?

 Linus

On Tue, Nov 27, 2012 at 12:48 PM, Johannes Weiner  wrote:
> Hi everyone,
>
> I hope I included everybody that participated in the various threads
> on kswapd getting stuck / exhibiting high CPU usage.  We were looking
> at at least three root causes as far as I can see, so it's not really
> clear who observed which problem.  Please correct me if the
> reported-by, tested-by, bisected-by tags are incomplete.
>
> One problem was, as it seems, overly aggressive reclaim due to scaling
> up reclaim goals based on compaction failures.  This one was reverted
> in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
> reclaim/compaction based on failures".
>
> Another one was an accounting problem where a freed higher order page
> was underreported, and so kswapd had trouble restoring watermarks.
> This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
> (appears like memory leak).
>
> The third one is a problem with small zones, like the DMA zone, where
> the high watermark is lower than the low watermark plus compaction gap
> (2 * allocation size).  The zonelist reclaim in kswapd would do
> nothing because all high watermarks are met, but the compaction logic
> would find its own requirements unmet and loop over the zones again.
> Indefinitely, until some third party would free enough memory to help
> meet the higher compaction watermark.  The problematic code has been
> there since the 3.4 merge window for non-THP higher order allocations
> but has been more prominent since the 3.7 merge window, where kswapd
> is also woken up for the much more common THP allocations.
>
> The following patch should fix the third issue by making both reclaim
> and compaction code in kswapd use the same predicate to determine
> whether a zone is balanced or not.
>
> Hopefully, the sum of all three fixes should tame kswapd enough for
> 3.7.
>
> Johannes
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


kswapd craziness in 3.7

2012-11-27 Thread Johannes Weiner
Hi everyone,

I hope I included everybody that participated in the various threads
on kswapd getting stuck / exhibiting high CPU usage.  We were looking
at at least three root causes as far as I can see, so it's not really
clear who observed which problem.  Please correct me if the
reported-by, tested-by, bisected-by tags are incomplete.

One problem was, as it seems, overly aggressive reclaim due to scaling
up reclaim goals based on compaction failures.  This one was reverted
in 9671009 mm: revert "mm: vmscan: scale number of pages reclaimed by
reclaim/compaction based on failures".

Another one was an accounting problem where a freed higher order page
was underreported, and so kswapd had trouble restoring watermarks.
This one was fixed in ef6c5be fix incorrect NR_FREE_PAGES accounting
(appears like memory leak).

The third one is a problem with small zones, like the DMA zone, where
the high watermark is lower than the low watermark plus compaction gap
(2 * allocation size).  The zonelist reclaim in kswapd would do
nothing because all high watermarks are met, but the compaction logic
would find its own requirements unmet and loop over the zones again.
Indefinitely, until some third party would free enough memory to help
meet the higher compaction watermark.  The problematic code has been
there since the 3.4 merge window for non-THP higher order allocations
but has been more prominent since the 3.7 merge window, where kswapd
is also woken up for the much more common THP allocations.

The following patch should fix the third issue by making both reclaim
and compaction code in kswapd use the same predicate to determine
whether a zone is balanced or not.

Hopefully, the sum of all three fixes should tame kswapd enough for
3.7.

Johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/