subject:"Re\: \[RFC\] Reproducible OOM with partial workaround"

Re: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread paul . szabo

Dear Andrew,

>>> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
>> Please see below ...
> ... Was this dump taken when the system was at or near oom?

No, that was a "quiescent" machine. Please see a just-before-OOM dump in
my next message (in a little while).

> Please send a copy of the oom-killer kernel message dump, if you still
> have one.

Please see one in next message, or in
http://bugs.debian.org/695182

>> I tried setting dirty_ratio to "funny" values, that did not seem to
>> help.
> Did you try setting it as low as possible?

Probably. Maybe. Sorry, cannot say with certainty.

>> Did you notice my patch about bdi_position_ratio(), how it was
>> plain wrong half the time (for negative x)? 
> Nope, please resend.

Quoting from
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182
:
...
 - In bdi_position_ratio() get difference (setpoint-dirty) right even
   when it is negative, which happens often. Normally these numbers are
   "small" and even with left-shift I never observed a 32-bit overflow.
   I believe it should be possible to re-write the whole function in
   32-bit ints; maybe it is not worth the effort to make it "efficient";
   seeing how this function was always wrong and we survived, it should
   simply be removed.
...
--- mm/page-writeback.c.old 2012-10-17 13:50:15.0 +1100
+++ mm/page-writeback.c 2013-01-06 21:54:59.0 +1100
[ Line numbers out because other patches not shown ]
...
@@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio(
 * => fast response on large errors; small oscillation near setpoint
 */
setpoint = (freerun + limit) / 2;
-   x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+   x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
limit - setpoint + 1);
pos_ratio = x;
pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
...

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread Andrew Morton

On Fri, 11 Jan 2013 22:51:35 +1100
paul.sz...@sydney.edu.au wrote:

> Dear Andrew,
> 
> > Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
> 
> Please see below: I do not know what any of that means. This machine has
> been running just fine, with all my users logging in here via XDMCP from
> X-terminals, dozens logged in simultaneously. (But, I think I could make
> it go OOM with more processes or logins.)

I'm counting 107MB in slab there.  Was this dump taken when the system
was at or near oom?

Please send a copy of the oom-killer kernel message dump, if you still
have one.

> > If so, you *may* be able to work around this by setting
> > /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
> > amount of dirty pagecache around.  Then, with luck, if we haven't
> > broken the buffer_heads_over_limit logic it in the past decade (we
> > probably have), the VM should be able to reclaim those buffer_heads.
> 
> I tried setting dirty_ratio to "funny" values, that did not seem to
> help.

Did you try setting it as low as possible?

> Did you notice my patch about bdi_position_ratio(), how it was
> plain wrong half the time (for negative x)? 

Nope, please resend.

> Anyway that did not help.
> 
> > Alternatively, use a filesystem which doesn't attach buffer_heads to
> > dirty pages.  xfs or btrfs, perhaps.
> 
> Seems there is also a problem not related to filesystem... or rather,
> the essence does not seem to be filesystem or caches. The filesystem
> thing now seems OK with my patch doing drop_caches.

hm, if doing a regular drop_caches fixes things then that implies the
problem is not with dirty pagecache.  Odd.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread Dave Hansen

On 01/10/2013 05:46 PM, paul.sz...@sydney.edu.au wrote:
>> > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
>> > kernel without either violating the ABI (3GB/1GB split) or doing
>> > something that never got merged upstream ...
> Sorry to be so contradictory:
> 
> psz@como:~$ uname -a
> Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 
> 18:34:25 EST 2013 i686 GNU/Linux
> psz@como:~$ free -l
>  total   used   free sharedbuffers cached
> Mem:  644469004729292   59717608  0  15972 480520
> Low:375836 304400  71436
> High: 640710644424892   59646172
> -/+ buffers/cache:4232800   60214100
> Swap:134217724  0  134217724

Hey, that's pretty cool!  I would swear that the mem_map[] overhead was
such that they wouldn't boot, but perhaps those brain cells died on me.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread paul . szabo

Dear Andrew,

> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.

Please see below: I do not know what any of that means. This machine has
been running just fine, with all my users logging in here via XDMCP from
X-terminals, dozens logged in simultaneously. (But, I think I could make
it go OOM with more processes or logins.)

> If so, you *may* be able to work around this by setting
> /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
> amount of dirty pagecache around.  Then, with luck, if we haven't
> broken the buffer_heads_over_limit logic it in the past decade (we
> probably have), the VM should be able to reclaim those buffer_heads.

I tried setting dirty_ratio to "funny" values, that did not seem to
help. Did you notice my patch about bdi_position_ratio(), how it was
plain wrong half the time (for negative x)? Anyway that did not help.

> Alternatively, use a filesystem which doesn't attach buffer_heads to
> dirty pages.  xfs or btrfs, perhaps.

Seems there is also a problem not related to filesystem... or rather,
the essence does not seem to be filesystem or caches. The filesystem
thing now seems OK with my patch doing drop_caches.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


---

root@como:~# free -lm
 total   used   free sharedbuffers cached
Mem: 62936   2317  60618  0 41635
Low:   367271 95
High:62569   2045  60523
-/+ buffers/cache:   1640  61295
Swap:   131071  0 131071
root@como:~# cat /proc/slabinfo
slabinfo - version: 2.1
# name
 : tunables: slabdata 
  
fuse_request   0  0376   434 : tunables000 : 
slabdata  0  0  0
fuse_inode 0  0448   364 : tunables000 : 
slabdata  0  0  0
bsg_cmd0  0288   282 : tunables000 : 
slabdata  0  0  0
ntfs_big_inode_cache  0  0512   324 : tunables000 : 
slabdata  0  0  0
ntfs_inode_cache   0  0176   462 : tunables000 : 
slabdata  0  0  0
nfs_direct_cache   0  0 80   511 : tunables000 : 
slabdata  0  0  0
nfs_inode_cache 5404   5404584   284 : tunables000 : 
slabdata193193  0
isofs_inode_cache  0  0360   454 : tunables000 : 
slabdata  0  0  0
fat_inode_cache0  0408   404 : tunables000 : 
slabdata  0  0  0
fat_cache  0  0 24  1701 : tunables000 : 
slabdata  0  0  0
jbd2_revoke_record  0  0 32  1281 : tunables000 : 
slabdata  0  0  0
journal_handle  5440   5440 24  1701 : tunables000 : 
slabdata 32 32  0
journal_head   16768  16768 64   641 : tunables000 : 
slabdata262262  0
revoke_record  20224  20224 16  2561 : tunables000 : 
slabdata 79 79  0
ext4_inode_cache   0  0584   284 : tunables000 : 
slabdata  0  0  0
ext4_free_data 0  0 40  1021 : tunables000 : 
slabdata  0  0  0
ext4_allocation_context  0  0112   361 : tunables00
0 : slabdata  0  0  0
ext4_prealloc_space  0  0 72   561 : tunables000 : 
slabdata  0  0  0
ext4_io_end0  0576   284 : tunables000 : 
slabdata  0  0  0
ext4_io_page   0  0  8  5121 : tunables000 : 
slabdata  0  0  0
ext2_inode_cache   0  0480   344 : tunables000 : 
slabdata  0  0  0
ext3_inode_cache   16531  19965488   334 : tunables000 : 
slabdata605605  0
ext3_xattr 0  0 48   851 : tunables000 : 
slabdata  0  0  0
dquot840840192   422 : tunables000 : 
slabdata 20 20  0
rpc_inode_cache  144144448   364 : tunables000 : 
slabdata  4  4  0
UDP-Lite   0  0576   284 : tunables000 : 
slabdata  0  0  0
xfrm_dst_cache 0  0320   514 : tunables000 : 
slabdata  0  0  0
UDP  896896576   284 : tunables000 : 
slabdata 32 32  0
tw_sock_TCP 1344   1344128   321 : tunables000 : 
slabdata 42 42  0
TCP 1457   1624   1152   28

Re: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread Simon Jeons

On Fri, 2013-01-11 at 00:01 -0800, Andrew Morton wrote:
> On Fri, 11 Jan 2013 12:46:15 +1100 paul.sz...@sydney.edu.au wrote:
> 
> > > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
> > > kernel without either violating the ABI (3GB/1GB split) or doing
> > > something that never got merged upstream ...
> > 
> > Sorry to be so contradictory:
> > 
> > psz@como:~$ uname -a
> > Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 
> > 18:34:25 EST 2013 i686 GNU/Linux
> > psz@como:~$ free -l
> >  total   used   free sharedbuffers cached
> > Mem:  644469004729292   59717608  0  15972 480520
> > Low:375836 304400  71436
> > High: 640710644424892   59646172
> > -/+ buffers/cache:4232800   60214100
> > Swap:134217724  0  134217724
> > psz@como:~$ 
> > 
> > (though I would not know about violations).
> > 
> > But OK, I take your point that I should move with the times.
> 
> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
> 
> If so, you *may* be able to work around this by setting
> /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
> amount of dirty pagecache around.  Then, with luck, if we haven't
> broken the buffer_heads_over_limit logic it in the past decade (we
> probably have), the VM should be able to reclaim those buffer_heads.
> 
> Alternatively, use a filesystem which doesn't attach buffer_heads to
> dirty pages.  xfs or btrfs, perhaps.
> 

Hi Andrew,

What's the meaning of attaching buffer_heads to dirty pages?

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread Andrew Morton

On Fri, 11 Jan 2013 12:46:15 +1100 paul.sz...@sydney.edu.au wrote:

> > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
> > kernel without either violating the ABI (3GB/1GB split) or doing
> > something that never got merged upstream ...
> 
> Sorry to be so contradictory:
> 
> psz@como:~$ uname -a
> Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 
> 18:34:25 EST 2013 i686 GNU/Linux
> psz@como:~$ free -l
>  total   used   free sharedbuffers cached
> Mem:  644469004729292   59717608  0  15972 480520
> Low:375836 304400  71436
> High: 640710644424892   59646172
> -/+ buffers/cache:4232800   60214100
> Swap:134217724  0  134217724
> psz@como:~$ 
> 
> (though I would not know about violations).
> 
> But OK, I take your point that I should move with the times.

Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.

If so, you *may* be able to work around this by setting
/proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
amount of dirty pagecache around.  Then, with luck, if we haven't
broken the buffer_heads_over_limit logic it in the past decade (we
probably have), the VM should be able to reclaim those buffer_heads.

Alternatively, use a filesystem which doesn't attach buffer_heads to
dirty pages.  xfs or btrfs, perhaps.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread paul . szabo

Dear Dave,

> ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
> kernel without either violating the ABI (3GB/1GB split) or doing
> something that never got merged upstream ...

Sorry to be so contradictory:

psz@como:~$ uname -a
Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 
EST 2013 i686 GNU/Linux
psz@como:~$ free -l
 total   used   free sharedbuffers cached
Mem:  644469004729292   59717608  0  15972 480520
Low:375836 304400  71436
High: 640710644424892   59646172
-/+ buffers/cache:4232800   60214100
Swap:134217724  0  134217724
psz@como:~$ 

(though I would not know about violations).

But OK, I take your point that I should move with the times.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread Dave Hansen

On 01/10/2013 04:46 PM, paul.sz...@sydney.edu.au wrote:
>> Your configuration has never worked.  This isn't a regression ...
>> ... does not mean that we expect it to work.
> 
> Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used;
> that all development is for 64-bit only?

My last 4GB laptop had a 1GB hole and needed HIGHMEM64G since it had RAM
at 0->5GB.  That worked just fine, btw.  The problem isn't with
HIGHMEM64G itself.

I'm not saying HIGHMEM64G is inherently bad, just that it gets gradually
worse and worse as you add more RAM.  I don't believe 64GB of RAM has
_ever_ been booted on a 32-bit kernel without either violating the ABI
(3GB/1GB split) or doing something that never got merged upstream (that
4GB/4GB split, or other fun stuff like page clustering).

> I find it puzzling that there seems to be a sharp cutoff at 32GB RAM,
> no problem under but OOM just over; whereas I would have expected
> lowmem starvation to be gradual, with OOM occuring much sooner with
> 64GB than with 34GB. Also, the kernel seems capable of reclaiming
> lowmem, so I wonder why does that fail just over the 32GB threshhold.
> (Obviously I have no idea what I am talking about.)

It _is_ puzzling.  It isn't immediately obvious to me why the slab that
you have isn't being reclaimed.  There might, indeed, be a fixable bug
there.  But, there are probably a bunch more bugs which will keep you
from having a nice, smoothly-running system, mostly those bugs have not
had much attention in the 10 years or so since 64-bit x86 became
commonplace.  Plus, even 10 years ago, when folks were working on this
actively, we _never_ got things running smoothly on 32GB of RAM.  Take a
look at this:

http://support.bull.com/ols/product/system/linux/redhat/help/kbf/g/inst/PrKB11417

You are effectively running the "SMP kernel" (hugemem is a completely
different beast).

I had a 32GB i386 system.  It was a really, really fun system to play
with, and its never-ending list of bugs helped keep me employed for
several years.  You don't want to unnecessarily inflict that pain on
yourself, really.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread paul . szabo

Dear Dave,

> Your configuration has never worked.  This isn't a regression ...
> ... does not mean that we expect it to work.

Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used;
that all development is for 64-bit only?

> ... 64-bit kernels should basically be drop-in replacements ...

Will think about that. I know all my servers are 64-bit capable, will
need to check all my desktops.

---

I find it puzzling that there seems to be a sharp cutoff at 32GB RAM,
no problem under but OOM just over; whereas I would have expected
lowmem starvation to be gradual, with OOM occuring much sooner with
64GB than with 34GB. Also, the kernel seems capable of reclaiming
lowmem, so I wonder why does that fail just over the 32GB threshhold.
(Obviously I have no idea what I am talking about.)

---

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread Dave Hansen

On 01/10/2013 01:58 PM, paul.sz...@sydney.edu.au wrote:
> I developed a workaround patch for this particular OOM demo, dropping
> filesystem caches when about to exhaust lowmem. However, subsequently
> I observed OOM when running many processes (as yet I do not have an
> easy-to-reproduce demo of this); so as I suspected, the essence of the
> problem is not with FS caches.
> 
> Could you please help in finding the cause of this OOM bug?

As was mentioned in the bug, your 32GB of physical memory only ends up
giving ~900MB of low memory to the kernel.  Of that, around 600MB is
used for "mem_map[]", leaving only about 300MB available to the kernel
for *ALL* of its allocations at runtime.

Your configuration has never worked.  This isn't a regression, it's
simply something that we know never worked in Linux and it's a very hard
problem to solve.  One Linux vendor (at least) went to a huge amount of
trouble to develop, ship, and supported a kernel that supported large
32-bit machines, but it was never merged upstream and work stopped on it
when such machines became rare beasts:

http://lwn.net/Articles/39925/

I believe just about any Linux vendor would call your configuration
"unsupported".  Just because the kernel can boot does not mean that we
expect it to work.

It's possible that some tweaks of the vm knobs (like lowmem_reserve)
could help you here.  But, really, you don't want to run a 32-bit kernel
on such a large machine.  Very, very few folks are running 32-bit
kernels on these systems and you're likely to keep running in to bugs
because this is such a rare configuration.

We've been very careful to ensure that 64-bit kernels shoul basically be
drop-in replacements for 32-bit ones.  You can keep userspace 100%
32-bit, and just have a 64-bit kernel.

If you're really set on staying 32-bit, I might have a NUMA-Q I can give
you. ;)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Reproducible OOM with partial workaround

Re: [RFC] Reproducible OOM with partial workaround

Re: [RFC] Reproducible OOM with partial workaround

Re: [RFC] Reproducible OOM with partial workaround

Re: [RFC] Reproducible OOM with partial workaround

Re: [RFC] Reproducible OOM with partial workaround

Re: [RFC] Reproducible OOM with partial workaround

Re: [RFC] Reproducible OOM with partial workaround

Re: [RFC] Reproducible OOM with partial workaround

Re: [RFC] Reproducible OOM with partial workaround

10 matches

Site Navigation

Mail list logo

Footer information