Bug#695182: Write couple of 1GB files for OOM crash

2013-01-21 Thread Jonathan Nieder
paul.sz...@sydney.edu.au wrote:

>   I am coming back to the idea that this is some
> signed-vs-unsigned or similar issue... though I could not find it yet!

Thanks for your work, and good luck.


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#695182: Write couple of 1GB files for OOM crash

2013-01-21 Thread paul . szabo
Dear Jonathan,

Thanks again for your help in writing "correct" and "acceptable"
patches. I did change a few things, hopefully for the better.

I decided not to "push" the drop-caches part of my patch; because it
now seems to me that it is not the "essence" of the issue: it protects
against OOM when writing a few files, but does not protect when running
a few sleeps. I am coming back to the idea that this is some
signed-vs-unsigned or similar issue... though I could not find it yet!

---

Using the amd64 kernel seems a "workable" workaround for the OOM issue.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#695182: Write couple of 1GB files for OOM crash

2013-01-10 Thread Jonathan Nieder
paul.sz...@sydney.edu.au wrote:
> Dear Jonathan,

>> ... once you have a reproducible test I imagine the mm folks will
>> already be very interested and they may be able to help ...
>
> But, I do already have a reproducible test! Write a few files, as per
> the initial message in this http://bugs.debian.org/695182 ; I also have
> a patch/solution/workaround for that particular test.
>
> Now I observed another way of making a machine with 64GB crash (sorry,
> not crash but to suffer an "OOM episode"). I am pretty sure this other
> "test" is reproducible, but is cumbersome to set up and tedious to do.

Right.  So please contact the mm folks (cc-ing us). :)  They know much
more about this stuff than we do.

Thanks,
Jonathan


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#695182: Write couple of 1GB files for OOM crash

2013-01-10 Thread paul . szabo
Dear Jonathan,

> ... once you have a reproducible test I imagine the mm folks will
> already be very interested and they may be able to help ...

But, I do already have a reproducible test! Write a few files, as per
the initial message in this http://bugs.debian.org/695182 ; I also have
a patch/solution/workaround for that particular test.

Now I observed another way of making a machine with 64GB crash (sorry,
not crash but to suffer an "OOM episode"). I am pretty sure this other
"test" is reproducible, but is cumbersome to set up and tedious to do.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#695182: Write couple of 1GB files for OOM crash

2013-01-09 Thread Jonathan Nieder
paul.sz...@sydney.edu.au wrote:

> So, there is a bug still. I will try to make up some easily reproducible
> test, and if can provoke OOM then look again into kernel code.

Ok, thanks for the update.

BTW, once you have a reproducible test I imagine the mm folks will
already be very interested and they may be able to help out by
answering your questions and indicating what's likely and unlikely to
work in fixing it.

Actually, that goes even if you don't have a reproducible test. ;-)

Thanks for your persistence working on this so far, btw.

Jonathan


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#695182: Write couple of 1GB files for OOM crash

2013-01-09 Thread paul . szabo
I am slowly coming around to the idea of splitting my patch up into many
parts, each addressing just one little issue, so there would be a couple
of [PATCH]-es and some [RFC]-s, each profusely explained. Unfortunately
did not yet have time to work on this; but want to do, hopefully soon.

However... we now did another test on the server, and it ran into OOM.
We were load-testing. We intend to use the server for student logins,
via XDMCP from any one of 109 X-terminals (similar to LTSP); we logged
in on all our terminals, and on each ran a firefox and an rstudio. The
server ran into OOM, having plenty of free memory but having exhausted
lowmem, after 80 logins. Then rebooted the server with mem=32G on the
kernel, and there was no OOM after logging on all our "fake" students.

So, there is a bug still. I will try to make up some easily reproducible
test, and if can provoke OOM then look again into kernel code.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Bug#695182: Write couple of 1GB files for OOM crash

2013-01-06 Thread Ben Hutchings
On Mon, 2013-01-07 at 14:03 +1100, paul.sz...@sydney.edu.au wrote:
> Dear Jonathan,
> 
> > Here's a quick review from a non-expert (i.e., me). ...
> 
> Thanks!
> 
> > ... When the patch
> > seems ready to you, please send it inline in a message to
> > linux...@kvack.org, cc-ing linux-ker...@vger.kernel.org and Ben and me
> > (or this bug log) so we can track it.  The mm folks will probably have
> > more feedback, and we can take it from there.
> 
> I was hoping that Ben would take that on; but OK, will do (first waiting
> for maybe Ben's comments, and your reply to my questions below).
[...]

I think Jonathan's covered everything.

Ben.

-- 
Ben Hutchings
If you seem to know what you are doing, you'll be given more to do.


signature.asc
Description: This is a digitally signed message part


Bug#695182: Write couple of 1GB files for OOM crash

2013-01-06 Thread Jonathan Nieder
paul.sz...@sydney.edu.au wrote:

>>> +/* Easy call: do "echo 3 > /proc/sys/vm/drop_caches" */
>>
>> I don't understand this comment.  What does "Easy call" mean?
>>
>>> +void easy_drop_caches(void)
[...]
>> This should be declared in some appropriate header (e.g.,
>> include/linux/mm.h).
>
> I added one new call, so I could easily call it from elsewhere without
> having to (cross?)-declare many things; in fact would have preferred to
> call iterate_supers() and drop_slab() directly.
>
> Yes, easy_drop_caches() might be declared in some include file, but
> should that be  or ? Would add another file to
> those changed, seemed more direct this way.

Remember that once your patch is applied, it is part of the Linux
kernel that other people develop against.  So it is best to optimize
for readability of the result, not readability of the patch.  Another
file changed is not a big deal.

The existing drop_caches functionality is declared in linux/mm.h, so I
think this new function would be easiest to find there.

> Should I change something?

Please clarify the comment to explain what the function is meant
to do.  It will help future readers.

[...]
>>> +   /* Subtract min_free_kbytes */
>>> +   if (min_free_kbytes > 0)
>>> +   y = min_free_kbytes >> (PAGE_SHIFT - 10);
>>
>> Why the "if"?
>
> Sanity-check paranoia: because I do not trust the input values. Maybe
> should have
>   BUG_ON(min_free_kbytes < 0);
> but that does not belong there.

Elsewhere the code assumes that min_free_kbytes is nonnegative.  If
you want to check that assertion here, that's fine.  An assertion test
would be spelled as

if (!WARN_ON(min_free_kbytes < 0))

since it's possible for the kernel to recover for additional debugging
when it fails.

[...]
>>> setpoint = (freerun + limit) / 2;
>>> -   x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
>>> +   x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
>>> limit - setpoint + 1);
>>
>> Why are 'setpoint' and 'dirty' of type unsigned long instead of long
>> in the first place?  What happens if one of their values overflows a
>> long?
>
> I guess because most other such things are also unsigned long; and I
> guess they were so when they meant memory addresses. These are now page
> counts, cannot overflow (with current limit of 64GB RAM).

This could be worth a comment and WARN_ON to document the assumption
so other readers don't have to be distracted by the same question.
Something like

/*
 * Since memory addresses fit in an unsigned long, page
 * counts cannot overflow a "long" and a cast to s64 is safe.
 */
WARN_ON(setpoint > LONG_MAX || dirty > LONG_MAX);
BUILD_BUG_ON(sizeof(long) > sizeof(s64));

x = div_s64(((s64)setpoint - ...

[...]
>>> --- mm/vmscan.c.old 2012-10-17 13:50:15.0 +1100
>>> +++ mm/vmscan.c 2013-01-06 09:50:49.0 +1100
>> [...]
>>> @@ -2726,9 +2731,87 @@ loop_again:
>>> nr_slab = shrink_slab(&shrink, sc.nr_scanned, 
>>> lru_pages);
>>> sc.nr_reclaimed += 
>>> reclaim_state->reclaimed_slab;
>>> total_scanned += sc.nr_scanned;
>>> +   if (unlikely(
>>> +   i == 1 &&
>>> +   nr_slab < 10 &&
>>> +   (reclaim_state->reclaimed_slab) < 10 &&
>>> +   zone_page_state(zone, NR_SLAB_RECLAIMABLE) 
>>> > 10 &&
[...]
>> This is getting really deeply nested.  Would it be possible to split
>> out a function so this code could be more easily contemplated in
>> isolation?
>
> Hmm... I would much prefer to leave it as is.

Can you say a little about why?  At this point in the function, the
context is

static unsigned long balance_pgdat()
{
...
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
...
for (i = 0; i <= end_zone; i++) {
...
if (!zone_watermark_ok_safe(zone, order, ...) {
...
if (unlikely(i == 1 && ...) {

It is easy to lose track at this point in a long function.  As
Documentation/CodingStyle explains:

Now, some people will claim that having 8-character indentations makes
the code move too far to the right, and makes it hard to read on a
80-character terminal screen.  The answer to that is that if you need
more than 3 levels of indentation, you're screwed anyway, and should fix
your program.

and:

Functions should be short and sweet, and do just one thing.  They should
fit on one or two screenfuls of text (the ISO/ANSI screen size is 80x24,
as we all know), and do one thing and do that well.

The maximum length of a f

Bug#695182: Write couple of 1GB files for OOM crash

2013-01-06 Thread paul . szabo
Dear Jonathan,

> Here's a quick review from a non-expert (i.e., me). ...

Thanks!

> ... When the patch
> seems ready to you, please send it inline in a message to
> linux...@kvack.org, cc-ing linux-ker...@vger.kernel.org and Ben and me
> (or this bug log) so we can track it.  The mm folks will probably have
> more feedback, and we can take it from there.

I was hoping that Ben would take that on; but OK, will do (first waiting
for maybe Ben's comments, and your reply to my questions below).

>> Also included are several minor fixes:
> Those should go in separate patches, but if you're just seeking
> comments then lumping them with the main change in a patch with
> [RFC/PATCH] in the subject line is fine.

In a sense, I am seeking comments on why lowmem is ever polluted with
caches and why slabs are not cleared up without dropping pagecache, as
I feel that the "real bug" is lurking there. - I admit that I do not
understand MM, it seems a mess to my un-educatedf eyes. - Some of those
comments are questions, some others I am sure are bugs.

OK, will use [RFC/PATCH] in the subject.

>> +/* Easy call: do "echo 3 > /proc/sys/vm/drop_caches" */
>
> I don't understand this comment.  What does "Easy call" mean?
>
>> +void easy_drop_caches(void)
>> +{
>> +iterate_supers(drop_pagecache_sb, NULL);
>> +drop_slab();
>> +}
>
> This should be declared in some appropriate header (e.g.,
> include/linux/mm.h).

I added one new call, so I could easily call it from elsewhere without
having to (cross?)-declare many things; in fact would have preferred to
call iterate_supers() and drop_slab() directly.

Yes, easy_drop_caches() might be declared in some include file, but
should that be  or ? Would add another file to
those changed, seemed more direct this way.

Should I change something?

>> --- mm/page-writeback.c.old  2012-10-17 13:50:15.0 +1100
>> +++ mm/page-writeback.c  2013-01-06 21:54:59.0 +1100
>> @@ -39,7 +39,8 @@
>>  /*
>>   * Sleep at most 200ms at a time in balance_dirty_pages().
>>   */
>> -#define MAX_PAUSE   max(HZ/5, 1)
>> +/* Might as well be max(HZ/5,4) to ensure max_pause/4>0 always */
>> +#define MAX_PAUSE   max(HZ/5, 4)
>
> This is one of the "while at it"s rather than the main point of
> the patch, right?

Yes, it was one of the unimportant oddities I noticed.

> [...]
>> @@ -343,11 +344,26 @@ static unsigned long highmem_dirtyable_m
>>  unsigned long determine_dirtyable_memory(void)
>>  {
>>  unsigned long x;
>> +int y = 0;
>> +extern int min_free_kbytes;
>
> Probably should be declared in a header like mm/internal.h, but
> there's precedent for the ad-hoc declaration, so meh.

Yes, I noticed some precedents.

> [...]
>> +/*
>> + * Seems that highmem_is_dirtyable is only used here, in the
>> + * calculation of limits and threshholds of dirtiness, not in deciding
>> + * where to put dirty things. Is that so? Is that as should be?
>> + * What is the recommended setting of highmem_is_dirtyable?
>> + */
>>  if (!vm_highmem_is_dirtyable)
>>  x -= highmem_dirtyable_memory(x);
>
> I dunno.  See 195cf453d2c3 (mm/page-writeback: highmem_is_dirtyable
> option, 2008-02-04) and the thread surrounding
> .

That made my head spin... Thanks for the pointers, will go through them
sometime.

>> +/* Subtract min_free_kbytes */
>> +if (min_free_kbytes > 0)
>> +y = min_free_kbytes >> (PAGE_SHIFT - 10);
>
> Why the "if"?

Sanity-check paranoia: because I do not trust the input values. Maybe
should have
  BUG_ON(min_free_kbytes < 0);
but that does not belong there.

> If I were doing it, I think I'd unconditionally set 'y' here, for
> clarity and to avoid bugs if some later patch reuses the var for
> something else.
>
>   y = min_free_kbytes >> (PAGE_SHIFT - 10);
>   x -= min(x, y);

Do you think I should change my patch? To me, it seemed clearer the way 
I had it.

> [...]
>> @@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio(
>>   * => fast response on large errors; small oscillation near setpoint
>>   */
>>  setpoint = (freerun + limit) / 2;
>> -x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
>> +x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
>>  limit - setpoint + 1);
>
> Why are 'setpoint' and 'dirty' of type unsigned long instead of long
> in the first place?  What happens if one of their values overflows a
> long?

I guess because most other such things are also unsigned long; and I
guess they were so when they meant memory addresses. These are now page
counts, cannot overflow (with current limit of 64GB RAM).

>> @@ -995,6 +1014,13 @@ static unsigned long bdi_max_pause(struc
>>   * The pause time will be settled within range (max_pause/4, max_pause).
>>   * Apply a minimal value of 4 to get a non-zero max_pause/4.
>>   */
>> +/*
>> + * O

Bug#695182: Write couple of 1GB files for OOM crash

2013-01-06 Thread Jonathan Nieder
Hi Paul,

Paul Szabo wrote:

> I changed the proposed patch accordingly, scripts/checkpatch.pl produces
> just a few warnings. I had my patch in use for a while now, so I believe
> it is suitably tested.

Thanks much --- this is easier to read.

Here's a quick review from a non-expert (i.e., me).  When the patch
seems ready to you, please send it inline in a message to
linux...@kvack.org, cc-ing linux-ker...@vger.kernel.org and Ben and me
(or this bug log) so we can track it.  The mm folks will probably have
more feedback, and we can take it from there.

> Avoid OOM when filesystem caches fill lowmem and are not reclaimed,
> doing drop_caches at that point. The issue is easily reproducible on
> machines with over 32GB RAM. The patch correctly protects against OOM.
> The added call to drop_caches has been observed to trigger "needlessly"
> but on quite rare occasions only.

Sounds like a reasonable strategy.

> Also included are several minor fixes:

Those should go in separate patches, but if you're just seeking
comments then lumping them with the main change in a patch with
[RFC/PATCH] in the subject line is fine.

[...]
> Signed-off-by: Paul Szabo 

Thanks.

> --- fs/drop_caches.c.old  2012-10-17 13:50:15.0 +1100
> +++ fs/drop_caches.c  2013-01-04 21:52:47.0 +1100
> @@ -65,3 +65,10 @@ int drop_caches_sysctl_handler(ctl_table
>   }
>   return 0;
>  }
> +
> +/* Easy call: do "echo 3 > /proc/sys/vm/drop_caches" */

I don't understand this comment.  What does "Easy call" mean?

> +void easy_drop_caches(void)
> +{
> + iterate_supers(drop_pagecache_sb, NULL);
> + drop_slab();
> +}

This should be declared in some appropriate header (e.g.,
include/linux/mm.h).

> --- mm/page-writeback.c.old   2012-10-17 13:50:15.0 +1100
> +++ mm/page-writeback.c   2013-01-06 21:54:59.0 +1100
> @@ -39,7 +39,8 @@
>  /*
>   * Sleep at most 200ms at a time in balance_dirty_pages().
>   */
> -#define MAX_PAUSEmax(HZ/5, 1)
> +/* Might as well be max(HZ/5,4) to ensure max_pause/4>0 always */
> +#define MAX_PAUSEmax(HZ/5, 4)

This is one of the "while at it"s rather than the main point of
the patch, right?

[...]
> @@ -343,11 +344,26 @@ static unsigned long highmem_dirtyable_m
>  unsigned long determine_dirtyable_memory(void)
>  {
>   unsigned long x;
> + int y = 0;
> + extern int min_free_kbytes;

Probably should be declared in a header like mm/internal.h, but
there's precedent for the ad-hoc declaration, so meh.

[...]
> + /*
> +  * Seems that highmem_is_dirtyable is only used here, in the
> +  * calculation of limits and threshholds of dirtiness, not in deciding
> +  * where to put dirty things. Is that so? Is that as should be?
> +  * What is the recommended setting of highmem_is_dirtyable?
> +  */
>   if (!vm_highmem_is_dirtyable)
>   x -= highmem_dirtyable_memory(x);

I dunno.  See 195cf453d2c3 (mm/page-writeback: highmem_is_dirtyable
option, 2008-02-04) and the thread surrounding
.

> + /* Subtract min_free_kbytes */
> + if (min_free_kbytes > 0)
> + y = min_free_kbytes >> (PAGE_SHIFT - 10);

Why the "if"?

If I were doing it, I think I'd unconditionally set 'y' here, for
clarity and to avoid bugs if some later patch reuses the var for
something else.

y = min_free_kbytes >> (PAGE_SHIFT - 10);
x -= min(x, y);

[...]
> @@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio(
>* => fast response on large errors; small oscillation near setpoint
>*/
>   setpoint = (freerun + limit) / 2;
> - x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> + x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
>   limit - setpoint + 1);

Why are 'setpoint' and 'dirty' of type unsigned long instead of long
in the first place?  What happens if one of their values overflows a
long?

>   pos_ratio = x;
>   pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> @@ -995,6 +1014,13 @@ static unsigned long bdi_max_pause(struc
>* The pause time will be settled within range (max_pause/4, max_pause).
>* Apply a minimal value of 4 to get a non-zero max_pause/4.
>*/
> + /*
> +  * On large machine it seems we always return 4,
> +  * on smaller desktop machine mostly return 5 (rarely 9 or 14).
> +  * Are those too small? Should we return something fixed e.g.
> + return (HZ/10);
> +  * instead of this wasted/useless calculation?
> +  */
>   return clamp_val(t, 4, MAX_PAUSE);

Another "while at it", I guess.

>  }
>  
> @@ -1109,6 +1135,11 @@ static void balance_dirty_pages(struct a
>   }
>   pause = HZ * pages_dirtied / task_ratelimit;
>   if (unlikely(pause <= 0)) {
> + /*
> +  * Not unlikely: often we get zero.
> +