Re: CFS scheduler unfairly prefers pinned tasks

2015-10-11 Thread paul . szabo
I wrote:

  The Linux CFS scheduler prefers pinned tasks and unfairly
  gives more CPU time to tasks that have set CPU affinity.
  ...
  I believe I have now solved the problem, simply by setting:
for n in /proc/sys/kernel/sched_domain/cpu*/domain0/min_interval; do echo 0 
> $n; done
for n in /proc/sys/kernel/sched_domain/cpu*/domain0/max_interval; do echo 1 
> $n; done

Testing with real-life jobs, I found I needed min_- and max_interval for
domain1 also, and a couple of other non-default values, so:

  for n in /proc/sys/kernel/sched_domain/cpu*/dom*/min_interval; do echo 0 > 
$n; done
  for n in /proc/sys/kernel/sched_domain/cpu*/dom*/max_interval; do echo 1 > 
$n; done
  echo 10 > /proc/sys/kernel/sched_latency_ns 
  echo 10 > /proc/sys/kernel/sched_min_granularity_ns
  echo 1 >  /proc/sys/kernel/sched_wakeup_granularity_ns

and then things seem fair and my users are happy.

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: disable task group re-weighting on the desktop

2015-10-11 Thread paul . szabo
Dear Mike,

Did you check whether setting min_- and max_interval e.g. as per
  https://lkml.org/lkml/2015/10/11/34
would help with your issue (instead of your "horrible gs destroying"
patch)?

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFS scheduler unfairly prefers pinned tasks

2015-10-11 Thread paul . szabo
I wrote:

  The Linux CFS scheduler prefers pinned tasks and unfairly
  gives more CPU time to tasks that have set CPU affinity.

I believe I have now solved the problem, simply by setting:

  for n in /proc/sys/kernel/sched_domain/cpu*/domain0/min_interval; do echo 0 > 
$n; done
  for n in /proc/sys/kernel/sched_domain/cpu*/domain0/max_interval; do echo 1 > 
$n; done

I am not sure what the domain1 values would be for (that I see exist
on my 4*E5-4627v2 server). So far I do not see any negative effects of
using these (extreme?) settings. (Explanation of what these things are
meant for, or pointers to documentation, would be appreciated.)

---

Thanks for the insightful discussion.

(Scary, isn't it?)

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: disable task group re-weighting on the desktop

2015-10-11 Thread paul . szabo
Dear Mike,

> ... so yes, un-related.

Thanks for clarifying.

> I haven't seen the problem you reported. ...

You mean you chose not to reproduce: you persisted in pinning your
perts, whereas the problem was stated with un-pinned perts (and pinned
oinks). But that is OK... others did reproduce, and anyway I believe
I have now fixed my problem. (Solution in that "other" email thread.)

Cheers,

Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFS scheduler unfairly prefers pinned tasks

2015-10-11 Thread paul . szabo
I wrote:

  The Linux CFS scheduler prefers pinned tasks and unfairly
  gives more CPU time to tasks that have set CPU affinity.

I believe I have now solved the problem, simply by setting:

  for n in /proc/sys/kernel/sched_domain/cpu*/domain0/min_interval; do echo 0 > 
$n; done
  for n in /proc/sys/kernel/sched_domain/cpu*/domain0/max_interval; do echo 1 > 
$n; done

I am not sure what the domain1 values would be for (that I see exist
on my 4*E5-4627v2 server). So far I do not see any negative effects of
using these (extreme?) settings. (Explanation of what these things are
meant for, or pointers to documentation, would be appreciated.)

---

Thanks for the insightful discussion.

(Scary, isn't it?)

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: disable task group re-weighting on the desktop

2015-10-11 Thread paul . szabo
Dear Mike,

> ... so yes, un-related.

Thanks for clarifying.

> I haven't seen the problem you reported. ...

You mean you chose not to reproduce: you persisted in pinning your
perts, whereas the problem was stated with un-pinned perts (and pinned
oinks). But that is OK... others did reproduce, and anyway I believe
I have now fixed my problem. (Solution in that "other" email thread.)

Cheers,

Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: disable task group re-weighting on the desktop

2015-10-11 Thread paul . szabo
Dear Mike,

Did you check whether setting min_- and max_interval e.g. as per
  https://lkml.org/lkml/2015/10/11/34
would help with your issue (instead of your "horrible gs destroying"
patch)?

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFS scheduler unfairly prefers pinned tasks

2015-10-11 Thread paul . szabo
I wrote:

  The Linux CFS scheduler prefers pinned tasks and unfairly
  gives more CPU time to tasks that have set CPU affinity.
  ...
  I believe I have now solved the problem, simply by setting:
for n in /proc/sys/kernel/sched_domain/cpu*/domain0/min_interval; do echo 0 
> $n; done
for n in /proc/sys/kernel/sched_domain/cpu*/domain0/max_interval; do echo 1 
> $n; done

Testing with real-life jobs, I found I needed min_- and max_interval for
domain1 also, and a couple of other non-default values, so:

  for n in /proc/sys/kernel/sched_domain/cpu*/dom*/min_interval; do echo 0 > 
$n; done
  for n in /proc/sys/kernel/sched_domain/cpu*/dom*/max_interval; do echo 1 > 
$n; done
  echo 10 > /proc/sys/kernel/sched_latency_ns 
  echo 10 > /proc/sys/kernel/sched_min_granularity_ns
  echo 1 >  /proc/sys/kernel/sched_wakeup_granularity_ns

and then things seem fair and my users are happy.

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: disable task group re-weighting on the desktop

2015-10-10 Thread paul . szabo
Dear Mike,

You CCed me on this patch. Is that because you expect this to solve "my"
problem also? You had some measurements of many oinks vs many perts or
vs "desktop", but not many oinks vs 1 or 2 perts as per my "complaint". 
You also changed the subject line, so maybe this is all un-related.

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: disable task group re-weighting on the desktop

2015-10-10 Thread paul . szabo
Dear Mike,

You CCed me on this patch. Is that because you expect this to solve "my"
problem also? You had some measurements of many oinks vs many perts or
vs "desktop", but not many oinks vs 1 or 2 perts as per my "complaint". 
You also changed the subject line, so maybe this is all un-related.

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFS scheduler unfairly prefers pinned tasks

2015-10-08 Thread paul . szabo
Dear Mike,

>>> I see a fairness issue ... but one opposite to your complaint.
>> Why is that opposite? ...
>
> Well, not exactly opposite, only opposite in that the one pert task also
> receives MORE than it's fair share when unpinned.  Two 100$ hogs sharing
> one CPU should each get 50% of that CPU. ...

But you are using CGROUPs, grouping all oinks into one group, and the
one pert into another: requesting each group to get same total CPU.
Since pert has one process only, the most he can get is 100% (not 400%),
and it is quite OK for the oinks together to get 700%.

> IFF ... massively parallel and synchronized ...

You would be making the assumption that you had the machine to yourself:
might be the wrong thing to assume.

>> Good to see that you agree ...
> Weeell, we've disagreed on pretty much everything ...

Sorry I disagree: we do agree on the essence. :-)

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFS scheduler unfairly prefers pinned tasks

2015-10-08 Thread paul . szabo
Dear Mike,

> I see a fairness issue ... but one opposite to your complaint.

Why is that opposite? I think it would be fair for the one pert process
to get 100% CPU, the many oink processes can get everything else. That
one oink is lowly 10% (when others are 100%) is of no consequence.

What happens when you un-pin pert: does it get 100%? What if you run two
perts? Have you reproduced my observations?

---

Good to see that you agree on the fairness issue... it MUST be fixed!
CFS might be wrong or wasteful, but never unfair.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFS scheduler unfairly prefers pinned tasks

2015-10-08 Thread paul . szabo
Dear Mike,

> I see a fairness issue ... but one opposite to your complaint.

Why is that opposite? I think it would be fair for the one pert process
to get 100% CPU, the many oink processes can get everything else. That
one oink is lowly 10% (when others are 100%) is of no consequence.

What happens when you un-pin pert: does it get 100%? What if you run two
perts? Have you reproduced my observations?

---

Good to see that you agree on the fairness issue... it MUST be fixed!
CFS might be wrong or wasteful, but never unfair.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFS scheduler unfairly prefers pinned tasks

2015-10-08 Thread paul . szabo
Dear Mike,

>>> I see a fairness issue ... but one opposite to your complaint.
>> Why is that opposite? ...
>
> Well, not exactly opposite, only opposite in that the one pert task also
> receives MORE than it's fair share when unpinned.  Two 100$ hogs sharing
> one CPU should each get 50% of that CPU. ...

But you are using CGROUPs, grouping all oinks into one group, and the
one pert into another: requesting each group to get same total CPU.
Since pert has one process only, the most he can get is 100% (not 400%),
and it is quite OK for the oinks together to get 700%.

> IFF ... massively parallel and synchronized ...

You would be making the assumption that you had the machine to yourself:
might be the wrong thing to assume.

>> Good to see that you agree ...
> Weeell, we've disagreed on pretty much everything ...

Sorry I disagree: we do agree on the essence. :-)

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFS scheduler unfairly prefers pinned tasks

2015-10-06 Thread paul . szabo
Dear Mike,

>> ... the CFS is meant to be fair, using things like vruntime
>> to preempt, and throttling. Why are those pinned tasks not preempted or
>> throttled?
>
> Imagine you own a 8192 CPU box for a moment, all CPUs having one pinned
> task, plus one extra unpinned task, and ponder what would have to happen
> in order to meet your utilization expectation. ...

Sorry but the kernel contradicts. As per my original report, things are
"fair" in the case of:
 - with CGROUP controls and the two kinds of processes run by
   different users, when there is just one un-pinned process
and that is so on my quad-core i5-3470 baby or my 32-core 4*E5-4627v2
server (and everywhere that I tested). The kernel is smart and gets it
right for one un-pinned process: why not for two?

Now re-testing further (on some machines with CGROUP): on the i5-3470
things are fair still with one un-pinned (become un-fair with two), on
the 4*E5-4627v2 are fair still with 4 un-pinned (become un-fair with 5).
Does this suggest that the kernel does things right within each physical
CPU, but breaks across several (or exact contrary)? Maybe not: on a
2*E5530 machine, things are fair with just one un-pinned and un-fair
with 2 already.

> What you're seeing is not a bug.  No task can occupy more than one CPU
> at a time, making space reservation on multiple CPUs a very bad idea.

I agree that pinning may be bad... should not the kernel penalize the
badly pinned processes?

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFS scheduler unfairly prefers pinned tasks

2015-10-06 Thread paul . szabo
Dear Mike,

>> .. CFS ... unfairly gives more CPU time to [pinned] tasks ...
>
> If they can all migrate, load balancing can move any of them to try to
> fix the permanent imbalance, so they'll all bounce about sharing a CPU
> with some other hog, and it all kinda sorta works out.
>
> When most are pinned, to make it work out long term you'd have to be
> short term unfair, walking the unpinned minority around the box in a
> carefully orchestrated dance... and have omniscient powers that assure
> that none of the tasks you're trying to equalize is gonna do something
> rude like leave, sleep, fork or whatever, and muck up the grand plan.

Could not your argument be turned around: for a pinned task it is harder
to find an idle CPU, so they should get less time?

But really... those pinned tasks do not hog the CPU forever. Whatever
kicks them off: could not that be done just a little earlier?

And further... the CFS is meant to be fair, using things like vruntime
to preempt, and throttling. Why are those pinned tasks not preempted or
throttled?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFS scheduler unfairly prefers pinned tasks

2015-10-06 Thread paul . szabo
Dear Mike,

>> ... the CFS is meant to be fair, using things like vruntime
>> to preempt, and throttling. Why are those pinned tasks not preempted or
>> throttled?
>
> Imagine you own a 8192 CPU box for a moment, all CPUs having one pinned
> task, plus one extra unpinned task, and ponder what would have to happen
> in order to meet your utilization expectation. ...

Sorry but the kernel contradicts. As per my original report, things are
"fair" in the case of:
 - with CGROUP controls and the two kinds of processes run by
   different users, when there is just one un-pinned process
and that is so on my quad-core i5-3470 baby or my 32-core 4*E5-4627v2
server (and everywhere that I tested). The kernel is smart and gets it
right for one un-pinned process: why not for two?

Now re-testing further (on some machines with CGROUP): on the i5-3470
things are fair still with one un-pinned (become un-fair with two), on
the 4*E5-4627v2 are fair still with 4 un-pinned (become un-fair with 5).
Does this suggest that the kernel does things right within each physical
CPU, but breaks across several (or exact contrary)? Maybe not: on a
2*E5530 machine, things are fair with just one un-pinned and un-fair
with 2 already.

> What you're seeing is not a bug.  No task can occupy more than one CPU
> at a time, making space reservation on multiple CPUs a very bad idea.

I agree that pinning may be bad... should not the kernel penalize the
badly pinned processes?

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CFS scheduler unfairly prefers pinned tasks

2015-10-06 Thread paul . szabo
Dear Mike,

>> .. CFS ... unfairly gives more CPU time to [pinned] tasks ...
>
> If they can all migrate, load balancing can move any of them to try to
> fix the permanent imbalance, so they'll all bounce about sharing a CPU
> with some other hog, and it all kinda sorta works out.
>
> When most are pinned, to make it work out long term you'd have to be
> short term unfair, walking the unpinned minority around the box in a
> carefully orchestrated dance... and have omniscient powers that assure
> that none of the tasks you're trying to equalize is gonna do something
> rude like leave, sleep, fork or whatever, and muck up the grand plan.

Could not your argument be turned around: for a pinned task it is harder
to find an idle CPU, so they should get less time?

But really... those pinned tasks do not hog the CPU forever. Whatever
kicks them off: could not that be done just a little earlier?

And further... the CFS is meant to be fair, using things like vruntime
to preempt, and throttling. Why are those pinned tasks not preempted or
throttled?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


CFS scheduler unfairly prefers pinned tasks

2015-10-05 Thread paul . szabo
The Linux CFS scheduler prefers pinned tasks and unfairly
gives more CPU time to tasks that have set CPU affinity.
This effect is observed with or without CGROUP controls.

To demonstrate: on an otherwise idle machine, as some user
run several processes pinned to each CPU, one for each CPU
(as many as CPUs present in the system) e.g. for a quad-core
non-HyperThreaded machine:

  taskset -c 0 perl -e 'while(1){1}' &
  taskset -c 1 perl -e 'while(1){1}' &
  taskset -c 2 perl -e 'while(1){1}' &
  taskset -c 3 perl -e 'while(1){1}' &

and (as that same or some other user) run some without
pinning:

  perl -e 'while(1){1}' &
  perl -e 'while(1){1}' &

and use e.g.   top   to observe that the pinned processes get
more CPU time than "fair".

Fairness is obtained when either:
 - there are as many un-pinned processes as CPUs; or
 - with CGROUP controls and the two kinds of processes run by
   different users, when there is just one un-pinned process; or
 - if the pinning is turned off for these processes (or they
   are started without).

Any insight is welcome!

---

I would appreciate replies direct to me as I am not subscribed to the
linux-kernel mailing list (but will try to watch the archives).

This bug is also reported to Debian, please see
  http://bugs.debian.org/800945

I use Debian with the 3.16 kernel, have not yet tried 4.* kernels.


Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


CFS scheduler unfairly prefers pinned tasks

2015-10-05 Thread paul . szabo
The Linux CFS scheduler prefers pinned tasks and unfairly
gives more CPU time to tasks that have set CPU affinity.
This effect is observed with or without CGROUP controls.

To demonstrate: on an otherwise idle machine, as some user
run several processes pinned to each CPU, one for each CPU
(as many as CPUs present in the system) e.g. for a quad-core
non-HyperThreaded machine:

  taskset -c 0 perl -e 'while(1){1}' &
  taskset -c 1 perl -e 'while(1){1}' &
  taskset -c 2 perl -e 'while(1){1}' &
  taskset -c 3 perl -e 'while(1){1}' &

and (as that same or some other user) run some without
pinning:

  perl -e 'while(1){1}' &
  perl -e 'while(1){1}' &

and use e.g.   top   to observe that the pinned processes get
more CPU time than "fair".

Fairness is obtained when either:
 - there are as many un-pinned processes as CPUs; or
 - with CGROUP controls and the two kinds of processes run by
   different users, when there is just one un-pinned process; or
 - if the pinning is turned off for these processes (or they
   are started without).

Any insight is welcome!

---

I would appreciate replies direct to me as I am not subscribed to the
linux-kernel mailing list (but will try to watch the archives).

This bug is also reported to Debian, please see
  http://bugs.debian.org/800945

I use Debian with the 3.16 kernel, have not yet tried 4.* kernels.


Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Reproducible OOM with just a few sleeps

2013-02-24 Thread paul . szabo
Dear Simon,

> So if he config sparse memory, the issue can be solved I think.

In my config file I have:

CONFIG_HAVE_SPARSE_IRQ=y
CONFIG_SPARSE_IRQ=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_SPARSEMEM_STATIC=y
# CONFIG_INPUT_SPARSEKMAP is not set
# CONFIG_SPARSE_RCU_POINTER is not set

Is that sufficient for sparse memory, or should I try something else?
Or maybe, you meant that some kernel source patches might be possible
in the sparse memory code?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Reproducible OOM with just a few sleeps

2013-02-24 Thread paul . szabo
Dear Simon,

 So if he config sparse memory, the issue can be solved I think.

In my config file I have:

CONFIG_HAVE_SPARSE_IRQ=y
CONFIG_SPARSE_IRQ=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_SPARSEMEM_STATIC=y
# CONFIG_INPUT_SPARSEKMAP is not set
# CONFIG_SPARSE_RCU_POINTER is not set

Is that sufficient for sparse memory, or should I try something else?
Or maybe, you meant that some kernel source patches might be possible
in the sparse memory code?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps

2013-01-31 Thread paul . szabo
Dear Ben,

>>>> PAE is broken for any amount of RAM.
>>> No it isn't.
>> Could I please ask you to expand on that?
>
> I already did, a few messages back.

OK, thanks. Noting however that fewer than those back, I said:
  ... PAE with any RAM fails the "sleep test":
  n=0; while [ $n -lt 33000 ]; do sleep 600 & ((n=n+1)); done
and somewhere also said that non-PAE passes. Does not that prove
that PAE is broken?

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps

2013-01-31 Thread paul . szabo
Dear Ben,

>> PAE is broken for any amount of RAM.
>
> No it isn't.

Could I please ask you to expand on that?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps

2013-01-31 Thread paul . szabo
Dear Ben,

> Based on your experience I might propose to change the automatic kernel
> selection for i386 so that we use 'amd64' on a system with >16GB RAM and
> a capable processor.

Don't you mean change to amd64 for >4GB (or any RAM), never using PAE?
PAE is broken for any amount of RAM. More precisely, PAE with any RAM
fails the "sleep test":
  n=0; while [ $n -lt 33000 ]; do sleep 600 & ((n=n+1)); done
and with >32GB fails the "write test":
  n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; 
((n=n+1)); done
Why do you think 16GB is significant?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps

2013-01-31 Thread paul . szabo
Dear Ben,

Thanks for the repeated explanations.

> PAE was a stop-gap ...
> ... [PAE] completely untenable.

Is this a good time to withdraw PAE, to tell the world that it does not
work? Maybe you should have had such comments in the code.

Seems that amd64 now works "somewhat": on Debian the linux-image package
is tricky to install, and linux-headers is even harder. Is there work
being done to make this smoother?

---

I am still not convinced by the "lowmem starvation" explanation: because
then PAE should have worked fine on my 3GB machine; maybe I should also
try PAE on my 512MB laptop. - Though, what do I know, have not yet found
the buggy line of code I believe is lurking there...

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps

2013-01-31 Thread paul . szabo
Dear Ben,

Thanks for the repeated explanations.

 PAE was a stop-gap ...
 ... [PAE] completely untenable.

Is this a good time to withdraw PAE, to tell the world that it does not
work? Maybe you should have had such comments in the code.

Seems that amd64 now works somewhat: on Debian the linux-image package
is tricky to install, and linux-headers is even harder. Is there work
being done to make this smoother?

---

I am still not convinced by the lowmem starvation explanation: because
then PAE should have worked fine on my 3GB machine; maybe I should also
try PAE on my 512MB laptop. - Though, what do I know, have not yet found
the buggy line of code I believe is lurking there...

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps

2013-01-31 Thread paul . szabo
Dear Ben,

 Based on your experience I might propose to change the automatic kernel
 selection for i386 so that we use 'amd64' on a system with 16GB RAM and
 a capable processor.

Don't you mean change to amd64 for 4GB (or any RAM), never using PAE?
PAE is broken for any amount of RAM. More precisely, PAE with any RAM
fails the sleep test:
  n=0; while [ $n -lt 33000 ]; do sleep 600  ((n=n+1)); done
and with 32GB fails the write test:
  n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; 
((n=n+1)); done
Why do you think 16GB is significant?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps

2013-01-31 Thread paul . szabo
Dear Ben,

 PAE is broken for any amount of RAM.

 No it isn't.

Could I please ask you to expand on that?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [RFC] Reproducible OOM with just a few sleeps

2013-01-31 Thread paul . szabo
Dear Ben,

 PAE is broken for any amount of RAM.
 No it isn't.
 Could I please ask you to expand on that?

 I already did, a few messages back.

OK, thanks. Noting however that fewer than those back, I said:
  ... PAE with any RAM fails the sleep test:
  n=0; while [ $n -lt 33000 ]; do sleep 600  ((n=n+1)); done
and somewhere also said that non-PAE passes. Does not that prove
that PAE is broken?

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Reproducible OOM with just a few sleeps

2013-01-30 Thread paul . szabo
Dear Pavel and Dave,

> The assertion was that 4GB with no PAE passed a forkbomb test (ooming)
> while 4GB of RAM with PAE hung, thus _PAE_ is broken.

Yes, PAE is broken. Still, maybe the above needs slight correction:
non-PAE HIGHMEM4G passed the "sleep test": no OOM, nothing unexpected;
whereas PAE OOMed then hung (tested with various RAM from 3GB to 64GB).

The feeling I get is that amd64 is proposed as a drop-in replacement for
PAE, that support and development of PAE is gone, that PAE is dead.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Reproducible OOM with just a few sleeps

2013-01-30 Thread paul . szabo
Dear Pavel and Dave,

 The assertion was that 4GB with no PAE passed a forkbomb test (ooming)
 while 4GB of RAM with PAE hung, thus _PAE_ is broken.

Yes, PAE is broken. Still, maybe the above needs slight correction:
non-PAE HIGHMEM4G passed the sleep test: no OOM, nothing unexpected;
whereas PAE OOMed then hung (tested with various RAM from 3GB to 64GB).

The feeling I get is that amd64 is proposed as a drop-in replacement for
PAE, that support and development of PAE is gone, that PAE is dead.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-26 Thread paul . szabo
Dear Jonathan,

>> If you can identify where it was fixed then your patch for older
>> versions should go to stable with a reference to the upstream fix (see
>> Documentation/stable_kernel_rules.txt).
>
> How about this patch?
>
> It was applied in mainline during the 3.3 merge window, so kernels
> newer than 3.2.y shouldn't need it.
>
> ...
> commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d upstream.
> ...

Yes, I beleive that is the correct patch, surely better than my simple
subtraction of min_free_kbytes.

Noting, that this does not "solve" all problems, the latest 3.8 kernel
still crashes with OOM:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1098961/comments/18

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-26 Thread paul . szabo
Dear Jonathan,

 If you can identify where it was fixed then your patch for older
 versions should go to stable with a reference to the upstream fix (see
 Documentation/stable_kernel_rules.txt).

 How about this patch?

 It was applied in mainline during the 3.3 merge window, so kernels
 newer than 3.2.y shouldn't need it.

 ...
 commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d upstream.
 ...

Yes, I beleive that is the correct patch, surely better than my simple
subtraction of min_free_kbytes.

Noting, that this does not solve all problems, the latest 3.8 kernel
still crashes with OOM:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1098961/comments/18

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()

2013-01-25 Thread paul . szabo
Dear Fengguang (et al),

> There are 260MB reclaimable slab pages in the normal zone, however we
> somehow failed to reclaim them. ...

Could the problem be that without CONFIG_NUMA, zone_reclaim_mode stays
at zero and anyway zone_reclaim() does nothing in include/linux/swap.h ?

Though... there is no CONFIG_NUMA nor /proc/sys/vm/zone_reclaim_mode in
the Ubuntu non-PAE "plain" HIGHMEM4G kernel, and still it handles the
"sleep test" just fine.

Where does reclaiming happen (or meant to happen)?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-25 Thread paul . szabo
Dear Ben,

> ... the mm maintainers are probably much better placed ...

Exactly. Now I wonder: are you one of them?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-25 Thread paul . szabo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-25 Thread paul . szabo
Dear Ben,

> If you can identify where it was fixed then ...

Sorry I cannot do that. I have no idea where kernel changelogs are kept.

I am happy to do some work. Please do not call me lazy.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-25 Thread paul . szabo
Dear Minchan,

> So what's the effect for user?
> ...
> It seems you saw old kernel.
> ...
> Current kernel includes ...
> So I think we don't need this patch.

As I understand now, my patch is "right" and needed for older kernels;
for newer kernels, the issue has been fixed in equivalent ways; it was
an oversight that the change was not backported; and any justification
you need, you can get from those "later better" patches.

I asked:

  A question: what is the use or significance of vm_highmem_is_dirtyable?
  It seems odd that it would be used in setting limits or threshholds, but
  not used in decisions where to put dirty things. Is that so, is that as
  should be? What is the recommended setting of highmem_is_dirtyable?

The silence is deafening. I guess highmem_is_dirtyable is an aberration.

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-25 Thread paul . szabo
Dear Minchan,

 So what's the effect for user?
 ...
 It seems you saw old kernel.
 ...
 Current kernel includes ...
 So I think we don't need this patch.

As I understand now, my patch is right and needed for older kernels;
for newer kernels, the issue has been fixed in equivalent ways; it was
an oversight that the change was not backported; and any justification
you need, you can get from those later better patches.

I asked:

  A question: what is the use or significance of vm_highmem_is_dirtyable?
  It seems odd that it would be used in setting limits or threshholds, but
  not used in decisions where to put dirty things. Is that so, is that as
  should be? What is the recommended setting of highmem_is_dirtyable?

The silence is deafening. I guess highmem_is_dirtyable is an aberration.

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-25 Thread paul . szabo
Dear Ben,

 If you can identify where it was fixed then ...

Sorry I cannot do that. I have no idea where kernel changelogs are kept.

I am happy to do some work. Please do not call me lazy.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-25 Thread paul . szabo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug#695182: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-25 Thread paul . szabo
Dear Ben,

 ... the mm maintainers are probably much better placed ...

Exactly. Now I wonder: are you one of them?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()

2013-01-25 Thread paul . szabo
Dear Fengguang (et al),

 There are 260MB reclaimable slab pages in the normal zone, however we
 somehow failed to reclaim them. ...

Could the problem be that without CONFIG_NUMA, zone_reclaim_mode stays
at zero and anyway zone_reclaim() does nothing in include/linux/swap.h ?

Though... there is no CONFIG_NUMA nor /proc/sys/vm/zone_reclaim_mode in
the Ubuntu non-PAE plain HIGHMEM4G kernel, and still it handles the
sleep test just fine.

Where does reclaiming happen (or meant to happen)?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()

2013-01-24 Thread paul . szabo
Dear Fengguang,

> There are 260MB reclaimable slab pages in the normal zone ...

Marked "all_unreclaimable? yes": is that wrong? Question asked also in:
http://marc.info/?l=linux-mm=135873981326767=2

> ... however we somehow failed to reclaim them. ...

I made a patch that would do a drop_caches at that point, please see:
http://bugs.debian.org/695182
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;filename=drop_caches.patch;att=1;bug=695182
http://marc.info/?l=linux-mm=135785511125549=2
and that successfully avoided OOM when writing files.
But, the drop_caches patch did not protect against the "sleep test".

> ... What's your filesystem and the content of /proc/slabinfo?

Filesystem is EXT3. See output of slabinfo in Debian bug above or in
http://marc.info/?l=linux-mm=135796154427544=2

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()

2013-01-24 Thread paul . szabo
Dear Jan,

> I think he found the culprit of the problem being min_free_kbytes was not
> properly reflected in the dirty throttling. ... Paul please correct me
> if I'm wrong.

Sorry but have to correct you.

I noticed and patched/corrected two problems, one with (setpoint-dirty)
in bdi_position_ratio(), another with min_free_kbytes not subtracted
from dirtyable memory. Fixing those problems, singly or in combination,
did not help in avoiding OOM: running
  n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; 
((n=$n+1)); done
still produces an OOM after a few files written (on a PAE machine with
over 32GB RAM).

Also, a quite similar OOM may be produced on any PAE machine with
  n=0; while [ $n -lt 33000 ]; do sleep 600 & ((n=n+1)); done
This was tested on machines with as low as just 3GB RAM ... and
curiously the same machine with "plain" (not PAE but HIGHMEM4G)
kernel handles the same "sleep test" without any problems.

(Thus I now think that the remaining bug is not with writeback.)

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()

2013-01-24 Thread paul . szabo
Dear Fengguang,

> Or more simple, you may show us the OOM dmesg which will contain the
> number of dirty pages. ...

Do you mean kern.log lines like:

[  744.754199] bash invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, 
oom_score_adj=0
[  744.754202] bash cpuset=/ mems_allowed=0
[  744.754204] Pid: 3836, comm: bash Not tainted 3.2.0-4-686-pae #1 Debian 
3.2.32-1
...
[  744.754354] active_anon:13497 inactive_anon:129 isolated_anon:0
[  744.754354]  active_file:2664 inactive_file:4144756 isolated_file:0
[  744.754355]  unevictable:0 dirty:510 writeback:0 unstable:0
[  744.754356]  free:11867217 slab_reclaimable:68289 slab_unreclaimable:7204
[  744.754356]  mapped:8066 shmem:250 pagetables:519 bounce:0
[  744.754361] DMA free:4260kB min:784kB low:980kB high:1176kB active_anon:0kB 
inactive_anon:0kB active_file:4kB inactive_file:0kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:15784kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11628kB 
slab_unreclaimable:4kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB 
writeback_tmp:0kB pages_scanned:499 all_unreclaimable? yes
[  744.754364] lowmem_reserve[]: 0 867 62932 62932
[  744.754369] Normal free:43788kB min:44112kB low:55140kB high:66168kB 
active_anon:0kB inactive_anon:0kB active_file:912kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:887976kB 
mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB 
slab_reclaimable:261528kB slab_unreclaimable:28812kB kernel_stack:3096kB 
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:16060 
all_unreclaimable? yes
[  744.754372] lowmem_reserve[]: 0 0 496525 496525
[  744.754377] HighMem free:47420820kB min:512kB low:789888kB high:1579264kB 
active_anon:53988kB inactive_anon:516kB active_file:9740kB 
inactive_file:16579320kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:63555300kB mlocked:0kB dirty:2040kB writeback:0kB mapped:32260kB 
shmem:1000kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB 
pagetables:2076kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 
all_unreclaimable? no
[  744.754380] lowmem_reserve[]: 0 0 0 0
[  744.754381] DMA: 445*4kB 36*8kB 3*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 
0*1024kB 1*2048kB 0*4096kB = 4260kB
[  744.754386] Normal: 1132*4kB 620*8kB 237*16kB 70*32kB 38*64kB 26*128kB 
20*256kB 14*512kB 4*1024kB 3*2048kB 0*4096kB = 43808kB
[  744.754390] HighMem: 226*4kB 242*8kB 155*16kB 66*32kB 10*64kB 1*128kB 
1*256kB 0*512kB 1*1024kB 2*2048kB 11574*4096kB = 47420680kB
[  744.754395] 4148173 total pagecache pages
[  744.754396] 0 pages in swap cache
[  744.754397] Swap cache stats: add 0, delete 0, find 0/0
[  744.754397] Free swap  = 0kB
[  744.754398] Total swap = 0kB
[  744.900649] 16777200 pages RAM
[  744.900650] 16549378 pages HighMem
[  744.900651] 664304 pages reserved
[  744.900652] 4162276 pages shared
[  744.900653] 104263 pages non-shared

? (The above and similar were reported to http://bugs.debian.org/695182 .)
Do you want me to log and report something else?

I believe the above crash may be provoked simply by running:
  n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; (( n = 
$n + 1 )); done &
on any PAE machine with over 32GB RAM. Oddly the problem does not seem
to occur when using mem=32g or lower on the kernel boot line (or on
machines with less than 32GB RAM).

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()

2013-01-24 Thread paul . szabo
Dear Fengguang,

 Or more simple, you may show us the OOM dmesg which will contain the
 number of dirty pages. ...

Do you mean kern.log lines like:

[  744.754199] bash invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, 
oom_score_adj=0
[  744.754202] bash cpuset=/ mems_allowed=0
[  744.754204] Pid: 3836, comm: bash Not tainted 3.2.0-4-686-pae #1 Debian 
3.2.32-1
...
[  744.754354] active_anon:13497 inactive_anon:129 isolated_anon:0
[  744.754354]  active_file:2664 inactive_file:4144756 isolated_file:0
[  744.754355]  unevictable:0 dirty:510 writeback:0 unstable:0
[  744.754356]  free:11867217 slab_reclaimable:68289 slab_unreclaimable:7204
[  744.754356]  mapped:8066 shmem:250 pagetables:519 bounce:0
[  744.754361] DMA free:4260kB min:784kB low:980kB high:1176kB active_anon:0kB 
inactive_anon:0kB active_file:4kB inactive_file:0kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB present:15784kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11628kB 
slab_unreclaimable:4kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB 
writeback_tmp:0kB pages_scanned:499 all_unreclaimable? yes
[  744.754364] lowmem_reserve[]: 0 867 62932 62932
[  744.754369] Normal free:43788kB min:44112kB low:55140kB high:66168kB 
active_anon:0kB inactive_anon:0kB active_file:912kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:887976kB 
mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB 
slab_reclaimable:261528kB slab_unreclaimable:28812kB kernel_stack:3096kB 
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:16060 
all_unreclaimable? yes
[  744.754372] lowmem_reserve[]: 0 0 496525 496525
[  744.754377] HighMem free:47420820kB min:512kB low:789888kB high:1579264kB 
active_anon:53988kB inactive_anon:516kB active_file:9740kB 
inactive_file:16579320kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:63555300kB mlocked:0kB dirty:2040kB writeback:0kB mapped:32260kB 
shmem:1000kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB 
pagetables:2076kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 
all_unreclaimable? no
[  744.754380] lowmem_reserve[]: 0 0 0 0
[  744.754381] DMA: 445*4kB 36*8kB 3*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 
0*1024kB 1*2048kB 0*4096kB = 4260kB
[  744.754386] Normal: 1132*4kB 620*8kB 237*16kB 70*32kB 38*64kB 26*128kB 
20*256kB 14*512kB 4*1024kB 3*2048kB 0*4096kB = 43808kB
[  744.754390] HighMem: 226*4kB 242*8kB 155*16kB 66*32kB 10*64kB 1*128kB 
1*256kB 0*512kB 1*1024kB 2*2048kB 11574*4096kB = 47420680kB
[  744.754395] 4148173 total pagecache pages
[  744.754396] 0 pages in swap cache
[  744.754397] Swap cache stats: add 0, delete 0, find 0/0
[  744.754397] Free swap  = 0kB
[  744.754398] Total swap = 0kB
[  744.900649] 16777200 pages RAM
[  744.900650] 16549378 pages HighMem
[  744.900651] 664304 pages reserved
[  744.900652] 4162276 pages shared
[  744.900653] 104263 pages non-shared

? (The above and similar were reported to http://bugs.debian.org/695182 .)
Do you want me to log and report something else?

I believe the above crash may be provoked simply by running:
  n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; (( n = 
$n + 1 )); done 
on any PAE machine with over 32GB RAM. Oddly the problem does not seem
to occur when using mem=32g or lower on the kernel boot line (or on
machines with less than 32GB RAM).

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()

2013-01-24 Thread paul . szabo
Dear Jan,

 I think he found the culprit of the problem being min_free_kbytes was not
 properly reflected in the dirty throttling. ... Paul please correct me
 if I'm wrong.

Sorry but have to correct you.

I noticed and patched/corrected two problems, one with (setpoint-dirty)
in bdi_position_ratio(), another with min_free_kbytes not subtracted
from dirtyable memory. Fixing those problems, singly or in combination,
did not help in avoiding OOM: running
  n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; 
((n=$n+1)); done
still produces an OOM after a few files written (on a PAE machine with
over 32GB RAM).

Also, a quite similar OOM may be produced on any PAE machine with
  n=0; while [ $n -lt 33000 ]; do sleep 600  ((n=n+1)); done
This was tested on machines with as low as just 3GB RAM ... and
curiously the same machine with plain (not PAE but HIGHMEM4G)
kernel handles the same sleep test without any problems.

(Thus I now think that the remaining bug is not with writeback.)

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Negative (setpoint-dirty) in bdi_position_ratio()

2013-01-24 Thread paul . szabo
Dear Fengguang,

 There are 260MB reclaimable slab pages in the normal zone ...

Marked all_unreclaimable? yes: is that wrong? Question asked also in:
http://marc.info/?l=linux-mmm=135873981326767w=2

 ... however we somehow failed to reclaim them. ...

I made a patch that would do a drop_caches at that point, please see:
http://bugs.debian.org/695182
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;filename=drop_caches.patch;att=1;bug=695182
http://marc.info/?l=linux-mmm=135785511125549w=2
and that successfully avoided OOM when writing files.
But, the drop_caches patch did not protect against the sleep test.

 ... What's your filesystem and the content of /proc/slabinfo?

Filesystem is EXT3. See output of slabinfo in Debian bug above or in
http://marc.info/?l=linux-mmm=135796154427544w=2

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-22 Thread paul . szabo
Dear Minchan,

> So what's the effect for user?

Sorry I have no idea.

The kernel seems to work well without this patch; or in fact not so
well, PAE crashing with spurious OOM. In my fruitless efforts of
avoiding OOM by sensible choices of sysctl tunables, I noticed that
maybe the treatment of min_free_kbytes was not right. Getting this
right did not help in avoiding OOM.

> It seems you saw old kernel.

Yes I have Debian on my machines. :-)

> Current kernel includes following logic.
> 
> static unsigned long global_dirtyable_memory(void)
> {
> unsigned long x;
> 
> x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> x -= min(x, dirty_balance_reserve);
> 
> if (!vm_highmem_is_dirtyable)
> x -= highmem_dirtyable_memory(x);
> 
> return x + 1;   /* Ensure that we never return 0 */
> }
> 
> And dirty_lanace_reserve already includes high_wmark_pages.
> Look at calculate_totalreserve_pages.
> 
> So I think we don't need this patch.
> Thanks.

Presumably, dirty_balance_reserve takes min_free_kbytes into account?
Then I agree that this patch is not needed on those newer kernels.

A question: what is the use or significance of vm_highmem_is_dirtyable?
It seems odd that it would be used in setting limits or threshholds, but
not used in decisions where to put dirty things. Is that so, is that as
should be? What is the recommended setting of highmem_is_dirtyable?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-22 Thread paul . szabo
Dear Minchan,

 So what's the effect for user?

Sorry I have no idea.

The kernel seems to work well without this patch; or in fact not so
well, PAE crashing with spurious OOM. In my fruitless efforts of
avoiding OOM by sensible choices of sysctl tunables, I noticed that
maybe the treatment of min_free_kbytes was not right. Getting this
right did not help in avoiding OOM.

 It seems you saw old kernel.

Yes I have Debian on my machines. :-)

 Current kernel includes following logic.
 
 static unsigned long global_dirtyable_memory(void)
 {
 unsigned long x;
 
 x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
 x -= min(x, dirty_balance_reserve);
 
 if (!vm_highmem_is_dirtyable)
 x -= highmem_dirtyable_memory(x);
 
 return x + 1;   /* Ensure that we never return 0 */
 }
 
 And dirty_lanace_reserve already includes high_wmark_pages.
 Look at calculate_totalreserve_pages.
 
 So I think we don't need this patch.
 Thanks.

Presumably, dirty_balance_reserve takes min_free_kbytes into account?
Then I agree that this patch is not needed on those newer kernels.

A question: what is the use or significance of vm_highmem_is_dirtyable?
It seems odd that it would be used in setting limits or threshholds, but
not used in decisions where to put dirty things. Is that so, is that as
should be? What is the recommended setting of highmem_is_dirtyable?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-20 Thread paul . szabo
When calculating amount of dirtyable memory, min_free_kbytes should be
subtracted because it is not intended for dirty pages.

Using an "extern int" because that is the only interface to some such
sysctl values.

(This patch does not solve the PAE OOM issue.)

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia

Reported-by: Paul Szabo 
Reference: http://bugs.debian.org/695182
Signed-off-by: Paul Szabo 

--- mm/page-writeback.c.old 2012-12-06 22:20:40.0 +1100
+++ mm/page-writeback.c 2013-01-21 13:57:05.0 +1100
@@ -343,12 +343,16 @@
 unsigned long determine_dirtyable_memory(void)
 {
unsigned long x;
+   extern int min_free_kbytes;
 
x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
 
if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);
 
+   /* Subtract min_free_kbytes */
+   x -= min(x, min_free_kbytes >> (PAGE_SHIFT - 10));
+
return x + 1;   /* Ensure that we never return 0 */
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] MAX_PAUSE to be at least 4

2013-01-20 Thread paul . szabo
Ensure MAX_PAUSE is 4 or larger, so limits in
return clamp_val(t, 4, MAX_PAUSE);
(the only use of it) are not back-to-front.

(This patch does not solve the PAE OOM issue.)

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia

Reported-by: Paul Szabo 
Reference: http://bugs.debian.org/695182
Signed-off-by: Paul Szabo 

--- mm/page-writeback.c.old 2012-12-06 22:20:40.0 +1100
+++ mm/page-writeback.c 2013-01-21 13:57:05.0 +1100
@@ -39,7 +39,7 @@
 /*
  * Sleep at most 200ms at a time in balance_dirty_pages().
  */
-#define MAX_PAUSE  max(HZ/5, 1)
+#define MAX_PAUSE  max(HZ/5, 4)
 
 /*
  * Estimate write bandwidth at 200ms intervals.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] MAX_PAUSE to be at least 4

2013-01-20 Thread paul . szabo
Ensure MAX_PAUSE is 4 or larger, so limits in
return clamp_val(t, 4, MAX_PAUSE);
(the only use of it) are not back-to-front.

(This patch does not solve the PAE OOM issue.)

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia

Reported-by: Paul Szabo p...@maths.usyd.edu.au
Reference: http://bugs.debian.org/695182
Signed-off-by: Paul Szabo p...@maths.usyd.edu.au

--- mm/page-writeback.c.old 2012-12-06 22:20:40.0 +1100
+++ mm/page-writeback.c 2013-01-21 13:57:05.0 +1100
@@ -39,7 +39,7 @@
 /*
  * Sleep at most 200ms at a time in balance_dirty_pages().
  */
-#define MAX_PAUSE  max(HZ/5, 1)
+#define MAX_PAUSE  max(HZ/5, 4)
 
 /*
  * Estimate write bandwidth at 200ms intervals.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Subtract min_free_kbytes from dirtyable memory

2013-01-20 Thread paul . szabo
When calculating amount of dirtyable memory, min_free_kbytes should be
subtracted because it is not intended for dirty pages.

Using an extern int because that is the only interface to some such
sysctl values.

(This patch does not solve the PAE OOM issue.)

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia

Reported-by: Paul Szabo p...@maths.usyd.edu.au
Reference: http://bugs.debian.org/695182
Signed-off-by: Paul Szabo p...@maths.usyd.edu.au

--- mm/page-writeback.c.old 2012-12-06 22:20:40.0 +1100
+++ mm/page-writeback.c 2013-01-21 13:57:05.0 +1100
@@ -343,12 +343,16 @@
 unsigned long determine_dirtyable_memory(void)
 {
unsigned long x;
+   extern int min_free_kbytes;
 
x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
 
if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);
 
+   /* Subtract min_free_kbytes */
+   x -= min(x, min_free_kbytes  (PAGE_SHIFT - 10));
+
return x + 1;   /* Ensure that we never return 0 */
 }

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Negative (setpoint-dirty) in bdi_position_ratio()

2013-01-19 Thread paul . szabo
In bdi_position_ratio(), get difference (setpoint-dirty) right even when
negative. Both setpoint and dirty are unsigned long, the difference was
zero-padded thus wrongly sign-extended to s64. This issue affects all
32-bit architectures, does not affect 64-bit architectures where long
and s64 are equivalent.

In this function, dirty is between freerun and limit, the pseudo-float x
is between [-1,1], expected to be negative about half the time. With
zero-padding, instead of a small negative x we obtained a large positive
one so bdi_position_ratio() returned garbage.

Casting the difference to s64 also prevents overflow with left-shift;
though normally these numbers are small and I never observed a 32-bit
overflow there.

(This patch does not solve the PAE OOM issue.)

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia

Reported-by: Paul Szabo 
Reference: http://bugs.debian.org/695182
Signed-off-by: Paul Szabo 

--- mm/page-writeback.c.old 2012-12-06 22:20:40.0 +1100
+++ mm/page-writeback.c 2013-01-20 07:47:55.0 +1100
@@ -559,7 +559,7 @@ static unsigned long bdi_position_ratio(
 * => fast response on large errors; small oscillation near setpoint
 */
setpoint = (freerun + limit) / 2;
-   x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+   x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
limit - setpoint + 1);
pos_ratio = x;
pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Negative (setpoint-dirty) in bdi_position_ratio()

2013-01-19 Thread paul . szabo
In bdi_position_ratio(), get difference (setpoint-dirty) right even when
negative. Both setpoint and dirty are unsigned long, the difference was
zero-padded thus wrongly sign-extended to s64. This issue affects all
32-bit architectures, does not affect 64-bit architectures where long
and s64 are equivalent.

In this function, dirty is between freerun and limit, the pseudo-float x
is between [-1,1], expected to be negative about half the time. With
zero-padding, instead of a small negative x we obtained a large positive
one so bdi_position_ratio() returned garbage.

Casting the difference to s64 also prevents overflow with left-shift;
though normally these numbers are small and I never observed a 32-bit
overflow there.

(This patch does not solve the PAE OOM issue.)

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia

Reported-by: Paul Szabo p...@maths.usyd.edu.au
Reference: http://bugs.debian.org/695182
Signed-off-by: Paul Szabo p...@maths.usyd.edu.au

--- mm/page-writeback.c.old 2012-12-06 22:20:40.0 +1100
+++ mm/page-writeback.c 2013-01-20 07:47:55.0 +1100
@@ -559,7 +559,7 @@ static unsigned long bdi_position_ratio(
 * = fast response on large errors; small oscillation near setpoint
 */
setpoint = (freerun + limit) / 2;
-   x = div_s64((setpoint - dirty)  RATELIMIT_CALC_SHIFT,
+   x = div_s64(((s64)setpoint - (s64)dirty)  RATELIMIT_CALC_SHIFT,
limit - setpoint + 1);
pos_ratio = x;
pos_ratio = pos_ratio * x  RATELIMIT_CALC_SHIFT;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Reproducible OOM with just a few sleeps

2013-01-17 Thread paul . szabo
Dear Dave,

>> On my large machine, 'free' fails to show about 2GB memory ...
> You probably have a memory hole. ...
> The e820 map (during early boot in dmesg) or /proc/iomem will let you
> locate your memory holes.

Now that my machine is running an amd64 kernel, 'free' shows total Mem
65854128 (up from 64447796 with PAE kernel), and I do not see much
change in /proc/iomem output (below). Is that as should be?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


---

root@zeno:~# uname -a
Linux zeno.maths.usyd.edu.au 3.2.35-pk06.12-amd64 #2 SMP Thu Jan 17 13:19:53 
EST 2013 x86_64 GNU/Linux
root@zeno:~# free
 total   used   free sharedbuffers cached
Mem:  658541281591704   64262424  0 227036 175620
-/+ buffers/cache:1189048   64665080
Swap:195312636  0  195312636
root@zeno:~# cat /proc/iomem
- : reserved
0001-00099bff : System RAM
00099c00-0009 : reserved
000a-000b : PCI Bus :00
000c-000d : PCI Bus :00
  000c-000c7fff : Video ROM
  000c8000-000cf5ff : Adapter ROM
  000cf800-000d07ff : Adapter ROM
  000d0800-000d0bff : Adapter ROM
000e-000f : reserved
  000f-000f : System ROM
0010-7e445fff : System RAM
  0100-0168f8c6 : Kernel code
  0168f8c7-018f24bf : Kernel data
  0197d000-019dafff : Kernel bss
7e446000-7e565fff : ACPI Non-volatile Storage
7e566000-7f1e2fff : reserved
7f1e3000-7f25efff : ACPI Tables
7f25f000-7f31cfff : reserved
7f31d000-7f323fff : ACPI Non-volatile Storage
7f324000-7f333fff : reserved
7f334000-7f33bfff : ACPI Non-volatile Storage
7f33c000-7f365fff : reserved
7f366000-7f7f : ACPI Non-volatile Storage
7f80-7fff : RAM buffer
8000-dfff : PCI Bus :00
  8000-8fff : PCI MMCONFIG  [bus 00-ff]
8000-8fff : reserved
  9000-900f : :00:16.0
  9010-901f : :00:16.1
  dd00-ddff : PCI Bus :08
dd00-ddff : :08:03.0
  de00-de4f : PCI Bus :07
de00-de3f : :07:00.0
de47c000-de47 : :07:00.0
  de60-de6f : PCI Bus :02
  df00-df8f : PCI Bus :08
df00-df7f : :08:03.0
df80-df803fff : :08:03.0
  df90-df9f : PCI Bus :07
  dfa0-dfaf : PCI Bus :02
dfa0-dfa1 : :02:00.1
  dfa0-dfa1 : igb
dfa2-dfa3 : :02:00.0
  dfa2-dfa3 : igb
dfa4-dfa43fff : :02:00.1
  dfa4-dfa43fff : igb
dfa44000-dfa47fff : :02:00.0
  dfa44000-dfa47fff : igb
  dfb0-dfb03fff : :00:04.7
  dfb04000-dfb07fff : :00:04.6
  dfb08000-dfb0bfff : :00:04.5
  dfb0c000-dfb0 : :00:04.4
  dfb1-dfb13fff : :00:04.3
  dfb14000-dfb17fff : :00:04.2
  dfb18000-dfb1bfff : :00:04.1
  dfb1c000-dfb1 : :00:04.0
  dfb2-dfb200ff : :00:1f.3
  dfb21000-dfb217ff : :00:1f.2
dfb21000-dfb217ff : ahci
  dfb22000-dfb223ff : :00:1d.0
dfb22000-dfb223ff : ehci_hcd
  dfb23000-dfb233ff : :00:1a.0
dfb23000-dfb233ff : ehci_hcd
  dfb25000-dfb25fff : :00:05.4
  dfffc000-dfffdfff : pnp 00:02
e000-fbff : PCI Bus :80
  fbe0-fbef : PCI Bus :84
fbe0-fbe3 : :84:00.0
fbe4-fbe5 : :84:00.0
fbe6-fbe63fff : :84:00.0
  fbf0-fbf03fff : :80:04.7
  fbf04000-fbf07fff : :80:04.6
  fbf08000-fbf0bfff : :80:04.5
  fbf0c000-fbf0 : :80:04.4
  fbf1-fbf13fff : :80:04.3
  fbf14000-fbf17fff : :80:04.2
  fbf18000-fbf1bfff : :80:04.1
  fbf1c000-fbf1 : :80:04.0
  fbf2-fbf20fff : :80:05.4
  fbffe000-fbff : pnp 00:12
fc00-fcff : pnp 00:01
fd00-fdff : pnp 00:01
fe00-feaf : pnp 00:01
feb0-febf : pnp 00:01
fec0-fec003ff : IOAPIC 0
fec01000-fec013ff : IOAPIC 1
fec4-fec403ff : IOAPIC 2
fed0-fed003ff : HPET 0
fed08000-fed08fff : pnp 00:0c
fed1c000-fed3 : reserved
  fed1c000-fed1 : pnp 00:0c
fed45000-fedf : pnp 00:01
fee0-fee00fff : Local APIC
ff00- : reserved
  ff00- : pnp 00:0c
1-107fff : System RAM
root@zeno:~# 

---

For comparison, output obtained (and reported previously) when machine
was running PAE kernel:
root@zeno:~# cat /proc/iomem
- : reserved
0001-00099bff : System RAM
00099c00-0009 : reserved
000a-000b : PCI Bus :00
  000a-000b : Video RAM area
000c-000d : PCI Bus :00
  000c-000c7fff : Video ROM
  000c8000-000cf5ff : Adapter ROM
  000cf800-000d07ff : Adapter ROM
  000d0800-000d0bff : Adapter ROM
000e-000f : reserved
  000f-000f : System ROM
0010-7e445fff : System RAM
  0100-01610e15 : Kernel code
  01610e16-01802dff : Kernel data
  0188-018b2fff : Kernel bss
7e

Re: [RFC] Reproducible OOM with just a few sleeps

2013-01-17 Thread paul . szabo
Dear Dave,

 On my large machine, 'free' fails to show about 2GB memory ...
 You probably have a memory hole. ...
 The e820 map (during early boot in dmesg) or /proc/iomem will let you
 locate your memory holes.

Now that my machine is running an amd64 kernel, 'free' shows total Mem
65854128 (up from 64447796 with PAE kernel), and I do not see much
change in /proc/iomem output (below). Is that as should be?

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


---

root@zeno:~# uname -a
Linux zeno.maths.usyd.edu.au 3.2.35-pk06.12-amd64 #2 SMP Thu Jan 17 13:19:53 
EST 2013 x86_64 GNU/Linux
root@zeno:~# free
 total   used   free sharedbuffers cached
Mem:  658541281591704   64262424  0 227036 175620
-/+ buffers/cache:1189048   64665080
Swap:195312636  0  195312636
root@zeno:~# cat /proc/iomem
- : reserved
0001-00099bff : System RAM
00099c00-0009 : reserved
000a-000b : PCI Bus :00
000c-000d : PCI Bus :00
  000c-000c7fff : Video ROM
  000c8000-000cf5ff : Adapter ROM
  000cf800-000d07ff : Adapter ROM
  000d0800-000d0bff : Adapter ROM
000e-000f : reserved
  000f-000f : System ROM
0010-7e445fff : System RAM
  0100-0168f8c6 : Kernel code
  0168f8c7-018f24bf : Kernel data
  0197d000-019dafff : Kernel bss
7e446000-7e565fff : ACPI Non-volatile Storage
7e566000-7f1e2fff : reserved
7f1e3000-7f25efff : ACPI Tables
7f25f000-7f31cfff : reserved
7f31d000-7f323fff : ACPI Non-volatile Storage
7f324000-7f333fff : reserved
7f334000-7f33bfff : ACPI Non-volatile Storage
7f33c000-7f365fff : reserved
7f366000-7f7f : ACPI Non-volatile Storage
7f80-7fff : RAM buffer
8000-dfff : PCI Bus :00
  8000-8fff : PCI MMCONFIG  [bus 00-ff]
8000-8fff : reserved
  9000-900f : :00:16.0
  9010-901f : :00:16.1
  dd00-ddff : PCI Bus :08
dd00-ddff : :08:03.0
  de00-de4f : PCI Bus :07
de00-de3f : :07:00.0
de47c000-de47 : :07:00.0
  de60-de6f : PCI Bus :02
  df00-df8f : PCI Bus :08
df00-df7f : :08:03.0
df80-df803fff : :08:03.0
  df90-df9f : PCI Bus :07
  dfa0-dfaf : PCI Bus :02
dfa0-dfa1 : :02:00.1
  dfa0-dfa1 : igb
dfa2-dfa3 : :02:00.0
  dfa2-dfa3 : igb
dfa4-dfa43fff : :02:00.1
  dfa4-dfa43fff : igb
dfa44000-dfa47fff : :02:00.0
  dfa44000-dfa47fff : igb
  dfb0-dfb03fff : :00:04.7
  dfb04000-dfb07fff : :00:04.6
  dfb08000-dfb0bfff : :00:04.5
  dfb0c000-dfb0 : :00:04.4
  dfb1-dfb13fff : :00:04.3
  dfb14000-dfb17fff : :00:04.2
  dfb18000-dfb1bfff : :00:04.1
  dfb1c000-dfb1 : :00:04.0
  dfb2-dfb200ff : :00:1f.3
  dfb21000-dfb217ff : :00:1f.2
dfb21000-dfb217ff : ahci
  dfb22000-dfb223ff : :00:1d.0
dfb22000-dfb223ff : ehci_hcd
  dfb23000-dfb233ff : :00:1a.0
dfb23000-dfb233ff : ehci_hcd
  dfb25000-dfb25fff : :00:05.4
  dfffc000-dfffdfff : pnp 00:02
e000-fbff : PCI Bus :80
  fbe0-fbef : PCI Bus :84
fbe0-fbe3 : :84:00.0
fbe4-fbe5 : :84:00.0
fbe6-fbe63fff : :84:00.0
  fbf0-fbf03fff : :80:04.7
  fbf04000-fbf07fff : :80:04.6
  fbf08000-fbf0bfff : :80:04.5
  fbf0c000-fbf0 : :80:04.4
  fbf1-fbf13fff : :80:04.3
  fbf14000-fbf17fff : :80:04.2
  fbf18000-fbf1bfff : :80:04.1
  fbf1c000-fbf1 : :80:04.0
  fbf2-fbf20fff : :80:05.4
  fbffe000-fbff : pnp 00:12
fc00-fcff : pnp 00:01
fd00-fdff : pnp 00:01
fe00-feaf : pnp 00:01
feb0-febf : pnp 00:01
fec0-fec003ff : IOAPIC 0
fec01000-fec013ff : IOAPIC 1
fec4-fec403ff : IOAPIC 2
fed0-fed003ff : HPET 0
fed08000-fed08fff : pnp 00:0c
fed1c000-fed3 : reserved
  fed1c000-fed1 : pnp 00:0c
fed45000-fedf : pnp 00:01
fee0-fee00fff : Local APIC
ff00- : reserved
  ff00- : pnp 00:0c
1-107fff : System RAM
root@zeno:~# 

---

For comparison, output obtained (and reported previously) when machine
was running PAE kernel:
root@zeno:~# cat /proc/iomem
- : reserved
0001-00099bff : System RAM
00099c00-0009 : reserved
000a-000b : PCI Bus :00
  000a-000b : Video RAM area
000c-000d : PCI Bus :00
  000c-000c7fff : Video ROM
  000c8000-000cf5ff : Adapter ROM
  000cf800-000d07ff : Adapter ROM
  000d0800-000d0bff : Adapter ROM
000e-000f : reserved
  000f-000f : System ROM
0010-7e445fff : System RAM
  0100-01610e15 : Kernel code
  01610e16-01802dff : Kernel data
  0188-018b2fff : Kernel bss
7e446000-7e565fff : ACPI

Re: [RFC] Reproducible OOM with just a few sleeps

2013-01-14 Thread paul . szabo
Dear Dave,

>> ... What is unacceptable is that PAE crashes or freezes with OOM:
>> it should gracefully handle the issue. Noting that (for a machine
>> with 4GB or under) PAE fails where the HIGHMEM4G kernel succeeds ...
>
> You have found a delta, but you're not really making apples-to-apples
> comparisons.  The page tables ...

I understand that the exact sizes of page tables are very important to
developers. To the rest of us, all that matters is that the kernel moves
them to highmem or swap or whatever, that it maybe emits some error
message but that it does not crash or freeze.

> There's probably a bug here.  But, it's incredibly unlikely to be seen
> in practice on anything resembling a modern system. ...

Probably, I found the bug on a very modern and brand-new system, just
trying to copy a few ISO image files and trying to log in a hundred
students. My machine crashed under those very practical and normal
circumstances. The demos with dd and sleep were just that: easily
reproducible demos.

> ... easily worked around by upgrading to a 64-bit kernel ...

Do you mean that PAE should never be used, but to use amd64 instead?

> ... Raising the vm.min_free_kbytes sysctl (to perhaps 10x of
> its current value on your system) is likely to help the hangs too,
> although it will further "consume" lowmem.

I have tried that, it did not work. As you say, it is backward.

> ... for a bug with ... so many reasonable workarounds ...

Only one workaround was proposed: use amd64.

PAE is buggy and useless, should be deprecated and removed.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Reproducible OOM with just a few sleeps

2013-01-14 Thread paul . szabo
Dear Dave,

>> Seems that any i386 PAE machine will go OOM just by running a few
>> processes. To reproduce:
>>   sh -c 'n=0; while [ $n -lt 1 ]; do sleep 600 & ((n=n+1)); done'
>> ...
> I think what you're seeing here is that, as the amount of total memory
> increases, the amount of lowmem available _decreases_ due to inflation
> of mem_map[] (and a few other more minor things).  The number of sleeps
> you can do is bound by the number of processes, as you noticed from
> ulimit.  Creating processes that don't use much memory eats a relatively
> large amount of low memory.
> This is a sad (and counterintuitive) fact: more RAM actually *CREATES*
> RAM bottlenecks on 32-bit systems.

I understand that more RAM leaves less lowmem. What is unacceptable is
that PAE crashes or freezes with OOM: it should gracefully handle the
issue. Noting that (for a machine with 4GB or under) PAE fails where the
HIGHMEM4G kernel succeeds and survives.

>> On my large machine, 'free' fails to show about 2GB memory ...
> You probably have a memory hole. ...
> The e820 map (during early boot in dmesg) or /proc/iomem will let you
> locate your memory holes.

Thanks, that might explain it. Output of /proc/iomem below: sorry I do
not know how to interpret it.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


---
root@zeno:~# cat /proc/iomem
- : reserved
0001-00099bff : System RAM
00099c00-0009 : reserved
000a-000b : PCI Bus :00
  000a-000b : Video RAM area
000c-000d : PCI Bus :00
  000c-000c7fff : Video ROM
  000c8000-000cf5ff : Adapter ROM
  000cf800-000d07ff : Adapter ROM
  000d0800-000d0bff : Adapter ROM
000e-000f : reserved
  000f-000f : System ROM
0010-7e445fff : System RAM
  0100-01610e15 : Kernel code
  01610e16-01802dff : Kernel data
  0188-018b2fff : Kernel bss
7e446000-7e565fff : ACPI Non-volatile Storage
7e566000-7f1e2fff : reserved
7f1e3000-7f25efff : ACPI Tables
7f25f000-7f31cfff : reserved
7f31d000-7f323fff : ACPI Non-volatile Storage
7f324000-7f333fff : reserved
7f334000-7f33bfff : ACPI Non-volatile Storage
7f33c000-7f365fff : reserved
7f366000-7f7f : ACPI Non-volatile Storage
7f80-7fff : RAM buffer
8000-dfff : PCI Bus :00
  8000-8fff : PCI MMCONFIG  [bus 00-ff]
8000-8fff : reserved
  9000-900f : :00:16.0
  9010-901f : :00:16.1
  dd00-ddff : PCI Bus :08
dd00-ddff : :08:03.0
  de00-de4f : PCI Bus :07
de00-de3f : :07:00.0
de47c000-de47 : :07:00.0
  de60-de6f : PCI Bus :02
  df00-df8f : PCI Bus :08
df00-df7f : :08:03.0
df80-df803fff : :08:03.0
  df90-df9f : PCI Bus :07
  dfa0-dfaf : PCI Bus :02
dfa0-dfa1 : :02:00.1
  dfa0-dfa1 : igb
dfa2-dfa3 : :02:00.0
  dfa2-dfa3 : igb
dfa4-dfa43fff : :02:00.1
  dfa4-dfa43fff : igb
dfa44000-dfa47fff : :02:00.0
  dfa44000-dfa47fff : igb
  dfb0-dfb03fff : :00:04.7
  dfb04000-dfb07fff : :00:04.6
  dfb08000-dfb0bfff : :00:04.5
  dfb0c000-dfb0 : :00:04.4
  dfb1-dfb13fff : :00:04.3
  dfb14000-dfb17fff : :00:04.2
  dfb18000-dfb1bfff : :00:04.1
  dfb1c000-dfb1 : :00:04.0
  dfb2-dfb200ff : :00:1f.3
  dfb21000-dfb217ff : :00:1f.2
dfb21000-dfb217ff : ahci
  dfb22000-dfb223ff : :00:1d.0
dfb22000-dfb223ff : ehci_hcd
  dfb23000-dfb233ff : :00:1a.0
dfb23000-dfb233ff : ehci_hcd
  dfb25000-dfb25fff : :00:05.4
  dfffc000-dfffdfff : pnp 00:02
e000-fbff : PCI Bus :80
  fbe0-fbef : PCI Bus :84
fbe0-fbe3 : :84:00.0
fbe4-fbe5 : :84:00.0
fbe6-fbe63fff : :84:00.0
  fbf0-fbf03fff : :80:04.7
  fbf04000-fbf07fff : :80:04.6
  fbf08000-fbf0bfff : :80:04.5
  fbf0c000-fbf0 : :80:04.4
  fbf1-fbf13fff : :80:04.3
  fbf14000-fbf17fff : :80:04.2
  fbf18000-fbf1bfff : :80:04.1
  fbf1c000-fbf1 : :80:04.0
  fbf2-fbf20fff : :80:05.4
  fbffe000-fbff : pnp 00:12
fc00-fcff : pnp 00:01
fd00-fdff : pnp 00:01
fe00-feaf : pnp 00:01
feb0-febf : pnp 00:01
fec0-fec003ff : IOAPIC 0
fec01000-fec013ff : IOAPIC 1
fec4-fec403ff : IOAPIC 2
fed0-fed003ff : HPET 0
fed08000-fed08fff : pnp 00:0c
fed1c000-fed3 : reserved
  fed1c000-fed1 : pnp 00:0c
fed45000-fedf : pnp 00:01
fee0-fee00fff : Local APIC
ff00- : reserved
  ff00- : pnp 00:0c
1-107fff : System RAM
root@zeno:~# 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
th

Re: [RFC] Reproducible OOM with just a few sleeps

2013-01-14 Thread paul . szabo
Dear Dave,

 Seems that any i386 PAE machine will go OOM just by running a few
 processes. To reproduce:
   sh -c 'n=0; while [ $n -lt 1 ]; do sleep 600  ((n=n+1)); done'
 ...
 I think what you're seeing here is that, as the amount of total memory
 increases, the amount of lowmem available _decreases_ due to inflation
 of mem_map[] (and a few other more minor things).  The number of sleeps
 you can do is bound by the number of processes, as you noticed from
 ulimit.  Creating processes that don't use much memory eats a relatively
 large amount of low memory.
 This is a sad (and counterintuitive) fact: more RAM actually *CREATES*
 RAM bottlenecks on 32-bit systems.

I understand that more RAM leaves less lowmem. What is unacceptable is
that PAE crashes or freezes with OOM: it should gracefully handle the
issue. Noting that (for a machine with 4GB or under) PAE fails where the
HIGHMEM4G kernel succeeds and survives.

 On my large machine, 'free' fails to show about 2GB memory ...
 You probably have a memory hole. ...
 The e820 map (during early boot in dmesg) or /proc/iomem will let you
 locate your memory holes.

Thanks, that might explain it. Output of /proc/iomem below: sorry I do
not know how to interpret it.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


---
root@zeno:~# cat /proc/iomem
- : reserved
0001-00099bff : System RAM
00099c00-0009 : reserved
000a-000b : PCI Bus :00
  000a-000b : Video RAM area
000c-000d : PCI Bus :00
  000c-000c7fff : Video ROM
  000c8000-000cf5ff : Adapter ROM
  000cf800-000d07ff : Adapter ROM
  000d0800-000d0bff : Adapter ROM
000e-000f : reserved
  000f-000f : System ROM
0010-7e445fff : System RAM
  0100-01610e15 : Kernel code
  01610e16-01802dff : Kernel data
  0188-018b2fff : Kernel bss
7e446000-7e565fff : ACPI Non-volatile Storage
7e566000-7f1e2fff : reserved
7f1e3000-7f25efff : ACPI Tables
7f25f000-7f31cfff : reserved
7f31d000-7f323fff : ACPI Non-volatile Storage
7f324000-7f333fff : reserved
7f334000-7f33bfff : ACPI Non-volatile Storage
7f33c000-7f365fff : reserved
7f366000-7f7f : ACPI Non-volatile Storage
7f80-7fff : RAM buffer
8000-dfff : PCI Bus :00
  8000-8fff : PCI MMCONFIG  [bus 00-ff]
8000-8fff : reserved
  9000-900f : :00:16.0
  9010-901f : :00:16.1
  dd00-ddff : PCI Bus :08
dd00-ddff : :08:03.0
  de00-de4f : PCI Bus :07
de00-de3f : :07:00.0
de47c000-de47 : :07:00.0
  de60-de6f : PCI Bus :02
  df00-df8f : PCI Bus :08
df00-df7f : :08:03.0
df80-df803fff : :08:03.0
  df90-df9f : PCI Bus :07
  dfa0-dfaf : PCI Bus :02
dfa0-dfa1 : :02:00.1
  dfa0-dfa1 : igb
dfa2-dfa3 : :02:00.0
  dfa2-dfa3 : igb
dfa4-dfa43fff : :02:00.1
  dfa4-dfa43fff : igb
dfa44000-dfa47fff : :02:00.0
  dfa44000-dfa47fff : igb
  dfb0-dfb03fff : :00:04.7
  dfb04000-dfb07fff : :00:04.6
  dfb08000-dfb0bfff : :00:04.5
  dfb0c000-dfb0 : :00:04.4
  dfb1-dfb13fff : :00:04.3
  dfb14000-dfb17fff : :00:04.2
  dfb18000-dfb1bfff : :00:04.1
  dfb1c000-dfb1 : :00:04.0
  dfb2-dfb200ff : :00:1f.3
  dfb21000-dfb217ff : :00:1f.2
dfb21000-dfb217ff : ahci
  dfb22000-dfb223ff : :00:1d.0
dfb22000-dfb223ff : ehci_hcd
  dfb23000-dfb233ff : :00:1a.0
dfb23000-dfb233ff : ehci_hcd
  dfb25000-dfb25fff : :00:05.4
  dfffc000-dfffdfff : pnp 00:02
e000-fbff : PCI Bus :80
  fbe0-fbef : PCI Bus :84
fbe0-fbe3 : :84:00.0
fbe4-fbe5 : :84:00.0
fbe6-fbe63fff : :84:00.0
  fbf0-fbf03fff : :80:04.7
  fbf04000-fbf07fff : :80:04.6
  fbf08000-fbf0bfff : :80:04.5
  fbf0c000-fbf0 : :80:04.4
  fbf1-fbf13fff : :80:04.3
  fbf14000-fbf17fff : :80:04.2
  fbf18000-fbf1bfff : :80:04.1
  fbf1c000-fbf1 : :80:04.0
  fbf2-fbf20fff : :80:05.4
  fbffe000-fbff : pnp 00:12
fc00-fcff : pnp 00:01
fd00-fdff : pnp 00:01
fe00-feaf : pnp 00:01
feb0-febf : pnp 00:01
fec0-fec003ff : IOAPIC 0
fec01000-fec013ff : IOAPIC 1
fec4-fec403ff : IOAPIC 2
fed0-fed003ff : HPET 0
fed08000-fed08fff : pnp 00:0c
fed1c000-fed3 : reserved
  fed1c000-fed1 : pnp 00:0c
fed45000-fedf : pnp 00:01
fee0-fee00fff : Local APIC
ff00- : reserved
  ff00- : pnp 00:0c
1-107fff : System RAM
root@zeno:~# 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Reproducible OOM with just a few sleeps

2013-01-14 Thread paul . szabo
Dear Dave,

 ... What is unacceptable is that PAE crashes or freezes with OOM:
 it should gracefully handle the issue. Noting that (for a machine
 with 4GB or under) PAE fails where the HIGHMEM4G kernel succeeds ...

 You have found a delta, but you're not really making apples-to-apples
 comparisons.  The page tables ...

I understand that the exact sizes of page tables are very important to
developers. To the rest of us, all that matters is that the kernel moves
them to highmem or swap or whatever, that it maybe emits some error
message but that it does not crash or freeze.

 There's probably a bug here.  But, it's incredibly unlikely to be seen
 in practice on anything resembling a modern system. ...

Probably, I found the bug on a very modern and brand-new system, just
trying to copy a few ISO image files and trying to log in a hundred
students. My machine crashed under those very practical and normal
circumstances. The demos with dd and sleep were just that: easily
reproducible demos.

 ... easily worked around by upgrading to a 64-bit kernel ...

Do you mean that PAE should never be used, but to use amd64 instead?

 ... Raising the vm.min_free_kbytes sysctl (to perhaps 10x of
 its current value on your system) is likely to help the hangs too,
 although it will further consume lowmem.

I have tried that, it did not work. As you say, it is backward.

 ... for a bug with ... so many reasonable workarounds ...

Only one workaround was proposed: use amd64.

PAE is buggy and useless, should be deprecated and removed.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Reproducible OOM with just a few sleeps

2013-01-12 Thread paul . szabo
The issue is a regression with PAE, reproduced and verified on Ubuntu,
on my home PC with 3GB RAM.

My PC was running kernel linux-image-3.2.0-35-generic so it showed:
  psz@DellE520:~$ uname -a
  Linux DellE520 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5 17:45:18 UTC 2012 
i686 i686 i386 GNU/Linux
  psz@DellE520:~$ free -l
   total   used   free sharedbuffers cached
  Mem:   3087972 6922562395716  0  18276 427116
  Low:861464  71372 790092
  High:  2226508 6208841605624
  -/+ buffers/cache: 2468642841108
  Swap: 2920 258364   19742556
Then it handled the "sleep test"
  bash -c 'n=0; while [ $n -lt 33000 ]; do sleep 600 & ((n=n+1)); ((m=n%500)); 
if [ $m -lt 1 ]; then echo -n "$n - "; date; free -l; sleep 1; fi; done'
just fine, stopped only by "max user processes" (default setting of
"ulimit -u 23964"), or raising that limit stopped when the machine ran
out of PID space; there was no OOM.

Installing and running the PAE kernel so it showed:
  psz@DellE520:~$ uname -a
  Linux DellE520 3.2.0-35-generic-pae #55-Ubuntu SMP Wed Dec 5 18:04:39 UTC 
2012 i686 i686 i386 GNU/Linux
  psz@DellE520:~$ free -l
   total   used   free sharedbuffers cached
  Mem:   3087620 6811882406432  0 167332 352296
  Low:865208 214080 651128
  High:  412 4671081755304
  -/+ buffers/cache: 1615602926060
  Swap: 2920  0   2920
and re-trying the "sleep test", it ran into OOM after 18000 or so sleeps
and crashed/froze so I had to press the POWER button to recover.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Reproducible OOM with just a few sleeps

2013-01-12 Thread paul . szabo
The issue is a regression with PAE, reproduced and verified on Ubuntu,
on my home PC with 3GB RAM.

My PC was running kernel linux-image-3.2.0-35-generic so it showed:
  psz@DellE520:~$ uname -a
  Linux DellE520 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5 17:45:18 UTC 2012 
i686 i686 i386 GNU/Linux
  psz@DellE520:~$ free -l
   total   used   free sharedbuffers cached
  Mem:   3087972 6922562395716  0  18276 427116
  Low:861464  71372 790092
  High:  2226508 6208841605624
  -/+ buffers/cache: 2468642841108
  Swap: 2920 258364   19742556
Then it handled the sleep test
  bash -c 'n=0; while [ $n -lt 33000 ]; do sleep 600  ((n=n+1)); ((m=n%500)); 
if [ $m -lt 1 ]; then echo -n $n - ; date; free -l; sleep 1; fi; done'
just fine, stopped only by max user processes (default setting of
ulimit -u 23964), or raising that limit stopped when the machine ran
out of PID space; there was no OOM.

Installing and running the PAE kernel so it showed:
  psz@DellE520:~$ uname -a
  Linux DellE520 3.2.0-35-generic-pae #55-Ubuntu SMP Wed Dec 5 18:04:39 UTC 
2012 i686 i686 i386 GNU/Linux
  psz@DellE520:~$ free -l
   total   used   free sharedbuffers cached
  Mem:   3087620 6811882406432  0 167332 352296
  Low:865208 214080 651128
  High:  412 4671081755304
  -/+ buffers/cache: 1615602926060
  Swap: 2920  0   2920
and re-trying the sleep test, it ran into OOM after 18000 or so sleeps
and crashed/froze so I had to press the POWER button to recover.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Reproducible OOM with just a few sleeps

2013-01-11 Thread paul . szabo
Dear Linux-MM,

Seems that any i386 PAE machine will go OOM just by running a few
processes. To reproduce:
  sh -c 'n=0; while [ $n -lt 1 ]; do sleep 600 & ((n=n+1)); done'
My machine has 64GB RAM. With previous OOM episodes, it seemed that
running (booting) it with mem=32G might avoid OOM; but an OOM was
obtained just the same, and also with lower memory:
  Memorysleeps to OOM   free shows total
  (mem=64G)  5300   64447796
  mem=32G   10200   31155512
  mem=16G   13400   14509364
  mem=8G14200   6186296
  mem=6G15200   4105532
  mem=4G16400   2041364
The machine does not run out of highmem, nor does it use any swap.

Comparing with my desktop PC: has 4GB RAM installed, free shows 3978592
total. Running the "sleep test", it simply froze after 16400 running...
no response to ping, will need to press the RESET button.

---

On my large machine, 'free' fails to show about 2GB memory, e.g. with
mem=16G it shows:

root@zeno:~# free -l
 total   used   free sharedbuffers cached
Mem:  14509364 435440   14073924  0   4068 111328
Low:769044 120232 648812
High: 13740320 315208   13425112
-/+ buffers/cache: 320044   14189320
Swap:134217724  0  134217724

---

Please let me know of any ideas, or if you want me to run some other
test or want to see some other output.

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


-

Details for when my machine was running with 64GB RAM:

In another window I was running
  cat /proc/slabinfo; free -l
repeatedly, and output of that (just before OOM) was:

+ cat /proc/slabinfo
slabinfo - version: 2.1
# name
 : tunables: slabdata 
  
fuse_request   0  0376   434 : tunables000 : 
slabdata  0  0  0
fuse_inode 0  0448   364 : tunables000 : 
slabdata  0  0  0
bsg_cmd0  0288   282 : tunables000 : 
slabdata  0  0  0
ntfs_big_inode_cache  0  0512   324 : tunables000 : 
slabdata  0  0  0
ntfs_inode_cache   0  0176   462 : tunables000 : 
slabdata  0  0  0
nfs_direct_cache   0  0 80   511 : tunables000 : 
slabdata  0  0  0
nfs_inode_cache   28 28584   284 : tunables000 : 
slabdata  1  1  0
isofs_inode_cache  0  0360   454 : tunables000 : 
slabdata  0  0  0
fat_inode_cache0  0408   404 : tunables000 : 
slabdata  0  0  0
fat_cache  0  0 24  1701 : tunables000 : 
slabdata  0  0  0
jbd2_revoke_record  0  0 32  1281 : tunables000 : 
slabdata  0  0  0
journal_handle  4080   4080 24  1701 : tunables000 : 
slabdata 24 24  0
journal_head1024   1024 64   641 : tunables000 : 
slabdata 16 16  0
revoke_record768768 16  2561 : tunables000 : 
slabdata  3  3  0
ext4_inode_cache   0  0584   284 : tunables000 : 
slabdata  0  0  0
ext4_free_data 0  0 40  1021 : tunables000 : 
slabdata  0  0  0
ext4_allocation_context  0  0112   361 : tunables00
0 : slabdata  0  0  0
ext4_prealloc_space  0  0 72   561 : tunables000 : 
slabdata  0  0  0
ext4_io_end0  0576   284 : tunables000 : 
slabdata  0  0  0
ext4_io_page   0  0  8  5121 : tunables000 : 
slabdata  0  0  0
ext2_inode_cache   0  0480   344 : tunables000 : 
slabdata  0  0  0
ext3_inode_cache1467   2079488   334 : tunables000 : 
slabdata 63 63  0
ext3_xattr 0  0 48   851 : tunables000 : 
slabdata  0  0  0
dquot168168192   422 : tunables000 : 
slabdata  4  4  0
rpc_inode_cache  108108448   364 : tunables000 : 
slabdata  3  3  0
UDP-Lite   0  0576   284 : tunables000 : 
slabdata  0  0  0
xfrm_dst_cache 0  0320   514 : tunables000 : 
slabdata  0  0  0
UDP  336336576   284 : tunables000 : 
slabdata 12 12  0
tw_sock_TCP   32 32   

Re: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread paul . szabo
Dear Andrew,

>>> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
>> Please see below ...
> ... Was this dump taken when the system was at or near oom?

No, that was a "quiescent" machine. Please see a just-before-OOM dump in
my next message (in a little while).

> Please send a copy of the oom-killer kernel message dump, if you still
> have one.

Please see one in next message, or in
http://bugs.debian.org/695182

>> I tried setting dirty_ratio to "funny" values, that did not seem to
>> help.
> Did you try setting it as low as possible?

Probably. Maybe. Sorry, cannot say with certainty.

>> Did you notice my patch about bdi_position_ratio(), how it was
>> plain wrong half the time (for negative x)? 
> Nope, please resend.

Quoting from
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182
:
...
 - In bdi_position_ratio() get difference (setpoint-dirty) right even
   when it is negative, which happens often. Normally these numbers are
   "small" and even with left-shift I never observed a 32-bit overflow.
   I believe it should be possible to re-write the whole function in
   32-bit ints; maybe it is not worth the effort to make it "efficient";
   seeing how this function was always wrong and we survived, it should
   simply be removed.
...
--- mm/page-writeback.c.old 2012-10-17 13:50:15.0 +1100
+++ mm/page-writeback.c 2013-01-06 21:54:59.0 +1100
[ Line numbers out because other patches not shown ]
...
@@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio(
 * => fast response on large errors; small oscillation near setpoint
 */
setpoint = (freerun + limit) / 2;
-   x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+   x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
limit - setpoint + 1);
    pos_ratio = x;
pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
...

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread paul . szabo
Dear Andrew,

> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.

Please see below: I do not know what any of that means. This machine has
been running just fine, with all my users logging in here via XDMCP from
X-terminals, dozens logged in simultaneously. (But, I think I could make
it go OOM with more processes or logins.)

> If so, you *may* be able to work around this by setting
> /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
> amount of dirty pagecache around.  Then, with luck, if we haven't
> broken the buffer_heads_over_limit logic it in the past decade (we
> probably have), the VM should be able to reclaim those buffer_heads.

I tried setting dirty_ratio to "funny" values, that did not seem to
help. Did you notice my patch about bdi_position_ratio(), how it was
plain wrong half the time (for negative x)? Anyway that did not help.

> Alternatively, use a filesystem which doesn't attach buffer_heads to
> dirty pages.  xfs or btrfs, perhaps.

Seems there is also a problem not related to filesystem... or rather,
the essence does not seem to be filesystem or caches. The filesystem
thing now seems OK with my patch doing drop_caches.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


---

root@como:~# free -lm
 total   used   free sharedbuffers cached
Mem: 62936   2317  60618  0 41635
Low:   367271 95
High:62569   2045  60523
-/+ buffers/cache:   1640  61295
Swap:   131071  0 131071
root@como:~# cat /proc/slabinfo
slabinfo - version: 2.1
# name
 : tunables: slabdata 
  
fuse_request   0  0376   434 : tunables000 : 
slabdata  0  0  0
fuse_inode 0  0448   364 : tunables000 : 
slabdata  0  0  0
bsg_cmd0  0288   282 : tunables000 : 
slabdata  0  0  0
ntfs_big_inode_cache  0  0512   324 : tunables000 : 
slabdata  0  0  0
ntfs_inode_cache   0  0176   462 : tunables000 : 
slabdata  0  0  0
nfs_direct_cache   0  0 80   511 : tunables000 : 
slabdata  0  0  0
nfs_inode_cache 5404   5404584   284 : tunables000 : 
slabdata193193  0
isofs_inode_cache  0  0360   454 : tunables000 : 
slabdata  0  0  0
fat_inode_cache0  0408   404 : tunables000 : 
slabdata  0  0  0
fat_cache  0  0 24  1701 : tunables000 : 
slabdata  0  0  0
jbd2_revoke_record  0  0 32  1281 : tunables000 : 
slabdata  0  0  0
journal_handle  5440   5440 24  1701 : tunables000 : 
slabdata 32 32  0
journal_head   16768  16768 64   641 : tunables000 : 
slabdata262262  0
revoke_record  20224  20224 16  2561 : tunables000 : 
slabdata 79 79  0
ext4_inode_cache   0  0584   284 : tunables000 : 
slabdata  0  0  0
ext4_free_data 0  0 40  1021 : tunables000 : 
slabdata  0  0  0
ext4_allocation_context  0  0112   361 : tunables00
0 : slabdata  0  0  0
ext4_prealloc_space  0  0 72   561 : tunables000 : 
slabdata  0  0  0
ext4_io_end0  0576   284 : tunables000 : 
slabdata  0  0  0
ext4_io_page   0  0  8  5121 : tunables000 : 
slabdata  0  0  0
ext2_inode_cache   0  0480   344 : tunables000 : 
slabdata  0  0  0
ext3_inode_cache   16531  19965488   334 : tunables000 : 
slabdata605605  0
ext3_xattr 0  0 48   851 : tunables000 : 
slabdata  0  0  0
dquot840840192   422 : tunables000 : 
slabdata 20 20  0
rpc_inode_cache  144144448   364 : tunables000 : 
slabdata  4  4  0
UDP-Lite   0  0576   284 : tunables000 : 
slabdata  0  0  0
xfrm_dst_cache 0  0320   514 : tunables000 : 
slabdata  0  0  0
UDP  896896576   284 : tunables000 : 
slabdata 32 32  0
tw_sock_TCP 1344   1344128   321 : tunables000 : 
slabdata 42 42  0
TCP 

Re: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread paul . szabo
Dear Andrew,

 Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.

Please see below: I do not know what any of that means. This machine has
been running just fine, with all my users logging in here via XDMCP from
X-terminals, dozens logged in simultaneously. (But, I think I could make
it go OOM with more processes or logins.)

 If so, you *may* be able to work around this by setting
 /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
 amount of dirty pagecache around.  Then, with luck, if we haven't
 broken the buffer_heads_over_limit logic it in the past decade (we
 probably have), the VM should be able to reclaim those buffer_heads.

I tried setting dirty_ratio to funny values, that did not seem to
help. Did you notice my patch about bdi_position_ratio(), how it was
plain wrong half the time (for negative x)? Anyway that did not help.

 Alternatively, use a filesystem which doesn't attach buffer_heads to
 dirty pages.  xfs or btrfs, perhaps.

Seems there is also a problem not related to filesystem... or rather,
the essence does not seem to be filesystem or caches. The filesystem
thing now seems OK with my patch doing drop_caches.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


---

root@como:~# free -lm
 total   used   free sharedbuffers cached
Mem: 62936   2317  60618  0 41635
Low:   367271 95
High:62569   2045  60523
-/+ buffers/cache:   1640  61295
Swap:   131071  0 131071
root@como:~# cat /proc/slabinfo
slabinfo - version: 2.1
# nameactive_objs num_objs objsize objperslab 
pagesperslab : tunables limit batchcount sharedfactor : slabdata 
active_slabs num_slabs sharedavail
fuse_request   0  0376   434 : tunables000 : 
slabdata  0  0  0
fuse_inode 0  0448   364 : tunables000 : 
slabdata  0  0  0
bsg_cmd0  0288   282 : tunables000 : 
slabdata  0  0  0
ntfs_big_inode_cache  0  0512   324 : tunables000 : 
slabdata  0  0  0
ntfs_inode_cache   0  0176   462 : tunables000 : 
slabdata  0  0  0
nfs_direct_cache   0  0 80   511 : tunables000 : 
slabdata  0  0  0
nfs_inode_cache 5404   5404584   284 : tunables000 : 
slabdata193193  0
isofs_inode_cache  0  0360   454 : tunables000 : 
slabdata  0  0  0
fat_inode_cache0  0408   404 : tunables000 : 
slabdata  0  0  0
fat_cache  0  0 24  1701 : tunables000 : 
slabdata  0  0  0
jbd2_revoke_record  0  0 32  1281 : tunables000 : 
slabdata  0  0  0
journal_handle  5440   5440 24  1701 : tunables000 : 
slabdata 32 32  0
journal_head   16768  16768 64   641 : tunables000 : 
slabdata262262  0
revoke_record  20224  20224 16  2561 : tunables000 : 
slabdata 79 79  0
ext4_inode_cache   0  0584   284 : tunables000 : 
slabdata  0  0  0
ext4_free_data 0  0 40  1021 : tunables000 : 
slabdata  0  0  0
ext4_allocation_context  0  0112   361 : tunables00
0 : slabdata  0  0  0
ext4_prealloc_space  0  0 72   561 : tunables000 : 
slabdata  0  0  0
ext4_io_end0  0576   284 : tunables000 : 
slabdata  0  0  0
ext4_io_page   0  0  8  5121 : tunables000 : 
slabdata  0  0  0
ext2_inode_cache   0  0480   344 : tunables000 : 
slabdata  0  0  0
ext3_inode_cache   16531  19965488   334 : tunables000 : 
slabdata605605  0
ext3_xattr 0  0 48   851 : tunables000 : 
slabdata  0  0  0
dquot840840192   422 : tunables000 : 
slabdata 20 20  0
rpc_inode_cache  144144448   364 : tunables000 : 
slabdata  4  4  0
UDP-Lite   0  0576   284 : tunables000 : 
slabdata  0  0  0
xfrm_dst_cache 0  0320   514 : tunables000 : 
slabdata  0  0  0
UDP  896896576   284 : tunables000 : 
slabdata 32 32  0
tw_sock_TCP 1344   1344128   321

Re: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread paul . szabo
Dear Andrew,

 Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
 Please see below ...
 ... Was this dump taken when the system was at or near oom?

No, that was a quiescent machine. Please see a just-before-OOM dump in
my next message (in a little while).

 Please send a copy of the oom-killer kernel message dump, if you still
 have one.

Please see one in next message, or in
http://bugs.debian.org/695182

 I tried setting dirty_ratio to funny values, that did not seem to
 help.
 Did you try setting it as low as possible?

Probably. Maybe. Sorry, cannot say with certainty.

 Did you notice my patch about bdi_position_ratio(), how it was
 plain wrong half the time (for negative x)? 
 Nope, please resend.

Quoting from
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182
:
...
 - In bdi_position_ratio() get difference (setpoint-dirty) right even
   when it is negative, which happens often. Normally these numbers are
   small and even with left-shift I never observed a 32-bit overflow.
   I believe it should be possible to re-write the whole function in
   32-bit ints; maybe it is not worth the effort to make it efficient;
   seeing how this function was always wrong and we survived, it should
   simply be removed.
...
--- mm/page-writeback.c.old 2012-10-17 13:50:15.0 +1100
+++ mm/page-writeback.c 2013-01-06 21:54:59.0 +1100
[ Line numbers out because other patches not shown ]
...
@@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio(
 * = fast response on large errors; small oscillation near setpoint
 */
setpoint = (freerun + limit) / 2;
-   x = div_s64((setpoint - dirty)  RATELIMIT_CALC_SHIFT,
+   x = div_s64(((s64)setpoint - (s64)dirty)  RATELIMIT_CALC_SHIFT,
limit - setpoint + 1);
pos_ratio = x;
pos_ratio = pos_ratio * x  RATELIMIT_CALC_SHIFT;
...

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Reproducible OOM with just a few sleeps

2013-01-11 Thread paul . szabo
Dear Linux-MM,

Seems that any i386 PAE machine will go OOM just by running a few
processes. To reproduce:
  sh -c 'n=0; while [ $n -lt 1 ]; do sleep 600  ((n=n+1)); done'
My machine has 64GB RAM. With previous OOM episodes, it seemed that
running (booting) it with mem=32G might avoid OOM; but an OOM was
obtained just the same, and also with lower memory:
  Memorysleeps to OOM   free shows total
  (mem=64G)  5300   64447796
  mem=32G   10200   31155512
  mem=16G   13400   14509364
  mem=8G14200   6186296
  mem=6G15200   4105532
  mem=4G16400   2041364
The machine does not run out of highmem, nor does it use any swap.

Comparing with my desktop PC: has 4GB RAM installed, free shows 3978592
total. Running the sleep test, it simply froze after 16400 running...
no response to ping, will need to press the RESET button.

---

On my large machine, 'free' fails to show about 2GB memory, e.g. with
mem=16G it shows:

root@zeno:~# free -l
 total   used   free sharedbuffers cached
Mem:  14509364 435440   14073924  0   4068 111328
Low:769044 120232 648812
High: 13740320 315208   13425112
-/+ buffers/cache: 320044   14189320
Swap:134217724  0  134217724

---

Please let me know of any ideas, or if you want me to run some other
test or want to see some other output.

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


-

Details for when my machine was running with 64GB RAM:

In another window I was running
  cat /proc/slabinfo; free -l
repeatedly, and output of that (just before OOM) was:

+ cat /proc/slabinfo
slabinfo - version: 2.1
# nameactive_objs num_objs objsize objperslab 
pagesperslab : tunables limit batchcount sharedfactor : slabdata 
active_slabs num_slabs sharedavail
fuse_request   0  0376   434 : tunables000 : 
slabdata  0  0  0
fuse_inode 0  0448   364 : tunables000 : 
slabdata  0  0  0
bsg_cmd0  0288   282 : tunables000 : 
slabdata  0  0  0
ntfs_big_inode_cache  0  0512   324 : tunables000 : 
slabdata  0  0  0
ntfs_inode_cache   0  0176   462 : tunables000 : 
slabdata  0  0  0
nfs_direct_cache   0  0 80   511 : tunables000 : 
slabdata  0  0  0
nfs_inode_cache   28 28584   284 : tunables000 : 
slabdata  1  1  0
isofs_inode_cache  0  0360   454 : tunables000 : 
slabdata  0  0  0
fat_inode_cache0  0408   404 : tunables000 : 
slabdata  0  0  0
fat_cache  0  0 24  1701 : tunables000 : 
slabdata  0  0  0
jbd2_revoke_record  0  0 32  1281 : tunables000 : 
slabdata  0  0  0
journal_handle  4080   4080 24  1701 : tunables000 : 
slabdata 24 24  0
journal_head1024   1024 64   641 : tunables000 : 
slabdata 16 16  0
revoke_record768768 16  2561 : tunables000 : 
slabdata  3  3  0
ext4_inode_cache   0  0584   284 : tunables000 : 
slabdata  0  0  0
ext4_free_data 0  0 40  1021 : tunables000 : 
slabdata  0  0  0
ext4_allocation_context  0  0112   361 : tunables00
0 : slabdata  0  0  0
ext4_prealloc_space  0  0 72   561 : tunables000 : 
slabdata  0  0  0
ext4_io_end0  0576   284 : tunables000 : 
slabdata  0  0  0
ext4_io_page   0  0  8  5121 : tunables000 : 
slabdata  0  0  0
ext2_inode_cache   0  0480   344 : tunables000 : 
slabdata  0  0  0
ext3_inode_cache1467   2079488   334 : tunables000 : 
slabdata 63 63  0
ext3_xattr 0  0 48   851 : tunables000 : 
slabdata  0  0  0
dquot168168192   422 : tunables000 : 
slabdata  4  4  0
rpc_inode_cache  108108448   364 : tunables000 : 
slabdata  3  3  0
UDP-Lite   0  0576   284 : tunables000 : 
slabdata  0  0  0
xfrm_dst_cache 0  0320   514 : tunables000 : 
slabdata  0  0  0
UDP  336336576   284

Re: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread paul . szabo
Dear Dave,

> ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
> kernel without either violating the ABI (3GB/1GB split) or doing
> something that never got merged upstream ...

Sorry to be so contradictory:

psz@como:~$ uname -a
Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 
EST 2013 i686 GNU/Linux
psz@como:~$ free -l
 total   used   free sharedbuffers cached
Mem:  644469004729292   59717608  0  15972 480520
Low:375836 304400  71436
High: 640710644424892   59646172
-/+ buffers/cache:4232800   60214100
Swap:134217724  0  134217724
psz@como:~$ 

(though I would not know about violations).

But OK, I take your point that I should move with the times.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread paul . szabo
Dear Dave,

> Your configuration has never worked.  This isn't a regression ...
> ... does not mean that we expect it to work.

Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used;
that all development is for 64-bit only?

> ... 64-bit kernels should basically be drop-in replacements ...

Will think about that. I know all my servers are 64-bit capable, will
need to check all my desktops.

---

I find it puzzling that there seems to be a sharp cutoff at 32GB RAM,
no problem under but OOM just over; whereas I would have expected
lowmem starvation to be gradual, with OOM occuring much sooner with
64GB than with 34GB. Also, the kernel seems capable of reclaiming
lowmem, so I wonder why does that fail just over the 32GB threshhold.
(Obviously I have no idea what I am talking about.)

---

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Reproducible OOM with partial workaround

2013-01-10 Thread paul . szabo
Dear Linux-MM,

On a machine with i386 kernel and over 32GB RAM, an OOM condition is
reliably obtained simply by writing a few files to some local disk
e.g. with:
  n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; 
((n=$n+1)); done
Crash usually occurs after 16 or 32 files written. Seems that the
problem may be avoided by using mem=32G on the kernel boot, and that
it occurs with any amount of RAM over 32GB.

I developed a workaround patch for this particular OOM demo, dropping
filesystem caches when about to exhaust lowmem. However, subsequently
I observed OOM when running many processes (as yet I do not have an
easy-to-reproduce demo of this); so as I suspected, the essence of the
problem is not with FS caches.

Could you please help in finding the cause of this OOM bug?

Please see
http://bugs.debian.org/695182
for details, in particular my workaround patch
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182

(Please reply to me directly, as I am not a subscriber to the linux-mm
mailing list.)

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Reproducible OOM with partial workaround

2013-01-10 Thread paul . szabo
Dear Linux-MM,

On a machine with i386 kernel and over 32GB RAM, an OOM condition is
reliably obtained simply by writing a few files to some local disk
e.g. with:
  n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; 
((n=$n+1)); done
Crash usually occurs after 16 or 32 files written. Seems that the
problem may be avoided by using mem=32G on the kernel boot, and that
it occurs with any amount of RAM over 32GB.

I developed a workaround patch for this particular OOM demo, dropping
filesystem caches when about to exhaust lowmem. However, subsequently
I observed OOM when running many processes (as yet I do not have an
easy-to-reproduce demo of this); so as I suspected, the essence of the
problem is not with FS caches.

Could you please help in finding the cause of this OOM bug?

Please see
http://bugs.debian.org/695182
for details, in particular my workaround patch
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182

(Please reply to me directly, as I am not a subscriber to the linux-mm
mailing list.)

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread paul . szabo
Dear Dave,

 Your configuration has never worked.  This isn't a regression ...
 ... does not mean that we expect it to work.

Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used;
that all development is for 64-bit only?

 ... 64-bit kernels should basically be drop-in replacements ...

Will think about that. I know all my servers are 64-bit capable, will
need to check all my desktops.

---

I find it puzzling that there seems to be a sharp cutoff at 32GB RAM,
no problem under but OOM just over; whereas I would have expected
lowmem starvation to be gradual, with OOM occuring much sooner with
64GB than with 34GB. Also, the kernel seems capable of reclaiming
lowmem, so I wonder why does that fail just over the 32GB threshhold.
(Obviously I have no idea what I am talking about.)

---

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread paul . szabo
Dear Dave,

 ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
 kernel without either violating the ABI (3GB/1GB split) or doing
 something that never got merged upstream ...

Sorry to be so contradictory:

psz@como:~$ uname -a
Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 
EST 2013 i686 GNU/Linux
psz@como:~$ free -l
 total   used   free sharedbuffers cached
Mem:  644469004729292   59717608  0  15972 480520
Low:375836 304400  71436
High: 640710644424892   59646172
-/+ buffers/cache:4232800   60214100
Swap:134217724  0  134217724
psz@como:~$ 

(though I would not know about violations).

But OK, I take your point that I should move with the times.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/