Re: the new VMt

2000-09-27 Thread Andrey Savochkin

Hello,

On Wed, Sep 27, 2000 at 01:55:52PM +0100, Hugh Dickins wrote:
> On Wed, 27 Sep 2000, Andrey Savochkin wrote:
> > 
> > It's a waste of resources to reserve memory+swap for the case that every
> > running process decides to modify libc code (and, thus, should receive its
> > private copy of the pages).   A real waste!
> 
> A real waste indeed, but a bad example: libc code is mapped read-only,
> so nobody would recommend reserving memory+swap for private mods to it.
> Of course, a process might choose to mprotect it writable at some time,
> that would be when to refuse if overcommitted.

Returning error from mprotect() call for private mappings?
It wouldn't be what people expect...

The other example where overcommit makes sense is fork() (not vfork) and
immediate exec in one of the threads.

Best regards
Andrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-27 Thread Hugh Dickins

On Wed, 27 Sep 2000, Andrey Savochkin wrote:
> 
> It's a waste of resources to reserve memory+swap for the case that every
> running process decides to modify libc code (and, thus, should receive its
> private copy of the pages).   A real waste!

A real waste indeed, but a bad example: libc code is mapped read-only,
so nobody would recommend reserving memory+swap for private mods to it.
Of course, a process might choose to mprotect it writable at some time,
that would be when to refuse if overcommitted.

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-27 Thread Jamie Lokier

Horst von Brand wrote:
> I'd call emacs consistently not being able to start an ls on a 16Mb machine
> much worse than a surprise...
> 
> Hint: Think about how emacs would go about doing that...

vfork ;-)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-27 Thread Andrey Savochkin

On Tue, Sep 26, 2000 at 11:45:02AM -0600, Erik Andersen wrote:
[snip]
> "Overcommit" to me is the same things as Mark Hemment stated earlier in this
> thread -- the "fact that the system has over committed its memory resources.
> ie. it has sold too many tickets for the number of seats in the plane, and all
> the passengers have turned up."   Basically any case where too many tickets
> have been sold (applied to the entire system, and all subsystems).
[snip]
> If the Beancounter patch lets the kernel count "passengers", classify them
> (with user hinting) so the pilot and flight attendants (init, X, or whatever)
> always stay on the plane, and has some sane predictable mechanism for booting
> non-priveledged passengers, then I am all for it.  

That's exactly what I'm doing.

> How does one provide the kernel with hints as to which processes are sacred?
> Where does one find this beancounter patch?   How much weight does it add to
> the kernel? 

ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html

The current version has some drawbacks, and one of them is the performance.
Memory accounting is implemented as a kernel thread which goes through page
tables of processes (similar to kswapd), and it appears to consume 1-5% of
CPU (depending on number of processes).  I consider it unacceptable, and have
started reimplementation of the process memory accounting from the beginning.

Best regards
Andrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-27 Thread Andrey Savochkin

Hello,

On Tue, Sep 26, 2000 at 01:10:30PM +0100, Mark Hemment wrote:
> 
> On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: 
> > So you have run out of physical memory --- what do you do about it?
> 
>   Why let the system get into the state where it is neccessary to kill a
> process?
>   Per-user/task resource counters should prevent unprivileged users from
> soaking up too many resources.  That is the DoS protection.
> 
[snip]
>   It is possible to do true, system wide, resource counting of physical
> memory and swap space, and to deny a fork() or mmap() which would cause
> over committing of memoy resources if everyone cashed in their
> requirements.
[snip]

People use overcommitting not because they are fans of the idea.
Overcommitting simply is the _efficient_ way of resource sharing.
It's a waste of resources to reserve memory+swap for the case that every
running process decides to modify libc code (and, thus, should receive its
private copy of the pages).   A real waste!
I always agree to take the risk of some applications being killed in such a
case of all processes turning crazy.

The approach I believe in is:
 - ensure that accidental or intentional madness of applications of one user
   may cause only limited damage to other users; and
 - introduce a way to tell the kernel that some applications should be
   saved longer than others when troubles begin and ways to set up some
   guaranteed amounts for important processes.
Certainly, a lot of processes may consume more than their guarantee until
bad things start to happen.  Then the rules of user protection and killing
order apply.
That's how I develop the resource control in the beancounter patch
ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html#s7

Best regards
Andrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-27 Thread Rusty Russell

In message <[EMAIL PROTECTED]> you
 write:
> I suspect that the proper way to do this is to just make another gfp_flag,
> which is basically another hint to the mm layer that we're doing a multi-
> page allocation and that the MM layer should not try forever to handle it.
> 
> In fact, that's independent of whether it is a multi-page allocation or
> not. It might be something like __GFP_SOFT - you could use it with single
> pages too. 

That'd be a lovely interface, now wouldn't it?

*yecch*

Please consider at least:

/* Never fails. */
#define trivial_kmalloc(s)  \
 ((void)((s) > PAGE_SIZE ? bad_size_##s : __kmalloc((s), GFP_KERNEL)))

/* Can fail */
#define kmalloc(s, pri) __kmalloc((s), (pri)|__GFP_SOFT)

Rusty.
--
Hacking time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-27 Thread Jamie Lokier

Horst von Brand wrote:
 I'd call emacs consistently not being able to start an ls on a 16Mb machine
 much worse than a surprise...
 
 Hint: Think about how emacs would go about doing that...

vfork ;-)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-27 Thread Hugh Dickins

On Wed, 27 Sep 2000, Andrey Savochkin wrote:
 
 It's a waste of resources to reserve memory+swap for the case that every
 running process decides to modify libc code (and, thus, should receive its
 private copy of the pages).   A real waste!

A real waste indeed, but a bad example: libc code is mapped read-only,
so nobody would recommend reserving memory+swap for private mods to it.
Of course, a process might choose to mprotect it writable at some time,
that would be when to refuse if overcommitted.

Hugh

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-27 Thread Andrey Savochkin

Hello,

On Tue, Sep 26, 2000 at 01:10:30PM +0100, Mark Hemment wrote:
 
 On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: 
  So you have run out of physical memory --- what do you do about it?
 
   Why let the system get into the state where it is neccessary to kill a
 process?
   Per-user/task resource counters should prevent unprivileged users from
 soaking up too many resources.  That is the DoS protection.
 
[snip]
   It is possible to do true, system wide, resource counting of physical
 memory and swap space, and to deny a fork() or mmap() which would cause
 over committing of memoy resources if everyone cashed in their
 requirements.
[snip]

People use overcommitting not because they are fans of the idea.
Overcommitting simply is the _efficient_ way of resource sharing.
It's a waste of resources to reserve memory+swap for the case that every
running process decides to modify libc code (and, thus, should receive its
private copy of the pages).   A real waste!
I always agree to take the risk of some applications being killed in such a
case of all processes turning crazy.

The approach I believe in is:
 - ensure that accidental or intentional madness of applications of one user
   may cause only limited damage to other users; and
 - introduce a way to tell the kernel that some applications should be
   saved longer than others when troubles begin and ways to set up some
   guaranteed amounts for important processes.
Certainly, a lot of processes may consume more than their guarantee until
bad things start to happen.  Then the rules of user protection and killing
order apply.
That's how I develop the resource control in the beancounter patch
ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html#s7

Best regards
Andrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-27 Thread Andrey Savochkin

On Tue, Sep 26, 2000 at 11:45:02AM -0600, Erik Andersen wrote:
[snip]
 "Overcommit" to me is the same things as Mark Hemment stated earlier in this
 thread -- the "fact that the system has over committed its memory resources.
 ie. it has sold too many tickets for the number of seats in the plane, and all
 the passengers have turned up."   Basically any case where too many tickets
 have been sold (applied to the entire system, and all subsystems).
[snip]
 If the Beancounter patch lets the kernel count "passengers", classify them
 (with user hinting) so the pilot and flight attendants (init, X, or whatever)
 always stay on the plane, and has some sane predictable mechanism for booting
 non-priveledged passengers, then I am all for it.  

That's exactly what I'm doing.

 How does one provide the kernel with hints as to which processes are sacred?
 Where does one find this beancounter patch?   How much weight does it add to
 the kernel? 

ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html

The current version has some drawbacks, and one of them is the performance.
Memory accounting is implemented as a kernel thread which goes through page
tables of processes (similar to kswapd), and it appears to consume 1-5% of
CPU (depending on number of processes).  I consider it unacceptable, and have
started reimplementation of the process memory accounting from the beginning.

Best regards
Andrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Horst von Brand

Erik Andersen <[EMAIL PROTECTED]> said:

[...]

> Another approach would be to let user space turn off overcommit.  
> That way, user space can be assured there will be no surprises...

I'd call emacs consistently not being able to start an ls on a 16Mb machine
much worse than a surprise...

Hint: Think about how emacs would go about doing that...

Also, to ensure there is /no/ overcommit /anywhere/ amounts to a rigurous
audit of the whole kernel, and of each single patchlet that goes in. You
are certainly welcome to do the job...
-- 
Horst von Brand [EMAIL PROTECTED]
Casilla 9G, Vin~a del Mar, Chile   +56 32 672616
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Eric Lowe

Hello,

> > Another approach would be to let user space turn off overcommit.  
> 
> No.  Overcommit only applies to pageable memory.  Beancounter is
> really needed for non-pageable resources such as page tables and
> mlock()ed pages.
> 

In addition to beancounter, do you think pageable page tables are
something we want to tackle in 2.5.x?  4MB page mappings on x86
could be cool too, as an option...

--
Eric Lowe
FibreChannel Software Engineer, Systran Corporation
[EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Erik Andersen

On Tue Sep 26, 2000 at 06:08:20PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Tue, Sep 26, 2000 at 11:02:48AM -0600, Erik Andersen wrote:
> 
> > Another approach would be to let user space turn off overcommit.  
> 
> No.  Overcommit only applies to pageable memory.  Beancounter is
> really needed for non-pageable resources such as page tables and
> mlock()ed pages.

I think we do agree here, though we are having problems with semantics.

"Overcommit" to me is the same things as Mark Hemment stated earlier in this
thread -- the "fact that the system has over committed its memory resources.
ie. it has sold too many tickets for the number of seats in the plane, and all
the passengers have turned up."   Basically any case where too many tickets
have been sold (applied to the entire system, and all subsystems).

To extend the airplane metaphor a bit past credibility...

When an airline sells too many tickets, it bribes people to get off the plane.
For the kernel, it tends to fall over, or starts kicking off pilots and flight
attendants.

If the Beancounter patch lets the kernel count "passengers", classify them
(with user hinting) so the pilot and flight attendants (init, X, or whatever)
always stay on the plane, and has some sane predictable mechanism for booting
non-priveledged passengers, then I am all for it.  

How does one provide the kernel with hints as to which processes are sacred?
Where does one find this beancounter patch?   How much weight does it add to
the kernel? 

 -Erik

--
Erik B. Andersen   email:  [EMAIL PROTECTED]
--This message was written using 73% post-consumer electrons--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Stephen C. Tweedie

Hi,

On Tue, Sep 26, 2000 at 11:02:48AM -0600, Erik Andersen wrote:

> Another approach would be to let user space turn off overcommit.  

No.  Overcommit only applies to pageable memory.  Beancounter is
really needed for non-pageable resources such as page tables and
mlock()ed pages.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Erik Andersen

On Tue Sep 26, 2000 at 05:04:06PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Tue, Sep 26, 2000 at 09:17:44AM -0600, [EMAIL PROTECTED] wrote:
> 
> > Operating systems cannot make more memory appear by magic.
> > The question is really about the best strategy for dealing with low memory. In my
> > opinion, the OS should not try to out-think physical limitations. Instead, the OS 
> > should take as little space as possible and provide the ability for user level 
> > clever management of space. In a truly embedded system, there can easily be a user 
>level
> > root process that watches memory usage and prevents DOS attacks -- if the OS 
>provides
> > settable enforced quotas etc. 
> 
> Agreed, absolutely.  The beancounter is one approach to those quotas,
> and has the advantage of allowing per-user as well as per-process
> quotas.

Another approach would be to let user space turn off overcommit.  
That way, user space can be assured there will be no surprises...

 -Erik

--
Erik B. Andersen   email:  [EMAIL PROTECTED]
--This message was written using 73% post-consumer electrons--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Stephen C. Tweedie

Hi,

On Tue, Sep 26, 2000 at 09:17:44AM -0600, [EMAIL PROTECTED] wrote:

> Operating systems cannot make more memory appear by magic.
> The question is really about the best strategy for dealing with low memory. In my
> opinion, the OS should not try to out-think physical limitations. Instead, the OS 
> should take as little space as possible and provide the ability for user level 
> clever management of space. In a truly embedded system, there can easily be a user 
>level
> root process that watches memory usage and prevents DOS attacks -- if the OS provides
> settable enforced quotas etc. 

Agreed, absolutely.  The beancounter is one approach to those quotas,
and has the advantage of allowing per-user as well as per-process
quotas.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread yodaiken

On Mon, Sep 25, 2000 at 05:14:11PM -0600, Erik Andersen wrote:
> On Mon Sep 25, 2000 at 02:04:19PM -0600, [EMAIL PROTECTED] wrote:
> > 
> > > all of the pending requests just as long as they are serialised, is
> > > this a problem?
> > 
> > I think you are solving the wrong problem. On a small memory machine, the kernel,
> > utilities, and applications should be configured to use little memory.  
> > BusyBox is better than BeanCount. 
> > 
> 
> Granted that smaller apps can help -- for a particular workload.  But while I
> am very partial to BusyBox (in fact I am about to cut a new release) I can
> assure you that OOM is easily possible even when your user space is tiny.  I do
> it all the time.  There are mallocs in busybox and when under memory pressure,
> the kernel still tends to fall over...

Operating systems cannot make more memory appear by magic.
The question is really about the best strategy for dealing with low memory. In my
opinion, the OS should not try to out-think physical limitations. Instead, the OS 
should take as little space as possible and provide the ability for user level 
clever management of space. In a truly embedded system, there can easily be a user 
level
root process that watches memory usage and prevents DOS attacks -- if the OS provides
settable enforced quotas etc. 


-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread yodaiken

On Tue, Sep 26, 2000 at 11:07:36AM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Mon, Sep 25, 2000 at 03:12:50PM -0600, [EMAIL PROTECTED] wrote:
> > > > 
> > > > I'm not too sure of what you have in mind, but if it is
> > > >  "process creates vast virtual space to generate many page table
> > > >   entries -- using mmap"
> > > > the answer is, virtual address space quotas and mmap should kill 
> > > > the process on low mem for page tables.
> > > 
> > > No.  Page tables are not freed after munmap (and for good reason).  The
> > > counting of page table "beans" is critical.
> > 
> > I've seen the assertion before, reasons would be interesting.
> 
> Reason 1: under DoS attack, you want to target not the process using
> the most resources, but the *user* using the most resources (else a
> fork-bomb style attack can work around your OOM-killer algorithms).

Ok.
  if(over_allocated_page_tables(task->uid) ) return ENOMEM;

makes sense in "fork".   I guess the argument here is not about whether
accounting is good, it's about where the accounting should be done. To me
the alternatives of

  if(preallocate_pages(page_table_size_for_this_process()) == -1)return error
 then actually allocate making sure to adjust counts if some other
 error turns up and with something taking care of how the pre-allocation
 works while we are sleeping waiting for possibly unrelated resources.

or
  just kmalloc with kmalloc magically juggling resources in some safe way


seem less clear.

   

 

> Reason 2: if you've got tasks stuck in low-level page allocation
> routines, then you can't immediately kill -9 them, so reactive OOM
> killing always has vulnerabilities --- to be robust in preventing
> resource exhaustion you want limits on the use of those resources
> before they are exhausted --- the necessary accounting being part of
> what we refer to as "beancounter".

doesn't the problem really come from low level page allocation at too high a level?
That is, if instead of select doing get_free_page, it maybe should do 
get_per_process_page(myprocess) or even get_per_process_file_use_page(myprocess)
Then we could have a config-optional per-process pinned page accounting with the 
possibility of doing something sensible in a user-space daemon when memory is low.

> 
> --Stephen

-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread yodaiken

On Tue, Sep 26, 2000 at 10:54:23AM +0100, Stephen C. Tweedie wrote:
> Beancounter is a framework for user-level accounting.  _What_ you
> account is up to the callers.  Maybe this has been a miscommunication,
> but beancounter is all about allowing callers to account for stuff
> before allocation, not about having the page allocation functions
> themselves enforce quotas.


per-user and system-wide and per-process quotas are one thing, a
pre-allocate-and-then-allocate generic scheme seems to me to be a error prone
way of getting there. In particular, I think it is dangerous to have a pre-count that
is approximately tethered to the thing it is counting -- in the memory allocation 
we were discussing, you need to make sure that the pre-allocations are for memory that
is really going to be allocated soon and that it is later correlated with free in 
some way.  

So, to me, a quota bounded allocate_page_table(process_id) makes much more sense then 
pre-allocate counting, or, even worse, a "smart" kmalloc that never fails.
If the problem is unaccounted for page-tables then account for
page tables and return a  -EYOURPROCESSISOUTOFCONTROL so that calling kernel code
can take the responsible action. 
   

-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Mark Hemment

Hi,

On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: 
> So you have run out of physical memory --- what do you do about it?

  Why let the system get into the state where it is neccessary to kill a
process?
  Per-user/task resource counters should prevent unprivileged users from
soaking up too many resources.  That is the DoS protection.

  So an OOM is possibly;
1) A privileged, legally resource hungry, app(s) has taken all
   the memory.  Could be too important to simply kill (it
   should exit gracefully).
2) Simply too many tasks*(memory-requirements-of-each-task).

  Ignoring allocations done by the kernel, the suitation comes down to the
fact that the system has over committed its memory resources.  ie. it has
sold too many tickets for the number of seats in the plane, and all the
passengers have turned up.
 (note, I use the term "memory" and not "physical memory", I'm including
swap space).

  Why not protect the system from over committing its memory resources?
  
  It is possible to do true, system wide, resource counting of physical
memory and swap space, and to deny a fork() or mmap() which would cause
over committing of memoy resources if everyone cashed in their
requirements.

  Named pages (those which came from a file) are the simplest to
handle.  If dirty, they already have allocated backing store, so we know
there is somewhere to put them when memory is low.
  How many named pages need to be held in physical memory at any one
instance for the system to function?  Only a few, although if you reach
that state, the system will be thrashing itself to death.

  Anonymous and copied (those faulted from a write to  an
MAP_PRIVATE|MAP_WRITE mapping) pages can be stored in either physical
memory or on swap.  To avoid getting into the OOM suitation, when these
mappings are created the system needs to check that it has (and will have,
in the future) space for every page that _could_ be allocated for the
mapping - ie. work out the worst case (including page-tables).
  This space could be on swap or in physical memory.  It is the accounting
which needs to be done, not the actual allocation (and not even the
decision of where to store the page when allocated - that is made much
later, when it needs to be).  If a machine has 2GB of RAM, a 1MB
swap, and 1GB of dirty anon or copied pages, that is fine.
  I'm stressing this point, as the scheme of reserving space for an (as
yet) unallocated page is sometimes refered to as "eager swap
allocation" (or some such similar term).  This is confusing.  People then
start to believe they need backing store for each anon/copied pages.  You
don't.  You simply need somewhere to store it, and that could be a
physical page.  It is all in the accounting. :)

  Allocations made by the kernel, for the kernel, are (obviously) pinned
memory.  To ensure kernel allocations do not completely exhaust physical
memory (or cause phyiscal memory to be over committed if the worst case
occurs), they need to be limited.
  How to limit?
  As I first guess (and this is only a guess);
1) don't let kernel allocations exceed 25% of physical memory
   (tunable)
2) don't let kernel allocations succeed if they would cause
   over commitment.
  Both conditions would need to pass before an allocation could succeed.
  This does need much more thought.  Should some tuning be per subsystem?
I don't know

  Perhaps 1) isn't needed.  I'm not sure.

  Because of 2), the total physical memory accounted for anon/copied
pages needs to have a high watermark.  Otherwise, in the accounting, the
system could allow too much physical memory to be reserved for these
types of pages (there doesn't need to be space on swap for each
anon/copied page, just space somewhere - a watermark would prevent too
much of this being physical memory).  Note, this doesn't mean start
swapping earlier - remember, this is accounting of anon/copied pages to
avoid over commitment.
  For named pages, the page cache needs to have a reserved number of
physical pages (ie. how small is it allowed to get, before pruning
stops).  Again, these reserved pages are in the accounting.

 mlock()ed pages need to have accouting also to prevent over commitment of
physical memory.  All fun.

  The disadvantages;

1) Extra code to do the accouting.
This shouldn't be too heavy.

2) mmap(MAP_ANON)/mmap(MAP_PRIVATE|MAP_SHARED) can fail more readily.

Programs which expect to memory map areas (which would created
anon/copied pages when written to) will see an increased failure
rate in mmap().  This can be very annoying, espically when you
know the mapping will be used sparsely.

One solution is to add a new mmap() flag, which tells the kernel
to let this mmap() exceed the actually resources.
With such a flag, the mmap() will be allowed, but the task should
expected to be killed if memory is exhausted.  (It could be

Re: the new VMt

2000-09-26 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 03:12:50PM -0600, [EMAIL PROTECTED] wrote:
> > > 
> > > I'm not too sure of what you have in mind, but if it is
> > >  "process creates vast virtual space to generate many page table
> > >   entries -- using mmap"
> > > the answer is, virtual address space quotas and mmap should kill 
> > > the process on low mem for page tables.
> > 
> > No.  Page tables are not freed after munmap (and for good reason).  The
> > counting of page table "beans" is critical.
> 
> I've seen the assertion before, reasons would be interesting.

Reason 1: under DoS attack, you want to target not the process using
the most resources, but the *user* using the most resources (else a
fork-bomb style attack can work around your OOM-killer algorithms).

Reason 2: if you've got tasks stuck in low-level page allocation
routines, then you can't immediately kill -9 them, so reactive OOM
killing always has vulnerabilities --- to be robust in preventing
resource exhaustion you want limits on the use of those resources
before they are exhausted --- the necessary accounting being part of
what we refer to as "beancounter".

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 03:07:44PM -0600, [EMAIL PROTECTED] wrote:
> On Mon, Sep 25, 2000 at 09:46:35PM +0100, Alan Cox wrote:
> > > I'm not too sure of what you have in mind, but if it is
> > >  "process creates vast virtual space to generate many page table
> > >   entries -- using mmap"
> > > the answer is, virtual address space quotas and mmap should kill 
> > > the process on low mem for page tables.
> > 
> > Those quotas being exactly what beancounter is
> 
> But that is a function specific counter, not a counter in the 
> alloc code.

Beancounter is a framework for user-level accounting.  _What_ you
account is up to the callers.  Maybe this has been a miscommunication,
but beancounter is all about allowing callers to account for stuff
before allocation, not about having the page allocation functions
themselves enforce quotas.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Jes Sorensen

> "Ingo" == Ingo Molnar <[EMAIL PROTECTED]> writes:

Ingo> On 26 Sep 2000, Jes Sorensen wrote:

>> 9.5KB blocks is common for people running Gigabit Ethernet with
>> Jumbo frames at least.

Ingo> yep, although this is more of a Linux limitation, the cards
Ingo> themselves are happy to DMA fragmented buffers as well. (sans
Ingo> some small penalty per new fragment.)

Hence the reason I have been pushing for the kiobufifying of the skbs ;-)
It's even more important for HIPPI with the 65280 bytes MTU.

Jes
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Ingo Molnar


On 26 Sep 2000, Jes Sorensen wrote:

> 9.5KB blocks is common for people running Gigabit Ethernet with Jumbo
> frames at least.

yep, although this is more of a Linux limitation, the cards themselves are
happy to DMA fragmented buffers as well. (sans some small penalty per new
fragment.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Jes Sorensen

> "Ingo" == Ingo Molnar <[EMAIL PROTECTED]> writes:

Ingo> On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

>> > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
>> 
>> You're right. That's why it's a waste to have so many order in the
>> buddy allocator. [...]

Ingo> yep, i agree. I'm not sure what the biggest allocation is, some
Ingo> drivers might use megabytes or contiguous RAM?

9.5KB blocks is common for people running Gigabit Ethernet with Jumbo
frames at least.

Jes
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Eric Lowe

Hello,

  Another approach would be to let user space turn off overcommit.  
 
 No.  Overcommit only applies to pageable memory.  Beancounter is
 really needed for non-pageable resources such as page tables and
 mlock()ed pages.
 

In addition to beancounter, do you think pageable page tables are
something we want to tackle in 2.5.x?  4MB page mappings on x86
could be cool too, as an option...

--
Eric Lowe
FibreChannel Software Engineer, Systran Corporation
[EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Horst von Brand

Erik Andersen [EMAIL PROTECTED] said:

[...]

 Another approach would be to let user space turn off overcommit.  
 That way, user space can be assured there will be no surprises...

I'd call emacs consistently not being able to start an ls on a 16Mb machine
much worse than a surprise...

Hint: Think about how emacs would go about doing that...

Also, to ensure there is /no/ overcommit /anywhere/ amounts to a rigurous
audit of the whole kernel, and of each single patchlet that goes in. You
are certainly welcome to do the job...
-- 
Horst von Brand [EMAIL PROTECTED]
Casilla 9G, Vin~a del Mar, Chile   +56 32 672616
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Mark Hemment

Hi,

On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: 
 So you have run out of physical memory --- what do you do about it?

  Why let the system get into the state where it is neccessary to kill a
process?
  Per-user/task resource counters should prevent unprivileged users from
soaking up too many resources.  That is the DoS protection.

  So an OOM is possibly;
1) A privileged, legally resource hungry, app(s) has taken all
   the memory.  Could be too important to simply kill (it
   should exit gracefully).
2) Simply too many tasks*(memory-requirements-of-each-task).

  Ignoring allocations done by the kernel, the suitation comes down to the
fact that the system has over committed its memory resources.  ie. it has
sold too many tickets for the number of seats in the plane, and all the
passengers have turned up.
 (note, I use the term "memory" and not "physical memory", I'm including
swap space).

  Why not protect the system from over committing its memory resources?
  
  It is possible to do true, system wide, resource counting of physical
memory and swap space, and to deny a fork() or mmap() which would cause
over committing of memoy resources if everyone cashed in their
requirements.

  Named pages (those which came from a file) are the simplest to
handle.  If dirty, they already have allocated backing store, so we know
there is somewhere to put them when memory is low.
  How many named pages need to be held in physical memory at any one
instance for the system to function?  Only a few, although if you reach
that state, the system will be thrashing itself to death.

  Anonymous and copied (those faulted from a write to  an
MAP_PRIVATE|MAP_WRITE mapping) pages can be stored in either physical
memory or on swap.  To avoid getting into the OOM suitation, when these
mappings are created the system needs to check that it has (and will have,
in the future) space for every page that _could_ be allocated for the
mapping - ie. work out the worst case (including page-tables).
  This space could be on swap or in physical memory.  It is the accounting
which needs to be done, not the actual allocation (and not even the
decision of where to store the page when allocated - that is made much
later, when it needs to be).  If a machine has 2GB of RAM, a 1MB
swap, and 1GB of dirty anon or copied pages, that is fine.
  I'm stressing this point, as the scheme of reserving space for an (as
yet) unallocated page is sometimes refered to as "eager swap
allocation" (or some such similar term).  This is confusing.  People then
start to believe they need backing store for each anon/copied pages.  You
don't.  You simply need somewhere to store it, and that could be a
physical page.  It is all in the accounting. :)

  Allocations made by the kernel, for the kernel, are (obviously) pinned
memory.  To ensure kernel allocations do not completely exhaust physical
memory (or cause phyiscal memory to be over committed if the worst case
occurs), they need to be limited.
  How to limit?
  As I first guess (and this is only a guess);
1) don't let kernel allocations exceed 25% of physical memory
   (tunable)
2) don't let kernel allocations succeed if they would cause
   over commitment.
  Both conditions would need to pass before an allocation could succeed.
  This does need much more thought.  Should some tuning be per subsystem?
I don't know

  Perhaps 1) isn't needed.  I'm not sure.

  Because of 2), the total physical memory accounted for anon/copied
pages needs to have a high watermark.  Otherwise, in the accounting, the
system could allow too much physical memory to be reserved for these
types of pages (there doesn't need to be space on swap for each
anon/copied page, just space somewhere - a watermark would prevent too
much of this being physical memory).  Note, this doesn't mean start
swapping earlier - remember, this is accounting of anon/copied pages to
avoid over commitment.
  For named pages, the page cache needs to have a reserved number of
physical pages (ie. how small is it allowed to get, before pruning
stops).  Again, these reserved pages are in the accounting.

 mlock()ed pages need to have accouting also to prevent over commitment of
physical memory.  All fun.

  The disadvantages;

1) Extra code to do the accouting.
This shouldn't be too heavy.

2) mmap(MAP_ANON)/mmap(MAP_PRIVATE|MAP_SHARED) can fail more readily.

Programs which expect to memory map areas (which would created
anon/copied pages when written to) will see an increased failure
rate in mmap().  This can be very annoying, espically when you
know the mapping will be used sparsely.

One solution is to add a new mmap() flag, which tells the kernel
to let this mmap() exceed the actually resources.
With such a flag, the mmap() will be allowed, but the task should
expected to be killed if memory is exhausted.  (It could be
 

Re: the new VMt

2000-09-26 Thread yodaiken

On Tue, Sep 26, 2000 at 10:54:23AM +0100, Stephen C. Tweedie wrote:
 Beancounter is a framework for user-level accounting.  _What_ you
 account is up to the callers.  Maybe this has been a miscommunication,
 but beancounter is all about allowing callers to account for stuff
 before allocation, not about having the page allocation functions
 themselves enforce quotas.


per-user and system-wide and per-process quotas are one thing, a
pre-allocate-and-then-allocate generic scheme seems to me to be a error prone
way of getting there. In particular, I think it is dangerous to have a pre-count that
is approximately tethered to the thing it is counting -- in the memory allocation 
we were discussing, you need to make sure that the pre-allocations are for memory that
is really going to be allocated soon and that it is later correlated with free in 
some way.  

So, to me, a quota bounded allocate_page_table(process_id) makes much more sense then 
pre-allocate counting, or, even worse, a "smart" kmalloc that never fails.
If the problem is unaccounted for page-tables then account for
page tables and return a  -EYOURPROCESSISOUTOFCONTROL so that calling kernel code
can take the responsible action. 
   

-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread yodaiken

On Tue, Sep 26, 2000 at 11:07:36AM +0100, Stephen C. Tweedie wrote:
 Hi,
 
 On Mon, Sep 25, 2000 at 03:12:50PM -0600, [EMAIL PROTECTED] wrote:

I'm not too sure of what you have in mind, but if it is
 "process creates vast virtual space to generate many page table
  entries -- using mmap"
the answer is, virtual address space quotas and mmap should kill 
the process on low mem for page tables.
   
   No.  Page tables are not freed after munmap (and for good reason).  The
   counting of page table "beans" is critical.
  
  I've seen the assertion before, reasons would be interesting.
 
 Reason 1: under DoS attack, you want to target not the process using
 the most resources, but the *user* using the most resources (else a
 fork-bomb style attack can work around your OOM-killer algorithms).

Ok.
  if(over_allocated_page_tables(task-uid) ) return ENOMEM;

makes sense in "fork".   I guess the argument here is not about whether
accounting is good, it's about where the accounting should be done. To me
the alternatives of

  if(preallocate_pages(page_table_size_for_this_process()) == -1)return error
 then actually allocate making sure to adjust counts if some other
 error turns up and with something taking care of how the pre-allocation
 works while we are sleeping waiting for possibly unrelated resources.

or
  just kmalloc with kmalloc magically juggling resources in some safe way


seem less clear.

   

 

 Reason 2: if you've got tasks stuck in low-level page allocation
 routines, then you can't immediately kill -9 them, so reactive OOM
 killing always has vulnerabilities --- to be robust in preventing
 resource exhaustion you want limits on the use of those resources
 before they are exhausted --- the necessary accounting being part of
 what we refer to as "beancounter".

doesn't the problem really come from low level page allocation at too high a level?
That is, if instead of select doing get_free_page, it maybe should do 
get_per_process_page(myprocess) or even get_per_process_file_use_page(myprocess)
Then we could have a config-optional per-process pinned page accounting with the 
possibility of doing something sensible in a user-space daemon when memory is low.

 
 --Stephen

-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread yodaiken

On Mon, Sep 25, 2000 at 05:14:11PM -0600, Erik Andersen wrote:
 On Mon Sep 25, 2000 at 02:04:19PM -0600, [EMAIL PROTECTED] wrote:
  
   all of the pending requests just as long as they are serialised, is
   this a problem?
  
  I think you are solving the wrong problem. On a small memory machine, the kernel,
  utilities, and applications should be configured to use little memory.  
  BusyBox is better than BeanCount. 
  
 
 Granted that smaller apps can help -- for a particular workload.  But while I
 am very partial to BusyBox (in fact I am about to cut a new release) I can
 assure you that OOM is easily possible even when your user space is tiny.  I do
 it all the time.  There are mallocs in busybox and when under memory pressure,
 the kernel still tends to fall over...

Operating systems cannot make more memory appear by magic.
The question is really about the best strategy for dealing with low memory. In my
opinion, the OS should not try to out-think physical limitations. Instead, the OS 
should take as little space as possible and provide the ability for user level 
clever management of space. In a truly embedded system, there can easily be a user 
level
root process that watches memory usage and prevents DOS attacks -- if the OS provides
settable enforced quotas etc. 


-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Stephen C. Tweedie

Hi,

On Tue, Sep 26, 2000 at 09:17:44AM -0600, [EMAIL PROTECTED] wrote:

 Operating systems cannot make more memory appear by magic.
 The question is really about the best strategy for dealing with low memory. In my
 opinion, the OS should not try to out-think physical limitations. Instead, the OS 
 should take as little space as possible and provide the ability for user level 
 clever management of space. In a truly embedded system, there can easily be a user 
level
 root process that watches memory usage and prevents DOS attacks -- if the OS provides
 settable enforced quotas etc. 

Agreed, absolutely.  The beancounter is one approach to those quotas,
and has the advantage of allowing per-user as well as per-process
quotas.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Stephen C. Tweedie

Hi,

On Tue, Sep 26, 2000 at 11:02:48AM -0600, Erik Andersen wrote:

 Another approach would be to let user space turn off overcommit.  

No.  Overcommit only applies to pageable memory.  Beancounter is
really needed for non-pageable resources such as page tables and
mlock()ed pages.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-26 Thread Erik Andersen

On Tue Sep 26, 2000 at 05:04:06PM +0100, Stephen C. Tweedie wrote:
 Hi,
 
 On Tue, Sep 26, 2000 at 09:17:44AM -0600, [EMAIL PROTECTED] wrote:
 
  Operating systems cannot make more memory appear by magic.
  The question is really about the best strategy for dealing with low memory. In my
  opinion, the OS should not try to out-think physical limitations. Instead, the OS 
  should take as little space as possible and provide the ability for user level 
  clever management of space. In a truly embedded system, there can easily be a user 
level
  root process that watches memory usage and prevents DOS attacks -- if the OS 
provides
  settable enforced quotas etc. 
 
 Agreed, absolutely.  The beancounter is one approach to those quotas,
 and has the advantage of allowing per-user as well as per-process
 quotas.

Another approach would be to let user space turn off overcommit.  
That way, user space can be assured there will be no surprises...

 -Erik

--
Erik B. Andersen   email:  [EMAIL PROTECTED]
--This message was written using 73% post-consumer electrons--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Erik Andersen

On Mon Sep 25, 2000 at 02:04:19PM -0600, [EMAIL PROTECTED] wrote:
> 
> > all of the pending requests just as long as they are serialised, is
> > this a problem?
> 
> I think you are solving the wrong problem. On a small memory machine, the kernel,
> utilities, and applications should be configured to use little memory.  
> BusyBox is better than BeanCount. 
> 

Granted that smaller apps can help -- for a particular workload.  But while I
am very partial to BusyBox (in fact I am about to cut a new release) I can
assure you that OOM is easily possible even when your user space is tiny.  I do
it all the time.  There are mallocs in busybox and when under memory pressure,
the kernel still tends to fall over...

 -Erik

--
Erik B. Andersen   email:  [EMAIL PROTECTED]
--This message was written using 73% post-consumer electrons--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread yodaiken

On Mon, Sep 25, 2000 at 04:47:21PM -0400, Benjamin C.R. LaHaise wrote:
> On Mon, 25 Sep 2000 [EMAIL PROTECTED] wrote:
> 
> > On Mon, Sep 25, 2000 at 09:23:48PM +0100, Alan Cox wrote:
> > > > my prediction is that if you show me an example of 
> > > > DoS vulnerability,  I can show you fix that does not require bean counting.
> > > > Am I wrong?
> > > 
> > > I think so. Page tables are a good example
> > 
> > I'm not too sure of what you have in mind, but if it is
> >  "process creates vast virtual space to generate many page table
> >   entries -- using mmap"
> > the answer is, virtual address space quotas and mmap should kill 
> > the process on low mem for page tables.
> 
> No.  Page tables are not freed after munmap (and for good reason).  The
> counting of page table "beans" is critical.

I've seen the assertion before, reasons would be interesting.


-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread yodaiken

On Mon, Sep 25, 2000 at 09:46:35PM +0100, Alan Cox wrote:
> > I'm not too sure of what you have in mind, but if it is
> >  "process creates vast virtual space to generate many page table
> >   entries -- using mmap"
> > the answer is, virtual address space quotas and mmap should kill 
> > the process on low mem for page tables.
> 
> Those quotas being exactly what beancounter is

But that is a function specific counter, not a counter in the 
alloc code.


-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alan Cox

> I'm not too sure of what you have in mind, but if it is
>  "process creates vast virtual space to generate many page table
>   entries -- using mmap"
> the answer is, virtual address space quotas and mmap should kill 
> the process on low mem for page tables.

Those quotas being exactly what beancounter is

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Benjamin C.R. LaHaise

On Mon, 25 Sep 2000 [EMAIL PROTECTED] wrote:

> On Mon, Sep 25, 2000 at 09:23:48PM +0100, Alan Cox wrote:
> > > my prediction is that if you show me an example of 
> > > DoS vulnerability,  I can show you fix that does not require bean counting.
> > > Am I wrong?
> > 
> > I think so. Page tables are a good example
> 
> I'm not too sure of what you have in mind, but if it is
>  "process creates vast virtual space to generate many page table
>   entries -- using mmap"
> the answer is, virtual address space quotas and mmap should kill 
> the process on low mem for page tables.

No.  Page tables are not freed after munmap (and for good reason).  The
counting of page table "beans" is critical.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread yodaiken

On Mon, Sep 25, 2000 at 09:23:48PM +0100, Alan Cox wrote:
> > my prediction is that if you show me an example of 
> > DoS vulnerability,  I can show you fix that does not require bean counting.
> > Am I wrong?
> 
> I think so. Page tables are a good example

I'm not too sure of what you have in mind, but if it is
 "process creates vast virtual space to generate many page table
  entries -- using mmap"
the answer is, virtual address space quotas and mmap should kill 
the process on low mem for page tables.

> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/

-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 02:04:19PM -0600, [EMAIL PROTECTED] wrote:

> > Right, but if the alternative is spurious ENOMEM when we can satisfy
> 
> An ENOMEM is not spurious if there is not enough memory. UNIX does not ask the
> OS to do impossible tricks.

Yes, but the ENOMEM _is_ spurious if you actually meant EAGAIN, and if
the OS was perfectly capable of doing the retry itself.

> > all of the pending requests just as long as they are serialised, is
> > this a problem?
> 
> I think you are solving the wrong problem. On a small memory machine, the kernel,
> utilities, and applications should be configured to use little memory.  
> BusyBox is better than BeanCount. 

Any box is a small memory machine if you get the wrong workload on it,
and the DoS attacks which are possible without beancounting let any
user bring even a large system to its knees right now.  If solving
that problem also means that small memory machines do the right thing
on their own rather than requiring specific manual configuration, then
it sounds like a good aim.

> > However, you just can't escape from the fact that on low memory
> > machinnes, we *need* beancounter-style accounting of pinned pages or
> > we'll be in Deep Trouble (TM).  We already have nasty DoS situations
> 
> What we need is simple kernel code that does not hold resources
> into a  possible deadlock situation. 



> On general principles, I don't see any substitute for clean code in the kernel and
> my prediction is that if you show me an example of 
> DoS vulnerability,  I can show you fix that does not require bean counting.
> Am I wrong?

If you have a user forking multiple processes and exhausting some
resource, then at some point you have to do something about it.  Let's
say it's page tables, just for argument's sake, because those are
currently non-swappable, but even if you make those swappable there
are plenty of other resources it might be (eg. data shoved down unix
domain sockets if you want another example).

So you have run out of physical memory --- what do you do about it?
The important observation here is that in a multi-user environment,
simply denying further allocations isn't good enough --- unless you
revoke those existing allocations you have DoS.  And you can't fairly
revoke existing allocations without knowing WHICH user has exhausted
the memory (which requires beancounter-style resource tracking), AND
having mechanisms in place to revoke all of the possible resources
which might be involved (eg unix domain socket datagrams).  kill -9
might solve that latter problem but it doesn't help in identifying who
to kill.

--Stephen
> 
> 
> 
> 
> 
> -- 
> -
> Victor Yodaiken 
> Finite State Machine Labs: The RTLinux Company.
>  www.fsmlabs.com  www.rtlinux.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alan Cox

> my prediction is that if you show me an example of 
> DoS vulnerability,  I can show you fix that does not require bean counting.
> Am I wrong?

I think so. Page tables are a good example


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread yodaiken

On Mon, Sep 25, 2000 at 08:25:49PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Mon, Sep 25, 2000 at 12:34:56PM -0600, [EMAIL PROTECTED] wrote:
> 
> > > > Process 1,2 and 3 all start allocating 20 pages
> > > > now 57 pages are locked up in non-swapable kernel space and the system 
>deadlocks OOM.
> > > 
> > > Or go the beancounter route: process 1 asks "can I pin 20 pages", gets
> > > told "yes", and goes allocating them, blocking as necessary until it
> > 
> > So you have a "pre-allocation allocator"?  Leads to interesting and hard to detect
> > bugs with old code that does not pre-allocate or with code that incorrectly 
>pre-allocates
> > or that blocks on something unrelated
> 
> Right, but if the alternative is spurious ENOMEM when we can satisfy

An ENOMEM is not spurious if there is not enough memory. UNIX does not ask the
OS to do impossible tricks.

> all of the pending requests just as long as they are serialised, is
> this a problem?

I think you are solving the wrong problem. On a small memory machine, the kernel,
utilities, and applications should be configured to use little memory.  
BusyBox is better than BeanCount. 


> However, you just can't escape from the fact that on low memory
> machinnes, we *need* beancounter-style accounting of pinned pages or
> we'll be in Deep Trouble (TM).  We already have nasty DoS situations

What we need is simple kernel code that does not hold resources
into a  possible deadlock situation. 

> which are embarassingly easy to reproduce.  If we need such
> beancounter protection, AND such protection can prevent the situation
> you describe, then do we need to go looking for another way of
> achieving the same protection?


On general principles, I don't see any substitute for clean code in the kernel and
my prediction is that if you show me an example of 
DoS vulnerability,  I can show you fix that does not require bean counting.
Am I wrong?





-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt [4MB+ blocks]

2000-09-25 Thread Stephen Williams


[EMAIL PROTECTED] said:
> Sometimes allocating such monster memory blocks could be supported,
>   but it should not be expected to be *fast*.  E.g. if doing it in
>   "reliable" way needs possibly moving currently allocated pages
>   away from memory to create such a hole(s), so be it.


[EMAIL PROTECTED] said:
> Anybody here who can describe those M$ API calls ?
>   Are they kernel/DDK-only, or userspace ones, or both ?

NT does indeed support allocating contiguous buffers of memory, which is
useful when the hardware in question doesn't do scatter-gather. I have
on occasion been compelled to use these routines. (Paradoxically, the
requirements in my case came from broken NT mmap support and not from the
hardware. Blech!)

Anyhow, these routines are indeed slow. And judging by the amount of disk
noise I hear when they are called, they do try to kick out pages to make
an allocation work. However, even so the M$ calls will eventually fail due
to lack of large enough holes, so fragmentation takes its toll.

So, they are both slow and unreliable under NT. But drivers that use them
tend to be loaded once at boot time, and that's it.
-- 
Steve Williams"The woods are lovely, dark and deep.
[EMAIL PROTECTED]  But I have promises to keep,
[EMAIL PROTECTED]and lines to code before I sleep,
http://www.picturel.com   And lines to code before I sleep."


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 12:34:56PM -0600, [EMAIL PROTECTED] wrote:

> > > Process 1,2 and 3 all start allocating 20 pages
> > > now 57 pages are locked up in non-swapable kernel space and the system 
>deadlocks OOM.
> > 
> > Or go the beancounter route: process 1 asks "can I pin 20 pages", gets
> > told "yes", and goes allocating them, blocking as necessary until it
> 
> So you have a "pre-allocation allocator"?  Leads to interesting and hard to detect
> bugs with old code that does not pre-allocate or with code that incorrectly 
>pre-allocates
> or that blocks on something unrelated

Right, but if the alternative is spurious ENOMEM when we can satisfy
all of the pending requests just as long as they are serialised, is
this a problem?

If you want, wrap it in a "get_free_pagev" call which returns a vector
of pointers to free pages, doing whatever accounting is needed.  You
don't have to push all of it to the callers.

However, you just can't escape from the fact that on low memory
machinnes, we *need* beancounter-style accounting of pinned pages or
we'll be in Deep Trouble (TM).  We already have nasty DoS situations
which are embarassingly easy to reproduce.  If we need such
beancounter protection, AND such protection can prevent the situation
you describe, then do we need to go looking for another way of
achieving the same protection?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 08:09:31PM +0100, Alan Cox wrote:
> > > Indeed. But we wont fail the kmalloc with a NULL return
> > 
> > Isn't that the preferred behaviour, though?  If we are completely out
> > of VM on a no-swap machine, we should be killing one of the existing
> > processes rather than preventing any progress and keeping all of the
> > old tasks alive but deadlocked.
> 
> Unless Im missing something we wont kill any task in that condition - even
> a SIGKILL will make no odds as everyone is asleep in kmalloc

Right.  Eeek.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alan Cox

> > Indeed. But we wont fail the kmalloc with a NULL return
> 
> Isn't that the preferred behaviour, though?  If we are completely out
> of VM on a no-swap machine, we should be killing one of the existing
> processes rather than preventing any progress and keeping all of the
> old tasks alive but deadlocked.

Unless Im missing something we wont kill any task in that condition - even
a SIGKILL will make no odds as everyone is asleep in kmalloc


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt [4MB+ blocks]

2000-09-25 Thread Matti Aarnio

 [Chopped the recipient list radically]

On Mon, Sep 25, 2000 at 06:06:11PM +0100, Alan Cox wrote:
> > > > Stupidity has no limits...
> > > Unfortunately its frequently wired into the hardware to save a few cents on
> > > scatter gather logic.
> > 
> > Since when hardware folks became exempt from the rule above? 128K is
> > almost tolerable, there were requests for 64 _mega_bytes...
> 
> Most cheap ass PCI hardware is built on the basis you can do linear 4Mb 
> allocations. There is a reason for this. You can do that 4Mb allocation on
> NT or Windows 9x

Sure, but intel processors have this neat 4 MB "super-page"
feature in the MMU...  (as we all well know)

Sometimes allocating such monster memory blocks could be supported,
but it should not be expected to be *fast*.  E.g. if doing it in
"reliable" way needs possibly moving currently allocated pages
away from memory to create such a hole(s), so be it..


Anybody here who can describe those M$ API calls ?
Are they kernel/DDK-only, or userspace ones, or both ?

/Matti Aarnio
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread yodaiken

On Mon, Sep 25, 2000 at 07:24:53PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Mon, Sep 25, 2000 at 12:13:15PM -0600, [EMAIL PROTECTED] wrote:
> 
> > > Definitely not.  GFP_ATOMIC is reserved for things that really can't
> > > swap or schedule right now.  Use GFP_ATOMIC indiscriminately and you'll
> > > have to increase the number of atomic-allocatable pages.
> > 
> > Process 1,2 and 3 all start allocating 20 pages
> >   process 1 stalls after allocating 19
> >   some memory is freed and process 2 runs and stall after allocating 19
> >   some memory is free and process 3 runs and stalls after allocating 19
> >  
> > now 57 pages are locked up in non-swapable kernel space and the system 
>deadlocks OOM.
> 
> Or go the beancounter route: process 1 asks "can I pin 20 pages", gets
> told "yes", and goes allocating them, blocking as necessary until it

So you have a "pre-allocation allocator"?  Leads to interesting and hard to detect
bugs with old code that does not pre-allocate or with code that incorrectly 
pre-allocates
or that blocks on something unrelated

   preallocte 20 pages
   get first
   ask for an inode -- block waiting for an inode


or
   preallocate 20 pages
   if(checkuserpath())return -ENOWAY; /* stranding my pre-allocate */
   else get them pages


What's nice about these is they don't cause errors on test and seem more 
difficult to spot than looking for cases where allocated memory gets stranded.
Doesn't the alloc_vec method seem simpler to you?

> gets them.  Process 2 asks "can *I* pin 20 pages" and the answer is
> either "not right now", in which case it waits for process 1 to
> release its reservation, or "no, you've exceeded your user quota" in

Or for someone else to free more pages ... 

> which case it fails with ENOMEM.  (That latter case can protect us
> against a lot of DoS attacks from local users.)

I like ENOMEM anyways.

> 
> The same accounting really needs to be done for page tables, as that
> represents one of the biggest sources of unaccounted, unswappable
> pages which user processes can cause to be created right now.



-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 12:13:15PM -0600, [EMAIL PROTECTED] wrote:

> > Definitely not.  GFP_ATOMIC is reserved for things that really can't
> > swap or schedule right now.  Use GFP_ATOMIC indiscriminately and you'll
> > have to increase the number of atomic-allocatable pages.
> 
> Process 1,2 and 3 all start allocating 20 pages
>   process 1 stalls after allocating 19
>   some memory is freed and process 2 runs and stall after allocating 19
>   some memory is free and process 3 runs and stalls after allocating 19
>  
> now 57 pages are locked up in non-swapable kernel space and the system deadlocks 
>OOM.

Or go the beancounter route: process 1 asks "can I pin 20 pages", gets
told "yes", and goes allocating them, blocking as necessary until it
gets them.  Process 2 asks "can *I* pin 20 pages" and the answer is
either "not right now", in which case it waits for process 1 to
release its reservation, or "no, you've exceeded your user quota" in
which case it fails with ENOMEM.  (That latter case can protect us
against a lot of DoS attacks from local users.)

The same accounting really needs to be done for page tables, as that
represents one of the biggest sources of unaccounted, unswappable
pages which user processes can cause to be created right now.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alan Cox

> there is no swap.  If there is truly nothing kswapd can do to recover
> here, then we are truly OOM.  Otherwise, kswapd should be able to free

Indeed. But we wont fail the kmalloc with a NULL return

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread yodaiken

On Mon, Sep 25, 2000 at 08:04:54PM +0200, Jamie Lokier wrote:
> [EMAIL PROTECTED] wrote:
> > > [EMAIL PROTECTED] wrote:
> > > >walk = out;
> > > > while(nfds > 0) {
> > > > poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL);
> > > > if (!tmp) {
> > > 
> > > Shouldn't this be GFP_USER?  (Which would also conveniently fix the
> > > problem Victor's pointing out...)
> > 
> > It should probably be GFP_ATOMIC, if I understand the mm right. 
> 
> Definitely not.  GFP_ATOMIC is reserved for things that really can't
> swap or schedule right now.  Use GFP_ATOMIC indiscriminately and you'll
> have to increase the number of atomic-allocatable pages.

Process 1,2 and 3 all start allocating 20 pages
  process 1 stalls after allocating 19
  some memory is freed and process 2 runs and stall after allocating 19
  some memory is free and process 3 runs and stalls after allocating 19
 
now 57 pages are locked up in non-swapable kernel space and the system deadlocks 
OOM.



> > The algorithm for requesting a collection of reources and freeing all
> > of them on failure is simple, fast, and robust.
> 
> Allocation is just as fast with GFP_KERNEL/USER, just less likely to

It's not speed, it's deadlock avoidance. 

> fail and less likely to break something else that really needs
> GFP_ATOMIC allocations.

My point here is simply that error returns in memory allocation allow 
higher level kernel operations to safely marshal a collection of resources following
a safe algorithm that is optimized for the case when there is no memory shortage
and that only starts going to the slow case when the system is stalling due to memory
shortages anyways.



> 
> -- Jamie

-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 11:51:39AM -0600, [EMAIL PROTECTED] wrote:
> It should probably be GFP_ATOMIC, if I understand the mm right. 

poll_wait is called from the f_op->poll callback from select just before
a sleep and since it's allowed to sleep too it should be a GFP_KERNEL
(not ATOMIC). Using GFP_ATOMIC where GFP_KERNEL can be used is a bug
and it can lead to failed allocations even while there's huge amount
of freeable/recyclable cache.

The reason it isn't GFP_USER but it's a GFP_KERNEL is because the memory
isn't allocated in userspace.

On a solid VM the only difference between GFP_USER and GFP_KERNEL happens to be
when the machine runs truly out of memory. In 2.4.x GFP_KERNEL should probably
be changed not to short the PF_MEMALLOC atomic queue when memory balancing
fails (then they would be equal).

> The algorithm for requesting a collection of reources and freeing all of them
>  on failure is simple, fast, and robust. 

Yes, I tend to like that style too because it's obviously safe and it obviously
can't dealdock during oom.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Jamie Lokier

[EMAIL PROTECTED] wrote:
> > [EMAIL PROTECTED] wrote:
> > >walk = out;
> > > while(nfds > 0) {
> > > poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL);
> > > if (!tmp) {
> > 
> > Shouldn't this be GFP_USER?  (Which would also conveniently fix the
> > problem Victor's pointing out...)
> 
> It should probably be GFP_ATOMIC, if I understand the mm right. 

Definitely not.  GFP_ATOMIC is reserved for things that really can't
swap or schedule right now.  Use GFP_ATOMIC indiscriminately and you'll
have to increase the number of atomic-allocatable pages.

> The algorithm for requesting a collection of reources and freeing all
> of them on failure is simple, fast, and robust.

Allocation is just as fast with GFP_KERNEL/USER, just less likely to
fail and less likely to break something else that really needs
GFP_ATOMIC allocations.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread yodaiken

On Mon, Sep 25, 2000 at 07:18:29PM +0200, Jamie Lokier wrote:
> [EMAIL PROTECTED] wrote:
> >walk = out;
> > while(nfds > 0) {
> > poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL);
> > if (!tmp) {
> 
> Shouldn't this be GFP_USER?  (Which would also conveniently fix the
> problem Victor's pointing out...)

It should probably be GFP_ATOMIC, if I understand the mm right. 

The algorithm for requesting a collection of reources and freeing all of them
 on failure is simple, fast, and robust. 


  

-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Jeff Garzik

On Mon, 25 Sep 2000, Oliver Xymoron wrote:
> Sure about that? It's been a while, but I seem to recall NT enforcing a
> scatter-gather framework on all drivers because it only gave them virtual
> allocations. For the cheaper cards, the s-g was done by software issuing
> single span requests to the card.

The Matrox framegrabber guys use some API under NT to allocate
megabytes upon megabytes of contiguous memory for DMA.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 05:51:49PM +0100, Alan Cox wrote:
> > > 2 active processes, no swap
> > > 
> > > #1#2
> > > kmalloc 32K   kmalloc 16K
> > > OKOK
> > > kmalloc 16K   kmalloc 32K
> > > block block
> > > 
> > 
> > ... and we get two wakeup_kswapd()s.  kswapd has PF_MEMALLOC and so is
> > able to eat memory which processes #1 and #2 are not allowed to touch.
> 
> 'no swap'

kswapd is perfectly capable of evicting clean pages and triggering any
necessary writeback of dirty filesystem data at this point, even if
there is no swap.  If there is truly nothing kswapd can do to recover
here, then we are truly OOM.  Otherwise, kswapd should be able to free
the required memory, providing that the PF_MEMALLOC flag allows it to
eat into a reserved set of free pages which nobody else can allocate
once physical free pages gets below a certain threshold.

--Stephen 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Oliver Xymoron

On Mon, 25 Sep 2000, Alan Cox wrote:

> > > > Stupidity has no limits...
> > > 
> > > Unfortunately its frequently wired into the hardware to save a few cents on
> > > scatter gather logic.
> > 
> > Since when hardware folks became exempt from the rule above? 128K is
> > almost tolerable, there were requests for 64 _mega_bytes...
> 
> Most cheap ass PCI hardware is built on the basis you can do linear 4Mb 
> allocations. There is a reason for this. You can do that 4Mb allocation on
> NT or Windows 9x

Sure about that? It's been a while, but I seem to recall NT enforcing a
scatter-gather framework on all drivers because it only gave them virtual
allocations. For the cheaper cards, the s-g was done by software issuing
single span requests to the card.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Stephen C. Tweedie

Hi,

On Mon, Sep 25, 2000 at 06:05:00PM +0200, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote:
> > Progress is made, clean pages are discarded and dirty ones queued for
> 
> How can you make progress if there isn't swap avaiable and all the
> freeable page/buffer cache is just been freed? The deadlock happens
> in OOM condition (not when we can make progress).

Agreed --- this assumes that all pinned, nonswappable pages are
subject to resource limiting to prevent them from exhausting the whole
of memory.  For things like page tables, that means we need
beancounter in place for us to be 100% safe.  For the no-swap case,
that requires an OOM killer.

The problem of avoiding filling memory with pinned pages is orthogonal
to the problem of managing the unpinned memory.  Both are obviously
required for a stable system.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Jamie Lokier

[EMAIL PROTECTED] wrote:
>walk = out;
> while(nfds > 0) {
> poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL);
> if (!tmp) {

Shouldn't this be GFP_USER?  (Which would also conveniently fix the
problem Victor's pointing out...)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 02:10:07PM -0300, Rik van Riel wrote:
> Not really. We could fix this by making the page freeing
> functions smarter and only free the pages we need.

That's what I proposed in first place infact.

To free large chunk of memory you may have to throw away lots of cache. We're
not freeing contigous cache as we do in 2.2.x.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread yodaiken

On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Mon, Sep 25, 2000 at 04:16:56PM +0100, Alan Cox wrote:
> > 
> > Unless Im missing something here think about this case
> > 
> > 2 active processes, no swap
> > 
> > #1  #2
> > kmalloc 32K kmalloc 16K
> > OK  OK
> > kmalloc 16K kmalloc 32K
> > block   block
> > 
> 
> ... and we get two wakeup_kswapd()s.  kswapd has PF_MEMALLOC and so is
> able to eat memory which processes #1 and #2 are not allowed to touch.
> Progress is made, clean pages are discarded and dirty ones queued for
> write, memory becomes free again and the world is a better place.
> 
> Or so goes the theory, at least.

from fs/select.c

   walk = out;
while(nfds > 0) {
poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL);
if (!tmp) {
while(out != NULL) {
tmp = out->next;
free_page((unsigned long)out);
out = tmp;
}
return NULL;
}
tmp->nr = 0;
tmp->entry = (struct poll_table_entry *)(tmp + 1);
tmp->next = NULL;
walk->next = tmp;
walk = tmp;
nfds -=__MAX_POLL_TABLE_ENTRIES;
}


> 
> --Stephen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/

-- 
-
Victor Yodaiken 
Finite State Machine Labs: The RTLinux Company.
 www.fsmlabs.com  www.rtlinux.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alan Cox

> > > Stupidity has no limits...
> > 
> > Unfortunately its frequently wired into the hardware to save a few cents on
> > scatter gather logic.
> 
> Since when hardware folks became exempt from the rule above? 128K is
> almost tolerable, there were requests for 64 _mega_bytes...

Most cheap ass PCI hardware is built on the basis you can do linear 4Mb 
allocations. There is a reason for this. You can do that 4Mb allocation on
NT or Windows 9x


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 07:03:46PM +0200, Ingo Molnar wrote:
> [..] __GFP_SOFT solves this all very nicely [..]

s/very nicely/throwing away lots of useful cache for no one good reason/

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 09:49:46AM -0700, Linus Torvalds wrote:
> [..] I
> don't think the balancing has to take the order of the allocation into
> account [..]

Why do you prefer to throw away most of the cache (potentially at fork time)
instead of freeing only the few contigous bits that we need?

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alexander Viro



On Mon, 25 Sep 2000, Alan Cox wrote:

> > > yep, i agree. I'm not sure what the biggest allocation is, some drivers
> > > might use megabytes or contiguous RAM?
> > 
> > Stupidity has no limits...
> 
> Unfortunately its frequently wired into the hardware to save a few cents on
> scatter gather logic.

Since when hardware folks became exempt from the rule above? 128K is
almost tolerable, there were requests for 64 _mega_bytes...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alan Cox

> > yep, i agree. I'm not sure what the biggest allocation is, some drivers
> > might use megabytes or contiguous RAM?
> 
> Stupidity has no limits...

Unfortunately its frequently wired into the hardware to save a few cents on
scatter gather logic.

We need 128K blocks for sound DMA buffers and most sound cards they need to
be linear (but not the newer ones thankfully). Some video capture hardware
needs 4Mb but that needs to use bootmem (in 2.2 they use bigmem hacks)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alan Cox

> > kmalloc 16K kmalloc 32K
> > block   block
> > 
> 2) set PF_MEMALLOC on the task you're killing for OOM,
>that way this task will either get the memory or
>fail (note that PF_MEMALLOC tasks don't wait)

Nobody is out of memory at this point. Everyone is in kernel space blocking
for someone else. There is also no further allocation after this deadlock 
point to cause a kill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alan Cox

> Frankly, how often do we allocate multi-order pages? I've just made quick
> statistics wrt. how allocation orders are distributed on a more or less
> typical system:

Enough that failures on this crashed older 2.2 kernels because the tcp code
ended up looping trying to get memory and the slab allocator couldnt get
a new multipage block. 

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Linus Torvalds wrote:

> Yes, I'm inclined to agree. Or at least not disagree. I'm more arguing
> that the order itself may not be the most interesting thing, and that
> I don't think the balancing has to take the order of the allocation
> into account - because it should be equivalent to just tell that it's
> a soft allocation (whether though the current !__GFP_HIGH or through a
> new __GFP_SOFT with slightly different logic).

yep, and there is another problem with pure order-based distinction: if i
do kmalloc(5k), and write the code on Alpha and expect it to never fail,
shouldnt i expect this to never fail on x86 as well? Along with the fork()
failure. __GFP_SOFT solves this all very nicely - the *allocator* decides
what allocation policy to follow. Great!

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alan Cox

> > 2 active processes, no swap
> > 
> > #1  #2
> > kmalloc 32K kmalloc 16K
> > OK  OK
> > kmalloc 16K kmalloc 32K
> > block   block
> > 
> 
> ... and we get two wakeup_kswapd()s.  kswapd has PF_MEMALLOC and so is
> able to eat memory which processes #1 and #2 are not allowed to touch.

'no swap'

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Rik van Riel

On Mon, 25 Sep 2000, Linus Torvalds wrote:
> On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> > 
> > But I'd much prefer to pass not only the classzone from allocator
> > to memory balancing, but _also_ the order of the allocation,
> > and then shrink_mmap will know it doesn't worth to free anything 
> > that isn't contigous on the order of the allocation that we need.
> 
> I suspect that the proper way to do this is to just make another gfp_flag,
> which is basically another hint to the mm layer that we're doing a multi-
> page allocation and that the MM layer should not try forever to handle it.
> 
> In fact, that's independent of whether it is a multi-page
> allocation or not. It might be something like __GFP_SOFT - you
> could use it with single pages too.
> 
> Thinking about it, we do have it already. It's called
> !__GFP_HIGH, and it used by all the GFP_USER allocations.

Hmm, I think these two are orthagonal.

__GFP_HIGH means that we are allowed to eat deeper into
the free list (maybe needed to avoid a deadlock freeing
pages)

__GFP_SOFT would mean "don't bother waiting for free pages",
which is something very different...

(I wouldn't want a user process to get killed simply because
kswapd is waiting for IO to finish on a swapout, in that case
we really do want to sleep for a while)

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Linus Torvalds



On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> 
> But I'd much prefer to pass not only the classzone from allocator
> to memory balancing, but _also_ the order of the allocation,
> and then shrink_mmap will know it doesn't worth to free anything 
> that isn't contigous on the order of the allocation that we need.

I suspect that the proper way to do this is to just make another gfp_flag,
which is basically another hint to the mm layer that we're doing a multi-
page allocation and that the MM layer should not try forever to handle it.

In fact, that's independent of whether it is a multi-page allocation or
not. It might be something like __GFP_SOFT - you could use it with single
pages too. 

Thinking about it, we do have it already. It's called !__GFP_HIGH, and it
used by all the GFP_USER allocations.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Jeff Garzik

On Mon, 25 Sep 2000, Alexander Viro wrote:
> On Mon, 25 Sep 2000, Ingo Molnar wrote:
> > yep, i agree. I'm not sure what the biggest allocation is, some drivers
> > might use megabytes or contiguous RAM?

> Stupidity has no limits...

Blame the hardware designers... and give me my big allocations. :)

Sounds drivers (not mine though, ) do stuff like

order = 20; /* just a made-up high number*/
while ((order-- > 0) && (mem == NULL)) {
mem = __get_free_pages (GFP_KERNEL, order);
}
/* use sound buffer 'mem' */

Older or modern, less-than-cool framegrabbers need tons of contiguous
memory too...

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Rik van Riel

On Mon, 25 Sep 2000, Ingo Molnar wrote:
> On Mon, 25 Sep 2000, Andi Kleen wrote:
> 
> > Another thing I would worry about are ports with multiple user page
> > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages
> > but may likely need a 16K kernel stack due to the 64bit stack bloat.
> 
> yep, but these cases are not affected, i think in the order != 0
> case we should return NULL if a certain number of iterations did
> not yield any free page.

Indeed. You're right here.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 06:18:17PM +0200, Andi Kleen wrote:
> On Mon, Sep 25, 2000 at 06:19:07PM +0200, Ingo Molnar wrote:
> > > Another thing I would worry about are ports with multiple user page
> > > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages
> > > but may likely need a 16K kernel stack due to the 64bit stack bloat.
> > 
> > yep, but these cases are not affected, i think in the order != 0 case we
> > should return NULL if a certain number of iterations did not yield any
> > free page.
> 
> Ok, that would just break fork()

Not sure if I have the whole context (I've not yet received Ingo's email
that you're replying to).

Currently we do a memory balancing pass indipendently by the order of the
allocation. Thus we don't do any iteraction and the memory balancing
is completly order blind (unfortunately it's also zone blind, while
at least in 2.2.x the memory balancing known which zone it had
to allocate memory from).

If Ingo suggested more iteractions of memory balancing for those cases
that should only make things better with respect to fragmentation.

But I'd much prefer to pass not only the classzone from allocator
to memory balancing, but _also_ the order of the allocation,
and then shrink_mmap will know it doesn't worth to free anything 
that isn't contigous on the order of the allocation that we need.

classzone haven't reached this point yet.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Rik van Riel

On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote:
> > Progress is made, clean pages are discarded and dirty ones queued for
> 
> How can you make progress if there isn't swap avaiable and all the
> freeable page/buffer cache is just been freed? The deadlock happens
> in OOM condition (not when we can make progress).

This is exactly why integrating the OOM killer is on
my TODO list.

The important difference between the new VM and the
old one is that we can't fail while we are not OOM,
whereas the old allocator could break down even when
we still had enough swap free

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 06:22:42PM +0200, Ingo Molnar wrote:
> yep, i agree. I'm not sure what the biggest allocation is, some drivers
> might use megabytes or contiguous RAM?

I'm not sure (we should grep all the drivers to be sure...) but I bet the old
2.2.0 MAX_ORDER #define will work for everything.

The fact is that over a certain order there's no hope anyway at runtime
and the only big allocations done through the init sequence are for
the hashtable.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Rik van Riel

On Mon, 25 Sep 2000, Alan Cox wrote:

> > > GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get
> > > everything jammed in kernel space waiting on GFP_KERNEL and if the
> > > swapper cannot make space you die.
> > 
> > if one can get everything jammed waiting for GFP_KERNEL, and not being
> > able to deallocate anything, thats a VM or resource-limit bug. This
> > situation is just 1% RAM away from the 'root cannot log in', situation.
> 
> Unless Im missing something here think about this case
> 
> 2 active processes, no swap
> 
> #1#2
> kmalloc 32K   kmalloc 16K
> OKOK
> kmalloc 16K   kmalloc 32K
> block block
> 
> so GFP_KERNEL has to be able to fail - it can wait for I/O in
> some cases with care, but when we have no pages left something
> has to give

The trick here is to:
1) keep some reserved pages around for PF_MEMALLOC tasks
   (we need this anyway)
2) set PF_MEMALLOC on the task you're killing for OOM,
   that way this task will either get the memory or
   fail (note that PF_MEMALLOC tasks don't wait)

This way the OOM-killed task will be able to exit quickly
and the rest of the system will not get killed as a side
effect.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andi Kleen

On Mon, Sep 25, 2000 at 06:19:07PM +0200, Ingo Molnar wrote:
> > Another thing I would worry about are ports with multiple user page
> > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages
> > but may likely need a 16K kernel stack due to the 64bit stack bloat.
> 
> yep, but these cases are not affected, i think in the order != 0 case we
> should return NULL if a certain number of iterations did not yield any
> free page.

Ok, that would just break fork()

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alexander Viro



On Mon, 25 Sep 2000, Ingo Molnar wrote:

> On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
> 
> > > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
> > 
> > You're right. That's why it's a waste to have so many order in the
> > buddy allocator. [...]
> 
> yep, i agree. I'm not sure what the biggest allocation is, some drivers
> might use megabytes or contiguous RAM?

Stupidity has no limits...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
> 
> You're right. That's why it's a waste to have so many order in the
> buddy allocator. [...]

yep, i agree. I'm not sure what the biggest allocation is, some drivers
might use megabytes or contiguous RAM?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Andi Kleen wrote:

> An important exception in 2.2/2.4 is NFS with bigger rsize (will be fixed
> in 2.5, but 2.4 does it this way). For an 8K r/wsize you need reliable 
> (=GFP_ATOMIC) 16K allocations.  

the discussion does not affect GFP_ATOMIC - GFP_ATOMIC allocators *must*
be prepared to handle occasional oom situations gracefully.

> Another thing I would worry about are ports with multiple user page
> sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages
> but may likely need a 16K kernel stack due to the 64bit stack bloat.

yep, but these cases are not affected, i think in the order != 0 case we
should return NULL if a certain number of iterations did not yield any
free page.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andi Kleen

On Mon, Sep 25, 2000 at 06:02:18PM +0200, Ingo Molnar wrote:
> Frankly, how often do we allocate multi-order pages? I've just made quick
> statistics wrt. how allocation orders are distributed on a more or less
> typical system:
> 
>   (ALLOC ORDER)
>   0: 167081
>   1: 850
>   2: 16
>   3: 25
>   4: 0
>   5: 1
>   6: 0
>   7: 2
>   8: 13
>   9: 5
> 
> ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
> task-structure. The rest is 0.05%.

An important exception in 2.2/2.4 is NFS with bigger rsize (will be fixed
in 2.5, but 2.4 does it this way). For an 8K r/wsize you need reliable 
(=GFP_ATOMIC) 16K allocations.  

Another thing I would worry about are ports with multiple user page sizes in 2.5.
Another ugly case is the x86-64 port which has 4K pages but may likely need
a 16K kernel stack due to the 64bit stack bloat.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 06:02:18PM +0200, Ingo Molnar wrote:
> Frankly, how often do we allocate multi-order pages? I've just made quick

The deadlock Alan pointed out can happen also with single page allocation
if we in 2.4.x-current put a loop in GFP_KERNEL.

> ie. 99.45% of all allocations are single-page! 0.50% is the 8kb

You're right. That's why it's a waste to have so many order in the
buddy allocator. Even more now that the hashtables should be allocated
with the bootmem allocator! :) Chuck seen the slowdown of increasing
the highest order allocation in his bench. But of course in 2.2.x we can't
avoid that.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote:
> Progress is made, clean pages are discarded and dirty ones queued for

How can you make progress if there isn't swap avaiable and all the
freeable page/buffer cache is just been freed? The deadlock happens
in OOM condition (not when we can make progress).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

> Ingo's point is that the underlined line won't ever happen in the
> first place

please dont misinterpret my point ...

Frankly, how often do we allocate multi-order pages? I've just made quick
statistics wrt. how allocation orders are distributed on a more or less
typical system:

(ALLOC ORDER)
0: 167081
1: 850
2: 16
3: 25
4: 0
5: 1
6: 0
7: 2
8: 13
9: 5

ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
task-structure. The rest is 0.05%.

i'm not talking about 4MB contiguous physical allocations having to
succeed on a 8MB box. I'm talking about 99% of the simple allocation
points not having to worry about a NULL pointer. (not checking for NULL is
one of the most common allocation-related bug that beats low-RAM systems.)

Ingo


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 04:16:56PM +0100, Alan Cox wrote:
> Unless Im missing something here think about this case
> 
> 2 active processes, no swap
> 
> #1#2
> kmalloc 32K   kmalloc 16K
> OKOK
> kmalloc 16K   kmalloc 32K
  ^
> block block

Yep, you're not missing anything. That was my complain about the fact
GFP_KERNEL not failing will obviously dealdock the kernel all over the place.

Ingo's point is that the underlined line won't ever happen in the first place
because of the resource accounting that will tell the upper layer that they
can't try to allocate anything, so they won't enter kmalloc at all. But he's
obviously not talking about 2.4.x. (and I'm not sure if that's the right
way to go in the general case but certainly it's the right way to go for
special cases like skbs with gigabit ethernet)

In 2.4.x GFP_KERNEL not failing is a deadlock as you said.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alan Cox

> > GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get
> > everything jammed in kernel space waiting on GFP_KERNEL and if the
> > swapper cannot make space you die.
> 
> if one can get everything jammed waiting for GFP_KERNEL, and not being
> able to deallocate anything, thats a VM or resource-limit bug. This
> situation is just 1% RAM away from the 'root cannot log in', situation.

Unless Im missing something here think about this case

2 active processes, no swap

#1  #2
kmalloc 32K kmalloc 16K
OK  OK
kmalloc 16K kmalloc 32K
block   block

so GFP_KERNEL has to be able to fail - it can wait for I/O in some cases with
care, but when we have no pages left something has to give


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alan Cox

  GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get
  everything jammed in kernel space waiting on GFP_KERNEL and if the
  swapper cannot make space you die.
 
 if one can get everything jammed waiting for GFP_KERNEL, and not being
 able to deallocate anything, thats a VM or resource-limit bug. This
 situation is just 1% RAM away from the 'root cannot log in', situation.

Unless Im missing something here think about this case

2 active processes, no swap

#1  #2
kmalloc 32K kmalloc 16K
OK  OK
kmalloc 16K kmalloc 32K
block   block

so GFP_KERNEL has to be able to fail - it can wait for I/O in some cases with
care, but when we have no pages left something has to give


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 04:16:56PM +0100, Alan Cox wrote:
 Unless Im missing something here think about this case
 
 2 active processes, no swap
 
 #1#2
 kmalloc 32K   kmalloc 16K
 OKOK
 kmalloc 16K   kmalloc 32K
  ^
 block block

Yep, you're not missing anything. That was my complain about the fact
GFP_KERNEL not failing will obviously dealdock the kernel all over the place.

Ingo's point is that the underlined line won't ever happen in the first place
because of the resource accounting that will tell the upper layer that they
can't try to allocate anything, so they won't enter kmalloc at all. But he's
obviously not talking about 2.4.x. (and I'm not sure if that's the right
way to go in the general case but certainly it's the right way to go for
special cases like skbs with gigabit ethernet)

In 2.4.x GFP_KERNEL not failing is a deadlock as you said.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

 Ingo's point is that the underlined line won't ever happen in the
 first place

please dont misinterpret my point ...

Frankly, how often do we allocate multi-order pages? I've just made quick
statistics wrt. how allocation orders are distributed on a more or less
typical system:

(ALLOC ORDER)
0: 167081
1: 850
2: 16
3: 25
4: 0
5: 1
6: 0
7: 2
8: 13
9: 5

ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
task-structure. The rest is 0.05%.

i'm not talking about 4MB contiguous physical allocations having to
succeed on a 8MB box. I'm talking about 99% of the simple allocation
points not having to worry about a NULL pointer. (not checking for NULL is
one of the most common allocation-related bug that beats low-RAM systems.)

Ingo


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote:
 Progress is made, clean pages are discarded and dirty ones queued for

How can you make progress if there isn't swap avaiable and all the
freeable page/buffer cache is just been freed? The deadlock happens
in OOM condition (not when we can make progress).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andrea Arcangeli

On Mon, Sep 25, 2000 at 06:02:18PM +0200, Ingo Molnar wrote:
 Frankly, how often do we allocate multi-order pages? I've just made quick

The deadlock Alan pointed out can happen also with single page allocation
if we in 2.4.x-current put a loop in GFP_KERNEL.

 ie. 99.45% of all allocations are single-page! 0.50% is the 8kb

You're right. That's why it's a waste to have so many order in the
buddy allocator. Even more now that the hashtables should be allocated
with the bootmem allocator! :) Chuck seen the slowdown of increasing
the highest order allocation in his bench. But of course in 2.2.x we can't
avoid that.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andi Kleen

On Mon, Sep 25, 2000 at 06:02:18PM +0200, Ingo Molnar wrote:
 Frankly, how often do we allocate multi-order pages? I've just made quick
 statistics wrt. how allocation orders are distributed on a more or less
 typical system:
 
   (ALLOC ORDER)
   0: 167081
   1: 850
   2: 16
   3: 25
   4: 0
   5: 1
   6: 0
   7: 2
   8: 13
   9: 5
 
 ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
 task-structure. The rest is 0.05%.

An important exception in 2.2/2.4 is NFS with bigger rsize (will be fixed
in 2.5, but 2.4 does it this way). For an 8K r/wsize you need reliable 
(=GFP_ATOMIC) 16K allocations.  

Another thing I would worry about are ports with multiple user page sizes in 2.5.
Another ugly case is the x86-64 port which has 4K pages but may likely need
a 16K kernel stack due to the 64bit stack bloat.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Andi Kleen wrote:

 An important exception in 2.2/2.4 is NFS with bigger rsize (will be fixed
 in 2.5, but 2.4 does it this way). For an 8K r/wsize you need reliable 
 (=GFP_ATOMIC) 16K allocations.  

the discussion does not affect GFP_ATOMIC - GFP_ATOMIC allocators *must*
be prepared to handle occasional oom situations gracefully.

 Another thing I would worry about are ports with multiple user page
 sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages
 but may likely need a 16K kernel stack due to the 64bit stack bloat.

yep, but these cases are not affected, i think in the order != 0 case we
should return NULL if a certain number of iterations did not yield any
free page.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Ingo Molnar


On Mon, 25 Sep 2000, Andrea Arcangeli wrote:

  ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
 
 You're right. That's why it's a waste to have so many order in the
 buddy allocator. [...]

yep, i agree. I'm not sure what the biggest allocation is, some drivers
might use megabytes or contiguous RAM?

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Alexander Viro



On Mon, 25 Sep 2000, Ingo Molnar wrote:

 On Mon, 25 Sep 2000, Andrea Arcangeli wrote:
 
   ie. 99.45% of all allocations are single-page! 0.50% is the 8kb
  
  You're right. That's why it's a waste to have so many order in the
  buddy allocator. [...]
 
 yep, i agree. I'm not sure what the biggest allocation is, some drivers
 might use megabytes or contiguous RAM?

Stupidity has no limits...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Andi Kleen

On Mon, Sep 25, 2000 at 06:19:07PM +0200, Ingo Molnar wrote:
  Another thing I would worry about are ports with multiple user page
  sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages
  but may likely need a 16K kernel stack due to the 64bit stack bloat.
 
 yep, but these cases are not affected, i think in the order != 0 case we
 should return NULL if a certain number of iterations did not yield any
 free page.

Ok, that would just break fork()

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: the new VMt

2000-09-25 Thread Rik van Riel

On Mon, 25 Sep 2000, Alan Cox wrote:

   GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get
   everything jammed in kernel space waiting on GFP_KERNEL and if the
   swapper cannot make space you die.
  
  if one can get everything jammed waiting for GFP_KERNEL, and not being
  able to deallocate anything, thats a VM or resource-limit bug. This
  situation is just 1% RAM away from the 'root cannot log in', situation.
 
 Unless Im missing something here think about this case
 
 2 active processes, no swap
 
 #1#2
 kmalloc 32K   kmalloc 16K
 OKOK
 kmalloc 16K   kmalloc 32K
 block block
 
 so GFP_KERNEL has to be able to fail - it can wait for I/O in
 some cases with care, but when we have no pages left something
 has to give

The trick here is to:
1) keep some reserved pages around for PF_MEMALLOC tasks
   (we need this anyway)
2) set PF_MEMALLOC on the task you're killing for OOM,
   that way this task will either get the memory or
   fail (note that PF_MEMALLOC tasks don't wait)

This way the OOM-killed task will be able to exit quickly
and the rest of the system will not get killed as a side
effect.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



  1   2   >