Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-28 Thread Ethan Solomita

Nick Piggin wrote:

Eric W. Biederman wrote:

First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.


Let's be practical here, what you're asking is basically impossible.

Unless by deterministic you mean that it never enters the a non
trivial syscall, in which case, you just want to know about maximum
RSS of the process, which we already account).


   If we used Beancounters as Pavel and Kirill mentioned, that would 
keep track of each container that has referenced a page, not just the 
first container. It sounds like beancounters can return a usage count 
where each page is divided by the number of referencing containers (e.g. 
1/3rd if 3 containers share a page). Presumably it could also return a 
full count of 1 to each container.


   If we look at data in the latter form, i.e. each container must pay 
fully for each page used, then Eric could use that to determine real 
usage needs of the container. However we could also use the fractional 
count in order to do things such as charging the container for its 
actual usage. i.e. full count for setting guarantees, fractional for 
actual usage.

   -- Ethan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-28 Thread Ethan Solomita

Nick Piggin wrote:

Eric W. Biederman wrote:

First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.


Let's be practical here, what you're asking is basically impossible.

Unless by deterministic you mean that it never enters the a non
trivial syscall, in which case, you just want to know about maximum
RSS of the process, which we already account).


   If we used Beancounters as Pavel and Kirill mentioned, that would 
keep track of each container that has referenced a page, not just the 
first container. It sounds like beancounters can return a usage count 
where each page is divided by the number of referencing containers (e.g. 
1/3rd if 3 containers share a page). Presumably it could also return a 
full count of 1 to each container.


   If we look at data in the latter form, i.e. each container must pay 
fully for each page used, then Eric could use that to determine real 
usage needs of the container. However we could also use the fractional 
count in order to do things such as charging the container for its 
actual usage. i.e. full count for setting guarantees, fractional for 
actual usage.

   -- Ethan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Balbir Singh

Nick Piggin wrote:

Kirill Korotaev wrote:


The approaches I have seen that don't have a struct page pointer, do
intrusive things like try to put hooks everywhere throughout the kernel
where a userspace task can cause an allocation (and of course end up
missing many, so they aren't secure anyway)... and basically just
nasty stuff that will never get merged.



User beancounters patch has got through all these...
The approach where each charged object has a pointer to the owner 
container,

who has charged it - is the most easy/clean way to handle
all the problems with dynamic context change, races, etc.
and 1 pointer in page struct is just 0.1% overehad.


The pointer in struct page approach is a decent one, which I have
liked since this whole container effort came up. IIRC Linus and Alan
also thought that was a reasonable way to go.

I haven't reviewed the rest of the beancounters patch since looking
at it quite a few months ago... I probably don't have time for a
good review at the moment, but I should eventually.



This patch is not really beancounters.

1. It uses the containers framework
2. It is similar to my RSS controller (http://lkml.org/lkml/2007/2/26/8)

I would say that beancounters are changing and evolving.


Struct page overhead really isn't bad. Sure, nobody who doesn't use
containers will want to turn it on, but unless you're using a big PAE
system you're actually unlikely to notice.



big PAE doesn't make any difference IMHO
(until struct pages are not created for non-present physical memory 
areas)


The issue is just that struct pages use low memory, which is a really
scarce commodity on PAE. One more pointer in the struct page means
64MB less lowmem.

But PAE is crap anyway. We've already made enough concessions in the
kernel to support it. I agree: struct page overhead is not really
significant. The benefits of simplicity seems to outweigh the downside.


But again, I'll say the node-container approach of course does avoid
this nicely (because we already can get the node from the page). So
definitely that approach needs to be discredited before going with this
one.



But it lacks some other features:
1. page can't be shared easily with another container


I think they could be shared. You allocate _new_ pages from your own
node, but you can definitely use existing pages allocated to other
nodes.


2. shared page can't be accounted honestly to containers
   as fraction=PAGE_SIZE/containers-using-it


Yes there would be some accounting differences. I think it is hard
to say exactly what containers are "using" what page anyway, though.
What do you say about unmapped pages? Kernel allocations? etc.


3. It doesn't help accounting of kernel memory structures.
   e.g. in OpenVZ we use exactly the same pointer on the page
   to track which container owns it, e.g. pages used for page
   tables are accounted this way.


?
page_to_nid(page) ~= container that owns it.


4. I guess container destroy requires destroy of memory zone,
   which means write out of dirty data. Which doesn't sound
   good for me as well.


I haven't looked at any implementation, but I think it is fine for
the zone to stay around.


5. memory reclamation in case of global memory shortage
   becomes a tricky/unfair task.


I don't understand why? You can much more easily target a specific
container for reclaim with this approach than with others (because
you have an lru per container).



Yes, but we break the global LRU. With these RSS patches, reclaim not
triggered by containers still uses the global LRU, by using nodes,
we would have lost the global LRU.


6. You cannot overcommit. AFAIU, the memory should be granted
   to node exclusive usage and cannot be used by by another containers,
   even if it is unused. This is not an option for us.


I'm not sure about that. If you have a larger number of nodes, then
you could assign more free nodes to a container on demand. But I
think there would definitely be less flexibility with nodes...

I don't know... and seeing as I don't really know where the google
guys are going with it, I won't misrepresent their work any further ;)



Everyone seems to have a plan ;) I don't read the containers list...
does everyone still have *different* plans, or is any sort of consensus
being reached?



hope we'll have it soon :)


Good luck ;)



I think we have made some forward progress on the consensus.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Nick Piggin

Kirill Korotaev wrote:


The approaches I have seen that don't have a struct page pointer, do
intrusive things like try to put hooks everywhere throughout the kernel
where a userspace task can cause an allocation (and of course end up
missing many, so they aren't secure anyway)... and basically just
nasty stuff that will never get merged.



User beancounters patch has got through all these...
The approach where each charged object has a pointer to the owner container,
who has charged it - is the most easy/clean way to handle
all the problems with dynamic context change, races, etc.
and 1 pointer in page struct is just 0.1% overehad.


The pointer in struct page approach is a decent one, which I have
liked since this whole container effort came up. IIRC Linus and Alan
also thought that was a reasonable way to go.

I haven't reviewed the rest of the beancounters patch since looking
at it quite a few months ago... I probably don't have time for a
good review at the moment, but I should eventually.


Struct page overhead really isn't bad. Sure, nobody who doesn't use
containers will want to turn it on, but unless you're using a big PAE
system you're actually unlikely to notice.



big PAE doesn't make any difference IMHO
(until struct pages are not created for non-present physical memory areas)


The issue is just that struct pages use low memory, which is a really
scarce commodity on PAE. One more pointer in the struct page means
64MB less lowmem.

But PAE is crap anyway. We've already made enough concessions in the
kernel to support it. I agree: struct page overhead is not really
significant. The benefits of simplicity seems to outweigh the downside.


But again, I'll say the node-container approach of course does avoid
this nicely (because we already can get the node from the page). So
definitely that approach needs to be discredited before going with this
one.



But it lacks some other features:
1. page can't be shared easily with another container


I think they could be shared. You allocate _new_ pages from your own
node, but you can definitely use existing pages allocated to other
nodes.


2. shared page can't be accounted honestly to containers
   as fraction=PAGE_SIZE/containers-using-it


Yes there would be some accounting differences. I think it is hard
to say exactly what containers are "using" what page anyway, though.
What do you say about unmapped pages? Kernel allocations? etc.


3. It doesn't help accounting of kernel memory structures.
   e.g. in OpenVZ we use exactly the same pointer on the page
   to track which container owns it, e.g. pages used for page
   tables are accounted this way.


?
page_to_nid(page) ~= container that owns it.


4. I guess container destroy requires destroy of memory zone,
   which means write out of dirty data. Which doesn't sound
   good for me as well.


I haven't looked at any implementation, but I think it is fine for
the zone to stay around.


5. memory reclamation in case of global memory shortage
   becomes a tricky/unfair task.


I don't understand why? You can much more easily target a specific
container for reclaim with this approach than with others (because
you have an lru per container).


6. You cannot overcommit. AFAIU, the memory should be granted
   to node exclusive usage and cannot be used by by another containers,
   even if it is unused. This is not an option for us.


I'm not sure about that. If you have a larger number of nodes, then
you could assign more free nodes to a container on demand. But I
think there would definitely be less flexibility with nodes...

I don't know... and seeing as I don't really know where the google
guys are going with it, I won't misrepresent their work any further ;)



Everyone seems to have a plan ;) I don't read the containers list...
does everyone still have *different* plans, or is any sort of consensus
being reached?



hope we'll have it soon :)


Good luck ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Kirill Korotaev
Nick,

>>Accounting becomes easy if we have a container pointer in struct page.
>> This can form base ground for building controllers since any memory
>>related controller would be interested in tracking pages.  However we
>>still want to evaluate if we can build them without bloating the
>>struct page.  Pagecache controller (2) we can implement with container
>>pointer in struct page or container pointer in struct address space.
> 
> 
> The thing is, you have to worry about actually getting anything in the
> kernel rather than trying to do fancy stuff.
> 
> The approaches I have seen that don't have a struct page pointer, do
> intrusive things like try to put hooks everywhere throughout the kernel
> where a userspace task can cause an allocation (and of course end up
> missing many, so they aren't secure anyway)... and basically just
> nasty stuff that will never get merged.

User beancounters patch has got through all these...
The approach where each charged object has a pointer to the owner container,
who has charged it - is the most easy/clean way to handle
all the problems with dynamic context change, races, etc.
and 1 pointer in page struct is just 0.1% overehad.

> Struct page overhead really isn't bad. Sure, nobody who doesn't use
> containers will want to turn it on, but unless you're using a big PAE
> system you're actually unlikely to notice.

big PAE doesn't make any difference IMHO
(until struct pages are not created for non-present physical memory areas)

> But again, I'll say the node-container approach of course does avoid
> this nicely (because we already can get the node from the page). So
> definitely that approach needs to be discredited before going with this
> one.

But it lacks some other features:
1. page can't be shared easily with another container
2. shared page can't be accounted honestly to containers
   as fraction=PAGE_SIZE/containers-using-it
3. It doesn't help accounting of kernel memory structures.
   e.g. in OpenVZ we use exactly the same pointer on the page
   to track which container owns it, e.g. pages used for page
   tables are accounted this way.
4. I guess container destroy requires destroy of memory zone,
   which means write out of dirty data. Which doesn't sound
   good for me as well.
5. memory reclamation in case of global memory shortage
   becomes a tricky/unfair task.
6. You cannot overcommit. AFAIU, the memory should be granted
   to node exclusive usage and cannot be used by by another containers,
   even if it is unused. This is not an option for us.

>>Building on this patchset is much simple and and we hope the bloat in
>>struct page will be compensated by the benefits in memory controllers
>>in terms of performance and simplicity.
>>
>>Adding too many controllers and accounting parameters to start with
>>will make the patch too big and complex.  As Balbir mentioned, we have
>>a plan and we shall add new control parameters in stages.
> 
> Everyone seems to have a plan ;) I don't read the containers list...
> does everyone still have *different* plans, or is any sort of consensus
> being reached?

hope we'll have it soon :)

Thanks,
Kirill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Pavel Emelianov
Cedric Le Goater wrote:
>> --- linux-2.6.20.orig/mm/migrate.c   2007-02-04 21:44:54.0 +0300
>> +++ linux-2.6.20-0/mm/migrate.c  2007-03-06 13:33:28.0 +0300
>> @@ -134,6 +134,7 @@ static void remove_migration_pte(struct 
>>  pte_t *ptep, pte;
>>  spinlock_t *ptl;
>>  unsigned long addr = page_address_in_vma(new, vma);
>> +struct page_container *pcont;
>>
>>  if (addr == -EFAULT)
>>  return;
>> @@ -157,6 +158,11 @@ static void remove_migration_pte(struct 
>>  return;
>>  }
>>
>> +if (container_rss_prepare(new, vma, )) {
>> +pte_unmap(ptep);
>> +return;
>> +}
>> +
>>  ptl = pte_lockptr(mm, pmd);
>>  spin_lock(ptl);
>>  pte = *ptep;
>> @@ -175,16 +181,19 @@ static void remove_migration_pte(struct 
>>  set_pte_at(mm, addr, ptep, pte);
>>
>>  if (PageAnon(new))
>> -page_add_anon_rmap(new, vma, addr);
>> +page_add_anon_rmap(new, vma, addr, pcont);
>>  else
>> -page_add_file_rmap(new);
>> +page_add_file_rmap(new, pcont);
>>
>>  /* No need to invalidate - it was non-present before */
>>  update_mmu_cache(vma, addr, pte);
>>  lazy_mmu_prot_update(pte);
>> +pte_unmap_unlock(ptep, ptl);
>> +return;
>>
>>  out:
>>  pte_unmap_unlock(ptep, ptl);
>> +container_rss_release(pcont);
>>  }
>>
>>  /*
> 
> you missed out an include in mm/migrate.c
> 
> cheers,

Thanks! :)

> C.
> Signed-off-by: Cedric Le Goater <[EMAIL PROTECTED]>
> ---
>  mm/migrate.c |1 +
>  1 file changed, 1 insertion(+)
> 
> Index: 2.6.20/mm/migrate.c
> ===
> --- 2.6.20.orig/mm/migrate.c
> +++ 2.6.20/mm/migrate.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "internal.h"
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Cedric Le Goater
> --- linux-2.6.20.orig/mm/migrate.c2007-02-04 21:44:54.0 +0300
> +++ linux-2.6.20-0/mm/migrate.c   2007-03-06 13:33:28.0 +0300
> @@ -134,6 +134,7 @@ static void remove_migration_pte(struct 
>   pte_t *ptep, pte;
>   spinlock_t *ptl;
>   unsigned long addr = page_address_in_vma(new, vma);
> + struct page_container *pcont;
> 
>   if (addr == -EFAULT)
>   return;
> @@ -157,6 +158,11 @@ static void remove_migration_pte(struct 
>   return;
>   }
> 
> + if (container_rss_prepare(new, vma, )) {
> + pte_unmap(ptep);
> + return;
> + }
> +
>   ptl = pte_lockptr(mm, pmd);
>   spin_lock(ptl);
>   pte = *ptep;
> @@ -175,16 +181,19 @@ static void remove_migration_pte(struct 
>   set_pte_at(mm, addr, ptep, pte);
> 
>   if (PageAnon(new))
> - page_add_anon_rmap(new, vma, addr);
> + page_add_anon_rmap(new, vma, addr, pcont);
>   else
> - page_add_file_rmap(new);
> + page_add_file_rmap(new, pcont);
> 
>   /* No need to invalidate - it was non-present before */
>   update_mmu_cache(vma, addr, pte);
>   lazy_mmu_prot_update(pte);
> + pte_unmap_unlock(ptep, ptl);
> + return;
> 
>  out:
>   pte_unmap_unlock(ptep, ptl);
> + container_rss_release(pcont);
>  }
> 
>  /*

you missed out an include in mm/migrate.c

cheers,

C.
Signed-off-by: Cedric Le Goater <[EMAIL PROTECTED]>
---
 mm/migrate.c |1 +
 1 file changed, 1 insertion(+)

Index: 2.6.20/mm/migrate.c
===
--- 2.6.20.orig/mm/migrate.c
+++ 2.6.20/mm/migrate.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Vaidyanathan Srinivasan


Nick Piggin wrote:
> Vaidyanathan Srinivasan wrote:
> 
>> Accounting becomes easy if we have a container pointer in struct page.
>>  This can form base ground for building controllers since any memory
>> related controller would be interested in tracking pages.  However we
>> still want to evaluate if we can build them without bloating the
>> struct page.  Pagecache controller (2) we can implement with container
>> pointer in struct page or container pointer in struct address space.
> 
> The thing is, you have to worry about actually getting anything in the
> kernel rather than trying to do fancy stuff.
> 
> The approaches I have seen that don't have a struct page pointer, do
> intrusive things like try to put hooks everywhere throughout the kernel
> where a userspace task can cause an allocation (and of course end up
> missing many, so they aren't secure anyway)... and basically just
> nasty stuff that will never get merged.
> 
> Struct page overhead really isn't bad. Sure, nobody who doesn't use
> containers will want to turn it on, but unless you're using a big PAE
> system you're actually unlikely to notice.
> 
> But again, I'll say the node-container approach of course does avoid
> this nicely (because we already can get the node from the page). So
> definitely that approach needs to be discredited before going with this
> one.

I agree :)

>> Building on this patchset is much simple and and we hope the bloat in
>> struct page will be compensated by the benefits in memory controllers
>> in terms of performance and simplicity.
>>
>> Adding too many controllers and accounting parameters to start with
>> will make the patch too big and complex.  As Balbir mentioned, we have
>> a plan and we shall add new control parameters in stages.
> 
> Everyone seems to have a plan ;) I don't read the containers list...
> does everyone still have *different* plans, or is any sort of consensus
> being reached?

Consensus?  I believe at this point we have a sort of consensus on the
base container infrastructure and the need for memory controller to
control RSS, pagecache, mlock, kernel memory etc.  However the
implementation and approach taken is still being discussed :)

--Vaidy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Nick Piggin

Vaidyanathan Srinivasan wrote:


Accounting becomes easy if we have a container pointer in struct page.
 This can form base ground for building controllers since any memory
related controller would be interested in tracking pages.  However we
still want to evaluate if we can build them without bloating the
struct page.  Pagecache controller (2) we can implement with container
pointer in struct page or container pointer in struct address space.


The thing is, you have to worry about actually getting anything in the
kernel rather than trying to do fancy stuff.

The approaches I have seen that don't have a struct page pointer, do
intrusive things like try to put hooks everywhere throughout the kernel
where a userspace task can cause an allocation (and of course end up
missing many, so they aren't secure anyway)... and basically just
nasty stuff that will never get merged.

Struct page overhead really isn't bad. Sure, nobody who doesn't use
containers will want to turn it on, but unless you're using a big PAE
system you're actually unlikely to notice.

But again, I'll say the node-container approach of course does avoid
this nicely (because we already can get the node from the page). So
definitely that approach needs to be discredited before going with this
one.


Building on this patchset is much simple and and we hope the bloat in
struct page will be compensated by the benefits in memory controllers
in terms of performance and simplicity.

Adding too many controllers and accounting parameters to start with
will make the patch too big and complex.  As Balbir mentioned, we have
a plan and we shall add new control parameters in stages.


Everyone seems to have a plan ;) I don't read the containers list...
does everyone still have *different* plans, or is any sort of consensus
being reached?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Vaidyanathan Srinivasan


Balbir Singh wrote:
> Nick Piggin wrote:
>> Balbir Singh wrote:
>>> Nick Piggin wrote:
 And strangely, this example does not go outside the parameters of
 what you asked for AFAIKS. In the worst case of one container getting
 _all_ the shared pages, they will still remain inside their maximum
 rss limit.

>>> When that does happen and if a container hits it limit, with a LRU
>>> per-container, if the container is not actually using those pages,
>>> they'll get thrown out of that container and get mapped into the
>>> container that is using those pages most frequently.
>> Exactly. Statistically, first touch will work OK. It may mean some
>> reclaim inefficiencies in corner cases, but things will tend to
>> even out.
>>
> 
> Exactly!
> 
 So they might get penalised a bit on reclaim, but maximum rss limits
 will work fine, and you can (almost) guarantee X amount of memory for
 a given container, and it will _work_.

 But I also take back my comments about this being the only design I
 have seen that gets everything, because the node-per-container idea
 is a really good one on the surface. And it could mean even less impact
 on the core VM than this patch. That is also a first-touch scheme.

>>> With the proposed node-per-container, we will need to make massive core
>>> VM changes to reorganize zones and nodes. We would want to allow
>>>
>>> 1. For sharing of nodes
>>> 2. Resizing nodes
>>> 3. May be more
>> But a lot of that is happening anyway for other reasons (eg. memory
>> plug/unplug). And I don't consider node/zone setup to be part of the
>> "core VM" as such... it is _good_ if we can move extra work into setup
>> rather than have it in the mm.
>>
>> That said, I don't think this patch is terribly intrusive either.
>>
> 
> Thanks, thats one of our goals, to keep it simple, understandable and
> non-intrusive.
> 
>>> With the node-per-container idea, it will hard to control page cache
>>> limits, independent of RSS limits or mlock limits.
>>>
>>> NOTE: page cache == unmapped page cache here.
>> I don't know that it would be particularly harder than any other
>> first-touch scheme. If one container ends up being charged with too
>> much pagecache, eventually they'll reclaim a bit of it and the pages
>> will get charged to more frequent users.
>>
>>
> 
> Yes, true, but what if a user does not want to control the page
> cache usage in a particular container or wants to turn off
> RSS control.
> 
> However the messed up accounting that doesn't handle sharing between
> groups of processes properly really bugs me.  Especially when we have
> the infrastructure to do it right.
>
> Does that make more sense?

 I think it is simplistic.

 Sure you could probably use some of the rmap stuff to account shared
 mapped _user_ pages once for each container that touches them. And
 this patchset isn't preventing that.

 But how do you account kernel allocations? How do you account unmapped
 pagecache?

 What's the big deal so many accounting people have with just RSS? I'm
 not a container person, this is an honest question. Because from my
 POV if you conveniently ignore everything else... you may as well just
 not do any accounting at all.

>>> We decided to implement accounting and control in phases
>>>
>>> 1. RSS control
>>> 2. unmapped page cache control
>>> 3. mlock control
>>> 4. Kernel accounting and limits
>>>
>>> This has several advantages
>>>
>>> 1. The limits can be individually set and controlled.
>>> 2. The code is broken down into simpler chunks for review and merging.
>> But this patch gives the groundwork to handle 1-4, and it is in a small
>> chunk, and one would be able to apply different limits to different types
>> of pages with it. Just using rmap to handle 1 does not really seem like a
>> viable alternative because it fundamentally isn't going to handle 2 or 4.
>>
> 
> For (2), we have the basic setup in the form of a per-container LRU list
> and a pointer from struct page to the container that first brought in
> the page.
> 
>> I'm not saying that you couldn't _later_ add something that uses rmap or
>> our current RSS accounting to tweak container-RSS semantics. But isn't it
>> sensible to lay the groundwork first? Get a clear path to something that
>> is good (not perfect), but *works*?
>>
> 
> I agree with your development model suggestion. One of things we are going 
> to do in the near future is to build (2) and then add (3) and (4). So far,
> we've not encountered any difficulties on building on top of (1).
> 
> Vaidy, any comments?

Accounting becomes easy if we have a container pointer in struct page.
 This can form base ground for building controllers since any memory
related controller would be interested in tracking pages.  However we
still want to evaluate if we can build them without bloating the
struct page.  Pagecache controller (2) we can implement 

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Balbir Singh

Nick Piggin wrote:

Balbir Singh wrote:

Nick Piggin wrote:



And strangely, this example does not go outside the parameters of
what you asked for AFAIKS. In the worst case of one container getting
_all_ the shared pages, they will still remain inside their maximum
rss limit.



When that does happen and if a container hits it limit, with a LRU
per-container, if the container is not actually using those pages,
they'll get thrown out of that container and get mapped into the
container that is using those pages most frequently.


Exactly. Statistically, first touch will work OK. It may mean some
reclaim inefficiencies in corner cases, but things will tend to
even out.



Exactly!


So they might get penalised a bit on reclaim, but maximum rss limits
will work fine, and you can (almost) guarantee X amount of memory for
a given container, and it will _work_.

But I also take back my comments about this being the only design I
have seen that gets everything, because the node-per-container idea
is a really good one on the surface. And it could mean even less impact
on the core VM than this patch. That is also a first-touch scheme.



With the proposed node-per-container, we will need to make massive core
VM changes to reorganize zones and nodes. We would want to allow

1. For sharing of nodes
2. Resizing nodes
3. May be more


But a lot of that is happening anyway for other reasons (eg. memory
plug/unplug). And I don't consider node/zone setup to be part of the
"core VM" as such... it is _good_ if we can move extra work into setup
rather than have it in the mm.

That said, I don't think this patch is terribly intrusive either.



Thanks, thats one of our goals, to keep it simple, understandable and
non-intrusive.




With the node-per-container idea, it will hard to control page cache
limits, independent of RSS limits or mlock limits.

NOTE: page cache == unmapped page cache here.


I don't know that it would be particularly harder than any other
first-touch scheme. If one container ends up being charged with too
much pagecache, eventually they'll reclaim a bit of it and the pages
will get charged to more frequent users.




Yes, true, but what if a user does not want to control the page
cache usage in a particular container or wants to turn off
RSS control.


However the messed up accounting that doesn't handle sharing between
groups of processes properly really bugs me.  Especially when we have
the infrastructure to do it right.

Does that make more sense?



I think it is simplistic.

Sure you could probably use some of the rmap stuff to account shared
mapped _user_ pages once for each container that touches them. And
this patchset isn't preventing that.

But how do you account kernel allocations? How do you account unmapped
pagecache?

What's the big deal so many accounting people have with just RSS? I'm
not a container person, this is an honest question. Because from my
POV if you conveniently ignore everything else... you may as well just
not do any accounting at all.



We decided to implement accounting and control in phases

1. RSS control
2. unmapped page cache control
3. mlock control
4. Kernel accounting and limits

This has several advantages

1. The limits can be individually set and controlled.
2. The code is broken down into simpler chunks for review and merging.


But this patch gives the groundwork to handle 1-4, and it is in a small
chunk, and one would be able to apply different limits to different types
of pages with it. Just using rmap to handle 1 does not really seem like a
viable alternative because it fundamentally isn't going to handle 2 or 4.



For (2), we have the basic setup in the form of a per-container LRU list
and a pointer from struct page to the container that first brought in
the page.


I'm not saying that you couldn't _later_ add something that uses rmap or
our current RSS accounting to tweak container-RSS semantics. But isn't it
sensible to lay the groundwork first? Get a clear path to something that
is good (not perfect), but *works*?



I agree with your development model suggestion. One of things we are going 
to do in the near future is to build (2) and then add (3) and (4). So far,

we've not encountered any difficulties on building on top of (1).

Vaidy, any comments?

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Balbir Singh

Nick Piggin wrote:

Balbir Singh wrote:

Nick Piggin wrote:



And strangely, this example does not go outside the parameters of
what you asked for AFAIKS. In the worst case of one container getting
_all_ the shared pages, they will still remain inside their maximum
rss limit.



When that does happen and if a container hits it limit, with a LRU
per-container, if the container is not actually using those pages,
they'll get thrown out of that container and get mapped into the
container that is using those pages most frequently.


Exactly. Statistically, first touch will work OK. It may mean some
reclaim inefficiencies in corner cases, but things will tend to
even out.



Exactly!


So they might get penalised a bit on reclaim, but maximum rss limits
will work fine, and you can (almost) guarantee X amount of memory for
a given container, and it will _work_.

But I also take back my comments about this being the only design I
have seen that gets everything, because the node-per-container idea
is a really good one on the surface. And it could mean even less impact
on the core VM than this patch. That is also a first-touch scheme.



With the proposed node-per-container, we will need to make massive core
VM changes to reorganize zones and nodes. We would want to allow

1. For sharing of nodes
2. Resizing nodes
3. May be more


But a lot of that is happening anyway for other reasons (eg. memory
plug/unplug). And I don't consider node/zone setup to be part of the
core VM as such... it is _good_ if we can move extra work into setup
rather than have it in the mm.

That said, I don't think this patch is terribly intrusive either.



Thanks, thats one of our goals, to keep it simple, understandable and
non-intrusive.




With the node-per-container idea, it will hard to control page cache
limits, independent of RSS limits or mlock limits.

NOTE: page cache == unmapped page cache here.


I don't know that it would be particularly harder than any other
first-touch scheme. If one container ends up being charged with too
much pagecache, eventually they'll reclaim a bit of it and the pages
will get charged to more frequent users.




Yes, true, but what if a user does not want to control the page
cache usage in a particular container or wants to turn off
RSS control.


However the messed up accounting that doesn't handle sharing between
groups of processes properly really bugs me.  Especially when we have
the infrastructure to do it right.

Does that make more sense?



I think it is simplistic.

Sure you could probably use some of the rmap stuff to account shared
mapped _user_ pages once for each container that touches them. And
this patchset isn't preventing that.

But how do you account kernel allocations? How do you account unmapped
pagecache?

What's the big deal so many accounting people have with just RSS? I'm
not a container person, this is an honest question. Because from my
POV if you conveniently ignore everything else... you may as well just
not do any accounting at all.



We decided to implement accounting and control in phases

1. RSS control
2. unmapped page cache control
3. mlock control
4. Kernel accounting and limits

This has several advantages

1. The limits can be individually set and controlled.
2. The code is broken down into simpler chunks for review and merging.


But this patch gives the groundwork to handle 1-4, and it is in a small
chunk, and one would be able to apply different limits to different types
of pages with it. Just using rmap to handle 1 does not really seem like a
viable alternative because it fundamentally isn't going to handle 2 or 4.



For (2), we have the basic setup in the form of a per-container LRU list
and a pointer from struct page to the container that first brought in
the page.


I'm not saying that you couldn't _later_ add something that uses rmap or
our current RSS accounting to tweak container-RSS semantics. But isn't it
sensible to lay the groundwork first? Get a clear path to something that
is good (not perfect), but *works*?



I agree with your development model suggestion. One of things we are going 
to do in the near future is to build (2) and then add (3) and (4). So far,

we've not encountered any difficulties on building on top of (1).

Vaidy, any comments?

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Vaidyanathan Srinivasan


Balbir Singh wrote:
 Nick Piggin wrote:
 Balbir Singh wrote:
 Nick Piggin wrote:
 And strangely, this example does not go outside the parameters of
 what you asked for AFAIKS. In the worst case of one container getting
 _all_ the shared pages, they will still remain inside their maximum
 rss limit.

 When that does happen and if a container hits it limit, with a LRU
 per-container, if the container is not actually using those pages,
 they'll get thrown out of that container and get mapped into the
 container that is using those pages most frequently.
 Exactly. Statistically, first touch will work OK. It may mean some
 reclaim inefficiencies in corner cases, but things will tend to
 even out.

 
 Exactly!
 
 So they might get penalised a bit on reclaim, but maximum rss limits
 will work fine, and you can (almost) guarantee X amount of memory for
 a given container, and it will _work_.

 But I also take back my comments about this being the only design I
 have seen that gets everything, because the node-per-container idea
 is a really good one on the surface. And it could mean even less impact
 on the core VM than this patch. That is also a first-touch scheme.

 With the proposed node-per-container, we will need to make massive core
 VM changes to reorganize zones and nodes. We would want to allow

 1. For sharing of nodes
 2. Resizing nodes
 3. May be more
 But a lot of that is happening anyway for other reasons (eg. memory
 plug/unplug). And I don't consider node/zone setup to be part of the
 core VM as such... it is _good_ if we can move extra work into setup
 rather than have it in the mm.

 That said, I don't think this patch is terribly intrusive either.

 
 Thanks, thats one of our goals, to keep it simple, understandable and
 non-intrusive.
 
 With the node-per-container idea, it will hard to control page cache
 limits, independent of RSS limits or mlock limits.

 NOTE: page cache == unmapped page cache here.
 I don't know that it would be particularly harder than any other
 first-touch scheme. If one container ends up being charged with too
 much pagecache, eventually they'll reclaim a bit of it and the pages
 will get charged to more frequent users.


 
 Yes, true, but what if a user does not want to control the page
 cache usage in a particular container or wants to turn off
 RSS control.
 
 However the messed up accounting that doesn't handle sharing between
 groups of processes properly really bugs me.  Especially when we have
 the infrastructure to do it right.

 Does that make more sense?

 I think it is simplistic.

 Sure you could probably use some of the rmap stuff to account shared
 mapped _user_ pages once for each container that touches them. And
 this patchset isn't preventing that.

 But how do you account kernel allocations? How do you account unmapped
 pagecache?

 What's the big deal so many accounting people have with just RSS? I'm
 not a container person, this is an honest question. Because from my
 POV if you conveniently ignore everything else... you may as well just
 not do any accounting at all.

 We decided to implement accounting and control in phases

 1. RSS control
 2. unmapped page cache control
 3. mlock control
 4. Kernel accounting and limits

 This has several advantages

 1. The limits can be individually set and controlled.
 2. The code is broken down into simpler chunks for review and merging.
 But this patch gives the groundwork to handle 1-4, and it is in a small
 chunk, and one would be able to apply different limits to different types
 of pages with it. Just using rmap to handle 1 does not really seem like a
 viable alternative because it fundamentally isn't going to handle 2 or 4.

 
 For (2), we have the basic setup in the form of a per-container LRU list
 and a pointer from struct page to the container that first brought in
 the page.
 
 I'm not saying that you couldn't _later_ add something that uses rmap or
 our current RSS accounting to tweak container-RSS semantics. But isn't it
 sensible to lay the groundwork first? Get a clear path to something that
 is good (not perfect), but *works*?

 
 I agree with your development model suggestion. One of things we are going 
 to do in the near future is to build (2) and then add (3) and (4). So far,
 we've not encountered any difficulties on building on top of (1).
 
 Vaidy, any comments?

Accounting becomes easy if we have a container pointer in struct page.
 This can form base ground for building controllers since any memory
related controller would be interested in tracking pages.  However we
still want to evaluate if we can build them without bloating the
struct page.  Pagecache controller (2) we can implement with container
pointer in struct page or container pointer in struct address space.

Building on this patchset is much simple and and we hope the bloat in
struct page will be compensated by the benefits in memory controllers
in terms of performance and simplicity.

Adding too many controllers and 

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Nick Piggin

Vaidyanathan Srinivasan wrote:


Accounting becomes easy if we have a container pointer in struct page.
 This can form base ground for building controllers since any memory
related controller would be interested in tracking pages.  However we
still want to evaluate if we can build them without bloating the
struct page.  Pagecache controller (2) we can implement with container
pointer in struct page or container pointer in struct address space.


The thing is, you have to worry about actually getting anything in the
kernel rather than trying to do fancy stuff.

The approaches I have seen that don't have a struct page pointer, do
intrusive things like try to put hooks everywhere throughout the kernel
where a userspace task can cause an allocation (and of course end up
missing many, so they aren't secure anyway)... and basically just
nasty stuff that will never get merged.

Struct page overhead really isn't bad. Sure, nobody who doesn't use
containers will want to turn it on, but unless you're using a big PAE
system you're actually unlikely to notice.

But again, I'll say the node-container approach of course does avoid
this nicely (because we already can get the node from the page). So
definitely that approach needs to be discredited before going with this
one.


Building on this patchset is much simple and and we hope the bloat in
struct page will be compensated by the benefits in memory controllers
in terms of performance and simplicity.

Adding too many controllers and accounting parameters to start with
will make the patch too big and complex.  As Balbir mentioned, we have
a plan and we shall add new control parameters in stages.


Everyone seems to have a plan ;) I don't read the containers list...
does everyone still have *different* plans, or is any sort of consensus
being reached?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Vaidyanathan Srinivasan


Nick Piggin wrote:
 Vaidyanathan Srinivasan wrote:
 
 Accounting becomes easy if we have a container pointer in struct page.
  This can form base ground for building controllers since any memory
 related controller would be interested in tracking pages.  However we
 still want to evaluate if we can build them without bloating the
 struct page.  Pagecache controller (2) we can implement with container
 pointer in struct page or container pointer in struct address space.
 
 The thing is, you have to worry about actually getting anything in the
 kernel rather than trying to do fancy stuff.
 
 The approaches I have seen that don't have a struct page pointer, do
 intrusive things like try to put hooks everywhere throughout the kernel
 where a userspace task can cause an allocation (and of course end up
 missing many, so they aren't secure anyway)... and basically just
 nasty stuff that will never get merged.
 
 Struct page overhead really isn't bad. Sure, nobody who doesn't use
 containers will want to turn it on, but unless you're using a big PAE
 system you're actually unlikely to notice.
 
 But again, I'll say the node-container approach of course does avoid
 this nicely (because we already can get the node from the page). So
 definitely that approach needs to be discredited before going with this
 one.

I agree :)

 Building on this patchset is much simple and and we hope the bloat in
 struct page will be compensated by the benefits in memory controllers
 in terms of performance and simplicity.

 Adding too many controllers and accounting parameters to start with
 will make the patch too big and complex.  As Balbir mentioned, we have
 a plan and we shall add new control parameters in stages.
 
 Everyone seems to have a plan ;) I don't read the containers list...
 does everyone still have *different* plans, or is any sort of consensus
 being reached?

Consensus?  I believe at this point we have a sort of consensus on the
base container infrastructure and the need for memory controller to
control RSS, pagecache, mlock, kernel memory etc.  However the
implementation and approach taken is still being discussed :)

--Vaidy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Kirill Korotaev
Nick,

Accounting becomes easy if we have a container pointer in struct page.
 This can form base ground for building controllers since any memory
related controller would be interested in tracking pages.  However we
still want to evaluate if we can build them without bloating the
struct page.  Pagecache controller (2) we can implement with container
pointer in struct page or container pointer in struct address space.
 
 
 The thing is, you have to worry about actually getting anything in the
 kernel rather than trying to do fancy stuff.
 
 The approaches I have seen that don't have a struct page pointer, do
 intrusive things like try to put hooks everywhere throughout the kernel
 where a userspace task can cause an allocation (and of course end up
 missing many, so they aren't secure anyway)... and basically just
 nasty stuff that will never get merged.

User beancounters patch has got through all these...
The approach where each charged object has a pointer to the owner container,
who has charged it - is the most easy/clean way to handle
all the problems with dynamic context change, races, etc.
and 1 pointer in page struct is just 0.1% overehad.

 Struct page overhead really isn't bad. Sure, nobody who doesn't use
 containers will want to turn it on, but unless you're using a big PAE
 system you're actually unlikely to notice.

big PAE doesn't make any difference IMHO
(until struct pages are not created for non-present physical memory areas)

 But again, I'll say the node-container approach of course does avoid
 this nicely (because we already can get the node from the page). So
 definitely that approach needs to be discredited before going with this
 one.

But it lacks some other features:
1. page can't be shared easily with another container
2. shared page can't be accounted honestly to containers
   as fraction=PAGE_SIZE/containers-using-it
3. It doesn't help accounting of kernel memory structures.
   e.g. in OpenVZ we use exactly the same pointer on the page
   to track which container owns it, e.g. pages used for page
   tables are accounted this way.
4. I guess container destroy requires destroy of memory zone,
   which means write out of dirty data. Which doesn't sound
   good for me as well.
5. memory reclamation in case of global memory shortage
   becomes a tricky/unfair task.
6. You cannot overcommit. AFAIU, the memory should be granted
   to node exclusive usage and cannot be used by by another containers,
   even if it is unused. This is not an option for us.

Building on this patchset is much simple and and we hope the bloat in
struct page will be compensated by the benefits in memory controllers
in terms of performance and simplicity.

Adding too many controllers and accounting parameters to start with
will make the patch too big and complex.  As Balbir mentioned, we have
a plan and we shall add new control parameters in stages.
 
 Everyone seems to have a plan ;) I don't read the containers list...
 does everyone still have *different* plans, or is any sort of consensus
 being reached?

hope we'll have it soon :)

Thanks,
Kirill

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Cedric Le Goater
 --- linux-2.6.20.orig/mm/migrate.c2007-02-04 21:44:54.0 +0300
 +++ linux-2.6.20-0/mm/migrate.c   2007-03-06 13:33:28.0 +0300
 @@ -134,6 +134,7 @@ static void remove_migration_pte(struct 
   pte_t *ptep, pte;
   spinlock_t *ptl;
   unsigned long addr = page_address_in_vma(new, vma);
 + struct page_container *pcont;
 
   if (addr == -EFAULT)
   return;
 @@ -157,6 +158,11 @@ static void remove_migration_pte(struct 
   return;
   }
 
 + if (container_rss_prepare(new, vma, pcont)) {
 + pte_unmap(ptep);
 + return;
 + }
 +
   ptl = pte_lockptr(mm, pmd);
   spin_lock(ptl);
   pte = *ptep;
 @@ -175,16 +181,19 @@ static void remove_migration_pte(struct 
   set_pte_at(mm, addr, ptep, pte);
 
   if (PageAnon(new))
 - page_add_anon_rmap(new, vma, addr);
 + page_add_anon_rmap(new, vma, addr, pcont);
   else
 - page_add_file_rmap(new);
 + page_add_file_rmap(new, pcont);
 
   /* No need to invalidate - it was non-present before */
   update_mmu_cache(vma, addr, pte);
   lazy_mmu_prot_update(pte);
 + pte_unmap_unlock(ptep, ptl);
 + return;
 
  out:
   pte_unmap_unlock(ptep, ptl);
 + container_rss_release(pcont);
  }
 
  /*

you missed out an include in mm/migrate.c

cheers,

C.
Signed-off-by: Cedric Le Goater [EMAIL PROTECTED]
---
 mm/migrate.c |1 +
 1 file changed, 1 insertion(+)

Index: 2.6.20/mm/migrate.c
===
--- 2.6.20.orig/mm/migrate.c
+++ 2.6.20/mm/migrate.c
@@ -28,6 +28,7 @@
 #include linux/mempolicy.h
 #include linux/vmalloc.h
 #include linux/security.h
+#include linux/rss_container.h
 
 #include internal.h

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Pavel Emelianov
Cedric Le Goater wrote:
 --- linux-2.6.20.orig/mm/migrate.c   2007-02-04 21:44:54.0 +0300
 +++ linux-2.6.20-0/mm/migrate.c  2007-03-06 13:33:28.0 +0300
 @@ -134,6 +134,7 @@ static void remove_migration_pte(struct 
  pte_t *ptep, pte;
  spinlock_t *ptl;
  unsigned long addr = page_address_in_vma(new, vma);
 +struct page_container *pcont;

  if (addr == -EFAULT)
  return;
 @@ -157,6 +158,11 @@ static void remove_migration_pte(struct 
  return;
  }

 +if (container_rss_prepare(new, vma, pcont)) {
 +pte_unmap(ptep);
 +return;
 +}
 +
  ptl = pte_lockptr(mm, pmd);
  spin_lock(ptl);
  pte = *ptep;
 @@ -175,16 +181,19 @@ static void remove_migration_pte(struct 
  set_pte_at(mm, addr, ptep, pte);

  if (PageAnon(new))
 -page_add_anon_rmap(new, vma, addr);
 +page_add_anon_rmap(new, vma, addr, pcont);
  else
 -page_add_file_rmap(new);
 +page_add_file_rmap(new, pcont);

  /* No need to invalidate - it was non-present before */
  update_mmu_cache(vma, addr, pte);
  lazy_mmu_prot_update(pte);
 +pte_unmap_unlock(ptep, ptl);
 +return;

  out:
  pte_unmap_unlock(ptep, ptl);
 +container_rss_release(pcont);
  }

  /*
 
 you missed out an include in mm/migrate.c
 
 cheers,

Thanks! :)

 C.
 Signed-off-by: Cedric Le Goater [EMAIL PROTECTED]
 ---
  mm/migrate.c |1 +
  1 file changed, 1 insertion(+)
 
 Index: 2.6.20/mm/migrate.c
 ===
 --- 2.6.20.orig/mm/migrate.c
 +++ 2.6.20/mm/migrate.c
 @@ -28,6 +28,7 @@
  #include linux/mempolicy.h
  #include linux/vmalloc.h
  #include linux/security.h
 +#include linux/rss_container.h
  
  #include internal.h
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Nick Piggin

Kirill Korotaev wrote:


The approaches I have seen that don't have a struct page pointer, do
intrusive things like try to put hooks everywhere throughout the kernel
where a userspace task can cause an allocation (and of course end up
missing many, so they aren't secure anyway)... and basically just
nasty stuff that will never get merged.



User beancounters patch has got through all these...
The approach where each charged object has a pointer to the owner container,
who has charged it - is the most easy/clean way to handle
all the problems with dynamic context change, races, etc.
and 1 pointer in page struct is just 0.1% overehad.


The pointer in struct page approach is a decent one, which I have
liked since this whole container effort came up. IIRC Linus and Alan
also thought that was a reasonable way to go.

I haven't reviewed the rest of the beancounters patch since looking
at it quite a few months ago... I probably don't have time for a
good review at the moment, but I should eventually.


Struct page overhead really isn't bad. Sure, nobody who doesn't use
containers will want to turn it on, but unless you're using a big PAE
system you're actually unlikely to notice.



big PAE doesn't make any difference IMHO
(until struct pages are not created for non-present physical memory areas)


The issue is just that struct pages use low memory, which is a really
scarce commodity on PAE. One more pointer in the struct page means
64MB less lowmem.

But PAE is crap anyway. We've already made enough concessions in the
kernel to support it. I agree: struct page overhead is not really
significant. The benefits of simplicity seems to outweigh the downside.


But again, I'll say the node-container approach of course does avoid
this nicely (because we already can get the node from the page). So
definitely that approach needs to be discredited before going with this
one.



But it lacks some other features:
1. page can't be shared easily with another container


I think they could be shared. You allocate _new_ pages from your own
node, but you can definitely use existing pages allocated to other
nodes.


2. shared page can't be accounted honestly to containers
   as fraction=PAGE_SIZE/containers-using-it


Yes there would be some accounting differences. I think it is hard
to say exactly what containers are using what page anyway, though.
What do you say about unmapped pages? Kernel allocations? etc.


3. It doesn't help accounting of kernel memory structures.
   e.g. in OpenVZ we use exactly the same pointer on the page
   to track which container owns it, e.g. pages used for page
   tables are accounted this way.


?
page_to_nid(page) ~= container that owns it.


4. I guess container destroy requires destroy of memory zone,
   which means write out of dirty data. Which doesn't sound
   good for me as well.


I haven't looked at any implementation, but I think it is fine for
the zone to stay around.


5. memory reclamation in case of global memory shortage
   becomes a tricky/unfair task.


I don't understand why? You can much more easily target a specific
container for reclaim with this approach than with others (because
you have an lru per container).


6. You cannot overcommit. AFAIU, the memory should be granted
   to node exclusive usage and cannot be used by by another containers,
   even if it is unused. This is not an option for us.


I'm not sure about that. If you have a larger number of nodes, then
you could assign more free nodes to a container on demand. But I
think there would definitely be less flexibility with nodes...

I don't know... and seeing as I don't really know where the google
guys are going with it, I won't misrepresent their work any further ;)



Everyone seems to have a plan ;) I don't read the containers list...
does everyone still have *different* plans, or is any sort of consensus
being reached?



hope we'll have it soon :)


Good luck ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-14 Thread Balbir Singh

Nick Piggin wrote:

Kirill Korotaev wrote:


The approaches I have seen that don't have a struct page pointer, do
intrusive things like try to put hooks everywhere throughout the kernel
where a userspace task can cause an allocation (and of course end up
missing many, so they aren't secure anyway)... and basically just
nasty stuff that will never get merged.



User beancounters patch has got through all these...
The approach where each charged object has a pointer to the owner 
container,

who has charged it - is the most easy/clean way to handle
all the problems with dynamic context change, races, etc.
and 1 pointer in page struct is just 0.1% overehad.


The pointer in struct page approach is a decent one, which I have
liked since this whole container effort came up. IIRC Linus and Alan
also thought that was a reasonable way to go.

I haven't reviewed the rest of the beancounters patch since looking
at it quite a few months ago... I probably don't have time for a
good review at the moment, but I should eventually.



This patch is not really beancounters.

1. It uses the containers framework
2. It is similar to my RSS controller (http://lkml.org/lkml/2007/2/26/8)

I would say that beancounters are changing and evolving.


Struct page overhead really isn't bad. Sure, nobody who doesn't use
containers will want to turn it on, but unless you're using a big PAE
system you're actually unlikely to notice.



big PAE doesn't make any difference IMHO
(until struct pages are not created for non-present physical memory 
areas)


The issue is just that struct pages use low memory, which is a really
scarce commodity on PAE. One more pointer in the struct page means
64MB less lowmem.

But PAE is crap anyway. We've already made enough concessions in the
kernel to support it. I agree: struct page overhead is not really
significant. The benefits of simplicity seems to outweigh the downside.


But again, I'll say the node-container approach of course does avoid
this nicely (because we already can get the node from the page). So
definitely that approach needs to be discredited before going with this
one.



But it lacks some other features:
1. page can't be shared easily with another container


I think they could be shared. You allocate _new_ pages from your own
node, but you can definitely use existing pages allocated to other
nodes.


2. shared page can't be accounted honestly to containers
   as fraction=PAGE_SIZE/containers-using-it


Yes there would be some accounting differences. I think it is hard
to say exactly what containers are using what page anyway, though.
What do you say about unmapped pages? Kernel allocations? etc.


3. It doesn't help accounting of kernel memory structures.
   e.g. in OpenVZ we use exactly the same pointer on the page
   to track which container owns it, e.g. pages used for page
   tables are accounted this way.


?
page_to_nid(page) ~= container that owns it.


4. I guess container destroy requires destroy of memory zone,
   which means write out of dirty data. Which doesn't sound
   good for me as well.


I haven't looked at any implementation, but I think it is fine for
the zone to stay around.


5. memory reclamation in case of global memory shortage
   becomes a tricky/unfair task.


I don't understand why? You can much more easily target a specific
container for reclaim with this approach than with others (because
you have an lru per container).



Yes, but we break the global LRU. With these RSS patches, reclaim not
triggered by containers still uses the global LRU, by using nodes,
we would have lost the global LRU.


6. You cannot overcommit. AFAIU, the memory should be granted
   to node exclusive usage and cannot be used by by another containers,
   even if it is unused. This is not an option for us.


I'm not sure about that. If you have a larger number of nodes, then
you could assign more free nodes to a container on demand. But I
think there would definitely be less flexibility with nodes...

I don't know... and seeing as I don't really know where the google
guys are going with it, I won't misrepresent their work any further ;)



Everyone seems to have a plan ;) I don't read the containers list...
does everyone still have *different* plans, or is any sort of consensus
being reached?



hope we'll have it soon :)


Good luck ;)



I think we have made some forward progress on the consensus.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Nick Piggin

Balbir Singh wrote:

Nick Piggin wrote:



And strangely, this example does not go outside the parameters of
what you asked for AFAIKS. In the worst case of one container getting
_all_ the shared pages, they will still remain inside their maximum
rss limit.



When that does happen and if a container hits it limit, with a LRU
per-container, if the container is not actually using those pages,
they'll get thrown out of that container and get mapped into the
container that is using those pages most frequently.


Exactly. Statistically, first touch will work OK. It may mean some
reclaim inefficiencies in corner cases, but things will tend to
even out.


So they might get penalised a bit on reclaim, but maximum rss limits
will work fine, and you can (almost) guarantee X amount of memory for
a given container, and it will _work_.

But I also take back my comments about this being the only design I
have seen that gets everything, because the node-per-container idea
is a really good one on the surface. And it could mean even less impact
on the core VM than this patch. That is also a first-touch scheme.



With the proposed node-per-container, we will need to make massive core
VM changes to reorganize zones and nodes. We would want to allow

1. For sharing of nodes
2. Resizing nodes
3. May be more


But a lot of that is happening anyway for other reasons (eg. memory
plug/unplug). And I don't consider node/zone setup to be part of the
"core VM" as such... it is _good_ if we can move extra work into setup
rather than have it in the mm.

That said, I don't think this patch is terribly intrusive either.



With the node-per-container idea, it will hard to control page cache
limits, independent of RSS limits or mlock limits.

NOTE: page cache == unmapped page cache here.


I don't know that it would be particularly harder than any other
first-touch scheme. If one container ends up being charged with too
much pagecache, eventually they'll reclaim a bit of it and the pages
will get charged to more frequent users.



However the messed up accounting that doesn't handle sharing between
groups of processes properly really bugs me.  Especially when we have
the infrastructure to do it right.

Does that make more sense?



I think it is simplistic.

Sure you could probably use some of the rmap stuff to account shared
mapped _user_ pages once for each container that touches them. And
this patchset isn't preventing that.

But how do you account kernel allocations? How do you account unmapped
pagecache?

What's the big deal so many accounting people have with just RSS? I'm
not a container person, this is an honest question. Because from my
POV if you conveniently ignore everything else... you may as well just
not do any accounting at all.



We decided to implement accounting and control in phases

1. RSS control
2. unmapped page cache control
3. mlock control
4. Kernel accounting and limits

This has several advantages

1. The limits can be individually set and controlled.
2. The code is broken down into simpler chunks for review and merging.


But this patch gives the groundwork to handle 1-4, and it is in a small
chunk, and one would be able to apply different limits to different types
of pages with it. Just using rmap to handle 1 does not really seem like a
viable alternative because it fundamentally isn't going to handle 2 or 4.

I'm not saying that you couldn't _later_ add something that uses rmap or
our current RSS accounting to tweak container-RSS semantics. But isn't it
sensible to lay the groundwork first? Get a clear path to something that
is good (not perfect), but *works*?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Balbir Singh

Nick Piggin wrote:

Eric W. Biederman wrote:

Nick Piggin <[EMAIL PROTECTED]> writes:



Eric W. Biederman wrote:


First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.


Let's be practical here, what you're asking is basically impossible.

Unless by deterministic you mean that it never enters the a non
trivial syscall, in which case, you just want to know about maximum
RSS of the process, which we already account).



Not per process I want this on a group of processes, and yes that
is all I want just.  I just want accounting of the maximum RSS of
a group of processes and then the mechanism to limit that maximum rss.


Well don't you just sum up the maximum for each process?

Or do you want to only count shared pages inside a container once,
or something difficult like that?



I don't want sharing between vservers/VE/containers to affect how many
pages I can have mapped into my processes at once.


You seem to want total isolation. You could use virtualization?



No.  I don't want the meaning of my rss limit to be affected by what
other processes are doing.  We have constraints of how many resources
the box actually has.  But I don't want accounting so sloppy that
processes outside my group of processes can artificially
lower my rss value, which magically raises my rss limit.


So what are you going to do about all the shared caches and slabs
inside the kernel?



It is basically handwaving anyway. The only approach I've seen with
a sane (not perfect, but good) way of accounting memory use is this
one. If you care to define "proper", then we could discuss that.



I will agree that this patchset is probably in the right general 
ballpark.

But the fact that pages are assigned exactly one owner is pure non-sense.
We can do better.  That is all I am asking for someone to at least 
attempt
to actually account for the rss of a group of processes and get the 
numbers

right when we have shared pages, between different groups of
processes.  We have the data structures to support this with rmap.


Well rmap only supports mapped, userspace pages.



Let me describe the situation where I think the accounting in the
patchset goes totally wonky.

Gcc as I recall maps the pages it is compiling with mmap.
If in a single kernel tree I do:
make -jN O=../compile1 &
make -jN O=../compile2 &

But set it up so that the two compiles are in different rss groups.
If I run the concurrently they will use the same files at the same
time and most likely because of the first touch rss limit rule even
if I have a draconian rss limit the compiles will both be able to
complete and finish.   However if I run either of them alone if I
use the most draconian rss limit I can that allows both compiles to
finish I won't be able to compile a single kernel tree.


Yeah it is not perfect. Fortunately, there is no perfect solution,
so we don't have to be too upset about that.

And strangely, this example does not go outside the parameters of
what you asked for AFAIKS. In the worst case of one container getting
_all_ the shared pages, they will still remain inside their maximum
rss limit.



When that does happen and if a container hits it limit, with a LRU
per-container, if the container is not actually using those pages,
they'll get thrown out of that container and get mapped into the
container that is using those pages most frequently.


So they might get penalised a bit on reclaim, but maximum rss limits
will work fine, and you can (almost) guarantee X amount of memory for
a given container, and it will _work_.

But I also take back my comments about this being the only design I
have seen that gets everything, because the node-per-container idea
is a really good one on the surface. And it could mean even less impact
on the core VM than this patch. That is also a first-touch scheme.



With the proposed node-per-container, we will need to make massive core
VM changes to reorganize zones and nodes. We would want to allow

1. For sharing of nodes
2. Resizing nodes
3. May be more

With the node-per-container idea, it will hard to control page cache
limits, independent of RSS limits or mlock limits.

NOTE: page cache == unmapped page cache here.




However the messed up accounting that doesn't handle sharing between
groups of processes properly really bugs me.  Especially when we have
the infrastructure to do it right.

Does that make more sense?


I think it is simplistic.

Sure you could probably use some of the rmap stuff to account shared
mapped _user_ pages once for each container that touches them. And
this patchset isn't preventing that.

But how do you account kernel allocations? How do you account unmapped
pagecache?


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Nick Piggin

Eric W. Biederman wrote:

Nick Piggin <[EMAIL PROTECTED]> writes:



Eric W. Biederman wrote:


First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.


Let's be practical here, what you're asking is basically impossible.

Unless by deterministic you mean that it never enters the a non
trivial syscall, in which case, you just want to know about maximum
RSS of the process, which we already account).



Not per process I want this on a group of processes, and yes that
is all I want just.  I just want accounting of the maximum RSS of
a group of processes and then the mechanism to limit that maximum rss.


Well don't you just sum up the maximum for each process?

Or do you want to only count shared pages inside a container once,
or something difficult like that?



I don't want sharing between vservers/VE/containers to affect how many
pages I can have mapped into my processes at once.


You seem to want total isolation. You could use virtualization?



No.  I don't want the meaning of my rss limit to be affected by what
other processes are doing.  We have constraints of how many resources
the box actually has.  But I don't want accounting so sloppy that
processes outside my group of processes can artificially
lower my rss value, which magically raises my rss limit.


So what are you going to do about all the shared caches and slabs
inside the kernel?



It is basically handwaving anyway. The only approach I've seen with
a sane (not perfect, but good) way of accounting memory use is this
one. If you care to define "proper", then we could discuss that.



I will agree that this patchset is probably in the right general ballpark.
But the fact that pages are assigned exactly one owner is pure non-sense.
We can do better.  That is all I am asking for someone to at least attempt
to actually account for the rss of a group of processes and get the numbers
right when we have shared pages, between different groups of
processes.  We have the data structures to support this with rmap.


Well rmap only supports mapped, userspace pages.



Let me describe the situation where I think the accounting in the
patchset goes totally wonky. 



Gcc as I recall maps the pages it is compiling with mmap.
If in a single kernel tree I do:
make -jN O=../compile1 &
make -jN O=../compile2 &

But set it up so that the two compiles are in different rss groups.
If I run the concurrently they will use the same files at the same
time and most likely because of the first touch rss limit rule even
if I have a draconian rss limit the compiles will both be able to
complete and finish.   However if I run either of them alone if I
use the most draconian rss limit I can that allows both compiles to
finish I won't be able to compile a single kernel tree.


Yeah it is not perfect. Fortunately, there is no perfect solution,
so we don't have to be too upset about that.

And strangely, this example does not go outside the parameters of
what you asked for AFAIKS. In the worst case of one container getting
_all_ the shared pages, they will still remain inside their maximum
rss limit.

So they might get penalised a bit on reclaim, but maximum rss limits
will work fine, and you can (almost) guarantee X amount of memory for
a given container, and it will _work_.

But I also take back my comments about this being the only design I
have seen that gets everything, because the node-per-container idea
is a really good one on the surface. And it could mean even less impact
on the core VM than this patch. That is also a first-touch scheme.



However the messed up accounting that doesn't handle sharing between
groups of processes properly really bugs me.  Especially when we have
the infrastructure to do it right.

Does that make more sense?


I think it is simplistic.

Sure you could probably use some of the rmap stuff to account shared
mapped _user_ pages once for each container that touches them. And
this patchset isn't preventing that.

But how do you account kernel allocations? How do you account unmapped
pagecache?

What's the big deal so many accounting people have with just RSS? I'm
not a container person, this is an honest question. Because from my
POV if you conveniently ignore everything else... you may as well just
not do any accounting at all.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Eric W. Biederman
Nick Piggin <[EMAIL PROTECTED]> writes:

> Eric W. Biederman wrote:
>>
>> First touch page ownership does not guarantee give me anything useful
>> for knowing if I can run my application or not.  Because of page
>> sharing my application might run inside the rss limit only because
>> I got lucky and happened to share a lot of pages with another running
>> application.  If the next I run and it isn't running my application
>> will fail.  That is ridiculous.
>
> Let's be practical here, what you're asking is basically impossible.
>
> Unless by deterministic you mean that it never enters the a non
> trivial syscall, in which case, you just want to know about maximum
> RSS of the process, which we already account).

Not per process I want this on a group of processes, and yes that
is all I want just.  I just want accounting of the maximum RSS of
a group of processes and then the mechanism to limit that maximum rss.

>> I don't want sharing between vservers/VE/containers to affect how many
>> pages I can have mapped into my processes at once.
>
> You seem to want total isolation. You could use virtualization?

No.  I don't want the meaning of my rss limit to be affected by what
other processes are doing.  We have constraints of how many resources
the box actually has.  But I don't want accounting so sloppy that
processes outside my group of processes can artificially
lower my rss value, which magically raises my rss limit.

>> Now sharing is sufficiently rare that I'm pretty certain that problems
>> come up rarely.  So maybe these problems have not shown up in testing
>> yet.  But until I see the proof that actually doing the accounting for
>> sharing properly has intolerable overhead.  I want proper accounting
>> not this hand waving that is only accurate on the third Tuesday of the
>> month.
>
> It is basically handwaving anyway. The only approach I've seen with
> a sane (not perfect, but good) way of accounting memory use is this
> one. If you care to define "proper", then we could discuss that.

I will agree that this patchset is probably in the right general ballpark.
But the fact that pages are assigned exactly one owner is pure non-sense.
We can do better.  That is all I am asking for someone to at least attempt
to actually account for the rss of a group of processes and get the numbers
right when we have shared pages, between different groups of
processes.  We have the data structures to support this with rmap.

Let me describe the situation where I think the accounting in the
patchset goes totally wonky. 


Gcc as I recall maps the pages it is compiling with mmap.
If in a single kernel tree I do:
make -jN O=../compile1 &
make -jN O=../compile2 &

But set it up so that the two compiles are in different rss groups.
If I run the concurrently they will use the same files at the same
time and most likely because of the first touch rss limit rule even
if I have a draconian rss limit the compiles will both be able to
complete and finish.   However if I run either of them alone if I
use the most draconian rss limit I can that allows both compiles to
finish I won't be able to compile a single kernel tree.

The reason for the failure with a single tree (in my thought
experiment) is that the rss limit was set below the what is actually
needed for the code to work.  When we were compiling two kernels and
they were mapping the same pages at the same time we could put the rss
limit below the minimum rss needed for the compile to execute and
still have it complete because of with first touch only one group
accounted for the pages and the other just leached of the first, as
long as both compiles grabbed some of the pages they could complete.

No I know in practice most draconian limits will simply result in the
page staying in the page cache but not mapped into processes in the
group with the draconian limit, or they will result in pages of the
group with the draconian limit being pushed out into the swap cache.
So the chances of actual application failure even with a draconian
rss limit are quite unlikely.  (I actually really appreciate this
fact).

However the messed up accounting that doesn't handle sharing between
groups of processes properly really bugs me.  Especially when we have
the infrastructure to do it right.

Does that make more sense?

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Nick Piggin

Eric W. Biederman wrote:

Herbert Poetzl <[EMAIL PROTECTED]> writes:



On Mon, Mar 12, 2007 at 09:50:08AM -0700, Dave Hansen wrote:


On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:


For these you essentially need per-container page->_mapcount counter,
otherwise you can't detect whether rss group still has the page 
in question being mapped in its processes' address spaces or not. 



What do you mean by this?  You can always tell whether a process has a
particular page mapped.  Could you explain the issue a bit more.  I'm
not sure I get it.


OpenVZ wants to account _shared_ pages in a guest
different than separate pages, so that the RSS
accounted values reflect the actual used RAM instead
of the sum of all processes RSS' pages, which for
sure is more relevant to the administrator, but IMHO
not so terribly important to justify memory consuming
structures and sacrifice performance to get it right

YMMV, but maybe we can find a smart solution to the
issue too :)



I will tell you what I want.

I want a shared page cache that has nothing to do with RSS limits.

I want an RSS limit that once I know I can run a deterministic
application with a fixed set of inputs in I want to know it will
always run.

First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.


Let's be practical here, what you're asking is basically impossible.

Unless by deterministic you mean that it never enters the a non
trivial syscall, in which case, you just want to know about maximum
RSS of the process, which we already account).


I don't want sharing between vservers/VE/containers to affect how many
pages I can have mapped into my processes at once.


You seem to want total isolation. You could use virtualization?


Now sharing is sufficiently rare that I'm pretty certain that problems
come up rarely.  So maybe these problems have not shown up in testing
yet.  But until I see the proof that actually doing the accounting for
sharing properly has intolerable overhead.  I want proper accounting
not this hand waving that is only accurate on the third Tuesday of the
month.


It is basically handwaving anyway. The only approach I've seen with
a sane (not perfect, but good) way of accounting memory use is this
one. If you care to define "proper", then we could discuss that.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Eric W. Biederman
Herbert Poetzl <[EMAIL PROTECTED]> writes:

> On Mon, Mar 12, 2007 at 09:50:08AM -0700, Dave Hansen wrote:
>> On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
>> > 
>> > For these you essentially need per-container page->_mapcount counter,
>> > otherwise you can't detect whether rss group still has the page 
>> > in question being mapped in its processes' address spaces or not. 
>
>> What do you mean by this?  You can always tell whether a process has a
>> particular page mapped.  Could you explain the issue a bit more.  I'm
>> not sure I get it.
>
> OpenVZ wants to account _shared_ pages in a guest
> different than separate pages, so that the RSS
> accounted values reflect the actual used RAM instead
> of the sum of all processes RSS' pages, which for
> sure is more relevant to the administrator, but IMHO
> not so terribly important to justify memory consuming
> structures and sacrifice performance to get it right
>
> YMMV, but maybe we can find a smart solution to the
> issue too :)

I will tell you what I want.

I want a shared page cache that has nothing to do with RSS limits.

I want an RSS limit that once I know I can run a deterministic
application with a fixed set of inputs in I want to know it will
always run.

First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.

I don't want sharing between vservers/VE/containers to affect how many
pages I can have mapped into my processes at once.

Now sharing is sufficiently rare that I'm pretty certain that problems
come up rarely.  So maybe these problems have not shown up in testing
yet.  But until I see the proof that actually doing the accounting for
sharing properly has intolerable overhead.  I want proper accounting
not this hand waving that is only accurate on the third Tuesday of the
month.

Ideally all of this will be followed by smarter rss based swapping.
There are some very cool things that can be done to eliminate machine
overload once you have the ability to track real rss values.  

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Eric W. Biederman
Dave Hansen <[EMAIL PROTECTED]> writes:

> On Mon, 2007-03-12 at 20:07 +0300, Kirill Korotaev wrote:
>> > On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
>> >>For these you essentially need per-container page->_mapcount counter,
>> >>otherwise you can't detect whether rss group still has the page in question
> being mapped
>> >>in its processes' address spaces or not. 
>> > 
>> > What do you mean by this?  You can always tell whether a process has a
>> > particular page mapped.  Could you explain the issue a bit more.  I'm
>> > not sure I get it.
>> When we do charge/uncharge we have to answer on another question:
>> "whether *any* task from the *container* has this page mapped", not the
>> "whether *this* task has this page mapped".
>
> That's a bit more clear. ;)
>
> OK, just so I make sure I'm getting your argument here.  It would be too
> expensive to go looking through all of the rmap data for _any_ other
> task that might be sharing the charge (in the same container) with the
> current task that is doing the unmapping.  

Which is a questionable assumption.  Worse case we are talking a list
several thousand entries long, and generally if you are used by the same
container you will hit one of your processes long before you traverse
the whole list.

So at least the average case performance should be good.

It is only in the case when you a page is shared between multiple
containers when this matters.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Eric W. Biederman
Dave Hansen [EMAIL PROTECTED] writes:

 On Mon, 2007-03-12 at 20:07 +0300, Kirill Korotaev wrote:
  On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
 For these you essentially need per-container page-_mapcount counter,
 otherwise you can't detect whether rss group still has the page in question
 being mapped
 in its processes' address spaces or not. 
  
  What do you mean by this?  You can always tell whether a process has a
  particular page mapped.  Could you explain the issue a bit more.  I'm
  not sure I get it.
 When we do charge/uncharge we have to answer on another question:
 whether *any* task from the *container* has this page mapped, not the
 whether *this* task has this page mapped.

 That's a bit more clear. ;)

 OK, just so I make sure I'm getting your argument here.  It would be too
 expensive to go looking through all of the rmap data for _any_ other
 task that might be sharing the charge (in the same container) with the
 current task that is doing the unmapping.  

Which is a questionable assumption.  Worse case we are talking a list
several thousand entries long, and generally if you are used by the same
container you will hit one of your processes long before you traverse
the whole list.

So at least the average case performance should be good.

It is only in the case when you a page is shared between multiple
containers when this matters.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Eric W. Biederman
Herbert Poetzl [EMAIL PROTECTED] writes:

 On Mon, Mar 12, 2007 at 09:50:08AM -0700, Dave Hansen wrote:
 On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
  
  For these you essentially need per-container page-_mapcount counter,
  otherwise you can't detect whether rss group still has the page 
  in question being mapped in its processes' address spaces or not. 

 What do you mean by this?  You can always tell whether a process has a
 particular page mapped.  Could you explain the issue a bit more.  I'm
 not sure I get it.

 OpenVZ wants to account _shared_ pages in a guest
 different than separate pages, so that the RSS
 accounted values reflect the actual used RAM instead
 of the sum of all processes RSS' pages, which for
 sure is more relevant to the administrator, but IMHO
 not so terribly important to justify memory consuming
 structures and sacrifice performance to get it right

 YMMV, but maybe we can find a smart solution to the
 issue too :)

I will tell you what I want.

I want a shared page cache that has nothing to do with RSS limits.

I want an RSS limit that once I know I can run a deterministic
application with a fixed set of inputs in I want to know it will
always run.

First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.

I don't want sharing between vservers/VE/containers to affect how many
pages I can have mapped into my processes at once.

Now sharing is sufficiently rare that I'm pretty certain that problems
come up rarely.  So maybe these problems have not shown up in testing
yet.  But until I see the proof that actually doing the accounting for
sharing properly has intolerable overhead.  I want proper accounting
not this hand waving that is only accurate on the third Tuesday of the
month.

Ideally all of this will be followed by smarter rss based swapping.
There are some very cool things that can be done to eliminate machine
overload once you have the ability to track real rss values.  

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Nick Piggin

Eric W. Biederman wrote:

Herbert Poetzl [EMAIL PROTECTED] writes:



On Mon, Mar 12, 2007 at 09:50:08AM -0700, Dave Hansen wrote:


On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:


For these you essentially need per-container page-_mapcount counter,
otherwise you can't detect whether rss group still has the page 
in question being mapped in its processes' address spaces or not. 



What do you mean by this?  You can always tell whether a process has a
particular page mapped.  Could you explain the issue a bit more.  I'm
not sure I get it.


OpenVZ wants to account _shared_ pages in a guest
different than separate pages, so that the RSS
accounted values reflect the actual used RAM instead
of the sum of all processes RSS' pages, which for
sure is more relevant to the administrator, but IMHO
not so terribly important to justify memory consuming
structures and sacrifice performance to get it right

YMMV, but maybe we can find a smart solution to the
issue too :)



I will tell you what I want.

I want a shared page cache that has nothing to do with RSS limits.

I want an RSS limit that once I know I can run a deterministic
application with a fixed set of inputs in I want to know it will
always run.

First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.


Let's be practical here, what you're asking is basically impossible.

Unless by deterministic you mean that it never enters the a non
trivial syscall, in which case, you just want to know about maximum
RSS of the process, which we already account).


I don't want sharing between vservers/VE/containers to affect how many
pages I can have mapped into my processes at once.


You seem to want total isolation. You could use virtualization?


Now sharing is sufficiently rare that I'm pretty certain that problems
come up rarely.  So maybe these problems have not shown up in testing
yet.  But until I see the proof that actually doing the accounting for
sharing properly has intolerable overhead.  I want proper accounting
not this hand waving that is only accurate on the third Tuesday of the
month.


It is basically handwaving anyway. The only approach I've seen with
a sane (not perfect, but good) way of accounting memory use is this
one. If you care to define proper, then we could discuss that.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Eric W. Biederman
Nick Piggin [EMAIL PROTECTED] writes:

 Eric W. Biederman wrote:

 First touch page ownership does not guarantee give me anything useful
 for knowing if I can run my application or not.  Because of page
 sharing my application might run inside the rss limit only because
 I got lucky and happened to share a lot of pages with another running
 application.  If the next I run and it isn't running my application
 will fail.  That is ridiculous.

 Let's be practical here, what you're asking is basically impossible.

 Unless by deterministic you mean that it never enters the a non
 trivial syscall, in which case, you just want to know about maximum
 RSS of the process, which we already account).

Not per process I want this on a group of processes, and yes that
is all I want just.  I just want accounting of the maximum RSS of
a group of processes and then the mechanism to limit that maximum rss.

 I don't want sharing between vservers/VE/containers to affect how many
 pages I can have mapped into my processes at once.

 You seem to want total isolation. You could use virtualization?

No.  I don't want the meaning of my rss limit to be affected by what
other processes are doing.  We have constraints of how many resources
the box actually has.  But I don't want accounting so sloppy that
processes outside my group of processes can artificially
lower my rss value, which magically raises my rss limit.

 Now sharing is sufficiently rare that I'm pretty certain that problems
 come up rarely.  So maybe these problems have not shown up in testing
 yet.  But until I see the proof that actually doing the accounting for
 sharing properly has intolerable overhead.  I want proper accounting
 not this hand waving that is only accurate on the third Tuesday of the
 month.

 It is basically handwaving anyway. The only approach I've seen with
 a sane (not perfect, but good) way of accounting memory use is this
 one. If you care to define proper, then we could discuss that.

I will agree that this patchset is probably in the right general ballpark.
But the fact that pages are assigned exactly one owner is pure non-sense.
We can do better.  That is all I am asking for someone to at least attempt
to actually account for the rss of a group of processes and get the numbers
right when we have shared pages, between different groups of
processes.  We have the data structures to support this with rmap.

Let me describe the situation where I think the accounting in the
patchset goes totally wonky. 


Gcc as I recall maps the pages it is compiling with mmap.
If in a single kernel tree I do:
make -jN O=../compile1 
make -jN O=../compile2 

But set it up so that the two compiles are in different rss groups.
If I run the concurrently they will use the same files at the same
time and most likely because of the first touch rss limit rule even
if I have a draconian rss limit the compiles will both be able to
complete and finish.   However if I run either of them alone if I
use the most draconian rss limit I can that allows both compiles to
finish I won't be able to compile a single kernel tree.

The reason for the failure with a single tree (in my thought
experiment) is that the rss limit was set below the what is actually
needed for the code to work.  When we were compiling two kernels and
they were mapping the same pages at the same time we could put the rss
limit below the minimum rss needed for the compile to execute and
still have it complete because of with first touch only one group
accounted for the pages and the other just leached of the first, as
long as both compiles grabbed some of the pages they could complete.

No I know in practice most draconian limits will simply result in the
page staying in the page cache but not mapped into processes in the
group with the draconian limit, or they will result in pages of the
group with the draconian limit being pushed out into the swap cache.
So the chances of actual application failure even with a draconian
rss limit are quite unlikely.  (I actually really appreciate this
fact).

However the messed up accounting that doesn't handle sharing between
groups of processes properly really bugs me.  Especially when we have
the infrastructure to do it right.

Does that make more sense?

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Nick Piggin

Eric W. Biederman wrote:

Nick Piggin [EMAIL PROTECTED] writes:



Eric W. Biederman wrote:


First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.


Let's be practical here, what you're asking is basically impossible.

Unless by deterministic you mean that it never enters the a non
trivial syscall, in which case, you just want to know about maximum
RSS of the process, which we already account).



Not per process I want this on a group of processes, and yes that
is all I want just.  I just want accounting of the maximum RSS of
a group of processes and then the mechanism to limit that maximum rss.


Well don't you just sum up the maximum for each process?

Or do you want to only count shared pages inside a container once,
or something difficult like that?



I don't want sharing between vservers/VE/containers to affect how many
pages I can have mapped into my processes at once.


You seem to want total isolation. You could use virtualization?



No.  I don't want the meaning of my rss limit to be affected by what
other processes are doing.  We have constraints of how many resources
the box actually has.  But I don't want accounting so sloppy that
processes outside my group of processes can artificially
lower my rss value, which magically raises my rss limit.


So what are you going to do about all the shared caches and slabs
inside the kernel?



It is basically handwaving anyway. The only approach I've seen with
a sane (not perfect, but good) way of accounting memory use is this
one. If you care to define proper, then we could discuss that.



I will agree that this patchset is probably in the right general ballpark.
But the fact that pages are assigned exactly one owner is pure non-sense.
We can do better.  That is all I am asking for someone to at least attempt
to actually account for the rss of a group of processes and get the numbers
right when we have shared pages, between different groups of
processes.  We have the data structures to support this with rmap.


Well rmap only supports mapped, userspace pages.



Let me describe the situation where I think the accounting in the
patchset goes totally wonky. 



Gcc as I recall maps the pages it is compiling with mmap.
If in a single kernel tree I do:
make -jN O=../compile1 
make -jN O=../compile2 

But set it up so that the two compiles are in different rss groups.
If I run the concurrently they will use the same files at the same
time and most likely because of the first touch rss limit rule even
if I have a draconian rss limit the compiles will both be able to
complete and finish.   However if I run either of them alone if I
use the most draconian rss limit I can that allows both compiles to
finish I won't be able to compile a single kernel tree.


Yeah it is not perfect. Fortunately, there is no perfect solution,
so we don't have to be too upset about that.

And strangely, this example does not go outside the parameters of
what you asked for AFAIKS. In the worst case of one container getting
_all_ the shared pages, they will still remain inside their maximum
rss limit.

So they might get penalised a bit on reclaim, but maximum rss limits
will work fine, and you can (almost) guarantee X amount of memory for
a given container, and it will _work_.

But I also take back my comments about this being the only design I
have seen that gets everything, because the node-per-container idea
is a really good one on the surface. And it could mean even less impact
on the core VM than this patch. That is also a first-touch scheme.



However the messed up accounting that doesn't handle sharing between
groups of processes properly really bugs me.  Especially when we have
the infrastructure to do it right.

Does that make more sense?


I think it is simplistic.

Sure you could probably use some of the rmap stuff to account shared
mapped _user_ pages once for each container that touches them. And
this patchset isn't preventing that.

But how do you account kernel allocations? How do you account unmapped
pagecache?

What's the big deal so many accounting people have with just RSS? I'm
not a container person, this is an honest question. Because from my
POV if you conveniently ignore everything else... you may as well just
not do any accounting at all.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Balbir Singh

Nick Piggin wrote:

Eric W. Biederman wrote:

Nick Piggin [EMAIL PROTECTED] writes:



Eric W. Biederman wrote:


First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.


Let's be practical here, what you're asking is basically impossible.

Unless by deterministic you mean that it never enters the a non
trivial syscall, in which case, you just want to know about maximum
RSS of the process, which we already account).



Not per process I want this on a group of processes, and yes that
is all I want just.  I just want accounting of the maximum RSS of
a group of processes and then the mechanism to limit that maximum rss.


Well don't you just sum up the maximum for each process?

Or do you want to only count shared pages inside a container once,
or something difficult like that?



I don't want sharing between vservers/VE/containers to affect how many
pages I can have mapped into my processes at once.


You seem to want total isolation. You could use virtualization?



No.  I don't want the meaning of my rss limit to be affected by what
other processes are doing.  We have constraints of how many resources
the box actually has.  But I don't want accounting so sloppy that
processes outside my group of processes can artificially
lower my rss value, which magically raises my rss limit.


So what are you going to do about all the shared caches and slabs
inside the kernel?



It is basically handwaving anyway. The only approach I've seen with
a sane (not perfect, but good) way of accounting memory use is this
one. If you care to define proper, then we could discuss that.



I will agree that this patchset is probably in the right general 
ballpark.

But the fact that pages are assigned exactly one owner is pure non-sense.
We can do better.  That is all I am asking for someone to at least 
attempt
to actually account for the rss of a group of processes and get the 
numbers

right when we have shared pages, between different groups of
processes.  We have the data structures to support this with rmap.


Well rmap only supports mapped, userspace pages.



Let me describe the situation where I think the accounting in the
patchset goes totally wonky.

Gcc as I recall maps the pages it is compiling with mmap.
If in a single kernel tree I do:
make -jN O=../compile1 
make -jN O=../compile2 

But set it up so that the two compiles are in different rss groups.
If I run the concurrently they will use the same files at the same
time and most likely because of the first touch rss limit rule even
if I have a draconian rss limit the compiles will both be able to
complete and finish.   However if I run either of them alone if I
use the most draconian rss limit I can that allows both compiles to
finish I won't be able to compile a single kernel tree.


Yeah it is not perfect. Fortunately, there is no perfect solution,
so we don't have to be too upset about that.

And strangely, this example does not go outside the parameters of
what you asked for AFAIKS. In the worst case of one container getting
_all_ the shared pages, they will still remain inside their maximum
rss limit.



When that does happen and if a container hits it limit, with a LRU
per-container, if the container is not actually using those pages,
they'll get thrown out of that container and get mapped into the
container that is using those pages most frequently.


So they might get penalised a bit on reclaim, but maximum rss limits
will work fine, and you can (almost) guarantee X amount of memory for
a given container, and it will _work_.

But I also take back my comments about this being the only design I
have seen that gets everything, because the node-per-container idea
is a really good one on the surface. And it could mean even less impact
on the core VM than this patch. That is also a first-touch scheme.



With the proposed node-per-container, we will need to make massive core
VM changes to reorganize zones and nodes. We would want to allow

1. For sharing of nodes
2. Resizing nodes
3. May be more

With the node-per-container idea, it will hard to control page cache
limits, independent of RSS limits or mlock limits.

NOTE: page cache == unmapped page cache here.




However the messed up accounting that doesn't handle sharing between
groups of processes properly really bugs me.  Especially when we have
the infrastructure to do it right.

Does that make more sense?


I think it is simplistic.

Sure you could probably use some of the rmap stuff to account shared
mapped _user_ pages once for each container that touches them. And
this patchset isn't preventing that.

But how do you account kernel allocations? How do you account unmapped
pagecache?

What's 

Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-13 Thread Nick Piggin

Balbir Singh wrote:

Nick Piggin wrote:



And strangely, this example does not go outside the parameters of
what you asked for AFAIKS. In the worst case of one container getting
_all_ the shared pages, they will still remain inside their maximum
rss limit.



When that does happen and if a container hits it limit, with a LRU
per-container, if the container is not actually using those pages,
they'll get thrown out of that container and get mapped into the
container that is using those pages most frequently.


Exactly. Statistically, first touch will work OK. It may mean some
reclaim inefficiencies in corner cases, but things will tend to
even out.


So they might get penalised a bit on reclaim, but maximum rss limits
will work fine, and you can (almost) guarantee X amount of memory for
a given container, and it will _work_.

But I also take back my comments about this being the only design I
have seen that gets everything, because the node-per-container idea
is a really good one on the surface. And it could mean even less impact
on the core VM than this patch. That is also a first-touch scheme.



With the proposed node-per-container, we will need to make massive core
VM changes to reorganize zones and nodes. We would want to allow

1. For sharing of nodes
2. Resizing nodes
3. May be more


But a lot of that is happening anyway for other reasons (eg. memory
plug/unplug). And I don't consider node/zone setup to be part of the
core VM as such... it is _good_ if we can move extra work into setup
rather than have it in the mm.

That said, I don't think this patch is terribly intrusive either.



With the node-per-container idea, it will hard to control page cache
limits, independent of RSS limits or mlock limits.

NOTE: page cache == unmapped page cache here.


I don't know that it would be particularly harder than any other
first-touch scheme. If one container ends up being charged with too
much pagecache, eventually they'll reclaim a bit of it and the pages
will get charged to more frequent users.



However the messed up accounting that doesn't handle sharing between
groups of processes properly really bugs me.  Especially when we have
the infrastructure to do it right.

Does that make more sense?



I think it is simplistic.

Sure you could probably use some of the rmap stuff to account shared
mapped _user_ pages once for each container that touches them. And
this patchset isn't preventing that.

But how do you account kernel allocations? How do you account unmapped
pagecache?

What's the big deal so many accounting people have with just RSS? I'm
not a container person, this is an honest question. Because from my
POV if you conveniently ignore everything else... you may as well just
not do any accounting at all.



We decided to implement accounting and control in phases

1. RSS control
2. unmapped page cache control
3. mlock control
4. Kernel accounting and limits

This has several advantages

1. The limits can be individually set and controlled.
2. The code is broken down into simpler chunks for review and merging.


But this patch gives the groundwork to handle 1-4, and it is in a small
chunk, and one would be able to apply different limits to different types
of pages with it. Just using rmap to handle 1 does not really seem like a
viable alternative because it fundamentally isn't going to handle 2 or 4.

I'm not saying that you couldn't _later_ add something that uses rmap or
our current RSS accounting to tweak container-RSS semantics. But isn't it
sensible to lay the groundwork first? Get a clear path to something that
is good (not perfect), but *works*?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-12 Thread Herbert Poetzl
On Mon, Mar 12, 2007 at 09:50:08AM -0700, Dave Hansen wrote:
> On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
> > 
> > For these you essentially need per-container page->_mapcount counter,
> > otherwise you can't detect whether rss group still has the page 
> > in question being mapped in its processes' address spaces or not. 

> What do you mean by this?  You can always tell whether a process has a
> particular page mapped.  Could you explain the issue a bit more.  I'm
> not sure I get it.

OpenVZ wants to account _shared_ pages in a guest
different than separate pages, so that the RSS
accounted values reflect the actual used RAM instead
of the sum of all processes RSS' pages, which for
sure is more relevant to the administrator, but IMHO
not so terribly important to justify memory consuming
structures and sacrifice performance to get it right

YMMV, but maybe we can find a smart solution to the
issue too :)

best,
Herbert

> -- Dave
> 
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-12 Thread Dave Hansen
On Mon, 2007-03-12 at 20:07 +0300, Kirill Korotaev wrote:
> > On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
> >>For these you essentially need per-container page->_mapcount counter,
> >>otherwise you can't detect whether rss group still has the page in question 
> >>being mapped
> >>in its processes' address spaces or not. 
> > 
> > What do you mean by this?  You can always tell whether a process has a
> > particular page mapped.  Could you explain the issue a bit more.  I'm
> > not sure I get it.
> When we do charge/uncharge we have to answer on another question:
> "whether *any* task from the *container* has this page mapped", not the
> "whether *this* task has this page mapped".

That's a bit more clear. ;)

OK, just so I make sure I'm getting your argument here.  It would be too
expensive to go looking through all of the rmap data for _any_ other
task that might be sharing the charge (in the same container) with the
current task that is doing the unmapping.  

The requirements you're presenting so far appear to be:

1. The first user of a page in a container must be charged
2. The second user of a page in a container must not be charged
3. A container using a page must take a diminished charge when 
   another container is already using the page.
4. Additional fields in data structures (including 'struct page') are
   permitted

What have I missed?  What are your requirements for performance?

I'm not quite sure how the page->container stuff fits in here, though.
page->container would appear to be strictly assigning one page to one
container, but I know that beancounters can do partial page charges.
Care to fill me in?

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-12 Thread Kirill Korotaev
> On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
> 
>>For these you essentially need per-container page->_mapcount counter,
>>otherwise you can't detect whether rss group still has the page in question 
>>being mapped
>>in its processes' address spaces or not. 
> 
> 
> What do you mean by this?  You can always tell whether a process has a
> particular page mapped.  Could you explain the issue a bit more.  I'm
> not sure I get it.
When we do charge/uncharge we have to answer on another question:
"whether *any* task from the *container* has this page mapped", not the
"whether *this* task has this page mapped".

Thanks,
Kirill
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-12 Thread Dave Hansen
On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
> 
> For these you essentially need per-container page->_mapcount counter,
> otherwise you can't detect whether rss group still has the page in question 
> being mapped
> in its processes' address spaces or not. 

What do you mean by this?  You can always tell whether a process has a
particular page mapped.  Could you explain the issue a bit more.  I'm
not sure I get it.

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-12 Thread Kirill Korotaev
Eric W. Biederman wrote:
> Pavel Emelianov <[EMAIL PROTECTED]> writes:
> 
> 
>>Pages are charged to their first touchers which are
>>determined using pages' mapcount manipulations in
>>rmap calls.
> 
> 
> NAK pages should be charged to every rss group whose mm_struct they
> are mapped into.
For these you essentially need per-container page->_mapcount counter,
otherwise you can't detect whether rss group still has the page in question 
being mapped
in its processes' address spaces or not.

1. This was discussed before and considered to be ok by all the resource 
management
   involved people.
2. this can be done with a-la page beancounters which are used in OVZ for shared
   fractions accounting. It's a next step forward.

If you know how to get "pages should be charged to every rss group whose 
mm_struct they
are mapped into" w/o additional pointer in struct page, please throw me an idea.

Thanks,
Kirill
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-12 Thread Kirill Korotaev
Eric W. Biederman wrote:
 Pavel Emelianov [EMAIL PROTECTED] writes:
 
 
Pages are charged to their first touchers which are
determined using pages' mapcount manipulations in
rmap calls.
 
 
 NAK pages should be charged to every rss group whose mm_struct they
 are mapped into.
For these you essentially need per-container page-_mapcount counter,
otherwise you can't detect whether rss group still has the page in question 
being mapped
in its processes' address spaces or not.

1. This was discussed before and considered to be ok by all the resource 
management
   involved people.
2. this can be done with a-la page beancounters which are used in OVZ for shared
   fractions accounting. It's a next step forward.

If you know how to get pages should be charged to every rss group whose 
mm_struct they
are mapped into w/o additional pointer in struct page, please throw me an idea.

Thanks,
Kirill
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-12 Thread Dave Hansen
On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
 
 For these you essentially need per-container page-_mapcount counter,
 otherwise you can't detect whether rss group still has the page in question 
 being mapped
 in its processes' address spaces or not. 

What do you mean by this?  You can always tell whether a process has a
particular page mapped.  Could you explain the issue a bit more.  I'm
not sure I get it.

-- Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-12 Thread Kirill Korotaev
 On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
 
For these you essentially need per-container page-_mapcount counter,
otherwise you can't detect whether rss group still has the page in question 
being mapped
in its processes' address spaces or not. 
 
 
 What do you mean by this?  You can always tell whether a process has a
 particular page mapped.  Could you explain the issue a bit more.  I'm
 not sure I get it.
When we do charge/uncharge we have to answer on another question:
whether *any* task from the *container* has this page mapped, not the
whether *this* task has this page mapped.

Thanks,
Kirill
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-12 Thread Dave Hansen
On Mon, 2007-03-12 at 20:07 +0300, Kirill Korotaev wrote:
  On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
 For these you essentially need per-container page-_mapcount counter,
 otherwise you can't detect whether rss group still has the page in question 
 being mapped
 in its processes' address spaces or not. 
  
  What do you mean by this?  You can always tell whether a process has a
  particular page mapped.  Could you explain the issue a bit more.  I'm
  not sure I get it.
 When we do charge/uncharge we have to answer on another question:
 whether *any* task from the *container* has this page mapped, not the
 whether *this* task has this page mapped.

That's a bit more clear. ;)

OK, just so I make sure I'm getting your argument here.  It would be too
expensive to go looking through all of the rmap data for _any_ other
task that might be sharing the charge (in the same container) with the
current task that is doing the unmapping.  

The requirements you're presenting so far appear to be:

1. The first user of a page in a container must be charged
2. The second user of a page in a container must not be charged
3. A container using a page must take a diminished charge when 
   another container is already using the page.
4. Additional fields in data structures (including 'struct page') are
   permitted

What have I missed?  What are your requirements for performance?

I'm not quite sure how the page-container stuff fits in here, though.
page-container would appear to be strictly assigning one page to one
container, but I know that beancounters can do partial page charges.
Care to fill me in?

-- Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-12 Thread Herbert Poetzl
On Mon, Mar 12, 2007 at 09:50:08AM -0700, Dave Hansen wrote:
 On Mon, 2007-03-12 at 19:23 +0300, Kirill Korotaev wrote:
  
  For these you essentially need per-container page-_mapcount counter,
  otherwise you can't detect whether rss group still has the page 
  in question being mapped in its processes' address spaces or not. 

 What do you mean by this?  You can always tell whether a process has a
 particular page mapped.  Could you explain the issue a bit more.  I'm
 not sure I get it.

OpenVZ wants to account _shared_ pages in a guest
different than separate pages, so that the RSS
accounted values reflect the actual used RAM instead
of the sum of all processes RSS' pages, which for
sure is more relevant to the administrator, but IMHO
not so terribly important to justify memory consuming
structures and sacrifice performance to get it right

YMMV, but maybe we can find a smart solution to the
issue too :)

best,
Herbert

 -- Dave
 
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-11 Thread Eric W. Biederman
Pavel Emelianov <[EMAIL PROTECTED]> writes:

> Pages are charged to their first touchers which are
> determined using pages' mapcount manipulations in
> rmap calls.

NAK pages should be charged to every rss group whose mm_struct they
are mapped into.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/7] RSS accounting hooks over the code

2007-03-11 Thread Eric W. Biederman
Pavel Emelianov [EMAIL PROTECTED] writes:

 Pages are charged to their first touchers which are
 determined using pages' mapcount manipulations in
 rmap calls.

NAK pages should be charged to every rss group whose mm_struct they
are mapped into.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/