Re: [RFC] Limit the size of the pagecache

2007-01-26 Thread KAMEZAWA Hiroyuki
On Fri, 26 Jan 2007 02:29:55 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Wed, 24 Jan 2007 14:15:10 +0900
> KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote:
> 
> > - One for stability
> >   When a customer constructs their detabase(Oracle), the system often goes 
> > to oom.
> >   This is because that the system cannot allocate DMA_ZOME memory for 32bit 
> > device.
> >   (USB or e100)
> >   Not allowing to use almost all pages as page cache (for temporal use) 
> > will be some help.
> >   (Note: construction DB on ext3so all writes are serialized and the 
> > system couldn't
> >free page cache.)
> 
> I'm surprised that any reasonable driver has a dependency on ZONE_DMA.  Are
> you sure?  Send full oom-killer output, please.
> 
> 
Our ia64 server's USB/e100 device uses 32bit-PCI, so sometimes OOM happens on 
DMA zone.
(ia64's ZONE_DMA is 0-4G area.)

But very sorryI was confused.

I looked the issue above again and found ZONE_NORMAL/x86 was exhausted.

This was interesiting incident,

Constructing DB on 4Gb system has no problem.
Constructing DB on 8Gb system always causes OOM.

I asked the users to change DB's parameter. (this happened on RHEL4/linux-2.6.9 
series)


> >   And...some customers want to keep memory Free as much as possible.
> >   99% memory usage makes insecure them ;)
> 
> Tell them to do "echo 3 > /proc/sys/vm/drop_caches", then wait three minutes?

Ah, maybe we can use it on RHEL5. We'll test it. thank you.

Thanks,
-Kamezawa



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-26 Thread Andrew Morton
On Wed, 24 Jan 2007 15:03:23 +1100
Nick Piggin <[EMAIL PROTECTED]> wrote:

> 
> Yeah, it will be failing at order=4, because the allocator won't try
> very hard reclaim pagecache pages at that cutoff point. This needs to
> be fixed in the allocator.

A simple and perhaps sufficient fix for this nommu problem would be to replace
the magic "3" in __alloc_pages() with a tunable.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-26 Thread Andrew Morton
On Wed, 24 Jan 2007 14:15:10 +0900
KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote:

> - One for stability
>   When a customer constructs their detabase(Oracle), the system often goes to 
> oom.
>   This is because that the system cannot allocate DMA_ZOME memory for 32bit 
> device.
>   (USB or e100)
>   Not allowing to use almost all pages as page cache (for temporal use) will 
> be some help.
>   (Note: construction DB on ext3so all writes are serialized and the 
> system couldn't
>free page cache.)

I'm surprised that any reasonable driver has a dependency on ZONE_DMA.  Are
you sure?  Send full oom-killer output, please.


> - One for tuing.
>   Sometimes our cutomer requests us to limit size of page-cache.
>   
>   Many cutomers's memory usage reaches 99.x%. (this is very common situation.)
>   If almost all memories are used by page-cache, and we can think we can free 
> it.
>   But the customer cannot estimate what amount of page-cache can be freed 
> (without 
>   perfromance regression).
>   
>   When a cutomer wants to add a new application, he tunes the system.
>   But memory usage is always 99%.
>   page-cache limitation is useful when the customer tunes his system and find
>   sets of data and page-cache. 
>   (Of course, we can use some other complicated resource management system 
> for this.)
>   This will allow the users to decide that they need extra memory or not.
> 
>   And...some customers want to keep memory Free as much as possible.
>   99% memory usage makes insecure them ;)

Tell them to do "echo 3 > /proc/sys/vm/drop_caches", then wait three minutes?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-26 Thread Andrew Morton
On Wed, 24 Jan 2007 14:15:10 +0900
KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:

 - One for stability
   When a customer constructs their detabase(Oracle), the system often goes to 
 oom.
   This is because that the system cannot allocate DMA_ZOME memory for 32bit 
 device.
   (USB or e100)
   Not allowing to use almost all pages as page cache (for temporal use) will 
 be some help.
   (Note: construction DB on ext3so all writes are serialized and the 
 system couldn't
free page cache.)

I'm surprised that any reasonable driver has a dependency on ZONE_DMA.  Are
you sure?  Send full oom-killer output, please.


 - One for tuing.
   Sometimes our cutomer requests us to limit size of page-cache.
   
   Many cutomers's memory usage reaches 99.x%. (this is very common situation.)
   If almost all memories are used by page-cache, and we can think we can free 
 it.
   But the customer cannot estimate what amount of page-cache can be freed 
 (without 
   perfromance regression).
   
   When a cutomer wants to add a new application, he tunes the system.
   But memory usage is always 99%.
   page-cache limitation is useful when the customer tunes his system and find
   sets of data and page-cache. 
   (Of course, we can use some other complicated resource management system 
 for this.)
   This will allow the users to decide that they need extra memory or not.
 
   And...some customers want to keep memory Free as much as possible.
   99% memory usage makes insecure them ;)

Tell them to do echo 3  /proc/sys/vm/drop_caches, then wait three minutes?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-26 Thread Andrew Morton
On Wed, 24 Jan 2007 15:03:23 +1100
Nick Piggin [EMAIL PROTECTED] wrote:

 
 Yeah, it will be failing at order=4, because the allocator won't try
 very hard reclaim pagecache pages at that cutoff point. This needs to
 be fixed in the allocator.

A simple and perhaps sufficient fix for this nommu problem would be to replace
the magic 3 in __alloc_pages() with a tunable.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-26 Thread KAMEZAWA Hiroyuki
On Fri, 26 Jan 2007 02:29:55 -0800
Andrew Morton [EMAIL PROTECTED] wrote:

 On Wed, 24 Jan 2007 14:15:10 +0900
 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 
  - One for stability
When a customer constructs their detabase(Oracle), the system often goes 
  to oom.
This is because that the system cannot allocate DMA_ZOME memory for 32bit 
  device.
(USB or e100)
Not allowing to use almost all pages as page cache (for temporal use) 
  will be some help.
(Note: construction DB on ext3so all writes are serialized and the 
  system couldn't
 free page cache.)
 
 I'm surprised that any reasonable driver has a dependency on ZONE_DMA.  Are
 you sure?  Send full oom-killer output, please.
 
 
Our ia64 server's USB/e100 device uses 32bit-PCI, so sometimes OOM happens on 
DMA zone.
(ia64's ZONE_DMA is 0-4G area.)

But very sorryI was confused.

I looked the issue above again and found ZONE_NORMAL/x86 was exhausted.

This was interesiting incident,

Constructing DB on 4Gb system has no problem.
Constructing DB on 8Gb system always causes OOM.

I asked the users to change DB's parameter. (this happened on RHEL4/linux-2.6.9 
series)


And...some customers want to keep memory Free as much as possible.
99% memory usage makes insecure them ;)
 
 Tell them to do echo 3  /proc/sys/vm/drop_caches, then wait three minutes?

Ah, maybe we can use it on RHEL5. We'll test it. thank you.

Thanks,
-Kamezawa



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Vaidyanathan Srinivasan


Al Boldi wrote:
> Vaidyanathan Srinivasan wrote:
>> Al Boldi wrote:
>>> Rik van Riel wrote:
 Christoph Lameter wrote:
> This is a patch using some of Aubrey's work plugging it in what is
> IMHO the right way. Feel free to improve on it. I have gotten
> repeatedly requests to be able to limit the pagecache.
 IMHO it's a bad hack.

 It would be better to identify the problem this "feature" is
 trying to fix, and then fix the root cause.
>>> Ok, here is the problem:  kswapd.
>>>
>>> Limiting the page-cache memory inhibits invoking kswapd needlessly,
>>> aiding performance and easing OOM pressures.
>> Apart from kswapd, limiting pagecache helps performance of
>> applications by not eating away their ANON pages or other parts of its
>> resident data set.  When there is enough free memory, then there is no
>> performance issue.  However memory is always utilized to the max.
>> Hence every pagecache page that is allocated should come from some
>> application's RSS, or from cold pagecache page.  If that page was
>> stolen from some application, then that application pays the price for
>> swapping or reading the page back to memory.  This scenario is what we
>>  want to avoid.  All that we are trying to achieve is that pagecache
>> eats a (unmapped) pagecache page and not steal memory from other
>> important application's resident set.
> 
> Agreed 100%.  Thanks for expanding exactly what I meant.
> 
>> Certainly this should be a configurable option and kernel's behavior
>> should not be changed in general.
>>
>>> I tried the patch; it works.
>>>
>> :)
>> :
>>> But it needs a bit of debugging.  Setting pagecache_ratio = 1 either
>>> deadlocks or reduces thru-put to < 1mb/s.
>> Yes, going below 5% on my 1GB RAM machine causes severe performance
>> problems.  We need to hard wire a reasonable lower limit and not
>> provide a noose for the end user to tie around!
> 
> One reason to test full range settings, is to expose underlying system 
> problems, like scalability.  By limiting the range, you only hide a problem 
> that was exposed.

Agreed.  This is a good point.

> 
> Thanks!
> 
> --
> Al
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Balbir Singh

Rik van Riel wrote:

Vaidyanathan Srinivasan wrote:

Rik van Riel wrote:



There are a few databases out there that mmap the whole
thing.  Sleepycat for one...


That is why my suggestion would be not to touch mmapped pagecache
pages in the current pagecache limit code.  The limit should concern
only unmapped pagecache pages.


So you want to limit how much data the kernel caches for mysql
or postgresql, but not limit how much of the rpm database is
cached ?!

IMHO your proposal does the exact opposite of what would be
right for my systems :)





One scenario I can think of is

A group of I/O intensive task can cause readahead and
dirty page I/O and make good forward progress, but
they'll hit another group of processes by swapping
their pages out. How do we make fair forward progress?
The system administrator can currently control the
amount of swappiness by setting it, but swappiness is
a reclaim time control parameter.

We can control dirty page I/O by setting vm_dirty_ratio.
Readahead is also tuneable with fadvise(), but not many
applications use fadvise.

The question now is, is it easier for the system administrator
to say, limit my page cache usage to say 30% of total memory available,
so that other allocations do not have to wait on disk I/O or page
reclaim (consider slab allocations, other kernel data structures).

A low priority task might run infrequently and end up spending all
it's time either swapping in pages or reclaiming memory and by
the time it runs again, it ends up doing the same thing.

I understand the swap token mitigates this problem to some extent,
but limiting the page cache will give the system administrator
control over system memory behaviour.

--
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Rik van Riel

Vaidyanathan Srinivasan wrote:

Rik van Riel wrote:



There are a few databases out there that mmap the whole
thing.  Sleepycat for one...


That is why my suggestion would be not to touch mmapped pagecache
pages in the current pagecache limit code.  The limit should concern
only unmapped pagecache pages.


So you want to limit how much data the kernel caches for mysql
or postgresql, but not limit how much of the rpm database is
cached ?!

IMHO your proposal does the exact opposite of what would be
right for my systems :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Bodo Eggert
Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> On Wed, 2007-01-24 at 22:22 +0800, Aubrey Li wrote:
>> On 1/24/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:

>> > He wants to make a nommu system act like a mmu system; this will just
>> > never ever work.
>> 
>> Nope. Actually my nommu system works great with some of patches made by us.
>> What let you think this will never work?
> 
> Because there are perfectly valid things user-space can do to mess you
> up. I forgot the test-case but it had something to do with opening a
> million files, this will scatter slab pages all over the place.

a) Limit the number of open files.
b) Don't do that then.

> Also, if you cycle your large user-space allocations a bit unluckily
> you'll also fragment it into oblivion.
> 
> So you can not guarantee it will not fragment into smithereens stopping
> your user-space from using large than page size allocations.

Therefore you should purposely increase the mess up to the point where the
system is guaranteed not to work? IMO you should rather put the other issues
onto the TODO list.

BTW: I'm not sure a hard limit is the right thing to do for mmu systems,
I'd rather implement high and low watermarks; if one pool is larger than
it's high watermark, it will be next get it's pages evicted, and it won't
lose pages if it's at the lower watermark.

> If your user-space consists of several applications that do dynamic
> memory allocation of various sizes its a matter of (run-) time before
> things will start failing.
> 
> If you prealloc a large area at boot time (like we now do for hugepages)
> and use that for user-space, you might 'reset' the status quo by cycling
> the whole of userspace.

Preallocating the page cache (and maybe the slab space?) may very well be
the right thing to do for nommu systems. It worked quite well in DOS times
and on old MACs.
-- 
Funny quotes:
30. Why is a person who plays the piano called a pianist but a person who
drives a race car not called a racist?
Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Al Boldi
Vaidyanathan Srinivasan wrote:
> Al Boldi wrote:
> > Rik van Riel wrote:
> >> Christoph Lameter wrote:
> >>> This is a patch using some of Aubrey's work plugging it in what is
> >>> IMHO the right way. Feel free to improve on it. I have gotten
> >>> repeatedly requests to be able to limit the pagecache.
> >>
> >> IMHO it's a bad hack.
> >>
> >> It would be better to identify the problem this "feature" is
> >> trying to fix, and then fix the root cause.
> >
> > Ok, here is the problem:  kswapd.
> >
> > Limiting the page-cache memory inhibits invoking kswapd needlessly,
> > aiding performance and easing OOM pressures.
>
> Apart from kswapd, limiting pagecache helps performance of
> applications by not eating away their ANON pages or other parts of its
> resident data set.  When there is enough free memory, then there is no
> performance issue.  However memory is always utilized to the max.
> Hence every pagecache page that is allocated should come from some
> application's RSS, or from cold pagecache page.  If that page was
> stolen from some application, then that application pays the price for
> swapping or reading the page back to memory.  This scenario is what we
>  want to avoid.  All that we are trying to achieve is that pagecache
> eats a (unmapped) pagecache page and not steal memory from other
> important application's resident set.

Agreed 100%.  Thanks for expanding exactly what I meant.

> Certainly this should be a configurable option and kernel's behavior
> should not be changed in general.
>
> > I tried the patch; it works.
> >
> :)
> :
> > But it needs a bit of debugging.  Setting pagecache_ratio = 1 either
> > deadlocks or reduces thru-put to < 1mb/s.
>
> Yes, going below 5% on my 1GB RAM machine causes severe performance
> problems.  We need to hard wire a reasonable lower limit and not
> provide a noose for the end user to tie around!

One reason to test full range settings, is to expose underlying system 
problems, like scalability.  By limiting the range, you only hide a problem 
that was exposed.


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Al Boldi
Peter Zijlstra wrote:
> > Apart from kswapd, limiting pagecache helps performance of
> > applications by not eating away their ANON pages or other parts of its
> > resident data set.  When there is enough free memory, then there is no
> > performance issue.  However memory is always utilized to the max.
> > Hence every pagecache page that is allocated should come from some
> > application's RSS, or from cold pagecache page.  If that page was
> > stolen from some application, then that application pays the price for
> > swapping or reading the page back to memory.  This scenario is what we
> >  want to avoid.  All that we are trying to achieve is that pagecache
> > eats a (unmapped) pagecache page and not steal memory from other
> > important application's resident set.
> >
> > Certainly this should be a configurable option and kernel's behavior
> > should not be changed in general.
>
> Ah, this would be a clear case of the page reclaim selecting the wrong
> working set.

Yes.

> It is perfectly fine for a page cache page to evict a app page (be it
> anon or not) if that page cache page is used more frequently than the
> app page in question.

It seems, that there is currently a clear preference for pagecache-page over 
app-page.  Some form of prio-selection could probably aid the situation.

> Trouble seems to be that the current algorithm gets it quite wrong at
> times.

It breaks down when memory gets tight.  You can actually hear it thrashing 
the disk, although it's not supposed to thrash, even with swapoff.

> Also stating that free memory somehow is good for you is weird, free
> memory is a loss, you under utilise your machine. Keeping clean
> pagecache pages in there that are likely to be referenced again is a
> clear win; it saves the tediously slow load from disk.

That's the theory.

> So you're now proposing to limit the page cache

As a workaround.

> where as its clear that
> the better solution would be to tune replacement policy

Yes.  Hopefully successfully.

> (and or provide
> hints to said mechanism using madvise/fadvise)

Not feasible; source is sometimes not immediately available.


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Vaidyanathan Srinivasan


Peter Zijlstra wrote:
>> Apart from kswapd, limiting pagecache helps performance of
>> applications by not eating away their ANON pages or other parts of its
>> resident data set.  When there is enough free memory, then there is no
>> performance issue.  However memory is always utilized to the max.
>> Hence every pagecache page that is allocated should come from some
>> application's RSS, or from cold pagecache page.  If that page was
>> stolen from some application, then that application pays the price for
>> swapping or reading the page back to memory.  This scenario is what we
>>  want to avoid.  All that we are trying to achieve is that pagecache
>> eats a (unmapped) pagecache page and not steal memory from other
>> important application's resident set.
>>
>> Certainly this should be a configurable option and kernel's behavior
>> should not be changed in general.
> 
> Ah, this would be a clear case of the page reclaim selecting the wrong
> working set.
> 
> It is perfectly fine for a page cache page to evict a app page (be it
> anon or not) if that page cache page is used more frequently than the
> app page in question.

Well, this is true only as long as all applications running in the
system are graded equally and it is kernel's job to provide the best
of the system resources to all applications.

> Trouble seems to be that the current algorithm gets it quite wrong at
> times.

The current reclaim code does a good job based on the assumption that
pages belonging to different applications have equal priority.  The
aging of the page is independent of application's priority or class.
This is good for best overall system performance.

The new use case that is challenging this assumption is the fact that
application groups fall into different class on the same system and
there is a need to make certain class perform better at the cost of
certain other class of applications.  In this scenario system
performance is not judged by overall average throughput, but by
performance of certain class of applications only.

A backup job running in the database server can take any amount of
performance hit to marginally improve database performance since that
is what the users care about.  We would run into similar situations
when running various virtualization and consolidation solutions.

> Also stating that free memory somehow is good for you is weird, free
> memory is a loss, you under utilise your machine. Keeping clean
> pagecache pages in there that are likely to be referenced again is a
> clear win; it saves the tediously slow load from disk.

Agreed

> 
> So you're now proposing to limit the page cache were as its clear that
> the better solution would be to tune replacement policy (and or provide
> hints to said mechanism using madvise/fadvise)

Well, we may need to use both the approach.  Hints with
madvise/fadvise is definitely a good approach and the kernel should
take these hints aggressively. Yet even with these hints we may want
to have limits in the interest of other applications that do not use
pagecache.

System wide limit to pagecache may not sound very interesting, but if
we think about 'containers' and group of process having such limits it
will have more practical use cases.  An aggregation of process having
 limit on pagecache would give relative importance to certain class of
pages during page replacement.  Controlling limits among group of
applications will help achieve peak application performance with the
applications that we care about.

--Vaidy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Peter Zijlstra

> Apart from kswapd, limiting pagecache helps performance of
> applications by not eating away their ANON pages or other parts of its
> resident data set.  When there is enough free memory, then there is no
> performance issue.  However memory is always utilized to the max.
> Hence every pagecache page that is allocated should come from some
> application's RSS, or from cold pagecache page.  If that page was
> stolen from some application, then that application pays the price for
> swapping or reading the page back to memory.  This scenario is what we
>  want to avoid.  All that we are trying to achieve is that pagecache
> eats a (unmapped) pagecache page and not steal memory from other
> important application's resident set.
> 
> Certainly this should be a configurable option and kernel's behavior
> should not be changed in general.

Ah, this would be a clear case of the page reclaim selecting the wrong
working set.

It is perfectly fine for a page cache page to evict a app page (be it
anon or not) if that page cache page is used more frequently than the
app page in question.

Trouble seems to be that the current algorithm gets it quite wrong at
times.

Also stating that free memory somehow is good for you is weird, free
memory is a loss, you under utilise your machine. Keeping clean
pagecache pages in there that are likely to be referenced again is a
clear win; it saves the tediously slow load from disk.

So you're now proposing to limit the page cache were as its clear that
the better solution would be to tune replacement policy (and or provide
hints to said mechanism using madvise/fadvise)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Vaidyanathan Srinivasan


Al Boldi wrote:
> Rik van Riel wrote:
>> Christoph Lameter wrote:
>>> This is a patch using some of Aubrey's work plugging it in what is IMHO
>>> the right way. Feel free to improve on it. I have gotten repeatedly
>>> requests to be able to limit the pagecache.
>> IMHO it's a bad hack.
>>
>> It would be better to identify the problem this "feature" is
>> trying to fix, and then fix the root cause.
> 
> Ok, here is the problem:  kswapd.
> 
> Limiting the page-cache memory inhibits invoking kswapd needlessly, aiding 
> performance and easing OOM pressures.

Apart from kswapd, limiting pagecache helps performance of
applications by not eating away their ANON pages or other parts of its
resident data set.  When there is enough free memory, then there is no
performance issue.  However memory is always utilized to the max.
Hence every pagecache page that is allocated should come from some
application's RSS, or from cold pagecache page.  If that page was
stolen from some application, then that application pays the price for
swapping or reading the page back to memory.  This scenario is what we
 want to avoid.  All that we are trying to achieve is that pagecache
eats a (unmapped) pagecache page and not steal memory from other
important application's resident set.

Certainly this should be a configurable option and kernel's behavior
should not be changed in general.

> I tried the patch; it works.

:)

> But it needs a bit of debugging.  Setting pagecache_ratio = 1 either 
> deadlocks or reduces thru-put to < 1mb/s.

Yes, going below 5% on my 1GB RAM machine causes severe performance
problems.  We need to hard wire a reasonable lower limit and not
provide a noose for the end user to tie around!

--Vaidy

> 
> Thanks!
> 
> --
> Al
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Vaidyanathan Srinivasan


Christoph Lameter wrote:
> On Wed, 24 Jan 2007, Erik Andersen wrote:
> 
>> It would be far more useful if an application could hint to the
>> pagecache as to which files are and which files as not worth
>> caching, especially when the application knows a priori that data
>> from a particular file will or will not ever be reused.
> 
> It can give such hints via madvise(2).

I think you meant fadvise.  That is certainly a possibility which we
need to work on.  Current implementation of fadvise only throttles
read ahead in case of sequential access and flushes the file in case
of DONTNEED.  We leave it at default for NOREUSE.

In case of DONTNEED and NOREUSE, we need to limit the pages used for
page cache and also reclaim them as soon as possible.  Interaction of
 mmap() and fadvise is little more dfficult to handle.

--Vaidy

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Vaidyanathan Srinivasan


Christoph Lameter wrote:
 On Wed, 24 Jan 2007, Erik Andersen wrote:
 
 It would be far more useful if an application could hint to the
 pagecache as to which files are and which files as not worth
 caching, especially when the application knows a priori that data
 from a particular file will or will not ever be reused.
 
 It can give such hints via madvise(2).

I think you meant fadvise.  That is certainly a possibility which we
need to work on.  Current implementation of fadvise only throttles
read ahead in case of sequential access and flushes the file in case
of DONTNEED.  We leave it at default for NOREUSE.

In case of DONTNEED and NOREUSE, we need to limit the pages used for
page cache and also reclaim them as soon as possible.  Interaction of
 mmap() and fadvise is little more dfficult to handle.

--Vaidy

 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Vaidyanathan Srinivasan


Al Boldi wrote:
 Rik van Riel wrote:
 Christoph Lameter wrote:
 This is a patch using some of Aubrey's work plugging it in what is IMHO
 the right way. Feel free to improve on it. I have gotten repeatedly
 requests to be able to limit the pagecache.
 IMHO it's a bad hack.

 It would be better to identify the problem this feature is
 trying to fix, and then fix the root cause.
 
 Ok, here is the problem:  kswapd.
 
 Limiting the page-cache memory inhibits invoking kswapd needlessly, aiding 
 performance and easing OOM pressures.

Apart from kswapd, limiting pagecache helps performance of
applications by not eating away their ANON pages or other parts of its
resident data set.  When there is enough free memory, then there is no
performance issue.  However memory is always utilized to the max.
Hence every pagecache page that is allocated should come from some
application's RSS, or from cold pagecache page.  If that page was
stolen from some application, then that application pays the price for
swapping or reading the page back to memory.  This scenario is what we
 want to avoid.  All that we are trying to achieve is that pagecache
eats a (unmapped) pagecache page and not steal memory from other
important application's resident set.

Certainly this should be a configurable option and kernel's behavior
should not be changed in general.

 I tried the patch; it works.

:)

 But it needs a bit of debugging.  Setting pagecache_ratio = 1 either 
 deadlocks or reduces thru-put to  1mb/s.

Yes, going below 5% on my 1GB RAM machine causes severe performance
problems.  We need to hard wire a reasonable lower limit and not
provide a noose for the end user to tie around!

--Vaidy

 
 Thanks!
 
 --
 Al
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Peter Zijlstra

 Apart from kswapd, limiting pagecache helps performance of
 applications by not eating away their ANON pages or other parts of its
 resident data set.  When there is enough free memory, then there is no
 performance issue.  However memory is always utilized to the max.
 Hence every pagecache page that is allocated should come from some
 application's RSS, or from cold pagecache page.  If that page was
 stolen from some application, then that application pays the price for
 swapping or reading the page back to memory.  This scenario is what we
  want to avoid.  All that we are trying to achieve is that pagecache
 eats a (unmapped) pagecache page and not steal memory from other
 important application's resident set.
 
 Certainly this should be a configurable option and kernel's behavior
 should not be changed in general.

Ah, this would be a clear case of the page reclaim selecting the wrong
working set.

It is perfectly fine for a page cache page to evict a app page (be it
anon or not) if that page cache page is used more frequently than the
app page in question.

Trouble seems to be that the current algorithm gets it quite wrong at
times.

Also stating that free memory somehow is good for you is weird, free
memory is a loss, you under utilise your machine. Keeping clean
pagecache pages in there that are likely to be referenced again is a
clear win; it saves the tediously slow load from disk.

So you're now proposing to limit the page cache were as its clear that
the better solution would be to tune replacement policy (and or provide
hints to said mechanism using madvise/fadvise)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Vaidyanathan Srinivasan


Peter Zijlstra wrote:
 Apart from kswapd, limiting pagecache helps performance of
 applications by not eating away their ANON pages or other parts of its
 resident data set.  When there is enough free memory, then there is no
 performance issue.  However memory is always utilized to the max.
 Hence every pagecache page that is allocated should come from some
 application's RSS, or from cold pagecache page.  If that page was
 stolen from some application, then that application pays the price for
 swapping or reading the page back to memory.  This scenario is what we
  want to avoid.  All that we are trying to achieve is that pagecache
 eats a (unmapped) pagecache page and not steal memory from other
 important application's resident set.

 Certainly this should be a configurable option and kernel's behavior
 should not be changed in general.
 
 Ah, this would be a clear case of the page reclaim selecting the wrong
 working set.
 
 It is perfectly fine for a page cache page to evict a app page (be it
 anon or not) if that page cache page is used more frequently than the
 app page in question.

Well, this is true only as long as all applications running in the
system are graded equally and it is kernel's job to provide the best
of the system resources to all applications.

 Trouble seems to be that the current algorithm gets it quite wrong at
 times.

The current reclaim code does a good job based on the assumption that
pages belonging to different applications have equal priority.  The
aging of the page is independent of application's priority or class.
This is good for best overall system performance.

The new use case that is challenging this assumption is the fact that
application groups fall into different class on the same system and
there is a need to make certain class perform better at the cost of
certain other class of applications.  In this scenario system
performance is not judged by overall average throughput, but by
performance of certain class of applications only.

A backup job running in the database server can take any amount of
performance hit to marginally improve database performance since that
is what the users care about.  We would run into similar situations
when running various virtualization and consolidation solutions.

 Also stating that free memory somehow is good for you is weird, free
 memory is a loss, you under utilise your machine. Keeping clean
 pagecache pages in there that are likely to be referenced again is a
 clear win; it saves the tediously slow load from disk.

Agreed

 
 So you're now proposing to limit the page cache were as its clear that
 the better solution would be to tune replacement policy (and or provide
 hints to said mechanism using madvise/fadvise)

Well, we may need to use both the approach.  Hints with
madvise/fadvise is definitely a good approach and the kernel should
take these hints aggressively. Yet even with these hints we may want
to have limits in the interest of other applications that do not use
pagecache.

System wide limit to pagecache may not sound very interesting, but if
we think about 'containers' and group of process having such limits it
will have more practical use cases.  An aggregation of process having
 limit on pagecache would give relative importance to certain class of
pages during page replacement.  Controlling limits among group of
applications will help achieve peak application performance with the
applications that we care about.

--Vaidy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Al Boldi
Peter Zijlstra wrote:
  Apart from kswapd, limiting pagecache helps performance of
  applications by not eating away their ANON pages or other parts of its
  resident data set.  When there is enough free memory, then there is no
  performance issue.  However memory is always utilized to the max.
  Hence every pagecache page that is allocated should come from some
  application's RSS, or from cold pagecache page.  If that page was
  stolen from some application, then that application pays the price for
  swapping or reading the page back to memory.  This scenario is what we
   want to avoid.  All that we are trying to achieve is that pagecache
  eats a (unmapped) pagecache page and not steal memory from other
  important application's resident set.
 
  Certainly this should be a configurable option and kernel's behavior
  should not be changed in general.

 Ah, this would be a clear case of the page reclaim selecting the wrong
 working set.

Yes.

 It is perfectly fine for a page cache page to evict a app page (be it
 anon or not) if that page cache page is used more frequently than the
 app page in question.

It seems, that there is currently a clear preference for pagecache-page over 
app-page.  Some form of prio-selection could probably aid the situation.

 Trouble seems to be that the current algorithm gets it quite wrong at
 times.

It breaks down when memory gets tight.  You can actually hear it thrashing 
the disk, although it's not supposed to thrash, even with swapoff.

 Also stating that free memory somehow is good for you is weird, free
 memory is a loss, you under utilise your machine. Keeping clean
 pagecache pages in there that are likely to be referenced again is a
 clear win; it saves the tediously slow load from disk.

That's the theory.

 So you're now proposing to limit the page cache

As a workaround.

 where as its clear that
 the better solution would be to tune replacement policy

Yes.  Hopefully successfully.

 (and or provide
 hints to said mechanism using madvise/fadvise)

Not feasible; source is sometimes not immediately available.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Al Boldi
Vaidyanathan Srinivasan wrote:
 Al Boldi wrote:
  Rik van Riel wrote:
  Christoph Lameter wrote:
  This is a patch using some of Aubrey's work plugging it in what is
  IMHO the right way. Feel free to improve on it. I have gotten
  repeatedly requests to be able to limit the pagecache.
 
  IMHO it's a bad hack.
 
  It would be better to identify the problem this feature is
  trying to fix, and then fix the root cause.
 
  Ok, here is the problem:  kswapd.
 
  Limiting the page-cache memory inhibits invoking kswapd needlessly,
  aiding performance and easing OOM pressures.

 Apart from kswapd, limiting pagecache helps performance of
 applications by not eating away their ANON pages or other parts of its
 resident data set.  When there is enough free memory, then there is no
 performance issue.  However memory is always utilized to the max.
 Hence every pagecache page that is allocated should come from some
 application's RSS, or from cold pagecache page.  If that page was
 stolen from some application, then that application pays the price for
 swapping or reading the page back to memory.  This scenario is what we
  want to avoid.  All that we are trying to achieve is that pagecache
 eats a (unmapped) pagecache page and not steal memory from other
 important application's resident set.

Agreed 100%.  Thanks for expanding exactly what I meant.

 Certainly this should be a configurable option and kernel's behavior
 should not be changed in general.

  I tried the patch; it works.
 
 :)
 :
  But it needs a bit of debugging.  Setting pagecache_ratio = 1 either
  deadlocks or reduces thru-put to  1mb/s.

 Yes, going below 5% on my 1GB RAM machine causes severe performance
 problems.  We need to hard wire a reasonable lower limit and not
 provide a noose for the end user to tie around!

One reason to test full range settings, is to expose underlying system 
problems, like scalability.  By limiting the range, you only hide a problem 
that was exposed.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Bodo Eggert
Peter Zijlstra [EMAIL PROTECTED] wrote:
 On Wed, 2007-01-24 at 22:22 +0800, Aubrey Li wrote:
 On 1/24/07, Peter Zijlstra [EMAIL PROTECTED] wrote:

  He wants to make a nommu system act like a mmu system; this will just
  never ever work.
 
 Nope. Actually my nommu system works great with some of patches made by us.
 What let you think this will never work?
 
 Because there are perfectly valid things user-space can do to mess you
 up. I forgot the test-case but it had something to do with opening a
 million files, this will scatter slab pages all over the place.

a) Limit the number of open files.
b) Don't do that then.

 Also, if you cycle your large user-space allocations a bit unluckily
 you'll also fragment it into oblivion.
 
 So you can not guarantee it will not fragment into smithereens stopping
 your user-space from using large than page size allocations.

Therefore you should purposely increase the mess up to the point where the
system is guaranteed not to work? IMO you should rather put the other issues
onto the TODO list.

BTW: I'm not sure a hard limit is the right thing to do for mmu systems,
I'd rather implement high and low watermarks; if one pool is larger than
it's high watermark, it will be next get it's pages evicted, and it won't
lose pages if it's at the lower watermark.

 If your user-space consists of several applications that do dynamic
 memory allocation of various sizes its a matter of (run-) time before
 things will start failing.
 
 If you prealloc a large area at boot time (like we now do for hugepages)
 and use that for user-space, you might 'reset' the status quo by cycling
 the whole of userspace.

Preallocating the page cache (and maybe the slab space?) may very well be
the right thing to do for nommu systems. It worked quite well in DOS times
and on old MACs.
-- 
Funny quotes:
30. Why is a person who plays the piano called a pianist but a person who
drives a race car not called a racist?
Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Rik van Riel

Vaidyanathan Srinivasan wrote:

Rik van Riel wrote:



There are a few databases out there that mmap the whole
thing.  Sleepycat for one...


That is why my suggestion would be not to touch mmapped pagecache
pages in the current pagecache limit code.  The limit should concern
only unmapped pagecache pages.


So you want to limit how much data the kernel caches for mysql
or postgresql, but not limit how much of the rpm database is
cached ?!

IMHO your proposal does the exact opposite of what would be
right for my systems :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Balbir Singh

Rik van Riel wrote:

Vaidyanathan Srinivasan wrote:

Rik van Riel wrote:



There are a few databases out there that mmap the whole
thing.  Sleepycat for one...


That is why my suggestion would be not to touch mmapped pagecache
pages in the current pagecache limit code.  The limit should concern
only unmapped pagecache pages.


So you want to limit how much data the kernel caches for mysql
or postgresql, but not limit how much of the rpm database is
cached ?!

IMHO your proposal does the exact opposite of what would be
right for my systems :)



Jumping in late into the discussion

One scenario I can think of is

A group of I/O intensive task can cause readahead and
dirty page I/O and make good forward progress, but
they'll hit another group of processes by swapping
their pages out. How do we make fair forward progress?
The system administrator can currently control the
amount of swappiness by setting it, but swappiness is
a reclaim time control parameter.

We can control dirty page I/O by setting vm_dirty_ratio.
Readahead is also tuneable with fadvise(), but not many
applications use fadvise.

The question now is, is it easier for the system administrator
to say, limit my page cache usage to say 30% of total memory available,
so that other allocations do not have to wait on disk I/O or page
reclaim (consider slab allocations, other kernel data structures).

A low priority task might run infrequently and end up spending all
it's time either swapping in pages or reclaiming memory and by
the time it runs again, it ends up doing the same thing.

I understand the swap token mitigates this problem to some extent,
but limiting the page cache will give the system administrator
control over system memory behaviour.

--
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-25 Thread Vaidyanathan Srinivasan


Al Boldi wrote:
 Vaidyanathan Srinivasan wrote:
 Al Boldi wrote:
 Rik van Riel wrote:
 Christoph Lameter wrote:
 This is a patch using some of Aubrey's work plugging it in what is
 IMHO the right way. Feel free to improve on it. I have gotten
 repeatedly requests to be able to limit the pagecache.
 IMHO it's a bad hack.

 It would be better to identify the problem this feature is
 trying to fix, and then fix the root cause.
 Ok, here is the problem:  kswapd.

 Limiting the page-cache memory inhibits invoking kswapd needlessly,
 aiding performance and easing OOM pressures.
 Apart from kswapd, limiting pagecache helps performance of
 applications by not eating away their ANON pages or other parts of its
 resident data set.  When there is enough free memory, then there is no
 performance issue.  However memory is always utilized to the max.
 Hence every pagecache page that is allocated should come from some
 application's RSS, or from cold pagecache page.  If that page was
 stolen from some application, then that application pays the price for
 swapping or reading the page back to memory.  This scenario is what we
  want to avoid.  All that we are trying to achieve is that pagecache
 eats a (unmapped) pagecache page and not steal memory from other
 important application's resident set.
 
 Agreed 100%.  Thanks for expanding exactly what I meant.
 
 Certainly this should be a configurable option and kernel's behavior
 should not be changed in general.

 I tried the patch; it works.

 :)
 :
 But it needs a bit of debugging.  Setting pagecache_ratio = 1 either
 deadlocks or reduces thru-put to  1mb/s.
 Yes, going below 5% on my 1GB RAM machine causes severe performance
 problems.  We need to hard wire a reasonable lower limit and not
 provide a noose for the end user to tie around!
 
 One reason to test full range settings, is to expose underlying system 
 problems, like scalability.  By limiting the range, you only hide a problem 
 that was exposed.

Agreed.  This is a good point.

 
 Thanks!
 
 --
 Al
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Vaidyanathan Srinivasan


Aubrey Li wrote:
> On 1/25/07, Vaidyanathan Srinivasan <[EMAIL PROTECTED]> wrote:
>>
>> Christoph Lameter wrote:
>>> On Wed, 24 Jan 2007, Vaidyanathan Srinivasan wrote:
>>>
 With your patch, MMAP of a file that will cross the pagecache limit hangs 
 the
 system.  As I mentioned in my previous mail, without subtracting the
 NR_FILE_MAPPED, the reclaim will infinitely try and fail.
>>> Well mapped pages are still pagecache pages.
>>>
>> Yes, but they can be classified under a process RSS pages.  Whether it
>> is an anon page or shared mem or mmap of pagecache, it would show up
>> under RSS.  Those pages can be limited by RSS limiter similar to the
>> one we are discussing in pagecache limiter.  In my opinion, once a
>> file page is mapped by the process, then it should be treated at par
>> with anon pages.  Application programs generally do not mmap a file
>> page if the reuse for the content is very low.
>>
> 
> I agree, we shouldn't take mmapped page into account.
> But Vaidy - even with your patch, we are still using the existing
> reclaimer, that means we dont ensure that only page cache is
> reclaimed/limited. mapped pages will be hit also.
> I think we still need to add a new scancontrol field to lock mmaped
> pages and remove unmapped pagecache pages only.

I have tried to add scan control to Roy's patch at
http://lkml.org/lkml/2007/01/17/96

In that patch, we search and remove only pages that are not mapped.
We also remove referenced and hot pagecache pages which the normal
reclaimer is not expected to consider.

I will try to fit that logic in Christoph's patch and test.

--Vaidy

> -Aubrey
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Christoph Lameter
On Thu, 25 Jan 2007, Aubrey Li wrote:

> But Vaidy - even with your patch, we are still using the existing
> reclaimer, that means we dont ensure that only page cache is
> reclaimed/limited. mapped pages will be hit also.
> I think we still need to add a new scancontrol field to lock mmaped
> pages and remove unmapped pagecache pages only.

Setting sc->swappiness to zero will make the reclaimer hit 
unmapped pages until we get into problems. Maybe set that to some negative 
value to avoid reclaim_mapped being set to 1 in shrink_active_list?

Oh. But reclaim_mapped is staying at zero anyways if may_swap is off. So 
we are already fine.

I still wonder why you are doing this at all. If you just run your own app 
on the box then preallocate your higher order allocations from user space. 
Much less trouble.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Al Boldi
Rik van Riel wrote:
> Christoph Lameter wrote:
> > This is a patch using some of Aubrey's work plugging it in what is IMHO
> > the right way. Feel free to improve on it. I have gotten repeatedly
> > requests to be able to limit the pagecache.
>
> IMHO it's a bad hack.
>
> It would be better to identify the problem this "feature" is
> trying to fix, and then fix the root cause.

Ok, here is the problem:  kswapd.

Limiting the page-cache memory inhibits invoking kswapd needlessly, aiding 
performance and easing OOM pressures.

I tried the patch; it works.

But it needs a bit of debugging.  Setting pagecache_ratio = 1 either 
deadlocks or reduces thru-put to < 1mb/s.


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Aubrey Li

On 1/25/07, Vaidyanathan Srinivasan <[EMAIL PROTECTED]> wrote:



Christoph Lameter wrote:
> On Wed, 24 Jan 2007, Vaidyanathan Srinivasan wrote:
>
>> With your patch, MMAP of a file that will cross the pagecache limit hangs the
>> system.  As I mentioned in my previous mail, without subtracting the
>> NR_FILE_MAPPED, the reclaim will infinitely try and fail.
>
> Well mapped pages are still pagecache pages.
>

Yes, but they can be classified under a process RSS pages.  Whether it
is an anon page or shared mem or mmap of pagecache, it would show up
under RSS.  Those pages can be limited by RSS limiter similar to the
one we are discussing in pagecache limiter.  In my opinion, once a
file page is mapped by the process, then it should be treated at par
with anon pages.  Application programs generally do not mmap a file
page if the reuse for the content is very low.



I agree, we shouldn't take mmapped page into account.
But Vaidy - even with your patch, we are still using the existing
reclaimer, that means we dont ensure that only page cache is
reclaimed/limited. mapped pages will be hit also.
I think we still need to add a new scancontrol field to lock mmaped
pages and remove unmapped pagecache pages only.

-Aubrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread KAMEZAWA Hiroyuki
On Thu, 25 Jan 2007 00:40:54 -0500
Rik van Riel <[EMAIL PROTECTED]> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Wed, 24 Jan 2007 23:28:15 -0500
> > Rik van Riel <[EMAIL PROTECTED]> wrote:
> > 
> >> KAMEZAWA Hiroyuki wrote:
> > I always says Linux is different from mainframes.
> 
> It's not just about Linux.
> 
> Applications behave differently too from the way they were 15
> years ago.
> 
> Some databases, eg. sleepycat's db, map the whole database in
> memory.  Other databases, like MySQL and postgresql, rely on
> the kernel's page cache to cache the most frequently accessed
> data.
> 
> To make matters more interesting, memory sizes have increased
> by a factor 1000, but disk seek times have only gotten 10 times
> faster.  This means that simplistic memory management algorithms
> can hurt performance a lot more than they could back then.
> 
> In short, I am not convinced that any of the simple tunable knobs
> from the "good old days" will do much to actually help people
> with modern workloads on modern computers.
> 
I agree. 

My current concerns is not adding knobs but how to show/explain
what the users does. In most case, users don't know what they does
and believes system-information can tell that.

for example)
A user sometimes asks "why amount of system-A's pagecache and system-B's are
different from each other ?. I definitly does the same jobs on the both system."

...just because he used different deta-set ;)

Thanks,
-Kame


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Vaidyanathan Srinivasan


Rik van Riel wrote:
> Vaidyanathan Srinivasan wrote:
> 
>> In my opinion, once a
>> file page is mapped by the process, then it should be treated at par
>> with anon pages.  Application programs generally do not mmap a file
>> page if the reuse for the content is very low.
> 
> Why not have the VM measure this, instead of making wild
> assumptions about every possible workload out there?

Yes, VM page aging and page replacement algorithm should decide on the
relevance of anon or mmap page.  However we may still need to limit
total pages in memory for a given set of process.

> There are a few databases out there that mmap the whole
> thing.  Sleepycat for one...
> 

That is why my suggestion would be not to touch mmapped pagecache
pages in the current pagecache limit code.  The limit should concern
only unmapped pagecache pages.

When the application unmaps the pages, then instantly we would go over
limit and 'now' unmapped pages can be reclaimed.  This behavior has
been verified with my fix on top of Christoph's patch.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Rik van Riel

KAMEZAWA Hiroyuki wrote:

On Wed, 24 Jan 2007 23:28:15 -0500
Rik van Riel <[EMAIL PROTECTED]> wrote:


KAMEZAWA Hiroyuki wrote:


FYI:
Because some customers are migrated from mainframes, they want to control
almost all features in OS, IOW, designing memory usages.

Don't you mean:

"Because some customers are migrating from mainframes, they are
  used to needing to control all features in OS" ? :)


Ah yes ;)
I always says Linux is different from mainframes.


It's not just about Linux.

Applications behave differently too from the way they were 15
years ago.

Some databases, eg. sleepycat's db, map the whole database in
memory.  Other databases, like MySQL and postgresql, rely on
the kernel's page cache to cache the most frequently accessed
data.

To make matters more interesting, memory sizes have increased
by a factor 1000, but disk seek times have only gotten 10 times
faster.  This means that simplistic memory management algorithms
can hurt performance a lot more than they could back then.

In short, I am not convinced that any of the simple tunable knobs
from the "good old days" will do much to actually help people
with modern workloads on modern computers.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread KAMEZAWA Hiroyuki
On Wed, 24 Jan 2007 23:28:15 -0500
Rik van Riel <[EMAIL PROTECTED]> wrote:

> KAMEZAWA Hiroyuki wrote:
> 
> > FYI:
> > Because some customers are migrated from mainframes, they want to control
> > almost all features in OS, IOW, designing memory usages.
> 
> Don't you mean:
> 
> "Because some customers are migrating from mainframes, they are
>   used to needing to control all features in OS" ? :)
> 
Ah yes ;)
I always says Linux is different from mainframes.

--
Because some customers have been migrated from mainframes,
they expected that they could do what they did on mainframes.
They want to control almost all features in OS. But they can't now.
This means they can't use their experience and schemes from old days.
--

Because they are studying Linux now, the case may change in future, I think.


Thanks,
-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Rik van Riel

Vaidyanathan Srinivasan wrote:


In my opinion, once a
file page is mapped by the process, then it should be treated at par
with anon pages.  Application programs generally do not mmap a file
page if the reuse for the content is very low.


Why not have the VM measure this, instead of making wild
assumptions about every possible workload out there?

There are a few databases out there that mmap the whole
thing.  Sleepycat for one...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Rik van Riel

KAMEZAWA Hiroyuki wrote:


FYI:
Because some customers are migrated from mainframes, they want to control
almost all features in OS, IOW, designing memory usages.


Don't you mean:

"Because some customers are migrating from mainframes, they are
 used to needing to control all features in OS" ? :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Rik van Riel

Christoph Lameter wrote:
This is a patch using some of Aubrey's work plugging it in what is IMHO 
the right way. Feel free to improve on it. I have gotten repeatedly 
requests to be able to limit the pagecache. 


IMHO it's a bad hack.

It would be better to identify the problem this "feature" is
trying to fix, and then fix the root cause.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Vaidyanathan Srinivasan


Christoph Lameter wrote:
> On Wed, 24 Jan 2007, Vaidyanathan Srinivasan wrote:
> 
>> With your patch, MMAP of a file that will cross the pagecache limit hangs the
>> system.  As I mentioned in my previous mail, without subtracting the
>> NR_FILE_MAPPED, the reclaim will infinitely try and fail.
> 
> Well mapped pages are still pagecache pages.
> 

Yes, but they can be classified under a process RSS pages.  Whether it
is an anon page or shared mem or mmap of pagecache, it would show up
under RSS.  Those pages can be limited by RSS limiter similar to the
one we are discussing in pagecache limiter.  In my opinion, once a
file page is mapped by the process, then it should be treated at par
with anon pages.  Application programs generally do not mmap a file
page if the reuse for the content is very low.

>> I have tested your patch with the attached fix on my PPC64 box.
> 
> Interesting. What is your reason for wanting to limit the size of the
> pagecache?

1. Systems primarily running database workloads would benefit if
background house keeping applications like backup processes do not
fill the pagecache.  Databases use O_DIRECT and we do not want the
kernel to even remove cold pages belonging to that application to make
room for pagecache that is going to be used by an unimportant backup
application.  The objective is to have some limit on pagecache usage
and make the backup application take all the performance hit and have
zero impact on the main database workload.

Solutions:

* The backup applications could use O_DIRECT as well, but this is not
very flexible since there are restrictions in using O_DIRECT.

Please review http://lkml.org/lkml/2007/1/4/55 for issues with O_DIRECT

* Improve fadvice to specify caching behavior.  Rightnow we only model
the readahead behavior.  However this would need a change in all
applications and more command line options.

* The technique we are discussing right now can serve the purpose

2. In the context of 'containers' and per container resource
management, there is a need to restrict resources utilized by each of
the process groups within the container.  Resources like CPU time,
RSS, pagecache usage, IO bandwidth etc may have to be controlled for
each process groups.

Some of today's open virtualisation solutions like UML instances, KVM
instances among others also have a need to control CPU time, RSS and
(unmapped) pagecache pages to be able to successfully execute
commercial workloads within their virtual environments.  Each of these
instances are normal Linux process within the host kernel.

--Vaidy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread KAMEZAWA Hiroyuki
On Wed, 24 Jan 2007 18:41:27 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:
> > But I can't think of the way to show that.
> > ==
> > [EMAIL PROTECTED] src]$ free
> > total   used   free sharedbuffers cached
> > Mem:741604 724628  16976  0  62700 564600
> > -/+ buffers/cache:  97328 644276
> > Swap:  1052216   25321049684
> > ==
> 
> Could we call the free memory "unused memory" and not talk about free 
> memory at all?
> 
Ah, maybe it's better.

I met several memory troubles in user's systems in these days. (on older 
kernels)
Thousands/hundreds of process works on it.

When I explain the cutomers about memory management, I devides memory into..

(1) unused memory  --- memory which is not used, in free-list of zones.

(2) reclaimable memory --- page cache, which is reclaimable
clean pages  --- can be reclaimed soon
dirty pages  --- need to be written back
*BUT* busy pages are unreclaimable. 

(3) swappable memory --- user process's pages. basically reclaimable if 
 swap is available.
 shmem pages are included here.

(4) locked memory --- mlocked memory, which is not reclaimable(but movable)

(5) kernel memory --- used by kernel, 
  (and we can't see how many pages are reclaimable)
 
We can know the amount of (1) and (5) and total memory.
Basically, (3) = (Total) - (2) - (1).
busy data-set of (2)(3) is not reclaimable. but the amount of busy data-set
is unknown. Many users takes log of 'ps' or 'sar' to estimate their memory
usage. (and sometimes page-cache of 'log-file' eats their memory.)

The amount of (4) is unknown. But there was a system with 6GB of 8GB
memory was mlocked (--; and OOM works.

I'm sorry that I can't catch up how the current kernel can show memory usage.
I should investigate that. 

FYI:
Because some customers are migrated from mainframes, they want to control
almost all features in OS, IOW, designing memory usages.

-Kame


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Christoph Lameter
On Wed, 24 Jan 2007, Erik Andersen wrote:

> It would be far more useful if an application could hint to the
> pagecache as to which files are and which files as not worth
> caching, especially when the application knows a priori that data
> from a particular file will or will not ever be reused.

It can give such hints via madvise(2).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Christoph Lameter
On Thu, 25 Jan 2007, KAMEZAWA Hiroyuki wrote:

> On Wed, 24 Jan 2007 14:15:10 +0900
> KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote:
> 
> >   And...some customers want to keep memory Free as much as possible.
> >   99% memory usage makes insecure them ;)
> > 
> If there is a way that the "free" command can show "never used" memory,
> they will not complain ;).
> 
> But I can't think of the way to show that.
> ==
> [EMAIL PROTECTED] src]$ free
> total   used   free sharedbuffers cached
> Mem:741604 724628  16976  0  62700 564600
> -/+ buffers/cache:  97328 644276
> Swap:  1052216   25321049684
> ==

Could we call the free memory "unused memory" and not talk about free 
memory at all?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Aubrey Li

On 1/24/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:

On Wed, 2007-01-24 at 22:22 +0800, Aubrey Li wrote:
> On 1/24/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
> > > This is a patch using some of Aubrey's work plugging it in what is IMHO
> > > the right way. Feel free to improve on it. I have gotten repeatedly
> > > requests to be able to limit the pagecache. With the revised VM statistics
> > > this is now actually possile. I'd like to know more about possible uses of
> > > such a feature.
> > >
> > >
> > >
> > >
> > > It may be useful to limit the size of the page cache for various reasons
> > > such as
> > >
> > > 1. Insure that anonymous pages that may contain performance
> > >critical data is never subject to swap.
> >
> > This is what we have mlock for, no?
> >
> > > 2. Insure rapid turnaround of pages in the cache.
> >
> > This sounds like we either need more fadvise hints and/or understand why
> > the VM doesn't behave properly.
> >
> > > 3. Reserve memory for other uses? (Aubrey?)
> >
> > He wants to make a nommu system act like a mmu system; this will just
> > never ever work.
>
> Nope. Actually my nommu system works great with some of patches made by us.
> What let you think this will never work?

Because there are perfectly valid things user-space can do to mess you
up. I forgot the test-case but it had something to do with opening a
million files, this will scatter slab pages all over the place.

Also, if you cycle your large user-space allocations a bit unluckily
you'll also fragment it into oblivion.

So you can not guarantee it will not fragment into smithereens stopping
your user-space from using large than page size allocations.

If your user-space consists of several applications that do dynamic
memory allocation of various sizes its a matter of (run-) time before
things will start failing.

If you prealloc a large area at boot time (like we now do for hugepages)
and use that for user-space, you might 'reset' the status quo by cycling
the whole of userspace.



It seems you are talking about a perfect system. Opening a million
files will never be the requirement of my system. You know I'm working
on an embedded system, most of the time the whole system just run for
one application, if I can guarantee this application works forever, I
think it's enough. I'm not trying to make a nommu system act like a
mmu system, it's impossible, I just make my nommu system work.

-Aubrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread KAMEZAWA Hiroyuki
On Wed, 24 Jan 2007 14:15:10 +0900
KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote:

>   And...some customers want to keep memory Free as much as possible.
>   99% memory usage makes insecure them ;)
> 
If there is a way that the "free" command can show "never used" memory,
they will not complain ;).

But I can't think of the way to show that.
==
[EMAIL PROTECTED] src]$ free
total   used   free sharedbuffers cached
Mem:741604 724628  16976  0  62700 564600
-/+ buffers/cache:  97328 644276
Swap:  1052216   25321049684
==

If anyone has some good idea, could you teach me ?

Regards,
-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Erik Andersen
On Wed Jan 24, 2007 at 06:58:42AM -0800, Christoph Lameter wrote:
> On Wed, 24 Jan 2007, Nick Piggin wrote:
> 
> > I can't argue that a smaller pagecache will be subject to a
> > higher turnaround given the same workload, but I don't know why
> > that would be a good thing.
> 
> Neither do I. Wonder why we need this but I keep getting 
> these requests. Could we either find a reason for limiting the pagecache 
> or get this out of our system for good?

I think this paints with too broad a brushstroke...

Simply limiting the page cache with no regard to the potential
for particular content to be later reused seems a rather
pointless exercise which is guaranteed to diminish system
performance.

It would be far more useful if an application could hint to the
pagecache as to which files are and which files as not worth
caching, especially when the application knows a priori that data
from a particular file will or will not ever be reused.

 -Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Christoph Lameter
On Wed, 24 Jan 2007, Nick Piggin wrote:

> I can't argue that a smaller pagecache will be subject to a
> higher turnaround given the same workload, but I don't know why
> that would be a good thing.

Neither do I. Wonder why we need this but I keep getting 
these requests. Could we either find a reason for limiting the pagecache 
or get this out of our system for good?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Christoph Lameter
On Wed, 24 Jan 2007, Vaidyanathan Srinivasan wrote:

> With your patch, MMAP of a file that will cross the pagecache limit hangs the
> system.  As I mentioned in my previous mail, without subtracting the
> NR_FILE_MAPPED, the reclaim will infinitely try and fail.

Well mapped pages are still pagecache pages.
 
> I have tested your patch with the attached fix on my PPC64 box.

Interesting. What is your reason for wanting to limit the size of the 
pagecache?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Peter Zijlstra
On Wed, 2007-01-24 at 22:22 +0800, Aubrey Li wrote:
> On 1/24/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
> > > This is a patch using some of Aubrey's work plugging it in what is IMHO
> > > the right way. Feel free to improve on it. I have gotten repeatedly
> > > requests to be able to limit the pagecache. With the revised VM statistics
> > > this is now actually possile. I'd like to know more about possible uses of
> > > such a feature.
> > >
> > >
> > >
> > >
> > > It may be useful to limit the size of the page cache for various reasons
> > > such as
> > >
> > > 1. Insure that anonymous pages that may contain performance
> > >critical data is never subject to swap.
> >
> > This is what we have mlock for, no?
> >
> > > 2. Insure rapid turnaround of pages in the cache.
> >
> > This sounds like we either need more fadvise hints and/or understand why
> > the VM doesn't behave properly.
> >
> > > 3. Reserve memory for other uses? (Aubrey?)
> >
> > He wants to make a nommu system act like a mmu system; this will just
> > never ever work.
> 
> Nope. Actually my nommu system works great with some of patches made by us.
> What let you think this will never work?

Because there are perfectly valid things user-space can do to mess you
up. I forgot the test-case but it had something to do with opening a
million files, this will scatter slab pages all over the place.

Also, if you cycle your large user-space allocations a bit unluckily
you'll also fragment it into oblivion.

So you can not guarantee it will not fragment into smithereens stopping
your user-space from using large than page size allocations.

If your user-space consists of several applications that do dynamic
memory allocation of various sizes its a matter of (run-) time before
things will start failing.

If you prealloc a large area at boot time (like we now do for hugepages)
and use that for user-space, you might 'reset' the status quo by cycling
the whole of userspace.

> > Memory fragmentation is a real issue not some gimmick
> > thought up by the hardware folks to sell these mmu chips.
> >
> I totally disagree. Memory fragmentations is the issue not only on
> nommu, it's also on mmu chips. That's not the reason mmu chips can be
> sold.

For MMU enabled chips these fragmentation issues (at the page allocation
level) will never reach (regular - !hugepages) user-space. Exactly
because of the MMU, it will make things virtually contiguous.

Yes, there are problem in kernel space, esp. when we want to use huge
pages.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Aubrey Li

On 1/24/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:

On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
> This is a patch using some of Aubrey's work plugging it in what is IMHO
> the right way. Feel free to improve on it. I have gotten repeatedly
> requests to be able to limit the pagecache. With the revised VM statistics
> this is now actually possile. I'd like to know more about possible uses of
> such a feature.
>
>
>
>
> It may be useful to limit the size of the page cache for various reasons
> such as
>
> 1. Insure that anonymous pages that may contain performance
>critical data is never subject to swap.

This is what we have mlock for, no?

> 2. Insure rapid turnaround of pages in the cache.

This sounds like we either need more fadvise hints and/or understand why
the VM doesn't behave properly.

> 3. Reserve memory for other uses? (Aubrey?)

He wants to make a nommu system act like a mmu system; this will just
never ever work.


Nope. Actually my nommu system works great with some of patches made by us.
What let you think this will never work?


Memory fragmentation is a real issue not some gimmick
thought up by the hardware folks to sell these mmu chips.


I totally disagree. Memory fragmentations is the issue not only on
nommu, it's also on mmu chips. That's not the reason mmu chips can be
sold.

-Aubrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Peter Zijlstra
On Wed, 2007-01-24 at 23:50 +1100, Nick Piggin wrote:
> Peter Zijlstra wrote:
> > On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
> 
> >>2. Insure rapid turnaround of pages in the cache.
> 
> [...]
> 
> > The  only maybe valid point would be 2, and I'd like to see if we can't
> > solve that differently - a better use-once logic comes to mind.
> 
> There must be something I'm missing with that point. The faster
> the turnaround of pagecache pages, the *less* efficiently the
> pagecache is working (assuming a rapid turnaround means a high
> rate of pages brought into, then reclaimed from pagecache).
> 
> I can't argue that a smaller pagecache will be subject to a
> higher turnaround given the same workload, but I don't know why
> that would be a good thing.

I interpreted the issue as selecting the wrong pages for the 'working
set'. Like not quickly evicting pages from a large streaming read, which
then pushes out more useful pages.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Nick Piggin

Peter Zijlstra wrote:

On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:



2. Insure rapid turnaround of pages in the cache.


[...]


The  only maybe valid point would be 2, and I'd like to see if we can't
solve that differently - a better use-once logic comes to mind.


There must be something I'm missing with that point. The faster
the turnaround of pagecache pages, the *less* efficiently the
pagecache is working (assuming a rapid turnaround means a high
rate of pages brought into, then reclaimed from pagecache).

I can't argue that a smaller pagecache will be subject to a
higher turnaround given the same workload, but I don't know why
that would be a good thing.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Vaidyanathan Srinivasan


Christoph Lameter wrote:
> This is a patch using some of Aubrey's work plugging it in what is IMHO
> the right way. Feel free to improve on it. I have gotten repeatedly
> requests to be able to limit the pagecache. With the revised VM statistics
> this is now actually possile. I'd like to know more about possible uses of
> such a feature.
> 
> 

[snip]

Hi Christoph,

With your patch, MMAP of a file that will cross the pagecache limit hangs the
system.  As I mentioned in my previous mail, without subtracting the
NR_FILE_MAPPED, the reclaim will infinitely try and fail.

I have tested your patch with the attached fix on my PPC64 box.

Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]>

---
 mm/page_alloc.c |3 ++-
 mm/vmscan.c |3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

--- linux-2.6.20-rc5.orig/mm/page_alloc.c
+++ linux-2.6.20-rc5/mm/page_alloc.c
@@ -1171,7 +1171,8 @@ zonelist_scan:
goto try_next_zone;

if ((gfp_mask & __GFP_PAGECACHE) &&
-   zone_page_state(zone, NR_FILE_PAGES) >
+   (zone_page_state(zone, NR_FILE_PAGES) -
+zone_page_state(zone, NR_FILE_MAPPED)) >
zone->max_pagecache_pages)
goto try_next_zone;

--- linux-2.6.20-rc5.orig/mm/vmscan.c
+++ linux-2.6.20-rc5/mm/vmscan.c
@@ -936,7 +936,8 @@ static unsigned long shrink_zone(int pri
 * If the page cache is too big then focus on page cache
 * and ignore anonymous pages
 */
-   if (sc->may_swap && zone_page_state(zone, NR_FILE_PAGES)
+   if (sc->may_swap && (zone_page_state(zone, NR_FILE_PAGES) -
+   zone_page_state(zone, NR_FILE_MAPPED))
> zone->max_pagecache_pages)
sc->may_swap = 0;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Vaidyanathan Srinivasan


Christoph Lameter wrote:
 This is a patch using some of Aubrey's work plugging it in what is IMHO
 the right way. Feel free to improve on it. I have gotten repeatedly
 requests to be able to limit the pagecache. With the revised VM statistics
 this is now actually possile. I'd like to know more about possible uses of
 such a feature.
 
 

[snip]

Hi Christoph,

With your patch, MMAP of a file that will cross the pagecache limit hangs the
system.  As I mentioned in my previous mail, without subtracting the
NR_FILE_MAPPED, the reclaim will infinitely try and fail.

I have tested your patch with the attached fix on my PPC64 box.

Signed-off-by: Vaidyanathan Srinivasan [EMAIL PROTECTED]

---
 mm/page_alloc.c |3 ++-
 mm/vmscan.c |3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

--- linux-2.6.20-rc5.orig/mm/page_alloc.c
+++ linux-2.6.20-rc5/mm/page_alloc.c
@@ -1171,7 +1171,8 @@ zonelist_scan:
goto try_next_zone;

if ((gfp_mask  __GFP_PAGECACHE) 
-   zone_page_state(zone, NR_FILE_PAGES) 
+   (zone_page_state(zone, NR_FILE_PAGES) -
+zone_page_state(zone, NR_FILE_MAPPED)) 
zone-max_pagecache_pages)
goto try_next_zone;

--- linux-2.6.20-rc5.orig/mm/vmscan.c
+++ linux-2.6.20-rc5/mm/vmscan.c
@@ -936,7 +936,8 @@ static unsigned long shrink_zone(int pri
 * If the page cache is too big then focus on page cache
 * and ignore anonymous pages
 */
-   if (sc-may_swap  zone_page_state(zone, NR_FILE_PAGES)
+   if (sc-may_swap  (zone_page_state(zone, NR_FILE_PAGES) -
+   zone_page_state(zone, NR_FILE_MAPPED))
 zone-max_pagecache_pages)
sc-may_swap = 0;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Nick Piggin

Peter Zijlstra wrote:

On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:



2. Insure rapid turnaround of pages in the cache.


[...]


The  only maybe valid point would be 2, and I'd like to see if we can't
solve that differently - a better use-once logic comes to mind.


There must be something I'm missing with that point. The faster
the turnaround of pagecache pages, the *less* efficiently the
pagecache is working (assuming a rapid turnaround means a high
rate of pages brought into, then reclaimed from pagecache).

I can't argue that a smaller pagecache will be subject to a
higher turnaround given the same workload, but I don't know why
that would be a good thing.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Peter Zijlstra
On Wed, 2007-01-24 at 23:50 +1100, Nick Piggin wrote:
 Peter Zijlstra wrote:
  On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
 
 2. Insure rapid turnaround of pages in the cache.
 
 [...]
 
  The  only maybe valid point would be 2, and I'd like to see if we can't
  solve that differently - a better use-once logic comes to mind.
 
 There must be something I'm missing with that point. The faster
 the turnaround of pagecache pages, the *less* efficiently the
 pagecache is working (assuming a rapid turnaround means a high
 rate of pages brought into, then reclaimed from pagecache).
 
 I can't argue that a smaller pagecache will be subject to a
 higher turnaround given the same workload, but I don't know why
 that would be a good thing.

I interpreted the issue as selecting the wrong pages for the 'working
set'. Like not quickly evicting pages from a large streaming read, which
then pushes out more useful pages.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Aubrey Li

On 1/24/07, Peter Zijlstra [EMAIL PROTECTED] wrote:

On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
 This is a patch using some of Aubrey's work plugging it in what is IMHO
 the right way. Feel free to improve on it. I have gotten repeatedly
 requests to be able to limit the pagecache. With the revised VM statistics
 this is now actually possile. I'd like to know more about possible uses of
 such a feature.




 It may be useful to limit the size of the page cache for various reasons
 such as

 1. Insure that anonymous pages that may contain performance
critical data is never subject to swap.

This is what we have mlock for, no?

 2. Insure rapid turnaround of pages in the cache.

This sounds like we either need more fadvise hints and/or understand why
the VM doesn't behave properly.

 3. Reserve memory for other uses? (Aubrey?)

He wants to make a nommu system act like a mmu system; this will just
never ever work.


Nope. Actually my nommu system works great with some of patches made by us.
What let you think this will never work?


Memory fragmentation is a real issue not some gimmick
thought up by the hardware folks to sell these mmu chips.


I totally disagree. Memory fragmentations is the issue not only on
nommu, it's also on mmu chips. That's not the reason mmu chips can be
sold.

-Aubrey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Peter Zijlstra
On Wed, 2007-01-24 at 22:22 +0800, Aubrey Li wrote:
 On 1/24/07, Peter Zijlstra [EMAIL PROTECTED] wrote:
  On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
   This is a patch using some of Aubrey's work plugging it in what is IMHO
   the right way. Feel free to improve on it. I have gotten repeatedly
   requests to be able to limit the pagecache. With the revised VM statistics
   this is now actually possile. I'd like to know more about possible uses of
   such a feature.
  
  
  
  
   It may be useful to limit the size of the page cache for various reasons
   such as
  
   1. Insure that anonymous pages that may contain performance
  critical data is never subject to swap.
 
  This is what we have mlock for, no?
 
   2. Insure rapid turnaround of pages in the cache.
 
  This sounds like we either need more fadvise hints and/or understand why
  the VM doesn't behave properly.
 
   3. Reserve memory for other uses? (Aubrey?)
 
  He wants to make a nommu system act like a mmu system; this will just
  never ever work.
 
 Nope. Actually my nommu system works great with some of patches made by us.
 What let you think this will never work?

Because there are perfectly valid things user-space can do to mess you
up. I forgot the test-case but it had something to do with opening a
million files, this will scatter slab pages all over the place.

Also, if you cycle your large user-space allocations a bit unluckily
you'll also fragment it into oblivion.

So you can not guarantee it will not fragment into smithereens stopping
your user-space from using large than page size allocations.

If your user-space consists of several applications that do dynamic
memory allocation of various sizes its a matter of (run-) time before
things will start failing.

If you prealloc a large area at boot time (like we now do for hugepages)
and use that for user-space, you might 'reset' the status quo by cycling
the whole of userspace.

  Memory fragmentation is a real issue not some gimmick
  thought up by the hardware folks to sell these mmu chips.
 
 I totally disagree. Memory fragmentations is the issue not only on
 nommu, it's also on mmu chips. That's not the reason mmu chips can be
 sold.

For MMU enabled chips these fragmentation issues (at the page allocation
level) will never reach (regular - !hugepages) user-space. Exactly
because of the MMU, it will make things virtually contiguous.

Yes, there are problem in kernel space, esp. when we want to use huge
pages.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Christoph Lameter
On Wed, 24 Jan 2007, Vaidyanathan Srinivasan wrote:

 With your patch, MMAP of a file that will cross the pagecache limit hangs the
 system.  As I mentioned in my previous mail, without subtracting the
 NR_FILE_MAPPED, the reclaim will infinitely try and fail.

Well mapped pages are still pagecache pages.
 
 I have tested your patch with the attached fix on my PPC64 box.

Interesting. What is your reason for wanting to limit the size of the 
pagecache?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Christoph Lameter
On Wed, 24 Jan 2007, Nick Piggin wrote:

 I can't argue that a smaller pagecache will be subject to a
 higher turnaround given the same workload, but I don't know why
 that would be a good thing.

Neither do I. Wonder why we need this but I keep getting 
these requests. Could we either find a reason for limiting the pagecache 
or get this out of our system for good?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Erik Andersen
On Wed Jan 24, 2007 at 06:58:42AM -0800, Christoph Lameter wrote:
 On Wed, 24 Jan 2007, Nick Piggin wrote:
 
  I can't argue that a smaller pagecache will be subject to a
  higher turnaround given the same workload, but I don't know why
  that would be a good thing.
 
 Neither do I. Wonder why we need this but I keep getting 
 these requests. Could we either find a reason for limiting the pagecache 
 or get this out of our system for good?

I think this paints with too broad a brushstroke...

Simply limiting the page cache with no regard to the potential
for particular content to be later reused seems a rather
pointless exercise which is guaranteed to diminish system
performance.

It would be far more useful if an application could hint to the
pagecache as to which files are and which files as not worth
caching, especially when the application knows a priori that data
from a particular file will or will not ever be reused.

 -Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread KAMEZAWA Hiroyuki
On Wed, 24 Jan 2007 14:15:10 +0900
KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:

   And...some customers want to keep memory Free as much as possible.
   99% memory usage makes insecure them ;)
 
If there is a way that the free command can show never used memory,
they will not complain ;).

But I can't think of the way to show that.
==
[EMAIL PROTECTED] src]$ free
total   used   free sharedbuffers cached
Mem:741604 724628  16976  0  62700 564600
-/+ buffers/cache:  97328 644276
Swap:  1052216   25321049684
==

If anyone has some good idea, could you teach me ?

Regards,
-Kame

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Aubrey Li

On 1/24/07, Peter Zijlstra [EMAIL PROTECTED] wrote:

On Wed, 2007-01-24 at 22:22 +0800, Aubrey Li wrote:
 On 1/24/07, Peter Zijlstra [EMAIL PROTECTED] wrote:
  On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
   This is a patch using some of Aubrey's work plugging it in what is IMHO
   the right way. Feel free to improve on it. I have gotten repeatedly
   requests to be able to limit the pagecache. With the revised VM statistics
   this is now actually possile. I'd like to know more about possible uses of
   such a feature.
  
  
  
  
   It may be useful to limit the size of the page cache for various reasons
   such as
  
   1. Insure that anonymous pages that may contain performance
  critical data is never subject to swap.
 
  This is what we have mlock for, no?
 
   2. Insure rapid turnaround of pages in the cache.
 
  This sounds like we either need more fadvise hints and/or understand why
  the VM doesn't behave properly.
 
   3. Reserve memory for other uses? (Aubrey?)
 
  He wants to make a nommu system act like a mmu system; this will just
  never ever work.

 Nope. Actually my nommu system works great with some of patches made by us.
 What let you think this will never work?

Because there are perfectly valid things user-space can do to mess you
up. I forgot the test-case but it had something to do with opening a
million files, this will scatter slab pages all over the place.

Also, if you cycle your large user-space allocations a bit unluckily
you'll also fragment it into oblivion.

So you can not guarantee it will not fragment into smithereens stopping
your user-space from using large than page size allocations.

If your user-space consists of several applications that do dynamic
memory allocation of various sizes its a matter of (run-) time before
things will start failing.

If you prealloc a large area at boot time (like we now do for hugepages)
and use that for user-space, you might 'reset' the status quo by cycling
the whole of userspace.



It seems you are talking about a perfect system. Opening a million
files will never be the requirement of my system. You know I'm working
on an embedded system, most of the time the whole system just run for
one application, if I can guarantee this application works forever, I
think it's enough. I'm not trying to make a nommu system act like a
mmu system, it's impossible, I just make my nommu system work.

-Aubrey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Christoph Lameter
On Thu, 25 Jan 2007, KAMEZAWA Hiroyuki wrote:

 On Wed, 24 Jan 2007 14:15:10 +0900
 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 
And...some customers want to keep memory Free as much as possible.
99% memory usage makes insecure them ;)
  
 If there is a way that the free command can show never used memory,
 they will not complain ;).
 
 But I can't think of the way to show that.
 ==
 [EMAIL PROTECTED] src]$ free
 total   used   free sharedbuffers cached
 Mem:741604 724628  16976  0  62700 564600
 -/+ buffers/cache:  97328 644276
 Swap:  1052216   25321049684
 ==

Could we call the free memory unused memory and not talk about free 
memory at all?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Christoph Lameter
On Wed, 24 Jan 2007, Erik Andersen wrote:

 It would be far more useful if an application could hint to the
 pagecache as to which files are and which files as not worth
 caching, especially when the application knows a priori that data
 from a particular file will or will not ever be reused.

It can give such hints via madvise(2).

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread KAMEZAWA Hiroyuki
On Wed, 24 Jan 2007 18:41:27 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:
  But I can't think of the way to show that.
  ==
  [EMAIL PROTECTED] src]$ free
  total   used   free sharedbuffers cached
  Mem:741604 724628  16976  0  62700 564600
  -/+ buffers/cache:  97328 644276
  Swap:  1052216   25321049684
  ==
 
 Could we call the free memory unused memory and not talk about free 
 memory at all?
 
Ah, maybe it's better.

I met several memory troubles in user's systems in these days. (on older 
kernels)
Thousands/hundreds of process works on it.

When I explain the cutomers about memory management, I devides memory into..

(1) unused memory  --- memory which is not used, in free-list of zones.

(2) reclaimable memory --- page cache, which is reclaimable
clean pages  --- can be reclaimed soon
dirty pages  --- need to be written back
*BUT* busy pages are unreclaimable. 

(3) swappable memory --- user process's pages. basically reclaimable if 
 swap is available.
 shmem pages are included here.

(4) locked memory --- mlocked memory, which is not reclaimable(but movable)

(5) kernel memory --- used by kernel, 
  (and we can't see how many pages are reclaimable)
 
We can know the amount of (1) and (5) and total memory.
Basically, (3) = (Total) - (2) - (1).
busy data-set of (2)(3) is not reclaimable. but the amount of busy data-set
is unknown. Many users takes log of 'ps' or 'sar' to estimate their memory
usage. (and sometimes page-cache of 'log-file' eats their memory.)

The amount of (4) is unknown. But there was a system with 6GB of 8GB
memory was mlocked (--; and OOM works.

I'm sorry that I can't catch up how the current kernel can show memory usage.
I should investigate that. 

FYI:
Because some customers are migrated from mainframes, they want to control
almost all features in OS, IOW, designing memory usages.

-Kame


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Vaidyanathan Srinivasan


Christoph Lameter wrote:
 On Wed, 24 Jan 2007, Vaidyanathan Srinivasan wrote:
 
 With your patch, MMAP of a file that will cross the pagecache limit hangs the
 system.  As I mentioned in my previous mail, without subtracting the
 NR_FILE_MAPPED, the reclaim will infinitely try and fail.
 
 Well mapped pages are still pagecache pages.
 

Yes, but they can be classified under a process RSS pages.  Whether it
is an anon page or shared mem or mmap of pagecache, it would show up
under RSS.  Those pages can be limited by RSS limiter similar to the
one we are discussing in pagecache limiter.  In my opinion, once a
file page is mapped by the process, then it should be treated at par
with anon pages.  Application programs generally do not mmap a file
page if the reuse for the content is very low.

 I have tested your patch with the attached fix on my PPC64 box.
 
 Interesting. What is your reason for wanting to limit the size of the
 pagecache?

1. Systems primarily running database workloads would benefit if
background house keeping applications like backup processes do not
fill the pagecache.  Databases use O_DIRECT and we do not want the
kernel to even remove cold pages belonging to that application to make
room for pagecache that is going to be used by an unimportant backup
application.  The objective is to have some limit on pagecache usage
and make the backup application take all the performance hit and have
zero impact on the main database workload.

Solutions:

* The backup applications could use O_DIRECT as well, but this is not
very flexible since there are restrictions in using O_DIRECT.

Please review http://lkml.org/lkml/2007/1/4/55 for issues with O_DIRECT

* Improve fadvice to specify caching behavior.  Rightnow we only model
the readahead behavior.  However this would need a change in all
applications and more command line options.

* The technique we are discussing right now can serve the purpose

2. In the context of 'containers' and per container resource
management, there is a need to restrict resources utilized by each of
the process groups within the container.  Resources like CPU time,
RSS, pagecache usage, IO bandwidth etc may have to be controlled for
each process groups.

Some of today's open virtualisation solutions like UML instances, KVM
instances among others also have a need to control CPU time, RSS and
(unmapped) pagecache pages to be able to successfully execute
commercial workloads within their virtual environments.  Each of these
instances are normal Linux process within the host kernel.

--Vaidy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Rik van Riel

Christoph Lameter wrote:
This is a patch using some of Aubrey's work plugging it in what is IMHO 
the right way. Feel free to improve on it. I have gotten repeatedly 
requests to be able to limit the pagecache. 


IMHO it's a bad hack.

It would be better to identify the problem this feature is
trying to fix, and then fix the root cause.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Rik van Riel

KAMEZAWA Hiroyuki wrote:


FYI:
Because some customers are migrated from mainframes, they want to control
almost all features in OS, IOW, designing memory usages.


Don't you mean:

Because some customers are migrating from mainframes, they are
 used to needing to control all features in OS ? :)

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Rik van Riel

Vaidyanathan Srinivasan wrote:


In my opinion, once a
file page is mapped by the process, then it should be treated at par
with anon pages.  Application programs generally do not mmap a file
page if the reuse for the content is very low.


Why not have the VM measure this, instead of making wild
assumptions about every possible workload out there?

There are a few databases out there that mmap the whole
thing.  Sleepycat for one...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread KAMEZAWA Hiroyuki
On Wed, 24 Jan 2007 23:28:15 -0500
Rik van Riel [EMAIL PROTECTED] wrote:

 KAMEZAWA Hiroyuki wrote:
 
  FYI:
  Because some customers are migrated from mainframes, they want to control
  almost all features in OS, IOW, designing memory usages.
 
 Don't you mean:
 
 Because some customers are migrating from mainframes, they are
   used to needing to control all features in OS ? :)
 
Ah yes ;)
I always says Linux is different from mainframes.

--
Because some customers have been migrated from mainframes,
they expected that they could do what they did on mainframes.
They want to control almost all features in OS. But they can't now.
This means they can't use their experience and schemes from old days.
--

Because they are studying Linux now, the case may change in future, I think.


Thanks,
-Kame

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Rik van Riel

KAMEZAWA Hiroyuki wrote:

On Wed, 24 Jan 2007 23:28:15 -0500
Rik van Riel [EMAIL PROTECTED] wrote:


KAMEZAWA Hiroyuki wrote:


FYI:
Because some customers are migrated from mainframes, they want to control
almost all features in OS, IOW, designing memory usages.

Don't you mean:

Because some customers are migrating from mainframes, they are
  used to needing to control all features in OS ? :)


Ah yes ;)
I always says Linux is different from mainframes.


It's not just about Linux.

Applications behave differently too from the way they were 15
years ago.

Some databases, eg. sleepycat's db, map the whole database in
memory.  Other databases, like MySQL and postgresql, rely on
the kernel's page cache to cache the most frequently accessed
data.

To make matters more interesting, memory sizes have increased
by a factor 1000, but disk seek times have only gotten 10 times
faster.  This means that simplistic memory management algorithms
can hurt performance a lot more than they could back then.

In short, I am not convinced that any of the simple tunable knobs
from the good old days will do much to actually help people
with modern workloads on modern computers.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Vaidyanathan Srinivasan


Rik van Riel wrote:
 Vaidyanathan Srinivasan wrote:
 
 In my opinion, once a
 file page is mapped by the process, then it should be treated at par
 with anon pages.  Application programs generally do not mmap a file
 page if the reuse for the content is very low.
 
 Why not have the VM measure this, instead of making wild
 assumptions about every possible workload out there?

Yes, VM page aging and page replacement algorithm should decide on the
relevance of anon or mmap page.  However we may still need to limit
total pages in memory for a given set of process.

 There are a few databases out there that mmap the whole
 thing.  Sleepycat for one...
 

That is why my suggestion would be not to touch mmapped pagecache
pages in the current pagecache limit code.  The limit should concern
only unmapped pagecache pages.

When the application unmaps the pages, then instantly we would go over
limit and 'now' unmapped pages can be reclaimed.  This behavior has
been verified with my fix on top of Christoph's patch.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread KAMEZAWA Hiroyuki
On Thu, 25 Jan 2007 00:40:54 -0500
Rik van Riel [EMAIL PROTECTED] wrote:

 KAMEZAWA Hiroyuki wrote:
  On Wed, 24 Jan 2007 23:28:15 -0500
  Rik van Riel [EMAIL PROTECTED] wrote:
  
  KAMEZAWA Hiroyuki wrote:
  I always says Linux is different from mainframes.
 
 It's not just about Linux.
 
 Applications behave differently too from the way they were 15
 years ago.
 
 Some databases, eg. sleepycat's db, map the whole database in
 memory.  Other databases, like MySQL and postgresql, rely on
 the kernel's page cache to cache the most frequently accessed
 data.
 
 To make matters more interesting, memory sizes have increased
 by a factor 1000, but disk seek times have only gotten 10 times
 faster.  This means that simplistic memory management algorithms
 can hurt performance a lot more than they could back then.
 
 In short, I am not convinced that any of the simple tunable knobs
 from the good old days will do much to actually help people
 with modern workloads on modern computers.
 
I agree. 

My current concerns is not adding knobs but how to show/explain
what the users does. In most case, users don't know what they does
and believes system-information can tell that.

for example)
A user sometimes asks why amount of system-A's pagecache and system-B's are
different from each other ?. I definitly does the same jobs on the both system.

...just because he used different deta-set ;)

Thanks,
-Kame


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Aubrey Li

On 1/25/07, Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote:



Christoph Lameter wrote:
 On Wed, 24 Jan 2007, Vaidyanathan Srinivasan wrote:

 With your patch, MMAP of a file that will cross the pagecache limit hangs the
 system.  As I mentioned in my previous mail, without subtracting the
 NR_FILE_MAPPED, the reclaim will infinitely try and fail.

 Well mapped pages are still pagecache pages.


Yes, but they can be classified under a process RSS pages.  Whether it
is an anon page or shared mem or mmap of pagecache, it would show up
under RSS.  Those pages can be limited by RSS limiter similar to the
one we are discussing in pagecache limiter.  In my opinion, once a
file page is mapped by the process, then it should be treated at par
with anon pages.  Application programs generally do not mmap a file
page if the reuse for the content is very low.



I agree, we shouldn't take mmapped page into account.
But Vaidy - even with your patch, we are still using the existing
reclaimer, that means we dont ensure that only page cache is
reclaimed/limited. mapped pages will be hit also.
I think we still need to add a new scancontrol field to lock mmaped
pages and remove unmapped pagecache pages only.

-Aubrey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Al Boldi
Rik van Riel wrote:
 Christoph Lameter wrote:
  This is a patch using some of Aubrey's work plugging it in what is IMHO
  the right way. Feel free to improve on it. I have gotten repeatedly
  requests to be able to limit the pagecache.

 IMHO it's a bad hack.

 It would be better to identify the problem this feature is
 trying to fix, and then fix the root cause.

Ok, here is the problem:  kswapd.

Limiting the page-cache memory inhibits invoking kswapd needlessly, aiding 
performance and easing OOM pressures.

I tried the patch; it works.

But it needs a bit of debugging.  Setting pagecache_ratio = 1 either 
deadlocks or reduces thru-put to  1mb/s.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Christoph Lameter
On Thu, 25 Jan 2007, Aubrey Li wrote:

 But Vaidy - even with your patch, we are still using the existing
 reclaimer, that means we dont ensure that only page cache is
 reclaimed/limited. mapped pages will be hit also.
 I think we still need to add a new scancontrol field to lock mmaped
 pages and remove unmapped pagecache pages only.

Setting sc-swappiness to zero will make the reclaimer hit 
unmapped pages until we get into problems. Maybe set that to some negative 
value to avoid reclaim_mapped being set to 1 in shrink_active_list?

Oh. But reclaim_mapped is staying at zero anyways if may_swap is off. So 
we are already fine.

I still wonder why you are doing this at all. If you just run your own app 
on the box then preallocate your higher order allocations from user space. 
Much less trouble.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-24 Thread Vaidyanathan Srinivasan


Aubrey Li wrote:
 On 1/25/07, Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote:

 Christoph Lameter wrote:
 On Wed, 24 Jan 2007, Vaidyanathan Srinivasan wrote:

 With your patch, MMAP of a file that will cross the pagecache limit hangs 
 the
 system.  As I mentioned in my previous mail, without subtracting the
 NR_FILE_MAPPED, the reclaim will infinitely try and fail.
 Well mapped pages are still pagecache pages.

 Yes, but they can be classified under a process RSS pages.  Whether it
 is an anon page or shared mem or mmap of pagecache, it would show up
 under RSS.  Those pages can be limited by RSS limiter similar to the
 one we are discussing in pagecache limiter.  In my opinion, once a
 file page is mapped by the process, then it should be treated at par
 with anon pages.  Application programs generally do not mmap a file
 page if the reuse for the content is very low.

 
 I agree, we shouldn't take mmapped page into account.
 But Vaidy - even with your patch, we are still using the existing
 reclaimer, that means we dont ensure that only page cache is
 reclaimed/limited. mapped pages will be hit also.
 I think we still need to add a new scancontrol field to lock mmaped
 pages and remove unmapped pagecache pages only.

I have tried to add scan control to Roy's patch at
http://lkml.org/lkml/2007/01/17/96

In that patch, we search and remove only pages that are not mapped.
We also remove referenced and hot pagecache pages which the normal
reclaimer is not expected to consider.

I will try to fit that logic in Christoph's patch and test.

--Vaidy

 -Aubrey
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Peter Zijlstra
On Tue, 2007-01-23 at 16:49 -0800, Christoph Lameter wrote:
> This is a patch using some of Aubrey's work plugging it in what is IMHO 
> the right way. Feel free to improve on it. I have gotten repeatedly 
> requests to be able to limit the pagecache. With the revised VM statistics 
> this is now actually possile. I'd like to know more about possible uses of 
> such a feature.
> 
> 
> 
> 
> It may be useful to limit the size of the page cache for various reasons
> such as
> 
> 1. Insure that anonymous pages that may contain performance
>critical data is never subject to swap.

This is what we have mlock for, no?

> 2. Insure rapid turnaround of pages in the cache.

This sounds like we either need more fadvise hints and/or understand why
the VM doesn't behave properly.

> 3. Reserve memory for other uses? (Aubrey?)

He wants to make a nommu system act like a mmu system; this will just
never ever work. Memory fragmentation is a real issue not some gimmick
thought up by the hardware folks to sell these mmu chips.

> We add a new variable "pagecache_ratio" to /proc/sys/vm/ that
> defaults to 100 (all memory usable for the pagecache).
> 
> The size of the pagecache is the number of file backed
> pages in a zone which is available through NR_FILE_PAGES.
> 
> We skip zones that contain too many page cache pages in
> the page allocator which may cause us to enter reclaim.
> 
> If we enter reclaim and the number of page cache pages
> is too high then we switch off swapping during reclaim
> to avoid touching anonymous pages.
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Code looks nice, however earlier responses have raised good points. Esp.
the one pointing out you'd need to defeat swappiness too.

That said, I'm not much in favour of a limit pagecache knob.

Esp. the "my customers are scared of the 99.9% memory used scenario" is
a clear case of educate them. We don't go fix psychological problems
with code.

The only maybe valid point would be 2, and I'd like to see if we can't
solve that differently - a better use-once logic comes to mind.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Vaidyanathan Srinivasan


Christoph Lameter wrote:
> This is a patch using some of Aubrey's work plugging it in what is IMHO
> the right way. Feel free to improve on it. I have gotten repeatedly
> requests to be able to limit the pagecache. With the revised VM statistics
> this is now actually possile. I'd like to know more about possible uses of
> such a feature.
> 
> 
> 
> 
> It may be useful to limit the size of the page cache for various reasons
> such as
> 
> 1. Insure that anonymous pages that may contain performance
>critical data is never subject to swap.
> 
> 2. Insure rapid turnaround of pages in the cache.
> 
> 3. Reserve memory for other uses? (Aubrey?)
> 
> We add a new variable "pagecache_ratio" to /proc/sys/vm/ that
> defaults to 100 (all memory usable for the pagecache).
> 
> The size of the pagecache is the number of file backed
> pages in a zone which is available through NR_FILE_PAGES.
> 
> We skip zones that contain too many page cache pages in
> the page allocator which may cause us to enter reclaim.

Skipping the zone may not be a good idea.  We can have a threshold
for reclaim to avoid running the reclaim code too often

> If we enter reclaim and the number of page cache pages
> is too high then we switch off swapping during reclaim
> to avoid touching anonymous pages.

This is a good idea, however there could be the following problems:

1. We may not find much of unmapped pages in the given number of
pages to scan.  We will have to iterate too much in shrink_zone and
artificially increase memory pressure in order to scan more pages
and find sufficient pagecache pages to free and bring it under limit

2. NR_FILE_PAGES include mapped pagecache pages count, if we turn
off may_swap, then reclaim_mapped will also be off and we will not
remove mapped pagecache.  This is correct because these pages are
'in use' relative to unmapped pagecache pages.

But the problem is in the limit comparison, we need to subtract
mapped pages before checking for overlimit

3. We may want to write out dirty and referenced pagecache pages and
free them.  Current shrink_zone looks for easily freeable pagecache
pages only, but if we set a 200MB limit and write out a 1GB file,
then all of the pages will be dirty, active and referenced and still
we will have to force the reclaimer to remove those pages.

Adding more scan control flags in reclaim can give us better
control.  Please review http://lkml.org/lkml/2007/01/17/96 which
used new scan control flags.


> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> 
> Index: linux-2.6.20-rc5/include/linux/gfp.h
> ===
> --- linux-2.6.20-rc5.orig/include/linux/gfp.h 2007-01-12 12:54:26.0 
> -0600
> +++ linux-2.6.20-rc5/include/linux/gfp.h  2007-01-23 17:54:51.750696888 
> -0600
> @@ -46,6 +46,7 @@ struct vm_area_struct;
>  #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency 
> reserves */
>  #define __GFP_HARDWALL   ((__force gfp_t)0x2u) /* Enforce hardwall 
> cpuset memory allocs */
>  #define __GFP_THISNODE   ((__force gfp_t)0x4u)/* No fallback, no 
> policies */
> +#define __GFP_PAGECACHE  ((__force gfp_t)0x8u) /* Page cache 
> allocation */
> 
>  #define __GFP_BITS_SHIFT 20  /* Room for 20 __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
> Index: linux-2.6.20-rc5/include/linux/pagemap.h
> ===
> --- linux-2.6.20-rc5.orig/include/linux/pagemap.h 2007-01-12 
> 12:54:26.0 -0600
> +++ linux-2.6.20-rc5/include/linux/pagemap.h  2007-01-23 18:13:14.310062155 
> -0600
> @@ -62,12 +62,13 @@ static inline struct page *__page_cache_
> 
>  static inline struct page *page_cache_alloc(struct address_space *x)
>  {
> - return __page_cache_alloc(mapping_gfp_mask(x));
> + return __page_cache_alloc(mapping_gfp_mask(x)| __GFP_PAGECACHE);
>  }
> 
>  static inline struct page *page_cache_alloc_cold(struct address_space *x)
>  {
> - return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
> + return __page_cache_alloc(mapping_gfp_mask(x) |
> +  __GFP_COLD | __GFP_PAGECACHE);
>  }
> 
>  typedef int filler_t(void *, struct page *);
> Index: linux-2.6.20-rc5/include/linux/sysctl.h
> ===
> --- linux-2.6.20-rc5.orig/include/linux/sysctl.h  2007-01-12 
> 12:54:26.0 -0600
> +++ linux-2.6.20-rc5/include/linux/sysctl.h   2007-01-23 18:17:09.285324555 
> -0600
> @@ -202,6 +202,7 @@ enum
>   VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
>   VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
>   VM_MIN_SLAB=35,  /* Percent pages ignored by zone reclaim */
> + VM_PAGECACHE_RATIO=36,  /* percent of RAM to use as page cache */
>  };
> 
> 
> @@ -956,7 +957,6 @@ extern ctl_handler sysctl_intvec;
>  extern ctl_handler sysctl_jiffies;
>  

Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Aubrey Li

Christoph's patch is better than mine. The only thing I think is that
zone->max_pagecache_pages should be checked never less than
zone->pages_low.

The good part of the patch is using the existing reclaimer. But the
problem in my opinion of the idea is the existing reclaimer too. Think
of  when vfs cache limit is
hit, reclaimer doesn't reclaim all of the reclaimable pages, it just
give few out. So next time vfs pagecache request, it is quite possible
reclaimer is triggered again. That means after limit is hit, reclaim
will be implemented every time fs ops allocating memory. That's the
point in my mind impacting the performance of the applications.

-Aubrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread KAMEZAWA Hiroyuki
On Tue, 23 Jan 2007 20:30:16 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Wed, 24 Jan 2007, KAMEZAWA Hiroyuki wrote:
> 
> > I don't prefer to cause zone fallback by this.
> > This may use ZONE_DMA before exhausing ZONE_NORMAL (ia64),
> 
> Hmmm... We could use node_page_state instead of zone_page_state.
> 
> > Very rapid page allocation can eats some amount of lower zone.
> 
> One queston: For what purpose would you be using the page cache size 
> limitation?
> 
This is my experience in support-desk for RHEL4. 
(therefore, this may not be suitable for talking about the current kernel)

- One for stability
  When a customer constructs their detabase(Oracle), the system often goes to 
oom.
  This is because that the system cannot allocate DMA_ZOME memory for 32bit 
device.
  (USB or e100)
  Not allowing to use almost all pages as page cache (for temporal use) will be 
some help.
  (Note: construction DB on ext3so all writes are serialized and the system 
couldn't
   free page cache.)

- One for tuing.
  Sometimes our cutomer requests us to limit size of page-cache.
  
  Many cutomers's memory usage reaches 99.x%. (this is very common situation.)
  If almost all memories are used by page-cache, and we can think we can free 
it.
  But the customer cannot estimate what amount of page-cache can be freed 
(without 
  perfromance regression).
  
  When a cutomer wants to add a new application, he tunes the system.
  But memory usage is always 99%.
  page-cache limitation is useful when the customer tunes his system and find
  sets of data and page-cache. 
  (Of course, we can use some other complicated resource management system for 
this.)
  This will allow the users to decide that they need extra memory or not.

  And...some customers want to keep memory Free as much as possible.
  99% memory usage makes insecure them ;)

-Kame



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Christoph Lameter
On Wed, 24 Jan 2007, KAMEZAWA Hiroyuki wrote:

> I don't prefer to cause zone fallback by this.
> This may use ZONE_DMA before exhausing ZONE_NORMAL (ia64),

Hmmm... We could use node_page_state instead of zone_page_state.

> Very rapid page allocation can eats some amount of lower zone.

One queston: For what purpose would you be using the page cache size 
limitation?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Nick Piggin

Aubrey Li wrote:

On 1/24/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:


On Wed, 24 Jan 2007, Nick Piggin wrote:

> > 1. Insure that anonymous pages that may contain performance
> >critical data is never subject to swap.
> >
> > 2. Insure rapid turnaround of pages in the cache.
>
> So if these two aren't working properly at 100%, then I want to know 
the
> reason why. Or at least see what the workload and the numbers look 
like.


The reason for the anonymous page may be because data is rarely touched
but for some reason the pages must stay in memory. Rapid turnaround is
just one of the reason that I vaguely recall but I never really
understood what the purpose was.

> > 3. Reserve memory for other uses? (Aubrey?)
>
> Maybe. This is still a bad hack, and I don't like to legitimise such 
use
> though. I hope Aubrey isn't relying on this alone for his device to 
work
> because his customers might end up hitting fragmentation problems 
sooner

> or later.

I surely wish that Aubrey would give us some more clarity on
how this should work. Maybe the others who want this feature could also
speak up? I am not that clear on its purpose.


Sorry for the delay. Somehow this thread was put into the spam folder
of my gmail box. :(
The patch I posted several days ago works properly on my side. I'm
working on blackfin-uclinux platform. So I'm not sure it works 100% on
the other arch platform. From O_DIRECT threads, I know different
people suffer from VFS pagecache issue for different reason. So I
really hope the patch can be improved.


So we need to work out what those issues are and fix them.


On my side, When VFS pagecache eat up all of the available memory,
applications who want to allocate the largeish block(order =4 ?) will
fail. So the logic is as follows:


Yeah, it will be failing at order=4, because the allocator won't try
very hard reclaim pagecache pages at that cutoff point. This needs to
be fixed in the allocator.


I hope Aubrey isn't relying on this alone for his device to work
because his customers might end up hitting fragmentation problems sooner
or later.



That's true. I wrote a replacement of buddy system, it's here:
http://lkml.org/lkml/2006/12/30/36.

That can improve the fragmentation problems on our platform.


That might be a good idea, but while the buddy system may not seem as
efficient and wastes space, it is actually really good for fragmentation.

Anyway, point being that you can't eliminate fragmentation, so you need
to cope with allocation failures or implement reserve pools if you want a
robust system.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Aubrey Li

On 1/24/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:

On Wed, 24 Jan 2007, Nick Piggin wrote:

> > 1. Insure that anonymous pages that may contain performance
> >critical data is never subject to swap.
> >
> > 2. Insure rapid turnaround of pages in the cache.
>
> So if these two aren't working properly at 100%, then I want to know the
> reason why. Or at least see what the workload and the numbers look like.

The reason for the anonymous page may be because data is rarely touched
but for some reason the pages must stay in memory. Rapid turnaround is
just one of the reason that I vaguely recall but I never really
understood what the purpose was.

> > 3. Reserve memory for other uses? (Aubrey?)
>
> Maybe. This is still a bad hack, and I don't like to legitimise such use
> though. I hope Aubrey isn't relying on this alone for his device to work
> because his customers might end up hitting fragmentation problems sooner
> or later.

I surely wish that Aubrey would give us some more clarity on
how this should work. Maybe the others who want this feature could also
speak up? I am not that clear on its purpose.


Sorry for the delay. Somehow this thread was put into the spam folder
of my gmail box. :(
The patch I posted several days ago works properly on my side. I'm
working on blackfin-uclinux platform. So I'm not sure it works 100% on
the other arch platform. From O_DIRECT threads, I know different
people suffer from VFS pagecache issue for different reason. So I
really hope the patch can be improved.

On my side, When VFS pagecache eat up all of the available memory,
applications who want to allocate the largeish block(order =4 ?) will
fail. So the logic is as follows:

if request pagecache
 watermark =  min + reserved_pagecache.
else
 watermark =  min.

Here, assume min=123 pages, reserved_pagecache = 200 pages. That means
when VFS pagecache eat up its all of available memory, there are still
200 pages available for the allocation of the application. Does that
make sense?


I hope Aubrey isn't relying on this alone for his device to work
because his customers might end up hitting fragmentation problems sooner
or later.


That's true. I wrote a replacement of buddy system, it's here:
http://lkml.org/lkml/2006/12/30/36.

That can improve the fragmentation problems on our platform.

Christoph - I can't find your original patch, Can you send me again?
it would be great if you merged all of the  enhancement.

Thanks,
-Aubrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread KAMEZAWA Hiroyuki

one more thing...

On Tue, 23 Jan 2007 16:49:55 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> @@ -1168,6 +1170,11 @@ zonelist_scan:
>   !cpuset_zone_allowed_softwall(zone, gfp_mask))
>   goto try_next_zone;
>  
> + if ((gfp_mask & __GFP_PAGECACHE) &&
> + zone_page_state(zone, NR_FILE_PAGES) >
> + zone->max_pagecache_pages)
> + goto try_next_zone;
> +

I don't prefer to cause zone fallback by this.
This may use ZONE_DMA before exhausing ZONE_NORMAL (ia64),
ZONE_NORMAL before ZONE_HIGHMEM (x86).
Very rapid page allocation can eats some amount of lower zone.

Regards,
-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Christoph Lameter
On Wed, 24 Jan 2007, Nick Piggin wrote:

> > 1. Insure that anonymous pages that may contain performance
> >critical data is never subject to swap.
> > 
> > 2. Insure rapid turnaround of pages in the cache.
> 
> So if these two aren't working properly at 100%, then I want to know the
> reason why. Or at least see what the workload and the numbers look like.

The reason for the anonymous page may be because data is rarely touched 
but for some reason the pages must stay in memory. Rapid turnaround is 
just one of the reason that I vaguely recall but I never really 
understood what the purpose was.

> > 3. Reserve memory for other uses? (Aubrey?)
> 
> Maybe. This is still a bad hack, and I don't like to legitimise such use
> though. I hope Aubrey isn't relying on this alone for his device to work
> because his customers might end up hitting fragmentation problems sooner
> or later.

I surely wish that Aubrey would give us some more clarity on 
how this should work. Maybe the others who want this feature could also 
speak up? I am not that clear on its purpose.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Christoph Lameter
On Wed, 24 Jan 2007, KAMEZAWA Hiroyuki wrote:

> if (sc->may_swap &&
> zone_page_state(zone, NR_FILE_PAGES) &&
> !(curreht->flags & PF_MEMALLOC))
>   sc->may_swap = 0;

That is probably better than what we have so far.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Nick Piggin

Christoph Lameter wrote:
This is a patch using some of Aubrey's work plugging it in what is IMHO 
the right way. Feel free to improve on it. I have gotten repeatedly 
requests to be able to limit the pagecache. With the revised VM statistics 
this is now actually possile. I'd like to know more about possible uses of 
such a feature.





It may be useful to limit the size of the page cache for various reasons
such as

1. Insure that anonymous pages that may contain performance
   critical data is never subject to swap.

2. Insure rapid turnaround of pages in the cache.


So if these two aren't working properly at 100%, then I want to know the
reason why. Or at least see what the workload and the numbers look like.



3. Reserve memory for other uses? (Aubrey?)


Maybe. This is still a bad hack, and I don't like to legitimise such use
though. I hope Aubrey isn't relying on this alone for his device to work
because his customers might end up hitting fragmentation problems sooner
or later.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread KAMEZAWA Hiroyuki
On Tue, 23 Jan 2007 16:49:55 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> If we enter reclaim and the number of page cache pages
> is too high then we switch off swapping during reclaim
> to avoid touching anonymous pages.

In general, I like this (kind of) feature.

> + /*
> +  * If the page cache is too big then focus on page cache
> +  * and ignore anonymous pages
> +  */
> + if (sc->may_swap && zone_page_state(zone, NR_FILE_PAGES)
> + > zone->max_pagecache_pages)
> + sc->may_swap = 0;
> +


How about adding this (kind of) check ?

if (sc->may_swap &&
zone_page_state(zone, NR_FILE_PAGES) &&
!(curreht->flags & PF_MEMALLOC))
sc->may_swap = 0;

-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Limit the size of the pagecache

2007-01-23 Thread Christoph Lameter
This is a patch using some of Aubrey's work plugging it in what is IMHO 
the right way. Feel free to improve on it. I have gotten repeatedly 
requests to be able to limit the pagecache. With the revised VM statistics 
this is now actually possile. I'd like to know more about possible uses of 
such a feature.




It may be useful to limit the size of the page cache for various reasons
such as

1. Insure that anonymous pages that may contain performance
   critical data is never subject to swap.

2. Insure rapid turnaround of pages in the cache.

3. Reserve memory for other uses? (Aubrey?)

We add a new variable "pagecache_ratio" to /proc/sys/vm/ that
defaults to 100 (all memory usable for the pagecache).

The size of the pagecache is the number of file backed
pages in a zone which is available through NR_FILE_PAGES.

We skip zones that contain too many page cache pages in
the page allocator which may cause us to enter reclaim.

If we enter reclaim and the number of page cache pages
is too high then we switch off swapping during reclaim
to avoid touching anonymous pages.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc5/include/linux/gfp.h
===
--- linux-2.6.20-rc5.orig/include/linux/gfp.h   2007-01-12 12:54:26.0 
-0600
+++ linux-2.6.20-rc5/include/linux/gfp.h2007-01-23 17:54:51.750696888 
-0600
@@ -46,6 +46,7 @@ struct vm_area_struct;
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency 
reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x2u) /* Enforce hardwall cpuset 
memory allocs */
 #define __GFP_THISNODE ((__force gfp_t)0x4u)/* No fallback, no policies */
+#define __GFP_PAGECACHE((__force gfp_t)0x8u) /* Page cache 
allocation */
 
 #define __GFP_BITS_SHIFT 20/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
Index: linux-2.6.20-rc5/include/linux/pagemap.h
===
--- linux-2.6.20-rc5.orig/include/linux/pagemap.h   2007-01-12 
12:54:26.0 -0600
+++ linux-2.6.20-rc5/include/linux/pagemap.h2007-01-23 18:13:14.310062155 
-0600
@@ -62,12 +62,13 @@ static inline struct page *__page_cache_
 
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
-   return __page_cache_alloc(mapping_gfp_mask(x));
+   return __page_cache_alloc(mapping_gfp_mask(x)| __GFP_PAGECACHE);
 }
 
 static inline struct page *page_cache_alloc_cold(struct address_space *x)
 {
-   return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+   return __page_cache_alloc(mapping_gfp_mask(x) |
+__GFP_COLD | __GFP_PAGECACHE);
 }
 
 typedef int filler_t(void *, struct page *);
Index: linux-2.6.20-rc5/include/linux/sysctl.h
===
--- linux-2.6.20-rc5.orig/include/linux/sysctl.h2007-01-12 
12:54:26.0 -0600
+++ linux-2.6.20-rc5/include/linux/sysctl.h 2007-01-23 18:17:09.285324555 
-0600
@@ -202,6 +202,7 @@ enum
VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
VM_MIN_SLAB=35,  /* Percent pages ignored by zone reclaim */
+   VM_PAGECACHE_RATIO=36,  /* percent of RAM to use as page cache */
 };
 
 
@@ -956,7 +957,6 @@ extern ctl_handler sysctl_intvec;
 extern ctl_handler sysctl_jiffies;
 extern ctl_handler sysctl_ms_jiffies;
 
-
 /*
  * Register a set of sysctl names by calling register_sysctl_table
  * with an initialised array of ctl_table's.  An entry with zero
Index: linux-2.6.20-rc5/kernel/sysctl.c
===
--- linux-2.6.20-rc5.orig/kernel/sysctl.c   2007-01-12 12:54:26.0 
-0600
+++ linux-2.6.20-rc5/kernel/sysctl.c2007-01-23 18:24:04.763443772 -0600
@@ -1023,6 +1023,17 @@ static ctl_table vm_table[] = {
.extra2 = _hundred,
},
 #endif
+   {
+   .ctl_name   = VM_PAGECACHE_RATIO,
+   .procname   = "pagecache_ratio",
+   .data   = _pagecache_ratio,
+   .maxlen = sizeof(sysctl_pagecache_ratio),
+   .mode   = 0644,
+   .proc_handler   = _pagecache_ratio_sysctl_handler,
+   .strategy   = _intvec,
+   .extra1 = ,
+   .extra2 = _hundred,
+   },
 #ifdef CONFIG_X86_32
{
.ctl_name   = VM_VDSO_ENABLED,
Index: linux-2.6.20-rc5/mm/page_alloc.c
===
--- linux-2.6.20-rc5.orig/mm/page_alloc.c   2007-01-16 23:26:28.0 
-0600
+++ linux-2.6.20-rc5/mm/page_alloc.c2007-01-23 18:11:40.484617205 -0600
@@ -59,6 +59,8 @@ unsigned long totalreserve_pages __read_
 long nr_swap_pages;
 

[RFC] Limit the size of the pagecache

2007-01-23 Thread Christoph Lameter
This is a patch using some of Aubrey's work plugging it in what is IMHO 
the right way. Feel free to improve on it. I have gotten repeatedly 
requests to be able to limit the pagecache. With the revised VM statistics 
this is now actually possile. I'd like to know more about possible uses of 
such a feature.




It may be useful to limit the size of the page cache for various reasons
such as

1. Insure that anonymous pages that may contain performance
   critical data is never subject to swap.

2. Insure rapid turnaround of pages in the cache.

3. Reserve memory for other uses? (Aubrey?)

We add a new variable pagecache_ratio to /proc/sys/vm/ that
defaults to 100 (all memory usable for the pagecache).

The size of the pagecache is the number of file backed
pages in a zone which is available through NR_FILE_PAGES.

We skip zones that contain too many page cache pages in
the page allocator which may cause us to enter reclaim.

If we enter reclaim and the number of page cache pages
is too high then we switch off swapping during reclaim
to avoid touching anonymous pages.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

Index: linux-2.6.20-rc5/include/linux/gfp.h
===
--- linux-2.6.20-rc5.orig/include/linux/gfp.h   2007-01-12 12:54:26.0 
-0600
+++ linux-2.6.20-rc5/include/linux/gfp.h2007-01-23 17:54:51.750696888 
-0600
@@ -46,6 +46,7 @@ struct vm_area_struct;
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency 
reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x2u) /* Enforce hardwall cpuset 
memory allocs */
 #define __GFP_THISNODE ((__force gfp_t)0x4u)/* No fallback, no policies */
+#define __GFP_PAGECACHE((__force gfp_t)0x8u) /* Page cache 
allocation */
 
 #define __GFP_BITS_SHIFT 20/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1  __GFP_BITS_SHIFT) - 1))
Index: linux-2.6.20-rc5/include/linux/pagemap.h
===
--- linux-2.6.20-rc5.orig/include/linux/pagemap.h   2007-01-12 
12:54:26.0 -0600
+++ linux-2.6.20-rc5/include/linux/pagemap.h2007-01-23 18:13:14.310062155 
-0600
@@ -62,12 +62,13 @@ static inline struct page *__page_cache_
 
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
-   return __page_cache_alloc(mapping_gfp_mask(x));
+   return __page_cache_alloc(mapping_gfp_mask(x)| __GFP_PAGECACHE);
 }
 
 static inline struct page *page_cache_alloc_cold(struct address_space *x)
 {
-   return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+   return __page_cache_alloc(mapping_gfp_mask(x) |
+__GFP_COLD | __GFP_PAGECACHE);
 }
 
 typedef int filler_t(void *, struct page *);
Index: linux-2.6.20-rc5/include/linux/sysctl.h
===
--- linux-2.6.20-rc5.orig/include/linux/sysctl.h2007-01-12 
12:54:26.0 -0600
+++ linux-2.6.20-rc5/include/linux/sysctl.h 2007-01-23 18:17:09.285324555 
-0600
@@ -202,6 +202,7 @@ enum
VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
VM_MIN_SLAB=35,  /* Percent pages ignored by zone reclaim */
+   VM_PAGECACHE_RATIO=36,  /* percent of RAM to use as page cache */
 };
 
 
@@ -956,7 +957,6 @@ extern ctl_handler sysctl_intvec;
 extern ctl_handler sysctl_jiffies;
 extern ctl_handler sysctl_ms_jiffies;
 
-
 /*
  * Register a set of sysctl names by calling register_sysctl_table
  * with an initialised array of ctl_table's.  An entry with zero
Index: linux-2.6.20-rc5/kernel/sysctl.c
===
--- linux-2.6.20-rc5.orig/kernel/sysctl.c   2007-01-12 12:54:26.0 
-0600
+++ linux-2.6.20-rc5/kernel/sysctl.c2007-01-23 18:24:04.763443772 -0600
@@ -1023,6 +1023,17 @@ static ctl_table vm_table[] = {
.extra2 = one_hundred,
},
 #endif
+   {
+   .ctl_name   = VM_PAGECACHE_RATIO,
+   .procname   = pagecache_ratio,
+   .data   = sysctl_pagecache_ratio,
+   .maxlen = sizeof(sysctl_pagecache_ratio),
+   .mode   = 0644,
+   .proc_handler   = sysctl_pagecache_ratio_sysctl_handler,
+   .strategy   = sysctl_intvec,
+   .extra1 = zero,
+   .extra2 = one_hundred,
+   },
 #ifdef CONFIG_X86_32
{
.ctl_name   = VM_VDSO_ENABLED,
Index: linux-2.6.20-rc5/mm/page_alloc.c
===
--- linux-2.6.20-rc5.orig/mm/page_alloc.c   2007-01-16 23:26:28.0 
-0600
+++ linux-2.6.20-rc5/mm/page_alloc.c2007-01-23 18:11:40.484617205 -0600
@@ -59,6 +59,8 @@ unsigned long totalreserve_pages __read_
 

Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread KAMEZAWA Hiroyuki
On Tue, 23 Jan 2007 16:49:55 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

 If we enter reclaim and the number of page cache pages
 is too high then we switch off swapping during reclaim
 to avoid touching anonymous pages.

In general, I like this (kind of) feature.

 + /*
 +  * If the page cache is too big then focus on page cache
 +  * and ignore anonymous pages
 +  */
 + if (sc-may_swap  zone_page_state(zone, NR_FILE_PAGES)
 +  zone-max_pagecache_pages)
 + sc-may_swap = 0;
 +


How about adding this (kind of) check ?

if (sc-may_swap 
zone_page_state(zone, NR_FILE_PAGES) 
!(curreht-flags  PF_MEMALLOC))
sc-may_swap = 0;

-Kame

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Nick Piggin

Christoph Lameter wrote:
This is a patch using some of Aubrey's work plugging it in what is IMHO 
the right way. Feel free to improve on it. I have gotten repeatedly 
requests to be able to limit the pagecache. With the revised VM statistics 
this is now actually possile. I'd like to know more about possible uses of 
such a feature.





It may be useful to limit the size of the page cache for various reasons
such as

1. Insure that anonymous pages that may contain performance
   critical data is never subject to swap.

2. Insure rapid turnaround of pages in the cache.


So if these two aren't working properly at 100%, then I want to know the
reason why. Or at least see what the workload and the numbers look like.



3. Reserve memory for other uses? (Aubrey?)


Maybe. This is still a bad hack, and I don't like to legitimise such use
though. I hope Aubrey isn't relying on this alone for his device to work
because his customers might end up hitting fragmentation problems sooner
or later.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Christoph Lameter
On Wed, 24 Jan 2007, KAMEZAWA Hiroyuki wrote:

 if (sc-may_swap 
 zone_page_state(zone, NR_FILE_PAGES) 
 !(curreht-flags  PF_MEMALLOC))
   sc-may_swap = 0;

That is probably better than what we have so far.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Christoph Lameter
On Wed, 24 Jan 2007, Nick Piggin wrote:

  1. Insure that anonymous pages that may contain performance
 critical data is never subject to swap.
  
  2. Insure rapid turnaround of pages in the cache.
 
 So if these two aren't working properly at 100%, then I want to know the
 reason why. Or at least see what the workload and the numbers look like.

The reason for the anonymous page may be because data is rarely touched 
but for some reason the pages must stay in memory. Rapid turnaround is 
just one of the reason that I vaguely recall but I never really 
understood what the purpose was.

  3. Reserve memory for other uses? (Aubrey?)
 
 Maybe. This is still a bad hack, and I don't like to legitimise such use
 though. I hope Aubrey isn't relying on this alone for his device to work
 because his customers might end up hitting fragmentation problems sooner
 or later.

I surely wish that Aubrey would give us some more clarity on 
how this should work. Maybe the others who want this feature could also 
speak up? I am not that clear on its purpose.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread KAMEZAWA Hiroyuki

one more thing...

On Tue, 23 Jan 2007 16:49:55 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

 @@ -1168,6 +1170,11 @@ zonelist_scan:
   !cpuset_zone_allowed_softwall(zone, gfp_mask))
   goto try_next_zone;
  
 + if ((gfp_mask  __GFP_PAGECACHE) 
 + zone_page_state(zone, NR_FILE_PAGES) 
 + zone-max_pagecache_pages)
 + goto try_next_zone;
 +

I don't prefer to cause zone fallback by this.
This may use ZONE_DMA before exhausing ZONE_NORMAL (ia64),
ZONE_NORMAL before ZONE_HIGHMEM (x86).
Very rapid page allocation can eats some amount of lower zone.

Regards,
-Kame

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Aubrey Li

On 1/24/07, Christoph Lameter [EMAIL PROTECTED] wrote:

On Wed, 24 Jan 2007, Nick Piggin wrote:

  1. Insure that anonymous pages that may contain performance
 critical data is never subject to swap.
 
  2. Insure rapid turnaround of pages in the cache.

 So if these two aren't working properly at 100%, then I want to know the
 reason why. Or at least see what the workload and the numbers look like.

The reason for the anonymous page may be because data is rarely touched
but for some reason the pages must stay in memory. Rapid turnaround is
just one of the reason that I vaguely recall but I never really
understood what the purpose was.

  3. Reserve memory for other uses? (Aubrey?)

 Maybe. This is still a bad hack, and I don't like to legitimise such use
 though. I hope Aubrey isn't relying on this alone for his device to work
 because his customers might end up hitting fragmentation problems sooner
 or later.

I surely wish that Aubrey would give us some more clarity on
how this should work. Maybe the others who want this feature could also
speak up? I am not that clear on its purpose.


Sorry for the delay. Somehow this thread was put into the spam folder
of my gmail box. :(
The patch I posted several days ago works properly on my side. I'm
working on blackfin-uclinux platform. So I'm not sure it works 100% on
the other arch platform. From O_DIRECT threads, I know different
people suffer from VFS pagecache issue for different reason. So I
really hope the patch can be improved.

On my side, When VFS pagecache eat up all of the available memory,
applications who want to allocate the largeish block(order =4 ?) will
fail. So the logic is as follows:

if request pagecache
 watermark =  min + reserved_pagecache.
else
 watermark =  min.

Here, assume min=123 pages, reserved_pagecache = 200 pages. That means
when VFS pagecache eat up its all of available memory, there are still
200 pages available for the allocation of the application. Does that
make sense?


I hope Aubrey isn't relying on this alone for his device to work
because his customers might end up hitting fragmentation problems sooner
or later.


That's true. I wrote a replacement of buddy system, it's here:
http://lkml.org/lkml/2006/12/30/36.

That can improve the fragmentation problems on our platform.

Christoph - I can't find your original patch, Can you send me again?
it would be great if you merged all of the  enhancement.

Thanks,
-Aubrey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Nick Piggin

Aubrey Li wrote:

On 1/24/07, Christoph Lameter [EMAIL PROTECTED] wrote:


On Wed, 24 Jan 2007, Nick Piggin wrote:

  1. Insure that anonymous pages that may contain performance
 critical data is never subject to swap.
 
  2. Insure rapid turnaround of pages in the cache.

 So if these two aren't working properly at 100%, then I want to know 
the
 reason why. Or at least see what the workload and the numbers look 
like.


The reason for the anonymous page may be because data is rarely touched
but for some reason the pages must stay in memory. Rapid turnaround is
just one of the reason that I vaguely recall but I never really
understood what the purpose was.

  3. Reserve memory for other uses? (Aubrey?)

 Maybe. This is still a bad hack, and I don't like to legitimise such 
use
 though. I hope Aubrey isn't relying on this alone for his device to 
work
 because his customers might end up hitting fragmentation problems 
sooner

 or later.

I surely wish that Aubrey would give us some more clarity on
how this should work. Maybe the others who want this feature could also
speak up? I am not that clear on its purpose.


Sorry for the delay. Somehow this thread was put into the spam folder
of my gmail box. :(
The patch I posted several days ago works properly on my side. I'm
working on blackfin-uclinux platform. So I'm not sure it works 100% on
the other arch platform. From O_DIRECT threads, I know different
people suffer from VFS pagecache issue for different reason. So I
really hope the patch can be improved.


So we need to work out what those issues are and fix them.


On my side, When VFS pagecache eat up all of the available memory,
applications who want to allocate the largeish block(order =4 ?) will
fail. So the logic is as follows:


Yeah, it will be failing at order=4, because the allocator won't try
very hard reclaim pagecache pages at that cutoff point. This needs to
be fixed in the allocator.


I hope Aubrey isn't relying on this alone for his device to work
because his customers might end up hitting fragmentation problems sooner
or later.



That's true. I wrote a replacement of buddy system, it's here:
http://lkml.org/lkml/2006/12/30/36.

That can improve the fragmentation problems on our platform.


That might be a good idea, but while the buddy system may not seem as
efficient and wastes space, it is actually really good for fragmentation.

Anyway, point being that you can't eliminate fragmentation, so you need
to cope with allocation failures or implement reserve pools if you want a
robust system.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Christoph Lameter
On Wed, 24 Jan 2007, KAMEZAWA Hiroyuki wrote:

 I don't prefer to cause zone fallback by this.
 This may use ZONE_DMA before exhausing ZONE_NORMAL (ia64),

Hmmm... We could use node_page_state instead of zone_page_state.

 Very rapid page allocation can eats some amount of lower zone.

One queston: For what purpose would you be using the page cache size 
limitation?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread KAMEZAWA Hiroyuki
On Tue, 23 Jan 2007 20:30:16 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Wed, 24 Jan 2007, KAMEZAWA Hiroyuki wrote:
 
  I don't prefer to cause zone fallback by this.
  This may use ZONE_DMA before exhausing ZONE_NORMAL (ia64),
 
 Hmmm... We could use node_page_state instead of zone_page_state.
 
  Very rapid page allocation can eats some amount of lower zone.
 
 One queston: For what purpose would you be using the page cache size 
 limitation?
 
This is my experience in support-desk for RHEL4. 
(therefore, this may not be suitable for talking about the current kernel)

- One for stability
  When a customer constructs their detabase(Oracle), the system often goes to 
oom.
  This is because that the system cannot allocate DMA_ZOME memory for 32bit 
device.
  (USB or e100)
  Not allowing to use almost all pages as page cache (for temporal use) will be 
some help.
  (Note: construction DB on ext3so all writes are serialized and the system 
couldn't
   free page cache.)

- One for tuing.
  Sometimes our cutomer requests us to limit size of page-cache.
  
  Many cutomers's memory usage reaches 99.x%. (this is very common situation.)
  If almost all memories are used by page-cache, and we can think we can free 
it.
  But the customer cannot estimate what amount of page-cache can be freed 
(without 
  perfromance regression).
  
  When a cutomer wants to add a new application, he tunes the system.
  But memory usage is always 99%.
  page-cache limitation is useful when the customer tunes his system and find
  sets of data and page-cache. 
  (Of course, we can use some other complicated resource management system for 
this.)
  This will allow the users to decide that they need extra memory or not.

  And...some customers want to keep memory Free as much as possible.
  99% memory usage makes insecure them ;)

-Kame



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Limit the size of the pagecache

2007-01-23 Thread Aubrey Li

Christoph's patch is better than mine. The only thing I think is that
zone-max_pagecache_pages should be checked never less than
zone-pages_low.

The good part of the patch is using the existing reclaimer. But the
problem in my opinion of the idea is the existing reclaimer too. Think
of  when vfs cache limit is
hit, reclaimer doesn't reclaim all of the reclaimable pages, it just
give few out. So next time vfs pagecache request, it is quite possible
reclaimer is triggered again. That means after limit is hit, reclaim
will be implemented every time fs ops allocating memory. That's the
point in my mind impacting the performance of the applications.

-Aubrey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >