Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-30 Thread Matthew Dillon
Yes, this makes a lot of sense to me.   You are exercising the 
system in a way that breaks the LRU algorithm.  The buffer cache,
without your patch, is carefully tuned to deal with this case...
that is why vm_page_dontneed() exists and why the vm_object code
calls it.  This creates a little extra work when the buffer cache
cycles, but prevents the system from reusing pages that it actually
needs under certainly types of load.  In particular, the situation the
system is saving itself from by making this call is the situation where
a user is reading data file(s) sequentially which are far larger then
can be reasonably cached.  In that situation strict LRU operation 
would result in terrible performance due to the system attempting to
unconditionally cache data it is going to have to throw away anyway,
and soon, which displaces older cached data that it will actually need
soon.   LRU isn't always the best policy.

When you disable vm_page_dontneed() the huge amount of data you are
moving through the system create a huge amount of pressure on the rest
of the VM system, thus the slower performance when your data operations
exceed what can be reasonably cached.  This would also have a severely
detrimental effect on production systems running real loads.

It's a tradeoff.  The system is trading off some cpu overhead
generally in order to deal with a fairly common heavy-loading
case and in order to reduce the pressure on the VM system for
situations (such as reading a large file sequentially) which
have no business putting pressure on the VM system.  e.g. the
system is trying to avoid blowing away user B's cache when user A
reads a huge file.  Your patch is changing the tradeoff, but not
really making things better overall.  Sure, the buildworld test went
faster, but that's just one type of load.

I am somewhat surprised at your 32MB tests.  Are you sure you 
stabilized the dd before getting those timings?  It would take
more then one run of the dd on the file to completely cache it (that's
one of the effects of vm_page_dontneed().  Since the system can't
predict whether a large file is going to be re-read over and over
again, or just read once, or even how much data will be read, it
depresses the priority of pages statistically so it might take
several full reads of the file for the system to realize that you
really do want to cache the whole thing.  In anycase, 32MB dd's
should be fully cached in the buffer cache, with no rewiring of
pages occuring at all, so I'm not sure why your patch is faster
for that case.  It shouldn't be.  Or the 64MB case.  The 96MB
case is getting close to what your setup can cache reasonably.
The pre-patch code can deal with it, but with your patch you are
probably putting enough extra pressure on the VM system to force
the pageout daemon to run earlier then it would without the patch.

The VM system is a very finely tuned beast.  That isn't to say that
it can't be improved, I'm sure it can, and I encourage you to play
with it!  But you have to be wary of it as well.   The VM system is
tuned primarily for performance under heavy loads.  There is a slight
loss of performance under light loads because of the extra management.
You have to be sure not to screw up the heavy-load performance when
running light-load benchmarks.  A buildworld is a light load benchmark,
primarily because it execs so programs so many times (the compiler)
that there are a lot of free VM pages sitting around for it to use.
Buildworlds do not load-test the VM system all that well!  A dd test
is not supposed to load-test the VM system either.  This is why we have
vm_page_dontneeds()'s.. user B's cache shouldn't be blown away just
because user A is reading a large file.  We lose a little in a light
load test but gain a lot under real world loads which put constant 
pressure on the VM system.

-Matt
Matthew Dillon 
<[EMAIL PROTECTED]>

:I tried that on the same PC as my last benchmark.  The PC has 160MB
:RAM, so I created a file of 256MB.
:
:One pre-read (in order to stabilize the buffer cache) and four read
:tests were run consecutively for each of six distinct read sizes just
:after boot.  The average read times (in seconds) and speeds (in
:MB/sec) are shown below:
:
:
:   without my patchwith my patch
:read size  timespeed   timespeed
:32MB   .49765.5.47169.0
:64MB   1.0263.6.90172.1
:96MB   2.2450.55.5218.9
:128MB  20.76.1916.57.79
:192MB  32.95.8332.95.83
:256MB  42.56.02   

Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-30 Thread Seigo Tanimura
On Mon, 28 Oct 2002 00:54:57 -0800 (PST),
  Matthew Dillon <[EMAIL PROTECTED]> said:

dillon> I can demonstrate the issue with a simple test.  Create a large file
dillon> with dd, larger then physical memory:

dillon> dd if=/dev/zero of=test bs=1m count=4096# create a 4G file.

dillon> Then dd (read) portions of the file and observe the performance.
dillon> Do this several times to get stable numbers.

dillon> dd if=test of=/dev/null bs=1m count=16  # repeat several times
dillon> dd if=test of=/dev/null bs=1m count=32  # etc...

dillon> You will find that read performance will drop in two significant
dillon> places:  (1) When the data no longer fits in the buffer cache and
dillon> the buffer cache is forced to teardown wirings and rewire other
dillon> pages from the VM page cache.  Still no physical I/O is being done.
dillon> (2) When the data no longer fits in the VM page cache and the system
dillon> is forced to perform physical I/O.

I tried that on the same PC as my last benchmark.  The PC has 160MB
RAM, so I created a file of 256MB.

One pre-read (in order to stabilize the buffer cache) and four read
tests were run consecutively for each of six distinct read sizes just
after boot.  The average read times (in seconds) and speeds (in
MB/sec) are shown below:


without my patchwith my patch
read size   timespeed   timespeed
32MB.49765.5.47169.0
64MB1.0263.6.90172.1
96MB2.2450.55.5218.9
128MB   20.76.1916.57.79
192MB   32.95.8332.95.83
256MB   42.56.0243.05.95


dillon> Its case (1) that you are manipulating with your patch, and as you can
dillon> see it is entirely dependant on the number of wired pages that the 
dillon> system is able to maintain in the buffer cache.

The results of 128MB-read are likely to be so.

96MB-read gave interesting results.  Since vfs_unwirepages() passes
buffer pages to vm_page_dontneed(), it seems that the page scanner
reclaims buffer cache pages too aggressively.

The table below shows the results with my patch where
vfs_unwirepages() does not call vm_page_dontneed().


read size   timespeed
32MB.50363.7
64MB.91670.5
96MB4.5727.1
128MB   17.07.62
192MB   35.85.36
256MB   46.05.56


The 96MB-read results were a little bit better, although the reads of
larger sizes became slower.  The unwired buffer pages may be putting
a pressure on user process pages and the page scanner.

-- 
Seigo Tanimura <[EMAIL PROTECTED]>

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-28 Thread Matthew Dillon
:I was going to comment on fragmentation issues, but that seems to have
:been very well covered.  I would like to point out that removing the
:buffer_map not only contributes to kernel map fragmentation, but also
:contention for the kernel map.  It might also prevent us from removing
:giant from the kernel map because it would add another interrupt time
:consumer.

Yes.  Whatever the case any sort of temporary KVA mapping management
system would need its own submap.  It would be insane to use the
kernel_map or kmem_map for this.

In regards to Seigo's patch:

The scaleability issue is entirely related to the KVA mapping portion
of the buffer cache.  Only I/O *WRITE* performance is specifically
limited by the size of the buffer_map, due to the limited number of
dirty buffers allowed in the map.  This in turn is a restriction
required by filesystems which must keep track of 'dirty' buffers
in order to sequence out writes.  Currently the only way around this
limitation is to use mmap/MAP_NOSYNC.  In otherwords, we support
dirty VM pages that are not associated with the buffer cache but
most of the filesystem algorithms are still based around the
assumption that dirty pages will be mapped into dirty buffers.

I/O *READ* caching is limited only by the VM Page cache.   
The reason you got slightly better numbers with your patch
has nothing to do with I/O performance, it is simply related to 
the cost of the buffer instantiations and teardowns that occur in
the limit buffer_map space mapping pages out of the VM page cache.
Since you could have more buffers, there were fewer instantiations
and teardowns.  It's that simple.

Unfortunately, this performance gain is *DIRECTLY* tied to the number
of pages wired into the buffer cache.  It is precisely the wired pages
portion of the instantiation and teardown that eats the extra cpu.
So the moment you regulate the number of wired pages in the system, you
will blow the performance you are getting.

I can demonstrate the issue with a simple test.  Create a large file
with dd, larger then physical memory:

dd if=/dev/zero of=test bs=1m count=4096# create a 4G file.

Then dd (read) portions of the file and observe the performance.
Do this several times to get stable numbers.

dd if=test of=/dev/null bs=1m count=16  # repeat several times
dd if=test of=/dev/null bs=1m count=32  # etc...

You will find that read performance will drop in two significant
places:  (1) When the data no longer fits in the buffer cache and
the buffer cache is forced to teardown wirings and rewire other
pages from the VM page cache.  Still no physical I/O is being done.
(2) When the data no longer fits in the VM page cache and the system
is forced to perform physical I/O.

Its case (1) that you are manipulating with your patch, and as you can
see it is entirely dependant on the number of wired pages that the 
system is able to maintain in the buffer cache.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-28 Thread Matthew Dillon
Hmm.  Well, the real problem is not going to be the struct bio
but will instead be the filesystem support.  Filesystems expect
KVA mapped data from the buffer cache, and they use pointers
to the data all over the place.  

The buffer cache is very efficient, at least as long 
filesystem block sizes are <= 16K.  You can mix filesystem
block sizes as long as they are <= 16K and there will
be no remapping and no fragmentation and buffer cache
operation will be O(1).  If you mix filesystem block
sizes <= 16K and > 16K the buffer cache will start to hit
remapping and fragmentation cases (though its really
the remapping cases that hurt).  It isn't a terrible 
problem, but it is an issue.

Tor has test cases for the above issue and could probably
give you more information on it.

The real performance problem is the fact that the buffer
cache exists at all.  I wouldn't bother fixing the remapping
issue and would instead focus on getting rid of the buffer
cache entirely.   As I said, the issue there is filesystem
block mapping support for meta-data (bitmaps, inodes), not I/O.

-Matt
Matthew Dillon 
<[EMAIL PROTECTED]>


:On Mon, 28 Oct 2002, Seigo Tanimura wrote:
:
:> On Thu, 24 Oct 2002 15:05:30 +1000 (EST),
:>   Bruce Evans <[EMAIL PROTECTED]> said:
:>
:> bde> Almost exactly what we have.  It turns out to be not very good, at least
:> bde> in its current implementation, since remapping is too expensive.  Things
:> bde> work OK to the extent that remapping is not required, but so would a
:> bde> much simpler implementation that uses less vm and more copying of data
:> bde> (copying seems to be faster than remapping).
:>
:> Which process is expensive in remapping?  Allocation of a KVA space?
:> Page wiring?  Or pmap operation?
:
:The allocation seemed to be most expensive when I looked at this about 2
:years ago.  The cause of the remapping seemed to be that different amounts
:of buffer kva were allocated for different buffer sizes.  Copying between
:filesystems with different block sizes therefore caused lots of remapping.
:I think this cause of remapping has been fixed.  VM has been improved too.
:I'm not sure how much in this area.
:
:Bruce
:

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-27 Thread Bruce Evans
On Mon, 28 Oct 2002, Seigo Tanimura wrote:

> On Thu, 24 Oct 2002 15:05:30 +1000 (EST),
>   Bruce Evans <[EMAIL PROTECTED]> said:
>
> bde> Almost exactly what we have.  It turns out to be not very good, at least
> bde> in its current implementation, since remapping is too expensive.  Things
> bde> work OK to the extent that remapping is not required, but so would a
> bde> much simpler implementation that uses less vm and more copying of data
> bde> (copying seems to be faster than remapping).
>
> Which process is expensive in remapping?  Allocation of a KVA space?
> Page wiring?  Or pmap operation?

The allocation seemed to be most expensive when I looked at this about 2
years ago.  The cause of the remapping seemed to be that different amounts
of buffer kva were allocated for different buffer sizes.  Copying between
filesystems with different block sizes therefore caused lots of remapping.
I think this cause of remapping has been fixed.  VM has been improved too.
I'm not sure how much in this area.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-27 Thread Seigo Tanimura
On Thu, 24 Oct 2002 15:05:30 +1000 (EST),
  Bruce Evans <[EMAIL PROTECTED]> said:

bde> Almost exactly what we have.  It turns out to be not very good, at least
bde> in its current implementation, since remapping is too expensive.  Things
bde> work OK to the extent that remapping is not required, but so would a
bde> much simpler implementation that uses less vm and more copying of data
bde> (copying seems to be faster than remapping).

Which process is expensive in remapping?  Allocation of a KVA space?
Page wiring?  Or pmap operation?

-- 
Seigo Tanimura <[EMAIL PROTECTED]>

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-27 Thread Seigo Tanimura
On Wed, 23 Oct 2002 16:51:44 -0400 (EDT),
  Jeff Roberson <[EMAIL PROTECTED]> said:

jroberson> I do, however, like the page unwiring idea.  As long as it's not too
jroberson> expensive.  I have been somewhat disappointed that the buffer cache's
jroberson> buffers are hands off for the vm.  I'm confused about your approach
jroberson> though.  I think that the rewire function is unnecessary.  You could move
jroberson> this code into allocbuf() which would limit the number of times that you
jroberson> have to make a pass over this list and keep the maintenance of it in a
jroberson> more central place.  This would also remove the need for truncating the
jroberson> buf.

I just wanted to make sure that buffers not in the clean queue look
as they used to do without the patch.  At least, if a buffer does not
become busy or held, then it need not be rewired down.

-- 
Seigo Tanimura <[EMAIL PROTECTED]>

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-23 Thread Bruce Evans
On Wed, 23 Oct 2002, Julian Elischer wrote:

> Bill Jolitz had a plan for 386BSD where all the buffers were nearly
> always unmapped from KVM. He was going to have a number of slots
> available for mapping them which would be used in a lifo order
>
> The number of slots was going to be somehow tunable
> but I don't remember the details.

We essentially have this now.  Most disk blocks are cached in physical
pages (VMIO pages) and are disassociated from the the buffer cache and
not mapped into vm.  Some blocks are mapped into buffers.  There are
a limited number of slots (nbuf).  nbuf hasn't grown nearly as fast
as disks or main memory, so what was once a large non-unified buffer
cache (nbuf * MAXBSIZE worth of caching) is now just a small number
of vm mappings (nbuf of them).

> When you wanted to access a buffer, it was mapped for you
> (unless already mapped).. It would be unmapped when it's slot
> was needed for something else. WHen you accessed a buffer already mapped
> it would move it back to the top of the list.
> Various events could pre-unmap a buffer. e.g. the related vm object was
> closed. (0 references).

Almost exactly what we have.  It turns out to be not very good, at least
in its current implementation, since remapping is too expensive.  Things
work OK to the extent that remapping is not required, but so would a
much simpler implementation that uses less vm and more copying of data
(copying seems to be faster than remapping).

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-23 Thread Poul-Henning Kamp
In message <[EMAIL PROTECTED]>, Ju
lian Elischer writes:

>Bill Jolitz had a plan for 386BSD where all the buffers were nearly
>always unmapped from KVM. He was going to have a number of slots
>available for mapping them which would be used in a lifo order

This entire area needs to be rethought.

And by "rethought" I really mean try to redesign it from scratch
to match our current needs and see what that leads to compared to
the stuff we have.

On of my first TODO after the 5.x/6.x branch is to give struct bio
the ability to communicate in a vector of separate pages, not
necessarily mapped.  This gives us a scatter gather ability in 
the entire disk I/O path.

This opens up a host of possibilities for things like clustering,
background writes (using copy-on-write pages) etc etc etc.

Needless to say, it will also drastically change the working
environment for struct buf.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-23 Thread Julian Elischer


On Wed, 23 Oct 2002, Jeff Roberson wrote:

> 
> I do, however, like the page unwiring idea.  As long as it's not too
> expensive.  I have been somewhat disappointed that the buffer cache's
> buffers are hands off for the vm.  I'm confused about your approach
> though.  I think that the rewire function is unnecessary.  You could move
> this code into allocbuf() which would limit the number of times that you
> have to make a pass over this list and keep the maintenance of it in a
> more central place.  This would also remove the need for truncating the
> buf.
> 

Bill Jolitz had a plan for 386BSD where all the buffers were nearly
always unmapped from KVM. He was going to have a number of slots
available for mapping them which would be used in a lifo order

The number of slots was going to be somehow tunable
but I don't remember the details.

When you wanted to access a buffer, it was mapped for you
(unless already mapped).. It would be unmapped when it's slot 
was needed for something else. WHen you accessed a buffer already mapped
it would move it back to the top of the list.
Various events could pre-unmap a buffer. e.g. the related vm object was
closed. (0 references).



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-23 Thread Jeff Roberson


On Wed, 23 Oct 2002, Seigo Tanimura wrote:

> On Wed, 23 Oct 2002 16:44:06 +1000 (EST),
>   Bruce Evans <[EMAIL PROTECTED]> said:
>
> Incidentally, Solaris 7 on sun4u reserves a space of 256MB in the KVM
> according to Solaris Internals.  On i386 (x86), the size is only 4MB.
> Not sure whether they use those spaces in a pure form, or they cluster
> some consecutive pages (which leads to fragmentation), though...
>
> NetBSD UBC also makes a map dedicated to buffers in kernel_map.
>
> Maybe there is a point to have a map dedicated to the buffer space for
> a better stability, and the size of the buffer map could be much
> smaller than now.  During my testing, I found that only up to 6-7MB of
> the buffers out of 40-50MB were wired down (ie busy, locked for
> background write or dirty) at most.
>

I was going to comment on fragmentation issues, but that seems to have
been very well covered.  I would like to point out that removing the
buffer_map not only contributes to kernel map fragmentation, but also
contention for the kernel map.  It might also prevent us from removing
giant from the kernel map because it would add another interrupt time
consumer.

I do, however, like the page unwiring idea.  As long as it's not too
expensive.  I have been somewhat disappointed that the buffer cache's
buffers are hands off for the vm.  I'm confused about your approach
though.  I think that the rewire function is unnecessary.  You could move
this code into allocbuf() which would limit the number of times that you
have to make a pass over this list and keep the maintenance of it in a
more central place.  This would also remove the need for truncating the
buf.

I have some other ideas for the buffer cache that you may be interested
in.  I have been discussing them in private for some time but I'll bring
it up on arch soon so that others can comment.

Cheers,
Jeff


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-23 Thread Seigo Tanimura
On Wed, 23 Oct 2002 16:44:06 +1000 (EST),
  Bruce Evans <[EMAIL PROTECTED]> said:

bde> I should be the last to defend the current design and implementation of
bde> the buffer cache, since I think it gets almost everything wrong (the
bde> implementation is OK, but has vast complications to work around design
bde> errors), but I think buffer_map is one of the things that it gets right
bde> (if we're going to have buffers at all).
(snip)
bde> I use the following changes in -current to enlarge the buffer cache and
bde> avoid fragmentation.  These only work because I don't have much physical
bde> memory (512MB max).  Even i386's have enough vm for the pure form of
bde> buffer_map to work:
bde> - enlarge BKVASIZE to MAXBSIZE so that fragmentation can not (should not?)
bde>   occur.
bde> - enlarge nbuf by a factor of (my_BKVASIZE / current_BKVASIZE) to work
bde>   around bugs.  The point of BKVASIZE got lost somewhere.
bde> - enlarge nbuf and associated variables by another factor of 2 or 4 to
bde>   get a larger buffer cache.
bde> This is marginal for 512MB physical, and probably wouldn't work if I had
bde> a lot of mbufs.  nbuf is about 4000 and buffer_map takes about 256MB.
bde> 256MB is a lot, but nbuf = 4000 isn't a lot.  I used buffer caches
bde> with 2000 * 1K buffers under Minix and Linux before FreeBSD, and ISTR
bde> having an nbuf of 5000 or so in FreeBSD-1.1.  At least 2880 buffers are
bde> needed to properly cache a tiny 1.44MB floppy with an msdosfs file
bde> system with a block size of 512, and that was an important test case.

bde> End of FreeBSD-[2-5] history.

Incidentally, Solaris 7 on sun4u reserves a space of 256MB in the KVM
according to Solaris Internals.  On i386 (x86), the size is only 4MB.
Not sure whether they use those spaces in a pure form, or they cluster
some consecutive pages (which leads to fragmentation), though...

NetBSD UBC also makes a map dedicated to buffers in kernel_map.

Maybe there is a point to have a map dedicated to the buffer space for
a better stability, and the size of the buffer map could be much
smaller than now.  During my testing, I found that only up to 6-7MB of
the buffers out of 40-50MB were wired down (ie busy, locked for
background write or dirty) at most.

-- 
Seigo Tanimura <[EMAIL PROTECTED]>

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-22 Thread Bruce Evans
On Tue, 22 Oct 2002, Seigo Tanimura wrote:

> Introduction:
>
> The I/O buffer of the kernel are currently allocated in buffer_map
> sized statically upon boot, and never grows.  This limits the scale of
> I/O performance on a host with large physical memory.  We used to tune
> NBUF to cope with that problem.  This workaround, however, results in
> a lot of wired pages not available for user processes, which is not
> acceptable for memory-bound applications.
>
> In order to run both I/O-bound and memory-bound processes on the same
> host, it is essential to achieve:
>
> A) allocation of buffer from kernel_map to break the limit of a map
>size, and
>
> B) page reclaim from idle buffers to regulate the number of wired
>pages.
>
> The patch at:
>
> http://people.FreeBSD.org/~tanimura/patches/dynamicbuf.diff.gz

I should be the last to defend the current design and implementation of
the buffer cache, since I think it gets almost everything wrong (the
implementation is OK, but has vast complications to work around design
errors), but I think buffer_map is one of the things that it gets right
(if we're going to have buffers at all).

Some history of this problem:

FreeBSD-1:

Allocating from kernel_map instead of buffer_map would almost take us
back to FreeBSD-1 where buffers were allocated from kmem_map using
malloc().  This caused larger problems with fragmentation.  Some of
these were due to foot-shooting, but I think large-memory machines
give essentially the same problems and complete fragmentation of
kernel_map would cause more problems than complete fragmentation of
any other map.  Part of the foot-shooting was to allocate too little
vm to the kernel and correspondingly too little vm to kmem_map.  The
(i386) kernel was at originally at 0xFE00, so there was only 32MB
of kernel vm.  32MB was far too small even for the relatively small
physical memories at the time (1992 or 1993), so this was changed to
0xF000 in FreeBSD-1.1.5.  Then there was 256MB of kernel vm.  I
suspect that this increase reduced the fragmentation problems to
insignificance in most but not all cases.

Some of the interesting cases at the time of FreeBSD-1 were:
- machines with a small amount of physical memory.  These should have
  few problems since there is not enough physical memory to make the
  maps more than sparse (unless the maps are undersized).
- machines with a not so small amount of physical memory.  It's possible
  that the too-small-in-general value for nbuf limits problems.
- machines which only use one type of filesystem with one (small?) block
  size.  If all allocations have the same size, then there need be no
  fragmentation.  I'm not sure how strong this effect was in FreeBSD-1.
  malloc() used a power-of-2 algorithm, but only up to a certain size
  which covered 4K-blocks but possibly not 8K-blocks.  Note that machines
  with large amounts of memory were likely to be specialized machines so
  were likely to take advantage of this without really trying, just by
  not mounting or not significantly using unusual filesystems like
  msdosfs, etx2fs and cd9660.

I used the following allocation policies in my version of FreeBSD-1.1.5:
- enlarge nbuf and the limit on buffer space (freebufspace) by a factor
  of 2 or 4 to get a larger buffer cache
- enlarge nbuf by another factor of 8, but don't enlarge freebufspzce,
  so that buffers of size 512 can hold as much as buffers of size 4096.
  I didn't care about buffers of size 8192 or larger at the time.
- actually enforce the freebufspace limit by discarding buffers in
  allocbuf() using a simplistic algorithm.
This worked well enough, but I only tested it on a 486's with 8-16MB.
The buffer cache had size 2MB or so.

End of FreeBSD-1 history.

FreeBSD-[2-5]:

Use of buffer_map was somehow implemented at the beginning in rev.1.2
of vfs_bio.c although this wasn't in FreeBSD-1.1.5.  Either I'm missing
some history or it was only in dyson's tree for FreeBSD-1.  Rev.1.2 used
buffer map in its purest form: each of nbuf buffers has a data buffer
consisting of MAXBSIZE bytes of vm attached to it at bufinit() time.
The allocation never changes and we simply map physical pages into the
vm when we have actual data.  The problems with this are that MAXBSIZE
is rather large and nbuf should be rather large (and/or dynamic).
Subsequent changes add vast complications to reduce the amount of vm.
I think these complications should only exist on machines with limited
amounts of vm (mainly i386's).

One of the complications was to reintroduce fragmentation problems.
buffer_map only has enough space for nbuf buffers of size BKVASIZE,
and the mappings are not statically allocated.  Another of the
complications is to discard buffers to reduce the fragmentation problems.
Perhaps similar defragmentation would have worked well enough in
FreeBSD-1.1.  I suspect that you change depends on this defragmentation,
but I don't think the defragmentation can work as well, since it can
only touc

Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-22 Thread Julian Elischer


On Tue, 22 Oct 2002, Seigo Tanimura wrote:

> Introduction:
> 
[...]
> 
> The patch at:
> 
> http://people.FreeBSD.org/~tanimura/patches/dynamicbuf.diff.gz

Cool..


> 
> 
> -jbaselinew/ my patch
>   realusersys realusersys
> 1 1608.21 1387.94 125.96  1577.88 1391.02 100.90
> 101576.10 1360.17 132.76  1531.79 1347.30 103.60
> 201568.01 1280.89 133.22  1509.36 1276.75 104.69
> 301923.42 1215.00 155.50  1865.13 1219.07 113.43
> 

definitly statistically significant.


> 
> Another interesting results are the numbers of swaps, shown below.
> 
> -jbaselinew/ my patch
> 1 0   0
> 100   0
> 20141 77
> 30530 465

this too.

> 
> 
> Comments and flames are welcome.  Thanks a lot.
> 

No flames..


Julian



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Dynamic growth of the buffer and buffer page reclaim

2002-10-22 Thread Poul-Henning Kamp
In message <[EMAIL PROTECTED]>, Seigo Tanimur
a writes:

>The patch at:
>
>http://people.FreeBSD.org/~tanimura/patches/dynamicbuf.diff.gz


>Comments and flames are welcome.  Thanks a lot.

This looks very very interesting!

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Dynamic growth of the buffer and buffer page reclaim

2002-10-22 Thread Seigo Tanimura
Introduction:

The I/O buffer of the kernel are currently allocated in buffer_map
sized statically upon boot, and never grows.  This limits the scale of
I/O performance on a host with large physical memory.  We used to tune
NBUF to cope with that problem.  This workaround, however, results in
a lot of wired pages not available for user processes, which is not
acceptable for memory-bound applications.

In order to run both I/O-bound and memory-bound processes on the same
host, it is essential to achieve:

A) allocation of buffer from kernel_map to break the limit of a map
   size, and

B) page reclaim from idle buffers to regulate the number of wired
   pages.

The patch at:

http://people.FreeBSD.org/~tanimura/patches/dynamicbuf.diff.gz

implements buffer allocation from kernel_map and reclaim of buffer
pages.  With this patch, make kernel-depend && make kernel completes
about 30-60 seconds faster on my PC.


Implementation in Detail:

A) is easy; first you need to do s/buffer_map/kernel_map/.  Since an
arbitrary number of buffer pages can be allocated dynamically, buffer
headers (struct buf) should be allocated dynamically as well.  Glue
them together into a list so that they can be traversed by boot()
et. al.

In order to accomplish B), we must find buffers both the filesystem
and I/O codes will not touch.  The clean buffer queue holds such the
buffers.  (exception: if the vnode associated with a clean buffer is
held by the namecache, it may access the buffer page.)  Thus, we
should unwire the pages of a buffer prior to enqueuing it to the clean
queue, and rewire the pages down in bremfree() if the pages are not
reclaimed.

Although unwiring gives a page a chance of being reclaimed,  we can go
further.  In Solaris, it is known that file cache pages should be
reclaimed prior to the other kinds of pages (anonymous, executable,
etc.) for a better performance.  Mainly due to a lack of time to work
on distinguishing the kind of a page to be unwired, I simply pass all
unwired pages to vm_page_dontneed().  This approach places most of the
unwired buffer pages at just one step to the cache queue.


Experimental Evaluation and Results:

The times taken to complete make kernel-depend && make kernel just
after booting into single-user mode have been measured on my ThinkPad
600E (CPU: Pentium II 366MHz, RAM: 160MB) by time(1).  The number
passed to the -j option of make(1) has been varied from 1 to 30 in
order to control the pressure of the memory demand for user processes.
The baseline is the kernel without my patch.

The following table shows the results.  All of the times are in
seconds.

-j  baselinew/ my patch
realusersys realusersys
1   1608.21 1387.94 125.96  1577.88 1391.02 100.90
10  1576.10 1360.17 132.76  1531.79 1347.30 103.60
20  1568.01 1280.89 133.22  1509.36 1276.75 104.69
30  1923.42 1215.00 155.50  1865.13 1219.07 113.43

Most of the improvements in the real times are accomplished by the
speedup of system calls.  The hit ratio of getblk() may be increased,
but not examined yet.

Another interesting results are the numbers of swaps, shown below.

-j  baselinew/ my patch
1   0   0
10  0   0
20  141 77
30  530 465

Since the baseline kernel does not free buffer pages at all(*), it may
be putting a pressure on the pages too much.

(*) bfreekva() is called only when the whole KVA is too fragmented.


Userland Interfaces:

The sysctl variable vfs.bufspace now reports the size of the pages
allocated for buffer, both wired and unwired.  A new sysctl variable,
vfs.bufwiredspace tells the size of the buffer pages wired down.

vfs.bufkvaspace returns the size of the KVA space for buffer.


Future Works:

The handling of unwired pages can be improved by scanning only buffer
pages.  In that case, we may have to run the vm page scanner more
frequently, as does Solaris.

vfs.bufspace does not track the buffer pages reclaimed by the page
scanner.  They are counted when the buffer associated with those pages
are removed from the clean queue, which is too late.

Benchmark tools concentrating on disk I/O performance (bonnie, iozone,
postmark, etc) may be more suitable than make kernel for evaluation.


Comments and flames are welcome.  Thanks a lot.

-- 
Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message