Re: vm_pageout_scan badness

2000-12-07 Thread Matt Dillon


:
:Matt Dillon wrote:
:> 
:> You may be able to achieve an effect very similar to mlock(), but
:> runnable by the 'news' user without hacking the kernel, by
:> writing a quick little C program to mmap() the two smaller history
:> files and then madvise() the map using MADV_WILLNEED in a loop
:> with a sleep(15).  Keeping in mind that expire may recreate those
:> files, the program should unmap, close(), and re-open()/mmap/madvise the
:> descriptors every so often (like once a minute).  You shouldn't have
:> to access the underlying pages but that would also have a similar
:> effect.  If you do, use a volatile pointer so GCC doesn't optimize
:> the access out of the loop.  e.g.
:
:Err... wouldn't it be better to write a quick little C program that
:mlocked the files? It would need suid, sure, but as a small program
:without user input it wouldn't have security problems.
:
:-- 
:Daniel C. Sobral   (8-DCS)
:[EMAIL PROTECTED]
:[EMAIL PROTECTED]

mlock()ing is dangerous when used on a cyclic file.  If you aren't
careful you can run your system out of memory.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-12-07 Thread Daniel C. Sobral

Matt Dillon wrote:
> 
> You may be able to achieve an effect very similar to mlock(), but
> runnable by the 'news' user without hacking the kernel, by
> writing a quick little C program to mmap() the two smaller history
> files and then madvise() the map using MADV_WILLNEED in a loop
> with a sleep(15).  Keeping in mind that expire may recreate those
> files, the program should unmap, close(), and re-open()/mmap/madvise the
> descriptors every so often (like once a minute).  You shouldn't have
> to access the underlying pages but that would also have a similar
> effect.  If you do, use a volatile pointer so GCC doesn't optimize
> the access out of the loop.  e.g.

Err... wouldn't it be better to write a quick little C program that
mlocked the files? It would need suid, sure, but as a small program
without user input it wouldn't have security problems.

-- 
Daniel C. Sobral(8-DCS)
[EMAIL PROTECTED]
[EMAIL PROTECTED]
[EMAIL PROTECTED]

"The bronze landed last, which canceled that method of impartial
choice."




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-12-07 Thread Daniel C. Sobral

Matt Dillon wrote:
> 
> One possible fix would be to have the kernel track cache hits and misses
> on a file and implement a heuristic from those statistics which is used
> to reduce the 'initial page weighting' for pages read-in from the
> 'generally uncacheable file'.  This would cause the kernel to reuse
> those cache pages more quickly and prevent it from throwing away (reusing)
> cache pages associated with more cacheable files like the .index and
> .hash files.  I don't have time to do this now, but it's definitely
> something I am going to keep in mind for a later release.

That sounds very, very clever. In fact, it sounds so clever I keep
wondering what is the huge flaw with it. :-) Still, promising, to say
the least.

-- 
Daniel C. Sobral(8-DCS)
[EMAIL PROTECTED]
[EMAIL PROTECTED]
[EMAIL PROTECTED]

"The bronze landed last, which canceled that method of impartial
choice."



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-12-06 Thread Matt Dillon

Excellent. 

What I believe is going on is that without the madvise()/mlock() the
general accesses to the 1 GB main history file are causing the pages to be
flushed from the .hash and .index files too quickly.  The performance
problems in general appear to be due to the system trying to cache more
of the (essentially uncacheable) main history file at the expense of
not caching as much of the (emminently cacheable) .index and .hash files.

One possible fix would be to have the kernel track cache hits and misses
on a file and implement a heuristic from those statistics which is used
to reduce the 'initial page weighting' for pages read-in from the
'generally uncacheable file'.  This would cause the kernel to reuse 
those cache pages more quickly and prevent it from throwing away (reusing)
cache pages associated with more cacheable files like the .index and
.hash files.  I don't have time to do this now, but it's definitely
something I am going to keep in mind for a later release.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-12-06 Thread News History File User

> :The mlock man page refers to some system limit on wired pages; I get no
> :error when mlock()'ing the hash file, and I'm reasonably sure I tweaked
> :the INN source to treat both files identically (and on the other machines
> :I have running, the timestamps of both files remains pretty much unchanged).
> :I'm not sure why I'm not seeing the desired results here with both files
> 
> I think you are on to something here.  It's got to be mlock().  Run
> 'limit' from csh/tcsh and you will see a 'memorylocked' resource.
> Whatever this resource is as of when innd is run -- presumably however
> it is initialized for the 'news' user (see /etc/login.conf) is going

Yep, `unlimited'...  same as the bash `ulimit -a'.  OH NO.  I HAVE IT
SET TO `infinity' IN LOGIN DOT CONF, no wonder it is all b0rken-like.

The weird thing is that mlock() does return success, the amount of
wired memory matches the two files, and I've seen nothing obvious in
the source code as to why it's different, but I'll keep plugging away
at it.


> History files are nortorious for random I/O... the problem is due
> to the hash table being, well, a hash table.  The hash table 
> lookups are bad enough but this will also result in random-like
> lookups on the main history file.  You get a little better
> locality of reference on the main history file (meaning the system

Ah, but ...  This is how the recent history format (based on MD5 hashes)
introduced as dbz v6 at the time you were busy with Diablo and your
history mechanism there differs from that which you remember -- AI,
speaking of your 64-bit CRC history mechanism, whatever happened to the
links that would get you there from the backplane homepage... -- in this
case, you don't do the random-like lookups to verify message ID presence
in the text file at all.  Everything you do is in the data in the two hash
tables.  At least for transit.  I'm not sure if the reader requests do
require a hit on the main file -- it'd be worth it to point a Diablo
frontend at such a box to see how it does there even when the overview
performance for traditional readership is, uh, suboptimal.  I think it
does but that's a trivial seek to one specific known offset.

I'm sure this is applicable to other databases somehow, for those who
aren't doing news and are bored stiff by this.


> At the moment madvise() MADV_WILLNEED does nothing more then activate
> the pages in question and force them into the process'es mmap.
> You have to call it every so often to keep the pages 'fresh'... calling
> it once isn't going to do anything.  

Well, it definitely does do a Good Thing when I call it once, as you
can see from the initial timer numbers that approach the long-running
values I'm used to (that I tried to simulate by doing lookups on a small
fraction of history entries, in hope of activating a majority of the
needed pages, that wasn't perfect but was a decent hack).  You can see
from the timestamps of the debugging here that while it slows down the
startup somewhat, the work of reading in the data happens quickly and
is a definite positive tradeoff:

Dec  6 07:32:14 crotchety innd: dbz openhashtable /news/db/history.index
Dec  6 07:32:14 crotchety innd: dbz madvise WILLNEED ok
Dec  6 07:32:14 crotchety innd: dbz madvise RANDOM ok
Dec  6 07:32:14 crotchety innd: dbz madvise NOSYNC ok
Dec  6 07:32:27 crotchety innd: dbz mlock ok
Dec  6 07:32:27 crotchety innd: dbz openhashtable /news/db/history.hash
Dec  6 07:32:27 crotchety innd: dbz madvise WILLNEED ok
Dec  6 07:32:27 crotchety innd: dbz madvise RANDOM ok
Dec  6 07:32:27 crotchety innd: dbz madvise NOSYNC ok
Dec  6 07:32:38 crotchety innd: dbz mlock ok

This happens quickly when the data is still in cache, leading me to
believe it's something else affecting the .hash file (I added the
madvise() MADV_NOSYNC call just in case somehow it wasn't happening
in the mmap() for some reason):

Dec  6 09:29:34 crotchety innd: dbz openhashtable /news/db/history.index
Dec  6 09:29:34 crotchety innd: dbz madvise WILLNEED ok
Dec  6 09:29:34 crotchety innd: dbz madvise RANDOM ok
Dec  6 09:29:34 crotchety innd: dbz madvise NOSYNC ok
Dec  6 09:29:34 crotchety innd: dbz mlock ok
Dec  6 09:29:34 crotchety innd: dbz openhashtable /news/db/history.hash
Dec  6 09:29:34 crotchety innd: dbz madvise WILLNEED ok
Dec  6 09:29:34 crotchety innd: dbz madvise RANDOM ok
Dec  6 09:29:34 crotchety innd: dbz madvise NOSYNC ok
Dec  6 09:29:34 crotchety innd: dbz mlock ok


> You may be able to achieve an effect very similar to mlock(), but
> runnable by the 'news' user without hacking the kernel, by 

Yeah, sounds like a hack, but I figured out what was going on earlier
with my mlock() hack -- INN and the reader daemon now use a dynamically
linked library so the nnrpd processes also were trying to mlock() the
files too.  Hmmm.  Either I can statically compile INN (which I chose
to do) or I can further butcher the source by attempting to

Re: vm_pageout_scan badness

2000-12-05 Thread Matt Dillon


:To recap, the difference here is that by cheating, I was able to mlock
:one of the two files (the behaviour I was hoping to be able to achieve
:through first MAP_NOSYNC alone, then in combination with MADV_WILLNEED
:to keep all the pages in memory so much as possible) and achieve a much
:improved level of performance -- I'm able to catch up on backlogs from
:a full feed that had built up during the time I wasn't cheating -- by
:using memory for the history database files rather than for general
:filesystem caching.  I even have spare capacity!  Woo.
:
:The mlock man page refers to some system limit on wired pages; I get no
:error when mlock()'ing the hash file, and I'm reasonably sure I tweaked
:the INN source to treat both files identically (and on the other machines
:I have running, the timestamps of both files remains pretty much unchanged).
:I'm not sure why I'm not seeing the desired results here with both files
:(maybe some call hidden somewhere I haven't located yet), but I hope you
:can see the improvements so far.  I even let abusive readers pound on
:me.  Well, for a while 'til I got tired of 'em.

I think you are on to something here.  It's got to be mlock().  Run
'limit' from csh/tcsh and you will see a 'memorylocked' resource.

Whatever this resource is as of when innd is run -- presumably however
it is initialized for the 'news' user (see /etc/login.conf) is going
to effect mlock() operation.

mlock() will wire pages.  I think you can safely call it on your 
two smaller history files (history.hash, history.index).  I can
definitely see how this could result in better performance.

:I still don't know for certain if the disk updates I am seeing are
:slow because they aren't sorted well, or if they're random pages and
:not a sequential set, given that I hope I've ruled out fragmentation
:of the database files.  I still maintain that in the case of a true
:MADV_RANDOM madvise'd file, any attempts to clean out `unused' pages
:are ill-advised, or if they're needed, anything other than freeing of
:sequential pages results in excess disk activity that gains nothing,
:if it's the case that this is not how it's done, due to the nature
:of random access.

History files are nortorious for random I/O... the problem is due
to the hash table being, well, a hash table.  The hash table 
lookups are bad enough but this will also result in random-like
lookups on the main history file.  You get a little better
locality of reference on the main history file (meaning the system
can do a better job caching it optimally), but the hash tables
are a lost cause so mlock()ing them could be a very good thing.

:Yeah, hacking the vm source to allow me to mlock() isn't kosher, but
:I wanted to test a theory.  Doing so probably requires a few more
:tweaks in the INN source to handle expiry, so it seems, so I'd rather
:the vm subsystem do this for me automagically with the right invocation
:of the suitable mmap/madvise operations, if this is reasonable.

At the moment madvise() MADV_WILLNEED does nothing more then activate
the pages in question and force them into the process'es mmap.
You have to call it every so often to keep the pages 'fresh'... calling
it once isn't going to do anything.  

When you call madvise() MADV_WILLNEED the system has to go through
a number of steps before the pages will be thrown away:  

- it has to remove them from the process pmap
- it has to deactivate them
- it has to cache them
- then it can free them

You may be able to achieve an effect very similar to mlock(), but
runnable by the 'news' user without hacking the kernel, by 
writing a quick little C program to mmap() the two smaller history
files and then madvise() the map using MADV_WILLNEED in a loop
with a sleep(15).  Keeping in mind that expire may recreate those
files, the program should unmap, close(), and re-open()/mmap/madvise the 
descriptors every so often (like once a minute).  You shouldn't have
to access the underlying pages but that would also have a similar 
effect.  If you do, use a volatile pointer so GCC doesn't optimize
the access out of the loop.  e.g.

for (ptr = mapBase; ptr < mapEnd; ptr += pageSize) {
volatile char c = *ptr;
}

or

for (ptr = mapBase; ptr < mapEnd; ptr += pageSize) {
dummyroutine(*ptr);
}

And my earlier suggestion above would look something like:

for (;;) {
open descriptor
map 
for (i = 0; i < 15; ++i) {
madvise(mapBase, mapSize, MADV_WILLNEED);
sleep(15);
}
munmap
close descriptor
}

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-12-05 Thread Matt Dillon

   I wouldn't worry about madvise() too much.  4.2 has a really good
   heuristic that figures it out for the most part.

   (still reading the rest of your postings)

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-12-05 Thread News History File User

Howdy,
I'm going to breach all sorts of ethics in the worst way by following
up to my own message, just to throw out some new info...  'kay?


Matt wrote, and I quote --
: > However, I noticed something interesting!

Of course I clipped away the interesting Thing, but note the following
that I saw...


: INN after adding the memory, I did a `cp -p' on both the history.hash
: and history.index files, just to start fresh and clean.  It didn't seem
[...]
: > There is an easy way to test file fragmentation.  Kill off everything
: > and do a 'dd if=history of=/dev/null bs=32k'.  Do the same for 
: > history.hash and history.index.  Look at the iostat on the history
: > drive.  Specifically, do an 'iostat 1' and look at the KB/t (kilobytes
: > per transfer).  You should see 32-64KB/t.  If you see 8K/t the file
: > is severely fragmented.  Go through the entire history file(s) w/ dd...
: 
: Okay, I'm doing this:  The two hash-type files give me between 9 and
: 10K/t; the history text file gives me more like 60KB/t.  Hmmm.  It's

Now, remember what Matt wrote, that partially-cached data played havoc
with read-ahead.  That is apparently what I was seeing here, pulling
some bit of data off the disk proper, but then pulling a chunk of data
that was cached, and so on.

I figured that out as I attempted to copy one of the files to create an
unfragmented copy to test transfer size and saw the expected 64K (well
DUH, that was the write size), and then attempted to `dd' these to /dev/null
and saw ... no disk activity.  The file was in cache.  Bummer.

Oh well, I had to reboot anyway for some reason, and did so.  Immediately
after reboot I `dd'ed the two database files and got the expected 64K/t
of an unfragmented file.  I also made copies of them just to push their
contents into memory, because...


: The actual history lookups and updates that matter are all done within
: the memory taken up by the .index and .hash files.  So, by keeping
: them in memory, one doesn't need to do any disk activity at all for
: lookups, and updates, well, so long as you commit them to the disk at
: shutdown, all should be okay.  That's what I'm attempting to achieve.
: These lookups and updates are bleedin' expensive when disk activity
: rears its ugly head.
: 
: Not to worry, I'm going to keep plugging to see if there is a way for
: me to lock these two files into memory so that they *stay* there, just
: to prove whether or not that's a significant performance improvement.
: I may have to break something, but hey...

I b0rked something.  I `fixed' the mlock operation to allow a lowly user
such as myself to use it, just as proof of concept.  (I still need to do
a bit of tuning, I can see, but hey, I got results)

So I attempt to pass all the madvise suggestions I can for both the
history.index and .hash files, and then I attempt to mlock both of them.
I don't get a failure, although the history.hash file (108MB) doesn't
quite achieve the desired results -- I do see Good Things with the
smaller history.index (72MB and don't remind me that 1MB really isn't
100bytes).

Anyway, the number of `Wired' Megs in `top' is up from 71MB to 200+,
and after some hours of operation, look at the timestamps of the two
database files (the .n.* files are those I copied after reboot, and
serve as a nice reference for when I started things)

-rw-rw-r--  1 news  news  755280213 Dec  5 19:05 history
-rw-rw-r--  1 news  news 57 Dec  5 19:05 history.dir
-rw-rw-r--  1 news  news  10800 Dec  5 19:05 history.hash
-rw-rw-r--  1 news  news   7200 Dec  5 08:44 history.index
-rw-rw-r--  1 news  news  10800 Dec  5 08:43 history.n.hash
-rw-rw-r--  1 news  news   7200 Dec  5 08:44 history.n.index

So, okay, history.hash still sees disk activity, but look at a handful
of INN timer stats following the boot:


The last two stats with the default vm k0deZ before restart:

Dec  5 08:30:40 crotchety innd: ME time 301532 idle 28002(120753)
 artwrite 70033(2853) artlink 0(0) hiswrite 49396(3097) hissync 28(6)
^
 sitesend 460(5706) artctrl 296(25) artcncl 295(25) hishave 32016(8923)
^
 hisgrep 45(10) artclean 20816(3150) perl 12536(3082) overv 29927(2853)
 python 0(0) ncread 33729(152735) ncproc 227796(152735) 

80 seconds of 300 spent on history activity...  urk...  on a steady-state
system with a few readers that had been running for some hours.

Dec  5 08:35:37 crotchety innd: ME time 300052 idle 16425(136209) artwrite 77811(2726) 
artlink 0(0) hiswrite 35676(2941) hissync 28(6) sitesend 571(5450) artctrl 454(41) 
artcncl 451(41) hishave 33311(7392) hisgrep 55(14) artclean 22778(3000) perl 
14137(2914) overv 28516(2726) python 0(0) ncread 38832(172145) ncproc 226513(172145) 

[REB00T]

Dec  5 08:59:32 crotchety innd: ME time 300059 idle 62840(189385)
 artwrite 68361(5580) artlink 0(0) hiswrite 8782(6567) hissync 104(12

Re: vm_pageout_scan badness

2000-12-04 Thread News History File User

> ok, since I got about 6 requests in four hours to be Cc'd, I'm 
> throwing this back onto the list.  Sorry for the double-response that
> some people are going to get!

Ah, good, since I've been deliberately avoiding reading mail in an
attempt to get something useful done in my last days in the country,
and probably wouldn't get around to reading it until I'm without Net
access in a couple weeks...

(Also, because your mailer seems to be ignoring the `Reply-To:' header
I've been using, but I'd get a copy through the cc: list, in case you
puzzled over why your previous messages bounced)


> I am going to include some additional thoughts in the front, then break
> to my originally private email response.

I'll mention that I've discovered the miracle of man pages, and found
the interesting `madvise' capability of `MADV_WILLNEED' that, from the
description, looks very promising.  Pity the results I'm seeing still
don't match my expectations.

Also, in case the amount of system memory on this machine might be
insufficient to do what I want with the size of the history.hash/.index
files, I've just gotten an upgrade to a full gig.  Unfortunately, now
performance is worse than it had been, so it looks I'll be butchering
the k0deZ to see if I can get my way.

Now, for `madvise' -- this is already used in the INN source in lib/dbz.c
(where one would add MAP_NOSYNC to the MAP__FLAGS) as MADV_RANDOM --
this matches the random access pattern of the history hash table.
Supposedly, MADV_WILLNEED will tell the system to avoid freeing these
pages, which looks to be my holy grail of this week, plus the immediate
mapping that certainly can't hurt.

There's only a single madvise call in the INN source, but I see that the
Diablo code does make two calls to it (although both WILLNEED and, unlike
INN, SEQUENTIAL access -- this could be part of the cause of the apparent
misunderstanding of the INN history file that I see below).  Since it
looks to my non-progammer eyes like I can't combine the behaviours in a
single call, I followed Diablo's example to specify both RANDOM and the
WILLNEED that I thought would improve things.

The machine is, of course, as you can see from the timings, not optimized
at all, since I've just thrown something together as a proof of concept
having run into a brick wall with the codes under test with Slowaris,
And because a departmental edict has come down that I must migrate all
services off Free/NetBSD and onto Slowaris, I can't expect to get the
needed hardware to beef up the system -- even though the MAP_NOSYNC
option on the transit machine enabled it to whup the pants off a far
more expensive chunk of Sun hardware.  So I'm trying to be able to say
`Look, see? see what you can do with FreeBSD' as I'm shown out the door.


> I ran a couple of tests with MAP_NOSYNC to make sure that the
> fragmentation issue is real.  It definitely is.  If you create a
> file by ftruncate()ing it to a large size, then mmap() it SHARED +
> NOSYNC, then modify the file via the mmap, massive fragmentation occurs

I've heard it confirmed that even the newer INN does not mmap() the
newly-created files for makehistory or expire.  As reported to the
INN-workers mailing list:

: From: [EMAIL PROTECTED] (Richard Todd)
: Newsgroups: mailing.unix.inn-workers
: Subject: Re: expire/makehistory and mmap/madvise'd dbz filez
: Date: 4 Dec 2000 06:30:47 +0800
: Message-ID: <90ehin$1ndk$[EMAIL PROTECTED]>
: 
: In servalan.mailinglist.inn-workers you write:
: 
: >Moin moin
: 
: >I'm engaged in a discussion on one of the FreeBSD developer lists
: >and I thought I'd verify the present source against my memory of how
: >INN 1.5 runs, to see if I might be having problems...
: 
: >Anyway, the Makefile in the 1.5 expire directory has the following bit,
: >that seems to be absent in present source, and I didn't see any
: >obvious indication in the makedbz source as to how it's initializing
: >the new files, which, if done wrong, could trigger some bugs, at least
: >when `expire' is run.
: 
: ># Build our own version of dbz.o for expire and makehistory, to avoid
: ># any -DMMAP in DBZCFLAGS - using mmap() for dbz in expire can slow it
: ># down really bad, and has no benefits as it pertains to the *new* .pag.
: >dbz.o: ../lib/dbz.c
: >   $(CC) $(CFLAGS) -c ../lib/dbz.c
: 
: >Is this functionality in the newest expire, or do I need to go a hackin'?
: 
: Whether dbz uses mmap or not on a given invocation is controlled by the 
: dbzsetoptions() call; look for that call and setting of the INCORE_MEM 
: option in expire/expire.c and expire/makedbz.c.  Neither expire nor
: makedbz mmaps the new dbz indices it creates. 

The remaining condition I'm not positive about is the case of an
overflow, that ideally would not be a case to consider, and is not
the case on the machine now.


> on the file.  This is easily demonstrated by issuing a sequential read
> on the file and noting that the syste

Re: vm_pageout_scan badness

2000-12-03 Thread Matt Dillon

ok, since I got about 6 requests in four hours to be Cc'd, I'm 
throwing this back onto the list.  Sorry for the double-response that
some people are going to get!

I am going to include some additional thoughts in the front, then break
to my originally private email response.

I ran a couple of tests with MAP_NOSYNC to make sure that the
fragmentation issue is real.  It definitely is.  If you create a
file by ftruncate()ing it to a large size, then mmap() it SHARED +
NOSYNC, then modify the file via the mmap, massive fragmentation occurs
on the file.  This is easily demonstrated by issuing a sequential read
on the file and noting that the system is not able to do any clustering
whatsoever and gets a measily 0.6MB/sec of throughput (on a disk
that can do 12-15MB/sec).  (and the disk seeks wildly during the read).

When you create a large file and fill it with zero's, THEN mmap() it
SHARED + NOSYNC and write to it randomly via the mmap(), the file 
remains laid on disk optimally.  However, I noticed something interesting!
When I dd if=file of=/dev/null bs=32k the file the first time after
randomly writing it and then fsync()ing it, I only get 4MB/sec of
throughput.  If I dd the file a second time I get around 8MB/sec.  If
I dd it the third time I get the platter speed - 12-15MB/sec.  The issue
here has to do with the fact that the file is partially cached in the
first two dd runs.

The partially cached file shortcuts the I/O clustering code, preventing
it from issueing read aheads once it hits a buffer that is already
in the cache.  So if you have a spattering of cached blocks and then
read a file sequentially, you actually get lower throughput then if
you don't have *any* cached blocks and then read the file sequentially.
Verrry interesting!  I think it may be beneficial to the clustering code
to issue the full read-ahead even if some of the blocks in the middle
are already cached.  The clustering code only operates when sequential
operation is detected, so I don't think it can make things worse.

large file == at least 2 x main memory.


-- original response --

Ok, lets concentrate on your hishave, artclean, artctrl, and overview
numbers.

:-rw-rw-r--  1 news  news  436206889 Dec  3 05:22 history
:-rw-rw-r--  1 news  news 67 Dec  3 05:22 history.dir
:-rw-rw-r--  1 news  news   8100 Dec  1 01:55 history.hash
:-rw-rw-r--  1 news  news   5400 Nov 30 22:49 history.index
:
:More observations that may or may not mean anything -- before rebooting,
:I timed the `fsync' commands on the 108MB and 72MB history files, as

note: the fsync command will not flush MAP_NOSYNC pages.

:The time taken to do the `fsync' was around one minute for the two
:history files.  And around 1 second for the BerkeleyDB file...

This is an indication of file fragmentation, probably due to holes
in the history file being filled via the mmap() instead of filled via
write().

In order for MAP_NOSYNC to be reasonable, you have to fix the code
that extends a file via ftruncate()s to write() zero's into the 
extended portion.

:data getting flushed to disk, then it seems like someone's priorities
:are a bit, well, wrong.  The way I see it, by giving the MAP_NOSYNC
:flag, I'm sort of asking for preferential treatment, kinda like mlock,
:even though that's not available to me as `news' user.

 The pages are treated the way any VM page is treated... they'll
 be cached based on use.  I don't think this is the problem.

Ok, lets look at a summary of your timing results:

hishave overv   artcleanartctrl

38857(26474)112176(6077)12264(6930) 2297(308)
22114(28196)136855(6402)12757(7295) 1257(322)
13614(24312)156723(6071)13232(6800) 324(244)
9944(25198) 164223(6620)13441(7753) 255(160)
2777(50732) 24979(3788) 29821(4017) 131(51)
31975(11904)21593(3320) 25148(3567) 5935(340)

Specifically, look at the last one where it blew up on you.  hishave
and artctrl are much worse, overview and artclean are about the same.

This is an indication of excessive seeking on the history disk.  I
believe that this seeking may be due to file fragmentation.

There is an easy way to test file fragmentation.  Kill off everything
and do a 'dd if=history of=/dev/null bs=32k'.  Do the same for 
history.hash and history.index.  Look at the iostat on the history
drive.  Specifically, do an 'iostat 1' and look at the KB/t (kilobytes
per transfer).  You should see 32-64KB/t.  If you see 8K/t the file
is severely fragmented.  Go through the entire history file(s) w/ dd...
the fragmentation may occur near the end.

If the file turns out to be fragmented, the only way to fix it is to 

Re: vm_pageout_scan badness

2000-12-03 Thread Matt Dillon

:>  I'm going to take this off of hackers and to private email.  My reply
:>  will be via private email.
:
:Actually, I was enjoying the discussion, since I was learning something
:in the process of hearing you debug this remotely.
:
:It sure beats the K&R vs. ANSI discussion. :)
:
:Nate

Heh.  Well, I didn't think there'd be as much interest as there is,
so I guess I'll throw it back onto the mailing list.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-12-03 Thread Nate Williams

>  I'm going to take this off of hackers and to private email.  My reply
>  will be via private email.

Actually, I was enjoying the discussion, since I was learning something
in the process of hearing you debug this remotely.

It sure beats the K&R vs. ANSI discussion. :)



Nate


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-12-03 Thread Matt Dillon

:errr then keep me in the CC 
:
:it's interesting
:
:-- 
:  __--_|\  Julian Elischer
: /   \ [EMAIL PROTECTED]

Sure thing.  Anyone else who wants to be in the Cc, email me.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-12-03 Thread Julian Elischer

Matt Dillon wrote:
> 
>  I'm going to take this off of hackers and to private email.  My reply
>  will be via private email.
> 
> -Matt
> 
> To Unsubscribe: send mail to [EMAIL PROTECTED]
> with "unsubscribe freebsd-hackers" in the body of the message

errr then keep me in the CC 

it's interesting

-- 
  __--_|\  Julian Elischer
 /   \ [EMAIL PROTECTED]
(   OZ) World tour 2000
---> X_.---._/  presently in:  Budapest
v


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-12-03 Thread Matt Dillon

 I'm going to take this off of hackers and to private email.  My reply
 will be via private email.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-12-02 Thread News History File User

> :but at last look, history lookups and writes are accounting for more
> :than half (!) of the INN news process time, with available idle time
> :being essentially zero.  So...
> 
> No idle time?  That doesn't sound like blocked I/O to me, it sounds
> like the machine has run out of cpu.

Um, I knew I'd be unclear somehow.   The machine itself (with 2 CPUs)
has plenty of idle time -- `top' reports typically 70-80% idle, and
INN takes from 20-40% of CPU (being SMP, a process like `perl' locked
to one CPU will appear around 98%, unlike a certain other OS that will
show this percentage for the system total, rather than for a particular
CPU).

What I mean is that the INN process timer, which is basically Joe Greco's
timer that wraps key functions with start/stop timer calls, showing
where INN spends much of its time, is showing little to no idle time
(meaning it couldn't take more articles in no matter how hard I push
them).  Let me show you the timer stats from the time I started things
not long ago on this reader machine, where it's taking in backlogs:


Dec  3 04:33:47 crotchety innd: ME time 300449 idle 376(4577)
 all times in milliseconds: elapsed time^^=5min ^^^idle time (numbers
in parentheses are number of calls; only significant in calls like
artwrite to show how many articles were actually written to spool,
hiswrite to show how many unique articles were received over this
time period, and hishave to show how many history lookups were done)
 artwrite 52601(6077) artlink 0(0) hiswrite 40200(7035) hissync 11(14)
  ^^^ 53 seconds writing articles   ^^ 40 seconds updating history
 sitesend 647(12154) artctrl 2297(308) artcncl 2288(308) hishave 38857(26474)
39 seconds doing history lookups ^^
 hisgrep 70(111) artclean 12264(6930) perl 13819(6838) overv 112176(6077)
 python 0(0) ncread 13818(21287) ncproc 284413(21287) 

Dec  3 04:38:48 crotchety innd: ME time 301584 idle 406(5926) artwrite 55774(6402) 
artlink 0(0) hiswrite 25483(7474) hissync 15(15) sitesend 733(12805) artctrl 1257(322) 
artcncl 1245(321) hishave 22114(28196) hisgrep 90(38) artclean 12757(7295) perl 
14696(7191) overv 136855(6402) python 0(0) ncread 14446(23235) ncproc 284767(23235) 

(as time passes and more of the MAP_NOSYNC file is in memory, the time
needed for history writes/lookups drops)
[...]
Dec  3 04:58:49 crotchety innd: ME time 300047 idle 566(6272) artwrite 59850(6071) 
artlink 0(0) hiswrite 11630(6894) hissync 33(14) sitesend 692(12142) artctrl 324(244) 
artcncl 320(244) hishave 13614(24312) hisgrep 0(77) artclean 13232(6800) perl 
14531(6727) overv 156723(6071) python 0(0) ncread 15116(23838) ncproc 281745(23838) 
Dec  3 05:03:49 crotchety innd: ME time 300018 idle 366(5936) artwrite 56956(6620) 
artlink 0(0) hiswrite 8850(7749) hissync 7(15) sitesend 760(13240) artctrl 255(160) 
artcncl 255(160) hishave 9944(25198) hisgrep 0(31) artclean 13441(7753) perl 
15605(7620) overv 164223(6620) python 0(0) ncread 14783(24123) ncproc 282791(24123) 

Most of the time is spent on the BerkeleyDB overview now.  This is
probably because some reader is giving repeated commands pounding
the overview database.That reader's IP now has a
different gateway address, and won't be bothering me for a while.

Now, for a reference, here are the timings on a transit-only machine
with no readers, after it's been running for a while:


Dec  3 05:22:09 news-feed69 innd: ME time 30 idle 91045(91733)
 a reasonable amount of idle time ^^
 artwrite 48083(2096) artlink 0(0) hiswrite 1639(2096) hissync 33(11)

 sitesend 4291(12510) artctrl 0(0) artcncl 0(0) hishave 1600(30129)

 hisgrep 0(0) artclean 25591(2121) perl 79(2096) overv 0(0) python 0(0)
 ncread 69798(147925) ncproc 108624(147919) 

A total of just over 3 seconds out of every 300 seconds spent on
history activity.  That's reflected by the timestamps on the NOSYNC'ed
history database (index/hash) files you see here:

-rw-rw-r--  1 news  news  436206889 Dec  3 05:22 history
-rw-rw-r--  1 news  news 67 Dec  3 05:22 history.dir
-rw-rw-r--  1 news  news   8100 Dec  1 01:55 history.hash
-rw-rw-r--  1 news  news   5400 Nov 30 22:49 history.index

However, the timings shown by `top' here show from 10 to 20% idle CPU
time, even though INN itself has capacity to do more work.


The problem is that I'm not seeing this on the reader box.  Or if I
do see it, it doesn't last long.  The timestamps on the above files
are pretty much current, in spite of the files being NOSYNC'ed.


> :As is to be expected, INN increases in size as it does history lookups
> :and updates, and the amount of memory shown as Active tracks this,
> :more or less.  But what's happening to the Free value!  It's going
> :down at as much as 4MB per `top' interval.  Or should I say, what is
> :happening to the Inactive value -- it's constan

Re: vm_pageout_scan badness

2000-12-02 Thread Matt Dillon

:closely the pattern of what happens to the available memory following
:a fresh boot...  At the moment, this (reader) machine has been up for
:half a day, with performance barely able to keep up with a full feed
:(but starting to slip as the overnight burst of binaries is starting),
:but at last look, history lookups and writes are accounting for more
:than half (!) of the INN news process time, with available idle time
:being essentially zero.  So...

No idle time?  That doesn't sound like blocked I/O to me, it sounds
like the machine has run out of cpu.

:Following the boot, things start out with plenty of memory Free, and
:something like 4MB Active, which seems reasonable to me.  Then I start
:things.
:
:As is to be expected, INN increases in size as it does history lookups
:and updates, and the amount of memory shown as Active tracks this,
:more or less.  But what's happening to the Free value!  It's going
:down at as much as 4MB per `top' interval.  Or should I say, what is
:happening to the Inactive value -- it's constantly increasing, and I
:observe a rapid migration of all the Free memory to Inactive, until
:the value of Inactive peaks out at the time that Free drops to about
:996k, beyond which it changes little.  None of the swap space has
:been touched yet.
:
:As soon as the value for Free hits bottom and that of Inactive has
:reached a max, now the migration happens from Inactive to Active --
:until this point, the value of Active has been roughly what I would
:expect to see, given the size of the history hash/index files, and
:the BerkeleyDB file I'm now using MAP_NOSYNC as well for a definite
:improvement in overview access times.

Hmm.  An increasing 'inactive' most often occurs when a program
is reading a file sequentially.  It sounds like most of the inactive
pages are probably due to reader requests from the spool.

:> Is it possible that history file rewriting is creating an issue?  Doesn't
:> INN rewrite the history file every once in a while to clear out old
:> garbage?  I'm not up on the latest INN.
:
:In normal operation, no -- the text file is append-only (the text file
:isn't used for lookups with the MD5-based hashing), and expire, which
:I'm running manually, rewrites the hash files -- leading to a mysterious
:lack of space today when I attempted to run both expire and makedbz (a
:variant of makehistory), and apparently some reader processes or some
:daemons still had the old inodes open, until suddenly in one swell foop,
:some 750MB was freed up -- far more than I expected to see, so I should
:probably look into this space usage sometime...
:
:This shouldn't be a problem the way I'm running things now.  I haven't
:run an expire process since the last reboot to observe things closely.

Woa.  750MB?  There are only two things that can cause that:

* A process with hundreds of megabytes of private store exited

* A large (500+ MB) file is deleted after having previously been
  mmap()'d.  (or the process holding the last open descriptor to
  the file, after deletion, now exits).

If I remember INN right, there is a situation that can occur here... the
reader processes open up the history file in order to implement a certain
NNTP commands.  I'm trying to remember which one... I think its one of
search commands.  Fubar... anyone remember which NNTP command opens
up the history file?  In anycase, I remember at BEST I had to completely
disable that command when running INN because it caused long-running
reader processes to keep a descriptor open on now-deleted history files.
When you do an expire run which replaces the history file, the original
(now deleted) history file may still be open by those reader processes.
This could easily account for your problems.

This sort of situation occurs most often when there is no timeout
or too-long a timeout in the reader processes, and/or if tcp keepalives
are not turned on, plus when certain NNTP commands (used mostly by
abusers, by the way, which try to download feeds via their reader 
access) are enabled.  I would immediately research this... look for 
reader processes that have hung around too long and try killing them,
then see if that clears out some memory.

There will also be a serious file fragmentation issue using MAP_NOSYNC
in the expire process.  You can probably use MAP_NOSYNC safely in the
INND core, but don't use it to rebuild the history file in the expire
process.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-12-01 Thread News History File User

> :> Personally speaking, I would much rather use MAP_NOSYNC anyway,
> even with
> :...
> :Everything starts out well, where the history disk is beaten at startup
> :but as time passes, the time taken to do lookups and writes drops down
> :to near-zero levels, and the disk gets quiet.  And actually, the transit
> :...
> :What I notice is that the amount of memory used keeps increasing, until
> :it's all used, and the Free amount shown by `top' drops to a meg or so.
> :Cache and Buf get a bit, but most of it is Active.  Far more than is
> :accounted for by the processes.
> 
> This is to be expected, because the dirty MAP_NOSYNC pages will not
> be written out until they are forced out, or by msync().

I just discovered the user command `fsync' which has revealed a few
things to me, clearing up some mysteries.  Also, I've watched more
closely the pattern of what happens to the available memory following
a fresh boot...  At the moment, this (reader) machine has been up for
half a day, with performance barely able to keep up with a full feed
(but starting to slip as the overnight burst of binaries is starting),
but at last look, history lookups and writes are accounting for more
than half (!) of the INN news process time, with available idle time
being essentially zero.  So...


> :Now, what happens on the reader machine is that after some time of the
> :Active memory increasing, it runs out and starts to swap out processes,
> :and the timestamps on the history database files (.index and .hash, this
> :is the md5-based history) get updated, rather than remaining at the
> :time INN is started.  Then the rapid history times skyrocket until it
> :takes more than 1/4 of the time.  I don't see this on the transit boxen
> :even after days of operation.
> 
> Hmm.  That doesn't sound right.  Free memory should drop to near zero,
> but then what should happen is the pageout daemon should come along
> and deactivate a big chunk of the 'active' pages... so you should
> see a situation where you have, say, 200MB worth of active pages
> and 200MB worth of inactive pages.  After that the pageout daemon
> should start paging out the inactive pages and increasing the 'cache'.
> The number of 'free' pages will always be near zero, which is to be
> expected.  But it should not be swapping out any process.

Here is what I noticed while watching the `top' values for Active,
Inactive, and Free following this last boot (I didn't pay any attention
to the other fields to notice any wild fluctuations there, next time
maybe), on this machine with 512MB of RAM, if it reveals anything:

Following the boot, things start out with plenty of memory Free, and
something like 4MB Active, which seems reasonable to me.  Then I start
things.

As is to be expected, INN increases in size as it does history lookups
and updates, and the amount of memory shown as Active tracks this,
more or less.  But what's happening to the Free value!  It's going
down at as much as 4MB per `top' interval.  Or should I say, what is
happening to the Inactive value -- it's constantly increasing, and I
observe a rapid migration of all the Free memory to Inactive, until
the value of Inactive peaks out at the time that Free drops to about
996k, beyond which it changes little.  None of the swap space has
been touched yet.

As soon as the value for Free hits bottom and that of Inactive has
reached a max, now the migration happens from Inactive to Active --
until this point, the value of Active has been roughly what I would
expect to see, given the size of the history hash/index files, and
the BerkeleyDB file I'm now using MAP_NOSYNC as well for a definite
improvement in overview access times.

Anyway, I don't remember what values exactly I was seeing for Free
and Inactive or Active, since I was just watching for general trends,
but I seem to recall Active being ~100MB, and Inactive somewhat more.

(Are you saying above that this Inactive value should be migrating to
Cache, which I'm not seeing, rather than to Active, which I do see?
If so, then hmmm.)

Now memory is drifting at a fairly rapid pace from Inactive (the
meaning of which I'm not exactly clear about, although there's some
explanation in the `top' man page that hasn't quite clicked into
understanding yet), over to the Active field, at something like 2MB
or so per `top' interval.  Free remains close to 1MB, but Active is
constantly growing, although no processes are clearly taking up any
of this, apart from INN which only accounts for around 100MB at this
time, and isn't increasing at the rate of increase of Active memory.

Anyway, the Active field continues to increase as Inactive decreases
until finally Inactive bottoms out, down from several hundred MB to
a one or two digit MB value (I don't remember exactly), while Active
has increased to almost 400MB.  This is something like 20 minutes
after the reboot, and now the first bit of swap gets hit.  However,
the value of A

Re: vm_pageout_scan badness

2000-12-01 Thread Matt Dillon

:> Personally speaking, I would much rather use MAP_NOSYNC anyway, even with
:> a fixed filesystem syncer.   MAP_NOSYNC pages are not restricted by
:...
:
:Yeah, no kidding -- here's what I see it screwing up.  First, some
:background:
:
:I've built three news machines, two transit boxen and one reader box,
:with recent INN k0dez, and 4.2-STABLE of a few days ago (having tested
:NetBSD, more on that later), and a brief detour into 5-current.
:..
:
:Everything starts out well, where the history disk is beaten at startup
:but as time passes, the time taken to do lookups and writes drops down
:to near-zero levels, and the disk gets quiet.  And actually, the transit
:...
:
:What I notice is that the amount of memory used keeps increasing, until
:it's all used, and the Free amount shown by `top' drops to a meg or so.
:Cache and Buf get a bit, but most of it is Active.  Far more than is
:accounted for by the processes.

This is to be expected, because the dirty MAP_NOSYNC pages will not
be written out until they are forced out, or by msync().

:Now, what happens on the reader machine is that after some time of the
:Active memory increasing, it runs out and starts to swap out processes,
:and the timestamps on the history database files (.index and .hash, this
:is the md5-based history) get updated, rather than remaining at the
:time INN is started.  Then the rapid history times skyrocket until it
:takes more than 1/4 of the time.  I don't see this on the transit boxen
:even after days of operation.

Hmm.  That doesn't sound right.  Free memory should drop to near zero,
but then what should happen is the pageout daemon should come along
and deactivate a big chunk of the 'active' pages... so you should
see a situation where you have, say, 200MB worth of active pages
and 200MB worth of inactive pages.  After that the pageout daemon
should start paging out the inactive pages and increasing the 'cache'.
The number of 'free' pages will always be near zero, which is to be
expected.  But it should not be swapping out any process.

The actual amount of 'free' memory in the system is actually 'free+cache'
pages.

:Now, what happens when I stop INN and everything news-related is that
:some memory is freed up, but still, there can be, say, 400MB still
:reported as Active.  More when I had a full gig in this machine to
:...
:
:Then, when I reboot the machine, it gives the kernel messages about
:syncing disks; done, and then suddenly the history drive light goes
:on and it starts grinding for five minutes or so, before the actual
:reboot happens.

Right.  This is to be expected.  You have a lot of dirty pages
in the system due to the use of MAP_NOSYNC that have to be flushed
out.

:No history activity happens when I shut down INN normally, which should
:free the MAP_NOSYNC'ed pages and make them available to be written to
:disk before rebooting, maybe.

MAP_NOSYNC pages are not flushed when the referencing program exits.
They stick around until they are forced out.  You can flush them
manually by using a mmap()/msync() combination.  i.e. an msync() prior
to munmap()ing (from INND only) ought to do it.

:What I think is happening, based on these observations, is that the
:data from the history hash files (less than 100MB) gets read into
:memory, but the updates to it are not written over the data to be
:replaced -- it's simply appended to, up to the limit of the available
:memory.  When this limit is reached on the transit machines, then
:things stabilize and old pages get recycled (but still, more memory
:overall is used than the size of the actual file).

It doesn't append... the pages are reused.  The set of 'active'
pages in the VM system is effectively the set of all files accessed
for the entire system, not just MAP_NOSYNC pages.  If you are only
MAP_NOSYNC'ing 100MB worth of pages, then only 100MB worth of pages
will be left unflushed.

Is it possible that history file rewriting is creating an issue?  Doesn't
INN rewrite the history file every once in a while to clear out old
garbage?  I'm not up on the latest INN.

:I'm guessing that additional activity of the reader machine causes
:jumps in memory usage not seen on the transit machines, that is enough
:to force some of the unwritten dirty pages to be written to the
:history file, as a few megs of swap get used, which is why it does
:not stabilize as `nicely' as the transit machines.

This makes sense... the amount of swap that gets used is critical.
If we are talking about only a few megabytes, then your system is
*not* swapping significantly, it is simply swapping out completely
idle pages from things like idle getty's and such.  This is a good
thing.  The disk activity would thus be mostly due to MAP_NOSYNC pages
being written out.

:Now, something I contemplated -- it seems that Bad Undesirable Things
:happen as soon as I start

Re: vm_pageout_scan badness

2000-12-01 Thread News History File User

Long ago, it was written here on 25 Oct 2000 by Matt Dillon:

> :Consider that a file with a huge number of pages outstanding
> :should probably be stealing pages from its own LRU list, and
> :not the system, to satisfy new requests.  This is particularly
> :true of files that are demanding resources on a resource-bound
> :system.
> :...
> :   Terry Lambert
> :   [EMAIL PROTECTED]
> 
> This isn't exactly what I was talking about.  The issue in regards to
> the filesystem syncer is that it fsync()'s an entire file.  If
> you have a big file (e.g. a USENET news history file) the 
> filesystem syncer can come along and exclusively lock it for
> *seconds* while it is fsync()ing it, stalling all activity on
> the file every 30 seconds.
[...]
> One of the reasons why Yahoo uses MAP_NOSYNC so much (causing the problem
> that Alfred has been talking about) is because the filesystem
> syncer is 'broken' in regards to generating unnecessarily long stalls.
> 
> Personally speaking, I would much rather use MAP_NOSYNC anyway, even with
> a fixed filesystem syncer.   MAP_NOSYNC pages are not restricted by
> the size of the filesystem buffer cache, so you can have a whole
> lot more dirty pages in the system then you would normally be able to
> have.  This 'feature' has had the unfortunate side effect of screwing
> up *THWACK*

Yeah, no kidding -- here's what I see it screwing up.  First, some
background:

I've built three news machines, two transit boxen and one reader box,
with recent INN k0dez, and 4.2-STABLE of a few days ago (having tested
NetBSD, more on that later), and a brief detour into 5-current.

The two transit boxes have somewhere on the order of ~400MB memory
or less; the amount I've put in the reader box has increased up to a
Gig as I try to figure out what's happening.  I'm using the MAP_NOSYNC
on the history database files on all to try to get the NetBSD performance
of not hitting history, and I've made a couple other minor tweaks to
use mmap where the INN history code probably should, but doesn't.

Everything starts out well, where the history disk is beaten at startup
but as time passes, the time taken to do lookups and writes drops down
to near-zero levels, and the disk gets quiet.  And actually, the transit
machines stay that way, while the reader machine gives me problems after
some time.

What I notice is that the amount of memory used keeps increasing, until
it's all used, and the Free amount shown by `top' drops to a meg or so.
Cache and Buf get a bit, but most of it is Active.  Far more than is
accounted for by the processes.

Now, what happens on the reader machine is that after some time of the
Active memory increasing, it runs out and starts to swap out processes,
and the timestamps on the history database files (.index and .hash, this
is the md5-based history) get updated, rather than remaining at the
time INN is started.  Then the rapid history times skyrocket until it
takes more than 1/4 of the time.  I don't see this on the transit boxen
even after days of operation.

Now, what happens when I stop INN and everything news-related is that
some memory is freed up, but still, there can be, say, 400MB still
reported as Active.  More when I had a full gig in this machine to
try to keep it from swapping, all of which got used...

Then, when I reboot the machine, it gives the kernel messages about
syncing disks; done, and then suddenly the history drive light goes
on and it starts grinding for five minutes or so, before the actual
reboot happens.

No history activity happens when I shut down INN normally, which should
free the MAP_NOSYNC'ed pages and make them available to be written to
disk before rebooting, maybe.


I'm also running BerkeleyDB for the reader overview on this machine,
and I just discovered that I had applied MAP_NOSYNC to an earlier
release, but the library linked in had not had this -- I just fixed
that and am running that way now (and see a noticeable improvement)
so now when I reboot, I may see both the overview database disk and
the history disk get some pre-reboot activity, if what I think is
happening really is happening.

What I think is happening, based on these observations, is that the
data from the history hash files (less than 100MB) gets read into
memory, but the updates to it are not written over the data to be
replaced -- it's simply appended to, up to the limit of the available
memory.  When this limit is reached on the transit machines, then
things stabilize and old pages get recycled (but still, more memory
overall is used than the size of the actual file).

I'm guessing that additional activity of the reader machine causes
jumps in memory usage not seen on the transit machines, that is enough
to force some of the unwritten dirty pages to be written to the
history file, as a few megs of swap get used, which is why it does
not sta

Re: vm_pageout_scan badness

2000-11-02 Thread Peter Jeremy

On Wed, 25 Oct 2000 21:54:42 + (GMT), Terry Lambert <[EMAIL PROTECTED]> wrote:
>I think the idea of a fixed limit on the FS buffer cache is
>probably wrong in the first place; certainly, there must be
>high and low reserves, but:
>
>|--| all of memory
> |-| FS allowed use
>|-|  non-FS allowed use
>||   non-FSreserve
>  || FS reserve
>
>...in other words, a reserve-based system, rather than a limit
>based system.

This is what Compaq Tru64 (aka Digital UNIX aka OSF/1) does.  It
splits physical RAM as follows:

|| physical RAM
|| Static wired memory
 |===| managed memory
 |=-|  dynamic wired memory
   |-| UBC memory
|| VM

The default configuration provides:
- up to 80% of RAM can be wired.
- UBC (unified buffer cache) uses a minimum of 10% RAM and can use up
  to 100% RAM.
- The VM subsystem can steal UBC pages if the UBC is using >20% RAM

There's no minimum limit for VM space.  The UBC can't directly steal
VM pages, just pages off the common free list.  The VM manages the
free list by paging and swapping based on target page counts (fixed
number of pages, not % of RAM).

The FS metadata cache is a fixed size wired pool.

I can think of benefits with the ability to separately control FS and
non-FS RAM usage.  The Tru64 defaults are definitely a very poor match
with the application we run on it[1] and being able to reduce the RAM
associated with filesystem buffers is an advantage.

[1] Basically a number of processes querying a _very_ large Oracle SGA.

Peter


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-10-26 Thread Matt Dillon

:On Tue, Oct 24, 2000 at 01:10:19PM -0700, Matt Dillon wrote:
:> Ouch.  The original VM code assumed that pages would not often be
:> ripped out from under the pageadaemon, so it felt free to restart
:> whenever.  I think you are absolutely correct in regards to the
:> clustering code causing nearby-page ripouts.
:> 
:> I don't have much time available, but let me take a crack at the
:> problem tonight.
:
:While you are at it, would you care and have a look at PR19672.  It
:seems to be at least remotely relevant.  ;-)

Hmmm.  Blech.  contigmalloc is aweful.  I'm not even sure if what
it is doing is legal!  If it can't find contiguous space it
tries to flush the entire inactive and active queues.  Every
single page!  not to mention the insane restarting.  The 
algorithm is O(N^2) on an idle machine, and even worse on
machines that might be doing something.

There is no easy fix.  contigmalloc would have to be completely
rewritten.  We could use the placemarker idea to make the loop
'retry' the page that blocked rather then restart at the beginning,
but the fact that contigmalloc tries to flush the entire page
queue means that it could very trivially get stuck on dead devices
(e.g. like a dead NFS mount).  Also, if we don't restart, there is
less of a chance that contigmalloc can find sufficient free space.
When it frees pages it does so half-hazzardly, and when it flushes
pages out it makes no attempt to free them so an active process
may reuse the page instantly.  Bleh.

I'm afraid I don't have the time to rewrite contigmalloc myself, but
my brain is available to answer questions if someone else wants to
have a go at it.

-Matt

:Cheers,
:%Anton.
:-- 
: and  would be a nice addition
:to HTML specification.
:



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



VM pager patch (was Re: vm_pageout_scan badness)

2000-10-25 Thread Matt Dillon

Here's a test patch, inclusive of some debugging sysctls:

vm.always_launder   set to 1 to give up on trying to avoid
pageouts.

vm.vm_pageout_stats_rescans
Number of times the main inactive scan
in the pageout loop had to restart

vm.vm_pageout_stats_xtralaunder
Number of times a second pass had to be
taken (in normal mode, with always_launder
set to 0).

This patch:

* implements a placemarker to try to avoid restarts.

* does not penalize the pageout daemon for being able 
  to cluster writes.

* adds an additional vnode check that should be there

One last note:  I wrote a quick and dirty program to mmap() a bunch
of big files MAP_NOSYNC and then dirty them in a loop.  I noticed
that the filesystem update daemon 'froze up' the system for about a 
second every 30 seconds due to the huge number of dirty MAP_NOSYNC
pages (about 1GB worth) sitting around (it has to scan the vm_page_t's
even if it doesn't do anything with them).  This is a separate issue.

If Alfred, and others running heavily loaded systems are able to test
this patch sufficiently, we can include it (minus the debugging
sysctls) in the release.  If not, I will wait until after
the release is rolled before I commit it or whatever the final patch
winds up looking like.

-Matt

Index: vm_page.c
===
RCS file: /home/ncvs/src/sys/vm/vm_page.c,v
retrieving revision 1.147.2.3
diff -u -r1.147.2.3 vm_page.c
--- vm_page.c   2000/08/04 22:31:11 1.147.2.3
+++ vm_page.c   2000/10/26 04:43:22
@@ -1783,6 +1783,12 @@
("contigmalloc1: page %p is not PQ_INACTIVE", 
m));
 
next = TAILQ_NEXT(m, pageq);
+   /*
+* ignore markers
+*/
+   if (m->flags & PG_MARKER)
+   continue;
+
if (vm_page_sleep_busy(m, TRUE, "vpctw0"))
goto again1;
vm_page_test_dirty(m);
Index: vm_page.h
===
RCS file: /home/ncvs/src/sys/vm/vm_page.h,v
retrieving revision 1.75.2.3
diff -u -r1.75.2.3 vm_page.h
--- vm_page.h   2000/09/16 01:08:03 1.75.2.3
+++ vm_page.h   2000/10/26 04:17:28
@@ -251,6 +251,7 @@
 #define PG_SWAPINPROG  0x0200  /* swap I/O in progress on page  */
 #define PG_NOSYNC  0x0400  /* do not collect for syncer */
 #define PG_UNMANAGED   0x0800  /* No PV management for page */
+#define PG_MARKER  0x1000  /* special queue marker page */
 
 /*
  * Misc constants.
Index: vm_pageout.c
===
RCS file: /home/ncvs/src/sys/vm/vm_pageout.c,v
retrieving revision 1.151.2.4
diff -u -r1.151.2.4 vm_pageout.c
--- vm_pageout.c2000/08/04 22:31:11 1.151.2.4
+++ vm_pageout.c2000/10/26 05:07:45
@@ -143,6 +143,9 @@
 static int disable_swap_pageouts=0;
 
 static int max_page_launder=100;
+static int always_launder=0;
+static int vm_pageout_stats_rescans=0;
+static int vm_pageout_stats_xtralaunder=0;
 #if defined(NO_SWAPPING)
 static int vm_swap_enabled=0;
 static int vm_swap_idle_enabled=0;
@@ -186,6 +189,12 @@
 
 SYSCTL_INT(_vm, OID_AUTO, max_page_launder,
CTLFLAG_RW, &max_page_launder, 0, "Maximum number of pages to clean per pass");
+SYSCTL_INT(_vm, OID_AUTO, always_launder,
+   CTLFLAG_RW, &always_launder, 0, "Always launder on the first pass");
+SYSCTL_INT(_vm, OID_AUTO, vm_pageout_stats_rescans,
+   CTLFLAG_RD, &vm_pageout_stats_rescans, 0, "");
+SYSCTL_INT(_vm, OID_AUTO, vm_pageout_stats_xtralaunder,
+   CTLFLAG_RD, &vm_pageout_stats_xtralaunder, 0, "");
 
 
 #define VM_PAGEOUT_PAGE_COUNT 16
@@ -613,11 +622,16 @@
 
 /*
  * vm_pageout_scan does the dirty work for the pageout daemon.
+ *
+ * This code is responsible for calculating the page shortage
+ * and then attempting to clean or free enough pages to hit that
+ * mark.
  */
 static int
 vm_pageout_scan()
 {
vm_page_t m, next;
+   struct vm_page marker;
int page_shortage, maxscan, pcount;
int addl_page_shortage, addl_page_shortage_init;
int maxlaunder;
@@ -651,27 +665,41 @@
/*
 * Figure out what to do with dirty pages when they are encountered.
 * Assume that 1/3 of the pages on the inactive list are clean.  If
-* we think we can reach our target, disable laundering (do not
-* clean any dirty pages).  If we 

Re: vm_pageout_scan badness

2000-10-25 Thread Terry Lambert

> :Consider that a file with a huge number of pages outstanding
> :should probably be stealing pages from its own LRU list, and
> :not the system, to satisfy new requests.  This is particularly
> :true of files that are demanding resources on a resource-bound
> :system.
> 
> This isn't exactly what I was talking about.  The issue in regards to
> the filesystem syncer is that it fsync()'s an entire file.  If
> you have a big file (e.g. a USENET news history file) the 
> filesystem syncer can come along and exclusively lock it for
> *seconds* while it is fsync()ing it, stalling all activity on
> the file every 30 seconds.

This seems like a broken (non)use of _SYNC parameters, but I
definitely remember now about the FreeBSD breakage in the
dirty page sync case not knowing what pages should be sync'ed
or not, in the mmap region sync case of msync() degrading to
fsync().

I guess O_WRITESYNC or msync() fixing is not an option?


> The current VM system already does a good job in allowing files
> to stealing pages from themselves.  The sequential I/O detection
> heuristic depresses the priority of pages as they are read making
> it more likely for them to be reused.  Since sequential I/O tends
> to be the biggest abuser of file cache, the current FreeBSD
> algorithms work well in real-life situations.  We also have a few
> other optimizations to reuse pages in there that I had added a year
> or so ago (or fixed up, in the case of the sequential detection
> heuristic).

The biggest abuser that I have seen of this is actually not
sequential.  It is a linker that mmap()'s the object files,
and then seeks all over creation to do the link, forcing all
other pages out of core.

I think the assumption that this is a sequential access problem,
instead of a more general problem, is a bad one (FWIW, building
per vnode working set quotas fixed the problem with the linker
being antagonisitic).


> One of the reasons why Yahoo uses MAP_NOSYNC so much (causing
> the problem that Alfred has been talking about) is because the
> filesystem syncer is 'broken' in regards to generating
> unnecessarily long stalls.

It doesn't stall when it should?  8-) 8-).  I think this is a
case of needing to eventually pay the piper for the music
being played.  If the pages are truly anonymous, then they
don't need sync'ed; if they aren't, then they do need sync'ed.

It sounds to me that if they are seeing long stalls, it's the
msync() bug with not being able to tell what's dirty and what's
clean...


> Personally speaking, I would much rather use MAP_NOSYNC anyway,
> even with a fixed filesystem syncer.   MAP_NOSYNC pages are not
> restricted by the size of the filesystem buffer cache,

I see this as a bug in the non-MAP_NOSYNC case in FreeBSD's use
vnodes as synonyms for vm_object_t's.  I really doubt, though,
that they are exceeding the maximum file size with a mapping; if
not, then the issue is tuning.  The limits on the size of the
FS buffer cache are arbitrary; it should be possible to relax
them.

Again, I think the biggest problem here is historical, and it
derives from the ability to dissociate a vnode with pages
still hung off it from the backing inode (a cache bust).  I
suspect that if they increased the size of the ihash cache,
they would see much better characteristics.  My personal
preference would be to not dissociate valid but clean pages
from the reference object, until absolutely necessary.  An
easy fix for this would be to allow the FS to own the vnodes,
not have a fixed size pool, and have a struct like:

struct ufs_vnode {
struct vnode;
struct ufs_in_core_inode;
};

And pass that around as if it were just a vnode, giving it back
the the VFS that owned it, instead of using a system reclaim
method, in order to reclaim it.  Then if an ihash reclaim was
wanted, it would have to free up the vnode resources to get it.

Using high and low watermarks, instead of a fixed pool would
complete the picture (the use of a fixed per-FS ihash pool in
combination with a high/low watermarked per-system vnode pool
is part of what causes the problem in the first place; an
analytical mechanic or electronics buff would call this a
classic case of "impedence mismatch").


> so you can have a whole lot more dirty pages in the system
> then you would normally be able to have.

E.g. they are working around an arbitrary, and wrong-for-them,
administrative limit, instead of changing it.  Bletch.


> This 'feature' has had the unfortunate side effect of screwing
> up the pageout daemon's algorithms, but that's fixable.

I think the idea of a fixed limit on the FS buffer cache is
probably wrong in the first place; certainly, there must be
high and low reserves, but:

|--| all of memory
 |-| FS allowed use
|---

Re: vm_pageout_scan badness

2000-10-25 Thread Matt Dillon

:
:Consider that a file with a huge number of pages outstanding
:should probably be stealing pages from its own LRU list, and
:not the system, to satisfy new requests.  This is particularly
:true of files that are demanding resources on a resource-bound
:system.
:...
:   Terry Lambert
:   [EMAIL PROTECTED]

This isn't exactly what I was talking about.  The issue in regards to
the filesystem syncer is that it fsync()'s an entire file.  If
you have a big file (e.g. a USENET news history file) the 
filesystem syncer can come along and exclusively lock it for
*seconds* while it is fsync()ing it, stalling all activity on
the file every 30 seconds.

The current VM system already does a good job in allowing files
to stealing pages from themselves.  The sequential I/O detection
heuristic depresses the priority of pages as they are read making
it more likely for them to be reused.  Since sequential I/O tends
to be the biggest abuser of file cache, the current FreeBSD
algorithms work well in real-life situations.  We also have a few
other optimizations to reuse pages in there that I had added a year
or so ago (or fixed up, in the case of the sequential detection
heuristic).

One of the reasons why Yahoo uses MAP_NOSYNC so much (causing the problem
that Alfred has been talking about) is because the filesystem
syncer is 'broken' in regards to generating unnecessarily long stalls.

Personally speaking, I would much rather use MAP_NOSYNC anyway, even with
a fixed filesystem syncer.   MAP_NOSYNC pages are not restricted by
the size of the filesystem buffer cache, so you can have a whole
lot more dirty pages in the system then you would normally be able to
have.  This 'feature' has had the unfortunate side effect of screwing
up the pageout daemon's algorithms, but that's fixable.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-10-25 Thread Anton Berezin

On Tue, Oct 24, 2000 at 01:10:19PM -0700, Matt Dillon wrote:
> Ouch.  The original VM code assumed that pages would not often be
> ripped out from under the pageadaemon, so it felt free to restart
> whenever.  I think you are absolutely correct in regards to the
> clustering code causing nearby-page ripouts.
> 
> I don't have much time available, but let me take a crack at the
> problem tonight.

While you are at it, would you care and have a look at PR19672.  It
seems to be at least remotely relevant.  ;-)

> I don't think we want to add another workaround to code that
> already has too many of them.  The solution may be to create a
> dummy placemarker vm_page_t and to insert it into the pagelist
> just after the current page after we've locked it and decided we
> have to do something significant to it.  We would then be able to
> pick the scan up where we left off using the placemarker.
> 
> This would allow us to get rid of the restart code entirely, or at
> least devolve it back into its original design (i.e. something
> that would not happen very often).  Since we already have cache
> locality of reference for the list node, the placemarker idea
> ought to be quite fast.
> 
> I'll take a crack at implementing the openbsd (or was it netbsd?)
> partial fsync() code as well, to prevent the update daemon from
> locking up large files that have lots of dirty pages for long
> periods of time.

Cheers,
%Anton.
-- 
 and  would be a nice addition
to HTML specification.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-10-25 Thread Terry Lambert

> > I'll take a crack at implementing the openbsd (or was it netbsd?) partial
> > fsync() code as well, to prevent the update daemon from locking up large
> > files that have lots of dirty pages for long periods of time.
> 
> Making the partial fsync would help some people but probably not
> these folks.

I think this would be better handled as a per file working set
quota, which could not be exceeded, unless changed by root.

Consider that a file with a huge number of pages outstanding
should probably be stealing pages from its own LRU list, and
not the system, to satisfy new requests.  This is particularly
true of files that are demanding resources on a resource-bound
system.


> The people getting hit by this are Yahoo! boxes, they have giant areas
> of NOSYNC mmap'd data, the issue is that for them the first scan through
> the loop always sees dirty data that needs to be written out.  I think
> they also need a _lot_ more than 32 pages cleaned per pass because all
> of thier pages need laundering.

First principles?

What are they doing, such that this situation arises in the
first place?  Having a clue to the problem they are trying to
resolve, which causes this problem as a side effect, would
both help to clarify if there were a better soloution for them,
as well as what FreeBSD should potentially act like they were
asking for instead, when/if the situation arose.


> It might be wise to switch to a 'launder mode' if this sort of
> usage pattern is detected and figure some better figure to use than
> 32, I was hoping you'd have some suggestions for a heuristic to
> detect this along the lines of what you have implemented in bufdaemon.

This is kind of evil.  You could do low and high watermarking,
as you suggest, but without any idea of the queue retention
time to expect, and how bursty the situation is, there's no
way to pick an appropriate algorithm.


Terry Lambert
[EMAIL PROTECTED]
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-10-24 Thread Matt Dillon

:Ok, now I feel pretty lost, how is there a relationship between
:max_page_launder and async writes?  If increasing max_page_launder
:increases the amount of async writes, isn't that a good thing?

The async writes are competing against the rest of the system
for disk resources.  While it is ok for an async write to stall, the
fact that it will cause other processes read() or page-in (which is
nominally synchronous) requests to stall can result in seriously
degraded operation for those processes.

Piling on the number of async writes running in parallel is not
going to improve the performance of page-out daemon, but it will
degrade the performance of I/O issued by other processes in the system.
The only two reasons the pageout daemon is not doing synchronous writes
are: (1) because it can't afford to stall on a slow device (or NFS, etc.)
and (2) so it can parallelize I/O across different devices.  But since
the pageout daemon isn't really all that smart and doesn't track what it
does, the whole algorithm devolves into issueing a certain number of
asynchronous I/O's all at once governed by max_page_launder.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-10-24 Thread Alfred Perlstein

* Matt Dillon <[EMAIL PROTECTED]> [001024 15:32] wrote:
> 
> :The people getting hit by this are Yahoo! boxes, they have giant areas
> :of NOSYNC mmap'd data, the issue is that for them the first scan through
> :the loop always sees dirty data that needs to be written out.  I think
> :they also need a _lot_ more than 32 pages cleaned per pass because all
> :of thier pages need laundering.
> :
> :Perhaps if you detected how often the routine was being called you
> :could slowly raise max_page_launder to compensate and lower it
> :after some time without a shortage.  Perhaps adding a quarter of
> :'should_have_laundered' to maxlaunder for a short interval.
> :
> :It might be wise to switch to a 'launder mode' if this sort of
> :usage pattern is detected and figure some better figure to use than
> :32, I was hoping you'd have some suggestions for a heuristic to
> :detect this along the lines of what you have implemented in bufdaemon.
> 
> We definitely don't want to increase max_page_launder too much... the
> problem is that there is a relationship between it and the number of
> simultanious async writes that can be queued in one go, and that can
> interfere with normal I/O.  But perhaps we should decouple it from the
> I/O count and have it count clusters instead of pages.  i.e. this line:

Ok, now I feel pretty lost, how is there a relationship between
max_page_launder and async writes?  If increasing max_page_launder
increases the amount of async writes, isn't that a good thing?

> 
>   written = vm_pageout_clean(m);
>   if (vp)
>   vput(vp)
>   maxlaunder -= written;
> 
> Can turn into:
> 
>   if (vm_pageout_clean(m))
>   --maxlaunder;
>   if (vp)
>   vput(vp);
> 
> In regards to speeding up paging, perhaps we can implement a heuristic
> similar to what buf_daemon() does.  We could wake the pageout daemon up
> more often.   I'll experiment with it a bit.  We certainly have enough
> statistical information to come up with something good.

That looks like it would help by ignoring the clustered data which
probably got written out pretty quickly and reducing the negative
cost/gain to a single page.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-10-24 Thread Matt Dillon


:The people getting hit by this are Yahoo! boxes, they have giant areas
:of NOSYNC mmap'd data, the issue is that for them the first scan through
:the loop always sees dirty data that needs to be written out.  I think
:they also need a _lot_ more than 32 pages cleaned per pass because all
:of thier pages need laundering.
:
:Perhaps if you detected how often the routine was being called you
:could slowly raise max_page_launder to compensate and lower it
:after some time without a shortage.  Perhaps adding a quarter of
:'should_have_laundered' to maxlaunder for a short interval.
:
:It might be wise to switch to a 'launder mode' if this sort of
:usage pattern is detected and figure some better figure to use than
:32, I was hoping you'd have some suggestions for a heuristic to
:detect this along the lines of what you have implemented in bufdaemon.

We definitely don't want to increase max_page_launder too much... the
problem is that there is a relationship between it and the number of
simultanious async writes that can be queued in one go, and that can
interfere with normal I/O.  But perhaps we should decouple it from the
I/O count and have it count clusters instead of pages.  i.e. this line:

written = vm_pageout_clean(m);
if (vp)
vput(vp)
maxlaunder -= written;

Can turn into:

if (vm_pageout_clean(m))
--maxlaunder;
if (vp)
vput(vp);

In regards to speeding up paging, perhaps we can implement a heuristic
similar to what buf_daemon() does.  We could wake the pageout daemon up
more often.   I'll experiment with it a bit.  We certainly have enough
statistical information to come up with something good.

-Matt

:-Alfred
:



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-10-24 Thread Alfred Perlstein

* Matt Dillon <[EMAIL PROTECTED]> [001024 13:11] wrote:
> Ouch.  The original VM code assumed that pages would not often be
> ripped out from under the pageadaemon, so it felt free to restart
> whenever.  I think you are absolutely correct in regards to the
> clustering code causing nearby-page ripouts.

Yes, it would make sense to me that if you did a sequential write
to a file after some time it would be likely that those pages would
be put in order on the inactive queue and when cluster written
'next' would be on a different queue as it was written along with
the preceeding page.

> I don't have much time available, but let me take a crack at the
> problem tonight.  I don't think we want to add another workaround to
> code that already has too many of them.  The solution may be
> to create a dummy placemarker vm_page_t and to insert it into the pagelist
> just after the current page after we've locked it and decided we have
> to do something significant to it.  We would then be able to pick the
> scan up where we left off using the placemarker.
> 
> This would allow us to get rid of the restart code entirely, or at least
> devolve it back into its original design (i.e. something that would not
> happen very often).  Since we already have cache locality of reference for
> the list node, the placemarker idea ought to be quite fast.
> 
> I'll take a crack at implementing the openbsd (or was it netbsd?) partial
> fsync() code as well, to prevent the update daemon from locking up large
> files that have lots of dirty pages for long periods of time.

Making the partial fsync would help some people but probably not
these folks.

The people getting hit by this are Yahoo! boxes, they have giant areas
of NOSYNC mmap'd data, the issue is that for them the first scan through
the loop always sees dirty data that needs to be written out.  I think
they also need a _lot_ more than 32 pages cleaned per pass because all
of thier pages need laundering.

Perhaps if you detected how often the routine was being called you
could slowly raise max_page_launder to compensate and lower it
after some time without a shortage.  Perhaps adding a quarter of
'should_have_laundered' to maxlaunder for a short interval.

It might be wise to switch to a 'launder mode' if this sort of
usage pattern is detected and figure some better figure to use than
32, I was hoping you'd have some suggestions for a heuristic to
detect this along the lines of what you have implemented in bufdaemon.

What you could also do is count the amount of pages that could/should have 
been laundered during the first pass and if it exceeds a certain threshold
passing the amount of pages that were free'd via:

if (m->object->ref_count == 0) {
and:
if (m->valid == 0) {
and:
} else if (m->dirty == 0) {

basically if maxlaunder is equal to zero and we miss all those tests
you might want to bump up a counter and if it exceeds a threshold
immediately start rescanning and double(?) maxlaunder.

-Alfred



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: vm_pageout_scan badness

2000-10-24 Thread Matt Dillon

Ouch.  The original VM code assumed that pages would not often be
ripped out from under the pageadaemon, so it felt free to restart
whenever.  I think you are absolutely correct in regards to the
clustering code causing nearby-page ripouts.

I don't have much time available, but let me take a crack at the
problem tonight.  I don't think we want to add another workaround to
code that already has too many of them.  The solution may be
to create a dummy placemarker vm_page_t and to insert it into the pagelist
just after the current page after we've locked it and decided we have
to do something significant to it.  We would then be able to pick the
scan up where we left off using the placemarker.

This would allow us to get rid of the restart code entirely, or at least
devolve it back into its original design (i.e. something that would not
happen very often).  Since we already have cache locality of reference for
the list node, the placemarker idea ought to be quite fast.

I'll take a crack at implementing the openbsd (or was it netbsd?) partial
fsync() code as well, to prevent the update daemon from locking up large
files that have lots of dirty pages for long periods of time.

-Matt

:
:Matt, I'm not sure if Paul mailed you yet so I thought I'd take the
:initiative of bugging you about some reported problems (lockups)
:when dealing with machines that have substantial MAP_NOSYNC'd
:data along with a page shortage.
:
:What seems to happen is that vm_pageout_scan (src/sys/vm/vm_pageout.c
:line 618) keeps rescanning the inactive queue.
:
:My guess is that it just doesn't expect someone to have hosed themselves
:by having so many pages that need laundering (maxlaunder/launder_loop)
:along with the fact that the comment and code here are doing the wrong
:thing for the situation:
:
:   /*
:* Figure out what to do with dirty pages when they are encountered.
:* Assume that 1/3 of the pages on the inactive list are clean.  If
:* we think we can reach our target, disable laundering (do not
:* clean any dirty pages).  If we miss the target we will loop back
:* up and do a laundering run.
:*/
:
:   if (cnt.v_inactive_count / 3 > page_shortage) {
:   maxlaunder = 0;
:   launder_loop = 0;
:   } else {
:   maxlaunder = 
:   (cnt.v_inactive_target > max_page_launder) ?
:   max_page_launder : cnt.v_inactive_target;
:   launder_loop = 1;
:   }
:
:The problem is that there's a chance that nearly all the pages on
:the inactive queue need laundering and the loop that starts with
:the lable 'rescan0:' is never able to clean enough pages before
:stumbling across a block that has moved to another queue.  This
:forces a jump back to the 'rescan0' lable with effectively nothing
:accomplished.
:
:Here's a patch that may help, it's untested, but shows what I'm
:hoping to achieve which is force a maximum on the amount of times
:rescan0 will be jumped to while not laundering.
:...
:
:I'm pretty sure that there's yet another problem here, when paging
:out a vnode's pages the output routine attempts to cluster them,
:this could easily make 'next' point to a page that is cleaned and
:put on the FREE queue, when the loop then tests it for
:'m->queue != PQ_INACTIVE' it forces 'rescan0' to happen.
:
:I think one could fix this by keeping a pointer to the previous
:page then the 'goto rescan0;' test might become something like
:this:
:...
:
:Of course we need to set 'prev' properly, but I need to get back
:to my database stuff right now. :)
:
:Also... I wish there was a good hueristic to make max_page_launder
:a bit more adaptive, you've done some wonders with bufdaemon so
:I'm wondering if you had some ideas about that.
:
:-- 
:-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
:"I have the heart of a child; I keep it in a jar on my desk."



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



vm_pageout_scan badness

2000-10-24 Thread Alfred Perlstein

Matt, I'm not sure if Paul mailed you yet so I thought I'd take the
initiative of bugging you about some reported problems (lockups)
when dealing with machines that have substantial MAP_NOSYNC'd
data along with a page shortage.

What seems to happen is that vm_pageout_scan (src/sys/vm/vm_pageout.c
line 618) keeps rescanning the inactive queue.

My guess is that it just doesn't expect someone to have hosed themselves
by having so many pages that need laundering (maxlaunder/launder_loop)
along with the fact that the comment and code here are doing the wrong
thing for the situation:

/*
 * Figure out what to do with dirty pages when they are encountered.
 * Assume that 1/3 of the pages on the inactive list are clean.  If
 * we think we can reach our target, disable laundering (do not
 * clean any dirty pages).  If we miss the target we will loop back
 * up and do a laundering run.
 */

if (cnt.v_inactive_count / 3 > page_shortage) {
maxlaunder = 0;
launder_loop = 0;
} else {
maxlaunder = 
(cnt.v_inactive_target > max_page_launder) ?
max_page_launder : cnt.v_inactive_target;
launder_loop = 1;
}

The problem is that there's a chance that nearly all the pages on
the inactive queue need laundering and the loop that starts with
the lable 'rescan0:' is never able to clean enough pages before
stumbling across a block that has moved to another queue.  This
forces a jump back to the 'rescan0' lable with effectively nothing
accomplished.

Here's a patch that may help, it's untested, but shows what I'm
hoping to achieve which is force a maximum on the amount of times
rescan0 will be jumped to while not laundering.

Index: vm_pageout.c
===
RCS file: /home/ncvs/src/sys/vm/vm_pageout.c,v
retrieving revision 1.151.2.4
diff -u -u -r1.151.2.4 vm_pageout.c
--- vm_pageout.c2000/08/04 22:31:11 1.151.2.4
+++ vm_pageout.c2000/10/24 07:31:39
@@ -618,7 +618,7 @@
 vm_pageout_scan()
 {
vm_page_t m, next;
-   int page_shortage, maxscan, pcount;
+   int page_shortage, maxscan, maxtotscan, pcount;
int addl_page_shortage, addl_page_shortage_init;
int maxlaunder;
int launder_loop = 0;
@@ -672,13 +672,23 @@
 * we have scanned the entire inactive queue.
 */
 
+rescantot:
+   /* make sure we don't hit rescan0 too many times */
+   maxtotscan = cnt.v_inactive_count;
 rescan0:
addl_page_shortage = addl_page_shortage_init;
maxscan = cnt.v_inactive_count;
+   if (maxtotscan < 1) {
+   maxlaunder = 
+   (cnt.v_inactive_target > max_page_launder) ?
+   max_page_launder : cnt.v_inactive_target;
+   }   
for (m = TAILQ_FIRST(&vm_page_queues[PQ_INACTIVE].pl);
 m != NULL && maxscan-- > 0 && page_shortage > 0;
 m = next) {
 
+   --maxtotscan;
+
cnt.v_pdpages++;
 
if (m->queue != PQ_INACTIVE) {
@@ -930,7 +940,7 @@
maxlaunder = 
(cnt.v_inactive_target > max_page_launder) ?
max_page_launder : cnt.v_inactive_target;
-   goto rescan0;
+   goto rescantot;
}
 
/*


(still talking about vm_pageout_scan()):

I'm pretty sure that there's yet another problem here, when paging
out a vnode's pages the output routine attempts to cluster them,
this could easily make 'next' point to a page that is cleaned and
put on the FREE queue, when the loop then tests it for
'm->queue != PQ_INACTIVE' it forces 'rescan0' to happen.

I think one could fix this by keeping a pointer to the previous
page then the 'goto rescan0;' test might become something like
this:

/*
 * We keep a back reference just in case the vm_pageout_clean()
 * happens to clean the page after the one we just cleaned
 * via clustering, this would make next point to something not
 * one the INACTIVE queue, as a stop-gap we keep a pointer
 * to the previous page and attempt to use it as a fallback
 * starting point before actually starting at the head of the
 * INACTIVE queue again
 */
if (m->queue != PQ_INACTIVE) {
if (prev != NULL && prev->queue == PQ_INACTIVE) {
m = TAILQ_NEXT(prev, pageq);
if (m == NULL || m->queue != PQ_INACTIVE)
goto rescan0;
} else {
goto rescan0;
}
}


Of course we need to set 'prev' properly, but I need to get back
to my database stuff right now. :)

Also... I wish there was a good hueristic to make max_page_launder
a bit more adaptive, you've done some wonders with bufdaemon so