subject:"Re\: vm balance"

Re: vm balance

2002-01-25 Thread Matthew Dillon



:
:Julian Elischer wrote:
:> You can mmap() devices  and you can mmap files..
:> 
:> you cannot mmap FIFOs or sockets.
:> 
:> for this reason I think that devices are still well represented by
:> vnodes. If we merged vnodes and vm objects,
:> then if devices were not vnodes, how would you represent
:> a vm area that maps a device?
:
:Merging vnodes and vm objects is an incredibly bad idea.  There
:is a lot of other work that should be done before that can even
:be considered, and then it shouldn't be considered.
:
:In othe words, it's a good excuse for getting some needed
:changes in, but it's not a good idea.
:
:I know you and Kirk love the idea, but, truly, it is a bad
:idea.

I like the idea too, but every time I've looked at it it's been
a huge mess.  In short, I don't think we will *ever* be able to merge
vnodes and VM objects.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2002-01-25 Thread Matthew Dillon



:
:Julian Elischer wrote:
:> Actually there have been times when I did want to mmap a datastream..
:> I think a datastream mapped into a user buffer-space is one of the
:> possible 0-copy methods people sometimes mention.
:
:This is ugly.  There are prettier ways of doing it.
:
:-- Terry

Considering that a number of failed attempts have already been made to
optimize standard read()/write() calls, and that mmap() isn't really
all that well suited to a datastream, I would be inclined to develop
a set of system calls to deal with 0-copy streams.

I did something similar in one of my embedded OS's.  It could actually
apply to normal files as easily as to pipes and both UDP and TCP
data streams, and would not require any fancy splitting of network
headers verses the data payload.  It gives the OS the ultimate
flexibility in managing 0-copy buffer spaces.

actual = readptr(int fd, void **ptr, int bytes);

Attempt to read 'bytes' bytes of data from the descriptor.  The
operating system will map the data read-only and supply a pointer
to the base of the buffer (which may or may not be page-aligned).

The actual number of bytes available is returned.  actual < bytes
does NOT signify EOF, because the OS may have other limitations
such as having to return piecemeal mbufs, skip packet headers,
and so forth.

The data will remain valid until the next readptr(), read(), or
lseek() call on the descriptor or until the descriptor is closed.
You can inform the OS that you have read all the data by calling
readptr(fd, NULL, 0) (i.e. if this is a TCP connection this would
allow TCP to reclaim the related mbufs).

The OS typically leaves the mapped space mapped for efficiency,
but the only valid data exists within the specific portion
represented by your last readptr() call.  The OS is free to reuse
its own mappings at any time as long as it leaves the data it has
guarenteed to be valid in place.

avail = writeptr(int fd, void **ptr, int bytes);

Request buffer space to write 'bytes' bytes of data.  The OS will
map appropriate buffer space and return a pointer to it.  This
procedure returns the actual number of bytes that may be written 
into the returned buffer.  The OS may limit the available buffer
space to fit mbuf/MTU requirements on a TCP connection or for
other reasons.

You should fill the buffer with 'avail' bytes and call writeptr()
again to commit your buffer.  Calling lseek() or write() will abort
the buffer.  You can commit your last writeptr() by calling
writeptr(fd, NULL, 0).

Close()ing the descriptor without comitting the buffer will result
in the loss of the buffer.

note: readptr() and writeptr() do not interfere with each other
when operating on streams, but one will abort the other when
operating on files due to the seek position changing.


IOCTL's

ioctl(fd, IOPTR_WABORT, bytes);

Abort  worth of a previously reserved write buffer.
Passing -1 aborts the entire buffer.

ioctl(fd, IOPTR_WCOMMIT, bytes);

Commit  bytes worth of a previously reserved write buffer,
aborting any remainder after that.  Passing -1 commits the 
entire 'avail' space.

This can be used to reserve a large write buffer and then commit
a smaller data set.  For example, a web server can reserve a
4K response buffer but only commit the actual length of the
response.

ioctl(fd, IOPTR_WCLEAR, 0);

Abort any previously reserved write buffer and force the OS
to unmap any cached memory space associated with writeptr().

ioctl(fd, IOPTR_RABORT, bytes);

Abort any previously returned read buffer, allowing the OS
to reclaim the buffer space if it wishes (especially useful
for TCP connections which might have to hold onto mbufs).
 bytes are aborted.  Passing -1 aborts the entire buffer.

ioctl(fd, IOPTR_RCLEAR, 0);

Abort any previously reserved write buffer and force the OS
to unmap any cached memory space associated with readptr().

-Matt
Matthew Dillon 
<[EMAIL PROTECTED]>

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2002-01-25 Thread Terry Lambert

Julian Elischer wrote:
> You can mmap() devices  and you can mmap files..
> 
> you cannot mmap FIFOs or sockets.
> 
> for this reason I think that devices are still well represented by
> vnodes. If we merged vnodes and vm objects,
> then if devices were not vnodes, how would you represent
> a vm area that maps a device?

Merging vnodes and vm objects is an incredibly bad idea.  There
is a lot of other work that should be done before that can even
be considered, and then it shouldn't be considered.

In othe words, it's a good excuse for getting some needed
changes in, but it's not a good idea.

I know you and Kirk love the idea, but, truly, it is a bad
idea.

As far as the other work is concerned:

o   Get rid of struct fileops
o   Get rid of specfs entirely; using vp's
o   Fix the permission/ownership problems on
FIFOs and sockets that results from the
use of struct fileops
o   Fix range locks on non-file objects
o   Move the lock list to the vnode
o   Make the VFS advisory locking into a
veto-based interface, which only has
something other than "return 0;" on
the NFS client code
o   Delay lock coelescing until after the
attempt has *not been* vetoed, in order
to save wire traffic in the "local lock
conflict" case
o   Consider getting rid of lock coelescing
entirely, by default, in order to comply
with the NFSv4 RFC non-coelescing of
locks requirement
o   Allow MMAP'ing of FIFO object
o   Constrain the buffer size to a multiple
of a page size, instead of the weird
value of "5K".
o   Implement them slightly differently
o   Get rid of fifofs
o   Give up on the idea of mmaping streams, since to
do that would require constraining mbufs to page
sized chunks per mbuf, at least on the receive
case, since there are adjacency problems with
mapping consecutive packets that don't represent
the same flow (unless you are willing to rewrite
all the firmware in the world, in which case,
"go for it!" 8-)).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2002-01-25 Thread Terry Lambert


Julian Elischer wrote:
> Actually there have been times when I did want to mmap a datastream..
> I think a datastream mapped into a user buffer-space is one of the
> possible 0-copy methods people sometimes mention.

This is ugly.  There are prettier ways of doing it.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-05-01 Thread Matt Dillon



:I think we need to remember that we do not always have a
:backing object, nor is a backing object always desirable.
:
:The performance of an mmap'ed file, or swap-backed anonymous
:region is _significantly_ below that of unbacked objects.
:
:-- Terry

This is not true, Terry.  There is no performance degredation with 
swap verses unbacked storage, and no performance degredation with
file-backed storage if you use MAP_NOSYNC to adjust the write flushing
characteristics of the map.  Additionally, there is no 'write through'
in the VM layer per-say -- the filesystem syncer has to come along
and actually look for dirty pages to sync to the backing store (and
with MAP_NOSYNC it doesn't bother).  The VM layers do not touch the
backing store at all until they absolutely have to.  For example, swap
is not allocated until the pagedaemon actually decides to page something
out.

This leaves only the pageout daemon which operates as it always has... 
if you are not squeezed for memory, it won't try to page anything out.
And you can always use madvise(), msync(), and mlock() on top of
everything else to adjust the VM characteristics of a section of 
memory (though personally speaking I don't think mlock() is necessary
with 4.x's VM system unless you need realtime).

In short, mmap()'s backing store is not an issue in 4.x.  Read the
manual page for mmap for more information, I fleshed it out a long time
ago to explain all of this.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: Proposed struct file (was Re: vm balance)

2001-05-01 Thread Terry Lambert

Matt Dillon wrote:
> 
> This is all preliminary.  The question is whether we can
> cover enough bases for this to be viable.
> 
> Here is a proposed struct file.  Make f_data opaque (or
> more opaque), add f_object, extend fileops (see next
> structure),   Added f_vopflags to indicate the presence
> of a vnode in f_data, allowing extended filesystem ops
> (e.g. rename, remove, fchown, etc etc etc).

1)  struct fileops is evil; adding to it contributes
to its inherent evil-ness.

2)  The new structure is too large.

3)  The old structure is too large; I have a need for
1,000,000 open files for a particular application,
and I'm not willing to give up that much memory.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-05-01 Thread Terry Lambert

Poul-Henning Kamp wrote:
> 
> In message <[EMAIL PROTECTED]>, Kirk
> McKusick writes:
> 
> >Every vnode in the system has an associated object.
> 
> No: device vnodes dont...
> 
> I think the correct solution to that is to move devices away from
> vnodes and into the fdesc layer, just like fifo's and sockets.

This is really, likewise, a bad idea.

The "struct fileops" has been a problem from day one.  It
exists for devices because we still have "specfs", and have
not moved over to a "devfs" that uses vnodes instead of using
strategy routines invoked from a "struct fileops *" dereference.

The code was smeared into the FIFO/socket/IPC code as a poor
man's integration to get something working.

When that happened, the ability to do normal things like
set ownership, permissions, etc., on things like FIFOs
disappeared.

FreeBSD is much poorer with regard to full compliance with
POSIX semantics on things like F_ fcntl() arguments and the
like when applied to sockets.  Linux, Solaris, AIX, and other
POSIX and Single UNIX Specification compliant OSs don't suffer
these same problems.

Perhaps one of the most annoying things about FreeBSD is the
inability to perform advisory locking on anything by true
vnode objects... and then only if the underlying VFS has an
advisory lock chain hung off of some private structure,
which can't be rescued except through the evils of POSIX
locking semantics.

Many applications use advisory lock chains off of devices to
communicate region protection information not directly related
to really protecting the resource.

Similarly, "struct fileops" is the main culprit, to my mind,
behind the inability of FreeBSD to support cloning devices,
such as that needed for multiple virtual machine instances in
vmware to work as it does in Linux and other first-class host
OSs.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-05-01 Thread Terry Lambert

[ ... merging vnode and vm_object_t ... ]

Kirk McKusick wrote:
> Every vnode in the system has an associated object. Every object
> backed by a file (e.g., everything but anonymous objects) has an
> associated vnode. So, the performance of one is pretty tied to the
> performance of the other. Matt is right that the VM does locking
> on a page level, but then has to get a lock on the associated
> vnode to do a read or a write, so really is pretty tied to the
> vnode lock performance. Merging the two data structures is not
> likely to change the performance characteristics of the system for
> either better or worse. But it will save a lot of headaches having
> to do with lock ordering that we have to deal with at the moment.

I really, really dislike the idea of a merge of these objects,
still, and not just because it will be nearly impossible to
macke object coherency work in a stack of two or more VFS
layers if this change ever goes through.

When John Dyson originally wrote the FreeBSD unified VM and
buffer cache code under contract for Oracle for use in their
Oracle 8i and FreeBSD based NC server platform, he did so in
such a way to allow anonymous objects, which did not have
backing store associated with them.  This was the memory
pulled off of /dev/zero, and the memory in SYSVSHM.

The main benefit of doing this is that it saves an incredible
amount of write-through, which would otherwise be necessary
to maintain coherency with the backing object (vnode).

I think we need to remember that we do not always have a
backing object, nor is a backing object always desirable.

The performance of an mmap'ed file, or swap-backed anonymous
region is _significantly_ below that of unbacked objects.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Julian Elischer


Poul-Henning Kamp wrote:
> 
> In message <[EMAIL PROTECTED]>, Matt Dillon writes:
> 
> >Actually, all this talk does imply that VM objects should be independant
> >of vnodes.  Devices may need to mmap (requiring a VM object), but
> >don't need all the baggage of a vnode.  Julian is absolutely correct
> >there.
> 
> Well, you have other VM Objects which doesn't map to vnodes:  swap
> backed anonymous objects for instance.

there has been talk of MAKING those have a vnode by making a swapfs.

> 

-- 
  __--_|\  Julian Elischer
 /   \ [EMAIL PROTECTED]
(   OZ) World tour 2000-2001
---> X_.---._/  
v

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Boris Popov

On Wed, 18 Apr 2001, Poul-Henning Kamp wrote:

> In message <[EMAIL PROTECTED]>, Matt Dillon writes:
> >If this will get rid of or clean up the specfs garbage, then I'm all
> >for it.  I would love to see a 'clean' fileops based device interface.
> 
> specfs, aliased vnodes, you name it...
> 
> I think the aliased vnodes is the single most strong argument of them
> all for doing this...

I think that this can be (and already is) solved in the other
way. Here is how I done it on my test system (quoted from the mail to
Bruce Evans):

--quote-start--
I'm working on this problem too, and these vop_lock/unlock in the
spec_open/read/write vnops cause a real pain. Using a generic vnode
stacking/layering mechanism (diffs will be published soon) I've
reorganized the way how device vnodes are handled. Each device gets its
own vnode of type VT_SPEC which is belongs to a hidden specfs mount. When
any real filesystem tries to lookup vnode for a specific device via
addaliasu(), addalias() just stacks filesystem vnode over specfs vnode:

fs1/vnode1 fs1/vnode8   fs2/vnode1
|   |   |
+---+---+
|
V
   specfs vnode

Specfs vnode also can be used directly as root vnode for any
mounted filesystem. Obviously, there is no need in the device aliases
because device can be controlled only via single vnode. v_rdev field is
also goes away from vnode structure and vn_todev() is the right way to get
a pointer to underlying device.

But there is a real problem with a locking/unlocking used by
specfs. Eg, if specfs vnode's lock used as lock for an entire layer tree,
then things will be totally broken because blocked spec_read() operation
may unlock a different vnode which should be locked, and even more
problems caused that the read lock is shared... Use of separate lock for
each vnode partially solves the problem, but not completely emulates the
old behavior for exclusive lock on open operation. For example if we call
open(vn1) and it block, the second open(vn1) will stuck waiting for lock
on vn1, while open(vn8) will work just fine.

This problem is common for stacked filesystems and many papers
avoid talking about it. The "right" solution is to have a "call stack", so
an unlock operation can unlock only a single chain of the above vnodes,
but I'm don't see the simple way to implement it for stacks containing
more than two layers :(
--quote-end--

Now, regarding to the new file operations structure: it is pretty
obvious that most of the operations will resemble vnode operations.
However, it is a misdesign of VFS to not allow a filesystem to track a
per-file descriptor tracking for at least OPEN/CLOSE operations. It is
also a pretty obvious that file operations (FOP) are just a layer above
VOP operations.

So, why not to do things right and add capability to the existing
VFS to handle a per-file operations properly ? Of course, this will
require more brain work, but results will be definitely better.

Lets back to vnode/vm/file/devices: I think it is a mistake to rip
out vnodes from devices. But I'm agree that vnode structure is too fat to
be used in the more general way. If it is possible to cleanup it, then we
can easily build any hierarchies we want:

file1   file2   file3
|   |   |
+---+   |
|   |
vnode1  vnode2
|   |
+---+
|
device1

--
Boris Popov
http://www.butya.kz/~bp/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Andrew Reilly


On Wed, Apr 18, 2001 at 10:26:40AM -0700, Julian Elischer wrote:
> Robert Watson wrote:
> > 
> > On Wed, 18 Apr 2001, Poul-Henning Kamp wrote:
> 
> > As I indicated in my follow-up mail, the statement about seeking was
> > incorrect, that is a property of the open file structure; I believe the
> > remainder still holds true. When was the last time you tried mmap'ing or
> > seeking on the socket?  A socket represents a buffered data stream which
> > does not allow arbitrary read/write operations at arbitrary offsets.
> 
> Actually there have been times when I did want to mmap a datastream..
> I think a datastream mapped into a user buffer-space is one of the 
> possible 0-copy methods people sometimes mention.

Mmapped data streams: audio IO.

There are probably others.

-- 
Andrew

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: Proposed struct file (was Re: vm balance)

2001-04-18 Thread Matt Dillon


   (oops, I forgot to add fo_truncate() to the fileops)

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Proposed struct file (was Re: vm balance)

2001-04-18 Thread Matt Dillon


This is all preliminary.  The question is whether we can cover enough
bases for this to be viable.

Here is a proposed struct file.  Make f_data opaque (or more opaque),
add f_object, extend fileops (see next structure),   Added f_vopflags
to indicate the presence of a vnode in f_data, allowing extended 
filesystem ops (e.g. rename, remove, fchown, etc etc etc).

struct file {
LIST_ENTRY(file) f_list;/* list of active files */
short   f_flag; /* see fcntl.h */
short   f_type; /* descriptor type */
short   f_vopflags; /* extended command set flags */
short   f_FILLER2;  /* (OLD) references from message queue */
struct  ucred *f_cred;  /* credentials associated with descriptor */
struct  fileops *f_ops; /* FILE OPS */
int f_seqcount; /* (sequential heuristic) */
off_t   f_nextoff;  /* (sequential heuristic) */
off_t   f_offset;   /* seek position */
caddr_t f_data; /* opaque data (was vnode or socket) */
vm_object_t f_object;   /* VM object if mmapable/cacheable, or NULL */
int f_count;/* reference count */
int f_msgcount; /* reference count from message queue */

(additional elements required to support devices, maybe just
a dev_t reference or something like that. I dunno).
};

Proposed fileops structure (shared):  Remove ucred argument (obtain ucred
from struct file), add additional functions.  Add cached and uncached
versions for fo_read() ... all users will use fo_read() but this way you
can vector fo_read() to a generic VM Object layer which can then call
fo_readnc() for anything that can't be handled by that layer.  Same
with fo_write().  Add additional flags to fo_writenc() to handle 
write behind, notification that a write occured in the VM layer
(e.g. required by NFS), and other heuristic features.

Note the lack of any reference to the buffer cache here.  The filesystem
is responsible for manipulation of the buffer cache if it wants to use
the buffer cache.

I've left the uio in for the moment since it's the most generic way
of passing a buffer.

struct  fileops {
int (*fo_read)  (fp, uio, flags, p);/* cachable */
int (*fo_readnc)(fp, uio, flags, p);/* uncached */
int (*fo_write) (fp, uio, flags, p);/* cachable */
int (*fo_writenc)   (fp, uio, flags, p);/* uncached */
int (*fo_ioctl) (fp, com, data, p);
int (*fo_poll)  (fp, events, p);
int (*fo_kqfilter)  (fp, knote);
int (*fo_stat)  (fp, stat, p);
int (*fo_close) (fp, p);
int (*fo_mmap)  (fp, mmap_args);
int (*fo_dump)  ( ? )
... others ...
} *f_ops;

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Matt Dillon writes:

>Actually, all this talk does imply that VM objects should be independant
>of vnodes.  Devices may need to mmap (requiring a VM object), but 
>don't need all the baggage of a vnode.  Julian is absolutely correct 
>there.

Well, you have other VM Objects which doesn't map to vnodes:  swap
backed anonymous objects for instance.

>We do need to guarentee locking order, which means that all I/O
>operations should be consistent.  If a device or vnode is mmap()able,
>then all read, write, and truncation(/extention) ops *should* run
>through the VM object first:

We guarantee that today my mapping the actual hardware and my having
all read/writes be synchronouse.   I remember at least one other
UNIX which didn't make that guarantee.

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Matt Dillon writes:
>
>:You can mmap() devices  and you can mmap files..
>:
>:you cannot mmap FIFOs or sockets.
>:
>:for this reason I think that devices are still well represented by
>:vnodes. If we merged vnodes and vm objects,
>:then if devices were not vnodes, how would you represent
>:a vm area that maps a device?
>:
>:-- 
>:  __--_|\  Julian Elischer
>
>I think the crux of the issue here is that most devices just don't
>need the baggage of a vnode and many don't need the baggage of a VM
>object except possibly for mmap().  A fileops interface would be the
>cleanest way to implement a wide range of devices.
>
>Lets compare our various function dispatch structures.  It's quite
>obvious to me that we can merge cdevsw and fileops and remove all
>vnode references from most of our devices.  Ok, maybe not /dev/tty...
>but most of the rest surely!  We would also want to have an optional
>vnode pointer in the fileops (like we do now) which 'enables' the
>additional VOP operations on that file descriptor (in this case the
>fileops for read, write, etc... would point to VOP wrappers like they
>do now), and of course we would need an opaque pointer for use by
>the fileops (devices would most likely load their cdev reference into
>it).

Right on.

I think your table is wrong for "REVOKE", there is TTY magic in that.

The fact that we have aliased vnodes for devices and for
nothing else.  The fact that all devices are handled by a magic
filesystem (specfs) in the same "orphan" mode by all filesystems
which support devices is another good reason.

I think I'll kick back tonight and try to see what it actually
takes to do it...

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Julian Elischer writes:

>If we merged vnodes and vm objects,
>then if devices were not vnodes, how would you represent
>a vm area that maps a device?

You would use a VM object of course, but it would be a special
kind of VM object, just like today...

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Matt Dillon


:Does this give you a cache coherence problem if the file system itself
:invokes data writes on files?  Consider the UFS quota and extended
:attribute cases: here, the file system will invoke VOP_WRITE() on its
:vnodes to avoid understanding file system internals, so you can have such
:operations shared across file systems using UFS.  If there is caching
:happening above VOP_WRITE(), will changes get propagated up the stack?  Or
:does VOP_WRITE() change so that it talks to the memory object which then
:talks to VOP_REALLYWRITE()?

There are a number of places where the kernel opens and then manipulates
files with VOP calls.  That's been a major eyesore, frankly.  We would
change those instances to open and manipulate files through 
struct file's (like it should have been done in the first place).

:Also, what implications does this have for security-oriented revocation? 
:Memory mapping has always been a problem for revocation, but a number of
:interesting pieces of work have been done wherein access to a file is

I don't think there are any implications.  Rather then scanning for a
vnode we instead just scan for an opaque data pointer in the struct 
file.  It might not be quite that trivial, but it wouldn't be difficult
either.  mmap is another matter, but certainly no more difficult then
it would be with the current scheme.

:Also, however this is implemented, it would be nice to consider supporting
:stateful access to devices: i.e., dev_open() returns a state reference
:that is fed into future operations, so that pseudo-devices emulating
:multi-instance devices from other platforms can operate correctly.  In my

I was thinking more like allocating a struct file, filling it in with
defaults, then passing it to dev_open() which would override the 
defaults as necessary.  In otherwords, the open function manipulates
the struct file and is otherwise completely opaque to the caller.

:use), or we need a more general state management technique.  In any case,
:one thing this means is that if operations are pushed through a virtual
:memory object, different "instances" must have different objects...

If the fileops must handle mmap, then the VM object would be directly
associated with the fileops.  If a file has an associated vnode there
might also be a VM object reference in the vnode (assuming we don't
merge them), but it would be opaque to the rest of the system.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Robert Watson

On Wed, 18 Apr 2001, Matt Dillon wrote:

> If a device or file can be mmap()'d, then the VM Object acts as the
> cache layer for the object.  We would in fact be able to remove nearly
> *ALL* the caching crap from *ALL* the filesystem code.  Filesystem
> code would be responsible for low level I/O operations and meta ops
> (VOPs) only and not be responsible for any caching of file data.  The
> filesystem would still potentially be responsible for caching things
> like bitmaps and such, but it could use a struct file for the backing
> device and get it for free (the backing device is mmapable and thus
> would have a VM Object layer, so you get the bitmap caching for free).

Does this give you a cache coherence problem if the file system itself
invokes data writes on files?  Consider the UFS quota and extended
attribute cases: here, the file system will invoke VOP_WRITE() on its
vnodes to avoid understanding file system internals, so you can have such
operations shared across file systems using UFS.  If there is caching
happening above VOP_WRITE(), will changes get propagated up the stack?  Or
does VOP_WRITE() change so that it talks to the memory object which then
talks to VOP_REALLYWRITE()?

Also, what implications does this have for security-oriented revocation? 
Memory mapping has always been a problem for revocation, but a number of
interesting pieces of work have been done wherein access to a file is
revoked resulting in EPERM being returned from future reads.  In fact, I
believe Secure Computing even contracted with BSDI to have support for
some sort of virtual memory revocation service to get written -- in MAC
environments, a label change on a file can result in future operations
failing.  Many third party security extensions on various platforms
implement some sort of revocation service -- while it hasn't been part of
the base OS in many cases, this is still a relevant audience.

Also, however this is implemented, it would be nice to consider supporting
stateful access to devices: i.e., dev_open() returns a state reference
that is fed into future operations, so that pseudo-devices emulating
multi-instance devices from other platforms can operate correctly.  In my
mind, for this to work with file descriptor passing, either the open file
record needs to hold the state, and be passed into operations (this is
what Linux does -- all file system operations accept a open file entry
pointer, allowing vmmon, for example, to determine which session is in
use), or we need a more general state management technique.  In any case,
one thing this means is that if operations are pushed through a virtual
memory object, different "instances" must have different objects...

I may be off-base on some points here based on a lack of expertise on the
device and vm sides, but my feeling is that there are a lot of
implications to this type of change, and we want to be careful not to
preclude a number of potential future development directions, especially
when it comes to security work and emulation.

Robert N M Watson FreeBSD Core Team, TrustedBSD Project
[EMAIL PROTECTED]  NAI Labs, Safeport Network Services

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Matt Dillon


:
:Great. Then we have aliased file pointers...
:that's not a great improvement..
:
:You'd still have to have 'per instance' storage somewhere,
:so that the openned devices could have different permissions, and still
:have them point to common data. so you still need
:aliases, except now it's not a vnode being aliased but some
:other structure.

VNodes should never have been aliased in the first place, IMHO.  We
have to deal with certain special cases, like mmap'ing /dev/zero,
but that is a minor issue I think.

Actually, all this talk does imply that VM objects should be independant
of vnodes.  Devices may need to mmap (requiring a VM object), but 
don't need all the baggage of a vnode.  Julian is absolutely correct 
there.

We do need to guarentee locking order, which means that all I/O
operations should be consistent.  If a device or vnode is mmap()able,
then all read, write, and truncation(/extention) ops *should* run
through the VM object first:

read/write/truncate fileops -> [VM object] -> device
read/write/truncate fileops -> [VM object] -> vnode

Relative to Poul's last message, this would require not only adding
MMAP to the fileops, but also adding FTRUNCATE to the fileops.  Not
a big deal!

If a device or file is not mmap()able, then the VM object would not
exist.  You wouldn't get any caching, either, in that case, unless the
device implemented the caching natively. 

If a device or file can be mmap()'d, then the VM Object acts as the
cache layer for the object.  We would in fact be able to remove nearly
*ALL* the caching crap from *ALL* the filesystem code.  Filesystem
code would be responsible for low level I/O operations and meta ops
(VOPs) only and not be responsible for any caching of file data.  The
filesystem would still potentially be responsible for caching things
like bitmaps and such, but it could use a struct file for the backing
device and get it for free (the backing device is mmapable and thus
would have a VM Object layer, so you get the bitmap caching for free).

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Robert Watson

On Wed, 18 Apr 2001, Julian Elischer wrote:

> Poul-Henning Kamp wrote:
> > 
> > In message <[EMAIL PROTECTED]>, Matt Dillon writes:
> > >If this will get rid of or clean up the specfs garbage, then I'm all
> > >for it.  I would love to see a 'clean' fileops based device interface.
> > 
> > specfs, aliased vnodes, you name it...
> > 
> > I think the aliased vnodes is the single most strong argument of them
> > all for doing this...
> 
> Great. Then we have aliased file pointers...  that's not a great
> improvement.. 
> 
> You'd still have to have 'per instance' storage somewhere, so that the
> openned devices could have different permissions, and still have them
> point to common data. so you still need aliases, except now it's not a
> vnode being aliased but some other structure.

As I justed stated in a private e-mail to Matt, I'm not opposed to the
idea of promoting devices to a first-class object (i.e., equivilent to
vnodes, rather than below vnodes) in FreeBSD, I just want to approach this
very cautiously, as there's a lot of obscure behavior in this area, and a
lot of portability concerns regarding the obscure behavior.  In
particular, the "special case" of ttys is a very important special case --
operations such as revoke() must continue to work.  With device operations
currently being pushed through VFS, VFS becomes a possible mediation point
for those operations, allowing "VFS magic" to be used on devices.  If we
remove VFS from the call stack, we lose that capability.  Poul-Henning has
successfully argued that this has a number of good implications, but we
need to make sure that the functionality lost there doesn't out-weight the
good bits.

One way to look at this disagreement might be the following: some people
feel that devices are simply evil, and not files, and shouldn't try to act
like them.  Others feel that, modulo ioctl, we really can make devices
look like files, and should do that.  The observation I tried to make in
an earlier e-mail was that it might be possible to accept the world-view
that "devices aren't files" by mapping some devices into a better
abstraction, such as the socket "data stream" concept, while still making
use of current abstractions.  For example, using read() on /dev/audit
sucks, since what comes out of /dev/audit is a set of discrete records.
I'd rather use recv(), which has far superior semantics, since this is a
record-oriented data stream.  The same goes for kernel log messages, which
on a discrete message-oriented stream could essentially become standard
syslog messages, rather than treating it as a text buffer with character
pointers.  This would allow wrap-around to be handled much more cleanly,
by simply dropping records off one end of the record chain, rather than
severing lines and ending up with the current /dev/console abomination
(send too much to /dev/console -- i.e., single user mode, and dmesg
becomes useless).

I won't claim that moving to the slightly more abstracted viewpoint I
proposed earlier is the way to go, just that it's worth keeping in mind.
Maybe we should just throw up our hands and say "devices are devices,
screw files" -- this decision was made with NFS, and dramatically
simplifies the problem space.

Robert N M Watson FreeBSD Core Team, TrustedBSD Project
[EMAIL PROTECTED]  NAI Labs, Safeport Network Services

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Julian Elischer


Poul-Henning Kamp wrote:
> 
> In message <[EMAIL PROTECTED]>, Matt Dillon writes:
> >If this will get rid of or clean up the specfs garbage, then I'm all
> >for it.  I would love to see a 'clean' fileops based device interface.
> 
> specfs, aliased vnodes, you name it...
> 
> I think the aliased vnodes is the single most strong argument of them
> all for doing this...

Great. Then we have aliased file pointers...
that's not a great improvement..

You'd still have to have 'per instance' storage somewhere,
so that the openned devices could have different permissions, and still
have them point to common data. so you still need
aliases, except now it's not a vnode being aliased but some
other structure.

> 
> --
> Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
> [EMAIL PROTECTED] | TCP/IP since RFC 956
> FreeBSD committer   | BSD since 4.3-tahoe
> Never attribute to malice what can adequately be explained by incompetence.

-- 
  __--_|\  Julian Elischer
 /   \ [EMAIL PROTECTED]
(   OZ) World tour 2000-2001
---> X_.---._/  
v

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Matt Dillon



:You can mmap() devices  and you can mmap files..
:
:you cannot mmap FIFOs or sockets.
:
:for this reason I think that devices are still well represented by
:vnodes. If we merged vnodes and vm objects,
:then if devices were not vnodes, how would you represent
:a vm area that maps a device?
:
:-- 
:  __--_|\  Julian Elischer

I think the crux of the issue here is that most devices just don't
need the baggage of a vnode and many don't need the baggage of a VM
object except possibly for mmap().  A fileops interface would be the
cleanest way to implement a wide range of devices.

Lets compare our various function dispatch structures.  It's quite
obvious to me that we can merge cdevsw and fileops and remove all
vnode references from most of our devices.  Ok, maybe not /dev/tty...
but most of the rest surely!  We would also want to have an optional
vnode pointer in the fileops (like we do now) which 'enables' the
additional VOP operations on that file descriptor (in this case the
fileops for read, write, etc... would point to VOP wrappers like they
do now), and of course we would need an opaque pointer for use by
the fileops (devices would most likely load their cdev reference into
it).

cdevsw  fileops vfsops VOPs

OPEN  X   -   - X
CLOSE X   X   - X
READ  X   X   - X
WRITE X   X   - X
IOCTL X   X   - -
POLL  X   X   - X
MMAP  X   -   - X
STRATEGY  X   -   - X
DUMP  X   -   - -

KQFILTER  -   X   - X
STAT  -   X   - -

NAME  X   -   - -
MAJ   X   -   - -
PSIZE X   -   - -
FLAGS X   -   - -
BMAJ  X   -   - -


ADVLOCK   -   -   - X
BWRITE-   -   - X
FSYNC -   -   - X
ISLOCKED  -   -   - X
LEASE -   -   - X
LOCK  -   -   - X
PATHCONF  -   -   - X
READLINK  -   -   - X
REALLOCBLKS   -   -   - X
REVOKE-   -   - X
UNLOCK-   -   - X
BMAP  -   -   - X
PRINT -   -   - X
BALLOC-   -   - X
GETPAGES  -   -   - X
PUTPAGES  -   -   - X
FREEBLKS  -   -   - X
GETACL-   -   - X
SETACL-   -   - X
ACLCHECK  -   -   - X
GETEXTATTR-   -   - X
SETEXTATTR-   -   - X
LOOKUP-   -   - X
CACHEDLOOKUP  -   -   - X
CREATE-   -   - X
WHITEOUT  -   -   - X
MKNOD -   -   - X
ACCESS-   -   - X
GETATTR   -   -   - X
SETATTR   -   -   - X
REMOVE-   -   - X
LINK  -   -   - X
RENAME-   -   - X
MKDIR -   -   - X
RMDIR -   -   - X
SYMLINK   -   -   - X
READDIR   -   -   - X
INACTIVE  -   -   - X
RECLAIM   -   -   - X

MOUNT -   -   X -
START -   -   X -
UNMOUNT   -   -   X -
ROOT  -   -   X -
QUOTACTL  -   -   X -
STATFS-   -   X -
SYNC  -   -   X -
VGET  -   -   X -
FHTOVP-   -   X -
CHECKEXP  -   -   X -
VPTOFH-   -   X -
INIT  -   -   X -
UNINIT-   -   X -
EXTATTRCTL-   -   X -

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Julian Elischer


Robert Watson wrote:
> 
> On Wed, 18 Apr 2001, Poul-Henning Kamp wrote:

> As I indicated in my follow-up mail, the statement about seeking was
> incorrect, that is a property of the open file structure; I believe the
> remainder still holds true. When was the last time you tried mmap'ing or
> seeking on the socket?  A socket represents a buffered data stream which
> does not allow arbitrary read/write operations at arbitrary offsets.

Actually there have been times when I did want to mmap a datastream..
I think a datastream mapped into a user buffer-space is one of the 
possible 0-copy methods people sometimes mention.


-- 
  __--_|\  Julian Elischer
 /   \ [EMAIL PROTECTED]
(   OZ) World tour 2000-2001
---> X_.---._/  
v

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Julian Elischer


You can mmap() devices  and you can mmap files..

you cannot mmap FIFOs or sockets.

for this reason I think that devices are still well represented by
vnodes. If we merged vnodes and vm objects,
then if devices were not vnodes, how would you represent
a vm area that maps a device?

-- 
  __--_|\  Julian Elischer
 /   \ [EMAIL PROTECTED]
(   OZ) World tour 2000-2001
---> X_.---._/  
v

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Matt Dillon writes:
>If this will get rid of or clean up the specfs garbage, then I'm all
>for it.  I would love to see a 'clean' fileops based device interface.

specfs, aliased vnodes, you name it...

I think the aliased vnodes is the single most strong argument of them
all for doing this...

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Matt Dillon


If this will get rid of or clean up the specfs garbage, then I'm all
for it.  I would love to see a 'clean' fileops based device interface.

-Matt

:I have not examined the full details of doing the shift yet, but it is
:my impression that it actually will reduce the amount of code 
:duplication and special casing.
:
:Basically we will need a new:
:
:   struct fileops devfileops = {
:   dev_read,
:   dev_write,,
:   dev_ioctl,
:   dev_poll,
:   dev_kqfilter,
:   dev_stat,
:   dev_close
:   };
:
:The only places we will need new magic is
:   open, which needs to fix the plumbing for us.
:   mmap, which may have to be added to the fileops vector.
:
:The amount of special-casing code this would remove from the vnode
:layer is rather astonishing.
:
:If we merger vm-objects and vnodes without taking devices out of the
:mix, we will need even more special-case code for devices.
:
:>The vnode is our abstraction for objects that have
:>address spaces, can be opened/closed and retain a seeking position, can be
:>mapped, have protections, etc, etc.
:
:This is simply not correct Robert, UNIX::sockets also have many of
:those properties, but they're not vnodes...
:
:>Besides which,
:>the kernel knows how to act on vnodes, and there is plenty of precedent
:>for the kernel opening vnodes and keeping around references for its own
:>ends, but there isn't all that much precedent for the kernel doing this
:>using file descriptors :-).
:
:Have you actually examined how FIFO and Sockets work Robert ?   :-)
:
:--
:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
:[EMAIL PROTECTED] | TCP/IP since RFC 956
:FreeBSD committer   | BSD since 4.3-tahoe
:Never attribute to malice what can adequately be explained by incompetence.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Robert Watson

On Wed, 18 Apr 2001, Poul-Henning Kamp wrote:

> I have not examined the full details of doing the shift yet, but it is
> my impression that it actually will reduce the amount of code
> duplication and special casing. 

..

> The only places we will need new magic is
>   open, which needs to fix the plumbing for us.
>   mmap, which may have to be added to the fileops vector.
> 
> The amount of special-casing code this would remove from the vnode
> layer is rather astonishing.
> 
> If we merger vm-objects and vnodes without taking devices out of the
> mix, we will need even more special-case code for devices.

Let me expand a bit on what I want to object to, and then comment a bit on
what I have mixed feelings about but am not actively objecting to.

I believe it is necessary to retain a reference to the vnode used to
access the device in f_data, and an f_type of DTYPE_VNODE.  This is used
with tty's extensively, where it is desirable to open /dev/ttyfoo and then
perform file system operations on it, such as fchflags(), fchmod(),
fchown(), revoke(), et al, and relies on reaching the vnode via the open
file entry associated with the file descriptor designated by the invoking
process.  This behavior is needed for a variety of race-free operations at
login, et al.  Changing this would require *extensive* modification to the
syscall service layer (that is, what sits above VFS).  Assuming the
modifications were made so that the fileops array provided these services
(makine the struct file be the entire abstraction, hiding VFS from the
system call service layer)  you've now completely rewritten the large
majority of system calls, as well as introduced a whole ne category of
inter-abstraction synchronization that must occur when a change is made to
any abstraction (i.e., adding ACLs, MAC, ...).  So it seems to me that
access to the vnode must be maintained in struct file, that we cannot
totally replace references to the vnode with references to, for example,
the device abstraction. 

So with these assumptions in place, it's still possible to consider what
you were suggesting: replacing the vnode fileops array with a device
fileops array, so that these calls would be short-cutted directly to the
device abstraction rather than passing through the VFS abstractions on the
way.  In some ways, this makes sense: many of the device services map
poorly into the file-like abstraction of the vnode.  For example, devices
may have a notion of a stateful seeking position: tape drives, for
example, really *do* seek to a particular location where the next read or
write must be performed.  Similarly, some devices really do act like
streaming data sources or sinks: especially with regards to
pseudo-devices, they may behave much more like sockets, with a notion of a
discrete transmission unit, a maximum transmission unit, or addressibility
(imagine if you could open a device representing a bus, and use socket
addressing calls to set the bus address being targetted -- say for a
/dev/usb0, you could say "address the following messages to USB address
4", or being able to open /dev/ed0, set the target address of the device
instance to an ethernet address, and send).  We already have this problem
to some extent with sockets: we use the file system vnode for two
purposes: first, as a namespace in which to identify the IPC object, and
second, as a means for storing protection properties.  It's arguable that
devices might work that way also, which I think is what you're asserting.

I'm not strictly opposed to this viewpoint, but it begins to make me
wonder a bit about the current structuring of that whole section of the
kernel: to me, a vnode really does seem like a decent abstraction of the
file system concept.  The socket seems like a less decent abstraction of
the IPC concept, but a better abstraction of a send/receive stream.  This
is all complicated by long-standing interfaces and notions about how the
abstractions are to be used.  I guess I'd rather see it look something
like this:

 +-+
 | file descriptor |
 +---+-+
 |
 +---+-+
 | kernel object reference |
 +---+-+
 |
 +---+-+
 |   | |
  vfile   kqueuevstream   
   |
++--+--++
   IPC Socket  FIFO   Pipe   Stream Device

(note the above, and below, are highly fictional)

Where "kernel object reference" is the equivilent of today's "struct
file", "vfile" is t

Re: vm balance

2001-04-18 Thread Robert Watson

On Wed, 18 Apr 2001, Poul-Henning Kamp wrote:

> >The vnode is our abstraction for objects that have
> >address spaces, can be opened/closed and retain a seeking position, can be
> >mapped, have protections, etc, etc.
> 
> This is simply not correct Robert, UNIX::sockets also have many of those
> properties, but they're not vnodes... 

As I indicated in my follow-up mail, the statement about seeking was
incorrect, that is a property of the open file structure; I believe the
remainder still holds true. When was the last time you tried mmap'ing or
seeking on the socket?  A socket represents a buffered data stream which
does not allow arbitrary read/write operations at arbitrary offsets.

I guess what I'd really like to see is this: for devices that provide an
address space service (such as disks), vnodes would be used.  For devices
that represent streams (such as many pseudo-devices and ttys), they would
be represented by a slightly improved socket abstraction.  The socket is a
somewhat poor abstraction for this right now, perhaps a vstream would be a
better concept.

> >Besides which,
> >the kernel knows how to act on vnodes, and there is plenty of precedent
> >for the kernel opening vnodes and keeping around references for its own
> >ends, but there isn't all that much precedent for the kernel doing this
> >using file descriptors :-).
> 
> Have you actually examined how FIFO and Sockets work Robert ?   :-)

What I'm refering to is the fact that the kernel frequently keeps open
vnodes for use internally in various sorts of operations, such as quotas,
accounting, core dumps, etc.

BTW, part of the problem here may be a terminology problem: for me, a file
descriptor refers to the per-process reference in the file descriptor
table.  What you appear to refer to here is the open file entry, struct
file, which stores the operation array, seeking location, cached
credential, etc.

Robert N M Watson FreeBSD Core Team, TrustedBSD Project
[EMAIL PROTECTED]  NAI Labs, Safeport Network Services

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Poul-Henning Kamp

In message <[EMAIL PROTECTED]>, Rober
t Watson writes:
>
>On Wed, 18 Apr 2001, Poul-Henning Kamp wrote:
>
>> In message <[EMAIL PROTECTED]>, Kirk McKusick writes:
>> 
>> >Every vnode in the system has an associated object. 
>> 
>> No: device vnodes dont...
>> 
>> I think the correct solution to that is to move devices away from vnodes
>> and into the fdesc layer, just like fifo's and sockets. 
>
>I dislike that idea for a number of reasons, not least of which is that
>introducing more and more file-descriptor level objects increases the
>complexity of the system call service implementation, and duplicates code.
>If we're going to pretend that everything in the system is a file, and
>most people seem willing to accept that, acting on devices through vnodes
>seems like a reasonable choice.

I have not examined the full details of doing the shift yet, but it is
my impression that it actually will reduce the amount of code 
duplication and special casing.

Basically we will need a new:

struct fileops devfileops = {
dev_read,
dev_write,,
dev_ioctl,
dev_poll,
dev_kqfilter,
dev_stat,
dev_close
};

The only places we will need new magic is
open, which needs to fix the plumbing for us.
mmap, which may have to be added to the fileops vector.

The amount of special-casing code this would remove from the vnode
layer is rather astonishing.

If we merger vm-objects and vnodes without taking devices out of the
mix, we will need even more special-case code for devices.

>The vnode is our abstraction for objects that have
>address spaces, can be opened/closed and retain a seeking position, can be
>mapped, have protections, etc, etc.

This is simply not correct Robert, UNIX::sockets also have many of
those properties, but they're not vnodes...

>Besides which,
>the kernel knows how to act on vnodes, and there is plenty of precedent
>for the kernel opening vnodes and keeping around references for its own
>ends, but there isn't all that much precedent for the kernel doing this
>using file descriptors :-).

Have you actually examined how FIFO and Sockets work Robert ?   :-)

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Robert Watson

On Wed, 18 Apr 2001, Robert Watson wrote:

> On Wed, 18 Apr 2001, Poul-Henning Kamp wrote:

> address spaces, can be opened/closed and retain a seeking position, can be

This is what I get for sending messages in the morning after staying up
late -- needless to say, you can ignore the "retain a seeking position" 
statement: vnodes generally don't operate with a notion of "position",
that occurs at the struct file level. 

It's arguable, if you had stateful vnodes, that you might want to push the
seek operation down from the open file layer, as devices might want to
implement the seeking service themselves.  In any case, this is not a
problem that moving the device operations into the struct file array will
fix--in fact, it's arguable for devices wanting to offer services to
different consumers on the same instance (such as /dev/vmmon), you want
the vnode reference counting notion of open/close + the sprinkled state
vnode design we've discussed before, which would allow VFS and the struct
file layer to do the state management binding state to consumers, rather
than teaching the device layer how to do that.

Robert N M Watson FreeBSD Core Team, TrustedBSD Project
[EMAIL PROTECTED]  NAI Labs, Safeport Network Services

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-18 Thread Robert Watson

On Wed, 18 Apr 2001, Poul-Henning Kamp wrote:

> In message <[EMAIL PROTECTED]>, Kirk McKusick writes:
> 
> >Every vnode in the system has an associated object. 
> 
> No: device vnodes dont...
> 
> I think the correct solution to that is to move devices away from vnodes
> and into the fdesc layer, just like fifo's and sockets. 

I dislike that idea for a number of reasons, not least of which is that
introducing more and more file-descriptor level objects increases the
complexity of the system call service implementation, and duplicates code.
If we're going to pretend that everything in the system is a file, and
most people seem willing to accept that, acting on devices through vnodes
seems like a reasonable choice.  The vnode provides us with a notion of
open/close, reference counting, access to a generic vnode pager for memory
mapping of objects without specific memory mapping characteristics, and so
on.  Right now, the mapping from vnodes into devices is a bit poor due to
some odd reference / open / close behavior, and due to a lack of a notion
of stateful access to vnodes (there have been a number of proposals to
remedy this, however).  The vnode is our abstraction for objects that have
address spaces, can be opened/closed and retain a seeking position, can be
mapped, have protections, etc, etc.  It may not be a perfect
representation of a device, but it does a reasonable job.  Besides which,
the kernel knows how to act on vnodes, and there is plenty of precedent
for the kernel opening vnodes and keeping around references for its own
ends, but there isn't all that much precedent for the kernel doing this
using file descriptors :-).

Robert N M Watson FreeBSD Core Team, TrustedBSD Project
[EMAIL PROTECTED]  NAI Labs, Safeport Network Services

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

RE: vm balance

2001-04-18 Thread Koster, K.J.


Dear Matt,

> :
> :Well, if that's the case, yank all uses of v_id from the nfs code,
> :I'll do the namecache and vnodes can be deleted to the joy 
> of our users...
> :
> 
> If you can yank v_id out from the kern/vfs_cache code, I 
> will make similar
> fixes to the NFS code.  I am not particularly interesting 
> in returning
> vnodes to the MALLOC pool myself, but I am interested in 
> fixing the
> two bugs I noticed when I ran over the code earlier today.
> 
> Actually one bug.  The vput() turns out to be correct, I 
> just looked at
> the code again.  However, the cache_lookup() call in 
> nfs_vnops.c is
> broken.  Assuming no other fixes, the vpid load needs to 
> occur before
> the VOP_ACCESS call rather then after.
>
I'm just curious: would this be the "redundant call/non-optimal
performance"--type bug or the "panics or trashes the system in dark and
mysterious ways"--type bug? If it is the latter, do you think it may be an
opportunity for you to close some NFS-related PR's?

Kees Jan


 You are only young once,
   but you can stay immature all your life.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Kirk McKusick writes:

>Every vnode in the system has an associated object. 

No: device vnodes dont...

I think the correct solution to that is to move devices away from
vnodes and into the fdesc layer, just like fifo's and sockets.

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Kirk McKusick

Date: Tue, 17 Apr 2001 09:49:54 -0400 (EDT)
From: Robert Watson <[EMAIL PROTECTED]>
To: Kirk McKusick <[EMAIL PROTECTED]>
cc: Julian Elischer <[EMAIL PROTECTED]>,
   Rik van Riel <[EMAIL PROTECTED]>, [EMAIL PROTECTED],
   Matt Dillon <[EMAIL PROTECTED]>, David Xu <[EMAIL PROTECTED]>
Subject: Re: vm balance 

On Mon, 16 Apr 2001, Kirk McKusick wrote:

> I am still of the opinion that merging VM objects and vnodes would be a
> good idea. Although it would touch a huge number of lines of code, when
> the dust settled, it would simplify some nasty bits of the system. This
> merger is really independent of making the number of vnodes dynamic.
> Under the old name cache implementation, decreasing the number of vnodes
> was slow and hard. With the current name cache implementation,
> decreasing the number of vnodes would be easy. I concur that adding a
> dynamically sized vnode cache would help performance on some workloads. 

I'm interested in this idea, although profess a gaping blind spot in
expertise in the area of the VM system.  However, one of the aspects of
our VFS that has always concerned me is that use of a single vnode
simplelock funnels most of the relevant (and performance-sensitive) calls. 
The result is that all accesses to an object represented by a vnode are
serialized, which can represent a substantial performance hit for
applications such as databases, where simultaneous write would be
advantageous, or for various vn-backed oddities (possibly including
vnode-backed swap?).

At some point, apparently an effort was made to mark up vnode_if.src with
possible alternative locking using read/write locks, but given that all
the consumers use exclusive locks right now, I assume that was not
followed through on.  A large part of the cost is mitigated through
caching on the under-side of VFS, allowing vnode operations to return
rapidly, but while this catches a number of common cases (where the file
is already in the cache), there are sufficient non-common cases that I
would anticipate this being a problem.  Are there any performance figures
available that either confirm this concern, or demonstrate that in fact it
is not relevant? :-)  Would this concern introduce additional funneling in
the VM system, or is the granularity of locks in the VM sufficiently low
that it might improve performance by combining existing broad locks?

Robert N M Watson FreeBSD Core Team, TrustedBSD Project
[EMAIL PROTECTED]  NAI Labs, Safeport Network Services

Every vnode in the system has an associated object. Every object
backed by a file (e.g., everything but anonymous objects) has an
associated vnode. So, the performance of one is pretty tied to the
performance of the other. Matt is right that the VM does locking
on a page level, but then has to get a lock on the associated
vnode to do a read or a write, so really is pretty tied to the
vnode lock performance. Merging the two data structures is not
likely to change the performance characteristics of the system for
either better or worse. But it will save a lot of headaches having
to do with lock ordering that we have to deal with at the moment.

Kirk McKusick

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Matt Dillon


:>reference to me.  I'm not even sure why they bother to check v_id.
:>The vp reference from an nfsnode is a hard reference.
:>
:
:Well, if that's the case, yank all uses of v_id from the nfs code,
:I'll do the namecache and vnodes can be deleted to the joy of our users...
:
:--
:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
:[EMAIL PROTECTED] | TCP/IP since RFC 956

If you can yank v_id out from the kern/vfs_cache code, I will make similar
fixes to the NFS code.  I am not particularly interesting in returning
vnodes to the MALLOC pool myself, but I am interested in fixing the
two bugs I noticed when I ran over the code earlier today.

Actually one bug.  The vput() turns out to be correct, I just looked at
the code again.  However, the cache_lookup() call in nfs_vnops.c is
broken.  Assuming no other fixes, the vpid load needs to occur before
the VOP_ACCESS call rather then after.


-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Matt Dillon writes:
>
>:
>:In message <[EMAIL PROTECTED]>, Matt Dillon writes:
>:>
>:>:In message <[EMAIL PROTECTED]>, Alfred Perlstein writes:
>:>:
>:>:>I thought vnodes were in stable storage?
>:>:
>:>:They are, that's the point Matt is not seeing yet.
>:>
>:>I know vnodes are in stable storage.  I'm just saying that NFS
>:>is the least of your worries in trying to change that.
>:
>:The namecache can do without the use of soft references.
>:
>:The only reason vnodes are stable storage any more is that NFS
>:uses soft references to vnodes.
>
>The only place I see soft references on vnode is in the NFS 
>lookup code which duplicates the VFS lookup code (except gets it wrong).
>If you are refering to the nqlease code... that looks like a hard
>reference to me.  I'm not even sure why they bother to check v_id.
>The vp reference from an nfsnode is a hard reference.
>

Well, if that's the case, yank all uses of v_id from the nfs code,
I'll do the namecache and vnodes can be deleted to the joy of our users...

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Matt Dillon



:
:In message <[EMAIL PROTECTED]>, Matt Dillon writes:
:>
:>:In message <[EMAIL PROTECTED]>, Alfred Perlstein writes:
:>:
:>:>I thought vnodes were in stable storage?
:>:
:>:They are, that's the point Matt is not seeing yet.
:>
:>I know vnodes are in stable storage.  I'm just saying that NFS
:>is the least of your worries in trying to change that.
:
:The namecache can do without the use of soft references.
:
:The only reason vnodes are stable storage any more is that NFS
:uses soft references to vnodes.

The only place I see soft references on vnode is in the NFS 
lookup code which duplicates the VFS lookup code (except gets it wrong).
If you are refering to the nqlease code... that looks like a hard
reference to me.  I'm not even sure why they bother to check v_id.
The vp reference from an nfsnode is a hard reference.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Matt Dillon writes:
>
>:In message <[EMAIL PROTECTED]>, Alfred Perlstein writes:
>:
>:>I thought vnodes were in stable storage?
>:
>:They are, that's the point Matt is not seeing yet.
>
>I know vnodes are in stable storage.  I'm just saying that NFS
>is the least of your worries in trying to change that.

The namecache can do without the use of soft references.

The only reason vnodes are stable storage any more is that NFS
uses soft references to vnodes.

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Matt Dillon


:>Note that I really don't care for using stable storeage as a hack
:>to deal with this sort of thing.
:
:Well, I have to admit that it is a pretty smart way of dealing with
:it for remote operations, but the trouble is that it prevents us from
:ever lowering their number again.
:
:If Matt can device a smart way to loose the soft reference in nfs,
:vnodes can be a truly dynamic thing.
:
:--
:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20

NFS uses vnodes the same way that VFS uses vnodes.  If you solve the
problem for general VFS operation (namely *cache_lookup), you solve
the problem for NFS as well.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Matt Dillon



:In message <[EMAIL PROTECTED]>, Alfred Perlstein writes:
:
:>I thought vnodes were in stable storage?
:
:They are, that's the point Matt is not seeing yet.

I know vnodes are in stable storage.  I'm just saying that NFS
is the least of your worries in trying to change that.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Poul-Henning Kamp

In message <[EMAIL PROTECTED]>, Alfred Perlstein writes:

>I thought vnodes were in stable storage?

They are, that's the point Matt is not seeing yet.

>Note that I really don't care for using stable storeage as a hack
>to deal with this sort of thing.

Well, I have to admit that it is a pretty smart way of dealing with
it for remote operations, but the trouble is that it prevents us from
ever lowering their number again.

If Matt can device a smart way to loose the soft reference in nfs,
vnodes can be a truly dynamic thing.

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Alfred Perlstein


* Poul-Henning Kamp <[EMAIL PROTECTED]> [010417 10:56] wrote:
> In message <[EMAIL PROTECTED]>, Matt Dillon writes:
> >:>I don't think NFS relies on vnodes never being freed.
> >:
> >:It does, in some case nfs stashes a vnode pointer and the v_id
> >:value away, and some time later tries to use that pair to try to
> >:refind the vnode again.  If you free vnodes, it will still think
> >:the pointer is a vnode and if junk happens to be right it will
> >:think it is still a vnode.   QED: Bad things (TM) will happen.
> >:
> >:# cd /sys/nfs
> >:# grep v_id *
> >:nfs_nqlease.c:  vpid = vp->v_id;
> >:nfs_nqlease.c:   if (vpid == vp->v_id) {
> >:nfs_nqlease.c:   if (vpid == vp->v_id &&
> >:nfs_vnops.c:vpid = newvp->v_id;
> >:nfs_vnops.c:if (vpid == newvp->v_id) {
> >:
> >:--
> >:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
> >
> > hahahahahahahaha..  Look at the code more closely.  v_id is not
> > managed by NFS, it's managed by vfs_cache.c.  There's a big XXX
> > comment just before cache_purge() that explains it.  Believe me,
> > NFS is the least of your worries here.
> 
> Matt, you try to free vnodes back to the malloc pool and you will
> see what happens OK ?

I thought vnodes were in stable storage?

Note that I really don't care for using stable storeage as a hack
to deal with this sort of thing.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
Represent yourself, show up at BABUG http://www.babug.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Matt Dillon


:
:In message <[EMAIL PROTECTED]>, Matt Dillon writes:
:>:>I don't think NFS relies on vnodes never being freed.
:>:
:>:It does, in some case nfs stashes a vnode pointer and the v_id
:>:value away, and some time later tries to use that pair to try to
:>:refind the vnode again.  If you free vnodes, it will still think
:>:the pointer is a vnode and if junk happens to be right it will
:>:think it is still a vnode.   QED: Bad things (TM) will happen.
:>:
:>:# cd /sys/nfs
:>:# grep v_id *
:>:nfs_nqlease.c:  vpid = vp->v_id;
:>:nfs_nqlease.c:   if (vpid == vp->v_id) {
:>:nfs_nqlease.c:   if (vpid == vp->v_id &&
:>:nfs_vnops.c:vpid = newvp->v_id;
:>:nfs_vnops.c:if (vpid == newvp->v_id) {
:>:
:>:--
:>:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
:>
:> hahahahahahahaha..  Look at the code more closely.  v_id is not
:> managed by NFS, it's managed by vfs_cache.c.  There's a big XXX
:> comment just before cache_purge() that explains it.  Believe me,
:> NFS is the least of your worries here.
:
:Matt, you try to free vnodes back to the malloc pool and you will
:see what happens OK ?
:
:--
:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20

ok ok... lets see.  Oh, ok I see what it's doing.  Actually I think
you just found a bug.  

if ((error = cache_lookup(dvp, vpp, cnp)) && error != ENOENT) {
struct vattr vattr;
int vpid;

if ((error = VOP_ACCESS(dvp, VEXEC, cnp->cn_cred, p)) != 0) {
*vpp = NULLVP;
return (error);
}

newvp = *vpp;
vpid = newvp->v_id;

This is totally bogus.  VOP_ACCESS can block, so even using vpid above
to check that the vnode hasn't been ripped out from under the code won't
work. 

Also, take a look at the vput() later on, and also the vput() in 
kern/vfs_cache.c/vfs_cache_lookup() - that looks bogus to me too and
would probably crash the machine.

The easiest solution here is to make cache_lookup bump the ref count
on the returned vnode and require that all users of cache_lookup vrele()
it.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Matt Dillon writes:
>:>I don't think NFS relies on vnodes never being freed.
>:
>:It does, in some case nfs stashes a vnode pointer and the v_id
>:value away, and some time later tries to use that pair to try to
>:refind the vnode again.  If you free vnodes, it will still think
>:the pointer is a vnode and if junk happens to be right it will
>:think it is still a vnode.   QED: Bad things (TM) will happen.
>:
>:# cd /sys/nfs
>:# grep v_id *
>:nfs_nqlease.c:  vpid = vp->v_id;
>:nfs_nqlease.c:   if (vpid == vp->v_id) {
>:nfs_nqlease.c:   if (vpid == vp->v_id &&
>:nfs_vnops.c:vpid = newvp->v_id;
>:nfs_vnops.c:if (vpid == newvp->v_id) {
>:
>:--
>:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
>
> hahahahahahahaha..  Look at the code more closely.  v_id is not
> managed by NFS, it's managed by vfs_cache.c.  There's a big XXX
> comment just before cache_purge() that explains it.  Believe me,
> NFS is the least of your worries here.

Matt, you try to free vnodes back to the malloc pool and you will
see what happens OK ?

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Matt Dillon


:>I don't think NFS relies on vnodes never being freed.
:
:It does, in some case nfs stashes a vnode pointer and the v_id
:value away, and some time later tries to use that pair to try to
:refind the vnode again.  If you free vnodes, it will still think
:the pointer is a vnode and if junk happens to be right it will
:think it is still a vnode.   QED: Bad things (TM) will happen.
:
:# cd /sys/nfs
:# grep v_id *
:nfs_nqlease.c:  vpid = vp->v_id;
:nfs_nqlease.c:   if (vpid == vp->v_id) {
:nfs_nqlease.c:   if (vpid == vp->v_id &&
:nfs_vnops.c:vpid = newvp->v_id;
:nfs_vnops.c:if (vpid == newvp->v_id) {
:
:--
:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20

 hahahahahahahaha..  Look at the code more closely.  v_id is not
 managed by NFS, it's managed by vfs_cache.c.  There's a big XXX
 comment just before cache_purge() that explains it.  Believe me,
 NFS is the least of your worries here.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Matt Dillon writes:
>
>:When I first heard you say this I thought you were off your rockers,
>:but gradually I have come to think that you may be right.
>:
>:I think the task will be easier if we get the vnode/buf relationship
>:untangled a bit first.
>:
>:I may also pay off to take vnodes out of diskoperations entirely before
>:we try the merge.
>
>Yes, I agree.  The vnode/VM-object issue is minor compared to
>the vnode/buf/io issue.

We're getting there, we're getting there...

>:Actually the main problem is that NFS relies on vnodes never being
>:freed to hold "soft references" using "struct vnode * + v_id).
>:
>:--
>:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
>
>I don't think NFS relies on vnodes never being freed.

It does, in some case nfs stashes a vnode pointer and the v_id
value away, and some time later tries to use that pair to try to
refind the vnode again.  If you free vnodes, it will still think
the pointer is a vnode and if junk happens to be right it will
think it is still a vnode.   QED: Bad things (TM) will happen.

# cd /sys/nfs
# grep v_id *
nfs_nqlease.c:  vpid = vp->v_id;
nfs_nqlease.c:   if (vpid == vp->v_id) {
nfs_nqlease.c:   if (vpid == vp->v_id &&
nfs_vnops.c:vpid = newvp->v_id;
nfs_vnops.c:if (vpid == newvp->v_id) {

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Matt Dillon



:When I first heard you say this I thought you were off your rockers,
:but gradually I have come to think that you may be right.
:
:I think the task will be easier if we get the vnode/buf relationship
:untangled a bit first.
:
:I may also pay off to take vnodes out of diskoperations entirely before
:we try the merge.

Yes, I agree.  The vnode/VM-object issue is minor compared to
the vnode/buf/io issue.

:>Under the old name cache implementation, decreasing
:>the number of vnodes was slow and hard. With the current name cache
:>implementation, decreasing the number of vnodes would be easy.
:
:Actually the main problem is that NFS relies on vnodes never being
:freed to hold "soft references" using "struct vnode * + v_id).
:
:--
:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20

I don't think NFS relies on vnodes never being freed.  The worst that
should happen is that NFS might need to do a LOOKUP.  I haven't had a
chance to look at the namei/vnode patch set yet but as long as a 
reasonable number of vnodes remain cached NFS shouldn't be effected
too much.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Matt Dillon


:I'm interested in this idea, although profess a gaping blind spot in
:expertise in the area of the VM system.  However, one of the aspects of
:our VFS that has always concerned me is that use of a single vnode
:simplelock funnels most of the relevant (and performance-sensitive) calls. 
:The result is that all accesses to an object represented by a vnode are
:serialized, which can represent a substantial performance hit for
:applications such as databases, where simultaneous write would be
:advantageous, or for various vn-backed oddities (possibly including
:vnode-backed swap?).
:
:At some point, apparently an effort was made to mark up vnode_if.src with
:possible alternative locking using read/write locks, but given that all
:...

We only use simplelocks on vnodes for interlock operations.  We use
normal kern/kern_lock.c locks for vnode locking and use both shared
and exclusive locks.

You are absolutely correct about the serialization that can
occur.  A stalled write() will stall all other write()'s plus any
read()'s.  Stalled write()s are easy to come by.  I did some work in
this area to try to mitigate the problem.  In 4.1/4.2 I added the
bwillwrite() function.  This function is called prior to obtaining
the exclusive vnode lock and blocks the process if there aren't
a sufficient number of filesystem buffers available to (likely)
accomodate the operation.  This (mostly) prevents the process from 
blocking in the buffer cache while holding an exclusive vnode lock
and makes a big difference.

:is already in the cache), there are sufficient non-common cases that I
:would anticipate this being a problem.  Are there any performance figures
:available that either confirm this concern, or demonstrate that in fact it
:is not relevant? :-)  Would this concern introduce additional funneling in
:the VM system, or is the granularity of locks in the VM sufficiently low
:that it might improve performance by combining existing broad locks?
:
:Robert N M Watson FreeBSD Core Team, TrustedBSD Project
:[EMAIL PROTECTED]  NAI Labs, Safeport Network Services

The VM system is in pretty good shape in regards to fine-grained
locking (you get down to the VM page).  The VFS system is in terrible
shape - there is no fine grained locking at all for writes.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Robert Watson

On Mon, 16 Apr 2001, Kirk McKusick wrote:

> I am still of the opinion that merging VM objects and vnodes would be a
> good idea. Although it would touch a huge number of lines of code, when
> the dust settled, it would simplify some nasty bits of the system. This
> merger is really independent of making the number of vnodes dynamic.
> Under the old name cache implementation, decreasing the number of vnodes
> was slow and hard. With the current name cache implementation,
> decreasing the number of vnodes would be easy. I concur that adding a
> dynamically sized vnode cache would help performance on some workloads. 

I'm interested in this idea, although profess a gaping blind spot in
expertise in the area of the VM system.  However, one of the aspects of
our VFS that has always concerned me is that use of a single vnode
simplelock funnels most of the relevant (and performance-sensitive) calls. 
The result is that all accesses to an object represented by a vnode are
serialized, which can represent a substantial performance hit for
applications such as databases, where simultaneous write would be
advantageous, or for various vn-backed oddities (possibly including
vnode-backed swap?).

At some point, apparently an effort was made to mark up vnode_if.src with
possible alternative locking using read/write locks, but given that all
the consumers use exclusive locks right now, I assume that was not
followed through on.  A large part of the cost is mitigated through
caching on the under-side of VFS, allowing vnode operations to return
rapidly, but while this catches a number of common cases (where the file
is already in the cache), there are sufficient non-common cases that I
would anticipate this being a problem.  Are there any performance figures
available that either confirm this concern, or demonstrate that in fact it
is not relevant? :-)  Would this concern introduce additional funneling in
the VM system, or is the granularity of locks in the VM sufficiently low
that it might improve performance by combining existing broad locks?

Robert N M Watson FreeBSD Core Team, TrustedBSD Project
[EMAIL PROTECTED]  NAI Labs, Safeport Network Services

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-17 Thread Poul-Henning Kamp

In message <[EMAIL PROTECTED]>, Kirk McKusick writes:

>I am still of the opinion that merging VM objects and vnodes would
>be a good idea. Although it would touch a huge number of lines of
>code, when the dust settled, it would simplify some nasty bits of
>the system.

When I first heard you say this I thought you were off your rockers,
but gradually I have come to think that you may be right.

I think the task will be easier if we get the vnode/buf relationship
untangled a bit first.

I may also pay off to take vnodes out of diskoperations entirely before
we try the merge.

>Under the old name cache implementation, decreasing
>the number of vnodes was slow and hard. With the current name cache
>implementation, decreasing the number of vnodes would be easy.

Actually the main problem is that NFS relies on vnodes never being
freed to hold "soft references" using "struct vnode * + v_id).

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-16 Thread Seigo Tanimura


On Mon, 16 Apr 2001 04:02:34 -0700,
  Alfred Perlstein <[EMAIL PROTECTED]> said:

Alfred> I'm also wondering why you can't track the number of
Alfred> nodes that ought to be cleaned, well, you do, but it doesn't
Alfred> look like it's used:

Alfred> +   numcachehv--;


Alfred> +   numcachehv++;

Alfred> then later:

Alfred> +   if (vnodeallocs % vnoderecycleperiod == 0 &&
Alfred> +   freevnodes < vnoderecycleminfreevn &&
Alfred> +   vnoderecyclemintotalvn < numvnodes) {

Alfred> shouldn't this be related to numcachehv somehow?

One reason is that the number of directory vnodes attempted to reclaim
should be greater than vnoderecycleperiod, the period of reclaim in
getnewvnodes() calls. Otherwise, all of the vnodes reclaimed in the
last attempt might be eaten up by the next attempt.

This fact calls for an constraint of vnoderecyclenumber >=
vnoderecycleperiod, but it is not checked yet.

The other one is that not all of the directory vnodes in namecache
can be reclaimed because some of them may be held as the working
directory of a process. Since a directory vnode in namecache can
become or no longer be a working directory without entering or purging
namecache, it is rather hard to track the number of the reclaimable
directory vnodes in namecache by simple watching cache_enter() and
cache_purge().

While the number of reclaimable directory vnodes can be counted by
traversing the whole namecache entries, we again have to traverse
the namecache entries in order to actually reclaim vnodes, so this
method is not an option to predetermine the number of directory vnodes
attempted to reclaim.

-- 
Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-16 Thread Kirk McKusick

Date: Tue, 10 Apr 2001 22:14:28 -0700
From: Julian Elischer <[EMAIL PROTECTED]>
To: Rik van Riel <[EMAIL PROTECTED]>
CC: Matt Dillon <[EMAIL PROTECTED]>, David Xu <[EMAIL PROTECTED]>,
   [EMAIL PROTECTED], [EMAIL PROTECTED]
    Subject: Re: vm balance

Rik van Riel wrote:

> 
> I'm curious about the other things though ... FreeBSD still seems
> to have the early 90's abstraction layer from Mach and the vnode
> cache doesn't seem to grow and shrink dynamically (which can be a
> big win for systems with lots of metadata activity).
> 
> So while it's true that FreeBSD's VM balancing seems to be the
> best one out there, I'm not quite sure about the rest of the VM...
> 

Many years ago Kirk was talking about merging the vm objects
and the vnodes..  (they tend to come in pairs anyhow)

I still think it might be an idea worth investigating further.

kirk?

-- 
  __--_|\  Julian Elischer
 /   \ [EMAIL PROTECTED]
(   OZ) World tour 2000-2001
---> X_.---._/  
v

I am still of the opinion that merging VM objects and vnodes would
be a good idea. Although it would touch a huge number of lines of
code, when the dust settled, it would simplify some nasty bits of
the system. This merger is really independent of making the number
of vnodes dynamic. Under the old name cache implementation, decreasing
the number of vnodes was slow and hard. With the current name cache
implementation, decreasing the number of vnodes would be easy. I
concur that adding a dynamically sized vnode cache would help
performance on some workloads.

Kirk McKusick

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-16 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Seigo Tanim
ura writes:

>Seigo> http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff
>
>has been updated and now ready to commit.

Ok, I ran a "cvs update ; make buildworld" here with and without
your patch.

without:
2049.846u 1077.358s 41:29.65 125.6% 594+714k 121161+5608io 7725pf+331w

with:
2053.464u 1075.493s 41:29.50 125.6% 595+715k 123125+5682io 8897pf+446w

Difference:
 + .17%   -.18% ~0%  0%   +.17% +.14% +1.6% +1.3%  +15%  +35%

I think that means we're inside epsilon for the normal case, so I'll
commit your patch later tonight.




--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-16 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Seigo Tanim
ura writes:

>Poul-Henning> I'm a bit worried about the amount of work done in the
>Poul-Henning> cache_purgeleafdirs(), considering how often it is called,
>
>Poul-Henning> Do you have measured the performance impact of this to be an
>Poul-Henning> insignificant overhead ?
>
>No precise results right now, mainly because I cannot find a benchmark
>to measure the performance of name lookup going down to a deep
>directory depth.

Have you done any "trivial" checks, like timing "make world" and such ?

>It has been confirmed, though, that the hit ratio of name lookup is
>around 96-98% for a box serving cvsup both with and without my patch
>(observed by systat(1)). Here are the details of the name lookup on
>that box:

Ohh, sure, I don't expect this to have a big impact on the hit rate,
If I thought it would have I would have protested :-)

>For a more precise investigation, we have to measure the actial time
>taken for a lookup operation, in which case I may have to write a
>benchmark for it and test in single-user mode.

I would be satisfied with a "sanity-check", for instance running
a "cvs co src ; cd src ; make buildworld ; cd release ; make release"

with and without, just to see that it doesn't have a significant
negative impact.

>It is interesting that the hit ratio of directory lookup is up to only
>1% at most, even without my patch. Why is it like that?

Uhm, which cache is this ?  The one reported in "vmstat -vm" ?

That is entirely different from the vfs-namecache, I think it is
a per process one-slot directory cache.  I have never studied it's
performance, but I belive a good case was made for it in the 4.[34]
BSD books.

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-16 Thread Seigo Tanimura

On Mon, 16 Apr 2001 12:36:03 +0200,
  Poul-Henning Kamp <[EMAIL PROTECTED]> said:

Poul-Henning> In message <[EMAIL PROTECTED]>, 
Seigo Tanim
Poul-Henning> ura writes:

>> Those pieces of work were done in the last weekend, and the patch at
>> 
Seigo> http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff
>> 
>> has been updated and now ready to commit.

Poul-Henning> I'm a bit worried about the amount of work done in the
Poul-Henning> cache_purgeleafdirs(), considering how often it is called,

Poul-Henning> Do you have measured the performance impact of this to be an
Poul-Henning> insignificant overhead ?

No precise results right now, mainly because I cannot find a benchmark
to measure the performance of name lookup going down to a deep
directory depth.

It has been confirmed, though, that the hit ratio of name lookup is
around 96-98% for a box serving cvsup both with and without my patch
(observed by systat(1)). Here are the details of the name lookup on
that box:

Frequency:  Around 25,000-35,000 lookups/sec at most,
8,000-10,000 generally.
Name vs Directory:  98% or more of the lookups are for names, the
rest of them are for directories (up to 1.5%
of the whole lookup at most).
Hit ratio:  96-98% for names and up to 1% at most for
directories (both with and without my patch)

Considering that most of lookup operations are for names and its hit
ratio is not observed to degrade, and assuming that the time consumed
for lookup hit is always constant, the performance of lookup is not
found to be deteriorated.

For a more precise investigation, we have to measure the actial time
taken for a lookup operation, in which case I may have to write a
benchmark for it and test in single-user mode.

It is interesting that the hit ratio of directory lookup is up to only
1% at most, even without my patch. Why is it like that?

-- 
Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-16 Thread Alfred Perlstein


* Seigo Tanimura <[EMAIL PROTECTED]> [010416 03:25] wrote:
> On Fri, 13 Apr 2001 20:08:57 +0900,
>   Seigo Tanimura  said:
> 
> Alfred> Are these changes planned for integration?
> 
> Seigo> Yes, but not very soon as there are a few kinds of works that should
> Seigo> be done.
> 
> Seigo> One is that a directory vnode may be held as the working directory of
> Seigo> a process, in which case we should not reclaim the directory vnode.
> 
> Seigo> Another is to determine how often namecache should be traversed to
> Seigo> reclaim how many directory vnodes. At this moment, namecache is
> (snip)
> 
> Those pieces of work were done in the last weekend, and the patch at
> 
> Seigo> http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff
> 
> has been updated and now ready to commit.

There's actually a few style bugs in here:

pointers should be compared against NULL, not 0

using a bit more meaningful variable names would be nice:
+   struct nchashhead *ncpp;
+   struct namecache *ncp, *nnp, *ncpc, *nnpc;

I'm also wondering why you can't track the number of
nodes that ought to be cleaned, well, you do, but it doesn't
look like it's used:

+   numcachehv--;


+   numcachehv++;

then later:

+   if (vnodeallocs % vnoderecycleperiod == 0 &&
+   freevnodes < vnoderecycleminfreevn &&
+   vnoderecyclemintotalvn < numvnodes) {

shouldn't this be related to numcachehv somehow?

excuse me if i'm missing something obvious, i'm in desperate need
of sleep. :)

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-16 Thread Poul-Henning Kamp

In message <[EMAIL PROTECTED]>, Seigo Tanim
ura writes:

>Those pieces of work were done in the last weekend, and the patch at
>
>Seigo> http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff
>
>has been updated and now ready to commit.

I'm a bit worried about the amount of work done in the
cache_purgeleafdirs(), considering how often it is called,

Do you have measured the performance impact of this to be an
insignificant overhead ?

Once we have that figured out I will commit the patch for you...

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-16 Thread Alfred Perlstein


* Seigo Tanimura <[EMAIL PROTECTED]> [010416 03:25] wrote:
> On Fri, 13 Apr 2001 20:08:57 +0900,
>   Seigo Tanimura  said:
> 
> Alfred> Are these changes planned for integration?
> 
> Seigo> Yes, but not very soon as there are a few kinds of works that should
> Seigo> be done.
> 
> Seigo> One is that a directory vnode may be held as the working directory of
> Seigo> a process, in which case we should not reclaim the directory vnode.
> 
> Seigo> Another is to determine how often namecache should be traversed to
> Seigo> reclaim how many directory vnodes. At this moment, namecache is
> (snip)
> 
> Those pieces of work were done in the last weekend, and the patch at
> 
> Seigo> http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff
> 
> has been updated and now ready to commit.

Heh, go for it. :)

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-16 Thread Seigo Tanimura


On Fri, 13 Apr 2001 20:08:57 +0900,
  Seigo Tanimura  said:

Alfred> Are these changes planned for integration?

Seigo> Yes, but not very soon as there are a few kinds of works that should
Seigo> be done.

Seigo> One is that a directory vnode may be held as the working directory of
Seigo> a process, in which case we should not reclaim the directory vnode.

Seigo> Another is to determine how often namecache should be traversed to
Seigo> reclaim how many directory vnodes. At this moment, namecache is
(snip)

Those pieces of work were done in the last weekend, and the patch at

Seigo> http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff

has been updated and now ready to commit.

-- 
Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-14 Thread Matt Dillon


:Speaking of vmiodirenable, what are the issues with it that it's not
:enabled by default?  ISTR that it's been in a while, and most people
:pointed at it have reported success with it, and it seems to have solved
:problems here and there for a number of people.  What's keeping it from
:the general case?
:
:-- 
:Matthew Fuller (MF4839) |[EMAIL PROTECTED]

I'll probably turn it on after the 4.3 release.  Insofar as Kirk and I
can tell there are no (hah!) filesystem corruption bugs left in the 
filesystem or VM code.  I am guessing that what corruption still occurs 
occassionally is either due to something elsewhere in the kernel, or 
motherboard issues (e.g. like the VIA chipset IDE DMA corruption bug).

I have just four words to say about IDE DMA:  It's a f**ked standard.

Neither Kirk nor I have been able to reproduce reported problems at all,
but with help from others we have fixed a number of bugs which seem to
have had a positive effect on Yahoo's test machines.  At the moment
one of Yahoo's 8 IDE test systems may crash once after a few hours, but
then after reboot will never crash again.  This hopefully means that
fsck is fixing corruption generated from earlier buggy kernels that is
caught later on.  I've been exchanging email with three other people 
with corruption issues.  One turned out to be hardware (fsck after
newfs was failing, so obviously not a filesystem issue!), another
is indeterminant, the third was working fine until late February and
then new kernels started to result in corruption (while old kernels still
worked) and he is now trying to narrow down the date range where
the problem was introduced.

Either way it should be fairly obvious if turning on vmiodirenable makes
it worse or not.  My guess is: not, and it's just my paranoia that is
holding up turning on vmiodirenable.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-14 Thread Peter Pentchev

On Sat, Apr 14, 2001 at 09:34:26AM -0500, Matthew D. Fuller wrote:
> On Thu, Apr 12, 2001 at 02:24:36PM -0700, a little birdie told me
> that Matt Dillon remarked
> > 
> > Without vmiodirenable turned on, any directory exceeding
> > vfs.maxmallocbufspace becomes extremely expensive to work with
> > O(N * diskIO).  With vmiodirenable turned on huge directories
> > are O(N), but have a better chance of being in the VM page cache
> > so cost proportionally less even though they don't do any
> > better on a relative scale.
> 
> Speaking of vmiodirenable, what are the issues with it that it's not
> enabled by default?  ISTR that it's been in a while, and most people
> pointed at it have reported success with it, and it seems to have solved
> problems here and there for a number of people.  What's keeping it from
> the general case?

Attached is a message from Matt Dillon from an earlier -hackers discussion.

G'luck,
Peter

-- 
The rest of this sentence is written in Thailand, on

>From [EMAIL PROTECTED] Fri Mar 23 02:15:39 2001
Date: Thu, 22 Mar 2001 16:14:11 -0800 (PST)
From: Matt Dillon <[EMAIL PROTECTED]>
Message-Id: <[EMAIL PROTECTED]>
To: "Michael C . Wu" <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: tuning a VERY heavily (30.0) loaded s cerver

:(Why is vfs.vmiodirenable=1 not enabled by default?)
:

The only reason it isn't enabled by default is some unresolved
filesystem corruption that occurs very rarely (with or without
it) that Kirk and I are still trying to nail down.  I want to get
that figured out first.

It is true that some people have brought up memory use issues, but
I don't consider memory use to really be that much of an issue.  This is
a cache, after all, so the blocks can be reused at just about
any time.  And directory blocks do not get cached
well at all with vmiodirenable turned off.  So the net result
should be an increase in performance even on low-memory boxes.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-14 Thread Matthew D. Fuller


On Thu, Apr 12, 2001 at 02:24:36PM -0700, a little birdie told me
that Matt Dillon remarked
> 
> Without vmiodirenable turned on, any directory exceeding
> vfs.maxmallocbufspace becomes extremely expensive to work with
> O(N * diskIO).  With vmiodirenable turned on huge directories
> are O(N), but have a better chance of being in the VM page cache
> so cost proportionally less even though they don't do any
> better on a relative scale.

Speaking of vmiodirenable, what are the issues with it that it's not
enabled by default?  ISTR that it's been in a while, and most people
pointed at it have reported success with it, and it seems to have solved
problems here and there for a number of people.  What's keeping it from
the general case?


-- 
Matthew Fuller (MF4839) |[EMAIL PROTECTED]
Unix Systems Administrator  |[EMAIL PROTECTED]
Specializing in FreeBSD |http://www.over-yonder.net/

"The only reason I'm burning my candle at both ends, is because I
  haven't figured out how to light the middle yet"

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-13 Thread Seigo Tanimura

On Fri, 13 Apr 2001 02:58:07 -0700,
  Alfred Perlstein <[EMAIL PROTECTED]> said:

Alfred> * Seigo Tanimura <[EMAIL PROTECTED]> [010413 02:39] wrote:
>> On Thu, 12 Apr 2001 22:50:50 +0200,
>> Poul-Henning Kamp <[EMAIL PROTECTED]> said:
>> 
>> Poul-Henning> We keep namecache entries around as long as we can use them, and that
>> Poul-Henning> generally means that recreating them is a rather expensive operation,
>> Poul-Henning> involving creation of vnode and very likely a vm object again.
>> 
>> Holding a namecache entry forever until its vnode is reused results in
>> disaster when a huge number of files are accessed concurrently, causing
>> active vnodes to eat up all of memory. This beast killed a box of mine
>> with 3GB of memory and 200GB of a RAID0 disk array serving about
>> 300,000 files by cvsupd and making the world a few months ago, when
>> the number of the vnodes reached around 400,000 to make all of the
>> processes wait for a free vnode.
>> 
>> With a help by tegge, the box is now reclaiming directory vnodes when
>> few free vnodes are available. Only directory vnodes holding no child
>> directory vnodes held in v_cache_src are recycled, so that directory
>> vnodes near the root of the filesystem hierarchy remain in namecache
>> and directory vnodes are not reclaimed in cascade. The number of
>> vnodes in the box is now about 135,000, staying quite steadily.
>> 
>> Name'cache' is the place to hold vnodes for future use which may *not*
>> come, hence vnodes held in namecache should be reclaimed in case of
>> critical vnode shortage.

Alfred> Are these changes planned for integration?

Yes, but not very soon as there are a few kinds of works that should
be done.

One is that a directory vnode may be held as the working directory of
a process, in which case we should not reclaim the directory vnode.

Another is to determine how often namecache should be traversed to
reclaim how many directory vnodes. At this moment, namecache is
traversed for every 1,000 calls of getnewvnode(). If the following
couple of inequalities satisfy, then up to 3,000 directory vnodes are
attempted to be reclaimed:

freevnodes < wantfreevnodes + 2 * 1000  (1)
wantfreevnodes + 2 * 1000 < numvnodes * 2   (2)

(1) means that we reclaim directory vnodes if the number of free
vnodes are smaller than about 2,000. (2) is so that vnode reclaiming
does not occur in the early stage of boot until the number of vnodes
reaches around 2,000. Although I chose those parameters so that vnode
reclaiming does not degrade the hit ratio of name lookup, they may not
be optimum. Those parameters should be tunable via sysctl(2).

Anyway, the patch can be found at:

http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff

-- 
Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-13 Thread Alfred Perlstein


* Seigo Tanimura <[EMAIL PROTECTED]> [010413 02:39] wrote:
> On Thu, 12 Apr 2001 22:50:50 +0200,
>   Poul-Henning Kamp <[EMAIL PROTECTED]> said:
> 
> Poul-Henning> We keep namecache entries around as long as we can use them, and that
> Poul-Henning> generally means that recreating them is a rather expensive operation,
> Poul-Henning> involving creation of vnode and very likely a vm object again.
> 
> Holding a namecache entry forever until its vnode is reused results in
> disaster when a huge number of files are accessed concurrently, causing
> active vnodes to eat up all of memory. This beast killed a box of mine
> with 3GB of memory and 200GB of a RAID0 disk array serving about
> 300,000 files by cvsupd and making the world a few months ago, when
> the number of the vnodes reached around 400,000 to make all of the
> processes wait for a free vnode.
> 
> With a help by tegge, the box is now reclaiming directory vnodes when
> few free vnodes are available. Only directory vnodes holding no child
> directory vnodes held in v_cache_src are recycled, so that directory
> vnodes near the root of the filesystem hierarchy remain in namecache
> and directory vnodes are not reclaimed in cascade. The number of
> vnodes in the box is now about 135,000, staying quite steadily.
> 
> Name'cache' is the place to hold vnodes for future use which may *not*
> come, hence vnodes held in namecache should be reclaimed in case of
> critical vnode shortage.

Are these changes planned for integration?

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-13 Thread Seigo Tanimura


On Thu, 12 Apr 2001 22:50:50 +0200,
  Poul-Henning Kamp <[EMAIL PROTECTED]> said:

Poul-Henning> We keep namecache entries around as long as we can use them, and that
Poul-Henning> generally means that recreating them is a rather expensive operation,
Poul-Henning> involving creation of vnode and very likely a vm object again.

Holding a namecache entry forever until its vnode is reused results in
disaster when a huge number of files are accessed concurrently, causing
active vnodes to eat up all of memory. This beast killed a box of mine
with 3GB of memory and 200GB of a RAID0 disk array serving about
300,000 files by cvsupd and making the world a few months ago, when
the number of the vnodes reached around 400,000 to make all of the
processes wait for a free vnode.

With a help by tegge, the box is now reclaiming directory vnodes when
few free vnodes are available. Only directory vnodes holding no child
directory vnodes held in v_cache_src are recycled, so that directory
vnodes near the root of the filesystem hierarchy remain in namecache
and directory vnodes are not reclaimed in cascade. The number of
vnodes in the box is now about 135,000, staying quite steadily.

Name'cache' is the place to hold vnodes for future use which may *not*
come, hence vnodes held in namecache should be reclaimed in case of
critical vnode shortage.

-- 
Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]>

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-13 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Matt Dillon writes:
>:>:>scaleability.
>:>:
>:>:Uhm, that is actually not true.
>:>:
>:>:We keep namecache entries around as long as we can use them, and that
>:>:generally means that recreating them is a rather expensive operation,
>:>:involving creation of vnode and very likely a vm object again.
>:>
>:>The vnode cache is a different cache.   positive namei hits will
>:>reference a vnode, but namei elements can be flushed at any 
>:>time without flushing the underlying vnode.
>:
>:Right, but doing so means that to refind that vnode from the name
>:is (comparatively) very expensive.
>:
>:--
>:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
>:[EMAIL PROTECTED] | TCP/IP since RFC 956
>
>The only thing that is truely expensive is having to physically
>scan a large directory in order to instantiate a new namei 
>record.  Everything else is inexpensive by comparison (by two
>orders of magnitude!), even constructing new vnodes.  
>
>Without vmiodirenable turned on, any directory [...]

It's worse than that, we are still way too rude in throwing away
directory data...

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-12 Thread Matt Dillon


:>:>scaleability.
:>:
:>:Uhm, that is actually not true.
:>:
:>:We keep namecache entries around as long as we can use them, and that
:>:generally means that recreating them is a rather expensive operation,
:>:involving creation of vnode and very likely a vm object again.
:>
:>The vnode cache is a different cache.   positive namei hits will
:>reference a vnode, but namei elements can be flushed at any 
:>time without flushing the underlying vnode.
:
:Right, but doing so means that to refind that vnode from the name
:is (comparatively) very expensive.
:
:--
:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
:[EMAIL PROTECTED] | TCP/IP since RFC 956

The only thing that is truely expensive is having to physically
scan a large directory in order to instantiate a new namei 
record.  Everything else is inexpensive by comparison (by two
orders of magnitude!), even constructing new vnodes.  

Without vmiodirenable turned on, any directory exceeding
vfs.maxmallocbufspace becomes extremely expensive to work with
O(N * diskIO).  With vmiodirenable turned on huge directories
are O(N), but have a better chance of being in the VM page cache
so cost proportionally less even though they don't do any
better on a relative scale.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-12 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Matt Dillon writes:
>
>:
>:In message <[EMAIL PROTECTED]>, Matt Dillon writes:
>:
>:>Again, keep in mind that the namei cache is strictly throw-away, but
>:>entries can often be reconstituted later by the filesystem without I/O
>:>due to the VM Page cache (and/or buffer cache depending on
>:>vfs.vmiodirenable).  So as with the buffer cache and inode cache,
>:>the number of entries can be limited without killing performance or
>:>scaleability.
>:
>:Uhm, that is actually not true.
>:
>:We keep namecache entries around as long as we can use them, and that
>:generally means that recreating them is a rather expensive operation,
>:involving creation of vnode and very likely a vm object again.
>
>The vnode cache is a different cache.   positive namei hits will
>reference a vnode, but namei elements can be flushed at any 
>time without flushing the underlying vnode.

Right, but doing so means that to refind that vnode from the name
is (comparatively) very expensive.

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-12 Thread Matt Dillon



:
:In message <[EMAIL PROTECTED]>, Matt Dillon writes:
:
:>Again, keep in mind that the namei cache is strictly throw-away, but
:>entries can often be reconstituted later by the filesystem without I/O
:>due to the VM Page cache (and/or buffer cache depending on
:>vfs.vmiodirenable).  So as with the buffer cache and inode cache,
:>the number of entries can be limited without killing performance or
:>scaleability.
:
:Uhm, that is actually not true.
:
:We keep namecache entries around as long as we can use them, and that
:generally means that recreating them is a rather expensive operation,
:involving creation of vnode and very likely a vm object again.

The vnode cache is a different cache.   positive namei hits will
reference a vnode, but namei elements can be flushed at any 
time without flushing the underlying vnode.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-12 Thread Poul-Henning Kamp

In message <[EMAIL PROTECTED]>, Matt Dillon writes:

>Again, keep in mind that the namei cache is strictly throw-away, but
>entries can often be reconstituted later by the filesystem without I/O
>due to the VM Page cache (and/or buffer cache depending on
>vfs.vmiodirenable).  So as with the buffer cache and inode cache,
>the number of entries can be limited without killing performance or
>scaleability.

Uhm, that is actually not true.

We keep namecache entries around as long as we can use them, and that
generally means that recreating them is a rather expensive operation,
involving creation of vnode and very likely a vm object again.

We can safely say that you cannot profitably _increase_ the size of
the namecache, except for the negative entries where raw statistics
will have to be the judge of the profitability of the idea.

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-12 Thread Rik van Riel


On Thu, 12 Apr 2001, Matt Dillon wrote:

> Again, keep in mind that the namei cache is strictly throw-away,

This seems to be the main difference between Linux and FreeBSD.

In Linux, open files directly refer to an entry in the dentry
(and inode) cache, so we really need to have dynamically growing
and shrinking caches in order to accomodate programs that have
huge amounts of files open (but we want to free the memory again
later, because the system load changes).

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com.br/


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-12 Thread Matt Dillon


:You should also know that negative entries, since they have no
:objects to "hang from" and consequently would clog up the name-cache,
:are limited by the sysctl:
:   debug.ncnegfactor: 16
:which means that max 1/16 of the name cache entries can be negative
:entries.  You can monitor the number of negative entries with the
:sysctl
:   debug.numneg: 305
:
:the value of "16" was rather arbitrarily chosen and better defaults
:may exist.
:
:--
:Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
:[EMAIL PROTECTED] | TCP/IP since RFC 956

Here's an example from a lightly loaded machine that's been up about
two months (since I last upgraded its kernel):

earth:/home/dillon> sysctl -a | fgrep vfs.cache
vfs.cache.numneg: 1596
vfs.cache.numcache: 30557
vfs.cache.numcalls: 352196140
vfs.cache.dothits: 5598866
vfs.cache.dotdothits: 14055093
vfs.cache.numchecks: 435747692
vfs.cache.nummiss: 29963655
vfs.cache.nummisszap: 3042073
vfs.cache.numposzaps: 3308219
vfs.cache.numposhits: 274527703
vfs.cache.numnegzaps: 939714
vfs.cache.numneghits: 20760817
vfs.cache.numcwdcalls: 215565
vfs.cache.numcwdfail1: 29
vfs.cache.numcwdfail2: 1730
vfs.cache.numcwdfail3: 0
vfs.cache.numcwdfail4: 4
vfs.cache.numcwdfound: 213802
vfs.cache.numfullpathcalls: 0
vfs.cache.numfullpathfail1: 0
vfs.cache.numfullpathfail2: 0
vfs.cache.numfullpathfail3: 0
vfs.cache.numfullpathfail4: 0
vfs.cache.numfullpathfound: 0

Again, keep in mind that the namei cache is strictly throw-away, but
entries can often be reconstituted later by the filesystem without I/O
due to the VM Page cache (and/or buffer cache depending on
vfs.vmiodirenable).  So as with the buffer cache and inode cache,
the number of entries can be limited without killing performance or
scaleability.

earth:/home/dillon> vmstat -m | egrep 'Type|vfsc'
...
Type  InUse MemUse HighUse  Limit Requests Limit Limit Size(s)
 vfscache 30567  2386K   2489K 85444K 275524850 0  64,128,256,256K

This particular machine has 30567 component entries in the namei cache
at the moment, eating around 2.3 MB of kernel memory.  That makes the
namei cache quite efficient.

Of course, there are many situations where the namei cache is 
ineffective, such as on machines with insanely huge mail queues
or older usenet news systems that used individual files for article
storage, or a squid cache that uses individual files.  The ultimate
solution is to back the name cache with a filesystem that uses hashed
or sorted/indexed directories - one of the few disadvantages that remain
with UFS/FFS.  I've never found that to be a show stopper, though.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-12 Thread Poul-Henning Kamp


In message <[EMAIL PROTECTED]>, Matt Dillon writes:
>
>:
>:On Tue, 10 Apr 2001, Matt Dillon wrote:
>:
>:>It's randomness that will kill performance.  You know the old saying
>:>about caches:  They only work if you get cache hits, otherwise
>:>they only slow things down.
>:
>:I wonder ... how does FreeBSD handle negative directory entries?
>:
>:That is, /bin/sh looks through the PATH to search for some executable
>:(eg grep) and doesn't find it in the first 3 directories.
>:
>:Does the vfs cache handle this or does FreeBSD have to go down into
>:the filesystem code every time?
>:
>:Rik
>
>The namei cache stores negative hits.  /usr/src/sys/kern/vfs_cache.c
>cache_lookup() - if ncp->nc_vp (the vnode) is NULL, the cache entry
>represents a negative hit.  cache_enter() - vp may be passed as NULL
>to create a negative cache entry.  ufs/ufs/ufs_lookup.c, calls to
>cache_enter() enters positive or negative lookups as appropriate.
>

You should also know that negative entries, since they have no
objects to "hang from" and consequently would clog up the name-cache,
are limited by the sysctl:
debug.ncnegfactor: 16
which means that max 1/16 of the name cache entries can be negative
entries.  You can monitor the number of negative entries with the
sysctl
debug.numneg: 305

the value of "16" was rather arbitrarily chosen and better defaults
may exist.

--
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
[EMAIL PROTECTED] | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-12 Thread Matt Dillon



:
:On Tue, 10 Apr 2001, Matt Dillon wrote:
:
:>It's randomness that will kill performance.  You know the old saying
:>about caches:  They only work if you get cache hits, otherwise
:>they only slow things down.
:
:I wonder ... how does FreeBSD handle negative directory entries?
:
:That is, /bin/sh looks through the PATH to search for some executable
:(eg grep) and doesn't find it in the first 3 directories.
:
:Does the vfs cache handle this or does FreeBSD have to go down into
:the filesystem code every time?
:
:Rik

The namei cache stores negative hits.  /usr/src/sys/kern/vfs_cache.c
cache_lookup() - if ncp->nc_vp (the vnode) is NULL, the cache entry
represents a negative hit.  cache_enter() - vp may be passed as NULL
to create a negative cache entry.  ufs/ufs/ufs_lookup.c, calls to
cache_enter() enters positive or negative lookups as appropriate.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-12 Thread Rik van Riel

On Tue, 10 Apr 2001, Matt Dillon wrote:

>It's randomness that will kill performance.  You know the old saying
>about caches:  They only work if you get cache hits, otherwise
>they only slow things down.

I wonder ... how does FreeBSD handle negative directory entries?

That is, /bin/sh looks through the PATH to search for some executable
(eg grep) and doesn't find it in the first 3 directories.

The next time the script is started (it might be ran for every file
in a large compile) the next invocation of the script looks for the
file in 3 directories where it isn't present .. again.

Does the vfs cache handle this or does FreeBSD have to go down into
the filesystem code every time?

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com.br/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-10 Thread Julian Elischer


Rik van Riel wrote:

> 
> I'm curious about the other things though ... FreeBSD still seems
> to have the early 90's abstraction layer from Mach and the vnode
> cache doesn't seem to grow and shrink dynamically (which can be a
> big win for systems with lots of metadata activity).
> 
> So while it's true that FreeBSD's VM balancing seems to be the
> best one out there, I'm not quite sure about the rest of the VM...
> 

Many years ago Kirk was talking about merging the vm objects and the vnodes..
(they tend to come in pairs anyhow)

I still think it might be an idea worth investigating further.

kirk?

-- 
  __--_|\  Julian Elischer
 /   \ [EMAIL PROTECTED]
(   OZ) World tour 2000-2001
---> X_.---._/  
v

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-10 Thread Matt Dillon


   It's randomness that will kill performance.  You know the old saying
   about caches:  They only work if you get cache hits, otherwise
   they only slow things down.

-Matt

:Which is ok if there isn't too much activity with these data
:structures, but I'm not sure if it works when you have a lot
:of metadata activity (though I'm not sure in what kind of
:workload you'd see this).
:
:Also, if you have a lot of metadata activity, you'll essentially
:double the memory requirements, since you'll have the stuff cached
:in both the internal structures and in the VM PAGE cache. I'm not
:sure how much of a hit this would be, though, if the internal
:structures are limited to a small enough size...
:
:regards,
:
:Rik

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-10 Thread Rik van Riel


On Tue, 10 Apr 2001, Matt Dillon wrote:

> :I'm curious about the other things though ... FreeBSD still seems
> :to have the early 90's abstraction layer from Mach and the vnode
> :cache doesn't seem to grow and shrink dynamically (which can be a
> :big win for systems with lots of metadata activity).

> Well, the approach we take is that of a two-layered cache.
> The vnode, dentry (namei for FreeBSD), and inode caches
> in FreeBSD are essentially throw-away caches of data
> represented in an internal form.  The VM PAGE cache 'backs'
> these caches loosely by caching the physical on-disk representation
> of inodes, and directory entries (see note 1 at bottom).
> 
> This means that even though we limit the number of the namei
> and inode structures we keep around in the kernel, the data
> required to reconstitute those structures is 'likely' to
> still be in the VM PAGE cache, allowing us to pretty much
> throw away those structures on a whim.  The only cost is that
> we have to go through a filesystem op (possibly not requiring I/O)
> to reconstitute the internal structure.

Which is ok if there isn't too much activity with these data
structures, but I'm not sure if it works when you have a lot
of metadata activity (though I'm not sure in what kind of
workload you'd see this).

Also, if you have a lot of metadata activity, you'll essentially
double the memory requirements, since you'll have the stuff cached
in both the internal structures and in the VM PAGE cache. I'm not
sure how much of a hit this would be, though, if the internal
structures are limited to a small enough size...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com.br/


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-10 Thread Matt Dillon



:In the balancing part, definately. FreeBSD seems to be the only
:system that has the balancing right. I'm planning on integrating
:some of the balancing tactics into Linux for the 2.5 kernel, but
:I'm not sure how to integrate the inode and dentry cache into the
:balancing scheme ...

:I'm curious about the other things though ... FreeBSD still seems
:to have the early 90's abstraction layer from Mach and the vnode
:cache doesn't seem to grow and shrink dynamically (which can be a
:big win for systems with lots of metadata activity).
:
:So while it's true that FreeBSD's VM balancing seems to be the
:best one out there, I'm not quite sure about the rest of the VM...
:
:regards,
:
:Rik

Well, the approach we take is that of a two-layered cache.
The vnode, dentry (namei for FreeBSD), and inode caches
in FreeBSD are essentially throw-away caches of data
represented in an internal form.  The VM PAGE cache 'backs'
these caches loosely by caching the physical on-disk representation
of inodes, and directory entries (see note 1 at bottom).

This means that even though we limit the number of the namei
and inode structures we keep around in the kernel, the data
required to reconstitute those structures is 'likely' to
still be in the VM PAGE cache, allowing us to pretty much
throw away those structures on a whim.  The only cost is that
we have to go through a filesystem op (possibly not requiring I/O)
to reconstitute the internal structure.

For example, take the namei cache.  The namei cache allows
the kernel to bypass big pieces of the filesystem when doing
path name lookups.  If a path is not in the namei cache the
filesystem has to do a directory lookup.  But a directory
lookup could very well access pages in the VM PAGE cache
and thus still not actually result in a disk I/O.

The inode cache works the same way ... inodes can be thrown
away at any time and most of the time they can be reconstituted
from the VM PAGE cache without an I/O.

The vnode cache works slightly differently.  VNodes that are
not in active use can be thrown away and reconstituted at a later
time from either the inode cache or the VM PAGE cache
(or if not then require a disk I/O to get at the stat information).
There is a caviat for the vnode cache, however.  VNodes are tightly
integrated with VM Objects which in turn help place hold VM pages
in the VM PAGE cache.  Thus when you throw away an inactive vnode
you also have to throw away any cached VM PAGES representing the
cached file or directory data represented by that vnode.

Nearly all installations of FreeBSD run out of physical memory long
before they run out of vnodes, so this side effect is almost never
an issue.  On some extremely rare occassions it is possible that
the system will have plenty of free memory but hit its vnode cache
limit and start recycling vnodes, causing it to recycle cache pages
even when there is plenty of free memory available.  But this is
very rare.

The key point to all of this is that we put most of our marbles in
the VM PAGE cache.  The namei and inode caches are there simply for
convenience so we don't have to 'lock' big portions of the underlying
VM PAGE cache.

The VM PAGE cache is pretty much an independant entity.  It does not know
or care *what* is being cached, it only cares how often the data is 
being accessed and whether it is clean or dirty.  It treats all the
data nearly the same.

note (1):  Physical directory blocks have historically been cached in
the buffer cache, using kernel MALLOC space, not in the VM PAGE cache.
buffer-cache based MALLOC space is severely limited (only a few megabytes)
compared to what the VM PAGE cache can offer.  In FreeBSD a
'sysctl -w vfs.vmiodirenable=1' will cause physical directory blocks to
be cached in the VM PAGE Cache, just like files are cached.  This is
not the default but it will be soon, and many people already turn this
sysctl on.

-

I should also say that there is a *forth* cache not yet mentioned which
actually has a huge effect on the VM PAGE cache.  This fourth cache 
relates to pages *actively* mapped into user space.  A page mapped into
user space is wired (cannot be ripped out of the VM PAGE cache) and also
has various other pmap-related tracking structures (which you are familiar
with, Rik, so I won't expound on that too much).  If the VM PAGE cache
wants to get rid of an idle page that is still mapped to a user process,
it has to unwire it first which means it has to get rid of the user
mappings - a pmap*() call from vm/vm_pageout.c and vm/vm_page.c 
accomplishes this.  This fourth cache (the active user mappings of pages)
is also a throw away cache, though one with the side effect of making
VM PAGE cache pages available for loadin

Re: vm balance

2001-04-10 Thread Rik van Riel

On Tue, 10 Apr 2001, Matt Dillon wrote:

> :I heard NetBSD has implemented a FreeBSD like VM, it also implemented
> :a VM balance in recent verion of NetBSD. some parameters like TEXT,
> :DATA and anonymous memory space can be tuned. is there anyone doing
> :such work on FreeBSD or has FreeBSD already implemented it?
> 
> FreeBSD implements a very sophisticated VM balancing algorithm.  Nobody's
> complaining about it so I don't think we need to really change it.  Most
> of the other UNIXes, including Linux, are actually playing catch-up to
> FreeBSD's VM design.

In the balancing part, definately. FreeBSD seems to be the only
system that has the balancing right. I'm planning on integrating
some of the balancing tactics into Linux for the 2.5 kernel, but
I'm not sure how to integrate the inode and dentry cache into the
balancing scheme ...

I'm curious about the other things though ... FreeBSD still seems
to have the early 90's abstraction layer from Mach and the vnode
cache doesn't seem to grow and shrink dynamically (which can be a
big win for systems with lots of metadata activity).

So while it's true that FreeBSD's VM balancing seems to be the
best one out there, I'm not quite sure about the rest of the VM...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com.br/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-10 Thread Andrew R. Reiter


> 
> FreeBSD implements a very sophisticated VM balancing algorithm.  Nobody's
> complaining about it so I don't think we need to really change it.  Most
> of the other UNIXes, including Linux, are actually playing catch-up to
> FreeBSD's VM design.
> 

I remember hearing/viewing a zero-copy networking patch for 4.2... Anyone
else seen this?  If it's already part of the tree, ignore me :-)


Andrew

*-.
| Andrew R. Reiter 
| [EMAIL PROTECTED]
| "It requires a very unusual mind
|   to undertake the analysis of the obvious" -- A.N. Whitehead


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Re: vm balance

2001-04-10 Thread Matt Dillon



:I heard NetBSD has implemented a FreeBSD like VM, it also implemented
:a VM balance in recent verion of NetBSD. some parameters like TEXT,
:DATA and anonymous memory space can be tuned. is there anyone doing
:such work on FreeBSD or has FreeBSD already implemented it?
:
:-- 
:David Xu

FreeBSD implements a very sophisticated VM balancing algorithm.  Nobody's
complaining about it so I don't think we need to really change it.  Most
of the other UNIXes, including Linux, are actually playing catch-up to
FreeBSD's VM design.

-Matt


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

82 matches

Mail list logo