Re: vm balance
: :Julian Elischer wrote: :> You can mmap() devices and you can mmap files.. :> :> you cannot mmap FIFOs or sockets. :> :> for this reason I think that devices are still well represented by :> vnodes. If we merged vnodes and vm objects, :> then if devices were not vnodes, how would you represent :> a vm area that maps a device? : :Merging vnodes and vm objects is an incredibly bad idea. There :is a lot of other work that should be done before that can even :be considered, and then it shouldn't be considered. : :In othe words, it's a good excuse for getting some needed :changes in, but it's not a good idea. : :I know you and Kirk love the idea, but, truly, it is a bad :idea. I like the idea too, but every time I've looked at it it's been a huge mess. In short, I don't think we will *ever* be able to merge vnodes and VM objects. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
: :Julian Elischer wrote: :> Actually there have been times when I did want to mmap a datastream.. :> I think a datastream mapped into a user buffer-space is one of the :> possible 0-copy methods people sometimes mention. : :This is ugly. There are prettier ways of doing it. : :-- Terry Considering that a number of failed attempts have already been made to optimize standard read()/write() calls, and that mmap() isn't really all that well suited to a datastream, I would be inclined to develop a set of system calls to deal with 0-copy streams. I did something similar in one of my embedded OS's. It could actually apply to normal files as easily as to pipes and both UDP and TCP data streams, and would not require any fancy splitting of network headers verses the data payload. It gives the OS the ultimate flexibility in managing 0-copy buffer spaces. actual = readptr(int fd, void **ptr, int bytes); Attempt to read 'bytes' bytes of data from the descriptor. The operating system will map the data read-only and supply a pointer to the base of the buffer (which may or may not be page-aligned). The actual number of bytes available is returned. actual < bytes does NOT signify EOF, because the OS may have other limitations such as having to return piecemeal mbufs, skip packet headers, and so forth. The data will remain valid until the next readptr(), read(), or lseek() call on the descriptor or until the descriptor is closed. You can inform the OS that you have read all the data by calling readptr(fd, NULL, 0) (i.e. if this is a TCP connection this would allow TCP to reclaim the related mbufs). The OS typically leaves the mapped space mapped for efficiency, but the only valid data exists within the specific portion represented by your last readptr() call. The OS is free to reuse its own mappings at any time as long as it leaves the data it has guarenteed to be valid in place. avail = writeptr(int fd, void **ptr, int bytes); Request buffer space to write 'bytes' bytes of data. The OS will map appropriate buffer space and return a pointer to it. This procedure returns the actual number of bytes that may be written into the returned buffer. The OS may limit the available buffer space to fit mbuf/MTU requirements on a TCP connection or for other reasons. You should fill the buffer with 'avail' bytes and call writeptr() again to commit your buffer. Calling lseek() or write() will abort the buffer. You can commit your last writeptr() by calling writeptr(fd, NULL, 0). Close()ing the descriptor without comitting the buffer will result in the loss of the buffer. note: readptr() and writeptr() do not interfere with each other when operating on streams, but one will abort the other when operating on files due to the seek position changing. IOCTL's ioctl(fd, IOPTR_WABORT, bytes); Abort worth of a previously reserved write buffer. Passing -1 aborts the entire buffer. ioctl(fd, IOPTR_WCOMMIT, bytes); Commit bytes worth of a previously reserved write buffer, aborting any remainder after that. Passing -1 commits the entire 'avail' space. This can be used to reserve a large write buffer and then commit a smaller data set. For example, a web server can reserve a 4K response buffer but only commit the actual length of the response. ioctl(fd, IOPTR_WCLEAR, 0); Abort any previously reserved write buffer and force the OS to unmap any cached memory space associated with writeptr(). ioctl(fd, IOPTR_RABORT, bytes); Abort any previously returned read buffer, allowing the OS to reclaim the buffer space if it wishes (especially useful for TCP connections which might have to hold onto mbufs). bytes are aborted. Passing -1 aborts the entire buffer. ioctl(fd, IOPTR_RCLEAR, 0); Abort any previously reserved write buffer and force the OS to unmap any cached memory space associated with readptr(). -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
Julian Elischer wrote: > You can mmap() devices and you can mmap files.. > > you cannot mmap FIFOs or sockets. > > for this reason I think that devices are still well represented by > vnodes. If we merged vnodes and vm objects, > then if devices were not vnodes, how would you represent > a vm area that maps a device? Merging vnodes and vm objects is an incredibly bad idea. There is a lot of other work that should be done before that can even be considered, and then it shouldn't be considered. In othe words, it's a good excuse for getting some needed changes in, but it's not a good idea. I know you and Kirk love the idea, but, truly, it is a bad idea. As far as the other work is concerned: o Get rid of struct fileops o Get rid of specfs entirely; using vp's o Fix the permission/ownership problems on FIFOs and sockets that results from the use of struct fileops o Fix range locks on non-file objects o Move the lock list to the vnode o Make the VFS advisory locking into a veto-based interface, which only has something other than "return 0;" on the NFS client code o Delay lock coelescing until after the attempt has *not been* vetoed, in order to save wire traffic in the "local lock conflict" case o Consider getting rid of lock coelescing entirely, by default, in order to comply with the NFSv4 RFC non-coelescing of locks requirement o Allow MMAP'ing of FIFO object o Constrain the buffer size to a multiple of a page size, instead of the weird value of "5K". o Implement them slightly differently o Get rid of fifofs o Give up on the idea of mmaping streams, since to do that would require constraining mbufs to page sized chunks per mbuf, at least on the receive case, since there are adjacency problems with mapping consecutive packets that don't represent the same flow (unless you are willing to rewrite all the firmware in the world, in which case, "go for it!" 8-)). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
Julian Elischer wrote: > Actually there have been times when I did want to mmap a datastream.. > I think a datastream mapped into a user buffer-space is one of the > possible 0-copy methods people sometimes mention. This is ugly. There are prettier ways of doing it. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:I think we need to remember that we do not always have a :backing object, nor is a backing object always desirable. : :The performance of an mmap'ed file, or swap-backed anonymous :region is _significantly_ below that of unbacked objects. : :-- Terry This is not true, Terry. There is no performance degredation with swap verses unbacked storage, and no performance degredation with file-backed storage if you use MAP_NOSYNC to adjust the write flushing characteristics of the map. Additionally, there is no 'write through' in the VM layer per-say -- the filesystem syncer has to come along and actually look for dirty pages to sync to the backing store (and with MAP_NOSYNC it doesn't bother). The VM layers do not touch the backing store at all until they absolutely have to. For example, swap is not allocated until the pagedaemon actually decides to page something out. This leaves only the pageout daemon which operates as it always has... if you are not squeezed for memory, it won't try to page anything out. And you can always use madvise(), msync(), and mlock() on top of everything else to adjust the VM characteristics of a section of memory (though personally speaking I don't think mlock() is necessary with 4.x's VM system unless you need realtime). In short, mmap()'s backing store is not an issue in 4.x. Read the manual page for mmap for more information, I fleshed it out a long time ago to explain all of this. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Proposed struct file (was Re: vm balance)
Matt Dillon wrote: > > This is all preliminary. The question is whether we can > cover enough bases for this to be viable. > > Here is a proposed struct file. Make f_data opaque (or > more opaque), add f_object, extend fileops (see next > structure), Added f_vopflags to indicate the presence > of a vnode in f_data, allowing extended filesystem ops > (e.g. rename, remove, fchown, etc etc etc). 1) struct fileops is evil; adding to it contributes to its inherent evil-ness. 2) The new structure is too large. 3) The old structure is too large; I have a need for 1,000,000 open files for a particular application, and I'm not willing to give up that much memory. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
Poul-Henning Kamp wrote: > > In message <[EMAIL PROTECTED]>, Kirk > McKusick writes: > > >Every vnode in the system has an associated object. > > No: device vnodes dont... > > I think the correct solution to that is to move devices away from > vnodes and into the fdesc layer, just like fifo's and sockets. This is really, likewise, a bad idea. The "struct fileops" has been a problem from day one. It exists for devices because we still have "specfs", and have not moved over to a "devfs" that uses vnodes instead of using strategy routines invoked from a "struct fileops *" dereference. The code was smeared into the FIFO/socket/IPC code as a poor man's integration to get something working. When that happened, the ability to do normal things like set ownership, permissions, etc., on things like FIFOs disappeared. FreeBSD is much poorer with regard to full compliance with POSIX semantics on things like F_ fcntl() arguments and the like when applied to sockets. Linux, Solaris, AIX, and other POSIX and Single UNIX Specification compliant OSs don't suffer these same problems. Perhaps one of the most annoying things about FreeBSD is the inability to perform advisory locking on anything by true vnode objects... and then only if the underlying VFS has an advisory lock chain hung off of some private structure, which can't be rescued except through the evils of POSIX locking semantics. Many applications use advisory lock chains off of devices to communicate region protection information not directly related to really protecting the resource. Similarly, "struct fileops" is the main culprit, to my mind, behind the inability of FreeBSD to support cloning devices, such as that needed for multiple virtual machine instances in vmware to work as it does in Linux and other first-class host OSs. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
[ ... merging vnode and vm_object_t ... ] Kirk McKusick wrote: > Every vnode in the system has an associated object. Every object > backed by a file (e.g., everything but anonymous objects) has an > associated vnode. So, the performance of one is pretty tied to the > performance of the other. Matt is right that the VM does locking > on a page level, but then has to get a lock on the associated > vnode to do a read or a write, so really is pretty tied to the > vnode lock performance. Merging the two data structures is not > likely to change the performance characteristics of the system for > either better or worse. But it will save a lot of headaches having > to do with lock ordering that we have to deal with at the moment. I really, really dislike the idea of a merge of these objects, still, and not just because it will be nearly impossible to macke object coherency work in a stack of two or more VFS layers if this change ever goes through. When John Dyson originally wrote the FreeBSD unified VM and buffer cache code under contract for Oracle for use in their Oracle 8i and FreeBSD based NC server platform, he did so in such a way to allow anonymous objects, which did not have backing store associated with them. This was the memory pulled off of /dev/zero, and the memory in SYSVSHM. The main benefit of doing this is that it saves an incredible amount of write-through, which would otherwise be necessary to maintain coherency with the backing object (vnode). I think we need to remember that we do not always have a backing object, nor is a backing object always desirable. The performance of an mmap'ed file, or swap-backed anonymous region is _significantly_ below that of unbacked objects. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
Poul-Henning Kamp wrote: > > In message <[EMAIL PROTECTED]>, Matt Dillon writes: > > >Actually, all this talk does imply that VM objects should be independant > >of vnodes. Devices may need to mmap (requiring a VM object), but > >don't need all the baggage of a vnode. Julian is absolutely correct > >there. > > Well, you have other VM Objects which doesn't map to vnodes: swap > backed anonymous objects for instance. there has been talk of MAKING those have a vnode by making a swapfs. > -- __--_|\ Julian Elischer / \ [EMAIL PROTECTED] ( OZ) World tour 2000-2001 ---> X_.---._/ v To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Wed, 18 Apr 2001, Poul-Henning Kamp wrote: > In message <[EMAIL PROTECTED]>, Matt Dillon writes: > >If this will get rid of or clean up the specfs garbage, then I'm all > >for it. I would love to see a 'clean' fileops based device interface. > > specfs, aliased vnodes, you name it... > > I think the aliased vnodes is the single most strong argument of them > all for doing this... I think that this can be (and already is) solved in the other way. Here is how I done it on my test system (quoted from the mail to Bruce Evans): --quote-start-- I'm working on this problem too, and these vop_lock/unlock in the spec_open/read/write vnops cause a real pain. Using a generic vnode stacking/layering mechanism (diffs will be published soon) I've reorganized the way how device vnodes are handled. Each device gets its own vnode of type VT_SPEC which is belongs to a hidden specfs mount. When any real filesystem tries to lookup vnode for a specific device via addaliasu(), addalias() just stacks filesystem vnode over specfs vnode: fs1/vnode1 fs1/vnode8 fs2/vnode1 | | | +---+---+ | V specfs vnode Specfs vnode also can be used directly as root vnode for any mounted filesystem. Obviously, there is no need in the device aliases because device can be controlled only via single vnode. v_rdev field is also goes away from vnode structure and vn_todev() is the right way to get a pointer to underlying device. But there is a real problem with a locking/unlocking used by specfs. Eg, if specfs vnode's lock used as lock for an entire layer tree, then things will be totally broken because blocked spec_read() operation may unlock a different vnode which should be locked, and even more problems caused that the read lock is shared... Use of separate lock for each vnode partially solves the problem, but not completely emulates the old behavior for exclusive lock on open operation. For example if we call open(vn1) and it block, the second open(vn1) will stuck waiting for lock on vn1, while open(vn8) will work just fine. This problem is common for stacked filesystems and many papers avoid talking about it. The "right" solution is to have a "call stack", so an unlock operation can unlock only a single chain of the above vnodes, but I'm don't see the simple way to implement it for stacks containing more than two layers :( --quote-end-- Now, regarding to the new file operations structure: it is pretty obvious that most of the operations will resemble vnode operations. However, it is a misdesign of VFS to not allow a filesystem to track a per-file descriptor tracking for at least OPEN/CLOSE operations. It is also a pretty obvious that file operations (FOP) are just a layer above VOP operations. So, why not to do things right and add capability to the existing VFS to handle a per-file operations properly ? Of course, this will require more brain work, but results will be definitely better. Lets back to vnode/vm/file/devices: I think it is a mistake to rip out vnodes from devices. But I'm agree that vnode structure is too fat to be used in the more general way. If it is possible to cleanup it, then we can easily build any hierarchies we want: file1 file2 file3 | | | +---+ | | | vnode1 vnode2 | | +---+ | device1 -- Boris Popov http://www.butya.kz/~bp/ To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Wed, Apr 18, 2001 at 10:26:40AM -0700, Julian Elischer wrote: > Robert Watson wrote: > > > > On Wed, 18 Apr 2001, Poul-Henning Kamp wrote: > > > As I indicated in my follow-up mail, the statement about seeking was > > incorrect, that is a property of the open file structure; I believe the > > remainder still holds true. When was the last time you tried mmap'ing or > > seeking on the socket? A socket represents a buffered data stream which > > does not allow arbitrary read/write operations at arbitrary offsets. > > Actually there have been times when I did want to mmap a datastream.. > I think a datastream mapped into a user buffer-space is one of the > possible 0-copy methods people sometimes mention. Mmapped data streams: audio IO. There are probably others. -- Andrew To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Proposed struct file (was Re: vm balance)
(oops, I forgot to add fo_truncate() to the fileops) -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Proposed struct file (was Re: vm balance)
This is all preliminary. The question is whether we can cover enough bases for this to be viable. Here is a proposed struct file. Make f_data opaque (or more opaque), add f_object, extend fileops (see next structure), Added f_vopflags to indicate the presence of a vnode in f_data, allowing extended filesystem ops (e.g. rename, remove, fchown, etc etc etc). struct file { LIST_ENTRY(file) f_list;/* list of active files */ short f_flag; /* see fcntl.h */ short f_type; /* descriptor type */ short f_vopflags; /* extended command set flags */ short f_FILLER2; /* (OLD) references from message queue */ struct ucred *f_cred; /* credentials associated with descriptor */ struct fileops *f_ops; /* FILE OPS */ int f_seqcount; /* (sequential heuristic) */ off_t f_nextoff; /* (sequential heuristic) */ off_t f_offset; /* seek position */ caddr_t f_data; /* opaque data (was vnode or socket) */ vm_object_t f_object; /* VM object if mmapable/cacheable, or NULL */ int f_count;/* reference count */ int f_msgcount; /* reference count from message queue */ (additional elements required to support devices, maybe just a dev_t reference or something like that. I dunno). }; Proposed fileops structure (shared): Remove ucred argument (obtain ucred from struct file), add additional functions. Add cached and uncached versions for fo_read() ... all users will use fo_read() but this way you can vector fo_read() to a generic VM Object layer which can then call fo_readnc() for anything that can't be handled by that layer. Same with fo_write(). Add additional flags to fo_writenc() to handle write behind, notification that a write occured in the VM layer (e.g. required by NFS), and other heuristic features. Note the lack of any reference to the buffer cache here. The filesystem is responsible for manipulation of the buffer cache if it wants to use the buffer cache. I've left the uio in for the moment since it's the most generic way of passing a buffer. struct fileops { int (*fo_read) (fp, uio, flags, p);/* cachable */ int (*fo_readnc)(fp, uio, flags, p);/* uncached */ int (*fo_write) (fp, uio, flags, p);/* cachable */ int (*fo_writenc) (fp, uio, flags, p);/* uncached */ int (*fo_ioctl) (fp, com, data, p); int (*fo_poll) (fp, events, p); int (*fo_kqfilter) (fp, knote); int (*fo_stat) (fp, stat, p); int (*fo_close) (fp, p); int (*fo_mmap) (fp, mmap_args); int (*fo_dump) ( ? ) ... others ... } *f_ops; -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Matt Dillon writes: >Actually, all this talk does imply that VM objects should be independant >of vnodes. Devices may need to mmap (requiring a VM object), but >don't need all the baggage of a vnode. Julian is absolutely correct >there. Well, you have other VM Objects which doesn't map to vnodes: swap backed anonymous objects for instance. >We do need to guarentee locking order, which means that all I/O >operations should be consistent. If a device or vnode is mmap()able, >then all read, write, and truncation(/extention) ops *should* run >through the VM object first: We guarantee that today my mapping the actual hardware and my having all read/writes be synchronouse. I remember at least one other UNIX which didn't make that guarantee. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Matt Dillon writes: > >:You can mmap() devices and you can mmap files.. >: >:you cannot mmap FIFOs or sockets. >: >:for this reason I think that devices are still well represented by >:vnodes. If we merged vnodes and vm objects, >:then if devices were not vnodes, how would you represent >:a vm area that maps a device? >: >:-- >: __--_|\ Julian Elischer > >I think the crux of the issue here is that most devices just don't >need the baggage of a vnode and many don't need the baggage of a VM >object except possibly for mmap(). A fileops interface would be the >cleanest way to implement a wide range of devices. > >Lets compare our various function dispatch structures. It's quite >obvious to me that we can merge cdevsw and fileops and remove all >vnode references from most of our devices. Ok, maybe not /dev/tty... >but most of the rest surely! We would also want to have an optional >vnode pointer in the fileops (like we do now) which 'enables' the >additional VOP operations on that file descriptor (in this case the >fileops for read, write, etc... would point to VOP wrappers like they >do now), and of course we would need an opaque pointer for use by >the fileops (devices would most likely load their cdev reference into >it). Right on. I think your table is wrong for "REVOKE", there is TTY magic in that. The fact that we have aliased vnodes for devices and for nothing else. The fact that all devices are handled by a magic filesystem (specfs) in the same "orphan" mode by all filesystems which support devices is another good reason. I think I'll kick back tonight and try to see what it actually takes to do it... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Julian Elischer writes: >If we merged vnodes and vm objects, >then if devices were not vnodes, how would you represent >a vm area that maps a device? You would use a VM object of course, but it would be a special kind of VM object, just like today... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:Does this give you a cache coherence problem if the file system itself :invokes data writes on files? Consider the UFS quota and extended :attribute cases: here, the file system will invoke VOP_WRITE() on its :vnodes to avoid understanding file system internals, so you can have such :operations shared across file systems using UFS. If there is caching :happening above VOP_WRITE(), will changes get propagated up the stack? Or :does VOP_WRITE() change so that it talks to the memory object which then :talks to VOP_REALLYWRITE()? There are a number of places where the kernel opens and then manipulates files with VOP calls. That's been a major eyesore, frankly. We would change those instances to open and manipulate files through struct file's (like it should have been done in the first place). :Also, what implications does this have for security-oriented revocation? :Memory mapping has always been a problem for revocation, but a number of :interesting pieces of work have been done wherein access to a file is I don't think there are any implications. Rather then scanning for a vnode we instead just scan for an opaque data pointer in the struct file. It might not be quite that trivial, but it wouldn't be difficult either. mmap is another matter, but certainly no more difficult then it would be with the current scheme. :Also, however this is implemented, it would be nice to consider supporting :stateful access to devices: i.e., dev_open() returns a state reference :that is fed into future operations, so that pseudo-devices emulating :multi-instance devices from other platforms can operate correctly. In my I was thinking more like allocating a struct file, filling it in with defaults, then passing it to dev_open() which would override the defaults as necessary. In otherwords, the open function manipulates the struct file and is otherwise completely opaque to the caller. :use), or we need a more general state management technique. In any case, :one thing this means is that if operations are pushed through a virtual :memory object, different "instances" must have different objects... If the fileops must handle mmap, then the VM object would be directly associated with the fileops. If a file has an associated vnode there might also be a VM object reference in the vnode (assuming we don't merge them), but it would be opaque to the rest of the system. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Wed, 18 Apr 2001, Matt Dillon wrote: > If a device or file can be mmap()'d, then the VM Object acts as the > cache layer for the object. We would in fact be able to remove nearly > *ALL* the caching crap from *ALL* the filesystem code. Filesystem > code would be responsible for low level I/O operations and meta ops > (VOPs) only and not be responsible for any caching of file data. The > filesystem would still potentially be responsible for caching things > like bitmaps and such, but it could use a struct file for the backing > device and get it for free (the backing device is mmapable and thus > would have a VM Object layer, so you get the bitmap caching for free). Does this give you a cache coherence problem if the file system itself invokes data writes on files? Consider the UFS quota and extended attribute cases: here, the file system will invoke VOP_WRITE() on its vnodes to avoid understanding file system internals, so you can have such operations shared across file systems using UFS. If there is caching happening above VOP_WRITE(), will changes get propagated up the stack? Or does VOP_WRITE() change so that it talks to the memory object which then talks to VOP_REALLYWRITE()? Also, what implications does this have for security-oriented revocation? Memory mapping has always been a problem for revocation, but a number of interesting pieces of work have been done wherein access to a file is revoked resulting in EPERM being returned from future reads. In fact, I believe Secure Computing even contracted with BSDI to have support for some sort of virtual memory revocation service to get written -- in MAC environments, a label change on a file can result in future operations failing. Many third party security extensions on various platforms implement some sort of revocation service -- while it hasn't been part of the base OS in many cases, this is still a relevant audience. Also, however this is implemented, it would be nice to consider supporting stateful access to devices: i.e., dev_open() returns a state reference that is fed into future operations, so that pseudo-devices emulating multi-instance devices from other platforms can operate correctly. In my mind, for this to work with file descriptor passing, either the open file record needs to hold the state, and be passed into operations (this is what Linux does -- all file system operations accept a open file entry pointer, allowing vmmon, for example, to determine which session is in use), or we need a more general state management technique. In any case, one thing this means is that if operations are pushed through a virtual memory object, different "instances" must have different objects... I may be off-base on some points here based on a lack of expertise on the device and vm sides, but my feeling is that there are a lot of implications to this type of change, and we want to be careful not to preclude a number of potential future development directions, especially when it comes to security work and emulation. Robert N M Watson FreeBSD Core Team, TrustedBSD Project [EMAIL PROTECTED] NAI Labs, Safeport Network Services To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
: :Great. Then we have aliased file pointers... :that's not a great improvement.. : :You'd still have to have 'per instance' storage somewhere, :so that the openned devices could have different permissions, and still :have them point to common data. so you still need :aliases, except now it's not a vnode being aliased but some :other structure. VNodes should never have been aliased in the first place, IMHO. We have to deal with certain special cases, like mmap'ing /dev/zero, but that is a minor issue I think. Actually, all this talk does imply that VM objects should be independant of vnodes. Devices may need to mmap (requiring a VM object), but don't need all the baggage of a vnode. Julian is absolutely correct there. We do need to guarentee locking order, which means that all I/O operations should be consistent. If a device or vnode is mmap()able, then all read, write, and truncation(/extention) ops *should* run through the VM object first: read/write/truncate fileops -> [VM object] -> device read/write/truncate fileops -> [VM object] -> vnode Relative to Poul's last message, this would require not only adding MMAP to the fileops, but also adding FTRUNCATE to the fileops. Not a big deal! If a device or file is not mmap()able, then the VM object would not exist. You wouldn't get any caching, either, in that case, unless the device implemented the caching natively. If a device or file can be mmap()'d, then the VM Object acts as the cache layer for the object. We would in fact be able to remove nearly *ALL* the caching crap from *ALL* the filesystem code. Filesystem code would be responsible for low level I/O operations and meta ops (VOPs) only and not be responsible for any caching of file data. The filesystem would still potentially be responsible for caching things like bitmaps and such, but it could use a struct file for the backing device and get it for free (the backing device is mmapable and thus would have a VM Object layer, so you get the bitmap caching for free). -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Wed, 18 Apr 2001, Julian Elischer wrote: > Poul-Henning Kamp wrote: > > > > In message <[EMAIL PROTECTED]>, Matt Dillon writes: > > >If this will get rid of or clean up the specfs garbage, then I'm all > > >for it. I would love to see a 'clean' fileops based device interface. > > > > specfs, aliased vnodes, you name it... > > > > I think the aliased vnodes is the single most strong argument of them > > all for doing this... > > Great. Then we have aliased file pointers... that's not a great > improvement.. > > You'd still have to have 'per instance' storage somewhere, so that the > openned devices could have different permissions, and still have them > point to common data. so you still need aliases, except now it's not a > vnode being aliased but some other structure. As I justed stated in a private e-mail to Matt, I'm not opposed to the idea of promoting devices to a first-class object (i.e., equivilent to vnodes, rather than below vnodes) in FreeBSD, I just want to approach this very cautiously, as there's a lot of obscure behavior in this area, and a lot of portability concerns regarding the obscure behavior. In particular, the "special case" of ttys is a very important special case -- operations such as revoke() must continue to work. With device operations currently being pushed through VFS, VFS becomes a possible mediation point for those operations, allowing "VFS magic" to be used on devices. If we remove VFS from the call stack, we lose that capability. Poul-Henning has successfully argued that this has a number of good implications, but we need to make sure that the functionality lost there doesn't out-weight the good bits. One way to look at this disagreement might be the following: some people feel that devices are simply evil, and not files, and shouldn't try to act like them. Others feel that, modulo ioctl, we really can make devices look like files, and should do that. The observation I tried to make in an earlier e-mail was that it might be possible to accept the world-view that "devices aren't files" by mapping some devices into a better abstraction, such as the socket "data stream" concept, while still making use of current abstractions. For example, using read() on /dev/audit sucks, since what comes out of /dev/audit is a set of discrete records. I'd rather use recv(), which has far superior semantics, since this is a record-oriented data stream. The same goes for kernel log messages, which on a discrete message-oriented stream could essentially become standard syslog messages, rather than treating it as a text buffer with character pointers. This would allow wrap-around to be handled much more cleanly, by simply dropping records off one end of the record chain, rather than severing lines and ending up with the current /dev/console abomination (send too much to /dev/console -- i.e., single user mode, and dmesg becomes useless). I won't claim that moving to the slightly more abstracted viewpoint I proposed earlier is the way to go, just that it's worth keeping in mind. Maybe we should just throw up our hands and say "devices are devices, screw files" -- this decision was made with NFS, and dramatically simplifies the problem space. Robert N M Watson FreeBSD Core Team, TrustedBSD Project [EMAIL PROTECTED] NAI Labs, Safeport Network Services To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
Poul-Henning Kamp wrote: > > In message <[EMAIL PROTECTED]>, Matt Dillon writes: > >If this will get rid of or clean up the specfs garbage, then I'm all > >for it. I would love to see a 'clean' fileops based device interface. > > specfs, aliased vnodes, you name it... > > I think the aliased vnodes is the single most strong argument of them > all for doing this... Great. Then we have aliased file pointers... that's not a great improvement.. You'd still have to have 'per instance' storage somewhere, so that the openned devices could have different permissions, and still have them point to common data. so you still need aliases, except now it's not a vnode being aliased but some other structure. > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > [EMAIL PROTECTED] | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. -- __--_|\ Julian Elischer / \ [EMAIL PROTECTED] ( OZ) World tour 2000-2001 ---> X_.---._/ v To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:You can mmap() devices and you can mmap files.. : :you cannot mmap FIFOs or sockets. : :for this reason I think that devices are still well represented by :vnodes. If we merged vnodes and vm objects, :then if devices were not vnodes, how would you represent :a vm area that maps a device? : :-- : __--_|\ Julian Elischer I think the crux of the issue here is that most devices just don't need the baggage of a vnode and many don't need the baggage of a VM object except possibly for mmap(). A fileops interface would be the cleanest way to implement a wide range of devices. Lets compare our various function dispatch structures. It's quite obvious to me that we can merge cdevsw and fileops and remove all vnode references from most of our devices. Ok, maybe not /dev/tty... but most of the rest surely! We would also want to have an optional vnode pointer in the fileops (like we do now) which 'enables' the additional VOP operations on that file descriptor (in this case the fileops for read, write, etc... would point to VOP wrappers like they do now), and of course we would need an opaque pointer for use by the fileops (devices would most likely load their cdev reference into it). cdevsw fileops vfsops VOPs OPEN X - - X CLOSE X X - X READ X X - X WRITE X X - X IOCTL X X - - POLL X X - X MMAP X - - X STRATEGY X - - X DUMP X - - - KQFILTER - X - X STAT - X - - NAME X - - - MAJ X - - - PSIZE X - - - FLAGS X - - - BMAJ X - - - ADVLOCK - - - X BWRITE- - - X FSYNC - - - X ISLOCKED - - - X LEASE - - - X LOCK - - - X PATHCONF - - - X READLINK - - - X REALLOCBLKS - - - X REVOKE- - - X UNLOCK- - - X BMAP - - - X PRINT - - - X BALLOC- - - X GETPAGES - - - X PUTPAGES - - - X FREEBLKS - - - X GETACL- - - X SETACL- - - X ACLCHECK - - - X GETEXTATTR- - - X SETEXTATTR- - - X LOOKUP- - - X CACHEDLOOKUP - - - X CREATE- - - X WHITEOUT - - - X MKNOD - - - X ACCESS- - - X GETATTR - - - X SETATTR - - - X REMOVE- - - X LINK - - - X RENAME- - - X MKDIR - - - X RMDIR - - - X SYMLINK - - - X READDIR - - - X INACTIVE - - - X RECLAIM - - - X MOUNT - - X - START - - X - UNMOUNT - - X - ROOT - - X - QUOTACTL - - X - STATFS- - X - SYNC - - X - VGET - - X - FHTOVP- - X - CHECKEXP - - X - VPTOFH- - X - INIT - - X - UNINIT- - X - EXTATTRCTL- - X - -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
Robert Watson wrote: > > On Wed, 18 Apr 2001, Poul-Henning Kamp wrote: > As I indicated in my follow-up mail, the statement about seeking was > incorrect, that is a property of the open file structure; I believe the > remainder still holds true. When was the last time you tried mmap'ing or > seeking on the socket? A socket represents a buffered data stream which > does not allow arbitrary read/write operations at arbitrary offsets. Actually there have been times when I did want to mmap a datastream.. I think a datastream mapped into a user buffer-space is one of the possible 0-copy methods people sometimes mention. -- __--_|\ Julian Elischer / \ [EMAIL PROTECTED] ( OZ) World tour 2000-2001 ---> X_.---._/ v To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
You can mmap() devices and you can mmap files.. you cannot mmap FIFOs or sockets. for this reason I think that devices are still well represented by vnodes. If we merged vnodes and vm objects, then if devices were not vnodes, how would you represent a vm area that maps a device? -- __--_|\ Julian Elischer / \ [EMAIL PROTECTED] ( OZ) World tour 2000-2001 ---> X_.---._/ v To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Matt Dillon writes: >If this will get rid of or clean up the specfs garbage, then I'm all >for it. I would love to see a 'clean' fileops based device interface. specfs, aliased vnodes, you name it... I think the aliased vnodes is the single most strong argument of them all for doing this... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
If this will get rid of or clean up the specfs garbage, then I'm all for it. I would love to see a 'clean' fileops based device interface. -Matt :I have not examined the full details of doing the shift yet, but it is :my impression that it actually will reduce the amount of code :duplication and special casing. : :Basically we will need a new: : : struct fileops devfileops = { : dev_read, : dev_write,, : dev_ioctl, : dev_poll, : dev_kqfilter, : dev_stat, : dev_close : }; : :The only places we will need new magic is : open, which needs to fix the plumbing for us. : mmap, which may have to be added to the fileops vector. : :The amount of special-casing code this would remove from the vnode :layer is rather astonishing. : :If we merger vm-objects and vnodes without taking devices out of the :mix, we will need even more special-case code for devices. : :>The vnode is our abstraction for objects that have :>address spaces, can be opened/closed and retain a seeking position, can be :>mapped, have protections, etc, etc. : :This is simply not correct Robert, UNIX::sockets also have many of :those properties, but they're not vnodes... : :>Besides which, :>the kernel knows how to act on vnodes, and there is plenty of precedent :>for the kernel opening vnodes and keeping around references for its own :>ends, but there isn't all that much precedent for the kernel doing this :>using file descriptors :-). : :Have you actually examined how FIFO and Sockets work Robert ? :-) : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 :[EMAIL PROTECTED] | TCP/IP since RFC 956 :FreeBSD committer | BSD since 4.3-tahoe :Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Wed, 18 Apr 2001, Poul-Henning Kamp wrote: > I have not examined the full details of doing the shift yet, but it is > my impression that it actually will reduce the amount of code > duplication and special casing. .. > The only places we will need new magic is > open, which needs to fix the plumbing for us. > mmap, which may have to be added to the fileops vector. > > The amount of special-casing code this would remove from the vnode > layer is rather astonishing. > > If we merger vm-objects and vnodes without taking devices out of the > mix, we will need even more special-case code for devices. Let me expand a bit on what I want to object to, and then comment a bit on what I have mixed feelings about but am not actively objecting to. I believe it is necessary to retain a reference to the vnode used to access the device in f_data, and an f_type of DTYPE_VNODE. This is used with tty's extensively, where it is desirable to open /dev/ttyfoo and then perform file system operations on it, such as fchflags(), fchmod(), fchown(), revoke(), et al, and relies on reaching the vnode via the open file entry associated with the file descriptor designated by the invoking process. This behavior is needed for a variety of race-free operations at login, et al. Changing this would require *extensive* modification to the syscall service layer (that is, what sits above VFS). Assuming the modifications were made so that the fileops array provided these services (makine the struct file be the entire abstraction, hiding VFS from the system call service layer) you've now completely rewritten the large majority of system calls, as well as introduced a whole ne category of inter-abstraction synchronization that must occur when a change is made to any abstraction (i.e., adding ACLs, MAC, ...). So it seems to me that access to the vnode must be maintained in struct file, that we cannot totally replace references to the vnode with references to, for example, the device abstraction. So with these assumptions in place, it's still possible to consider what you were suggesting: replacing the vnode fileops array with a device fileops array, so that these calls would be short-cutted directly to the device abstraction rather than passing through the VFS abstractions on the way. In some ways, this makes sense: many of the device services map poorly into the file-like abstraction of the vnode. For example, devices may have a notion of a stateful seeking position: tape drives, for example, really *do* seek to a particular location where the next read or write must be performed. Similarly, some devices really do act like streaming data sources or sinks: especially with regards to pseudo-devices, they may behave much more like sockets, with a notion of a discrete transmission unit, a maximum transmission unit, or addressibility (imagine if you could open a device representing a bus, and use socket addressing calls to set the bus address being targetted -- say for a /dev/usb0, you could say "address the following messages to USB address 4", or being able to open /dev/ed0, set the target address of the device instance to an ethernet address, and send). We already have this problem to some extent with sockets: we use the file system vnode for two purposes: first, as a namespace in which to identify the IPC object, and second, as a means for storing protection properties. It's arguable that devices might work that way also, which I think is what you're asserting. I'm not strictly opposed to this viewpoint, but it begins to make me wonder a bit about the current structuring of that whole section of the kernel: to me, a vnode really does seem like a decent abstraction of the file system concept. The socket seems like a less decent abstraction of the IPC concept, but a better abstraction of a send/receive stream. This is all complicated by long-standing interfaces and notions about how the abstractions are to be used. I guess I'd rather see it look something like this: +-+ | file descriptor | +---+-+ | +---+-+ | kernel object reference | +---+-+ | +---+-+ | | | vfile kqueuevstream | ++--+--++ IPC Socket FIFO Pipe Stream Device (note the above, and below, are highly fictional) Where "kernel object reference" is the equivilent of today's "struct file", "vfile" is t
Re: vm balance
On Wed, 18 Apr 2001, Poul-Henning Kamp wrote: > >The vnode is our abstraction for objects that have > >address spaces, can be opened/closed and retain a seeking position, can be > >mapped, have protections, etc, etc. > > This is simply not correct Robert, UNIX::sockets also have many of those > properties, but they're not vnodes... As I indicated in my follow-up mail, the statement about seeking was incorrect, that is a property of the open file structure; I believe the remainder still holds true. When was the last time you tried mmap'ing or seeking on the socket? A socket represents a buffered data stream which does not allow arbitrary read/write operations at arbitrary offsets. I guess what I'd really like to see is this: for devices that provide an address space service (such as disks), vnodes would be used. For devices that represent streams (such as many pseudo-devices and ttys), they would be represented by a slightly improved socket abstraction. The socket is a somewhat poor abstraction for this right now, perhaps a vstream would be a better concept. > >Besides which, > >the kernel knows how to act on vnodes, and there is plenty of precedent > >for the kernel opening vnodes and keeping around references for its own > >ends, but there isn't all that much precedent for the kernel doing this > >using file descriptors :-). > > Have you actually examined how FIFO and Sockets work Robert ? :-) What I'm refering to is the fact that the kernel frequently keeps open vnodes for use internally in various sorts of operations, such as quotas, accounting, core dumps, etc. BTW, part of the problem here may be a terminology problem: for me, a file descriptor refers to the per-process reference in the file descriptor table. What you appear to refer to here is the open file entry, struct file, which stores the operation array, seeking location, cached credential, etc. Robert N M Watson FreeBSD Core Team, TrustedBSD Project [EMAIL PROTECTED] NAI Labs, Safeport Network Services To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Rober t Watson writes: > >On Wed, 18 Apr 2001, Poul-Henning Kamp wrote: > >> In message <[EMAIL PROTECTED]>, Kirk McKusick writes: >> >> >Every vnode in the system has an associated object. >> >> No: device vnodes dont... >> >> I think the correct solution to that is to move devices away from vnodes >> and into the fdesc layer, just like fifo's and sockets. > >I dislike that idea for a number of reasons, not least of which is that >introducing more and more file-descriptor level objects increases the >complexity of the system call service implementation, and duplicates code. >If we're going to pretend that everything in the system is a file, and >most people seem willing to accept that, acting on devices through vnodes >seems like a reasonable choice. I have not examined the full details of doing the shift yet, but it is my impression that it actually will reduce the amount of code duplication and special casing. Basically we will need a new: struct fileops devfileops = { dev_read, dev_write,, dev_ioctl, dev_poll, dev_kqfilter, dev_stat, dev_close }; The only places we will need new magic is open, which needs to fix the plumbing for us. mmap, which may have to be added to the fileops vector. The amount of special-casing code this would remove from the vnode layer is rather astonishing. If we merger vm-objects and vnodes without taking devices out of the mix, we will need even more special-case code for devices. >The vnode is our abstraction for objects that have >address spaces, can be opened/closed and retain a seeking position, can be >mapped, have protections, etc, etc. This is simply not correct Robert, UNIX::sockets also have many of those properties, but they're not vnodes... >Besides which, >the kernel knows how to act on vnodes, and there is plenty of precedent >for the kernel opening vnodes and keeping around references for its own >ends, but there isn't all that much precedent for the kernel doing this >using file descriptors :-). Have you actually examined how FIFO and Sockets work Robert ? :-) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Wed, 18 Apr 2001, Robert Watson wrote: > On Wed, 18 Apr 2001, Poul-Henning Kamp wrote: > address spaces, can be opened/closed and retain a seeking position, can be This is what I get for sending messages in the morning after staying up late -- needless to say, you can ignore the "retain a seeking position" statement: vnodes generally don't operate with a notion of "position", that occurs at the struct file level. It's arguable, if you had stateful vnodes, that you might want to push the seek operation down from the open file layer, as devices might want to implement the seeking service themselves. In any case, this is not a problem that moving the device operations into the struct file array will fix--in fact, it's arguable for devices wanting to offer services to different consumers on the same instance (such as /dev/vmmon), you want the vnode reference counting notion of open/close + the sprinkled state vnode design we've discussed before, which would allow VFS and the struct file layer to do the state management binding state to consumers, rather than teaching the device layer how to do that. Robert N M Watson FreeBSD Core Team, TrustedBSD Project [EMAIL PROTECTED] NAI Labs, Safeport Network Services To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Wed, 18 Apr 2001, Poul-Henning Kamp wrote: > In message <[EMAIL PROTECTED]>, Kirk McKusick writes: > > >Every vnode in the system has an associated object. > > No: device vnodes dont... > > I think the correct solution to that is to move devices away from vnodes > and into the fdesc layer, just like fifo's and sockets. I dislike that idea for a number of reasons, not least of which is that introducing more and more file-descriptor level objects increases the complexity of the system call service implementation, and duplicates code. If we're going to pretend that everything in the system is a file, and most people seem willing to accept that, acting on devices through vnodes seems like a reasonable choice. The vnode provides us with a notion of open/close, reference counting, access to a generic vnode pager for memory mapping of objects without specific memory mapping characteristics, and so on. Right now, the mapping from vnodes into devices is a bit poor due to some odd reference / open / close behavior, and due to a lack of a notion of stateful access to vnodes (there have been a number of proposals to remedy this, however). The vnode is our abstraction for objects that have address spaces, can be opened/closed and retain a seeking position, can be mapped, have protections, etc, etc. It may not be a perfect representation of a device, but it does a reasonable job. Besides which, the kernel knows how to act on vnodes, and there is plenty of precedent for the kernel opening vnodes and keeping around references for its own ends, but there isn't all that much precedent for the kernel doing this using file descriptors :-). Robert N M Watson FreeBSD Core Team, TrustedBSD Project [EMAIL PROTECTED] NAI Labs, Safeport Network Services To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
RE: vm balance
Dear Matt, > : > :Well, if that's the case, yank all uses of v_id from the nfs code, > :I'll do the namecache and vnodes can be deleted to the joy > of our users... > : > > If you can yank v_id out from the kern/vfs_cache code, I > will make similar > fixes to the NFS code. I am not particularly interesting > in returning > vnodes to the MALLOC pool myself, but I am interested in > fixing the > two bugs I noticed when I ran over the code earlier today. > > Actually one bug. The vput() turns out to be correct, I > just looked at > the code again. However, the cache_lookup() call in > nfs_vnops.c is > broken. Assuming no other fixes, the vpid load needs to > occur before > the VOP_ACCESS call rather then after. > I'm just curious: would this be the "redundant call/non-optimal performance"--type bug or the "panics or trashes the system in dark and mysterious ways"--type bug? If it is the latter, do you think it may be an opportunity for you to close some NFS-related PR's? Kees Jan You are only young once, but you can stay immature all your life. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Kirk McKusick writes: >Every vnode in the system has an associated object. No: device vnodes dont... I think the correct solution to that is to move devices away from vnodes and into the fdesc layer, just like fifo's and sockets. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
Date: Tue, 17 Apr 2001 09:49:54 -0400 (EDT) From: Robert Watson <[EMAIL PROTECTED]> To: Kirk McKusick <[EMAIL PROTECTED]> cc: Julian Elischer <[EMAIL PROTECTED]>, Rik van Riel <[EMAIL PROTECTED]>, [EMAIL PROTECTED], Matt Dillon <[EMAIL PROTECTED]>, David Xu <[EMAIL PROTECTED]> Subject: Re: vm balance On Mon, 16 Apr 2001, Kirk McKusick wrote: > I am still of the opinion that merging VM objects and vnodes would be a > good idea. Although it would touch a huge number of lines of code, when > the dust settled, it would simplify some nasty bits of the system. This > merger is really independent of making the number of vnodes dynamic. > Under the old name cache implementation, decreasing the number of vnodes > was slow and hard. With the current name cache implementation, > decreasing the number of vnodes would be easy. I concur that adding a > dynamically sized vnode cache would help performance on some workloads. I'm interested in this idea, although profess a gaping blind spot in expertise in the area of the VM system. However, one of the aspects of our VFS that has always concerned me is that use of a single vnode simplelock funnels most of the relevant (and performance-sensitive) calls. The result is that all accesses to an object represented by a vnode are serialized, which can represent a substantial performance hit for applications such as databases, where simultaneous write would be advantageous, or for various vn-backed oddities (possibly including vnode-backed swap?). At some point, apparently an effort was made to mark up vnode_if.src with possible alternative locking using read/write locks, but given that all the consumers use exclusive locks right now, I assume that was not followed through on. A large part of the cost is mitigated through caching on the under-side of VFS, allowing vnode operations to return rapidly, but while this catches a number of common cases (where the file is already in the cache), there are sufficient non-common cases that I would anticipate this being a problem. Are there any performance figures available that either confirm this concern, or demonstrate that in fact it is not relevant? :-) Would this concern introduce additional funneling in the VM system, or is the granularity of locks in the VM sufficiently low that it might improve performance by combining existing broad locks? Robert N M Watson FreeBSD Core Team, TrustedBSD Project [EMAIL PROTECTED] NAI Labs, Safeport Network Services Every vnode in the system has an associated object. Every object backed by a file (e.g., everything but anonymous objects) has an associated vnode. So, the performance of one is pretty tied to the performance of the other. Matt is right that the VM does locking on a page level, but then has to get a lock on the associated vnode to do a read or a write, so really is pretty tied to the vnode lock performance. Merging the two data structures is not likely to change the performance characteristics of the system for either better or worse. But it will save a lot of headaches having to do with lock ordering that we have to deal with at the moment. Kirk McKusick To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:>reference to me. I'm not even sure why they bother to check v_id. :>The vp reference from an nfsnode is a hard reference. :> : :Well, if that's the case, yank all uses of v_id from the nfs code, :I'll do the namecache and vnodes can be deleted to the joy of our users... : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 :[EMAIL PROTECTED] | TCP/IP since RFC 956 If you can yank v_id out from the kern/vfs_cache code, I will make similar fixes to the NFS code. I am not particularly interesting in returning vnodes to the MALLOC pool myself, but I am interested in fixing the two bugs I noticed when I ran over the code earlier today. Actually one bug. The vput() turns out to be correct, I just looked at the code again. However, the cache_lookup() call in nfs_vnops.c is broken. Assuming no other fixes, the vpid load needs to occur before the VOP_ACCESS call rather then after. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Matt Dillon writes: > >: >:In message <[EMAIL PROTECTED]>, Matt Dillon writes: >:> >:>:In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: >:>: >:>:>I thought vnodes were in stable storage? >:>: >:>:They are, that's the point Matt is not seeing yet. >:> >:>I know vnodes are in stable storage. I'm just saying that NFS >:>is the least of your worries in trying to change that. >: >:The namecache can do without the use of soft references. >: >:The only reason vnodes are stable storage any more is that NFS >:uses soft references to vnodes. > >The only place I see soft references on vnode is in the NFS >lookup code which duplicates the VFS lookup code (except gets it wrong). >If you are refering to the nqlease code... that looks like a hard >reference to me. I'm not even sure why they bother to check v_id. >The vp reference from an nfsnode is a hard reference. > Well, if that's the case, yank all uses of v_id from the nfs code, I'll do the namecache and vnodes can be deleted to the joy of our users... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
: :In message <[EMAIL PROTECTED]>, Matt Dillon writes: :> :>:In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: :>: :>:>I thought vnodes were in stable storage? :>: :>:They are, that's the point Matt is not seeing yet. :> :>I know vnodes are in stable storage. I'm just saying that NFS :>is the least of your worries in trying to change that. : :The namecache can do without the use of soft references. : :The only reason vnodes are stable storage any more is that NFS :uses soft references to vnodes. The only place I see soft references on vnode is in the NFS lookup code which duplicates the VFS lookup code (except gets it wrong). If you are refering to the nqlease code... that looks like a hard reference to me. I'm not even sure why they bother to check v_id. The vp reference from an nfsnode is a hard reference. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Matt Dillon writes: > >:In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: >: >:>I thought vnodes were in stable storage? >: >:They are, that's the point Matt is not seeing yet. > >I know vnodes are in stable storage. I'm just saying that NFS >is the least of your worries in trying to change that. The namecache can do without the use of soft references. The only reason vnodes are stable storage any more is that NFS uses soft references to vnodes. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:>Note that I really don't care for using stable storeage as a hack :>to deal with this sort of thing. : :Well, I have to admit that it is a pretty smart way of dealing with :it for remote operations, but the trouble is that it prevents us from :ever lowering their number again. : :If Matt can device a smart way to loose the soft reference in nfs, :vnodes can be a truly dynamic thing. : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 NFS uses vnodes the same way that VFS uses vnodes. If you solve the problem for general VFS operation (namely *cache_lookup), you solve the problem for NFS as well. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: : :>I thought vnodes were in stable storage? : :They are, that's the point Matt is not seeing yet. I know vnodes are in stable storage. I'm just saying that NFS is the least of your worries in trying to change that. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: >I thought vnodes were in stable storage? They are, that's the point Matt is not seeing yet. >Note that I really don't care for using stable storeage as a hack >to deal with this sort of thing. Well, I have to admit that it is a pretty smart way of dealing with it for remote operations, but the trouble is that it prevents us from ever lowering their number again. If Matt can device a smart way to loose the soft reference in nfs, vnodes can be a truly dynamic thing. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
* Poul-Henning Kamp <[EMAIL PROTECTED]> [010417 10:56] wrote: > In message <[EMAIL PROTECTED]>, Matt Dillon writes: > >:>I don't think NFS relies on vnodes never being freed. > >: > >:It does, in some case nfs stashes a vnode pointer and the v_id > >:value away, and some time later tries to use that pair to try to > >:refind the vnode again. If you free vnodes, it will still think > >:the pointer is a vnode and if junk happens to be right it will > >:think it is still a vnode. QED: Bad things (TM) will happen. > >: > >:# cd /sys/nfs > >:# grep v_id * > >:nfs_nqlease.c: vpid = vp->v_id; > >:nfs_nqlease.c: if (vpid == vp->v_id) { > >:nfs_nqlease.c: if (vpid == vp->v_id && > >:nfs_vnops.c:vpid = newvp->v_id; > >:nfs_vnops.c:if (vpid == newvp->v_id) { > >: > >:-- > >:Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > > > > hahahahahahahaha.. Look at the code more closely. v_id is not > > managed by NFS, it's managed by vfs_cache.c. There's a big XXX > > comment just before cache_purge() that explains it. Believe me, > > NFS is the least of your worries here. > > Matt, you try to free vnodes back to the malloc pool and you will > see what happens OK ? I thought vnodes were in stable storage? Note that I really don't care for using stable storeage as a hack to deal with this sort of thing. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] Represent yourself, show up at BABUG http://www.babug.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
: :In message <[EMAIL PROTECTED]>, Matt Dillon writes: :>:>I don't think NFS relies on vnodes never being freed. :>: :>:It does, in some case nfs stashes a vnode pointer and the v_id :>:value away, and some time later tries to use that pair to try to :>:refind the vnode again. If you free vnodes, it will still think :>:the pointer is a vnode and if junk happens to be right it will :>:think it is still a vnode. QED: Bad things (TM) will happen. :>: :>:# cd /sys/nfs :>:# grep v_id * :>:nfs_nqlease.c: vpid = vp->v_id; :>:nfs_nqlease.c: if (vpid == vp->v_id) { :>:nfs_nqlease.c: if (vpid == vp->v_id && :>:nfs_vnops.c:vpid = newvp->v_id; :>:nfs_vnops.c:if (vpid == newvp->v_id) { :>: :>:-- :>:Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 :> :> hahahahahahahaha.. Look at the code more closely. v_id is not :> managed by NFS, it's managed by vfs_cache.c. There's a big XXX :> comment just before cache_purge() that explains it. Believe me, :> NFS is the least of your worries here. : :Matt, you try to free vnodes back to the malloc pool and you will :see what happens OK ? : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 ok ok... lets see. Oh, ok I see what it's doing. Actually I think you just found a bug. if ((error = cache_lookup(dvp, vpp, cnp)) && error != ENOENT) { struct vattr vattr; int vpid; if ((error = VOP_ACCESS(dvp, VEXEC, cnp->cn_cred, p)) != 0) { *vpp = NULLVP; return (error); } newvp = *vpp; vpid = newvp->v_id; This is totally bogus. VOP_ACCESS can block, so even using vpid above to check that the vnode hasn't been ripped out from under the code won't work. Also, take a look at the vput() later on, and also the vput() in kern/vfs_cache.c/vfs_cache_lookup() - that looks bogus to me too and would probably crash the machine. The easiest solution here is to make cache_lookup bump the ref count on the returned vnode and require that all users of cache_lookup vrele() it. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Matt Dillon writes: >:>I don't think NFS relies on vnodes never being freed. >: >:It does, in some case nfs stashes a vnode pointer and the v_id >:value away, and some time later tries to use that pair to try to >:refind the vnode again. If you free vnodes, it will still think >:the pointer is a vnode and if junk happens to be right it will >:think it is still a vnode. QED: Bad things (TM) will happen. >: >:# cd /sys/nfs >:# grep v_id * >:nfs_nqlease.c: vpid = vp->v_id; >:nfs_nqlease.c: if (vpid == vp->v_id) { >:nfs_nqlease.c: if (vpid == vp->v_id && >:nfs_vnops.c:vpid = newvp->v_id; >:nfs_vnops.c:if (vpid == newvp->v_id) { >: >:-- >:Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > > hahahahahahahaha.. Look at the code more closely. v_id is not > managed by NFS, it's managed by vfs_cache.c. There's a big XXX > comment just before cache_purge() that explains it. Believe me, > NFS is the least of your worries here. Matt, you try to free vnodes back to the malloc pool and you will see what happens OK ? -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:>I don't think NFS relies on vnodes never being freed. : :It does, in some case nfs stashes a vnode pointer and the v_id :value away, and some time later tries to use that pair to try to :refind the vnode again. If you free vnodes, it will still think :the pointer is a vnode and if junk happens to be right it will :think it is still a vnode. QED: Bad things (TM) will happen. : :# cd /sys/nfs :# grep v_id * :nfs_nqlease.c: vpid = vp->v_id; :nfs_nqlease.c: if (vpid == vp->v_id) { :nfs_nqlease.c: if (vpid == vp->v_id && :nfs_vnops.c:vpid = newvp->v_id; :nfs_vnops.c:if (vpid == newvp->v_id) { : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 hahahahahahahaha.. Look at the code more closely. v_id is not managed by NFS, it's managed by vfs_cache.c. There's a big XXX comment just before cache_purge() that explains it. Believe me, NFS is the least of your worries here. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Matt Dillon writes: > >:When I first heard you say this I thought you were off your rockers, >:but gradually I have come to think that you may be right. >: >:I think the task will be easier if we get the vnode/buf relationship >:untangled a bit first. >: >:I may also pay off to take vnodes out of diskoperations entirely before >:we try the merge. > >Yes, I agree. The vnode/VM-object issue is minor compared to >the vnode/buf/io issue. We're getting there, we're getting there... >:Actually the main problem is that NFS relies on vnodes never being >:freed to hold "soft references" using "struct vnode * + v_id). >: >:-- >:Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > >I don't think NFS relies on vnodes never being freed. It does, in some case nfs stashes a vnode pointer and the v_id value away, and some time later tries to use that pair to try to refind the vnode again. If you free vnodes, it will still think the pointer is a vnode and if junk happens to be right it will think it is still a vnode. QED: Bad things (TM) will happen. # cd /sys/nfs # grep v_id * nfs_nqlease.c: vpid = vp->v_id; nfs_nqlease.c: if (vpid == vp->v_id) { nfs_nqlease.c: if (vpid == vp->v_id && nfs_vnops.c:vpid = newvp->v_id; nfs_vnops.c:if (vpid == newvp->v_id) { -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:When I first heard you say this I thought you were off your rockers, :but gradually I have come to think that you may be right. : :I think the task will be easier if we get the vnode/buf relationship :untangled a bit first. : :I may also pay off to take vnodes out of diskoperations entirely before :we try the merge. Yes, I agree. The vnode/VM-object issue is minor compared to the vnode/buf/io issue. :>Under the old name cache implementation, decreasing :>the number of vnodes was slow and hard. With the current name cache :>implementation, decreasing the number of vnodes would be easy. : :Actually the main problem is that NFS relies on vnodes never being :freed to hold "soft references" using "struct vnode * + v_id). : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 I don't think NFS relies on vnodes never being freed. The worst that should happen is that NFS might need to do a LOOKUP. I haven't had a chance to look at the namei/vnode patch set yet but as long as a reasonable number of vnodes remain cached NFS shouldn't be effected too much. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:I'm interested in this idea, although profess a gaping blind spot in :expertise in the area of the VM system. However, one of the aspects of :our VFS that has always concerned me is that use of a single vnode :simplelock funnels most of the relevant (and performance-sensitive) calls. :The result is that all accesses to an object represented by a vnode are :serialized, which can represent a substantial performance hit for :applications such as databases, where simultaneous write would be :advantageous, or for various vn-backed oddities (possibly including :vnode-backed swap?). : :At some point, apparently an effort was made to mark up vnode_if.src with :possible alternative locking using read/write locks, but given that all :... We only use simplelocks on vnodes for interlock operations. We use normal kern/kern_lock.c locks for vnode locking and use both shared and exclusive locks. You are absolutely correct about the serialization that can occur. A stalled write() will stall all other write()'s plus any read()'s. Stalled write()s are easy to come by. I did some work in this area to try to mitigate the problem. In 4.1/4.2 I added the bwillwrite() function. This function is called prior to obtaining the exclusive vnode lock and blocks the process if there aren't a sufficient number of filesystem buffers available to (likely) accomodate the operation. This (mostly) prevents the process from blocking in the buffer cache while holding an exclusive vnode lock and makes a big difference. :is already in the cache), there are sufficient non-common cases that I :would anticipate this being a problem. Are there any performance figures :available that either confirm this concern, or demonstrate that in fact it :is not relevant? :-) Would this concern introduce additional funneling in :the VM system, or is the granularity of locks in the VM sufficiently low :that it might improve performance by combining existing broad locks? : :Robert N M Watson FreeBSD Core Team, TrustedBSD Project :[EMAIL PROTECTED] NAI Labs, Safeport Network Services The VM system is in pretty good shape in regards to fine-grained locking (you get down to the VM page). The VFS system is in terrible shape - there is no fine grained locking at all for writes. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Mon, 16 Apr 2001, Kirk McKusick wrote: > I am still of the opinion that merging VM objects and vnodes would be a > good idea. Although it would touch a huge number of lines of code, when > the dust settled, it would simplify some nasty bits of the system. This > merger is really independent of making the number of vnodes dynamic. > Under the old name cache implementation, decreasing the number of vnodes > was slow and hard. With the current name cache implementation, > decreasing the number of vnodes would be easy. I concur that adding a > dynamically sized vnode cache would help performance on some workloads. I'm interested in this idea, although profess a gaping blind spot in expertise in the area of the VM system. However, one of the aspects of our VFS that has always concerned me is that use of a single vnode simplelock funnels most of the relevant (and performance-sensitive) calls. The result is that all accesses to an object represented by a vnode are serialized, which can represent a substantial performance hit for applications such as databases, where simultaneous write would be advantageous, or for various vn-backed oddities (possibly including vnode-backed swap?). At some point, apparently an effort was made to mark up vnode_if.src with possible alternative locking using read/write locks, but given that all the consumers use exclusive locks right now, I assume that was not followed through on. A large part of the cost is mitigated through caching on the under-side of VFS, allowing vnode operations to return rapidly, but while this catches a number of common cases (where the file is already in the cache), there are sufficient non-common cases that I would anticipate this being a problem. Are there any performance figures available that either confirm this concern, or demonstrate that in fact it is not relevant? :-) Would this concern introduce additional funneling in the VM system, or is the granularity of locks in the VM sufficiently low that it might improve performance by combining existing broad locks? Robert N M Watson FreeBSD Core Team, TrustedBSD Project [EMAIL PROTECTED] NAI Labs, Safeport Network Services To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Kirk McKusick writes: >I am still of the opinion that merging VM objects and vnodes would >be a good idea. Although it would touch a huge number of lines of >code, when the dust settled, it would simplify some nasty bits of >the system. When I first heard you say this I thought you were off your rockers, but gradually I have come to think that you may be right. I think the task will be easier if we get the vnode/buf relationship untangled a bit first. I may also pay off to take vnodes out of diskoperations entirely before we try the merge. >Under the old name cache implementation, decreasing >the number of vnodes was slow and hard. With the current name cache >implementation, decreasing the number of vnodes would be easy. Actually the main problem is that NFS relies on vnodes never being freed to hold "soft references" using "struct vnode * + v_id). -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Mon, 16 Apr 2001 04:02:34 -0700, Alfred Perlstein <[EMAIL PROTECTED]> said: Alfred> I'm also wondering why you can't track the number of Alfred> nodes that ought to be cleaned, well, you do, but it doesn't Alfred> look like it's used: Alfred> + numcachehv--; Alfred> + numcachehv++; Alfred> then later: Alfred> + if (vnodeallocs % vnoderecycleperiod == 0 && Alfred> + freevnodes < vnoderecycleminfreevn && Alfred> + vnoderecyclemintotalvn < numvnodes) { Alfred> shouldn't this be related to numcachehv somehow? One reason is that the number of directory vnodes attempted to reclaim should be greater than vnoderecycleperiod, the period of reclaim in getnewvnodes() calls. Otherwise, all of the vnodes reclaimed in the last attempt might be eaten up by the next attempt. This fact calls for an constraint of vnoderecyclenumber >= vnoderecycleperiod, but it is not checked yet. The other one is that not all of the directory vnodes in namecache can be reclaimed because some of them may be held as the working directory of a process. Since a directory vnode in namecache can become or no longer be a working directory without entering or purging namecache, it is rather hard to track the number of the reclaimable directory vnodes in namecache by simple watching cache_enter() and cache_purge(). While the number of reclaimable directory vnodes can be counted by traversing the whole namecache entries, we again have to traverse the namecache entries in order to actually reclaim vnodes, so this method is not an option to predetermine the number of directory vnodes attempted to reclaim. -- Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
Date: Tue, 10 Apr 2001 22:14:28 -0700 From: Julian Elischer <[EMAIL PROTECTED]> To: Rik van Riel <[EMAIL PROTECTED]> CC: Matt Dillon <[EMAIL PROTECTED]>, David Xu <[EMAIL PROTECTED]>, [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: vm balance Rik van Riel wrote: > > I'm curious about the other things though ... FreeBSD still seems > to have the early 90's abstraction layer from Mach and the vnode > cache doesn't seem to grow and shrink dynamically (which can be a > big win for systems with lots of metadata activity). > > So while it's true that FreeBSD's VM balancing seems to be the > best one out there, I'm not quite sure about the rest of the VM... > Many years ago Kirk was talking about merging the vm objects and the vnodes.. (they tend to come in pairs anyhow) I still think it might be an idea worth investigating further. kirk? -- __--_|\ Julian Elischer / \ [EMAIL PROTECTED] ( OZ) World tour 2000-2001 ---> X_.---._/ v I am still of the opinion that merging VM objects and vnodes would be a good idea. Although it would touch a huge number of lines of code, when the dust settled, it would simplify some nasty bits of the system. This merger is really independent of making the number of vnodes dynamic. Under the old name cache implementation, decreasing the number of vnodes was slow and hard. With the current name cache implementation, decreasing the number of vnodes would be easy. I concur that adding a dynamically sized vnode cache would help performance on some workloads. Kirk McKusick To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Seigo Tanim ura writes: >Seigo> http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff > >has been updated and now ready to commit. Ok, I ran a "cvs update ; make buildworld" here with and without your patch. without: 2049.846u 1077.358s 41:29.65 125.6% 594+714k 121161+5608io 7725pf+331w with: 2053.464u 1075.493s 41:29.50 125.6% 595+715k 123125+5682io 8897pf+446w Difference: + .17% -.18% ~0% 0% +.17% +.14% +1.6% +1.3% +15% +35% I think that means we're inside epsilon for the normal case, so I'll commit your patch later tonight. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Seigo Tanim ura writes: >Poul-Henning> I'm a bit worried about the amount of work done in the >Poul-Henning> cache_purgeleafdirs(), considering how often it is called, > >Poul-Henning> Do you have measured the performance impact of this to be an >Poul-Henning> insignificant overhead ? > >No precise results right now, mainly because I cannot find a benchmark >to measure the performance of name lookup going down to a deep >directory depth. Have you done any "trivial" checks, like timing "make world" and such ? >It has been confirmed, though, that the hit ratio of name lookup is >around 96-98% for a box serving cvsup both with and without my patch >(observed by systat(1)). Here are the details of the name lookup on >that box: Ohh, sure, I don't expect this to have a big impact on the hit rate, If I thought it would have I would have protested :-) >For a more precise investigation, we have to measure the actial time >taken for a lookup operation, in which case I may have to write a >benchmark for it and test in single-user mode. I would be satisfied with a "sanity-check", for instance running a "cvs co src ; cd src ; make buildworld ; cd release ; make release" with and without, just to see that it doesn't have a significant negative impact. >It is interesting that the hit ratio of directory lookup is up to only >1% at most, even without my patch. Why is it like that? Uhm, which cache is this ? The one reported in "vmstat -vm" ? That is entirely different from the vfs-namecache, I think it is a per process one-slot directory cache. I have never studied it's performance, but I belive a good case was made for it in the 4.[34] BSD books. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Mon, 16 Apr 2001 12:36:03 +0200, Poul-Henning Kamp <[EMAIL PROTECTED]> said: Poul-Henning> In message <[EMAIL PROTECTED]>, Seigo Tanim Poul-Henning> ura writes: >> Those pieces of work were done in the last weekend, and the patch at >> Seigo> http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff >> >> has been updated and now ready to commit. Poul-Henning> I'm a bit worried about the amount of work done in the Poul-Henning> cache_purgeleafdirs(), considering how often it is called, Poul-Henning> Do you have measured the performance impact of this to be an Poul-Henning> insignificant overhead ? No precise results right now, mainly because I cannot find a benchmark to measure the performance of name lookup going down to a deep directory depth. It has been confirmed, though, that the hit ratio of name lookup is around 96-98% for a box serving cvsup both with and without my patch (observed by systat(1)). Here are the details of the name lookup on that box: Frequency: Around 25,000-35,000 lookups/sec at most, 8,000-10,000 generally. Name vs Directory: 98% or more of the lookups are for names, the rest of them are for directories (up to 1.5% of the whole lookup at most). Hit ratio: 96-98% for names and up to 1% at most for directories (both with and without my patch) Considering that most of lookup operations are for names and its hit ratio is not observed to degrade, and assuming that the time consumed for lookup hit is always constant, the performance of lookup is not found to be deteriorated. For a more precise investigation, we have to measure the actial time taken for a lookup operation, in which case I may have to write a benchmark for it and test in single-user mode. It is interesting that the hit ratio of directory lookup is up to only 1% at most, even without my patch. Why is it like that? -- Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
* Seigo Tanimura <[EMAIL PROTECTED]> [010416 03:25] wrote: > On Fri, 13 Apr 2001 20:08:57 +0900, > Seigo Tanimura said: > > Alfred> Are these changes planned for integration? > > Seigo> Yes, but not very soon as there are a few kinds of works that should > Seigo> be done. > > Seigo> One is that a directory vnode may be held as the working directory of > Seigo> a process, in which case we should not reclaim the directory vnode. > > Seigo> Another is to determine how often namecache should be traversed to > Seigo> reclaim how many directory vnodes. At this moment, namecache is > (snip) > > Those pieces of work were done in the last weekend, and the patch at > > Seigo> http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff > > has been updated and now ready to commit. There's actually a few style bugs in here: pointers should be compared against NULL, not 0 using a bit more meaningful variable names would be nice: + struct nchashhead *ncpp; + struct namecache *ncp, *nnp, *ncpc, *nnpc; I'm also wondering why you can't track the number of nodes that ought to be cleaned, well, you do, but it doesn't look like it's used: + numcachehv--; + numcachehv++; then later: + if (vnodeallocs % vnoderecycleperiod == 0 && + freevnodes < vnoderecycleminfreevn && + vnoderecyclemintotalvn < numvnodes) { shouldn't this be related to numcachehv somehow? excuse me if i'm missing something obvious, i'm in desperate need of sleep. :) -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Seigo Tanim ura writes: >Those pieces of work were done in the last weekend, and the patch at > >Seigo> http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff > >has been updated and now ready to commit. I'm a bit worried about the amount of work done in the cache_purgeleafdirs(), considering how often it is called, Do you have measured the performance impact of this to be an insignificant overhead ? Once we have that figured out I will commit the patch for you... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
* Seigo Tanimura <[EMAIL PROTECTED]> [010416 03:25] wrote: > On Fri, 13 Apr 2001 20:08:57 +0900, > Seigo Tanimura said: > > Alfred> Are these changes planned for integration? > > Seigo> Yes, but not very soon as there are a few kinds of works that should > Seigo> be done. > > Seigo> One is that a directory vnode may be held as the working directory of > Seigo> a process, in which case we should not reclaim the directory vnode. > > Seigo> Another is to determine how often namecache should be traversed to > Seigo> reclaim how many directory vnodes. At this moment, namecache is > (snip) > > Those pieces of work were done in the last weekend, and the patch at > > Seigo> http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff > > has been updated and now ready to commit. Heh, go for it. :) -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Fri, 13 Apr 2001 20:08:57 +0900, Seigo Tanimura said: Alfred> Are these changes planned for integration? Seigo> Yes, but not very soon as there are a few kinds of works that should Seigo> be done. Seigo> One is that a directory vnode may be held as the working directory of Seigo> a process, in which case we should not reclaim the directory vnode. Seigo> Another is to determine how often namecache should be traversed to Seigo> reclaim how many directory vnodes. At this moment, namecache is (snip) Those pieces of work were done in the last weekend, and the patch at Seigo> http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff has been updated and now ready to commit. -- Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:Speaking of vmiodirenable, what are the issues with it that it's not :enabled by default? ISTR that it's been in a while, and most people :pointed at it have reported success with it, and it seems to have solved :problems here and there for a number of people. What's keeping it from :the general case? : :-- :Matthew Fuller (MF4839) |[EMAIL PROTECTED] I'll probably turn it on after the 4.3 release. Insofar as Kirk and I can tell there are no (hah!) filesystem corruption bugs left in the filesystem or VM code. I am guessing that what corruption still occurs occassionally is either due to something elsewhere in the kernel, or motherboard issues (e.g. like the VIA chipset IDE DMA corruption bug). I have just four words to say about IDE DMA: It's a f**ked standard. Neither Kirk nor I have been able to reproduce reported problems at all, but with help from others we have fixed a number of bugs which seem to have had a positive effect on Yahoo's test machines. At the moment one of Yahoo's 8 IDE test systems may crash once after a few hours, but then after reboot will never crash again. This hopefully means that fsck is fixing corruption generated from earlier buggy kernels that is caught later on. I've been exchanging email with three other people with corruption issues. One turned out to be hardware (fsck after newfs was failing, so obviously not a filesystem issue!), another is indeterminant, the third was working fine until late February and then new kernels started to result in corruption (while old kernels still worked) and he is now trying to narrow down the date range where the problem was introduced. Either way it should be fairly obvious if turning on vmiodirenable makes it worse or not. My guess is: not, and it's just my paranoia that is holding up turning on vmiodirenable. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Sat, Apr 14, 2001 at 09:34:26AM -0500, Matthew D. Fuller wrote: > On Thu, Apr 12, 2001 at 02:24:36PM -0700, a little birdie told me > that Matt Dillon remarked > > > > Without vmiodirenable turned on, any directory exceeding > > vfs.maxmallocbufspace becomes extremely expensive to work with > > O(N * diskIO). With vmiodirenable turned on huge directories > > are O(N), but have a better chance of being in the VM page cache > > so cost proportionally less even though they don't do any > > better on a relative scale. > > Speaking of vmiodirenable, what are the issues with it that it's not > enabled by default? ISTR that it's been in a while, and most people > pointed at it have reported success with it, and it seems to have solved > problems here and there for a number of people. What's keeping it from > the general case? Attached is a message from Matt Dillon from an earlier -hackers discussion. G'luck, Peter -- The rest of this sentence is written in Thailand, on >From [EMAIL PROTECTED] Fri Mar 23 02:15:39 2001 Date: Thu, 22 Mar 2001 16:14:11 -0800 (PST) From: Matt Dillon <[EMAIL PROTECTED]> Message-Id: <[EMAIL PROTECTED]> To: "Michael C . Wu" <[EMAIL PROTECTED]> Cc: [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: tuning a VERY heavily (30.0) loaded s cerver :(Why is vfs.vmiodirenable=1 not enabled by default?) : The only reason it isn't enabled by default is some unresolved filesystem corruption that occurs very rarely (with or without it) that Kirk and I are still trying to nail down. I want to get that figured out first. It is true that some people have brought up memory use issues, but I don't consider memory use to really be that much of an issue. This is a cache, after all, so the blocks can be reused at just about any time. And directory blocks do not get cached well at all with vmiodirenable turned off. So the net result should be an increase in performance even on low-memory boxes. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Thu, Apr 12, 2001 at 02:24:36PM -0700, a little birdie told me that Matt Dillon remarked > > Without vmiodirenable turned on, any directory exceeding > vfs.maxmallocbufspace becomes extremely expensive to work with > O(N * diskIO). With vmiodirenable turned on huge directories > are O(N), but have a better chance of being in the VM page cache > so cost proportionally less even though they don't do any > better on a relative scale. Speaking of vmiodirenable, what are the issues with it that it's not enabled by default? ISTR that it's been in a while, and most people pointed at it have reported success with it, and it seems to have solved problems here and there for a number of people. What's keeping it from the general case? -- Matthew Fuller (MF4839) |[EMAIL PROTECTED] Unix Systems Administrator |[EMAIL PROTECTED] Specializing in FreeBSD |http://www.over-yonder.net/ "The only reason I'm burning my candle at both ends, is because I haven't figured out how to light the middle yet" To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Fri, 13 Apr 2001 02:58:07 -0700, Alfred Perlstein <[EMAIL PROTECTED]> said: Alfred> * Seigo Tanimura <[EMAIL PROTECTED]> [010413 02:39] wrote: >> On Thu, 12 Apr 2001 22:50:50 +0200, >> Poul-Henning Kamp <[EMAIL PROTECTED]> said: >> >> Poul-Henning> We keep namecache entries around as long as we can use them, and that >> Poul-Henning> generally means that recreating them is a rather expensive operation, >> Poul-Henning> involving creation of vnode and very likely a vm object again. >> >> Holding a namecache entry forever until its vnode is reused results in >> disaster when a huge number of files are accessed concurrently, causing >> active vnodes to eat up all of memory. This beast killed a box of mine >> with 3GB of memory and 200GB of a RAID0 disk array serving about >> 300,000 files by cvsupd and making the world a few months ago, when >> the number of the vnodes reached around 400,000 to make all of the >> processes wait for a free vnode. >> >> With a help by tegge, the box is now reclaiming directory vnodes when >> few free vnodes are available. Only directory vnodes holding no child >> directory vnodes held in v_cache_src are recycled, so that directory >> vnodes near the root of the filesystem hierarchy remain in namecache >> and directory vnodes are not reclaimed in cascade. The number of >> vnodes in the box is now about 135,000, staying quite steadily. >> >> Name'cache' is the place to hold vnodes for future use which may *not* >> come, hence vnodes held in namecache should be reclaimed in case of >> critical vnode shortage. Alfred> Are these changes planned for integration? Yes, but not very soon as there are a few kinds of works that should be done. One is that a directory vnode may be held as the working directory of a process, in which case we should not reclaim the directory vnode. Another is to determine how often namecache should be traversed to reclaim how many directory vnodes. At this moment, namecache is traversed for every 1,000 calls of getnewvnode(). If the following couple of inequalities satisfy, then up to 3,000 directory vnodes are attempted to be reclaimed: freevnodes < wantfreevnodes + 2 * 1000 (1) wantfreevnodes + 2 * 1000 < numvnodes * 2 (2) (1) means that we reclaim directory vnodes if the number of free vnodes are smaller than about 2,000. (2) is so that vnode reclaiming does not occur in the early stage of boot until the number of vnodes reaches around 2,000. Although I chose those parameters so that vnode reclaiming does not degrade the hit ratio of name lookup, they may not be optimum. Those parameters should be tunable via sysctl(2). Anyway, the patch can be found at: http://people.FreeBSD.org/~tanimura/patches/vnrecycle.diff -- Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
* Seigo Tanimura <[EMAIL PROTECTED]> [010413 02:39] wrote: > On Thu, 12 Apr 2001 22:50:50 +0200, > Poul-Henning Kamp <[EMAIL PROTECTED]> said: > > Poul-Henning> We keep namecache entries around as long as we can use them, and that > Poul-Henning> generally means that recreating them is a rather expensive operation, > Poul-Henning> involving creation of vnode and very likely a vm object again. > > Holding a namecache entry forever until its vnode is reused results in > disaster when a huge number of files are accessed concurrently, causing > active vnodes to eat up all of memory. This beast killed a box of mine > with 3GB of memory and 200GB of a RAID0 disk array serving about > 300,000 files by cvsupd and making the world a few months ago, when > the number of the vnodes reached around 400,000 to make all of the > processes wait for a free vnode. > > With a help by tegge, the box is now reclaiming directory vnodes when > few free vnodes are available. Only directory vnodes holding no child > directory vnodes held in v_cache_src are recycled, so that directory > vnodes near the root of the filesystem hierarchy remain in namecache > and directory vnodes are not reclaimed in cascade. The number of > vnodes in the box is now about 135,000, staying quite steadily. > > Name'cache' is the place to hold vnodes for future use which may *not* > come, hence vnodes held in namecache should be reclaimed in case of > critical vnode shortage. Are these changes planned for integration? -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Thu, 12 Apr 2001 22:50:50 +0200, Poul-Henning Kamp <[EMAIL PROTECTED]> said: Poul-Henning> We keep namecache entries around as long as we can use them, and that Poul-Henning> generally means that recreating them is a rather expensive operation, Poul-Henning> involving creation of vnode and very likely a vm object again. Holding a namecache entry forever until its vnode is reused results in disaster when a huge number of files are accessed concurrently, causing active vnodes to eat up all of memory. This beast killed a box of mine with 3GB of memory and 200GB of a RAID0 disk array serving about 300,000 files by cvsupd and making the world a few months ago, when the number of the vnodes reached around 400,000 to make all of the processes wait for a free vnode. With a help by tegge, the box is now reclaiming directory vnodes when few free vnodes are available. Only directory vnodes holding no child directory vnodes held in v_cache_src are recycled, so that directory vnodes near the root of the filesystem hierarchy remain in namecache and directory vnodes are not reclaimed in cascade. The number of vnodes in the box is now about 135,000, staying quite steadily. Name'cache' is the place to hold vnodes for future use which may *not* come, hence vnodes held in namecache should be reclaimed in case of critical vnode shortage. -- Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Matt Dillon writes: >:>:>scaleability. >:>: >:>:Uhm, that is actually not true. >:>: >:>:We keep namecache entries around as long as we can use them, and that >:>:generally means that recreating them is a rather expensive operation, >:>:involving creation of vnode and very likely a vm object again. >:> >:>The vnode cache is a different cache. positive namei hits will >:>reference a vnode, but namei elements can be flushed at any >:>time without flushing the underlying vnode. >: >:Right, but doing so means that to refind that vnode from the name >:is (comparatively) very expensive. >: >:-- >:Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 >:[EMAIL PROTECTED] | TCP/IP since RFC 956 > >The only thing that is truely expensive is having to physically >scan a large directory in order to instantiate a new namei >record. Everything else is inexpensive by comparison (by two >orders of magnitude!), even constructing new vnodes. > >Without vmiodirenable turned on, any directory [...] It's worse than that, we are still way too rude in throwing away directory data... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:>:>scaleability. :>: :>:Uhm, that is actually not true. :>: :>:We keep namecache entries around as long as we can use them, and that :>:generally means that recreating them is a rather expensive operation, :>:involving creation of vnode and very likely a vm object again. :> :>The vnode cache is a different cache. positive namei hits will :>reference a vnode, but namei elements can be flushed at any :>time without flushing the underlying vnode. : :Right, but doing so means that to refind that vnode from the name :is (comparatively) very expensive. : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 :[EMAIL PROTECTED] | TCP/IP since RFC 956 The only thing that is truely expensive is having to physically scan a large directory in order to instantiate a new namei record. Everything else is inexpensive by comparison (by two orders of magnitude!), even constructing new vnodes. Without vmiodirenable turned on, any directory exceeding vfs.maxmallocbufspace becomes extremely expensive to work with O(N * diskIO). With vmiodirenable turned on huge directories are O(N), but have a better chance of being in the VM page cache so cost proportionally less even though they don't do any better on a relative scale. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Matt Dillon writes: > >: >:In message <[EMAIL PROTECTED]>, Matt Dillon writes: >: >:>Again, keep in mind that the namei cache is strictly throw-away, but >:>entries can often be reconstituted later by the filesystem without I/O >:>due to the VM Page cache (and/or buffer cache depending on >:>vfs.vmiodirenable). So as with the buffer cache and inode cache, >:>the number of entries can be limited without killing performance or >:>scaleability. >: >:Uhm, that is actually not true. >: >:We keep namecache entries around as long as we can use them, and that >:generally means that recreating them is a rather expensive operation, >:involving creation of vnode and very likely a vm object again. > >The vnode cache is a different cache. positive namei hits will >reference a vnode, but namei elements can be flushed at any >time without flushing the underlying vnode. Right, but doing so means that to refind that vnode from the name is (comparatively) very expensive. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
: :In message <[EMAIL PROTECTED]>, Matt Dillon writes: : :>Again, keep in mind that the namei cache is strictly throw-away, but :>entries can often be reconstituted later by the filesystem without I/O :>due to the VM Page cache (and/or buffer cache depending on :>vfs.vmiodirenable). So as with the buffer cache and inode cache, :>the number of entries can be limited without killing performance or :>scaleability. : :Uhm, that is actually not true. : :We keep namecache entries around as long as we can use them, and that :generally means that recreating them is a rather expensive operation, :involving creation of vnode and very likely a vm object again. The vnode cache is a different cache. positive namei hits will reference a vnode, but namei elements can be flushed at any time without flushing the underlying vnode. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Matt Dillon writes: >Again, keep in mind that the namei cache is strictly throw-away, but >entries can often be reconstituted later by the filesystem without I/O >due to the VM Page cache (and/or buffer cache depending on >vfs.vmiodirenable). So as with the buffer cache and inode cache, >the number of entries can be limited without killing performance or >scaleability. Uhm, that is actually not true. We keep namecache entries around as long as we can use them, and that generally means that recreating them is a rather expensive operation, involving creation of vnode and very likely a vm object again. We can safely say that you cannot profitably _increase_ the size of the namecache, except for the negative entries where raw statistics will have to be the judge of the profitability of the idea. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Thu, 12 Apr 2001, Matt Dillon wrote: > Again, keep in mind that the namei cache is strictly throw-away, This seems to be the main difference between Linux and FreeBSD. In Linux, open files directly refer to an entry in the dentry (and inode) cache, so we really need to have dynamically growing and shrinking caches in order to accomodate programs that have huge amounts of files open (but we want to free the memory again later, because the system load changes). regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:You should also know that negative entries, since they have no :objects to "hang from" and consequently would clog up the name-cache, :are limited by the sysctl: : debug.ncnegfactor: 16 :which means that max 1/16 of the name cache entries can be negative :entries. You can monitor the number of negative entries with the :sysctl : debug.numneg: 305 : :the value of "16" was rather arbitrarily chosen and better defaults :may exist. : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 :[EMAIL PROTECTED] | TCP/IP since RFC 956 Here's an example from a lightly loaded machine that's been up about two months (since I last upgraded its kernel): earth:/home/dillon> sysctl -a | fgrep vfs.cache vfs.cache.numneg: 1596 vfs.cache.numcache: 30557 vfs.cache.numcalls: 352196140 vfs.cache.dothits: 5598866 vfs.cache.dotdothits: 14055093 vfs.cache.numchecks: 435747692 vfs.cache.nummiss: 29963655 vfs.cache.nummisszap: 3042073 vfs.cache.numposzaps: 3308219 vfs.cache.numposhits: 274527703 vfs.cache.numnegzaps: 939714 vfs.cache.numneghits: 20760817 vfs.cache.numcwdcalls: 215565 vfs.cache.numcwdfail1: 29 vfs.cache.numcwdfail2: 1730 vfs.cache.numcwdfail3: 0 vfs.cache.numcwdfail4: 4 vfs.cache.numcwdfound: 213802 vfs.cache.numfullpathcalls: 0 vfs.cache.numfullpathfail1: 0 vfs.cache.numfullpathfail2: 0 vfs.cache.numfullpathfail3: 0 vfs.cache.numfullpathfail4: 0 vfs.cache.numfullpathfound: 0 Again, keep in mind that the namei cache is strictly throw-away, but entries can often be reconstituted later by the filesystem without I/O due to the VM Page cache (and/or buffer cache depending on vfs.vmiodirenable). So as with the buffer cache and inode cache, the number of entries can be limited without killing performance or scaleability. earth:/home/dillon> vmstat -m | egrep 'Type|vfsc' ... Type InUse MemUse HighUse Limit Requests Limit Limit Size(s) vfscache 30567 2386K 2489K 85444K 275524850 0 64,128,256,256K This particular machine has 30567 component entries in the namei cache at the moment, eating around 2.3 MB of kernel memory. That makes the namei cache quite efficient. Of course, there are many situations where the namei cache is ineffective, such as on machines with insanely huge mail queues or older usenet news systems that used individual files for article storage, or a squid cache that uses individual files. The ultimate solution is to back the name cache with a filesystem that uses hashed or sorted/indexed directories - one of the few disadvantages that remain with UFS/FFS. I've never found that to be a show stopper, though. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
In message <[EMAIL PROTECTED]>, Matt Dillon writes: > >: >:On Tue, 10 Apr 2001, Matt Dillon wrote: >: >:>It's randomness that will kill performance. You know the old saying >:>about caches: They only work if you get cache hits, otherwise >:>they only slow things down. >: >:I wonder ... how does FreeBSD handle negative directory entries? >: >:That is, /bin/sh looks through the PATH to search for some executable >:(eg grep) and doesn't find it in the first 3 directories. >: >:Does the vfs cache handle this or does FreeBSD have to go down into >:the filesystem code every time? >: >:Rik > >The namei cache stores negative hits. /usr/src/sys/kern/vfs_cache.c >cache_lookup() - if ncp->nc_vp (the vnode) is NULL, the cache entry >represents a negative hit. cache_enter() - vp may be passed as NULL >to create a negative cache entry. ufs/ufs/ufs_lookup.c, calls to >cache_enter() enters positive or negative lookups as appropriate. > You should also know that negative entries, since they have no objects to "hang from" and consequently would clog up the name-cache, are limited by the sysctl: debug.ncnegfactor: 16 which means that max 1/16 of the name cache entries can be negative entries. You can monitor the number of negative entries with the sysctl debug.numneg: 305 the value of "16" was rather arbitrarily chosen and better defaults may exist. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
: :On Tue, 10 Apr 2001, Matt Dillon wrote: : :>It's randomness that will kill performance. You know the old saying :>about caches: They only work if you get cache hits, otherwise :>they only slow things down. : :I wonder ... how does FreeBSD handle negative directory entries? : :That is, /bin/sh looks through the PATH to search for some executable :(eg grep) and doesn't find it in the first 3 directories. : :Does the vfs cache handle this or does FreeBSD have to go down into :the filesystem code every time? : :Rik The namei cache stores negative hits. /usr/src/sys/kern/vfs_cache.c cache_lookup() - if ncp->nc_vp (the vnode) is NULL, the cache entry represents a negative hit. cache_enter() - vp may be passed as NULL to create a negative cache entry. ufs/ufs/ufs_lookup.c, calls to cache_enter() enters positive or negative lookups as appropriate. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Tue, 10 Apr 2001, Matt Dillon wrote: >It's randomness that will kill performance. You know the old saying >about caches: They only work if you get cache hits, otherwise >they only slow things down. I wonder ... how does FreeBSD handle negative directory entries? That is, /bin/sh looks through the PATH to search for some executable (eg grep) and doesn't find it in the first 3 directories. The next time the script is started (it might be ran for every file in a large compile) the next invocation of the script looks for the file in 3 directories where it isn't present .. again. Does the vfs cache handle this or does FreeBSD have to go down into the filesystem code every time? Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
Rik van Riel wrote: > > I'm curious about the other things though ... FreeBSD still seems > to have the early 90's abstraction layer from Mach and the vnode > cache doesn't seem to grow and shrink dynamically (which can be a > big win for systems with lots of metadata activity). > > So while it's true that FreeBSD's VM balancing seems to be the > best one out there, I'm not quite sure about the rest of the VM... > Many years ago Kirk was talking about merging the vm objects and the vnodes.. (they tend to come in pairs anyhow) I still think it might be an idea worth investigating further. kirk? -- __--_|\ Julian Elischer / \ [EMAIL PROTECTED] ( OZ) World tour 2000-2001 ---> X_.---._/ v To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
It's randomness that will kill performance. You know the old saying about caches: They only work if you get cache hits, otherwise they only slow things down. -Matt :Which is ok if there isn't too much activity with these data :structures, but I'm not sure if it works when you have a lot :of metadata activity (though I'm not sure in what kind of :workload you'd see this). : :Also, if you have a lot of metadata activity, you'll essentially :double the memory requirements, since you'll have the stuff cached :in both the internal structures and in the VM PAGE cache. I'm not :sure how much of a hit this would be, though, if the internal :structures are limited to a small enough size... : :regards, : :Rik To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
On Tue, 10 Apr 2001, Matt Dillon wrote: > :I'm curious about the other things though ... FreeBSD still seems > :to have the early 90's abstraction layer from Mach and the vnode > :cache doesn't seem to grow and shrink dynamically (which can be a > :big win for systems with lots of metadata activity). > Well, the approach we take is that of a two-layered cache. > The vnode, dentry (namei for FreeBSD), and inode caches > in FreeBSD are essentially throw-away caches of data > represented in an internal form. The VM PAGE cache 'backs' > these caches loosely by caching the physical on-disk representation > of inodes, and directory entries (see note 1 at bottom). > > This means that even though we limit the number of the namei > and inode structures we keep around in the kernel, the data > required to reconstitute those structures is 'likely' to > still be in the VM PAGE cache, allowing us to pretty much > throw away those structures on a whim. The only cost is that > we have to go through a filesystem op (possibly not requiring I/O) > to reconstitute the internal structure. Which is ok if there isn't too much activity with these data structures, but I'm not sure if it works when you have a lot of metadata activity (though I'm not sure in what kind of workload you'd see this). Also, if you have a lot of metadata activity, you'll essentially double the memory requirements, since you'll have the stuff cached in both the internal structures and in the VM PAGE cache. I'm not sure how much of a hit this would be, though, if the internal structures are limited to a small enough size... regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:In the balancing part, definately. FreeBSD seems to be the only :system that has the balancing right. I'm planning on integrating :some of the balancing tactics into Linux for the 2.5 kernel, but :I'm not sure how to integrate the inode and dentry cache into the :balancing scheme ... :I'm curious about the other things though ... FreeBSD still seems :to have the early 90's abstraction layer from Mach and the vnode :cache doesn't seem to grow and shrink dynamically (which can be a :big win for systems with lots of metadata activity). : :So while it's true that FreeBSD's VM balancing seems to be the :best one out there, I'm not quite sure about the rest of the VM... : :regards, : :Rik Well, the approach we take is that of a two-layered cache. The vnode, dentry (namei for FreeBSD), and inode caches in FreeBSD are essentially throw-away caches of data represented in an internal form. The VM PAGE cache 'backs' these caches loosely by caching the physical on-disk representation of inodes, and directory entries (see note 1 at bottom). This means that even though we limit the number of the namei and inode structures we keep around in the kernel, the data required to reconstitute those structures is 'likely' to still be in the VM PAGE cache, allowing us to pretty much throw away those structures on a whim. The only cost is that we have to go through a filesystem op (possibly not requiring I/O) to reconstitute the internal structure. For example, take the namei cache. The namei cache allows the kernel to bypass big pieces of the filesystem when doing path name lookups. If a path is not in the namei cache the filesystem has to do a directory lookup. But a directory lookup could very well access pages in the VM PAGE cache and thus still not actually result in a disk I/O. The inode cache works the same way ... inodes can be thrown away at any time and most of the time they can be reconstituted from the VM PAGE cache without an I/O. The vnode cache works slightly differently. VNodes that are not in active use can be thrown away and reconstituted at a later time from either the inode cache or the VM PAGE cache (or if not then require a disk I/O to get at the stat information). There is a caviat for the vnode cache, however. VNodes are tightly integrated with VM Objects which in turn help place hold VM pages in the VM PAGE cache. Thus when you throw away an inactive vnode you also have to throw away any cached VM PAGES representing the cached file or directory data represented by that vnode. Nearly all installations of FreeBSD run out of physical memory long before they run out of vnodes, so this side effect is almost never an issue. On some extremely rare occassions it is possible that the system will have plenty of free memory but hit its vnode cache limit and start recycling vnodes, causing it to recycle cache pages even when there is plenty of free memory available. But this is very rare. The key point to all of this is that we put most of our marbles in the VM PAGE cache. The namei and inode caches are there simply for convenience so we don't have to 'lock' big portions of the underlying VM PAGE cache. The VM PAGE cache is pretty much an independant entity. It does not know or care *what* is being cached, it only cares how often the data is being accessed and whether it is clean or dirty. It treats all the data nearly the same. note (1): Physical directory blocks have historically been cached in the buffer cache, using kernel MALLOC space, not in the VM PAGE cache. buffer-cache based MALLOC space is severely limited (only a few megabytes) compared to what the VM PAGE cache can offer. In FreeBSD a 'sysctl -w vfs.vmiodirenable=1' will cause physical directory blocks to be cached in the VM PAGE Cache, just like files are cached. This is not the default but it will be soon, and many people already turn this sysctl on. - I should also say that there is a *forth* cache not yet mentioned which actually has a huge effect on the VM PAGE cache. This fourth cache relates to pages *actively* mapped into user space. A page mapped into user space is wired (cannot be ripped out of the VM PAGE cache) and also has various other pmap-related tracking structures (which you are familiar with, Rik, so I won't expound on that too much). If the VM PAGE cache wants to get rid of an idle page that is still mapped to a user process, it has to unwire it first which means it has to get rid of the user mappings - a pmap*() call from vm/vm_pageout.c and vm/vm_page.c accomplishes this. This fourth cache (the active user mappings of pages) is also a throw away cache, though one with the side effect of making VM PAGE cache pages available for loadin
Re: vm balance
On Tue, 10 Apr 2001, Matt Dillon wrote: > :I heard NetBSD has implemented a FreeBSD like VM, it also implemented > :a VM balance in recent verion of NetBSD. some parameters like TEXT, > :DATA and anonymous memory space can be tuned. is there anyone doing > :such work on FreeBSD or has FreeBSD already implemented it? > > FreeBSD implements a very sophisticated VM balancing algorithm. Nobody's > complaining about it so I don't think we need to really change it. Most > of the other UNIXes, including Linux, are actually playing catch-up to > FreeBSD's VM design. In the balancing part, definately. FreeBSD seems to be the only system that has the balancing right. I'm planning on integrating some of the balancing tactics into Linux for the 2.5 kernel, but I'm not sure how to integrate the inode and dentry cache into the balancing scheme ... I'm curious about the other things though ... FreeBSD still seems to have the early 90's abstraction layer from Mach and the vnode cache doesn't seem to grow and shrink dynamically (which can be a big win for systems with lots of metadata activity). So while it's true that FreeBSD's VM balancing seems to be the best one out there, I'm not quite sure about the rest of the VM... regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
> > FreeBSD implements a very sophisticated VM balancing algorithm. Nobody's > complaining about it so I don't think we need to really change it. Most > of the other UNIXes, including Linux, are actually playing catch-up to > FreeBSD's VM design. > I remember hearing/viewing a zero-copy networking patch for 4.2... Anyone else seen this? If it's already part of the tree, ignore me :-) Andrew *-. | Andrew R. Reiter | [EMAIL PROTECTED] | "It requires a very unusual mind | to undertake the analysis of the obvious" -- A.N. Whitehead To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm balance
:I heard NetBSD has implemented a FreeBSD like VM, it also implemented :a VM balance in recent verion of NetBSD. some parameters like TEXT, :DATA and anonymous memory space can be tuned. is there anyone doing :such work on FreeBSD or has FreeBSD already implemented it? : :-- :David Xu FreeBSD implements a very sophisticated VM balancing algorithm. Nobody's complaining about it so I don't think we need to really change it. Most of the other UNIXes, including Linux, are actually playing catch-up to FreeBSD's VM design. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message