Hi,
Roland Mainz wrote:
serious cycles on this though we'd like to, and the folks who were in
involved in 64K simply don't have any interest in working on this in the open.
Why ? What do they fear ? Being swamped&overburned with too many emails
or what ?
This is a community -- it's up to them whether they want to participate.
The VOP_GETPAGE() and VOP_PUTPAGE() interfaces were added in the page
cache unification which I believe (without looking) dates back to SunOS
4.x. The original vnode interface specification from srk was based on the
buffer cache.
Who or what is "srk" ?
Steve Kleiman. Looking at the notes at the end of the paper it says that
the architecture was designed by Bill Joy but (I'm told) Steve was the
original implementor of the vnode interfaces.
The major problem with these interfaces is that they expose PAGESIZE to
filesystems. UFS in particuar is problematic because (as I've said before)
it assumes that PAGESIZE <= MAXBSIZE.
Yes, but in the x86 case we have PAGESIZE != MAXBSIZE so these locations
are likely well-known and we do not have to search anymore...
The filesystems appear to be prepared to handle multiple pages per block
but not multiple blocks per page.
The right way to fix this is to refactor the VM/FS interfaces to remove
PAGESIZE from them. That work is getting underway now.
What about skipping UFS in the initial pass and only concentrate - as
Holger Berger suggested - on NFS booting ?
That sounds like a fine idea.
BTW: Are the ZFS/zfsboot people aware of the problem ? IMO at least the
person(s) working on zfsboot should be warned that relying on pagesize
for whatever reason may be bad...
ZFS doesn't have any of these issues; when Jeff Bonwick started ZFS he
knew more about the VM system than most of us did back then, and
specifically he realized how broken the VM/FS interfaces are long before
we did.
<weeds> One "for instance" is to go look at the ZFS implementation of
mmap() and compare it to UFS. UFS has all sorts of horrible deadlocks
between mmap() fault-driven file I/O and its other data paths due to the
page cache, whereas ZFS does not. ZFS doesn't exhibit similar problems
partly because it requires transactional interfaces -- from the time the
data is checksummed to when the transaction group is committed the data
buffers have to be immutable. Pulling this off with a unified page cache
is simply impractical, since every write would require a writers-lock plus
a global TLB shootdown to prevent subsequent writes. Modern architectures
are capable of copying the data in about 1/10th of the time it takes to
make a user-mapped buffer immutable on an MP. Logging UFS also has
transactional issues which have been band-aided over leading to locking
hell and codepaths so twisted they make my head spin. VOP_GETPAGE() and
VOP_PUTPAGE() and their reliance on PAGESIZE lead to a complete lack of
transactional semantics which more and more modern filesystems are
starting to require. </weeds>
Since extensive file system changes are necessary one way or the other to
pull this off, myself and others would prefer to take the tact (since we
have to go there eventually ANYWAY) of blowing up VOP_*PAGE(), and
introducing a new VOP_IO() interface which does not depend on PAGESIZE at
all. Instead of a page_t, it would use a base/bounds pair attached to a
different data structure which is associated with the I/O transaction.
What about committing the original solution first (excluding UFS (e.g.
make UFS module unloadable for now in a 64k kernel) and concentrating on
a prototype which can only boot via NFS) ?
You're welcome to try to implement partial-page reads and writes in the
VOP layer if you want. Myself, I wouldn't waste my time on it.
As you can see from the discussion thread I think there is a lot more than
throwing the code over the wall,
Yes... but at least the hungry wolves can chew on it for some time until
complains come back... :-)
:)
because it implicitly assumes that their
approach was the best approach to solving the problem or was even tractable.
Ok... I'll try to sync with Holger berger then how to proceed...
Sounds good.
- Eric
_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org