Here's an update on the syslink work. After much thought on how to best approach the problem of accessing machine resources over a remote link I finally realized that building a clustered operating system requires sharing far more then just VM objects, processes, and devices.
For all intents and purposes it requires sharing almost every type of resource an operating system can have. Here's a short list: VM spaces VM objects VM pages processes lwps vnodes inodes (e.g. for clustered FS support) sockets file descriptor tables file descriptors devices labeled disks (logical abstractions using the 'label' field) creds file buffers file BIOs ... and probably many other things ... When I first contemplated doing this I came up with SYSLINK, a message based protocol that can devolve down into almost direct procedure calls when two localized resources talk to each other. I am still on track with the SYSLINK concept, but as I continued to develop it I hit a snag, and I think I finally have a solution to this snag. The snag is this: In order to transport requests across a machine boundary (that is, outside the domain of a direct memory access), it is necessary to assign a unique identifier to the resource. The easiest way to think about this is to consider something like NFS. Accessing a file over NFS requires a name lookup which translates into an identifier that represents the inode. The NFS client can simply cache the identifier without having to know much about the complex resource the identifier represents, other then it is a 'file'. In order to do this with SYSLINK I was up until today contemplating reworking all the major system structures so they would use a 'syslink compatible' API. That would mean changing DEVOPS, VOPS, file descriptor access routines, and so on and so forth. I didn't quite realize that there were over a dozen (maybe even two dozen) different structures that would need to be redone. Well, reworking two dozen structures is out of the question. I'd like to get this done before I start looking like rumplestiltskin! Hence the hair pulling. -- Today I came up with something that IS possible to do in a more reasonable time frame. I'm kinda kicking myself for not thinking of it sooner. Instead of reworking all the APIs I am instead going to rework JUST the reference counting methodology used in these resource structures. Right now all the resource structures roll their own ref counting mechanisms. That's all going to be replaced with a common ref counting API and a little structure that includes a 64 bit unique sysid, red-black tree node, the ref count, and a pointer to a resource type structure (e.g. identifying it as a vnode, vm object, or whatever). When any of the above resources are allocated, they will be indexed in a Red-Black tree. In other words it will be possible to identify every single resource in the system by traversing the red-black tree, which means it will be possible to lookup ANY resource in the system by its sysid using a red-black tree lookup! I am going to implement a per-cpu Red-Black tree and use critical sections to control access to it. All resources will be registered when they are allocated, and deregistered when they are released. Use of a per-cpu RB tree will mean no lock contention on the RB tree itself and cross cpu releases will just use a passive IPI, which costs us almost nothing. The ref count field will be buslocked or spinlocks but I don't expect that to create a contention issue. What does that mean for SYSLINK? It means that all of a system's resources will now become addressable via a 64 bit id and thus will be suitably represented in any remote protocol. The 64 bit sysids will be unique, which is a very simple mechanism... each cpu just initializes a 64 bit sysid to a shifted timestamp on boot, and then increments it by <ncpus> to 'allocate' a sysid. You can't get much simpler then that. I don't think it would be possible to overflow a 64 bit counter, even incrementing by ncpus, without at least a hundred years of uptime and I'm just not worried about a hundred years of uptime for a single host. The uniqueness means that remote accesses will not go accessing the wrong resource because a sysids will never be reused, which means that the sysid can represent a stable resource from the point of view of any remote accessor, and the remote accessor can be told if/when it goes away. And that, folks, gives us the building blocks we need to represent resources in a cluster. This also means I don't have to rewrite the APIs. Instead I can simply write new RPC APIs for accesses made via syslink ids and, poof, now all of a system's resources will become accessible remotely, with only modest effort. So this will be the next step for me. Implementing the global registration, reference counting, and allocation and disposal API. I'm gonna call it 'sysreg', and the first commits are going to occur in the next few days because I don't expect it to be very difficult to implement. I'm pretty excited. * Implement sysreg * Start converting structure refcount & allocation APIs to the sysreg API. * Build a local syslink VFS and DEV interface * Build a remote VFS and DEV interface via TCP (like NFS) * continue working on the things needed for clustering, like the syslink mesh, packetized messaging protocols, and so on ... -Matt Matthew Dillon <[EMAIL PROTECTED]>