:Today I think that the SSI goal has become less important. :The "cluster hype" has diminished and been partially :replaced by the "cloud hype". Today, it is extremely :important to have excellent SMP scalability. Multi-core :systems are common, my desktop at home is a 6-core AMD :Phenom II X6 which costs less than 200 Euros. You can :buy x86 machines with 4 CPU sockets, 8 cores each, plus :...
I think these are all good points. The evolution of SMP has been highly predictable, but perhaps what has not been quite that predictable is the enormous improvement in off-cpu interconnect bandwidth over the last few years as PCI-e has really pushed into all aspects of chip design. These days you can't even find a FPGA that doesn't have support for numerous serial gigabit links going off-chip. Serial has clearly won over parallel. This enormous improvement is one of the many reasons why something like swapcache/SSD is not only viable on today's system, but almost a necessity if one wishes to squeeze out every last drop of performance from a machine. It used to be that one could get a good chunk of cpu power in a consumer machine but not really be able to match servers on bus bandwidth. That simply is not the case any more. Now the cheap sweet-spot on the consumer curve has well over 50GBits of off-chip bandwidth. 4-8 SATA ports running at 3-6 GBits each, plus another 24+ PCI-e lanes on top of that. It is getting busy out there. The only real advantage a 'server' has now is memory interconnect and even that is being whittled away now that one can stuff 16GB+ of ECC ram into a 4-slot consumer box. The next big improvement will probably wind up being ultra-high-speed memory interconnects. We would have it already if not for RamBus and their double-blasted patent lawsuits. And after that the links are going to start running at light frequencies, which Intel I think has already demonstrated to some degree. -- I do think I have to modulate my SSI goal a bit. The cluster filesystem is still hot... I absolutely want something to replace NFSv3 that is fully cache coherent across all clients and servers, and it *IS NOT* NFSv4. Adding a quorum protocol capability on top of that to create distributed filesystem redundancy is also still in the cards. The important point here is that I do not believe anything but a network abstraction can create the levels of reliability needed to have truely distributed redundancy for filesystems. The Single-image abstraction that RAID provides isn't good enough because it doesn't deal with bugs in the filesystem code itself or the corruption from software bugs in today's complex kernels (in general). Actual SSI might not be in the cards any more. It is virtually impossible to do it at the vnode, device, and process level without a complete rewrite of nearly the entire system. Sure one can migrate whole VM's, but that is more a workaround and less a core solution. However, with a fully cache coherent remote filesystem solution we can actually get very close to SSI-like operation. If a shared mmap of a file across a cache coherent remote mount is made possible then process migration or at least shared memory spaces across physical machines can certainly be made possible too. -- I'm a bit loath to extend the per-cpu globaldata concept beyond 64 cpus (for 64-bit builds) for numerous reasons, not the least of which being that the kernel per-cpu caches don't scale well when the memory:ncpus ratio drops too low. We know this is a problem already for per-thread caches in pthreads implementations for user programs which is why our nmalloc implementation in libc tries to be very careful to not leave too much stuff sitting around in a per-thread cache that other threads can't get to. There is an opportunity here too. There is no real need to support more than 64 *KERNEL* threads on a massively hyperthreaded cpu when just 2 per actual cpu (judged by the L1 cache topology) will yield the same kernel performance. The solution is obvious to me... multiplex the N hyperthreads per real cpu into a single globaldata structure and interlock with a spinlock. Or, alternatively, have no more than two globaldata entities per real cpu (say in a situation where one has 4 hyperthreads or 8 hyperthreads). Spinlocks contending only between the hyperthreads associated with the same cpu have an overhead cost of almost nothing. Another possible solution is to not actually transition the extra hyperthreads into the kernel core but instead hold them at the userspace<->kernelspace border and have only dedicated kernel hyperthreads run the kernel core. The user threads would just stall at the edge until the real kernel thread tells them what to do. Advantages of this include not having to save/restore the register state and allowing the real kernel threads to run full floating point (and all related compiler optimizations). So, lots of possibilities abound here. -Matt Matthew Dillon <dil...@backplane.com>