:Today I think that the SSI goal has become less important.
:The "cluster hype" has diminished and been partially
:replaced by the "cloud hype".  Today, it is extremely
:important to have excellent SMP scalability.  Multi-core
:systems are common, my desktop at home is a 6-core AMD
:Phenom II X6 which costs less than 200 Euros.  You can
:buy x86 machines with 4 CPU sockets, 8 cores each, plus
:...

    I think these are all good points.  The evolution of SMP has
    been highly predictable, but perhaps what has not been quite
    that predictable is the enormous improvement in off-cpu
    interconnect bandwidth over the last few years as PCI-e has
    really pushed into all aspects of chip design.  These days
    you can't even find a FPGA that doesn't have support for
    numerous serial gigabit links going off-chip.  Serial has clearly
    won over parallel.

    This enormous improvement is one of the many reasons why something
    like swapcache/SSD is not only viable on today's system, but almost a
    necessity if one wishes to squeeze out every last drop of performance
    from a machine.

    It used to be that one could get a good chunk of cpu power
    in a consumer machine but not really be able to match servers
    on bus bandwidth.  That simply is not the case any more.  Now
    the cheap sweet-spot on the consumer curve has well over 50GBits
    of off-chip bandwidth.  4-8 SATA ports running at 3-6 GBits each, plus
    another 24+ PCI-e lanes on top of that.  It is getting busy out
    there.  The only real advantage a 'server' has now is memory
    interconnect and even that is being whittled away now that one can
    stuff 16GB+ of ECC ram into a 4-slot consumer box.

    The next big improvement will probably wind up being ultra-high-speed
    memory interconnects.  We would have it already if not for RamBus and
    their double-blasted patent lawsuits.  And after that the links are
    going to start running at light frequencies, which Intel I think has
    already demonstrated to some degree.

    --

    I do think I have to modulate my SSI goal a bit.  The cluster
    filesystem is still hot... I absolutely want something to replace NFSv3
    that is fully cache coherent across all clients and servers, and it
    *IS NOT* NFSv4.  Adding a quorum protocol capability on top of that
    to create distributed filesystem redundancy is also still in the cards.
    The important point here is that I do not believe anything but a network
    abstraction can create the levels of reliability needed to have truely
    distributed redundancy for filesystems.

    The Single-image abstraction that RAID provides isn't good enough
    because it doesn't deal with bugs in the filesystem code itself or
    the corruption from software bugs in today's complex kernels (in
    general).

    Actual SSI might not be in the cards any more.  It is virtually
    impossible to do it at the vnode, device, and process level without
    a complete rewrite of nearly the entire system.  Sure one can migrate
    whole VM's, but that is more a workaround and less a core solution.

    However, with a fully cache coherent remote filesystem solution we
    can actually get very close to SSI-like operation.  If a shared mmap
    of a file across a cache coherent remote mount is made possible then
    process migration or at least shared memory spaces across physical
    machines can certainly be made possible too.

    --

    I'm a bit loath to extend the per-cpu globaldata concept beyond 64
    cpus (for 64-bit builds) for numerous reasons, not the least of which
    being that the kernel per-cpu caches don't scale well when the
    memory:ncpus ratio drops too low.  We know this is a problem already
    for per-thread caches in pthreads implementations for user programs
    which is why our nmalloc implementation in libc tries to be very
    careful to not leave too much stuff sitting around in a per-thread
    cache that other threads can't get to.

    There is an opportunity here too.  There is no real need to support
    more than 64 *KERNEL* threads on a massively hyperthreaded cpu when
    just 2 per actual cpu (judged by the L1 cache topology) will yield
    the same kernel performance. 

    The solution is obvious to me... multiplex the N hyperthreads per
    real cpu into a single globaldata structure and interlock with a
    spinlock.  Or, alternatively, have no more than two globaldata entities
    per real cpu (say in a situation where one has 4 hyperthreads or 8
    hyperthreads).  Spinlocks contending only between the hyperthreads
    associated with the same cpu have an overhead cost of almost nothing.

    Another possible solution is to not actually transition the extra
    hyperthreads into the kernel core but instead hold them at the
    userspace<->kernelspace border and have only dedicated kernel
    hyperthreads run the kernel core.  The user threads would just stall
    at the edge until the real kernel thread tells them what to do.
    Advantages of this include not having to save/restore the register
    state and allowing the real kernel threads to run full floating point
    (and all related compiler optimizations).

    So, lots of possibilities abound here.

                                        -Matt
                                        Matthew Dillon 
                                        <dil...@backplane.com>

Reply via email to