On 19 Feb 2007 13:04:12 +0100, Andi Kleen <[EMAIL PROTECTED]> wrote:
LRU tends to be hell for caches in MP systems, because it writes to the cache lines too and makes them exclusive and more expensive.
That's why you let the hardware worry about LRU. You don't write to the upper layers of the splay tree when you don't have to. It's the mere traversal of the upper layers that keeps them in cache, causing the cache hierarchy to mimic the data structure hierarchy. RCU changes the whole game, of course, because you don't write to the old copy at all; you have to clone the altered node and all its ancestors and swap out the root node itself under a spinlock. Except you don't use a spinlock; you have a ring buffer of root nodes and atomically increment the writer index. That atomically incremented index is the only thing on which there's any write contention. (Obviously you need a completion flag on the new root node for the next writer to poll on, so the sequence is atomic-increment ... copy and alter from leaf to root ... wmb() ... mark new root complete.) When you share TCP sessions among CPUs, and packets associated with the same session may hit softirq in any CPU, you are going to eat a lot of interconnect bandwidth keeping the sessions coherent. (The only way out of this is to partition the tuple space by CPU at the NIC layer with separate per-core, or perhaps per-cache, receive queues; at which point the NIC is so smart that you might as well put the DDoS handling there.) But at least it's cache coherency protocol bandwidth and not bandwidth to and from DRAM, which has much nastier latencies. The only reason the data structure matters _at_all_ is that DDoS attacks threaten to evict the working set of real sessions out of cache. That's why you add new sessions at the leaves and don't rotate them up until they're hit a second time. Of course the leaf layer can't be RCU, but it doesn't have to be; it's just a bucket of tuples. You need an auxiliary structure to hold the session handshake trackers for the leaf layer, but you assume that you're always hitting cold cache when diving into this structure and ration accesses accordingly. Maybe you even explicitly evict entries from cache after sending the SYNACK, so they don't crowd other stuff out; they go to DRAM and get pulled into the new CPU (and rotated up) if and when the next packet in the session arrives. (I'm assuming T/TCP here, so you can't skimp much on session tracker size during the handshake.) Every software firewall I've seen yet falls over under DDoS. If you want to change that, you're going to need more than the back-of-the-napkin calculations that show that session lookup bandwidth exceeds frame throughput for min-size packets. You're going to need to strategize around exploiting the cache hierarchy already present in your commodity processor to implicitly partition real traffic from the DDoS storm. It's not a trivial problem, even in the mathematician's sense (in which all problems are either trivial or unsolved). Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html