Re: [PATCH 1/2] pcnet32: only allocate init_block dma consistent
On 3/7/07, Ralf Baechle <[EMAIL PROTECTED]> wrote: GFP_* flags have no influence on caching or prefetching. The zone modifier flags (like GFP_DMA) can in principle affect the cache/prefetch policy, since they affect what physical address range the memory is allocated from. I don't know whether Linux uses this effect intentionally, but I would not be the least bit surprised if virtual machines commonly treat the traditional 16MB of low physical memory differently from the rest of the address space. Looking at the i386 implementation of dma_alloc_coherent (underlying pci_alloc_consistent), it's not clear to me how this interferes with cacheability of the allocated block. Maybe it doesn't. The alpha version of pci_alloc_consistent, on the other hand, does interesting things to make the memory visible to PCI. Don, what arches did you have in mind when you commented on caching issues? Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] pcnet32: only allocate init_block dma consistent
On 3/6/07, Ralf Baechle <[EMAIL PROTECTED]> wrote: Price question: why would this patch make a difference under VMware? :-) Moving the struct pcnet32_private from the GFP_DMA32 init_block to the GFP_KERNEL netdev allocation may be a win even on systems where GFP_DMA32 is normally cached, because the private area will get read ahead into cache when the netdev is touched. (This could be a bigger win if the most often accessed members were moved to the beginning of the pcnet32_private struct.) On the other hand, VMWare may engage in some sort of sleight of hand to keep the low 16MB or more of the VM's memory contiguously allocated and warm in the real MMU (locked hugepage TLB entries? I'm speculating). So having allocated the private area as part of a DMA-able page may have silently spared you a page fault on access. On the third hand, the new layout will rarely be a problem if the whole netdev (including private area) fits in one page, since if you were going to take a page fault you took it when you looked into the netdev. So it's hard to see how it could cause a performance regression unless VMWare loses its timeslice (and the TLB entry for the page containing the netdev) in the middle of pcnet32_rx, etc. Lennart is of course right that most VMWare VMs are using vmxnet instead, but they're also using distro kernels. :-) I find VMWare useful for certain kinds of kernel experimentation, and don't want to fool with vmxnet every time I flip a kernel config switch. Linus kernels run just fine on VMWare Workstation using piix, mptspi, and pcnet32 (I'm running vanilla 2.6.20.1 right now). I would think that changes to those drivers should be regression tested under VMWare, and I'm happy to help. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] pcnet32: only allocate init_block dma consistent
On 3/6/07, Ralf Baechle <[EMAIL PROTECTED]> wrote: This small change btw. delivers about ~ 3% extra performance on a very slow test system. Has this change been tested / benchmarked under VMWare? pcnet32 is the (default?) virtual device presented by VMWare Workstation, and that's probably a large fraction of its use in the field these days. But then Don probably already knows that. :-) Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 3/4/07, David Miller <[EMAIL PROTECTED]> wrote: From: "Michael K. Edwards" <[EMAIL PROTECTED]> > Before I implement, I design. Before I design, I analyze. Before I > analyze, I prototype. Before I prototype, I gather requirements. How the heck do you ever get to writing ANYTHING if you work that way? Oh, c'mon, the whole design process takes maybe 3 weeks for something I'm going to spend 3 months implementing. And half the time the client mistakes the prototype for the implementation and tries to run with it -- with the foreseeable consequences. Smaller projects are a lot less formal but the principle still holds: every time I've implemented without actually doing the homework leading up to a design, I've had cause to regret it. That goes double for problems that already have off-the-shelf solutions and only need improvement in ease of use and robustness in hostile environments. I certainly would never have written one single line of Linux kernel code if I had to go through that kind of sequence to actually get to writing code. Lots of _code_ gets written as a side effect of each stage of that sequence. But none of that code should be mistaken for the _implementation_. Not when the implementation is going to ship inside hundreds of thousands of widgets that people are going to stick in their ears, or is going to monitor the integrity of an intercontinental telecoms network. Most shipping code, including most code that I have written and large swathes of the Linux kernel, has never really gone beyond the prototype stage. There's no shame in that; the shame would be in refusing to admit it. And that's definitely not the "Linux way". You code up ideas as soon as you come up with one that has a chance of working, and you see what happens. Sure, you'll throw a lot away, but at least you will "know" instead of "think". I call that process "prototyping". I used to do a lot of it; but I have since found that thinking first is actually more efficient. There is very little point in prototyping the square wheel again and again and again. And especially given that Linux already has plenty of nice, round wheels, there aren't that many places left where you can impress the world by replacing a square wheel with a hexagonal one. Replacing wheels with maglev tracks requires a real design phase. You have to try things, "DO" stuff, not just sit around and theorize and design things and shoot down ideas on every negative minute detail you can come up with before you type any code in. That mode of development doesn't inspire people and get a lot of code written. I'll cop to the "negative" part of this, at least to some degree, but not the rest. "Naive hashes DDoS easily" is not a minute detail. "Hash tables don't provide the operations characteristic of priority queues" is not a minute detail. "The benchmarks offered do not accurately model system impact" is not a minute detail. If I were not sufficiently familiar with some of Evgeniy's other contributions to the kernel to think him capable of responding to these critiques with a better design, I would not have bothered. I definitely do not think others should use this design/prototype/analyze/blah/balh way of developing as an example, instead I think folks should use people like Ingo Molnar as an example of a good Linux developer. People like Ingo rewrite the scheduler one night because of a tiny cool idea, and even if only 1 out of 10 hacks like that turn out to be useful, his work is invaluable and since he's actually trying to do things and writing lots of code this inspires other people. Ingo Molnar is certainly a good, nay an excellent, Linux developer. His prototypes are brilliant even when they're under-designed by my way of thinking. By all means, readers should go thou and do likewise. And when someone presses an alternative design, "show me the code" is a fair response. At present, I am not offering code, nor design, nor even a prototype. But I am starting to figure out which bits of the Linux kernel really are implementations based on a solid design. Any prototyping I wind up doing is going to be bolted firmly onto those implementations, not floating in the surrounding sea of prototype code; and any analysis I offer based on that prototype will reflect, to the best of my ability, an understanding of its weaknesses as well as its strengths. If that analysis passes community scrutiny, and if I remain interested in the project, perhaps I will go through with design and implementation as well. This, incidentally, seems very similar to the process that Robert Olsson and Stefan Nilsson have gone through with their trie/hash project. Although I haven't tried it out yet and don't have any basis for an independe
Re: Extensible hashing and RCU
On 3/3/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: Btw, you could try to implement something you have written above to show its merits, so that it would not be an empty words :) Before I implement, I design. Before I design, I analyze. Before I analyze, I prototype. Before I prototype, I gather requirements. Before I gather requirements, I think -- and the only way I know how to think about technical matters is to write down my intuitions and compare them against the sea of published research on the topic. I'm only partway through thinking about RCU and DDoS, especially as this is on the fringe of my professional expertise and the appropriate literature is not currently at my fingertips. The only times that I make exceptions to the above sequence are 1, when someone is paying me well to do so (usually to retrofit some kind of sanity onto a pile of crap someone else wrote) and 2, when I really feel like it. At present neither exception applies here, although I may yet get so het up about threadlets that I go into a coding binge (which may or may not produce an RCU splay tree as a side effect). I wouldn't hold my breath if I were you, though; it's the first of what promises to be a string of fine weekends, and if I binge on anything this spring it's likely to be gardening. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 3/2/07, Eric Dumazet <[EMAIL PROTECTED]> wrote: Thank you for this report. (Still avoiding cache misses studies, while they obviously are the limiting factor) 1) The entire point of going to a tree-like structure would be to allow the leaves to age out of cache (or even forcibly evict them) when the structure bloats (generally under DDoS attack), on the theory that most of them are bogus and won't be referenced again. It's not about the speed of the data structure -- it's about managing its impact on the rest of the system. 2) The other entire point of going to a tree-like structure is that they're drastically simpler to RCU than hashes, and more generally they don't involve individual atomic operations (RCU reaping passes, resizing, etc.) that cause big latency hiccups and evict a bunch of other stuff from cache. 3) The third entire point of going to a tree-like structure is to have a richer set of efficient operations, since you can give them a second "priority"-type index and have "pluck-highest-priority-item", three-sided search, and bulk delete operations. These aren't that much harder to RCU than the basic modify-existing-node operation. Now can we give these idiotic micro-benchmarks a rest until Robert's implementation is tuned and ready for stress-testing? Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 22 Feb 2007 18:49:00 -0500, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: The rehash-every-10-minutes detail is theoretically unnecessary, but does cover the case where a would-be attacker *does* get a chance to look at a machine, such as by using routing delays to measure the effectiveness of a collision attempt. AOL -- and that's one of the reasons why RCUing hashes is a nightmare in practice, if you care about smooth peristalsis. Again, the resizing headache is just the canary in the coal mine. If you want to test your distribution for randomness, go have a look at Knuth volume 2 (seminumerical algorithms) and see the discussion of the Kolmogorov-Smirnov test. Some lumpiness is *expected* in a truly random distribution. Ah, Kolmogorov-Smirnov, the astronomer's friend. Somewhere on a DAT in my garage lies a rather nice implementation of K-S tests on censored data for the Mirella/Mirage Forth image analysis package. If you want something industrial-strength, easy to use, and pretty besides, I recommend SM (http://www.astro.princeton.edu/~rhl/sm/), which will cost you $500 U.S. for a department-scale license. SM so utterly exceeds the capabilities of gnuplot it isn't even funny. But then, while you don't always get what you pay for, you rarely fail to pay for what you get, sooner or later, in one currency or another. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/22/07, David Miller <[EMAIL PROTECTED]> wrote: From: "Michael K. Edwards" <[EMAIL PROTECTED]> Date: Wed, 21 Feb 2007 13:15:51 -0800 > it had better be a 2-left hash with a compact > overflow pool for the rare case of second collision. Just like Evgeniy, you should do your research too :- Fair enough. :-) The gain for multiple-hash schemes, such as 2-left, exists only if you can parallelize the N lookups in hardware. This is an assumption built into all the papers you will read on this subject, including the one you reference by Mitzenmacher. Er, no. The principal gain for the 2-left hash is that the only time you ever look at the right hash is when you have a collision in the left. The right hash is therefore substantially less occupied and the odds of second collision are lower. If you happen to be able to prefetch the left and right buckets in parallel, that's gravy. Roughly speaking, for a single hash, P(0) = (1 - 1/k)^n \approx e^{-n/k}, where P(0) is the probability that a given bucket has 0 entries, k is the number of buckets, and n is the number of elements. One entry in the bucket replaces a factor (1-1/k) with a factor n/k, the second entry replaces another (1-1/k) with (n-1)/2k, and so forth. It's the good old birthday problem. Take the simple example of an equal number of items and of one-entry buckets, and compare the simple and 2-left hashes. When n/k \approx 1, you have about 37% empty buckets, but when n/k \approx 2 you only have about 13% empty buckets. On the other hand, when n/k \approx 1 and the buckets only hold one entry, about 37% of the items overflow; when n/k \approx 2, about 57% do. Result: a 2-left hash with exactly as many 1-item buckets as there are items will only see about 20% of the items overflow (collide in the left hash and then collide again in the right hash), versus the 37% that would have overflowed in a single hash of the same size. In practical applications you try to make your hash buckets cache-line sized and fit 2 or 4 items into them so that overflows are farther out on the tail of the distribution. The result can be a hash that is only oversized by a factor of two or so relative to the space occupied by the filled slots, yet has an amortized cost of less than 1.5 cache misses and has a fraction of a percent of items that overflow the right hash. That fraction of a percent can go into a single overflow pool (a way oversized hash or an rbtree or something else you've already coded; it's a rare case) rather than having to chain from each hash bucket. If you read any paper on IP route lookup algorithms, prefix your reading of that paper with something that reads like: And by the way we are targeting implementations in hardware that can parallelize and pipeline memory accesses to fast SRAM. Do not consider this algorithm seriously for running on general purpose processors in software. I generally agree with this statement. However, the 2-left hash is a very useful data structure for lots of things other than IP route lookups, especially in memory-constrained systems where you don't want to oversize the hash table by _too_ much but you still want actual overflow chaining to be quite rare. It's much simpler than next-empty-bucket types of schemes, simple enough that I've coded it for little DSP cores, and almost as good at minimizing overflows. Probably a few read's of Network Algorithmics would help your understanding in this area as well. Hey, I don't think a hash is the right solution in the first place. If you're on the open Internet you have to assume a DDoS half-open connection load a thousandfold higher than the number of real traffic flows. If you don't have the memory bandwidth and cache architecture to bend over and take it (and on a commodity processor, you don't, not unless you're a hell of a lot cleverer about hinting to the cache controller than I am), you want a data structure that rotates the real connections up into a frequently traversed set. Besides, RCUing a hash of any sort sounds like a nightmare, whereas something like a splay tree is reputed to be very straightforward to make "persistent" in the functional-programming sense. I haven't coded that one myself but surely it's in the current edition of Knuth. Network Algorithmics is fine as far as it goes, but if you want to distribute the networking stack across a NUMA box and still not have it go pear-shaped under sophisticated DDoS, I think you probably have to think it out for yourself. In any case, I am not setting myself up as some kind of networking or data structure guru, just trying to articulate what's wrong with Evgeniy's analysis and why RCUing a naive hash is a complete waste of time. Tries and similar structures do help in the case of our routing because we're goin
Re: Extensible hashing and RCU
Look, Evgeniy. Eric and I may be morons but davem is not. He's telling you, again and again, that DoS attacks do happen, that to survive them you need for the distribution of tuples within hash buckets to vary unpredictably from system to system and boot to boot, and that XOR hash does not accomplish this. He has told you that the Jenkins hash works, for reasons discussed exhaustively in the link I gave you, which you can follow to statistical analysis that is beyond your expertise or mine. He has told you that your performance analysis is fundamentally flawed. Do you think you might need to drop back five and punt? As for the distribution issues: who cares? You fundamentally can't do any better, for random input drawn from a tuple space much larger than the number of hash buckets, than a Poisson distribution. And that's enough to kill a naive hash table design, because chaining is going to cost you another cache miss for every link, and you couldn't afford the first cache miss in the first place. If you absolutely insist on a hash table (on the theory that you can't keep the hot connections warm in cache anyway, which isn't necessarily true if you use a splay tree), it had better be a 2-left hash with a compact overflow pool for the rare case of second collision. I recommend Michael Mitzenmacher's paper to you, or rather to whoever's reading along with the intention of improving the kernel without reinventing the square wheel. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/20/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote: Correct. That's called a "weak hash", and Jenkins is known to be a thoroughly weak hash. That's why you never, ever use it without a salt, and you don't let an attacker inspect the hash output either. Weak in a cryptographic sense, of course. Excellent avalanche behavior, though, which is what you care about in a salted hash. http://en.wikipedia.org/wiki/Hash_table I know nothing about data structures and algorithms except what I read on the Internet. But you'd be amazed what's on the Internet. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: How I like personal insults - it is always fun to read about myself from people who never knew me :) On this occasion, I did not set out to insult you. I set out to suggest an explanation for why cooler and grayer heads than mine are not falling over themselves to merge your designs into the mainline kernel. Did I succeed? I just shown a problem in jenkins hash - it is not how to find a differnet input for the same output - it is a _law_ which allows to break a hash. You will add some constant, and that law will be turned into something different (getting into account what was written, it will end up with the same law). Correct. That's called a "weak hash", and Jenkins is known to be a thoroughly weak hash. That's why you never, ever use it without a salt, and you don't let an attacker inspect the hash output either. Using jenkins hash is equal to the situation, when part of you hash chains will be 5 times longer than median square value, with XOR one there is no such distribution. Show us the numbers. Salt properly this time to reduce the artifacts that come of applying a weak hash to a poor PRNG, and histogram your results. If you don't get a Poisson distribution you probably don't know how to use gnuplot either. :-) Added somthing into permutations will not endup in different distribution, since it is permutations which are broken, not its result (which can be xored with something). I can't parse this. Care to try again? Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
All right, Eric, you and me so clevvah, let's embarrass our own selves designing this thing in public instead of picking on poor Evgeniy. Which would you rather RCU, a 2-left hash or a splay tree? 2-left hash gets you excellent occupation fraction before the first real collision, so you can be utterly stupid about collisions in the second hash table (just toss all double collisions in an overflow radix tree). Splay tree is a lot bigger bite to chew initially, but it gets you a hot set that's at least sometimes warm in cache, and it's easier to think about how to RCU it, and it doubles as a priority queue. You pick. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: And here are another ones which produce the same hash value. Of course searching for pair for jhash('jhash is broken') will require more steps, but it is doable. That means that if attacker has a full control over one host, it can create a chain of maximum 4 entries in socket table (if jhash is used). If it is udp, that means that attacker control addresses too without syn cookies, which in turn means that below list can be increased to infinite. Evgeniy, you're not gettin' it. Have you ever heard of a Poisson distribution? That's what you have to aim for in a (bog-standard) hash table -- a hash function that scrambles the key space until you have a Poisson distribution in the hash buckets no matter whan pile of keys the attacker throws at you. That's a "perfect" hash function in the statistician's sense (not a "perfect hash" function in the compiler designer's sense). Yes there will be collisions, and they will start happening much earlier than you will think, if you have never heard of Poisson distributions before; that's called the "birthday problem" and it is a chestnut of a mathematical puzzle generally encountered by bright twelve-year-olds in extra-curricular math classes. The only sensible way to deal with them is to use a better data structure -- like a 2-left hash (Google it) or something tree-ish. That is not, however, what you're not getting. What you're not getting is the use of a "salt" or "secret" or whatever you want to call it to turn a weak hash into an impenetrable defense against chosen-plaintext collision attacks. You can run your PRNG until you slag your poor botnet's little CPUs searching for a set of tuples that collide in the same bucket on your test system-under-attack. But they won't collide on mine, and they won't collide on Eric's, and they won't collide on yours the next time it's rebooted. Because even a weak hash (in the cryptographer's sense) is good enough to scramble the key space radically differently with two different salts. XOR doesn't cut it -- the salt will permute the buckets but not toss each one's contents up into the air to land in a bunch of different buckets. But Jenkins does cut it, as discussed in the source code Eric pointed out to you. Of course you don't change the salt except when resizing the hash (which is a dumb thing to do anyway, and a sure sign that a hash table is not the right data structure). That would kinda defeat the purpose of having a hash table in the first place. If you can't assimilate this and articulate it back to us in your own words instead of repeating "any hash that doesn't totally, utterly suck slows my meaningless artificial benchmark by 50%", you may not be the right person to design and implement a new RCU data structure in a kernel that thousands of us trust to withstand exposure to the open Internet to the best of a commodity processor's ability. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: Jenkins _does_ have them, I showed tests half a year ago and in this thread too. Actually _any_ hash has them it is just a matter of time to find one. I think you misunderstood me. If you are trying to DoS me from outside with a hash collision attack, you are trying to feed me packets that fall into the same hash bucket. The Jenkins hash does not have to be artifact-free, and does not have to be cryptographically strong. It just has to do a passable job of mixing a random salt into the tuple, so you don't know which string of packets to feed me in order to fill one (or a few) of my buckets. XORing salt into a folded tuple doesn't help; it just permutes the buckets. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/20/07, David Miller <[EMAIL PROTECTED]> wrote: Actually someone (I think it was Evgeniy in fact) made such comparisons and found in his studies that not only does the current ehash xor hash function distribute about as well as jenkins, it's significantly cheaper to calculate :-) However, it's vulnerable to hash collision attacks, which the Jenkins hash (used as a poor man's HMAC with a boot-time random salt) is not. But if you don't care about DoS Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 19 Feb 2007 13:04:12 +0100, Andi Kleen <[EMAIL PROTECTED]> wrote: LRU tends to be hell for caches in MP systems, because it writes to the cache lines too and makes them exclusive and more expensive. That's why you let the hardware worry about LRU. You don't write to the upper layers of the splay tree when you don't have to. It's the mere traversal of the upper layers that keeps them in cache, causing the cache hierarchy to mimic the data structure hierarchy. RCU changes the whole game, of course, because you don't write to the old copy at all; you have to clone the altered node and all its ancestors and swap out the root node itself under a spinlock. Except you don't use a spinlock; you have a ring buffer of root nodes and atomically increment the writer index. That atomically incremented index is the only thing on which there's any write contention. (Obviously you need a completion flag on the new root node for the next writer to poll on, so the sequence is atomic-increment ... copy and alter from leaf to root ... wmb() ... mark new root complete.) When you share TCP sessions among CPUs, and packets associated with the same session may hit softirq in any CPU, you are going to eat a lot of interconnect bandwidth keeping the sessions coherent. (The only way out of this is to partition the tuple space by CPU at the NIC layer with separate per-core, or perhaps per-cache, receive queues; at which point the NIC is so smart that you might as well put the DDoS handling there.) But at least it's cache coherency protocol bandwidth and not bandwidth to and from DRAM, which has much nastier latencies. The only reason the data structure matters _at_all_ is that DDoS attacks threaten to evict the working set of real sessions out of cache. That's why you add new sessions at the leaves and don't rotate them up until they're hit a second time. Of course the leaf layer can't be RCU, but it doesn't have to be; it's just a bucket of tuples. You need an auxiliary structure to hold the session handshake trackers for the leaf layer, but you assume that you're always hitting cold cache when diving into this structure and ration accesses accordingly. Maybe you even explicitly evict entries from cache after sending the SYNACK, so they don't crowd other stuff out; they go to DRAM and get pulled into the new CPU (and rotated up) if and when the next packet in the session arrives. (I'm assuming T/TCP here, so you can't skimp much on session tracker size during the handshake.) Every software firewall I've seen yet falls over under DDoS. If you want to change that, you're going to need more than the back-of-the-napkin calculations that show that session lookup bandwidth exceeds frame throughput for min-size packets. You're going to need to strategize around exploiting the cache hierarchy already present in your commodity processor to implicitly partition real traffic from the DDoS storm. It's not a trivial problem, even in the mathematician's sense (in which all problems are either trivial or unsolved). Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/18/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote: ... Much less vulnerable to cache eviction DDoS than a hash, because the hot connections get rotated up into non-leaf layers and get traversed enough to keep them in the LRU set. Let me enlarge on this a bit. I used to work for a company that built a custom firewall/VPN ASIC with all sorts of special sauce in it, mostly focused on dealing with DDoS. Some really smart guys, some really good technology, I hope they grab the brass ring someday. On the scale they were dealing with, there's only one thing to do about DDoS: bend over and take it. Provision enough memory bandwidth to cold-cache every packet, every session lookup, and every packet-processing-progress structure. Massively parallelize, spinlock on on-chip SRAM, tune for the cold-cache case. If you can't afford to do that -- and if you haven't designed your own chip, with separate cache windows for each of these use cases, you can't, because they're all retrograde loads for an LRU cache -- then a hash is not the right answer. The interaction between resizing and RCU is just the canary in the coal mine. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
A better data structure for RCU, even with a fixed key space, is probably a splay tree. Much less vulnerable to cache eviction DDoS than a hash, because the hot connections get rotated up into non-leaf layers and get traversed enough to keep them in the LRU set. But I am not a data structures guru; a stratified tree or van Emde-Boas priority queue might be a better choice. RCU splay tree has lots of other potential applications inside the kernel, so it would make a fun project for someone with time to burn. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html