Re: Extensible hashing and RCU
On Tue, Mar 13, 2007 at 11:08:27AM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > On Tuesday 13 March 2007 10:32, Evgeniy Polyakov wrote: > > On Fri, Mar 02, 2007 at 11:52:47AM +0300, Evgeniy Polyakov > ([EMAIL PROTECTED]) wrote: > > So, I ask network developers about testing environment for socket lookup > > benchmarking. What would be the best test case to determine performance > > of the lookup algo? Is it enough to replace algo and locking and create > > say one million of connections and try to run trivial web server (that > > is what I'm going to test if there will not be any better suggestion, > > but I only have single-core athlon 64 with 1gb of ram as a test bed and > > two core duo machines as generators, probably I can use one of them as a > > test machine too. They have gigabit adapters and aree connected over > > gigabit switch)? > > One million concurrent sockets on your machines will be tricky :) > > $ egrep "(filp|dent|^TCP|sock_inode_cache)" /proc/slabinfo |cut -c1-40 > TCP 12 14 1152 > sock_inode_cache 423430384 > dentry_cache 36996 47850132 > filp4081 4680192 > > that means at the minimum 1860 bytes of LOWMEM per tcp socket on 32bit > kernel, > (2512 bytes on a 64bit kernel) > > I had one bench program but apparently I lost it :( > It was able to open long lived sockets, (one million if enough memory), and > was generating kind of random trafic on all sockets. damned. > The 'server' side had to listen to many (>16) ports because of the 65536 > limit. Yep, I was too optimistic about my hardware - getting size of the tcp socket it is impossible to even create such amount of them with 1 or 2 gb of ram. Well, I can run additional tests in userspace (ideally with hugetlb support, but given that both socket hash table and my algo use essentially the same amount of ram it should not matter) with more precise analysis... And just send a patch with detailed description. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 13 March 2007 10:32, Evgeniy Polyakov wrote: > On Fri, Mar 02, 2007 at 11:52:47AM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: > So, I ask network developers about testing environment for socket lookup > benchmarking. What would be the best test case to determine performance > of the lookup algo? Is it enough to replace algo and locking and create > say one million of connections and try to run trivial web server (that > is what I'm going to test if there will not be any better suggestion, > but I only have single-core athlon 64 with 1gb of ram as a test bed and > two core duo machines as generators, probably I can use one of them as a > test machine too. They have gigabit adapters and aree connected over > gigabit switch)? One million concurrent sockets on your machines will be tricky :) $ egrep "(filp|dent|^TCP|sock_inode_cache)" /proc/slabinfo |cut -c1-40 TCP 12 14 1152 sock_inode_cache 423430384 dentry_cache 36996 47850132 filp4081 4680192 that means at the minimum 1860 bytes of LOWMEM per tcp socket on 32bit kernel, (2512 bytes on a 64bit kernel) I had one bench program but apparently I lost it :( It was able to open long lived sockets, (one million if enough memory), and was generating kind of random trafic on all sockets. damned. The 'server' side had to listen to many (>16) ports because of the 65536 limit. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Fri, Mar 02, 2007 at 11:52:47AM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: > So I get my words about tree/trie implementation instead of hash table > for socket lookup back. Hmm, I was a bit fast to draw a line, there are some possibilities to have faster than hash table lookup using different algorithms... So, I ask network developers about testing environment for socket lookup benchmarking. What would be the best test case to determine performance of the lookup algo? Is it enough to replace algo and locking and create say one million of connections and try to run trivial web server (that is what I'm going to test if there will not be any better suggestion, but I only have single-core athlon 64 with 1gb of ram as a test bed and two core duo machines as generators, probably I can use one of them as a test machine too. They have gigabit adapters and aree connected over gigabit switch)? -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Sun, Mar 04, 2007 at 02:02:36AM -0800, Michael K. Edwards ([EMAIL PROTECTED]) wrote: > On 3/3/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > >Btw, you could try to implement something you have written above to show > >its merits, so that it would not be an empty words :) > > Before I implement, I design. Before I design, I analyze. Before I > analyze, I prototype. Before I prototype, I gather requirements. > Before I gather requirements, I think -- and the only way I know how > to think about technical matters is to write down my intuitions and > compare them against the sea of published research on the topic. I'm You forgot 'before (and whlist) I do all above I write tons of words in mail lists about nothing, thus spending my and others time'. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Michael K. Edwards writes: > This, incidentally, seems very similar to the process that Robert > Olsson and Stefan Nilsson have gone through with their trie/hash > project. Although I haven't tried it out yet and don't have any basis > for an independent opinion, the data and analysis provided in their > paper are quite convincing. I've info about this "process" :) Moved fib_trie.c to userland to extend it longer an variable keylengths. Testprogram was doing insert/lookup/remove with random data with blistering performance (very flat trees). So quite happy I moved this trie back to the kernel and started testing with "real data" - ip addresses and rDoS. Result was disastrous very deep trees and awful network performance. Random data is very different from real data was the lesson learned again. Gave up. A couple days later an idea came up. I'll remembered the poof from the LC-trie paper that length of key does not impact tree depth. So went to Stefan, Can't we just fix up the data rather than fiddling with an new improved algorithm? The result was this "hash" header to boost tree compression. Cheers --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 3/4/07, David Miller <[EMAIL PROTECTED]> wrote: From: "Michael K. Edwards" <[EMAIL PROTECTED]> > Before I implement, I design. Before I design, I analyze. Before I > analyze, I prototype. Before I prototype, I gather requirements. How the heck do you ever get to writing ANYTHING if you work that way? Oh, c'mon, the whole design process takes maybe 3 weeks for something I'm going to spend 3 months implementing. And half the time the client mistakes the prototype for the implementation and tries to run with it -- with the foreseeable consequences. Smaller projects are a lot less formal but the principle still holds: every time I've implemented without actually doing the homework leading up to a design, I've had cause to regret it. That goes double for problems that already have off-the-shelf solutions and only need improvement in ease of use and robustness in hostile environments. I certainly would never have written one single line of Linux kernel code if I had to go through that kind of sequence to actually get to writing code. Lots of _code_ gets written as a side effect of each stage of that sequence. But none of that code should be mistaken for the _implementation_. Not when the implementation is going to ship inside hundreds of thousands of widgets that people are going to stick in their ears, or is going to monitor the integrity of an intercontinental telecoms network. Most shipping code, including most code that I have written and large swathes of the Linux kernel, has never really gone beyond the prototype stage. There's no shame in that; the shame would be in refusing to admit it. And that's definitely not the "Linux way". You code up ideas as soon as you come up with one that has a chance of working, and you see what happens. Sure, you'll throw a lot away, but at least you will "know" instead of "think". I call that process "prototyping". I used to do a lot of it; but I have since found that thinking first is actually more efficient. There is very little point in prototyping the square wheel again and again and again. And especially given that Linux already has plenty of nice, round wheels, there aren't that many places left where you can impress the world by replacing a square wheel with a hexagonal one. Replacing wheels with maglev tracks requires a real design phase. You have to try things, "DO" stuff, not just sit around and theorize and design things and shoot down ideas on every negative minute detail you can come up with before you type any code in. That mode of development doesn't inspire people and get a lot of code written. I'll cop to the "negative" part of this, at least to some degree, but not the rest. "Naive hashes DDoS easily" is not a minute detail. "Hash tables don't provide the operations characteristic of priority queues" is not a minute detail. "The benchmarks offered do not accurately model system impact" is not a minute detail. If I were not sufficiently familiar with some of Evgeniy's other contributions to the kernel to think him capable of responding to these critiques with a better design, I would not have bothered. I definitely do not think others should use this design/prototype/analyze/blah/balh way of developing as an example, instead I think folks should use people like Ingo Molnar as an example of a good Linux developer. People like Ingo rewrite the scheduler one night because of a tiny cool idea, and even if only 1 out of 10 hacks like that turn out to be useful, his work is invaluable and since he's actually trying to do things and writing lots of code this inspires other people. Ingo Molnar is certainly a good, nay an excellent, Linux developer. His prototypes are brilliant even when they're under-designed by my way of thinking. By all means, readers should go thou and do likewise. And when someone presses an alternative design, "show me the code" is a fair response. At present, I am not offering code, nor design, nor even a prototype. But I am starting to figure out which bits of the Linux kernel really are implementations based on a solid design. Any prototyping I wind up doing is going to be bolted firmly onto those implementations, not floating in the surrounding sea of prototype code; and any analysis I offer based on that prototype will reflect, to the best of my ability, an understanding of its weaknesses as well as its strengths. If that analysis passes community scrutiny, and if I remain interested in the project, perhaps I will go through with design and implementation as well. This, incidentally, seems very similar to the process that Robert Olsson and Stefan Nilsson have gone through with their trie/hash project. Although I haven't tried it out yet and don't have any basis for an independent opinion, the data and analysis provided in their paper are quite convincing. Any prototyping I might do would probably build on their work, perhaps adding a more explicit DDoS-survival strategy based on a priori
Re: Extensible hashing and RCU
From: "Michael K. Edwards" <[EMAIL PROTECTED]> Date: Sun, 4 Mar 2007 02:02:36 -0800 > On 3/3/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > > Btw, you could try to implement something you have written above to show > > its merits, so that it would not be an empty words :) > > Before I implement, I design. Before I design, I analyze. Before I > analyze, I prototype. Before I prototype, I gather requirements. How the heck do you ever get to writing ANYTHING if you work that way? I certainly would never have written one single line of Linux kernel code if I had to go through that kind of sequence to actually get to writing code. And that's definitely not the "Linux way". You code up ideas as soon as you come up with one that has a chance of working, and you see what happens. Sure, you'll throw a lot away, but at least you will "know" instead of "think". You have to try things, "DO" stuff, not just sit around and theorize and design things and shoot down ideas on every negative minute detail you can come up with before you type any code in. That mode of development doesn't inspire people and get a lot of code written. I definitely do not think others should use this design/prototype/analyze/blah/balh way of developing as an example, instead I think folks should use people like Ingo Molnar as an example of a good Linux developer. People like Ingo rewrite the scheduler one night because of a tiny cool idea, and even if only 1 out of 10 hacks like that turn out to be useful, his work is invaluable and since he's actually trying to do things and writing lots of code this inspires other people. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 3/3/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: Btw, you could try to implement something you have written above to show its merits, so that it would not be an empty words :) Before I implement, I design. Before I design, I analyze. Before I analyze, I prototype. Before I prototype, I gather requirements. Before I gather requirements, I think -- and the only way I know how to think about technical matters is to write down my intuitions and compare them against the sea of published research on the topic. I'm only partway through thinking about RCU and DDoS, especially as this is on the fringe of my professional expertise and the appropriate literature is not currently at my fingertips. The only times that I make exceptions to the above sequence are 1, when someone is paying me well to do so (usually to retrofit some kind of sanity onto a pile of crap someone else wrote) and 2, when I really feel like it. At present neither exception applies here, although I may yet get so het up about threadlets that I go into a coding binge (which may or may not produce an RCU splay tree as a side effect). I wouldn't hold my breath if I were you, though; it's the first of what promises to be a string of fine weekends, and if I binge on anything this spring it's likely to be gardening. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Fri, Mar 02, 2007 at 12:45:36PM -0800, Michael K. Edwards ([EMAIL PROTECTED]) wrote: > On 3/2/07, Eric Dumazet <[EMAIL PROTECTED]> wrote: > >Thank you for this report. (Still avoiding cache misses studies, while they > >obviously are the limiting factor) > > 1) The entire point of going to a tree-like structure would be to > allow the leaves to age out of cache (or even forcibly evict them) > when the structure bloats (generally under DDoS attack), on the theory > that most of them are bogus and won't be referenced again. It's not > about the speed of the data structure -- it's about managing its > impact on the rest of the system. > > 2) The other entire point of going to a tree-like structure is that > they're drastically simpler to RCU than hashes, and more generally > they don't involve individual atomic operations (RCU reaping passes, > resizing, etc.) that cause big latency hiccups and evict a bunch of > other stuff from cache. > > 3) The third entire point of going to a tree-like structure is to > have a richer set of efficient operations, since you can give them a > second "priority"-type index and have "pluck-highest-priority-item", > three-sided search, and bulk delete operations. These aren't that > much harder to RCU than the basic modify-existing-node operation. > > Now can we give these idiotic micro-benchmarks a rest until Robert's > implementation is tuned and ready for stress-testing? Mmm, you have learnt new words from other threads :) It is not a benchmark, it is analysis of the structure processing. All you have written above is correct, since it was said in this thread multiple times, but it does not change the fact, that tree traversal is slower than list one, so to compete tree (or another algo, which would be even more interesting) implementation must have that in mind and be faster in any (the most) load. As is tree/trie does not have it (it is feature of algorithm), but has another advantages (extremely suitable in routing cache whihc requires wildcard support and scaling) you did not mention - ability to scale without structure recreations. Btw, you could try to implement something you have written above to show its merits, so that it would not be an empty words :) > Cheers, > - Michael -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 3/2/07, Eric Dumazet <[EMAIL PROTECTED]> wrote: Thank you for this report. (Still avoiding cache misses studies, while they obviously are the limiting factor) 1) The entire point of going to a tree-like structure would be to allow the leaves to age out of cache (or even forcibly evict them) when the structure bloats (generally under DDoS attack), on the theory that most of them are bogus and won't be referenced again. It's not about the speed of the data structure -- it's about managing its impact on the rest of the system. 2) The other entire point of going to a tree-like structure is that they're drastically simpler to RCU than hashes, and more generally they don't involve individual atomic operations (RCU reaping passes, resizing, etc.) that cause big latency hiccups and evict a bunch of other stuff from cache. 3) The third entire point of going to a tree-like structure is to have a richer set of efficient operations, since you can give them a second "priority"-type index and have "pluck-highest-priority-item", three-sided search, and bulk delete operations. These aren't that much harder to RCU than the basic modify-existing-node operation. Now can we give these idiotic micro-benchmarks a rest until Robert's implementation is tuned and ready for stress-testing? Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Fri, Mar 02, 2007 at 10:56:23AM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > On Friday 02 March 2007 09:52, Evgeniy Polyakov wrote: > > > Ok, I've ran an analysis of linked lists and trie traversals and found > > that (at least on x86) optimized one list traversal is about 4 (!) > > times faster than one bit lookup in trie traversal (or actually one > > lookup in binary tree-like structure) - that is because of the fact > > that trie traversal needs to have more instructions per lookup, and at > > least one additional branch which can not be predicted. > > > > Tests with rdtsc shows that one bit lookup in trie (actually it is any > > lookup in binary tree structures) is about 3-4 times slower than one > > lookup in linked list. > > > > Since hash table usually has upto 4 elements in each hash entry, > > competing binary tree/trie stucture must get an entry in one lookup, > > which is essentially impossible with usual tree/trie implementations. > > > > Things dramatically change when linked list became too long, but it > > should not happend with proper resizing of the hash table, wildcards > > implementation also introduce additional requirements, which can not be > > easily solved in hash tables. > > > > So I get my words about tree/trie implementation instead of hash table > > for socket lookup back. > > > > Interested reader can find more details on tests, asm outputs and > > conclusions at: > > http://tservice.net.ru/~s0mbre/blog/2007/03/01#2007_03_01 > > Thank you for this report. (Still avoiding cache misses studies, while they > obviously are the limiting factor) > > Anyqay, if data is in cache and you want optimum performance from your cpu, > you may try to use an algorithm without conditional branches : > (well 4 in this case for the whole 32 bits tests) Tests were always for no-cache-miss case. I also ran them in kenel mode (to eliminate tlb flushes per rescheduling and to get into account that kernel tlb covers 8mb while userspace only 4k), but results were essentially the same (modulo several percents). I only tested trie, in my impementation its memory usage is smaller than hash table for 2^20 entries. > gcc -O2 -S -march=i686 test1.c > > struct node { > struct node *left; > struct node *right; > int value; > }; > struct node *head; > int v1; > > #define PASS2(bit) \ > n2 = n1->left; \ > right = n1->right; \ > if (value & (1< n2 = right; \ > n1 = n2->left; \ > right = n2->right; \ > if (value & (2< n1 = right; > > main() > { > int j; > unsigned int value = v1; > struct node *n1 = head, *n2, *right; > for (j=0; j<4; ++j) { > PASS2(0) > PASS2(2) > PASS2(4) > PASS2(6) > value >>= 8; > } > printf("result=%p\n", n1); > } This one resulted in 10*4 and 2*4 branches per loop. So total 32 branches (instead of 64 in simpler code) and 160 instructions (instead of 128 in simpler code). Getting that branch is two times longer to execute (though it is quite strange sentence, but I must admit, that I did not read x86 processor manual at all (only ppc32)) according to tests, we do not get any gain for 32bit value (32 lookups): 64*2+128 in old case, 32*2+160 in new one. I also have advanced trie implementation, which caches values in nodes if there are no child entries, and it _greatly_ decrease number of lookups and memory usage for smaller sets, but in long run and huge amount of entries in trie, it does not matter since only the lowest layer caches values. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Friday 02 March 2007 09:52, Evgeniy Polyakov wrote: > Ok, I've ran an analysis of linked lists and trie traversals and found > that (at least on x86) optimized one list traversal is about 4 (!) > times faster than one bit lookup in trie traversal (or actually one > lookup in binary tree-like structure) - that is because of the fact > that trie traversal needs to have more instructions per lookup, and at > least one additional branch which can not be predicted. > > Tests with rdtsc shows that one bit lookup in trie (actually it is any > lookup in binary tree structures) is about 3-4 times slower than one > lookup in linked list. > > Since hash table usually has upto 4 elements in each hash entry, > competing binary tree/trie stucture must get an entry in one lookup, > which is essentially impossible with usual tree/trie implementations. > > Things dramatically change when linked list became too long, but it > should not happend with proper resizing of the hash table, wildcards > implementation also introduce additional requirements, which can not be > easily solved in hash tables. > > So I get my words about tree/trie implementation instead of hash table > for socket lookup back. > > Interested reader can find more details on tests, asm outputs and > conclusions at: > http://tservice.net.ru/~s0mbre/blog/2007/03/01#2007_03_01 Thank you for this report. (Still avoiding cache misses studies, while they obviously are the limiting factor) Anyqay, if data is in cache and you want optimum performance from your cpu, you may try to use an algorithm without conditional branches : (well 4 in this case for the whole 32 bits tests) gcc -O2 -S -march=i686 test1.c struct node { struct node *left; struct node *right; int value; }; struct node *head; int v1; #define PASS2(bit) \ n2 = n1->left; \ right = n1->right; \ if (value & (1right; \ if (value & (2<>= 8; } printf("result=%p\n", n1); } .file "test1.c" .section.rodata.str1.1,"aMS",@progbits,1 .LC0: .string "result=%p\n" .text .p2align 4,,15 .globl main .type main, @function main: leal4(%esp), %ecx andl$-16, %esp pushl -4(%ecx) pushl %ebp movl%esp, %ebp pushl %ebx xorl%ebx, %ebx pushl %ecx subl$16, %esp movlv1, %ecx movlhead, %edx .p2align 4,,7 .L2: movl4(%edx), %eax testb $1, %cl cmove (%edx), %eax testb $2, %cl movl4(%eax), %edx cmove (%eax), %edx testb $4, %cl movl4(%edx), %eax cmove (%edx), %eax testb $8, %cl movl4(%eax), %edx cmove (%eax), %edx testb $16, %cl movl4(%edx), %eax cmove (%edx), %eax testb $32, %cl movl4(%eax), %edx cmove (%eax), %edx testb $64, %cl movl4(%edx), %eax cmove (%edx), %eax testb %cl, %cl movl4(%eax), %edx cmovns (%eax), %edx addl$1, %ebx cmpl$4, %ebx je .L19 shrl$8, %ecx jmp .L2 .p2align 4,,7 .L19: movl%edx, 4(%esp) movl$.LC0, (%esp) callprintf addl$16, %esp popl%ecx popl%ebx popl%ebp leal-4(%ecx), %esp ret .size main, .-main .comm head,4,4 .comm v1,4,4 .ident "GCC: (GNU) 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)" .section.note.GNU-stack,"",@progbits
Re: Extensible hashing and RCU
On Sat, Feb 17, 2007 at 04:13:02PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: > > >I noticed in an LCA talk mention that apprently extensible hashing > > >with RCU access is an unsolved problem. Here's an idea for solving it. > > > > > > > Yes, I have been playing around with the same idea for > > doing dynamic resizing of the TCP hashtable. > > > > Did a prototype "toy" implementation, and I have a > > "half-finished" patch which resizes the TCP hashtable > > at runtime. Hmmm, your mail may be the impetus to get > > me to finally finish this thing > > Why anyone do not want to use trie - for socket-like loads it has > exactly constant search/insert/delete time and scales as hell. Ok, I've ran an analysis of linked lists and trie traversals and found that (at least on x86) optimized one list traversal is about 4 (!) times faster than one bit lookup in trie traversal (or actually one lookup in binary tree-like structure) - that is because of the fact that trie traversal needs to have more instructions per lookup, and at least one additional branch which can not be predicted. Tests with rdtsc shows that one bit lookup in trie (actually it is any lookup in binary tree structures) is about 3-4 times slower than one lookup in linked list. Since hash table usually has upto 4 elements in each hash entry, competing binary tree/trie stucture must get an entry in one lookup, which is essentially impossible with usual tree/trie implementations. Things dramatically change when linked list became too long, but it should not happend with proper resizing of the hash table, wildcards implementation also introduce additional requirements, which can not be easily solved in hash tables. So I get my words about tree/trie implementation instead of hash table for socket lookup back. Interested reader can find more details on tests, asm outputs and conclusions at: http://tservice.net.ru/~s0mbre/blog/2007/03/01#2007_03_01 -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 22 Feb 2007 18:49:00 -0500, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: The rehash-every-10-minutes detail is theoretically unnecessary, but does cover the case where a would-be attacker *does* get a chance to look at a machine, such as by using routing delays to measure the effectiveness of a collision attempt. AOL -- and that's one of the reasons why RCUing hashes is a nightmare in practice, if you care about smooth peristalsis. Again, the resizing headache is just the canary in the coal mine. If you want to test your distribution for randomness, go have a look at Knuth volume 2 (seminumerical algorithms) and see the discussion of the Kolmogorov-Smirnov test. Some lumpiness is *expected* in a truly random distribution. Ah, Kolmogorov-Smirnov, the astronomer's friend. Somewhere on a DAT in my garage lies a rather nice implementation of K-S tests on censored data for the Mirella/Mirage Forth image analysis package. If you want something industrial-strength, easy to use, and pretty besides, I recommend SM (http://www.astro.princeton.edu/~rhl/sm/), which will cost you $500 U.S. for a department-scale license. SM so utterly exceeds the capabilities of gnuplot it isn't even funny. But then, while you don't always get what you pay for, you rarely fail to pay for what you get, sooner or later, in one currency or another. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
> I think you misunderstood me. If you are trying to DoS me from > outside with a hash collision attack, you are trying to feed me > packets that fall into the same hash bucket. The Jenkins hash does > not have to be artifact-free, and does not have to be > cryptographically strong. It just has to do a passable job of mixing > a random salt into the tuple, so you don't know which string of > packets to feed me in order to fill one (or a few) of my buckets. > XORing salt into a folded tuple doesn't help; it just permutes the > buckets. If you want to understand this more formally, read up on "universal families of hash functions," which is the name cryptologists give to this concept. When used according to directions, they are actually *more* secure than standard cryptographic hashes such as MD5 and SHA. The key difference is that *the attacker doesn't get to see the hash output*. The basic pattern is: - Here's a family of hash functions, e.g. a salted hash function. - I pick one at random. (E.g. choose a salt.) - Now your challenge is to generate a pair of inputs which will collide. - Note that you never get to see sample input/output pairs of the hash function. All you know is that it's a member of the family. - It is surprisingly easy to find families of size N such that an attacker has on the order of a 1/N chance to construct a collision. - This remains true even if you assume that the attacker has infinite computational power. This pattern corresponds exactly to an attacker trying to force collisions in a hash table they can't see. As far as I know, nobody has proved salted jash a truly universal family, but so many amazingly simple algorithms have been proved universal that it wouldn't surprise me if it was. For example, the family of all CRCs computed modulo n-bit primitive polynomials is a universal family. If you do know the polynomial, it's ridiculously easy to build a collision. If you don't, it's provably impossible. (Footnote: the chance isn't exactly 1/N, but also depends on the size of the input relative to the size of the hash. With bigger inputs, it's easier to make them match according to more of the hashes. Ultimately, if you have N k-bit CRC polynomials, you can make them all collide with an N*k-bit input. But since N is proportional to 2^k, it's easy to make k big enough that this is impractical.) The rehash-every-10-minutes detail is theoretically unnecessary, but does cover the case where a would-be attacker *does* get a chance to look at a machine, such as by using routing delays to measure the effectiveness of a collision attempt. Now, as for flaming about how xor generates more uniform distributions than jhash - that's to be expected from a weak hash. By relying on non-uniform properties of the input (particularly that hosts tend to walk linearly through the source port space), you can make hash values walk linearly through your hash table, and get a completely even distribution rather than a *random* one. This is great for efficiency, but depends on letting patterns in the hash input through to the output, which is exactly the property that makes it vulnerable to a deliberate DoS attempt. If you want to test your distribution for randomness, go have a look at Knuth volume 2 (seminumerical algorithms) and see the discussion of the Kolmogorov-Smirnov test. Some lumpiness is *expected* in a truly random distribution. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Thu, 22 Feb 2007 03:07:55 -0800 (PST) David Miller <[EMAIL PROTECTED]> wrote: > From: "Michael K. Edwards" <[EMAIL PROTECTED]> > Date: Thu, 22 Feb 2007 03:00:32 -0800 > > > Besides, RCUing a hash of any sort sounds like a nightmare, whereas > > something like a splay tree is reputed to be very straightforward to > > make "persistent" in the functional-programming sense. I haven't > > coded that one myself but surely it's in the current edition of Knuth. > > Knuth tends to be thorough, yet archaic. :-) > > > You are much better off obtaining all per-flow state with a > > single data structure lookup. It would be kind of nice if state for > > other layers like IPSec and NAT could live there too. > > This is the eventual idea. > > > By the way, are these flows unidirectional or bidirectional? > > The entries created by Robert's TRASH are unidirectional, they have to > be because the path from remote to localhost is different than the > path from localhost to remote :-) > > But we don't need to do a socket demux on packet output, we just > need a plain route (+ firewall, IPSEC, etc.) destination cache > entry for that. > - There are a number of other lock-less algorithms rather than simple RCU that might be worth exploring. A number of them rely on a working implementation of compare-exchange; that can be a problem since some architectures don't have a fast way of doing that. -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
From: Andi Kleen <[EMAIL PROTECTED]> Date: Thu, 22 Feb 2007 12:41:05 +0100 > > With 100k flows we typically see an aver depth 1.3 and a max depth 4. > > Right > > This includes full srcport:dstport? Yes. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
> With 100k flows we typically see an aver depth 1.3 and a max depth 4. Right This includes full srcport:dstport? > now there is only a dst entry in leaf nodes. So yes anything else we can put > in leaf is "free". How much would fit there? -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
From: "Michael K. Edwards" <[EMAIL PROTECTED]> Date: Thu, 22 Feb 2007 03:00:32 -0800 > Besides, RCUing a hash of any sort sounds like a nightmare, whereas > something like a splay tree is reputed to be very straightforward to > make "persistent" in the functional-programming sense. I haven't > coded that one myself but surely it's in the current edition of Knuth. Knuth tends to be thorough, yet archaic. :-) > You are much better off obtaining all per-flow state with a > single data structure lookup. It would be kind of nice if state for > other layers like IPSec and NAT could live there too. This is the eventual idea. > By the way, are these flows unidirectional or bidirectional? The entries created by Robert's TRASH are unidirectional, they have to be because the path from remote to localhost is different than the path from localhost to remote :-) But we don't need to do a socket demux on packet output, we just need a plain route (+ firewall, IPSEC, etc.) destination cache entry for that. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/22/07, David Miller <[EMAIL PROTECTED]> wrote: From: "Michael K. Edwards" <[EMAIL PROTECTED]> Date: Wed, 21 Feb 2007 13:15:51 -0800 > it had better be a 2-left hash with a compact > overflow pool for the rare case of second collision. Just like Evgeniy, you should do your research too :- Fair enough. :-) The gain for multiple-hash schemes, such as 2-left, exists only if you can parallelize the N lookups in hardware. This is an assumption built into all the papers you will read on this subject, including the one you reference by Mitzenmacher. Er, no. The principal gain for the 2-left hash is that the only time you ever look at the right hash is when you have a collision in the left. The right hash is therefore substantially less occupied and the odds of second collision are lower. If you happen to be able to prefetch the left and right buckets in parallel, that's gravy. Roughly speaking, for a single hash, P(0) = (1 - 1/k)^n \approx e^{-n/k}, where P(0) is the probability that a given bucket has 0 entries, k is the number of buckets, and n is the number of elements. One entry in the bucket replaces a factor (1-1/k) with a factor n/k, the second entry replaces another (1-1/k) with (n-1)/2k, and so forth. It's the good old birthday problem. Take the simple example of an equal number of items and of one-entry buckets, and compare the simple and 2-left hashes. When n/k \approx 1, you have about 37% empty buckets, but when n/k \approx 2 you only have about 13% empty buckets. On the other hand, when n/k \approx 1 and the buckets only hold one entry, about 37% of the items overflow; when n/k \approx 2, about 57% do. Result: a 2-left hash with exactly as many 1-item buckets as there are items will only see about 20% of the items overflow (collide in the left hash and then collide again in the right hash), versus the 37% that would have overflowed in a single hash of the same size. In practical applications you try to make your hash buckets cache-line sized and fit 2 or 4 items into them so that overflows are farther out on the tail of the distribution. The result can be a hash that is only oversized by a factor of two or so relative to the space occupied by the filled slots, yet has an amortized cost of less than 1.5 cache misses and has a fraction of a percent of items that overflow the right hash. That fraction of a percent can go into a single overflow pool (a way oversized hash or an rbtree or something else you've already coded; it's a rare case) rather than having to chain from each hash bucket. If you read any paper on IP route lookup algorithms, prefix your reading of that paper with something that reads like: And by the way we are targeting implementations in hardware that can parallelize and pipeline memory accesses to fast SRAM. Do not consider this algorithm seriously for running on general purpose processors in software. I generally agree with this statement. However, the 2-left hash is a very useful data structure for lots of things other than IP route lookups, especially in memory-constrained systems where you don't want to oversize the hash table by _too_ much but you still want actual overflow chaining to be quite rare. It's much simpler than next-empty-bucket types of schemes, simple enough that I've coded it for little DSP cores, and almost as good at minimizing overflows. Probably a few read's of Network Algorithmics would help your understanding in this area as well. Hey, I don't think a hash is the right solution in the first place. If you're on the open Internet you have to assume a DDoS half-open connection load a thousandfold higher than the number of real traffic flows. If you don't have the memory bandwidth and cache architecture to bend over and take it (and on a commodity processor, you don't, not unless you're a hell of a lot cleverer about hinting to the cache controller than I am), you want a data structure that rotates the real connections up into a frequently traversed set. Besides, RCUing a hash of any sort sounds like a nightmare, whereas something like a splay tree is reputed to be very straightforward to make "persistent" in the functional-programming sense. I haven't coded that one myself but surely it's in the current edition of Knuth. Network Algorithmics is fine as far as it goes, but if you want to distribute the networking stack across a NUMA box and still not have it go pear-shaped under sophisticated DDoS, I think you probably have to think it out for yourself. In any case, I am not setting myself up as some kind of networking or data structure guru, just trying to articulate what's wrong with Evgeniy's analysis and why RCUing a naive hash is a complete waste of time. Tries and similar structures do help in the case of our routing because we're going from a potential 32-hashtable probe down to a relatively small number of probes into a trie. That's what Robert's LC-Trie and upcomi
Re: Extensible hashing and RCU
From: "Michael K. Edwards" <[EMAIL PROTECTED]> Date: Wed, 21 Feb 2007 13:15:51 -0800 > it had better be a 2-left hash with a compact > overflow pool for the rare case of second collision. Just like Evgeniy, you should do your research too :- The gain for multiple-hash schemes, such as 2-left, exists only if you can parallelize the N lookups in hardware. This is an assumption built into all the papers you will read on this subject, including the one you reference by Mitzenmacher. If you read any paper on IP route lookup algorithms, prefix your reading of that paper with something that reads like: And by the way we are targeting implementations in hardware that can parallelize and pipeline memory accesses to fast SRAM. Do not consider this algorithm seriously for running on general purpose processors in software. Probably a few read's of Network Algorithmics would help your understanding in this area as well. Tries and similar structures do help in the case of our routing because we're going from a potential 32-hashtable probe down to a relatively small number of probes into a trie. That's what Robert's LC-Trie and upcoming TRASH work do. But it won't help the same when going from hashes to a trie, as would be the case for TCP. If we can get the sockets into the TRASH entries, as Eric will experiment with, we'll be able to eliminate the TCP hash lookup in the fast path entirely. It is the only reasonable suggestion I've seen made in this area to date. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Robert Olsson a écrit : David Miller writes: > But what about if tree lookup were free :-) > > This is why I consider Robert Olsson's trash work the most promising, > if we stick sockets into his full flow identified routing cache trie > entries, we can eliminate lookup altogether. > > Just like how he already uses traffic inspection to kill cache entries > when FIN shutdown sequence is snooped, we can explicitly flush such > entries when socket is closed fully on local system. Below is current tree-stats from a "production" flowlogging application at a large university (real traffic) using this full flow lookup. With 100k flows we typically see an aver depth 1.3 and a max depth 4. Right now there is only a dst entry in leaf nodes. So yes anything else we can put in leaf is "free". trie: Aver depth: 1.31 Max depth: 4 Leaves: 99930 Internal nodes: 14925 1: 13450 2: 1465 3: 9 18: 1 Pointers: 294976 Null ptrs: 180122 Total size: 7259 kB Hi Robert This sounds *very* promising. I read again the pdf presentation and theory seems ok. However you speak of average depth and max depth without mentioning corresponding cache lines. Are all your structures under L1_CACHE_BYTES ? Do you have a pointer to the patch so that I can take a look on it and eventually start to think about it ? (ie adding free bits :) ) Thank you Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
David Miller writes: > But what about if tree lookup were free :-) > > This is why I consider Robert Olsson's trash work the most promising, > if we stick sockets into his full flow identified routing cache trie > entries, we can eliminate lookup altogether. > > Just like how he already uses traffic inspection to kill cache entries > when FIN shutdown sequence is snooped, we can explicitly flush such > entries when socket is closed fully on local system. Below is current tree-stats from a "production" flowlogging application at a large university (real traffic) using this full flow lookup. With 100k flows we typically see an aver depth 1.3 and a max depth 4. Right now there is only a dst entry in leaf nodes. So yes anything else we can put in leaf is "free". trie: Aver depth: 1.31 Max depth: 4 Leaves: 99930 Internal nodes: 14925 1: 13450 2: 1465 3: 9 18: 1 Pointers: 294976 Null ptrs: 180122 Total size: 7259 kB Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Look, Evgeniy. Eric and I may be morons but davem is not. He's telling you, again and again, that DoS attacks do happen, that to survive them you need for the distribution of tuples within hash buckets to vary unpredictably from system to system and boot to boot, and that XOR hash does not accomplish this. He has told you that the Jenkins hash works, for reasons discussed exhaustively in the link I gave you, which you can follow to statistical analysis that is beyond your expertise or mine. He has told you that your performance analysis is fundamentally flawed. Do you think you might need to drop back five and punt? As for the distribution issues: who cares? You fundamentally can't do any better, for random input drawn from a tuple space much larger than the number of hash buckets, than a Poisson distribution. And that's enough to kill a naive hash table design, because chaining is going to cost you another cache miss for every link, and you couldn't afford the first cache miss in the first place. If you absolutely insist on a hash table (on the theory that you can't keep the hot connections warm in cache anyway, which isn't necessarily true if you use a splay tree), it had better be a 2-left hash with a compact overflow pool for the rare case of second collision. I recommend Michael Mitzenmacher's paper to you, or rather to whoever's reading along with the intention of improving the kernel without reinventing the square wheel. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
From: Eric Dumazet <[EMAIL PROTECTED]> Date: Wed, 21 Feb 2007 14:19:30 +0100 > Now, when the rate of lookups/inserts/delete is high, with totally random > endpoints and cache *always* cold , 'tree structures' are not welcome (not > cache friendly) But what about if tree lookup were free :-) This is why I consider Robert Olsson's trash work the most promising, if we stick sockets into his full flow identified routing cache trie entries, we can eliminate lookup altogether. Just like how he already uses traffic inspection to kill cache entries when FIN shutdown sequence is snooped, we can explicitly flush such entries when socket is closed fully on local system. Really, possibilities are truly endless with that thing. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Wednesday 21 February 2007 13:41, Andi Kleen wrote: > Eric Dumazet <[EMAIL PROTECTED]> writes: > > For example, sock_wfree() uses 1.6612 % of cpu because of false sharing > > of sk_flags (dirtied each time SOCK_QUEUE_SHRUNK is set :( > > Might be easily fixable by moving the fields around a bit? For this one, it seems sk_flags is mostly read, but SOCK_QUEUE_SHRUNK is mostly writen. It would make sense to move it to another point, to keep sk_flags shared by different cpus. Maybe using one low order bit in a related pointer ? (like the rb_color() trick). Maybe this is time for a new include/linux/bap.h (bits and pointer) include :) > > > If we want to optimize tcp, we should reorder fields to reduce number of > > cache lines, not change algos. struct sock fields are currently placed to > > reduce holes, while they should be grouped by related fields sharing > > cache lines. > > Regrouping is definitely a good thing; but I'm not sure why you are so > deadset against exploring other data structures. The promise of RCUing > and avoiding the big hash tables seems alluding to me, even if it > only breaks even in the end in terms of cycles. RCU is definitely wanted, and IP routing demonstrated the wins. rbtree was successfully plugged into epoll instead of initial hash table implementation. Now, when the rate of lookups/inserts/delete is high, with totally random endpoints and cache *always* cold , 'tree structures' are not welcome (not cache friendly) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Eric Dumazet <[EMAIL PROTECTED]> writes: > > For example, sock_wfree() uses 1.6612 % of cpu because of false sharing of > sk_flags (dirtied each time SOCK_QUEUE_SHRUNK is set :( Might be easily fixable by moving the fields around a bit? > If we want to optimize tcp, we should reorder fields to reduce number of > cache > lines, not change algos. struct sock fields are currently placed to reduce > holes, while they should be grouped by related fields sharing cache lines. Regrouping is definitely a good thing; but I'm not sure why you are so deadset against exploring other data structures. The promise of RCUing and avoiding the big hash tables seems alluding to me, even if it only breaks even in the end in terms of cycles. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
From: Evgeniy Polyakov <[EMAIL PROTECTED]> Date: Wed, 21 Feb 2007 12:51:09 +0300 > Linux route cache does not change $c (third parameter), and it _seems_ > that distribution for the random $a and $b is fair, while when $c is > formed over attacker's data, random per-boot $initval does not help. It is not only initialized per-boot, it is also regenerated every 10 minutes by default, via rt_secret_rebuild(). - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Wed, Feb 21, 2007 at 10:38:22AM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > On Wednesday 21 February 2007 10:27, Evgeniy Polyakov wrote: > > On Wed, Feb 21, 2007 at 10:15:11AM +0100, Eric Dumazet ([EMAIL PROTECTED]) > wrote: > > > On Wednesday 21 February 2007 09:54, Evgeniy Polyakov wrote: > > > > I shown that numbers 4 times already, do you read mails and links? > > > > Did you see an artifact Eric showed with his data? > > > > > > I showed all your thinking is wrong. > > > > You mix all the time distribution fairness and complexity of searching > > for collisions. Jenkins hash is unfair - having Linux random generator > > as attacker's tool we end up in the case, when jenkins hash has some > > chains with 5 times longer length. > > > > Short history: > > I showed that jenkins is unfair, you confirmed that with your data. > > Another question is about complexity of searching for collisions - you > > showed that with xor it is quite easy and with jenkins it is complex, > > then I showed that it is not that complex to find collisions in jenkins > > too. > > Again here is your 'test' > > enter_hash(u32 val) > { > counter[val & mask]++; > } > > for (i = 0 ; i < 1000 ; i++) > enter_hash(CONSTANT1) > for (i = 0 ; i < 1000 ; i++) > enter_hash(CONSTANT2) > > Oh my god, I have two chains with 1000 elems in it. Drop me a phone of your drug diller - I want that crack too :) Code is pretty simple: hash = jhash_3words(const, const, random_of_2^16, const_or_random_per_run); table[hash & (hash_size - 1)]++; > Real data are : > 1) XOR hash, with a load factor of 0.41 > > Distribution of sockets/chain length > [chain length]:number of sockets > [0]:752255 0% > [1]:208850 47.455% > [2]:54281 72.1225% > [3]:19236 85.235% > [4]:8199 92.6869% > [5]:3487 96.6485% > [6]:1489 98.6785% > [7]:515 99.4976% > [8]:192 99.8466% > [9]:53 99.955% > [10]:14 99.9868% > [11]:3 99.9943% > [12]:1 99.997% > [13]:1 100% > total : 440101 sockets, Avg lookup cost=3.07896 cache lines > > 2) Jenkins hash, same load factor > [0]:688813 0% > [1]:289874 65.8653% > [2]:60452 93.3372% > [3]:8493 99.1266% > [4]:879 99.9255% > [5]:62 99.9959% > [6]:3 100% > total : 440101 sockets, Avg lookup cost=2.4175 cache lines Please read my e-mail about fairness and different usage models I found to be vulnerable. Your data only shows that your feel yourself safe with jenkins hash, while it has problems (as long as xor). With described above usage case jenkins will have enrties with 50 elements in the chain - you have not received such packets yet, or you use jhash_2words(). -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Wed, Feb 21, 2007 at 01:34:40AM -0800, David Miller ([EMAIL PROTECTED]) wrote: > > I repeat again - add your salt into jenkins hash and I will show you > > that it has the same problems. > > So, I'm waiting for your patch for jhash_*_words(). > > The problem is that whilst XOR, with arbitrary random input > seed, can be forced to use choosen hash chains easily, with > jenkins this is not the case. > > The reason is that, due to jenkin's sophisticated mixing, each > random input produces unique "pattern" of hash chains even for > the most carefully crafted inputs. > > It is not trivial to target matching hash chains even with known > input seed, and it is impossible with unknown seed such as that > which we use in routing cache. Routing cache is safe in that regard, that found problem seems to only affect the case, when $c parameter is specially crafted - at least I can not reproduce distribution problem with first two parameters changed. But having different initial value, i.e. hash(const, const, random_from_attacker, random_per_boot) instead of hash(const, const, random_from_attacker, 0) ends up in the same problem. Attacker will not know which chain is selected, but changing that parameter will only change chain position, but distribution will be still broken. > I do not talk about distribution characteristics here, only about > attackability. It _seems_ (no full analysis) that problem is in the nature of jenkins hash - no matter how $initval is selected, its third parameter should _not_ be changed to data controlled by attacker, otherwise result is reproduced pretty easy. Linux route cache does not change $c (third parameter), and it _seems_ that distribution for the random $a and $b is fair, while when $c is formed over attacker's data, random per-boot $initval does not help. In that regard jhash_2/1words() are only safe - they have $c as zero, jhash_3words() with attackers $c and random per-boot $initval is trivially attackable. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Wednesday 21 February 2007 10:27, Evgeniy Polyakov wrote: > On Wed, Feb 21, 2007 at 10:15:11AM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > On Wednesday 21 February 2007 09:54, Evgeniy Polyakov wrote: > > > I shown that numbers 4 times already, do you read mails and links? > > > Did you see an artifact Eric showed with his data? > > > > I showed all your thinking is wrong. > > You mix all the time distribution fairness and complexity of searching > for collisions. Jenkins hash is unfair - having Linux random generator > as attacker's tool we end up in the case, when jenkins hash has some > chains with 5 times longer length. > > Short history: > I showed that jenkins is unfair, you confirmed that with your data. > Another question is about complexity of searching for collisions - you > showed that with xor it is quite easy and with jenkins it is complex, > then I showed that it is not that complex to find collisions in jenkins > too. Again here is your 'test' enter_hash(u32 val) { counter[val & mask]++; } for (i = 0 ; i < 1000 ; i++) enter_hash(CONSTANT1) for (i = 0 ; i < 1000 ; i++) enter_hash(CONSTANT2) Oh my god, I have two chains with 1000 elems in it. Real data are : 1) XOR hash, with a load factor of 0.41 Distribution of sockets/chain length [chain length]:number of sockets [0]:752255 0% [1]:208850 47.455% [2]:54281 72.1225% [3]:19236 85.235% [4]:8199 92.6869% [5]:3487 96.6485% [6]:1489 98.6785% [7]:515 99.4976% [8]:192 99.8466% [9]:53 99.955% [10]:14 99.9868% [11]:3 99.9943% [12]:1 99.997% [13]:1 100% total : 440101 sockets, Avg lookup cost=3.07896 cache lines 2) Jenkins hash, same load factor [0]:688813 0% [1]:289874 65.8653% [2]:60452 93.3372% [3]:8493 99.1266% [4]:879 99.9255% [5]:62 99.9959% [6]:3 100% total : 440101 sockets, Avg lookup cost=2.4175 cache lines - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
From: Evgeniy Polyakov <[EMAIL PROTECTED]> Date: Wed, 21 Feb 2007 11:56:08 +0300 > On Tue, Feb 20, 2007 at 12:09:59PM -0800, Michael K. Edwards ([EMAIL > PROTECTED]) wrote: > > On 2/20/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote: > > >Correct. That's called a "weak hash", and Jenkins is known to be a > > >thoroughly weak hash. That's why you never, ever use it without a > > >salt, and you don't let an attacker inspect the hash output either. > > > > Weak in a cryptographic sense, of course. Excellent avalanche > > behavior, though, which is what you care about in a salted hash. > > http://en.wikipedia.org/wiki/Hash_table > > I repeat again - add your salt into jenkins hash and I will show you > that it has the same problems. > So, I'm waiting for your patch for jhash_*_words(). The problem is that whilst XOR, with arbitrary random input seed, can be forced to use choosen hash chains easily, with jenkins this is not the case. The reason is that, due to jenkin's sophisticated mixing, each random input produces unique "pattern" of hash chains even for the most carefully crafted inputs. It is not trivial to target matching hash chains even with known input seed, and it is impossible with unknown seed such as that which we use in routing cache. I do not talk about distribution characteristics here, only about attackability. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Wed, Feb 21, 2007 at 10:15:11AM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > On Wednesday 21 February 2007 09:54, Evgeniy Polyakov wrote: > > I shown that numbers 4 times already, do you read mails and links? > > Did you see an artifact Eric showed with his data? > > > > I showed all your thinking is wrong. You mix all the time distribution fairness and complexity of searching for collisions. Jenkins hash is unfair - having Linux random generator as attacker's tool we end up in the case, when jenkins hash has some chains with 5 times longer length. Short history: I showed that jenkins is unfair, you confirmed that with your data. Another question is about complexity of searching for collisions - you showed that with xor it is quite easy and with jenkins it is complex, then I showed that it is not that complex to find collisions in jenkins too. > Some guys think MD5 checksum is full of artifacts, since its certainly > possible to construct two differents files having the same md5sum. > Yet, lot of people use md5 checksums. In 10 years, we probably switch to > another stronger hash. Its only a question of current state of the art. It is _NOT_ about having two imputs with the same output. It is about its distribution. I say that having random input output will not be fairly distributed - and it was confirmed by you - having 2^16 random values ended up with some chains of length 4, while hash table size perfectly allowed them all to have length 1 as was in case of xor hash. > Jenkins hash is far better than XOR, at least in the tcp ehash context. How it can be better than xor when its distribution is unfair? You get hash chain longer with it compared to xor one - _you_ showed that data too if you do not believe my data. > Some hackers already exploited the XOR hash weak, more than two years ago. > They never succeeded since I changed to Jenkins hash. It is _different_ problem. It has _absolutely_ nothing in common with distribution fairness problem found in jenkins hash, do not mix both. XOR has only one problem - it is quite easy to find a collision, but it has perfect distribution fairness. Fix XOR, add random permutation, use sha/md5/whatever which has 1. fair distribution 2. strong against searching for collisions Jenkins hash does not have at least first, and as I showed, it is not that complex to break second (for example case of hash(const, const, random)). -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Wednesday 21 February 2007 09:54, Evgeniy Polyakov wrote: > I shown that numbers 4 times already, do you read mails and links? > Did you see an artifact Eric showed with his data? > I showed all your thinking is wrong. Some guys think MD5 checksum is full of artifacts, since its certainly possible to construct two differents files having the same md5sum. Yet, lot of people use md5 checksums. In 10 years, we probably switch to another stronger hash. Its only a question of current state of the art. Jenkins hash is far better than XOR, at least in the tcp ehash context. Some hackers already exploited the XOR hash weak, more than two years ago. They never succeeded since I changed to Jenkins hash. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 12:09:59PM -0800, Michael K. Edwards ([EMAIL PROTECTED]) wrote: > On 2/20/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote: > >Correct. That's called a "weak hash", and Jenkins is known to be a > >thoroughly weak hash. That's why you never, ever use it without a > >salt, and you don't let an attacker inspect the hash output either. > > Weak in a cryptographic sense, of course. Excellent avalanche > behavior, though, which is what you care about in a salted hash. > http://en.wikipedia.org/wiki/Hash_table I repeat again - add your salt into jenkins hash and I will show you that it has the same problems. So, I'm waiting for your patch for jhash_*_words(). > I know nothing about data structures and algorithms except what I read > on the Internet. But you'd be amazed what's on the Internet. > > Cheers, > - Michael -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 12:03:04PM -0800, Michael K. Edwards ([EMAIL PROTECTED]) wrote: > >I just shown a problem in jenkins hash - it is not how to find a > >differnet input for the same output - it is a _law_ which allows to > >break a hash. You will add some constant, and that law will be turned > >into something different (getting into account what was written, it will > >end up with the same law). > > Correct. That's called a "weak hash", and Jenkins is known to be a > thoroughly weak hash. That's why you never, ever use it without a > salt, and you don't let an attacker inspect the hash output either. Again, where will be your salt? I'm going to show you that having constant xor on fairly distributed system will not change distribution as long as bad one. > >Using jenkins hash is equal to the situation, when part of you hash > >chains will be 5 times longer than median square value, with XOR one > >there is no such distribution. > > Show us the numbers. Salt properly this time to reduce the artifacts > that come of applying a weak hash to a poor PRNG, and histogram your > results. If you don't get a Poisson distribution you probably don't > know how to use gnuplot either. :-) I shown that numbers 4 times already, do you read mails and links? Did you see an artifact Eric showed with his data? > >Added somthing into permutations will not endup in different > >distribution, since it is permutations which are broken, not its result > >(which can be xored with something). > > I can't parse this. Care to try again? Whre are you going to add a salt into jenkins hash to fix its distribution? In other words - jenkins hash is equal to simple shift - it is a hash too, and it has bad distribution too, where will added salt ever help in that scenario? > Cheers, > - Michael -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/20/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote: Correct. That's called a "weak hash", and Jenkins is known to be a thoroughly weak hash. That's why you never, ever use it without a salt, and you don't let an attacker inspect the hash output either. Weak in a cryptographic sense, of course. Excellent avalanche behavior, though, which is what you care about in a salted hash. http://en.wikipedia.org/wiki/Hash_table I know nothing about data structures and algorithms except what I read on the Internet. But you'd be amazed what's on the Internet. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: How I like personal insults - it is always fun to read about myself from people who never knew me :) On this occasion, I did not set out to insult you. I set out to suggest an explanation for why cooler and grayer heads than mine are not falling over themselves to merge your designs into the mainline kernel. Did I succeed? I just shown a problem in jenkins hash - it is not how to find a differnet input for the same output - it is a _law_ which allows to break a hash. You will add some constant, and that law will be turned into something different (getting into account what was written, it will end up with the same law). Correct. That's called a "weak hash", and Jenkins is known to be a thoroughly weak hash. That's why you never, ever use it without a salt, and you don't let an attacker inspect the hash output either. Using jenkins hash is equal to the situation, when part of you hash chains will be 5 times longer than median square value, with XOR one there is no such distribution. Show us the numbers. Salt properly this time to reduce the artifacts that come of applying a weak hash to a poor PRNG, and histogram your results. If you don't get a Poisson distribution you probably don't know how to use gnuplot either. :-) Added somthing into permutations will not endup in different distribution, since it is permutations which are broken, not its result (which can be xored with something). I can't parse this. Care to try again? Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 08:17:31PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > I shown your test was bogus. All your claims are just bogus. > I claim your 'true random data' is a joke. rand() in your program is a pure > joke. Care to reread your mail about your true random case with hash chain length of 3 and 4? Anyway, I just shown that jenkins hash is simple to crack and to find its collisions - even if you will put there some constant value it will be the same. It is math, not something special speculation about input values. > Given 48 bits of input, you *can* find a lot of addr/port to hit one > particular cache line if XOR function is used. With jhash, without knowing > the 32bits random secret, you *cant*. You seems to do not want to understand that it is exactly the same as searching for collision law. It is simple, and results will be dangerous. > Again, you dont take into account the chain length. > > If all chains were of length <= 1, then yes, xor would be faster. In real > life, we *know* chain length can be larger, especially in DOS situations. I.e. you propose to add a hash, which has broken case for the same ip addresses and different ports compared to good xor? It was shown that hash(const, const, non_const) ends up with _broken_ distribution comapred to xor hash. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 11:13:50AM -0800, Michael K. Edwards ([EMAIL PROTECTED]) wrote: > On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > >And here are another ones which produce the same hash value. > >Of course searching for pair for jhash('jhash is broken') > >will require more steps, but it is doable. > > > >That means that if attacker has a full control over one host, it can > >create a chain of maximum 4 entries in socket table (if jhash is used). > >If it is udp, that means that attacker control addresses too without > >syn cookies, which in turn means that below list can be increased to > >infinite. > > Evgeniy, you're not gettin' it. Have you ever heard of a Poisson > distribution? That's what you have to aim for in a (bog-standard) > hash table -- a hash function that scrambles the key space until you > have a Poisson distribution in the hash buckets no matter whan pile of > keys the attacker throws at you. That's a "perfect" hash function in > the statistician's sense (not a "perfect hash" function in the > compiler designer's sense). Yes there will be collisions, and they > will start happening much earlier than you will think, if you have > never heard of Poisson distributions before; that's called the > "birthday problem" and it is a chestnut of a mathematical puzzle > generally encountered by bright twelve-year-olds in extra-curricular > math classes. The only sensible way to deal with them is to use a > better data structure -- like a 2-left hash (Google it) or something > tree-ish. > > That is not, however, what you're not getting. What you're not > getting is the use of a "salt" or "secret" or whatever you want to > call it to turn a weak hash into an impenetrable defense against > chosen-plaintext collision attacks. You can run your PRNG until you > slag your poor botnet's little CPUs searching for a set of tuples that > collide in the same bucket on your test system-under-attack. But they > won't collide on mine, and they won't collide on Eric's, and they > won't collide on yours the next time it's rebooted. Because even a > weak hash (in the cryptographer's sense) is good enough to scramble > the key space radically differently with two different salts. XOR > doesn't cut it -- the salt will permute the buckets but not toss each > one's contents up into the air to land in a bunch of different > buckets. But Jenkins does cut it, as discussed in the source code > Eric pointed out to you. > > Of course you don't change the salt except when resizing the hash > (which is a dumb thing to do anyway, and a sure sign that a hash table > is not the right data structure). That would kinda defeat the purpose > of having a hash table in the first place. > > If you can't assimilate this and articulate it back to us in your own > words instead of repeating "any hash that doesn't totally, utterly > suck slows my meaningless artificial benchmark by 50%", you may not be > the right person to design and implement a new RCU data structure in a > kernel that thousands of us trust to withstand exposure to the open > Internet to the best of a commodity processor's ability. How I like personal insults - it is always fun to read about myself from people who never knew me :) I just shown a problem in jenkins hash - it is not how to find a differnet input for the same output - it is a _law_ which allows to break a hash. You will add some constant, and that law will be turned into something different (getting into account what was written, it will end up with the same law). Using jenkins hash is equal to the situation, when part of you hash chains will be 5 times longer than median square value, with XOR one there is no such distribution. Added somthing into permutations will not endup in different distribution, since it is permutations which are broken, not its result (which can be xored with something). > Cheers, > - Michael -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
All right, Eric, you and me so clevvah, let's embarrass our own selves designing this thing in public instead of picking on poor Evgeniy. Which would you rather RCU, a 2-left hash or a splay tree? 2-left hash gets you excellent occupation fraction before the first real collision, so you can be utterly stupid about collisions in the second hash table (just toss all double collisions in an overflow radix tree). Splay tree is a lot bigger bite to chew initially, but it gets you a hot set that's at least sometimes warm in cache, and it's easier to think about how to RCU it, and it doubles as a priority queue. You pick. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 20:06, Evgeniy Polyakov wrote: > > I added to my 'simulator_plugged_on_real_server' the average cost > > calculation, relative to number of cache line per lookup. > > > > ehash_size=2^20 > > xor hash : > > 386290 sockets, Avg lookup cost=3.2604 cache lines/lookup > > 393667 sockets, Avg lookup cost=3.30579 cache lines/lookup > > 400777 sockets, Avg lookup cost=3.3493 cache lines/lookup > > 404720 sockets, Avg lookup cost=3.36705 cache lines/lookup > > 406671 sockets, Avg lookup cost=3.37677 cache lines/lookup > > jenkin hash: > > 386290 sockets, Avg lookup cost=2.36763 cache lines/lookup > > 393667 sockets, Avg lookup cost=2.37533 cache lines/lookup > > 400777 sockets, Avg lookup cost=2.38211 cache lines/lookup > > 404720 sockets, Avg lookup cost=2.38582 cache lines/lookup > > 406671 sockets, Avg lookup cost=2.38679 cache lines/lookup > > > > (you can see that when number of sockets increase, the xor hash becomes > > worst) > > > > So the jenkin hash function CPU cost is balanced by the fact its > > distribution is better. In the end you are faster. > > Very strange test - it shows that jenkins distribution for your setup is > better than xor one, although for the true random data they are roughly > the same, and jenkins one has more instructions. > > But _you_ have shown that with true random data of 2^16 ports jenkins > distribution is _worse_ than xor without any gain to buy. I shown your test was bogus. All your claims are just bogus. I claim your 'true random data' is a joke. rand() in your program is a pure joke. Given 48 bits of input, you *can* find a lot of addr/port to hit one particular cache line if XOR function is used. With jhash, without knowing the 32bits random secret, you *cant*. Again, you dont take into account the chain length. If all chains were of length <= 1, then yes, xor would be faster. In real life, we *know* chain length can be larger, especially in DOS situations. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: And here are another ones which produce the same hash value. Of course searching for pair for jhash('jhash is broken') will require more steps, but it is doable. That means that if attacker has a full control over one host, it can create a chain of maximum 4 entries in socket table (if jhash is used). If it is udp, that means that attacker control addresses too without syn cookies, which in turn means that below list can be increased to infinite. Evgeniy, you're not gettin' it. Have you ever heard of a Poisson distribution? That's what you have to aim for in a (bog-standard) hash table -- a hash function that scrambles the key space until you have a Poisson distribution in the hash buckets no matter whan pile of keys the attacker throws at you. That's a "perfect" hash function in the statistician's sense (not a "perfect hash" function in the compiler designer's sense). Yes there will be collisions, and they will start happening much earlier than you will think, if you have never heard of Poisson distributions before; that's called the "birthday problem" and it is a chestnut of a mathematical puzzle generally encountered by bright twelve-year-olds in extra-curricular math classes. The only sensible way to deal with them is to use a better data structure -- like a 2-left hash (Google it) or something tree-ish. That is not, however, what you're not getting. What you're not getting is the use of a "salt" or "secret" or whatever you want to call it to turn a weak hash into an impenetrable defense against chosen-plaintext collision attacks. You can run your PRNG until you slag your poor botnet's little CPUs searching for a set of tuples that collide in the same bucket on your test system-under-attack. But they won't collide on mine, and they won't collide on Eric's, and they won't collide on yours the next time it's rebooted. Because even a weak hash (in the cryptographer's sense) is good enough to scramble the key space radically differently with two different salts. XOR doesn't cut it -- the salt will permute the buckets but not toss each one's contents up into the air to land in a bunch of different buckets. But Jenkins does cut it, as discussed in the source code Eric pointed out to you. Of course you don't change the salt except when resizing the hash (which is a dumb thing to do anyway, and a sure sign that a hash table is not the right data structure). That would kinda defeat the purpose of having a hash table in the first place. If you can't assimilate this and articulate it back to us in your own words instead of repeating "any hash that doesn't totally, utterly suck slows my meaningless artificial benchmark by 50%", you may not be the right person to design and implement a new RCU data structure in a kernel that thousands of us trust to withstand exposure to the open Internet to the best of a commodity processor's ability. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 07:55:15PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > On Tuesday 20 February 2007 19:00, Evgeniy Polyakov wrote: > > As you can see, jhash has problems in a really trivial case of 2^16, > > which in local lan is a disaster. The only reason jenkins hash is good > > for hashing purposes is that is it more complex than xor one, and thus > > it is harder to find a collision law. That's all. > > And it is two times slower. > > I see no problems at all. An attacker can not exploit the fact that two (or > three) different values of sport will hit the same hash bucket. > > A hash function may have collisions. This is *designed* like that. > > The complexity of the hash function is a tradeoff. A perfect hash would be : > - Perfect distribution > - Hard (or even : impossible) to guess for an attacker. > - No CPU cost. > > There is no perfect hash function... given 96 bits in input. > So what ? hashes are 'badly broken' ? > Thats just not even funny Evgeniy. Jenkins has _worse_ distribution than xor one. _That_ is bad, not the fact that hash has collisions. hash(val) = val >> 16; is a hash too, and it has even worse distribution - so it is designed even worse, so we do not use it. > The 'two times slower' is a simplistic view, or maybe you have an alien CPU, > or a CPU from the past ? It is core duo 3.7 ghz. Timings are printed in the test I showed in the list. > On my oprofile, rt_hash_code() uses 0.24% of cpu (about 50 x86_64 > instructions) > > Each time a cache miss is done because your bucket length is (X+1) instead of > (X), your CPU is stuck while it could have do 150 instructions. Next CPUS > will do 300 instructions per cache miss, maybe 1000 one day... yes, life is > hard. > > I added to my 'simulator_plugged_on_real_server' the average cost > calculation, > relative to number of cache line per lookup. > > ehash_size=2^20 > xor hash : > 386290 sockets, Avg lookup cost=3.2604 cache lines/lookup > 393667 sockets, Avg lookup cost=3.30579 cache lines/lookup > 400777 sockets, Avg lookup cost=3.3493 cache lines/lookup > 404720 sockets, Avg lookup cost=3.36705 cache lines/lookup > 406671 sockets, Avg lookup cost=3.37677 cache lines/lookup > jenkin hash: > 386290 sockets, Avg lookup cost=2.36763 cache lines/lookup > 393667 sockets, Avg lookup cost=2.37533 cache lines/lookup > 400777 sockets, Avg lookup cost=2.38211 cache lines/lookup > 404720 sockets, Avg lookup cost=2.38582 cache lines/lookup > 406671 sockets, Avg lookup cost=2.38679 cache lines/lookup > > (you can see that when number of sockets increase, the xor hash becomes worst) > > So the jenkin hash function CPU cost is balanced by the fact its distribution > is better. In the end you are faster. Very strange test - it shows that jenkins distribution for your setup is better than xor one, although for the true random data they are roughly the same, and jenkins one has more instructions. But _you_ have shown that with true random data of 2^16 ports jenkins distribution is _worse_ than xor without any gain to buy. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 19:00, Evgeniy Polyakov wrote: > As you can see, jhash has problems in a really trivial case of 2^16, > which in local lan is a disaster. The only reason jenkins hash is good > for hashing purposes is that is it more complex than xor one, and thus > it is harder to find a collision law. That's all. > And it is two times slower. I see no problems at all. An attacker can not exploit the fact that two (or three) different values of sport will hit the same hash bucket. A hash function may have collisions. This is *designed* like that. The complexity of the hash function is a tradeoff. A perfect hash would be : - Perfect distribution - Hard (or even : impossible) to guess for an attacker. - No CPU cost. There is no perfect hash function... given 96 bits in input. So what ? hashes are 'badly broken' ? Thats just not even funny Evgeniy. The 'two times slower' is a simplistic view, or maybe you have an alien CPU, or a CPU from the past ? On my oprofile, rt_hash_code() uses 0.24% of cpu (about 50 x86_64 instructions) Each time a cache miss is done because your bucket length is (X+1) instead of (X), your CPU is stuck while it could have do 150 instructions. Next CPUS will do 300 instructions per cache miss, maybe 1000 one day... yes, life is hard. I added to my 'simulator_plugged_on_real_server' the average cost calculation, relative to number of cache line per lookup. ehash_size=2^20 xor hash : 386290 sockets, Avg lookup cost=3.2604 cache lines/lookup 393667 sockets, Avg lookup cost=3.30579 cache lines/lookup 400777 sockets, Avg lookup cost=3.3493 cache lines/lookup 404720 sockets, Avg lookup cost=3.36705 cache lines/lookup 406671 sockets, Avg lookup cost=3.37677 cache lines/lookup jenkin hash: 386290 sockets, Avg lookup cost=2.36763 cache lines/lookup 393667 sockets, Avg lookup cost=2.37533 cache lines/lookup 400777 sockets, Avg lookup cost=2.38211 cache lines/lookup 404720 sockets, Avg lookup cost=2.38582 cache lines/lookup 406671 sockets, Avg lookup cost=2.38679 cache lines/lookup (you can see that when number of sockets increase, the xor hash becomes worst) So the jenkin hash function CPU cost is balanced by the fact its distribution is better. In the end you are faster. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 08:55:50PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: > Here is a dump of possible addr/port pairs which end up badly > distributed: > > 8e363a50:27652 -> c0a80001:20480 > 8e363a50:35529 -> c0a80001:20480 > 8e363a50:40919 -> c0a80001:20480 > 8e363a50:46720 -> c0a80001:20480 > > they produce the same hash value in the test described above. And here are another ones which produce the same hash value. Of course searching for pair for jhash('jhash is broken') will require more steps, but it is doable. That means that if attacker has a full control over one host, it can create a chain of maximum 4 entries in socket table (if jhash is used). If it is udp, that means that attacker control addresses too without syn cookies, which in turn means that below list can be increased to infinite. 8e363a50:22210 -> c0a80001:20480 10403 8e363a50:58377 -> c0a80001:20480 10403 8e363a50:9272 -> c0a80001:20480 10403 8e363a50:4173 -> c0a80001:20480 130f8 8e363a50:44401 -> c0a80001:20480 130f8 8e363a50:53439 -> c0a80001:20480 130f8 8e363a50:44525 -> c0a80001:20480 14391 8e363a50:46858 -> c0a80001:20480 14391 8e363a50:50030 -> c0a80001:20480 14391 8e363a50:40337 -> c0a80001:20480 1c66d 8e363a50:53249 -> c0a80001:20480 1c66d 8e363a50:65307 -> c0a80001:20480 1c66d 8e363a50:10433 -> c0a80001:20480 1fd1b 8e363a50:49548 -> c0a80001:20480 1fd1b 8e363a50:64835 -> c0a80001:20480 1fd1b 8e363a50:14889 -> c0a80001:20480 206ae 8e363a50:29984 -> c0a80001:20480 206ae 8e363a50:44282 -> c0a80001:20480 206ae 8e363a50:27521 -> c0a80001:20480 2a8c8 8e363a50:34493 -> c0a80001:20480 2a8c8 8e363a50:41134 -> c0a80001:20480 2a8c8 8e363a50:50387 -> c0a80001:20480 2c1fc 8e363a50:56740 -> c0a80001:20480 2c1fc 8e363a50:58943 -> c0a80001:20480 2c1fc 8e363a50:23856 -> c0a80001:20480 31ac2 8e363a50:35034 -> c0a80001:20480 31ac2 8e363a50:62638 -> c0a80001:20480 31ac2 8e363a50:15623 -> c0a80001:20480 33b81 8e363a50:24235 -> c0a80001:20480 33b81 8e363a50:38581 -> c0a80001:20480 33b81 8e363a50:23779 -> c0a80001:20480 37e65 8e363a50:42244 -> c0a80001:20480 37e65 8e363a50:6729 -> c0a80001:20480 37e65 8e363a50:11002 -> c0a80001:20480 3d06d 8e363a50:4321 -> c0a80001:20480 3d06d 8e363a50:5255 -> c0a80001:20480 3d06d 8e363a50:19326 -> c0a80001:20480 439c7 8e363a50:6187 -> c0a80001:20480 439c7 8e363a50:61932 -> c0a80001:20480 439c7 8e363a50:36916 -> c0a80001:20480 472ce 8e363a50:39670 -> c0a80001:20480 472ce 8e363a50:50520 -> c0a80001:20480 472ce 8e363a50:14229 -> c0a80001:20480 4e5f2 8e363a50:16897 -> c0a80001:20480 4e5f2 8e363a50:3340 -> c0a80001:20480 4e5f2 8e363a50:12892 -> c0a80001:20480 5d11 8e363a50:3998 -> c0a80001:20480 5d11 8e363a50:50654 -> c0a80001:20480 5d11 8e363a50:37267 -> c0a80001:20480 5e30e 8e363a50:41659 -> c0a80001:20480 5e30e 8e363a50:57118 -> c0a80001:20480 5e30e 8e363a50:27652 -> c0a80001:20480 6a284 8e363a50:35529 -> c0a80001:20480 6a284 8e363a50:40919 -> c0a80001:20480 6a284 8e363a50:46720 -> c0a80001:20480 6a284 8e363a50:1825 -> c0a80001:20480 6af47 8e363a50:3025 -> c0a80001:20480 6af47 8e363a50:49431 -> c0a80001:20480 6af47 8e363a50:17218 -> c0a80001:20480 77300 8e363a50:48400 -> c0a80001:20480 77300 8e363a50:9188 -> c0a80001:20480 77300 8e363a50:48327 -> c0a80001:20480 7cf09 8e363a50:55417 -> c0a80001:20480 7cf09 8e363a50:57221 -> c0a80001:20480 7cf09 8e363a50:10586 -> c0a80001:20480 809af 8e363a50:11371 -> c0a80001:20480 809af 8e363a50:27313 -> c0a80001:20480 809af 8e363a50:34688 -> c0a80001:20480 80bf3 8e363a50:58611 -> c0a80001:20480 80bf3 8e363a50:61056 -> c0a80001:20480 80bf3 8e363a50:10367 -> c0a80001:20480 85eae 8e363a50:3761 -> c0a80001:20480 85eae 8e363a50:57021 -> c0a80001:20480 85eae 8e363a50:10940 -> c0a80001:20480 88c52 8e363a50:26256 -> c0a80001:20480 88c52 8e363a50:7363 -> c0a80001:20480 88c52 8e363a50:10613 -> c0a80001:20480 89d75 8e363a50:54306 -> c0a80001:20480 89d75 8e363a50:59263 -> c0a80001:20480 89d75 8e363a50:16004 -> c0a80001:20480 91821 8e363a50:269 -> c0a80001:20480 91821 8e363a50:38109 -> c0a80001:20480 91821 8e363a50:1073 -> c0a80001:20480 96854 8e363a50:34201 -> c0a80001:20480 96854 8e363a50:58160 -> c0a80001:20480 96854 8e363a50:11353 -> c0a80001:20480 a17c4 8e363a50:37120 -> c0a80001:20480 a17c4 8e363a50:43332 -> c0a80001:20480 a17c4 8e363a50:26356 -> c0a80001:20480 a2e03 8e363a50:46187 -> c0a80001:20480 a2e03 8e363a50:61198 -> c0a80001:20480 a2e03 8e363a50:12881 -> c0a80001:20480 a7466 8e363a50:45272 -> c0a80001:20480 a7466 8e363a50:52661 -> c0a80001:20480 a7466 8e363a50:32863 -> c0a80001:20480 a7eeb 8e363a50:33575 -> c0a80001:20480 a7eeb 8e363a50:9977 -> c0a80001:20480 a7eeb 8e363a50:23136 -> c0a80001:20480 a9e47 8e363a50:41222 -> c0a80001:20480 a9e47 8e363a50:43554 -> c0a80001:20480 a9e47 8e363a50:3248 -> c0a80001:20480 b365 8e363a50:3417 -> c0a80001:20480 b365 8e363a50:61275 -> c0a80001:20480 b365 8e363a50:25606 -> c0a80001:20480 b511e 8e363a50:46638 -> c0a80001:20480 b511e 8e363a50:59262 -> c0a80001:20480 b511e 8e363a50:24384 -> c0a80001:20480 b571d 8e363a50:34078 ->
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 06:53:59PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > > I've attached source code and running script. > > > $ ./run.sh > > > > Yep, of course. > > Your test program is just bogus. artefacts come from your 'random' generator. > > You just increment a counter, assuming the key you search is not in the table > yet. No, that part is commented, but it does not matter. > But obviously with only a variation of sport (16 bits), you have a maximum of > 65536 values. No need to feed 100*2^20 values are most of them are dups. > > Now if you change your program to do a real lookups with the 2^16 possible > values of sport you'll see : No need to change something - I showed that in some cases jenkins ends up with a huge problem. If test will be changed, or solar will be turned off, things will behave good, but as is they behave very bad. > jhash function : > 1 61578 > 2 1916 > 3 42 > > that is : 61578 chains of length 1 > 1916 chains of length 2 > 42 chains of length 3 > > (for reference, with HASH_FOLD, about same results : > 1 61692 > 2 1856 > 3 44 > > Pretty good results... for the gain jhash gives us. > > Of course, XOR hash gives a 'perfect' 65536 chains of length 1. As you can see, jhash has problems in a really trivial case of 2^16, which in local lan is a disaster. The only reason jenkins hash is good for hashing purposes is that is it more complex than xor one, and thus it is harder to find a collision law. That's all. And it is two times slower. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 06:20:26PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > Hmm, I've just ran following test: > > 1. created 2^20 hash table. > > 2. ran in loop (100*(2^20) iterations) following hashes: > > a. xor hash (const_ip, const_ip, random_word) > > So what ? to attack me you want to send 100*2^20 packets every minute ? :) No, I will specially craft 1000 packets which will hist the same chain. > Thats nonsense... If you really can send so many packets, My pipe is full > whatever I do of received packets. No Algo will protect me, even designed by > Einstein. Did you ever read what I wrote? It is test, which shows that 1. jenkins has problems 2. it is two times slower than xor How to explot problem in a real world is out of that research, but it is enough to say that it is broken. > If you look again at route cache, you will see chains length are limited by > elasticity factor, that is usually 8... No need to try to reach 100 entries > in a chain. > > Yes, I can destroy Russia sending 2^10 nuclear weapons on major cities. You > really should build a bunker right now :) France only has 100 delivery vehicles (about 50 submarines and 50 Mirages) - so no, I will not :) > Now try to build an attack with 100 packets per second... and I will try to > be > smart too. Depending on the end result... Wanna buy me (or suggest) couple of bottles of good not expensive french wine? :) Here is a dump of possible addr/port pairs which end up badly distributed: 8e363a50:27652 -> c0a80001:20480 8e363a50:35529 -> c0a80001:20480 8e363a50:40919 -> c0a80001:20480 8e363a50:46720 -> c0a80001:20480 they produce the same hash value in the test described above. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 18:05, Evgeniy Polyakov wrote: > On Tue, Feb 20, 2007 at 07:59:07PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: > > I've attached source code and running script. > > $ ./run.sh > > Yep, of course. Your test program is just bogus. artefacts come from your 'random' generator. You just increment a counter, assuming the key you search is not in the table yet. But obviously with only a variation of sport (16 bits), you have a maximum of 65536 values. No need to feed 100*2^20 values are most of them are dups. Now if you change your program to do a real lookups with the 2^16 possible values of sport you'll see : jhash function : 1 61578 2 1916 3 42 that is : 61578 chains of length 1 1916 chains of length 2 42 chains of length 3 (for reference, with HASH_FOLD, about same results : 1 61692 2 1856 3 44 Pretty good results... for the gain jhash gives us. Of course, XOR hash gives a 'perfect' 65536 chains of length 1. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 17:59, Evgeniy Polyakov wrote: > On Tue, Feb 20, 2007 at 05:38:19PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > > It is secrecy, not security - attacker will check the source and find > > > where constant per-boot value is added and recalculate attack vector - > > > we all were college students, it would be even more fun to crack. > > > > > > In that regard Jenkins ahsh and XOR one have _exactly_ the same attack > > > vector, only Jenkins is a bit more sophisticated. I even think that > > > example in rt_hash_code() will endup with heavy problems when one of > > > the addresses is constant - my tests show problem exactly in the case > > > of jhash_2words() with random third parameter and constant one of the > > > first like in rt_hash_code(). > > > > Please define heavy problem. > > > > On most hosts, with one NIC, one IP address, most entries in cache have > > the same address (IP address of eth0 or localhost). It just works. > > > > Last time I checked, the 2^21 route cache I am using was correctly > > filled, thanks to jhash. > > > > Again, the random value is 32bits. If jhash happens to be cracked by your > > students, we just put md5 or whatever in... > > > > You can call it secrecy or whatever, fact is : it's just working, far > > better than XOR previous hash function. > > Hmm, I've just ran following test: > 1. created 2^20 hash table. > 2. ran in loop (100*(2^20) iterations) following hashes: > a. xor hash (const_ip, const_ip, random_word) So what ? to attack me you want to send 100*2^20 packets every minute ? Thats nonsense... If you really can send so many packets, My pipe is full whatever I do of received packets. No Algo will protect me, even designed by Einstein. If you look again at route cache, you will see chains length are limited by elasticity factor, that is usually 8... No need to try to reach 100 entries in a chain. Yes, I can destroy Russia sending 2^10 nuclear weapons on major cities. You really should build a bunker right now :) Now try to build an attack with 100 packets per second... and I will try to be smart too. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 07:59:07PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: > I've attached source code and running script. > $ ./run.sh Yep, of course. -- Evgeniy Polyakov #include #include #include #include #include #include #include typedef unsigned int u32; typedef unsigned short u16; typedef unsigned char u8; typedef unsigned int __u32; typedef unsigned short __u16; typedef unsigned char __u8; #include "jhash.h" struct hash_entry { unsigned long counter; }; static inline __u32 num2ip(__u8 a1, __u8 a2, __u8 a3, __u8 a4) { __u32 a = 0; a |= a1; a <<= 8; a |= a2; a <<= 8; a |= a3; a <<= 8; a |= a4; return a; } static inline __u8 get_random_byte(void) { return 1 + (int) (255.0 * (rand() / (RAND_MAX + 1.0))); } static inline __u16 get_random_word(void) { return 1 + (int) (65536.0 * (rand() / (RAND_MAX + 1.0))); } unsigned int hash_addr1(__u32 faddr, __u16 fport, __u32 laddr, __u16 lport) { unsigned int h = (laddr ^ lport) ^ (faddr ^ fport); h ^= h >> 16; h ^= h >> 8; return h; } unsigned int hash_addr(__u32 faddr, __u16 fport, __u32 laddr, __u16 lport) { u32 ports; unsigned int h; ports = lport; ports <<= 16; ports |= fport; h = jhash_3words(faddr, laddr, ports, 123123); #ifdef HASH_FOLD h ^= h >> 16; h ^= h >> 8; #endif return h; } int main() { struct hash_entry *table; __u32 saddr, daddr; __u16 sport, dport; unsigned int hash, i, *results; unsigned int hash_size = (1<<10), iter_num = 100; table = malloc(hash_size * sizeof(struct hash_entry)); if (!table) return -ENOMEM; results = malloc(hash_size * sizeof(unsigned int)); if (!results) return -ENOMEM; for (i=0; i run.sh Description: Bourne shell script artefact.png Description: PNG image distribution.png Description: PNG image
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 05:38:19PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > It is secrecy, not security - attacker will check the source and find > > where constant per-boot value is added and recalculate attack vector - > > we all were college students, it would be even more fun to crack. > > > > In that regard Jenkins ahsh and XOR one have _exactly_ the same attack > > vector, only Jenkins is a bit more sophisticated. I even think that > > example in rt_hash_code() will endup with heavy problems when one of the > > addresses is constant - my tests show problem exactly in the case of > > jhash_2words() with random third parameter and constant one of the first > > like in rt_hash_code(). > > Please define heavy problem. > > On most hosts, with one NIC, one IP address, most entries in cache have the > same address (IP address of eth0 or localhost). It just works. > > Last time I checked, the 2^21 route cache I am using was correctly filled, > thanks to jhash. > > Again, the random value is 32bits. If jhash happens to be cracked by your > students, we just put md5 or whatever in... > > You can call it secrecy or whatever, fact is : it's just working, far better > than XOR previous hash function. Hmm, I've just ran following test: 1. created 2^20 hash table. 2. ran in loop (100*(2^20) iterations) following hashes: a. xor hash (const_ip, const_ip, random_word) b. jhash_3words(const_ip, const_ip, random_word, 123123) - it is exactly as jhash_2words(const_ip, const_ip, wandom_word) 3. hash &= hash_size - 1; 4. table[hash].counter++; 5. for (i=0; ihttp://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 17:20, Evgeniy Polyakov wrote: > On Tue, Feb 20, 2007 at 05:08:12PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > > Adding XOR with constant value does not change distribution. > > > Variable salt will end up with differnet buckets for the same flow. > > > It is forbidden - it is not the situation created for passwd/des > > > decades ago. > > > > Adding a random hint to jhash (random value picked at boot time, not > > known by attacker) permits to have a secure hash table : An attacker > > cannot build an attack to fill one particular hash chain. > > > > See net/ipv4/route.c (function rt_hash_code()) to see how its used for > > route cache. > > It is secrecy, not security - attacker will check the source and find > where constant per-boot value is added and recalculate attack vector - > we all were college students, it would be even more fun to crack. > > In that regard Jenkins ahsh and XOR one have _exactly_ the same attack > vector, only Jenkins is a bit more sophisticated. I even think that > example in rt_hash_code() will endup with heavy problems when one of the > addresses is constant - my tests show problem exactly in the case of > jhash_2words() with random third parameter and constant one of the first > like in rt_hash_code(). Please define heavy problem. On most hosts, with one NIC, one IP address, most entries in cache have the same address (IP address of eth0 or localhost). It just works. Last time I checked, the 2^21 route cache I am using was correctly filled, thanks to jhash. Again, the random value is 32bits. If jhash happens to be cracked by your students, we just put md5 or whatever in... You can call it secrecy or whatever, fact is : it's just working, far better than XOR previous hash function. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 05:08:12PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > Adding XOR with constant value does not change distribution. > > Variable salt will end up with differnet buckets for the same flow. > > It is forbidden - it is not the situation created for passwd/des decades > > ago. > > Adding a random hint to jhash (random value picked at boot time, not known by > attacker) permits to have a secure hash table : An attacker cannot build an > attack to fill one particular hash chain. > > See net/ipv4/route.c (function rt_hash_code()) to see how its used for route > cache. It is secrecy, not security - attacker will check the source and find where constant per-boot value is added and recalculate attack vector - we all were college students, it would be even more fun to crack. In that regard Jenkins ahsh and XOR one have _exactly_ the same attack vector, only Jenkins is a bit more sophisticated. I even think that example in rt_hash_code() will endup with heavy problems when one of the addresses is constant - my tests show problem exactly in the case of jhash_2words() with random third parameter and constant one of the first like in rt_hash_code(). -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 16:59, Evgeniy Polyakov wrote: > On Tue, Feb 20, 2007 at 07:49:11AM -0800, Michael K. Edwards ([EMAIL PROTECTED]) wrote: > > On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > > >Jenkins _does_ have them, I showed tests half a year ago and in this > > >thread too. Actually _any_ hash has them it is just a matter of time > > >to find one. > > > > I think you misunderstood me. If you are trying to DoS me from > > outside with a hash collision attack, you are trying to feed me > > packets that fall into the same hash bucket. The Jenkins hash does > > not have to be artifact-free, and does not have to be > > cryptographically strong. It just has to do a passable job of mixing > > a random salt into the tuple, so you don't know which string of > > packets to feed me in order to fill one (or a few) of my buckets. > > XORing salt into a folded tuple doesn't help; it just permutes the > > buckets. > > Adding XOR with constant value does not change distribution. > Variable salt will end up with differnet buckets for the same flow. > It is forbidden - it is not the situation created for passwd/des decades > ago. Adding a random hint to jhash (random value picked at boot time, not known by attacker) permits to have a secure hash table : An attacker cannot build an attack to fill one particular hash chain. See net/ipv4/route.c (function rt_hash_code()) to see how its used for route cache. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 16:49, Michael K. Edwards wrote: > On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > > Jenkins _does_ have them, I showed tests half a year ago and in this > > thread too. Actually _any_ hash has them it is just a matter of time > > to find one. > > I think you misunderstood me. If you are trying to DoS me from > outside with a hash collision attack, you are trying to feed me > packets that fall into the same hash bucket. The Jenkins hash does > not have to be artifact-free, and does not have to be > cryptographically strong. It just has to do a passable job of mixing > a random salt into the tuple, so you don't know which string of > packets to feed me in order to fill one (or a few) of my buckets. > XORing salt into a folded tuple doesn't help; it just permutes the > buckets. Yes. I must say I had an attack like that some years ago on one particular server : Some tcp ehash chains had a length > 1000. I had to plug jenkin hash to stop the attack (thanks to David :), and thanks to oprofile to let me understand what was happening ) The attacker was controlling several thousand of zombies and was able to choose its src port (knowing its src ip addr) to target *one* particular hash bucket on my web server. Each zombie was opening one tcp socket only, so a firewall could not detect them, they had a absolutely normal behavior. XOR, combined with the 16 bits range of src port, permits a lot of easy guessing for the attacker (since it knows the ehash_size of target is a power of two...) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 07:49:11AM -0800, Michael K. Edwards ([EMAIL PROTECTED]) wrote: > On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > >Jenkins _does_ have them, I showed tests half a year ago and in this > >thread too. Actually _any_ hash has them it is just a matter of time > >to find one. > > I think you misunderstood me. If you are trying to DoS me from > outside with a hash collision attack, you are trying to feed me > packets that fall into the same hash bucket. The Jenkins hash does > not have to be artifact-free, and does not have to be > cryptographically strong. It just has to do a passable job of mixing > a random salt into the tuple, so you don't know which string of > packets to feed me in order to fill one (or a few) of my buckets. > XORing salt into a folded tuple doesn't help; it just permutes the > buckets. Adding XOR with constant value does not change distribution. Variable salt will end up with differnet buckets for the same flow. It is forbidden - it is not the situation created for passwd/des decades ago. > Cheers, > - Michael -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/20/07, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: Jenkins _does_ have them, I showed tests half a year ago and in this thread too. Actually _any_ hash has them it is just a matter of time to find one. I think you misunderstood me. If you are trying to DoS me from outside with a hash collision attack, you are trying to feed me packets that fall into the same hash bucket. The Jenkins hash does not have to be artifact-free, and does not have to be cryptographically strong. It just has to do a passable job of mixing a random salt into the tuple, so you don't know which string of packets to feed me in order to fill one (or a few) of my buckets. XORing salt into a folded tuple doesn't help; it just permutes the buckets. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 07:07:16AM -0800, Michael K. Edwards ([EMAIL PROTECTED]) wrote: > On 2/20/07, David Miller <[EMAIL PROTECTED]> wrote: > >Actually someone (I think it was Evgeniy in fact) made such > >comparisons and found in his studies that not only does the current > >ehash xor hash function distribute about as well as jenkins, it's > >significantly cheaper to calculate :-) > > However, it's vulnerable to hash collision attacks, which the Jenkins > hash (used as a poor man's HMAC with a boot-time random salt) is not. > But if you don't care about DoS Jenkins _does_ have them, I showed tests half a year ago and in this thread too. Actually _any_ hash has them it is just a matter of time to find one. > Cheers, > - Michael -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/20/07, David Miller <[EMAIL PROTECTED]> wrote: Actually someone (I think it was Evgeniy in fact) made such comparisons and found in his studies that not only does the current ehash xor hash function distribute about as well as jenkins, it's significantly cheaper to calculate :-) However, it's vulnerable to hash collision attacks, which the Jenkins hash (used as a poor man's HMAC with a boot-time random salt) is not. But if you don't care about DoS Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 12:25:23PM +0300, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: > And there is _no_ O(1) - O(1) is just a hash entry selection, then you > need to add the whole chain path, which can be long enough. > For example for the length 9 you have 1000 entries - it is exactly size > of the tree - sum of all entries upto and including 2^9. > One third of the table is accessible within 17 lookups (hash chain > length is 1), but given into account size of the entry - let's say it > is equal to > 32+32+32 - size of the key > 32+32+32 - size of the left/right/parent pointer > so 192 bytes per entry - given into acount that 1/4 of the 1-2 MB cache is I just realized that it is in _BITS_, not in bytes, so typical trie/tree entry is about 24 bytes - less than one cache line! -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 12:30:18PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > On Tuesday 20 February 2007 12:10, Eric Dumazet wrote: > > > > > Yep, it happend to be my tests :) > > > Jenkins hash was slower and had significant artifacts for some usage > > > cases ended up with extremely long chain length. > > > One can find more details at > > > http://tservice.net.ru/~s0mbre/blog/2006/05/14#2006_05_14 > > > http://tservice.net.ru/~s0mbre/blog/2006/06/01#2006_06_01 > > > > Please explain why you choseh = jhash_2words(faddr, laddr, ports); > > h ^= h >> 16; > > h ^= h >> 8; > > > > jhash is very good, no need to try to be smarter, shufling some bytes... > > and adding artifacts. > > I checked with my simulator and got no differences with the extra ops, at > least no artifacts. Maybe this is related to the fact my hash size is 2^20 ? > > If we use jenkin hash: > [0]:617469 0% > [1]:326671 58.7654% > [2]:86704 89.9601% > [3]:15387 98.264% > [4]:2103 99.7773% > [5]:216 99.9716% > [6]:24 99.9975% > [7]:2 100% > > If we use jenkin hash (+plus Evgeniy Polyakov shifts) : > [0]:617553 0% > [1]:326403 58.7172% > [2]:86902 89.9831% > [3]:15462 98.3275% > [4]:2012 99.7753% > [5]:216 99.9696% > [6]:27 99.9987% > [7]:1 100% Yes, they are the same - but artifacts are artifacts in that regard that they appear only in special sets, as far as I recall one of the parameters must be constant (i.e. address or port pair). -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 12:34:59PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > Getting into account that network stack with NAPI schedules several > > packets to be queued into socket and all that happens without any > > infuence from userspace, trie/tree wins again in that regard that > > majority of the tree will be in the cache already. > > Nope, you receive 100 packets for 100 different flows. Only the code may be > in > cache (minus for the first packet), and the very top of the tree... > > Forget cache. Forget it. First packet populates half of the top of the tree/trie to get to the bottom - another one goes to completely different location, but it still need to go through some of the just populated entries - it is a tree/trie nature - probability of a hit decreases two times in each level (for binary tree, for trie is is very different) until first 'miss', but it is still non zero. But generally I afgree that talking about caches can only be valid when all other issues are resolved completely. So, we need a patch. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 12:29, Evgeniy Polyakov wrote: > On Tue, Feb 20, 2007 at 12:09:51PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > If we want to optimize tcp, we should reorder fields to reduce number of > > cache lines, not change algos. struct sock fields are currently placed to > > reduce holes, while they should be grouped by related fields sharing > > cache lines. > > Getting into account that network stack with NAPI schedules several > packets to be queued into socket and all that happens without any > infuence from userspace, trie/tree wins again in that regard that > majority of the tree will be in the cache already. Nope, you receive 100 packets for 100 different flows. Only the code may be in cache (minus for the first packet), and the very top of the tree... Forget cache. Forget it. > > Hash table has its fast access only in theory, practice adds caches, > NAPI and a lot of other stuff. Even simple test (maybe broken, but it > is equally broken for both trie and hash, even worse for trie)) > whows that hash table does not behave as good as expected and close to > trie. > > I'm going back to drawing board to design simple trie algo/patch > suitable for hash table selection replacement, so that we would test it > in a real life examples. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 12:10, Eric Dumazet wrote: > > > Yep, it happend to be my tests :) > > Jenkins hash was slower and had significant artifacts for some usage > > cases ended up with extremely long chain length. > > One can find more details at > > http://tservice.net.ru/~s0mbre/blog/2006/05/14#2006_05_14 > > http://tservice.net.ru/~s0mbre/blog/2006/06/01#2006_06_01 > > Please explain why you choseh = jhash_2words(faddr, laddr, ports); > h ^= h >> 16; > h ^= h >> 8; > > jhash is very good, no need to try to be smarter, shufling some bytes... > and adding artifacts. I checked with my simulator and got no differences with the extra ops, at least no artifacts. Maybe this is related to the fact my hash size is 2^20 ? If we use jenkin hash: [0]:617469 0% [1]:326671 58.7654% [2]:86704 89.9601% [3]:15387 98.264% [4]:2103 99.7773% [5]:216 99.9716% [6]:24 99.9975% [7]:2 100% If we use jenkin hash (+plus Evgeniy Polyakov shifts) : [0]:617553 0% [1]:326403 58.7172% [2]:86902 89.9831% [3]:15462 98.3275% [4]:2012 99.7753% [5]:216 99.9696% [6]:27 99.9987% [7]:1 100% - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 12:09:51PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > If we want to optimize tcp, we should reorder fields to reduce number of > cache > lines, not change algos. struct sock fields are currently placed to reduce > holes, while they should be grouped by related fields sharing cache lines. Getting into account that network stack with NAPI schedules several packets to be queued into socket and all that happens without any infuence from userspace, trie/tree wins again in that regard that majority of the tree will be in the cache already. Hash table has its fast access only in theory, practice adds caches, NAPI and a lot of other stuff. Even simple test (maybe broken, but it is equally broken for both trie and hash, even worse for trie)) whows that hash table does not behave as good as expected and close to trie. I'm going back to drawing board to design simple trie algo/patch suitable for hash table selection replacement, so that we would test it in a real life examples. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 12:10:22PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > Please explain why you choseh = jhash_2words(faddr, laddr, ports); > h ^= h >> 16; > h ^= h >> 8; > > jhash is very good, no need to try to be smarter, shufling some bytes... and > adding artifacts. If distribution is fair its whift produces still fair distribution. In the discussion linked at that test I also produced results without shifts, which were exactly the same. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 11:44, Evgeniy Polyakov wrote: > On Tue, Feb 20, 2007 at 11:04:15AM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > You totally miss the fact that the 1-2-4 MB cache is not available for > > you at all. It is filled by User accesses. I dont care about DOS. I care > > about real servers, servicing tcp clients. The TCP service/stack should > > not take more than 10% of CPU (cycles and caches). The User application > > is certainly more important because it hosts the real added value. > > TCP socket is 4k in size, one tree entry can be reduced to 200 bytes? > > No one says about _that_ cache miss, it is considered OK to have, but > tree cache miss becomes the worst thing ever. > In softirq we process socket's state, lock, reference counter several > pointer, and if we are happy - the whole TCP state machine fields - and > most of it stasy there when kernel is over - userspace issues syscalls > which must populate it back. Why don't we see that it is moved into > cache each time syscall is invoked? Because it is in the cache as long > as part of the hash table assotiated with last recently used hash > entries, which should not be there, and instead part of the tree can be. No I see cache misses everywhere... This is because my machines are doing real work in user land. They are not lab machines. Even if I had cpus with 16-32MB cache, it would be the same, because User land wants GBs ... For example, sock_wfree() uses 1.6612 % of cpu because of false sharing of sk_flags (dirtied each time SOCK_QUEUE_SHRUNK is set :( 803c2850 : /* sock_wfree total: 714241 1.6613 */ 1307 0.0030 :803c2850: push %rbp 55056 0.1281 :803c2851: mov%rsp,%rbp 94 2.2e-04 :803c2854: push %rbx :803c2855: sub$0x8,%rsp 1090 0.0025 :803c2859: mov0x10(%rdi),%rbx 3 7.0e-06 :803c285d: mov0xb8(%rdi),%eax 38 8.8e-05 :803c2863: lock sub %eax,0x90(%rbx) /* HOT : access to sk_flags */ 81979 0.1907 :803c286a: mov0x100(%rbx),%eax 512119 1.1912 :803c2870: test $0x2,%ah 262 6.1e-04 :803c2873: jne803c2880 142 3.3e-04 :803c2875: mov%rbx,%rdi 14467 0.0336 :803c2878: callq *0x200(%rbx) 63 1.5e-04 :803c287e: data16 :803c287f: nop 9046 0.0210 :803c2880: lock decl 0x28(%rbx) 29792 0.0693 :803c2884: sete %al 56 1.3e-04 :803c2887: test %al,%al 789 0.0018 :803c2889: je 803c2893 :803c288b: mov%rbx,%rdi 144 3.3e-04 :803c288e: callq 803c0f90 1685 0.0039 :803c2893: add$0x8,%rsp 2462 0.0057 :803c2897: pop%rbx 684 0.0016 :803c2898: leaveq 2963 0.0069 :803c2899: retq This is why tcp lookups should not take more than 1% themselves : other parts of the stack *want* to make many cache misses too. If we want to optimize tcp, we should reorder fields to reduce number of cache lines, not change algos. struct sock fields are currently placed to reduce holes, while they should be grouped by related fields sharing cache lines. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 11:30, Evgeniy Polyakov wrote: > On Tue, Feb 20, 2007 at 02:12:09AM -0800, David Miller ([EMAIL PROTECTED]) wrote: > > From: Eric Dumazet <[EMAIL PROTECTED]> > > Date: Tue, 20 Feb 2007 11:04:15 +0100 > > > > > Using a jenkin's hash permits a better hash distribution for a litle > > > cpu cost. I will post later a distribution simulation based on the > > > data gathered from the same real server. > > > > Actually someone (I think it was Evgeniy in fact) made such > > comparisons and found in his studies that not only does the current > > ehash xor hash function distribute about as well as jenkins, it's > > significantly cheaper to calculate :-) > > Yep, it happend to be my tests :) > Jenkins hash was slower and had significant artifacts for some usage > cases ended up with extremely long chain length. > One can find more details at > http://tservice.net.ru/~s0mbre/blog/2006/05/14#2006_05_14 > http://tservice.net.ru/~s0mbre/blog/2006/06/01#2006_06_01 Please explain why you choseh = jhash_2words(faddr, laddr, ports); h ^= h >> 16; h ^= h >> 8; jhash is very good, no need to try to be smarter, shufling some bytes... and adding artifacts. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 11:12, David Miller wrote: > From: Eric Dumazet <[EMAIL PROTECTED]> > Date: Tue, 20 Feb 2007 11:04:15 +0100 > > > Using a jenkin's hash permits a better hash distribution for a litle > > cpu cost. I will post later a distribution simulation based on the > > data gathered from the same real server. > > Actually someone (I think it was Evgeniy in fact) made such > comparisons and found in his studies that not only does the current > ehash xor hash function distribute about as well as jenkins, it's > significantly cheaper to calculate :-) > > If you find jenkins is better, great, but I hope it works that > way for many workloads. Here are results from real workload : ehash_addr=0x81047600 ehash_size=1048576 330241 chains, 2441 twchains Distribution of sockets/chain length [chain length]:number of sockets [0]:718335 0% [1]:223485 42.1436% [2]:60655 65.0196% [3]:22451 77.7207% [4]:11164 86.1416% [5]:6376 92.1534% [6]:3236 95.8148% [7]:1600 97.9268% [8]:793 99.1231% [9]:286 99.6085% [10]:111 99.8178% [11]:56 99.934% [12]:20 99.9793% [13]:4 99.9891% [14]:2 99.9943% [15]:2 100% total : 530294 sockets Imagine we double hash size (2097152). Distribution would be: [0]:1698209 0% [1]:309255 58.3177% [2]:61638 81.5644% [3]:18593 92.0829% [4]:6479 96.97% [5]:2092 98.9425% [6]:655 99.6836% [7]:184 99.9265% [8]:37 99.9823% [9]:6 99.9925% [10]:4 100% If we use jenkin hash: jhash_3words(daddr, saddr, (dport<<16)|sport, 0); [0]:632435 0% [1]:319916 60.328% [2]:80465 90.6754% [3]:13816 98.4914% [4]:1735 99.8001% [5]:196 99.9849% [6]:11 99.9974% [7]:2 100% - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 11:04:15AM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > You totally miss the fact that the 1-2-4 MB cache is not available for you at > all. It is filled by User accesses. I dont care about DOS. I care about real > servers, servicing tcp clients. The TCP service/stack should not take more > than 10% of CPU (cycles and caches). The User application is certainly more > important because it hosts the real added value. TCP socket is 4k in size, one tree entry can be reduced to 200 bytes? No one says about _that_ cache miss, it is considered OK to have, but tree cache miss becomes the worst thing ever. In softirq we process socket's state, lock, reference counter several pointer, and if we are happy - the whole TCP state machine fields - and most of it stasy there when kernel is over - userspace issues syscalls which must populate it back. Why don't we see that it is moved into cache each time syscall is invoked? Because it is in the cache as long as part of the hash table assotiated with last recently used hash entries, which should not be there, and instead part of the tree can be. > As soon softirq finished its job, CPU returns to User Space, blowing out your > top-level entries from the cache. Next ms (tg3 for example, batching XX > packets per interrupt), softirq handler must repopulate the cache again and > again. > > Is linux kernel aimed to construct firewalls ? Certainly not. Linux kernel is > a generalist kernel. Of course your tree algo might be better in certain > situations, but those situations are not the norm. In these situations, we > might reserve 50% of cpu cache to hold top level of route cache and tcp > cache, but provide no power to user programs. > > With ehash, pressure on cpu caches is small and User application can make > progress. With tree it is exacly the same - but instead of unneded entries one can put there head of the tree. To prove myself right (wrong) and make me eat my hat I will implement simple rb-tree-based socket lookup code (which does not allow simple RCUfication though) or trie (without wildcards support) which allows trivial RCUfication. Will you care to test patches or provide testcase repetable in my lab (couple of 2-way core duo, athlon 3500 and via epia)? > Using a jenkin's hash permits a better hash distribution for a litle cpu cost. > I will post later a distribution simulation based on the data gathered from > the same real server. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 02:12:09AM -0800, David Miller ([EMAIL PROTECTED]) wrote: > From: Eric Dumazet <[EMAIL PROTECTED]> > Date: Tue, 20 Feb 2007 11:04:15 +0100 > > > Using a jenkin's hash permits a better hash distribution for a litle > > cpu cost. I will post later a distribution simulation based on the > > data gathered from the same real server. > > Actually someone (I think it was Evgeniy in fact) made such > comparisons and found in his studies that not only does the current > ehash xor hash function distribute about as well as jenkins, it's > significantly cheaper to calculate :-) Yep, it happend to be my tests :) Jenkins hash was slower and had significant artifacts for some usage cases ended up with extremely long chain length. One can find more details at http://tservice.net.ru/~s0mbre/blog/2006/05/14#2006_05_14 http://tservice.net.ru/~s0mbre/blog/2006/06/01#2006_06_01 -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tue, Feb 20, 2007 at 01:57:45AM -0800, David Miller ([EMAIL PROTECTED]) wrote: > > What you miss is the fact, that upper layers of the tree are always in the > > cache. So its access cost nothing compared to every time new entry in > > the hash table. > > It will not be on a real computer spending reasonable if not > predominant time in userspace. > > I agree with others here in this thread, in that your test program is > not accurate at all in several aspects as it makes many incorrect > assumptions about the environment in which these lookups run: > > 1) cache must be assumed cold, ALWAYS > 2) large PTE mappings must be used to simulate kernel costs >and timings accurately > > Even on computer not spending much time in userspace, cache will > be cold too in most of these paths. Do you remember the cpu cycle > counter experiment I ran in the tg3 driver a few years back? > > That test showed that in NAPI loop, second packet was always processed > some 40 times faster than first one, and it's all because of cache > warming done by first packet. > > It is a very essential and sometimes non-intuitive aspect of packet > input processing. You must code for the cold cache case. Therefore > you cannot run silly loops in userspace to measure the performance of > a flow demux algorithm, it is a completely pointless exercise in fact > :-) It is not correct. Getting size of the table for 2^20 entries is about 4Mb, we get half/third of it in the cache in my userspace test - and it still performs bad (well, it is noticebly better than all other approaches (yet), but it is not that good as expected it to be). With my trie test, _smaller_ part of the trie is in the cache due to size of the node, which includes _a lot_ of additional information used for wildcard processing (needed for netfilter and/or route implementation in netchannels). Above words are _only_ correct for hash tables - yes, in that case cache is always cold, since each new flow goes into completely different entry. With tree/trie (and _especially_ on SMP_) access to low-level entry _requires_ all above entries to be in the cache, so until load fully flushes it away, there will be a win for tree implementation. If load is that, that it flushes the whole cache each time, then we are going to die just because populating of the struct sock and/or tcp socket is much worse than several tree/trie entris in the lookup. > > And there is _no_ O(1) - O(1) is just a hash entry selection, then you > > need to add the whole chain path, which can be long enough. > > For example for the length 9 you have 1000 entries - it is exactly size > > of the tree - sum of all entries upto and including 2^9. > > Hash table should grow to exactly number of hash chains as number > of sockets that are active. That is how we program every dynamically > grown hash table algorithm in the kernel, and we would do the same > for TCP once dynamic growth is possible. Eric's test shows that only one third of the sockets is in the chains with 1 length, so it does not work that way, which is not a bad or good - table scaling is not trivial operation indeed. > It is assumption of every hash algorithm. We choose poor sizes at > boot time on Eric's computer, or there is some limit preventing from > using a proper size. It does not make a flaw in the hash approach. > > People don't use trees or tries for socket demux in any modern > operating system, there is a reason and it is not because this > approach has not been tried. :-) Trees are used _always_ in huge route tables - not just because it is too ugly to implement wildcards in hash tables :). I would agree that routers do not perform any real work besides packet forwarding, so they do have theirs caches filled with all needed information (especially if all above is implemented in hardware and thus is completely different), but it does not mean that existing systems do not allow to scale with trees with existing cache issues. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
From: Eric Dumazet <[EMAIL PROTECTED]> Date: Tue, 20 Feb 2007 11:04:15 +0100 > Using a jenkin's hash permits a better hash distribution for a litle > cpu cost. I will post later a distribution simulation based on the > data gathered from the same real server. Actually someone (I think it was Evgeniy in fact) made such comparisons and found in his studies that not only does the current ehash xor hash function distribute about as well as jenkins, it's significantly cheaper to calculate :-) If you find jenkins is better, great, but I hope it works that way for many workloads. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Tuesday 20 February 2007 10:25, Evgeniy Polyakov wrote: > On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > On Monday 19 February 2007 16:14, Eric Dumazet wrote: > > > Because O(1) is different of O(log(N)) ? > > > if N = 2^20, it certainly makes a difference. > > > Yes, 1% of chains might have a length > 10, but yet 99% of the lookups > > > are touching less than 4 cache lines. > > > With a binary tree, log(2^20) is 20. or maybe not ? If you tell me it's > > > 4, I will be very pleased. > > > > Here is the tcp ehash chain length distribution on a real server : > > ehash_addr=0x81047600 > > ehash_size=1048576 > > 333835 used chains, 3365 used twchains > > Distribution of sockets/chain length > > [chain length]:number of sockets > > [1]:221019 37.4645% > > [2]:56590 56.6495% > > [3]:21250 67.4556% > > [4]:12534 75.9541% > > [5]:8677 83.3082% > > [6]:5862 89.2701% > > [7]:3640 93.5892% > > [8]:2219 96.5983% > > [9]:1083 98.2505% > > [10]:539 99.1642% > > [11]:244 99.6191% > > [12]:112 99.8469% > > [13]:39 99.9329% > > [14]:16 99.9708% > > [15]:6 99.9861% > > [16]:3 99.9942% > > [17]:2 100% > > total : 589942 sockets > > > > So even with a lazy hash function, 89 % of lookups are satisfied with > > less than 6 compares. > > This is only 2^19 sockets. > > What you miss is the fact, that upper layers of the tree are always in the > cache. So its access cost nothing compared to every time new entry in > the hash table. > > And there is _no_ O(1) - O(1) is just a hash entry selection, then you > need to add the whole chain path, which can be long enough. > For example for the length 9 you have 1000 entries - it is exactly size > of the tree - sum of all entries upto and including 2^9. > One third of the table is accessible within 17 lookups (hash chain > length is 1), but given into account size of the entry - let's say it > is equal to > 32+32+32 - size of the key > 32+32+32 - size of the left/right/parent pointer > so 192 bytes per entry - given into acount that 1/4 of the 1-2 MB cache is > filled with it, we get about 1300 top-level entries in the hash, so > about _10_ first lookups are in the cache _always_ due to nature of the > tree lookup. You totally miss the fact that the 1-2-4 MB cache is not available for you at all. It is filled by User accesses. I dont care about DOS. I care about real servers, servicing tcp clients. The TCP service/stack should not take more than 10% of CPU (cycles and caches). The User application is certainly more important because it hosts the real added value. As soon softirq finished its job, CPU returns to User Space, blowing out your top-level entries from the cache. Next ms (tg3 for example, batching XX packets per interrupt), softirq handler must repopulate the cache again and again. Is linux kernel aimed to construct firewalls ? Certainly not. Linux kernel is a generalist kernel. Of course your tree algo might be better in certain situations, but those situations are not the norm. In these situations, we might reserve 50% of cpu cache to hold top level of route cache and tcp cache, but provide no power to user programs. With ehash, pressure on cpu caches is small and User application can make progress. Using a jenkin's hash permits a better hash distribution for a litle cpu cost. I will post later a distribution simulation based on the data gathered from the same real server. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
From: Evgeniy Polyakov <[EMAIL PROTECTED]> Date: Tue, 20 Feb 2007 12:25:23 +0300 > What you miss is the fact, that upper layers of the tree are always in the > cache. So its access cost nothing compared to every time new entry in > the hash table. It will not be on a real computer spending reasonable if not predominant time in userspace. I agree with others here in this thread, in that your test program is not accurate at all in several aspects as it makes many incorrect assumptions about the environment in which these lookups run: 1) cache must be assumed cold, ALWAYS 2) large PTE mappings must be used to simulate kernel costs and timings accurately Even on computer not spending much time in userspace, cache will be cold too in most of these paths. Do you remember the cpu cycle counter experiment I ran in the tg3 driver a few years back? That test showed that in NAPI loop, second packet was always processed some 40 times faster than first one, and it's all because of cache warming done by first packet. It is a very essential and sometimes non-intuitive aspect of packet input processing. You must code for the cold cache case. Therefore you cannot run silly loops in userspace to measure the performance of a flow demux algorithm, it is a completely pointless exercise in fact :-) > And there is _no_ O(1) - O(1) is just a hash entry selection, then you > need to add the whole chain path, which can be long enough. > For example for the length 9 you have 1000 entries - it is exactly size > of the tree - sum of all entries upto and including 2^9. Hash table should grow to exactly number of hash chains as number of sockets that are active. That is how we program every dynamically grown hash table algorithm in the kernel, and we would do the same for TCP once dynamic growth is possible. It is assumption of every hash algorithm. We choose poor sizes at boot time on Eric's computer, or there is some limit preventing from using a proper size. It does not make a flaw in the hash approach. People don't use trees or tries for socket demux in any modern operating system, there is a reason and it is not because this approach has not been tried. :-) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > On Monday 19 February 2007 16:14, Eric Dumazet wrote: > > > > Because O(1) is different of O(log(N)) ? > > if N = 2^20, it certainly makes a difference. > > Yes, 1% of chains might have a length > 10, but yet 99% of the lookups are > > touching less than 4 cache lines. > > With a binary tree, log(2^20) is 20. or maybe not ? If you tell me it's 4, > > I will be very pleased. > > > > Here is the tcp ehash chain length distribution on a real server : > ehash_addr=0x81047600 > ehash_size=1048576 > 333835 used chains, 3365 used twchains > Distribution of sockets/chain length > [chain length]:number of sockets > [1]:221019 37.4645% > [2]:56590 56.6495% > [3]:21250 67.4556% > [4]:12534 75.9541% > [5]:8677 83.3082% > [6]:5862 89.2701% > [7]:3640 93.5892% > [8]:2219 96.5983% > [9]:1083 98.2505% > [10]:539 99.1642% > [11]:244 99.6191% > [12]:112 99.8469% > [13]:39 99.9329% > [14]:16 99.9708% > [15]:6 99.9861% > [16]:3 99.9942% > [17]:2 100% > total : 589942 sockets > > So even with a lazy hash function, 89 % of lookups are satisfied with less > than 6 compares. This is only 2^19 sockets. What you miss is the fact, that upper layers of the tree are always in the cache. So its access cost nothing compared to every time new entry in the hash table. And there is _no_ O(1) - O(1) is just a hash entry selection, then you need to add the whole chain path, which can be long enough. For example for the length 9 you have 1000 entries - it is exactly size of the tree - sum of all entries upto and including 2^9. One third of the table is accessible within 17 lookups (hash chain length is 1), but given into account size of the entry - let's say it is equal to 32+32+32 - size of the key 32+32+32 - size of the left/right/parent pointer so 192 bytes per entry - given into acount that 1/4 of the 1-2 MB cache is filled with it, we get about 1300 top-level entries in the hash, so about _10_ first lookups are in the cache _always_ due to nature of the tree lookup. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Evgeniy Polyakov <[EMAIL PROTECTED]> writes: > > My experiment shows almost 400 nsecs without _any_ locks - they are > removed completely - it is pure hash selection/list traverse time. Are you sure you're not measuring TLB misses too? In user space you likely use 4K pages. The kernel would use 2MB pages. I would suggest putting the tables into hugetlbfs allocated memory in your test program. -Andei - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 19 Feb 2007 13:04:12 +0100, Andi Kleen <[EMAIL PROTECTED]> wrote: LRU tends to be hell for caches in MP systems, because it writes to the cache lines too and makes them exclusive and more expensive. That's why you let the hardware worry about LRU. You don't write to the upper layers of the splay tree when you don't have to. It's the mere traversal of the upper layers that keeps them in cache, causing the cache hierarchy to mimic the data structure hierarchy. RCU changes the whole game, of course, because you don't write to the old copy at all; you have to clone the altered node and all its ancestors and swap out the root node itself under a spinlock. Except you don't use a spinlock; you have a ring buffer of root nodes and atomically increment the writer index. That atomically incremented index is the only thing on which there's any write contention. (Obviously you need a completion flag on the new root node for the next writer to poll on, so the sequence is atomic-increment ... copy and alter from leaf to root ... wmb() ... mark new root complete.) When you share TCP sessions among CPUs, and packets associated with the same session may hit softirq in any CPU, you are going to eat a lot of interconnect bandwidth keeping the sessions coherent. (The only way out of this is to partition the tuple space by CPU at the NIC layer with separate per-core, or perhaps per-cache, receive queues; at which point the NIC is so smart that you might as well put the DDoS handling there.) But at least it's cache coherency protocol bandwidth and not bandwidth to and from DRAM, which has much nastier latencies. The only reason the data structure matters _at_all_ is that DDoS attacks threaten to evict the working set of real sessions out of cache. That's why you add new sessions at the leaves and don't rotate them up until they're hit a second time. Of course the leaf layer can't be RCU, but it doesn't have to be; it's just a bucket of tuples. You need an auxiliary structure to hold the session handshake trackers for the leaf layer, but you assume that you're always hitting cold cache when diving into this structure and ration accesses accordingly. Maybe you even explicitly evict entries from cache after sending the SYNACK, so they don't crowd other stuff out; they go to DRAM and get pulled into the new CPU (and rotated up) if and when the next packet in the session arrives. (I'm assuming T/TCP here, so you can't skimp much on session tracker size during the handshake.) Every software firewall I've seen yet falls over under DDoS. If you want to change that, you're going to need more than the back-of-the-napkin calculations that show that session lookup bandwidth exceeds frame throughput for min-size packets. You're going to need to strategize around exploiting the cache hierarchy already present in your commodity processor to implicitly partition real traffic from the DDoS storm. It's not a trivial problem, even in the mathematician's sense (in which all problems are either trivial or unsolved). Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Mon, Feb 19, 2007 at 01:26:42PM -0500, Benjamin LaHaise wrote: > On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet wrote: > > So even with a lazy hash function, 89 % of lookups are satisfied with less > > than 6 compares. > > Which sucks, as those are typically going to be cache misses (costing many > hundreds of cpu cycles). Hash chains fair very poorly under DoS conditions, > and must be removed under a heavy load. Worst case handling is very > important next to common case. I should clarify. Back of the napkin calculations show that there is only 157 cycles on a 3GHz processor in which to decide what happens to a packet, which means 1 cache miss is more than enough. In theory we can get pretty close to line rate with quad core processors, but it definately needs some of the features that newer chipsets have for stuffing packets directly into the cache. I would venture a guess that we also need to intelligently partition packets so that we make the most use of available cache resources. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet wrote: > So even with a lazy hash function, 89 % of lookups are satisfied with less > than 6 compares. Which sucks, as those are typically going to be cache misses (costing many hundreds of cpu cycles). Hash chains fair very poorly under DoS conditions, and must be removed under a heavy load. Worst case handling is very important next to common case. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Monday 19 February 2007 16:14, Eric Dumazet wrote: > > Because O(1) is different of O(log(N)) ? > if N = 2^20, it certainly makes a difference. > Yes, 1% of chains might have a length > 10, but yet 99% of the lookups are > touching less than 4 cache lines. > With a binary tree, log(2^20) is 20. or maybe not ? If you tell me it's 4, > I will be very pleased. > Here is the tcp ehash chain length distribution on a real server : ehash_addr=0x81047600 ehash_size=1048576 333835 used chains, 3365 used twchains Distribution of sockets/chain length [chain length]:number of sockets [1]:221019 37.4645% [2]:56590 56.6495% [3]:21250 67.4556% [4]:12534 75.9541% [5]:8677 83.3082% [6]:5862 89.2701% [7]:3640 93.5892% [8]:2219 96.5983% [9]:1083 98.2505% [10]:539 99.1642% [11]:244 99.6191% [12]:112 99.8469% [13]:39 99.9329% [14]:16 99.9708% [15]:6 99.9861% [16]:3 99.9942% [17]:2 100% total : 589942 sockets So even with a lazy hash function, 89 % of lookups are satisfied with less than 6 compares. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Monday 19 February 2007 15:25, Evgeniy Polyakov wrote: > On Mon, Feb 19, 2007 at 03:14:02PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > > Forget about cache misses and cache lines - we have a hash table, only > > > part of which is used (part for time-wait sockets, part for established > > > ones). > > > > No you didnt not read my mail. Current ehash is not as decribed by you. > > I did. > And I also said that my tests do not have timewait sockets at all - I > removed sk_for_each and so on, which should effectively increase lookup > time twice on busy system with lots of created/removed sockets per > timeframe (that is theory from my side already). > Anyway, I ran the same test with increased table too. > > > > Anyway, even with 2^20 (i.e. when the whole table is only used for > > > established sockets) search time is about 360-370 nsec on 3.7 GHz Core > > > Duo (only one CPU is used) with 2 GB of ram. > > > > Your tests are user land, so unfortunatly are biased... > > > > (Unless you use hugetlb data ?) > > No I do not. But the same can be applied to trie test - it is also > performed in userspace and thus suffers from possible swapping/cache > flushing and so on. > > And I doubt moving test into kernel will suddenly end up with 10 times > increased rates. At least some architectures pay a high price using vmalloc() instead of kmalloc(), and TLB misses means something for them. Not everybody has the latest Intel cpu. Normally, ehash table is using huge pages. > > Anyway, trie test (broken implementation) is two times slower than hash > table (resized already), and it does not include locking isses of the > hash table access and further scalability issues. > You mix apples and oranges. We already know locking has nothing to do with hashing or trie-ing. We *can* put RCU on top of the existing ehash. We also can add hash resizing if we really care. > I think I need to fix my trie implementation to fully show its > potential, but original question was why tree/trie based implementation > is not considered at all although it promises better performance and > scalability. Because you mix performance and scalability. Thats not exactly the same. Sometime, high performance means *suboptimal* scalability. Because O(1) is different of O(log(N)) ? if N = 2^20, it certainly makes a difference. Yes, 1% of chains might have a length > 10, but yet 99% of the lookups are touching less than 4 cache lines. With a binary tree, log(2^20) is 20. or maybe not ? If you tell me it's 4, I will be very pleased. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Mon, Feb 19, 2007 at 03:14:02PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > Forget about cache misses and cache lines - we have a hash table, only > > part of which is used (part for time-wait sockets, part for established > > ones). > > No you didnt not read my mail. Current ehash is not as decribed by you. I did. And I also said that my tests do not have timewait sockets at all - I removed sk_for_each and so on, which should effectively increase lookup time twice on busy system with lots of created/removed sockets per timeframe (that is theory from my side already). Anyway, I ran the same test with increased table too. > > Anyway, even with 2^20 (i.e. when the whole table is only used for > > established sockets) search time is about 360-370 nsec on 3.7 GHz Core > > Duo (only one CPU is used) with 2 GB of ram. > > Your tests are user land, so unfortunatly are biased... > > (Unless you use hugetlb data ?) No I do not. But the same can be applied to trie test - it is also performed in userspace and thus suffers from possible swapping/cache flushing and so on. And I doubt moving test into kernel will suddenly end up with 10 times increased rates. Anyway, trie test (broken implementation) is two times slower than hash table (resized already), and it does not include locking isses of the hash table access and further scalability issues. I think I need to fix my trie implementation to fully show its potential, but original question was why tree/trie based implementation is not considered at all although it promises better performance and scalability. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Monday 19 February 2007 14:56, Evgeniy Polyakov wrote: > On Mon, Feb 19, 2007 at 02:38:13PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > > On Monday 19 February 2007 12:41, Evgeniy Polyakov wrote: > > > > 1 microsecond ? Are you kidding ? We want no more than 50 ns. > > > > > > Theory again. > > > > Theory is nice, but I personally prefer oprofile :) > > I base my comments on real facts. > > We *want* 50 ns tcp lookups (2 cache line misses, one with reader intent, > > one for exclusive access intent) > > I said that your words are theory in previous mails :) > > Current code works 10 times worse than you expect. > > > > Existing table does not scale that good - I created (1<<20)/2 (to cover > > > only established part) entries table and filled it with 1 million of > > > random entries -search time is about half of microsecod. > > > > I use exactly 1^20 slots, not 1^19 (see commit > > dbca9b2750e3b1ee6f56a616160ccfc12e8b161f , where I changed layout of > > ehash table so that two chains (established/timewait) are on the same > > cache line. every cache miss *counts*) > > Forget about cache misses and cache lines - we have a hash table, only > part of which is used (part for time-wait sockets, part for established > ones). No you didnt not read my mail. Current ehash is not as decribed by you. > > Anyway, even with 2^20 (i.e. when the whole table is only used for > established sockets) search time is about 360-370 nsec on 3.7 GHz Core > Duo (only one CPU is used) with 2 GB of ram. Your tests are user land, so unfortunatly are biased... (Unless you use hugetlb data ?) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Actually for socket code any other binary tree will work perfectly ok - socket code does not have wildcards (except listening sockets), so it is possible to combine all values into one search key used in flat one-dimensional tree - it scales as hell and allows still very high lookup time. As of cache usage - such trees can be combined with different protocols to increase cache locality. The only reason I implemented trie is that netchannels support wildcards, that is how netfilter is implemented on top of them. Tree with lazy deletion (i.e. without deletion at all) can be moved to RCU very easily. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Mon, Feb 19, 2007 at 02:38:13PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > On Monday 19 February 2007 12:41, Evgeniy Polyakov wrote: > > > > 1 microsecond ? Are you kidding ? We want no more than 50 ns. > > > > Theory again. > > > Theory is nice, but I personally prefer oprofile :) > I base my comments on real facts. > We *want* 50 ns tcp lookups (2 cache line misses, one with reader intent, one > for exclusive access intent) I said that your words are theory in previous mails :) Current code works 10 times worse than you expect. > > Existing table does not scale that good - I created (1<<20)/2 (to cover > > only established part) entries table and filled it with 1 million of random > > entries -search time is about half of microsecod. > > I use exactly 1^20 slots, not 1^19 (see commit > dbca9b2750e3b1ee6f56a616160ccfc12e8b161f , where I changed layout of ehash > table so that two chains (established/timewait) are on the same cache line. > every cache miss *counts*) Forget about cache misses and cache lines - we have a hash table, only part of which is used (part for time-wait sockets, part for established ones). Anyway, even with 2^20 (i.e. when the whole table is only used for established sockets) search time is about 360-370 nsec on 3.7 GHz Core Duo (only one CPU is used) with 2 GB of ram. > http://www.mail-archive.com/netdev@vger.kernel.org/msg31096.html > > (Of course, you may have to change MAX_ORDER to 14 or else the hash table > hits > the MAX_ORDER limit) > > Search time under 100 ns, for real trafic (kind of random... but not quite) > Most of this time is taken by the rwlock, so expect 50 ns once RCU is finally > in... My experiment shows almost 400 nsecs without _any_ locks - they are removed completely - it is pure hash selection/list traverse time. > In your tests, please make sure a User process is actually doing real work on > each CPU, ie evicting cpu caches every ms... > > The rule is : On a normal machine, cpu caches contain UserMode data, not > kernel data. (as a typical machine spends 15% of its cpu time in kernel land, > and 85% in User land). You can assume kernel text is in cache, but even this > assumption may be wrong. In my tests _only_ hash tables are in memory (well with some bits of other stuff) - I use exactly the same approach for both trie and hash table tests - table/trie is allocated, filled and lookup of random values is performed in a loop. It is done in userspace - I just moved list.h inet_hashtable.h and other needed files into separate project and compiled them (with removed locks, atomic operations and other pure kernel stuff). So actual time even more for hash table - at least it requires locks while trie implementation works with RCU. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Monday 19 February 2007 12:41, Evgeniy Polyakov wrote: > > 1 microsecond ? Are you kidding ? We want no more than 50 ns. > > Theory again. Theory is nice, but I personally prefer oprofile :) I base my comments on real facts. We *want* 50 ns tcp lookups (2 cache line misses, one with reader intent, one for exclusive access intent) > > Existing table does not scale that good - I created (1<<20)/2 (to cover > only established part) entries table and filled it with 1 million of random > entries -search time is about half of microsecod. I use exactly 1^20 slots, not 1^19 (see commit dbca9b2750e3b1ee6f56a616160ccfc12e8b161f , where I changed layout of ehash table so that two chains (established/timewait) are on the same cache line. every cache miss *counts*) http://www.mail-archive.com/netdev@vger.kernel.org/msg31096.html (Of course, you may have to change MAX_ORDER to 14 or else the hash table hits the MAX_ORDER limit) Search time under 100 ns, for real trafic (kind of random... but not quite) Most of this time is taken by the rwlock, so expect 50 ns once RCU is finally in... In your tests, please make sure a User process is actually doing real work on each CPU, ie evicting cpu caches every ms... The rule is : On a normal machine, cpu caches contain UserMode data, not kernel data. (as a typical machine spends 15% of its cpu time in kernel land, and 85% in User land). You can assume kernel text is in cache, but even this assumption may be wrong. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Andi Kleen writes: > > If not, you loose. > > It all depends on if the higher levels on the trie are small > enough to be kept in cache. Even with two cache misses it might > still break even, but have better scalability. Yes the trick to keep root large to allow a very flat tree and few cache misses. Stefan Nilsson (author of LC-trie) and were able to improve the the LC-trie quite a bit we called this trie+hash ->trash The paper discusses trie/hash... (you've seen it) http://www.nada.kth.se/~snilsson/public/papers/trash/ > Another advantage would be to eliminate the need for large memory > blocks, which cause problems too e.g. on NUMA. It certainly would > save quite some memory if the tree levels are allocated on demand > only. However breaking it up might also cost more TLB misses, > but those could be eliminated by preallocating the tree in > the same way as the hash today. Don't know if it's needed or not. > > I guess someone needs to code it up and try it. I've implemented trie/trash as replacement for the dst hash to full key lookup for ipv4 (unicache) to start with. And is still is focusing on the nasty parts, packet forwarding, as we don't want break this So the benfits of full flow lookup is not accounted. I.e the full flow lookup could give socket at no cost and do some conntrack support like Evgeniy did in netchannels pathes. Below, some recent comparisions and profiles for the packet forwardning Input 2 * 65k concurrent flows eth0->eth1, eth2->eth3 in forwarding On separate CPU Opteron 2218 (2.6 GHZ) net-2.6.21 git. Numbers are very approximative but should still be representative. Profiles are collected. Performance comparison -- Table below holds: dst-entries in use, lookup hits, slow path and total pps Flowlen 40 250k 1020 + 21 = 1041 pps Vanilla rt_hash=32k 1M950 + 29 = 979 pps Vanilla rt_hash=131k 260k 980 + 24 = 1004 pps Unicache Flowlen 4 (rdos) 290k 560 + 162 = 722 kpps Vanilla rt_hash=32k 1M 400 + 165 = 565 kpps Vanilla rt_hash=131k 230k 570 + 170 = 740 kpps Unicache unicache flen=4 pkts c02df84f 5257 7.72078 tkey_extract_bits c023151a 5230 7.68112 e1000_clean_rx_irq c02df908 3306 4.85541 tkey_equals c014cf31 3296 4.84072 kfree c02f8c3b 3067 4.5044 ip_route_input c02fbdf0 2948 4.32963 ip_forward c023024e 2809 4.12548 e1000_xmit_frame c02e06f1 2792 4.10052 trie_lookup c02fd764 2159 3.17085 ip_output c032591c 1965 2.88593 fn_trie_lookup c014cd82 1456 2.13838 kmem_cache_alloc c02fa941 1337 1.96361 ip_rcv c014ced0 1334 1.9592 kmem_cache_free c02e1538 1289 1.89311 unicache_tcp_establish c02e2d70 1218 1.78884 dev_queue_xmit c02e31af 1074 1.57735 netif_receive_skb c02f8484 1053 1.54651 ip_route_input_slow c02db552 987 1.44957 __alloc_skb c02e626f 913 1.34089 dst_alloc c02edaad 828 1.21606 __qdisc_run c0321ccf 810 1.18962 fib_get_table c02e14c1 782 1.1485 match_pktgen c02e6375 766 1.125 dst_destroy c02e10e8 728 1.06919 unicache_hash_code c0231242 647 0.950227e1000_clean_tx_irq c02f7d23 625 0.917916ipv4_dst_destroy unicache flen=40 pkts - c023151a 6742 10.3704 e1000_clean_rx_irq c02df908 4553 7.00332 tkey_equals c02fbdf0 4455 6.85258 ip_forward c02e06f1 4067 6.25577 trie_lookup c02f8c3b 3951 6.07734 ip_route_input c02df84f 3929 6.0435 tkey_extract_bits c023024e 3538 5.44207 e1000_xmit_frame c014cf31 3152 4.84834 kfree c02fd764 2711 4.17ip_output c02e1538 1930 2.96868 unicache_tcp_establish c02fa941 1696 2.60875 ip_rcv c02e31af 1466 2.25497 netif_receive_skb c02e2d70 1412 2.17191 dev_queue_xmit c014cd82 1397 2.14883 kmem_cache_alloc c02db552 1394 2.14422 __alloc_skb c02edaad 1032 1.5874 __qdisc_run c02ed6b8 957 1.47204 eth_header c02e15dd 904 1.39051 unicache_garbage_collect_active c02db94e 861 1.32437 kfree_skb c0231242 794 1.22131 e1000_clean_tx_irq c022fd58 778 1.1967 e1000_tx_map c014ce73 756 1.16286 __kmalloc c014ced0 740 1.13825 kmem_cache_free c02e14c1 701 1.07826 match_pktgen c023002c 621 0.955208e1000_tx_queue c02e78fa 519 0.798314neigh_resolve_output Vanilla w. flen=4 pkts rt_hash=32k -- c02f6852 1570422.9102 ip_route_input c023151a 5324 7.76705 e1000_clean_rx_irq c02f84a1 4457 6.5022 ip_rcv c02f9950 3065 4.47145 ip_forward c023024e 2630 3.83684 e1000_xmit_frame c0323380 2343 3.41814 fn_trie_lookup c02fb2c4 2181 3.1818 ip_output c02f4a3b 1839 2.68287 rt_intern_hash c02f4480 1762 2.57054 rt_may_expire c02f60
Re: Extensible hashing and RCU
On Sun, Feb 18, 2007 at 09:21:30PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > Evgeniy Polyakov a e'crit : > >On Sun, Feb 18, 2007 at 07:46:22PM +0100, Eric Dumazet > >([EMAIL PROTECTED]) wrote: > >>>Why anyone do not want to use trie - for socket-like loads it has > >>>exactly constant search/insert/delete time and scales as hell. > >>> > >>Because we want to be *very* fast. You cannot beat hash table. > >> > >>Say you have 1.000.000 tcp connections, with 50.000 incoming packets per > >>second to *random* streams... > > > >What is really good in trie, that you may have upto 2^32 connections > >without _any_ difference in lookup performance of random streams. > > So are you speaking of one memory cache miss per lookup ? > If not, you loose. With trie big part of it _does_ live in cache compared to hash table where similar addresses ends up in a completely different hash entries. > >>With a 2^20 hashtable, a lookup uses one cache line (the hash head > >>pointer) plus one cache line to get the socket (you need it to access its > >>refcounter) > >> > >>Several attempts were done in the past to add RCU to ehash table (last > >>done by Benjamin LaHaise last March). I believe this was delayed a bit, > >>because David would like to be able to resize the hash table... > > > >This is a theory. > > Not theory, but actual practice, on a real machine. > > # cat /proc/net/sockstat > sockets: used 918944 > TCP: inuse 925413 orphan 7401 tw 4906 alloc 926292 mem 304759 > UDP: inuse 9 > RAW: inuse 0 > FRAG: inuse 9 memory 18360 Theory is a speculation about performance. Highly cache usage optimized bubble sorting still much worse than cache usage non-optimized binary tree. > >Practice includes cost for hashing, locking, and list traversal > >(each pointer is in own cache line btw, which must be fetched) and plus > >the same for time wait sockets (if we are unlucky). > > > >No need to talk about price of cache miss when there might be more > >serious problems - for example length of the linked list to traverse each > >time new packet is received. > > > >For example lookup time in trie with 1.6 millions random 3-dimensional > >32bit (saddr/daddr/ports) entries is about 1 microsecond on amd athlon64 > >3500 cpu (test was ran in userspace emulator though). > > 1 microsecond ? Are you kidding ? We want no more than 50 ns. Theory again. Existing table does not scale that good - I created (1<<20)/2 (to cover only established part) entries table and filled it with 1 million of random entries -search time is about half of microsecod. Wanna see a code? I copied Linux hash table magic into userspace and run the same inet_hash() and inet_lookup() in a loop. Result above. Trie is still 2 times worse, but I've just found a bug in my implementation. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
"Michael K. Edwards" <[EMAIL PROTECTED]> writes: > A better data structure for RCU, even with a fixed key space, is > probably a splay tree. Much less vulnerable to cache eviction DDoS > than a hash, because the hot connections get rotated up into non-leaf > layers and get traversed enough to keep them in the LRU set. LRU tends to be hell for caches in MP systems, because it writes to the cache lines too and makes them exclusive and more expensive. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Eric Dumazet <[EMAIL PROTECTED]> writes: > > So are you speaking of one memory cache miss per lookup ? Actually two: if the trie'ing allows RCUing you would save the spinlock cache line too. This would increase the break-even budget for the trie. > If not, you loose. It all depends on if the higher levels on the trie are small enough to be kept in cache. Even with two cache misses it might still break even, but have better scalability. Another advantage would be to eliminate the need for large memory blocks, which cause problems too e.g. on NUMA. It certainly would save quite some memory if the tree levels are allocated on demand only. However breaking it up might also cost more TLB misses, but those could be eliminated by preallocating the tree in the same way as the hash today. Don't know if it's needed or not. I guess someone needs to code it up and try it. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On 2/18/07, Michael K. Edwards <[EMAIL PROTECTED]> wrote: ... Much less vulnerable to cache eviction DDoS than a hash, because the hot connections get rotated up into non-leaf layers and get traversed enough to keep them in the LRU set. Let me enlarge on this a bit. I used to work for a company that built a custom firewall/VPN ASIC with all sorts of special sauce in it, mostly focused on dealing with DDoS. Some really smart guys, some really good technology, I hope they grab the brass ring someday. On the scale they were dealing with, there's only one thing to do about DDoS: bend over and take it. Provision enough memory bandwidth to cold-cache every packet, every session lookup, and every packet-processing-progress structure. Massively parallelize, spinlock on on-chip SRAM, tune for the cold-cache case. If you can't afford to do that -- and if you haven't designed your own chip, with separate cache windows for each of these use cases, you can't, because they're all retrograde loads for an LRU cache -- then a hash is not the right answer. The interaction between resizing and RCU is just the canary in the coal mine. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
A better data structure for RCU, even with a fixed key space, is probably a splay tree. Much less vulnerable to cache eviction DDoS than a hash, because the hot connections get rotated up into non-leaf layers and get traversed enough to keep them in the LRU set. But I am not a data structures guru; a stratified tree or van Emde-Boas priority queue might be a better choice. RCU splay tree has lots of other potential applications inside the kernel, so it would make a fun project for someone with time to burn. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Evgeniy Polyakov a e'crit : On Sun, Feb 18, 2007 at 07:46:22PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: Why anyone do not want to use trie - for socket-like loads it has exactly constant search/insert/delete time and scales as hell. Because we want to be *very* fast. You cannot beat hash table. Say you have 1.000.000 tcp connections, with 50.000 incoming packets per second to *random* streams... What is really good in trie, that you may have upto 2^32 connections without _any_ difference in lookup performance of random streams. So are you speaking of one memory cache miss per lookup ? If not, you loose. With a 2^20 hashtable, a lookup uses one cache line (the hash head pointer) plus one cache line to get the socket (you need it to access its refcounter) Several attempts were done in the past to add RCU to ehash table (last done by Benjamin LaHaise last March). I believe this was delayed a bit, because David would like to be able to resize the hash table... This is a theory. Not theory, but actual practice, on a real machine. # cat /proc/net/sockstat sockets: used 918944 TCP: inuse 925413 orphan 7401 tw 4906 alloc 926292 mem 304759 UDP: inuse 9 RAW: inuse 0 FRAG: inuse 9 memory 18360 Practice includes cost for hashing, locking, and list traversal (each pointer is in own cache line btw, which must be fetched) and plus the same for time wait sockets (if we are unlucky). No need to talk about price of cache miss when there might be more serious problems - for example length of the linked list to traverse each time new packet is received. For example lookup time in trie with 1.6 millions random 3-dimensional 32bit (saddr/daddr/ports) entries is about 1 microsecond on amd athlon64 3500 cpu (test was ran in userspace emulator though). 1 microsecond ? Are you kidding ? We want no more than 50 ns. You can check on this dual cpu machine, tcp_v4_rcv() uses 2.29 % of cpu. CPU: AMD64 processors, speed 1992.67 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 10 samples %symbol name 2009510 4.6863 memcpy_c 1668842 3.8918 tg3_start_xmit_dma_bug 1485844 3.4651 tg3_poll 1293558 3.0167 kmem_cache_free 1232862 2.8751 kfree 1131012 2.6376 free_block 1000671 2.3336 ip_route_input 9826552.2916 tcp_v4_rcv 942.2284 __alloc_skb 8637532.0143 tcp_ack 8632222.0131 tcp_recvmsg 8346801.9465 fget_light 8014451.8690 lock_sock_nested 7936991.8510 tcp_sendmsg 7646891.7833 copy_user_generic_string 7435151.7339 ip_queue_xmit 7123141.6612 sock_wfree 6504861.5170 tcp_rcv_established - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Sun, Feb 18, 2007 at 07:46:22PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > >Why anyone do not want to use trie - for socket-like loads it has > >exactly constant search/insert/delete time and scales as hell. > > > > Because we want to be *very* fast. You cannot beat hash table. > > Say you have 1.000.000 tcp connections, with 50.000 incoming packets per > second to *random* streams... What is really good in trie, that you may have upto 2^32 connections without _any_ difference in lookup performance of random streams. > With a 2^20 hashtable, a lookup uses one cache line (the hash head pointer) > plus one cache line to get the socket (you need it to access its refcounter) > > Several attempts were done in the past to add RCU to ehash table (last done > by Benjamin LaHaise last March). I believe this was delayed a bit, because > David would like to be able to resize the hash table... This is a theory. Practice includes cost for hashing, locking, and list traversal (each pointer is in own cache line btw, which must be fetched) and plus the same for time wait sockets (if we are unlucky). No need to talk about price of cache miss when there might be more serious problems - for example length of the linked list to traverse each time new packet is received. For example lookup time in trie with 1.6 millions random 3-dimensional 32bit (saddr/daddr/ports) entries is about 1 microsecond on amd athlon64 3500 cpu (test was ran in userspace emulator though). Data obtained from netchannels research: http://tservice.net.ru/~s0mbre/blog/2006/12/02#2006_12_02 I think I need to setup similar emulator for hash table of the above size to check hash/list goodness. > I am not really interested in hash resizing, because an admin can size it > at boot time. But RCU is definitly *wanted* Trie in that regard is true winner - its RCU'fication is trivial. > Note : It would be good to also use RCU for UDP, because the current rwlock > protecting udp_hash[] is a scalability problem. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
Evgeniy Polyakov a e'crit : On Mon, Feb 05, 2007 at 10:02:53AM -0800, [EMAIL PROTECTED] ([EMAIL PROTECTED]) wrote: On Sat, 4 Feb 2007 [EMAIL PROTECTED] wrote: I noticed in an LCA talk mention that apprently extensible hashing with RCU access is an unsolved problem. Here's an idea for solving it. Yes, I have been playing around with the same idea for doing dynamic resizing of the TCP hashtable. Did a prototype "toy" implementation, and I have a "half-finished" patch which resizes the TCP hashtable at runtime. Hmmm, your mail may be the impetus to get me to finally finish this thing Why anyone do not want to use trie - for socket-like loads it has exactly constant search/insert/delete time and scales as hell. Because we want to be *very* fast. You cannot beat hash table. Say you have 1.000.000 tcp connections, with 50.000 incoming packets per second to *random* streams... With a 2^20 hashtable, a lookup uses one cache line (the hash head pointer) plus one cache line to get the socket (you need it to access its refcounter) Several attempts were done in the past to add RCU to ehash table (last done by Benjamin LaHaise last March). I believe this was delayed a bit, because David would like to be able to resize the hash table... I am not really interested in hash resizing, because an admin can size it at boot time. But RCU is definitly *wanted* Note : It would be good to also use RCU for UDP, because the current rwlock protecting udp_hash[] is a scalability problem. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Mon, Feb 05, 2007 at 10:02:53AM -0800, [EMAIL PROTECTED] ([EMAIL PROTECTED]) wrote: > On Sat, 4 Feb 2007 [EMAIL PROTECTED] wrote: > > >I noticed in an LCA talk mention that apprently extensible hashing > >with RCU access is an unsolved problem. Here's an idea for solving it. > > > > Yes, I have been playing around with the same idea for > doing dynamic resizing of the TCP hashtable. > > Did a prototype "toy" implementation, and I have a > "half-finished" patch which resizes the TCP hashtable > at runtime. Hmmm, your mail may be the impetus to get > me to finally finish this thing Why anyone do not want to use trie - for socket-like loads it has exactly constant search/insert/delete time and scales as hell. > -- > Arthur > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/TOY]Extensible hashing and RCU
> For purposes of discussion, I've attached a "toy" implementation > for doing dynamic resizing of a hashtable. It is useless, except > as a proof of concept. > > I think this is very similar to what you are describing, no? Er... not quite; that has a lot more overhead than what I was thinking about. I have used the trick of distinguishable sentinel values in a doubly-linked list to maintain read cursors while it's being updated, but I don't think that's necessary here. (You can also encode the nh_type flag in the lsbit of the pointer if you're being sneaky. That will attract curses from the memleak detector, though.) In particular, I was imagining a singly linked list. To delete an item, use the hash to find the head pointer and walk it to find the pointer to be fixed up. Since the chains are short anyway, this is entirely reasonable. Less fundamental comments include: 1) Is the seqlock in get_nh() and nh_replace() really required? Is there any architecture that doesn't have atomic pointer stores? If you wanted to store the table size in a fixed location as well, I could see the need... 2) I think the whole __nh_sort_chain business will royally confuse anyone walking the chain while it happens. This is exactly what I was working to avoid. The partial sorting in __nh_insert isn't good enough. Instead, try: /* Return true if bitrev(x) > bitrev(y) */ static bool bitrev_gt(unsigned long x, unsinged long y) { /* Identify the bits that differ between x and y */ unsigned long mask = x ^ y; /* Find the bits that differ */ mask ^= mask-1; /* Find lsbit of difference (and all lower bits) */ return (x & mask) > (y & mask); } static void __nh_insert(struct nh_entry *entry, struct nh_head *head) { struct list_head *p, *n; unsigned long const hashval = nh_hashval(entry->data); /* * Insert the new entry just before the first element of the list * that its hash value is not greater than (bit-reversed). */ p = &head->list; list_for_each_rcu(n, &head->list) { struct nh_entry *t = container_of(n, struct nh_entry, nh_list); if (t->nh_type == NH_ENTRY && !bitrev_gt(hashval, nh_hashval(t->data))) break; p = n; } __list_add_rcu(&entry->nh_list, p, n); } static int nh_add(unsigned long data) { struct nh_entry *entry = kmalloc(sizeof *entry, GFP_KERNEL); struct nh *nh; if (!entry) return -ENOMEM; entry->nh_type = NH_ENTRY; entry->data = data; rcu_read_lock(); nh = get_nh(); /* or nh = __nh */ if (nh) { struct nh_head *h = \ &nh->hash[ nh_bucket(nh_hashval(data), nh->nentries) ]; spin_lock(&h->lock); __nh_insert(entry, h); spin_unlock(&h->lock); } rcu_read_unlock(); } Then there's no need for __nh_sort_chain at all. Alternatively, if the upper bits of nh_hashval are as good as the lower bits, just index the hash table on them. 3) Code inside a mutex like nh_resize() can use plain list_for_each(); the _rcu variant is only required if there can be simultaneous mutation. That's a nice module framework. I'll see if I can write some code of the sort I was thinking about. FWIW, I figured out a way around the need to delay deletion for two RCU intervals. Suppose that an expansion is pending, and we have just stretched the table from 16 entries to 32, and the following hash values are stored. (Note the bit-reversed order.) old[3] --\ new[3] ---+-> 0x03 -> 0x43 -> 0x23 -> 0x63 -> 0x13 -> 0x53 -> 0x33 -> 0x73 / new[3+16]---/ After an RCU period, you can throw away the old[] array and NUL-terminate the new[i] list after 0x63. But until then, you need to leave the list alone to accomodate people who are looking for 0x53 via the old head. The tricky case comes when you delete 0x13. If you only delete it from the new[3+16] list, you can't discard it until the RCU quiescent period after the one which dicsonnects the 0x63->0x13 link. The solution is actually very simple: notice when you're - Deleting the first entry in a list - While an expension is pending - And the list is in the second half of the expanded table then unlink the entry from BOTH the new head and the old list. It's a bit more work, and requires some lock-ordering care, but it lets you queue the node for RCU cleanup the normal way. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/TOY]Extensible hashing and RCU
On Sat, 4 Feb 2007 [EMAIL PROTECTED] wrote: I noticed in an LCA talk mention that apprently extensible hashing with RCU access is an unsolved problem. Here's an idea for solving it. I'm assuming the table is a power of 2 in size with open chaining for collisions. When the chains get too long, the table is doubled. When the chains get too short, the table size is halved. . For purposes of discussion, I've attached a "toy" implementation for doing dynamic resizing of a hashtable. It is useless, except as a proof of concept. I think this is very similar to what you are describing, no? -- Arthur #include #include #include #include #include #include #include #include #include #include #define _NETHASH_BUFSZ_ 16 #define MODULE_NAME "nethash" #define _DEBUG_IT_ #ifdef _DEBUG_IT_ #define nprintk(x...) printk(KERN_ALERT x); #else /* !_DEBUG_IT_ */ #define nprintk(x) #endif /* _DEBUG_IT_ */ static struct proc_dir_entry *nh_proc_dir; enum nh_type { NH_ENTRY, NH_HEAD }; struct nh_item { /* the list head must be first followed by nh_type */ struct list_headnh_list; enum nh_typenh_type; }; struct nh_entry { /* the list head must be first followed by nh_type */ struct list_headnh_list; enum nh_typenh_type; unsigned long data; struct rcu_head rcu_head; }; struct nh_head { /* the list head must be first followed by nh_type */ struct list_headlist; enum nh_typenh_type; spinlock_t lock; }; struct nh { unsigned long nentries; struct nh_head* hash; struct rcu_head rcu_head; }; static struct nh * __nh; static DEFINE_SEQLOCK(nethash_seq); static DEFINE_MUTEX(nethash_resize_mutex); extern void * system_hash_alloc(unsigned long sz); extern void system_hash_free(void * v, unsigned long sz); /* XXX nh_dump() is for for debug only * it must be called under rcu_read_lock() */ static int nh_dump(struct nh * nh, const char * debug_str); static struct nh * get_nh(void); static unsigned long nh_hashval(unsigned long data) { return data; } static void nh_entry_free(struct rcu_head * head) { struct nh_entry * entry = container_of(head, struct nh_entry, rcu_head); nprintk("%s: freeing entry with data %lu\n", __FUNCTION__, entry->data); kfree(entry); } static void nh_free(struct rcu_head * head) { struct nh * nh = container_of(head, struct nh, rcu_head); unsigned long nentries = nh->nentries; unsigned long size = sizeof (struct nh_head) * nentries; nprintk("%s: freeing nh of size %lu\n", __FUNCTION__, size); system_hash_free((void *)nh->hash, size); nprintk("%s: freeing nh\n", __FUNCTION__); kfree(nh); } /* called with head's spin_lock held */ static int __nh_insert(struct nh_entry * entry, struct nh_head * head, unsigned long nentries) { struct list_head * prev; /* insert entry after prev */ struct list_head * list; nprintk("%s: linking entry with data %lu\n", __FUNCTION__, entry->data); /* find the appropriate spot to place entry */ prev = &head->list; if ( nh_hashval(entry->data) & nentries ) { list_for_each_rcu(list, &head->list) { struct nh_entry * tmp; struct nh_item * item = (struct nh_item *)list; if (item->nh_type != NH_ENTRY) continue; tmp = list_entry(list, struct nh_entry, nh_list); prev = &tmp->nh_list; if (nh_hashval(tmp->data) & nentries) { /* put entry after nh */ break; } } } list_add_rcu(&entry->nh_list, prev); return 0; } /* called with head's lock held */ static void __nh_sort_chain(struct nh_head * head, unsigned long nentries) { struct list_head tmp, *list = &head->list; struct nh_entry* entry; INIT_LIST_HEAD(&tmp); list_splice_init(list, &tmp); while ( !list_empty(&tmp) ) { struct list_head * first = tmp.next; list_del(first); entry = (struct nh_entry*)first; __nh_insert(entry, head, nentries); } } static void nh_fixup_grow(struct rcu_head * head) { struct nh * nh; struct nh * oh = container_of(head, struct nh, rcu_head); unsigned long i, oentries = oh->nentries; rcu_read_lock(); nh = get_nh(); /* this is an rcu_callback - only the "new" nh is * visible elsewhere now, so make sure to access * the hashtable using the proper (new) locks,... */ for (i = 0; i <