On Sun, Feb 18, 2007 at 09:21:30PM +0100, Eric Dumazet ([EMAIL PROTECTED]) wrote: > Evgeniy Polyakov a e'crit : > >On Sun, Feb 18, 2007 at 07:46:22PM +0100, Eric Dumazet > >([EMAIL PROTECTED]) wrote: > >>>Why anyone do not want to use trie - for socket-like loads it has > >>>exactly constant search/insert/delete time and scales as hell. > >>> > >>Because we want to be *very* fast. You cannot beat hash table. > >> > >>Say you have 1.000.000 tcp connections, with 50.000 incoming packets per > >>second to *random* streams... > > > >What is really good in trie, that you may have upto 2^32 connections > >without _any_ difference in lookup performance of random streams. > > So are you speaking of one memory cache miss per lookup ? > If not, you loose.
With trie big part of it _does_ live in cache compared to hash table where similar addresses ends up in a completely different hash entries. > >>With a 2^20 hashtable, a lookup uses one cache line (the hash head > >>pointer) plus one cache line to get the socket (you need it to access its > >>refcounter) > >> > >>Several attempts were done in the past to add RCU to ehash table (last > >>done by Benjamin LaHaise last March). I believe this was delayed a bit, > >>because David would like to be able to resize the hash table... > > > >This is a theory. > > Not theory, but actual practice, on a real machine. > > # cat /proc/net/sockstat > sockets: used 918944 > TCP: inuse 925413 orphan 7401 tw 4906 alloc 926292 mem 304759 > UDP: inuse 9 > RAW: inuse 0 > FRAG: inuse 9 memory 18360 Theory is a speculation about performance. Highly cache usage optimized bubble sorting still much worse than cache usage non-optimized binary tree. > >Practice includes cost for hashing, locking, and list traversal > >(each pointer is in own cache line btw, which must be fetched) and plus > >the same for time wait sockets (if we are unlucky). > > > >No need to talk about price of cache miss when there might be more > >serious problems - for example length of the linked list to traverse each > >time new packet is received. > > > >For example lookup time in trie with 1.6 millions random 3-dimensional > >32bit (saddr/daddr/ports) entries is about 1 microsecond on amd athlon64 > >3500 cpu (test was ran in userspace emulator though). > > 1 microsecond ? Are you kidding ? We want no more than 50 ns. Theory again. Existing table does not scale that good - I created (1<<20)/2 (to cover only established part) entries table and filled it with 1 million of random entries -search time is about half of microsecod. Wanna see a code? I copied Linux hash table magic into userspace and run the same inet_hash() and inet_lookup() in a loop. Result above. Trie is still 2 times worse, but I've just found a bug in my implementation. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html