Hello Jean-Michel, thanks for your input. I appreciate it.
On Tue, Mar 19, 2002 at 03:56:32PM +0100, Jean-Michel Hemstedt wrote: > > I'm not a conntrack specialist, neither a kernel hacker, but I've > some experience with ip hash caches in access servers (BRAS) > that may be useful(?): > > some additional stats: > - HDA: cache hit depth average: the number of iterations in the > bucket's list to get the matching collision entry. > - MDA: cache miss depth average: the number of iterations required > without matching a cache entry (new connection). These two measures make an excellent dynamic measure in addition to what I described. However, they require modification of the standard LIST_FIND() macro used when traversing the chain, which is a bit awkward - and they require keeping properly locked / cpu-local counters in the critical path. I wouldn't want this on a highly loaded machine, at least not for a longer time. The measures are important because of their dynamical nature. No static looking at the chains can capture effects related to the _order_ of the lists. These measures can. One possible optimization for the future, would be to move the bucket pointer to the "last found" element, giving an easy 1-off-cache per bucket. The effect of that cannot be measured with a static examination; it should be clearly visible under your dynamic measure. I'll see that I make this "seperately optional". > some questions: > - have you an efficicent 'freelist' implementation? What I've seen about > kmem_cache_free and kmem_cache_alloc doesn't look like a simple > pointer dereference... Am I wrong? The kmem_cache() stuff is supposed to be the best possible implementation of such a freelist on an SMP system. I am not aware that it has inefficiencies. > - wouldn't it be worth to have a "cache promotion" mechanism? What do you mean with that? > regarding [hashsize=conntrack_max/2], I vote for! > An alternate solution would be to have a dynamic hash resize > each time the average number of collisions exceeds a treshold > (and no down resize, except maybe asynchroneously). This is REALLY AWFUL in an SMP setting, because you need to keep the whole system write locked while you rebuild the hashes. You really don't want to do that. > One last word: the hash function you're using is the best compromise > between unpredictable ipv4 traffic, cache symetry, uniformity and > computation time. I'm pretty ignorant wrt theory and attributes of hash functions, so I'm damned to check this by experiment. But it's reassuring to hear you say the function is basically OK. > I wouldn't change it too much, but there are two > propositions possible: > - if you keep the modulo method (%), use a prime number far from a > power of 2 for 'ip_conntrack_htable_size'. Hmm. Can you give a short idea of why "far from a power of 2" is important? In a context completely unrelated to iptables, I had good results using the dynamically growing array approach you described (this was userlevel batch jobs, don't care about latency, only throughput). I precalculated the largest prime just below any power of 2, and used that as the array size. Works out very well. I agree that, since we already use a full division when calculating the hash function, we may as well use a power-of-two hashsize. This will waste some room in the last OS page of the array, but that's irrelevant given the overall size of the array. best regards Patrick