Re: [Q] connection tracking scaling

Patrick Schaaf Tue, 19 Mar 2002 09:04:49 -0800

Hello Jean-Michel,

thanks for your input. I appreciate it.

On Tue, Mar 19, 2002 at 03:56:32PM +0100, Jean-Michel Hemstedt wrote:
> 
> I'm not a conntrack specialist, neither a kernel hacker, but I've
> some experience with ip hash caches in access servers (BRAS)
> that may be useful(?):
> 
> some additional stats:
> - HDA: cache hit depth average: the number of iterations in the
>   bucket's list to get the matching collision entry.
> - MDA: cache miss depth average: the number of iterations required
>   without matching a cache entry (new connection).

These two measures make an excellent dynamic measure in addition to
what I described. However, they require modification of the standard
LIST_FIND() macro used when traversing the chain, which is a bit
awkward - and they require keeping properly locked / cpu-local
counters in the critical path. I wouldn't want this on a highly
loaded machine, at least not for a longer time.

The measures are important because of their dynamical nature. No static
looking at the chains can capture effects related to the _order_ of the
lists. These measures can. One possible optimization for the future,
would be to move the bucket pointer to the "last found" element, giving
an easy 1-off-cache per bucket. The effect of that cannot be measured
with a static examination; it should be clearly visible under your
dynamic measure.

I'll see that I make this "seperately optional".

> some questions:
> - have you an efficicent 'freelist' implementation? What I've seen about
>   kmem_cache_free and kmem_cache_alloc doesn't look like a simple
>   pointer dereference... Am I wrong?

The kmem_cache() stuff is supposed to be the best possible implementation
of such a freelist on an SMP system. I am not aware that it has inefficiencies.

> - wouldn't it be worth to have a "cache promotion" mechanism?

What do you mean with that?

> regarding [hashsize=conntrack_max/2], I vote for!
> An alternate solution would be to have a dynamic hash resize
> each time the average number of collisions exceeds a treshold
> (and no down resize, except maybe asynchroneously).

This is REALLY AWFUL in an SMP setting, because you need to keep the
whole system write locked while you rebuild the hashes. You really
don't want to do that.

> One last word: the hash function you're using is the best compromise
> between unpredictable ipv4 traffic, cache symetry, uniformity and
> computation time.

I'm pretty ignorant wrt theory and attributes of hash functions,
so I'm damned to check this by experiment. But it's reassuring to
hear you say the function is basically OK.

> I wouldn't change it too much, but there are two
> propositions possible:
> - if you keep the modulo method (%), use a prime number far from a
>   power of 2 for 'ip_conntrack_htable_size'.

Hmm. Can you give a short idea of why "far from a power of 2" is important?
In a context completely unrelated to iptables, I had good results using
the dynamically growing array approach you described (this was userlevel
batch jobs, don't care about latency, only throughput). I precalculated
the largest prime just below any power of 2, and used that as the array
size. Works out very well.

I agree that, since we already use a full division when calculating
the hash function, we may as well use a power-of-two hashsize. This will
waste some room in the last OS page of the array, but that's irrelevant
given the overall size of the array.

best regards
  Patrick

Re: [Q] connection tracking scaling

Reply via email to