Re: [Q] connection tracking scaling

2002-03-19 Thread Harald Welte

On Mon, Mar 18, 2002 at 10:58:20PM +0100, Patrick Schaaf wrote:
 
> Hashing with chaining is fine, but for high performance, you want the
> chains only as a backdrop for the occasional hash collision. The "planned"
> oversubscription of the ip_conntrack hash table (1:8 hashsize/conntrack_max)
> does not perform well when conntrack_max is near. This will become more
> apparent as more people try to use conntracking at the line rate their
> hardware permits.

I totally agree.

> On machines where I expect many connections, I'd use a hashsize
> near the number of expected connections, and make conntrack_max
> only about two times that value.

But this obviously only helps if the hash function is distributing
the conntrack entries equally among the hash buckets.  I wouldn't be 
so sure if this really does happen when the hash becomes wider than a
certain point.

> How does the core team feel about this issue? I hereby suggest changing
> the default calculation to have hashsize == conntrack_max/2. Were there
> good reasons to do different?

This would be fine with me, but rather than just blindly doing that,
I'd be more interested in how good our hash function is with real world
traffic.  And real-world traffic usually means narrow source ip ranges
(because most people firewall a couple of Class-C's) and narrow source
port ranges (let's assume lots of users aren't causing too many connections
and thus the source port range stays close to the startup default port (32k?))

The destination ports are most definitely also not very distributed, since
most people will do the same services (http, ftp, smtp, or whatever is used
from within this organization).

> best regards
>   Patrick

-- 
Live long and prosper
- Harald Welte / [EMAIL PROTECTED]   http://www.gnumonks.org/

GCS/E/IT d- s-: a-- C+++ UL$ P+++ L$ E--- W- N++ o? K- w--- O- M+ 
V-- PS++ PE-- Y++ PGP++ t+ 5-- !X !R tv-- b+++ !DI !D G+ e* h--- r++ y+(*)




Re: [Q] connection tracking scaling

2002-03-19 Thread Patrick Schaaf

Hi Harald,

> > On machines where I expect many connections, I'd use a hashsize
> > near the number of expected connections, and make conntrack_max
> > only about two times that value.
> 
> But this obviously only helps if the hash function is distributing
> the conntrack entries equally among the hash buckets.  I wouldn't be 
> so sure if this really does happen when the hash becomes wider than a
> certain point.

In theory, a good hash function would distribute well independant of
the size of the bucket array.

In practise, theory is theory...

I agree that the hash function needs scrutiny. Do you (or somebody else
here) have a good collection of real world /proc/net/ip_conntrack excerpts,
maybe coming from the development of ctnetlink? I'll cook up a "hash
occupation simulator" for user level, where you can pipe in a conntrack
table, and get reports about the distribution of chain sizes.

As I don't run realworld conntracking firewalls with lots of connections
(I'm using the stuff almost exclusively on servers), I need the help of
you all to get good input test data, here.

> > How does the core team feel about this issue? I hereby suggest changing
> > the default calculation to have hashsize == conntrack_max/2. Were there
> > good reasons to do different?
> 
> This would be fine with me, but rather than just blindly doing that,
> I'd be more interested in how good our hash function is with real world
> traffic.

Jep.

> And real-world traffic usually means narrow source ip ranges
> (because most people firewall a couple of Class-C's) and narrow source
> port ranges (let's assume lots of users aren't causing too many connections
> and thus the source port range stays close to the startup default port (32k?))
> The destination ports are most definitely also not very distributed, since
> most people will do the same services (http, ftp, smtp, or whatever is used
> from within this organization).

These are all important constraints. However, the problem is not insoluble.
Given a good cryptographic hash (not that I'd want to have SHA or MD5 for
this purpose :-) won't care about such issues. The art is to find a fast
hash function that still is not very sensitive to the inputs.

best regards
  Patrick




Re: 2.4.18 patch-o-matic crashing with H323

2002-03-19 Thread Marc Haber

On Tue, 19 Mar 2002 08:56:03 +0100 (CET), Jozsef Kadlecsik
<[EMAIL PROTECTED]> wrote:
>Do you have an SMP machine?

No, it is an old faithful single processor P133.

>Can the crash be reproduced at will?

No. Actually, I have been using the test bed for my phone connectivity
for almost a week before the problem showed for the first time.

>Debugging info could really help us: please switch on debugging in
>ip_conntrack_core.c, ip_nat_core.c and ip_conntrack_h323.c, ip_nat_h323.c,
>recompile/reinstall the modules.

How do I switch on debugging? Is it a matter of changing
#if 0
#define DEBUGP printk
#else
#define DEBUGP(format, args...)
#endif

to #if 1, or do I need to make additional changes?

>Thank you the report,

You're welcome.

Greetings
Marc

Please Cc: me or netfilter@lists, because I don't subscribe to
netfilter-devel due to lack of cluons.

-- 
-- !! No courtesy copies, please !! -
Marc Haber  |   " Questions are the | Mailadresse im Header
Karlsruhe, Germany  | Beginning of Wisdom " | Fon: *49 721 966 32 15
Nordisch by Nature  | Lt. Worf, TNG "Rightful Heir" | Fax: *49 721 966 31 29




Re: New DSCP target in CVS

2002-03-19 Thread Takuya Satoh

> > Hi,
> > Does the old FTOS target zero the ECN bits?
>
> Quick Answer = YES
>
> Long Answer -> FTOS takes whatever HEX code you specify and overwrites
> _all_ 8 bits of the TOS field. So if you only specify say 0xf0 then the
> ECN are overwritten with 0's. make sense?

Perfectly clear, thanks.  So the FTOS target (but not the new DSCP) can be
also used to selectively remove the ECN-enabled bit from syn packets going
to some "bad" hosts throwing away any ECN-enabled connection (until the new
ECN target is finished ...).
Taka






Re: New DSCP target in CVS

2002-03-19 Thread Maciej Soltysiak

> Perfectly clear, thanks.  So the FTOS target (but not the new DSCP) can be
> also used to selectively remove the ECN-enabled bit from syn packets going
> to some "bad" hosts throwing away any ECN-enabled connection (until the new
> ECN target is finished ...).
Hmm, but you will overwrite TOS bits if set originally :(

> Taka
Maciek





Re: [Q] connection tracking scaling

2002-03-19 Thread Harald Welte

On Tue, Mar 19, 2002 at 09:49:20AM +0100, Patrick Schaaf wrote:
> Hi Harald,
> 
> > > On machines where I expect many connections, I'd use a hashsize
> > > near the number of expected connections, and make conntrack_max
> > > only about two times that value.
> > 
> > But this obviously only helps if the hash function is distributing
> > the conntrack entries equally among the hash buckets.  I wouldn't be 
> > so sure if this really does happen when the hash becomes wider than a
> > certain point.
> 
> In theory, a good hash function would distribute well independant of
> the size of the bucket array.
> 
> In practise, theory is theory...
> 
> I agree that the hash function needs scrutiny. Do you (or somebody else
> here) have a good collection of real world /proc/net/ip_conntrack excerpts,
> maybe coming from the development of ctnetlink? I'll cook up a "hash
> occupation simulator" for user level, where you can pipe in a conntrack
> table, and get reports about the distribution of chain sizes.

The problem is that reading out /proc/net/ip_conntrack is slow and cpu-burning
as hell and I don't want to do this on firewalls with large numbers of 
conntrack entries.

I'd rather like to have this information to be gathered at runtime within
the kernel, where one could read out the current hash occupation via /proc
or some ioctl.

> As I don't run realworld conntracking firewalls with lots of connections
> (I'm using the stuff almost exclusively on servers), I need the help of
> you all to get good input test data, here.

I guess if there is a patch in p-o-m, some people would run a cronjob which
extracts this occupation info every minute or so and send a gzipped summary to
us.  At least I could do this on a bunch of small to medium-sized firewalls,
and there certainly are volunteers on the list as well :)

> These are all important constraints. However, the problem is not insoluble.
> Given a good cryptographic hash (not that I'd want to have SHA or MD5 for
> this purpose :-) won't care about such issues. The art is to find a fast
> hash function that still is not very sensitive to the inputs.

Exactly.

> best regards
>   Patrick

-- 
Live long and prosper
- Harald Welte / [EMAIL PROTECTED]   http://www.gnumonks.org/

GCS/E/IT d- s-: a-- C+++ UL$ P+++ L$ E--- W- N++ o? K- w--- O- M+ 
V-- PS++ PE-- Y++ PGP++ t+ 5-- !X !R tv-- b+++ !DI !D G+ e* h--- r++ y+(*)




Re: 2.4.18 patch-o-matic crashing with H323

2002-03-19 Thread Jozsef Kadlecsik

On Tue, 19 Mar 2002, Marc Haber wrote:

> On Tue, 19 Mar 2002 08:56:03 +0100 (CET), Jozsef Kadlecsik
> <[EMAIL PROTECTED]> wrote:
> >Do you have an SMP machine?
>
> No, it is an old faithful single processor P133.

Ack!

> >Can the crash be reproduced at will?
>
> No. Actually, I have been using the test bed for my phone connectivity
> for almost a week before the problem showed for the first time.

Even if you try to reproduce the situation, i.e. you hangup a phone call
and at the same time someone calls you?

> >Debugging info could really help us: please switch on debugging in
> >ip_conntrack_core.c, ip_nat_core.c and ip_conntrack_h323.c, ip_nat_h323.c,
> >recompile/reinstall the modules.
>
> How do I switch on debugging? Is it a matter of changing
> #if 0
> #define DEBUGP printk
> #else
> #define DEBUGP(format, args...)
> #endif
>
> to #if 1, or do I need to make additional changes?

Yes, that's all what is required to switch on debugging.

Regards,
Jozsef
-
E-mail  : [EMAIL PROTECTED], [EMAIL PROTECTED]
WWW-Home: http://www.kfki.hu/~kadlec
Address : KFKI Research Institute for Particle and Nuclear Physics
  H-1525 Budapest 114, POB. 49, Hungary





Re: [Q] connection tracking scaling

2002-03-19 Thread Patrick Schaaf

> I'd rather like to have this information to be gathered at runtime within
> the kernel, where one could read out the current hash occupation via /proc
> or some ioctl.

OK, that's what I wanted to hear :-)

Actually, the interesting statistics for a hash are not that large, and all
aggregate:

- bucket occupation: number of used buckets, vs. number of all buckets
- average chain length over all buckets
- average chain length over the used buckets
- counting of a classification of the chain lengths:
- number of 0-entry buckets
- number of 1-entry buckets
- number of 2-entry buckets
- number of 4-entry buckets
- number of 8-entry buckets
- number of 16-entry buckets
- number of more-than-16-entry buckets

That's 10 values, and will at most double when I think more about it.
I propose to gather these stats on the fly, and simply printk() them
at a chosen interval:

echo 300 >/proc/net/ip_conntrack_showstat

would generate one printk() every 300 seconds. Echoing 0 would disable
the statistics gathering altogether.

I think I can hack this up, today. Having the flu must be good for something...

later
  Patrick




Re: TPROXY

2002-03-19 Thread Jean-Michel Hemstedt

- Original Message -
From: "Balazs Scheidler" <[EMAIL PROTECTED]>
To: "Jean-Michel Hemstedt" <[EMAIL PROTECTED]>
Sent: Tuesday, 19 March, 2002 08:50
Subject: Re: TPROXY


> On Wed, Mar 13, 2002 at 01:19:30PM +0100, Jean-Michel Hemstedt wrote:
> > hello,
> >
> > I'm quite new to netfilter, and I would like to use/write an extension
capable of
> > rewriting HTTP/HTTPS requests from non-proxy aware clients to remote
> > non-transparent aware proxy (the netfilter box being in the middle and
acting
> > as a default gw for both sides).
> >
> > This implies:
> > - for HTTP: most HTTP requests methods (GET,POST,...) need to be rewritten
> >   with the full URL (taken from the non redirected ip.dst for HTTP/0.9 or
from the
> >   'Host' field for HTTP/1.x)
> > - for HTTPS: *insert* an HTTP CONNECT transaction in the TCP stream (just
> >   after the TCP establishment), which means that the ip packets can't simply
be
> >   redirected, unless playing with (cracking) the tcp.seq_num in netfilter.
> >
> > The first case is not a problem (kind of REDIRECT target)
> > For the second case (HTTPS), I was thinking of using the ip_nonlocal_bind
option,
> > but I read in the kernel archives that the connect() was "broken" for
non-local bind
> > in 2.4.x. I would also avoid user space QUEUEing since I noticed that the
throughput
> > was simply divided by 2 (just for normal forwarding!).
> >
> > I think that your TPROXY target is well suited for the HTTPS case
(terminating the tcp
> > sessions of the client on the netfilter box and originating tcp session to
the proxy from
> > the netfilter box as if they were originating from the client, using a kind
of
> > *ip_nonlocal_bind* mechanism). right?
>
> TPROXY is not yet ready, it is lacking several important features. I posted
> it on the -devel list to receive feedback.
>
> Both of your problems can be solved by REDIRECT, you only need two different
> programs (or a single program performing both operations). Just listen on a
> random port (say 50080), and redirect all traffic to this port:
>
> iptables -t nat -A PREROUTING -p tcp -d 0/0 --dport 80 -j REDIRECT --to-port
50080
> iptables -t nat -A PREROUTING -p tcp -d 0/0 --dport 443 -j REDIRECT --to-port
50443
>

REDIRECT could work in case of collocated proxy, and only if we have control
on the proxy, i.e. Apache;
(btw: I'm curreltly trying to find a clean and reusable way to extend
transparent
HTTP mod_tprox and add HTTPS transp proxy to Apache for Linux).

But I'm afraid REDIRECT doesn't fit for remote proxies which rely on the
originam
source ip ofthe client to perform some checks. In that case we need the
50080/50443
applications of your example to forward the modified requests to the remote
proxy
with the source.ip of the original client.

I see 3 possible ways to do that:

1) the 50080/50443 applications use libipt and for each new client request,
before doing
a new connect() to the remote proxy, they create a new iptable rule doing SNAT
based
on the --sport they choosed for their bind(). And when the connection is
released, they
remove the created rule. This solution is very inefficient, and not scalable.

2) the 50080/50443 applications rely on TPROXY framework and uses nonlocal_bind.

3) ??INTERCEPT?? = REDIRECT(PREROUTING)+SNAT(OUTPUT/POSTROUTING?)
i.e:
A:pclient1 <> [REDIRECT--dport80] <>Bserver:50080<>Bclient:pclient2 <>
[SNAT(--to A:pclient)] <> C:80
=>
the INTERCEPT would REDIRECT the packets from the client to the local stack and
pass a 'rdmark' to the user space application retreivable via
getsockopt(rdmark).
Then, the application rewrites the packets and in order to forward them to the
remote
proxy, it creates a new client socket to the remote-proxy and uses
setsockopt(rdmark)
to instruct netfilter to do SNAT on the outgoing packets (OUPUT/POSTROUTING?).
Netfilter uses the 'rdmark' to retreive from the redirect table the '--to'
information (the
source.ip before the redirect).
When a packet comes back from the remote-proxy the reverse SNAT redirects the
packets to the local client, which pass the packet to the local server which
sends
back to the original client the modified packets...

(PS: I don't think the MARK target is suited for that kind of mechanism)
(PPS: the user space applications would move as LKM in a second phase)

do you (or anyone else) see any other way to do it?


> One of your proxies will be listening on 50080 the other on 50443. The first
> performing non-transparent/transparent rewrite the other CONNECT
> encapsulation.
>
> By the way the first one is easy to do with Zorp. Its HttpProxy is able to
> rewrite server-requests to proxy-requests.
>
> CONNECT encapsulation is not supported I'm afraid.
>
> > Have you received any feedback on your TPROXY target?
>
> not much.
>

I hope I'll be able to contribute, but I'll first need to better understand
what are all the features of Netfilter and how they can interact between
each other...

In the mean time, if you have any update, i'd like to

Re: [Q] connection tracking scaling

2002-03-19 Thread Harald Welte

On Tue, Mar 19, 2002 at 12:16:47PM +0100, Patrick Schaaf wrote:
> > I'd rather like to have this information to be gathered at runtime within
> > the kernel, where one could read out the current hash occupation via /proc
> > or some ioctl.
> 
> OK, that's what I wanted to hear :-)

Well, it's IMHO the right way to do this :)

> That's 10 values, and will at most double when I think more about it.
> I propose to gather these stats on the fly, and simply printk() them
> at a chosen interval:
> 
>   echo 300 >/proc/net/ip_conntrack_showstat

please make it /proc/sys/net/ipv4/ip_conntrack_stat_interval or something
more descriptive.

> would generate one printk() every 300 seconds. Echoing 0 would disable
> the statistics gathering altogether.
> 
> I think I can hack this up, today. Having the flu must be good for
> something...

Thanks, no need to hurry.  We have lived without this for multiple years,
you know ;)

Thanks in advance.

BTW: Gute Besserung.

> later
>   Patrick

-- 
Live long and prosper
- Harald Welte / [EMAIL PROTECTED]   http://www.gnumonks.org/

GCS/E/IT d- s-: a-- C+++ UL$ P+++ L$ E--- W- N++ o? K- w--- O- M+ 
V-- PS++ PE-- Y++ PGP++ t+ 5-- !X !R tv-- b+++ !DI !D G+ e* h--- r++ y+(*)




Re: [Q] connection tracking scaling

2002-03-19 Thread Paul P Komkoff Jr

Replying to Patrick Schaaf:
> 
> I think I can hack this up, today. Having the flu must be good for something...

It seems that we caught the flu both at the same time.
I will try to 'brain-analyze' hashfn here. Maybe, I shall try radix-tree
approach or something ...

-- 
Paul P 'Stingray' Komkoff 'Greatest' Jr // (icq)23200764 // (irc)Spacebar
  PPKJ1-RIPE // (smtp)[EMAIL PROTECTED] // (http)stingr.net // (pgp)0xA4B4ECA4




Re: [Q] connection tracking scaling

2002-03-19 Thread Martin Josefsson

On Tue, 19 Mar 2002, Patrick Schaaf wrote:

> I agree that the hash function needs scrutiny. Do you (or somebody else
> here) have a good collection of real world /proc/net/ip_conntrack excerpts,
> maybe coming from the development of ctnetlink? I'll cook up a "hash
> occupation simulator" for user level, where you can pipe in a conntrack
> table, and get reports about the distribution of chain sizes.
> 
> As I don't run realworld conntracking firewalls with lots of connections
> (I'm using the stuff almost exclusively on servers), I need the help of
> you all to get good input test data, here.

Just as a small sidenote... dumping the entire hashtable is very broken in
ctnetlink, the patch in cvs will kill your kernel. I've fixed at least the
bug that crashes kernel but the dumping still doesn't work, you only get
the first 46 (or was it 47?) connections, one full skb so I need to
implement multi packet responses... I've also fixed a few other bugs that
could crash kernel and I've cleaned it up a bit. All the same bugs are
present in nfnetlink, I'll supply a patch for cvs soon, there's an SMP
race in ctnetlink that I'd like to fix first.

And as Harald wrote, /proc/net/ip_conntrack is very slow, it's O(n^2).
And theres no guarantee that you actually get all connections when reading
it.

I have a router with 73.000 entries in conntrack, I can see if I can copy
/proc/net/ip_conntrack sometime during the night (it will disturb routing
a _lot_, this was the reason I started playing with ctnetlink and fixing
it up. I'm using oidentd on NAT routers here and every lookup took a very
long time (several seconds) and during that time the routinglatency went
up to 100-300ms through the router, now with ctnetlink and a hacked
oidentd I don't see anything happening to the latency while serving a lot
of ident requests)

> > And real-world traffic usually means narrow source ip ranges
> > (because most people firewall a couple of Class-C's) and narrow source
> > port ranges (let's assume lots of users aren't causing too many connections
> > and thus the source port range stays close to the startup default port (32k?))
> > The destination ports are most definitely also not very distributed, since
> > most people will do the same services (http, ftp, smtp, or whatever is used
> > from within this organization).

I have 1000 clients in total behind three routers, we have 2 C-class
subnets behind each router + an internal network thats NAT'd.

/Martin

Never argue with an idiot. They drag you down to their level, then beat you with 
experience.





Hashed jump, or other dynamic jump operations in general

2002-03-19 Thread Henrik Nordstrom

Hi.

Looking into various ways of managing large rulebases using automated tools, 
and was thinking, would it make sense to have a hashed jump operation?

I.e. in one operation, jump to one of 2^n chains depending on a 2^n sized 
hash of a selected criteria (source, destination ip/port, protocol, etc..)

Another option is obviously to create search trees by match -> jump with a 
set of intermediary chains.

If one would attempt to implement such kinds of multitired jumps (hashed or 
whatever internal selection criteria), any ideas on how to proceed? I.e. how 
to write a custom "jump" like target?

   - How to reference the possible target chains? I.e. how to in the 
userspace tool convert user friendly chain names to their kernel name?
   - How to tell the core that processing should jump to the selected chain 
once the custom target has figured out which chain to jump to?

Regards
Henrik Nordström




Re: [Q] connection tracking scaling

2002-03-19 Thread Jean-Michel Hemstedt

- Original Message -
From: "Patrick Schaaf" <[EMAIL PROTECTED]>
To: "Harald Welte" <[EMAIL PROTECTED]>; "Patrick Schaaf" <[EMAIL PROTECTED]>;
"Martin Josefsson" <[EMAIL PROTECTED]>; "Aviv Bergman"
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Tuesday, 19 March, 2002 12:16
Subject: Re: [Q] connection tracking scaling


> > I'd rather like to have this information to be gathered at runtime
within
> > the kernel, where one could read out the current hash occupation via
/proc
> > or some ioctl.
>
> OK, that's what I wanted to hear :-)
>
> Actually, the interesting statistics for a hash are not that large, and
all
> aggregate:
>
> - bucket occupation: number of used buckets, vs. number of all buckets
> - average chain length over all buckets
> - average chain length over the used buckets
> - counting of a classification of the chain lengths:
> - number of 0-entry buckets
> - number of 1-entry buckets
> - number of 2-entry buckets
> - number of 4-entry buckets
> - number of 8-entry buckets
> - number of 16-entry buckets
> - number of more-than-16-entry buckets
>
> That's 10 values, and will at most double when I think more about it.
> I propose to gather these stats on the fly, and simply printk() them
> at a chosen interval:

I'm not a conntrack specialist, neither a kernel hacker, but I've
some experience with ip hash caches in access servers (BRAS)
that may be useful(?):

some additional stats:
- HDA: cache hit depth average: the number of iterations in the
  bucket's list to get the matching collision entry.
- MDA: cache miss depth average: the number of iterations required
  without matching a cache entry (new connection).

HDA is meaningful if you have a bad cache distribution or a small
CIS/CTS ratio (Cache Index size=number of hash buckets / Cache
total size=total number of conntrack tuples cachable). It also provides
good information on traffic type and cache efficiency: In fact, lets
assume you have realtime traffic (RTP) and bursty traffic (HTTP/1.1
with keep alive) at the same time, and that the tuples for both type
of traffic are under the same hash key. Now if your RT tuple is at the
end of the collision list, or after the bursty entries, you will need
frequent extra iterations to get your RT tuple... The work around
for that is "collision promotion": you keep a hit counter in each tuple
and just swap one position ahead the most frequently accessed tuple.

some questions:
- have you an efficicent 'freelist' implementation? What I've seen about
  kmem_cache_free and kmem_cache_alloc doesn't look like a simple
  pointer dereference... Am I wrong?
- wouldn't it be worth to have a "cache promotion" mechanism?

regarding [hashsize=conntrack_max/2], I vote for!
An alternate solution would be to have a dynamic hash resize
each time the average number of collisions exceeds a treshold
(and no down resize, except maybe asynchroneously).
But given my experience I would say that ip hash distribution is not at all
predictable (unless you know where in the net path your box will be, and
what traffic type (VoIP, HTTP, eDonkey, ...) your box will have to handle,
and even then, your predictions will not be valid for more than 6 month!).
Therefore, the common way to handle unpredictable distribution is to
define:
[ max hash index size >= max number of cache tuples]
with a dynamic hash index resize

One last word: the hash function you're using is the best compromise
between unpredictable ipv4 traffic, cache symetry, uniformity and
computation time. I wouldn't change it too much, but there are two
propositions possible:
- if you keep the modulo method (%), use a prime number far from a
  power of 2 for 'ip_conntrack_htable_size'.
- if modulo is too slow, use the bitmasking method (&) with hsize being
 a power of 2, and with 2 bitshifts ((key+key>>20+key>>12) & hsize), but
  this method is not as efficient as the modulo method, and must be
  reconsidered for ipv6.

hope this may help...

>
> echo 300 >/proc/net/ip_conntrack_showstat
>
> would generate one printk() every 300 seconds. Echoing 0 would disable
> the statistics gathering altogether.
>
> I think I can hack this up, today. Having the flu must be good for
something...
>
> later
>   Patrick
>
>





Re: [Q] connection tracking scaling

2002-03-19 Thread Patrick Schaaf

Hello Jean-Michel,

thanks for your input. I appreciate it.

On Tue, Mar 19, 2002 at 03:56:32PM +0100, Jean-Michel Hemstedt wrote:
> 
> I'm not a conntrack specialist, neither a kernel hacker, but I've
> some experience with ip hash caches in access servers (BRAS)
> that may be useful(?):
> 
> some additional stats:
> - HDA: cache hit depth average: the number of iterations in the
>   bucket's list to get the matching collision entry.
> - MDA: cache miss depth average: the number of iterations required
>   without matching a cache entry (new connection).

These two measures make an excellent dynamic measure in addition to
what I described. However, they require modification of the standard
LIST_FIND() macro used when traversing the chain, which is a bit
awkward - and they require keeping properly locked / cpu-local
counters in the critical path. I wouldn't want this on a highly
loaded machine, at least not for a longer time.

The measures are important because of their dynamical nature. No static
looking at the chains can capture effects related to the _order_ of the
lists. These measures can. One possible optimization for the future,
would be to move the bucket pointer to the "last found" element, giving
an easy 1-off-cache per bucket. The effect of that cannot be measured
with a static examination; it should be clearly visible under your
dynamic measure.

I'll see that I make this "seperately optional".

> some questions:
> - have you an efficicent 'freelist' implementation? What I've seen about
>   kmem_cache_free and kmem_cache_alloc doesn't look like a simple
>   pointer dereference... Am I wrong?

The kmem_cache() stuff is supposed to be the best possible implementation
of such a freelist on an SMP system. I am not aware that it has inefficiencies.

> - wouldn't it be worth to have a "cache promotion" mechanism?

What do you mean with that?

> regarding [hashsize=conntrack_max/2], I vote for!
> An alternate solution would be to have a dynamic hash resize
> each time the average number of collisions exceeds a treshold
> (and no down resize, except maybe asynchroneously).

This is REALLY AWFUL in an SMP setting, because you need to keep the
whole system write locked while you rebuild the hashes. You really
don't want to do that.

> One last word: the hash function you're using is the best compromise
> between unpredictable ipv4 traffic, cache symetry, uniformity and
> computation time.

I'm pretty ignorant wrt theory and attributes of hash functions,
so I'm damned to check this by experiment. But it's reassuring to
hear you say the function is basically OK.

> I wouldn't change it too much, but there are two
> propositions possible:
> - if you keep the modulo method (%), use a prime number far from a
>   power of 2 for 'ip_conntrack_htable_size'.

Hmm. Can you give a short idea of why "far from a power of 2" is important?
In a context completely unrelated to iptables, I had good results using
the dynamically growing array approach you described (this was userlevel
batch jobs, don't care about latency, only throughput). I precalculated
the largest prime just below any power of 2, and used that as the array
size. Works out very well.

I agree that, since we already use a full division when calculating
the hash function, we may as well use a power-of-two hashsize. This will
waste some room in the last OS page of the array, but that's irrelevant
given the overall size of the array.

best regards
  Patrick




Re: [Q] connection tracking scaling

2002-03-19 Thread Patrick Schaaf

> I agree that, since we already use a full division when calculating
> the hash function, we may as well use a power-of-two hashsize. This will
> waste some room in the last OS page of the array, but that's irrelevant
> given the overall size of the array.

Damn. The second lines is of course supposed to read "we may as well
user a prime hashsize".

:)
  Patrick




Re: 2.4.18 patch-o-matic crashing with H323

2002-03-19 Thread Marc Haber

On Tue, 19 Mar 2002 11:56:50 +0100 (CET), Jozsef Kadlecsik
<[EMAIL PROTECTED]> wrote:
>On Tue, 19 Mar 2002, Marc Haber wrote:
>> No. Actually, I have been using the test bed for my phone connectivity
>> for almost a week before the problem showed for the first time.
>
>Even if you try to reproduce the situation, i.e. you hangup a phone call
>and at the same time someone calls you?

I got called back by the same guy that I called previously, so the
line _must_ have been idle for some seconds.

>> How do I switch on debugging? Is it a matter of changing
>> #if 0
>> #define DEBUGP printk
>> #else
>> #define DEBUGP(format, args...)
>> #endif
>>
>> to #if 1, or do I need to make additional changes?
>
>Yes, that's all what is required to switch on debugging.

OK, will do so and get back after the next crash.

Greetings
Marc

-- 
-- !! No courtesy copies, please !! -
Marc Haber  |   " Questions are the | Mailadresse im Header
Karlsruhe, Germany  | Beginning of Wisdom " | Fon: *49 721 966 32 15
Nordisch by Nature  | Lt. Worf, TNG "Rightful Heir" | Fax: *49 721 966 31 29




Re: New DSCP target in CVS

2002-03-19 Thread Matthew G. Marsh

On Tue, 19 Mar 2002, Maciej Soltysiak wrote:

> > Perfectly clear, thanks.  So the FTOS target (but not the new DSCP) can be
> > also used to selectively remove the ECN-enabled bit from syn packets going
> > to some "bad" hosts throwing away any ECN-enabled connection (until the new
> > ECN target is finished ...).
> Hmm, but you will overwrite TOS bits if set originally :(

Yes - And I have previously wondered if it was worth figuring out how to
maintain or mask previous TOS settings. H - one of these days maybe. I
always use FTOS to test routing and force certain behaviour on DIffServ
networks. So I never really cared that it stomped all over the TOS field.
But now with the shoehorning (aka TOS split to DS and ECN and ...) I
wonder if maybe a more flexible multiset mechanism that allows you to mask
old values for additiona or changing would work. Kind of letting you
choose which settings to allow.

> > Taka
> Maciek

--
Matthew G. Marsh,  President
Paktronix Systems LLC
1506 North 59th Street
Omaha  NE  68104
Phone: (402) 932-7250 x101
Email: [EMAIL PROTECTED]
WWW:  http://www.paktronix.com
--





Re: [Q] connection tracking scaling

2002-03-19 Thread Jean-Michel Hemstedt

- Original Message -
From: "Patrick Schaaf" <[EMAIL PROTECTED]>
To: "Jean-Michel Hemstedt" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Tuesday, 19 March, 2002 17:42
Subject: Re: [Q] connection tracking scaling


> Hello Jean-Michel,
>
> thanks for your input. I appreciate it.
>
> On Tue, Mar 19, 2002 at 03:56:32PM +0100, Jean-Michel Hemstedt wrote:
> >
> > I'm not a conntrack specialist, neither a kernel hacker, but I've
> > some experience with ip hash caches in access servers (BRAS)
> > that may be useful(?):
> >
> > some additional stats:
> > - HDA: cache hit depth average: the number of iterations in the
> >   bucket's list to get the matching collision entry.
> > - MDA: cache miss depth average: the number of iterations required
> >   without matching a cache entry (new connection).
>
> These two measures make an excellent dynamic measure in addition to
> what I described. However, they require modification of the standard
> LIST_FIND() macro used when traversing the chain, which is a bit
> awkward - and they require keeping properly locked / cpu-local
> counters in the critical path. I wouldn't want this on a highly
> loaded machine, at least not for a longer time.
>
everything is relative: if the overhead of maintaining those local counters
is too big compared to the exploit we can do of them (col prom, coarser
stats,
path switching,...) then just forget about it.

> The measures are important because of their dynamical nature. No static
> looking at the chains can capture effects related to the _order_ of the
> lists. These measures can. One possible optimization for the future,
> would be to move the bucket pointer to the "last found" element, giving
> an easy 1-off-cache per bucket. The effect of that cannot be measured
> with a static examination; it should be clearly visible under your
> dynamic measure.
>
> I'll see that I make this "seperately optional".
>

of course, each box as it's own requirements. A configurable option
in such a generic kernel framework is always better than a built-in
*useable* option.

> > some questions:
> > - have you an efficicent 'freelist' implementation? What I've seen
about
> >   kmem_cache_free and kmem_cache_alloc doesn't look like a simple
> >   pointer dereference... Am I wrong?
>
> The kmem_cache() stuff is supposed to be the best possible implementation
> of such a freelist on an SMP system. I am not aware that it has
inefficiencies.
>
and surely me neither ;o)

> > - wouldn't it be worth to have a "cache promotion" mechanism?
>
> What do you mean with that?

(you wiped out my explanation in my previous mail).
It's the mechanism by which the tuples in a collision list are resorted
by hit frequency, trying to reduce the number of iterations required to
get a match. It's only usefull if you insert new tuples at the beginning
of the list, or if you have let's say a control channel and a data channel
in the same list. In the later example, if the entry corresponding to the
data
channel is after the entry corresponding to your control channel, then you
waist one iteration per data packet.
example? stream A=1pps, stream B=100pps, lifetime of A = lifetime of B = T
=> if A after B, num iterations = T*(1*100 + 2*1) = T*(102)
=> if B after A, num iterations = T*(1*1 + 2*100) = T*(201)
But this mechanism requires a per tuple hit counter, and some additional
checks and pointer reallocation in the fast path... This is thus useful in
specific cases (i.e: a media gateway, sip/h323 proxy, real-time streaming
env.)

>
> > regarding [hashsize=conntrack_max/2], I vote for!
> > An alternate solution would be to have a dynamic hash resize
> > each time the average number of collisions exceeds a treshold
> > (and no down resize, except maybe asynchroneously).
>
> This is REALLY AWFUL in an SMP setting, because you need to keep the
> whole system write locked while you rebuild the hashes. You really
> don't want to do that.

ok, ok, I haven't said anything ;o)... but FYI the "prime double hash" is
supposed to address the problem of smooth hash expansion by moving
gradually the entries from one to the other (i've not tried it).

>
> > One last word: the hash function you're using is the best compromise
> > between unpredictable ipv4 traffic, cache symetry, uniformity and
> > computation time.
>
> I'm pretty ignorant wrt theory and attributes of hash functions,
> so I'm damned to check this by experiment. But it's reassuring to
> hear you say the function is basically OK.
>
thanks to the one who choosed it (Rusty?).I've got only
one remark: modulo operation consumes lots of CPU cycles especially
if the operand is prime! and its implementation is architecture
dependent (so not predictable) wheras bit shift operations are totally
controlable and give pretty good results if used correctly... Bitshift has
also the advantage of being able to define a hashsize=2^n.
I memory is not an issue and if CPU cycles are critical ($/Hz >> $/MB)
then we'd better use a bitshift hash 

Re: Hashed jump, or other dynamic jump operations in general

2002-03-19 Thread Henrik Nordstrom

On Tuesday 19 March 2002 18:47, Paul P Komkoff Jr wrote:

> Have you seen hashed jump in routing code ? packet classifier, etc.
> Unfortunately I've not got it here but you can look at lartc on the
> net and find an example there.

Yes, but what I am interested in at this moment is to how to utilize 
similar techniques for optimizing large iptables rulesets. My routing 
tables rarely contain more than a handful of rules even in the most 
complex environments..

The tc code is quite different from the iptables code.

The actual act of the method of selecting which chain to jump to is 
not a big deal, the question is more of how to make iptables make the 
actual jump in the best manner, and how to build the argument list to 
the jump target in a manner that makes sense to the userspace 
iptables program..

Regards
Henrik Nordström




Re: TPROXY

2002-03-19 Thread Henrik Nordstrom

[cannot claim I have been following the thread closely, mostly 
guessing on what you are actually trying to acheive here.. so I may 
be way off]

On Tuesday 19 March 2002 12:19, Jean-Michel Hemstedt wrote:

> REDIRECT could work in case of collocated proxy, and only if we
> have control on the proxy, i.e. Apache;
> (btw: I'm curreltly trying to find a clean and reusable way to
> extend transparent
> HTTP mod_tprox and add HTTPS transp proxy to Apache for Linux).

REDIRECT works only if the you have a user space proxy is running on 
the machine doing REDIRECT. This is per definition of REDIRECT.

> But I'm afraid REDIRECT doesn't fit for remote proxies which rely
> on the originam
> source ip ofthe client to perform some checks. In that case we need
> the 50080/50443
> applications of your example to forward the modified requests to
> the remote proxy
> with the source.ip of the original client.

If the remote proxy is on the same LAN segment or if you can set up a 
GRE tunnel or something similar to the proxy server then you can use 
CONNMARK for this purpose to route the packets unmodified to the 
proxy and then do the final interception there. When you see the NEW 
session in mangle, mark it, then use fwmark based routing to route 
the packets of that session to the "close by" proxy.

> 1) the 50080/50443 applications use libipt and for each new client
> request, before doing
> a new connect() to the remote proxy, they create a new iptable rule
> doing SNAT based
> on the --sport they choosed for their bind(). And when the
> connection is released, they
> remove the created rule. This solution is very inefficient, and not
> scalable.

Yuck.

I would rather go for a single daemon using a custom protocol to 
forward the information to the origin server, such as the 
(incidentally) named TPROXY extension I was once playing with for 
Squid, archived somewhere on my "old Squid patches" page 
 on how to manage 
remote interception of traffic.

But sure, this is a limitation of the "transparent proxy" 
capabilities of the current iptables framework.

I think some aspects of SOCKS can also be used for this purpose.

> 2) the 50080/50443 applications rely on TPROXY framework and uses
> nonlocal_bind.

Except that nonlocal_bind do not yet work in TPROXY, does it?

> 3) ??INTERCEPT?? = REDIRECT(PREROUTING)+SNAT(OUTPUT/POSTROUTING?)
> i.e:
> A:pclient1 <> [REDIRECT--dport80] <>Bserver:50080<>Bclient:pclient2
> <> [SNAT(--to A:pclient)] <> C:80
> =>
> the INTERCEPT would REDIRECT the packets from the client to the
> local stack and pass a 'rdmark' to the user space application
> retreivable via getsockopt(rdmark).
> Then, the application rewrites the packets and in order to forward
> them to the remote
> proxy, it creates a new client socket to the remote-proxy and uses
> setsockopt(rdmark)
> to instruct netfilter to do SNAT on the outgoing packets
> (OUPUT/POSTROUTING?). Netfilter uses the 'rdmark' to retreive from
> the redirect table the '--to' information (the
> source.ip before the redirect).
> When a packet comes back from the remote-proxy the reverse SNAT
> redirects the packets to the local client, which pass the packet to
> the local server which sends
> back to the original client the modified packets...

Very much sounds like CONNMARK is what you are after here.. Allows 
you to selectively reroute individual tracked sessions without 
needing to rely on NAT. But if you need to rewrite the payload then 
CONNMARK obviously won't help you..

> > Zorp supports HTTPS, but it doesn't encapsulate it into CONNECT.
> > It simply decrypts ongoing traffic, checks HTTP within it, and
> > sends it on reencrypted. But for this to work you'd need to run
> > Zorp on your firewall (where it was meant to run)

At the cost of totally invalidating SSL in terms of proxying.

  - Client can no longer verify the authenticity of the origin server 
further than the proxy.
  - Servers can no longer authenticate or verify the client.

Typical man-in-the-middle scenario.

I assume we are talking about what is nominated by the IEFT WREC 
group as "surrogate" servers rather than proxies here.. If not then 
decrypting proxied SSL traffic is a serious breach of security.

Regards
Henrik Nordström




[ANNOUNCE] newnat release candidate (newnat13)

2002-03-19 Thread Harald Welte

Hi!

newnat13 will be hopefully the latest version... 

I've now done the following changes from newnat8: 

1) ipchains.o and ipfwadm.o did contain unresolved symbols with newnat!
2) re-order members of struct ip_{conntrack,nat}_helper and changed other
   patches accordingly
3) ported helper match
4) ported pptp conntrack+nat to newnat (minimum port, still no multiple
   calls possible)
5) renamed some #defines from ip_conntrack_talk.h since there were collisions
   with IRDA (*sigh*, macro namespace is global)

I'd like everybody (esp. Jozsef, if he has time) to have a look at
the code and test newnat from current CVS.

It applies cleanly against 2.4.18 and 2.4.19-pre3 (needs all 'submitted'
patches from p-o-m).

If I don't receive any further complaints and none of my test boxes 
crashes, I'll submit 0-newnat13.patch to the mainstream kernel at
6pm CET (GMT+1) tomorrow.

Thanks.

-- 
Live long and prosper
- Harald Welte / [EMAIL PROTECTED]   http://www.gnumonks.org/

GCS/E/IT d- s-: a-- C+++ UL$ P+++ L$ E--- W- N++ o? K- w--- O- M+ 
V-- PS++ PE-- Y++ PGP++ t+ 5-- !X !R tv-- b+++ !DI !D G+ e* h--- r++ y+(*)