FYI: QUEUE ipqmpd bugs

2002-07-05 Thread Jean-Michel Hemstedt

FYI,

I upgraded to iptables-1.2.6a (user  kernel-2.4.18 patches)
and got the following (maybe known) problems:

- QUEUE target is NOK with kernel compiled with CONFIG_IP_NF_QUEUE=m
  = the packets are queued, but ipq_create_handle() returns 
 can't create netlink socket
 ERROR: Unable to create netlink socket: Connection refused
(problem with exported symbols?)
  =quick fix: compile kernel with CONFIG_IP_NF_QUEUE=y

- ipqmpd-0.3: default verdict NF_ACCEPT is not applied when no
  process has attached to it. In fact ipqmpd starts, but it seems
  that it never receives any packet (in ipq_inp). When one process
  attaches to it, with a mark different from the queued packet, then
  the default NF_ACCEPT is applied correctly. When all processes have
  detached from ipqmpd, the default NF_ACCEPT continues to be applied
  correctly.

kr,
___
-jmhe-   He who expects nothing shall never be disappointed






Re: performance issues (nat / conntrack)

2002-06-26 Thread Jean-Michel Hemstedt

  (strange thing is that ethernet irq's reported by procinfo are 
   decreasing when the machine is overloaded. It suppose that it 
   means either that irq's are not even caught by the kernel/driver, 
   which is quite worrying, or either that irq's counters refer to 
   'processessed' interrupts)
 
 are you using a driver which uses the irq mitigation interface of 2.4.x
 or the NAPI of 2.5.x ?

no. (is it available for non Gigabit ethernets?). I'm using 3c905-TX.

one explanation could be that, SYN packets are 'lost'/not processed, and
there are thus no additional SYN/ACK,ACK (at least), and thus no extra 
flow (each connection was made of about 10 packets). Since packets 
don't get forwarded, the same applies for the other interface. 
It means also that most retransmissions get lost.

 
  Harld: could you ask to your kernel specialist what is the weakpoint of
  kmem_cache_alloc()? (locks, allocs, ...), and how we could possibly
  improve it (batch alloc, but isn't it already the case?)
 
 I don't think that the slab allocator is the bottleneck. please show me
 profiling data pointing this out.
 

yesss, facts... but in the mean time, grabbing info/experience/feelings 
from 'non involved' people could be instructive. what do they say?

 - Harald Welte

kr,
-jmhe-





Re: performance issues (nat / conntrack)

2002-06-25 Thread Jean-Michel Hemstedt
 generator creating traffic
  from one source to one destination ipa, with only source port variation
  (but given my configured hash table size and the hash function itself
  it shouldn't have been an issue).

 I think because only the source port varies, this is an important issue in
 your setup. You actually tested the hash functions and could bomb some
 hash entries. The overall effect was a DoS against conntrack.

ok, here we go:

98  static inline u_int32_t
99  hash_conntrack(const struct ip_conntrack_tuple *tuple)
100  {
101  #if 0
102  dump_tuple(tuple);
103  #endif
104  /* ntohl because more differences in low bits. */
105  /* To ensure that halves of the same connection don't hash
106 clash, we add the source per-proto again. */
107  return (ntohl(tuple-src.ip + tuple-dst.ip
108   + tuple-src.u.all + tuple-dst.u.all
109   + tuple-dst.protonum)
110  + ntohs(tuple-src.u.all))
111  % ip_conntrack_htable_size;
112  }

src.u.all  dst.u.all refer (unless there's a bug) to src.tcp.port
and dst.tcp.port respectively. So, if only src.port varies linearly
(let's say between 32000 and 64000), and if ip_conntrack_htable_size
= 32768 (kernel: ip_conntrack (32768 buckets, 262144 max)), then
we should have maximum 2 collisions per bucket (unless there's a type
overfow somewhere).

This was my test setup, but since I haven't verified the conntrack hash
distribution, I didn't want to argue on that. To measure that, we should
maintain hash counters such as max collisions, average collisions per
key, hit/miss depth average, number of hit/miss per second, etc...
I've planned to do that along with profiling, but unfortunately not in
the 2 coming weeks.

--

last points I wanted to clarify:

 From: Patrick Schaaf [EMAIL PROTECTED]
 On Sun, Jun 23, 2002 at 09:46:29PM -0700, Don Cohen wrote:
From: Jean-Michel Hemstedt [EMAIL PROTECTED]
   Since in my test, each connection is ephemeral (10ms) ...
 
  One question here is whether the traffic generator is acting like
  a real set of users or like an attacker.  A real user would not keep
  trying to make connections at the same rate if the previous attempts
  were not being served.  I suspect you're acting more like an attacker.

 He definitely is. The test he described is completely artificial, and does
 not represent any normal real world workload.

 Nevertheless, it does point out a valid optimization chance. We discussed
 that months ago, and it's still there.

No, I don't think so.
1) the hash is not in cause (see above)
   (btw, as discussed in 'connection tracking scaling' [19 March 2002]
i don't see ways to really optimize it unless you go for
multidimesional hashes described in theoretical papers, or if
you make traffic assumptions which is most likely impossible
in such a generic framework...) However, I don't understand
why we are adding twice the src.port in the hash function?
2) My test was artificial, but not unrealistic: one endpoint sustaining
   1000 conn/s wathever the responsiveness of the target, or 1 users
   trying to connect through the gw in a time lapse of 10 seconds is
   similar.
   Now, if some of you are telling me that I'm not allowed, or that I'm nuts
   to place my box in front of 1 users, that's another debate.
   I'm not talking about dimensioning, I'm talking about relative performances,
   and strange weaknesses.

kr,
-jmhe-






Re: performance issues (nat / conntrack)

2002-06-25 Thread Jean-Michel Hemstedt

  loading a module, doesn't mean using it (lsmod reports it as 'unused'
  in my tests). So, does it really 'sounds as expected', when you see
 
 From where do you think that the module usage counter reports how many
 packets/connections are handled (currently? totally?) by the module.
 There is no whatsoever connection!

module usage counter increases when a TARGET needs it (i.e. ipt_REDIRECT).
In this test, no rule was defined, and no target module was loaded.
So I did not expect NAT to process any packet.

 
o The cumulative effect should be reconsidered.
 
  - I can't explain the last one, but when the table is exhausted
conntrack drops new packets, right? What I noticed is that at that
moment, the cpu load suddenly hit 100%, and the machine did not
recover, unless I killed the load generator
 
 That is unusual and should be tested further.

I suppose that due to the load, packets are dropped not because of conntrack
but because they simply can't be processed, and thus conntrack misses packets
of existing connections (such as FIN, RST) and can't thus recover due to its 
timeouts.

 
   ? What 'nat table' are  you talking about?  Do you understand how NAT
   works and how it interacts with connection tracking?
 
  Just to recall my test: I generated an amount of new connections
  per second passing through a forwarding machine without any iptables
  module and measured the cpu load/responsiveness and other things...
  Then while the machine was sustaining this amount of new conn/s, i did
  'insmod ip_conntrack [size]', saw the cpu load increasing, and finally
  just did 'iptables -t nat -L' to load the nat module without any rule,
  and saw again the cpu load increasing. With 500conn/s, the cpu load went
  from 10% - ~50/70% - 100% (machine unavailable).
 
 According to your first mail, the machine has 256M RAM and you issued
 
 insmod ip_conntrack 16384
 
 That requires 16384*8*~600byte ~= 75MB non-swappable RAM.
 
 When you issued iptables -t nat -L, the system tried to reserve plus
 2x75MB. That's in total pretty near to all your available physical RAM
 and the machine might died in swapping.
 

exact! 
That's why I looked (but not closely) at swap-in/swap-out in procinfo, 
but didn't notice anything (0 most of the time on 10 sec average). 
But I agree that I was close to the limit, and even over when I tried 32K.
Despite that, nothing so surpising to have so few swaps, since my table
was not full (max 4000 up to 1 concurrent tuples).

But this raises one additional problem: 
1) the hash index size and the hash total size should be configurable 
separately (get rid of that factor 8, and use a free list for the tuple 
allocation).
2) NAT hash sizes should also be configurable independently from conntrack.
Normally the nat hashes are smaller than conntrack hash, since conntrack
is based on ports, while nat is not.

PS: could anybody redo similar tests so that we can compare the results
and stop killing the messenger, please? ;o)


 Regards,
 Jozsef
 -
 E-mail  : [EMAIL PROTECTED], [EMAIL PROTECTED]
 WWW-Home: http://www.kfki.hu/~kadlec
 Address : KFKI Research Institute for Particle and Nuclear Physics
   H-1525 Budapest 114, POB. 49, Hungary
 
 
 
 





Re: conntrack hash function

2002-06-25 Thread Jean-Michel Hemstedt

hi,

just FYI, hash was already discussed in a previous thread:
'connection tracking scaling' [19 March 2002]
sorry if you were aware of it.


  From: Jean-Michel Hemstedt [EMAIL PROTECTED]
 
   98  static inline u_int32_t
   99  hash_conntrack(const struct ip_conntrack_tuple *tuple)
   100  {
   101  #if 0
   102  dump_tuple(tuple);
   103  #endif
   104  /* ntohl because more differences in low bits. */
   105  /* To ensure that halves of the same connection don't hash
   106 clash, we add the source per-proto again. */
   107  return (ntohl(tuple-src.ip + tuple-dst.ip
   108   + tuple-src.u.all + tuple-dst.u.all
   109   + tuple-dst.protonum)
   110  + ntohs(tuple-src.u.all))
   111  % ip_conntrack_htable_size;
   112  }
 
 A few questions here:
 - Why make the two halves of the connection hash to different buckets?
   I'd think you'd want to consider the two halves to be the same
   connection.  So you want them to hash the same.  It would make the
   comparison a little more expensive, but save half the space.

and would even avoid a second key computation (modulo may be costly on
prime numbers for instance) and it would also avoid a second collision 
list scanning (interresting in case of large number of collisions).

 - % table size seems not quite ideal.  Especially since the table size
   is likely a power of 2, which means that you effectively ignore all
   but the low order bits of the addresses and ports.

yes, modulo is slower but more robust than bit masking (), and
is generally considered a good tradeoff between computation time
and element distribution.

Another common practice consists in bitshifting src and dst
(i.e: ipa+ipa20+ipa13) to enforce domain LSB variations.

 Perhaps one less than table size if it's even, which will lose the
 use of one bucket but then use all of the data bits in the hash.  
 Of course, the low order bits might well be good enough.
 Then again, that depends on what the -per-proto data looks like,
 and for some protos this might not vary in the low order bits.

-jmhe-





Re: performance issues (nat / conntrack)

2002-06-23 Thread Jean-Michel Hemstedt



I know this debate is not new... I just didn't expect such a (90% see
below) perf drop, and unavailablity risk. That's why I'm only reporting
it, hoping secretly that experienced hackers will consider it seriously.
;o)

Note: I don't want to play with words, but if you prefer, consider
 'load generator' as 'malicious DoS user', and 'perf issue' as
  'DoS vulnerability' as Don Cohen cleverly suggested :-/
 (for me it's the same problem, except that DoS is ponctual while
  perf is what we may expect in normal situation)

 
  I'm doing some tcp benches on a netfilter enabled box and noticed
  huge and surprising perf decrease when loading iptable_nat module.

 Sounds as expected.

loading a module, doesn't mean using it (lsmod reports it as 'unused'
in my tests). So, does it really 'sounds as expected', when you see
your cpu load hitting 100%, and most packets dropped just after having
done 'iptables -t nat -L' on a system with 1%CPU load handling 'only'
10kpps and forwarding about 1000 new TCP connections/s?


  - ip_conntrack is of course also loading the system, but with huge memory
  and a large bucket size, the problem can be solved. The big issue with
  ip_conntrack are the state timeouts: it simply kill the system and drops
  all the traffic with the default ones, because the ip_conntrack table
  becomes quickly full, and it seems that there is no way to recover from
  that  situation... Keeping unused entries (time_close) even 1 minute in
  the cache is really not suitable for configurations handling (relatively)
  large number of connections/s.

 what is a 'relatively' large number of connections? I've seen a couple
 of netfilter firewalls dealing with 20+ tracked connections.

200K concurrent established connections, maybe... but surely not NEW
connections/second.
See previous results: with only ip_conntrack loaded (no nat), I hardly
reached 500 (new) conn/s.


  o The cumulative effect should be reconsidered.

 could you please try to explain what you mean?

There are 3 aspects:
- table exhaustion (can be fixed with large memory) as long as the
  hash is correctly distributed (few collisions)
- concurrent timers (1 per conntrack tuple??)
- I can't explain the last one, but when the table is exhausted
  conntrack drops new packets, right? What I noticed is that at that
  moment, the cpu load suddenly hit 100%, and the machine did not
  recover, unless I killed the load generator


  o Are there ways/plans to tune the timeouts dynamically? and what are
the valid/invalid ranges of timeouts?

 No, see the mailinglist archives for th reason why.

If you refer to your mail of 18 January 2001, I think that this timeout
should also be reviewed ;o)... Waiting for somebody having the time and
being able of doing a redesign was quite idealistic, while a quick patch
for configurable timeouts per rule (ie: http timeouts different from smtp
ones, as suggested by Denis Ducamp) would have been more realistic.


  o looking at the code, it seems that one timer is started by tuple...
wouldn't it be more efficient to have a unique periodic callback
scanning the whole or part of the table for aged entries?

 I think somebody (Martin Josefsson?) is currently looking into optimizing

  - The annoying point is iptable_nat: normally the number of entries in
  the nat table is much lower than the number of entries in the conntrack
  table. So even if the hash function itself could be less efficient than
  the ip_conntrack one (because it takes less arguments: src+dst+proto),
  the load of nat, should be much lower than the load of conntrack.
  o So... why is it the opposite??

 ? What 'nat table' are  you talking about?  Do you understand how NAT
 works and how it interacts with connection tracking?

Actually, that's also what i would like to know ;o)
bysource or byisproto hash tables, pointing to ip_nat_hash tuples
pointing to ip_conntrack entry. But i don't understand where the
extra processing comes from when there are no (nat) rules defined.
Just to recall my test: I generated an amount of new connections
per second passing through a forwarding machine without any iptables
module and measured the cpu load/responsiveness and other things...
Then while the machine was sustaining this amount of new conn/s, i did
'insmod ip_conntrack [size]', saw the cpu load increasing, and finally
just did 'iptables -t nat -L' to load the nat module without any rule,
and saw again the cpu load increasing. With 500conn/s, the cpu load went
from 10% - ~50/70% - 100% (machine unavailable).


  o Are there ways to tune the nat performances?

 no. NAT (and esp. NAT performance) is not a very strong point of netfilter.
 Everybody agrees that NAT is evil and it should be avoided in all
circumstances.
 Rusty didn't want to become NAT/masquerading maintainer in the first place,
 but rather concentrate on packet filtering.

wow! what is the alternative for 'Everybody' using REDIRECT?


 The NAT subsystem has a 

Re: performance issues (nat / conntrack)

2002-06-23 Thread Jean-Michel Hemstedt



  I'm doing some tcp benches on a netfilter enabled box and noticed
   huge and surprising perf decrease when loading iptable_nat module. 
 Rather similar to the results I posted about a week ago.

oops, sorry, it seems we performed our tests at the same time ;o)

 
   - Another (old) question: why are conntrack or nat active when there are
   no rules configured (using them or not)? 
 I noticed this too.  After a test using conntrack the next test
 without using conntrack would perfom poorly unless I did rmmod.

yes, minor issue if documented...

 
   Since in my test, each connection is ephemeral (10ms) ...
 When all works correctly, the end of each connection should be noticed
 by conntrack and the connection removed from the table, right?

yep

 In which case the table should never get very full.

ideally yes, but from the conntrack machine perspective, the rest of
the world should not be considered reliable... and in fact it is not.
So, timouts should be reviewed, especially if we know that the average
tcp connection duration on the www is about 20 seconds.

 So I'm guessing that large number of entries in conntrack table is
 evidence that packets are being lost.  

not only: a crashed client breaking the tcp sequence causes also
garbage entries in conntrack.

 In particular, if the syn
 packet arrives but is never forwarded, you get one of those conntrack
 entries where conntrack thinks (incorrectly) the syn has been
 forwarded so it's waiting for the reply.  Ideally the entry should
 not be added to the table until the packet goes out.

??? or is served locally ???

 
 Just wondering, how did you measure cpu load?
 

procinfo -n10 ; [d] for showing differences, which in fact, computes
   the differences of cumulated cpu time (got from 
   /proc/meminfo) on the given period: (tsys1-tsys0)/T
   (I was too lazy to write a 'while' script...)


Maybe my mail was not clear... I've been surprised by 2 issues:
1) conntrack timeout garbages (which was addressed by you mail)
2) nat performance killing: I really don't understand it, especially
   when there's no rule active on it, and thus no translation active.

I can admit the overhead of conntrack because of the number of
entries and criteria it has to manage, but this one can be 
dimensionned and understood. 
But what about NAT??? In my opinion, the NAT overhead should only
be a delta against the conntrack overhead. But what I noticed is 
an overhead as big as the conntrack overhead! why?

___
-jmhe-   He who expects nothing shall never be disappointed







performance issues (nat / conntrack)

2002-06-20 Thread Jean-Michel Hemstedt

dear netdevels,

I'm doing some tcp benches on a netfilter enabled box and noticed
huge and surprising perf decrease when loading iptable_nat module. 

- ip_conntrack is of course also loading the system, but with huge memory
and a large bucket size, the problem can be solved. The big issue with
ip_conntrack are the state timeouts: it simply kill the system and drops
all the traffic with the default ones, because the ip_conntrack table
becomes quickly full, and it seems that there is no way to recover from
that  situation... Keeping unused entries (time_close) even 1 minute in
the cache is really not suitable for configurations handling (relatively)
large number of connections/s. 
o The cumulative effect should be reconsidered.
o Are there ways/plans to tune the timeouts dynamically? and what are
  the valid/invalid ranges of timeouts?
o looking at the code, it seems that one timer is started by tuple...
  wouldn't it be more efficient to have a unique periodic callback
  scanning the whole or part of the table for aged entries?

- The annoying point is iptable_nat: normally the number of entries in
the nat table is much lower than the number of entries in the conntrack
table. So even if the hash function itself could be less efficient than
the ip_conntrack one (because it takes less arguments: src+dst+proto),
the load of nat, should be much lower than the load of conntrack.
o So... why is it the opposite??
o Are there ways to tune the nat performances?

- Another (old) question: why are conntrack or nat active when there are
no rules configured (using them or not)? If not fixed it should be at
least documented... Somebody doing iptables -t nat -L takes the risk
of killing its system if it's already under load... In the same spirit,
iptables -F should unload all unused modules (the ip_tables modules 
doesn't hurt). Just one quick fix: replace the 'iptables' executable by
one 'iptables' script calling the exe (located somewhere else) and 
doing an rmmod at the end...

comments are welcome;


here is my test bed:

tested target:
 -kernel 2.4.18 + non_local_bind + small conntrack timeouts...
 -PIII~500MHz, RAM=256MB
 -2*100Mb/s NIC

The target acts as a forwarding gateway between a load generator client
running httperf, and an apache proxy serving cached pages. 100Mb/s NICs
and requests/response sizes insure that BW and packet collisions is not
an issue.

Since in my test, each connection is ephemeral (10ms), i recompiled the 
kernel with very short conntrack timeouts (i.e: 1 sec for close_wait, 
and about 60 sec for established!) This was also the only way to restrict
the conntrack hash table size (given my RAM) and avoid exagerated hash
collisions. Another limitation comes from my load generator creating traffic
from one source to one destination ipa, with only source port variation 
(but given my configured hash table size and the hash function itself
it shouldn't have been an issue).

results are averages from procinfo -n10 [d]

test results:

1) target = forwarding only (no iptables module or rule)
 -  rate  : 100conn/s (=request-response/s)
 - CPU load  : 0% system
 - context   : 7  context/s
 - irq(eth0/eth1): 0.9 / 0.9  kpps   (# of packet/sec = #irq/s)

 -  rate  : 500conn/s
 - CPU load  : 10%system
 - context   : 18-100context/s (varying!)
 - irq(eth0/eth1): 4.4 / 4.4  kpps

 -  rate (max): 1050   conn/s (max from my load generator)
 - CPU load  : 25%system
 - context   : 1000   context/s
 - irq(eth0/eth1): 10 / 10kpps

2) (1) + insmod ip_conntrack 16384 (no rules)

 -  rate  : 100conn/s
 - CPU load  : 0.8%   system
 - context   : 7  context/s
 - irq(eth0/eth1): 0.9 / 0.9  kpps
 - conntrack size: 970concurrent entries

 -  rate  : 250conn/s
 - CPU load  : 10%system
 - context   : 12 context/s
 - irq(eth0/eth1): 2.2 / 2.2  kpps
 - conntrack size: 2390   concurrent entries

 -  rate  : 500conn/s
 - CPU load  : 30-70% system  (varying)
 - context   : 45-90  context/s
 - irq(eth0/eth1): 4 / 4  kpps
 - conntrack size: 4770   concurrent entries

3) (2) + iptables -t nat -L  [=iptable_nat] (no rules)
 -  rate  : 100conn/s
 - CPU load  : 1% system
 - context   : 8  context/s
 - irq(eth0/eth1): 0.9 / 0.9  kpps
 - conntrack size: 970concurrent entries

 -  rate  : 250conn/s
 - CPU load  : 40%system
 - context   : 20 context/s
 - irq(eth0/eth1): 2.2 / 2.2  kpps
 - conntrack size: 2390   concurrent entries

 -  rate  (max)   : 420conn/s (all failed)
 - CPU load  : 97%system
 - context   : 28 context/s
 - irq(eth0/eth1): 3.1 / 4.1  kpps
 - conntrack size: 4050   concurrent entries

 -  rate (killing): [500]-0   conn/s (all failed)
 - CPU load  : 

Re: TPROXY

2002-03-27 Thread Jean-Michel Hemstedt



 On Wed, Mar 27, 2002 at 10:15:56AM +0100, Henrik Nordstrom wrote:
  On Tuesdayen den 26 March 2002 16.33, Balazs Scheidler wrote:
 
   Providing a client certificate to the server is not very common, if it is
   required a tunnel can be opened to that _specific_ server, and nothing
   else.
  
   So using a real decrypting HTTPS proxy for general https traffic, and
   opening holes to specific destinations is definitely more secure than a
   simple 'pass-through' hole in the firewall.
 
  You missed the point here. Using a decryption HTTPS proxy invalidates both
  the use of client certificates AND the use of server certificates, which
  makes the use of SSL somewhat pointless. Further, unless the proxy runs it's
  own CA trusted by the browsers then the users will always be warned that the
  server certificate is invalid when using such proxy.

 I think you missed the point here. Of course the firewall verifies the
 server's certificate using its own trusted list of CAs.

 The user is not capable of deciding whether a certificate presented to him
 really belongs to the given server. They simply press 'continue' without
 thinking that the server they are communicating with is fake.

 Of course if you AND your users know what the hell a certificate is, they
 can decide but I think you are a minority.


We are far from TPROXY, but here is my point of view:

- HTTPS decrypting proxy is an (mitma) alternative if you want
  to block all CONNECT operations in your proxy. But it sounds
  like an absuse protection against inside users. And unfortunately,
  for the user itself, as mentionned above, it will block services
  such as home banking as well.

- If your proxy allows CONNECT requests, then virtually anything
  can pass through it, and HTTPS decrypting proxy does not make sense.

Then, if you are really concerned by insider attacks, what about a
session/tunnel timer which could be a possible (ugly) protection
against wormhole kinds of attacks, without invalidating ssl?

-jmhe-





Re: TPROXY

2002-03-20 Thread Jean-Michel Hemstedt

Henrik,
just to recap the goal:

I have:
- non-proxy aware clients (not controlable)
- non-transparent aware proxy (not controlable,
  and even not on Linux, it is not in-housed)

an in the middle:
- one (or more) default gateway, the netfilter box.

= goal:
1) HTTP: rewrite the HTTP requests (PDU) so that they
  can be handled by the proxy.
2) HTTPS: insert the CONNECT transactions so that the
  proxy can create its https tunnel to the orig-server.
 (and there is no mitma issue)
3) for both: keep the source ip addresses of the clients
  in the modified forwarded packets, so that the proxy
  can do simple source based authentication (possibly
  with the collaboration of exteral elements such as
  radius, but athentication is out of scope here).

I appreciate your propositions, but since we don't see
the origin-server, since we are forced to pass the requests
through the proxy, since the proxy is not controlable, since
the PDU needs to be rewritten, and since the stream itself
needs to be modified (https), none of them (CONNMARK
or GRE tunnel) seems to be applicable.

The big issue is point 3 above, given that 1 and 2 needs to
be handled. nonlocal_bind or contextual SNAT could be the
solutions... But my NF level of experience is too weak for the
moment to see how it could be achieved, or how to reuse
existing mechanisms. (i.e: how to make NAT and REDIRECT
collaborate, or how to crack nonlocal_bind protection).

best regards.

- Original Message -
From: Henrik Nordstrom [EMAIL PROTECTED]
To: Jean-Michel Hemstedt [EMAIL PROTECTED]; Balazs
Scheidler [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, 20 March, 2002 00:12
Subject: Re: TPROXY


 [cannot claim I have been following the thread closely, mostly
 guessing on what you are actually trying to acheive here.. so I may
 be way off]

 On Tuesday 19 March 2002 12:19, Jean-Michel Hemstedt wrote:

  REDIRECT could work in case of collocated proxy, and only if we
  have control on the proxy, i.e. Apache;
  (btw: I'm curreltly trying to find a clean and reusable way to
  extend transparent
  HTTP mod_tprox and add HTTPS transp proxy to Apache for Linux).

 REDIRECT works only if the you have a user space proxy is running on
 the machine doing REDIRECT. This is per definition of REDIRECT.

  But I'm afraid REDIRECT doesn't fit for remote proxies which rely
  on the originam
  source ip ofthe client to perform some checks. In that case we need
  the 50080/50443
  applications of your example to forward the modified requests to
  the remote proxy
  with the source.ip of the original client.

 If the remote proxy is on the same LAN segment or if you can set up a
 GRE tunnel or something similar to the proxy server then you can use
 CONNMARK for this purpose to route the packets unmodified to the
 proxy and then do the final interception there. When you see the NEW
 session in mangle, mark it, then use fwmark based routing to route
 the packets of that session to the close by proxy.

  1) the 50080/50443 applications use libipt and for each new client
  request, before doing
  a new connect() to the remote proxy, they create a new iptable rule
  doing SNAT based
  on the --sport they choosed for their bind(). And when the
  connection is released, they
  remove the created rule. This solution is very inefficient, and not
  scalable.

 Yuck.

 I would rather go for a single daemon using a custom protocol to
 forward the information to the origin server, such as the
 (incidentally) named TPROXY extension I was once playing with for
 Squid, archived somewhere on my old Squid patches page
 http://devel.squid-cache.org/hno/patche-old.html on how to manage
 remote interception of traffic.

 But sure, this is a limitation of the transparent proxy
 capabilities of the current iptables framework.

 I think some aspects of SOCKS can also be used for this purpose.

  2) the 50080/50443 applications rely on TPROXY framework and uses
  nonlocal_bind.

 Except that nonlocal_bind do not yet work in TPROXY, does it?

  3) ??INTERCEPT?? = REDIRECT(PREROUTING)+SNAT(OUTPUT/POSTROUTING?)
  i.e:
  A:pclient1  [REDIRECT--dport80] Bserver:50080Bclient:pclient2
   [SNAT(--to A:pclient)]  C:80
  =
  the INTERCEPT would REDIRECT the packets from the client to the
  local stack and pass a 'rdmark' to the user space application
  retreivable via getsockopt(rdmark).
  Then, the application rewrites the packets and in order to forward
  them to the remote
  proxy, it creates a new client socket to the remote-proxy and uses
  setsockopt(rdmark)
  to instruct netfilter to do SNAT on the outgoing packets
  (OUPUT/POSTROUTING?). Netfilter uses the 'rdmark' to retreive from
  the redirect table the '--to' information (the
  source.ip before the redirect).
  When a packet comes back from the remote-proxy the reverse SNAT
  redirects the packets to the local client, which pass the packet to
  the local server which sends
  back to the original client the modified packets...

 Very much

Re: TPROXY

2002-03-19 Thread Jean-Michel Hemstedt

- Original Message -
From: Balazs Scheidler [EMAIL PROTECTED]
To: Jean-Michel Hemstedt [EMAIL PROTECTED]
Sent: Tuesday, 19 March, 2002 08:50
Subject: Re: TPROXY


 On Wed, Mar 13, 2002 at 01:19:30PM +0100, Jean-Michel Hemstedt wrote:
  hello,
 
  I'm quite new to netfilter, and I would like to use/write an extension
capable of
  rewriting HTTP/HTTPS requests from non-proxy aware clients to remote
  non-transparent aware proxy (the netfilter box being in the middle and
acting
  as a default gw for both sides).
 
  This implies:
  - for HTTP: most HTTP requests methods (GET,POST,...) need to be rewritten
with the full URL (taken from the non redirected ip.dst for HTTP/0.9 or
from the
'Host' field for HTTP/1.x)
  - for HTTPS: *insert* an HTTP CONNECT transaction in the TCP stream (just
after the TCP establishment), which means that the ip packets can't simply
be
redirected, unless playing with (cracking) the tcp.seq_num in netfilter.
 
  The first case is not a problem (kind of REDIRECT target)
  For the second case (HTTPS), I was thinking of using the ip_nonlocal_bind
option,
  but I read in the kernel archives that the connect() was broken for
non-local bind
  in 2.4.x. I would also avoid user space QUEUEing since I noticed that the
throughput
  was simply divided by 2 (just for normal forwarding!).
 
  I think that your TPROXY target is well suited for the HTTPS case
(terminating the tcp
  sessions of the client on the netfilter box and originating tcp session to
the proxy from
  the netfilter box as if they were originating from the client, using a kind
of
  *ip_nonlocal_bind* mechanism). right?

 TPROXY is not yet ready, it is lacking several important features. I posted
 it on the -devel list to receive feedback.

 Both of your problems can be solved by REDIRECT, you only need two different
 programs (or a single program performing both operations). Just listen on a
 random port (say 50080), and redirect all traffic to this port:

 iptables -t nat -A PREROUTING -p tcp -d 0/0 --dport 80 -j REDIRECT --to-port
50080
 iptables -t nat -A PREROUTING -p tcp -d 0/0 --dport 443 -j REDIRECT --to-port
50443


REDIRECT could work in case of collocated proxy, and only if we have control
on the proxy, i.e. Apache;
(btw: I'm curreltly trying to find a clean and reusable way to extend
transparent
HTTP mod_tprox and add HTTPS transp proxy to Apache for Linux).

But I'm afraid REDIRECT doesn't fit for remote proxies which rely on the
originam
source ip ofthe client to perform some checks. In that case we need the
50080/50443
applications of your example to forward the modified requests to the remote
proxy
with the source.ip of the original client.

I see 3 possible ways to do that:

1) the 50080/50443 applications use libipt and for each new client request,
before doing
a new connect() to the remote proxy, they create a new iptable rule doing SNAT
based
on the --sport they choosed for their bind(). And when the connection is
released, they
remove the created rule. This solution is very inefficient, and not scalable.

2) the 50080/50443 applications rely on TPROXY framework and uses nonlocal_bind.

3) ??INTERCEPT?? = REDIRECT(PREROUTING)+SNAT(OUTPUT/POSTROUTING?)
i.e:
A:pclient1  [REDIRECT--dport80] Bserver:50080Bclient:pclient2 
[SNAT(--to A:pclient)]  C:80
=
the INTERCEPT would REDIRECT the packets from the client to the local stack and
pass a 'rdmark' to the user space application retreivable via
getsockopt(rdmark).
Then, the application rewrites the packets and in order to forward them to the
remote
proxy, it creates a new client socket to the remote-proxy and uses
setsockopt(rdmark)
to instruct netfilter to do SNAT on the outgoing packets (OUPUT/POSTROUTING?).
Netfilter uses the 'rdmark' to retreive from the redirect table the '--to'
information (the
source.ip before the redirect).
When a packet comes back from the remote-proxy the reverse SNAT redirects the
packets to the local client, which pass the packet to the local server which
sends
back to the original client the modified packets...

(PS: I don't think the MARK target is suited for that kind of mechanism)
(PPS: the user space applications would move as LKM in a second phase)

do you (or anyone else) see any other way to do it?


 One of your proxies will be listening on 50080 the other on 50443. The first
 performing non-transparent/transparent rewrite the other CONNECT
 encapsulation.

 By the way the first one is easy to do with Zorp. Its HttpProxy is able to
 rewrite server-requests to proxy-requests.

 CONNECT encapsulation is not supported I'm afraid.

  Have you received any feedback on your TPROXY target?

 not much.


I hope I'll be able to contribute, but I'll first need to better understand
what are all the features of Netfilter and how they can interact between
each other...

In the mean time, if you have any update, i'd like to have a look at it.

 
  Have you heard of any similar HTTPS-real-transp-proxy implementations

Re: [Q] connection tracking scaling

2002-03-19 Thread Jean-Michel Hemstedt

- Original Message -
From: Patrick Schaaf [EMAIL PROTECTED]
To: Harald Welte [EMAIL PROTECTED]; Patrick Schaaf [EMAIL PROTECTED];
Martin Josefsson [EMAIL PROTECTED]; Aviv Bergman
[EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Tuesday, 19 March, 2002 12:16
Subject: Re: [Q] connection tracking scaling


  I'd rather like to have this information to be gathered at runtime
within
  the kernel, where one could read out the current hash occupation via
/proc
  or some ioctl.

 OK, that's what I wanted to hear :-)

 Actually, the interesting statistics for a hash are not that large, and
all
 aggregate:

 - bucket occupation: number of used buckets, vs. number of all buckets
 - average chain length over all buckets
 - average chain length over the used buckets
 - counting of a classification of the chain lengths:
 - number of 0-entry buckets
 - number of 1-entry buckets
 - number of 2-entry buckets
 - number of 4-entry buckets
 - number of 8-entry buckets
 - number of 16-entry buckets
 - number of more-than-16-entry buckets

 That's 10 values, and will at most double when I think more about it.
 I propose to gather these stats on the fly, and simply printk() them
 at a chosen interval:

I'm not a conntrack specialist, neither a kernel hacker, but I've
some experience with ip hash caches in access servers (BRAS)
that may be useful(?):

some additional stats:
- HDA: cache hit depth average: the number of iterations in the
  bucket's list to get the matching collision entry.
- MDA: cache miss depth average: the number of iterations required
  without matching a cache entry (new connection).

HDA is meaningful if you have a bad cache distribution or a small
CIS/CTS ratio (Cache Index size=number of hash buckets / Cache
total size=total number of conntrack tuples cachable). It also provides
good information on traffic type and cache efficiency: In fact, lets
assume you have realtime traffic (RTP) and bursty traffic (HTTP/1.1
with keep alive) at the same time, and that the tuples for both type
of traffic are under the same hash key. Now if your RT tuple is at the
end of the collision list, or after the bursty entries, you will need
frequent extra iterations to get your RT tuple... The work around
for that is collision promotion: you keep a hit counter in each tuple
and just swap one position ahead the most frequently accessed tuple.

some questions:
- have you an efficicent 'freelist' implementation? What I've seen about
  kmem_cache_free and kmem_cache_alloc doesn't look like a simple
  pointer dereference... Am I wrong?
- wouldn't it be worth to have a cache promotion mechanism?

regarding [hashsize=conntrack_max/2], I vote for!
An alternate solution would be to have a dynamic hash resize
each time the average number of collisions exceeds a treshold
(and no down resize, except maybe asynchroneously).
But given my experience I would say that ip hash distribution is not at all
predictable (unless you know where in the net path your box will be, and
what traffic type (VoIP, HTTP, eDonkey, ...) your box will have to handle,
and even then, your predictions will not be valid for more than 6 month!).
Therefore, the common way to handle unpredictable distribution is to
define:
[ max hash index size = max number of cache tuples]
with a dynamic hash index resize

One last word: the hash function you're using is the best compromise
between unpredictable ipv4 traffic, cache symetry, uniformity and
computation time. I wouldn't change it too much, but there are two
propositions possible:
- if you keep the modulo method (%), use a prime number far from a
  power of 2 for 'ip_conntrack_htable_size'.
- if modulo is too slow, use the bitmasking method () with hsize being
 a power of 2, and with 2 bitshifts ((key+key20+key12)  hsize), but
  this method is not as efficient as the modulo method, and must be
  reconsidered for ipv6.

hope this may help...


 echo 300 /proc/net/ip_conntrack_showstat

 would generate one printk() every 300 seconds. Echoing 0 would disable
 the statistics gathering altogether.

 I think I can hack this up, today. Having the flu must be good for
something...

 later
   Patrick







Re: [Q] connection tracking scaling

2002-03-19 Thread Jean-Michel Hemstedt

- Original Message -
From: Patrick Schaaf [EMAIL PROTECTED]
To: Jean-Michel Hemstedt [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Tuesday, 19 March, 2002 17:42
Subject: Re: [Q] connection tracking scaling


 Hello Jean-Michel,

 thanks for your input. I appreciate it.

 On Tue, Mar 19, 2002 at 03:56:32PM +0100, Jean-Michel Hemstedt wrote:
 
  I'm not a conntrack specialist, neither a kernel hacker, but I've
  some experience with ip hash caches in access servers (BRAS)
  that may be useful(?):
 
  some additional stats:
  - HDA: cache hit depth average: the number of iterations in the
bucket's list to get the matching collision entry.
  - MDA: cache miss depth average: the number of iterations required
without matching a cache entry (new connection).

 These two measures make an excellent dynamic measure in addition to
 what I described. However, they require modification of the standard
 LIST_FIND() macro used when traversing the chain, which is a bit
 awkward - and they require keeping properly locked / cpu-local
 counters in the critical path. I wouldn't want this on a highly
 loaded machine, at least not for a longer time.

everything is relative: if the overhead of maintaining those local counters
is too big compared to the exploit we can do of them (col prom, coarser
stats,
path switching,...) then just forget about it.

 The measures are important because of their dynamical nature. No static
 looking at the chains can capture effects related to the _order_ of the
 lists. These measures can. One possible optimization for the future,
 would be to move the bucket pointer to the last found element, giving
 an easy 1-off-cache per bucket. The effect of that cannot be measured
 with a static examination; it should be clearly visible under your
 dynamic measure.

 I'll see that I make this seperately optional.


of course, each box as it's own requirements. A configurable option
in such a generic kernel framework is always better than a built-in
*useable* option.

  some questions:
  - have you an efficicent 'freelist' implementation? What I've seen
about
kmem_cache_free and kmem_cache_alloc doesn't look like a simple
pointer dereference... Am I wrong?

 The kmem_cache() stuff is supposed to be the best possible implementation
 of such a freelist on an SMP system. I am not aware that it has
inefficiencies.

and surely me neither ;o)

  - wouldn't it be worth to have a cache promotion mechanism?

 What do you mean with that?

(you wiped out my explanation in my previous mail).
It's the mechanism by which the tuples in a collision list are resorted
by hit frequency, trying to reduce the number of iterations required to
get a match. It's only usefull if you insert new tuples at the beginning
of the list, or if you have let's say a control channel and a data channel
in the same list. In the later example, if the entry corresponding to the
data
channel is after the entry corresponding to your control channel, then you
waist one iteration per data packet.
example? stream A=1pps, stream B=100pps, lifetime of A = lifetime of B = T
= if A after B, num iterations = T*(1*100 + 2*1) = T*(102)
= if B after A, num iterations = T*(1*1 + 2*100) = T*(201)
But this mechanism requires a per tuple hit counter, and some additional
checks and pointer reallocation in the fast path... This is thus useful in
specific cases (i.e: a media gateway, sip/h323 proxy, real-time streaming
env.)


  regarding [hashsize=conntrack_max/2], I vote for!
  An alternate solution would be to have a dynamic hash resize
  each time the average number of collisions exceeds a treshold
  (and no down resize, except maybe asynchroneously).

 This is REALLY AWFUL in an SMP setting, because you need to keep the
 whole system write locked while you rebuild the hashes. You really
 don't want to do that.

ok, ok, I haven't said anything ;o)... but FYI the prime double hash is
supposed to address the problem of smooth hash expansion by moving
gradually the entries from one to the other (i've not tried it).


  One last word: the hash function you're using is the best compromise
  between unpredictable ipv4 traffic, cache symetry, uniformity and
  computation time.

 I'm pretty ignorant wrt theory and attributes of hash functions,
 so I'm damned to check this by experiment. But it's reassuring to
 hear you say the function is basically OK.

thanks to the one who choosed it (Rusty?).I've got only
one remark: modulo operation consumes lots of CPU cycles especially
if the operand is prime! and its implementation is architecture
dependent (so not predictable) wheras bit shift operations are totally
controlable and give pretty good results if used correctly... Bitshift has
also the advantage of being able to define a hashsize=2^n.
I memory is not an issue and if CPU cycles are critical ($/Hz  $/MB)
then we'd better use a bitshift hash function. (personally this was the
one I used). Anyway, I would be interested by bench results

TPROXY

2002-03-14 Thread Jean-Michel Hemstedt



hello,

- is there any update regardingTPROXY since 
13/Feb/2002?
-is TPROXY intended to replace 'slessdir' 
and 'IP_INTERCEPT'?
- will it be included in the kernel someday 
(which version?)?
- does it provide the definitive patch 
fornonlocal binding?
- are there examples on how to use it? (apart 
from thecomments in the diff)?

any help is welcome,

and thanks for your job Balazs!

-jm-