FYI: QUEUE ipqmpd bugs
FYI, I upgraded to iptables-1.2.6a (user kernel-2.4.18 patches) and got the following (maybe known) problems: - QUEUE target is NOK with kernel compiled with CONFIG_IP_NF_QUEUE=m = the packets are queued, but ipq_create_handle() returns can't create netlink socket ERROR: Unable to create netlink socket: Connection refused (problem with exported symbols?) =quick fix: compile kernel with CONFIG_IP_NF_QUEUE=y - ipqmpd-0.3: default verdict NF_ACCEPT is not applied when no process has attached to it. In fact ipqmpd starts, but it seems that it never receives any packet (in ipq_inp). When one process attaches to it, with a mark different from the queued packet, then the default NF_ACCEPT is applied correctly. When all processes have detached from ipqmpd, the default NF_ACCEPT continues to be applied correctly. kr, ___ -jmhe- He who expects nothing shall never be disappointed
Re: performance issues (nat / conntrack)
(strange thing is that ethernet irq's reported by procinfo are decreasing when the machine is overloaded. It suppose that it means either that irq's are not even caught by the kernel/driver, which is quite worrying, or either that irq's counters refer to 'processessed' interrupts) are you using a driver which uses the irq mitigation interface of 2.4.x or the NAPI of 2.5.x ? no. (is it available for non Gigabit ethernets?). I'm using 3c905-TX. one explanation could be that, SYN packets are 'lost'/not processed, and there are thus no additional SYN/ACK,ACK (at least), and thus no extra flow (each connection was made of about 10 packets). Since packets don't get forwarded, the same applies for the other interface. It means also that most retransmissions get lost. Harld: could you ask to your kernel specialist what is the weakpoint of kmem_cache_alloc()? (locks, allocs, ...), and how we could possibly improve it (batch alloc, but isn't it already the case?) I don't think that the slab allocator is the bottleneck. please show me profiling data pointing this out. yesss, facts... but in the mean time, grabbing info/experience/feelings from 'non involved' people could be instructive. what do they say? - Harald Welte kr, -jmhe-
Re: performance issues (nat / conntrack)
generator creating traffic from one source to one destination ipa, with only source port variation (but given my configured hash table size and the hash function itself it shouldn't have been an issue). I think because only the source port varies, this is an important issue in your setup. You actually tested the hash functions and could bomb some hash entries. The overall effect was a DoS against conntrack. ok, here we go: 98 static inline u_int32_t 99 hash_conntrack(const struct ip_conntrack_tuple *tuple) 100 { 101 #if 0 102 dump_tuple(tuple); 103 #endif 104 /* ntohl because more differences in low bits. */ 105 /* To ensure that halves of the same connection don't hash 106 clash, we add the source per-proto again. */ 107 return (ntohl(tuple-src.ip + tuple-dst.ip 108 + tuple-src.u.all + tuple-dst.u.all 109 + tuple-dst.protonum) 110 + ntohs(tuple-src.u.all)) 111 % ip_conntrack_htable_size; 112 } src.u.all dst.u.all refer (unless there's a bug) to src.tcp.port and dst.tcp.port respectively. So, if only src.port varies linearly (let's say between 32000 and 64000), and if ip_conntrack_htable_size = 32768 (kernel: ip_conntrack (32768 buckets, 262144 max)), then we should have maximum 2 collisions per bucket (unless there's a type overfow somewhere). This was my test setup, but since I haven't verified the conntrack hash distribution, I didn't want to argue on that. To measure that, we should maintain hash counters such as max collisions, average collisions per key, hit/miss depth average, number of hit/miss per second, etc... I've planned to do that along with profiling, but unfortunately not in the 2 coming weeks. -- last points I wanted to clarify: From: Patrick Schaaf [EMAIL PROTECTED] On Sun, Jun 23, 2002 at 09:46:29PM -0700, Don Cohen wrote: From: Jean-Michel Hemstedt [EMAIL PROTECTED] Since in my test, each connection is ephemeral (10ms) ... One question here is whether the traffic generator is acting like a real set of users or like an attacker. A real user would not keep trying to make connections at the same rate if the previous attempts were not being served. I suspect you're acting more like an attacker. He definitely is. The test he described is completely artificial, and does not represent any normal real world workload. Nevertheless, it does point out a valid optimization chance. We discussed that months ago, and it's still there. No, I don't think so. 1) the hash is not in cause (see above) (btw, as discussed in 'connection tracking scaling' [19 March 2002] i don't see ways to really optimize it unless you go for multidimesional hashes described in theoretical papers, or if you make traffic assumptions which is most likely impossible in such a generic framework...) However, I don't understand why we are adding twice the src.port in the hash function? 2) My test was artificial, but not unrealistic: one endpoint sustaining 1000 conn/s wathever the responsiveness of the target, or 1 users trying to connect through the gw in a time lapse of 10 seconds is similar. Now, if some of you are telling me that I'm not allowed, or that I'm nuts to place my box in front of 1 users, that's another debate. I'm not talking about dimensioning, I'm talking about relative performances, and strange weaknesses. kr, -jmhe-
Re: performance issues (nat / conntrack)
loading a module, doesn't mean using it (lsmod reports it as 'unused' in my tests). So, does it really 'sounds as expected', when you see From where do you think that the module usage counter reports how many packets/connections are handled (currently? totally?) by the module. There is no whatsoever connection! module usage counter increases when a TARGET needs it (i.e. ipt_REDIRECT). In this test, no rule was defined, and no target module was loaded. So I did not expect NAT to process any packet. o The cumulative effect should be reconsidered. - I can't explain the last one, but when the table is exhausted conntrack drops new packets, right? What I noticed is that at that moment, the cpu load suddenly hit 100%, and the machine did not recover, unless I killed the load generator That is unusual and should be tested further. I suppose that due to the load, packets are dropped not because of conntrack but because they simply can't be processed, and thus conntrack misses packets of existing connections (such as FIN, RST) and can't thus recover due to its timeouts. ? What 'nat table' are you talking about? Do you understand how NAT works and how it interacts with connection tracking? Just to recall my test: I generated an amount of new connections per second passing through a forwarding machine without any iptables module and measured the cpu load/responsiveness and other things... Then while the machine was sustaining this amount of new conn/s, i did 'insmod ip_conntrack [size]', saw the cpu load increasing, and finally just did 'iptables -t nat -L' to load the nat module without any rule, and saw again the cpu load increasing. With 500conn/s, the cpu load went from 10% - ~50/70% - 100% (machine unavailable). According to your first mail, the machine has 256M RAM and you issued insmod ip_conntrack 16384 That requires 16384*8*~600byte ~= 75MB non-swappable RAM. When you issued iptables -t nat -L, the system tried to reserve plus 2x75MB. That's in total pretty near to all your available physical RAM and the machine might died in swapping. exact! That's why I looked (but not closely) at swap-in/swap-out in procinfo, but didn't notice anything (0 most of the time on 10 sec average). But I agree that I was close to the limit, and even over when I tried 32K. Despite that, nothing so surpising to have so few swaps, since my table was not full (max 4000 up to 1 concurrent tuples). But this raises one additional problem: 1) the hash index size and the hash total size should be configurable separately (get rid of that factor 8, and use a free list for the tuple allocation). 2) NAT hash sizes should also be configurable independently from conntrack. Normally the nat hashes are smaller than conntrack hash, since conntrack is based on ports, while nat is not. PS: could anybody redo similar tests so that we can compare the results and stop killing the messenger, please? ;o) Regards, Jozsef - E-mail : [EMAIL PROTECTED], [EMAIL PROTECTED] WWW-Home: http://www.kfki.hu/~kadlec Address : KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary
Re: conntrack hash function
hi, just FYI, hash was already discussed in a previous thread: 'connection tracking scaling' [19 March 2002] sorry if you were aware of it. From: Jean-Michel Hemstedt [EMAIL PROTECTED] 98 static inline u_int32_t 99 hash_conntrack(const struct ip_conntrack_tuple *tuple) 100 { 101 #if 0 102 dump_tuple(tuple); 103 #endif 104 /* ntohl because more differences in low bits. */ 105 /* To ensure that halves of the same connection don't hash 106 clash, we add the source per-proto again. */ 107 return (ntohl(tuple-src.ip + tuple-dst.ip 108 + tuple-src.u.all + tuple-dst.u.all 109 + tuple-dst.protonum) 110 + ntohs(tuple-src.u.all)) 111 % ip_conntrack_htable_size; 112 } A few questions here: - Why make the two halves of the connection hash to different buckets? I'd think you'd want to consider the two halves to be the same connection. So you want them to hash the same. It would make the comparison a little more expensive, but save half the space. and would even avoid a second key computation (modulo may be costly on prime numbers for instance) and it would also avoid a second collision list scanning (interresting in case of large number of collisions). - % table size seems not quite ideal. Especially since the table size is likely a power of 2, which means that you effectively ignore all but the low order bits of the addresses and ports. yes, modulo is slower but more robust than bit masking (), and is generally considered a good tradeoff between computation time and element distribution. Another common practice consists in bitshifting src and dst (i.e: ipa+ipa20+ipa13) to enforce domain LSB variations. Perhaps one less than table size if it's even, which will lose the use of one bucket but then use all of the data bits in the hash. Of course, the low order bits might well be good enough. Then again, that depends on what the -per-proto data looks like, and for some protos this might not vary in the low order bits. -jmhe-
Re: performance issues (nat / conntrack)
I know this debate is not new... I just didn't expect such a (90% see below) perf drop, and unavailablity risk. That's why I'm only reporting it, hoping secretly that experienced hackers will consider it seriously. ;o) Note: I don't want to play with words, but if you prefer, consider 'load generator' as 'malicious DoS user', and 'perf issue' as 'DoS vulnerability' as Don Cohen cleverly suggested :-/ (for me it's the same problem, except that DoS is ponctual while perf is what we may expect in normal situation) I'm doing some tcp benches on a netfilter enabled box and noticed huge and surprising perf decrease when loading iptable_nat module. Sounds as expected. loading a module, doesn't mean using it (lsmod reports it as 'unused' in my tests). So, does it really 'sounds as expected', when you see your cpu load hitting 100%, and most packets dropped just after having done 'iptables -t nat -L' on a system with 1%CPU load handling 'only' 10kpps and forwarding about 1000 new TCP connections/s? - ip_conntrack is of course also loading the system, but with huge memory and a large bucket size, the problem can be solved. The big issue with ip_conntrack are the state timeouts: it simply kill the system and drops all the traffic with the default ones, because the ip_conntrack table becomes quickly full, and it seems that there is no way to recover from that situation... Keeping unused entries (time_close) even 1 minute in the cache is really not suitable for configurations handling (relatively) large number of connections/s. what is a 'relatively' large number of connections? I've seen a couple of netfilter firewalls dealing with 20+ tracked connections. 200K concurrent established connections, maybe... but surely not NEW connections/second. See previous results: with only ip_conntrack loaded (no nat), I hardly reached 500 (new) conn/s. o The cumulative effect should be reconsidered. could you please try to explain what you mean? There are 3 aspects: - table exhaustion (can be fixed with large memory) as long as the hash is correctly distributed (few collisions) - concurrent timers (1 per conntrack tuple??) - I can't explain the last one, but when the table is exhausted conntrack drops new packets, right? What I noticed is that at that moment, the cpu load suddenly hit 100%, and the machine did not recover, unless I killed the load generator o Are there ways/plans to tune the timeouts dynamically? and what are the valid/invalid ranges of timeouts? No, see the mailinglist archives for th reason why. If you refer to your mail of 18 January 2001, I think that this timeout should also be reviewed ;o)... Waiting for somebody having the time and being able of doing a redesign was quite idealistic, while a quick patch for configurable timeouts per rule (ie: http timeouts different from smtp ones, as suggested by Denis Ducamp) would have been more realistic. o looking at the code, it seems that one timer is started by tuple... wouldn't it be more efficient to have a unique periodic callback scanning the whole or part of the table for aged entries? I think somebody (Martin Josefsson?) is currently looking into optimizing - The annoying point is iptable_nat: normally the number of entries in the nat table is much lower than the number of entries in the conntrack table. So even if the hash function itself could be less efficient than the ip_conntrack one (because it takes less arguments: src+dst+proto), the load of nat, should be much lower than the load of conntrack. o So... why is it the opposite?? ? What 'nat table' are you talking about? Do you understand how NAT works and how it interacts with connection tracking? Actually, that's also what i would like to know ;o) bysource or byisproto hash tables, pointing to ip_nat_hash tuples pointing to ip_conntrack entry. But i don't understand where the extra processing comes from when there are no (nat) rules defined. Just to recall my test: I generated an amount of new connections per second passing through a forwarding machine without any iptables module and measured the cpu load/responsiveness and other things... Then while the machine was sustaining this amount of new conn/s, i did 'insmod ip_conntrack [size]', saw the cpu load increasing, and finally just did 'iptables -t nat -L' to load the nat module without any rule, and saw again the cpu load increasing. With 500conn/s, the cpu load went from 10% - ~50/70% - 100% (machine unavailable). o Are there ways to tune the nat performances? no. NAT (and esp. NAT performance) is not a very strong point of netfilter. Everybody agrees that NAT is evil and it should be avoided in all circumstances. Rusty didn't want to become NAT/masquerading maintainer in the first place, but rather concentrate on packet filtering. wow! what is the alternative for 'Everybody' using REDIRECT? The NAT subsystem has a
Re: performance issues (nat / conntrack)
I'm doing some tcp benches on a netfilter enabled box and noticed huge and surprising perf decrease when loading iptable_nat module. Rather similar to the results I posted about a week ago. oops, sorry, it seems we performed our tests at the same time ;o) - Another (old) question: why are conntrack or nat active when there are no rules configured (using them or not)? I noticed this too. After a test using conntrack the next test without using conntrack would perfom poorly unless I did rmmod. yes, minor issue if documented... Since in my test, each connection is ephemeral (10ms) ... When all works correctly, the end of each connection should be noticed by conntrack and the connection removed from the table, right? yep In which case the table should never get very full. ideally yes, but from the conntrack machine perspective, the rest of the world should not be considered reliable... and in fact it is not. So, timouts should be reviewed, especially if we know that the average tcp connection duration on the www is about 20 seconds. So I'm guessing that large number of entries in conntrack table is evidence that packets are being lost. not only: a crashed client breaking the tcp sequence causes also garbage entries in conntrack. In particular, if the syn packet arrives but is never forwarded, you get one of those conntrack entries where conntrack thinks (incorrectly) the syn has been forwarded so it's waiting for the reply. Ideally the entry should not be added to the table until the packet goes out. ??? or is served locally ??? Just wondering, how did you measure cpu load? procinfo -n10 ; [d] for showing differences, which in fact, computes the differences of cumulated cpu time (got from /proc/meminfo) on the given period: (tsys1-tsys0)/T (I was too lazy to write a 'while' script...) Maybe my mail was not clear... I've been surprised by 2 issues: 1) conntrack timeout garbages (which was addressed by you mail) 2) nat performance killing: I really don't understand it, especially when there's no rule active on it, and thus no translation active. I can admit the overhead of conntrack because of the number of entries and criteria it has to manage, but this one can be dimensionned and understood. But what about NAT??? In my opinion, the NAT overhead should only be a delta against the conntrack overhead. But what I noticed is an overhead as big as the conntrack overhead! why? ___ -jmhe- He who expects nothing shall never be disappointed
performance issues (nat / conntrack)
dear netdevels, I'm doing some tcp benches on a netfilter enabled box and noticed huge and surprising perf decrease when loading iptable_nat module. - ip_conntrack is of course also loading the system, but with huge memory and a large bucket size, the problem can be solved. The big issue with ip_conntrack are the state timeouts: it simply kill the system and drops all the traffic with the default ones, because the ip_conntrack table becomes quickly full, and it seems that there is no way to recover from that situation... Keeping unused entries (time_close) even 1 minute in the cache is really not suitable for configurations handling (relatively) large number of connections/s. o The cumulative effect should be reconsidered. o Are there ways/plans to tune the timeouts dynamically? and what are the valid/invalid ranges of timeouts? o looking at the code, it seems that one timer is started by tuple... wouldn't it be more efficient to have a unique periodic callback scanning the whole or part of the table for aged entries? - The annoying point is iptable_nat: normally the number of entries in the nat table is much lower than the number of entries in the conntrack table. So even if the hash function itself could be less efficient than the ip_conntrack one (because it takes less arguments: src+dst+proto), the load of nat, should be much lower than the load of conntrack. o So... why is it the opposite?? o Are there ways to tune the nat performances? - Another (old) question: why are conntrack or nat active when there are no rules configured (using them or not)? If not fixed it should be at least documented... Somebody doing iptables -t nat -L takes the risk of killing its system if it's already under load... In the same spirit, iptables -F should unload all unused modules (the ip_tables modules doesn't hurt). Just one quick fix: replace the 'iptables' executable by one 'iptables' script calling the exe (located somewhere else) and doing an rmmod at the end... comments are welcome; here is my test bed: tested target: -kernel 2.4.18 + non_local_bind + small conntrack timeouts... -PIII~500MHz, RAM=256MB -2*100Mb/s NIC The target acts as a forwarding gateway between a load generator client running httperf, and an apache proxy serving cached pages. 100Mb/s NICs and requests/response sizes insure that BW and packet collisions is not an issue. Since in my test, each connection is ephemeral (10ms), i recompiled the kernel with very short conntrack timeouts (i.e: 1 sec for close_wait, and about 60 sec for established!) This was also the only way to restrict the conntrack hash table size (given my RAM) and avoid exagerated hash collisions. Another limitation comes from my load generator creating traffic from one source to one destination ipa, with only source port variation (but given my configured hash table size and the hash function itself it shouldn't have been an issue). results are averages from procinfo -n10 [d] test results: 1) target = forwarding only (no iptables module or rule) - rate : 100conn/s (=request-response/s) - CPU load : 0% system - context : 7 context/s - irq(eth0/eth1): 0.9 / 0.9 kpps (# of packet/sec = #irq/s) - rate : 500conn/s - CPU load : 10%system - context : 18-100context/s (varying!) - irq(eth0/eth1): 4.4 / 4.4 kpps - rate (max): 1050 conn/s (max from my load generator) - CPU load : 25%system - context : 1000 context/s - irq(eth0/eth1): 10 / 10kpps 2) (1) + insmod ip_conntrack 16384 (no rules) - rate : 100conn/s - CPU load : 0.8% system - context : 7 context/s - irq(eth0/eth1): 0.9 / 0.9 kpps - conntrack size: 970concurrent entries - rate : 250conn/s - CPU load : 10%system - context : 12 context/s - irq(eth0/eth1): 2.2 / 2.2 kpps - conntrack size: 2390 concurrent entries - rate : 500conn/s - CPU load : 30-70% system (varying) - context : 45-90 context/s - irq(eth0/eth1): 4 / 4 kpps - conntrack size: 4770 concurrent entries 3) (2) + iptables -t nat -L [=iptable_nat] (no rules) - rate : 100conn/s - CPU load : 1% system - context : 8 context/s - irq(eth0/eth1): 0.9 / 0.9 kpps - conntrack size: 970concurrent entries - rate : 250conn/s - CPU load : 40%system - context : 20 context/s - irq(eth0/eth1): 2.2 / 2.2 kpps - conntrack size: 2390 concurrent entries - rate (max) : 420conn/s (all failed) - CPU load : 97%system - context : 28 context/s - irq(eth0/eth1): 3.1 / 4.1 kpps - conntrack size: 4050 concurrent entries - rate (killing): [500]-0 conn/s (all failed) - CPU load :
Re: TPROXY
On Wed, Mar 27, 2002 at 10:15:56AM +0100, Henrik Nordstrom wrote: On Tuesdayen den 26 March 2002 16.33, Balazs Scheidler wrote: Providing a client certificate to the server is not very common, if it is required a tunnel can be opened to that _specific_ server, and nothing else. So using a real decrypting HTTPS proxy for general https traffic, and opening holes to specific destinations is definitely more secure than a simple 'pass-through' hole in the firewall. You missed the point here. Using a decryption HTTPS proxy invalidates both the use of client certificates AND the use of server certificates, which makes the use of SSL somewhat pointless. Further, unless the proxy runs it's own CA trusted by the browsers then the users will always be warned that the server certificate is invalid when using such proxy. I think you missed the point here. Of course the firewall verifies the server's certificate using its own trusted list of CAs. The user is not capable of deciding whether a certificate presented to him really belongs to the given server. They simply press 'continue' without thinking that the server they are communicating with is fake. Of course if you AND your users know what the hell a certificate is, they can decide but I think you are a minority. We are far from TPROXY, but here is my point of view: - HTTPS decrypting proxy is an (mitma) alternative if you want to block all CONNECT operations in your proxy. But it sounds like an absuse protection against inside users. And unfortunately, for the user itself, as mentionned above, it will block services such as home banking as well. - If your proxy allows CONNECT requests, then virtually anything can pass through it, and HTTPS decrypting proxy does not make sense. Then, if you are really concerned by insider attacks, what about a session/tunnel timer which could be a possible (ugly) protection against wormhole kinds of attacks, without invalidating ssl? -jmhe-
Re: TPROXY
Henrik, just to recap the goal: I have: - non-proxy aware clients (not controlable) - non-transparent aware proxy (not controlable, and even not on Linux, it is not in-housed) an in the middle: - one (or more) default gateway, the netfilter box. = goal: 1) HTTP: rewrite the HTTP requests (PDU) so that they can be handled by the proxy. 2) HTTPS: insert the CONNECT transactions so that the proxy can create its https tunnel to the orig-server. (and there is no mitma issue) 3) for both: keep the source ip addresses of the clients in the modified forwarded packets, so that the proxy can do simple source based authentication (possibly with the collaboration of exteral elements such as radius, but athentication is out of scope here). I appreciate your propositions, but since we don't see the origin-server, since we are forced to pass the requests through the proxy, since the proxy is not controlable, since the PDU needs to be rewritten, and since the stream itself needs to be modified (https), none of them (CONNMARK or GRE tunnel) seems to be applicable. The big issue is point 3 above, given that 1 and 2 needs to be handled. nonlocal_bind or contextual SNAT could be the solutions... But my NF level of experience is too weak for the moment to see how it could be achieved, or how to reuse existing mechanisms. (i.e: how to make NAT and REDIRECT collaborate, or how to crack nonlocal_bind protection). best regards. - Original Message - From: Henrik Nordstrom [EMAIL PROTECTED] To: Jean-Michel Hemstedt [EMAIL PROTECTED]; Balazs Scheidler [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Wednesday, 20 March, 2002 00:12 Subject: Re: TPROXY [cannot claim I have been following the thread closely, mostly guessing on what you are actually trying to acheive here.. so I may be way off] On Tuesday 19 March 2002 12:19, Jean-Michel Hemstedt wrote: REDIRECT could work in case of collocated proxy, and only if we have control on the proxy, i.e. Apache; (btw: I'm curreltly trying to find a clean and reusable way to extend transparent HTTP mod_tprox and add HTTPS transp proxy to Apache for Linux). REDIRECT works only if the you have a user space proxy is running on the machine doing REDIRECT. This is per definition of REDIRECT. But I'm afraid REDIRECT doesn't fit for remote proxies which rely on the originam source ip ofthe client to perform some checks. In that case we need the 50080/50443 applications of your example to forward the modified requests to the remote proxy with the source.ip of the original client. If the remote proxy is on the same LAN segment or if you can set up a GRE tunnel or something similar to the proxy server then you can use CONNMARK for this purpose to route the packets unmodified to the proxy and then do the final interception there. When you see the NEW session in mangle, mark it, then use fwmark based routing to route the packets of that session to the close by proxy. 1) the 50080/50443 applications use libipt and for each new client request, before doing a new connect() to the remote proxy, they create a new iptable rule doing SNAT based on the --sport they choosed for their bind(). And when the connection is released, they remove the created rule. This solution is very inefficient, and not scalable. Yuck. I would rather go for a single daemon using a custom protocol to forward the information to the origin server, such as the (incidentally) named TPROXY extension I was once playing with for Squid, archived somewhere on my old Squid patches page http://devel.squid-cache.org/hno/patche-old.html on how to manage remote interception of traffic. But sure, this is a limitation of the transparent proxy capabilities of the current iptables framework. I think some aspects of SOCKS can also be used for this purpose. 2) the 50080/50443 applications rely on TPROXY framework and uses nonlocal_bind. Except that nonlocal_bind do not yet work in TPROXY, does it? 3) ??INTERCEPT?? = REDIRECT(PREROUTING)+SNAT(OUTPUT/POSTROUTING?) i.e: A:pclient1 [REDIRECT--dport80] Bserver:50080Bclient:pclient2 [SNAT(--to A:pclient)] C:80 = the INTERCEPT would REDIRECT the packets from the client to the local stack and pass a 'rdmark' to the user space application retreivable via getsockopt(rdmark). Then, the application rewrites the packets and in order to forward them to the remote proxy, it creates a new client socket to the remote-proxy and uses setsockopt(rdmark) to instruct netfilter to do SNAT on the outgoing packets (OUPUT/POSTROUTING?). Netfilter uses the 'rdmark' to retreive from the redirect table the '--to' information (the source.ip before the redirect). When a packet comes back from the remote-proxy the reverse SNAT redirects the packets to the local client, which pass the packet to the local server which sends back to the original client the modified packets... Very much
Re: TPROXY
- Original Message - From: Balazs Scheidler [EMAIL PROTECTED] To: Jean-Michel Hemstedt [EMAIL PROTECTED] Sent: Tuesday, 19 March, 2002 08:50 Subject: Re: TPROXY On Wed, Mar 13, 2002 at 01:19:30PM +0100, Jean-Michel Hemstedt wrote: hello, I'm quite new to netfilter, and I would like to use/write an extension capable of rewriting HTTP/HTTPS requests from non-proxy aware clients to remote non-transparent aware proxy (the netfilter box being in the middle and acting as a default gw for both sides). This implies: - for HTTP: most HTTP requests methods (GET,POST,...) need to be rewritten with the full URL (taken from the non redirected ip.dst for HTTP/0.9 or from the 'Host' field for HTTP/1.x) - for HTTPS: *insert* an HTTP CONNECT transaction in the TCP stream (just after the TCP establishment), which means that the ip packets can't simply be redirected, unless playing with (cracking) the tcp.seq_num in netfilter. The first case is not a problem (kind of REDIRECT target) For the second case (HTTPS), I was thinking of using the ip_nonlocal_bind option, but I read in the kernel archives that the connect() was broken for non-local bind in 2.4.x. I would also avoid user space QUEUEing since I noticed that the throughput was simply divided by 2 (just for normal forwarding!). I think that your TPROXY target is well suited for the HTTPS case (terminating the tcp sessions of the client on the netfilter box and originating tcp session to the proxy from the netfilter box as if they were originating from the client, using a kind of *ip_nonlocal_bind* mechanism). right? TPROXY is not yet ready, it is lacking several important features. I posted it on the -devel list to receive feedback. Both of your problems can be solved by REDIRECT, you only need two different programs (or a single program performing both operations). Just listen on a random port (say 50080), and redirect all traffic to this port: iptables -t nat -A PREROUTING -p tcp -d 0/0 --dport 80 -j REDIRECT --to-port 50080 iptables -t nat -A PREROUTING -p tcp -d 0/0 --dport 443 -j REDIRECT --to-port 50443 REDIRECT could work in case of collocated proxy, and only if we have control on the proxy, i.e. Apache; (btw: I'm curreltly trying to find a clean and reusable way to extend transparent HTTP mod_tprox and add HTTPS transp proxy to Apache for Linux). But I'm afraid REDIRECT doesn't fit for remote proxies which rely on the originam source ip ofthe client to perform some checks. In that case we need the 50080/50443 applications of your example to forward the modified requests to the remote proxy with the source.ip of the original client. I see 3 possible ways to do that: 1) the 50080/50443 applications use libipt and for each new client request, before doing a new connect() to the remote proxy, they create a new iptable rule doing SNAT based on the --sport they choosed for their bind(). And when the connection is released, they remove the created rule. This solution is very inefficient, and not scalable. 2) the 50080/50443 applications rely on TPROXY framework and uses nonlocal_bind. 3) ??INTERCEPT?? = REDIRECT(PREROUTING)+SNAT(OUTPUT/POSTROUTING?) i.e: A:pclient1 [REDIRECT--dport80] Bserver:50080Bclient:pclient2 [SNAT(--to A:pclient)] C:80 = the INTERCEPT would REDIRECT the packets from the client to the local stack and pass a 'rdmark' to the user space application retreivable via getsockopt(rdmark). Then, the application rewrites the packets and in order to forward them to the remote proxy, it creates a new client socket to the remote-proxy and uses setsockopt(rdmark) to instruct netfilter to do SNAT on the outgoing packets (OUPUT/POSTROUTING?). Netfilter uses the 'rdmark' to retreive from the redirect table the '--to' information (the source.ip before the redirect). When a packet comes back from the remote-proxy the reverse SNAT redirects the packets to the local client, which pass the packet to the local server which sends back to the original client the modified packets... (PS: I don't think the MARK target is suited for that kind of mechanism) (PPS: the user space applications would move as LKM in a second phase) do you (or anyone else) see any other way to do it? One of your proxies will be listening on 50080 the other on 50443. The first performing non-transparent/transparent rewrite the other CONNECT encapsulation. By the way the first one is easy to do with Zorp. Its HttpProxy is able to rewrite server-requests to proxy-requests. CONNECT encapsulation is not supported I'm afraid. Have you received any feedback on your TPROXY target? not much. I hope I'll be able to contribute, but I'll first need to better understand what are all the features of Netfilter and how they can interact between each other... In the mean time, if you have any update, i'd like to have a look at it. Have you heard of any similar HTTPS-real-transp-proxy implementations
Re: [Q] connection tracking scaling
- Original Message - From: Patrick Schaaf [EMAIL PROTECTED] To: Harald Welte [EMAIL PROTECTED]; Patrick Schaaf [EMAIL PROTECTED]; Martin Josefsson [EMAIL PROTECTED]; Aviv Bergman [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Tuesday, 19 March, 2002 12:16 Subject: Re: [Q] connection tracking scaling I'd rather like to have this information to be gathered at runtime within the kernel, where one could read out the current hash occupation via /proc or some ioctl. OK, that's what I wanted to hear :-) Actually, the interesting statistics for a hash are not that large, and all aggregate: - bucket occupation: number of used buckets, vs. number of all buckets - average chain length over all buckets - average chain length over the used buckets - counting of a classification of the chain lengths: - number of 0-entry buckets - number of 1-entry buckets - number of 2-entry buckets - number of 4-entry buckets - number of 8-entry buckets - number of 16-entry buckets - number of more-than-16-entry buckets That's 10 values, and will at most double when I think more about it. I propose to gather these stats on the fly, and simply printk() them at a chosen interval: I'm not a conntrack specialist, neither a kernel hacker, but I've some experience with ip hash caches in access servers (BRAS) that may be useful(?): some additional stats: - HDA: cache hit depth average: the number of iterations in the bucket's list to get the matching collision entry. - MDA: cache miss depth average: the number of iterations required without matching a cache entry (new connection). HDA is meaningful if you have a bad cache distribution or a small CIS/CTS ratio (Cache Index size=number of hash buckets / Cache total size=total number of conntrack tuples cachable). It also provides good information on traffic type and cache efficiency: In fact, lets assume you have realtime traffic (RTP) and bursty traffic (HTTP/1.1 with keep alive) at the same time, and that the tuples for both type of traffic are under the same hash key. Now if your RT tuple is at the end of the collision list, or after the bursty entries, you will need frequent extra iterations to get your RT tuple... The work around for that is collision promotion: you keep a hit counter in each tuple and just swap one position ahead the most frequently accessed tuple. some questions: - have you an efficicent 'freelist' implementation? What I've seen about kmem_cache_free and kmem_cache_alloc doesn't look like a simple pointer dereference... Am I wrong? - wouldn't it be worth to have a cache promotion mechanism? regarding [hashsize=conntrack_max/2], I vote for! An alternate solution would be to have a dynamic hash resize each time the average number of collisions exceeds a treshold (and no down resize, except maybe asynchroneously). But given my experience I would say that ip hash distribution is not at all predictable (unless you know where in the net path your box will be, and what traffic type (VoIP, HTTP, eDonkey, ...) your box will have to handle, and even then, your predictions will not be valid for more than 6 month!). Therefore, the common way to handle unpredictable distribution is to define: [ max hash index size = max number of cache tuples] with a dynamic hash index resize One last word: the hash function you're using is the best compromise between unpredictable ipv4 traffic, cache symetry, uniformity and computation time. I wouldn't change it too much, but there are two propositions possible: - if you keep the modulo method (%), use a prime number far from a power of 2 for 'ip_conntrack_htable_size'. - if modulo is too slow, use the bitmasking method () with hsize being a power of 2, and with 2 bitshifts ((key+key20+key12) hsize), but this method is not as efficient as the modulo method, and must be reconsidered for ipv6. hope this may help... echo 300 /proc/net/ip_conntrack_showstat would generate one printk() every 300 seconds. Echoing 0 would disable the statistics gathering altogether. I think I can hack this up, today. Having the flu must be good for something... later Patrick
Re: [Q] connection tracking scaling
- Original Message - From: Patrick Schaaf [EMAIL PROTECTED] To: Jean-Michel Hemstedt [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Tuesday, 19 March, 2002 17:42 Subject: Re: [Q] connection tracking scaling Hello Jean-Michel, thanks for your input. I appreciate it. On Tue, Mar 19, 2002 at 03:56:32PM +0100, Jean-Michel Hemstedt wrote: I'm not a conntrack specialist, neither a kernel hacker, but I've some experience with ip hash caches in access servers (BRAS) that may be useful(?): some additional stats: - HDA: cache hit depth average: the number of iterations in the bucket's list to get the matching collision entry. - MDA: cache miss depth average: the number of iterations required without matching a cache entry (new connection). These two measures make an excellent dynamic measure in addition to what I described. However, they require modification of the standard LIST_FIND() macro used when traversing the chain, which is a bit awkward - and they require keeping properly locked / cpu-local counters in the critical path. I wouldn't want this on a highly loaded machine, at least not for a longer time. everything is relative: if the overhead of maintaining those local counters is too big compared to the exploit we can do of them (col prom, coarser stats, path switching,...) then just forget about it. The measures are important because of their dynamical nature. No static looking at the chains can capture effects related to the _order_ of the lists. These measures can. One possible optimization for the future, would be to move the bucket pointer to the last found element, giving an easy 1-off-cache per bucket. The effect of that cannot be measured with a static examination; it should be clearly visible under your dynamic measure. I'll see that I make this seperately optional. of course, each box as it's own requirements. A configurable option in such a generic kernel framework is always better than a built-in *useable* option. some questions: - have you an efficicent 'freelist' implementation? What I've seen about kmem_cache_free and kmem_cache_alloc doesn't look like a simple pointer dereference... Am I wrong? The kmem_cache() stuff is supposed to be the best possible implementation of such a freelist on an SMP system. I am not aware that it has inefficiencies. and surely me neither ;o) - wouldn't it be worth to have a cache promotion mechanism? What do you mean with that? (you wiped out my explanation in my previous mail). It's the mechanism by which the tuples in a collision list are resorted by hit frequency, trying to reduce the number of iterations required to get a match. It's only usefull if you insert new tuples at the beginning of the list, or if you have let's say a control channel and a data channel in the same list. In the later example, if the entry corresponding to the data channel is after the entry corresponding to your control channel, then you waist one iteration per data packet. example? stream A=1pps, stream B=100pps, lifetime of A = lifetime of B = T = if A after B, num iterations = T*(1*100 + 2*1) = T*(102) = if B after A, num iterations = T*(1*1 + 2*100) = T*(201) But this mechanism requires a per tuple hit counter, and some additional checks and pointer reallocation in the fast path... This is thus useful in specific cases (i.e: a media gateway, sip/h323 proxy, real-time streaming env.) regarding [hashsize=conntrack_max/2], I vote for! An alternate solution would be to have a dynamic hash resize each time the average number of collisions exceeds a treshold (and no down resize, except maybe asynchroneously). This is REALLY AWFUL in an SMP setting, because you need to keep the whole system write locked while you rebuild the hashes. You really don't want to do that. ok, ok, I haven't said anything ;o)... but FYI the prime double hash is supposed to address the problem of smooth hash expansion by moving gradually the entries from one to the other (i've not tried it). One last word: the hash function you're using is the best compromise between unpredictable ipv4 traffic, cache symetry, uniformity and computation time. I'm pretty ignorant wrt theory and attributes of hash functions, so I'm damned to check this by experiment. But it's reassuring to hear you say the function is basically OK. thanks to the one who choosed it (Rusty?).I've got only one remark: modulo operation consumes lots of CPU cycles especially if the operand is prime! and its implementation is architecture dependent (so not predictable) wheras bit shift operations are totally controlable and give pretty good results if used correctly... Bitshift has also the advantage of being able to define a hashsize=2^n. I memory is not an issue and if CPU cycles are critical ($/Hz $/MB) then we'd better use a bitshift hash function. (personally this was the one I used). Anyway, I would be interested by bench results
TPROXY
hello, - is there any update regardingTPROXY since 13/Feb/2002? -is TPROXY intended to replace 'slessdir' and 'IP_INTERCEPT'? - will it be included in the kernel someday (which version?)? - does it provide the definitive patch fornonlocal binding? - are there examples on how to use it? (apart from thecomments in the diff)? any help is welcome, and thanks for your job Balazs! -jm-