Re: [squid-users] Complicate ACL affect performance?
On Sun, Oct 19, 2008 at 04:51:16PM +1300, Amos Jeffries wrote: Fair test would be reversing the hostname, which is very cheap operation. ;) No. Because most users will not write their ACL regex normally, and the regex has to match a forward-coded domain anyway. The squid algorithm works on forward-coded domains. A fair test, therefore uses each methods native comparison style from forward-coded domains as input. dstdomain does not even really use the terminator equivalent to $ in its matches, though it is assumed. No, the idea was to test best case scenario. Atleast for me. Your initial claim was that simply assembling the regex was faster than dstdomain comparison. Sorry, you must have been reading this thread too fast. Me: Sometimes you just need to block more specific URLS how to use them efficiently IF NEEDED It was the OTHER Henrik who was curious about dstdomain/regex speed. :-) You implied it very strongly with your statement that we should stop recommending dstdomain for domain-only ACL. The informed developers have never said NO regex. Only pointed out uses where its not worth using. Never I have said that you should stop using dstdomain? What statement specifically are you referring to? I was merely pointing out that avoid regex was a bit too generic response, when someone asked about high-speed ACLs. We don't know if the original poster needed them. If you need to block specific URLs, obviously you can't just start using dstdomain instead.
Re: [squid-users] Complicate ACL affect performance?
On Fri, Oct 17, 2008 at 10:24:21PM +0200, Henrik Nordstrom wrote: On tor, 2008-10-16 at 12:02 +0300, Henrik K wrote: Optimizing 1000 x www.foo.bar/randomstuff into a _single_ www.foobar.com/(r(egex|and(om)?)|fuba[rz]) regex is nowhere near linear. Even if it's all random servers, there are only ~30 characters from which branches are created from. Right. Would be interesting to see how 50K dstdomain compares to 50k host patterns merged into a single dstdomain_regex pattern in terms of CPU usage. Probably a little tweaking of Squid is needed to support such large patterns, but that's trivial. (squid.conf parser is limited to 4096 characters per line, including folding) Not sure what the splay code does in Squid, didn't have time to grab it. But a simple test with Perl: - Grepped some hostnames from wwwlogs etc - Regexp::Assemble'd 5 unique hostnames (= 560kB regex, took 22 sec) - Run 10 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU) It's pretty powerful stuff.
Re: [squid-users] Complicate ACL affect performance?
On Sat, Oct 18, 2008 at 12:44:46PM +0300, Henrik K wrote: On Fri, Oct 17, 2008 at 10:24:21PM +0200, Henrik Nordstrom wrote: On tor, 2008-10-16 at 12:02 +0300, Henrik K wrote: Optimizing 1000 x www.foo.bar/randomstuff into a _single_ www.foobar.com/(r(egex|and(om)?)|fuba[rz]) regex is nowhere near linear. Even if it's all random servers, there are only ~30 characters from which branches are created from. Right. Would be interesting to see how 50K dstdomain compares to 50k host patterns merged into a single dstdomain_regex pattern in terms of CPU usage. Probably a little tweaking of Squid is needed to support such large patterns, but that's trivial. (squid.conf parser is limited to 4096 characters per line, including folding) Not sure what the splay code does in Squid, didn't have time to grab it. But a simple test with Perl: - Grepped some hostnames from wwwlogs etc - Regexp::Assemble'd 5 unique hostnames (= 560kB regex, took 22 sec) - Run 10 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU) It's pretty powerful stuff. Oops, did it even slightly wrong. By doing it correctly, using ^hostname$ instead of plain hostname in regex results in 1.2 seconds, that's 8+ hosts/sec..
Re: [squid-users] Complicate ACL affect performance?
Henrik K wrote: On Sat, Oct 18, 2008 at 12:44:46PM +0300, Henrik K wrote: On Fri, Oct 17, 2008 at 10:24:21PM +0200, Henrik Nordstrom wrote: On tor, 2008-10-16 at 12:02 +0300, Henrik K wrote: Optimizing 1000 x www.foo.bar/randomstuff into a _single_ www.foobar.com/(r(egex|and(om)?)|fuba[rz]) regex is nowhere near linear. Even if it's all random servers, there are only ~30 characters from which branches are created from. Right. Would be interesting to see how 50K dstdomain compares to 50k host patterns merged into a single dstdomain_regex pattern in terms of CPU usage. Probably a little tweaking of Squid is needed to support such large patterns, but that's trivial. (squid.conf parser is limited to 4096 characters per line, including folding) Not sure what the splay code does in Squid, didn't have time to grab it. But a simple test with Perl: - Grepped some hostnames from wwwlogs etc - Regexp::Assemble'd 5 unique hostnames (= 560kB regex, took 22 sec) - Run 10 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU) It's pretty powerful stuff. Oops, did it even slightly wrong. By doing it correctly, using ^hostname$ instead of plain hostname in regex results in 1.2 seconds, that's 8+ hosts/sec.. Sill out slightly. The fair test for that vs squid splay tree would be still missing the ^ to match any given *.example.com$ Amos -- Please use Squid 2.7.STABLE4 or 3.0.STABLE9
Re: [squid-users] Complicate ACL affect performance?
On Sat, Oct 18, 2008 at 11:54:52PM +1300, Amos Jeffries wrote: Henrik K wrote: On Sat, Oct 18, 2008 at 12:44:46PM +0300, Henrik K wrote: Not sure what the splay code does in Squid, didn't have time to grab it. But a simple test with Perl: - Grepped some hostnames from wwwlogs etc - Regexp::Assemble'd 5 unique hostnames (= 560kB regex, took 22 sec) - Run 10 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU) It's pretty powerful stuff. Oops, did it even slightly wrong. By doing it correctly, using ^hostname$ instead of plain hostname in regex results in 1.2 seconds, that's 8+ hosts/sec.. Sill out slightly. The fair test for that vs squid splay tree would be still missing the ^ to match any given *.example.com$ Fair test would be reversing the hostname, which is very cheap operation. ;) (^|\.)example\.com$ .. runtime 2.2 secs ^moc\.elpmaxe(\.|$) .. runtime 1.3 secs No one is suggesting that dstdomain should be replaced by regexs though. This just proves that if you need them, they can be used efficiently.
Re: [squid-users] Complicate ACL affect performance?
On lör, 2008-10-18 at 12:58 +0300, Henrik K wrote: By doing it correctly, using ^hostname$ instead of plain hostname in regex results in 1.2 seconds, that's 8+ hosts/sec.. The interesting pattern match to compare with is s/^www\.// on the hostnames before making patterns Then for each hostname (\.|^)hostname$ or expanded in two patterns depending how well Regexp::Assemble handles this case. \.hostname$ ^hostname$ blacklists have a quite large proportion domain matches, matching a complete domain. Quite likely regex will handle this much better if you reverse the hostnames, resulting in patterns on the form ^emantsoh(\.|$) Regards Henrik signature.asc Description: This is a digitally signed message part
Re: [squid-users] Complicate ACL affect performance?
On lör, 2008-10-18 at 14:26 +0300, Henrik K wrote: Fair test would be reversing the hostname, which is very cheap operation. ;) (^|\.)example\.com$ .. runtime 2.2 secs ^moc\.elpmaxe(\.|$) .. runtime 1.3 secs Heh, and I should learn to read the whole thread before responding ;-) Regards Henrik signature.asc Description: This is a digitally signed message part
Re: [squid-users] Complicate ACL affect performance?
Henrik K wrote: On Sat, Oct 18, 2008 at 11:54:52PM +1300, Amos Jeffries wrote: Henrik K wrote: On Sat, Oct 18, 2008 at 12:44:46PM +0300, Henrik K wrote: Not sure what the splay code does in Squid, didn't have time to grab it. Produces a very inefficient unsorted but alphabetically ordered trinary tree. But a simple test with Perl: - Grepped some hostnames from wwwlogs etc - Regexp::Assemble'd 5 unique hostnames (= 560kB regex, took 22 sec) - Run 10 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU) It's pretty powerful stuff. Oops, did it even slightly wrong. By doing it correctly, using ^hostname$ instead of plain hostname in regex results in 1.2 seconds, that's 8+ hosts/sec.. Sill out slightly. The fair test for that vs squid splay tree would be still missing the ^ to match any given *.example.com$ Fair test would be reversing the hostname, which is very cheap operation. ;) No. Because most users will not write their ACL regex normally, and the regex has to match a forward-coded domain anyway. The squid algorithm works on forward-coded domains. A fair test, therefore uses each methods native comparison style from forward-coded domains as input. dstdomain does not even really use the terminator equivalent to $ in its matches, though it is assumed. Your initial claim was that simply assembling the regex was faster than dstdomain comparison. You've provided the regex numbers. I'm working on the sourcelayout project, which should simplify the code so we can build a benchmark test app for dstdomain easily sometime soon. Just a guesstimate (not knowing the avg domain length you used, my numbers assume max-length 256byte domain names). I expect it matches at over 200k domains per second on a single-CPU 2.8GHz machine. (^|\.)example\.com$ .. runtime 2.2 secs ^moc\.elpmaxe(\.|$) .. runtime 1.3 secs No one is suggesting that dstdomain should be replaced by regexs though. This just proves that if you need them, they can be used efficiently. You implied it very strongly with your statement that we should stop recommending dstdomain for domain-only ACL. The informed developers have never said NO regex. Only pointed out uses where its not worth using. One of the major optimization I myself promote is adding a src ACL on each access line to restrict the times regex or other 'slow' acl get tested to start with. Amos -- Please use Squid 2.7.STABLE4 or 3.0.STABLE9
Re: [squid-users] Complicate ACL affect performance?
snip No. Because most users will not write their ACL regex normally, and the regex has to match a forward-coded domain anyway. The squid algorithm works on forward-coded domains. Oops. I meant to write: Because most users will write their ACL regex normally (wont even think to write regex in reverse byte-wise). Amos -- Please use Squid 2.7.STABLE4 or 3.0.STABLE9
Re: [squid-users] Complicate ACL affect performance?
On tor, 2008-10-16 at 12:02 +0300, Henrik K wrote: Optimizing 1000 x www.foo.bar/randomstuff into a _single_ www.foobar.com/(r(egex|and(om)?)|fuba[rz]) regex is nowhere near linear. Even if it's all random servers, there are only ~30 characters from which branches are created from. Right. Would be interesting to see how 50K dstdomain compares to 50k host patterns merged into a single dstdomain_regex pattern in terms of CPU usage. Probably a little tweaking of Squid is needed to support such large patterns, but that's trivial. (squid.conf parser is limited to 4096 characters per line, including folding) Regards Henrik signature.asc Description: This is a digitally signed message part
Re: [squid-users] Complicate ACL affect performance?
On Thu, Oct 16, 2008 at 01:56:59AM +0800, howard chen wrote: Hello, On Wed, Oct 15, 2008 at 10:14 PM, Henrik K [EMAIL PROTECTED] wrote: On Wed, Oct 15, 2008 at 03:42:20PM +0200, Henrik Nordstrom wrote: Any suggestion for having large ACL in a high traffic server? Avoid using regex based acls. It's fine if you use Perl + Regexp::Assemble to optimize them. And link Squid with PCRE. Sometimes you just need to block more specific URLs. WHat do you mean link Squid with PCRE ? http://www.pcre.org/ When compiling add -lpcreposix -lpcre to LDFLAGS. It overrides your your system library, being faster and I don't have any memory leaks anymore. If you want to read long regexps from include file, you need to patch a bit: http://www.squid-cache.org/bugs/show_bug.cgi?id=2215 (src/cache_cf.c - strtokFile() - change all 256 to 65535)
Re: [squid-users] Complicate ACL affect performance?
On ons, 2008-10-15 at 17:14 +0300, Henrik K wrote: Avoid using regex based acls. It's fine if you use Perl + Regexp::Assemble to optimize them. And link Squid with PCRE. Sometimes you just need to block more specific URLs. No it's not. Even optimized regexes is several orders of magnitude more complex to evaluate than the structured acls. The lookup time of dstdomain is logaritmic to the number of entries. The lookup time of regex acls is linear to the number of entries. Regards Henrik signature.asc Description: This is a digitally signed message part
Re: [squid-users] Complicate ACL affect performance?
On Thu, Oct 16, 2008 at 10:10:23AM +0200, Henrik Nordstrom wrote: On ons, 2008-10-15 at 17:14 +0300, Henrik K wrote: Avoid using regex based acls. It's fine if you use Perl + Regexp::Assemble to optimize them. And link Squid with PCRE. Sometimes you just need to block more specific URLs. No it's not. Even optimized regexes is several orders of magnitude more complex to evaluate than the structured acls. The lookup time of dstdomain is logaritmic to the number of entries. The lookup time of regex acls is linear to the number of entries. It's fine that you advocate for avoid regex, but a much better way is to actually tell people what's wrong and how to use them efficiently if needed. Of course you shouldn't have a separate regex for every URL. I suggest you look at what Regexp::Assemble does. Optimizing 1000 x www.foo.bar/randomstuff into a _single_ www.foobar.com/(r(egex|and(om)?)|fuba[rz]) regex is nowhere near linear. Even if it's all random servers, there are only ~30 characters from which branches are created from.
Re: [squid-users] Complicate ACL affect performance?
On ons, 2008-10-15 at 19:29 +0800, howard chen wrote: I have some quite complicate ACL in the config file, I just wonder if it is possible to see how many CPU time is used in running the ACL, and how many CPU time is used in serving cache/proxy. Not easily. Any suggestion for having large ACL in a high traffic server? Avoid using regex based acls. Use the structured acls such as dstdomain. Regards Henrik signature.asc Description: This is a digitally signed message part
[squid-users] Complicate ACL affect performance?
Hello, I have some quite complicate ACL in the config file, I just wonder if it is possible to see how many CPU time is used in running the ACL, and how many CPU time is used in serving cache/proxy. Any suggestion for having large ACL in a high traffic server? Thanks.
Re: [squid-users] Complicate ACL affect performance?
On Wed, Oct 15, 2008 at 03:42:20PM +0200, Henrik Nordstrom wrote: Any suggestion for having large ACL in a high traffic server? Avoid using regex based acls. It's fine if you use Perl + Regexp::Assemble to optimize them. And link Squid with PCRE. Sometimes you just need to block more specific URLs.
Re: [squid-users] Complicate ACL affect performance?
Hello, On Wed, Oct 15, 2008 at 10:14 PM, Henrik K [EMAIL PROTECTED] wrote: On Wed, Oct 15, 2008 at 03:42:20PM +0200, Henrik Nordstrom wrote: Any suggestion for having large ACL in a high traffic server? Avoid using regex based acls. It's fine if you use Perl + Regexp::Assemble to optimize them. And link Squid with PCRE. Sometimes you just need to block more specific URLs. WHat do you mean link Squid with PCRE ? Thanks.