Re: [squid-users] Complicate ACL affect performance?

2008-10-19 Thread Henrik K
On Sun, Oct 19, 2008 at 04:51:16PM +1300, Amos Jeffries wrote:

 Fair test would be reversing the hostname, which is very cheap operation. ;)

 No. Because most users will not write their ACL regex normally, and the  
 regex has to match a forward-coded domain anyway. The squid algorithm  
 works on forward-coded domains.

 A fair test, therefore uses each methods native comparison style from  
 forward-coded domains as input. dstdomain does not even really use the  
 terminator equivalent to $ in its matches, though it is assumed.

No, the idea was to test best case scenario. Atleast for me.

 Your initial claim was that simply assembling the regex was faster than  
 dstdomain comparison.

Sorry, you must have been reading this thread too fast.

Me:
Sometimes you just need to block more specific URLS
how to use them efficiently IF NEEDED

It was the OTHER Henrik who was curious about dstdomain/regex speed. :-)

 You implied it very strongly with your statement that we should stop  
 recommending dstdomain for domain-only ACL. The informed developers have  
 never said NO regex. Only pointed out uses where its not worth using.  

Never I have said that you should stop using dstdomain? What statement
specifically are you referring to?

I was merely pointing out that avoid regex was a bit too generic response,
when someone asked about high-speed ACLs. We don't know if the original
poster needed them. If you need to block specific URLs, obviously you can't
just start using dstdomain instead.



Re: [squid-users] Complicate ACL affect performance?

2008-10-18 Thread Henrik K
On Fri, Oct 17, 2008 at 10:24:21PM +0200, Henrik Nordstrom wrote:
 On tor, 2008-10-16 at 12:02 +0300, Henrik K wrote:
 
  Optimizing 1000 x www.foo.bar/randomstuff into a _single_
  www.foobar.com/(r(egex|and(om)?)|fuba[rz]) regex is nowhere near linear.
  Even if it's all random servers, there are only ~30 characters from which
  branches are created from.
 
 Right. 
 
 Would be interesting to see how 50K dstdomain compares to 50k host
 patterns merged into a single dstdomain_regex pattern in terms of CPU
 usage. Probably a little tweaking of Squid is needed to support such
 large patterns, but that's trivial. (squid.conf parser is limited to
 4096 characters per line, including folding)

Not sure what the splay code does in Squid, didn't have time to grab it.
But a simple test with Perl:

- Grepped some hostnames from wwwlogs etc
- Regexp::Assemble'd 5 unique hostnames (= 560kB regex, took 22 sec)
- Run 10 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU)

It's pretty powerful stuff.



Re: [squid-users] Complicate ACL affect performance?

2008-10-18 Thread Henrik K
On Sat, Oct 18, 2008 at 12:44:46PM +0300, Henrik K wrote:
 On Fri, Oct 17, 2008 at 10:24:21PM +0200, Henrik Nordstrom wrote:
  On tor, 2008-10-16 at 12:02 +0300, Henrik K wrote:
  
   Optimizing 1000 x www.foo.bar/randomstuff into a _single_
   www.foobar.com/(r(egex|and(om)?)|fuba[rz]) regex is nowhere near linear.
   Even if it's all random servers, there are only ~30 characters from which
   branches are created from.
  
  Right. 
  
  Would be interesting to see how 50K dstdomain compares to 50k host
  patterns merged into a single dstdomain_regex pattern in terms of CPU
  usage. Probably a little tweaking of Squid is needed to support such
  large patterns, but that's trivial. (squid.conf parser is limited to
  4096 characters per line, including folding)
 
 Not sure what the splay code does in Squid, didn't have time to grab it.
 But a simple test with Perl:
 
 - Grepped some hostnames from wwwlogs etc
 - Regexp::Assemble'd 5 unique hostnames (= 560kB regex, took 22 sec)
 - Run 10 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU)
 
 It's pretty powerful stuff.

Oops, did it even slightly wrong.

By doing it correctly, using ^hostname$ instead of plain hostname in regex
results in 1.2 seconds, that's 8+ hosts/sec..



Re: [squid-users] Complicate ACL affect performance?

2008-10-18 Thread Amos Jeffries

Henrik K wrote:

On Sat, Oct 18, 2008 at 12:44:46PM +0300, Henrik K wrote:

On Fri, Oct 17, 2008 at 10:24:21PM +0200, Henrik Nordstrom wrote:

On tor, 2008-10-16 at 12:02 +0300, Henrik K wrote:


Optimizing 1000 x www.foo.bar/randomstuff into a _single_
www.foobar.com/(r(egex|and(om)?)|fuba[rz]) regex is nowhere near linear.
Even if it's all random servers, there are only ~30 characters from which
branches are created from.
Right. 


Would be interesting to see how 50K dstdomain compares to 50k host
patterns merged into a single dstdomain_regex pattern in terms of CPU
usage. Probably a little tweaking of Squid is needed to support such
large patterns, but that's trivial. (squid.conf parser is limited to
4096 characters per line, including folding)

Not sure what the splay code does in Squid, didn't have time to grab it.
But a simple test with Perl:

- Grepped some hostnames from wwwlogs etc
- Regexp::Assemble'd 5 unique hostnames (= 560kB regex, took 22 sec)
- Run 10 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU)

It's pretty powerful stuff.


Oops, did it even slightly wrong.

By doing it correctly, using ^hostname$ instead of plain hostname in regex
results in 1.2 seconds, that's 8+ hosts/sec..



Sill out slightly. The fair test for that vs squid splay tree would be 
still missing the ^  to match any given *.example.com$



Amos
--
Please use Squid 2.7.STABLE4 or 3.0.STABLE9


Re: [squid-users] Complicate ACL affect performance?

2008-10-18 Thread Henrik K
On Sat, Oct 18, 2008 at 11:54:52PM +1300, Amos Jeffries wrote:
 Henrik K wrote:
 On Sat, Oct 18, 2008 at 12:44:46PM +0300, Henrik K wrote:
 Not sure what the splay code does in Squid, didn't have time to grab it.
 But a simple test with Perl:

 - Grepped some hostnames from wwwlogs etc
 - Regexp::Assemble'd 5 unique hostnames (= 560kB regex, took 22 sec)
 - Run 10 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU)

 It's pretty powerful stuff.

 Oops, did it even slightly wrong.

 By doing it correctly, using ^hostname$ instead of plain hostname in regex
 results in 1.2 seconds, that's 8+ hosts/sec..


 Sill out slightly. The fair test for that vs squid splay tree would be  
 still missing the ^  to match any given *.example.com$

Fair test would be reversing the hostname, which is very cheap operation. ;)

(^|\.)example\.com$  .. runtime 2.2 secs
^moc\.elpmaxe(\.|$)  .. runtime 1.3 secs

No one is suggesting that dstdomain should be replaced by regexs though.
This just proves that if you need them, they can be used efficiently.



Re: [squid-users] Complicate ACL affect performance?

2008-10-18 Thread Henrik Nordstrom
On lör, 2008-10-18 at 12:58 +0300, Henrik K wrote:

 By doing it correctly, using ^hostname$ instead of plain hostname in regex
 results in 1.2 seconds, that's 8+ hosts/sec..

The interesting pattern match to compare with is

s/^www\.// on the hostnames before making patterns

Then for each hostname
(\.|^)hostname$

or expanded in two patterns depending how well Regexp::Assemble handles
this case.

   \.hostname$
   ^hostname$

blacklists have a quite large proportion domain matches, matching a
complete domain.

Quite likely regex will handle this much better if you reverse the
hostnames, resulting in patterns on the form

 ^emantsoh(\.|$)

Regards
Henrik


signature.asc
Description: This is a digitally signed message part


Re: [squid-users] Complicate ACL affect performance?

2008-10-18 Thread Henrik Nordstrom
On lör, 2008-10-18 at 14:26 +0300, Henrik K wrote:

 Fair test would be reversing the hostname, which is very cheap operation. ;)
 
 (^|\.)example\.com$  .. runtime 2.2 secs
 ^moc\.elpmaxe(\.|$)  .. runtime 1.3 secs

Heh, and I should learn to read the whole thread before responding ;-)

Regards
Henrik


signature.asc
Description: This is a digitally signed message part


Re: [squid-users] Complicate ACL affect performance?

2008-10-18 Thread Amos Jeffries

Henrik K wrote:

On Sat, Oct 18, 2008 at 11:54:52PM +1300, Amos Jeffries wrote:

Henrik K wrote:

On Sat, Oct 18, 2008 at 12:44:46PM +0300, Henrik K wrote:

Not sure what the splay code does in Squid, didn't have time to grab it.


Produces a very inefficient unsorted but alphabetically ordered trinary 
tree.



But a simple test with Perl:

- Grepped some hostnames from wwwlogs etc
- Regexp::Assemble'd 5 unique hostnames (= 560kB regex, took 22 sec)
- Run 10 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU)

It's pretty powerful stuff.

Oops, did it even slightly wrong.

By doing it correctly, using ^hostname$ instead of plain hostname in regex
results in 1.2 seconds, that's 8+ hosts/sec..

Sill out slightly. The fair test for that vs squid splay tree would be  
still missing the ^  to match any given *.example.com$


Fair test would be reversing the hostname, which is very cheap operation. ;)


No. Because most users will not write their ACL regex normally, and the 
regex has to match a forward-coded domain anyway. The squid algorithm 
works on forward-coded domains.


A fair test, therefore uses each methods native comparison style from 
forward-coded domains as input. dstdomain does not even really use the 
terminator equivalent to $ in its matches, though it is assumed.


Your initial claim was that simply assembling the regex was faster than 
dstdomain comparison.
You've provided the regex numbers. I'm working on the sourcelayout 
project, which should simplify the code so we can build a benchmark test 
app for dstdomain easily sometime soon.


Just a guesstimate (not knowing the avg domain length you used, my 
numbers assume max-length 256byte domain names). I expect it matches at 
over 200k domains per second on a single-CPU 2.8GHz machine.




(^|\.)example\.com$  .. runtime 2.2 secs
^moc\.elpmaxe(\.|$)  .. runtime 1.3 secs

No one is suggesting that dstdomain should be replaced by regexs though.
This just proves that if you need them, they can be used efficiently.


You implied it very strongly with your statement that we should stop 
recommending dstdomain for domain-only ACL. The informed developers have 
never said NO regex. Only pointed out uses where its not worth using. 
One of the major optimization I myself promote is adding a src ACL on 
each access line to restrict the times regex or other 'slow' acl get 
tested to start with.


Amos
--
Please use Squid 2.7.STABLE4 or 3.0.STABLE9


Re: [squid-users] Complicate ACL affect performance?

2008-10-18 Thread Amos Jeffries

snip


No. Because most users will not write their ACL regex normally, and the 
regex has to match a forward-coded domain anyway. The squid algorithm 
works on forward-coded domains.




Oops. I meant to write: Because most users will write their ACL regex 
normally (wont even think to write regex in reverse byte-wise).


Amos
--
Please use Squid 2.7.STABLE4 or 3.0.STABLE9


Re: [squid-users] Complicate ACL affect performance?

2008-10-17 Thread Henrik Nordstrom
On tor, 2008-10-16 at 12:02 +0300, Henrik K wrote:

 Optimizing 1000 x www.foo.bar/randomstuff into a _single_
 www.foobar.com/(r(egex|and(om)?)|fuba[rz]) regex is nowhere near linear.
 Even if it's all random servers, there are only ~30 characters from which
 branches are created from.

Right. 

Would be interesting to see how 50K dstdomain compares to 50k host
patterns merged into a single dstdomain_regex pattern in terms of CPU
usage. Probably a little tweaking of Squid is needed to support such
large patterns, but that's trivial. (squid.conf parser is limited to
4096 characters per line, including folding)

Regards
Henrik


signature.asc
Description: This is a digitally signed message part


Re: [squid-users] Complicate ACL affect performance?

2008-10-16 Thread Henrik K
On Thu, Oct 16, 2008 at 01:56:59AM +0800, howard chen wrote:
 Hello,
 
 On Wed, Oct 15, 2008 at 10:14 PM, Henrik K [EMAIL PROTECTED] wrote:
  On Wed, Oct 15, 2008 at 03:42:20PM +0200, Henrik Nordstrom wrote:
 
   Any suggestion for having large ACL in a high traffic server?
 
  Avoid using regex based acls.
 
  It's fine if you use Perl + Regexp::Assemble to optimize them. And link
  Squid with PCRE. Sometimes you just need to block more specific URLs.
 
 
 
 WHat do you mean link Squid with PCRE ?

http://www.pcre.org/

When compiling add -lpcreposix -lpcre to LDFLAGS. It overrides your your
system library, being faster and I don't have any memory leaks anymore.

If you want to read long regexps from include file, you need to
patch a bit: http://www.squid-cache.org/bugs/show_bug.cgi?id=2215
(src/cache_cf.c - strtokFile() - change all 256 to 65535)



Re: [squid-users] Complicate ACL affect performance?

2008-10-16 Thread Henrik Nordstrom
On ons, 2008-10-15 at 17:14 +0300, Henrik K wrote:
  Avoid using regex based acls.
 
 It's fine if you use Perl + Regexp::Assemble to optimize them. And link
 Squid with PCRE. Sometimes you just need to block more specific URLs.

No it's not. Even optimized regexes is several orders of magnitude more
complex to evaluate than the structured acls.

The lookup time of dstdomain is logaritmic to the number of entries.

The lookup time of regex acls is linear to the number of entries.

Regards
Henrik


signature.asc
Description: This is a digitally signed message part


Re: [squid-users] Complicate ACL affect performance?

2008-10-16 Thread Henrik K
On Thu, Oct 16, 2008 at 10:10:23AM +0200, Henrik Nordstrom wrote:
 On ons, 2008-10-15 at 17:14 +0300, Henrik K wrote:
   Avoid using regex based acls.
  
  It's fine if you use Perl + Regexp::Assemble to optimize them. And link
  Squid with PCRE. Sometimes you just need to block more specific URLs.
 
 No it's not. Even optimized regexes is several orders of magnitude more
 complex to evaluate than the structured acls.
 
 The lookup time of dstdomain is logaritmic to the number of entries.
 
 The lookup time of regex acls is linear to the number of entries.

It's fine that you advocate for avoid regex, but a much better way is to
actually tell people what's wrong and how to use them efficiently if needed.

Of course you shouldn't have a separate regex for every URL. I suggest you
look at what Regexp::Assemble does.

Optimizing 1000 x www.foo.bar/randomstuff into a _single_
www.foobar.com/(r(egex|and(om)?)|fuba[rz]) regex is nowhere near linear.
Even if it's all random servers, there are only ~30 characters from which
branches are created from.



Re: [squid-users] Complicate ACL affect performance?

2008-10-15 Thread Henrik Nordstrom
On ons, 2008-10-15 at 19:29 +0800, howard chen wrote:

 I have some quite complicate ACL in the config file, I just wonder if
 it is possible to see how many CPU time is used in running the ACL,
 and how many CPU time is used in
 serving cache/proxy.

Not easily.

 Any suggestion for having large ACL in a high traffic server?

Avoid using regex based acls. Use the structured acls such as dstdomain.

Regards
Henrik


signature.asc
Description: This is a digitally signed message part


[squid-users] Complicate ACL affect performance?

2008-10-15 Thread howard chen
Hello,

I have some quite complicate ACL in the config file, I just wonder if
it is possible to see how many CPU time is used in running the ACL,
and how many CPU time is used in
serving cache/proxy.

Any suggestion for having large ACL in a high traffic server?

Thanks.


Re: [squid-users] Complicate ACL affect performance?

2008-10-15 Thread Henrik K
On Wed, Oct 15, 2008 at 03:42:20PM +0200, Henrik Nordstrom wrote:
 
  Any suggestion for having large ACL in a high traffic server?
 
 Avoid using regex based acls.

It's fine if you use Perl + Regexp::Assemble to optimize them. And link
Squid with PCRE. Sometimes you just need to block more specific URLs.



Re: [squid-users] Complicate ACL affect performance?

2008-10-15 Thread howard chen
Hello,

On Wed, Oct 15, 2008 at 10:14 PM, Henrik K [EMAIL PROTECTED] wrote:
 On Wed, Oct 15, 2008 at 03:42:20PM +0200, Henrik Nordstrom wrote:

  Any suggestion for having large ACL in a high traffic server?

 Avoid using regex based acls.

 It's fine if you use Perl + Regexp::Assemble to optimize them. And link
 Squid with PCRE. Sometimes you just need to block more specific URLs.



WHat do you mean link Squid with PCRE ?




Thanks.