Henrik K wrote:
On Sat, Oct 18, 2008 at 12:44:46PM +0300, Henrik K wrote:
On Fri, Oct 17, 2008 at 10:24:21PM +0200, Henrik Nordstrom wrote:
On tor, 2008-10-16 at 12:02 +0300, Henrik K wrote:

Optimizing 1000 x "www.foo.bar/<randomstuff>" into a _single_
"www.foobar.com/(r(egex|and(om)?)|fuba[rz])" regex is nowhere near linear.
Even if it's all random servers, there are only ~30 characters from which
branches are created from.
Right.
Would be interesting to see how 50K dstdomain compares to 50k host
patterns merged into a single dstdomain_regex pattern in terms of CPU
usage. Probably a little tweaking of Squid is needed to support such
large patterns, but that's trivial. (squid.conf parser is limited to
4096 characters per line, including folding)
Not sure what the splay code does in Squid, didn't have time to grab it.
But a simple test with Perl:

- Grepped some hostnames from wwwlogs etc
- Regexp::Assemble'd 50000 unique hostnames (= 560kB regex, took 22 sec)
- Run 100000 hostnames on it in 4 seconds (25000 hosts/sec on 2.8Ghz CPU)

It's pretty powerful stuff.

Oops, did it even slightly wrong.

By doing it correctly, using ^hostname$ instead of plain hostname in regex
results in 1.2 seconds, that's 80000+ hosts/sec..


Sill out slightly. The fair test for that vs squid splay tree would be still missing the ^ to match any given *.example.com$


Amos
--
Please use Squid 2.7.STABLE4 or 3.0.STABLE9

Reply via email to