"Paul A. Rombouts" <[EMAIL PROTECTED]> writes:
> I've been studying parts of the wwwoffle source code the last few weeks. I
> greatly admire the overall design of WWWOFFLE, but I find some of the details
> could be improved.
> One of these features I'd like to discuss here is wildcard pattern matching.
> I was a bit annoyed by the 2 '*' limit imposed by WWWOFFLE, so I had a look at
> the code to see if it could be improved. I thought up an algorithm that was more
> general (the number of *s allowed is practically unlimited), more compact and
> somewhat faster. Actually using this rewritten pattern matching function
> was a bit more involved that I had anticipated at first, because I used a
> different representation of wildcard patterns (not one simple string but a
> sequence of strings). I had to make changes in all the places where wildcard
> patterns are used, but I eventually managed to pull it off.
When I code I use the principle of Keeping It Simple. This doesn't
always mean that the shortest and fastest piece of code is always the
best. There are times that taking a black box approach will win even
though the code is longer and slower. If you need to change every
place where the wildcard pattern is used then perhaps it is not simple
enough.
The algorithm that I use may seem long, but the path through the code
is without complex loops and with only a few conditional statements.
I hope that it would be obvious to anybody that read it how it worked
and that it did work correctly.
> My version still tries every pattern in sequence, just as the original WWWOFFLE
> code. To do truly fast pattern matching you would probably need to translate a
> sequence of patterns into a finite state machine, but I doubt this is worth the
> effort.
Using the principle of Keeping It Simple I agree that this is never
likely to be worth doing for a program like this. In a program that
only does pattern matching it would be an interesting (if complex)
optimisation.
> Nevertheless, I think my version will still be interesting for people with a
> large number of patterns in their configuration file (particularly the DonGet
> section), or who find the 2 '*' limit annoying.
I have had a look at my implementation and made a few optimisations.
In particular I have removed the memory allocation which will increase
the speed.
I personally can't think of a good reason or example URL that would
require more than 2 '*' to define it except for very weird cases (for
example "*.*.*.foo.com" to exclude all hostnames with 5 parts to them
but not those with 4 parts which "*.*.foo.com" also matches).
The 2 '*' are limited to each of the host part and path part of the
URL-SPECIFICATION, you can have a total of more than 2 '*' in the URL
in total.
> If someone would like to try out my version, just ask me nicely and I will send
> you a patch file.
Does the patch work with version 2.7-beta? There has been a lot of
change of the configuration file reading, especially the UrlSpec data
type.
I for one am interested to see the algorithm.
--
Andrew.
----------------------------------------------------------------------
Andrew M. Bishop [EMAIL PROTECTED]
http://www.gedanken.demon.co.uk/
WWWOFFLE users page:
http://www.gedanken.demon.co.uk/wwwoffle/version-2.6/user.html