>> It seems like
>> all you have to do to get around the etc. problem is to wait a
>> little longer before applying the fixup -- allow the semicolon to match
>> in
>> the hostname search and then strip it out.
>
> My bad.. I guess the plugin currently only fixes up '&#\d\d\d' encoding,
> not etc. maybe i'll work on that...
This seems to be a big improvement at least on the 3 million lines of
random traffic i tested with, and it's a smaller patch:
--- uribl.orig 2010-07-23 17:06:10.894320796 -0500
+++ uribl 2010-07-23 19:57:39.304321519 -0500
@@ -289,7 +289,13 @@
# Undo URI escape munging
$l =~ s/[=%]([0-9A-Fa-f]{2,2})/chr(hex($1))/ge;
# Undo HTML entity munging (e.g. in parameterized redirects)
- $l =~ s/&#(\d{2,3});?/chr($1)/ge;
+ $l =~ s/&#(\d{2,4});?/chr($1)/ge;
+ # Un-encode a few common important named entities and discard the
rest
+ $l =~ s/ / /go;
+ $l =~ s/&/&/go;
+ $l =~ s/>/>/go;
+ $l =~ s/</</go;
+ $l =~ s/&\w{2,6};//go;
# Dodge inserted-semicolon munging
$l =~ tr/;//d;