Hi Keith,

Thanks for the your time and knowledge!  The explanation of your rule
contraction is excellent and really helped me understand the process much
better.

This exercise has been a very good learning experience for me thanks to
Jennifer Wheeler.  She authored a set of rules
(http://spamhammers.nxtek.net) to catch word obfuscation by HTML tags,
script tags, and HTML encoding.  I have been trying to reduce the number of
rules and find possible holes.  I think that all who have contributed to
this thread are to credit for the ground gained.  Your insight has really
helped!

Thanks Again,
Larry



> -----Original Message-----
> From: Keith C. Ivey
> Sent: Saturday, October 11, 2003 6:02 PM
> To: [EMAIL PROTECTED]
> Subject: RE: [SAtalk] Popcorn, Backhair, and Weeds
> 
> 
> Larry Gilson <[EMAIL PROTECTED]> wrote:
> 
> > I had the following HTML tag OBFU rule (variant of yours):
> >   /(\>|\s)\w{1,5}?\<\/?\s?[\w\s]{6,150}\/?\s?\>\w{1,7}?(\s|\W|\<)/
> 
> There's a lot of clutter in that that makes it harder to 
> follow.  Let's try paring it down.  First, '<' and '>' are not 
> special on their own in regexes, so there's no need to 
> backslash them:
> 
> /(>|\s)\w{1,5}?<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}?(\s|\W|<)/
> 
> When you have an alternation -- something like '(a|b|c)' -- 
> where all the alternatives are single characters, it's better 
> to write it as a character class -- something like '[abc]'.  
> Also, '\s' and '<' are both included in '\W', so that last 
> alternation is equivalent to just '\W':
> 
> /[>\s]\w{1,5}?<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}?\W/
> 
> Now, nongreedy matching serves no purpose when the thing 
> following it can't be matched by the thing being repeated.  In 
> this case you have '\w{1,5}?' followed by '<', but '<' can't 
> match '\w', so there's no difference between greedy and 
> nongreedy matching there.  The matching for the series of '\w' 
> characters has to go all the way to the '<' -- it can't stop 
> short.  Similarly, the '\W' at the end can never match the '\w' 
> preceding it, so that '?' is also pointless:
> 
> /[>\s]\w{1,5}<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}\W/
> 
> That regex is equivalent to your original one, and may help you 
> see better why it's not matching as you expect.  It's looking 
> for
> 
>    a '>' or whitespace character (space, tab, carriage return,
>       line feed, form feed),
>    followed by 1 to 5 word characters (letters, numbers, and
>       underscores),
>    followed by '<',
>    followed by an optional '/',
>    followed by an optional single whitespace character,
>    followed by 6 to 150 word or whitespace characters,
>    followed by an optional '/',
>    followed by an optional single whitespace character,
>    followed by '>',
>    followed by 1 to 7 word characters,
>    followed by a nonword character (anything other than
>       letters, numbers, and underscore).
> 
> I'm not clear on what you want to match, but that's probably 
> not it.
> 
> -- 
> Keith C. Ivey <[EMAIL PROTECTED]>
> Washington, DC



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to