Hi Keith, Thanks for the your time and knowledge! The explanation of your rule contraction is excellent and really helped me understand the process much better.
This exercise has been a very good learning experience for me thanks to Jennifer Wheeler. She authored a set of rules (http://spamhammers.nxtek.net) to catch word obfuscation by HTML tags, script tags, and HTML encoding. I have been trying to reduce the number of rules and find possible holes. I think that all who have contributed to this thread are to credit for the ground gained. Your insight has really helped! Thanks Again, Larry > -----Original Message----- > From: Keith C. Ivey > Sent: Saturday, October 11, 2003 6:02 PM > To: [EMAIL PROTECTED] > Subject: RE: [SAtalk] Popcorn, Backhair, and Weeds > > > Larry Gilson <[EMAIL PROTECTED]> wrote: > > > I had the following HTML tag OBFU rule (variant of yours): > > /(\>|\s)\w{1,5}?\<\/?\s?[\w\s]{6,150}\/?\s?\>\w{1,7}?(\s|\W|\<)/ > > There's a lot of clutter in that that makes it harder to > follow. Let's try paring it down. First, '<' and '>' are not > special on their own in regexes, so there's no need to > backslash them: > > /(>|\s)\w{1,5}?<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}?(\s|\W|<)/ > > When you have an alternation -- something like '(a|b|c)' -- > where all the alternatives are single characters, it's better > to write it as a character class -- something like '[abc]'. > Also, '\s' and '<' are both included in '\W', so that last > alternation is equivalent to just '\W': > > /[>\s]\w{1,5}?<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}?\W/ > > Now, nongreedy matching serves no purpose when the thing > following it can't be matched by the thing being repeated. In > this case you have '\w{1,5}?' followed by '<', but '<' can't > match '\w', so there's no difference between greedy and > nongreedy matching there. The matching for the series of '\w' > characters has to go all the way to the '<' -- it can't stop > short. Similarly, the '\W' at the end can never match the '\w' > preceding it, so that '?' is also pointless: > > /[>\s]\w{1,5}<\/?\s?[\w\s]{6,150}\/?\s?>\w{1,7}\W/ > > That regex is equivalent to your original one, and may help you > see better why it's not matching as you expect. It's looking > for > > a '>' or whitespace character (space, tab, carriage return, > line feed, form feed), > followed by 1 to 5 word characters (letters, numbers, and > underscores), > followed by '<', > followed by an optional '/', > followed by an optional single whitespace character, > followed by 6 to 150 word or whitespace characters, > followed by an optional '/', > followed by an optional single whitespace character, > followed by '>', > followed by 1 to 7 word characters, > followed by a nonword character (anything other than > letters, numbers, and underscore). > > I'm not clear on what you want to match, but that's probably > not it. > > -- > Keith C. Ivey <[EMAIL PROTECTED]> > Washington, DC ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk