Chris Santerre wrote:
m{<td([^>"]+|"[^"]*")*>(<([^>"]+|"[^"]*")*>)*[a-z]{1,2}(<([^>"]One of the things the SARE group has realized, is that using '*' in any
+|"[^"]*")*>)*</td([^>"]+|"[^"]*")*>}i
The other problem with the pattern as written (with no *) is that the subpatterns don't match plain <td> or </td>, since they require at least one character between the td and the >.
regex is a bad idea. Trust me on that one. We avoid it like the plague.
I'm sure that '*' causes problems in certain contexts, but a blanket prohibition on it seems excessive. I know it's particular problematic when applied to something that can match an empty string, but that's not the case here (and the problem would apply just as much with '+'). Actually the real danger with '*' is probably having it in a context where another '*' or a '+' applies to it -- something like '([^"]*|"")*', which should be '([^"]+|"")*'.
The worst affect of avoiding '*' in this case is that the original regex contains '</td([^>]+|"[^"]+)>', which doesn't match plain '</td>', which is surely going to be the most common way to close a table cell.
In any case, as originally written the '|"[^"]+' part of the regex is useless unless spammers are really using things like
<td"some text>
It doesn't match things like
<td foo=">">
which is apparently what was intended, otherwise there wouldn't be much point in not just leaving it at plain '[^>]+'.
-- Keith C. Ivey <[EMAIL PROTECTED]> Washington, DC