Chris Santerre wrote:

m{<td([^>"]+|"[^"]*")*>(<([^>"]+|"[^"]*")*>)*[a-z]{1,2}(<([^>"]
+|"[^"]*")*>)*</td([^>"]+|"[^"]*")*>}i



The other problem with the pattern as written (with no *) is that the subpatterns don't match plain <td> or </td>, since they require at least one character between the td and the >.


One of the things the SARE group has realized, is that using '*' in any
regex is a bad idea. Trust me on that one. We avoid it like the plague.

I'm sure that '*' causes problems in certain contexts, but a blanket prohibition on it seems excessive. I know it's particular problematic when applied to something that can match an empty string, but that's not the case here (and the problem would apply just as much with '+'). Actually the real danger with '*' is probably having it in a context where another '*' or a '+' applies to it -- something like '([^"]*|"")*', which should be '([^"]+|"")*'.


The worst affect of avoiding '*' in this case is that the original regex contains '</td([^>]+|"[^"]+)>', which doesn't match plain '</td>', which is surely going to be the most common way to close a table cell.

In any case, as originally written the '|"[^"]+' part of the regex is useless unless spammers are really using things like

   <td"some text>

It doesn't match things like

   <td foo=">">

which is apparently what was intended, otherwise there wouldn't be much point in not just leaving it at plain '[^>]+'.

--
Keith C. Ivey <[EMAIL PROTECTED]>
Washington, DC

Reply via email to