Theo Van Dinter wrote:
On Thu, Jul 22, 2004 at 12:09:14AM +0200, Jesse Houwing wrote:No it looks for any uri with a = in the hostname (and excludes the quoted printable =) so:
This is the rule in question:
uri SARE_URI_EQUALS
m{^(?:(?:h|%[46]8)(?:t|%[57]4){2}(?:p|%[57]0)(?:s|%[57]3)?(?::|%3a)?(?:%5c|\\|%2f|/){0,2})[^/\?;]+=(?!(?:..)?$).*}i
Hrm. I have no idea what this is actually looking trying to
match. The first (?: bit isn't necessary, btw. Looks like an
URL with a = somewhere in the host section? ie: something like
'http://penistone=2eopoloveok=2ecom/3/' in a quoted-printable part?
(this is the only set of matches I could find with your RE)
http://www.iamahost=butthisismyrealname.com/ would match http://www.butthisismyreal= would not, neither would http://www.butthisismyreal=20
This is an internet explorer parsing bug I'm trying to detect here, and it is abused quite often in spam. Any chars before the = sign are discarted and the hostname after the is is used instead, but to the user the host before the = is shown (nifty).
But it seesm to do it too harshly, I'll try to find an example from my corpus that should be tagged, but isn't in this case.If not, please post an example and I'll be happy to help debug. (I don't think this is a 3.0 bug though. See below.)
If so, however: yeah, that'll be different. In 2.6:
http://penistone=2eopoloveok=2ecom/3/
vs 3.0:
http://penistone.opoloveok.com/3/
which is caused by 2.6 doing a very half-assed attempt at decoding the quoted-printable part, so you get the QP bits in the URI. 3.0 does the decoding properly (thanks total MIME parser rewrite!), so you end up with the URI you're supposed to get, properly decoded.
Specifically, in PerMsgStatus::get_decoded_body_text_array(), which 2.6x uses to get the uri list from, the un-quoted-printable code is:
s/\=([0-9A-F]{2})/chr(hex($1))/ge;
which clearly has one flaw: it's looking for case-sensitive A-F! D'oh! Therefore, it doesn't match the URI above (uses lowercase). 3.0 does the right thing here. :)
Jesse
