>...
>Is foo.tld=bar a valid hostname part in a URI? I doubt that. now, would
>a MUA show that as a URI followed by "bar"?
>
>I think that SA should provide an option to enable/disable:
>uri_broken_mua, so that people not caring for "broken" MUAs can avoid
>such false positives.
>

        How about the case of "http=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2F"
inside of HTML?   i.e. http://www.cnn.com/2003/ - from a "phishing spam",
the full line was:

=3Chttp=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2FWORLD=2Fafrica=2F07=2F20=2Fkenya=2Ecrash=2Findex=2Ehtml=3E

which itself was a continuation of a previous line.  If you allow for more
than just ASCII or UTF-8, there are quite a few "words" that can be built
from the first six letters of the alphabet - and a much greater amount if
we include "elite-speak".  The above example need not have been a "phish"
using cnn.com, but just as easily could have been a spamvertised domain or
have been valid non-spam HTML.  Unfortunately the case of MUAs accepting
non-standard (re. illegal) HTML constructs is the most common case (e.g.
Outlook and OE as well as many more MUAs which *need* to be able to read
the same emails under MS Win*).  And still more cases of URIs exist, which
are not parsed by SA, but can have constructs like these with embedded
domain names (e.g. "Message-ID:" lines).  Life would be much easier if all
URIs were contained within '<' and '>' (as at least one "standard" requires).

        The problem is that sometimes '=' is a word break, and sometimes
it is used a either a continuation or meta-character.  Find a rule with
a very good rate at disambiguating these cases (for example, an '=' as the
final character on a line can probably almost never be ignored). and file
a Bugzilla;  I'm sure the developers would at least look at whatever you
come up with.  Remember to also handle '%', '#' and '$' while you're at
it:-)


        Paul Shupak
        [EMAIL PROTECTED]

Reply via email to