>... >Is foo.tld=bar a valid hostname part in a URI? I doubt that. now, would >a MUA show that as a URI followed by "bar"? > >I think that SA should provide an option to enable/disable: >uri_broken_mua, so that people not caring for "broken" MUAs can avoid >such false positives. >
How about the case of "http=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2F" inside of HTML? i.e. http://www.cnn.com/2003/ - from a "phishing spam", the full line was: =3Chttp=3A=2F=2Fwww=2Ecnn=2Ecom=2F2003=2FWORLD=2Fafrica=2F07=2F20=2Fkenya=2Ecrash=2Findex=2Ehtml=3E which itself was a continuation of a previous line. If you allow for more than just ASCII or UTF-8, there are quite a few "words" that can be built from the first six letters of the alphabet - and a much greater amount if we include "elite-speak". The above example need not have been a "phish" using cnn.com, but just as easily could have been a spamvertised domain or have been valid non-spam HTML. Unfortunately the case of MUAs accepting non-standard (re. illegal) HTML constructs is the most common case (e.g. Outlook and OE as well as many more MUAs which *need* to be able to read the same emails under MS Win*). And still more cases of URIs exist, which are not parsed by SA, but can have constructs like these with embedded domain names (e.g. "Message-ID:" lines). Life would be much easier if all URIs were contained within '<' and '>' (as at least one "standard" requires). The problem is that sometimes '=' is a word break, and sometimes it is used a either a continuation or meta-character. Find a rule with a very good rate at disambiguating these cases (for example, an '=' as the final character on a line can probably almost never be ignored). and file a Bugzilla; I'm sure the developers would at least look at whatever you come up with. Remember to also handle '%', '#' and '$' while you're at it:-) Paul Shupak [EMAIL PROTECTED]