Re: SARE_URI_EQUALS false positives

List Mail User Mon, 26 Dec 2005 16:07:13 -0800

>...
        Mouss,

>List Mail User a écrit :
>>      updated.by - check http://www.tld.by/cgi-bin/registry.cgi
>> 
>>      You'll see that update.by is a registered domain!  Therefore
>> "updated.by" is indeed a URI. QED
>
>the question is: if foo.example-DEMUNGED is listed in uribl/surbl, does
>that make it a bad string in mail?
>
>If it appears as http://somethin.foo.example-DEMUNGED, or even as a
>textual www.foo.example-DEMUNGED, we may consider it "risky"
>
>But if it appears as:
>       telnet smtp.foo.example-DEMUNGED
>or
>       Dec 26 23:41:53 bobo postfix/smtpd[7560]: connect from
>foo.example-DEMUNGED[192.0.2.56]
>...
>
>then checking *BLs is questionable. There are more chances to block
>someone reporting a spammy session or asking for help than seeing a
>spammer advertize his site via a log line...
>
>I believe this is the most important issue that uribl encounters: is the
>URI used to advertize or is it an example/report/...? if we solve this,
>we'll feel very happy.
>


        There are several parts to the answer, but the first and most
important part can be phrased as "barring a special case", yes a spam
domain in mail is bad (period).

        Now, there are more than a few special cases.  One immediate
case is that no abuse@ email account should be doing content filtering.
Another obvious case is that any person or mailing list which discusses
spam need to be whitelisted, setup to avoid filtering or some other action
take to configure it not to trip spam filters.  The case you listed of
an "example/report" would/should always come under these situations, but
there are still others;  If you file a complaint with any party about an
abuse situation, you should be prepared to have your own message quoted
back to you (this one has to include organizations like ICANN, the internic,
ARIN, RIPE, etc.).  If you discuss spam or abuse with another person or
on a list, again you should be prepared to be answered similarly (this
case I have been guilty of forgetting more than once).  There are still
more possible cases that can be hard to expect - e.g. I recently got
an email from a hosting service that I have locally BL'd which was sent
addressed to customers (I am *not* one), but which I was copied on (I have
spoken by telephone and email to the business' managers and staff on a few
occasions) - fortunately they sent it to an account which is only used for
certain types of complaints and communication, and which bypasses the BLs
at the MTA level (still hits SA).  Also, there are some companies/newsletters
which may be on quite a few BLs, but are solicited mail at my site, so they
*must* be whitelisted (at the MTA, in SA, in DCC, etc.).  If you accept
requests for help (with abuse issues, or even allowing such things), you
should either be using a dedicated account or be prepared to FP on the
emails.  (Yes, I know not everyone "controls" one or more domains and can
not create special purpose accounts trivially.)

        Even the simplest case of a bare domain name is clearly bad.  How
can you distinguish (without building/writing a natural language parser)
the difference between saying "I got spam from example.com for ..." and
"Copy example.com into your browser to see our specials..."?  The second
format is fairly common in spam.  You could try to somehow score a bare
name differently, but them what if it is embedded in a scripting language,
HTML or obfuscated with character translations (e.g. %45xample%2Eco%4D or
similar);  This kind of style can still be "dangerous".

        There are many examples of non-distributed rules (i.e. not part of
SA distributions) which conflict with common styles of email writing and
quoting (e.g. the SARE chicken-pox rules and large chunks of source code
is a common example).  Most of the "standard" SA rules are "safe" under
normal conditions, but if some automated tool creates text containing a
string which happens by be formatted the same as a spam domain, there will
be a conflict (e.g. if "updated.by" were spammy - or even if a local rule
penalized non-"RFC compliant" TLDs, since ".by" doesn't have a whois server,
so any string with ".by", ".my", ".de", ".mx", etc. at its end could cause
problems).

        I don't think you can find any way to tell if something is actually
advertising even if you did have a natural language parser.  Consider the
case where the mail contains an image of a watch, pills or scantily clad
young woman, random text (not random words, but "literary" chaff) and a bare
domain name.  To a human it may be obvious what is happening, but you'd need
a very complex recognizer to get a computer to "know" it was advertising;
It could be a picture of your cousin with the poem she won a prize for and
the domain it is "published" at, sent by a relative (example from mail I've
actually recieved) or it could be an advertisement for child pornography;
How can you tell (especially when it comes from a DUL host via a cable ISP)?

        Not an easy case, and not one I expect to be solved in my lifetime.


        Paul Shupak
        [EMAIL PROTECTED]

Re: SARE_URI_EQUALS false positives

Reply via email to