Jeff Chan <[EMAIL PROTECTED]> writes:

> For name-based URIs that's very different from my intended use for
> SURBL so I may have been partially in error in suggesting that an
> unmodified URIDNSBL use SURBL directly.

Yeah, I didn't expect it would work based on the explanation on the
SURBL web page, but I figured I'd give it a try anyway.  No harm, no
foul.  I think we'll need to add another method to the URIDNSBL plugin
to support direct RHS query blacklists like SURBL.
 
> Presently the RBL only has about 250 records; perhaps that's on the
> small side.

250 seems small relative to the number of domains I see in spam each day
(very roughly about 4 domains mentioned per email, average of 2 domains
in each spam unique to a week-long period).

> One improvement might be to encode the frequency data in the RBL so
> that more frequently reported domains could be used to give higher
> scores.

We could do that, but let's see where we are once we start doing direct
lookups and if perhaps you increase your timeout and lower your
threshold to increase the number of records somewhat.

The key thing with the threshold is that we want SURBL to be accurate as
a spam rule.  Joe jobs are something you want to think about now as
opposed to later.

One way you could reduce the possibility of joe jobs is to remove old
domains, ones that have been around a while.  Stuff like amazon.com,
ebay.com, etc. have been around for a long time.  SenderBase has easily
accessed data for this (first email from domain was initialized long
enough ago to be useful now) and there are also the whois records.  You
could also build-up a whitelist for repeated joe-jobs.

You might also to increase the timeout on domains that appear again and
again.
 
> As another example of difference about my views on the use of the
> SURBL data, off-list Sidney brought up the question of processing
> deliberately randomized host names that spammers sometimes use and how
> that could confuse or defeat a spam message body domain RBL.  He
> implied that that such deliberate attempts at randomization might be a
> reason my data was not working too well with URIDNSBL, and I partially
> agree.  This observation points out potential differences in how the
> data might best be used.

Yes, but the SBL rule works pretty well, so I don't think randomized
host names are a problem yet.
 
> My take on the randomized host or subdomain problem highlights
> a different viewpoint we took into consideration when designing
> our data structure.

I *think* we also currently only do queries of the domain itself, so
it shouldn't be an issue.

> Instead of checking every randomized FQDN against the RBL, we prefer
> to try to strip off the random portion and pass only the basic,
> unchanging domain.  The SURBL data only gets the parent of these
> randomized FQDNs since it builds its (inverted) tree from the root
> (TLD) direction down toward the leaves.  (It actually starts counting
> reports from the second level, not the top level, which would be way
> too broad.)  It accumulates a count of the children under the second
> level so that:
> 
>  lkjhlkjh.random.com
>  089yokhl.random.com
>  asdsdfsd.random.com
> 
> gives one entry for each FQDN, but gives the useful and desirable
> count of *3* for random.com.  The randomizers *cannot hide* from
> this approach.  The non-random child portion of their domains
> shows up clearly and conspicuously as a parent domain with an
> increased count (3 is greater than 1).  Every time a spammer gets
> reported using a randomized host or subdomain name, it increases
> the count of their parent domain.  In the words of the original,
> Apple II version of Castle Wolfenstein, "You're caught."

This is a good idea.
 
> My suggested alternative approach to parsing spam URIs would be to
> start with the second level domains, compare those against SURBL,
> try the third levels next, up to some limit. (Levels 1 with 2,
> then 1 through 3 are probably enough, i.e. two DNS queries into
> the SURBL domain).  Since the DNS RBL lookups are all cached and
> very fast there should not be too much of a performance penalty
> for this.

Whatever we do, we really want to do all the queries at once as early as
possible in the message check for performance reasons.

> Probably it's less of a penalty than trying to resolve spam body FQDNs
> into numeric addresses, then do reverse lookups or name server record
> checks on the addresses, etc.

Definitely.

> Implementing this approach may require a new code branch off of
> URIDNSBL to be started.  But I'm convinced my approach may have
> some definite merit if implemented.

I think it belongs in the URIDNSBL code, but another plugin would
perhaps be okay.
 
> I've never written any SA code, so could I convince someone to
> consider implementing this approach or give me a pointer to learn how
> to do it?

It sounds like Justin is thinking about it, or perhaps Sidney is
interested, or my advice if you want to do it would be to check out the
SVN tree and start hacking.  :-)

Daniel
 
-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Reply via email to