On Mon, 17 Nov 2003, Justin Mason wrote:

> BTW, given that a URI DB cannot use regular expressions, or patterns,
> would this really be useful?
>
> Basically with a DB you only gain efficiency when looking up exact
> strings.  So for this to be useful against URIs, you'd have to pick out
> *just* the domain part of the URI and look it up. e.g.:
>
> http://www.stearns.org/sa-blacklist/sa-blacklist.2003111402.uri.cf
>
> would be looked up as "www.stearns.org" or "stearns.org".)

The parser in the Bayes routine (tokenize_line in Bayes.pm) creates 'UD:'
lookup tokens for each component of the domain name. So for the above
example, it would create:
        UD:www.stearns.org
        UD:stearns.org
        UD:org

Thus the DB would only need to contain one entry for the lowest common
denominator [1]. IE: stearns.org.

> I suspect doing this with a DB lookup may not be such a win, compared
> to using a local eval test that parses a config file and creates an
> in-memory hash table.
>
> - --j.

Au contraire, a DB lookup is a big win compaired to a regex match for
speed/memory consumption. The Bayesan engine does hundreds of lookups
per message against a database that has tens (or hundreds) of thousands of
(50k~200k) entries. Other people on this list have found that using regex
matches, (EG 'evilrules') a set of just a few thousand patterns make a
major hit in processor load.

One of the big advantages of using a DB type system is that it can be
updated 'hot' on a running system. A system based upon parsing a config
file and creating an in-memory hash table would require restarting spamd
every time an update was made.

If we want to have any hope of automating such a system, it needs to be
updatable 'hot' (note how Bayes operates).

Yes, you are right in that a URI DB cannot use regular expressions or
patterns. However, if we're just looking for a 'catcher' for spammer
sites in URIs, that's probably not necessary. We just want to grab a
host/site name out of a spam and slam it in there. Ask people such as
Chris how much time he spent "regex"ing each entry in his 'evilrules'
set. Speed of update and search are far more important IMHO.

I envision this working in a couple of possible ways, either updated from
a central site (EG the rules emporium) via wget/rsync etc, or by a local
engine that would use some kind of heuristics on suspect host names found
in potential spam (do DNS lookups, use IP that point to spammer nets,
look at 'whois' data for spammer hosting, look at DNS TTLs, etc).

Part of my motivation is a local "competition". Our central campus IT
group looked at SA and then decied that it was too much work to manage,
so they spent money and bought Activestate's PureMessage product.
(Which is based upon a commercialization of SA. Many of the header
tags even match ;).
Part of our mail streams thru the central servers so I get to compair
the SA scoring against the PMX scores. Most of the time SA does a better
job (fewer FP/FN) but sometimes PMX "wins" and when it does it is
us usually becase of a 'sparse' spam that has just a few URL
images (and a bunch of Bayes fodder). The PMX score will be often
pushed up by a rule that is labled: KNOWN_ADVERT_URL

So my guess is that PMX already has something like this. I want it TOO!

Dave

[1] In a mathematical context, 'lowest common denominator' makes no sense.
The number 1 is always the lowest common denominator for any value.
Mathematically we're looking for the GCD ('greatest common divisor').

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{



-------------------------------------------------------
This SF. Net email is sponsored by: GoToMyPC
GoToMyPC is the fast, easy and secure way to access your computer from
any Web browser or wireless device. Click here to Try it Free!
https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to