On Mon, 17 Nov 2003, Justin Mason wrote: > BTW, given that a URI DB cannot use regular expressions, or patterns, > would this really be useful? > > Basically with a DB you only gain efficiency when looking up exact > strings. So for this to be useful against URIs, you'd have to pick out > *just* the domain part of the URI and look it up. e.g.: > > http://www.stearns.org/sa-blacklist/sa-blacklist.2003111402.uri.cf > > would be looked up as "www.stearns.org" or "stearns.org".)
The parser in the Bayes routine (tokenize_line in Bayes.pm) creates 'UD:' lookup tokens for each component of the domain name. So for the above example, it would create: UD:www.stearns.org UD:stearns.org UD:org Thus the DB would only need to contain one entry for the lowest common denominator [1]. IE: stearns.org. > I suspect doing this with a DB lookup may not be such a win, compared > to using a local eval test that parses a config file and creates an > in-memory hash table. > > - --j. Au contraire, a DB lookup is a big win compaired to a regex match for speed/memory consumption. The Bayesan engine does hundreds of lookups per message against a database that has tens (or hundreds) of thousands of (50k~200k) entries. Other people on this list have found that using regex matches, (EG 'evilrules') a set of just a few thousand patterns make a major hit in processor load. One of the big advantages of using a DB type system is that it can be updated 'hot' on a running system. A system based upon parsing a config file and creating an in-memory hash table would require restarting spamd every time an update was made. If we want to have any hope of automating such a system, it needs to be updatable 'hot' (note how Bayes operates). Yes, you are right in that a URI DB cannot use regular expressions or patterns. However, if we're just looking for a 'catcher' for spammer sites in URIs, that's probably not necessary. We just want to grab a host/site name out of a spam and slam it in there. Ask people such as Chris how much time he spent "regex"ing each entry in his 'evilrules' set. Speed of update and search are far more important IMHO. I envision this working in a couple of possible ways, either updated from a central site (EG the rules emporium) via wget/rsync etc, or by a local engine that would use some kind of heuristics on suspect host names found in potential spam (do DNS lookups, use IP that point to spammer nets, look at 'whois' data for spammer hosting, look at DNS TTLs, etc). Part of my motivation is a local "competition". Our central campus IT group looked at SA and then decied that it was too much work to manage, so they spent money and bought Activestate's PureMessage product. (Which is based upon a commercialization of SA. Many of the header tags even match ;). Part of our mail streams thru the central servers so I get to compair the SA scoring against the PMX scores. Most of the time SA does a better job (fewer FP/FN) but sometimes PMX "wins" and when it does it is us usually becase of a 'sparse' spam that has just a few URL images (and a bunch of Bayes fodder). The PMX score will be often pushed up by a rule that is labled: KNOWN_ADVERT_URL So my guess is that PMX already has something like this. I want it TOO! Dave [1] In a mathematical context, 'lowest common denominator' makes no sense. The number 1 is always the lowest common denominator for any value. Mathematically we're looking for the GCD ('greatest common divisor'). -- Dave Funk University of Iowa <dbfunk (at) engineering.uiowa.edu> College of Engineering 319/335-5751 FAX: 319/384-0549 1256 Seamans Center Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527 #include <std_disclaimer.h> Better is not better, 'standard' is better. B{ ------------------------------------------------------- This SF. Net email is sponsored by: GoToMyPC GoToMyPC is the fast, easy and secure way to access your computer from any Web browser or wireless device. Click here to Try it Free! https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk