Sarang Shrivastava wrote:
Yes, indeed CRM114 has a lot criterias for categorization of data and that can be done via a host of methods, including regexes, approximate regexes,
a Hidden Markov Model, Orthogonal Sparse Bigrams, WINNOW, Correllation,
KNN/Hyperspace, or Bit Entropy.

We can take ideas from them and develop our own plugin that has the
capability to compete with CRM114. Afterall there is no place like home. I look forward to work on these given the fact that my proposal gets accepted.

Like with Kevin, the AI is not my field of expertise either, although in
a nearby laboratory at our institute there is quite a strong group of
researchers working in that area (but with less interest in open-source
projects than myself). I think such algorithms can potentially offer a
substantial fresh air to SpamAssassin and are well worth exploring.


A thought to give: Does using custom plugins hinder the performance of SA in terms of speed ? No doubt that CRM114 is good in classifying spams and
hams but does it in any case hamper the speed at all ?
What do you guys say about including these into SA itself if possible ?

A SpamAssasin plugin is just a perl module, loaded with the rest of
SpamAssassin framework. There is no additional overhead in calling methods
in a plugin, it's just a normal perl subroutine call. In fact most
of the existing content-checking methods that come with SpamAssassin
are even now packaged as plugins. The rest of SpamAssassin are utility
routines, parsing and interfacing code and such.

If a plugin (i.e. a perl module) cannot do all the work by itself
but needs to invoke some external service, e.g. run some program
or connect to some service or use a database, that does add some
overhead. Done carefully such overhead can still be relatively small
and manageable.

If some database is needed, currently SpamAssassin can use a couple
of them, from file-based (e.g. BerkeleyDB), to SQL, LDAP, Redis.
Btw, Quanah Gibson-Mount (from Zimbra) is a strong advocate for trying
the LMDB (http://symas.com/mdb/), which recently also can be used by a
Postfix mailer, so it may be worth considering when an in-memory database
(such as Redis or Memcache) may not suffice.

Some databases (like SQL or Redis) offer server-side code execution
(stored procedures, or LUA in Redis), which can significantly reduce
the number of round-trip query/response accesses to a database, which
can be valuable for performance.

  Mark

Reply via email to