Dear clapf users,

I am about to release the latest version of clapf. The 0.3.29
release contains lots of improvments and changes as follows:

- Full mysql support

clapf is able to store and query the tokens directly from a
mysql database. This gives flexibility and convenience over
the read-only CDB format. You don't have to refresh the
tokens.cdb file perodically as before.

- Per user token data set

The direct mysql token queries allow us creating personalised
token database. You are able to create a global group while
the initial training and then let your users train their own
data set without affecting other users'. Or even more: clapf
is able to merge the global data set with the users own data
set at run time.

- New training method for a faster learning curve

Training while performing the analysis of the incoming email is
also possible by the Train Until Mature (TUM) method. If it's
set and we are certain that the message is spam (or ham) it
trains the email with those tokens having less then 25 spam or
ham hits in the database. Thus it's possible to learn spammers'
new tricks without encountering a false negative classification
error.

- Much more easier and convenient database training

Another big improvment is letting the users train the token database
by forwarding the mail as an attachment to one of the two special email addresses (ham+usern...@domain.com and spam+usern...@domain.com).

- Hashed tokens

The token database can be smaller (and faster) by using hashed tokens.
Instead of storing the actual token string I store only its 64 bit long
numeric representative (using the APHash algorithm).

You may test the APHash function with the aphash utility to compute
the hash values of an arbitrary string.

I recommend you to recreate the t_token table, although you may keep
the old format using the --enable-old-format configure option.

If you want the new (and better) format but you cannot rebuild your
token database from scratch, I may write a migration tool for you.
Just let me know if you need it.

- Added a html decoder

I added the html decoder algorithm to gain the original message if
spammers try to html encode their message text.

- Misc. fixes

The configure script honors the --sysconfdir option now. Some improvements
on the parser and other fixes.


I think about dropping ldap support (for user settings), and removing the cdb and hash database supporting codebase. I would like to know
whether you want these features in the future otherwise I remove them
from the source.

If you want to try the latest developer version of clapf try the latest
nightly build (http://clapf.acts.hu/clapf-nightly.tar.gz).


If you have any questions, further ideas, anything, don't hesitate to
contact me. If everything goes well I will release the final 0.3.29
on the next week.

For the Hungarian users: I wrote a Wiki page about installing clapf. You
may find it at http://wiki.hup.hu/index.php/Clapf

Digitally yours,

SJ.

--



Reply via email to