Sidney Markowitz <[EMAIL PROTECTED]> writes:

> Looking over the DSPAM docs recently, I saw that it converts all tokens 
> to an 8 byte integer hash using CRC64 and then works only with those 
> fixed length 8 byte numbers instead of variable length strings. CRC64 
> may not yield perfectly unique results, but it is certainly close enough 
> for the Bayes statistics.

You might also want to check out CRM114.
 
> Does anyone here have the experience with SpamAssassin's Bayes 
> processing to be able to guess how much of a difference it would make, 
> if any, if the Bayes db stored fixed length 8 byte integers instead of 
> strings and all the comparisons were of 8 byte integers? How much would 
> that change storage requirements? How would it change I/O requirements 
> reading from and writing to the database?

It would lower storage requirements somewhat and would probably enable
significant database speed/size improvements based on using a fixed key
length.  Our average token size is about 12 bytes right now.

> I'm not putting this in Bugzilla as an RFE yet, because first I would 
> like to get a sense if this seems worth pursuing.

I think it's definitely worth testing.

I also think the dobly noise reduction technique is worth trying (and
testing this is already in a RFE bug) as well as testing 2 (or more)
token-wide windowing.

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Reply via email to