Sidney Markowitz <[EMAIL PROTECTED]> writes: > Looking over the DSPAM docs recently, I saw that it converts all tokens > to an 8 byte integer hash using CRC64 and then works only with those > fixed length 8 byte numbers instead of variable length strings. CRC64 > may not yield perfectly unique results, but it is certainly close enough > for the Bayes statistics.
You might also want to check out CRM114. > Does anyone here have the experience with SpamAssassin's Bayes > processing to be able to guess how much of a difference it would make, > if any, if the Bayes db stored fixed length 8 byte integers instead of > strings and all the comparisons were of 8 byte integers? How much would > that change storage requirements? How would it change I/O requirements > reading from and writing to the database? It would lower storage requirements somewhat and would probably enable significant database speed/size improvements based on using a fixed key length. Our average token size is about 12 bytes right now. > I'm not putting this in Bugzilla as an RFE yet, because first I would > like to get a sense if this seems worth pursuing. I think it's definitely worth testing. I also think the dobly noise reduction technique is worth trying (and testing this is already in a RFE bug) as well as testing 2 (or more) token-wide windowing. -- Daniel Quinlan anti-spam (SpamAssassin), Linux, http://www.pathname.com/~quinlan/ and open source consulting
