On Thu, Mar 04, 2004 at 06:45:18PM -0800, Dan Quinlan wrote: > It would lower storage requirements somewhat and would probably enable > significant database speed/size improvements based on using a fixed key > length. Our average token size is about 12 bytes right now.
I think there would be some, but I don't think "significant" speed up. It'd be nice for storage, but we'll have additional overhead from doing all the hashes. I don't think overall I/O would go down a lot, which is the main time suck in our Bayes code iirc. Also, some people (myself included) would miss doing "sa-learn --dump data" and see actual words (ilikespam) and such instead of "0x000003bf"... > I think it's definitely worth testing. > > I also think the dobly noise reduction technique is worth trying (and > testing this is already in a RFE bug) as well as testing 2 (or more) > token-wide windowing. Definitely, but we either need to test this RSN or punt to 3.1, IMO. -- Randomly Generated Tagline: "Walk softly and carry a +6 two-handed sword." - From a screen saver
pgpGqaWm0Ly0S.pgp
Description: PGP signature
