On Thu, Mar 04, 2004 at 06:45:18PM -0800, Dan Quinlan wrote:
> It would lower storage requirements somewhat and would probably enable
> significant database speed/size improvements based on using a fixed key
> length.  Our average token size is about 12 bytes right now.

I think there would be some, but I don't think "significant" speed up.
It'd be nice for storage, but we'll have additional overhead from doing
all the hashes.  I don't think overall I/O would go down a lot, which
is the main time suck in our Bayes code iirc.

Also, some people (myself included) would miss doing "sa-learn --dump
data" and see actual words (ilikespam) and such instead of "0x000003bf"...

> I think it's definitely worth testing.
> 
> I also think the dobly noise reduction technique is worth trying (and
> testing this is already in a RFE bug) as well as testing 2 (or more)
> token-wide windowing.

Definitely, but we either need to test this RSN or punt to 3.1, IMO.

-- 
Randomly Generated Tagline:
"Walk softly and carry a +6 two-handed sword."     - From a screen saver

Attachment: pgpGqaWm0Ly0S.pgp
Description: PGP signature

Reply via email to