Il ven, 2002-12-06 alle 21:09, Neal Richter ha scritto: > According to the literature, if you go with a stemmed index exclusively, > the index efficiency goes up by ABOUT 20-30%. This estimate is very data > and language dependent. > > I research and implement this kind of stuff at work... I'd be happy to > post links to a couple research papers if people are interested.
Yes, please do Neal. I am really interested, and what you said so far is interesting as well! :-) > Here's a proposal for 'intelligent stemming' in HtDig: > > 1. Fix index efficiency. Yep > 2. Add a configuration switch to disable stemming ;-) Good. > 3. Implement the stemming algorithm to ADD additional rows to the index > with stemmed versions of the words (with a row flag to signify > this). Perfect > 4. During result ranking we rank the results with an algorithm like > this: > > If num documents is LARGE > unstemmed rows are 80%, stemmed rows are 20% of the 'score' > > If num documents is MEDIUM > unstemmed rows are 60%, stemmed rows are 40% of the 'score' > > If num documents is SMALL > unstemmed rows are 30%, stemmed rows are 70% of the 'score' I like it, even though I think that giving users the chance to set those values somehow, by choosing a more general or specific index wouldn't be bad in my opinion. > I also don't support doing anything about stemming until we fix the index > (which I'm working on). It will negatively impact the size too much for > large indexes. I agree ... Babysteps. :-) Thanks for your message. Please can you point us some reference or resources to read. I'd love that! Ciao and thanks, -Gabriele -- Gabriele Bartolini - Web Programmer Comune di Prato - Prato - Tuscany - Italy [EMAIL PROTECTED] | http://www.comune.prato.it > find bin/laden -name osama -exec rm {} ;
signature.asc
Description: PGP signature
