Re: [htdig-dev] stemming

Gabriele Bartolini Sun, 08 Dec 2002 23:53:47 -0800

Il ven, 2002-12-06 alle 21:09, Neal Richter ha scritto:
>   According to the literature, if you go with a stemmed index exclusively,
> the index efficiency goes up by ABOUT 20-30%.  This estimate is very data
> and language dependent.
> 
>   I research and implement this kind of stuff at work...  I'd be happy to
> post links to a couple research papers if people are interested.


Yes, please do Neal. I am really interested, and what you said so far is
interesting as well! :-)

>   Here's a proposal for 'intelligent stemming' in HtDig:
> 
>   1.  Fix index efficiency.

Yep

>   2.  Add a configuration switch to disable stemming ;-)

Good.

>   3.  Implement the stemming algorithm to ADD additional rows to the index
>       with stemmed versions of the words (with a row flag to signify
>       this).

Perfect

>   4.  During result ranking we rank the results with an algorithm like
>       this:
> 
>       If num documents is LARGE
>          unstemmed rows are 80%, stemmed rows are 20% of the 'score'
> 
>       If num documents is MEDIUM
>          unstemmed rows are 60%, stemmed rows are 40% of the 'score'
> 
>       If num documents is SMALL
>          unstemmed rows are 30%, stemmed rows are 70% of the 'score'

I like it, even though I think that giving users the chance to set those
values somehow, by choosing a more general or specific index wouldn't be
bad in my opinion.

> I also don't support doing anything about stemming until we fix the index
> (which I'm working on).  It will negatively impact the size too much for
> large indexes.

I agree ... Babysteps. :-)


Thanks for your message. Please can you point us some reference or
resources to read. I'd love that!

Ciao and thanks,
-Gabriele

-- 
Gabriele Bartolini - Web Programmer
Comune di Prato - Prato - Tuscany - Italy
[EMAIL PROTECTED] | http://www.comune.prato.it
> find bin/laden -name osama -exec rm {} ;

signature.asc
Description: PGP signature

Re: [htdig-dev] stemming

Reply via email to