Re: Which stemmer?

Michael Sokolov Thu, 15 Nov 2012 22:19:17 -0800

On 11/15/2012 1:06 PM, Tom Burton-West wrote:

This paper on the Kstem stemmer lists cases where the Porter stemmer
understems or overstems and explains the logic of Kstem: "Viewing
Morphology as an Inference Process"  (*Krovetz*, R., Proceedings of the
Sixteenth Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 191-203, 1993).


*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"

Thanks for the reference - that was very enlightening. The paperexplains why many terms are not stemmed as one might expect by KStem -words that are found in the dictionary, by which I think they mean havetheir own senses whose definitions do not include the stem word, are notstemmed by KStem since it assumes that they have their own particularmeanings, and are not derived *purely by inflection*.

The dictionary they used is the Longman dictionary, which is availablefor free online. I looked up "dog"http://www.ldoceonline.com/dictionary/dog_1 and found that there is asense there (sense 13) whose definition reads:



   dogs

[plural]American Englishinformalfeet:

this sense doesn't mention the stem word "dog" - it clearly has adifferent meaning than the main dog entry, so I guess the thinkingbehind this is: if the person was searching for "dogs" (meaning feet)they wouldn't want to find text with "dog" (meaning man's best friend).Of course in this case, "dog" singular presumably could mean foot aswell, so the inference seems faulty, although perhaps that neveroccurs? Honestly I've never heard of anyone using "dogs" to mean feeteither, but hey nobody's perfect.

This entry: http://www.ldoceonline.com/dictionary/bound_4 probablyexplains the reason "bounds" doesn't stem to "bound".

In the Lucene KStemmer code, this translates into the word appearing inone of the dictionary data files. If a word appears there (as "dogs"and "bounds" do), it won't be stemmed. I suppose a possible approachhere would be to send the client the dictionary of non-stemming wordsand let them remove some, but then you'd have to compile your ownKStemmer variant.

Perhaps a nice feature to add to KStemmer would be to have it read alist of exception words at run-time that would be removed from itsdictionary in order to allow them to be stemmed.


-Mike

Re: Which stemmer?

Reply via email to