On 11/15/2012 1:06 PM, Tom Burton-West wrote:
This paper on the Kstem stemmer lists cases where the Porter stemmer
understems or overstems and explains the logic of Kstem: "Viewing
Morphology as an Inference Process"  (*Krovetz*, R., Proceedings of the
Sixteenth Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 191-203, 1993).

*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"

Thanks for the reference - that was very enlightening. The paper explains why many terms are not stemmed as one might expect by KStem - words that are found in the dictionary, by which I think they mean have their own senses whose definitions do not include the stem word, are not stemmed by KStem since it assumes that they have their own particular meanings, and are not derived *purely by inflection*.

The dictionary they used is the Longman dictionary, which is available for free online. I looked up "dog" http://www.ldoceonline.com/dictionary/dog_1 and found that there is a sense there (sense 13) whose definition reads:


   dogs

[plural]American Englishinformalfeet:

this sense doesn't mention the stem word "dog" - it clearly has a different meaning than the main dog entry, so I guess the thinking behind this is: if the person was searching for "dogs" (meaning feet) they wouldn't want to find text with "dog" (meaning man's best friend). Of course in this case, "dog" singular presumably could mean foot as well, so the inference seems faulty, although perhaps that never occurs? Honestly I've never heard of anyone using "dogs" to mean feet either, but hey nobody's perfect.

This entry: http://www.ldoceonline.com/dictionary/bound_4 probably explains the reason "bounds" doesn't stem to "bound".

In the Lucene KStemmer code, this translates into the word appearing in one of the dictionary data files. If a word appears there (as "dogs" and "bounds" do), it won't be stemmed. I suppose a possible approach here would be to send the client the dictionary of non-stemming words and let them remove some, but then you'd have to compile your own KStemmer variant.

Perhaps a nice feature to add to KStemmer would be to have it read a list of exception words at run-time that would be removed from its dictionary in order to allow them to be stemmed.

-Mike

Reply via email to