On 11/15/2012 1:06 PM, Tom Burton-West wrote:
This paper on the Kstem stemmer lists cases where the Porter stemmer
understems or overstems and explains the logic of Kstem: "Viewing
Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the
Sixteenth Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 191-203, 1993).
*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"
Thanks for the reference - that was very enlightening. The paper
explains why many terms are not stemmed as one might expect by KStem -
words that are found in the dictionary, by which I think they mean have
their own senses whose definitions do not include the stem word, are not
stemmed by KStem since it assumes that they have their own particular
meanings, and are not derived *purely by inflection*.
The dictionary they used is the Longman dictionary, which is available
for free online. I looked up "dog"
http://www.ldoceonline.com/dictionary/dog_1 and found that there is a
sense there (sense 13) whose definition reads:
dogs
[plural]American Englishinformalfeet:
this sense doesn't mention the stem word "dog" - it clearly has a
different meaning than the main dog entry, so I guess the thinking
behind this is: if the person was searching for "dogs" (meaning feet)
they wouldn't want to find text with "dog" (meaning man's best friend).
Of course in this case, "dog" singular presumably could mean foot as
well, so the inference seems faulty, although perhaps that never
occurs? Honestly I've never heard of anyone using "dogs" to mean feet
either, but hey nobody's perfect.
This entry: http://www.ldoceonline.com/dictionary/bound_4 probably
explains the reason "bounds" doesn't stem to "bound".
In the Lucene KStemmer code, this translates into the word appearing in
one of the dictionary data files. If a word appears there (as "dogs"
and "bounds" do), it won't be stemmed. I suppose a possible approach
here would be to send the client the dictionary of non-stemming words
and let them remove some, but then you'd have to compile your own
KStemmer variant.
Perhaps a nice feature to add to KStemmer would be to have it read a
list of exception words at run-time that would be removed from its
dictionary in order to allow them to be stemmed.
-Mike