One other factor to keep in mind is that the customer should never "look" at
the actual stem term - such as "countri" or "gener" because in can freak
them out a little, for no good reason. I mean, the goal of stemming is to
show what set of words/terms will be treated as equivalent on a query, and
this is independent of what gets returned for a stored field. The stem is
simply the means to THAT end.
The fact that "dog" and "dogs" are not equivalent in KStem is in fact
disheartening, at least to me, but it may not be problematic in some use
cases.
-- Jack Krupansky
-----Original Message-----
From: Scott Smith
Sent: Thursday, November 15, 2012 11:57 AM
To: java-user@lucene.apache.org
Subject: RE: Which stemmer?
Thanks for the suggestions I think Erick is correct as well. I'll let the
customer decide.
Here's an updated list. Fyi--the minStem was the English Minimal Stemmer--I
changed the label. Interesting to see where the minimal stemmer and porter
agree (and KStemmer doesn't). You may also find the "dog" examples
interesting. I also found the "invest*" list entertaining.
original porter kstem EngMinStem
----------- ----------- ----------- -----------
country countri country country
countries countri country country
country's country' country's country'
run run run run
runs run runs run
running run running running
read read read read
reading read reading reading
reader reader reader reader
association associ association association
associate associ associate associate
listing list list listing
water water water water
watered water water watered
sure sure sure sure
surely sure surely surely
invest invest invest invest
investing invest invest investing
investment invest investment investment
investments invest investment investment
invests invest invest invest
investor investor invest investor
invester invest invest invester
investors investor invest investor
investers invest invest invester
organization organ organization organization
organize organ organize organize
organic organ organic organic
generous gener generous generous
generic gener generic generic
dog dog dog dog
dog's dog' dog's dog'
dogs dog dogs dog
dogs' dog dogs dog
Now, if someone would answer my question on the Solr list ("Custom Solr
Indexer/Search"), my day would be complete ;-).
Thanks for the continued help.
Scott
-----Original Message-----
From: Tom Burton-West [mailto:tburt...@umich.edu]
Sent: Thursday, November 15, 2012 11:06 AM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?
I agree with Erick that you probably need to give your client a list of
concrete examples, and perhaps to explain the trade-offs.
All stemmers both overstem and understem. Understemming means that some
forms of a word won't get searched. For example, without stemming,
searching for "dogs" would not retrieve documents containing the word "dog".
Generally there is a precision/recall tradeoff where reducing understemming
increases overstemming. The problem with aggressive stemmers like the
Porter stemmer, is that they overstem.
The original Porter stemmer for example would stem "organization" and "
organic" both to "organ" and "generalization" , "generous"and "generic" to "
gener" *
For background on the Porter stemmers and lots of examples see these pages:
http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*
*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>
This paper on the Kstem stemmer lists cases where the Porter stemmer
understems or overstems and explains the logic of Kstem: "Viewing Morphology
as an Inference Process" (*Krovetz*, R., Proceedings of the Sixteenth
Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, 191-203, 1993).
*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"
Tom
http://www.hathitrust.org/blogs/large-scale-search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org