One other factor to keep in mind is that the customer should never "look" at the actual stem term - such as "countri" or "gener" because in can freak them out a little, for no good reason. I mean, the goal of stemming is to show what set of words/terms will be treated as equivalent on a query, and this is independent of what gets returned for a stored field. The stem is simply the means to THAT end.

The fact that "dog" and "dogs" are not equivalent in KStem is in fact disheartening, at least to me, but it may not be problematic in some use cases.

-- Jack Krupansky

-----Original Message----- From: Scott Smith
Sent: Thursday, November 15, 2012 11:57 AM
To: java-user@lucene.apache.org
Subject: RE: Which stemmer?

Thanks for the suggestions I think Erick is correct as well. I'll let the customer decide.

Here's an updated list. Fyi--the minStem was the English Minimal Stemmer--I changed the label. Interesting to see where the minimal stemmer and porter agree (and KStemmer doesn't). You may also find the "dog" examples interesting. I also found the "invest*" list entertaining.

  original       porter        kstem   EngMinStem
-----------  -----------  -----------  -----------
   country      countri      country      country
 countries      countri      country      country
 country's     country'    country's     country'
       run          run          run          run
      runs          run         runs          run
   running          run      running      running
      read         read         read         read
   reading         read      reading      reading
    reader       reader       reader       reader
association       associ  association  association
 associate       associ    associate    associate
   listing         list         list      listing
     water        water        water        water
   watered        water        water      watered
      sure         sure         sure         sure
    surely         sure       surely       surely
    invest       invest       invest       invest
 investing       invest       invest    investing
investment       invest   investment   investment
investments       invest   investment   investment
   invests       invest       invest       invest
  investor     investor       invest     investor
  invester       invest       invest     invester
 investors     investor       invest     investor
 investers       invest       invest     invester
organization        organ  organization  organization
  organize        organ     organize     organize
   organic        organ      organic      organic
  generous        gener     generous     generous
   generic        gener      generic      generic
       dog          dog          dog          dog
     dog's         dog'        dog's         dog'
      dogs          dog         dogs          dog
     dogs'          dog         dogs          dog

Now, if someone would answer my question on the Solr list ("Custom Solr Indexer/Search"), my day would be complete ;-).

Thanks for the continued help.

Scott

-----Original Message-----
From: Tom Burton-West [mailto:tburt...@umich.edu]
Sent: Thursday, November 15, 2012 11:06 AM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?

I agree with Erick that you probably need to give your client a list of concrete examples, and perhaps to explain the trade-offs.

All stemmers both overstem and understem.   Understemming means that some
forms of a word won't get searched. For example, without stemming, searching for "dogs" would not retrieve documents containing the word "dog". Generally there is a precision/recall tradeoff where reducing understemming increases overstemming. The problem with aggressive stemmers like the Porter stemmer, is that they overstem.

The original Porter stemmer for example would stem "organization" and " organic" both to "organ" and "generalization" , "generous"and "generic" to " gener" *

For background on the Porter stemmers and lots of examples see these pages:

http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*

*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>

This paper on the Kstem stemmer lists cases where the Porter stemmer understems or overstems and explains the logic of Kstem: "Viewing Morphology as an Inference Process" (*Krovetz*, R., Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 191-203, 1993).

*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"

Tom

http://www.hathitrust.org/blogs/large-scale-search

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to