Thanks for the suggestions I think Erick is correct as well.  I'll let the 
customer decide.

Here's an updated list.  Fyi--the minStem was the English Minimal Stemmer--I 
changed the label.  Interesting to see where the minimal stemmer and porter 
agree (and KStemmer doesn't).  You may also find the "dog" examples 
interesting.  I also found the "invest*" list entertaining.

   original       porter        kstem   EngMinStem
-----------  -----------  -----------  -----------
    country      countri      country      country
  countries      countri      country      country
  country's     country'    country's     country'
        run          run          run          run
       runs          run         runs          run
    running          run      running      running
       read         read         read         read
    reading         read      reading      reading
     reader       reader       reader       reader
association       associ  association  association
  associate       associ    associate    associate
    listing         list         list      listing
      water        water        water        water
    watered        water        water      watered
       sure         sure         sure         sure
     surely         sure       surely       surely
     invest       invest       invest       invest
  investing       invest       invest    investing
 investment       invest   investment   investment
investments       invest   investment   investment
    invests       invest       invest       invest
   investor     investor       invest     investor
   invester       invest       invest     invester
  investors     investor       invest     investor
  investers       invest       invest     invester
organization        organ  organization  organization
   organize        organ     organize     organize
    organic        organ      organic      organic
   generous        gener     generous     generous
    generic        gener      generic      generic
        dog          dog          dog          dog
      dog's         dog'        dog's         dog'
       dogs          dog         dogs          dog
      dogs'          dog         dogs          dog

Now, if someone would answer my question on the Solr list ("Custom Solr 
Indexer/Search"), my day would be complete ;-).

Thanks for the continued help.

Scott

-----Original Message-----
From: Tom Burton-West [mailto:tburt...@umich.edu] 
Sent: Thursday, November 15, 2012 11:06 AM
To: java-user@lucene.apache.org
Subject: Re: Which stemmer?

I agree with Erick that you probably need to give your client a list of 
concrete examples, and perhaps to explain the trade-offs.

All stemmers both overstem and understem.   Understemming means that some
forms of a word won't get searched.  For example, without stemming, searching 
for "dogs" would not retrieve documents containing the word "dog".
Generally there is a precision/recall tradeoff where reducing understemming 
increases overstemming.  The problem with aggressive stemmers like the Porter 
stemmer, is that they overstem.

 The original Porter stemmer for example would stem "organization" and " 
organic" both to "organ" and "generalization" , "generous"and "generic" to " 
gener"  *

For background on the Porter stemmers and lots of examples see these pages:

http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*

*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>

This paper on the Kstem stemmer lists cases where the Porter stemmer understems 
or overstems and explains the logic of Kstem: "Viewing Morphology as an 
Inference Process"  (*Krovetz*, R., Proceedings of the Sixteenth Annual 
International ACM SIGIR Conference on Research and Development in Information 
Retrieval, 191-203, 1993).

*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"

Tom

http://www.hathitrust.org/blogs/large-scale-search

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to