Re: sanity check on how stemming, stopwords, and snowball analyzer works together

Mark Miller Mon, 15 Oct 2007 07:38:17 -0700

Sounds right to me.

The other option I think you have is to not use the MoreLikeThisstopword functionality. Instead add the stopwords to the analyzer thatyou pass to MoreLikeThis. That way you can ensure that the analyzerapplies the stopword list before stemming (The MoreLikeThis stopwordremoval is implemented so that stopwords are removed after stemming).Then you just have to add 'developer' to the stop list, and you canforget about handling stemmed forms.


Your method should also work though.

- Mark

Donna L Gresh wrote:

Could those "in the know" comment on my current understanding of stemmingand stopwords using the snowball analyzer?
In my application, I am using the MoreLikeThis class to find similardocuments to an input "text blob". There are words in the input text blobwhich are "uninteresting" for my application, so I create a list of thesewords. These words are "uninteresting" no matter what their tense orusage, for example, "develop", "developing", "developed", and "developer"are all uninteresting and I do not want them included in the search querycreated by the MoreLikeThis class.
My index documents are stemmed using the Snowball analyzer. I do not useany stopwords when the documents are indexed (as I would like the choiceof stopwords to be under user control at search time).
I would like the user to be able to provide to the search application alist of "uninteresting" words, and for obvious reasons would like to forcethem to provide only, say, "developer" and have the application understandthat all variants should be ignored (and I don't want to force them to tryto guess what the stemmed version of "developer" is).
My first try was to use MoreLikeThis with the Snowball analyzer and asimple list of unstemmed stopwords (MoreLikeThis.setAnalyzer andMoreLikeThis.setStopWords). However, it appears that the stopwordsprovided to the MoreLikeThis class are compared in an exact way to thetoken stream output by the Snowball filter (where the words have beenstemmed), so "developer" will not match anything, and all variants passthrough. Even if I provide the list of unstemmed stopwords to the snowballanalyzer instead, they are used "as-is" with no stemming performed, so"developer" will not remove "developed".
Apparently the following is necessary for my application:
Construct a snowball analyzer with no stopwords. Use the unstemmedstopword list with the analyzer to construct a stemmed version of the setof stopwords. Use this set of stemmed stopwords as the stopwords input tothe MoreLikeThis class (where the tokens are compared to the stemmedversions after been output from the Snowball analyzer).
Is my understanding correct?

Donna


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: sanity check on how stemming, stopwords, and snowball analyzer works together

Reply via email to