[ https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798388#action_12798388 ]
Simon Willnauer commented on LUCENE-2198: ----------------------------------------- bq. So I think we should just provide ignore with CharArraySet, but if you feel otherwise please comment. While I read your proposal a possibly more flexible design came to my mind. We could introduce a StemAttribute that has a method public boolean stem() used by every stemmer to decide if a token should be stemmed. That way we decouple the decision if a token should be stemmed from the stemming algorithm. This also enables custom filters to set the values based on other reasons aside from a term being in a set. The default value for sure it true but can be set on any condition. inside an analyzer we can add a filter right before the stemmer based on a CharArraySet. Yet if the set is empty or null we simply leave the filter out. > support protected words in Stemming TokenFilters > ------------------------------------------------ > > Key: LUCENE-2198 > URL: https://issues.apache.org/jira/browse/LUCENE-2198 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Affects Versions: 3.0 > Reporter: Robert Muir > Priority: Minor > > This is from LUCENE-1515 > I propose that all stemming TokenFilters have an 'exclusion set' that > bypasses any stemming for words in this set. > Some stemming tokenfilters have this, some do not. > This would be one way for Karl to implement his new swedish stemmer (as a > text file of ignore words). > Additionally, it would remove duplication between lucene and solr, as they > reimplement snowballfilter since it does not have this functionality. > Finally, I think this is a pretty common use case, where people want to > ignore things like proper nouns in the stemming. > As an alternative design I considered a case where we generalized this to > CharArrayMap (and ignoring words would mean mapping them to themselves), > which would also provide a mechanism to override the stemming algorithm. But > I think this is too expert, could be its own filter, and the only example of > this i can find is in the Dutch stemmer. > So I think we should just provide ignore with CharArraySet, but if you feel > otherwise please comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org