[jira] Commented: (LUCENE-2198) support protected words in Stemming TokenFilters

Simon Willnauer (JIRA) Sat, 09 Jan 2010 10:41:18 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798388#action_12798388
 ]


Simon Willnauer commented on LUCENE-2198:
-----------------------------------------

bq. So I think we should just provide ignore with CharArraySet, but if you feel 
otherwise please comment.
While I read your proposal a possibly more flexible design came to my mind. We 
could introduce a StemAttribute that has a method public boolean stem() used by 
every stemmer to decide if a token should be stemmed. That way we decouple the 
decision if a token should be stemmed from the stemming algorithm. This also 
enables custom filters to set the values based on other reasons aside from a 
term being in a set. 
The default value for sure it true but can be set on any condition. inside an 
analyzer we can add a filter right before the stemmer based on a CharArraySet. 
Yet if the set is empty or null we simply leave the filter out. 



> support protected words in Stemming TokenFilters
> ------------------------------------------------
>
>                 Key: LUCENE-2198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2198
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Robert Muir
>            Priority: Minor
>
> This is from LUCENE-1515
> I propose that all stemming TokenFilters have an 'exclusion set' that 
> bypasses any stemming for words in this set.
> Some stemming tokenfilters have this, some do not.
> This would be one way for Karl to implement his new swedish stemmer (as a 
> text file of ignore words).
> Additionally, it would remove duplication between lucene and solr, as they 
> reimplement snowballfilter since it does not have this functionality.
> Finally, I think this is a pretty common use case, where people want to 
> ignore things like proper nouns in the stemming.
> As an alternative design I considered a case where we generalized this to 
> CharArrayMap (and ignoring words would mean mapping them to themselves), 
> which would also provide a mechanism to override the stemming algorithm. But 
> I think this is too expert, could be its own filter, and the only example of 
> this i can find is in the Dutch stemmer.
> So I think we should just provide ignore with CharArraySet, but if you feel 
> otherwise please comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2198) support protected words in Stemming TokenFilters

Reply via email to