[jira] Commented: (LUCENE-2198) support protected words in Stemming TokenFilters

Uwe Schindler (JIRA) Mon, 18 Jan 2010 03:04:20 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801734#action_12801734
 ]


Uwe Schindler commented on LUCENE-2198:
---------------------------------------

bq. I also don't understand the arguments about type safety. Token bundles 
multiple attributes together w/o loss of type safety, right?

With "type safety" I mean, not attributes in general (they are type safe withy 
any impl). FlagAttribute itsself is not type safe, because everybody can 
store/update any bit in this integer. If you have two different filters 
updating the same bit but mean something other with the bit, it gets broken. 
Maybe other Filters just update the flags using no bit operations (because we 
have no support for these in the API).

bq. If we have more than one boolean attribute in lucene in future, we can 
extend DEFAULT_ATTRIBUTE_FACTORY to support this.

My idea is to have a default AttributeImpl for boolean attributes that support 
things like set/get of a bit (like BitSet). You subclass it e.g. to generate a 
combined impl for 4 boolean interfaces we may have in futrure in Lucene core. 
In the ctor you pass the bitmasks and the impl of all boolean get/setters 
delegate to the generic BitSet-like methods. Clone and copyTo it then simple, 
as it only copies the word if the target AttributeImpl is the same class (like 
Token.copyTo).

> support protected words in Stemming TokenFilters
> ------------------------------------------------
>
>                 Key: LUCENE-2198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2198
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2198.patch, LUCENE-2198.patch
>
>
> This is from LUCENE-1515
> I propose that all stemming TokenFilters have an 'exclusion set' that 
> bypasses any stemming for words in this set.
> Some stemming tokenfilters have this, some do not.
> This would be one way for Karl to implement his new swedish stemmer (as a 
> text file of ignore words).
> Additionally, it would remove duplication between lucene and solr, as they 
> reimplement snowballfilter since it does not have this functionality.
> Finally, I think this is a pretty common use case, where people want to 
> ignore things like proper nouns in the stemming.
> As an alternative design I considered a case where we generalized this to 
> CharArrayMap (and ignoring words would mean mapping them to themselves), 
> which would also provide a mechanism to override the stemming algorithm. But 
> I think this is too expert, could be its own filter, and the only example of 
> this i can find is in the Dutch stemmer.
> So I think we should just provide ignore with CharArraySet, but if you feel 
> otherwise please comment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2198) support protected words in Stemming TokenFilters

Reply via email to