[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449993#comment-16449993
 ] 

David Smiley commented on LUCENE-8273:
--------------------------------------

Nice!

Could you add a test with a filter that may produce multiple terms instead of 
just one-to-one?  And maybe try the scenario when the filter swallows it (e.g. 
WDF sees a token that is simply a symbol).  The documentation is ok but I was 
confused about practically what would usage look like until I looked at the 
test, so maybe a simple example in the class javadocs could shed light on this. 

With such a general utility, I wonder if the existing TokenFilters that have 
precondition checks (e.g. stemmers that check conditions) needn't bother doing 
this anymore since you could wrap the stemmer with the BypassingTokenFilter 
here with a check if the word is in a list?  Then we wouldn't even need 
KeywordAttribute!  I realize this is taking your simple proposal and taking it 
very far but I think it's worth discussing for 8.0.

An alternative to your BypassingTokenFilter is creating an intermediate base 
class between existing TokenFilters that bypass (e.g. stemmers + ones that 
ought to like WDF) and TokenFilter.  But thinking about this more, this seems 
like a bigger disruptive change and wouldn't cast a net as wide as 
BypassingTokenFilter which can filter anything, even filters where the author 
forgot to consider being filtered.

> Add a BypassingTokenFilter
> --------------------------
>
>                 Key: LUCENE-8273
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8273
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to