[
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449993#comment-16449993
]
David Smiley commented on LUCENE-8273:
--------------------------------------
Nice!
Could you add a test with a filter that may produce multiple terms instead of
just one-to-one? And maybe try the scenario when the filter swallows it (e.g.
WDF sees a token that is simply a symbol). The documentation is ok but I was
confused about practically what would usage look like until I looked at the
test, so maybe a simple example in the class javadocs could shed light on this.
With such a general utility, I wonder if the existing TokenFilters that have
precondition checks (e.g. stemmers that check conditions) needn't bother doing
this anymore since you could wrap the stemmer with the BypassingTokenFilter
here with a check if the word is in a list? Then we wouldn't even need
KeywordAttribute! I realize this is taking your simple proposal and taking it
very far but I think it's worth discussing for 8.0.
An alternative to your BypassingTokenFilter is creating an intermediate base
class between existing TokenFilters that bypass (e.g. stemmers + ones that
ought to like WDF) and TokenFilter. But thinking about this more, this seems
like a bigger disruptive change and wouldn't cast a net as wide as
BypassingTokenFilter which can filter anything, even filters where the author
forgot to consider being filtered.
> Add a BypassingTokenFilter
> --------------------------
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Alan Woodward
> Priority: Major
> Attachments: LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265. It would be useful to be able to wrap a TokenFilter
> in such a way that it could optionally be bypassed based on the current state
> of the TokenStream. This could be used to, for example, only apply
> WordDelimiterFilter to terms that contain hyphens.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]