[ 
https://issues.apache.org/jira/browse/LUCENE-8352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509545#comment-16509545
 ] 

Mike Sokolov commented on LUCENE-8352:
--------------------------------------

I'm currently relying on a custom TokenStreamComponents, created in order to 
override setReader (that does seem like the only reason to override this 
class?). We don't use this Analyzer-wrapping pattern, so I guess we were lucky 
enough to avoid the trap you describe here. I'd be concerned if this were made 
private that some other extension mechanism be opened up to allow for cases 
when you want to take some action for each instance being indexed. In our case 
we pass some metadata along with the actual text to be analyzed that informs 
the analysis process.

I'm having difficulty seeing how to add something that would pass additional 
metadata down into the analysis chain without some fairly major impact though.

I think that overriding setReader provides some value currently that is 
challenging to achieve in any other way, so would be in favor of keeping it 
public, and looking into fixing the wrapping situation instead. For example, 
what if wrapComponents actually *wrapped* the original components instead of 
replacing it? Or, as you say, explore the idea of marking Analyzers as 
unwrappable. Perhaps AnalyzerWrapper should examine the type of 
TokenStreamComponents (or some new "discardable" method could be added) to 
decide whether it is safe?

> Make TokenStreamComponents final
> --------------------------------
>
>                 Key: LUCENE-8352
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8352
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Mark Harwood
>            Priority: Minor
>
> The current design is a little trappy. Any specialised subclasses of 
> TokenStreamComponents _(see_ _StandardAnalyzer, ClassicAnalyzer, 
> UAX29URLEmailAnalyzer)_ are discarded by any subsequent Analyzers that wrap 
> them _(see LimitTokenCountAnalyzer, QueryAutoStopWordAnalyzer, 
> ShingleAnalyzerWrapper and other examples in elasticsearch)_. 
> The current design means each AnalyzerWrapper.wrapComponents() implementation 
> discards any custom TokenStreamComponents and replaces it with one of its own 
> choosing (a vanilla TokenStreamComponents class from examples I've seen).
> This is a trap I fell into when writing a custom TokenStreamComponents with a 
> custom setReader() and I wondered why it was not being triggered when wrapped 
> by other analyzers.
> If AnalyzerWrapper is designed to encourage composition it's arguably a 
> mistake to also permit custom TokenStreamComponent subclasses  - the 
> composition process does not preserve the choice of custom classes and any 
> behaviours they might add. For this reason we should not encourage extensions 
> to TokenStreamComponents (or if TSC extensions are required we should somehow 
> mark an Analyzer as "unwrappable" to prevent lossy compositions).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to