[ https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890090#comment-15890090 ]
Michael McCandless commented on LUCENE-7719: -------------------------------------------- Wow, this is a great catch [~dmitrymalinin]! Thank you for opening the precursor issue. {{AutomatonQuery.getAutomaton}} really must return a UTF8-oriented automaton because that matches how the terms are indexed into Lucene, and what the automaton will be intersected with, to run the query. We should fix the javadocs to say this. And it is sort of annoying that these differences are not strongly typed, but the {{Automaton}} class is really agnostic to what ints you are putting onto its transitions. But, yeah, for highlighting, we are operating in UTF16 space, and so I think we need some way to have the {{CharacterRunAutomaton}} interface on top of a UTF8 automaton? Maybe we should abstract out a separate interface that {{MultiTermHighlighting}} would use? It seems it only uses the {{run}} method, to test if a given term is accepted? And then, as you suggested, we could easily convert the incoming char[] to UTF8 BytesRef and use the {{ByteRunAutomaton.run}} on that. > UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars > ----------------------------------------------------------------------------- > > Key: LUCENE-7719 > URL: https://issues.apache.org/jira/browse/LUCENE-7719 > Project: Lucene - Core > Issue Type: Bug > Components: modules/highlighter > Reporter: David Smiley > > In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes > the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not > character oriented. For ASCII terms, this is safe but it's not for > multi-byte characters. This is most likely going to rear it's head with a > WildcardQuery, but due to special casing in MultiTermHighlighting, > PrefixQuery isn't affected. Nonetheless it'd be nice to get a general fix in > so that MultiTermHighlighting can remove special cases for PrefixQuery and > TermRangeQuery (both subclass AutomatonQuery). > AFAICT, this bug was likely in the PostingsHighlighter since inception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org