[ https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley updated LUCENE-7719: --------------------------------- Attachment: LUCENE_7719.patch Here is a patch for consideration. First of all, I realized that this bug is very minor because only _some_ {{AutomatonQuery.getAutomaton}} are binary (I thought they all were); others are char based. The ones that are binary in Lucene are {{PrefixQuery}} and {{TermRangeQuery}}, both of which are special cased in {{MultiTermHighlighting}}. So practically speaking I think the only users to see this bug would be anyone building a custom AutomatonQuery. Nonetheless this should be cleaned up and actually tested. The patch: * Adds a fairly thorough test with lots of randomized unicode, testing PrefixQuery, TermRangeQuery, WildcardQuery, and FuzzyQuery * removes the special casing of AutomatonQuery subclasses in MultiTermHighlighting. AQ is handled generically now. * added {{AutomatonQuery.isAutomatonBinary()}} with a new field to match. * if the AQ.getAutomaton is _not_ binary, we follow a simple/obvious path * if the automaton is binary, then I produce a CharRunAutomaton implementing run() to navigate the Automaton char by char. It makes a BytesRunAutomaton on the automaton. I inlined the one and two byte UTF char logic from {{UnicodeUtil.UTF16toUTF8}}. If the char needs more bytes, then I call out to UTF16toUTF8 and work off the generated byte array for the remaining chars. I think this is more maintainable, albeit slower, than reproducing the logic. There is a TODO on making the BytesRunAutomaton on the binary Automaton. This is redundant since the AutomatonQuery already has a compiledAutomaton which in turn already has a BytesRunAutomaton. The construction cost seems less than trivial as it determinizes the automaton and does other work. Is this a big deal? It seems kinda sorta; it depends. > UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars > ----------------------------------------------------------------------------- > > Key: LUCENE-7719 > URL: https://issues.apache.org/jira/browse/LUCENE-7719 > Project: Lucene - Core > Issue Type: Bug > Components: modules/highlighter > Reporter: David Smiley > Assignee: David Smiley > Attachments: LUCENE_7719.patch > > > In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes > the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not > character oriented. For ASCII terms, this is safe but it's not for > multi-byte characters. This is most likely going to rear it's head with a > WildcardQuery, but due to special casing in MultiTermHighlighting, > PrefixQuery isn't affected. Nonetheless it'd be nice to get a general fix in > so that MultiTermHighlighting can remove special cases for PrefixQuery and > TermRangeQuery (both subclass AutomatonQuery). > AFAICT, this bug was likely in the PostingsHighlighter since inception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org