[ https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957326#comment-15957326 ]
David Smiley edited comment on LUCENE-7719 at 4/29/17 12:19 PM: ---------------------------------------------------------------- Here is a patch for consideration. First of all, I realized that this bug is very minor because only _some_ {{AutomatonQuery.getAutomaton}} are binary (I thought they all were); others are char based. The ones that are binary in Lucene are {{PrefixQuery}} and {{TermRangeQuery}}, both of which are special cased in {{MultiTermHighlighting}}. So practically speaking I think the only users to see this bug would be anyone building a custom AutomatonQuery. Nonetheless this should be cleaned up and actually tested. The patch: * Adds a fairly thorough test with lots of randomized unicode, testing PrefixQuery, TermRangeQuery, WildcardQuery, and FuzzyQuery * removes the special casing of AutomatonQuery subclasses in MultiTermHighlighting. AQ is handled generically now. * added {{AutomatonQuery.isAutomatonBinary()}} with a new field to match. * if the AQ.getAutomaton is _not_ binary, we follow a simple/obvious path * if the automaton is binary, then I produce a CharRunAutomaton implementing run() to navigate the Automaton char by char. It makes a BytesRunAutomaton on the automaton. I inlined the one and two byte UTF char logic from {{UnicodeUtil.UTF16toUTF8}}. If the char needs more bytes, then I call out to UTF16toUTF8 and work off the generated byte array for the remaining chars. I think this is more maintainable, albeit slower, than reproducing the logic. I added a perf TODO in MultiTermHighlighting.java to have CompiledAutomaton expose the ByteRunAutomaton so that we don't need to rebuild it here. The construction cost seems less than trivial as it determinizes the automaton and does other work. Is this a big deal? It seems kinda sorta; it depends. was (Author: dsmiley): Here is a patch for consideration. First of all, I realized that this bug is very minor because only _some_ {{AutomatonQuery.getAutomaton}} are binary (I thought they all were); others are char based. The ones that are binary in Lucene are {{PrefixQuery}} and {{TermRangeQuery}}, both of which are special cased in {{MultiTermHighlighting}}. So practically speaking I think the only users to see this bug would be anyone building a custom AutomatonQuery. Nonetheless this should be cleaned up and actually tested. The patch: * Adds a fairly thorough test with lots of randomized unicode, testing PrefixQuery, TermRangeQuery, WildcardQuery, and FuzzyQuery * removes the special casing of AutomatonQuery subclasses in MultiTermHighlighting. AQ is handled generically now. * added {{AutomatonQuery.isAutomatonBinary()}} with a new field to match. * if the AQ.getAutomaton is _not_ binary, we follow a simple/obvious path * if the automaton is binary, then I produce a CharRunAutomaton implementing run() to navigate the Automaton char by char. It makes a BytesRunAutomaton on the automaton. I inlined the one and two byte UTF char logic from {{UnicodeUtil.UTF16toUTF8}}. If the char needs more bytes, then I call out to UTF16toUTF8 and work off the generated byte array for the remaining chars. I think this is more maintainable, albeit slower, than reproducing the logic. There is a TODO on making the BytesRunAutomaton on the binary Automaton. This is redundant since the AutomatonQuery already has a compiledAutomaton which in turn already has a BytesRunAutomaton. The construction cost seems less than trivial as it determinizes the automaton and does other work. Is this a big deal? It seems kinda sorta; it depends. > UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars > ----------------------------------------------------------------------------- > > Key: LUCENE-7719 > URL: https://issues.apache.org/jira/browse/LUCENE-7719 > Project: Lucene - Core > Issue Type: Bug > Components: modules/highlighter > Reporter: David Smiley > Assignee: David Smiley > Priority: Minor > Attachments: LUCENE_7719.patch > > > In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes > the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not > character oriented. For ASCII terms, this is safe but it's not for > multi-byte characters. This is most likely going to rear it's head with a > WildcardQuery, but due to special casing in MultiTermHighlighting, > PrefixQuery isn't affected. Nonetheless it'd be nice to get a general fix in > so that MultiTermHighlighting can remove special cases for PrefixQuery and > TermRangeQuery (both subclass AutomatonQuery). > AFAICT, this bug was likely in the PostingsHighlighter since inception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org