[ 
https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957326#comment-15957326
 ] 

David Smiley edited comment on LUCENE-7719 at 4/29/17 12:19 PM:
----------------------------------------------------------------

Here is a patch for consideration.

First of all, I realized that this bug is very minor because only _some_ 
{{AutomatonQuery.getAutomaton}} are binary (I thought they all were); others 
are char based.  The ones that are binary in Lucene are {{PrefixQuery}} and 
{{TermRangeQuery}}, both of which are special cased in 
{{MultiTermHighlighting}}.  So practically speaking I think the only users to 
see this bug would be anyone building a custom AutomatonQuery.  Nonetheless 
this should be cleaned up and actually tested.

The patch:
* Adds a fairly thorough test with lots of randomized unicode, testing 
PrefixQuery, TermRangeQuery, WildcardQuery, and FuzzyQuery
* removes the special casing of AutomatonQuery subclasses in 
MultiTermHighlighting.  AQ is handled generically now.
* added {{AutomatonQuery.isAutomatonBinary()}} with a new field to match.
* if the AQ.getAutomaton is _not_ binary, we follow a simple/obvious path
* if the automaton is binary, then I produce a CharRunAutomaton implementing 
run() to navigate the Automaton char by char.  It makes a BytesRunAutomaton on 
the automaton.  I inlined the one and two byte UTF char logic from 
{{UnicodeUtil.UTF16toUTF8}}.  If the char needs more bytes, then I call out to 
UTF16toUTF8 and work off the generated byte array for the remaining chars.  I 
think this is more maintainable, albeit slower, than reproducing the logic.

I added a perf TODO in MultiTermHighlighting.java to have CompiledAutomaton 
expose the ByteRunAutomaton so that we don't need to rebuild it here.  The 
construction cost seems less than trivial as it determinizes the automaton and 
does other work.  Is this a big deal?  It seems kinda sorta; it depends.  


was (Author: dsmiley):
Here is a patch for consideration.

First of all, I realized that this bug is very minor because only _some_ 
{{AutomatonQuery.getAutomaton}} are binary (I thought they all were); others 
are char based.  The ones that are binary in Lucene are {{PrefixQuery}} and 
{{TermRangeQuery}}, both of which are special cased in 
{{MultiTermHighlighting}}.  So practically speaking I think the only users to 
see this bug would be anyone building a custom AutomatonQuery.  Nonetheless 
this should be cleaned up and actually tested.

The patch:
* Adds a fairly thorough test with lots of randomized unicode, testing 
PrefixQuery, TermRangeQuery, WildcardQuery, and FuzzyQuery
* removes the special casing of AutomatonQuery subclasses in 
MultiTermHighlighting.  AQ is handled generically now.
* added {{AutomatonQuery.isAutomatonBinary()}} with a new field to match.
* if the AQ.getAutomaton is _not_ binary, we follow a simple/obvious path
* if the automaton is binary, then I produce a CharRunAutomaton implementing 
run() to navigate the Automaton char by char.  It makes a BytesRunAutomaton on 
the automaton.  I inlined the one and two byte UTF char logic from 
{{UnicodeUtil.UTF16toUTF8}}.  If the char needs more bytes, then I call out to 
UTF16toUTF8 and work off the generated byte array for the remaining chars.  I 
think this is more maintainable, albeit slower, than reproducing the logic.

There is a TODO on making the BytesRunAutomaton on the binary Automaton.  This 
is redundant since the AutomatonQuery already has a compiledAutomaton which in 
turn already has a BytesRunAutomaton.  The construction cost seems less than 
trivial as it determinizes the automaton and does other work.  Is this a big 
deal?  It seems kinda sorta; it depends.  

> UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-7719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7719
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Minor
>         Attachments: LUCENE_7719.patch
>
>
> In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes 
> the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not 
> character oriented.  For ASCII terms, this is safe but it's not for 
> multi-byte characters.  This is most likely going to rear it's head with a 
> WildcardQuery, but due to special casing in MultiTermHighlighting, 
> PrefixQuery isn't affected.  Nonetheless it'd be nice to get a general fix in 
> so that MultiTermHighlighting can remove special cases for PrefixQuery and 
> TermRangeQuery (both subclass AutomatonQuery).
> AFAICT, this bug was likely in the PostingsHighlighter since inception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to