[ 
https://issues.apache.org/jira/browse/LUCENE-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888807#comment-15888807
 ] 

David Smiley commented on LUCENE-7717:
--------------------------------------

Here's my take on it:  The UnifiedHighlighter (and PostingsHighlighter from 
which it derives) processes the MultiTermQueries (e.g. wildcards) in the query 
and creates multiple {{CharacterRunAutomaton}} intended to match the same 
things.  {{CharacterRunAutomaton}} takes a {{Automaton}} as input, and when it 
does it's processing, it matches the Character code points (integers from 0 to 
0x10FFFF) against the integers in the Automaton.  However, this strategy 
assumes that the Automaton was constructed based on character code points.  But 
{{AutomatonQuery.getAutomaton}} is intended to match byte by byte (integers 0 
to 255).  {{PrefixQuery.toAutomaton}} will get 2 bytes for the the "я" in 
BytesRef form, and add 2 states.  This does not line up with the assumptions of 
CharacterRunAutomaton.

A short term immediate "fix" is simply to put AutomatonQuery last in the 
if-else list as Dmitry indicated.  As such, PrefixQuery will work again.  This 
was broken by LUCENE-6367 (Lucene 5.1).  TermRangeQuery, which also now extends 
AutomatonQuery, will likewise work -- broken by LUCENE-5879 (Lucene 5.2).  
Again, back when MultiTermHighlighting was first written, neither of those 
queries extended AutomatonQuery.  _But there will be bugs for other types of 
AutomatonQuery (namely WildcardQuery and RegexpQuery) that have yet to be 
reported._

[~rcmuir] or [~mikemccand] I wonder if you have any thoughts on how to fix 
this.  An idea I have is to _not_ use a CharacterRunAutomaton in the 
UnifiedHighlighter; use a ByteRunAutomaton instead.  Then, add 
{{ByteRunAutomaton.run(char[] ...etc)}} that converts each character to the 
equivalent UTF8 bytes to match.  Even with that, I wonder if this points to 
areas to improve the automata API so that people don't bump into this trap in 
the future.  For example, maybe have the Automata self-report if it's byte 
oriented, Unicode codepoint oriented, or something custom.  Then, RunAutomaton 
could throw an exception if there is a mis-match.  However that would be a 
runtime error; maybe the Automata could be typed.

Any way, what I'd like to do is do a short term fix that addresses many common 
cases and the title of this issue.  And then do a more thorough fix in a 
follow-on issue.  [~ichattopadhyaya] do you think this could go into 6.4.2 or 
are you only looking for "critical" issues?  It's debatable what's critical and 
not.  This bug has been around since 5.1 so perhaps it isn't.

(a patch will follow shortly)


> UnifiedHighlighter don't work with russian PrefixQuery
> ------------------------------------------------------
>
>                 Key: LUCENE-7717
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7717
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>    Affects Versions: 6.3, 6.4.1
>            Reporter: Dmitry Malinin
>            Assignee: David Smiley
>         Attachments: LUCENE-7717.patch
>
>
> UnifiedHighlighter highlighter = new UnifiedHighlighter(null, new 
> StandardAnalyzer());
> Query query = new PrefixQuery(new Term("title", "я"));
> String testData = "я";
> Object snippet = highlighter.highlightWithoutSearcher(fieldName, query, 
> testData, 1);
> System.out.printf("testData=[%s] Query=%s snippet=[%s]\n", testData, query, 
> snippet==null?null:snippet.toString());



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to