[
https://issues.apache.org/jira/browse/LUCENE-7717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888807#comment-15888807
]
David Smiley commented on LUCENE-7717:
--------------------------------------
Here's my take on it: The UnifiedHighlighter (and PostingsHighlighter from
which it derives) processes the MultiTermQueries (e.g. wildcards) in the query
and creates multiple {{CharacterRunAutomaton}} intended to match the same
things. {{CharacterRunAutomaton}} takes a {{Automaton}} as input, and when it
does it's processing, it matches the Character code points (integers from 0 to
0x10FFFF) against the integers in the Automaton. However, this strategy
assumes that the Automaton was constructed based on character code points. But
{{AutomatonQuery.getAutomaton}} is intended to match byte by byte (integers 0
to 255). {{PrefixQuery.toAutomaton}} will get 2 bytes for the the "я" in
BytesRef form, and add 2 states. This does not line up with the assumptions of
CharacterRunAutomaton.
A short term immediate "fix" is simply to put AutomatonQuery last in the
if-else list as Dmitry indicated. As such, PrefixQuery will work again. This
was broken by LUCENE-6367 (Lucene 5.1). TermRangeQuery, which also now extends
AutomatonQuery, will likewise work -- broken by LUCENE-5879 (Lucene 5.2).
Again, back when MultiTermHighlighting was first written, neither of those
queries extended AutomatonQuery. _But there will be bugs for other types of
AutomatonQuery (namely WildcardQuery and RegexpQuery) that have yet to be
reported._
[~rcmuir] or [~mikemccand] I wonder if you have any thoughts on how to fix
this. An idea I have is to _not_ use a CharacterRunAutomaton in the
UnifiedHighlighter; use a ByteRunAutomaton instead. Then, add
{{ByteRunAutomaton.run(char[] ...etc)}} that converts each character to the
equivalent UTF8 bytes to match. Even with that, I wonder if this points to
areas to improve the automata API so that people don't bump into this trap in
the future. For example, maybe have the Automata self-report if it's byte
oriented, Unicode codepoint oriented, or something custom. Then, RunAutomaton
could throw an exception if there is a mis-match. However that would be a
runtime error; maybe the Automata could be typed.
Any way, what I'd like to do is do a short term fix that addresses many common
cases and the title of this issue. And then do a more thorough fix in a
follow-on issue. [~ichattopadhyaya] do you think this could go into 6.4.2 or
are you only looking for "critical" issues? It's debatable what's critical and
not. This bug has been around since 5.1 so perhaps it isn't.
(a patch will follow shortly)
> UnifiedHighlighter don't work with russian PrefixQuery
> ------------------------------------------------------
>
> Key: LUCENE-7717
> URL: https://issues.apache.org/jira/browse/LUCENE-7717
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/highlighter
> Affects Versions: 6.3, 6.4.1
> Reporter: Dmitry Malinin
> Assignee: David Smiley
> Attachments: LUCENE-7717.patch
>
>
> UnifiedHighlighter highlighter = new UnifiedHighlighter(null, new
> StandardAnalyzer());
> Query query = new PrefixQuery(new Term("title", "я"));
> String testData = "я";
> Object snippet = highlighter.highlightWithoutSearcher(fieldName, query,
> testData, 1);
> System.out.printf("testData=[%s] Query=%s snippet=[%s]\n", testData, query,
> snippet==null?null:snippet.toString());
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]