[
https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957326#comment-15957326
]
David Smiley edited comment on LUCENE-7719 at 4/29/17 12:19 PM:
----------------------------------------------------------------
Here is a patch for consideration.
First of all, I realized that this bug is very minor because only _some_
{{AutomatonQuery.getAutomaton}} are binary (I thought they all were); others
are char based. The ones that are binary in Lucene are {{PrefixQuery}} and
{{TermRangeQuery}}, both of which are special cased in
{{MultiTermHighlighting}}. So practically speaking I think the only users to
see this bug would be anyone building a custom AutomatonQuery. Nonetheless
this should be cleaned up and actually tested.
The patch:
* Adds a fairly thorough test with lots of randomized unicode, testing
PrefixQuery, TermRangeQuery, WildcardQuery, and FuzzyQuery
* removes the special casing of AutomatonQuery subclasses in
MultiTermHighlighting. AQ is handled generically now.
* added {{AutomatonQuery.isAutomatonBinary()}} with a new field to match.
* if the AQ.getAutomaton is _not_ binary, we follow a simple/obvious path
* if the automaton is binary, then I produce a CharRunAutomaton implementing
run() to navigate the Automaton char by char. It makes a BytesRunAutomaton on
the automaton. I inlined the one and two byte UTF char logic from
{{UnicodeUtil.UTF16toUTF8}}. If the char needs more bytes, then I call out to
UTF16toUTF8 and work off the generated byte array for the remaining chars. I
think this is more maintainable, albeit slower, than reproducing the logic.
I added a perf TODO in MultiTermHighlighting.java to have CompiledAutomaton
expose the ByteRunAutomaton so that we don't need to rebuild it here. The
construction cost seems less than trivial as it determinizes the automaton and
does other work. Is this a big deal? It seems kinda sorta; it depends.
was (Author: dsmiley):
Here is a patch for consideration.
First of all, I realized that this bug is very minor because only _some_
{{AutomatonQuery.getAutomaton}} are binary (I thought they all were); others
are char based. The ones that are binary in Lucene are {{PrefixQuery}} and
{{TermRangeQuery}}, both of which are special cased in
{{MultiTermHighlighting}}. So practically speaking I think the only users to
see this bug would be anyone building a custom AutomatonQuery. Nonetheless
this should be cleaned up and actually tested.
The patch:
* Adds a fairly thorough test with lots of randomized unicode, testing
PrefixQuery, TermRangeQuery, WildcardQuery, and FuzzyQuery
* removes the special casing of AutomatonQuery subclasses in
MultiTermHighlighting. AQ is handled generically now.
* added {{AutomatonQuery.isAutomatonBinary()}} with a new field to match.
* if the AQ.getAutomaton is _not_ binary, we follow a simple/obvious path
* if the automaton is binary, then I produce a CharRunAutomaton implementing
run() to navigate the Automaton char by char. It makes a BytesRunAutomaton on
the automaton. I inlined the one and two byte UTF char logic from
{{UnicodeUtil.UTF16toUTF8}}. If the char needs more bytes, then I call out to
UTF16toUTF8 and work off the generated byte array for the remaining chars. I
think this is more maintainable, albeit slower, than reproducing the logic.
There is a TODO on making the BytesRunAutomaton on the binary Automaton. This
is redundant since the AutomatonQuery already has a compiledAutomaton which in
turn already has a BytesRunAutomaton. The construction cost seems less than
trivial as it determinizes the automaton and does other work. Is this a big
deal? It seems kinda sorta; it depends.
> UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
> -----------------------------------------------------------------------------
>
> Key: LUCENE-7719
> URL: https://issues.apache.org/jira/browse/LUCENE-7719
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/highlighter
> Reporter: David Smiley
> Assignee: David Smiley
> Priority: Minor
> Attachments: LUCENE_7719.patch
>
>
> In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes
> the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not
> character oriented. For ASCII terms, this is safe but it's not for
> multi-byte characters. This is most likely going to rear it's head with a
> WildcardQuery, but due to special casing in MultiTermHighlighting,
> PrefixQuery isn't affected. Nonetheless it'd be nice to get a general fix in
> so that MultiTermHighlighting can remove special cases for PrefixQuery and
> TermRangeQuery (both subclass AutomatonQuery).
> AFAICT, this bug was likely in the PostingsHighlighter since inception.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]