[jira] [Commented] (LUCENE-7719) UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
[ https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052653#comment-16052653 ] ASF subversion and git services commented on LUCENE-7719: - Commit 42fdb549270b41ae164b90ea7bc001ceb7848b6d in lucene-solr's branch refs/heads/master from [~dsmiley] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=42fdb54 ] LUCENE-7719: tests: Eliminate needless SuppressSysoutChecks and address lint warning > UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars > - > > Key: LUCENE-7719 > URL: https://issues.apache.org/jira/browse/LUCENE-7719 > Project: Lucene - Core > Issue Type: Bug > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley >Priority: Minor > Attachments: LUCENE_7719.patch > > > In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes > the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not > character oriented. For ASCII terms, this is safe but it's not for > multi-byte characters. This is most likely going to rear it's head with a > WildcardQuery, but due to special casing in MultiTermHighlighting, > PrefixQuery isn't affected. Nonetheless it'd be nice to get a general fix in > so that MultiTermHighlighting can remove special cases for PrefixQuery and > TermRangeQuery (both subclass AutomatonQuery). > AFAICT, this bug was likely in the PostingsHighlighter since inception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7719) UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
[ https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052652#comment-16052652 ] ASF subversion and git services commented on LUCENE-7719: - Commit d0b9d3459fd097dba677cdda170632f6fca5e042 in lucene-solr's branch refs/heads/master from [~dsmiley] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d0b9d34 ] LUCENE-7719: Generalize UnifiedHighlighter's support for AutomatonQuery > UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars > - > > Key: LUCENE-7719 > URL: https://issues.apache.org/jira/browse/LUCENE-7719 > Project: Lucene - Core > Issue Type: Bug > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley >Priority: Minor > Attachments: LUCENE_7719.patch > > > In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes > the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not > character oriented. For ASCII terms, this is safe but it's not for > multi-byte characters. This is most likely going to rear it's head with a > WildcardQuery, but due to special casing in MultiTermHighlighting, > PrefixQuery isn't affected. Nonetheless it'd be nice to get a general fix in > so that MultiTermHighlighting can remove special cases for PrefixQuery and > TermRangeQuery (both subclass AutomatonQuery). > AFAICT, this bug was likely in the PostingsHighlighter since inception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7719) UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
[ https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048646#comment-16048646 ] David Smiley commented on LUCENE-7719: -- Ping [~mikemccand] since you've been involved with AutomatonQuery and automata in general. If you're too busy then I think the change to AutomatonQuery is innocent enough so I'm comfortable committing the patch as-is. I'm not sure if this will make 7.0 or not but I don't think it matters -- no back-compat issue / API issue. > UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars > - > > Key: LUCENE-7719 > URL: https://issues.apache.org/jira/browse/LUCENE-7719 > Project: Lucene - Core > Issue Type: Bug > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley >Priority: Minor > Attachments: LUCENE_7719.patch > > > In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes > the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not > character oriented. For ASCII terms, this is safe but it's not for > multi-byte characters. This is most likely going to rear it's head with a > WildcardQuery, but due to special casing in MultiTermHighlighting, > PrefixQuery isn't affected. Nonetheless it'd be nice to get a general fix in > so that MultiTermHighlighting can remove special cases for PrefixQuery and > TermRangeQuery (both subclass AutomatonQuery). > AFAICT, this bug was likely in the PostingsHighlighter since inception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7719) UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
[ https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989866#comment-15989866 ] David Smiley commented on LUCENE-7719: -- [~mikemccand] What do you think of the patch? In particular, I wonder what you think of: * AutomatonQuery.isAutomatonBinary(). This is a very simple/innocent addition. It's a shame Automaton.isBinary (or something similar) doesn't exist. * See my TODO last paragraph above. Also note even if that were done, the CompiledAutomaton isn't exposed by AutomatonQuery any way; so we'd need an accessor. Perhaps alternatively AutomatonQuery might expose both a CharRunAutomaton and ByteRunAutomaton (i.e. move some of the code in this patch to there)? If that wouldn't potentially be useful to other users then nevermind. * The approach to convert chars to bytes at each step > UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars > - > > Key: LUCENE-7719 > URL: https://issues.apache.org/jira/browse/LUCENE-7719 > Project: Lucene - Core > Issue Type: Bug > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley >Priority: Minor > Attachments: LUCENE_7719.patch > > > In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes > the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not > character oriented. For ASCII terms, this is safe but it's not for > multi-byte characters. This is most likely going to rear it's head with a > WildcardQuery, but due to special casing in MultiTermHighlighting, > PrefixQuery isn't affected. Nonetheless it'd be nice to get a general fix in > so that MultiTermHighlighting can remove special cases for PrefixQuery and > TermRangeQuery (both subclass AutomatonQuery). > AFAICT, this bug was likely in the PostingsHighlighter since inception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7719) UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
[ https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890090#comment-15890090 ] Michael McCandless commented on LUCENE-7719: Wow, this is a great catch [~dmitrymalinin]! Thank you for opening the precursor issue. {{AutomatonQuery.getAutomaton}} really must return a UTF8-oriented automaton because that matches how the terms are indexed into Lucene, and what the automaton will be intersected with, to run the query. We should fix the javadocs to say this. And it is sort of annoying that these differences are not strongly typed, but the {{Automaton}} class is really agnostic to what ints you are putting onto its transitions. But, yeah, for highlighting, we are operating in UTF16 space, and so I think we need some way to have the {{CharacterRunAutomaton}} interface on top of a UTF8 automaton? Maybe we should abstract out a separate interface that {{MultiTermHighlighting}} would use? It seems it only uses the {{run}} method, to test if a given term is accepted? And then, as you suggested, we could easily convert the incoming char[] to UTF8 BytesRef and use the {{ByteRunAutomaton.run}} on that. > UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars > - > > Key: LUCENE-7719 > URL: https://issues.apache.org/jira/browse/LUCENE-7719 > Project: Lucene - Core > Issue Type: Bug > Components: modules/highlighter >Reporter: David Smiley > > In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes > the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not > character oriented. For ASCII terms, this is safe but it's not for > multi-byte characters. This is most likely going to rear it's head with a > WildcardQuery, but due to special casing in MultiTermHighlighting, > PrefixQuery isn't affected. Nonetheless it'd be nice to get a general fix in > so that MultiTermHighlighting can remove special cases for PrefixQuery and > TermRangeQuery (both subclass AutomatonQuery). > AFAICT, this bug was likely in the PostingsHighlighter since inception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org