[jira] [Commented] (LUCENE-7719) UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars

2017-06-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052653#comment-16052653
 ] 

ASF subversion and git services commented on LUCENE-7719:
-

Commit 42fdb549270b41ae164b90ea7bc001ceb7848b6d in lucene-solr's branch 
refs/heads/master from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=42fdb54 ]

LUCENE-7719: tests: Eliminate needless SuppressSysoutChecks and address lint 
warning


> UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
> -
>
> Key: LUCENE-7719
> URL: https://issues.apache.org/jira/browse/LUCENE-7719
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Attachments: LUCENE_7719.patch
>
>
> In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes 
> the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not 
> character oriented.  For ASCII terms, this is safe but it's not for 
> multi-byte characters.  This is most likely going to rear it's head with a 
> WildcardQuery, but due to special casing in MultiTermHighlighting, 
> PrefixQuery isn't affected.  Nonetheless it'd be nice to get a general fix in 
> so that MultiTermHighlighting can remove special cases for PrefixQuery and 
> TermRangeQuery (both subclass AutomatonQuery).
> AFAICT, this bug was likely in the PostingsHighlighter since inception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7719) UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars

2017-06-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052652#comment-16052652
 ] 

ASF subversion and git services commented on LUCENE-7719:
-

Commit d0b9d3459fd097dba677cdda170632f6fca5e042 in lucene-solr's branch 
refs/heads/master from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d0b9d34 ]

LUCENE-7719: Generalize UnifiedHighlighter's support for AutomatonQuery


> UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
> -
>
> Key: LUCENE-7719
> URL: https://issues.apache.org/jira/browse/LUCENE-7719
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Attachments: LUCENE_7719.patch
>
>
> In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes 
> the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not 
> character oriented.  For ASCII terms, this is safe but it's not for 
> multi-byte characters.  This is most likely going to rear it's head with a 
> WildcardQuery, but due to special casing in MultiTermHighlighting, 
> PrefixQuery isn't affected.  Nonetheless it'd be nice to get a general fix in 
> so that MultiTermHighlighting can remove special cases for PrefixQuery and 
> TermRangeQuery (both subclass AutomatonQuery).
> AFAICT, this bug was likely in the PostingsHighlighter since inception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7719) UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars

2017-06-13 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048646#comment-16048646
 ] 

David Smiley commented on LUCENE-7719:
--

Ping [~mikemccand] since you've been involved with AutomatonQuery and automata 
in general. If you're too busy then I think the change to AutomatonQuery is 
innocent enough so I'm comfortable committing the patch as-is.

I'm not sure if this will make 7.0 or not but I don't think it matters -- no 
back-compat issue / API issue.

> UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
> -
>
> Key: LUCENE-7719
> URL: https://issues.apache.org/jira/browse/LUCENE-7719
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Attachments: LUCENE_7719.patch
>
>
> In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes 
> the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not 
> character oriented.  For ASCII terms, this is safe but it's not for 
> multi-byte characters.  This is most likely going to rear it's head with a 
> WildcardQuery, but due to special casing in MultiTermHighlighting, 
> PrefixQuery isn't affected.  Nonetheless it'd be nice to get a general fix in 
> so that MultiTermHighlighting can remove special cases for PrefixQuery and 
> TermRangeQuery (both subclass AutomatonQuery).
> AFAICT, this bug was likely in the PostingsHighlighter since inception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7719) UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars

2017-04-29 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989866#comment-15989866
 ] 

David Smiley commented on LUCENE-7719:
--

[~mikemccand] What do you think of the patch?  In particular, I wonder what you 
think of:
* AutomatonQuery.isAutomatonBinary().  This is a very simple/innocent addition. 
 It's a shame Automaton.isBinary (or something similar) doesn't exist.  
* See my TODO last paragraph above.  Also note even if that were done, the 
CompiledAutomaton isn't exposed by AutomatonQuery any way; so we'd need an 
accessor.  Perhaps alternatively AutomatonQuery might expose both a 
CharRunAutomaton and ByteRunAutomaton (i.e. move some of the code in this patch 
to there)?  If that wouldn't potentially be useful to other users then 
nevermind.
* The approach to convert chars to bytes at each step


> UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
> -
>
> Key: LUCENE-7719
> URL: https://issues.apache.org/jira/browse/LUCENE-7719
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Attachments: LUCENE_7719.patch
>
>
> In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes 
> the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not 
> character oriented.  For ASCII terms, this is safe but it's not for 
> multi-byte characters.  This is most likely going to rear it's head with a 
> WildcardQuery, but due to special casing in MultiTermHighlighting, 
> PrefixQuery isn't affected.  Nonetheless it'd be nice to get a general fix in 
> so that MultiTermHighlighting can remove special cases for PrefixQuery and 
> TermRangeQuery (both subclass AutomatonQuery).
> AFAICT, this bug was likely in the PostingsHighlighter since inception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7719) UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars

2017-03-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890090#comment-15890090
 ] 

Michael McCandless commented on LUCENE-7719:


Wow, this is a great catch [~dmitrymalinin]!  Thank you for opening the 
precursor issue.

{{AutomatonQuery.getAutomaton}} really must return a UTF8-oriented
automaton because that matches how the terms are indexed into Lucene,
and what the automaton will be intersected with, to run the query.

We should fix the javadocs to say this.

And it is sort of annoying that these differences are not strongly
typed, but the {{Automaton}} class is really agnostic to what ints you are
putting onto its transitions.

But, yeah, for highlighting, we are operating in UTF16 space, and so I
think we need some way to have the {{CharacterRunAutomaton}} interface
on top of a UTF8 automaton?  Maybe we should abstract out a separate
interface that {{MultiTermHighlighting}} would use?  It seems it only
uses the {{run}} method, to test if a given term is accepted?  And
then, as you suggested, we could easily convert the incoming char[] to
UTF8 BytesRef and use the {{ByteRunAutomaton.run}} on that.


> UnifiedHighlighter doesn't handle some AutomatonQuery's with multi-byte chars
> -
>
> Key: LUCENE-7719
> URL: https://issues.apache.org/jira/browse/LUCENE-7719
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/highlighter
>Reporter: David Smiley
>
> In MultiTermHighlighting, a CharacterRunAutomaton is being created that takes 
> the result of AutomatonQuery.getAutomaton that in turn is byte oriented, not 
> character oriented.  For ASCII terms, this is safe but it's not for 
> multi-byte characters.  This is most likely going to rear it's head with a 
> WildcardQuery, but due to special casing in MultiTermHighlighting, 
> PrefixQuery isn't affected.  Nonetheless it'd be nice to get a general fix in 
> so that MultiTermHighlighting can remove special cases for PrefixQuery and 
> TermRangeQuery (both subclass AutomatonQuery).
> AFAICT, this bug was likely in the PostingsHighlighter since inception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org