[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option
markharwood commented on a change in pull request #1541: URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r451581237 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java ## @@ -499,10 +508,29 @@ public RegExp(String s) throws IllegalArgumentException { * regular expression */ public RegExp(String s, int syntax_flags) throws IllegalArgumentException { +this(s, syntax_flags, 0); + } + /** + * Constructs new RegExp from a string. + * + * @param s regexp string + * @param syntax_flags boolean 'or' of optional syntax constructs to be + * enabled + * @param match_flags boolean 'or' of match behavior options such as case insensitivity + * @exception IllegalArgumentException if an error occurred while parsing the + * regular expression + */ + public RegExp(String s, int syntax_flags, int match_flags) throws IllegalArgumentException { +// (for BWC reasons we don't validate invalid bits, just trim instead) +syntax_flags = syntax_flags & 0xff; Review comment: As far as I can see, no. The change would be to remove that line and replace with if (syntax_flags > ALL) { throw new IllegalArgumentException("Illegal syntax flag"); } This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option
markharwood commented on a change in pull request #1541: URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r445538877 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java ## @@ -439,12 +441,19 @@ /** * Syntax flag, enables all optional regexp syntax. */ - public static final int ALL = 0x; - + public static final int ALL = 0xff; + /** * Syntax flag, enables no optional regexp syntax. */ public static final int NONE = 0x; + + //- Matching flags ( > 0xff ) -- + + /** + * Allows case insensitive matching of ASCII characters. + */ + public static final int ASCII_CASE_INSENSITIVE = 0x0100; Review comment: My assumption was this class was lenient to syntax flags > 0xff before this change so should remain so for BWC reasons This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option
markharwood commented on a change in pull request #1541: URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r445497621 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java ## @@ -439,12 +441,19 @@ /** * Syntax flag, enables all optional regexp syntax. */ - public static final int ALL = 0x; - + public static final int ALL = 0xff; + /** * Syntax flag, enables no optional regexp syntax. */ public static final int NONE = 0x; + + //- Matching flags ( > 0xff ) -- + + /** + * Allows case insensitive matching of ASCII characters. + */ + public static final int ASCII_CASE_INSENSITIVE = 0x0100; Review comment: UNICODE_CASE_INSENSITIVE or just CASE_INSENSITIVE? Either way it would cover ASCII and all other UNICODE characters. Admittedly slightly odd that the two flags overlap but the alternative is people may assume that a non-qualified name like "CASE_INSENSITIVE" would cover all the bases when we only currently support ASCII. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option
markharwood commented on a change in pull request #1541: URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r445490501 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java ## @@ -439,12 +441,19 @@ /** * Syntax flag, enables all optional regexp syntax. */ - public static final int ALL = 0x; - + public static final int ALL = 0xff; + /** * Syntax flag, enables no optional regexp syntax. */ public static final int NONE = 0x; + + //- Matching flags ( > 0xff ) -- + + /** + * Allows case insensitive matching of ASCII characters. + */ + public static final int ASCII_CASE_INSENSITIVE = 0x0100; Review comment: I thought it might be useful if the flag name reflected the current limitations? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option
markharwood commented on a change in pull request #1541: URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r444284423 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java ## @@ -743,6 +792,30 @@ private Automaton toAutomatonInternal(Map automata, } return a; } + private Automaton toCaseInsensitiveChar(int codepoint, int maxDeterminizedStates) { +Automaton case1 = Automata.makeChar(codepoint); +int altCase = Character.isLowerCase(codepoint) ? Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint); +Automaton result; +if (altCase != codepoint) { + result = Operations.union(case1, Automata.makeChar(altCase)); + result = MinimizationOperations.minimize(result, maxDeterminizedStates); +} else { + result = case1; +} +return result; + } Review comment: An alternative would be an overhaul of RegExp. * Introducing a Builder class for the parser with named properties for settings * separating the RegExp parser logic from the parsed objects (currently they are the same class). * separating rendering functions (toString, to Automaton, toStringTree) from the parsed objects. I'm not sure if we're at the tipping point where all of that would make sense. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option
markharwood commented on a change in pull request #1541: URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r444201265 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java ## @@ -743,6 +792,30 @@ private Automaton toAutomatonInternal(Map automata, } return a; } + private Automaton toCaseInsensitiveChar(int codepoint, int maxDeterminizedStates) { +Automaton case1 = Automata.makeChar(codepoint); +int altCase = Character.isLowerCase(codepoint) ? Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint); +Automaton result; +if (altCase != codepoint) { + result = Operations.union(case1, Automata.makeChar(altCase)); + result = MinimizationOperations.minimize(result, maxDeterminizedStates); +} else { + result = case1; +} +return result; + } Review comment: The one downside is RegExpQuery already has a (Term, int, int) constructor for syntax flags and maxDeterminizedStates so I added a (int, int, Term) variation for passing the syntax and match flags. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option
markharwood commented on a change in pull request #1541: URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r444182510 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java ## @@ -743,6 +792,30 @@ private Automaton toAutomatonInternal(Map automata, } return a; } + private Automaton toCaseInsensitiveChar(int codepoint, int maxDeterminizedStates) { +Automaton case1 = Automata.makeChar(codepoint); +int altCase = Character.isLowerCase(codepoint) ? Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint); +Automaton result; +if (altCase != codepoint) { + result = Operations.union(case1, Automata.makeChar(altCase)); + result = MinimizationOperations.minimize(result, maxDeterminizedStates); +} else { + result = case1; +} +return result; + } Review comment: Yes, that's what I was thinking This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option
markharwood commented on a change in pull request #1541: URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r444134039 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java ## @@ -743,6 +792,30 @@ private Automaton toAutomatonInternal(Map automata, } return a; } + private Automaton toCaseInsensitiveChar(int codepoint, int maxDeterminizedStates) { +Automaton case1 = Automata.makeChar(codepoint); +int altCase = Character.isLowerCase(codepoint) ? Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint); +Automaton result; +if (altCase != codepoint) { + result = Operations.union(case1, Automata.makeChar(altCase)); + result = MinimizationOperations.minimize(result, maxDeterminizedStates); +} else { + result = case1; +} +return result; + } Review comment: Would that be an argument for having a bit mask for case sensitivity flags instead of a Boolean? Java has the UNICODE_CASE flag which differs from the ASCII one and it might be an idea to leave something in the API open to future work. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option
markharwood commented on a change in pull request #1541: URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r431891105 ## File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestRegExp.java ## @@ -88,10 +88,14 @@ public void testRepeatWithEmptyLanguage() throws Exception { assertTrue(a.toString().length() > 0); } + + boolean caseSensitiveQuery = true; + public void testCoreJavaParity() { // Generate random doc values and random regular expressions // and check for same matching behaviour as Java's Pattern class. for (int i = 0; i < 1000; i++) { + caseSensitiveQuery = true; Review comment: The randomisation comes from the choice of string mutation later that produces the regex - this is just initialising the variable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org