[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option

2020-07-08 Thread GitBox


markharwood commented on a change in pull request #1541:
URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r451581237



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##
@@ -499,10 +508,29 @@ public RegExp(String s) throws IllegalArgumentException {
*  regular expression
*/
   public RegExp(String s, int syntax_flags) throws IllegalArgumentException {
+this(s, syntax_flags, 0);
+  }
+  /**
+   * Constructs new RegExp from a string.
+   * 
+   * @param s regexp string
+   * @param syntax_flags boolean 'or' of optional syntax constructs to be
+   *  enabled
+   * @param match_flags boolean 'or' of match behavior options such as case 
insensitivity
+   * @exception IllegalArgumentException if an error occurred while parsing the
+   *  regular expression
+   */
+  public RegExp(String s, int syntax_flags, int match_flags) throws 
IllegalArgumentException {
+// (for BWC reasons we don't validate invalid bits, just trim instead)
+syntax_flags  = syntax_flags & 0xff;

Review comment:
   As far as I can see, no. The change would be to remove that line and 
replace with
   
   if (syntax_flags >  ALL) {
 throw new IllegalArgumentException("Illegal syntax flag");
   }
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option

2020-06-25 Thread GitBox


markharwood commented on a change in pull request #1541:
URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r445538877



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##
@@ -439,12 +441,19 @@
   /**
* Syntax flag, enables all optional regexp syntax.
*/
-  public static final int ALL = 0x;
-  
+  public static final int ALL = 0xff;
+  
   /**
* Syntax flag, enables no optional regexp syntax.
*/
   public static final int NONE = 0x;
+  
+  //-  Matching flags ( > 0xff )  --
+  
+  /**
+   * Allows case insensitive matching of ASCII characters.
+   */
+  public static final int ASCII_CASE_INSENSITIVE = 0x0100;

Review comment:
   My assumption was this class was lenient to syntax flags > 0xff before 
this change so should remain so for BWC reasons





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option

2020-06-25 Thread GitBox


markharwood commented on a change in pull request #1541:
URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r445497621



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##
@@ -439,12 +441,19 @@
   /**
* Syntax flag, enables all optional regexp syntax.
*/
-  public static final int ALL = 0x;
-  
+  public static final int ALL = 0xff;
+  
   /**
* Syntax flag, enables no optional regexp syntax.
*/
   public static final int NONE = 0x;
+  
+  //-  Matching flags ( > 0xff )  --
+  
+  /**
+   * Allows case insensitive matching of ASCII characters.
+   */
+  public static final int ASCII_CASE_INSENSITIVE = 0x0100;

Review comment:
   UNICODE_CASE_INSENSITIVE or just CASE_INSENSITIVE?
   Either way it would cover ASCII and all other UNICODE characters. 
   Admittedly slightly odd that the two flags overlap but the alternative is 
people may assume that a non-qualified name like "CASE_INSENSITIVE" would cover 
all the bases when we only currently support ASCII.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option

2020-06-25 Thread GitBox


markharwood commented on a change in pull request #1541:
URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r445490501



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##
@@ -439,12 +441,19 @@
   /**
* Syntax flag, enables all optional regexp syntax.
*/
-  public static final int ALL = 0x;
-  
+  public static final int ALL = 0xff;
+  
   /**
* Syntax flag, enables no optional regexp syntax.
*/
   public static final int NONE = 0x;
+  
+  //-  Matching flags ( > 0xff )  --
+  
+  /**
+   * Allows case insensitive matching of ASCII characters.
+   */
+  public static final int ASCII_CASE_INSENSITIVE = 0x0100;

Review comment:
   I thought it might be useful if the flag name reflected the current 
limitations?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option

2020-06-23 Thread GitBox


markharwood commented on a change in pull request #1541:
URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r444284423



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##
@@ -743,6 +792,30 @@ private Automaton 
toAutomatonInternal(Map automata,
 }
 return a;
   }
+  private Automaton toCaseInsensitiveChar(int codepoint, int 
maxDeterminizedStates) {
+Automaton case1 = Automata.makeChar(codepoint);
+int altCase = Character.isLowerCase(codepoint) ? 
Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint);
+Automaton result;
+if (altCase != codepoint) {
+  result = Operations.union(case1, Automata.makeChar(altCase));
+  result = MinimizationOperations.minimize(result, maxDeterminizedStates); 
 
+} else {
+  result = case1;  
+}  
+return result;
+  }

Review comment:
   An alternative would be an overhaul of RegExp.
   * Introducing a Builder class for the parser with named properties for 
settings
   * separating the RegExp parser logic from the  parsed objects (currently 
they are the same class). 
   * separating rendering functions (toString, to Automaton, toStringTree) from 
the parsed objects.
   
   I'm not sure if we're at the tipping point where all of that would make 
sense.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option

2020-06-23 Thread GitBox


markharwood commented on a change in pull request #1541:
URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r444201265



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##
@@ -743,6 +792,30 @@ private Automaton 
toAutomatonInternal(Map automata,
 }
 return a;
   }
+  private Automaton toCaseInsensitiveChar(int codepoint, int 
maxDeterminizedStates) {
+Automaton case1 = Automata.makeChar(codepoint);
+int altCase = Character.isLowerCase(codepoint) ? 
Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint);
+Automaton result;
+if (altCase != codepoint) {
+  result = Operations.union(case1, Automata.makeChar(altCase));
+  result = MinimizationOperations.minimize(result, maxDeterminizedStates); 
 
+} else {
+  result = case1;  
+}  
+return result;
+  }

Review comment:
   The one downside is RegExpQuery already has a (Term, int, int) 
constructor for syntax flags and maxDeterminizedStates so I added a (int, int, 
Term) variation for passing the syntax and match flags.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option

2020-06-23 Thread GitBox


markharwood commented on a change in pull request #1541:
URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r444182510



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##
@@ -743,6 +792,30 @@ private Automaton 
toAutomatonInternal(Map automata,
 }
 return a;
   }
+  private Automaton toCaseInsensitiveChar(int codepoint, int 
maxDeterminizedStates) {
+Automaton case1 = Automata.makeChar(codepoint);
+int altCase = Character.isLowerCase(codepoint) ? 
Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint);
+Automaton result;
+if (altCase != codepoint) {
+  result = Operations.union(case1, Automata.makeChar(altCase));
+  result = MinimizationOperations.minimize(result, maxDeterminizedStates); 
 
+} else {
+  result = case1;  
+}  
+return result;
+  }

Review comment:
   Yes, that's what I was thinking





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option

2020-06-23 Thread GitBox


markharwood commented on a change in pull request #1541:
URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r444134039



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##
@@ -743,6 +792,30 @@ private Automaton 
toAutomatonInternal(Map automata,
 }
 return a;
   }
+  private Automaton toCaseInsensitiveChar(int codepoint, int 
maxDeterminizedStates) {
+Automaton case1 = Automata.makeChar(codepoint);
+int altCase = Character.isLowerCase(codepoint) ? 
Character.toUpperCase(codepoint) : Character.toLowerCase(codepoint);
+Automaton result;
+if (altCase != codepoint) {
+  result = Operations.union(case1, Automata.makeChar(altCase));
+  result = MinimizationOperations.minimize(result, maxDeterminizedStates); 
 
+} else {
+  result = case1;  
+}  
+return result;
+  }

Review comment:
   Would that be an argument for having a bit mask for case sensitivity 
flags instead of a Boolean? Java has the UNICODE_CASE flag which differs from 
the ASCII one and it might be an idea to leave something in the API open to 
future work.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] markharwood commented on a change in pull request #1541: RegExp - add case insensitive matching option

2020-05-28 Thread GitBox


markharwood commented on a change in pull request #1541:
URL: https://github.com/apache/lucene-solr/pull/1541#discussion_r431891105



##
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestRegExp.java
##
@@ -88,10 +88,14 @@ public void testRepeatWithEmptyLanguage() throws Exception {
 assertTrue(a.toString().length() > 0);
   }
   
+  
+  boolean caseSensitiveQuery = true;
+  
   public void testCoreJavaParity() {
 // Generate random doc values and random regular expressions
 // and check for same matching behaviour as Java's Pattern class.
 for (int i = 0; i < 1000; i++) {
+  caseSensitiveQuery = true;  

Review comment:
   The randomisation comes from the choice of string mutation later that 
produces the regex - this is just initialising the variable.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org