[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-27 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r318197341
 
 

 ##
 File path: 
parquet/src/test/java/org/apache/iceberg/parquet/TestDictionaryRowGroupFilter.java
 ##
 @@ -231,6 +236,41 @@ public void testRequiredColumn() {
 Assert.assertFalse("Should skip: required columns are always non-null", 
shouldRead);
   }
 
+  @Test
+  public void testStartsWith() {
+boolean shouldRead = new ParquetDictionaryRowGroupFilter(SCHEMA, 
startsWith("non_dict", "re"))
+.shouldRead(PARQUET_SCHEMA, ROW_GROUP_METADATA, DICTIONARY_STORE);
+Assert.assertTrue("Should read: no dictionary", shouldRead);
+
+shouldRead = new ParquetDictionaryRowGroupFilter(SCHEMA, 
startsWith("required", "re"))
+.shouldRead(PARQUET_SCHEMA, ROW_GROUP_METADATA, DICTIONARY_STORE);
+Assert.assertTrue("Should read: dictionary contains a matching entry", 
shouldRead);
+
+shouldRead = new ParquetDictionaryRowGroupFilter(SCHEMA, 
startsWith("required", "req"))
+.shouldRead(PARQUET_SCHEMA, ROW_GROUP_METADATA, DICTIONARY_STORE);
+Assert.assertTrue("Should read: dictionary contains a matching entry", 
shouldRead);
+
+shouldRead = new ParquetDictionaryRowGroupFilter(SCHEMA, 
startsWith("some_nulls", "so"))
+.shouldRead(PARQUET_SCHEMA, ROW_GROUP_METADATA, DICTIONARY_STORE);
+Assert.assertTrue("Should read: dictionary contains a matching entry", 
shouldRead);
+
+shouldRead = new ParquetDictionaryRowGroupFilter(SCHEMA, 
startsWith("no_stats", UUID.randomUUID().toString()))
+.shouldRead(PARQUET_SCHEMA, ROW_GROUP_METADATA, DICTIONARY_STORE);
+Assert.assertFalse("Should skip: no stats but dictionary is present", 
shouldRead);
+
+shouldRead = new ParquetDictionaryRowGroupFilter(SCHEMA, 
startsWith("required", "reqs"))
+.shouldRead(PARQUET_SCHEMA, ROW_GROUP_METADATA, DICTIONARY_STORE);
+Assert.assertFalse("Should skip: no match in dictionary", shouldRead);
+
+shouldRead = new ParquetDictionaryRowGroupFilter(SCHEMA, 
startsWith("some_nulls", "somex"))
+.shouldRead(PARQUET_SCHEMA, ROW_GROUP_METADATA, DICTIONARY_STORE);
+Assert.assertFalse("Should skip: no match in dictionary", shouldRead);
+
+shouldRead = new ParquetDictionaryRowGroupFilter(SCHEMA, 
startsWith("no_nulls", "xxx"))
+.shouldRead(PARQUET_SCHEMA, ROW_GROUP_METADATA, DICTIONARY_STORE);
+Assert.assertFalse("Should skip: no match in dictionary", shouldRead);
 
 Review comment:
   Can you add tests for `all_nulls` and `not_in_file` columns?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-27 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r318196009
 
 

 ##
 File path: 
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java
 ##
 @@ -339,6 +341,47 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
   return ROWS_MIGHT_MATCH;
 }
 
+@Override
+@SuppressWarnings("unchecked")
+public  Boolean startsWith(BoundReference ref, Literal lit) {
+  int id = ref.fieldId();
+
+  Long valueCount = valueCounts.get(id);
+  if (valueCount == null) {
+// the column is not present and is all nulls
+return ROWS_CANNOT_MATCH;
+  }
+
+  Statistics colStats = (Statistics) stats.get(id);
+  if (colStats != null && !colStats.isEmpty()) {
+if (!colStats.hasNonNullValue()) {
+  return ROWS_CANNOT_MATCH;
+}
+
+Binary prefixAsBinary = 
Binary.fromConstantByteBuffer(lit.toByteBuffer());
+
+PrimitiveComparator comparator = 
PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR;
+
+Binary lower = colStats.genericGetMin();
+// truncate lower bound so that its length in bytes is not greater 
than the length of prefix
+int lowerLength = Math.min(prefixAsBinary.length(), lower.length());
+int lowerCmp = comparator.compare(lower.slice(0, lowerLength), 
prefixAsBinary);
 
 Review comment:
   I just looked at Parquet and I think it is going to be cheaper to use 
`toByteBuffer().slice(...)` instead of `slice(...)` because the slice 
implementation for most of the `Binary` implementations calls 
`getBytesUnsafe()` that returns a `byte[]` and usually copies data.
   
   Another minor point is that I think it would be better to use Iceberg's 
comparators instead of Parquet's comparators to ensure we have the same 
behavior everywhere. So I think this should be updated to convert the binary to 
a byte buffer, slice that, and then compare the two using Iceberg's 
`Comparators.unsignedBytes()`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-27 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r318191642
 
 

 ##
 File path: 
api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java
 ##
 @@ -355,4 +390,40 @@ public void testCaseInsensitiveIntegerNotEqRewritten() {
   public void testCaseSensitiveIntegerNotEqRewritten() {
 boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, not(equal("ID", 
5)), true).eval(FILE);
   }
+
+  @Test
+  public void testStringStartsWith() {
+boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
startsWith("required", "a"), true).eval(FILE);
+Assert.assertTrue("Should read: no stats", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"a"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aa"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aaa"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"1s"), true).eval(FILE_3);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"1str1x"), true).eval(FILE_3);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"ff"), true).eval(FILE_4);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aB"), true).eval(FILE_2);
 
 Review comment:
   Okay, that works for me. You might add a comment to show your intent in the 
future, but it's not a big deal here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-27 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r318178750
 
 

 ##
 File path: 
api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java
 ##
 @@ -355,4 +390,40 @@ public void testCaseInsensitiveIntegerNotEqRewritten() {
   public void testCaseSensitiveIntegerNotEqRewritten() {
 boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, not(equal("ID", 
5)), true).eval(FILE);
   }
+
+  @Test
+  public void testStringStartsWith() {
+boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
startsWith("required", "a"), true).eval(FILE);
+Assert.assertTrue("Should read: no stats", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"a"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aa"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aaa"), true).eval(FILE_2);
 
 Review comment:
   You're right, I think that behavior is correct.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317842139
 
 

 ##
 File path: spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java
 ##
 @@ -153,6 +156,11 @@ public static Expression convert(Filter filter) {
   }
   return null;
 }
+
+case STARTS_WITH: {
+  StringStartsWith stringStartsWith = (StringStartsWith) filter;
+  return startsWith(stringStartsWith.attribute(), 
stringStartsWith.value());
+}
 
 Review comment:
   Is there a good way to test this conversion?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317841160
 
 

 ##
 File path: 
api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java
 ##
 @@ -355,4 +390,40 @@ public void testCaseInsensitiveIntegerNotEqRewritten() {
   public void testCaseSensitiveIntegerNotEqRewritten() {
 boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, not(equal("ID", 
5)), true).eval(FILE);
   }
+
+  @Test
+  public void testStringStartsWith() {
+boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
startsWith("required", "a"), true).eval(FILE);
+Assert.assertTrue("Should read: no stats", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"a"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aa"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aaa"), true).eval(FILE_2);
 
 Review comment:
   I don't think there is an equivalent test for the upper bound, where the 
bound is shorter so the startsWith prefix is truncated. Can you add one?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317841001
 
 

 ##
 File path: 
api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java
 ##
 @@ -355,4 +390,40 @@ public void testCaseInsensitiveIntegerNotEqRewritten() {
   public void testCaseSensitiveIntegerNotEqRewritten() {
 boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, not(equal("ID", 
5)), true).eval(FILE);
   }
+
+  @Test
+  public void testStringStartsWith() {
+boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
startsWith("required", "a"), true).eval(FILE);
+Assert.assertTrue("Should read: no stats", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"a"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aa"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aaa"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"1s"), true).eval(FILE_3);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"1str1x"), true).eval(FILE_3);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"ff"), true).eval(FILE_4);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aB"), true).eval(FILE_2);
 
 Review comment:
   Nit: For the purpose of testing, I think it is more clear to use all lower 
case or all upper case. It may not be obvious that `"B" < "a"`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317840857
 
 

 ##
 File path: 
api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java
 ##
 @@ -355,4 +390,40 @@ public void testCaseInsensitiveIntegerNotEqRewritten() {
   public void testCaseSensitiveIntegerNotEqRewritten() {
 boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, not(equal("ID", 
5)), true).eval(FILE);
   }
+
+  @Test
+  public void testStringStartsWith() {
+boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
startsWith("required", "a"), true).eval(FILE);
+Assert.assertTrue("Should read: no stats", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"a"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aa"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aaa"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"1s"), true).eval(FILE_3);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"1str1x"), true).eval(FILE_3);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"ff"), true).eval(FILE_4);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aB"), true).eval(FILE_2);
+Assert.assertFalse("Should not read: range doesn't match", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"dWX"), true).eval(FILE_2);
+Assert.assertFalse("Should not read: range doesn't match", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"5"), true).eval(FILE_3);
+Assert.assertFalse("Should not read: range doesn't match", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"イロホヘト"), true).eval(FILE_4);
 
 Review comment:
   It isn't very clear whether this should or shouldn't match. Could you modify 
the original string to produce values? For example, 
`UnicodeUtil.truncateString(original, 4)` will increment the 4th codepoint so 
that it is more obviously outside the range.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317836527
 
 

 ##
 File path: 
parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java
 ##
 @@ -294,6 +294,11 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
   return ROWS_MIGHT_MATCH;
 }
 
+@Override
+public  Boolean startsWith(BoundReference ref, Literal lit) {
+  return ROWS_MIGHT_MATCH;
 
 Review comment:
   I think this can be implemented. If any dictionary value starts with the 
prefix, then some row will match. But if all dictionary values do not start 
with the prefix, then the row group can be skipped. (This would also need to 
check `isFallback` and return `ROWS_MIGHT_MATCH` if so)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-24 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317372354
 
 

 ##
 File path: 
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
 ##
 @@ -254,5 +258,43 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 public  Boolean notIn(BoundReference ref, Literal lit) {
   return ROWS_MIGHT_MATCH;
 }
+
+@Override
+public  Boolean startsWith(BoundReference ref, Literal lit) {
+  int id = ref.fieldId();
+
+  String prefix = (String) lit.value();
+  ByteBuffer prefixAsBytes = 
ByteBuffer.wrap(prefix.getBytes(StandardCharsets.UTF_8));
 
 Review comment:
   Maybe we should add `toByteBuffer` and `fromByteBuffer` methods to 
`Literal`. That would make conversion to the single-value serialization easier 
(and public) and would solve this problem.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-23 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317277032
 
 

 ##
 File path: 
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
 ##
 @@ -254,5 +258,43 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 public  Boolean notIn(BoundReference ref, Literal lit) {
   return ROWS_MIGHT_MATCH;
 }
+
+@Override
+public  Boolean startsWith(BoundReference ref, Literal lit) {
+  int id = ref.fieldId();
+
+  String prefix = (String) lit.value();
+  ByteBuffer prefixAsBytes = 
ByteBuffer.wrap(prefix.getBytes(StandardCharsets.UTF_8));
+
+  Comparator comparator = Comparators.unsignedBytes();
+
+  if (lowerBounds != null && lowerBounds.containsKey(id)) {
+ByteBuffer lower = lowerBounds.get(id);
+// take the same number of bytes from prefix and lower bound
+int length = Math.min(prefixAsBytes.remaining(), lower.remaining());
+
+int cmp = comparator.compare(
+length < lower.remaining() ? BinaryUtil.truncateBinary(lower, 
length) : lower,
 
 Review comment:
   Yeah, I think so.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-23 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317276968
 
 

 ##
 File path: 
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
 ##
 @@ -254,5 +258,43 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 public  Boolean notIn(BoundReference ref, Literal lit) {
   return ROWS_MIGHT_MATCH;
 }
+
+@Override
+public  Boolean startsWith(BoundReference ref, Literal lit) {
+  int id = ref.fieldId();
+
+  String prefix = (String) lit.value();
+  ByteBuffer prefixAsBytes = 
ByteBuffer.wrap(prefix.getBytes(StandardCharsets.UTF_8));
 
 Review comment:
   Maybe make it so that a `StringLiteral` can cache and return the bytes 
representation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-23 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317276702
 
 

 ##
 File path: 
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java
 ##
 @@ -339,6 +342,48 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
   return ROWS_MIGHT_MATCH;
 }
 
+@Override
+@SuppressWarnings("unchecked")
+public  Boolean startsWith(BoundReference ref, Literal lit) {
+  int id = ref.fieldId();
+
+  Long valueCount = valueCounts.get(id);
+  if (valueCount == null) {
+// the column is not present and is all nulls
+return ROWS_CANNOT_MATCH;
+  }
+
+  Statistics colStats = (Statistics) stats.get(id);
 
 Review comment:
   Not yet. Parquet stops writing stats if they are too large. I think the 
default limit is 4k.
   
   I think I missed something earlier. I didn't catch that the length you're 
truncating to is the length in bytes of the prefix. In that case, binary 
comparison should work because UTF-8 unsigned byte-wise comparison has the same 
result as the default UTF-8 ordering comparison. So as long as we're truncating 
to the same length in bytes and comparing UTF-8 encoded values, I think you're 
right about this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-23 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317265081
 
 

 ##
 File path: 
parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java
 ##
 @@ -339,6 +342,48 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
   return ROWS_MIGHT_MATCH;
 }
 
+@Override
+@SuppressWarnings("unchecked")
+public  Boolean startsWith(BoundReference ref, Literal lit) {
+  int id = ref.fieldId();
+
+  Long valueCount = valueCounts.get(id);
+  if (valueCount == null) {
+// the column is not present and is all nulls
+return ROWS_CANNOT_MATCH;
+  }
+
+  Statistics colStats = (Statistics) stats.get(id);
 
 Review comment:
   I don't think this is safe because truncation happens differently for 
strings and binary.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-23 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317264772
 
 

 ##
 File path: 
api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java
 ##
 @@ -42,18 +42,19 @@
 import static org.apache.iceberg.expressions.Expressions.notEqual;
 import static org.apache.iceberg.expressions.Expressions.notNull;
 import static org.apache.iceberg.expressions.Expressions.or;
+import static org.apache.iceberg.expressions.Expressions.startsWith;
 import static org.apache.iceberg.types.Conversions.toByteBuffer;
 import static org.apache.iceberg.types.Types.NestedField.optional;
 import static org.apache.iceberg.types.Types.NestedField.required;
 
 public class TestInclusiveMetricsEvaluator {
   private static final Schema SCHEMA = new Schema(
   required(1, "id", IntegerType.get()),
-  optional(2, "no_stats", Types.IntegerType.get()),
-  required(3, "required", Types.StringType.get()),
-  optional(4, "all_nulls", Types.StringType.get()),
-  optional(5, "some_nulls", Types.StringType.get()),
-  optional(6, "no_nulls", Types.StringType.get())
+  optional(2, "no_stats", IntegerType.get()),
+  required(3, "required", StringType.get()),
+  optional(4, "all_nulls", StringType.get()),
+  optional(5, "some_nulls", StringType.get()),
+  optional(6, "no_nulls", StringType.get())
 
 Review comment:
   Nit: this is an unnecessary change that could cause commit conflicts.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-23 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317264545
 
 

 ##
 File path: 
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
 ##
 @@ -254,5 +258,43 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 public  Boolean notIn(BoundReference ref, Literal lit) {
   return ROWS_MIGHT_MATCH;
 }
+
+@Override
+public  Boolean startsWith(BoundReference ref, Literal lit) {
+  int id = ref.fieldId();
+
+  String prefix = (String) lit.value();
+  ByteBuffer prefixAsBytes = 
ByteBuffer.wrap(prefix.getBytes(StandardCharsets.UTF_8));
+
+  Comparator comparator = Comparators.unsignedBytes();
+
+  if (lowerBounds != null && lowerBounds.containsKey(id)) {
+ByteBuffer lower = lowerBounds.get(id);
+// take the same number of bytes from prefix and lower bound
+int length = Math.min(prefixAsBytes.remaining(), lower.remaining());
+
+int cmp = comparator.compare(
+length < lower.remaining() ? BinaryUtil.truncateBinary(lower, 
length) : lower,
 
 Review comment:
   Also, the `truncate` methods will return the original value if there is no 
need to truncate, so you can simplify this code.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-23 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317264055
 
 

 ##
 File path: 
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
 ##
 @@ -254,5 +258,43 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 public  Boolean notIn(BoundReference ref, Literal lit) {
   return ROWS_MIGHT_MATCH;
 }
+
+@Override
+public  Boolean startsWith(BoundReference ref, Literal lit) {
+  int id = ref.fieldId();
+
+  String prefix = (String) lit.value();
+  ByteBuffer prefixAsBytes = 
ByteBuffer.wrap(prefix.getBytes(StandardCharsets.UTF_8));
+
+  Comparator comparator = Comparators.unsignedBytes();
+
+  if (lowerBounds != null && lowerBounds.containsKey(id)) {
+ByteBuffer lower = lowerBounds.get(id);
+// take the same number of bytes from prefix and lower bound
+int length = Math.min(prefixAsBytes.remaining(), lower.remaining());
+
+int cmp = comparator.compare(
+length < lower.remaining() ? BinaryUtil.truncateBinary(lower, 
length) : lower,
 
 Review comment:
   It isn't safe to call `truncateBinary` on a string value because it 
truncates by bytes. Strings are truncated by unicode codepoints instead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-23 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317257280
 
 

 ##
 File path: 
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
 ##
 @@ -254,5 +258,43 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 public  Boolean notIn(BoundReference ref, Literal lit) {
   return ROWS_MIGHT_MATCH;
 }
+
+@Override
+public  Boolean startsWith(BoundReference ref, Literal lit) {
+  int id = ref.fieldId();
+
+  String prefix = (String) lit.value();
+  ByteBuffer prefixAsBytes = 
ByteBuffer.wrap(prefix.getBytes(StandardCharsets.UTF_8));
 
 Review comment:
   Can we cache these instead of converting in each call to evaluate? Metrics 
evaluators are called in a tight loop when planning jobs.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-23 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317256796
 
 

 ##
 File path: 
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
 ##
 @@ -254,5 +258,43 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 public  Boolean notIn(BoundReference ref, Literal lit) {
   return ROWS_MIGHT_MATCH;
 }
+
+@Override
+public  Boolean startsWith(BoundReference ref, Literal lit) {
 
 Review comment:
   I think if a column is all nulls, then `startsWith` should not match. I 
think it was an oversight not to add these checks in 
`InclusiveMetricsEvaluator`. Could you open an issue to fix this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org