Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21623#discussion_r198244664
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
    @@ -270,6 +277,29 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
           case sources.Not(pred) =>
             createFilter(schema, pred).map(FilterApi.not)
     
    +      case sources.StringStartsWith(name, prefix) if pushDownStartWith && 
canMakeFilterOn(name) =>
    +        Option(prefix).map { v =>
    +          FilterApi.userDefined(binaryColumn(name),
    +            new UserDefinedPredicate[Binary] with Serializable {
    +              private val strToBinary = 
Binary.fromReusedByteArray(v.getBytes)
    +              private val size = strToBinary.length
    +
    +              override def canDrop(statistics: Statistics[Binary]): 
Boolean = {
    +                val comparator = 
PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
    +                val max = statistics.getMax
    +                val min = statistics.getMin
    +                comparator.compare(max.slice(0, math.min(size, 
max.length)), strToBinary) < 0 ||
    +                  comparator.compare(min.slice(0, math.min(size, 
min.length)), strToBinary) > 0
    +              }
    +
    +              override def inverseCanDrop(statistics: Statistics[Binary]): 
Boolean = false
    --- End diff --
    
    Sorry, I meant if the min and max both *include* the prefix, then we should 
be able to drop the range. The situation is where both min and max match, so 
all values must also match the filter. If we are looking for values that do not 
match the filter, then we can eliminate the row group.
    
    The example is prefix=CCC and values are between min=CCCa and max=CCCZ: all 
values start with CCC, so the entire row group can be skipped.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to