Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21070
  
    @rdblue Aha, thanks for the explanation! 
    
    > I think that you were expecting a string comparison case to have a 
significant benefit over 
    > non->pushdown. But I would only expect that if ORC had a similar benefit.
    > That's because this is dependent on the clustering of values in the file 
so that Parquet can
    > eliminate row groups. If ORC didn't have a benefit, then I would expect 
that the data just
    > isn't clustered in a way that helps.
    
    The `string` case had the same test data set (monotonically-increasing ids) 
with the `int` case, but we didn't get the benefit of push-down only in the 
string case. Is the logic to eliminate row groups different between `int` cases 
and `string` cases in spite that we use the same dataset?
    
https://github.com/maropu/spark/blob/465aa420b1399aba7199aa2868ad6ae58d877d50/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala#L70



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to