Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 @rdblue Aha, thanks for the explanation! > I think that you were expecting a string comparison case to have a significant benefit over > non->pushdown. But I would only expect that if ORC had a similar benefit. > That's because this is dependent on the clustering of values in the file so that Parquet can > eliminate row groups. If ORC didn't have a benefit, then I would expect that the data just > isn't clustered in a way that helps. The `string` case had the same test data set (monotonically-increasing ids) with the `int` case, but we didn't get the benefit of push-down only in the string case. Is the logic to eliminate row groups different between `int` cases and `string` cases in spite that we use the same dataset? https://github.com/maropu/spark/blob/465aa420b1399aba7199aa2868ad6ae58d877d50/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala#L70
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org