[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
wangyum commented on pull request #29642: URL: https://github.com/apache/spark/pull/29642#issuecomment-841794509 @dongjoon-hyun I think [current benchmark](https://github.com/apache/spark/blob/7158e7f986630d4f67fb49a206d408c5f4384991/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L282-L297) is enough. I have updated the benchmark to PR description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
wangyum commented on pull request #29642: URL: https://github.com/apache/spark/pull/29642#issuecomment-841794509 @dongjoon-hyun I think [current benchmark](https://github.com/apache/spark/blob/7158e7f986630d4f67fb49a206d408c5f4384991/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L282-L297) is enough. I have updated the benchmark to PR description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
wangyum commented on pull request #29642: URL: https://github.com/apache/spark/pull/29642#issuecomment-841178201 @dongjoon-hyun @cloud-fan Please see the latest benchmark result: https://github.com/apache/spark/pull/29642/commits/27a2bf615eb158c7c25aa5bfaa04caa939c237da -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
wangyum commented on pull request #29642: URL: https://github.com/apache/spark/pull/29642#issuecomment-840985630 @dongjoon-hyun I think this performance issue is not caused by this change. This PR only changes the `In` predicate. It is also slow without this change: ``` OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1047-azure Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz Select 0 string row (value IS NULL): Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative Parquet Vectorized10623 10994 272 1.5 675.4 1.0X Parquet Vectorized (Pushdown) 627657 24 25.1 39.9 16.9X Native ORC Vectorized 7490 7653 203 2.1 476.2 1.4X Native ORC Vectorized (Pushdown)553606 34 28.4 35.2 19.2X ``` https://github.com/wangyum/spark/runs/2580852093 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
wangyum commented on pull request #29642: URL: https://github.com/apache/spark/pull/29642#issuecomment-840259632 @dongjoon-hyun Do you have more comments? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown
wangyum commented on pull request #29642: URL: https://github.com/apache/spark/pull/29642#issuecomment-835972741 @dongjoon-hyun This pr only improve the `In` predicate. I have added the improvement part to PR description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org