[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

2021-05-17 Thread GitBox


wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841794509


   @dongjoon-hyun I think [current 
benchmark](https://github.com/apache/spark/blob/7158e7f986630d4f67fb49a206d408c5f4384991/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L282-L297)
 is enough. I have updated the benchmark to PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

2021-05-16 Thread GitBox


wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841794509


   @dongjoon-hyun I think [current 
benchmark](https://github.com/apache/spark/blob/7158e7f986630d4f67fb49a206d408c5f4384991/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala#L282-L297)
 is enough. I have updated the benchmark to PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

2021-05-14 Thread GitBox


wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-841178201


   @dongjoon-hyun @cloud-fan Please see the latest benchmark result: 
https://github.com/apache/spark/pull/29642/commits/27a2bf615eb158c7c25aa5bfaa04caa939c237da


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

2021-05-13 Thread GitBox


wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840985630


   @dongjoon-hyun I think this performance issue is not caused by this change. 
This PR only changes the `In` predicate. It is also slow without this change:
   
   ```
   OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1047-azure
   Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
   Select 0 string row (value IS NULL):  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   

   Parquet Vectorized10623  10994   
  272  1.5 675.4   1.0X
   Parquet Vectorized (Pushdown)   627657   
   24 25.1  39.9  16.9X
   Native ORC Vectorized  7490   7653   
  203  2.1 476.2   1.4X
   Native ORC Vectorized (Pushdown)553606   
   34 28.4  35.2  19.2X
   ```
   https://github.com/wangyum/spark/runs/2580852093


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

2021-05-12 Thread GitBox


wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-840259632


   @dongjoon-hyun Do you have more comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #29642: [SPARK-32792][SQL] Improve Parquet In filter pushdown

2021-05-09 Thread GitBox


wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-835972741


   @dongjoon-hyun This pr only improve the `In` predicate. I have added the 
improvement part to PR description.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org