[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2021-02-26 Thread GitBox


sunchao commented on pull request #29542:
URL: https://github.com/apache/spark/pull/29542#issuecomment-786837732


   Opened #31667 with the fix. Comparing to the old PR there's only one line 
change:
   ```scala
   reader.setRequestedSchema(requestedSchema);
   ```
   Please take a look. Thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2021-02-25 Thread GitBox


sunchao commented on pull request #29542:
URL: https://github.com/apache/spark/pull/29542#issuecomment-786443278


   I tried @LuciferYang 's suggestion with TPC-DS benchmark and it fixed the 
perf regression. For instance, in the q9 above, here's what I got:
   
   without the PR | with the PR | with the PR + requestSchema fix
   |-|---|--
   5940 | 11260 | 5281
   
   I've gathered all the results in this 
[gist](https://gist.github.com/sunchao/533c1c9a5fb6adc5f7c72b1c465e974d). The 
benchmark was run with scale factor 5.
   
   @maropu @HyukjinKwon @srowen let me know what you think. I can update this 
with the fix.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2021-02-23 Thread GitBox


sunchao commented on pull request #29542:
URL: https://github.com/apache/spark/pull/29542#issuecomment-784706735


   Interesting. I'll take a look on the issue. Thanks for reporting. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2021-02-18 Thread GitBox


sunchao commented on pull request #29542:
URL: https://github.com/apache/spark/pull/29542#issuecomment-781504812


   Thanks all for reviewing and merging!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2021-01-29 Thread GitBox


sunchao commented on pull request #29542:
URL: https://github.com/apache/spark/pull/29542#issuecomment-769938181


   Re-open this since Spark has upgraded to Parquet 1.11 :-)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-31 Thread GitBox


sunchao commented on pull request #29542:
URL: https://github.com/apache/spark/pull/29542#issuecomment-683655834


   Thanks again @HyukjinKwon . Closing this for now.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-31 Thread GitBox


sunchao commented on pull request #29542:
URL: https://github.com/apache/spark/pull/29542#issuecomment-683635947


   Thanks @HyukjinKwon - this is good to know. Since it seems Spark could take 
a while to get to 1.11, should I close this for now then? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-30 Thread GitBox


sunchao commented on pull request #29542:
URL: https://github.com/apache/spark/pull/29542#issuecomment-683451979


   > Parquet reader is performance-wise important component in Spark SQL. We 
better to make sure no performance regression due to this change. Should we run 
a benchmark to check it?
   
   @viirya So I used `FilterPushdownBenchmark` for this and I don't see much 
difference. Taking the first few:
   
   Before:
   ```
   [info] OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5
   [info] Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
   [info] Select 0 string row ('7864320' < value < '7864320'):  Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   [info] 
---
   [info] Parquet Vectorized5632
   5752 120  2.8 358.1   1.0X
   [info] Parquet Vectorized (Pushdown)  491
506  18 32.0  31.2  11.5X
   [info] Native ORC Vectorized 4300
   4335  25  3.7 273.4   1.3X
   [info] Native ORC Vectorized (Pushdown)   525
530   6 30.0  33.4  10.7X
   
   [info] OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5
   [info] Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
   [info] Select 1 string row (value = '7864320'):  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   [info] 

   [info] Parquet Vectorized 5594   
5757 101  2.8 355.7   1.0X
   [info] Parquet Vectorized (Pushdown)   472
491  14 33.3  30.0  11.9X
   [info] Native ORC Vectorized  4320   
4387  42  3.6 274.7   1.3X
   [info] Native ORC Vectorized (Pushdown)512
524   8 30.7  32.6  10.9X
   ```
   After:
   ```
   [info] OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5
   [info] Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
   [info] Select 0 string row ('7864320' < value < '7864320'):  Best Time(ms)   
Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   [info] 
---
   [info] Parquet Vectorized5539
   5635  87  2.8 352.1   1.0X
   [info] Parquet Vectorized (Pushdown)  456
461   6 34.5  29.0  12.1X
   [info] Native ORC Vectorized 4243
   4282  35  3.7 269.8   1.3X
   [info] Native ORC Vectorized (Pushdown)   511
523  13 30.8  32.5  10.8X
   
   [info] OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5
   [info] Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
   [info] Select 1 string row (value = '7864320'):  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   [info] 

   [info] Parquet Vectorized 5509   
5576  55  2.9 350.2   1.0X
   [info] Parquet Vectorized (Pushdown)   462
475   8 34.1  29.4  11.9X
   [info] Native ORC Vectorized  4230   
4294  44  3.7 268.9   1.3X
   [info] Native ORC Vectorized (Pushdown)501
510  11 31.4  31.8  11.0X
   ```
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-28 Thread GitBox


sunchao commented on pull request #29542:
URL: https://github.com/apache/spark/pull/29542#issuecomment-683210745


   > Parquet reader is performance-wise important component in Spark SQL. We 
better to make sure no performance regression due to this change. Should we run 
a benchmark to check it?
   
   Sure let me do that. Should I just run `DataSourceReadBenchmark`?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase

2020-08-28 Thread GitBox


sunchao commented on pull request #29542:
URL: https://github.com/apache/spark/pull/29542#issuecomment-682996358


   @HyukjinKwon this PR is ready - can you please take another look? Thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org