[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on pull request #29542: URL: https://github.com/apache/spark/pull/29542#issuecomment-786837732 Opened #31667 with the fix. Comparing to the old PR there's only one line change: ```scala reader.setRequestedSchema(requestedSchema); ``` Please take a look. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on pull request #29542: URL: https://github.com/apache/spark/pull/29542#issuecomment-786443278 I tried @LuciferYang 's suggestion with TPC-DS benchmark and it fixed the perf regression. For instance, in the q9 above, here's what I got: without the PR | with the PR | with the PR + requestSchema fix |-|---|-- 5940 | 11260 | 5281 I've gathered all the results in this [gist](https://gist.github.com/sunchao/533c1c9a5fb6adc5f7c72b1c465e974d). The benchmark was run with scale factor 5. @maropu @HyukjinKwon @srowen let me know what you think. I can update this with the fix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on pull request #29542: URL: https://github.com/apache/spark/pull/29542#issuecomment-784706735 Interesting. I'll take a look on the issue. Thanks for reporting. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on pull request #29542: URL: https://github.com/apache/spark/pull/29542#issuecomment-781504812 Thanks all for reviewing and merging! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on pull request #29542: URL: https://github.com/apache/spark/pull/29542#issuecomment-769938181 Re-open this since Spark has upgraded to Parquet 1.11 :-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on pull request #29542: URL: https://github.com/apache/spark/pull/29542#issuecomment-683655834 Thanks again @HyukjinKwon . Closing this for now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on pull request #29542: URL: https://github.com/apache/spark/pull/29542#issuecomment-683635947 Thanks @HyukjinKwon - this is good to know. Since it seems Spark could take a while to get to 1.11, should I close this for now then? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on pull request #29542: URL: https://github.com/apache/spark/pull/29542#issuecomment-683451979 > Parquet reader is performance-wise important component in Spark SQL. We better to make sure no performance regression due to this change. Should we run a benchmark to check it? @viirya So I used `FilterPushdownBenchmark` for this and I don't see much difference. Taking the first few: Before: ``` [info] OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 [info] Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz [info] Select 0 string row ('7864320' < value < '7864320'): Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative [info] --- [info] Parquet Vectorized5632 5752 120 2.8 358.1 1.0X [info] Parquet Vectorized (Pushdown) 491 506 18 32.0 31.2 11.5X [info] Native ORC Vectorized 4300 4335 25 3.7 273.4 1.3X [info] Native ORC Vectorized (Pushdown) 525 530 6 30.0 33.4 10.7X [info] OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 [info] Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz [info] Select 1 string row (value = '7864320'): Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative [info] [info] Parquet Vectorized 5594 5757 101 2.8 355.7 1.0X [info] Parquet Vectorized (Pushdown) 472 491 14 33.3 30.0 11.9X [info] Native ORC Vectorized 4320 4387 42 3.6 274.7 1.3X [info] Native ORC Vectorized (Pushdown)512 524 8 30.7 32.6 10.9X ``` After: ``` [info] OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 [info] Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz [info] Select 0 string row ('7864320' < value < '7864320'): Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative [info] --- [info] Parquet Vectorized5539 5635 87 2.8 352.1 1.0X [info] Parquet Vectorized (Pushdown) 456 461 6 34.5 29.0 12.1X [info] Native ORC Vectorized 4243 4282 35 3.7 269.8 1.3X [info] Native ORC Vectorized (Pushdown) 511 523 13 30.8 32.5 10.8X [info] OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 [info] Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz [info] Select 1 string row (value = '7864320'): Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative [info] [info] Parquet Vectorized 5509 5576 55 2.9 350.2 1.0X [info] Parquet Vectorized (Pushdown) 462 475 8 34.1 29.4 11.9X [info] Native ORC Vectorized 4230 4294 44 3.7 268.9 1.3X [info] Native ORC Vectorized (Pushdown)501 510 11 31.4 31.8 11.0X ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on pull request #29542: URL: https://github.com/apache/spark/pull/29542#issuecomment-683210745 > Parquet reader is performance-wise important component in Spark SQL. We better to make sure no performance regression due to this change. Should we run a benchmark to check it? Sure let me do that. Should I just run `DataSourceReadBenchmark`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on pull request #29542: URL: https://github.com/apache/spark/pull/29542#issuecomment-682996358 @HyukjinKwon this PR is ready - can you please take another look? Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org