[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user c21 commented on the issue: https://github.com/apache/spark/pull/23163 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user c21 commented on the issue: https://github.com/apache/spark/pull/23163 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user c21 commented on the issue: https://github.com/apache/spark/pull/23163 cc @cloud-fan and @gatorsmile: I think this pr is ready for review. Could you guys take a look when you have time? Thanks! The test failure (fails due to an unknown error code, -9) seems to be unrelated to my change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user c21 commented on the issue: https://github.com/apache/spark/pull/23163 @gatorsmile: > Any perf number? From my employer company workload, we see >20% reserved CPU time (executor wall clock time) reduction, and >20% disk spill size reduction, after rolling out the change to use concurrent writers instead of sort (i.e. hash-based write in this pr). I am not sure whether it's the performance number you were looking for. Let me know if anything needed. Thanks. In addition, I updated the pr, as I found I need to change `BasicWriteTaskStatsTracker` as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user c21 commented on the issue: https://github.com/apache/spark/pull/23163 cc people who have most context for review - @cloud-fan, @tejasapatil and @sameeragarwal. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23163: [SPARK-26164][SQL] Allow FileFormatWriter to writ...
GitHub user c21 opened a pull request: https://github.com/apache/spark/pull/23163 [SPARK-26164][SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort ## What changes were proposed in this pull request? Currently spark always requires a local sort before writing to output table on partition/bucket columns (see `write.requiredOrdering` in `FileFormatWriter.scala`), which is unnecessary, and can be avoided by keeping multiple output writers concurrently in `FileFormatDataWriter.scala`. This pr is first doing hash-based write, then falling back to sort-based write (current implementation) when number of opened writer exceeding a threshold (controlled by a config). Specifically: 1. (hash-based write) Maintain mapping between file path and output writer, and re-use writer for writing input row. In case of the number of opened output writers exceeding a threshold (can be changed by a config), we go to 2. 2. (sort-based write) Sort the rest of input rows (use the same sorter in SortExec). Then writing the rest of sorted rows, and we can close the writer on the fly, in case no more rows for current file path. ## How was this patch tested? Added unit test in `DataFrameReaderWriterSuite.scala`. Existing test like `SQLMetricsSuite.scala` would already exercise the code path of executor write metrics. You can merge this pull request into a Git repository by running: $ git pull https://github.com/c21/spark more-writers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23163.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23163 commit c2e81eb2f9cdc1b290c098d228d477f325a24101 Author: Cheng Su Date: 2018-11-28T09:46:35Z Allow FileFormatWriter to write multiple partitions/buckets without sort --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org