GitHub user c21 opened a pull request: https://github.com/apache/spark/pull/23163
[SPARK-26164][SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort ## What changes were proposed in this pull request? Currently spark always requires a local sort before writing to output table on partition/bucket columns (see `write.requiredOrdering` in `FileFormatWriter.scala`), which is unnecessary, and can be avoided by keeping multiple output writers concurrently in `FileFormatDataWriter.scala`. This pr is first doing hash-based write, then falling back to sort-based write (current implementation) when number of opened writer exceeding a threshold (controlled by a config). Specifically: 1. (hash-based write) Maintain mapping between file path and output writer, and re-use writer for writing input row. In case of the number of opened output writers exceeding a threshold (can be changed by a config), we go to 2. 2. (sort-based write) Sort the rest of input rows (use the same sorter in SortExec). Then writing the rest of sorted rows, and we can close the writer on the fly, in case no more rows for current file path. ## How was this patch tested? Added unit test in `DataFrameReaderWriterSuite.scala`. Existing test like `SQLMetricsSuite.scala` would already exercise the code path of executor write metrics. You can merge this pull request into a Git repository by running: $ git pull https://github.com/c21/spark more-writers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23163.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23163 ---- commit c2e81eb2f9cdc1b290c098d228d477f325a24101 Author: Cheng Su <chengsu@...> Date: 2018-11-28T09:46:35Z Allow FileFormatWriter to write multiple partitions/buckets without sort ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org