GitHub user c21 opened a pull request:

    https://github.com/apache/spark/pull/23163

    [SPARK-26164][SQL] Allow FileFormatWriter to write multiple 
partitions/buckets without sort

    ## What changes were proposed in this pull request?
    
    Currently spark always requires a local sort before writing to output table 
on partition/bucket columns (see `write.requiredOrdering` in 
`FileFormatWriter.scala`), which is unnecessary, and can be avoided by keeping 
multiple output writers concurrently in `FileFormatDataWriter.scala`.
    
    This pr is first doing hash-based write, then falling back to sort-based 
write (current implementation) when number of opened writer exceeding a 
threshold (controlled by a config). Specifically:
    
    1. (hash-based write) Maintain mapping between file path and output writer, 
and re-use writer for writing input row. In case of the number of opened output 
writers exceeding a threshold (can be changed by a config), we go to 2.
    
    2. (sort-based write) Sort the rest of input rows (use the same sorter in 
SortExec). Then writing the rest of sorted rows, and we can close the writer on 
the fly, in case no more rows for current file path.
    
    ## How was this patch tested?
    
    Added unit test in `DataFrameReaderWriterSuite.scala`. Existing test like 
`SQLMetricsSuite.scala` would already exercise the code path of executor write 
metrics.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/c21/spark more-writers

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23163.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23163
    
----
commit c2e81eb2f9cdc1b290c098d228d477f325a24101
Author: Cheng Su <chengsu@...>
Date:   2018-11-28T09:46:35Z

    Allow FileFormatWriter to write multiple partitions/buckets without sort

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to