[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...

2018-12-03 Thread c21
Github user c21 commented on the issue:

https://github.com/apache/spark/pull/23163
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...

2018-12-02 Thread c21
Github user c21 commented on the issue:

https://github.com/apache/spark/pull/23163
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...

2018-12-01 Thread c21
Github user c21 commented on the issue:

https://github.com/apache/spark/pull/23163
  
cc @cloud-fan and @gatorsmile:

I think this pr is ready for review. Could you guys take a look when you 
have time? Thanks!
The test failure (fails due to an unknown error code, -9) seems to be 
unrelated to my change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...

2018-11-28 Thread c21
Github user c21 commented on the issue:

https://github.com/apache/spark/pull/23163
  
@gatorsmile:
> Any perf number?

From my employer company workload, we see >20% reserved CPU time (executor 
wall clock time) reduction, and >20% disk spill size reduction, after rolling 
out the change to use concurrent writers instead of sort (i.e. hash-based write 
in this pr).

I am not sure whether it's the performance number you were looking for. Let 
me know if anything needed. Thanks.

In addition, I updated the pr, as I found I need to change 
`BasicWriteTaskStatsTracker` as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...

2018-11-28 Thread c21
Github user c21 commented on the issue:

https://github.com/apache/spark/pull/23163
  
cc people who have most context for review - @cloud-fan, @tejasapatil and 
@sameeragarwal. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23163: [SPARK-26164][SQL] Allow FileFormatWriter to writ...

2018-11-28 Thread c21
GitHub user c21 opened a pull request:

https://github.com/apache/spark/pull/23163

[SPARK-26164][SQL] Allow FileFormatWriter to write multiple 
partitions/buckets without sort

## What changes were proposed in this pull request?

Currently spark always requires a local sort before writing to output table 
on partition/bucket columns (see `write.requiredOrdering` in 
`FileFormatWriter.scala`), which is unnecessary, and can be avoided by keeping 
multiple output writers concurrently in `FileFormatDataWriter.scala`.

This pr is first doing hash-based write, then falling back to sort-based 
write (current implementation) when number of opened writer exceeding a 
threshold (controlled by a config). Specifically:

1. (hash-based write) Maintain mapping between file path and output writer, 
and re-use writer for writing input row. In case of the number of opened output 
writers exceeding a threshold (can be changed by a config), we go to 2.

2. (sort-based write) Sort the rest of input rows (use the same sorter in 
SortExec). Then writing the rest of sorted rows, and we can close the writer on 
the fly, in case no more rows for current file path.

## How was this patch tested?

Added unit test in `DataFrameReaderWriterSuite.scala`. Existing test like 
`SQLMetricsSuite.scala` would already exercise the code path of executor write 
metrics.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/c21/spark more-writers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23163.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23163


commit c2e81eb2f9cdc1b290c098d228d477f325a24101
Author: Cheng Su 
Date:   2018-11-28T09:46:35Z

Allow FileFormatWriter to write multiple partitions/buckets without sort




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org