[GitHub] spark pull request #23228: [MINOR][DOC]The condition description of serializ...

2018-12-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23228#discussion_r240036698
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala ---
@@ -33,10 +33,10 @@ import org.apache.spark.shuffle._
  * Sort-based shuffle has two different write paths for producing its map 
output files:
  *
  *  - Serialized sorting: used when all three of the following conditions 
hold:
- *1. The shuffle dependency specifies no aggregation or output 
ordering.
+ *1. The shuffle dependency specifies no map-side combine.
--- End diff --

looks right to me, according to 
https://github.com/apache/spark/blob/d5dadbf30d5429c36ec3d5c2845a71c2717fd6f3/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala#L195


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23228: [MINOR][DOC]The condition description of serializ...

2018-12-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23228#discussion_r239215561
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala ---
@@ -33,10 +33,10 @@ import org.apache.spark.shuffle._
  * Sort-based shuffle has two different write paths for producing its map 
output files:
  *
  *  - Serialized sorting: used when all three of the following conditions 
hold:
- *1. The shuffle dependency specifies no aggregation or output 
ordering.
+ *1. The shuffle dependency specifies no map-side combine.
--- End diff --

Does this sound right @JoshRosen ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23228: [MINOR][DOC]The condition description of serializ...

2018-12-05 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/23228

[MINOR][DOC]The condition description of serialized shuffle is not very 
accurate

## What changes were proposed in this pull request?
`1. The shuffle dependency specifies no aggregation or output ordering.`
If the shuffle dependency specifies aggregation, but it only aggregates at 
the reducer side, serialized shuffle can still be used.
`3. The shuffle produces fewer than 16777216 output partitions.`
If the number of output partitions is 16777216 , we can use serialized 
shuffle.
## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark SerializedShuffle_doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23228.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23228


commit d5dadbf30d5429c36ec3d5c2845a71c2717fd6f3
Author: liuxian 
Date:   2018-12-05T08:55:20Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org