[jira] [Comment Edited] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment

2017-06-02 Thread CanBin Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033988#comment-16033988
 ] 

CanBin Zheng edited comment on SPARK-20943 at 6/2/17 7:00 AM:
--

Look at these two cases, either Aggregator or Ordering is defined but 
mapsideCombine is false,  they both run with BypassMergeSortShuffleWriter,
 {code}
 //Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }
{code}


was (Author: canbinzheng):
Look at there two cases.
 {code}
 //Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }
{code}

> Correct BypassMergeSortShuffleWriter's comment
> --
>
> Key: SPARK-20943
> URL: https://issues.apache.org/jira/browse/SPARK-20943
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Shuffle
>Affects Versions: 2.1.1
>Reporter: CanBin Zheng
>Priority: Trivial
>  Labels: starter
>
> There are some comments written in BypassMergeSortShuffleWriter.java about 
> when to select this write path, the three required conditions are described 
> as follows:  
> 1. no Ordering is specified, and
> 2. no Aggregator is specified, and
> 3. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold
> Obviously, the conditions written are partially wrong and misleading, the 
> right conditions should be:
> 1. map-side combine is false, and
> 2. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment

2017-06-01 Thread CanBin Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033988#comment-16033988
 ] 

CanBin Zheng edited comment on SPARK-20943 at 6/2/17 1:11 AM:
--

Look at there two cases.
 {code}
 //Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }
{code}


was (Author: canbinzheng):
Look at there two cases.
 
 `//Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }`

> Correct BypassMergeSortShuffleWriter's comment
> --
>
> Key: SPARK-20943
> URL: https://issues.apache.org/jira/browse/SPARK-20943
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Shuffle
>Affects Versions: 2.1.1
>Reporter: CanBin Zheng
>Priority: Trivial
>  Labels: starter
>
> There are some comments written in BypassMergeSortShuffleWriter.java about 
> when to select this write path, the three required conditions are described 
> as follows:  
> 1. no Ordering is specified, and
> 2. no Aggregator is specified, and
> 3. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold
> Obviously, the conditions written are partially wrong and misleading, the 
> right conditions should be:
> 1. map-side combine is false, and
> 2. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment

2017-06-01 Thread CanBin Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033988#comment-16033988
 ] 

CanBin Zheng edited comment on SPARK-20943 at 6/2/17 1:09 AM:
--

Look at there two cases.
 
 `//Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }`


was (Author: canbinzheng):
Look at there two cases.
 
 //Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }

> Correct BypassMergeSortShuffleWriter's comment
> --
>
> Key: SPARK-20943
> URL: https://issues.apache.org/jira/browse/SPARK-20943
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Shuffle
>Affects Versions: 2.1.1
>Reporter: CanBin Zheng
>Priority: Trivial
>  Labels: starter
>
> There are some comments written in BypassMergeSortShuffleWriter.java about 
> when to select this write path, the three required conditions are described 
> as follows:  
> 1. no Ordering is specified, and
> 2. no Aggregator is specified, and
> 3. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold
> Obviously, the conditions written are partially wrong and misleading, the 
> right conditions should be:
> 1. map-side combine is false, and
> 2. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org