[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2015-03-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386541#comment-14386541
 ] 

Sean Owen commented on SPARK-3441:
--

This is mentioned in the change for https://github.com/apache/spark/pull/5074 
but I think the work here is to explain more deeply the rationale and 
partitioner details in the scaladoc

 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
 shuffle
 ---

 Key: SPARK-3441
 URL: https://issues.apache.org/jira/browse/SPARK-3441
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Patrick Wendell
Assignee: Sandy Ryza

 I think it would be good to say something like this in the doc for 
 repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
 {code}
 This can be used to enact a Hadoop Style shuffle along with a call to 
 mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
 {code}
 It might also be nice to add a version that doesn't take a partitioner and/or 
 to mention this in the groupBy javadoc. I guess it depends a bit whether we 
 consider this to be an API we want people to use more widely or whether we 
 just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
 people to consider this API when porting workloads from Hadoop, then it might 
 be worth documenting better.
 What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2015-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365896#comment-14365896
 ] 

Apache Spark commented on SPARK-3441:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/5074

 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
 shuffle
 ---

 Key: SPARK-3441
 URL: https://issues.apache.org/jira/browse/SPARK-3441
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Patrick Wendell
Assignee: Sandy Ryza

 I think it would be good to say something like this in the doc for 
 repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
 {code}
 This can be used to enact a Hadoop Style shuffle along with a call to 
 mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
 {code}
 It might also be nice to add a version that doesn't take a partitioner and/or 
 to mention this in the groupBy javadoc. I guess it depends a bit whether we 
 consider this to be an API we want people to use more widely or whether we 
 just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
 people to consider this API when porting workloads from Hadoop, then it might 
 be worth documenting better.
 What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2015-02-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336948#comment-14336948
 ] 

Sean Owen commented on SPARK-3441:
--

Since another shuffle-related doc ticket came up for action 
(http://issues.apache.org/jira/browse/SPARK-5750) I wonder is this still live? 
is it just a matter of documenting something or needs a code change?

 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
 shuffle
 ---

 Key: SPARK-3441
 URL: https://issues.apache.org/jira/browse/SPARK-3441
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Patrick Wendell
Assignee: Sandy Ryza

 I think it would be good to say something like this in the doc for 
 repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
 {code}
 This can be used to enact a Hadoop Style shuffle along with a call to 
 mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
 {code}
 It might also be nice to add a version that doesn't take a partitioner and/or 
 to mention this in the groupBy javadoc. I guess it depends a bit whether we 
 consider this to be an API we want people to use more widely or whether we 
 just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
 people to consider this API when porting workloads from Hadoop, then it might 
 be worth documenting better.
 What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2014-09-08 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125936#comment-14125936
 ] 

Sandy Ryza commented on SPARK-3441:
---

I'll add mention that this can be used to get Hadoop-style shuffle.  Because we 
opted not to provide the grouping, the API isn't ideal for porting workloads 
from Hadoop - the amount of code required to replicate the Hadoop shuffle 
semantics is probably more than we can fit in the Javadoc.  Not opposed to 
adding a note that it can be used as a building block.

I considered a version without a partitioner, but I couldn't think of a 
situation where one would care that records are sorted within a partition, but 
not need to be specific about what keys end up in what partitions.  Anything 
you can think of that I'm missing?

 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
 shuffle
 ---

 Key: SPARK-3441
 URL: https://issues.apache.org/jira/browse/SPARK-3441
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Patrick Wendell
Assignee: Sandy Ryza

 I think it would be good to say something like this in the doc for 
 repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
 {code}
 This can be used to enact a Hadoop Style shuffle along with a call to 
 mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
 {code}
 It might also be nice to add a version that doesn't take a partitioner and/or 
 to mention this in the groupBy javadoc. I guess it depends a bit whether we 
 consider this to be an API we want people to use more widely or whether we 
 just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
 people to consider this API when porting workloads from Hadoop, then it might 
 be worth documenting better.
 What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2014-09-08 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126175#comment-14126175
 ] 

Matei Zaharia commented on SPARK-3441:
--

I agree that we should have more of a doc here to explain the rationale, I 
missed it earlier.

BTW for a partitioner we should note that people can pass `new 
HashPartitioner`. No need to write your own.

One case where you may not care about giving a Partitioner is if you just want 
to do some kind of groupBy / join that spills externally. So that may also be 
useful but since this is a pretty advanced API, I don't think it matters that 
much.

 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
 shuffle
 ---

 Key: SPARK-3441
 URL: https://issues.apache.org/jira/browse/SPARK-3441
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Patrick Wendell
Assignee: Sandy Ryza

 I think it would be good to say something like this in the doc for 
 repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
 {code}
 This can be used to enact a Hadoop Style shuffle along with a call to 
 mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
 {code}
 It might also be nice to add a version that doesn't take a partitioner and/or 
 to mention this in the groupBy javadoc. I guess it depends a bit whether we 
 consider this to be an API we want people to use more widely or whether we 
 just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
 people to consider this API when porting workloads from Hadoop, then it might 
 be worth documenting better.
 What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2014-09-08 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126192#comment-14126192
 ] 

Sandy Ryza commented on SPARK-3441:
---

bq. One case where you may not care about giving a Partitioner is if you just 
want to do some kind of groupBy / join that spills externally.
You mean so that the values for a single key can be disk-backed?  Eventually we 
want join and groupByKey to handle this themselves, right?

 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
 shuffle
 ---

 Key: SPARK-3441
 URL: https://issues.apache.org/jira/browse/SPARK-3441
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Patrick Wendell
Assignee: Sandy Ryza

 I think it would be good to say something like this in the doc for 
 repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
 {code}
 This can be used to enact a Hadoop Style shuffle along with a call to 
 mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
 {code}
 It might also be nice to add a version that doesn't take a partitioner and/or 
 to mention this in the groupBy javadoc. I guess it depends a bit whether we 
 consider this to be an API we want people to use more widely or whether we 
 just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
 people to consider this API when porting workloads from Hadoop, then it might 
 be worth documenting better.
 What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2014-09-08 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126308#comment-14126308
 ] 

Patrick Wendell commented on SPARK-3441:


Hey [~sandyr] - what do you mean by grouping? Just that the user has to write 
their own code to detect the boundaries between keys? I wonder if we could 
write a simple wrapper that does that (though @rxin pointed out to me this 
would be bad if a user cached it).

 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
 shuffle
 ---

 Key: SPARK-3441
 URL: https://issues.apache.org/jira/browse/SPARK-3441
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Patrick Wendell
Assignee: Sandy Ryza

 I think it would be good to say something like this in the doc for 
 repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
 {code}
 This can be used to enact a Hadoop Style shuffle along with a call to 
 mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
 {code}
 It might also be nice to add a version that doesn't take a partitioner and/or 
 to mention this in the groupBy javadoc. I guess it depends a bit whether we 
 consider this to be an API we want people to use more widely or whether we 
 just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
 people to consider this API when porting workloads from Hadoop, then it might 
 be worth documenting better.
 What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2014-09-08 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126454#comment-14126454
 ] 

Sandy Ryza commented on SPARK-3441:
---

Right.  It's not much work, but there are some questions (posted on SPARK-2978) 
about exactly what the semantics of such a wrapper should be.  The concern was 
that we would want to make groupByKey consistent with it when it supports 
disk-backed keys, and didn't feel comfortable locking that behavior down right 
now.  Happy to add a wrapper if we can come to a decision there.

 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
 shuffle
 ---

 Key: SPARK-3441
 URL: https://issues.apache.org/jira/browse/SPARK-3441
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Patrick Wendell
Assignee: Sandy Ryza

 I think it would be good to say something like this in the doc for 
 repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
 {code}
 This can be used to enact a Hadoop Style shuffle along with a call to 
 mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
 {code}
 It might also be nice to add a version that doesn't take a partitioner and/or 
 to mention this in the groupBy javadoc. I guess it depends a bit whether we 
 consider this to be an API we want people to use more widely or whether we 
 just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
 people to consider this API when porting workloads from Hadoop, then it might 
 be worth documenting better.
 What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org