[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121115#comment-14121115
 ] 

Apache Spark commented on SPARK-2978:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/2274

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117999#comment-14117999
 ] 

Reynold Xin commented on SPARK-2978:


[~sandyryza] instead of adding a new API, what if Hive project just creates a 
utility function that does partition, sort, and then walking down the sorted 
list to provide grouping similar to MR?

The reason I'm asking about this is that eventually we would want to make 
groupByKey itself support sort and spill. But it is fairly tricky to design as 
you've already pointed out, so it could take a while to finalize that API. 

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118286#comment-14118286
 ] 

Sandy Ryza commented on SPARK-2978:
---

IIUC, that would require using ShuffledRDD directly.  Would we be comfortable 
taking off the DeveloperAPI tag?

Another option that would allow us to avoid making the groupBy decision would 
be exposing a repartitionAndSortWithinPartition transform.  Then Hive would 
handle the grouping on the sorted stream.

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118917#comment-14118917
 ] 

Reynold Xin commented on SPARK-2978:


I talked to [~pwendell] about this. How about this?

We can add two APIs to OrderedRDDFunctions (and whatever the Java equivalent 
is):
- sortWithinPartition
- repartitionAndSortWithinPartition

The first one is obvious, while the 2nd one is functionally equivalent to 
repartition followed by sortWithinPartition. The 2nd one is an optimization 
because it can push the sorting code into ShuffledRDD.



 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119080#comment-14119080
 ] 

Sandy Ryza commented on SPARK-2978:
---

What's the thinking behind adding sortWithinPartition?  It shouldn't be 
difficult to add, but I can't think of a situation where it would be useful 
without a repartition before.

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119084#comment-14119084
 ] 

Reynold Xin commented on SPARK-2978:


It was just asked multiple times by various users.  I think the use case is to 
provide a robust external sort implementation without exposing the 
ExternalSorter API.  It doesn't need to be part of this change.


 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119091#comment-14119091
 ] 

Sandy Ryza commented on SPARK-2978:
---

Ah ok, sounds good.

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105126#comment-14105126
 ] 

Sandy Ryza commented on SPARK-2978:
---

So I started looking into this a little more and wanted to bring up a semantics 
issue I came across.

The proposed implementation would be to use a similar path to that used by 
sortByKey in each reduce task, and then wrap the Iterator over sorted records 
with an Iterator that groups them.  I.e. wrap an the Iterator[(K, V)] in an 
Iterator[(K, Iterator[V])].  The question is how to handle the validity of an 
inner V iterator with respect to the outer Iterator.  The options as I see it 
are:
1. Calling next() or hasNext() on the outer iterator invalidates the current 
inner V iterator.
2. The inner V iterator must be exhausted before calling next() or hasNext() on 
the outer iterator.
3. On each next() call on the outer iterator, scan over all the values for that 
key and put them in a separate buffer. 

The MapReduce approach, where the outer iterator is replaced by a sequence of 
calls to the reduce function, is similar to (1).

When the Iterators returned by groupByKey are eventually disk-backed, we'll 
face the same issue, so we probably want to make the semantics there consistent 
with whatever we decide here.


 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105128#comment-14105128
 ] 

Sandy Ryza commented on SPARK-2978:
---

[~jerryshao], if I understand correctly, ShuffleRDD already supports what's 
needed here, and satisfying that need is independent of whether we sort on the 
map side.  That said, I think the changes you proposed on SPARK-2926 could 
definitely make this more performant, and we would likely see the same 
improvements you benchmarked for sortByKey.

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-12 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093866#comment-14093866
 ] 

Saisai Shao commented on SPARK-2978:


Hi Sandy,

A simple question: do you mean to add some new operators upon ShuffleRDD 
generically, or only for sort-based shuffle?

Seems these operators are specific for MR way's of shuffle.

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org