[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517946#comment-14517946
 ] 

koert kuipers edited comment on SPARK-3655 at 4/28/15 8:18 PM:
---------------------------------------------------------------

since the last pullreq for this ticket i created spark-sorted (based on 
suggestions from imran), a small library for spark that supports the target 
features of this ticket, but without the burden of having to be fully 
compatible with the current spark api conventions (with regards to ordering 
being implicit).
i also got a chance to catch up with sandy at spark summit east and we 
exchanged some emails afterward about this jira ticket and possible design 
choices.

so based on those experiences i think there are better alternatives than the 
current pullreq (https://github.com/apache/spark/pull/3632), and i will close 
it. the pullreq does bring secondary sort to spark, but only in memory, which 
is a very limited feature (since if the values can be stored in memory then 
sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to 
the reduce operation as we know it in hadoop: an sorted streaming reduce. Its 
signature would be something like this on RDD[(K, V)]:
{noformat}
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => 
Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
{noformat}
(note that the implicits would not actually be on the method as shown here, but 
on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available 
in the small smart-sorted library which is available on spark-packages, and 
that's good enough.



was (Author: koert):
since the last pullreq for this ticket i created spark-sorted (based on 
suggestions from imran), a small library for spark that supports the target 
features of this ticket, but without the burden of having to be fully 
compatible with the current spark api conventions (with regards to ordering 
being implicit).
i also got a chance to catch up with sandy at spark summit east and we 
exchanged some emails afterward about this jira ticket and possible design 
choices.

so based on those experiences i think there are better alternatives than the 
current pullreq (https://github.com/apache/spark/pull/3632), and i will close 
it. the pullreq does bring secondary sort to spark, but only in memory, which 
is a very limited feature (since if the values can be stored in memory then 
sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to 
the reduce operation as we know it in hadoop: an sorted streaming reduce. Its 
signature would be something like this on RDD[(K, V)]:
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => 
Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
(note that the implicits would not actually be on the method as shown here, but 
on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available 
in the small smart-sorted library which is available on spark-packages, and 
that's good enough.


> Support sorting of values in addition to keys (i.e. secondary sort)
> -------------------------------------------------------------------
>
>                 Key: SPARK-3655
>                 URL: https://issues.apache.org/jira/browse/SPARK-3655
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: koert kuipers
>            Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to