[ 
https://issues.apache.org/jira/browse/SPARK-22568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22568.
-------------------------------
    Resolution: Not A Problem

I think this is more of a usage question, so belongs on the mailing list.
You can indeed filter by each distinct key individually; this doesn't mean 
calling collect().
You can already group by the key. You can hash by the key and sort within 
partitions in one operation, which lets you encounter all values for a key at a 
time while traversing partitions.
I think there are plenty of tools to do the kind of thing you mention already.

> Split pair RDDs by keys - an efficient (maybe?) substitute to groupByKey
> ------------------------------------------------------------------------
>
>                 Key: SPARK-22568
>                 URL: https://issues.apache.org/jira/browse/SPARK-22568
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Éderson Cássio
>              Labels: features, performance, usability
>
> Sorry for any mistakes on filling this big form... it's my first issue here :)
> Recently, I have the need to separate a RDD by some categorization. I was 
> able to accomplish that by some ways.
> First, the obvious: mapping each element to a pair, with the key being the 
> category of the element. Then, using the good ol' {{groupByKey}}.
> Listening to advices to avoid {{groupByKey}}, I failed to find another way 
> that was more efficient. I ended up (a) obtaining the distinct list of 
> element categories, (b) {{collect}} ing them and (c) making a call to 
> {{filter}} for each category. Of course, before all I {{cache}} d my initial 
> RDD.
> So, I started to speculate: maybe it would be possible to make a number of 
> RDDs from an initial pair RDD _without the need to shuffle the data_. It 
> could be made by a kind of _local repartition_: first each partition is 
> splitted into various by key; then the master group the partitions with the 
> same key into a new RDD. The operation returns a List or array containing the 
> new RDDs.
> It's just a conjecture, I don't know if it would be feasible in current Spark 
> Core architecture. But it would be great if it could be done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to