[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230390#comment-14230390
 ] 

Patrick Wendell commented on SPARK-4644:
----------------------------------------

I would push back a bit on what you said about groupByKey [~sdkfjslakdfj]. I 
think solving groupByKey is pretty important, it's probably the most common 
user frustration with Spark. There are cases where the user is streaming 
through the data (e.g. they are doing groupByKey and then writing results out 
to HDFS or DISK_ONLY persistence level). Or cases where it's hard for them to 
significantly reduce the amount of data. So I wouldn't rule out solving this in 
a nice way across all of our operators that, in the current architecture, 
suffer from this issue.

> Implement skewed join
> ---------------------
>
>                 Key: SPARK-4644
>                 URL: https://issues.apache.org/jira/browse/SPARK-4644
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Shixiong Zhu
>         Attachments: Skewed Join Design Doc.pdf
>
>
> Skewed data is not rare. For example, a book recommendation site may have 
> several books which are liked by most of the users. Running ALS on such 
> skewed data will raise a OutOfMemory error, if some book has too many users 
> which cannot be fit into memory. To solve it, we propose a skewed join 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to