[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14591929#comment-14591929
 ] 

koert kuipers commented on SPARK-4644:
--------------------------------------

i believe (plz correct me if i am wrong) that spark no longer OOMs inside 
cogroup. so to me the main issue with skew at this point is the uneven workload 
distribution. the block join seems a basic way to address this.

a more advanced proposal would be a skew join that only does the 
replication/distribution for keys with lots of values. a count-min-sketch could 
be used to estimate the key distribution, and from that point on the logic 
would be similar to the block join but with the left and right replication 
based off key counts. scalding already has this too...

> Implement skewed join
> ---------------------
>
>                 Key: SPARK-4644
>                 URL: https://issues.apache.org/jira/browse/SPARK-4644
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Shixiong Zhu
>         Attachments: Skewed Join Design Doc.pdf
>
>
> Skewed data is not rare. For example, a book recommendation site may have 
> several books which are liked by most of the users. Running ALS on such 
> skewed data will raise a OutOfMemory error, if some book has too many users 
> which cannot be fit into memory. To solve it, we propose a skewed join 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to