[ https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228308#comment-14228308 ]
Shixiong Zhu commented on SPARK-4644: ------------------------------------- I disagree to use `broadcast join` because: 1. `broadcast join` is in Spark SQL. It's not convenient for people who only want to use Spark Core. Some users (such as ALS in mllib) have already used `join` of Spark Core, and I don't think forcing users to rewrite them with Spark SQL is a good idea. 2. `broadcast join` assumes only one of two tables has skew keys. If both two tables have skew keys, how to handle it? I only know a little about Spark SQL. Please let me know if there is any mistake. > Implement skewed join > --------------------- > > Key: SPARK-4644 > URL: https://issues.apache.org/jira/browse/SPARK-4644 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Shixiong Zhu > Attachments: Skewed Join Design Doc.pdf > > > Skewed data is not rare. For example, a book recommendation site may have > several books which are liked by most of the users. Running ALS on such > skewed data will raise a OutOfMemory error, if some book has too many users > which cannot be fit into memory. To solve it, we propose a skewed join > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org