[ https://issues.apache.org/jira/browse/SPARK-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069883#comment-14069883 ]
Apache Spark commented on SPARK-2612: ------------------------------------- User 'renozhang' has created a pull request for this issue: https://github.com/apache/spark/pull/1521 > ALS has data skew for popular product > ------------------------------------- > > Key: SPARK-2612 > URL: https://issues.apache.org/jira/browse/SPARK-2612 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.0.0 > Reporter: Peng Zhang > > Usually there are some popular products which are related with many users in > Rating inputs. > groupByKey() in updateFeatures() may cause one extra Shuffle stage to gather > data of the popular product to one task, because it's RDD's partitioner may > be not used as the join() partitioner. > The following join() need to shuffle from the aggregated product data. The > shuffle block can easily be bigger than 2G, and shuffle failed as mentioned > in SPARK-1476 > And increasing blocks number doesn't work. > IMHO, groupByKey() should use the same partitioner as the other RDD in > join(). So groupByKey() and join() will be in the same stage, and shuffle > data from many previous tasks will not trigger "2G" limits. -- This message was sent by Atlassian JIRA (v6.2#6252)