[ https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612247#comment-14612247 ]
Xiangrui Meng commented on SPARK-8708: -------------------------------------- [~antonymayi] We merged a fix by [~viirya] in the master branch. Could you help test the performance on your dataset and report back? Thanks! > MatrixFactorizationModel.predictAll() populates single partition only > --------------------------------------------------------------------- > > Key: SPARK-8708 > URL: https://issues.apache.org/jira/browse/SPARK-8708 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.3.0 > Reporter: Antony Mayi > Assignee: Liang-Chi Hsieh > Fix For: 1.5.0 > > > When using mllib.recommendation.ALS the RDD returned by .predictAll() has all > values pushed into single partition despite using quite high parallelism. > This degrades performance of further processing (I can obviously run > .partitionBy()) to balance it but that's still too costly (ie if running > .predictAll() in loop for thousands of products) and should be possible to do > it rather somehow on the model (automatically)). > Bellow is an example on tiny sample (same on large dataset): > {code:title=pyspark} > >>> r1 = (1, 1, 1.0) > >>> r2 = (1, 2, 2.0) > >>> r3 = (2, 1, 2.0) > >>> r4 = (2, 2, 2.0) > >>> r5 = (3, 1, 1.0) > >>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) > >>> ratings.getNumPartitions() > 5 > >>> users = ratings.map(itemgetter(0)).distinct() > >>> model = ALS.trainImplicit(ratings, 1, seed=10) > >>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) > >>> predictions_for_2.glom().map(len).collect() > [0, 0, 3, 0, 0] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org