[ https://issues.apache.org/jira/browse/SPARK-10158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983644#comment-14983644 ]
Bryan Cutler edited comment on SPARK-10158 at 10/31/15 7:05 AM: ---------------------------------------------------------------- I think the best way to handle this from the PySpark side is to add something like the following to {{ALS._prepare}} ([link|https://github.com/apache/spark/blob/master/python/pyspark/mllib/recommendation.py#L215]) which is called before training {noformat} MAX_ID_VALUE = ratings.ctx._gateway.jvm.Integer.MAX_VALUE if ratings.filter(lambda x: x.user > MAX_ID_VALUE or x.product > MAX_ID_VALUE).count() > 0: raise ValueError("Rating IDs must be less than max Java int %s." % str(MAX_ID_VALUE)) {noformat} But any operations on the data are probably not worth the hit for this issue Edit: I meant the above as an alternative to checking values for 2^31 explicitly, which could be done in the Ratings constructor but seems like too much of a hack to me was (Author: bryanc): The only way I can see handling this from the PySpark side is to add something like the following to {{ALS._prepare}} ([link|https://github.com/apache/spark/blob/master/python/pyspark/mllib/recommendation.py#L215]) which is called before training {noformat} MAX_ID_VALUE = ratings.ctx._gateway.jvm.Integer.MAX_VALUE if ratings.filter(lambda x: x.user > MAX_ID_VALUE or x.product > MAX_ID_VALUE).count() > 0: raise ValueError("Rating IDs must be less than max Java int %s." % str(MAX_ID_VALUE)) {noformat} But any operations on the data are probably not worth the hit for this issue > ALS should print better errors when given Long IDs > -------------------------------------------------- > > Key: SPARK-10158 > URL: https://issues.apache.org/jira/browse/SPARK-10158 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark > Reporter: Joseph K. Bradley > Priority: Minor > > See [SPARK-10115] for the very confusing messages you get when you try to use > ALS with Long IDs. We should catch and identify these errors and print > meaningful error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org