[ https://issues.apache.org/jira/browse/SPARK-7127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541433#comment-14541433 ]
Joseph K. Bradley commented on SPARK-7127: ------------------------------------------ The mapPartitions function is really an RDD method which has to return an RDD instead of a DataFrame. By using it, you end up creating 2 RDDs/DataFrames which must then be joined. You're trying to do that with "withColumn," but you would have to use join. However, a better approach will be to stick with DataFrame-only methods which return DataFrames, not RDDs. To do that, you can broadcast the model and then use it in a UDF. (Search the spark.ml code for "callUDF" method invocations for examples.) That UDF can be used with "withColumn" to add the prediction column. > Broadcast spark.ml tree ensemble models for predict > --------------------------------------------------- > > Key: SPARK-7127 > URL: https://issues.apache.org/jira/browse/SPARK-7127 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 1.4.0 > Reporter: Joseph K. Bradley > Priority: Minor > > GBTRegressor/Classifier and RandomForestRegressor/Classifier should broadcast > models and then predict. This will mean overriding transform(). > Note: Try to reduce duplicated code via the TreeEnsembleModel abstraction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org