[ 
https://issues.apache.org/jira/browse/SPARK-7127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541433#comment-14541433
 ] 

Joseph K. Bradley commented on SPARK-7127:
------------------------------------------

The mapPartitions function is really an RDD method which has to return an RDD 
instead of a DataFrame.  By using it, you end up creating 2 RDDs/DataFrames 
which must then be joined.  You're trying to do that with "withColumn," but you 
would have to use join.

However, a better approach will be to stick with DataFrame-only methods which 
return DataFrames, not RDDs.  To do that, you can broadcast the model and then 
use it in a UDF.  (Search the spark.ml code for "callUDF" method invocations 
for examples.)  That UDF can be used with "withColumn" to add the prediction 
column.

> Broadcast spark.ml tree ensemble models for predict
> ---------------------------------------------------
>
>                 Key: SPARK-7127
>                 URL: https://issues.apache.org/jira/browse/SPARK-7127
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.4.0
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> GBTRegressor/Classifier and RandomForestRegressor/Classifier should broadcast 
> models and then predict.  This will mean overriding transform().
> Note: Try to reduce duplicated code via the TreeEnsembleModel abstraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to