The proposed new set of APIs (SPARK-3573, SPARK-3530) will address this issue. We "carry over" extra columns with training and prediction and then leverage on Spark SQL's execution plan optimization to decide which columns are really needed. For the current set of APIs, we can add `predictOnValues` to models, which carries over the input keys. StreamingKMeans and StreamingLinearRegression implement this method. -Xiangrui
On Tue, Nov 4, 2014 at 2:30 AM, jamborta <jambo...@gmail.com> wrote: > Hi all, > > There are a few algorithms in pyspark where the prediction part is > implemented in scala (e.g. ALS, decision trees) where it is not very easy to > manipulate the prediction methods. > > I think it is a very common scenario that the user would like to generate > prediction for a datasets, so that each predicted value is identifiable > (e.g. have a unique id attached to it). this is not possible in the current > implementation as predict functions take a feature vector and return the > predicted values where, I believe, the order is not guaranteed, so there is > no way to join it back with the original data the predictions are generated > from. > > Is there a way around this at the moment? > > thanks, > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/pass-unique-ID-to-mllib-algorithms-pyspark-tp18051.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org