Re: pass unique ID to mllib algorithms pyspark

2014-11-05 Thread Tamas Jambor
Hi Xiangrui,

Thanks for the reply. is this still due to be released in 1.2
(SPARK-3530 is still open)?

Thanks,

On Wed, Nov 5, 2014 at 3:21 AM, Xiangrui Meng men...@gmail.com wrote:
 The proposed new set of APIs (SPARK-3573, SPARK-3530) will address
 this issue. We carry over extra columns with training and prediction
 and then leverage on Spark SQL's execution plan optimization to decide
 which columns are really needed. For the current set of APIs, we can
 add `predictOnValues` to models, which carries over the input keys.
 StreamingKMeans and StreamingLinearRegression implement this method.
 -Xiangrui

 On Tue, Nov 4, 2014 at 2:30 AM, jamborta jambo...@gmail.com wrote:
 Hi all,

 There are a few algorithms in pyspark where the prediction part is
 implemented in scala (e.g. ALS, decision trees) where it is not very easy to
 manipulate the prediction methods.

 I think it is a very common scenario that the user would like to generate
 prediction for a datasets, so that each predicted value is identifiable
 (e.g. have a unique id attached to it). this is not possible in the current
 implementation as predict functions take a feature vector and return the
 predicted values where, I believe, the order is not guaranteed, so there is
 no way to join it back with the original data the predictions are generated
 from.

 Is there a way around this at the moment?

 thanks,



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/pass-unique-ID-to-mllib-algorithms-pyspark-tp18051.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pass unique ID to mllib algorithms pyspark

2014-11-04 Thread Xiangrui Meng
The proposed new set of APIs (SPARK-3573, SPARK-3530) will address
this issue. We carry over extra columns with training and prediction
and then leverage on Spark SQL's execution plan optimization to decide
which columns are really needed. For the current set of APIs, we can
add `predictOnValues` to models, which carries over the input keys.
StreamingKMeans and StreamingLinearRegression implement this method.
-Xiangrui

On Tue, Nov 4, 2014 at 2:30 AM, jamborta jambo...@gmail.com wrote:
 Hi all,

 There are a few algorithms in pyspark where the prediction part is
 implemented in scala (e.g. ALS, decision trees) where it is not very easy to
 manipulate the prediction methods.

 I think it is a very common scenario that the user would like to generate
 prediction for a datasets, so that each predicted value is identifiable
 (e.g. have a unique id attached to it). this is not possible in the current
 implementation as predict functions take a feature vector and return the
 predicted values where, I believe, the order is not guaranteed, so there is
 no way to join it back with the original data the predictions are generated
 from.

 Is there a way around this at the moment?

 thanks,



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/pass-unique-ID-to-mllib-algorithms-pyspark-tp18051.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org