Dang I can't seem to find the JIRA now but I am sure we had a discussion with Matei about this and the conclusion was that RDD order is not guaranteed unless a sort is involved. On Mar 17, 2015 12:14 AM, "Joseph Bradley" <jos...@databricks.com> wrote:
> This was brought up again in > https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one item > which was asked about the reliability of zipping RDDs. Basically, it > should be reliable, and if it is not, then it should be reported as a bug. > This general approach should work (with explicit types to make it clear): > > val data: RDD[LabeledPoint] = ... > val labels: RDD[Double] = data.map(_.label) > val features1: RDD[Vector] = data.map(_.features) > val features2: RDD[Vector] = new > HashingTF(numFeatures=100).transform(features1) > val features3: RDD[Vector] = idfModel.transform(features2) > val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label, > features) => LabeledPoint(label, features)) > > If you run into problems with zipping like this, please report them! > > Thanks, > Joseph > > On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng <men...@gmail.com> wrote: > >> Hopefully the new pipeline API addresses this problem. We have a code >> example here: >> >> >> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala >> >> -Xiangrui >> >> On Mon, Dec 29, 2014 at 5:22 AM, andy petrella <andy.petre...@gmail.com> >> wrote: >> > Here is what I did for this case : >> https://github.com/andypetrella/tf-idf >> > >> > >> > Le lun 29 déc. 2014 11:31, Sean Owen <so...@cloudera.com> a écrit : >> > >> >> Given (label, terms) you can just transform the values to a TF vector, >> >> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can >> >> make a LabeledPoint from (label, vector) pairs. Is that what you're >> >> looking for? >> >> >> >> On Mon, Dec 29, 2014 at 3:37 AM, Yao <y...@ford.com> wrote: >> >> > I found the TF-IDF feature extraction and all the MLlib code that >> work >> >> > with >> >> > pure Vector RDD very difficult to work with due to the lack of >> ability >> >> > to >> >> > associate vector back to the original data. Why can't Spark MLlib >> >> > support >> >> > LabeledPoint? >> >> > >> >> > >> >> > >> >> > -- >> >> > View this message in context: >> >> > >> http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html >> >> > Sent from the Apache Spark User List mailing list archive at >> Nabble.com. >> >> > >> >> > --------------------------------------------------------------------- >> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> > For additional commands, e-mail: user-h...@spark.apache.org >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >