FWIW the JIRA I was thinking about is https://issues.apache.org/jira/browse/SPARK-3098
On Mon, Mar 16, 2015 at 6:10 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > I vaguely remember that JIRA and AFAIK Matei's point was that the order is > not guaranteed *after* a shuffle. If you only use operations like map which > preserve partitioning, ordering should be guaranteed from what I know. > > On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen <so...@cloudera.com> wrote: > >> Dang I can't seem to find the JIRA now but I am sure we had a discussion >> with Matei about this and the conclusion was that RDD order is not >> guaranteed unless a sort is involved. >> On Mar 17, 2015 12:14 AM, "Joseph Bradley" <jos...@databricks.com> wrote: >> >>> This was brought up again in >>> https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one >>> item which was asked about the reliability of zipping RDDs. Basically, it >>> should be reliable, and if it is not, then it should be reported as a bug. >>> This general approach should work (with explicit types to make it clear): >>> >>> val data: RDD[LabeledPoint] = ... >>> val labels: RDD[Double] = data.map(_.label) >>> val features1: RDD[Vector] = data.map(_.features) >>> val features2: RDD[Vector] = new >>> HashingTF(numFeatures=100).transform(features1) >>> val features3: RDD[Vector] = idfModel.transform(features2) >>> val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label, >>> features) => LabeledPoint(label, features)) >>> >>> If you run into problems with zipping like this, please report them! >>> >>> Thanks, >>> Joseph >>> >>> On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng <men...@gmail.com> wrote: >>> >>>> Hopefully the new pipeline API addresses this problem. We have a code >>>> example here: >>>> >>>> >>>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala >>>> >>>> -Xiangrui >>>> >>>> On Mon, Dec 29, 2014 at 5:22 AM, andy petrella <andy.petre...@gmail.com> >>>> wrote: >>>> > Here is what I did for this case : >>>> https://github.com/andypetrella/tf-idf >>>> > >>>> > >>>> > Le lun 29 déc. 2014 11:31, Sean Owen <so...@cloudera.com> a écrit : >>>> > >>>> >> Given (label, terms) you can just transform the values to a TF >>>> vector, >>>> >> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can >>>> >> make a LabeledPoint from (label, vector) pairs. Is that what you're >>>> >> looking for? >>>> >> >>>> >> On Mon, Dec 29, 2014 at 3:37 AM, Yao <y...@ford.com> wrote: >>>> >> > I found the TF-IDF feature extraction and all the MLlib code that >>>> work >>>> >> > with >>>> >> > pure Vector RDD very difficult to work with due to the lack of >>>> ability >>>> >> > to >>>> >> > associate vector back to the original data. Why can't Spark MLlib >>>> >> > support >>>> >> > LabeledPoint? >>>> >> > >>>> >> > >>>> >> > >>>> >> > -- >>>> >> > View this message in context: >>>> >> > >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html >>>> >> > Sent from the Apache Spark User List mailing list archive at >>>> Nabble.com. >>>> >> > >>>> >> > >>>> --------------------------------------------------------------------- >>>> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> >> > For additional commands, e-mail: user-h...@spark.apache.org >>>> >> > >>>> >> >>>> >> --------------------------------------------------------------------- >>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> >> For additional commands, e-mail: user-h...@spark.apache.org >>>> >> >>>> > >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >