I vaguely remember that JIRA and AFAIK Matei's point was that the order is not guaranteed *after* a shuffle. If you only use operations like map which preserve partitioning, ordering should be guaranteed from what I know.
On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen <so...@cloudera.com> wrote: > Dang I can't seem to find the JIRA now but I am sure we had a discussion > with Matei about this and the conclusion was that RDD order is not > guaranteed unless a sort is involved. > On Mar 17, 2015 12:14 AM, "Joseph Bradley" <jos...@databricks.com> wrote: > >> This was brought up again in >> https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one >> item which was asked about the reliability of zipping RDDs. Basically, it >> should be reliable, and if it is not, then it should be reported as a bug. >> This general approach should work (with explicit types to make it clear): >> >> val data: RDD[LabeledPoint] = ... >> val labels: RDD[Double] = data.map(_.label) >> val features1: RDD[Vector] = data.map(_.features) >> val features2: RDD[Vector] = new >> HashingTF(numFeatures=100).transform(features1) >> val features3: RDD[Vector] = idfModel.transform(features2) >> val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label, >> features) => LabeledPoint(label, features)) >> >> If you run into problems with zipping like this, please report them! >> >> Thanks, >> Joseph >> >> On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng <men...@gmail.com> wrote: >> >>> Hopefully the new pipeline API addresses this problem. We have a code >>> example here: >>> >>> >>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala >>> >>> -Xiangrui >>> >>> On Mon, Dec 29, 2014 at 5:22 AM, andy petrella <andy.petre...@gmail.com> >>> wrote: >>> > Here is what I did for this case : >>> https://github.com/andypetrella/tf-idf >>> > >>> > >>> > Le lun 29 déc. 2014 11:31, Sean Owen <so...@cloudera.com> a écrit : >>> > >>> >> Given (label, terms) you can just transform the values to a TF vector, >>> >> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can >>> >> make a LabeledPoint from (label, vector) pairs. Is that what you're >>> >> looking for? >>> >> >>> >> On Mon, Dec 29, 2014 at 3:37 AM, Yao <y...@ford.com> wrote: >>> >> > I found the TF-IDF feature extraction and all the MLlib code that >>> work >>> >> > with >>> >> > pure Vector RDD very difficult to work with due to the lack of >>> ability >>> >> > to >>> >> > associate vector back to the original data. Why can't Spark MLlib >>> >> > support >>> >> > LabeledPoint? >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > View this message in context: >>> >> > >>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html >>> >> > Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> >> > >>> >> > >>> --------------------------------------------------------------------- >>> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> >> > For additional commands, e-mail: user-h...@spark.apache.org >>> >> > >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> >> For additional commands, e-mail: user-h...@spark.apache.org >>> >> >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>