Yes I don't think this is entirely reliable in general. I would emit (label,features) pairs and then transform the values.
In practice, this may happen to work fine in simple cases. On Sun, Mar 15, 2015 at 3:51 AM, kian.ho <hui.kian.ho+sp...@gmail.com> wrote: > Hi, I was taking a look through the mllib examples in the official spark > documentation and came across the following: > http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2 > > specifically the lines: > > label = data.map(lambda x: x.label) > features = data.map(lambda x: x.features) > ... > ... > data1 = label.zip(scaler1.transform(features)) > > my question: > wouldn't it be possible that some labels in the pairs returned by the > label.zip(..) operation are not paired with their original features? i.e. > are the original orderings of `labels` and `features` preserved after the > scaler1.transform(..) and label.zip(..) operations? > > This issue was also mentioned in > http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p19433.html > > I would greatly appreciate some clarification on this, as I've run into this > issue whilst experimenting with feature extraction for text classification, > where (correct me if I'm wrong) there is no built-in mechanism to keep track > of document-ids through the HashingTF and IDF fitting and transformations. > > Thanks. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org