Re: order preservation with RDDs
For those still interested, I raised this issue on JIRA and received an official response: https://issues.apache.org/jira/browse/SPARK-6340 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052p22088.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: order preservation with RDDs
Yes I don't think this is entirely reliable in general. I would emit (label,features) pairs and then transform the values. In practice, this may happen to work fine in simple cases. On Sun, Mar 15, 2015 at 3:51 AM, kian.ho hui.kian.ho+sp...@gmail.com wrote: Hi, I was taking a look through the mllib examples in the official spark documentation and came across the following: http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2 specifically the lines: label = data.map(lambda x: x.label) features = data.map(lambda x: x.features) ... ... data1 = label.zip(scaler1.transform(features)) my question: wouldn't it be possible that some labels in the pairs returned by the label.zip(..) operation are not paired with their original features? i.e. are the original orderings of `labels` and `features` preserved after the scaler1.transform(..) and label.zip(..) operations? This issue was also mentioned in http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p19433.html I would greatly appreciate some clarification on this, as I've run into this issue whilst experimenting with feature extraction for text classification, where (correct me if I'm wrong) there is no built-in mechanism to keep track of document-ids through the HashingTF and IDF fitting and transformations. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
order preservation with RDDs
Hi, I was taking a look through the mllib examples in the official spark documentation and came across the following: http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2 specifically the lines: label = data.map(lambda x: x.label) features = data.map(lambda x: x.features) ... ... data1 = label.zip(scaler1.transform(features)) my question: wouldn't it be possible that some labels in the pairs returned by the label.zip(..) operation are not paired with their original features? i.e. are the original orderings of `labels` and `features` preserved after the scaler1.transform(..) and label.zip(..) operations? This issue was also mentioned in http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p19433.html I would greatly appreciate some clarification on this, as I've run into this issue whilst experimenting with feature extraction for text classification, where (correct me if I'm wrong) there is no built-in mechanism to keep track of document-ids through the HashingTF and IDF fitting and transformations. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org