Re: order preservation with RDDs

2015-03-16 Thread kian.ho
For those still interested, I raised this issue on JIRA and received an
official response:

https://issues.apache.org/jira/browse/SPARK-6340



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052p22088.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: order preservation with RDDs

2015-03-15 Thread Sean Owen
Yes I don't think this is entirely reliable in general. I would emit
(label,features) pairs and then transform the values.

In practice, this may happen to work fine in simple cases.

On Sun, Mar 15, 2015 at 3:51 AM, kian.ho hui.kian.ho+sp...@gmail.com wrote:
 Hi, I was taking a look through the mllib examples in the official spark
 documentation and came across the following:
 http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2

 specifically the lines:

 label = data.map(lambda x: x.label)
 features = data.map(lambda x: x.features)
 ...
 ...
 data1 = label.zip(scaler1.transform(features))

 my question:
 wouldn't it be possible that some labels in the pairs returned by the
 label.zip(..) operation are not paired with their original features? i.e.
 are the original orderings of `labels` and `features` preserved after the
 scaler1.transform(..) and label.zip(..) operations?

 This issue was also mentioned in
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p19433.html

 I would greatly appreciate some clarification on this, as I've run into this
 issue whilst experimenting with feature extraction for text classification,
 where (correct me if I'm wrong) there is no built-in mechanism to keep track
 of document-ids through the HashingTF and IDF fitting and transformations.

 Thanks.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



order preservation with RDDs

2015-03-14 Thread kian.ho
Hi, I was taking a look through the mllib examples in the official spark
documentation and came across the following: 
http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2

specifically the lines:

label = data.map(lambda x: x.label)
features = data.map(lambda x: x.features)
...
...
data1 = label.zip(scaler1.transform(features))

my question:
wouldn't it be possible that some labels in the pairs returned by the
label.zip(..) operation are not paired with their original features? i.e.
are the original orderings of `labels` and `features` preserved after the
scaler1.transform(..) and label.zip(..) operations?

This issue was also mentioned in
http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p19433.html

I would greatly appreciate some clarification on this, as I've run into this
issue whilst experimenting with feature extraction for text classification,
where (correct me if I'm wrong) there is no built-in mechanism to keep track
of document-ids through the HashingTF and IDF fitting and transformations.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org