Re: Using TF-IDF from MLlib

2015-03-16 Thread Shivaram Venkataraman
-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Using TF-IDF from MLlib

2015-03-16 Thread Joseph Bradley
vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive

Re: Using TF-IDF from MLlib

2015-03-16 Thread Sean Owen
. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Using TF-IDF from MLlib

2015-03-16 Thread Shivaram Venkataraman
can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Using TF-IDF from MLlib

2014-12-29 Thread Sean Owen
.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail

Re: Using TF-IDF from MLlib

2014-12-29 Thread andy petrella
vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Using TF-IDF from MLlib

2014-12-29 Thread Xiangrui Meng
to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive

Re: Using TF-IDF from MLlib

2014-12-28 Thread Yao
-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands

RE: Using TF-IDF from MLlib

2014-11-21 Thread Daniel, Ronald (ELS-SDG)
Thanks for the info Andy. A big help. One thing - I think you can figure out which document is responsible for which vector without checking in more code. Start with a PairRDD of [doc_id, doc_string] for each document and split that into one RDD for each column. The values in the doc_string RDD

Re: Using TF-IDF from MLlib

2014-11-21 Thread andy petrella
Yeah, I initially used zip but I was wondering how reliable it is. I mean, it's the order guaranteed? What if some mode fail, and the data is pulled out from different nodes? And even if it can work, I found this implicit semantic quite uncomfortable, don't you? My0.2c Le ven 21 nov. 2014 15:26,

Using TF-IDF from MLlib

2014-11-20 Thread Daniel, Ronald (ELS-SDG)
Hi all, I want to try the TF-IDF functionality in MLlib. I can feed it words and generate the tf and idf RDD[Vector]s, using the code below. But how do I get this back to words and their counts and tf-idf values for presentation? val sentsTmp = sqlContext.sql(SELECT text FROM sentenceTable)

Re: Using TF-IDF from MLlib

2014-11-20 Thread andy petrella
/Someone will correct me if I'm wrong./ Actually, TF-IDF scores terms for a given document, an specifically TF. Internally, these things are holding a Vector (hopefully sparsed) representing all the possible words (up to 2²⁰) per document. So each document afer applying TF, will be transformed in