HashingTF was not designed to handle your case, you can try CountVectorizer who will keep the original terms as vocabulary for retrieving. CountVectorizer will compute a global term-to-index map, which can be expensive for a large corpus and has the risk of OOM. IDF can accept feature vectors generated by HashingTF or CountVectorizer. FYI http://spark.apache.org/docs/latest/ml-features.html#tf-idf
Thanks Yanbo On Thu, Oct 20, 2016 at 10:00 AM, Ciumac Sergiu <ciumac.ser...@gmail.com> wrote: > Hello everyone, > > I'm having a usage issue with HashingTF class from Spark MLLIB. > > I'm computing TF.IDF on a set of terms/documents which later I'm using to > identify most important ones in each of the input document. > > Below is a short code snippet which outlines the example (2 documents with > 2 words each, executed on Spark 2.0). > > val documentsToEvaluate = sc.parallelize(Array(Seq("Mars", > "Jupiter"),Seq("Venus", "Mars"))) > val hashingTF = new HashingTF() > val tf = hashingTF.transform(documentsToEvaluate) > tf.cache() > val idf = new IDF().fit(tf) > val tfidf: RDD[Vector] = idf.transform(tf) > documentsToEvaluate.zip(tfidf).saveAsTextFile("/tmp/tfidf") > > The computation yields to the following result: > > (List(Mars, Jupiter),(1048576,[593437,962819],[0.4054651081081644,0.0])) > (List(Venus, Mars),(1048576,[798918,962819],[0.4054651081081644,0.0])) > > My concern is that I can't get a mapping of TF.IDF weights an initial > terms used (i.e. Mars : 0.0, Jupiter : 0.4, Venus : 0.4. You may notice > that the weight and terms indices do not correspond after zipping 2 > sequences). I can only identify the hash (i.e. 593437 : 0.4) mappings. > > I know HashingTF uses the hashing trick to compute TF. My question is it > possible to retrieve terms / weights mapping, or HashingTF was not designed > to handle this use-case. If latter, what other implementation of TF.IDF you > may recommend. > > I may continue the computation with the (*hash:weight*) tuple, though > getting initial (*term:weight)* would result in a lot easier debugging > steps later down the pipeline. > > Any response will be greatly appreciated! > > Regards, Sergiu Ciumac >