HashingTF was not designed to handle your case, you can try CountVectorizer
who will keep the original terms as vocabulary for retrieving.
CountVectorizer will compute a global term-to-index map, which can be
expensive for a large corpus and has the risk of OOM. IDF can accept
feature vectors generated by HashingTF or CountVectorizer.
FYI http://spark.apache.org/docs/latest/ml-features.html#tf-idf

Thanks
Yanbo

On Thu, Oct 20, 2016 at 10:00 AM, Ciumac Sergiu <ciumac.ser...@gmail.com>
wrote:

> Hello everyone,
>
> I'm having a usage issue with HashingTF class from Spark MLLIB.
>
> I'm computing TF.IDF on a set of terms/documents which later I'm using to
> identify most important ones in each of the input document.
>
> Below is a short code snippet which outlines the example (2 documents with
> 2 words each, executed on Spark 2.0).
>
> val documentsToEvaluate = sc.parallelize(Array(Seq("Mars", 
> "Jupiter"),Seq("Venus", "Mars")))
> val hashingTF = new HashingTF()
> val tf = hashingTF.transform(documentsToEvaluate)
> tf.cache()
> val idf = new IDF().fit(tf)
> val tfidf: RDD[Vector] = idf.transform(tf)
> documentsToEvaluate.zip(tfidf).saveAsTextFile("/tmp/tfidf")
>
> The computation yields to the following result:
>
> (List(Mars, Jupiter),(1048576,[593437,962819],[0.4054651081081644,0.0]))
> (List(Venus, Mars),(1048576,[798918,962819],[0.4054651081081644,0.0]))
>
> My concern is that I can't get a mapping of TF.IDF weights an initial
> terms used (i.e. Mars : 0.0, Jupiter : 0.4, Venus : 0.4. You may notice
> that the weight and terms indices do not correspond after zipping 2
> sequences). I can only identify the hash (i.e. 593437 : 0.4) mappings.
>
> I know HashingTF uses the hashing trick to compute TF. My question is it
> possible to retrieve terms / weights mapping, or HashingTF was not designed
> to handle this use-case. If latter, what other implementation of TF.IDF you
> may recommend.
>
> I may continue the computation with the (*hash:weight*) tuple, though
> getting initial (*term:weight)* would result in a lot easier debugging
> steps later down the pipeline.
>
> Any response will be greatly appreciated!
>
> Regards, Sergiu Ciumac
>

Reply via email to