Hi, I had the same problem and I didn't found the solution. I used Word2Vec instead. I am interessed by the solution of this problem of how to go back from the TF-IDF hashing to word. Regards, Clark
Le Mardi 4 août 2015 13h03, Yanbo Liang <yblia...@gmail.com> a écrit : It can not translate the number back to the word except you store the in map by yourself. 2015-07-31 1:45 GMT+08:00 hans ziqiu li <thenewh...@gmail.com>: Hello spark users! I am having some troubles with the TFIDF in MLlib and was wondering if anyone can point me to the right direction. The data ingestion and the initial term frequency count code taken from the example works fine (I am using the first example from this page: https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html). Below is my input data: WrappedArray((Frank, spent, Friday, afternoon, at, labs, test, test, test, test, test, test, test, test, test)) WrappedArray((we, are, testing, the, algorithm, with, us, test, test, test, test, test, test, test, test)) WrappedArray((hello, my, name, is, Hans, and, I, am, testing, TFIDF, test, test, test, test, test)) WrappedArray((TFIDF, is, an, amazing, algorithm, that, is, used, for, spam, filtering, and, search, test, test)) WrappedArray((Accenture, is, doing, great, test, test, test, test, test, test, test, test, test, test, test)) Here’s the output: (1048576,[1065,1463,33868,34122,34252,337086,420523,603314,717226,767673,839152,876983],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0]) (1048576,[1463,6313,33869,34122,118216,147517,162737,367946,583529,603314,605639,646109,876983,972879],[1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) (1048576,[20311,34122,340246,603314,778861,876983],[1.0,1.0,1.0,10.0,1.0,1.0]) (1048576,[33875,102986,154015,267598,360614,603314,690972,876983],[1.0,1.0,1.0,1.0,1.0,8.0,1.0,1.0]) (1048576,[1588,19537,34494,42230,603314,696550,839152,876983,972879],[1.0,1.0,1.0,1.0,7.0,1.0,1.0,1.0,1.0]) The problem I am having here is that the output from HashingTF is not ordered like the original sentence, I understand that the integer “603314” in the output stands for the word “ test” in the input. But how would I programmatically translate the number back to the word so I know which words are most common? Please let me know your thoughts! I am not sure how helpful these are going to be but here are the things I’ve noticed when I was looking into the source code of TFIDF: 1. def indexOf(term: Any): Int = Utils.nonNegativeMod(term.##, numFeatures) ————> This line of code hashes the term into it’s ASCII value and calculates ‘ASCII’ modulo ‘numberFeatures’(which is defaulted 2^20) 2. Then def transform(document: Iterable[_]): Vector = { blah blah blah} ———> This part of the code does the counting and spreads the current array into two separate ones using Vectors.sparse. Thanks in advance and I hope to hear from you soon! Best, Hans -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TFIDF-Transformation-tp24086.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org