It can not translate the number back to the word except you store the in map by yourself.
2015-07-31 1:45 GMT+08:00 hans ziqiu li <thenewh...@gmail.com>: > Hello spark users! > > I am having some troubles with the TFIDF in MLlib and was wondering if > anyone can point me to the right direction. > > The data ingestion and the initial term frequency count code taken from the > example works fine (I am using the first example from this page: > https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html). > > Below is my input data: > > WrappedArray((Frank, spent, Friday, afternoon, at, labs, test, test, > test, test, test, test, test, test, test)) > WrappedArray((we, are, testing, the, algorithm, with, us, test, > test, test, test, test, test, test, test)) > WrappedArray((hello, my, name, is, Hans, and, I, am, testing, > TFIDF, test, test, test, test, test)) > WrappedArray((TFIDF, is, an, amazing, algorithm, that, is, used, > for, spam, filtering, and, search, test, test)) > WrappedArray((Accenture, is, doing, great, test, test, test, test, > test, test, test, test, test, test, test)) > > Here’s the output: > > > (1048576,[1065,1463,33868,34122,34252,337086,420523,603314,717226,767673,839152,876983],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0]) > > (1048576,[1463,6313,33869,34122,118216,147517,162737,367946,583529,603314,605639,646109,876983,972879],[1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) > > (1048576,[20311,34122,340246,603314,778861,876983],[1.0,1.0,1.0,10.0,1.0,1.0]) > > (1048576,[33875,102986,154015,267598,360614,603314,690972,876983],[1.0,1.0,1.0,1.0,1.0,8.0,1.0,1.0]) > > (1048576,[1588,19537,34494,42230,603314,696550,839152,876983,972879],[1.0,1.0,1.0,1.0,7.0,1.0,1.0,1.0,1.0]) > > The problem I am having here is that the output from HashingTF is not > ordered like the original sentence, I understand that the integer “603314” > in the output stands for the word “ test” in the input. But how would I > programmatically translate the number back to the word so I know which > words > are most common? Please let me know your thoughts! > > I am not sure how helpful these are going to be but here are the things > I’ve > noticed when I was looking into the source code of TFIDF: > > 1. def > indexOf(term: > Any): > Int > = > Utils.nonNegativeMod(term.##, > numFeatures) ————> This line of code hashes the term into it’s ASCII value > and calculates ‘ASCII’ modulo ‘numberFeatures’(which is defaulted 2^20) > 2. Then def > transform(document: > Iterable[_]): > Vector > = > { blah blah blah} ———> This part of the code does the counting and spreads > the current array into two separate ones using Vectors.sparse. > > > Thanks in advance and I hope to hear from you soon! > Best, > Hans > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/TFIDF-Transformation-tp24086.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >