Hello spark users, I hope your week is going fantastic! I am having some troubles with the TFIDF in MLlib and was wondering if anyone can point me to the right direction.
The data ingestion and the initial term frequency count code taken from the example works fine (I am using the first example from this page: https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html). Below is my input data: WrappedArray((Frank, spent, Friday, afternoon, at, labs, test, test, test, test, test, test, test, test, test)) WrappedArray((we, are, testing, the, algorithm, with, us, test, test, test, test, test, test, test, test)) WrappedArray((hello, my, name, is, Hans, and, I, am, testing, TFIDF, test, test, test, test, test)) WrappedArray((TFIDF, is, an, amazing, algorithm, that, is, used, for, spam, filtering, and, search, test, test)) WrappedArray((Accenture, is, doing, great, test, test, test, test, test, test, test, test, test, test, test)) Here's the output: (1048576,[1065,1463,33868,34122,34252,337086,420523,603314,717226,767673,839152,876983],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0]) (1048576,[1463,6313,33869,34122,118216,147517,162737,367946,583529,603314,605639,646109,876983,972879],[1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]) (1048576,[20311,34122,340246,603314,778861,876983],[1.0,1.0,1.0,10.0,1.0,1.0]) (1048576,[33875,102986,154015,267598,360614,603314,690972,876983],[1.0,1.0,1.0,1.0,1.0,8.0,1.0,1.0]) (1048576,[1588,19537,34494,42230,603314,696550,839152,876983,972879],[1.0,1.0,1.0,1.0,7.0,1.0,1.0,1.0,1.0]) The problem I am having here is that the output from HashingTF is not ordered like the original sentence, I understand that the integer "603314" in the output stands for the word " test" in the input. But how would I programmatically translate the number back to the word so I know which words are most common? Please let me know your thoughts! I am not sure how helpful these are going to be but here are the things I've noticed when I was looking into the source code of TFIDF: 1. def indexOf(term: Any): Int = Utils.nonNegativeMod(term.##, numFeatures) ----> This line of code hashes the term into it's ASCII value and calculates 'ASCII' modulo 'numberFeatures'(which is defaulted 2^20) 2. Then def transform(document: Iterable[_]): Vector = { blah blah blah} ---> This part of the code does the counting and spreads the current array into two separate ones using Vectors.sparse. Thanks in advance and I hope to hear from you soon! Best, Hans ________________________________ This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. ______________________________________________________________________________________ www.accenture.com