TFIDF Transformation

ziqiu.li Wed, 29 Jul 2015 23:39:07 -0700

Hello spark users,

I hope your week is going fantastic! I am having some troubles with the TFIDF 
in MLlib and was wondering if anyone can point me to the right direction.


The data ingestion and the initial term frequency count code taken from the 
example works fine (I am using the first example from this page: 
https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html).

Below is my input data:

WrappedArray((Frank,  spent,  Friday,  afternoon,  at,  labs,  test,  test,  
test,  test,  test,  test,  test,  test,  test))
WrappedArray((we,  are,  testing,  the,  algorithm,  with,  us,  test,  test,  
test,  test,  test,  test,  test,  test))
WrappedArray((hello,  my,  name,  is,  Hans,  and,  I,  am,  testing,  TFIDF,  
test,  test,  test,  test,  test))
WrappedArray((TFIDF,  is,  an,  amazing,  algorithm,  that,  is,  used,  for,  
spam,  filtering,  and,  search,  test,  test))
WrappedArray((Accenture,  is,  doing,  great,  test,  test,  test,  test,  
test,  test,  test,  test,  test,  test,  test))

Here's the output:

(1048576,[1065,1463,33868,34122,34252,337086,420523,603314,717226,767673,839152,876983],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0])
(1048576,[1463,6313,33869,34122,118216,147517,162737,367946,583529,603314,605639,646109,876983,972879],[1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
(1048576,[20311,34122,340246,603314,778861,876983],[1.0,1.0,1.0,10.0,1.0,1.0])
(1048576,[33875,102986,154015,267598,360614,603314,690972,876983],[1.0,1.0,1.0,1.0,1.0,8.0,1.0,1.0])
(1048576,[1588,19537,34494,42230,603314,696550,839152,876983,972879],[1.0,1.0,1.0,1.0,7.0,1.0,1.0,1.0,1.0])

The problem I am having here is that the output from HashingTF is not ordered 
like the original sentence, I understand that the integer "603314" in the 
output stands for the word " test" in the input. But how would I 
programmatically translate the number back to the word so I know which words 
are most common? Please let me know your thoughts!

I am not sure how helpful these are going to be but here are the things I've 
noticed when I was looking into the source code of TFIDF:

1. def indexOf(term: Any): Int = Utils.nonNegativeMod(term.##, numFeatures) 
----> This line of code hashes the term into it's ASCII value and calculates 
'ASCII' modulo 'numberFeatures'(which is defaulted 2^20)
2. Then def transform(document: Iterable[_]): Vector = { blah blah blah} ---> 
This part of the code does the counting and spreads the current array into two 
separate ones using Vectors.sparse.


Thanks in advance and I hope to hear from you soon!
Best,
Hans


________________________________

This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
______________________________________________________________________________________

www.accenture.com

TFIDF Transformation

Reply via email to