Hi,  I had the same problem and I didn't found the solution. I used Word2Vec 
instead.
I am interessed by the solution of this problem of how to go back from the 
TF-IDF hashing to word.
Regards,
Clark
 


     Le Mardi 4 août 2015 13h03, Yanbo Liang <yblia...@gmail.com> a écrit :
   

 It can not translate the number back to the word except you store the in map 
by yourself.
2015-07-31 1:45 GMT+08:00 hans ziqiu li <thenewh...@gmail.com>:

Hello spark users!

I am having some troubles with the TFIDF in MLlib and was wondering if
anyone can point me to the right direction.

The data ingestion and the initial term frequency count code taken from the
example works fine (I am using the first example from this page:
https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html).

Below is my input data:

WrappedArray((Frank,  spent,  Friday,  afternoon,  at,  labs,  test,  test,
test,  test,  test,  test,  test,  test,  test))
WrappedArray((we,  are,  testing,  the,  algorithm,  with,  us,  test,
test,  test,  test,  test,  test,  test,  test))
WrappedArray((hello,  my,  name,  is,  Hans,  and,  I,  am,  testing,
TFIDF,  test,  test,  test,  test,  test))
WrappedArray((TFIDF,  is,  an,  amazing,  algorithm,  that,  is,  used,
for,  spam,  filtering,  and,  search,  test,  test))
WrappedArray((Accenture,  is,  doing,  great,  test,  test,  test,  test,
test,  test,  test,  test,  test,  test,  test))

Here’s the output:

(1048576,[1065,1463,33868,34122,34252,337086,420523,603314,717226,767673,839152,876983],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0])
(1048576,[1463,6313,33869,34122,118216,147517,162737,367946,583529,603314,605639,646109,876983,972879],[1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
(1048576,[20311,34122,340246,603314,778861,876983],[1.0,1.0,1.0,10.0,1.0,1.0])
(1048576,[33875,102986,154015,267598,360614,603314,690972,876983],[1.0,1.0,1.0,1.0,1.0,8.0,1.0,1.0])
(1048576,[1588,19537,34494,42230,603314,696550,839152,876983,972879],[1.0,1.0,1.0,1.0,7.0,1.0,1.0,1.0,1.0])

The problem I am having here is that the output from HashingTF is not
ordered like the original sentence, I understand that the integer “603314”
in the output stands for the word “ test” in the input. But how would I
programmatically translate the number back to the word so I know which words
are most common? Please let me know your thoughts!

I am not sure how helpful these are going to be but here are the things I’ve
noticed when I was looking into the source code of TFIDF:

1. def
indexOf(term:
Any):
Int
=
Utils.nonNegativeMod(term.##,
 numFeatures) ————> This line of code hashes the term into it’s ASCII value
and calculates ‘ASCII’ modulo ‘numberFeatures’(which is defaulted 2^20)
2. Then def
transform(document:
Iterable[_]):
Vector
=
 { blah blah blah} ———> This part of the code does the counting and spreads
the current array into two separate ones using Vectors.sparse.


Thanks in advance and I hope to hear from you soon!
Best,
Hans




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/TFIDF-Transformation-tp24086.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





  

Reply via email to