Re: TFIDF Transformation

Yanbo Liang Tue, 04 Aug 2015 04:04:41 -0700

It can not translate the number back to the word except you store the in
map by yourself.


2015-07-31 1:45 GMT+08:00 hans ziqiu li <thenewh...@gmail.com>:

> Hello spark users!
>
> I am having some troubles with the TFIDF in MLlib and was wondering if
> anyone can point me to the right direction.
>
> The data ingestion and the initial term frequency count code taken from the
> example works fine (I am using the first example from this page:
> https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html).
>
> Below is my input data:
>
> WrappedArray((Frank,  spent,  Friday,  afternoon,  at,  labs,  test,  test,
> test,  test,  test,  test,  test,  test,  test))
> WrappedArray((we,  are,  testing,  the,  algorithm,  with,  us,  test,
> test,  test,  test,  test,  test,  test,  test))
> WrappedArray((hello,  my,  name,  is,  Hans,  and,  I,  am,  testing,
> TFIDF,  test,  test,  test,  test,  test))
> WrappedArray((TFIDF,  is,  an,  amazing,  algorithm,  that,  is,  used,
> for,  spam,  filtering,  and,  search,  test,  test))
> WrappedArray((Accenture,  is,  doing,  great,  test,  test,  test,  test,
> test,  test,  test,  test,  test,  test,  test))
>
> Here’s the output:
>
>
> (1048576,[1065,1463,33868,34122,34252,337086,420523,603314,717226,767673,839152,876983],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0])
>
> (1048576,[1463,6313,33869,34122,118216,147517,162737,367946,583529,603314,605639,646109,876983,972879],[1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
>
> (1048576,[20311,34122,340246,603314,778861,876983],[1.0,1.0,1.0,10.0,1.0,1.0])
>
> (1048576,[33875,102986,154015,267598,360614,603314,690972,876983],[1.0,1.0,1.0,1.0,1.0,8.0,1.0,1.0])
>
> (1048576,[1588,19537,34494,42230,603314,696550,839152,876983,972879],[1.0,1.0,1.0,1.0,7.0,1.0,1.0,1.0,1.0])
>
> The problem I am having here is that the output from HashingTF is not
> ordered like the original sentence, I understand that the integer “603314”
> in the output stands for the word “ test” in the input. But how would I
> programmatically translate the number back to the word so I know which
> words
> are most common? Please let me know your thoughts!
>
> I am not sure how helpful these are going to be but here are the things
> I’ve
> noticed when I was looking into the source code of TFIDF:
>
> 1. def
> indexOf(term:
> Any):
> Int
> =
> Utils.nonNegativeMod(term.##,
>  numFeatures) ————> This line of code hashes the term into it’s ASCII value
> and calculates ‘ASCII’ modulo ‘numberFeatures’(which is defaulted 2^20)
> 2. Then def
> transform(document:
> Iterable[_]):
> Vector
> =
>  { blah blah blah} ———> This part of the code does the counting and spreads
> the current array into two separate ones using Vectors.sparse.
>
>
> Thanks in advance and I hope to hear from you soon!
> Best,
> Hans
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/TFIDF-Transformation-tp24086.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: TFIDF Transformation

Reply via email to