RE: TF-IDF Question

Somnath Pandeya Thu, 04 Jun 2015 21:56:58 -0700

Hi,

org.apache.spark.mllib.linalg.Vector = 
(1048576,[35587,884670],[3.458767233,3.458767233])
it is sparse vector representation of terms
so the first term(1048576) is the length of vector
[35587,884670] is the index of words
[3.458767233,3.458767233] are the tf-idf values of the terms.

Thanks
Somnath

From: franco barrientos [mailto:franco.barrien...@exalitica.com]
Sent: Thursday, June 04, 2015 11:17 PM
To: user@spark.apache.org
Subject: TF-IDF Question

Hi all!,

I have a .txt file where each row of it it's a collection of terms of a 
document separated by space. For example:

1 "Hola spark"
2 ..

I followed this example of spark site 
https://spark.apache.org/docs/latest/mllib-feature-extraction.html and i get 
something like this:

tfidf.first()
org.apache.spark.mllib.linalg.Vector = 
(1048576,[35587,884670],[3.458767233,3.458767233])

I think this:

  1.  First parameter "1048576" i don't know what it is but always it´s the 
same number (maybe the number of terms).
  2.  Second parameter "[35587,884670]" i think are the terms of the first line 
in my .txt file.
  3.  Third parameter "[3.458767233,3.458767233]" i think are the tfidf values 
for my terms.
Anyone knows the exact interpretation of this and in the second point if these 
values are the terms, how can i match this values with the original terms 
values ("[35587=>Hola,884670=>spark]")?.

Regards and thanks in advance.

Franco Barrientos
Data Scientist
Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893
franco.barrien...@exalitica.com<mailto:franco.barrien...@exalitica.com>
www.exalitica.com
<http://www.exalitica.com/>
[http://exalitica.com/web/img/frim.png]

RE: TF-IDF Question

Reply via email to