Hi, org.apache.spark.mllib.linalg.Vector = (1048576,[35587,884670],[3.458767233,3.458767233]) it is sparse vector representation of terms so the first term(1048576) is the length of vector [35587,884670] is the index of words [3.458767233,3.458767233] are the tf-idf values of the terms.
Thanks Somnath From: franco barrientos [mailto:franco.barrien...@exalitica.com] Sent: Thursday, June 04, 2015 11:17 PM To: user@spark.apache.org Subject: TF-IDF Question Hi all!, I have a .txt file where each row of it it's a collection of terms of a document separated by space. For example: 1 "Hola spark" 2 .. I followed this example of spark site https://spark.apache.org/docs/latest/mllib-feature-extraction.html and i get something like this: tfidf.first() org.apache.spark.mllib.linalg.Vector = (1048576,[35587,884670],[3.458767233,3.458767233]) I think this: 1. First parameter "1048576" i don't know what it is but always it´s the same number (maybe the number of terms). 2. Second parameter "[35587,884670]" i think are the terms of the first line in my .txt file. 3. Third parameter "[3.458767233,3.458767233]" i think are the tfidf values for my terms. Anyone knows the exact interpretation of this and in the second point if these values are the terms, how can i match this values with the original terms values ("[35587=>Hola,884670=>spark]")?. Regards and thanks in advance. Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 franco.barrien...@exalitica.com<mailto:franco.barrien...@exalitica.com> www.exalitica.com <http://www.exalitica.com/> [http://exalitica.com/web/img/frim.png]