Hey, I work it out myself :) The "Vector" is actually a "SparesVector", so when it is written into a string, the format is
(size, [coordinate....], [value...]) Simple! On Sat, Mar 14, 2015 at 6:05 PM Xi Shen <davidshe...@gmail.com> wrote: > Hi, > > I read this document, > http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and > tried to build a TF-IDF model of my documents. > > I have a list of documents, each word is represented as a Int, and each > document is listed in one line. > > doc_name, int1, int2... > doc_name, int3, int4... > > This is how I load my documents: > val documents: RDD[Seq[Int]] = sc.objectFile[(String, > Seq[Int])](s"$sparkStore/documents") map (_._2) cache() > > Then I did: > > val hashingTF = new HashingTF() > val tf: RDD[Vector] = hashingTF.transform(documents) > val idf = new IDF().fit(tf) > val tfidf = idf.transform(tf) > > I write the tfidf model to a text file and try to understand the structure. > FileUtils.writeLines(new File("tfidf.out"), > tfidf.collect().toList.asJavaCollection) > > What I is something like: > > (1048576,[0,4,7,8,10,13,17,21....],[...some float numbers...]) > ... > > I think it s a tuple with 3 element. > > - I have no idea what the 1st element is... > - I think the 2nd element is a list of the word > - I think the 3rd element is a list of tf-idf value of the words in > the previous list > > Please help me understand this structure. > > > Thanks, > David > > > >