Imagine that 4 documents exist as shown below: D1: the cat sat on the mat D2: the cat sat on the cat D3: the cat sat D4: the mat sat
where each word in the vocabulary can be translated to its wordID: 0 the 1 cat 2 sat 3 on 4 the 5 mat Now every document, can be represented using sparse vectors as shown below: Vectors.sparse(5, Seq((0, 2.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0))), Vectors.sparse(5, Seq((0, 2.0), (1, 2.0), (2, 1.0), (3, 1.0))), Vectors.sparse(5, Seq((0, 1.0), (1, 1.0), (2, 1.0))), Vectors.sparse(5, Seq((0, 1.0), (2, 1.0), (4, 1.0)))) and finally, principal components can be computed as follows: val data = Array( Vectors.sparse(5, Seq((0, 2.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0))), Vectors.sparse(5, Seq((0, 2.0), (1, 2.0), (2, 1.0), (3, 1.0))), Vectors.sparse(5, Seq((0, 1.0), (1, 1.0), (2, 1.0))), Vectors.sparse(5, Seq((0, 1.0), (2, 1.0), (4, 1.0)))) val dataRDD = sc.parallelize(data) val mat: RowMatrix = new RowMatrix(dataRDD) val pc: Matrix = mat.computePrincipalComponents(4) What I want to do, is to read the following dataset and represent each document using sparse vectors like above, in order to compute the principal components. In the form: docID wordID count 1 2 1 1 39 1 1 42 3 1 77 1 1 95 1 1 96 1 2 105 1 2 108 1 3 133 3 however I am not quite sure how to read and represent the dataset as sparse vectors. Any help would be much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Represent-documents-as-a-sequence-of-wordID-frequency-and-perform-PCA-tp28554.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org