Read file and represent rows as Vectors

2017-04-03 Thread Old-School
I have a dataset that contains DocID, WordID and frequency (count) as shown below. Note that the first three numbers represent 1. the number of documents, 2. the number of words in the vocabulary and 3. the total number of words in the collection. 189 1430 12300 1 2 1 1 39 1 1 42 3 1 77 1 1 95 1 1

Represent documents as a sequence of wordID & frequency and perform PCA

2017-04-02 Thread Old-School
Imagine that 4 documents exist as shown below: D1: the cat sat on the mat D2: the cat sat on the cat D3: the cat sat D4: the mat sat where each word in the vocabulary can be translated to its wordID: 0 the 1 cat 2 sat 3 on 4 the 5 mat Now every document, can be represented using sparse vectors

[RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-04 Thread Old-School
Hi, I want to perform some simple transformations and check the execution time, under various configurations (e.g. number of cores being used, number of partitions etc). Since it is not possible to set the partitions of a dataframe , I guess that I should probably use RDDs. I've got a dataset wi