I have a dataset that contains DocID, WordID and frequency (count) as shown
below. Note that the first three numbers represent 1. the number of
documents, 2. the number of words in the vocabulary and 3. the total number
of words in the collection.
189
1430
12300
1 2 1
1 39 1
1 42 3
1 77 1
1 95 1
1
Imagine that 4 documents exist as shown below:
D1: the cat sat on the mat
D2: the cat sat on the cat
D3: the cat sat
D4: the mat sat
where each word in the vocabulary can be translated to its wordID:
0 the
1 cat
2 sat
3 on
4 the
5 mat
Now every document, can be represented using sparse vectors
Hi,
I want to perform some simple transformations and check the execution time,
under various configurations (e.g. number of cores being used, number of
partitions etc). Since it is not possible to set the partitions of a
dataframe , I guess that I should probably use RDDs.
I've got a dataset wi