Hi All,
I have a 8 mill row, 500 column data set, which is derived by reading a text 
file and doing a filter, flatMap operation to weed out some anomalies.
Now, I have a process which has to run through all 500 columns, do couple of 
map, reduce, forEach operations on the data set and return some statistics as 
output. I have thought of the following approaches.
Approach 1:

i)                    Read the DataSet from textfile, do some operations, get a 
RDD. Use toArray or collect on this RDD and broadcast it.

ii)                   Do a flatMap on a range of numbers, this range being 
equivalent to the number of columns.

iii)                 In each flatMap operation, perform the required operations 
on the broadcast variable to derive the stats, return the array of stats

Questions about this approach:

1)      Is there a preference amongst toArray and collect?

2)      Can I not directly broadcast a RDD instead of first collecting it and 
broadcasting it? I tried this, but I got a serialization exception.

3)      When I use sc.parallelize on the broadcast dataset, would it be a 
problem if there isn't enough space to store it in-memory?


Approach 2:

Instead of reading the textfile, doing some operations and then broadcasting 
it, I was planning to do the read part within each of the 500 steps of the 
flatMap (assuming I have 500 columns)

Is this better than Approach 1? In Approach 1, I'd have to read once and 
broadcast whilst here, I'd have to read 500 times.


Approach 3:
Do a transpose of the dataset and then flatMap on the transposed matrix.

Could someone please point out the best approach from above, or if there's a 
better way to solve this?
Thank you for the help!
Vinay




Reply via email to