Hi All, I have a 8 mill row, 500 column data set, which is derived by reading a text file and doing a filter, flatMap operation to weed out some anomalies. Now, I have a process which has to run through all 500 columns, do couple of map, reduce, forEach operations on the data set and return some statistics as output. I have thought of the following approaches. Approach 1:
i) Read the DataSet from textfile, do some operations, get a RDD. Use toArray or collect on this RDD and broadcast it. ii) Do a flatMap on a range of numbers, this range being equivalent to the number of columns. iii) In each flatMap operation, perform the required operations on the broadcast variable to derive the stats, return the array of stats Questions about this approach: 1) Is there a preference amongst toArray and collect? 2) Can I not directly broadcast a RDD instead of first collecting it and broadcasting it? I tried this, but I got a serialization exception. 3) When I use sc.parallelize on the broadcast dataset, would it be a problem if there isn't enough space to store it in-memory? Approach 2: Instead of reading the textfile, doing some operations and then broadcasting it, I was planning to do the read part within each of the 500 steps of the flatMap (assuming I have 500 columns) Is this better than Approach 1? In Approach 1, I'd have to read once and broadcast whilst here, I'd have to read 500 times. Approach 3: Do a transpose of the dataset and then flatMap on the transposed matrix. Could someone please point out the best approach from above, or if there's a better way to solve this? Thank you for the help! Vinay