Hi, For last couple of days I have been trying hard to get around this problem. Please share any insights on solving this problem.
Problem : There is a huge list of (key, value) pairs. I want to transform this to (key, distinct values) and then eventually to (key, distinct values count) On small dataset groupByKey().map( x => (x_1, x._2.distinct)) ...map(x => (x_1, x._2.distinct.count)) On large data set I am getting OOM. Is there a way to represent Seq of values from groupByKey as RDD and then perform distinct over it ? Thanks Vivek