Hi,
   For last couple of days I have been trying hard to get around this
problem. Please share any insights on solving this problem.

Problem :
There is a huge list of (key, value) pairs. I want to transform this to
(key, distinct values) and then eventually to (key, distinct values count)

On small dataset

groupByKey().map( x => (x_1, x._2.distinct)) ...map(x => (x_1,
x._2.distinct.count))

On large data set I am getting OOM.

Is there a way to represent Seq of values from groupByKey as RDD and then
perform distinct over it ?

Thanks
Vivek

Reply via email to