And got the first cut:
val res = pairs.groupByKey().map((x) = (x._1, x._2.size, x._2.toSet.size))
gives the total unique.
The question : is it scalable efficient ? Would appreciate insights.
Cheers
k/
On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar ksanka...@gmail.com
wrote:
Hi,
Would appreciate insights and wisdom on a problem we are working on:
1. Context:
- Given a csv file like:
- d1,c1,a1
- d1,c1,a2
- d1,c2,a1
- d1,c1,a1
- d2,c1,a3
- d2,c2,a1
- d3,c1,a1
- d3,c3,a1
- d3,c2,a1
- d3,c3,a2
Answered one of my questions (#5) : val pairs = new PairRDDFunctions(RDD)
works fine locally. Now I can do groupByKey et al. Am not sure if it is
scalable for millions of records memory efficient.
heers
k/
On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar ksanka...@gmail.com wrote:
Hi,