Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Hi, Would appreciate insights and wisdom on a problem we are working on: 1. Context: - Given a csv file like: - d1,c1,a1 - d1,c1,a2 - d1,c2,a1 - d1,c1,a1 - d2,c1,a3 - d2,c2,a1 - d3,c1,a1 - d3,c3,a1 - d3,c2,a1 - d3,c3,a2

Re: Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Answered one of my questions (#5) : val pairs = new PairRDDFunctions() works fine locally. Now I can do groupByKey et al. Am not sure if it is scalable for millions of records & memory efficient. heers On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar wrote: > Hi, >Would appreciate insights

Re: Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
And got the first cut: val res = pairs.groupByKey().map((x) => (x._1, x._2.size, x._2.toSet.size)) gives the total & unique. The question : is it scalable & efficient ? Would appreciate insights. Cheers On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar wrote: > Answered one of my questi