Re: Multi-dimensional Uniques over large dataset

2014-06-14 Thread Krishna Sankar
And got the first cut: val res = pairs.groupByKey().map((x) = (x._1, x._2.size, x._2.toSet.size)) gives the total unique. The question : is it scalable efficient ? Would appreciate insights. Cheers k/ On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar ksanka...@gmail.com wrote:

Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Hi, Would appreciate insights and wisdom on a problem we are working on: 1. Context: - Given a csv file like: - d1,c1,a1 - d1,c1,a2 - d1,c2,a1 - d1,c1,a1 - d2,c1,a3 - d2,c2,a1 - d3,c1,a1 - d3,c3,a1 - d3,c2,a1 - d3,c3,a2

Re: Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Answered one of my questions (#5) : val pairs = new PairRDDFunctions(RDD) works fine locally. Now I can do groupByKey et al. Am not sure if it is scalable for millions of records memory efficient. heers k/ On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar ksanka...@gmail.com wrote: Hi,