And got the first cut: val res = pairs.groupByKey().map((x) => (x._1, x._2.size, x._2.toSet.size)) gives the total & unique.
The question : is it scalable & efficient ? Would appreciate insights. Cheers <k/> On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar <ksanka...@gmail.com> wrote: > Answered one of my questions (#5) : val pairs = new > PairRDDFunctions(<RDD>) works fine locally. Now I can do groupByKey et al. > Am not sure if it is scalable for millions of records & memory efficient. > heers > <k/> > > > On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar <ksanka...@gmail.com> > wrote: > >> Hi, >> Would appreciate insights and wisdom on a problem we are working on: >> >> 1. Context: >> - Given a csv file like: >> - d1,c1,a1 >> - d1,c1,a2 >> - d1,c2,a1 >> - d1,c1,a1 >> - d2,c1,a3 >> - d2,c2,a1 >> - d3,c1,a1 >> - d3,c3,a1 >> - d3,c2,a1 >> - d3,c3,a2 >> - d5,c1,a3 >> - d5,c2,a2 >> - d5,c3,a2 >> - Want to find uniques and totals (of the d_ across the c_ and a_ >> dimensions): >> - Tot Unique >> - c1 6 4 >> - c2 4 4 >> - c3 2 2 >> - a1 7 3 >> - a2 4 3 >> - a3 2 2 >> - c1-a1 ... >> - c1-a2 ... >> - c1-a3 ... >> - c2-a1 ... >> - c2-a2 ... >> - ... >> - c3-a3 >> - Obviously there are millions of records and more >> attributes/dimensions. So scalability is key >> 2. We think Spark is a good stack for this problem: Have a few >> questions: >> 3. From a Spark substrate perspective, what are some of the optimum >> transformations & things to watch out for ? >> 4. Is PairRDD the best data representation ? GroupByKey et al is only >> available for PairRDD. >> 5. On a pragmatic level, file.map().map() results in RDD. How do I >> transform it to a PairRDD ? >> 1. .map(fields => (fields(1), fields(0)) - results in Unit >> 2. .map(fields => fields(1) -> fields(0)) also is not working >> 3. Both these do not result in a PairRDD >> 4. Am missing something fundamental. >> >> Cheers & Have a nice weekend >> <k/> >> > >