Answered one of my questions (#5) : val pairs = new PairRDDFunctions(<RDD>) works fine locally. Now I can do groupByKey et al. Am not sure if it is scalable for millions of records & memory efficient. heers <k/>
On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar <ksanka...@gmail.com> wrote: > Hi, > Would appreciate insights and wisdom on a problem we are working on: > > 1. Context: > - Given a csv file like: > - d1,c1,a1 > - d1,c1,a2 > - d1,c2,a1 > - d1,c1,a1 > - d2,c1,a3 > - d2,c2,a1 > - d3,c1,a1 > - d3,c3,a1 > - d3,c2,a1 > - d3,c3,a2 > - d5,c1,a3 > - d5,c2,a2 > - d5,c3,a2 > - Want to find uniques and totals (of the d_ across the c_ and a_ > dimensions): > - Tot Unique > - c1 6 4 > - c2 4 4 > - c3 2 2 > - a1 7 3 > - a2 4 3 > - a3 2 2 > - c1-a1 ... > - c1-a2 ... > - c1-a3 ... > - c2-a1 ... > - c2-a2 ... > - ... > - c3-a3 > - Obviously there are millions of records and more > attributes/dimensions. So scalability is key > 2. We think Spark is a good stack for this problem: Have a few > questions: > 3. From a Spark substrate perspective, what are some of the optimum > transformations & things to watch out for ? > 4. Is PairRDD the best data representation ? GroupByKey et al is only > available for PairRDD. > 5. On a pragmatic level, file.map().map() results in RDD. How do I > transform it to a PairRDD ? > 1. .map(fields => (fields(1), fields(0)) - results in Unit > 2. .map(fields => fields(1) -> fields(0)) also is not working > 3. Both these do not result in a PairRDD > 4. Am missing something fundamental. > > Cheers & Have a nice weekend > <k/> >