Re: Multi-dimensional Uniques over large dataset

Krishna Sankar Fri, 13 Jun 2014 22:52:28 -0700

Answered one of my questions (#5) : val pairs = new PairRDDFunctions(<RDD>)
works fine locally. Now I can do groupByKey et al. Am not sure if it is
scalable for millions of records & memory efficient.
heers
<k/>



On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar <ksanka...@gmail.com> wrote:

> Hi,
>    Would appreciate insights and wisdom on a problem we are working on:
>
>    1. Context:
>       - Given a csv file like:
>       - d1,c1,a1
>       - d1,c1,a2
>       - d1,c2,a1
>       - d1,c1,a1
>       - d2,c1,a3
>       - d2,c2,a1
>       - d3,c1,a1
>       - d3,c3,a1
>       - d3,c2,a1
>       - d3,c3,a2
>       - d5,c1,a3
>       - d5,c2,a2
>       - d5,c3,a2
>       - Want to find uniques and totals (of the d_ across the c_ and a_
>       dimensions):
>       -         Tot   Unique
>          - c1      6      4
>          - c2      4      4
>          - c3      2      2
>          - a1      7      3
>          - a2      4      3
>          - a3      2      2
>          - c1-a1  ...
>          - c1-a2 ...
>          - c1-a3 ...
>          - c2-a1 ...
>          - c2-a2 ...
>          - ...
>          - c3-a3
>       - Obviously there are millions of records and more
>       attributes/dimensions. So scalability is key
>       2. We think Spark is a good stack for this problem: Have a few
>    questions:
>    3. From a Spark substrate perspective, what are some of the optimum
>    transformations & things to watch out for ?
>    4. Is PairRDD the best data representation ? GroupByKey et al is only
>    available for PairRDD.
>    5. On a pragmatic level, file.map().map() results in RDD. How do I
>    transform it to a PairRDD ?
>       1. .map(fields => (fields(1), fields(0)) - results in Unit
>       2. .map(fields => fields(1) -> fields(0)) also is not working
>       3. Both these do not result in a PairRDD
>       4. Am missing something fundamental.
>
> Cheers & Have a nice weekend
> <k/>
>

Re: Multi-dimensional Uniques over large dataset

Reply via email to