Disclaimer : I am new at Spark.... I did something similar in a prototype which works but I that did not test at scale yet
val agg =3D users.mapValues(_ =3D> 1)..aggregateByKey(new CustomAggregation())(CustomAggregation.sequenceOp, CustomAggregation.comboO= p) class CustomAggregation() extends Serializable { var count =3D0: Long val users =3D Set(): Set[String] } object CustomAggregation { def sequenceOp(agg: CustomAggregation, user_id : String ): CustomAggregation =3D { agg.count+=3D1; agg.users+=3Duser_id return agg; } def comboOp(agg: CustomAggregation, agg2: CustomAggregation): CustomAggregation =3D { agg.count+=3D agg2.count agg.users++=3Dagg2.users return agg; } } That should gives you the aggregation , distinct count is the size of users set . I hope this helps Stephane On Wed, Dec 3, 2014 at 5:47 PM, Arun Luthra <arun.lut...@gmail.com> wrote: > I'm wondering how to do this kind of SQL query with PairRDDFunctions. > > SELECT zip, COUNT(user), COUNT(DISTINCT user) > FROM users > GROUP BY zip > > In the Spark scala API, I can make an RDD (called "users") of key-value > pairs where the keys are zip (as in ZIP code) and the values are user id's. > Then I can compute the count and distinct count like this: > > val count = users.mapValues(_ => 1).reduceByKey(_ + _) > val countDistinct = users.distinct().mapValues(_ => 1).reduceByKey(_ + _) > > Then, if I want count and countDistinct in the same table, I have to join > them on the key. > > Is there a way to do this without doing a join (and without using SQL or > spark SQL)? > > Arun >