Disclaimer : I am new at Spark....

I did something similar in a prototype which works but I that did not test
at scale yet

 val agg =3D users.mapValues(_ =3D> 1)..aggregateByKey(new
CustomAggregation())(CustomAggregation.sequenceOp, CustomAggregation.comboO=
p)

class CustomAggregation() extends Serializable {
  var count =3D0: Long
  val users =3D Set(): Set[String]
}

object CustomAggregation {

  def sequenceOp(agg: CustomAggregation, user_id : String
): CustomAggregation =3D
    {
      agg.count+=3D1;
      agg.users+=3Duser_id
      return agg;
    }

  def comboOp(agg: CustomAggregation,
agg2: CustomAggregation): CustomAggregation =3D
    {
      agg.count+=3D agg2.count
      agg.users++=3Dagg2.users
      return agg;
    }

}


That should gives you the aggregation , distinct count is the size of users
set .

I hope this helps

Stephane

On Wed, Dec 3, 2014 at 5:47 PM, Arun Luthra <arun.lut...@gmail.com> wrote:

> I'm wondering how to do this kind of SQL query with PairRDDFunctions.
>
> SELECT zip, COUNT(user), COUNT(DISTINCT user)
> FROM users
> GROUP BY zip
>
> In the Spark scala API, I can make an RDD (called "users") of key-value
> pairs where the keys are zip (as in ZIP code) and the values are user id's.
> Then I can compute the count and distinct count like this:
>
> val count = users.mapValues(_ => 1).reduceByKey(_ + _)
> val countDistinct = users.distinct().mapValues(_ => 1).reduceByKey(_ + _)
>
> Then, if I want count and countDistinct in the same table, I have to join
> them on the key.
>
> Is there a way to do this without doing a join (and without using SQL or
> spark SQL)?
>
> Arun
>

Reply via email to