Thanks, I will try this.

On Fri, Dec 5, 2014 at 1:19 AM, Cheng Lian <lian.cs....@gmail.com> wrote:

>  Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write
> you own aggregation with aggregateByKey:
>
> users.aggregateByKey((0, Set.empty[String]))({ case ((count, seen), user) =>
>   (count + 1, seen + user)
> }, { case ((count0, seen0), (count1, seen1)) =>
>   (count0 + count1, seen0 ++ seen1)
> }).mapValues { case (count, seen) =>
>   (count, seen.size)
> }
>
> On 12/5/14 3:47 AM, Arun Luthra wrote:
>
>   Is that Spark SQL? I'm wondering if it's possible without spark SQL.
>
> On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian <lian.cs....@gmail.com> wrote:
>
>>  You may do this:
>>
>> table("users").groupBy('zip)('zip, count('user), countDistinct('user))
>>
>>  On 12/4/14 8:47 AM, Arun Luthra wrote:
>>
>>  I'm wondering how to do this kind of SQL query with PairRDDFunctions.
>>
>>  SELECT zip, COUNT(user), COUNT(DISTINCT user)
>> FROM users
>> GROUP BY zip
>>
>>  In the Spark scala API, I can make an RDD (called "users") of key-value
>> pairs where the keys are zip (as in ZIP code) and the values are user id's.
>> Then I can compute the count and distinct count like this:
>>
>>  val count = users.mapValues(_ => 1).reduceByKey(_ + _)
>> val countDistinct = users.distinct().mapValues(_ => 1).reduceByKey(_ + _)
>>
>>  Then, if I want count and countDistinct in the same table, I have to
>> join them on the key.
>>
>>  Is there a way to do this without doing a join (and without using SQL
>> or spark SQL)?
>>
>>  Arun
>>
>>  ​
>>
>
>    ​
>

Reply via email to