Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write you own aggregation with |aggregateByKey|:

|users.aggregateByKey((0,Set.empty[String]))({case  ((count, seen), user) =>
  (count +1, seen + user)
}, {case  ((count0, seen0), (count1, seen1)) =>
  (count0 + count1, seen0 ++ seen1)
}).mapValues {case  (count, seen) =>
  (count, seen.size)
}
|

On 12/5/14 3:47 AM, Arun Luthra wrote:

Is that Spark SQL? I'm wondering if it's possible without spark SQL.

On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

    You may do this:

    |table("users").groupBy('zip)('zip, count('user), countDistinct('user))
    |

    On 12/4/14 8:47 AM, Arun Luthra wrote:

    I'm wondering how to do this kind of SQL query with PairRDDFunctions.

    SELECT zip, COUNT(user), COUNT(DISTINCT user)
    FROM users
    GROUP BY zip

    In the Spark scala API, I can make an RDD (called "users") of
    key-value pairs where the keys are zip (as in ZIP code) and the
    values are user id's. Then I can compute the count and distinct
    count like this:

    val count = users.mapValues(_ => 1).reduceByKey(_ + _)
    val countDistinct = users.distinct().mapValues(_ =>
    1).reduceByKey(_ + _)

    Then, if I want count and countDistinct in the same table, I have
    to join them on the key.

    Is there a way to do this without doing a join (and without using
    SQL or spark SQL)?

    Arun

Reply via email to