Thanks, I will try this. On Fri, Dec 5, 2014 at 1:19 AM, Cheng Lian <lian.cs....@gmail.com> wrote:
> Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write > you own aggregation with aggregateByKey: > > users.aggregateByKey((0, Set.empty[String]))({ case ((count, seen), user) => > (count + 1, seen + user) > }, { case ((count0, seen0), (count1, seen1)) => > (count0 + count1, seen0 ++ seen1) > }).mapValues { case (count, seen) => > (count, seen.size) > } > > On 12/5/14 3:47 AM, Arun Luthra wrote: > > Is that Spark SQL? I'm wondering if it's possible without spark SQL. > > On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > >> You may do this: >> >> table("users").groupBy('zip)('zip, count('user), countDistinct('user)) >> >> On 12/4/14 8:47 AM, Arun Luthra wrote: >> >> I'm wondering how to do this kind of SQL query with PairRDDFunctions. >> >> SELECT zip, COUNT(user), COUNT(DISTINCT user) >> FROM users >> GROUP BY zip >> >> In the Spark scala API, I can make an RDD (called "users") of key-value >> pairs where the keys are zip (as in ZIP code) and the values are user id's. >> Then I can compute the count and distinct count like this: >> >> val count = users.mapValues(_ => 1).reduceByKey(_ + _) >> val countDistinct = users.distinct().mapValues(_ => 1).reduceByKey(_ + _) >> >> Then, if I want count and countDistinct in the same table, I have to >> join them on the key. >> >> Is there a way to do this without doing a join (and without using SQL >> or spark SQL)? >> >> Arun >> >> >> > > >