Re: SQL query in scala API
Thanks, I will try this. On Fri, Dec 5, 2014 at 1:19 AM, Cheng Lian lian.cs@gmail.com wrote: Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write you own aggregation with aggregateByKey: users.aggregateByKey((0, Set.empty[String]))({ case ((count, seen), user) = (count + 1, seen + user) }, { case ((count0, seen0), (count1, seen1)) = (count0 + count1, seen0 ++ seen1) }).mapValues { case (count, seen) = (count, seen.size) } On 12/5/14 3:47 AM, Arun Luthra wrote: Is that Spark SQL? I'm wondering if it's possible without spark SQL. On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com wrote: You may do this: table(users).groupBy('zip)('zip, count('user), countDistinct('user)) On 12/4/14 8:47 AM, Arun Luthra wrote: I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API, I can make an RDD (called users) of key-value pairs where the keys are zip (as in ZIP code) and the values are user id's. Then I can compute the count and distinct count like this: val count = users.mapValues(_ = 1).reduceByKey(_ + _) val countDistinct = users.distinct().mapValues(_ = 1).reduceByKey(_ + _) Then, if I want count and countDistinct in the same table, I have to join them on the key. Is there a way to do this without doing a join (and without using SQL or spark SQL)? Arun
Re: SQL query in scala API
Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write you own aggregation with |aggregateByKey|: |users.aggregateByKey((0,Set.empty[String]))({case ((count, seen), user) = (count +1, seen + user) }, {case ((count0, seen0), (count1, seen1)) = (count0 + count1, seen0 ++ seen1) }).mapValues {case (count, seen) = (count, seen.size) } | On 12/5/14 3:47 AM, Arun Luthra wrote: Is that Spark SQL? I'm wondering if it's possible without spark SQL. On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: You may do this: |table(users).groupBy('zip)('zip, count('user), countDistinct('user)) | On 12/4/14 8:47 AM, Arun Luthra wrote: I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API, I can make an RDD (called users) of key-value pairs where the keys are zip (as in ZIP code) and the values are user id's. Then I can compute the count and distinct count like this: val count = users.mapValues(_ = 1).reduceByKey(_ + _) val countDistinct = users.distinct().mapValues(_ = 1).reduceByKey(_ + _) Then, if I want count and countDistinct in the same table, I have to join them on the key. Is there a way to do this without doing a join (and without using SQL or spark SQL)? Arun
Re: SQL query in scala API
Is that Spark SQL? I'm wondering if it's possible without spark SQL. On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com wrote: You may do this: table(users).groupBy('zip)('zip, count('user), countDistinct('user)) On 12/4/14 8:47 AM, Arun Luthra wrote: I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API, I can make an RDD (called users) of key-value pairs where the keys are zip (as in ZIP code) and the values are user id's. Then I can compute the count and distinct count like this: val count = users.mapValues(_ = 1).reduceByKey(_ + _) val countDistinct = users.distinct().mapValues(_ = 1).reduceByKey(_ + _) Then, if I want count and countDistinct in the same table, I have to join them on the key. Is there a way to do this without doing a join (and without using SQL or spark SQL)? Arun
Re: SQL query in scala API
Disclaimer : I am new at Spark I did something similar in a prototype which works but I that did not test at scale yet val agg =3D users.mapValues(_ =3D 1)..aggregateByKey(new CustomAggregation())(CustomAggregation.sequenceOp, CustomAggregation.comboO= p) class CustomAggregation() extends Serializable { var count =3D0: Long val users =3D Set(): Set[String] } object CustomAggregation { def sequenceOp(agg: CustomAggregation, user_id : String ): CustomAggregation =3D { agg.count+=3D1; agg.users+=3Duser_id return agg; } def comboOp(agg: CustomAggregation, agg2: CustomAggregation): CustomAggregation =3D { agg.count+=3D agg2.count agg.users++=3Dagg2.users return agg; } } That should gives you the aggregation , distinct count is the size of users set . I hope this helps Stephane On Wed, Dec 3, 2014 at 5:47 PM, Arun Luthra arun.lut...@gmail.com wrote: I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API, I can make an RDD (called users) of key-value pairs where the keys are zip (as in ZIP code) and the values are user id's. Then I can compute the count and distinct count like this: val count = users.mapValues(_ = 1).reduceByKey(_ + _) val countDistinct = users.distinct().mapValues(_ = 1).reduceByKey(_ + _) Then, if I want count and countDistinct in the same table, I have to join them on the key. Is there a way to do this without doing a join (and without using SQL or spark SQL)? Arun
Re: SQL query in scala API
You may do this: |table(users).groupBy('zip)('zip, count('user), countDistinct('user)) | On 12/4/14 8:47 AM, Arun Luthra wrote: I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API, I can make an RDD (called users) of key-value pairs where the keys are zip (as in ZIP code) and the values are user id's. Then I can compute the count and distinct count like this: val count = users.mapValues(_ = 1).reduceByKey(_ + _) val countDistinct = users.distinct().mapValues(_ = 1).reduceByKey(_ + _) Then, if I want count and countDistinct in the same table, I have to join them on the key. Is there a way to do this without doing a join (and without using SQL or spark SQL)? Arun