SQL query in scala API

Arun Luthra Wed, 03 Dec 2014 16:49:10 -0800

I'm wondering how to do this kind of SQL query with PairRDDFunctions.

SELECT zip, COUNT(user), COUNT(DISTINCT user)
FROM users
GROUP BY zip


In the Spark scala API, I can make an RDD (called "users") of key-value
pairs where the keys are zip (as in ZIP code) and the values are user id's.
Then I can compute the count and distinct count like this:

val count = users.mapValues(_ => 1).reduceByKey(_ + _)
val countDistinct = users.distinct().mapValues(_ => 1).reduceByKey(_ + _)

Then, if I want count and countDistinct in the same table, I have to join
them on the key.

Is there a way to do this without doing a join (and without using SQL or
spark SQL)?

Arun

SQL query in scala API

Reply via email to