Re: SQL query in scala API

2014-12-06 Thread Arun Luthra
Thanks, I will try this.

On Fri, Dec 5, 2014 at 1:19 AM, Cheng Lian lian.cs@gmail.com wrote:

  Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write
 you own aggregation with aggregateByKey:

 users.aggregateByKey((0, Set.empty[String]))({ case ((count, seen), user) =
   (count + 1, seen + user)
 }, { case ((count0, seen0), (count1, seen1)) =
   (count0 + count1, seen0 ++ seen1)
 }).mapValues { case (count, seen) =
   (count, seen.size)
 }

 On 12/5/14 3:47 AM, Arun Luthra wrote:

   Is that Spark SQL? I'm wondering if it's possible without spark SQL.

 On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com wrote:

  You may do this:

 table(users).groupBy('zip)('zip, count('user), countDistinct('user))

  On 12/4/14 8:47 AM, Arun Luthra wrote:

  I'm wondering how to do this kind of SQL query with PairRDDFunctions.

  SELECT zip, COUNT(user), COUNT(DISTINCT user)
 FROM users
 GROUP BY zip

  In the Spark scala API, I can make an RDD (called users) of key-value
 pairs where the keys are zip (as in ZIP code) and the values are user id's.
 Then I can compute the count and distinct count like this:

  val count = users.mapValues(_ = 1).reduceByKey(_ + _)
 val countDistinct = users.distinct().mapValues(_ = 1).reduceByKey(_ + _)

  Then, if I want count and countDistinct in the same table, I have to
 join them on the key.

  Is there a way to do this without doing a join (and without using SQL
 or spark SQL)?

  Arun

  ​


​



Re: SQL query in scala API

2014-12-05 Thread Cheng Lian
Oh, sorry. So neither SQL nor Spark SQL is preferred. Then you may write 
you own aggregation with |aggregateByKey|:


|users.aggregateByKey((0,Set.empty[String]))({case  ((count, seen), user) =
  (count +1, seen + user)
}, {case  ((count0, seen0), (count1, seen1)) =
  (count0 + count1, seen0 ++ seen1)
}).mapValues {case  (count, seen) =
  (count, seen.size)
}
|

On 12/5/14 3:47 AM, Arun Luthra wrote:


Is that Spark SQL? I'm wondering if it's possible without spark SQL.

On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com wrote:


You may do this:

|table(users).groupBy('zip)('zip, count('user), countDistinct('user))
|

On 12/4/14 8:47 AM, Arun Luthra wrote:


I'm wondering how to do this kind of SQL query with PairRDDFunctions.

SELECT zip, COUNT(user), COUNT(DISTINCT user)
FROM users
GROUP BY zip

In the Spark scala API, I can make an RDD (called users) of
key-value pairs where the keys are zip (as in ZIP code) and the
values are user id's. Then I can compute the count and distinct
count like this:

val count = users.mapValues(_ = 1).reduceByKey(_ + _)
val countDistinct = users.distinct().mapValues(_ =
1).reduceByKey(_ + _)

Then, if I want count and countDistinct in the same table, I have
to join them on the key.

Is there a way to do this without doing a join (and without using
SQL or spark SQL)?

Arun

​



​


Re: SQL query in scala API

2014-12-04 Thread Arun Luthra
Is that Spark SQL? I'm wondering if it's possible without spark SQL.

On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com wrote:

  You may do this:

 table(users).groupBy('zip)('zip, count('user), countDistinct('user))

 On 12/4/14 8:47 AM, Arun Luthra wrote:

   I'm wondering how to do this kind of SQL query with PairRDDFunctions.

  SELECT zip, COUNT(user), COUNT(DISTINCT user)
 FROM users
 GROUP BY zip

  In the Spark scala API, I can make an RDD (called users) of key-value
 pairs where the keys are zip (as in ZIP code) and the values are user id's.
 Then I can compute the count and distinct count like this:

  val count = users.mapValues(_ = 1).reduceByKey(_ + _)
 val countDistinct = users.distinct().mapValues(_ = 1).reduceByKey(_ + _)

  Then, if I want count and countDistinct in the same table, I have to
 join them on the key.

  Is there a way to do this without doing a join (and without using SQL or
 spark SQL)?

  Arun

   ​



Re: SQL query in scala API

2014-12-04 Thread Stéphane Verlet
Disclaimer : I am new at Spark

I did something similar in a prototype which works but I that did not test
at scale yet

 val agg =3D users.mapValues(_ =3D 1)..aggregateByKey(new
CustomAggregation())(CustomAggregation.sequenceOp, CustomAggregation.comboO=
p)

class CustomAggregation() extends Serializable {
  var count =3D0: Long
  val users =3D Set(): Set[String]
}

object CustomAggregation {

  def sequenceOp(agg: CustomAggregation, user_id : String
): CustomAggregation =3D
{
  agg.count+=3D1;
  agg.users+=3Duser_id
  return agg;
}

  def comboOp(agg: CustomAggregation,
agg2: CustomAggregation): CustomAggregation =3D
{
  agg.count+=3D agg2.count
  agg.users++=3Dagg2.users
  return agg;
}

}


That should gives you the aggregation , distinct count is the size of users
set .

I hope this helps

Stephane

On Wed, Dec 3, 2014 at 5:47 PM, Arun Luthra arun.lut...@gmail.com wrote:

 I'm wondering how to do this kind of SQL query with PairRDDFunctions.

 SELECT zip, COUNT(user), COUNT(DISTINCT user)
 FROM users
 GROUP BY zip

 In the Spark scala API, I can make an RDD (called users) of key-value
 pairs where the keys are zip (as in ZIP code) and the values are user id's.
 Then I can compute the count and distinct count like this:

 val count = users.mapValues(_ = 1).reduceByKey(_ + _)
 val countDistinct = users.distinct().mapValues(_ = 1).reduceByKey(_ + _)

 Then, if I want count and countDistinct in the same table, I have to join
 them on the key.

 Is there a way to do this without doing a join (and without using SQL or
 spark SQL)?

 Arun



Re: SQL query in scala API

2014-12-03 Thread Cheng Lian

You may do this:

|table(users).groupBy('zip)('zip, count('user), countDistinct('user))
|

On 12/4/14 8:47 AM, Arun Luthra wrote:


I'm wondering how to do this kind of SQL query with PairRDDFunctions.

SELECT zip, COUNT(user), COUNT(DISTINCT user)
FROM users
GROUP BY zip

In the Spark scala API, I can make an RDD (called users) of 
key-value pairs where the keys are zip (as in ZIP code) and the values 
are user id's. Then I can compute the count and distinct count like this:


val count = users.mapValues(_ = 1).reduceByKey(_ + _)
val countDistinct = users.distinct().mapValues(_ = 1).reduceByKey(_ + _)

Then, if I want count and countDistinct in the same table, I have to 
join them on the key.


Is there a way to do this without doing a join (and without using SQL 
or spark SQL)?


Arun


​