Re: Patterns for making multiple aggregations in one pass

Evan R. Sparks Wed, 18 Jun 2014 16:57:23 -0700

This looks like a job for SparkSQL!


val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class MyRecord(country: String, name: String, age: Int, hits: Long)
val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234),
MyRecord("USA", "Bob", 55, 108), MyRecord("France", "Remi", 33, 72)))
data.registerAsTable("MyRecords")
val results = sql("""SELECT t.country, AVG(t.age), SUM(t.hits) FROM
MyRecords t GROUP BY t.country""").collect

Now "results" contains:

Array[org.apache.spark.sql.Row] = Array([France,33.0,72], [USA,39.5,342])


On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin <doris.s....@gmail.com> wrote:

> Hi Nick,
>
> Instead of using reduceByKey(), you might want to look into using
> aggregateByKey(), which allows you to return a different value type U
> instead of the input value type V for each input tuple (K, V). You can
> define U to be a datatype that holds both the average and total and have
> seqOp update both fields of U in a single pass.
>
> Hope this makes sense,
> Doris
>
>
> On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas <nicholas.cham...@gmail.com>
> wrote:
>
>> The following is a simplified example of what I am trying to accomplish.
>>
>> Say I have an RDD of objects like this:
>>
>> {
>>     "country": "USA",
>>     "name": "Franklin",
>>     "age": 24,
>>     "hits": 224}
>> {
>>
>>     "country": "USA",
>>     "name": "Bob",
>>     "age": 55,
>>     "hits": 108}
>> {
>>
>>     "country": "France",
>>     "name": "Remi",
>>     "age": 33,
>>     "hits": 72}
>>
>> I want to find the average age and total number of hits per country.
>> Ideally, I would like to scan the data once and perform both aggregations
>> simultaneously.
>>
>> What is a good approach to doing this?
>>
>> I’m thinking that we’d want to keyBy(country), and then somehow
>> reduceByKey(). The problem is, I don’t know how to approach writing a
>> function that can be passed to reduceByKey() and that will track a
>> running average and total simultaneously.
>>
>> Nick
>> 
>>
>> ------------------------------
>> View this message in context: Patterns for making multiple aggregations
>> in one pass
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>

Re: Patterns for making multiple aggregations in one pass

Reply via email to