If your input data is JSON, you can also try out the recently merged
in initial JSON support:
https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916

On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> That’s pretty neat! So I guess if you start with an RDD of objects, you’d
> first do something like RDD.map(lambda x: Record(x['field_1'], x['field_2'],
> ...)) in order to register it as a table, and from there run your
> aggregates. Very nice.
>
>
>
> On Wed, Jun 18, 2014 at 7:56 PM, Evan R. Sparks <evan.spa...@gmail.com>
> wrote:
>>
>> This looks like a job for SparkSQL!
>>
>>
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> import sqlContext._
>> case class MyRecord(country: String, name: String, age: Int, hits: Long)
>> val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234),
>> MyRecord("USA", "Bob", 55, 108), MyRecord("France", "Remi", 33, 72)))
>> data.registerAsTable("MyRecords")
>> val results = sql("""SELECT t.country, AVG(t.age), SUM(t.hits) FROM
>> MyRecords t GROUP BY t.country""").collect
>>
>> Now "results" contains:
>>
>> Array[org.apache.spark.sql.Row] = Array([France,33.0,72], [USA,39.5,342])
>>
>>
>>
>> On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin <doris.s....@gmail.com> wrote:
>>>
>>> Hi Nick,
>>>
>>> Instead of using reduceByKey(), you might want to look into using
>>> aggregateByKey(), which allows you to return a different value type U
>>> instead of the input value type V for each input tuple (K, V). You can
>>> define U to be a datatype that holds both the average and total and have
>>> seqOp update both fields of U in a single pass.
>>>
>>> Hope this makes sense,
>>> Doris
>>>
>>>
>>> On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas
>>> <nicholas.cham...@gmail.com> wrote:
>>>>
>>>> The following is a simplified example of what I am trying to accomplish.
>>>>
>>>> Say I have an RDD of objects like this:
>>>>
>>>> {
>>>>     "country": "USA",
>>>>     "name": "Franklin",
>>>>     "age": 24,
>>>>     "hits": 224
>>>> }
>>>> {
>>>>
>>>>     "country": "USA",
>>>>     "name": "Bob",
>>>>     "age": 55,
>>>>     "hits": 108
>>>> }
>>>> {
>>>>
>>>>     "country": "France",
>>>>     "name": "Remi",
>>>>     "age": 33,
>>>>     "hits": 72
>>>> }
>>>>
>>>> I want to find the average age and total number of hits per country.
>>>> Ideally, I would like to scan the data once and perform both aggregations
>>>> simultaneously.
>>>>
>>>> What is a good approach to doing this?
>>>>
>>>> I’m thinking that we’d want to keyBy(country), and then somehow
>>>> reduceByKey(). The problem is, I don’t know how to approach writing a
>>>> function that can be passed to reduceByKey() and that will track a running
>>>> average and total simultaneously.
>>>>
>>>> Nick
>>>>
>>>>
>>>> ________________________________
>>>> View this message in context: Patterns for making multiple aggregations
>>>> in one pass
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>>
>>
>

Reply via email to