If your input data is JSON, you can also try out the recently merged in initial JSON support: https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916
On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > That’s pretty neat! So I guess if you start with an RDD of objects, you’d > first do something like RDD.map(lambda x: Record(x['field_1'], x['field_2'], > ...)) in order to register it as a table, and from there run your > aggregates. Very nice. > > > > On Wed, Jun 18, 2014 at 7:56 PM, Evan R. Sparks <evan.spa...@gmail.com> > wrote: >> >> This looks like a job for SparkSQL! >> >> >> val sqlContext = new org.apache.spark.sql.SQLContext(sc) >> import sqlContext._ >> case class MyRecord(country: String, name: String, age: Int, hits: Long) >> val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234), >> MyRecord("USA", "Bob", 55, 108), MyRecord("France", "Remi", 33, 72))) >> data.registerAsTable("MyRecords") >> val results = sql("""SELECT t.country, AVG(t.age), SUM(t.hits) FROM >> MyRecords t GROUP BY t.country""").collect >> >> Now "results" contains: >> >> Array[org.apache.spark.sql.Row] = Array([France,33.0,72], [USA,39.5,342]) >> >> >> >> On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin <doris.s....@gmail.com> wrote: >>> >>> Hi Nick, >>> >>> Instead of using reduceByKey(), you might want to look into using >>> aggregateByKey(), which allows you to return a different value type U >>> instead of the input value type V for each input tuple (K, V). You can >>> define U to be a datatype that holds both the average and total and have >>> seqOp update both fields of U in a single pass. >>> >>> Hope this makes sense, >>> Doris >>> >>> >>> On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas >>> <nicholas.cham...@gmail.com> wrote: >>>> >>>> The following is a simplified example of what I am trying to accomplish. >>>> >>>> Say I have an RDD of objects like this: >>>> >>>> { >>>> "country": "USA", >>>> "name": "Franklin", >>>> "age": 24, >>>> "hits": 224 >>>> } >>>> { >>>> >>>> "country": "USA", >>>> "name": "Bob", >>>> "age": 55, >>>> "hits": 108 >>>> } >>>> { >>>> >>>> "country": "France", >>>> "name": "Remi", >>>> "age": 33, >>>> "hits": 72 >>>> } >>>> >>>> I want to find the average age and total number of hits per country. >>>> Ideally, I would like to scan the data once and perform both aggregations >>>> simultaneously. >>>> >>>> What is a good approach to doing this? >>>> >>>> I’m thinking that we’d want to keyBy(country), and then somehow >>>> reduceByKey(). The problem is, I don’t know how to approach writing a >>>> function that can be passed to reduceByKey() and that will track a running >>>> average and total simultaneously. >>>> >>>> Nick >>>> >>>> >>>> ________________________________ >>>> View this message in context: Patterns for making multiple aggregations >>>> in one pass >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> >> >