This looks like a job for SparkSQL!
val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class MyRecord(country: String, name: String, age: Int, hits: Long) val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234), MyRecord("USA", "Bob", 55, 108), MyRecord("France", "Remi", 33, 72))) data.registerAsTable("MyRecords") val results = sql("""SELECT t.country, AVG(t.age), SUM(t.hits) FROM MyRecords t GROUP BY t.country""").collect Now "results" contains: Array[org.apache.spark.sql.Row] = Array([France,33.0,72], [USA,39.5,342]) On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin <doris.s....@gmail.com> wrote: > Hi Nick, > > Instead of using reduceByKey(), you might want to look into using > aggregateByKey(), which allows you to return a different value type U > instead of the input value type V for each input tuple (K, V). You can > define U to be a datatype that holds both the average and total and have > seqOp update both fields of U in a single pass. > > Hope this makes sense, > Doris > > > On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas <nicholas.cham...@gmail.com> > wrote: > >> The following is a simplified example of what I am trying to accomplish. >> >> Say I have an RDD of objects like this: >> >> { >> "country": "USA", >> "name": "Franklin", >> "age": 24, >> "hits": 224} >> { >> >> "country": "USA", >> "name": "Bob", >> "age": 55, >> "hits": 108} >> { >> >> "country": "France", >> "name": "Remi", >> "age": 33, >> "hits": 72} >> >> I want to find the average age and total number of hits per country. >> Ideally, I would like to scan the data once and perform both aggregations >> simultaneously. >> >> What is a good approach to doing this? >> >> I’m thinking that we’d want to keyBy(country), and then somehow >> reduceByKey(). The problem is, I don’t know how to approach writing a >> function that can be passed to reduceByKey() and that will track a >> running average and total simultaneously. >> >> Nick >> >> >> ------------------------------ >> View this message in context: Patterns for making multiple aggregations >> in one pass >> <http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html> >> Sent from the Apache Spark User List mailing list archive >> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >> > >