Hi Nick, Instead of using reduceByKey(), you might want to look into using aggregateByKey(), which allows you to return a different value type U instead of the input value type V for each input tuple (K, V). You can define U to be a datatype that holds both the average and total and have seqOp update both fields of U in a single pass.
Hope this makes sense, Doris On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas <nicholas.cham...@gmail.com> wrote: > The following is a simplified example of what I am trying to accomplish. > > Say I have an RDD of objects like this: > > { > "country": "USA", > "name": "Franklin", > "age": 24, > "hits": 224} > { > > "country": "USA", > "name": "Bob", > "age": 55, > "hits": 108} > { > > "country": "France", > "name": "Remi", > "age": 33, > "hits": 72} > > I want to find the average age and total number of hits per country. > Ideally, I would like to scan the data once and perform both aggregations > simultaneously. > > What is a good approach to doing this? > > I’m thinking that we’d want to keyBy(country), and then somehow > reduceByKey(). The problem is, I don’t know how to approach writing a > function that can be passed to reduceByKey() and that will track a > running average and total simultaneously. > > Nick > > > ------------------------------ > View this message in context: Patterns for making multiple aggregations > in one pass > <http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >