Re: Patterns for making multiple aggregations in one pass

Doris Xin Wed, 18 Jun 2014 16:43:26 -0700

Hi Nick,

Instead of using reduceByKey(), you might want to look into using
aggregateByKey(), which allows you to return a different value type U
instead of the input value type V for each input tuple (K, V). You can
define U to be a datatype that holds both the average and total and have
seqOp update both fields of U in a single pass.


Hope this makes sense,
Doris


On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas <nicholas.cham...@gmail.com>
wrote:

> The following is a simplified example of what I am trying to accomplish.
>
> Say I have an RDD of objects like this:
>
> {
>     "country": "USA",
>     "name": "Franklin",
>     "age": 24,
>     "hits": 224}
> {
>
>     "country": "USA",
>     "name": "Bob",
>     "age": 55,
>     "hits": 108}
> {
>
>     "country": "France",
>     "name": "Remi",
>     "age": 33,
>     "hits": 72}
>
> I want to find the average age and total number of hits per country.
> Ideally, I would like to scan the data once and perform both aggregations
> simultaneously.
>
> What is a good approach to doing this?
>
> I’m thinking that we’d want to keyBy(country), and then somehow
> reduceByKey(). The problem is, I don’t know how to approach writing a
> function that can be passed to reduceByKey() and that will track a
> running average and total simultaneously.
>
> Nick
> 
>
> ------------------------------
> View this message in context: Patterns for making multiple aggregations
> in one pass
> <http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Re: Patterns for making multiple aggregations in one pass

Reply via email to