Re: Spark ReduceByKey - Working in Java

Sean Owen Sat, 02 Aug 2014 02:30:28 -0700

I think your questions revolve around the reduce function here, which
is a function of 2 arguments returning 1, whereas in a Reducer, you
implement a function of many-to-many.

This API is simpler if less general. Here you provide an associative
operation that can reduce any 2 values down to 1 (e.g. two integers
sum to one). This is used to reduce all values for each key to 1. It's
not necessary to provide an N-to-1 function since it can be
accomplished with a 2-to-1 function. Here, you can't emit multiple
values for one key.

The result are (key, reduced value) from each (key, bunch of values).

The Mapper and Reducer in classic Hadoop MapReduce were actually both
quite similar (just that one takes a collection of values rather than
single value per key) and let you implement a lot of patterns. In a
way that's good, in a way that was wasteful and complex.

You can still reproduce what Mappers and Reducers do, but the method
in Spark is mapPartitions, possibly paired with groupByKey. These are
the most general operations you might consider, and I'm not saying you
*should* emulate MapReduce this way in Spark. In fact it's unlikely to
be efficient. But it is possible.

On Sat, Aug 2, 2014 at 9:55 AM, Anil Karamchandani <[email protected]> wrote:
> Hi,
>
> I am a complete newbie to spark and map reduce frameworks and have a basic
> question on the reduce function. I was working on the word count example and
> was stuck at the reduce stage where the sum happens.
>
> I am trying to understand the working of the reducebykey in Spark using java
> as the programming language.
>
> Say if I have a sentence "I am who I am" I break the sentence into words and
> store as list [I, am, who, I, am]
>
> now this function assigns 1 to each word
>
> JavaPairRDD ones = words.mapToPair(new PairFunction() { @Override public
> Tuple2 call(String s) { return new Tuple2(s, 1); } });
>
> so the output is something like this
>
> (I,1) (am,1) (who,1) (I,1) (am,1)
>
> Now If I have 3 reducers running, each reducer will get a key and the values
> associated to that key
>
> reducer 1 : (I,1) (I,1)
>
> reducer 2 : (am,1) (am,1)
>
> reducer 3 : (who,1)
>
> I wanted to know
>
> a. what exactly happens here in the code below.
>
> b.What are the parameters new Function2
>
> c. Basically how the JavaPairRDD is formed.
>
> JavaPairRDD counts = ones.reduceByKey(new Function2() { @Override public
> Integer call(Integer i1, Integer i2) { return i1 + i2; } });
>
> thanks !
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Spark ReduceByKey - Working in Java

Reply via email to