Whoops, you are right. Sorry for the misinformation. Indeed reduceByKey
just calls combineByKey:
def reduceByKey(partitioner: Partitioner, func: (V, V) = V): RDD[(K, V)] =
{
combineByKey[V]((v: V) = v, func, func, partitioner)
}
(I think I confused reduceByKey with groupByKey.)
On Wed, Apr
Hi Daniel
Thanks for your reply, While I think for reduceByKey, it will also do
map side combine, thus extra the result is the same, say, for each partition,
one entry per distinct word. In my case with javaserializer, 240MB dataset
yield to around 70MB shuffle data. Only that shuffle
Hi Patrick
I am just doing simple word count , the data is generated by hadoop
random text writer.
This seems to me not quite related to compress , If I turn off compress
on shuffle, the metrics is something like below for the smaller 240MB Dataset.
Executor ID Address