Whoops, you are right. Sorry for the misinformation. Indeed reduceByKey
just calls combineByKey:
def reduceByKey(partitioner: Partitioner, func: (V, V) = V): RDD[(K, V)] =
{
combineByKey[V]((v: V) = v, func, func, partitioner)
}
(I think I confused reduceByKey with groupByKey.)
On Wed, Apr
Hi Daniel
Thanks for your reply, While I think for reduceByKey, it will also do
map side combine, thus extra the result is the same, say, for each partition,
one entry per distinct word. In my case with javaserializer, 240MB dataset
yield to around 70MB shuffle data. Only that shuffle
Hi
I am running a simple word count program on spark standalone cluster.
The cluster is made up of 6 node, each run 4 worker and each worker own 10G
memory and 16 core thus total 96 core and 240G memory. ( well, also used to
configed as 1 worker with 40G memory on each node )
Hi Patrick
I am just doing simple word count , the data is generated by hadoop
random text writer.
This seems to me not quite related to compress , If I turn off compress
on shuffle, the metrics is something like below for the smaller 240MB Dataset.
Executor ID Address