Re: Shuffle Spill Issue

2014-04-30 Thread Daniel Darabos
Whoops, you are right. Sorry for the misinformation. Indeed reduceByKey just calls combineByKey: def reduceByKey(partitioner: Partitioner, func: (V, V) = V): RDD[(K, V)] = { combineByKey[V]((v: V) = v, func, func, partitioner) } (I think I confused reduceByKey with groupByKey.) On Wed, Apr

RE: Shuffle Spill Issue

2014-04-29 Thread Liu, Raymond
Hi Daniel Thanks for your reply, While I think for reduceByKey, it will also do map side combine, thus extra the result is the same, say, for each partition, one entry per distinct word. In my case with javaserializer, 240MB dataset yield to around 70MB shuffle data. Only that shuffle

Shuffle Spill Issue

2014-04-28 Thread Liu, Raymond
Hi I am running a simple word count program on spark standalone cluster. The cluster is made up of 6 node, each run 4 worker and each worker own 10G memory and 16 core thus total 96 core and 240G memory. ( well, also used to configed as 1 worker with 40G memory on each node )

RE: Shuffle Spill Issue

2014-04-28 Thread Liu, Raymond
Hi Patrick I am just doing simple word count , the data is generated by hadoop random text writer. This seems to me not quite related to compress , If I turn off compress on shuffle, the metrics is something like below for the smaller 240MB Dataset. Executor ID Address