Even though it does not sound intuitive, reduce by key expects all values
for a particular key for a partition to be loaded into memory. So once you
increase the partitions you can run the jobs.
Ok, so that worked flawlessly after I upped the number of partitions to 400
from 40.
Thanks!
On Fri, May 13, 2016 at 7:28 PM, Sung Hwan Chung
wrote:
> I'll try that, as of now I have a small number of partitions in the order
> of 20~40.
>
> It would be great if
I'll try that, as of now I have a small number of partitions in the order
of 20~40.
It would be great if there's some documentation on the memory requirement
wrt the number of keys and the number of partitions per executor (i.e., the
Spark's internal memory requirement outside of the user space).
Have you taken a look at SPARK-11293 ?
Consider using repartition to increase the number of partitions.
FYI
On Fri, May 13, 2016 at 12:14 PM, Sung Hwan Chung
wrote:
> Hello,
>
> I'm using Spark version 1.6.0 and have trouble with memory when trying to
> do
Hello,
I'm using Spark version 1.6.0 and have trouble with memory when trying to
do reducebykey on a dataset with as many as 75 million keys. I.e. I get the
following exception when I run the task.
There are 20 workers in the cluster. It is running under the standalone
mode with 12 GB assigned