Have you taken a look at SPARK-11293 ? Consider using repartition to increase the number of partitions.
FYI On Fri, May 13, 2016 at 12:14 PM, Sung Hwan Chung <coded...@cs.stanford.edu> wrote: > Hello, > > I'm using Spark version 1.6.0 and have trouble with memory when trying to > do reducebykey on a dataset with as many as 75 million keys. I.e. I get the > following exception when I run the task. > > There are 20 workers in the cluster. It is running under the standalone > mode with 12 GB assigned per executor and 4 cores per worker. The > spark.memory.fraction is set to 0.5 and I'm not using any caching. > > What might be the problem here? Since I'm using the version 1.6.0, this > doesn't seem to be related to SPARK-12155. This problem always happens > during the shuffle read phase. > > Is there a minimum amount of memory required for executor > (spark.memory.fraction) for shuffle read? > > java.lang.OutOfMemoryError: Unable to acquire 262144 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:91) > at > org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:735) > at > org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:197) > at > org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:212) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.<init>(UnsafeFixedWidthAggregationMap.java:103) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:483) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > >