Ok, so that worked flawlessly after I upped the number of partitions to 400 from 40.
Thanks! On Fri, May 13, 2016 at 7:28 PM, Sung Hwan Chung <coded...@cs.stanford.edu> wrote: > I'll try that, as of now I have a small number of partitions in the order > of 20~40. > > It would be great if there's some documentation on the memory requirement > wrt the number of keys and the number of partitions per executor (i.e., the > Spark's internal memory requirement outside of the user space). > > Otherwise, it's like shooting in the dark. > > On Fri, May 13, 2016 at 7:20 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Have you taken a look at SPARK-11293 ? >> >> Consider using repartition to increase the number of partitions. >> >> FYI >> >> On Fri, May 13, 2016 at 12:14 PM, Sung Hwan Chung < >> coded...@cs.stanford.edu> wrote: >> >>> Hello, >>> >>> I'm using Spark version 1.6.0 and have trouble with memory when trying >>> to do reducebykey on a dataset with as many as 75 million keys. I.e. I get >>> the following exception when I run the task. >>> >>> There are 20 workers in the cluster. It is running under the standalone >>> mode with 12 GB assigned per executor and 4 cores per worker. The >>> spark.memory.fraction is set to 0.5 and I'm not using any caching. >>> >>> What might be the problem here? Since I'm using the version 1.6.0, this >>> doesn't seem to be related to SPARK-12155. This problem always happens >>> during the shuffle read phase. >>> >>> Is there a minimum amount of memory required for executor >>> (spark.memory.fraction) for shuffle read? >>> >>> java.lang.OutOfMemoryError: Unable to acquire 262144 bytes of memory, got 0 >>> at >>> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:91) >>> at >>> org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:735) >>> at >>> org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:197) >>> at >>> org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:212) >>> at >>> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.<init>(UnsafeFixedWidthAggregationMap.java:103) >>> at >>> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:483) >>> at >>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) >>> at >>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) >>> at >>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) >>> at >>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) >>> at >>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >>> at >>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >>> at >>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) >>> at >>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >>> at org.apache.spark.scheduler.Task.run(Task.scala:89) >>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:745) >>> >>> >> >