I'll try that, as of now I have a small number of partitions in the order
of 20~40.

It would be great if there's some documentation on the memory requirement
wrt the number of keys and the number of partitions per executor (i.e., the
Spark's internal memory requirement outside of the user space).

Otherwise, it's like shooting in the dark.

On Fri, May 13, 2016 at 7:20 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Have you taken a look at SPARK-11293 ?
>
> Consider using repartition to increase the number of partitions.
>
> FYI
>
> On Fri, May 13, 2016 at 12:14 PM, Sung Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>
>> Hello,
>>
>> I'm using Spark version 1.6.0 and have trouble with memory when trying to
>> do reducebykey on a dataset with as many as 75 million keys. I.e. I get the
>> following exception when I run the task.
>>
>> There are 20 workers in the cluster. It is running under the standalone
>> mode with 12 GB assigned per executor and 4 cores per worker. The
>> spark.memory.fraction is set to 0.5 and I'm not using any caching.
>>
>> What might be the problem here? Since I'm using the version 1.6.0, this
>> doesn't seem to be related to  SPARK-12155. This problem always happens
>> during the shuffle read phase.
>>
>> Is there a minimum  amount of memory required for executor
>> (spark.memory.fraction) for shuffle read?
>>
>> java.lang.OutOfMemoryError: Unable to acquire 262144 bytes of memory, got 0
>>      at 
>> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:91)
>>      at 
>> org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:735)
>>      at 
>> org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:197)
>>      at 
>> org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:212)
>>      at 
>> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.<init>(UnsafeFixedWidthAggregationMap.java:103)
>>      at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:483)
>>      at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
>>      at 
>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>>      at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>>      at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>>      at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>      at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>      at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>      at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>      at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>      at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>      at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>      at java.lang.Thread.run(Thread.java:745)
>>
>>
>

Reply via email to