Ok, so that worked flawlessly after I upped the number of partitions to 400
from 40.

Thanks!

On Fri, May 13, 2016 at 7:28 PM, Sung Hwan Chung <coded...@cs.stanford.edu>
wrote:

> I'll try that, as of now I have a small number of partitions in the order
> of 20~40.
>
> It would be great if there's some documentation on the memory requirement
> wrt the number of keys and the number of partitions per executor (i.e., the
> Spark's internal memory requirement outside of the user space).
>
> Otherwise, it's like shooting in the dark.
>
> On Fri, May 13, 2016 at 7:20 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Have you taken a look at SPARK-11293 ?
>>
>> Consider using repartition to increase the number of partitions.
>>
>> FYI
>>
>> On Fri, May 13, 2016 at 12:14 PM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Hello,
>>>
>>> I'm using Spark version 1.6.0 and have trouble with memory when trying
>>> to do reducebykey on a dataset with as many as 75 million keys. I.e. I get
>>> the following exception when I run the task.
>>>
>>> There are 20 workers in the cluster. It is running under the standalone
>>> mode with 12 GB assigned per executor and 4 cores per worker. The
>>> spark.memory.fraction is set to 0.5 and I'm not using any caching.
>>>
>>> What might be the problem here? Since I'm using the version 1.6.0, this
>>> doesn't seem to be related to  SPARK-12155. This problem always happens
>>> during the shuffle read phase.
>>>
>>> Is there a minimum  amount of memory required for executor
>>> (spark.memory.fraction) for shuffle read?
>>>
>>> java.lang.OutOfMemoryError: Unable to acquire 262144 bytes of memory, got 0
>>>     at 
>>> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:91)
>>>     at 
>>> org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:735)
>>>     at 
>>> org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:197)
>>>     at 
>>> org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:212)
>>>     at 
>>> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.<init>(UnsafeFixedWidthAggregationMap.java:103)
>>>     at 
>>> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:483)
>>>     at 
>>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
>>>     at 
>>> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>>>     at 
>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>>>     at 
>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>>>     at 
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>     at 
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>     at 
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>>     at 
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>     at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>>     at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>     at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>     at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>
>

Reply via email to