Ryan,
Thank you for reply.

For 2 TB of Data what should be the value of
spark.yarn.executor.memoryOverhead = ?

with regards to this - i see issue at spark
https://issues.apache.org/jira/browse/SPARK-18787 , not sure whether it
works or not at Spark 2.0.1  !

can you elaborate more for spark.memory.fraction setting.

number of partitions = 674
Cluster: 455 GB total memory, VCores: 288, Nodes: 17
Given / tried memory config: executor-mem = 16g, num-executor=10, executor
cores=6, driver mem=4g

spark.default.parallelism=1000
spark.sql.shuffle.partitions=1000
spark.yarn.executor.memoryOverhead=2048
spark.shuffle.io.preferDirectBufs=false









On Wed, Aug 2, 2017 at 10:43 PM, Ryan Blue <rb...@netflix.com> wrote:

> Chetan,
>
> When you're writing to a partitioned table, you want to use a shuffle to
> avoid the situation where each task has to write to every partition. You
> can do that either by adding a repartition by your table's partition keys,
> or by adding an order by with the partition keys and then columns you
> normally use to filter when reading the table. I generally recommend the
> second approach because it handles skew and prepares the data for more
> efficient reads.
>
> If that doesn't help, then you should look at your memory settings. When
> you're getting killed by YARN, you should consider setting `
> spark.shuffle.io.preferDirectBufs=false` so you use less off-heap memory
> that the JVM doesn't account for. That is usually an easier fix than
> increasing the memory overhead. Also, when you set executor memory, always
> change spark.memory.fraction to ensure the memory you're adding is used
> where it is needed. If your memory fraction is the default 60%, then 60% of
> the memory will be used for Spark execution, not reserved whatever is
> consuming it and causing the OOM. (If Spark's memory is too low, you'll see
> other problems like spilling too much to disk.)
>
> rb
>
> On Wed, Aug 2, 2017 at 9:02 AM, Chetan Khatri <chetan.opensou...@gmail.com
> > wrote:
>
>> Can anyone please guide me with above issue.
>>
>>
>> On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Spark Users,
>>>
>>> I have Hbase table reading and writing to Hive managed table where i
>>> applied partitioning by date column which worked fine but it has generate
>>> more number of files in almost 700 partitions but i wanted to use
>>> reparation to reduce File I/O by reducing number of files inside each
>>> partition.
>>>
>>> *But i ended up with below exception:*
>>>
>>> ExecutorLostFailure (executor 11 exited caused by one of the running
>>> tasks) Reason: Container killed by YARN for exceeding memory limits. 14.0
>>> GB of 14 GB physical memory used. Consider boosting spark.yarn.executor.
>>> memoryOverhead.
>>>
>>> Driver memory=4g, executor mem=12g, num-executors=8, executor core=8
>>>
>>> Do you think below setting can help me to overcome above issue:
>>>
>>> spark.default.parellism=1000
>>> spark.sql.shuffle.partitions=1000
>>>
>>> Because default max number of partitions are 1000.
>>>
>>>
>>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to