The memory overhead is based less on the total amount of data and more on what you end up doing with the data (e.g. if your doing a lot of off-heap processing or using Python you need to increase it). Honestly most people find this number for their job "experimentally" (e.g. they try a few different things).
On Wed, Aug 2, 2017 at 1:52 PM, Chetan Khatri <chetan.opensou...@gmail.com> wrote: > Ryan, > Thank you for reply. > > For 2 TB of Data what should be the value of > spark.yarn.executor.memoryOverhead > = ? > > with regards to this - i see issue at spark https://issues.apache.org/ > jira/browse/SPARK-18787 , not sure whether it works or not at Spark 2.0.1 > ! > > can you elaborate more for spark.memory.fraction setting. > > number of partitions = 674 > Cluster: 455 GB total memory, VCores: 288, Nodes: 17 > Given / tried memory config: executor-mem = 16g, num-executor=10, executor > cores=6, driver mem=4g > > spark.default.parallelism=1000 > spark.sql.shuffle.partitions=1000 > spark.yarn.executor.memoryOverhead=2048 > spark.shuffle.io.preferDirectBufs=false > > > > > > > > > > On Wed, Aug 2, 2017 at 10:43 PM, Ryan Blue <rb...@netflix.com> wrote: > >> Chetan, >> >> When you're writing to a partitioned table, you want to use a shuffle to >> avoid the situation where each task has to write to every partition. You >> can do that either by adding a repartition by your table's partition keys, >> or by adding an order by with the partition keys and then columns you >> normally use to filter when reading the table. I generally recommend the >> second approach because it handles skew and prepares the data for more >> efficient reads. >> >> If that doesn't help, then you should look at your memory settings. When >> you're getting killed by YARN, you should consider setting ` >> spark.shuffle.io.preferDirectBufs=false` so you use less off-heap memory >> that the JVM doesn't account for. That is usually an easier fix than >> increasing the memory overhead. Also, when you set executor memory, always >> change spark.memory.fraction to ensure the memory you're adding is used >> where it is needed. If your memory fraction is the default 60%, then 60% of >> the memory will be used for Spark execution, not reserved whatever is >> consuming it and causing the OOM. (If Spark's memory is too low, you'll see >> other problems like spilling too much to disk.) >> >> rb >> >> On Wed, Aug 2, 2017 at 9:02 AM, Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: >> >>> Can anyone please guide me with above issue. >>> >>> >>> On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri < >>> chetan.opensou...@gmail.com> wrote: >>> >>>> Hello Spark Users, >>>> >>>> I have Hbase table reading and writing to Hive managed table where i >>>> applied partitioning by date column which worked fine but it has generate >>>> more number of files in almost 700 partitions but i wanted to use >>>> reparation to reduce File I/O by reducing number of files inside each >>>> partition. >>>> >>>> *But i ended up with below exception:* >>>> >>>> ExecutorLostFailure (executor 11 exited caused by one of the running >>>> tasks) Reason: Container killed by YARN for exceeding memory limits. 14.0 >>>> GB of 14 GB physical memory used. Consider boosting spark.yarn.executor. >>>> memoryOverhead. >>>> >>>> Driver memory=4g, executor mem=12g, num-executors=8, executor core=8 >>>> >>>> Do you think below setting can help me to overcome above issue: >>>> >>>> spark.default.parellism=1000 >>>> spark.sql.shuffle.partitions=1000 >>>> >>>> Because default max number of partitions are 1000. >>>> >>>> >>>> >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau