we are testing Dataset/Dataframe jobs instead of RDD jobs. one thing we keep running into is containers getting killed by yarn. i realize this has to do with off-heap memory, and the suggestion is to increase spark.yarn.executor.memoryOverhead.
at times our memoryOverhead is as large as the executor memory (say 4G and 4G). why is Dataset/Dataframe using so much off heap memory? we havent changed spark.memory.offHeap.enabled which defaults to false. should we enable that to get a better handle on this?