I have the simplest job which i'm running against 100TB of data. The job keeps
failing with ExecutorLostFailure's on containers killed by Yarn for exceeding
memory limits
I have varied the executor-memory from 32GB to 96GB, the
spark.yarn.executor.memoryOverhead from 8192 to 36000 and similar changes to
the number of cores, and driver size. It looks like nothing stops this error
(running out of memory) from happening. Looking at metrics reported by Spark
status page, is there anything I can use to configure my job properly? Is
repartitioning more or less going to help at all? The current number of
partitions is around 40,000 currently.
Here's the gist of the code:
val input = sc.textFile(path)
val t0 = input.map(s => s.split("\t").map(a => ((a(0),a(1)), a)))
t0.persist(StorageLevel.DISK_ONLY)
I have changed storagelevel from MEMORY_ONLY to MEMORY_AND_DISK to DISK_ONLY to
no avail.