I have the simplest job which i'm running against 100TB of data. The job keeps failing with ExecutorLostFailure's on containers killed by Yarn for exceeding memory limits
I have varied the executor-memory from 32GB to 96GB, the spark.yarn.executor.memoryOverhead from 8192 to 36000 and similar changes to the number of cores, and driver size. It looks like nothing stops this error (running out of memory) from happening. Looking at metrics reported by Spark status page, is there anything I can use to configure my job properly? Is repartitioning more or less going to help at all? The current number of partitions is around 40,000 currently. Here's the gist of the code: val input = sc.textFile(path) val t0 = input.map(s => s.split("\t").map(a => ((a(0),a(1)), a))) t0.persist(StorageLevel.DISK_ONLY) I have changed storagelevel from MEMORY_ONLY to MEMORY_AND_DISK to DISK_ONLY to no avail.