I am running a large job using 4000 partitions - after running for four hours on a 16 node cluster it fails with the following message. The errors are in spark code and seem address unreliability at the level of the disk - Anyone seen this and know what is going on and how to fix it.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 15.827 failed 4 times, most recent failure: Lost task 13.3 in stage 15.827 (TID 13386, pltrd022.labs.uninett.no): java.io.IOException: failed to read chunk org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:348) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) .....