I am running a large job using 4000 partitions - after running for four
hours on a 16 node cluster it fails with the following message.
The errors are in spark code and seem address unreliability at the level of
the disk -
Anyone seen this and know what is going on and how to fix it.


Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 13 in stage 15.827 failed 4 times, most recent
failure: Lost task 13.3 in stage 15.827 (TID 13386, pltrd022.labs.uninett.no):
java.io.IOException: failed to read chunk

org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:348)

org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
        org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
         .....

Reply via email to