I am using a custom hadoop input format which works well on smaller files but fails with a file at about 4GB size - the format is generating about 800 splits and all variables in my code are longs - Any suggestions? Is anyone reading files of this size?
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 113 in stage 0.0 failed 4 times, most recent failure: Lost task 113.3 in stage 0.0 (TID 38, pltrd022.labs.uninett.no): java.lang.IllegalArgumentException: Size exceeds Integ er.MAX_VALUE sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:104) org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:452) org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:368) org.apache.spark.storage.BlockManager.get(BlockManager.scala:552) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745)