Hello folks, I am running Random Forest from ml from spark 1.6.1 on bimbo[1] dataset with following configurations:
"-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" "-Dspark.speculation=true" "-Dspark.speculation.multiplier=2" "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms" "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" "-XX:SurvivorRatio=3" "-DnumPartitions=36" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@11.0.0.106:56419" I get following error: 16/10/04 06:55:05 WARN TaskSetManager: Lost task 8.0 in stage 19.0 (TID 194, 11.0.0.106): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869) at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127) at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) I have varied number of partitions from 24 to 48. I still get the same error. How can this problem be tackled? Thanks, Samkit [1]: https://www.kaggle.com/c/grupo-bimbo-inventory-demand