You can use the .repartition(<integer>) function on the rdd or dataframe to set the number of partitions higher. Use .partitions.length to get the current number of partitions. (Scala API).
Andrew > On Jul 24, 2016, at 4:30 PM, Ascot Moss <ascot.m...@gmail.com> wrote: > > the data set is the training data set for random forest training, about > 36,500 data, any idea how to further partition it? > > On Sun, Jul 24, 2016 at 12:31 PM, Andrew Ehrlich <and...@aehrlich.com > <mailto:and...@aehrlich.com>> wrote: > It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 > <https://issues.apache.org/jira/browse/SPARK-6235> which limits the size of > the blocks in the file being written to disk to 2GB. > > If so, the solution is for you to try tuning for smaller tasks. Try > increasing the number of partitions, or using a more space-efficient data > structure inside the RDD, or increasing the amount of memory available to > spark and caching the data in memory. Make sure you are using Kryo > serialization. > > Andrew > >> On Jul 23, 2016, at 9:00 PM, Ascot Moss <ascot.m...@gmail.com >> <mailto:ascot.m...@gmail.com>> wrote: >> >> >> Hi, >> >> Please help! >> >> My spark: 1.6.2 >> Java: java8_u40 >> >> I am trying random forest training, I got " Size exceeds Integer.MAX_VALUE". >> >> Any idea how to resolve it? >> >> >> (the log) >> 16/07/24 07:59:49 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID >> 25) >> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE >> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) >> at >> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127) >> >> at >> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115) >> >> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250) >> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129) >> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136) >> at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503) >> >> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420) >> >> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625) >> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154) >> >> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:89) >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> >> at java.lang.Thread.run(Thread.java:745) >> 16/07/24 07:59:49 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 25, >> localhost): java.lang.IllegalArgumentException: Size exceeds >> Integer.MAX_VALUE >> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) >> at >> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127) >> >> at >> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115) >> >> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250) >> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129) >> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136) >> at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503) >> >> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420) >> >> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625) >> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154) >> >> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:89) >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> >> at java.lang.Thread.run(Thread.java:745) >> >> >> Regards >> > >