You can use the .repartition(<integer>) function on the rdd or dataframe to set 
the number of partitions higher. Use .partitions.length to get the current 
number of partitions. (Scala API).

Andrew

> On Jul 24, 2016, at 4:30 PM, Ascot Moss <ascot.m...@gmail.com> wrote:
> 
> the data set is the training data set for random forest training, about 
> 36,500 data,  any idea how to further partition it?  
> 
> On Sun, Jul 24, 2016 at 12:31 PM, Andrew Ehrlich <and...@aehrlich.com 
> <mailto:and...@aehrlich.com>> wrote:
> It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 
> <https://issues.apache.org/jira/browse/SPARK-6235> which limits the size of 
> the blocks in the file being written to disk to 2GB.
> 
> If so, the solution is for you to try tuning for smaller tasks. Try 
> increasing the number of partitions, or using a more space-efficient data 
> structure inside the RDD, or increasing the amount of memory available to 
> spark and caching the data in memory. Make sure you are using Kryo 
> serialization. 
> 
> Andrew
> 
>> On Jul 23, 2016, at 9:00 PM, Ascot Moss <ascot.m...@gmail.com 
>> <mailto:ascot.m...@gmail.com>> wrote:
>> 
>> 
>> Hi,
>> 
>> Please help!
>> 
>> My spark: 1.6.2
>> Java: java8_u40
>> 
>> I am trying random forest training, I got " Size exceeds Integer.MAX_VALUE".
>> 
>> Any idea how to resolve it?
>> 
>> 
>> (the log) 
>> 16/07/24 07:59:49 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 
>> 25)   
>> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE      
>> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)     
>> at 
>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
>>     
>> at 
>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
>>     
>> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)    
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)     
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)     
>> at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)  
>>    
>> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)    
>>    
>> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)    
>> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)   
>>    
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)    
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)     
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)  
>>    
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)      
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)     
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)   
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)   
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)   
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>       
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>       
>> at java.lang.Thread.run(Thread.java:745)
>> 16/07/24 07:59:49 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 25, 
>> localhost): java.lang.IllegalArgumentException: Size exceeds 
>> Integer.MAX_VALUE       
>> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)     
>> at 
>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
>>     
>> at 
>> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
>>     
>> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)    
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)     
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)     
>> at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)  
>>    
>> at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)    
>>    
>> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)    
>> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)   
>>    
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)    
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)     
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)  
>>    
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)      
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)     
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)   
>> at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)   
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)   
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>       
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>       
>> at java.lang.Thread.run(Thread.java:745) 
>> 
>> 
>> Regards
>> 
> 
> 

Reply via email to