Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Michal Čizmazia
Thanks Matei! It worked.

On 9 July 2015 at 19:43, Matei Zaharia matei.zaha...@gmail.com wrote:

 Thus means that one of your cached RDD partitions is bigger than 2 GB of
 data. You can fix it by having more partitions. If you read data from a
 file system like HDFS or S3, set the number of partitions higher in the
 sc.textFile, hadoopFile, etc methods (it's an optional second parameter to
 those methods). If you create it through parallelize or if this particular
 RDD comes from a shuffle, use more tasks in the parallelize or shuffle.

 Matei

 On Jul 9, 2015, at 3:35 PM, Michal Čizmazia mici...@gmail.com wrote:

 Spark version 1.4.0 in the Standalone mode

 2015-07-09 20:12:02 INFO  (sparkDriver-akka.actor.default-dispatcher-3)
 BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8
 GB)
 2015-07-09 20:12:02 ERROR (Executor task launch worker-0) Executor:96 -
 Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
 at
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
 at
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
 at
 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
 at
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:509)
 at
 org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427)
 at
 org.apache.spark.storage.BlockManager.get(BlockManager.scala:615)
 at
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
 at
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)


 On 9 July 2015 at 18:11, Ted Yu yuzhih...@gmail.com wrote:

 Which release of Spark are you using ?

 Can you show the complete stack trace ?

 getBytes() could be called from:
 getBytes(file, 0, file.length)
 or:
 getBytes(segment.file, segment.offset, segment.length)

 Cheers

 On Thu, Jul 9, 2015 at 2:50 PM, Michal Čizmazia mici...@gmail.com
 wrote:

 Please could anyone give me pointers for appropriate SparkConf to work
 around Size exceeds Integer.MAX_VALUE?

 Stacktrace:

 2015-07-09 20:12:02 INFO  (sparkDriver-akka.actor.default-dispatcher-3)
 BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8
 GB)
 2015-07-09 20:12:02 ERROR (Executor task launch worker-0) Executor:96 -
 Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
 at
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
 ...







Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Michal Čizmazia
Spark version 1.4.0 in the Standalone mode

2015-07-09 20:12:02 INFO  (sparkDriver-akka.actor.default-dispatcher-3)
BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8
GB)
2015-07-09 20:12:02 ERROR (Executor task launch worker-0) Executor:96 -
Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
at
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
at
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
at
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:509)
at
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:615)
at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


On 9 July 2015 at 18:11, Ted Yu yuzhih...@gmail.com wrote:

 Which release of Spark are you using ?

 Can you show the complete stack trace ?

 getBytes() could be called from:
 getBytes(file, 0, file.length)
 or:
 getBytes(segment.file, segment.offset, segment.length)

 Cheers

 On Thu, Jul 9, 2015 at 2:50 PM, Michal Čizmazia mici...@gmail.com wrote:

 Please could anyone give me pointers for appropriate SparkConf to work
 around Size exceeds Integer.MAX_VALUE?

 Stacktrace:

 2015-07-09 20:12:02 INFO  (sparkDriver-akka.actor.default-dispatcher-3)
 BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8
 GB)
 2015-07-09 20:12:02 ERROR (Executor task launch worker-0) Executor:96 -
 Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
 at
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
 ...





work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Michal Čizmazia
Please could anyone give me pointers for appropriate SparkConf to work
around Size exceeds Integer.MAX_VALUE?

Stacktrace:

2015-07-09 20:12:02 INFO  (sparkDriver-akka.actor.default-dispatcher-3)
BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8
GB)
2015-07-09 20:12:02 ERROR (Executor task launch worker-0) Executor:96 -
Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
at
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
...


Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Ted Yu
Which release of Spark are you using ?

Can you show the complete stack trace ?

getBytes() could be called from:
getBytes(file, 0, file.length)
or:
getBytes(segment.file, segment.offset, segment.length)

Cheers

On Thu, Jul 9, 2015 at 2:50 PM, Michal Čizmazia mici...@gmail.com wrote:

 Please could anyone give me pointers for appropriate SparkConf to work
 around Size exceeds Integer.MAX_VALUE?

 Stacktrace:

 2015-07-09 20:12:02 INFO  (sparkDriver-akka.actor.default-dispatcher-3)
 BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8
 GB)
 2015-07-09 20:12:02 ERROR (Executor task launch worker-0) Executor:96 -
 Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
 at
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
 ...




Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Matei Zaharia
Thus means that one of your cached RDD partitions is bigger than 2 GB of data. 
You can fix it by having more partitions. If you read data from a file system 
like HDFS or S3, set the number of partitions higher in the sc.textFile, 
hadoopFile, etc methods (it's an optional second parameter to those methods). 
If you create it through parallelize or if this particular RDD comes from a 
shuffle, use more tasks in the parallelize or shuffle.

Matei

 On Jul 9, 2015, at 3:35 PM, Michal Čizmazia mici...@gmail.com wrote:
 
 Spark version 1.4.0 in the Standalone mode
 
 2015-07-09 20:12:02 INFO  (sparkDriver-akka.actor.default-dispatcher-3) 
 BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8 GB)
 2015-07-09 20:12:02 ERROR (Executor task launch worker-0) Executor:96 - 
 Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
 at 
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
 at 
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
 at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:509)
 at 
 org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427)
 at org.apache.spark.storage.BlockManager.get(BlockManager.scala:615)
 at 
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 
 
 On 9 July 2015 at 18:11, Ted Yu yuzhih...@gmail.com 
 mailto:yuzhih...@gmail.com wrote:
 Which release of Spark are you using ?
 
 Can you show the complete stack trace ?
 
 getBytes() could be called from:
 getBytes(file, 0, file.length)
 or:
 getBytes(segment.file, segment.offset, segment.length)
 
 Cheers
 
 On Thu, Jul 9, 2015 at 2:50 PM, Michal Čizmazia mici...@gmail.com 
 mailto:mici...@gmail.com wrote:
 Please could anyone give me pointers for appropriate SparkConf to work around 
 Size exceeds Integer.MAX_VALUE?
 
 Stacktrace:
 
 2015-07-09 20:12:02 INFO  (sparkDriver-akka.actor.default-dispatcher-3) 
 BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8 GB)
 2015-07-09 20:12:02 ERROR (Executor task launch worker-0) Executor:96 - 
 Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
 at 
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
 ...