Hi, Could you try repartitioning the data by .repartition(# of cores on machine) or while reading the data, supply the number of minimum partitions as in sc.textFile(path, # of cores on machine).
It may be that the whole data is stored in one block? If it is billions of rows, then the indexing probably will not work giving the "exceeds Integer.MAX_VALUE" error. Best, Burak ----- Original Message ----- From: "francisco" <ftanudj...@nextag.com> To: u...@spark.incubator.apache.org Sent: Wednesday, September 17, 2014 3:18:29 PM Subject: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator Hi, We are running aggregation on a huge data set (few billion rows). While running the task got the following error (see below). Any ideas? Running spark 1.1.0 on cdh distribution. ... 14/09/17 13:33:30 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 2083 bytes result sent to driver 14/09/17 13:33:30 INFO CoarseGrainedExecutorBackend: Got assigned task 1 14/09/17 13:33:30 INFO Executor: Running task 0.0 in stage 2.0 (TID 1) 14/09/17 13:33:30 INFO TorrentBroadcast: Started reading broadcast variable 2 14/09/17 13:33:30 INFO MemoryStore: ensureFreeSpace(1428) called with curMem=163719, maxMem=34451478282 14/09/17 13:33:30 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1428.0 B, free 32.1 GB) 14/09/17 13:33:30 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 14/09/17 13:33:30 INFO TorrentBroadcast: Reading broadcast variable 2 took 0.027374294 s 14/09/17 13:33:30 INFO MemoryStore: ensureFreeSpace(2336) called with curMem=165147, maxMem=34451478282 14/09/17 13:33:30 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.3 KB, free 32.1 GB) 14/09/17 13:33:30 INFO MapOutputTrackerWorker: Updating epoch to 1 and clearing cache 14/09/17 13:33:30 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them 14/09/17 13:33:30 INFO MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkdri...@sas-model1.pv.sv.nextag.com:56631/user/MapOutputTracker#794212052] 14/09/17 13:33:30 INFO MapOutputTrackerWorker: Got the output locations 14/09/17 13:33:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/09/17 13:33:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 14/09/17 13:33:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 8 ms 14/09/17 13:33:30 ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Error occurred while fetching local blocks java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:104) at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:120) at org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:358) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:208) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:205) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.getLocalBlocks(BlockFetcherIterator.scala:205) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.initialize(BlockFetcherIterator.scala:240) at org.apache.spark.storage.BlockManager.getMultiple(BlockManager.scala:583) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:77) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:41) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/09/17 13:33:30 INFO CoarseGrainedExecutorBackend: Got assigned task 2 14/09/17 13:33:30 INFO Executor: Running task 0.0 in stage 1.1 (TID 2) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-in-BlockFetcherIterator-tp14483.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org