Thanks Sean and Imran, I'll try splitting the broadcast variable into smaller ones.
I had tried a regular join but it was failing due to high garbage collection overhead during the shuffle. One of the RDDs is very large and has a skewed distribution where a handful of keys account for 90% of the data. Do you have any pointers on how to handle skewed key distributions during a join. Soila On Fri, Feb 13, 2015 at 10:49 AM, Imran Rashid <iras...@cloudera.com> wrote: > unfortunately this is a known issue: > https://issues.apache.org/jira/browse/SPARK-1476 > > as Sean suggested, you need to think of some other way of doing the same > thing, even if its just breaking your one big broadcast var into a few > smaller ones > > On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen <so...@cloudera.com> wrote: >> >> I think you've hit the nail on the head. Since the serialization >> ultimately creates a byte array, and arrays can have at most ~2 >> billion elements in the JVM, the broadcast can be at most ~2GB. >> >> At that scale, you might consider whether you really have to broadcast >> these values, or want to handle them as RDDs and join and so on. >> >> Or consider whether you can break it up into several broadcasts? >> >> >> On Fri, Feb 13, 2015 at 6:24 PM, soila <skavu...@gmail.com> wrote: >> > I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get >> > the >> > following exception when the size of the broadcast variable exceeds 2GB. >> > Any >> > ideas on how I can resolve this issue? >> > >> > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE >> > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829) >> > at >> > org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) >> > at >> > org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) >> > at >> > org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:99) >> > at >> > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:147) >> > at >> > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114) >> > at >> > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787) >> > at >> > >> > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) >> > at >> > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992) >> > at >> > >> > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98) >> > at >> > >> > org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:84) >> > at >> > >> > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) >> > at >> > >> > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) >> > at >> > >> > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) >> > at >> > org.apache.spark.SparkContext.broadcast(SparkContext.scala:945) >> > >> > >> > >> > >> > -- >> > View this message in context: >> > http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-exception-when-broadcasting-large-variable-tp21648.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org