Thanks Sean and Imran,

I'll try splitting the broadcast variable into smaller ones.

I had tried a regular join but it was failing due to high garbage
collection overhead during the shuffle. One of the RDDs is very large
and has a skewed distribution where a handful of keys account for 90%
of the data. Do you have any pointers on how to handle skewed key
distributions during a join.

Soila

On Fri, Feb 13, 2015 at 10:49 AM, Imran Rashid <iras...@cloudera.com> wrote:
> unfortunately this is a known issue:
> https://issues.apache.org/jira/browse/SPARK-1476
>
> as Sean suggested, you need to think of some other way of doing the same
> thing, even if its just breaking your one big broadcast var into a few
> smaller ones
>
> On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> I think you've hit the nail on the head. Since the serialization
>> ultimately creates a byte array, and arrays can have at most ~2
>> billion elements in the JVM, the broadcast can be at most ~2GB.
>>
>> At that scale, you might consider whether you really have to broadcast
>> these values, or want to handle them as RDDs and join and so on.
>>
>> Or consider whether you can break it up into several broadcasts?
>>
>>
>> On Fri, Feb 13, 2015 at 6:24 PM, soila <skavu...@gmail.com> wrote:
>> > I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get
>> > the
>> > following exception when the size of the broadcast variable exceeds 2GB.
>> > Any
>> > ideas on how I can resolve this issue?
>> >
>> > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>> >         at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:829)
>> >         at
>> > org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
>> >         at
>> > org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
>> >         at
>> > org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:99)
>> >         at
>> > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:147)
>> >         at
>> > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114)
>> >         at
>> > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787)
>> >         at
>> >
>> > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
>> >         at
>> > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992)
>> >         at
>> >
>> > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98)
>> >         at
>> >
>> > org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:84)
>> >         at
>> >
>> > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>> >         at
>> >
>> > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
>> >         at
>> >
>> > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
>> >         at
>> > org.apache.spark.SparkContext.broadcast(SparkContext.scala:945)
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Size-exceeds-Integer-MAX-VALUE-exception-when-broadcasting-large-variable-tp21648.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to