I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get the
following exception when the size of the broadcast variable exceeds 2GB. Any
ideas on how I can resolve this issue?
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at
I think you've hit the nail on the head. Since the serialization
ultimately creates a byte array, and arrays can have at most ~2
billion elements in the JVM, the broadcast can be at most ~2GB.
At that scale, you might consider whether you really have to broadcast
these values, or want to handle
unfortunately this is a known issue:
https://issues.apache.org/jira/browse/SPARK-1476
as Sean suggested, you need to think of some other way of doing the same
thing, even if its just breaking your one big broadcast var into a few
smaller ones
On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen
Thanks Sean and Imran,
I'll try splitting the broadcast variable into smaller ones.
I had tried a regular join but it was failing due to high garbage
collection overhead during the shuffle. One of the RDDs is very large
and has a skewed distribution where a handful of keys account for 90%
of the