Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread soila
I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get the following exception when the size of the broadcast variable exceeds 2GB. Any ideas on how I can resolve this issue? java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at

Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Sean Owen
I think you've hit the nail on the head. Since the serialization ultimately creates a byte array, and arrays can have at most ~2 billion elements in the JVM, the broadcast can be at most ~2GB. At that scale, you might consider whether you really have to broadcast these values, or want to handle

Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Imran Rashid
unfortunately this is a known issue: https://issues.apache.org/jira/browse/SPARK-1476 as Sean suggested, you need to think of some other way of doing the same thing, even if its just breaking your one big broadcast var into a few smaller ones On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen

Re: Size exceeds Integer.MAX_VALUE exception when broadcasting large variable

2015-02-13 Thread Soila Pertet Kavulya
Thanks Sean and Imran, I'll try splitting the broadcast variable into smaller ones. I had tried a regular join but it was failing due to high garbage collection overhead during the shuffle. One of the RDDs is very large and has a skewed distribution where a handful of keys account for 90% of the