Re: OOM exception during Broadcast

2016-03-08 Thread Olivier Girardot
Java's default serialization is not the best/most efficient way of handling ser/deser, did you try switching to Kryo serialization ? c.f. https://ogirardot.wordpress.com/2015/01/09/changing-sparks-default-java-serialization-to-kryo/ if you need a tutorial. This should help in terms of both CPU

Re: OOM exception during Broadcast

2016-03-07 Thread Arash
The driver memory is set at 40G and OOM seems to be happening on the executors. I might try a different broadcast block size (vs 4m) as Takeshi suggested to see if it makes a difference. On Mon, Mar 7, 2016 at 6:54 PM, Tristan Nixon wrote: > Yeah, the spark engine is

Re: OOM exception during Broadcast

2016-03-07 Thread Tristan Nixon
Yeah, the spark engine is pretty clever and its best not to pre-maturely optimize. It would be interesting to profile your join vs. the collect on the smaller dataset. I suspect that the join is faster (even before you broadcast it back out). I’m also curious about the broadcast OOM - did you

Re: OOM exception during Broadcast

2016-03-07 Thread Arash
So I just implemented the logic through a standard join (without collect and broadcast) and it's working great. The idea behind trying the broadcast was that since the other side of join is a much larger dataset, the process might be faster through collect and broadcast, since it avoids the

Re: OOM exception during Broadcast

2016-03-07 Thread Tristan Nixon
I’m not sure I understand - if it was already distributed over the cluster in an RDD, why would you want to collect and then re-send it as a broadcast variable? Why not simply use the RDD that is already distributed on the worker nodes? > On Mar 7, 2016, at 7:44 PM, Arash

Re: OOM exception during Broadcast

2016-03-07 Thread Takeshi Yamamuro
Oh, How about increasing broadcast block size in spark.broadcast.blockSize? A default size is `4m` and it is too small agains ~1GB, I think. On Tue, Mar 8, 2016 at 10:44 AM, Arash wrote: > Hi Tristan, > > This is not static, I actually collect it from an RDD to the

Re: OOM exception during Broadcast

2016-03-07 Thread Arash
Hi Tristan, This is not static, I actually collect it from an RDD to the driver. On Mon, Mar 7, 2016 at 5:42 PM, Tristan Nixon wrote: > Hi Arash, > > is this static data? Have you considered including it in your jars and > de-serializing it from jar on each worker node?

Re: OOM exception during Broadcast

2016-03-07 Thread Arash
Hi Ankur, For this specific test, I'm only running the few lines of code that are pasted. Nothing else is cached in the cluster. Thanks, Arash On Mon, Mar 7, 2016 at 4:07 PM, Ankur Srivastava wrote: > Hi, > > We have a use case where we broadcast ~4GB of data and

Re: OOM exception during Broadcast

2016-03-07 Thread Ankur Srivastava
Hi, We have a use case where we broadcast ~4GB of data and we are on m3.2xlarge so your object size is not an issue. Also based on your explanation does not look like a broadcast issue as it works when your partition size is small. Are you caching any other data? Because boradcast variable use

Re: OOM exception during Broadcast

2016-03-07 Thread Arash
Well, I'm trying to avoid a big shuffle/join, from what I could find online, my understanding was that 1G broadcast should be doable, is that not accurate? On Mon, Mar 7, 2016 at 3:34 PM, Jeff Zhang wrote: > Any reason why do you broadcast such large variable ? It doesn't make

Re: OOM exception during Broadcast

2016-03-07 Thread Jeff Zhang
Any reason why do you broadcast such large variable ? It doesn't make sense to me On Tue, Mar 8, 2016 at 7:29 AM, Arash wrote: > Hello all, > > I'm trying to broadcast a variable of size ~1G to a cluster of 20 nodes > but haven't been able to make it work so far. > > It looks

OOM exception during Broadcast

2016-03-07 Thread Arash
Hello all, I'm trying to broadcast a variable of size ~1G to a cluster of 20 nodes but haven't been able to make it work so far. It looks like the executors start to run out of memory during deserialization. This behavior only shows itself when the number of partitions is above a few 10s, the