Java's default serialization is not the best/most efficient way of handling
ser/deser, did you try switching to Kryo serialization ?
c.f.
https://ogirardot.wordpress.com/2015/01/09/changing-sparks-default-java-serialization-to-kryo/
if you need a tutorial.
This should help in terms of both CPU
The driver memory is set at 40G and OOM seems to be happening on the
executors. I might try a different broadcast block size (vs 4m) as Takeshi
suggested to see if it makes a difference.
On Mon, Mar 7, 2016 at 6:54 PM, Tristan Nixon wrote:
> Yeah, the spark engine is
Yeah, the spark engine is pretty clever and its best not to pre-maturely
optimize. It would be interesting to profile your join vs. the collect on the
smaller dataset. I suspect that the join is faster (even before you broadcast
it back out).
I’m also curious about the broadcast OOM - did you
So I just implemented the logic through a standard join (without collect
and broadcast) and it's working great.
The idea behind trying the broadcast was that since the other side of join
is a much larger dataset, the process might be faster through collect and
broadcast, since it avoids the
I’m not sure I understand - if it was already distributed over the cluster in
an RDD, why would you want to collect and then re-send it as a broadcast
variable? Why not simply use the RDD that is already distributed on the worker
nodes?
> On Mar 7, 2016, at 7:44 PM, Arash
Oh, How about increasing broadcast block size
in spark.broadcast.blockSize?
A default size is `4m` and it is too small agains ~1GB, I think.
On Tue, Mar 8, 2016 at 10:44 AM, Arash wrote:
> Hi Tristan,
>
> This is not static, I actually collect it from an RDD to the
Hi Tristan,
This is not static, I actually collect it from an RDD to the driver.
On Mon, Mar 7, 2016 at 5:42 PM, Tristan Nixon wrote:
> Hi Arash,
>
> is this static data? Have you considered including it in your jars and
> de-serializing it from jar on each worker node?
Hi Ankur,
For this specific test, I'm only running the few lines of code that are
pasted. Nothing else is cached in the cluster.
Thanks,
Arash
On Mon, Mar 7, 2016 at 4:07 PM, Ankur Srivastava wrote:
> Hi,
>
> We have a use case where we broadcast ~4GB of data and
Hi,
We have a use case where we broadcast ~4GB of data and we are on m3.2xlarge
so your object size is not an issue. Also based on your explanation does
not look like a broadcast issue as it works when your partition size is
small.
Are you caching any other data? Because boradcast variable use
Well, I'm trying to avoid a big shuffle/join, from what I could find
online, my understanding was that 1G broadcast should be doable, is that
not accurate?
On Mon, Mar 7, 2016 at 3:34 PM, Jeff Zhang wrote:
> Any reason why do you broadcast such large variable ? It doesn't make
Any reason why do you broadcast such large variable ? It doesn't make sense
to me
On Tue, Mar 8, 2016 at 7:29 AM, Arash wrote:
> Hello all,
>
> I'm trying to broadcast a variable of size ~1G to a cluster of 20 nodes
> but haven't been able to make it work so far.
>
> It looks
Hello all,
I'm trying to broadcast a variable of size ~1G to a cluster of 20 nodes but
haven't been able to make it work so far.
It looks like the executors start to run out of memory during
deserialization. This behavior only shows itself when the number of
partitions is above a few 10s, the
12 matches
Mail list logo