Hello,
My team and I have a 32-core machine and we would like to use a huge object
- for example a large dictionary - in a map transformation and use all our
cores in parallel by sharing this object among some tasks.
We broadcast our large dictionary.
dico_br = sc.broadcast(dico)
We use it in
If broadcast variable doesn't fit in memory, I think is not the right fit
for you.
You can think about fitting it with an RDD as a tuple with other data you
are working on.
Say you are working on RDD (rdd in your case), run a map/reduce
to convert it to RDD> so now
Clement
In local mode all worker threads run in the driver VM. Your dictionary
should not be copied 32 times, in fact it wont be broadcast at all. Have
you tried increasing spark.driver.memory to ensure that the driver uses all
the memory on the machine.
Deenar
On 22 September 2015 at 19:42,