Clement

In local mode all worker threads run in the driver VM. Your dictionary
should not be copied 32 times, in fact it wont be broadcast at all. Have
you tried increasing spark.driver.memory to ensure that the driver uses all
the memory on the machine.

Deenar

On 22 September 2015 at 19:42, Clément Frison <clement.fri...@gmail.com>
wrote:

> Hello,
>
> My team and I have a 32-core machine and we would like to use a huge
> object - for example a large dictionary - in a map transformation and use
> all our cores in parallel by sharing this object among some tasks.
>
> We broadcast our large dictionary.
>
> dico_br = sc.broadcast(dico)
>
> We use it in a map:
>
> rdd.map(lambda x: (x[0], function(x[1], dico_br)))
>
> where function does a lookup : dico_br.value[x]
>
> Our issue is that our dictionary is loaded 32 times in memory, and it
> doesn't fit. So what we are doing is limiting the number of executors. It
> works fine but we only have 8 cpus working in parallel instead of 32.
>
> We would like to take advantage of multicore processing and shared memory,
> as the 32 cores are in the same machine. For example we would like to load
> the dictionary in memory 8 times only and make 4 cores share it. How could
> we achieve that with Spark ?
>
>
> What we have tried - without success :
>
> 1) One driver/worker with 32 cores : local[32]
>
> 2) Standalone with one master and 8 workers - each of them having 4 cores
>
> Thanks a lot for your help, Clement
>

Reply via email to