Using s3 instead of broadcast

Aureliano Buendia Mon, 10 Mar 2014 21:59:31 -0700

Hi,

My spark app has to broadcast 5 GB of RDD to about 100 workers at the
beginning of each job. Obviously, this takes some time, and this time
linearly increases as the number of workers increases.


Does it make sense instead of broadcasting the 5 GB RDD, to ask each worker
to download it from s3? Download speed from s3 is not supposed to decrease
as the number of workers increases.

If downloading from s3 from each worker makes sense, how to implement it?
The closure code dispatched to workers cannot access the spark context
object.

Using s3 instead of broadcast

Reply via email to