1. All RDD operations are executed in workers. So reading a text file or executing val x = 1 will happen on worker. (link <http://stackoverflow.com/questions/24637312/spark-driver-in-apache-spark>)
2. a. Without braodcast: Let's say you have 'n' nodes. You can set hadoop's replication factor to n and it will replicate that data across all nodes. b. With broadcast: using sc.broadcast() should do it. (link <http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables> ) On Mon, Oct 20, 2014 at 1:18 AM, Saurabh Wadhawan < saurabh.wadha...@guavus.com> wrote: > Any response for this? > > 1. How do I know what statements will be executed on worker side out of > the spark script in a stage. > e.g. if I have > val x = 1 (or any other code) > in my driver code, will the same statements be executed on the worker > side in a stage? > > 2. How can I do a map side join in spark : > a. without broadcast(i.e. by reading a file once in each executor) > b. with broadcast but by broadcasting complete RDD to each executor > > Regards > - Saurabh Wadhawan > > > > On 19-Oct-2014, at 1:54 am, Saurabh Wadhawan <saurabh.wadha...@guavus.com> > wrote: > > Hi, > > I have following questions: > > 1. When I write a spark script, how do I know what part runs on the > driver side and what runs on the worker side. > So lets say, I write code to to read a plain text file. > Will it run on driver side only or will it run on server side only or > on both sides > > 2. If I want each worker to load a file for lets say join and the file > is pretty huge lets say in GBs, so that I don't want to broadcast it, then > what's the best way to do it. > Another way to say the same thing would be how do I load a data > structure for fast lookup(and not an RDD) on each worker node in the > executor > > Regards > - Saurabh > > >