What about:

http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCAF_KkPwk7iiQVD2JzOwVVhQ_U2p3bPVM=-bka18v4s-5-lp...@mail.gmail.com%3E<http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/<CAF_KkPwk7iiQVD2JzOwVVhQ_U2p3bPVM=-bka18v4s-5-lp...@mail.gmail.com>>


Regards
- Saurabh Wadhawan



On 20-Oct-2014, at 4:56 pm, Kamal Banga 
<banga.ka...@gmail.com<mailto:banga.ka...@gmail.com>> wrote:

1.  All RDD operations are executed in workers. So reading a text file or 
executing val x = 1 will happen on worker. 
(link<http://stackoverflow.com/questions/24637312/spark-driver-in-apache-spark>)

2.
a. Without braodcast: Let's say you have 'n' nodes. You can set hadoop's 
replication factor to n and it will replicate that data across all nodes.
b. With broadcast: using sc.broadcast() should do it. 
(link<http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables>)

On Mon, Oct 20, 2014 at 1:18 AM, Saurabh Wadhawan 
<saurabh.wadha...@guavus.com<mailto:saurabh.wadha...@guavus.com>> wrote:
Any response for this?

1. How do I know what statements will be executed on worker side out of the 
spark script in a stage.
    e.g. if I have
    val x = 1 (or any other code)
    in my driver code, will the same statements be executed on the worker side 
in a stage?

2. How can I do a map side join in spark :
   a. without broadcast(i.e. by reading a file once in each executor)
   b. with broadcast but by broadcasting complete RDD to each executor

Regards
- Saurabh Wadhawan



On 19-Oct-2014, at 1:54 am, Saurabh Wadhawan 
<saurabh.wadha...@guavus.com<mailto:saurabh.wadha...@guavus.com>> wrote:

Hi,

 I have following questions:

1. When I write a spark script, how do I know what part runs on the driver side 
and what runs on the worker side.
    So lets say, I write code to to read a plain text file.
    Will it run on driver side only or will it run on server side only or on 
both sides

2. If I want each worker to load a file for lets say join and the file is 
pretty huge lets say in GBs, so that I don't want to broadcast it, then what's 
the best way to do it.
     Another way to say the same thing would be how do I load a data structure 
for fast lookup(and not an RDD) on each worker node in the executor

Regards
- Saurabh




Reply via email to