Ok, its at the end of the day and I’m trying to make sure I understand the 
locale of where things are running. 

I have an application where I have to query a bunch of sources, creating some 
RDDs and then I need to join off the RDDs and some other lookup tables. 


Yarn has two modes… client and cluster. 

I get it that in cluster mode… everything is running on the cluster. 
But in client mode, the driver is running on the edge node while the workers 
are running on the cluster.

When I run a sparkSQL command that generates a new RDD, does the result set 
live on the cluster with the workers, and gets referenced by the driver, or 
does the result set get migrated to the driver running on the client? (I’m 
pretty sure I know the answer, but its never safe to assume anything…) 

The follow up questions:

1) If I kill the  app running the driver on the edge node… will that cause YARN 
to free up the cluster’s resources? (In cluster mode… that doesn’t happen) What 
happens and how quickly? 

1a) If using the client mode… can I spin up and spin down the number of 
executors on the cluster? (Assuming that when I kill an executor any portion of 
the RDDs associated with that executor are gone, however the spark context is 
still alive on the edge node? [again assuming that the spark context lives with 
the driver.]) 

2) Any I/O between my spark job and the outside world… (e.g. walking through 
the data set and writing out a data set to a file) will occur on the edge node 
where the driver is located?  (This may seem kinda silly, but what happens when 
you want to expose the result set to the world… ? ) 

Now for something slightly different… 

Suppose I have a data source… like a couple of hive tables and I access the 
tables via beeline. (JDBC)  In this case… Hive generates a map/reduce job and 
then would stream the result set back to the client node where the RDD result 
set would be built.  I realize that I could run Hive on top of spark, but 
that’s a separate issue. Here the RDD will reside on the client only.  (That is 
I could in theory run this as a single spark instance.) 
If I were to run this on the cluster… then the result set would stream thru the 
beeline gate way and would reside back on the cluster sitting in RDDs within 
each executor? 

I realize that these are silly questions but I need to make sure that I know 
the flow of the data and where it ultimately resides.  There really is a method 
to my madness, and if I could explain it… these questions really would make 
sense. ;-) 

TIA, 

-Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to