Ok, its at the end of the day and I’m trying to make sure I understand the locale of where things are running.
I have an application where I have to query a bunch of sources, creating some RDDs and then I need to join off the RDDs and some other lookup tables. Yarn has two modes… client and cluster. I get it that in cluster mode… everything is running on the cluster. But in client mode, the driver is running on the edge node while the workers are running on the cluster. When I run a sparkSQL command that generates a new RDD, does the result set live on the cluster with the workers, and gets referenced by the driver, or does the result set get migrated to the driver running on the client? (I’m pretty sure I know the answer, but its never safe to assume anything…) The follow up questions: 1) If I kill the app running the driver on the edge node… will that cause YARN to free up the cluster’s resources? (In cluster mode… that doesn’t happen) What happens and how quickly? 1a) If using the client mode… can I spin up and spin down the number of executors on the cluster? (Assuming that when I kill an executor any portion of the RDDs associated with that executor are gone, however the spark context is still alive on the edge node? [again assuming that the spark context lives with the driver.]) 2) Any I/O between my spark job and the outside world… (e.g. walking through the data set and writing out a data set to a file) will occur on the edge node where the driver is located? (This may seem kinda silly, but what happens when you want to expose the result set to the world… ? ) Now for something slightly different… Suppose I have a data source… like a couple of hive tables and I access the tables via beeline. (JDBC) In this case… Hive generates a map/reduce job and then would stream the result set back to the client node where the RDD result set would be built. I realize that I could run Hive on top of spark, but that’s a separate issue. Here the RDD will reside on the client only. (That is I could in theory run this as a single spark instance.) If I were to run this on the cluster… then the result set would stream thru the beeline gate way and would reside back on the cluster sitting in RDDs within each executor? I realize that these are silly questions but I need to make sure that I know the flow of the data and where it ultimately resides. There really is a method to my madness, and if I could explain it… these questions really would make sense. ;-) TIA, -Mike --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org