Hi, We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of Tuple2. At the end of day, a daily job is launched, which works on the outputs of the hourly jobs.
For data locality and speed, we wish that when the daily job launches, it finds all instances of a given key at a single executor rather than fetching it from others during shuffle. Is it possible to maintain key partitioning across jobs? We can control partitioning in one job. But how do we send keys to the executors of same node manager across jobs? And while saving data to HDFS, are the blocks allocated to the same data node machine as the executor for a partition? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-across-jobs-tp22351.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org