This isn't currently a capability that Spark has, though it has definitely been discussed: https://issues.apache.org/jira/browse/SPARK-1061. The primary obstacle at this point is that Hadoop's FileInputFormat doesn't guarantee that each file corresponds to a single split, so the records corresponding to a particular partition at the end of the first job can end up split across multiple partitions in the second job.
-Sandy On Wed, Apr 1, 2015 at 9:09 PM, kjsingh <kanwaljit.si...@guavus.com> wrote: > Hi, > > We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of > Tuple2. At the end of day, a daily job is launched, which works on the > outputs of the hourly jobs. > > For data locality and speed, we wish that when the daily job launches, it > finds all instances of a given key at a single executor rather than > fetching > it from others during shuffle. > > Is it possible to maintain key partitioning across jobs? We can control > partitioning in one job. But how do we send keys to the executors of same > node manager across jobs? And while saving data to HDFS, are the blocks > allocated to the same data node machine as the executor for a partition? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-across-jobs-tp22351.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >