Re: Data locality across jobs

Sandy Ryza Thu, 02 Apr 2015 13:32:06 -0700

This isn't currently a capability that Spark has, though it has definitely
been discussed: https://issues.apache.org/jira/browse/SPARK-1061.  The
primary obstacle at this point is that Hadoop's FileInputFormat doesn't
guarantee that each file corresponds to a single split, so the records
corresponding to a particular partition at the end of the first job can end
up split across multiple partitions in the second job.


-Sandy

On Wed, Apr 1, 2015 at 9:09 PM, kjsingh <kanwaljit.si...@guavus.com> wrote:

> Hi,
>
> We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of
> Tuple2. At the end of day, a daily job is launched, which works on the
> outputs of the hourly jobs.
>
> For data locality and speed, we wish that when the daily job launches, it
> finds all instances of a given key at a single executor rather than
> fetching
> it from others during shuffle.
>
> Is it possible to maintain key partitioning across jobs? We can control
> partitioning in one job. But how do we send keys to the executors of same
> node manager across jobs? And while saving data to HDFS, are the blocks
> allocated to the same data node machine as the executor for a partition?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-across-jobs-tp22351.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Data locality across jobs

Reply via email to