Re: cached data between jobs

Jeff Zhang Tue, 01 Sep 2015 19:55:28 -0700

Hi Eric,

If the 2 jobs share the same parent stages. these stages can be skipped for
the second job.

Here's one simple example:

val rdd1 = sc.parallelize(1 to 10).map(e=>(e,e))
val rdd2 = rdd1.groupByKey()
rdd2.map(e=>e._1).collect() foreach println
rdd2.map(e=> (e._1, e._2.size)).collect foreach println

Obviously, there are 2 jobs and both of them have 2 stages. Luckily here
these 2 jobs share the same stage (the first stage of each job), although
you doesn't cache these data explicitly, once one stage is completed, it is
marked as available and can used for other jobs. so for the second job, it
only needs to run one stage.
You should be able to see the skipped stage in the spark job ui.

[image: Inline image 1]

On Wed, Sep 2, 2015 at 12:53 AM, Eric Walker <eric.wal...@gmail.com> wrote:

> Hi,
>
> I'm noticing that a 30 minute job that was initially IO-bound may not be
> during subsequent runs.  Is there some kind of between-job caching that
> happens in Spark or in Linux that outlives jobs and that might be making
> subsequent runs faster?  If so, is there a way to avoid the caching in
> order to get a better sense of the worst-case scenario?
>
> (It's also possible that I've simply changed something that made things
> faster.)
>
> Eric
>
>

-- 
Best Regards

Jeff Zhang

Re: cached data between jobs

Reply via email to