Hi Jeff,

I think I see what you're saying.  I was thinking more of a whole Spark
job, where `spark-submit` is run once to completion and then started up
again, rather than a "job" as seen in the Spark UI.  I take it there is no
implicit caching of results between `spark-submit` runs.

(In the case I was writing about, I think I read too much into the Ganglia
network traffic view.  During the runs which I believed to be IO-bound, I
was carrying out a long-running database transfer on the same network.
After it completed I saw a speedup, not realizing where it came from, and
wondered whether there had been some kind of shifting in the data.)

Eric


On Tue, Sep 1, 2015 at 9:54 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> Hi Eric,
>
> If the 2 jobs share the same parent stages. these stages can be skipped
> for the second job.
>
> Here's one simple example:
>
> val rdd1 = sc.parallelize(1 to 10).map(e=>(e,e))
> val rdd2 = rdd1.groupByKey()
> rdd2.map(e=>e._1).collect() foreach println
> rdd2.map(e=> (e._1, e._2.size)).collect foreach println
>
> Obviously, there are 2 jobs and both of them have 2 stages. Luckily here
> these 2 jobs share the same stage (the first stage of each job), although
> you doesn't cache these data explicitly, once one stage is completed, it is
> marked as available and can used for other jobs. so for the second job, it
> only needs to run one stage.
> You should be able to see the skipped stage in the spark job ui.
>
>
>
> [image: Inline image 1]
>
> On Wed, Sep 2, 2015 at 12:53 AM, Eric Walker <eric.wal...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm noticing that a 30 minute job that was initially IO-bound may not be
>> during subsequent runs.  Is there some kind of between-job caching that
>> happens in Spark or in Linux that outlives jobs and that might be making
>> subsequent runs faster?  If so, is there a way to avoid the caching in
>> order to get a better sense of the worst-case scenario?
>>
>> (It's also possible that I've simply changed something that made things
>> faster.)
>>
>> Eric
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Reply via email to