It's not just if the RDD is explicitly cached, but also if the map outputs
for stages have been materialized into shuffle files and are still
accessible through the map output tracker.  Because of that, explicitly
caching RDD actions often gains you little or nothing, since even without a
call to cache() or persist() the prior computation will largely be reused
and stages will show up as skipped -- i.e. no need to recompute that stage.

On Tue, Mar 15, 2016 at 5:50 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> If RDD is cached, this RDD is only computed once and the stages for
> computing this RDD in the following jobs are skipped.
>
>
> On Wed, Mar 16, 2016 at 8:14 AM, Prabhu Joseph <prabhujose.ga...@gmail.com
> > wrote:
>
>> Hi All,
>>
>>
>> Spark UI Completed Jobs section shows below information, what is the
>> skipped value shown for Stages and Tasks below.
>>
>> Job_ID    Description    Submitted                    Duration
>> Stages (Succeeded/Total)    Tasks (for all stages): Succeeded/Total
>>
>> 11             count          2016/03/14 15:35:32      1.4
>> min             164/164 * (163 skipped)   *            19841/19788
>> *(41405 skipped)*
>> Thanks,
>> Prabhu Joseph
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Reply via email to