It seems the root cause of the delay was the sheer size of the DAG for
those jobs, which are towards the end of a long series of jobs.

To reduce it, you can probably try to checkpoint (rdd.checkpoint) some
previous RDDs. That will:
1. save the RDD on disk
2. remove all references to the parents of this RDD

Which means the when a job uses that RDD, the DAG stops at that RDD and
does not looks at its parents as it doesn't have them anymore. It is very
similar to saving your RDD and re-loading it as a "fresh" RDD.

On Fri, Jun 26, 2015 at 9:14 AM, Thomas Gerber <thomas.ger...@radius.com>
wrote:

> Note that this problem is probably NOT caused directly by GraphX, but
> GraphX reveals it because as you go further down the iterations, you get
> further and further away of a shuffle you can rely on.
>
> On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber <thomas.ger...@radius.com>
> wrote:
>
>> Hello,
>>
>> We run GraphX ConnectedComponents, and we notice that there is a time gap
>> that becomes larger and larger during Jobs, that is not accounted for.
>>
>> In the screenshot attached, you will notice that each job only takes
>> around 2 1/2min. At first, the next job/iteration starts immediately after
>> the previous one. But as we go through iterations, there is a gap (time
>> where job N+1 starts - time where job N finishes) that grows, reaching
>> ultimately 6 minutes around the 30th iteration .
>>
>> I suspect it has to do with DAG computation on the driver, as evidenced
>> by the very large (and getting larger at every iteration) of pending stages
>> that are ultimately skipped.
>>
>> So,
>> 1. is there anything obvious we can do to make that "gap" between
>> iterations shorter?
>> 2. would dividing the number of partitions in the input RDD per 2 divide
>> the gap by 2 as well?
>>
>> I ask because 3 min gap on average for a job length of 2 1/2 min => we
>> are "wasting" 50% of CPU time on the Executors.
>>
>> Thanks!
>> Thomas
>>
>
>

Reply via email to