[ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430800#comment-13430800
 ] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

I totally agree. When jaeho's patch is fixed and works, I look forward to 
seeing it in. For now, I have verified over countless runs of 60+ minutes 
sometimes over the last month and again over the last few days since rebasing 
it that this works without a doubt. I (and others here) need to be doing large 
scale job runs all the time right now. When his is ready, I say replace this 
thing. The template is already set. For now, this is verified and helps us move 
forward with stable code.

As far as configurable timing, this patch also has that, originally in the 
GiraphJob as a conf option, the after suggestions on the thread, as a final 
variable setting which I agree is probably too stiff for long term. But this is 
destined for replacement when a verified solution is available.

His solution is clean and a good fix for the parts where barriers are used, but 
lots of other progress calls are still going to be peppered through the code if 
we don't allow an alternate thread to call progress() (which I agree should not 
happen.) When he's got his working and has tried it and knows it does what its 
supposed to, I would be the first to +1 it. Right now, I need to get jobs to 
run have to patch this in until a cleaner fix is implemented and verified. This 
gives him time to find and tune that more elegant fix.

                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-1.patch, GIRAPH-246-2.patch, 
> GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, 
> GIRAPH-246-6.patch, GIRAPH-246-7.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to