[jira] [Commented] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits

Jaeho Shin (JIRA) Tue, 07 Aug 2012 22:16:16 -0700

    [ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430874#comment-13430874
 ]


Jaeho Shin commented on GIRAPH-246:
-----------------------------------

As Avery, I see no reason why we want to revert the fix for {{waitForever()}}.  
Apart from that, I totally welcome the confirmed solution of using explicit 
{{waitMsecs()}}s.  I thought this was identical to my fix last time, but after 
a second look it seems ZK needs frequent polling to work on some cases.  In our 
case, it was fixed by GIRAPH-267 and GIRAPH-274, but we are now struggling with 
other problems (perhaps GC and/or netty?).

Eli, how about keeping PredicateLock from GIRAPH-267 but replacing the 
waitForever() lines with your fix?
                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-1.patch, GIRAPH-246-2.patch, 
> GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, 
> GIRAPH-246-6.patch, GIRAPH-246-7.patch, GIRAPH-246-8.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-246) Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits

Reply via email to