[ https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430874#comment-13430874 ]
Jaeho Shin commented on GIRAPH-246: ----------------------------------- As Avery, I see no reason why we want to revert the fix for {{waitForever()}}. Apart from that, I totally welcome the confirmed solution of using explicit {{waitMsecs()}}s. I thought this was identical to my fix last time, but after a second look it seems ZK needs frequent polling to work on some cases. In our case, it was fixed by GIRAPH-267 and GIRAPH-274, but we are now struggling with other problems (perhaps GC and/or netty?). Eli, how about keeping PredicateLock from GIRAPH-267 but replacing the waitForever() lines with your fix? > Periodic worker calls to context.progress() will prevent timeout on some > Hadoop clusters during barrier waits > ------------------------------------------------------------------------------------------------------------- > > Key: GIRAPH-246 > URL: https://issues.apache.org/jira/browse/GIRAPH-246 > Project: Giraph > Issue Type: Improvement > Components: bsp > Affects Versions: 0.2.0 > Reporter: Eli Reisman > Assignee: Eli Reisman > Priority: Minor > Labels: hadoop, patch > Fix For: 0.2.0 > > Attachments: GIRAPH-246-1.patch, GIRAPH-246-2.patch, > GIRAPH-246-3.patch, GIRAPH-246-4.patch, GIRAPH-246-5.patch, > GIRAPH-246-6.patch, GIRAPH-246-7.patch, GIRAPH-246-8.patch > > > This simple change creates a command-line configurable option in GiraphJob to > control the time between calls to context().progress() that allows workers to > avoid timeouts during long data load-ins in which some works complete their > input split reads much faster than others, or finish a super step faster. I > found this allowed jobs that were large-scale but with low memory overhead to > complete even when they would previously time out during runs on a Hadoop > cluster. Timeout is still possible when the worker crashes or runs out of > memory or has other GC or RPC trouble that is legitimate, but prevents > unintentional crashes when the worker is actually still healthy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira