[ https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436866#comment-13436866 ]
Eli Reisman commented on GIRAPH-246: ------------------------------------ More test this morning. The 246-NEW-FIX-2.patch calls progress() every 10 seconds regardless of variable-length timed waits in waitMsecs as this patch sets up, or in waitForever() as Jaeho set it up to do in 267 and trunk does already. I think this is ready to go. In other news, I think while stress testing this and scaling it up, I might have found another place progress needs to be called more often: in the netty channel pipelines handling send and receive during the input superstep as collections of vertices are sent to their future homes. I will try to get more instrumented runs in this morning if I can to get more details, but something weird is going on when a worker is not reading a split but does start to receive its partition data over Netty that is causing a timeout. I don't know if that timing is co-incidental but a strange timeout during large-scale runs is happening consistently on such worker nodes. Often times when I can get log data on such a timeout, it is not a healthy worker timing out but one where netty is overwhelmed and it has genuinely died. This might be more appropriate in another JIRA, or perhaps Avery is already aware of this and has wrapped it up into his next Netty improvement? Either way I will try to get more details on what is happening here and repeat the problem. This is running on today's trunk too, so the GIRAPH-300 improvements are already in as of this problem showing up. > Periodic worker calls to context.progress() will prevent timeout on some > Hadoop clusters during barrier waits > ------------------------------------------------------------------------------------------------------------- > > Key: GIRAPH-246 > URL: https://issues.apache.org/jira/browse/GIRAPH-246 > Project: Giraph > Issue Type: Improvement > Components: bsp > Affects Versions: 0.2.0 > Reporter: Eli Reisman > Assignee: Eli Reisman > Priority: Minor > Labels: hadoop, patch > Fix For: 0.2.0 > > Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, > GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, > GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, > GIRAPH-246-7.patch, GIRAPH-246-7_rebase1.patch, GIRAPH-246-7_rebase2.patch, > GIRAPH-246-8.patch, GIRAPH-246-9.patch, GIRAPH-246-NEW-FIX-2.patch, > GIRAPH-246-NEW-FIX.patch > > > This simple change creates a command-line configurable option in GiraphJob to > control the time between calls to context().progress() that allows workers to > avoid timeouts during long data load-ins in which some works complete their > input split reads much faster than others, or finish a super step faster. I > found this allowed jobs that were large-scale but with low memory overhead to > complete even when they would previously time out during runs on a Hadoop > cluster. Timeout is still possible when the worker crashes or runs out of > memory or has other GC or RPC trouble that is legitimate, but prevents > unintentional crashes when the worker is actually still healthy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira