[ 
https://issues.apache.org/jira/browse/GIRAPH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436866#comment-13436866
 ] 

Eli Reisman commented on GIRAPH-246:
------------------------------------

More test this morning. The 246-NEW-FIX-2.patch calls progress() every 10 
seconds regardless of variable-length timed waits in waitMsecs as this patch 
sets up, or in waitForever() as Jaeho set it up to do in 267 and trunk does 
already. I think this is ready to go.

In other news, I think while stress testing this and scaling it up, I might 
have found another place progress needs to be called more often: in the netty 
channel pipelines handling send and receive during the input superstep as 
collections of vertices are sent to their future homes. I will try to get more 
instrumented runs in this morning if I can to get more details, but something 
weird is going on when a worker is not reading a split but does start to 
receive its partition data over Netty that is causing a timeout. I don't know 
if that timing is co-incidental but a strange timeout during large-scale runs 
is happening consistently on such worker nodes. Often times when I can get log 
data on such a timeout, it is not a healthy worker timing out but one where 
netty is overwhelmed and it has genuinely died. This might be more appropriate 
in another JIRA, or perhaps Avery is already aware of this and has wrapped it 
up into his next Netty improvement? Either way I will try to get more details 
on what is happening here and repeat the problem. This is running on today's 
trunk too, so the GIRAPH-300 improvements are already in as of this problem 
showing up.

                
> Periodic worker calls to context.progress() will prevent timeout on some 
> Hadoop clusters during barrier waits
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-246
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-246
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>              Labels: hadoop, patch
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-246-10.patch, GIRAPH-246-11.patch, 
> GIRAPH-246-1.patch, GIRAPH-246-2.patch, GIRAPH-246-3.patch, 
> GIRAPH-246-4.patch, GIRAPH-246-5.patch, GIRAPH-246-6.patch, 
> GIRAPH-246-7.patch, GIRAPH-246-7_rebase1.patch, GIRAPH-246-7_rebase2.patch, 
> GIRAPH-246-8.patch, GIRAPH-246-9.patch, GIRAPH-246-NEW-FIX-2.patch, 
> GIRAPH-246-NEW-FIX.patch
>
>
> This simple change creates a command-line configurable option in GiraphJob to 
> control the time between calls to context().progress() that allows workers to 
> avoid timeouts during long data load-ins in which some works complete their 
> input split reads much faster than others, or finish a super step faster. I 
> found this allowed jobs that were large-scale but with low memory overhead to 
> complete even when they would previously time out during runs on a Hadoop 
> cluster. Timeout is still possible when the worker crashes or runs out of 
> memory or has other GC or RPC trouble that is legitimate, but prevents 
> unintentional crashes when the worker is actually still healthy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to