Maja Kabiljo created GIRAPH-1077:
------------------------------------

             Summary: Jobs getting stuck after channel failure
                 Key: GIRAPH-1077
                 URL: https://issues.apache.org/jira/browse/GIRAPH-1077
             Project: Giraph
          Issue Type: Bug
            Reporter: Maja Kabiljo
            Assignee: Maja Kabiljo


When a channel fails currently we just log the failure. Since we don't wait on 
open requests from every place, checking requests doesn't get called always, 
and we've seen issues with jobs staying stuck, for example during the input 
stage when request for split to read from worker to master fails. When we know 
that channel failed, we should try to resend the requests from that channel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to