[ https://issues.apache.org/jira/browse/GIRAPH-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342452#comment-15342452 ]
Maja Kabiljo commented on GIRAPH-1077: -------------------------------------- https://reviews.facebook.net/D59895 > Jobs getting stuck after channel failure > ---------------------------------------- > > Key: GIRAPH-1077 > URL: https://issues.apache.org/jira/browse/GIRAPH-1077 > Project: Giraph > Issue Type: Bug > Reporter: Maja Kabiljo > Assignee: Maja Kabiljo > > When a channel fails currently we just log the failure. Since we don't wait > on open requests from every place, checking requests doesn't get called > always, and we've seen issues with jobs staying stuck, for example during the > input stage when request for split to read from worker to master fails. When > we know that channel failed, we should try to resend the requests from that > channel. -- This message was sent by Atlassian JIRA (v6.3.4#6332)