[ 
https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437607#comment-13437607
 ] 

Avery Ching commented on GIRAPH-306:
------------------------------------

>Yeah that was the impression I had too. Just to clarify, as of the recent 
>Netty upgrades + this one, we are in no way >attempting to handle worker 
>restarts with any grace right? This is all purely connection reliability for 
>healthy worker nodes?

Yeah, this is purely for reliability of connections and requests, nothing else.

>I am having a lot more trouble scaling out to more workers than I used to. I 
>know you guys had mentioned this, but I have >not been testing again until the 
>last few days and its definitely gotten trickier, not the least of which 
>because I'm >having trouble getting logs to see what happened during a fail. I 
>don't have dumps I saved from those jobs, but if I see >more I will put them 
>here.

Here's a trick you can try.  Add -Dmapred.map.max.attempts=1 to ensure that any 
failure will fail the job.  Then you can see the logs for the failed task and 
try to figure out what the problem is.

>Mostly the logs I get are reconnection logs after reincarnation in which they 
>all fail (of course) and no logs for the >failed portion of the run that 
>triggered the worker to reincarnate.

The above should help us narrow down your problem.  =)
                
> Netty requests should be reliable and implement exactly once semantics
> ----------------------------------------------------------------------
>
>                 Key: GIRAPH-306
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-306
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Priority: Critical
>         Attachments: GIRAPH-306.patch
>
>
> One of the biggest scalability challenges is getting Giraph to run reliably 
> on a large number of tasks (i.e. > 200).  Several problems exist:
> 1) If the connection fails after the initial connection was made, the job 
> will die.
> 2) Requests must be completed exactly once.  This is difficult to implement, 
> but required since we cannot have multiple retried requests succeed (i.e. a 
> vertex gets more messages than expected).
> 3) Sometimes there are unresolved addresses, causing failure.
> This patch addresses these issues by re-establishing failed connections and 
> keep tracking of every request sent to every worker.  If the request fails or 
> passes a timeout, it will be resent.  The server will keep track of requests 
> that succeeded to insure that the same request won't be processed more than 
> once.  The structure for keeping track of the succeeded requests on the 
> server is efficient for handling increasing request ids (IncreasingBitSet).  
> For handling unresolved addresses, I added retry logic to keep trying to 
> resolve the problem.
> This patch also adds several unit tests that use fault injection to simulate 
> a lost response or a closed channel exception on the server.  It also has 
> unittests for IncreasingBitSet to insure it is working correctly and 
> efficiently.
> This passes all unittests (including the new ones).  Additionally, I have 
> some experience results as well.
> Previously, I was unable to run reliably with more than 200 workers.  With 
> this change I can reliably run 500+ workers.  I also ran with 600 workers 
> successfully.  This is a really big reliability win for us.
> I can see the code working to do reconnections and re-issue requests when 
> necessary.  It's very cool.
> I.e.
> 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Fixing disconnected channel to 
> xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
> 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
> 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Fixing disconnected channel to 
> xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
> 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: 
> checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to