[ https://issues.apache.org/jira/browse/GIRAPH-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439104#comment-13439104 ]
Hudson commented on GIRAPH-306: ------------------------------- Integrated in Giraph-trunk-Commit #183 (See [https://builds.apache.org/job/Giraph-trunk-Commit/183/]) GIRAPH-306: Netty requests should be reliable and implement exactly once semantics. (aching) (Revision 1375824) Result = SUCCESS aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1375824 Files : * /giraph/trunk/CHANGELOG * /giraph/trunk/src/main/java/org/apache/giraph/comm/AddressRequestIdGenerator.java * /giraph/trunk/src/main/java/org/apache/giraph/comm/ChannelRotater.java * /giraph/trunk/src/main/java/org/apache/giraph/comm/ClientRequestId.java * /giraph/trunk/src/main/java/org/apache/giraph/comm/IncreasingBitSet.java * /giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java * /giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java * /giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java * /giraph/trunk/src/main/java/org/apache/giraph/comm/RequestDecoder.java * /giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java * /giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java * /giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java * /giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerRequestReservedMap.java * /giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java * /giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java * /giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java * /giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java * /giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java * /giraph/trunk/src/test/java/org/apache/giraph/comm/IncreasingBitSetTest.java * /giraph/trunk/src/test/java/org/apache/giraph/comm/RequestFailureTest.java * /giraph/trunk/src/test/java/org/apache/giraph/comm/RequestTest.java > Netty requests should be reliable and implement exactly once semantics > ---------------------------------------------------------------------- > > Key: GIRAPH-306 > URL: https://issues.apache.org/jira/browse/GIRAPH-306 > Project: Giraph > Issue Type: Improvement > Reporter: Avery Ching > Assignee: Avery Ching > Priority: Critical > Attachments: GIRAPH-306.2.patch, GIRAPH-306.patch > > > One of the biggest scalability challenges is getting Giraph to run reliably > on a large number of tasks (i.e. > 200). Several problems exist: > 1) If the connection fails after the initial connection was made, the job > will die. > 2) Requests must be completed exactly once. This is difficult to implement, > but required since we cannot have multiple retried requests succeed (i.e. a > vertex gets more messages than expected). > 3) Sometimes there are unresolved addresses, causing failure. > This patch addresses these issues by re-establishing failed connections and > keep tracking of every request sent to every worker. If the request fails or > passes a timeout, it will be resent. The server will keep track of requests > that succeeded to insure that the same request won't be processed more than > once. The structure for keeping track of the succeeded requests on the > server is efficient for handling increasing request ids (IncreasingBitSet). > For handling unresolved addresses, I added retry logic to keep trying to > resolve the problem. > This patch also adds several unit tests that use fault injection to simulate > a lost response or a closed channel exception on the server. It also has > unittests for IncreasingBitSet to insure it is working correctly and > efficiently. > This passes all unittests (including the new ones). Additionally, I have > some experience results as well. > Previously, I was unable to run reliably with more than 200 workers. With > this change I can reliably run 500+ workers. I also ran with 600 workers > successfully. This is a really big reliability win for us. > I can see the code working to do reconnections and re-issue requests when > necessary. It's very cool. > I.e. > 2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Fixing disconnected channel to > xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false > 2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455! > 2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Fixing disconnected channel to > xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false > 2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: > checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira