-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6687/
-----------------------------------------------------------

Review request for giraph.


Description
-------

One of the biggest scalability challenges is getting Giraph to run reliably on 
a large number of tasks (i.e. > 200). Several problems exist:

1) If the connection fails after the initial connection was made, the job will 
die.
2) Requests must be completed exactly once. This is difficult to implement, but 
required since we cannot have multiple retried requests succeed (i.e. a vertex 
gets more messages than expected).
3) Sometimes there are unresolved addresses, causing failure.

This patch addresses these issues by re-establishing failed connections and 
keep tracking of every request sent to every worker. If the request fails or 
passes a timeout, it will be resent. The server will keep track of requests 
that succeeded to insure that the same request won't be processed more than 
once. The structure for keeping track of the succeeded requests on the server 
is efficient for handling increasing request ids (IncreasingBitSet). For 
handling unresolved addresses, I added retry logic to keep trying to resolve 
the problem.

This patch also adds several unit tests that use fault injection to simulate a 
lost response or a closed channel exception on the server. It also has 
unittests for IncreasingBitSet to insure it is working correctly and 
efficiently.

This passes all unittests (including the new ones). Additionally, I have some 
experience results as well.

Previously, I was unable to run reliably with more than 200 workers. With this 
change I can reliably run 500+ workers. I also ran with 600 workers 
successfully. This is a really big reliability win for us.

I can see the code working to do reconnections and re-issue requests when 
necessary. It's very cool.

I.e.

2012-08-18 00:16:52,109 INFO org.apache.giraph.comm.NettyClient: 
checkAndFixChannel: Fixing disconnected channel to 
xxx.xxx.xxx.xxx/xx.xx.xx.xx:30455, open = false, bound = false
2012-08-18 00:16:52,111 INFO org.apache.giraph.comm.NettyClient: 
checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30455!
2012-08-18 00:16:52,123 INFO org.apache.giraph.comm.NettyClient: 
checkAndFixChannel: Fixing disconnected channel to 
xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx, open = false, bound = false
2012-08-18 00:16:52,124 INFO org.apache.giraph.comm.NettyClient: 
checkAndFixChannel: Connected to xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:30117!


This addresses bug GIRAPH-306.
    https://issues.apache.org/jira/browse/GIRAPH-306


Diffs
-----

  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/AddressRequestIdGenerator.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ChannelRotater.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/IncreasingBitSet.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyClient.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyServer.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/NettyWorkerClient.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestDecoder.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestInfo.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/RequestServerHandler.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/ResponseClientHandler.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerIdRequestId.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerRequestReservedMap.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/comm/WritableRequest.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/main/java/org/apache/giraph/graph/WorkerInfo.java
 1374192 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/IncreasingBitSetTest.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestFailureTest.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/giraph/trunk/src/test/java/org/apache/giraph/comm/RequestTest.java
 1374192 

Diff: https://reviews.apache.org/r/6687/diff/


Testing
-------

mvn clean verify
Lots of large test 500-600 workers with PageRankBenchmark


Thanks,

Avery Ching

Reply via email to