
I had similar issue, it was caused by long GC pauses. I patched NettyClient so when reconnect fails it sleeps for some time before next try. Patch is enclosed. Let me know if it works for you. I would try tuning GC. You can also try to use giraph.waitForRequestsConfirmation and giraph.maxNumberOfOpenRequests .
On 4.4.2014 22:49, Suijian Zhou wrote:
I have a zookeeper problem when running a giraph program, the program will be aborted in superstep 2 as: 14/04/04 15:44:48 INFO zookeeper.ClientCnxn: Opening socket connection to server compute-0-18.local/ <>. Will not attempt to authenticate using SASL (unknown error) 14/04/04 15:44:48 INFO zookeeper.ClientCnxn: Socket connection established to compute-0-18.local/ <>, initiating session 14/04/04 15:44:48 INFO zookeeper.ClientCnxn: Session establishment complete on server compute-0-18.local/ <>, sessionid = 0x1452e7c79910009, negotiated timeout = 600000
14/04/04 15:46:08 INFO job.JobProgressTracker: Data from 8 workers - Compute superstep 2: 0 out of 4847571 vertices computed; 0 out of 64 partitions computed; min free memory on worker 3 - 270.37MB, average 451.21MB 14/04/04 15:46:13 INFO job.JobProgressTracker: Data from 8 workers - Compute superstep 2: 0 out of 4847571 vertices computed; 0 out of 64 partitions computed; min free memory on worker 6 - 249.25MB, average 404.02MB 14/04/04 15:46:16 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x1452e7c79910009, likely server has closed socket, closing socket connection and attempting reconnect 14/04/04 15:46:17 INFO zookeeper.ClientCnxn: Opening socket connection to server compute-0-18.local/ <>. Will not attempt to authenticate using SASL (unknown error) 14/04/04 15:46:17 WARN zookeeper.ClientCnxn: Session 0x1452e7c79910009 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused

Each rerun of the program will lead to another computing node reporting the same error("Unable to read additional data from server sessionid...").

What in superstep 2 are:
  if (getSuperstep() == 2) {
    for (IntWritable message: messages) {
        for (Edge<IntWritable, IntWritable> edge: vertex.getEdges()) {
           sendMessage(edge.getTargetVertexId(), message);
           //int abc=0;

Checked that if I replace the line "sendMessage(edge.getTargetVertexId(), message);" to another meaningless line like "int abc=0;", the program could be finished successfully. Seems a ZooKeeper problem but this seems comes with giraph as I did not install ZooKeeper seperately. I tried to modify parameters in GiraphConstants.java and re-compile giraph, but it seems do not take any effects as I see in the screen output the parameters were not changed at all. Any hints?

@@ -153,6 +153,10 @@
   private final int maxRequestMilliseconds;
   /** Waiting internal for checking outstanding requests msecs */
   private final int waitingRequestMsecs;
+  /** Fix - wait time when connection failed*/
+  private int sleepDelay = 100;
   /** Timed logger for printing request debugging */
   private final TimedLogger requestLogger = new TimedLogger(15 * 1000, LOG);
   /** Worker executor group */
@@ -403,6 +407,7 @@
     // Wait for all the connections to succeed up to n tries
     int failures = 0;
     int connected = 0;
+    sleepDelay = 100;
     while (failures < maxConnectionFailures) {
       List<ChannelFutureAddress> nextCheckFutures = Lists.newArrayList();
       for (ChannelFutureAddress waitingConnection : waitingConnectionList) {
@@ -453,6 +458,19 @@
           failures + " failures total.");
       if (nextCheckFutures.isEmpty()) {
+      } else {
+        try {
+          LOG.info("FIX: Waiting " + sleepDelay +
+                  " ms for " + nextCheckFutures.size() + " connections");
+          Thread.sleep(sleepDelay);
+          context.getCounter(
+                  "FIX waiting",
+                  "Waiting for " + sleepDelay + "ms").increment(1);
+          int delay = (int) Math.round(sleepDelay * 1.2);
+          sleepDelay = Math.max(1000, delay);
+        } catch (InterruptedException e) {
+          LOG.error("Waiting failed:" + e);
+        }
       waitingConnectionList = nextCheckFutures;

