BTW: This patch solves connection problems between workers, not with
zookeeper but as you problem disappears when you dont sent messages the
zookeeper problems may be secondary.
On 5.4.2014 00:12, Lukas Nalezenec wrote:
Hi,
I had similar issue, it was caused by long GC pauses. I patched
NettyClient so when reconnect fails it sleeps for some time before
next try. Patch is enclosed. Let me know if it works for you.
I would try tuning GC. You can also try to use
giraph.waitForRequestsConfirmation and giraph.maxNumberOfOpenRequests .
I hope I am right.
Regards
Lukas
On 4.4.2014 22:49, Suijian Zhou wrote:
Hi,
I have a zookeeper problem when running a giraph program, the
program will be aborted in superstep 2 as:
14/04/04 15:44:48 INFO zookeeper.ClientCnxn: Opening socket
connection to server compute-0-18.local/10.1.255.236:22181
<http://10.1.255.236:22181>. Will not attempt to authenticate using
SASL (unknown error)
14/04/04 15:44:48 INFO zookeeper.ClientCnxn: Socket connection
established to compute-0-18.local/10.1.255.236:22181
<http://10.1.255.236:22181>, initiating session
14/04/04 15:44:48 INFO zookeeper.ClientCnxn: Session establishment
complete on server compute-0-18.local/10.1.255.236:22181
<http://10.1.255.236:22181>, sessionid = 0x1452e7c79910009,
negotiated timeout = 600000
......
14/04/04 15:46:08 INFO job.JobProgressTracker: Data from 8 workers -
Compute superstep 2: 0 out of 4847571 vertices computed; 0 out of 64
partitions computed; min free memory on worker 3 - 270.37MB, average
451.21MB
14/04/04 15:46:13 INFO job.JobProgressTracker: Data from 8 workers -
Compute superstep 2: 0 out of 4847571 vertices computed; 0 out of 64
partitions computed; min free memory on worker 6 - 249.25MB, average
404.02MB
14/04/04 15:46:16 INFO zookeeper.ClientCnxn: Unable to read
additional data from server sessionid 0x1452e7c79910009, likely
server has closed socket, closing socket connection and attempting
reconnect
14/04/04 15:46:17 INFO zookeeper.ClientCnxn: Opening socket
connection to server compute-0-18.local/10.1.255.236:22181
<http://10.1.255.236:22181>. Will not attempt to authenticate using
SASL (unknown error)
14/04/04 15:46:17 WARN zookeeper.ClientCnxn: Session
0x1452e7c79910009 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
Each rerun of the program will lead to another computing node
reporting the same error("Unable to read additional data from server
sessionid...").
What in superstep 2 are:
if (getSuperstep() == 2) {
for (IntWritable message: messages) {
for (Edge<IntWritable, IntWritable> edge: vertex.getEdges()) {
sendMessage(edge.getTargetVertexId(), message);
//int abc=0;
}
}
}
Checked that if I replace the line
"sendMessage(edge.getTargetVertexId(), message);" to another
meaningless line like "int abc=0;", the program could be finished
successfully. Seems a ZooKeeper problem but this seems comes with
giraph as I did not install ZooKeeper seperately. I tried to modify
parameters in GiraphConstants.java and re-compile giraph, but it
seems do not take any effects as I see in the screen output the
parameters were not changed at all. Any hints?
Best Regards,
Suijian