Hi All, I have a Giraph 1.0.0 job that has failed, but I'm not able to get detail as to what really happened. The master's log says:
> 2014-10-28 10:28:32,006 ERROR org.apache.giraph.master.BspServiceMaster: > superstepChosenWorkerAlive: Missing chosen worker > Worker(hostname=compute-0-0.wright, MRtaskID=1, port=30001) on superstep 4 OK, this seems to say compute-0-0 failed in some way, correct? The Ganglia pages show no noticeable OS differences between the failed node and another identical compute node. In the failed node's log I see two WARNs: > 2014-10-28 10:28:19,560 WARN org.apache.giraph.bsp.BspService: process: > Disconnected from ZooKeeper (will automatically try to recover) WatchedEvent > state:Disconnected type:None path:null > 2014-10-28 10:28:19,560 WARN org.apache.giraph.worker.InputSplitsHandler: > process: Problem with zookeeper, got event with path null, state > Disconnected, event type None OK, I guess there was a zookeeper issue. In the Zookeeper log I find: > 2014-10-28 10:28:14,917 WARN org.apache.zookeeper.server.NIOServerCnxn: > caught end of stream exception > EndOfStreamException: Unable to read additional data from client sessionid > 0x149529c74de0a4d, likely client has closed socket > at > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) > at > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) > at java.lang.Thread.run(Thread.java:745) OK, so I guess the socket closure was the problem. But why did *that* happen? I could really use your help here! Thank you, matt -- Matthew Cornell | m...@matthewcornell.org