Hi All,

I have a Giraph 1.0.0 job that has failed, but I'm not able to get
detail as to what really happened. The master's log says:

> 2014-10-28 10:28:32,006 ERROR org.apache.giraph.master.BspServiceMaster: 
> superstepChosenWorkerAlive: Missing chosen worker 
> Worker(hostname=compute-0-0.wright, MRtaskID=1, port=30001) on superstep 4

OK, this seems to say compute-0-0 failed in some way, correct? The
Ganglia pages show no noticeable OS differences between the failed
node and another identical compute node. In the failed node's log I
see two WARNs:

> 2014-10-28 10:28:19,560 WARN org.apache.giraph.bsp.BspService: process: 
> Disconnected from ZooKeeper (will automatically try to recover) WatchedEvent 
> state:Disconnected type:None path:null
> 2014-10-28 10:28:19,560 WARN org.apache.giraph.worker.InputSplitsHandler: 
> process: Problem with zookeeper, got event with path null, state 
> Disconnected, event type None

OK, I guess there was a zookeeper issue. In the Zookeeper log I find:

> 2014-10-28 10:28:14,917 WARN org.apache.zookeeper.server.NIOServerCnxn: 
> caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid 
> 0x149529c74de0a4d, likely client has closed socket
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
>         at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>         at java.lang.Thread.run(Thread.java:745)

OK, so I guess the socket closure was the problem. But why did *that* happen?

I could really use your help here!

Thank you,

matt


-- 
Matthew Cornell | m...@matthewcornell.org

Reply via email to