Hey Guys. I have a problem running my giraph job on a dataset with 20,000,000 edges and 2,000,000 vertices. All the vertices are Text based. The giraph job works perfectly on smaller datasets but always fails on larger ones. The setup I have is a 3 node cluster, each with 24 cores and 24 GB of ram. The cluster has a total of 60 mappers each with mapred.child.java.opts set to -Xmx1000m. If I don't use the Out-of-Core option then the job fails due to running out of java heap space. When I use -Dgiraph.useOutOfCoreGraph=true then the master eventually fails due to a worker disconnecting from zookeeper. The worker just throws a warning and doesn't actually fail. I've been using the -Dgiraph.checkpointFrequency=1 option but this doesn't seem to restart the mapper. I'm new to zookeeper too so if this is a zookeeper problem then let me know and I can investigate it as such.
Below is the options I'm using and the errors I'm currently getting Any help or tips are appreciated, Simon Options: -Dgiraph.zkList=10.10.5.103:2181,10.10.5.104:2181,10.10.5.105:2181 -Dgiraph.checkpointFrequency=1 -Dgiraph.useOutOfCoreGraph=true -Dgiraph.zkSessionMsecTimeout=600000 -Dgiraph.numComputeThreads=2 Master Log: 2013-10-17 18:19:34,638 INFO org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 50 workers finished on superstep 1 on path /_hadoopBsp/job_201310161506_0064/_applicationAttemptsDir/0/_superstepDir/1/_workerWroteCheckpointDir 2013-10-17 18:20:52,105 ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive: Missing chosen worker Worker(hostname= node1.mycompany.com, MRtaskID=30, port=30030) on superstep 1 2013-10-17 18:20:52,106 INFO org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 1 took 78.851 seconds ended with state WORKER_FAILURE and is now on superstep 1 2013-10-17 18:20:52,112 ERROR org.apache.giraph.master.MasterThread: masterThread: Master algorithm failed with RuntimeException java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) at org.apache.giraph.master.MasterThread.run(MasterThread.java:135) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307) at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177) ... 1 more 2013-10-17 18:20:52,115 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException: restartFromCheckpoint: KeeperException, exiting... java.lang.IllegalStateException: java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.MasterThread.run(MasterThread.java:181) Caused by: java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) at org.apache.giraph.master.MasterThread.run(MasterThread.java:135) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307) at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177) ... 1 more Worker 30 log: 2013-10-17 18:19:07,309 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing partition edges 1927 to /data/var/hdfs/data/mapred/taskTracker/simon/jobcache/job_201310161506_0064/attempt_201310161506_0064_m_000030_0/work/_bsp/_partitions/job_201310161506_0064/partition-1927_edges 2013-10-17 18:19:45,736 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb 2013-10-17 18:19:45,737 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98 2013-10-17 18:19:45,789 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 40183ms for sessionid 0x341c716ad860073, closing socket connection and attempting reconnect 2013-10-17 18:19:46,113 WARN org.apache.giraph.bsp.BspService: process: Disconnected from ZooKeeper (will automatically try to recover) WatchedEvent state:Disconnected type:None path:null 2013-10-17 18:19:46,113 WARN org.apache.giraph.worker.InputSplitsHandler: process: Problem with zookeeper, got event with path null, state Disconnected, event type None 2013-10-17 18:19:46,746 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /10.10.5.105:2181 2013-10-17 18:19:46,747 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to node3.mycompany.com/10.10.5.105:2181, initiating session 2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x341c716ad860073 has expired, closing socket connection 2013-10-17 18:19:46,750 WARN org.apache.giraph.bsp.BspService: process: Got unknown null path event WatchedEvent state:Expired type:None path:null 2013-10-17 18:19:46,750 WARN org.apache.giraph.worker.InputSplitsHandler: process: Problem with zookeeper, got event with path null, state Expired, event type None 2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2013-10-17 18:20:33,546 INFO org.apache.giraph.comm.netty.handler.RequestDecoder: decode: Server window metrics MBytes/sec sent = 0, MBytes/sec received = 0.0059, MBytesSent = 0.0008, MBytesReceived = 0.7636, ave sent req MBytes = 0, ave received req MBytes = 0.0111, secs waited = 128.396 2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb 2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98