Hey Guys.

I have a problem running my giraph job on a dataset with 20,000,000 edges
and 2,000,000 vertices. All the vertices are Text based. The giraph job
works perfectly on smaller datasets but always fails on larger ones. The
setup I have is a 3 node cluster, each with 24 cores and 24 GB of ram. The
cluster has a total of 60 mappers each with mapred.child.java.opts set to
-Xmx1000m.
If I don't use the Out-of-Core option then the job fails due to running out
of java heap space. When I use -Dgiraph.useOutOfCoreGraph=true then the
master eventually fails due to a worker disconnecting from zookeeper. The
worker just throws a warning and doesn't actually fail. I've been using the
-Dgiraph.checkpointFrequency=1 option but this doesn't seem to restart the
mapper. I'm new to zookeeper too so if this is a zookeeper problem then let
me know and I can investigate it as such.

Below is the options I'm using and the errors I'm currently getting
Any help or tips are appreciated,
Simon

Options:
-Dgiraph.zkList=10.10.5.103:2181,10.10.5.104:2181,10.10.5.105:2181
-Dgiraph.checkpointFrequency=1
-Dgiraph.useOutOfCoreGraph=true
-Dgiraph.zkSessionMsecTimeout=600000
-Dgiraph.numComputeThreads=2

Master Log:
2013-10-17 18:19:34,638 INFO org.apache.giraph.master.BspServiceMaster:
barrierOnWorkerList: 0 out of 50 workers finished on superstep 1 on path
/_hadoopBsp/job_201310161506_0064/_applicationAttemptsDir/0/_superstepDir/1/_workerWroteCheckpointDir
2013-10-17 18:20:52,105 ERROR org.apache.giraph.master.BspServiceMaster:
superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=
node1.mycompany.com, MRtaskID=30, port=30030) on superstep 1
2013-10-17 18:20:52,106 INFO org.apache.giraph.master.MasterThread:
masterThread: Coordination of superstep 1 took 78.851 seconds ended with
state WORKER_FAILURE and is now on superstep 1
2013-10-17 18:20:52,112 ERROR org.apache.giraph.master.MasterThread:
masterThread: Master algorithm failed with RuntimeException
java.lang.RuntimeException: restartFromCheckpoint: KeeperException
at
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for
/_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
at
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
... 1 more
2013-10-17 18:20:52,115 FATAL org.apache.giraph.graph.GraphMapper:
uncaughtException: OverrideExceptionHandler on thread
org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException:
restartFromCheckpoint: KeeperException, exiting...
java.lang.IllegalStateException: java.lang.RuntimeException:
restartFromCheckpoint: KeeperException
at org.apache.giraph.master.MasterThread.run(MasterThread.java:181)
Caused by: java.lang.RuntimeException: restartFromCheckpoint:
KeeperException
at
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for
/_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
at
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
... 1 more


Worker 30 log:
2013-10-17 18:19:07,309 INFO
org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition:
writing partition edges 1927 to
/data/var/hdfs/data/mapred/taskTracker/simon/jobcache/job_201310161506_0064/attempt_201310161506_0064_m_000030_0/work/_bsp/_partitions/job_201310161506_0064/partition-1927_edges
2013-10-17 18:19:45,736 INFO org.apache.giraph.utils.ProgressableUtils:
waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb
2013-10-17 18:19:45,737 INFO org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98
2013-10-17 18:19:45,789 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 40183ms for sessionid
0x341c716ad860073, closing socket connection and attempting reconnect
2013-10-17 18:19:46,113 WARN org.apache.giraph.bsp.BspService: process:
Disconnected from ZooKeeper (will automatically try to recover)
WatchedEvent state:Disconnected type:None path:null
2013-10-17 18:19:46,113 WARN org.apache.giraph.worker.InputSplitsHandler:
process: Problem with zookeeper, got event with path null, state
Disconnected, event type None
2013-10-17 18:19:46,746 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server /10.10.5.105:2181
2013-10-17 18:19:46,747 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to node3.mycompany.com/10.10.5.105:2181, initiating
session
2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x341c716ad860073 has expired,
closing socket connection
2013-10-17 18:19:46,750 WARN org.apache.giraph.bsp.BspService: process: Got
unknown null path event WatchedEvent state:Expired type:None path:null
2013-10-17 18:19:46,750 WARN org.apache.giraph.worker.InputSplitsHandler:
process: Problem with zookeeper, got event with path null, state Expired,
event type None
2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2013-10-17 18:20:33,546 INFO
org.apache.giraph.comm.netty.handler.RequestDecoder: decode: Server window
metrics MBytes/sec sent = 0, MBytes/sec received = 0.0059, MBytesSent =
0.0008, MBytesReceived = 0.7636, ave sent req MBytes = 0, ave received req
MBytes = 0.0111, secs waited = 128.396
2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils:
waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb
2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98

Reply via email to