Thanks Claudio. Yes the machines are homogenous. Unfortunately I don't have ganglia installed. You were right it is a memory issue. I've reduced the number of partitions down to 1 with -Dgiraph.maxPartitionsInMemory=1 and now my jobs are failing due to running out of diskspace on HDFS. Each HDFS mount has 100gb of space. I will increase the size of HDFS and order more memory next week. Is there anyway to calculate the memory requirements of a giraph job? I presume it depends on the algorithm being run.
On Thu, Oct 17, 2013 at 6:42 PM, Claudio Martella < claudio.marte...@gmail.com> wrote: > Try decreasing the number of partitions you keep in memory. You're running > out of memory. Also, are your nodes homogenous? It could be one particular > machine swapping or something. If you have ganglia, try investigating the > usage of memory. > > > On Thu, Oct 17, 2013 at 7:39 PM, Simon McGloin <simonmcgl...@gmail.com>wrote: > >> Hey Guys. >> >> I have a problem running my giraph job on a dataset with 20,000,000 edges >> and 2,000,000 vertices. All the vertices are Text based. The giraph job >> works perfectly on smaller datasets but always fails on larger ones. The >> setup I have is a 3 node cluster, each with 24 cores and 24 GB of ram. The >> cluster has a total of 60 mappers each with mapred.child.java.opts set to >> -Xmx1000m. >> If I don't use the Out-of-Core option then the job fails due to running >> out of java heap space. When I use -Dgiraph.useOutOfCoreGraph=true then the >> master eventually fails due to a worker disconnecting from zookeeper. The >> worker just throws a warning and doesn't actually fail. I've been using the >> -Dgiraph.checkpointFrequency=1 option but this doesn't seem to restart the >> mapper. I'm new to zookeeper too so if this is a zookeeper problem then let >> me know and I can investigate it as such. >> >> Below is the options I'm using and the errors I'm currently getting >> Any help or tips are appreciated, >> Simon >> >> Options: >> -Dgiraph.zkList=10.10.5.103:2181,10.10.5.104:2181,10.10.5.105:2181 >> -Dgiraph.checkpointFrequency=1 >> -Dgiraph.useOutOfCoreGraph=true >> -Dgiraph.zkSessionMsecTimeout=600000 >> -Dgiraph.numComputeThreads=2 >> >> Master Log: >> 2013-10-17 18:19:34,638 INFO org.apache.giraph.master.BspServiceMaster: >> barrierOnWorkerList: 0 out of 50 workers finished on superstep 1 on path >> /_hadoopBsp/job_201310161506_0064/_applicationAttemptsDir/0/_superstepDir/1/_workerWroteCheckpointDir >> 2013-10-17 18:20:52,105 ERROR org.apache.giraph.master.BspServiceMaster: >> superstepChosenWorkerAlive: Missing chosen worker Worker(hostname= >> node1.mycompany.com, MRtaskID=30, port=30030) on superstep 1 >> 2013-10-17 18:20:52,106 INFO org.apache.giraph.master.MasterThread: >> masterThread: Coordination of superstep 1 took 78.851 seconds ended with >> state WORKER_FAILURE and is now on superstep 1 >> 2013-10-17 18:20:52,112 ERROR org.apache.giraph.master.MasterThread: >> masterThread: Master algorithm failed with RuntimeException >> java.lang.RuntimeException: restartFromCheckpoint: KeeperException >> at >> org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) >> at org.apache.giraph.master.MasterThread.run(MasterThread.java:135) >> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: >> KeeperErrorCode = NoNode for >> /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) >> at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) >> at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307) >> at >> org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177) >> ... 1 more >> 2013-10-17 18:20:52,115 FATAL org.apache.giraph.graph.GraphMapper: >> uncaughtException: OverrideExceptionHandler on thread >> org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException: >> restartFromCheckpoint: KeeperException, exiting... >> java.lang.IllegalStateException: java.lang.RuntimeException: >> restartFromCheckpoint: KeeperException >> at org.apache.giraph.master.MasterThread.run(MasterThread.java:181) >> Caused by: java.lang.RuntimeException: restartFromCheckpoint: >> KeeperException >> at >> org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) >> at org.apache.giraph.master.MasterThread.run(MasterThread.java:135) >> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: >> KeeperErrorCode = NoNode for >> /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) >> at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) >> at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307) >> at >> org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177) >> ... 1 more >> >> >> Worker 30 log: >> 2013-10-17 18:19:07,309 INFO >> org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: >> writing partition edges 1927 to >> /data/var/hdfs/data/mapred/taskTracker/simon/jobcache/job_201310161506_0064/attempt_201310161506_0064_m_000030_0/work/_bsp/_partitions/job_201310161506_0064/partition-1927_edges >> 2013-10-17 18:19:45,736 INFO org.apache.giraph.utils.ProgressableUtils: >> waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb >> 2013-10-17 18:19:45,737 INFO org.apache.giraph.utils.ProgressableUtils: >> waitFor: Waiting for >> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98 >> 2013-10-17 18:19:45,789 INFO org.apache.zookeeper.ClientCnxn: Client >> session timed out, have not heard from server in 40183ms for sessionid >> 0x341c716ad860073, closing socket connection and attempting reconnect >> 2013-10-17 18:19:46,113 WARN org.apache.giraph.bsp.BspService: process: >> Disconnected from ZooKeeper (will automatically try to recover) >> WatchedEvent state:Disconnected type:None path:null >> 2013-10-17 18:19:46,113 WARN org.apache.giraph.worker.InputSplitsHandler: >> process: Problem with zookeeper, got event with path null, state >> Disconnected, event type None >> 2013-10-17 18:19:46,746 INFO org.apache.zookeeper.ClientCnxn: Opening >> socket connection to server /10.10.5.105:2181 >> 2013-10-17 18:19:46,747 INFO org.apache.zookeeper.ClientCnxn: Socket >> connection established to node3.mycompany.com/10.10.5.105:2181, >> initiating session >> 2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: Unable to >> reconnect to ZooKeeper service, session 0x341c716ad860073 has expired, >> closing socket connection >> 2013-10-17 18:19:46,750 WARN org.apache.giraph.bsp.BspService: process: >> Got unknown null path event WatchedEvent state:Expired type:None path:null >> 2013-10-17 18:19:46,750 WARN org.apache.giraph.worker.InputSplitsHandler: >> process: Problem with zookeeper, got event with path null, state Expired, >> event type None >> 2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: EventThread >> shut down >> 2013-10-17 18:20:33,546 INFO >> org.apache.giraph.comm.netty.handler.RequestDecoder: decode: Server window >> metrics MBytes/sec sent = 0, MBytes/sec received = 0.0059, MBytesSent = >> 0.0008, MBytesReceived = 0.7636, ave sent req MBytes = 0, ave received req >> MBytes = 0.0111, secs waited = 128.396 >> 2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils: >> waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb >> 2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils: >> waitFor: Waiting for >> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98 >> >> >> > > > -- > Claudio Martella > claudio.marte...@gmail.com >