I will reach out right now From: Simon McGloin [mailto:simonmcgl...@gmail.com] Sent: Friday, October 18, 2013 12:24 PM To: user@giraph.apache.org Subject: Re: Master always fails on dataset
Thanks Claudio. Yes the machines are homogenous. Unfortunately I don't have ganglia installed. You were right it is a memory issue. I've reduced the number of partitions down to 1 with -Dgiraph.maxPartitionsInMemory=1 and now my jobs are failing due to running out of diskspace on HDFS. Each HDFS mount has 100gb of space. I will increase the size of HDFS and order more memory next week. Is there anyway to calculate the memory requirements of a giraph job? I presume it depends on the algorithm being run. On Thu, Oct 17, 2013 at 6:42 PM, Claudio Martella <claudio.marte...@gmail.com<mailto:claudio.marte...@gmail.com>> wrote: Try decreasing the number of partitions you keep in memory. You're running out of memory. Also, are your nodes homogenous? It could be one particular machine swapping or something. If you have ganglia, try investigating the usage of memory. On Thu, Oct 17, 2013 at 7:39 PM, Simon McGloin <simonmcgl...@gmail.com<mailto:simonmcgl...@gmail.com>> wrote: Hey Guys. I have a problem running my giraph job on a dataset with 20,000,000 edges and 2,000,000 vertices. All the vertices are Text based. The giraph job works perfectly on smaller datasets but always fails on larger ones. The setup I have is a 3 node cluster, each with 24 cores and 24 GB of ram. The cluster has a total of 60 mappers each with mapred.child.java.opts set to -Xmx1000m. If I don't use the Out-of-Core option then the job fails due to running out of java heap space. When I use -Dgiraph.useOutOfCoreGraph=true then the master eventually fails due to a worker disconnecting from zookeeper. The worker just throws a warning and doesn't actually fail. I've been using the -Dgiraph.checkpointFrequency=1 option but this doesn't seem to restart the mapper. I'm new to zookeeper too so if this is a zookeeper problem then let me know and I can investigate it as such. Below is the options I'm using and the errors I'm currently getting Any help or tips are appreciated, Simon Options: -Dgiraph.zkList=10.10.5.103:2181<http://10.10.5.103:2181>,10.10.5.104:2181<http://10.10.5.104:2181>,10.10.5.105:2181<http://10.10.5.105:2181> -Dgiraph.checkpointFrequency=1 -Dgiraph.useOutOfCoreGraph=true -Dgiraph.zkSessionMsecTimeout=600000 -Dgiraph.numComputeThreads=2 Master Log: 2013-10-17 18:19:34,638 INFO org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 50 workers finished on superstep 1 on path /_hadoopBsp/job_201310161506_0064/_applicationAttemptsDir/0/_superstepDir/1/_workerWroteCheckpointDir 2013-10-17 18:20:52,105 ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=node1.mycompany.com<http://node1.mycompany.com>, MRtaskID=30, port=30030) on superstep 1 2013-10-17 18:20:52,106 INFO org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 1 took 78.851 seconds ended with state WORKER_FAILURE and is now on superstep 1 2013-10-17 18:20:52,112 ERROR org.apache.giraph.master.MasterThread: masterThread: Master algorithm failed with RuntimeException java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) at org.apache.giraph.master.MasterThread.run(MasterThread.java:135) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307) at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177) ... 1 more 2013-10-17 18:20:52,115 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException: restartFromCheckpoint: KeeperException, exiting... java.lang.IllegalStateException: java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.MasterThread.run(MasterThread.java:181) Caused by: java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) at org.apache.giraph.master.MasterThread.run(MasterThread.java:135) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307) at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177) ... 1 more Worker 30 log: 2013-10-17 18:19:07,309 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing partition edges 1927 to /data/var/hdfs/data/mapred/taskTracker/simon/jobcache/job_201310161506_0064/attempt_201310161506_0064_m_000030_0/work/_bsp/_partitions/job_201310161506_0064/partition-1927_edges 2013-10-17 18:19:45,736 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb 2013-10-17 18:19:45,737 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98 2013-10-17 18:19:45,789 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 40183ms for sessionid 0x341c716ad860073, closing socket connection and attempting reconnect 2013-10-17 18:19:46,113 WARN org.apache.giraph.bsp.BspService: process: Disconnected from ZooKeeper (will automatically try to recover) WatchedEvent state:Disconnected type:None path:null 2013-10-17 18:19:46,113 WARN org.apache.giraph.worker.InputSplitsHandler: process: Problem with zookeeper, got event with path null, state Disconnected, event type None 2013-10-17 18:19:46,746 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /10.10.5.105:2181<http://10.10.5.105:2181> 2013-10-17 18:19:46,747 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to node3.mycompany.com/10.10.5.105:2181<http://node3.mycompany.com/10.10.5.105:2181>, initiating session 2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x341c716ad860073 has expired, closing socket connection 2013-10-17 18:19:46,750 WARN org.apache.giraph.bsp.BspService: process: Got unknown null path event WatchedEvent state:Expired type:None path:null 2013-10-17 18:19:46,750 WARN org.apache.giraph.worker.InputSplitsHandler: process: Problem with zookeeper, got event with path null, state Expired, event type None 2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2013-10-17 18:20:33,546 INFO org.apache.giraph.comm.netty.handler.RequestDecoder: decode: Server window metrics MBytes/sec sent = 0, MBytes/sec received = 0.0059, MBytesSent = 0.0008, MBytesReceived = 0.7636, ave sent req MBytes = 0, ave received req MBytes = 0.0111, secs waited = 128.396 2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb 2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98 -- Claudio Martella claudio.marte...@gmail.com<mailto:claudio.marte...@gmail.com>