Try decreasing the number of partitions you keep in memory. You're running
out of memory. Also, are your nodes homogenous? It could be one particular
machine swapping or something. If you have ganglia, try investigating the
usage of memory.


On Thu, Oct 17, 2013 at 7:39 PM, Simon McGloin <simonmcgl...@gmail.com>wrote:

> Hey Guys.
>
> I have a problem running my giraph job on a dataset with 20,000,000 edges
> and 2,000,000 vertices. All the vertices are Text based. The giraph job
> works perfectly on smaller datasets but always fails on larger ones. The
> setup I have is a 3 node cluster, each with 24 cores and 24 GB of ram. The
> cluster has a total of 60 mappers each with mapred.child.java.opts set to
> -Xmx1000m.
> If I don't use the Out-of-Core option then the job fails due to running
> out of java heap space. When I use -Dgiraph.useOutOfCoreGraph=true then the
> master eventually fails due to a worker disconnecting from zookeeper. The
> worker just throws a warning and doesn't actually fail. I've been using the
> -Dgiraph.checkpointFrequency=1 option but this doesn't seem to restart the
> mapper. I'm new to zookeeper too so if this is a zookeeper problem then let
> me know and I can investigate it as such.
>
> Below is the options I'm using and the errors I'm currently getting
> Any help or tips are appreciated,
> Simon
>
> Options:
> -Dgiraph.zkList=10.10.5.103:2181,10.10.5.104:2181,10.10.5.105:2181
> -Dgiraph.checkpointFrequency=1
> -Dgiraph.useOutOfCoreGraph=true
> -Dgiraph.zkSessionMsecTimeout=600000
> -Dgiraph.numComputeThreads=2
>
> Master Log:
> 2013-10-17 18:19:34,638 INFO org.apache.giraph.master.BspServiceMaster:
> barrierOnWorkerList: 0 out of 50 workers finished on superstep 1 on path
> /_hadoopBsp/job_201310161506_0064/_applicationAttemptsDir/0/_superstepDir/1/_workerWroteCheckpointDir
> 2013-10-17 18:20:52,105 ERROR org.apache.giraph.master.BspServiceMaster:
> superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=
> node1.mycompany.com, MRtaskID=30, port=30030) on superstep 1
> 2013-10-17 18:20:52,106 INFO org.apache.giraph.master.MasterThread:
> masterThread: Coordination of superstep 1 took 78.851 seconds ended with
> state WORKER_FAILURE and is now on superstep 1
> 2013-10-17 18:20:52,112 ERROR org.apache.giraph.master.MasterThread:
> masterThread: Master algorithm failed with RuntimeException
> java.lang.RuntimeException: restartFromCheckpoint: KeeperException
> at
> org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
>  at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for
> /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>  at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
> at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
>  at
> org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
> ... 1 more
> 2013-10-17 18:20:52,115 FATAL org.apache.giraph.graph.GraphMapper:
> uncaughtException: OverrideExceptionHandler on thread
> org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException:
> restartFromCheckpoint: KeeperException, exiting...
> java.lang.IllegalStateException: java.lang.RuntimeException:
> restartFromCheckpoint: KeeperException
> at org.apache.giraph.master.MasterThread.run(MasterThread.java:181)
> Caused by: java.lang.RuntimeException: restartFromCheckpoint:
> KeeperException
> at
> org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
>  at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for
> /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>  at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
> at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
>  at
> org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
> ... 1 more
>
>
> Worker 30 log:
> 2013-10-17 18:19:07,309 INFO
> org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition:
> writing partition edges 1927 to
> /data/var/hdfs/data/mapred/taskTracker/simon/jobcache/job_201310161506_0064/attempt_201310161506_0064_m_000030_0/work/_bsp/_partitions/job_201310161506_0064/partition-1927_edges
> 2013-10-17 18:19:45,736 INFO org.apache.giraph.utils.ProgressableUtils:
> waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb
> 2013-10-17 18:19:45,737 INFO org.apache.giraph.utils.ProgressableUtils:
> waitFor: Waiting for
> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98
> 2013-10-17 18:19:45,789 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 40183ms for sessionid
> 0x341c716ad860073, closing socket connection and attempting reconnect
>  2013-10-17 18:19:46,113 WARN org.apache.giraph.bsp.BspService: process:
> Disconnected from ZooKeeper (will automatically try to recover)
> WatchedEvent state:Disconnected type:None path:null
> 2013-10-17 18:19:46,113 WARN org.apache.giraph.worker.InputSplitsHandler:
> process: Problem with zookeeper, got event with path null, state
> Disconnected, event type None
> 2013-10-17 18:19:46,746 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server /10.10.5.105:2181
> 2013-10-17 18:19:46,747 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to node3.mycompany.com/10.10.5.105:2181,
> initiating session
> 2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x341c716ad860073 has expired,
> closing socket connection
> 2013-10-17 18:19:46,750 WARN org.apache.giraph.bsp.BspService: process:
> Got unknown null path event WatchedEvent state:Expired type:None path:null
> 2013-10-17 18:19:46,750 WARN org.apache.giraph.worker.InputSplitsHandler:
> process: Problem with zookeeper, got event with path null, state Expired,
> event type None
> 2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
> 2013-10-17 18:20:33,546 INFO
> org.apache.giraph.comm.netty.handler.RequestDecoder: decode: Server window
> metrics MBytes/sec sent = 0, MBytes/sec received = 0.0059, MBytesSent =
> 0.0008, MBytesReceived = 0.7636, ave sent req MBytes = 0, ave received req
> MBytes = 0.0111, secs waited = 128.396
> 2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils:
> waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb
> 2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils:
> waitFor: Waiting for
> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98
>
>
>


-- 
   Claudio Martella
   claudio.marte...@gmail.com

Reply via email to