Re: Master always fails on dataset

Simon McGloin Fri, 18 Oct 2013 09:24:44 -0700

Thanks Claudio. Yes the machines are homogenous. Unfortunately I don't have
ganglia installed. You were right it is a memory issue. I've reduced the
number of partitions down to 1 with -Dgiraph.maxPartitionsInMemory=1 and
now my jobs are failing due to running out of diskspace on HDFS. Each HDFS
mount has 100gb of space. I will increase the size of HDFS and order more
memory next week. Is there anyway to calculate the memory requirements of a
giraph job? I presume it depends on the algorithm being run.



On Thu, Oct 17, 2013 at 6:42 PM, Claudio Martella <
claudio.marte...@gmail.com> wrote:

> Try decreasing the number of partitions you keep in memory. You're running
> out of memory. Also, are your nodes homogenous? It could be one particular
> machine swapping or something. If you have ganglia, try investigating the
> usage of memory.
>
>
> On Thu, Oct 17, 2013 at 7:39 PM, Simon McGloin <simonmcgl...@gmail.com>wrote:
>
>> Hey Guys.
>>
>> I have a problem running my giraph job on a dataset with 20,000,000 edges
>> and 2,000,000 vertices. All the vertices are Text based. The giraph job
>> works perfectly on smaller datasets but always fails on larger ones. The
>> setup I have is a 3 node cluster, each with 24 cores and 24 GB of ram. The
>> cluster has a total of 60 mappers each with mapred.child.java.opts set to
>> -Xmx1000m.
>> If I don't use the Out-of-Core option then the job fails due to running
>> out of java heap space. When I use -Dgiraph.useOutOfCoreGraph=true then the
>> master eventually fails due to a worker disconnecting from zookeeper. The
>> worker just throws a warning and doesn't actually fail. I've been using the
>> -Dgiraph.checkpointFrequency=1 option but this doesn't seem to restart the
>> mapper. I'm new to zookeeper too so if this is a zookeeper problem then let
>> me know and I can investigate it as such.
>>
>> Below is the options I'm using and the errors I'm currently getting
>> Any help or tips are appreciated,
>> Simon
>>
>> Options:
>> -Dgiraph.zkList=10.10.5.103:2181,10.10.5.104:2181,10.10.5.105:2181
>> -Dgiraph.checkpointFrequency=1
>> -Dgiraph.useOutOfCoreGraph=true
>> -Dgiraph.zkSessionMsecTimeout=600000
>> -Dgiraph.numComputeThreads=2
>>
>> Master Log:
>> 2013-10-17 18:19:34,638 INFO org.apache.giraph.master.BspServiceMaster:
>> barrierOnWorkerList: 0 out of 50 workers finished on superstep 1 on path
>> /_hadoopBsp/job_201310161506_0064/_applicationAttemptsDir/0/_superstepDir/1/_workerWroteCheckpointDir
>> 2013-10-17 18:20:52,105 ERROR org.apache.giraph.master.BspServiceMaster:
>> superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=
>> node1.mycompany.com, MRtaskID=30, port=30030) on superstep 1
>> 2013-10-17 18:20:52,106 INFO org.apache.giraph.master.MasterThread:
>> masterThread: Coordination of superstep 1 took 78.851 seconds ended with
>> state WORKER_FAILURE and is now on superstep 1
>> 2013-10-17 18:20:52,112 ERROR org.apache.giraph.master.MasterThread:
>> masterThread: Master algorithm failed with RuntimeException
>> java.lang.RuntimeException: restartFromCheckpoint: KeeperException
>> at
>> org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
>>  at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> KeeperErrorCode = NoNode for
>> /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>  at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
>> at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
>>  at
>> org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
>> ... 1 more
>> 2013-10-17 18:20:52,115 FATAL org.apache.giraph.graph.GraphMapper:
>> uncaughtException: OverrideExceptionHandler on thread
>> org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException:
>> restartFromCheckpoint: KeeperException, exiting...
>> java.lang.IllegalStateException: java.lang.RuntimeException:
>> restartFromCheckpoint: KeeperException
>> at org.apache.giraph.master.MasterThread.run(MasterThread.java:181)
>> Caused by: java.lang.RuntimeException: restartFromCheckpoint:
>> KeeperException
>> at
>> org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
>>  at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> KeeperErrorCode = NoNode for
>> /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>  at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
>> at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
>>  at
>> org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
>> ... 1 more
>>
>>
>> Worker 30 log:
>> 2013-10-17 18:19:07,309 INFO
>> org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition:
>> writing partition edges 1927 to
>> /data/var/hdfs/data/mapred/taskTracker/simon/jobcache/job_201310161506_0064/attempt_201310161506_0064_m_000030_0/work/_bsp/_partitions/job_201310161506_0064/partition-1927_edges
>> 2013-10-17 18:19:45,736 INFO org.apache.giraph.utils.ProgressableUtils:
>> waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb
>> 2013-10-17 18:19:45,737 INFO org.apache.giraph.utils.ProgressableUtils:
>> waitFor: Waiting for
>> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98
>> 2013-10-17 18:19:45,789 INFO org.apache.zookeeper.ClientCnxn: Client
>> session timed out, have not heard from server in 40183ms for sessionid
>> 0x341c716ad860073, closing socket connection and attempting reconnect
>>  2013-10-17 18:19:46,113 WARN org.apache.giraph.bsp.BspService: process:
>> Disconnected from ZooKeeper (will automatically try to recover)
>> WatchedEvent state:Disconnected type:None path:null
>> 2013-10-17 18:19:46,113 WARN org.apache.giraph.worker.InputSplitsHandler:
>> process: Problem with zookeeper, got event with path null, state
>> Disconnected, event type None
>> 2013-10-17 18:19:46,746 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server /10.10.5.105:2181
>> 2013-10-17 18:19:46,747 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to node3.mycompany.com/10.10.5.105:2181,
>> initiating session
>> 2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: Unable to
>> reconnect to ZooKeeper service, session 0x341c716ad860073 has expired,
>> closing socket connection
>> 2013-10-17 18:19:46,750 WARN org.apache.giraph.bsp.BspService: process:
>> Got unknown null path event WatchedEvent state:Expired type:None path:null
>> 2013-10-17 18:19:46,750 WARN org.apache.giraph.worker.InputSplitsHandler:
>> process: Problem with zookeeper, got event with path null, state Expired,
>> event type None
>> 2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: EventThread
>> shut down
>> 2013-10-17 18:20:33,546 INFO
>> org.apache.giraph.comm.netty.handler.RequestDecoder: decode: Server window
>> metrics MBytes/sec sent = 0, MBytes/sec received = 0.0059, MBytesSent =
>> 0.0008, MBytesReceived = 0.7636, ave sent req MBytes = 0, ave received req
>> MBytes = 0.0111, secs waited = 128.396
>> 2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils:
>> waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb
>> 2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils:
>> waitFor: Waiting for
>> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98
>>
>>
>>
>
>
> --
>    Claudio Martella
>    claudio.marte...@gmail.com
>

Re: Master always fails on dataset

Reply via email to