I will reach out right now

From: Simon McGloin [mailto:simonmcgl...@gmail.com]
Sent: Friday, October 18, 2013 12:24 PM
To: user@giraph.apache.org
Subject: Re: Master always fails on dataset

Thanks Claudio. Yes the machines are homogenous. Unfortunately I don't have 
ganglia installed. You were right it is a memory issue. I've reduced the number 
of partitions down to 1 with -Dgiraph.maxPartitionsInMemory=1 and now my jobs 
are failing due to running out of diskspace on HDFS. Each HDFS mount has 100gb 
of space. I will increase the size of HDFS and order more memory next week. Is 
there anyway to calculate the memory requirements of a giraph job? I presume it 
depends on the algorithm being run.

On Thu, Oct 17, 2013 at 6:42 PM, Claudio Martella 
<claudio.marte...@gmail.com<mailto:claudio.marte...@gmail.com>> wrote:
Try decreasing the number of partitions you keep in memory. You're running out 
of memory. Also, are your nodes homogenous? It could be one particular machine 
swapping or something. If you have ganglia, try investigating the usage of 
memory.

On Thu, Oct 17, 2013 at 7:39 PM, Simon McGloin 
<simonmcgl...@gmail.com<mailto:simonmcgl...@gmail.com>> wrote:
Hey Guys.

I have a problem running my giraph job on a dataset with 20,000,000 edges and 
2,000,000 vertices. All the vertices are Text based. The giraph job works 
perfectly on smaller datasets but always fails on larger ones. The setup I have 
is a 3 node cluster, each with 24 cores and 24 GB of ram. The cluster has a 
total of 60 mappers each with mapred.child.java.opts set to -Xmx1000m.
If I don't use the Out-of-Core option then the job fails due to running out of 
java heap space. When I use -Dgiraph.useOutOfCoreGraph=true then the master 
eventually fails due to a worker disconnecting from zookeeper. The worker just 
throws a warning and doesn't actually fail. I've been using the 
-Dgiraph.checkpointFrequency=1 option but this doesn't seem to restart the 
mapper. I'm new to zookeeper too so if this is a zookeeper problem then let me 
know and I can investigate it as such.

Below is the options I'm using and the errors I'm currently getting
Any help or tips are appreciated,
Simon

Options:
-Dgiraph.zkList=10.10.5.103:2181<http://10.10.5.103:2181>,10.10.5.104:2181<http://10.10.5.104:2181>,10.10.5.105:2181<http://10.10.5.105:2181>
-Dgiraph.checkpointFrequency=1
-Dgiraph.useOutOfCoreGraph=true
-Dgiraph.zkSessionMsecTimeout=600000
-Dgiraph.numComputeThreads=2

Master Log:
2013-10-17 18:19:34,638 INFO org.apache.giraph.master.BspServiceMaster: 
barrierOnWorkerList: 0 out of 50 workers finished on superstep 1 on path 
/_hadoopBsp/job_201310161506_0064/_applicationAttemptsDir/0/_superstepDir/1/_workerWroteCheckpointDir
2013-10-17 18:20:52,105 ERROR org.apache.giraph.master.BspServiceMaster: 
superstepChosenWorkerAlive: Missing chosen worker 
Worker(hostname=node1.mycompany.com<http://node1.mycompany.com>, MRtaskID=30, 
port=30030) on superstep 1
2013-10-17 18:20:52,106 INFO org.apache.giraph.master.MasterThread: 
masterThread: Coordination of superstep 1 took 78.851 seconds ended with state 
WORKER_FAILURE and is now on superstep 1
2013-10-17 18:20:52,112 ERROR org.apache.giraph.master.MasterThread: 
masterThread: Master algorithm failed with RuntimeException
java.lang.RuntimeException: restartFromCheckpoint: KeeperException
at 
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for 
/_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
at 
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
... 1 more
2013-10-17 18:20:52,115 FATAL org.apache.giraph.graph.GraphMapper: 
uncaughtException: OverrideExceptionHandler on thread 
org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException: 
restartFromCheckpoint: KeeperException, exiting...
java.lang.IllegalStateException: java.lang.RuntimeException: 
restartFromCheckpoint: KeeperException
at org.apache.giraph.master.MasterThread.run(MasterThread.java:181)
Caused by: java.lang.RuntimeException: restartFromCheckpoint: KeeperException
at 
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for 
/_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
at 
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
... 1 more


Worker 30 log:
2013-10-17 18:19:07,309 INFO 
org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing 
partition edges 1927 to 
/data/var/hdfs/data/mapred/taskTracker/simon/jobcache/job_201310161506_0064/attempt_201310161506_0064_m_000030_0/work/_bsp/_partitions/job_201310161506_0064/partition-1927_edges
2013-10-17 18:19:45,736 INFO org.apache.giraph.utils.ProgressableUtils: 
waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb
2013-10-17 18:19:45,737 INFO org.apache.giraph.utils.ProgressableUtils: 
waitFor: Waiting for 
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98
2013-10-17 18:19:45,789 INFO org.apache.zookeeper.ClientCnxn: Client session 
timed out, have not heard from server in 40183ms for sessionid 
0x341c716ad860073, closing socket connection and attempting reconnect
2013-10-17 18:19:46,113 WARN org.apache.giraph.bsp.BspService: process: 
Disconnected from ZooKeeper (will automatically try to recover) WatchedEvent 
state:Disconnected type:None path:null
2013-10-17 18:19:46,113 WARN org.apache.giraph.worker.InputSplitsHandler: 
process: Problem with zookeeper, got event with path null, state Disconnected, 
event type None
2013-10-17 18:19:46,746 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server /10.10.5.105:2181<http://10.10.5.105:2181>
2013-10-17 18:19:46,747 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to 
node3.mycompany.com/10.10.5.105:2181<http://node3.mycompany.com/10.10.5.105:2181>,
 initiating session
2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: Unable to 
reconnect to ZooKeeper service, session 0x341c716ad860073 has expired, closing 
socket connection
2013-10-17 18:19:46,750 WARN org.apache.giraph.bsp.BspService: process: Got 
unknown null path event WatchedEvent state:Expired type:None path:null
2013-10-17 18:19:46,750 WARN org.apache.giraph.worker.InputSplitsHandler: 
process: Problem with zookeeper, got event with path null, state Expired, event 
type None
2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2013-10-17 18:20:33,546 INFO 
org.apache.giraph.comm.netty.handler.RequestDecoder: decode: Server window 
metrics MBytes/sec sent = 0, MBytes/sec received = 0.0059, MBytesSent = 0.0008, 
MBytesReceived = 0.7636, ave sent req MBytes = 0, ave received req MBytes = 
0.0111, secs waited = 128.396
2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils: 
waitFor: Future result not ready yet java.util.concurrent.FutureTask@c07bacb
2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils: 
waitFor: Waiting for 
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98





--
   Claudio Martella
   claudio.marte...@gmail.com<mailto:claudio.marte...@gmail.com>

Reply via email to