Giraph application get stuck, on superstep 4, all workers active but without progress

José Luis Larroque Fri, 26 Aug 2016 17:25:46 -0700

Hi again guys!

I'm doing BFS search through the Wikipedia (spanish edition) site. I
converted the dump <https://dumps.wikimedia.org/eswiki/20160601/> (
https://dumps.wikimedia.org/eswiki/20160601) into a file that could be read
with Giraph.


The BFS is searching for paths, and its all ok until get stuck in some
point of the superstep four.

I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each node
is a r3.8xlarge ec2 instance. The command for executing the BFS is this one:
/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote
-vif
ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat
-vip /user/hduser/input/grafo-wikipedia.txt -vof
ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat
-op /user/hduser/output/caminosNavegacionales -w 4 -yh 120000 -ca
giraph.useOutOfCoreMessages=true,giraph.metrics.enable=true,giraph.maxMessagesInMemory=1000000000,giraph.isStaticGraph=true,
*giraph.logLevel=Debug*

Each container have 120GB (almost). I'm using 1000M messages limit in
outOfCore, because i believed that was the problem, but  apparently is not.

This ones are the master logs (it seems that is waiting for workers for
finish but they just don't...and keeps like this forever...):

6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
finished worker list = [], size = 0, worker list =
[Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000),
Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
size = 4 from
/_hadoopBsp/giraph_yarn_application_1472168758138_0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4

*16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for
1000016/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed
signaled of false*
...thirty times same last two lines...
...
6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
finished worker list = [], size = 0, worker list =
[Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000),
Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
size = 4 from
/_hadoopBsp/giraph_yarn_application_1472168758138_0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4

And in *all* workers, there is no information on what is happening (i'm
testing this with *giraph.logLevel=Debug* because with the default level of
giraph log i was lost), and the workers say this over and over again:

16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result not
ready yet java.util.concurrent.FutureTask@7392f34d
16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82

Before starting the superstep 4, the information on each worker was the
following one
16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2]
startSuperstep: WORKER_ONLY - Attempt=0, Superstep=4
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep:
addressesAndPartitions[Worker(hostname=ip-172-31-29-14.ec2.internal,
MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal,
MRtaskID
=1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2,
port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4,
port=30004)]
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0
Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1
Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2
Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3
Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 4
Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 5
Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6
Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7
Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 8
Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 9
Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10
Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11
Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12
Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 13
Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14
Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 15
Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
16/08/26 00:43:08 DEBUG graph.GraphTaskManager: execute: Memory
(free/total/max) = 92421.41M / 115000.00M / 115000.00M


I don't know what is exactly failing:
- i know that all containers have memory available, on datanodes i check
that each one had like 50 GB available.
- I'm not sure if i'm hitting some sort of limit in the use of outOfCore. I
know that writing messages too fast is dangerous with 1.1 version of
Giraph, but if i hit that limit, i suppose that the container will fail,
right?
- Maybe the connections for zookeeper client aren't enough? I read that
maybe the 60 default value in zookeeper for *maxClientCnxns* is too small
for a context like AWS, but i'm not fully aware of the relationship between
Giraph and Zookeeper for start changing default configuration values
- Maybe i have to tune outOfCore configuration? Using
giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirmation=true
like someone recommend here (
http://mail-archives.apache.org/mod_mbox/giraph-user/201209.mbox/%3ccc775449.2c4b%25majakabi...@fb.com%3E)
?
- Should i tune the netty configuration? I have the default configuration,
but i believe that maybe using only 8 netty client and 8 server threads
will be enough, since that i have only a few workers and maybe too much
threads of netty are making the overhead that is doing that entire
application get stuck
- Using giraph.useBigDataIOForMessages=true didn't help me either, i know
that each vertex is receiving 100 M or more messages and that property
should be helpful, but didn't make any difference anyway

As you maybe are suspecting, i have too many hypothesis, that's why i'm
seeking for help, so i can go in the right direction.

Any help would be greatly appreciated.

Bye!
Jose

Giraph application get stuck, on superstep 4, all workers active but without progress

Reply via email to