Jose Luis Larroque created GIRAPH-1101:
------------------------------------------
Summary: Giraph hangs indefinitely when two or more workers
process the same vertice on the same superstep
Key: GIRAPH-1101
URL: https://issues.apache.org/jira/browse/GIRAPH-1101
Project: Giraph
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Jose Luis Larroque
Priority: Minor
If two workers (or more) are proccesing the same vertice on same superstep (for
example, doing mulple BFS at the same time, could lead to it, depending of the
data of course), the entire superstep hangs, every workers say something like
this:
16/07/29 22:49:19 INFO utils.ProgressableUtils: waitFor: Future result not
ready yet java.util.concurrent.FutureTask@23a1ef14
16/07/29 22:49:19 INFO utils.ProgressableUtils: waitFor: Waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@5c571c52
16/07/29 22:50:19 INFO utils.ProgressableUtils: waitFor: Future result not
ready yet java.util.concurrent.FutureTask@23a1ef14
And the master says:
16/07/29 21:43:19 INFO yarn.GiraphYarnTask: [STATUS: task-0]
MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/07/29 21:43:19 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
finished worker list = [], size = 0, worker list =
[Worker(hostname=ip-172-31-23-9.sa-east-1.compute.internal, MRtaskID=1,
port=30001), Worker(hostname=ip-172-31-23-12.sa-east-1.compute.internal,
MRtaskID=2, port=30002),
Worker(hostname=ip-172-31-23-11.sa-east-1.compute.internal, MRtaskID=3,
port=30003), Worker(hostname=ip-172-31-23-9.sa-east-1.compute.internal,
MRtaskID=5, port=30005)], size = 4 from
/_hadoopBsp/giraph_yarn_application_1469827475142_0001/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/07/29 21:43:19 INFO yarn.GiraphYarnTask: [STATUS: task-0]
MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/07/29 21:43:19 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:43:29 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:43:29 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:43:39 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:43:39 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:43:49 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:43:49 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:43:59 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:43:59 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:44:09 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:44:09 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:44:19 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:44:19 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:44:29 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:44:29 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:44:39 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:44:39 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:44:49 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:44:49 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:44:59 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:44:59 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:45:09 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:45:09 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:45:19 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:45:19 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:45:29 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:45:29 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:45:39 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:45:39 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:45:49 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:45:49 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:45:59 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:45:59 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:46:09 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:46:09 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:46:19 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:46:19 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:46:29 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:46:29 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:46:39 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:46:39 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:46:49 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:46:49 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:46:59 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:46:59 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:47:09 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:47:09 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:47:19 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:47:19 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:47:29 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:47:29 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:47:39 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:47:39 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:47:49 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:47:49 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:47:59 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:47:59 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:48:09 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:48:09 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:48:19 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:48:19 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
finished worker list = [], size = 0, worker list =
[Worker(hostname=ip-172-31-23-9.sa-east-1.compute.internal, MRtaskID=1,
port=30001), Worker(hostname=ip-172-31-23-12.sa-east-1.compute.internal,
MRtaskID=2, port=30002),
Worker(hostname=ip-172-31-23-11.sa-east-1.compute.internal, MRtaskID=3,
port=30003), Worker(hostname=ip-172-31-23-9.sa-east-1.compute.internal,
MRtaskID=5, port=30005)], size = 4 from
/_hadoopBsp/giraph_yarn_application_1469827475142_0001/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/07/29 21:48:19 INFO master.BspServiceMaster: barrierOnWorkerList: 0 out of 4
workers finished on superstep 4 on path
/_hadoopBsp/giraph_yarn_application_1469827475142_0001/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/07/29 21:48:19 INFO master.BspServiceMaster: barrierOnWorkerList: Waiting on
[ip-172-31-23-12.sa-east-1.compute.internal_2,
ip-172-31-23-9.sa-east-1.compute.internal_5,
ip-172-31-23-11.sa-east-1.compute.internal_3,
ip-172-31-23-9.sa-east-1.compute.internal_1]
16/07/29 21:48:19 INFO yarn.GiraphYarnTask: [STATUS: task-0]
MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/07/29 21:48:19 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:48:29 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:48:29 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:48:39 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:48:39 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 22:50:19 INFO utils.ProgressableUtils: waitFor: Waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@5c571c52
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)