Jose Luis Larroque created GIRAPH-1101:
------------------------------------------

             Summary: Giraph hangs indefinitely when two or more workers 
process the same vertice on the same superstep
                 Key: GIRAPH-1101
                 URL: https://issues.apache.org/jira/browse/GIRAPH-1101
             Project: Giraph
          Issue Type: Bug
    Affects Versions: 1.1.0
            Reporter: Jose Luis Larroque
            Priority: Minor


If two workers (or more) are proccesing the same vertice on same superstep (for 
example, doing mulple BFS at the same time, could lead to it, depending of the 
data of course), the entire superstep hangs, every workers say something like 
this:

16/07/29 22:49:19 INFO utils.ProgressableUtils: waitFor: Future result not 
ready yet java.util.concurrent.FutureTask@23a1ef14
16/07/29 22:49:19 INFO utils.ProgressableUtils: waitFor: Waiting for 
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@5c571c52
16/07/29 22:50:19 INFO utils.ProgressableUtils: waitFor: Future result not 
ready yet java.util.concurrent.FutureTask@23a1ef14

And the master says:
16/07/29 21:43:19 INFO yarn.GiraphYarnTask: [STATUS: task-0] 
MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/07/29 21:43:19 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got 
finished worker list = [], size = 0, worker list = 
[Worker(hostname=ip-172-31-23-9.sa-east-1.compute.internal, MRtaskID=1, 
port=30001), Worker(hostname=ip-172-31-23-12.sa-east-1.compute.internal, 
MRtaskID=2, port=30002), 
Worker(hostname=ip-172-31-23-11.sa-east-1.compute.internal, MRtaskID=3, 
port=30003), Worker(hostname=ip-172-31-23-9.sa-east-1.compute.internal, 
MRtaskID=5, port=30005)], size = 4 from 
/_hadoopBsp/giraph_yarn_application_1469827475142_0001/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/07/29 21:43:19 INFO yarn.GiraphYarnTask: [STATUS: task-0] 
MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/07/29 21:43:19 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:43:29 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:43:29 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:43:39 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:43:39 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:43:49 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:43:49 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:43:59 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:43:59 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:44:09 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:44:09 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:44:19 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:44:19 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:44:29 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:44:29 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:44:39 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:44:39 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:44:49 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:44:49 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:44:59 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:44:59 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:45:09 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:45:09 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:45:19 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:45:19 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:45:29 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:45:29 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:45:39 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:45:39 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:45:49 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:45:49 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:45:59 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:45:59 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:46:09 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:46:09 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:46:19 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:46:19 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:46:29 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:46:29 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:46:39 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:46:39 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:46:49 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:46:49 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:46:59 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:46:59 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:47:09 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:47:09 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:47:19 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:47:19 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:47:29 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:47:29 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:47:39 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:47:39 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:47:49 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:47:49 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:47:59 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:47:59 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:48:09 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:48:09 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:48:19 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:48:19 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got 
finished worker list = [], size = 0, worker list = 
[Worker(hostname=ip-172-31-23-9.sa-east-1.compute.internal, MRtaskID=1, 
port=30001), Worker(hostname=ip-172-31-23-12.sa-east-1.compute.internal, 
MRtaskID=2, port=30002), 
Worker(hostname=ip-172-31-23-11.sa-east-1.compute.internal, MRtaskID=3, 
port=30003), Worker(hostname=ip-172-31-23-9.sa-east-1.compute.internal, 
MRtaskID=5, port=30005)], size = 4 from 
/_hadoopBsp/giraph_yarn_application_1469827475142_0001/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/07/29 21:48:19 INFO master.BspServiceMaster: barrierOnWorkerList: 0 out of 4 
workers finished on superstep 4 on path 
/_hadoopBsp/giraph_yarn_application_1469827475142_0001/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
16/07/29 21:48:19 INFO master.BspServiceMaster: barrierOnWorkerList: Waiting on 
[ip-172-31-23-12.sa-east-1.compute.internal_2, 
ip-172-31-23-9.sa-east-1.compute.internal_5, 
ip-172-31-23-11.sa-east-1.compute.internal_3, 
ip-172-31-23-9.sa-east-1.compute.internal_1]
16/07/29 21:48:19 INFO yarn.GiraphYarnTask: [STATUS: task-0] 
MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
16/07/29 21:48:19 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:48:29 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:48:29 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 21:48:39 DEBUG zk.PredicateLock: waitMsecs: Got timed signaled of false
16/07/29 21:48:39 DEBUG zk.PredicateLock: waitMsecs: Wait for 10000
16/07/29 22:50:19 INFO utils.ProgressableUtils: waitFor: Waiting for 
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@5c571c52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to