Hello,  I am currently not a contributor to this project but have noticed an 
issue i wanted to report here instead of on the users mailing list.

using 1.1.0-SNAPSHOT built for PURE YARN and cdh5.0.0

I have an intermittent problem that, when it occurs, causes the job to stall 
after completion (but prior to vertices writing their output).   Looking into 
the logs (posted below) I see that i go from 7 of 8 workers reporting 
completion to 9 of 8.  The code in BspServiceMaster:1740 users 
cleanedUpChildrenList.size() == maxTasks inside of a while true loop, so the 
job gets stuck here forever and will never progress again.

I plan on changing this locally to a >= for my own use to prevent this problem, 
but i don’t know how 9 of 8 is being reported and how this problem is really 
happening.  

Thanks for any ideas,
Eric


14/01/30 09:35:37 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 1 of 8 
desired children from 
/_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
14/01/30 09:35:37 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for 
the children of 
/_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change 
since only got 1 nodes.
14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged 
signaled
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 2 of 8 
desired children from 
/_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for 
the children of 
/_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change 
since only got 2 nodes.
14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged 
signaled
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 5 of 8 
desired children from 
/_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for 
the children of 
/_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change 
since only got 5 nodes.
14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged 
signaled
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 6 of 8 
desired children from 
/_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for 
the children of 
/_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change 
since only got 6 nodes.
14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged 
signaled
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 9 of 8 
desired children from 
/_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for 
the children of 
/_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change 
since only got 9 nodes.

Reply via email to