Hello, I am currently not a contributor to this project but have noticed an issue i wanted to report here instead of on the users mailing list.
using 1.1.0-SNAPSHOT built for PURE YARN and cdh5.0.0 I have an intermittent problem that, when it occurs, causes the job to stall after completion (but prior to vertices writing their output). Looking into the logs (posted below) I see that i go from 7 of 8 workers reporting completion to 9 of 8. The code in BspServiceMaster:1740 users cleanedUpChildrenList.size() == maxTasks inside of a while true loop, so the job gets stuck here forever and will never progress again. I plan on changing this locally to a >= for my own use to prevent this problem, but i don’t know how 9 of 8 is being reported and how this problem is really happening. Thanks for any ideas, Eric 14/01/30 09:35:37 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 1 of 8 desired children from /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir 14/01/30 09:35:37 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children of /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change since only got 1 nodes. 14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged signaled 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 2 of 8 desired children from /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children of /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change since only got 2 nodes. 14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged signaled 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 5 of 8 desired children from /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children of /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change since only got 5 nodes. 14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged signaled 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 6 of 8 desired children from /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children of /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change since only got 6 nodes. 14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged signaled 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 9 of 8 desired children from /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children of /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change since only got 9 nodes.
