[ 
https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100564#comment-13100564
 ] 

Avery Ching commented on GIRAPH-25:
-----------------------------------

Patch worked nicely.  I added a unittest and tweaked an error message.  Here's 
some example output I got (looks much better).

...
2011-09-08 11:20:35,203 INFO org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Only found 0 responses of 32767 needed to start superstep -1.  
Sleeping for 1 msecs and used 0 of 1 attempts.
2011-09-08 11:20:35,203 ERROR org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Did not receive enough processes in time (only 0 of 32767 
required).  This occurs if you do not have enough map tasks available 
simultaneously on your Hadoop instance to fulfill the number of requested 
workers.
2011-09-08 11:20:35,276 INFO org.apache.giraph.graph.BspServiceMaster: 
setJobState: 
{"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on 
superstep -1
2011-09-08 11:20:35,333 FATAL org.apache.giraph.graph.BspServiceMaster: 
failJob: Killing job job_201109080935_0009
2011-09-08 11:20:35,619 INFO org.apache.giraph.graph.BspServiceMaster: cleanup: 
Notifying master its okay to cleanup with 
/_hadoopBsp/job_201109080935_0009/_cleanedUpDir/0_master
2011-09-08 11:20:35,620 INFO org.apache.giraph.graph.BspServiceMaster: 
cleanUpZooKeeper: Node /_hadoopBsp/job_201109080935_0009/_cleanedUpDir already 
exists, no need to create.
2011-09-08 11:20:35,621 INFO org.apache.giraph.graph.BspServiceMaster: 
cleanUpZooKeeper: Got 1 of 32768 desired children from 
/_hadoopBsp/job_201109080935_0009/_cleanedUpDir
2011-09-08 11:20:35,621 INFO org.apache.giraph.graph.BspServiceMaster: 
cleanedUpZooKeeper: Waiting for the children of 
/_hadoopBsp/job_201109080935_0009/_cleanedUpDir to change since only got 1 
nodes.
2011-09-08 11:20:38,182 WARN org.apache.giraph.zk.ZooKeeperManager: 
onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.

I'll upload the minor changes and then commit it on your behalf.  I ran 
unittests in local mode and also on a small Hadoop instance.  Thanks!


> NPE in BspServiceMaster when failing a job
> ------------------------------------------
>
>                 Key: GIRAPH-25
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-25
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Priority: Minor
>         Attachments: GIRAPH-25.2.patch, GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies 
> with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to