Hi, guys,
I have a question regarding how Giraph restarts from last checkpoint due to
worker_failure.
I run an example with 5 workers and 1 master. Two workers are preempted
during running. But I found the other 3 workers also quit. I check the
code, and find the following in the
BspServiceWorker.processEvent(WatchedEvent event):
if ((ApplicationState.valueOf(jsonObj.getString(JSONOBJ_STATE_KEY)) ==
ApplicationState.START_SUPERSTEP) &&
jsonObj.getLong(JSONOBJ_APPLICATION_ATTEMPT_KEY) !=
getApplicationAttempt()) {
LOG.fatal("processEvent: Worker will restart " +
"from command - " + jsonObj.toString());
System.exit(-1);
}
Does this mean all ''good'' workers also need to quit and the job needs to
request resources again? BTW, I use the pure-YARN with
Giraph-1.1.0-SNAPSHOT.
The following is the log from one "good" worker:
2014-04-29 21:56:55,284 INFO [main-EventThread] worker.BspServiceWorker
(BspServiceWorker.java:processEvent(1604)) - processEvent: Job state
changed, checking to see if it needs to restart
2014-04-29 21:56:55,285 INFO [main-EventThread] bsp.BspService
(BspService.java:getJobState(695)) - getJobState: Job state already exists
(/_hadoopBsp/giraph_yarn_application_1398826558049_0001/_masterJobState)
2014-04-29 21:56:55,287 FATAL [main-EventThread] worker.BspServiceWorker
(BspServiceWorker.java:processEvent(1619)) - processEvent: Worker will
restart from command -
{"_stateKey":"START_SUPERSTEP","_applicationAttemptKey":1,"_superstepKey":24}
Thanks for help!