Hi!

I am running Giraph with YARN. Checkpointing is enabled. But when worker 
failure happens master node outputs:

18/11/28 12:52:31 INFO master.MasterThread: masterThread: Coordination of 
superstep 3 took 0.094 seconds ended with state WORKER_FAILURE and is now on 
superstep 3
18/11/28 12:52:31 INFO master.BspServiceMaster: setJobState: 
{"_applicationAttemptKey":1,"_stateKey":"START_SUPERSTEP","_superstepKey":2} on 
superstep 2
18/11/28 12:52:31 INFO master.BspServiceMaster: setJobState: 
{"_applicationAttemptKey":1,"_stateKey":"START_SUPERSTEP","_superstepKey":2}
18/11/28 12:52:31 INFO yarn.GiraphYarnTask: [STATUS: task-0] MASTER_ONLY 
checkWorkers: Only found 0 responses of 2 needed to start superstep 2
After a while it fails job since timeout expires and no workers are present. 

Is it possible to use automatic checkpoint resuming without falling back from 
YARN to MR driver?

Best Regards,
Denis Dudinski

Reply via email to