Hi!
I am running Giraph with YARN. Checkpointing is enabled. But when worker
failure happens master node outputs:
18/11/28 12:52:31 INFO master.MasterThread: masterThread: Coordination of
superstep 3 took 0.094 seconds ended with state WORKER_FAILURE and is now on
superstep 3
18/11/28 12:52:31 INFO master.BspServiceMaster: setJobState:
{"_applicationAttemptKey":1,"_stateKey":"START_SUPERSTEP","_superstepKey":2} on
superstep 2
18/11/28 12:52:31 INFO master.BspServiceMaster: setJobState:
{"_applicationAttemptKey":1,"_stateKey":"START_SUPERSTEP","_superstepKey":2}
18/11/28 12:52:31 INFO yarn.GiraphYarnTask: [STATUS: task-0] MASTER_ONLY
checkWorkers: Only found 0 responses of 2 needed to start superstep 2
After a while it fails job since timeout expires and no workers are present.
Is it possible to use automatic checkpoint resuming without falling back from
YARN to MR driver?
Best Regards,
Denis Dudinski