Hi All, When we submit Spark jobs on YARN, during RM failover, we see lot of jobs reporting below error messages.
*2016-01-11 09:41:06,682 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1450676950893_0280_000001* 2016-01-11 09:41:06,683 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1450676950893_0280_000001 State change from FINAL_SAVING to FAILED 2016-01-11 09:41:06,683 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1450676950893_0280 State change from RUNNING to ACCEPTED 2016-01-11 09:41:06,683 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application appattempt_1450676950893_0280_000001 is done. finalState=FAILED 2016-01-11 09:41:06,683 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1450676950893_0280_000002 2016-01-11 09:41:06,683 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1450676950893_0280 requests cleared 2016-01-11 09:41:06,683 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1450676950893_0280_000002 State change from NEW to SUBMITTED 2016-01-11 09:41:06,683 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning master appattempt_1450676950893_0280_000001 2016-01-11 09:41:06,683 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1450676950893_0280_000002 to scheduler from user: glenm 2016-01-11 09:41:06,683 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1450676950893_0280_000002 State change from SUBMITTED to SCHEDULED *2016-01-11 09:41:06,747 ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AppAttemptId doesnt exist in cache appattempt_1450676950893_0280_000001* ResourceManager has a ConcurrentMap where it puts applicationId during resgistering of application attempt, and when there is finishApplicationMaster request, it gets the entry from ConcurrentMap, if there if no entry present, it throws that ERROR message. When there is unregistering Application Attempt, it removes the entry. So, after the unregistering application attempt, there are many finishApplicationMaster request causing the ERROR. Need your help to understand on what scenario the above happens. JIRA's related are https://issues.apache.org/jira/browse/SPARK-1032 https://issues.apache.org/jira/browse/SPARK-3072 Thanks, Prabhu Joseph