[ https://issues.apache.org/jira/browse/HIVE-15860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861020#comment-15861020 ]
Rui Li commented on HIVE-15860: ------------------------------- [~xuefuz] - yeah the monitor loops forever in that case. For the monitor, the job has started because we have received JobStarted event. So it goes to this switch branch every time it wakes up: {code} case STARTED: JobExecutionStatus sparkJobState = sparkJobStatus.getState(); if (sparkJobState == JobExecutionStatus.RUNNING) { Map<String, SparkStageProgress> progressMap = sparkJobStatus.getSparkStageProgress(); if (!running) { perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.SPARK_SUBMIT_TO_RUNNING); printAppInfo(); // print job stages. console.printInfo("\nQuery Hive on Spark job[" + sparkJobStatus.getJobId() + "] stages: " + Arrays.toString(sparkJobStatus.getStageIds())); console.printInfo("\nStatus: Running (Hive on Spark job[" + sparkJobStatus.getJobId() + "])"); running = true; String format = "Job Progress Format\nCurrentTime StageId_StageAttemptId: " + "SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount"; if (!inPlaceUpdate) { console.printInfo(format); } else { console.logInfo(format); } } printStatus(progressMap, lastProgressMap); lastProgressMap = progressMap; } break; {code} However, {{sparkJobStatus.getState()}} always returns null because we haven't received the JobSubmitted event which carries the JobId. At this point, we need a way to tell whether the connect has broken, or there's just a big gap between JobStarted and JobSubmitted, see HIVE-9370. So I added the check to see if the client is still alive. > RemoteSparkJobMonitor may hang when RemoteDriver exits abnormally > ----------------------------------------------------------------- > > Key: HIVE-15860 > URL: https://issues.apache.org/jira/browse/HIVE-15860 > Project: Hive > Issue Type: Bug > Reporter: Rui Li > Assignee: Rui Li > Attachments: HIVE-15860.1.patch > > > It happens when RemoteDriver crashes between {{JobStarted}} and > {{JobSubmitted}}, e.g. killed by {{kill -9}}. RemoteSparkJobMonitor will > consider the job has started, however it can't get the job info because it > hasn't received the JobId. Then the monitor will loop forever. -- This message was sent by Atlassian JIRA (v6.3.15#6346)