[ 
https://issues.apache.org/jira/browse/HIVE-15860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861020#comment-15861020
 ] 

Rui Li commented on HIVE-15860:
-------------------------------

[~xuefuz] - yeah the monitor loops forever in that case. For the monitor, the 
job has started because we have received JobStarted event. So it goes to this 
switch branch every time it wakes up:
{code}
        case STARTED:
          JobExecutionStatus sparkJobState = sparkJobStatus.getState();
          if (sparkJobState == JobExecutionStatus.RUNNING) {
            Map<String, SparkStageProgress> progressMap = 
sparkJobStatus.getSparkStageProgress();
            if (!running) {
              perfLogger.PerfLogEnd(CLASS_NAME, 
PerfLogger.SPARK_SUBMIT_TO_RUNNING);
              printAppInfo();
              // print job stages.
              console.printInfo("\nQuery Hive on Spark job[" + 
sparkJobStatus.getJobId() +
                  "] stages: " + Arrays.toString(sparkJobStatus.getStageIds()));

              console.printInfo("\nStatus: Running (Hive on Spark job["
                + sparkJobStatus.getJobId() + "])");
              running = true;

              String format = "Job Progress Format\nCurrentTime 
StageId_StageAttemptId: "
                  + 
"SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount";
              if (!inPlaceUpdate) {
                console.printInfo(format);
              } else {
                console.logInfo(format);
              }
            }

            printStatus(progressMap, lastProgressMap);
            lastProgressMap = progressMap;
          }
          break;
{code}
However, {{sparkJobStatus.getState()}} always returns null because we haven't 
received the JobSubmitted event which carries the JobId. At this point, we need 
a way to tell whether the connect has broken, or there's just a big gap between 
JobStarted and JobSubmitted, see HIVE-9370. So I added the check to see if the 
client is still alive.

> RemoteSparkJobMonitor may hang when RemoteDriver exits abnormally
> -----------------------------------------------------------------
>
>                 Key: HIVE-15860
>                 URL: https://issues.apache.org/jira/browse/HIVE-15860
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rui Li
>            Assignee: Rui Li
>         Attachments: HIVE-15860.1.patch
>
>
> It happens when RemoteDriver crashes between {{JobStarted}} and 
> {{JobSubmitted}}, e.g. killed by {{kill -9}}. RemoteSparkJobMonitor will 
> consider the job has started, however it can't get the job info because it 
> hasn't received the JobId. Then the monitor will loop forever.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to