[ 
https://issues.apache.org/jira/browse/GOBBLIN-318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286602#comment-16286602
 ] 

Joel Baranick commented on GOBBLIN-318:
---------------------------------------

Another piece of info.  All tasks are marked as completed in the Gobblin DB, 
but when I look at 
https://zookeeper/node?path=/ROOT/CLUSTER/PROPERTYSTORE/TaskRebalancer/JOB_NAME_job_JOB_NAME_1512924480001/Context
 , there are multiple tasks still marked as running:

{code:java}
{
  "id":"TaskContext"
  ,"simpleFields":{
    "START_TIME":"1512924491039"
  }
  ,"listFields":{
  }
  ,"mapFields":{
    "0":{
      "ASSIGNED_PARTICIPANT":"worker-1"
      ,"FINISH_TIME":"1512924700877"
      ,"INFO":"completed tasks: 1"
      ,"NUM_ATTEMPTS":"1"
      ,"START_TIME":"1512924491044"
      ,"STATE":"COMPLETED"
      ,"TASK_ID":"124a2e88-90e3-40e8-add6-94b59ee30133"
    }
    ,"1":{
      "ASSIGNED_PARTICIPANT":"worker-2"
      ,"FINISH_TIME":"1512924701120"
      ,"INFO":"completed tasks: 1"
      ,"NUM_ATTEMPTS":"1"
      ,"START_TIME":"1512924491044"
      ,"STATE":"COMPLETED"
      ,"TASK_ID":"9d7c2369-d6d9-4c2f-8bf3-1bcea0a47fdf"
    }
    ,"2":{
      "ASSIGNED_PARTICIPANT":"worker-3"
      ,"FINISH_TIME":"1512924695451"
      ,"INFO":"completed tasks: 1"
      ,"NUM_ATTEMPTS":"1"
      ,"START_TIME":"1512924491044"
      ,"STATE":"COMPLETED"
      ,"TASK_ID":"19545764-e2bf-48b6-9942-361c834790cf"
    }
    ,"3":{
      "ASSIGNED_PARTICIPANT":"worker-4"
      ,"FINISH_TIME":"1512924776614"
      ,"INFO":"completed tasks: 1"
      ,"NUM_ATTEMPTS":"1"
      ,"START_TIME":"1512924491044"
      ,"STATE":"COMPLETED"
      ,"TASK_ID":"3f59431f-2415-477a-8008-26a3eb258129"
    }
    ,"4":{
      "ASSIGNED_PARTICIPANT":"worker-5"
      ,"FINISH_TIME":"1512924731962"
      ,"INFO":"completed tasks: 1"
      ,"NUM_ATTEMPTS":"1"
      ,"START_TIME":"1512924491044"
      ,"STATE":"COMPLETED"
      ,"TASK_ID":"19863633-6ed3-49d4-a07f-2130eec15dd3"
    }
    ,"5":{
      "ASSIGNED_PARTICIPANT":"worker-6"
      ,"INFO":""
      ,"START_TIME":"1512924491044"
      ,"STATE":"RUNNING"
      ,"TASK_ID":"433c0107-0919-428a-b7c5-6e8925df7dac"
    }
    ,"6":{
      "ASSIGNED_PARTICIPANT":"worker-7"
      ,"INFO":""
      ,"START_TIME":"1512924491044"
      ,"STATE":"RUNNING"
      ,"TASK_ID":"89a63cfd-efb4-44ce-a08b-68678d792e25"
    }
    ,"7":{
      "ASSIGNED_PARTICIPANT":"worker-8"
      ,"FINISH_TIME":"1512924524111"
      ,"INFO":"completed tasks: 1"
      ,"NUM_ATTEMPTS":"1"
      ,"START_TIME":"1512924491044"
      ,"STATE":"COMPLETED"
      ,"TASK_ID":"a133db13-3f28-49af-8e3d-1d6fa81f6247"
    }
    ,"8":{
      "ASSIGNED_PARTICIPANT":"worker-9"
      ,"INFO":""
      ,"START_TIME":"1512924491044"
      ,"STATE":"RUNNING"
      ,"TASK_ID":"7bbda2ef-68da-4f11-b217-89c3cd7d7a2e"
    }
    ,"9":{
      "ASSIGNED_PARTICIPANT":"worker-10"
      ,"INFO":""
      ,"START_TIME":"1512924491044"
      ,"STATE":"RUNNING"
      ,"TASK_ID":"8407cb27-4b26-4786-91f2-ad920b1e2343"
    }
  }
}
{code}


> Gobblin Helix Jobs Hang Indefinitely 
> -------------------------------------
>
>                 Key: GOBBLIN-318
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-318
>             Project: Apache Gobblin
>          Issue Type: Bug
>            Reporter: Joel Baranick
>            Priority: Critical
>
> In some cases, gobblin helix jobs can hang indefinitely.  When coupled with 
> job locks, this can result in a job becoming stuck and not progressing.  The 
> only solution currently is to restart the master node.
> Assume the following is for a {{job_myjob_1510884004834}} and which hung at 
> 2017-11-17 02:09:00 UTC and was still hung at 2017-11-17 09:12:00 UTC. 
> {{GobblinHelixJobLauncher.waitForJobCompletion()}} is never detecting the job 
> as completed. This results in the {{TaskStateCollectorService}} indefinitely 
> searching for more task states, even though it has processed all the task 
> states that are ever going to be produced.  There is no reference to the hung 
> job in Zookeeper at {{/mycluster/CONFIGS/RESOURCE}}.  In the Helix Web Admin, 
> the hung job doesn't exist at {{/clusters/mycluster/jobQueues/jobname}}. 
> There is no record of the job in Zookeeper at 
> {{/mycluster/PROPERTYSTORE/TaskRebalancer/jobname/Context}}.  This means that 
> the {{GobblinHelixJobLauncher.waitForJobCompletion()}} code fails.
> {code:java}
> private void waitForJobCompletion() throws InterruptedException {
>     while (true) {
>       WorkflowContext workflowContext = 
> TaskDriver.getWorkflowContext(this.helixManager, this.helixQueueName);
>       if (workflowContext != null) {
>         org.apache.helix.task.TaskState helixJobState = 
> workflowContext.getJobState(this.jobResourceName);
>         if (helixJobState == org.apache.helix.task.TaskState.COMPLETED ||
>             helixJobState == org.apache.helix.task.TaskState.FAILED ||
>             helixJobState == org.apache.helix.task.TaskState.STOPPED) {
>           return;
>         }
>       }
>       Thread.sleep(1000);
>     }
>   }
> {code}
> The code gets the job state from Zookeeper:
> {code:javascript}
> {
>   "id": "WorkflowContext",
>   "simpleFields": {
>     "START_TIME": "1505159715449",
>     "STATE": "IN_PROGRESS"
>   },
>   "listFields": {},
>   "mapFields": {
>     "JOB_STATES": {
>       "jobname_job_jobname_1507415700001": "COMPLETED",
>       "jobname_job_jobname_1507756800000": "COMPLETED",
>       "jobname_job_jobname_1507959300001": "COMPLETED",
>       "jobname_job_jobname_1509857102910": "COMPLETED",
>       "jobname_job_jobname_1510253708033": "COMPLETED",
>       "jobname_job_jobname_1510271102898": "COMPLETED",
>       "jobname_job_jobname_1510852210668": "COMPLETED",
>       "jobname_job_jobname_1510853133675": "COMPLETED"
>     }
>   }
> }
> {code}
> But there is no information contained in the job state for the hung job.
> Also, it is really strange that the job states contained in that json blob 
> are so old.  The oldest one is from 2017-10-7 10:35:00 PM UTC, more than a 
> month ago.
> I'm not sure how the system got in this state, but this isn't the first time 
> we have seen this.  While it would be good to prevent this from happening, it 
> would also be good to allow the system to recover if this state is entered.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to