[ https://issues.apache.org/jira/browse/GOBBLIN-318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286602#comment-16286602 ]
Joel Baranick commented on GOBBLIN-318: --------------------------------------- Another piece of info. All tasks are marked as completed in the Gobblin DB, but when I look at https://zookeeper/node?path=/ROOT/CLUSTER/PROPERTYSTORE/TaskRebalancer/JOB_NAME_job_JOB_NAME_1512924480001/Context , there are multiple tasks still marked as running: {code:java} { "id":"TaskContext" ,"simpleFields":{ "START_TIME":"1512924491039" } ,"listFields":{ } ,"mapFields":{ "0":{ "ASSIGNED_PARTICIPANT":"worker-1" ,"FINISH_TIME":"1512924700877" ,"INFO":"completed tasks: 1" ,"NUM_ATTEMPTS":"1" ,"START_TIME":"1512924491044" ,"STATE":"COMPLETED" ,"TASK_ID":"124a2e88-90e3-40e8-add6-94b59ee30133" } ,"1":{ "ASSIGNED_PARTICIPANT":"worker-2" ,"FINISH_TIME":"1512924701120" ,"INFO":"completed tasks: 1" ,"NUM_ATTEMPTS":"1" ,"START_TIME":"1512924491044" ,"STATE":"COMPLETED" ,"TASK_ID":"9d7c2369-d6d9-4c2f-8bf3-1bcea0a47fdf" } ,"2":{ "ASSIGNED_PARTICIPANT":"worker-3" ,"FINISH_TIME":"1512924695451" ,"INFO":"completed tasks: 1" ,"NUM_ATTEMPTS":"1" ,"START_TIME":"1512924491044" ,"STATE":"COMPLETED" ,"TASK_ID":"19545764-e2bf-48b6-9942-361c834790cf" } ,"3":{ "ASSIGNED_PARTICIPANT":"worker-4" ,"FINISH_TIME":"1512924776614" ,"INFO":"completed tasks: 1" ,"NUM_ATTEMPTS":"1" ,"START_TIME":"1512924491044" ,"STATE":"COMPLETED" ,"TASK_ID":"3f59431f-2415-477a-8008-26a3eb258129" } ,"4":{ "ASSIGNED_PARTICIPANT":"worker-5" ,"FINISH_TIME":"1512924731962" ,"INFO":"completed tasks: 1" ,"NUM_ATTEMPTS":"1" ,"START_TIME":"1512924491044" ,"STATE":"COMPLETED" ,"TASK_ID":"19863633-6ed3-49d4-a07f-2130eec15dd3" } ,"5":{ "ASSIGNED_PARTICIPANT":"worker-6" ,"INFO":"" ,"START_TIME":"1512924491044" ,"STATE":"RUNNING" ,"TASK_ID":"433c0107-0919-428a-b7c5-6e8925df7dac" } ,"6":{ "ASSIGNED_PARTICIPANT":"worker-7" ,"INFO":"" ,"START_TIME":"1512924491044" ,"STATE":"RUNNING" ,"TASK_ID":"89a63cfd-efb4-44ce-a08b-68678d792e25" } ,"7":{ "ASSIGNED_PARTICIPANT":"worker-8" ,"FINISH_TIME":"1512924524111" ,"INFO":"completed tasks: 1" ,"NUM_ATTEMPTS":"1" ,"START_TIME":"1512924491044" ,"STATE":"COMPLETED" ,"TASK_ID":"a133db13-3f28-49af-8e3d-1d6fa81f6247" } ,"8":{ "ASSIGNED_PARTICIPANT":"worker-9" ,"INFO":"" ,"START_TIME":"1512924491044" ,"STATE":"RUNNING" ,"TASK_ID":"7bbda2ef-68da-4f11-b217-89c3cd7d7a2e" } ,"9":{ "ASSIGNED_PARTICIPANT":"worker-10" ,"INFO":"" ,"START_TIME":"1512924491044" ,"STATE":"RUNNING" ,"TASK_ID":"8407cb27-4b26-4786-91f2-ad920b1e2343" } } } {code} > Gobblin Helix Jobs Hang Indefinitely > ------------------------------------- > > Key: GOBBLIN-318 > URL: https://issues.apache.org/jira/browse/GOBBLIN-318 > Project: Apache Gobblin > Issue Type: Bug > Reporter: Joel Baranick > Priority: Critical > > In some cases, gobblin helix jobs can hang indefinitely. When coupled with > job locks, this can result in a job becoming stuck and not progressing. The > only solution currently is to restart the master node. > Assume the following is for a {{job_myjob_1510884004834}} and which hung at > 2017-11-17 02:09:00 UTC and was still hung at 2017-11-17 09:12:00 UTC. > {{GobblinHelixJobLauncher.waitForJobCompletion()}} is never detecting the job > as completed. This results in the {{TaskStateCollectorService}} indefinitely > searching for more task states, even though it has processed all the task > states that are ever going to be produced. There is no reference to the hung > job in Zookeeper at {{/mycluster/CONFIGS/RESOURCE}}. In the Helix Web Admin, > the hung job doesn't exist at {{/clusters/mycluster/jobQueues/jobname}}. > There is no record of the job in Zookeeper at > {{/mycluster/PROPERTYSTORE/TaskRebalancer/jobname/Context}}. This means that > the {{GobblinHelixJobLauncher.waitForJobCompletion()}} code fails. > {code:java} > private void waitForJobCompletion() throws InterruptedException { > while (true) { > WorkflowContext workflowContext = > TaskDriver.getWorkflowContext(this.helixManager, this.helixQueueName); > if (workflowContext != null) { > org.apache.helix.task.TaskState helixJobState = > workflowContext.getJobState(this.jobResourceName); > if (helixJobState == org.apache.helix.task.TaskState.COMPLETED || > helixJobState == org.apache.helix.task.TaskState.FAILED || > helixJobState == org.apache.helix.task.TaskState.STOPPED) { > return; > } > } > Thread.sleep(1000); > } > } > {code} > The code gets the job state from Zookeeper: > {code:javascript} > { > "id": "WorkflowContext", > "simpleFields": { > "START_TIME": "1505159715449", > "STATE": "IN_PROGRESS" > }, > "listFields": {}, > "mapFields": { > "JOB_STATES": { > "jobname_job_jobname_1507415700001": "COMPLETED", > "jobname_job_jobname_1507756800000": "COMPLETED", > "jobname_job_jobname_1507959300001": "COMPLETED", > "jobname_job_jobname_1509857102910": "COMPLETED", > "jobname_job_jobname_1510253708033": "COMPLETED", > "jobname_job_jobname_1510271102898": "COMPLETED", > "jobname_job_jobname_1510852210668": "COMPLETED", > "jobname_job_jobname_1510853133675": "COMPLETED" > } > } > } > {code} > But there is no information contained in the job state for the hung job. > Also, it is really strange that the job states contained in that json blob > are so old. The oldest one is from 2017-10-7 10:35:00 PM UTC, more than a > month ago. > I'm not sure how the system got in this state, but this isn't the first time > we have seen this. While it would be good to prevent this from happening, it > would also be good to allow the system to recover if this state is entered. -- This message was sent by Atlassian JIRA (v6.4.14#64029)