[ https://issues.apache.org/jira/browse/OOZIE-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300614#comment-15300614 ]
Purshotam Shah commented on OOZIE-2509: --------------------------------------- Thanks Rohini for review. Committed to trunk. > SLA job status can stuck in running state > ----------------------------------------- > > Key: OOZIE-2509 > URL: https://issues.apache.org/jira/browse/OOZIE-2509 > Project: Oozie > Issue Type: Bug > Reporter: Purshotam Shah > Assignee: Purshotam Shah > Attachments: OOZIE-2509-V1.patch, OOZIE-2509-V2.patch, > OOZIE-2509-V3.patch, OOZIE-2509-V4.patch, OOZIE-2509-V5.patch, > OOZIE-2509-V6.patch, OOZIE-2509-V7.patch, OOZIE-2509-V8.patch > > > There are few places where job status is not updated properly > 1. Receiving event which is out of order. > Ex "oozie.service.EventHandlerService.batch.size" is set to 50. > oozie.service.EventHandlerService.worker.threads is set to 15. Which means > that there will be 15 thread processing event in the batch of 50. > It can happen that 51th event gets process before the 49th event. > If 49th is job started event and 51th is job completed event, then the job > status will get overridden to running > 2. > {code} > case COORDINATOR_ACTION: > CoordinatorActionBean ca = jpaService.execute(new > CoordActionGetForSLAJPAExecutor(slaCalc.getId())); > if (ca.isTerminalWithFailure()) { > isEndMiss = ended = true; > slaCalc.setActualEnd(ca.getLastModifiedTime()); > } > if (ca.getExternalId() != null) { > wf = jpaService.execute(new > WorkflowJobGetForSLAJPAExecutor(ca.getExternalId())); > if (wf.getEndTime() != null) { > ended = true; > if (wf.getEndTime().getTime() > > slaCalc.getExpectedEnd().getTime()) { > isEndMiss = true; > } > } > slaCalc.setActualEnd(wf.getEndTime()); > slaCalc.setActualStart(wf.getStartTime()); > } > {code} > Oozie checks the wf status and update the sla status with coord job status. > We might have a case where coord is still running,but wf has ended. > 3. HistoryPurgeWorker updates endtime but doesn't update status. > 4. There other few locking issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)