[
https://issues.apache.org/jira/browse/OOZIE-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228488#comment-17228488
]
Frank Top Frank commented on OOZIE-3581:
----------------------------------------
Hello, I also ran into the same problem with your production environment under
the same version.Hello, I also ran into the same problem with your production
environment under the same version.Initially, I modified the following
parameters
#oozie.service.ActionCheckerService.action.check.interval ->10
#oozie.service.ActionCheckerService.action.check.delay ->20
#oozie.service.ActionCheckerService.callable.batch.size ->5
All oozie tasks are up and running without inconsistent state issues. However,
after several hours of running, oozie again has a large number of task states
running but tasks on Yarn have succeededI combed through the logic
Then, with Debugg mode enabled, it is found through oozie logs that the code
does not continue here
{code:java}
protected Void execute() throws CommandException {
// If the action is still in PREP, we probably received a callback
before Oozie was able to update from PREP to RUNNING;
// we'll requeue this command a few times and hope that it switches to
RUNNING before giving up
if (this.wfactionBean.getStatus() == WorkflowActionBean.Status.PREP) {
int maxEarlyRequeueCount =
Services.get().get(CallbackService.class).getEarlyRequeueMaxRetries();
if (this.earlyRequeueCount < maxEarlyRequeueCount) {
long delay = getRequeueDelay();
LOG.warn("Received early callback for action still in PREP
state; will wait [{0}]ms and requeue up to [{1}] more"
+ " times", delay, (maxEarlyRequeueCount -
earlyRequeueCount));
queue(new CompletedActionXCommand(this.actionId,
this.externalStatus, null, this.getPriority(),
this.earlyRequeueCount + 1), delay);
} else {
throw new CommandException(ErrorCode.E0822, actionId);
}
} else { // RUNNING
ActionExecutor executor =
Services.get().get(ActionService.class).getExecutor(this.wfactionBean.getType());
// this is done because oozie notifications (of sub-wfs) is send
// every status change, not only on completion.
if (executor.isCompleted(externalStatus)) {//***********Here!!code
does not run into queue()*****************
queue(new ActionCheckXCommand(this.wfactionBean.getId(),
getPriority(), -1));
}
}
return null;
}
{code}
> Callback does not applied in Oozie server, workflows stuk in RUNNING states.
> ----------------------------------------------------------------------------
>
> Key: OOZIE-3581
> URL: https://issues.apache.org/jira/browse/OOZIE-3581
> Project: Oozie
> Issue Type: Bug
> Components: action, workflow
> Affects Versions: 4.3.1
> Reporter: Kotsubinsky Victor
> Priority: Critical
>
> oozie version 4.3.1.3.1.0.0-78
> with HDP3.10 stack , release provides Oozie 4.3.1 and the additional Apache
> patches listed here:
> [https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/release-notes/content/patch_oozie.html]
> I use Hadoop kerberized cluster, run on OOZIE, YARN Mr jobs.
> 1. OOzie run mr-job via YARN
> 2. After YARN mr job completed,YYARN mr-job Successfully sent Callback
> request to OOzie,
> 3. in logs OOzie server, i can see this request, but OOZIE does not apply
> this callback request, so in WF-action-id i still see RUNNING state (until
> action.check process check wf-ids and swith action-id to SUCCESS state)
> LOGS in YARN:
> 2020-01-27 12:16:39,749 INFO [Thread-78] org.eclipse.jetty.util.log: Job end
> notification trying
> http://hdp3-oozie:11000/oozie/callback?id=0005607-200123121357414-oozie-oozi-W@rdb-full-table-extract-java&status=SUCCEEDED
> 2020-01-27 12:16:39,772 INFO [Thread-78] org.eclipse.jetty.util.log: Job end
> notification to
> http://hdp3-oozie:11000/oozie/callback?id=0005607-200123121357414-oozie-oozi-W@rdb-full-table-extract-java&status=SUCCEEDED
> succeeded
> 2020-01-27 12:16:39,772 INFO [Thread-78] org.eclipse.jetty.util.log: Job end
> notification succeeded for job_1579778851579_31505
>
> Oozie logs about this event:
> 2020-01-27 12:16:39,770 DEBUG CallbackServlet:526 - SERVER[hdp3-oo-2] USER[-]
> GROUP[-] TOKEN[-] APP[-] JOB[0005607-200123121357414-oozie-oozi-W]
> ACTION[0005607-200123121357414-oozie-oozi-W@rdb-full-table-extract-java]
> Received a CallbackServlet.doGet() with query string
> id=0005607-200123121357414-oozie-oozi-W@rdb-full-table-extract-java&status=SUCCEEDED
> 2020-01-27 12:16:39,776 DEBUG CompletedActionXCommand:526 - SERVER[hdp3-oo-2]
> USER[-] GROUP[-] TOKEN[] APP[-] JOB[0005607-200123121357414-oozie-oozi-W]
> ACTION[0005607-200123121357414-oozie-oozi-W@rdb-full-table-extract-java]
> Execute command [callback] key [null]
> 2020-01-27 12:16:39,776 DEBUG CompletedActionXCommand:526 - SERVER[hdp3-oo-2]
> USER[-] GROUP[-] TOKEN[] APP[-] JOB[0005607-200123121357414-oozie-oozi-W]
> ACTION[0005607-200123121357414-oozie-oozi-W@rdb-full-table-extract-java]
> Queuing [1] commands with delay [0]ms
--
This message was sent by Atlassian Jira
(v8.3.4#803005)