[ https://issues.apache.org/jira/browse/OOZIE-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228488#comment-17228488 ]
Frank Top Frank commented on OOZIE-3581: ---------------------------------------- Hello, I also ran into the same problem with your production environment under the same version.Hello, I also ran into the same problem with your production environment under the same version.Initially, I modified the following parameters #oozie.service.ActionCheckerService.action.check.interval ->10 #oozie.service.ActionCheckerService.action.check.delay ->20 #oozie.service.ActionCheckerService.callable.batch.size ->5 All oozie tasks are up and running without inconsistent state issues. However, after several hours of running, oozie again has a large number of task states running but tasks on Yarn have succeededI combed through the logic Then, with Debugg mode enabled, it is found through oozie logs that the code does not continue here {code:java} protected Void execute() throws CommandException { // If the action is still in PREP, we probably received a callback before Oozie was able to update from PREP to RUNNING; // we'll requeue this command a few times and hope that it switches to RUNNING before giving up if (this.wfactionBean.getStatus() == WorkflowActionBean.Status.PREP) { int maxEarlyRequeueCount = Services.get().get(CallbackService.class).getEarlyRequeueMaxRetries(); if (this.earlyRequeueCount < maxEarlyRequeueCount) { long delay = getRequeueDelay(); LOG.warn("Received early callback for action still in PREP state; will wait [{0}]ms and requeue up to [{1}] more" + " times", delay, (maxEarlyRequeueCount - earlyRequeueCount)); queue(new CompletedActionXCommand(this.actionId, this.externalStatus, null, this.getPriority(), this.earlyRequeueCount + 1), delay); } else { throw new CommandException(ErrorCode.E0822, actionId); } } else { // RUNNING ActionExecutor executor = Services.get().get(ActionService.class).getExecutor(this.wfactionBean.getType()); // this is done because oozie notifications (of sub-wfs) is send // every status change, not only on completion. if (executor.isCompleted(externalStatus)) {//***********Here!!code does not run into queue()***************** queue(new ActionCheckXCommand(this.wfactionBean.getId(), getPriority(), -1)); } } return null; } {code} > Callback does not applied in Oozie server, workflows stuk in RUNNING states. > ---------------------------------------------------------------------------- > > Key: OOZIE-3581 > URL: https://issues.apache.org/jira/browse/OOZIE-3581 > Project: Oozie > Issue Type: Bug > Components: action, workflow > Affects Versions: 4.3.1 > Reporter: Kotsubinsky Victor > Priority: Critical > > oozie version 4.3.1.3.1.0.0-78 > with HDP3.10 stack , release provides Oozie 4.3.1 and the additional Apache > patches listed here: > [https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/release-notes/content/patch_oozie.html] > I use Hadoop kerberized cluster, run on OOZIE, YARN Mr jobs. > 1. OOzie run mr-job via YARN > 2. After YARN mr job completed,YYARN mr-job Successfully sent Callback > request to OOzie, > 3. in logs OOzie server, i can see this request, but OOZIE does not apply > this callback request, so in WF-action-id i still see RUNNING state (until > action.check process check wf-ids and swith action-id to SUCCESS state) > LOGS in YARN: > 2020-01-27 12:16:39,749 INFO [Thread-78] org.eclipse.jetty.util.log: Job end > notification trying > http://hdp3-oozie:11000/oozie/callback?id=0005607-200123121357414-oozie-oozi-W@rdb-full-table-extract-java&status=SUCCEEDED > 2020-01-27 12:16:39,772 INFO [Thread-78] org.eclipse.jetty.util.log: Job end > notification to > http://hdp3-oozie:11000/oozie/callback?id=0005607-200123121357414-oozie-oozi-W@rdb-full-table-extract-java&status=SUCCEEDED > succeeded > 2020-01-27 12:16:39,772 INFO [Thread-78] org.eclipse.jetty.util.log: Job end > notification succeeded for job_1579778851579_31505 > > Oozie logs about this event: > 2020-01-27 12:16:39,770 DEBUG CallbackServlet:526 - SERVER[hdp3-oo-2] USER[-] > GROUP[-] TOKEN[-] APP[-] JOB[0005607-200123121357414-oozie-oozi-W] > ACTION[0005607-200123121357414-oozie-oozi-W@rdb-full-table-extract-java] > Received a CallbackServlet.doGet() with query string > id=0005607-200123121357414-oozie-oozi-W@rdb-full-table-extract-java&status=SUCCEEDED > 2020-01-27 12:16:39,776 DEBUG CompletedActionXCommand:526 - SERVER[hdp3-oo-2] > USER[-] GROUP[-] TOKEN[] APP[-] JOB[0005607-200123121357414-oozie-oozi-W] > ACTION[0005607-200123121357414-oozie-oozi-W@rdb-full-table-extract-java] > Execute command [callback] key [null] > 2020-01-27 12:16:39,776 DEBUG CompletedActionXCommand:526 - SERVER[hdp3-oo-2] > USER[-] GROUP[-] TOKEN[] APP[-] JOB[0005607-200123121357414-oozie-oozi-W] > ACTION[0005607-200123121357414-oozie-oozi-W@rdb-full-table-extract-java] > Queuing [1] commands with delay [0]ms -- This message was sent by Atlassian Jira (v8.3.4#803005)