[ 
https://issues.apache.org/jira/browse/OOZIE-2985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218677#comment-16218677
 ] 

Peter Bacsko commented on OOZIE-2985:
-------------------------------------

The problem is you don't always know how much time you have to wait. If it's a 
busy cluster, it could take seconds for the application to start running or 
even more. Until that happens, the application is stuck in ACCEPTED state.

I can imagine a way to work this problem around: on a best effort basis, we 
wait a couple of seconds until it reaches RUNNING state, but we can't block for 
too long. If it does not get scheduled in, let's say, 5 seconds, we give up and 
move on. 

But this is not the kind of solution I'd really implement. Best would be some 
sort of callback mechanism from YARN, but I don't think it's supported.

> If LauncherAM fails, Oozie is not notified in a timely manner
> -------------------------------------------------------------
>
>                 Key: OOZIE-2985
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2985
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Attila Sasvari
>
> I've noticed if LauncherAM fails, Oozie is notified about the launcher's 
> failure with a lot of delay. It gives the impression that the workflow is 
> running.
> {{oozie job -oozie http://localhost:11000/oozie -config 
> examples/apps/datelist-java-main/job.properties  -info  
> 0000000-170712153835057-oozie-asas-W}}
> {code}
> 0000000-170712153835057-oozie-asas-W@java1                                    
> RUNNING   application_1499866588585_0001RUNNING    -         
> {code}
> I've looked at yarn logs for the launcher and seen that the launcher failed. 
> For example, in my case , during development, oozie-sharelib launcher was not 
> found:  
> {code}
> Error: Could not find or load main class 
> org.apache.oozie.action.hadoop.LauncherAM
> {code}
> The problem is only after the specified timeout (by default 10 minutes) we 
> see that the workflow has actually failed /errored.
> {code}
> Created       : 2017-07-12 13:38 GMT
> Started       : 2017-07-12 13:38 GMT
> Last Modified : 2017-07-12 13:49 GMT
> ...
> 0000000-170712153835057-oozie-asas-W@java1                                    
> ERROR     application_1499866588585_0001FAILED/KILLED-         
> {code} 
> The problem might be that in {{JavaActionExecutor}} in the {{start()}} method 
> the check is too fast.
> {code}
> LOG.debug("Starting action " + action.getId() + " getting Action File 
> System");
>             FileSystem actionFs = context.getAppFileSystem();
>             LOG.debug("Preparing action Dir through copying " + 
> context.getActionDir());
>             prepareActionDir(actionFs, context);
>             LOG.debug("Action Dir is ready. Submitting the action ");
>             submitLauncher(actionFs, context, action);
>             LOG.debug("Action submit completed. Performing check ");
>             check(context, action);
>             LOG.debug("Action check is done after submission
> {code}
> There should be some delay after {{submitLauncher()}} before {{check()}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to