[ 
https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177872#comment-15177872
 ] 

Jason Lowe commented on YARN-4744:
----------------------------------

Thanks for the patch!

bq. In addition, logging in PrivilegedOperationExecutor includes information 
that isn't necessarily available when the exception is propagated. 

That problem is solved by having the throwing code either encode that 
information in the exception message or adding necessary fields to the 
exception class, allowing the error handler to retrieve them as needed.  If the 
throwing code can create an appropriate log message then it can put that same 
information in the exception.  There's already a custom exception for these 
errors, so it would be easy to add things like full command line, etc.  I still 
think the code handling the error is the real problem if we're missing 
appropriate logs, but I don't feel so strongly to block it if others prefer 
leaving the log-then-throw logic in place.

Comments on the patch:

Don't we need to update DockerLinuxContainerRuntime in a similar manner?  I 
think we'll have the same issue there.

PrivilegedOperation should have a constructor that just takes an opType 
parameter and the other constructors should be implemented in terms of it.  
That eliminates the duplicate code maintenance pitfalls and avoids doing odd 
things like passing nulls as standard practice.


> Too many signal to container failure in case of LCE
> ---------------------------------------------------
>
>                 Key: YARN-4744
>                 URL: https://issues.apache.org/jira/browse/YARN-4744
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.9.0
>            Reporter: Bibin A Chundatt
>            Assignee: Sidharta Seethana
>         Attachments: YARN-4744.001.patch
>
>
> Install HA cluster in secure mode
> Enable LCE with cgroups
> Start server with dsperf user
> Submit mapreduce application terasort/teragen with user yarn/dsperf 
> Too many signal to container failure 
> Submit with user the exception is thrown
> {noformat}
> 2014-03-02 09:20:38,689 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for testing (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB
> 2014-03-02 09:20:40,158 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Event EventType: KILL_CONTAINER sent to absent container 
> container_e02_1393731146548_0001_01_000013
> 2014-03-02 09:20:43,071 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container container_e02_1393731146548_0001_01_000009 succeeded
> 2014-03-02 09:20:43,072 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_000009 transitioned from 
> RUNNING to EXITED_WITH_SUCCESS
> 2014-03-02 09:20:43,073 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_e02_1393731146548_0001_01_000009
> 2014-03-02 09:20:43,075 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime:
>  Using container runtime: DefaultLinuxContainerRuntime
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 9. Privileged Execution Operation Output:
> main : command provided 2
> main : run as user is yarn
> main : requested yarn user is yarn
> Full command array for failed execution:
> [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor,
>  yarn, yarn, 2, 9370, 15]
> 2014-03-02 09:20:43,081 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime:
>  Signal container failed. Exception:
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=9:
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: ExitCodeException exitCode=9:
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
>         at org.apache.hadoop.util.Shell.run(Shell.java:838)
>         at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
>         ... 9 more
> 2014-03-02 09:20:43,113 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=yarn 
> OPERATION=Container Finished - Succeeded        TARGET=ContainerImpl    
> RESULT=SUCCESS  APPID=application_1393731146548_0001    
> CONTAINERID=container_e02_1393731146548_0001_01_000009
> 2014-03-02 09:20:43,115 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_e02_1393731146548_0001_01_000009 transitioned from 
> EXITED_WITH_SUCCESS to DONE
> 2014-03-02 09:20:43,115 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_e02_1393731146548_0001_01_000009 from application 
> application_1393731146548_0001
> {noformat}
> Checked the same scenario in 2.7.2 version (not available)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to