[ 
https://issues.apache.org/jira/browse/STORM-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076694#comment-17076694
 ] 

Aaron Gresch commented on STORM-3606:
-------------------------------------

user will see workers restart due to a NPE if they upload credentials before 
the TGT renewal thread runs the kinit -R command.
{code:java}
020-04-01 14:36:53.005 o.a.s.u.Utils TGT Renewer for XXX [ERROR] Received error 
in thread TGT Renewer for XXX.. terminating server... java.lang.Error: 
java.lang.NullPointerException         at 
org.apache.storm.utils.Utils.handleUncaughtException(Utils.java:694) 
~[storm-client-2.2.0.y.jar:2.2.0.y]         at 
org.apache.storm.utils.Utils.handleUncaughtException(Utils.java:673) 
~[storm-client-2.2.0.y.jar:2.2.0.y]         at 
org.apache.storm.utils.Utils.lambda$createDefaultUncaughtExceptionHandler$2(Utils.java:1055)
 ~[storm-client-2.2.0.y.jar:2.2.0.y]         at 
java.lang.ThreadGroup.uncaughtException(ThreadGroup.java)         at 
java.lang.ThreadGroup.uncaughtException(ThreadGroup.java)         at 
java.lang.Thread.dispatchUncaughtException(Thread.java) Caused by: 
java.lang.NullPointerException         at 
org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:1031)
 ~[stormjar.jar: ?]         at java.lang.Thread.run(Thread.java) 2020-04-01 
14:36:53.018 o.a.s.u.Utils Thread-23 [INFO] Halting after 3 seconds 2020-04-01 
14:36:53.019 o.a.s.d.w.Worker Thread-24 [INFO] Shutting down worker XXX
{code}
Sequence:

1) Hadoop thread grabs the initial TGT:

https://github.com/apache/hadoop/blob/branch-2.9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L978

2) Hadoop then sleeps:

https://github.com/apache/hadoop/blob/branch-2.9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L992

3) kinit -R runs:

https://github.com/apache/hadoop/blob/branch-2.9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L994

 

The kinit will fail and generate an IOException. Then we get to this line that 
accesses the original TGT:

https://github.com/apache/hadoop/blob/branch-2.9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L1014

 

But since we cleared credentials on an upload credentials this will cause the 
NPE, which then restarts the worker.

 

 

> AutoTGT shouldn't invoke TGT renewal thread (from 
> UserGroupInformation.loginUserFromSubject)
> --------------------------------------------------------------------------------------------
>
>                 Key: STORM-3606
>                 URL: https://issues.apache.org/jira/browse/STORM-3606
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 1.2.3, 2.1.0
>            Reporter: Ethan Li
>            Assignee: Aaron Gresch
>            Priority: Minor
>
> When hadoop security is enabled, 
> https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/security/auth/kerberos/AutoTGT.java#L199-L209
> AutoTGT will invoke "loginUserFromSubject", and it will spawn a TGT renewal 
> thread ("TGT Renewer for <username>"). 
> https://github.com/apache/hadoop/blob/branch-2.8.5/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L928-L957
> which will eventually invoke system command "kinit -R", and then fail with 
> the exception
> {code:java}
> org.apache.hadoop.util.Shell$ExitCodeException: kinit: Credentials cache file 
> '/tmp/krb5cc_xxx' not found while renewing credentials
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:1004) 
> ~[stormjar.jar:?]
>       at org.apache.hadoop.util.Shell.run(Shell.java:898) ~[stormjar.jar:?]
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) 
> ~[stormjar.jar:?]
>       at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307) 
> ~[stormjar.jar:?]
>       at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289) 
> ~[stormjar.jar:?]
>       at 
> org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:1011)
>  [stormjar.jar:?]
>       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
> {code}
> "kinit" will never work from worker process since Storm don't keep TGT in 
> local cache. Instead, TGT is saved in zookeeper and in memory of Worker 
> process. 
> This exception is confusing but not harmful to topologies. And the TGT 
> renewal thread will eventually abort. 
> It's better to find a real solution for it. But for now we can document what 
> might happen in AutoTGT code.
> To be clear, we still need loginUserFromSubject or some sort but we don't 
> want to spawn TGT renewal thread.  This is found with hadoop-2.8.5. Other 
> versions are similar. But it can also change in the future release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to