[ https://issues.apache.org/jira/browse/HADOOP-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855543#comment-15855543 ]
Duo Zhang commented on HADOOP-13433: ------------------------------------ Any other concerns on the patches for branch-2.8 and branch-2.7? [~xiaochen] [~ste...@apache.org]. Thanks. > Race in UGI.reloginFromKeytab > ----------------------------- > > Key: HADOOP-13433 > URL: https://issues.apache.org/jira/browse/HADOOP-13433 > Project: Hadoop Common > Issue Type: Bug > Components: security > Affects Versions: 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1 > Reporter: Duo Zhang > Assignee: Duo Zhang > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: HADOOP-13433-branch-2.7.patch, > HADOOP-13433-branch-2.7-v1.patch, HADOOP-13433-branch-2.8.patch, > HADOOP-13433-branch-2.8.patch, HADOOP-13433-branch-2.patch, > HADOOP-13433.patch, HADOOP-13433-v1.patch, HADOOP-13433-v2.patch, > HADOOP-13433-v4.patch, HADOOP-13433-v5.patch, HADOOP-13433-v6.patch, > HBASE-13433-testcase-v3.patch > > > This is a problem that has troubled us for several years. For our HBase > cluster, sometimes the RS will be stuck due to > {noformat} > 2016-06-20,03:44:12,936 INFO org.apache.hadoop.ipc.SecureClient: Exception > encountered while connecting to the server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: The ticket > isn't for us (35) - BAD TGS SERVER NAME)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:194) > at > org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:140) > at > org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupSaslConnection(SecureClient.java:187) > at > org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.access$700(SecureClient.java:95) > at > org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:325) > at > org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:322) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781) > at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.hbase.util.Methods.call(Methods.java:37) > at org.apache.hadoop.hbase.security.User.call(User.java:607) > at org.apache.hadoop.hbase.security.User.access$700(User.java:51) > at > org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:461) > at > org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupIOstreams(SecureClient.java:321) > at > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1164) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1004) > at > org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:107) > at $Proxy24.replicateLogEntries(Unknown Source) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:962) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.runLoop(ReplicationSource.java:466) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:515) > Caused by: GSSException: No valid credentials provided (Mechanism level: The > ticket isn't for us (35) - BAD TGS SERVER NAME) > at > sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:663) > at > sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248) > at > sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:180) > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:175) > ... 23 more > Caused by: KrbException: The ticket isn't for us (35) - BAD TGS SERVER NAME > at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:64) > at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:185) > at > sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:294) > at > sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:106) > at > sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:557) > at > sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:594) > ... 26 more > Caused by: KrbException: Identifier doesn't match expected value (906) > at sun.security.krb5.internal.KDCRep.init(KDCRep.java:133) > at sun.security.krb5.internal.TGSRep.init(TGSRep.java:58) > at sun.security.krb5.internal.TGSRep.<init>(TGSRep.java:53) > at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:46) > ... 31 more > {noformat} > It rarely happens, but if it happens, the regionserver will be stuck and can > never recover. > Recently we added a log after a successful re-login which prints the private > credentials, and finally catched the direct reason. After a successful > re-login, we have two kerberos tickets in the credentials, one is the TGT, > and the other is a service ticket. The strange thing is that, the service > ticket is placed before TGT. This breaks the assumption of jdk's kerberos > library. See > http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5InitCredential.java, > the {{getTgt}} Method > {code:title=Krb5InitCredential} > return AccessController.doPrivileged( > new PrivilegedExceptionAction<KerberosTicket>() { > public KerberosTicket run() throws Exception { > // It's OK to use null as serverPrincipal. TGT is almost > // the first ticket for a principal and we use list. > return Krb5Util.getTicket( > realCaller, > clientPrincipal, null, acc); > }}); > {code} > So here, the library will use the service ticket as TGT to acquire a service > ticket, and KDC will reject the request since the 'TGT' does not start with > 'krbtgt'. And it can never recover because in UGI, the re-login will check if > there is a valid TGT first and no doubt, we have one... > This usually happens when a secure connection initialization comes along with > the re-login, and the end time indicates that the service ticket is acquired > by the previous TGT. Since UGI does not prevent doAs and re-login happen at > the same time, we believe that there is a race condition. > After reading the code, we found a possible race condition. > See > http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5Context.java, > the {{initSecContext}} method, we will get TGT first, then check if there is > already a service ticket, if not, acquire a service ticket using the TGT, and > put it into the credentials. > And in Krb5LoginModule.logout(the sun version), we will remove the kerberos > tickets from the credentials first, and then destroy them. > Here comes the race condition. Let T1 be the secure connection set up thread, > T2 be the re-login thread. > T1: get TGT > T2: remove all tickets from credentials > T1: check service ticket, none(since all tickets have been removed) > T1: acquire a new service ticket using TGT and put it into the credentials > T2: destroy all tickets > T2: login, i.e., put a new TGT into the credentials. > It is hard to write a UT to produce the problem because the racing code is in > jdk, which is not written by us... > Suggestions are welcomed. Thanks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org