Daniel Osvath created HDFS-16165:
------------------------------------

             Summary: Backport the Hadoop 3.x Kerberos synchronization fix to 
Hadoop 2.x
                 Key: HDFS-16165
                 URL: https://issues.apache.org/jira/browse/HDFS-16165
             Project: Hadoop HDFS
          Issue Type: Wish
         Environment: Can be reproduced in docker HDFS environment with 
Kerberos 
https://github.com/vdesabou/kafka-docker-playground/blob/93a93de293ad2f9bb22afb244f2d8729a178296e/connect/connect-hdfs2-sink/hdfs2-sink-ha-kerberos-repro-gss-exception.sh
            Reporter: Daniel Osvath


*Problem Description*

For more than a year Apache Kafka Connect users have been running into a 
Kerberos renewal issue that causes our HDFS2 connectors to fail. 

We have been able to consistently reproduce the issue under high load with 40 
connectors (threads) that use the library. When we try an alternate workaround 
that uses the kerberos keytab on the system the connector operates without 
issues.

We identified the root cause to be a race condition bug in the Hadoop 2.x 
library that causes the ticker renewal to fail with the error below: 


{code:java}
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)]
 at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)We
 reached the conclusion of the root cause once we tried the same environment 
(40 connectors) with Hadoop 3.x, and our HDFS3 connectors and operated without 
renewal issues. Additionally, identifying that the synchronization issue has 
been fixed for the newer Hadoop 3.x releases  we confirmed our hypothesis about 
the root cause. Request
{code}

There are many changes in HDFS 3 
[UserGroupInformation.java|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java]
 related to UGI synchronization which were done as part of 
https://issues.apache.org/jira/browse/HADOOP-9747, and those changes suggest 
some race conditions were happening with older version, i.e HDFS 2.x Which 
would explain why we can reproduce the problem with HDFS2.
For example(among others):
{code:java}
  private void relogin(HadoopLoginContext login, boolean ignoreLastLoginTime)
      throws IOException {
    // ensure the relogin is atomic to avoid leaving credentials in an
    // inconsistent state.  prevents other ugi instances, SASL, and SPNEGO
    // from accessing or altering credentials during the relogin.
    synchronized(login.getSubjectLock()) {
      // another racing thread may have beat us to the relogin.
      if (login == getLogin()) {
        unprotectedRelogin(login, ignoreLastLoginTime);
      }
    }
  }
{code}
All those changes were not backported to Hadoop 2.x (out HDFS2 connector uses 
2.10.1), on which several CDH distributions are based. 

*Request*
We would like to ask for the synchronization fix to be backported to Hadoop 2.x 
so that our users can operate without issues. 

*Impact*
The older 2.x Hadoop version is used by our HDFS connector, which is used in 
production by our community. Currently, the issue causes our HDFS connector to 
fail, as it is unable to recover and renew the ticket at a later point. Having 
the backported fix would allow our users to operate without issues that require 
manual intervention every week (or few days in some cases). The only workaround 
available to community for the issue is to run a command or restart their 
workers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to