[ 
https://issues.apache.org/jira/browse/HDFS-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Osvath updated HDFS-16165:
---------------------------------
    Comment: was deleted

(was: This request is on behalf of [Confluent, Inc|http://confluent.io/].)

> Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x
> ------------------------------------------------------------------
>
>                 Key: HDFS-16165
>                 URL: https://issues.apache.org/jira/browse/HDFS-16165
>             Project: Hadoop HDFS
>          Issue Type: Wish
>         Environment: Can be reproduced in docker HDFS environment with 
> Kerberos 
> https://github.com/vdesabou/kafka-docker-playground/blob/93a93de293ad2f9bb22afb244f2d8729a178296e/connect/connect-hdfs2-sink/hdfs2-sink-ha-kerberos-repro-gss-exception.sh
>            Reporter: Daniel Osvath
>            Priority: Major
>              Labels: Confluent
>
> *Problem Description*
> For more than a year Apache Kafka Connect users have been running into a 
> Kerberos renewal issue that causes our HDFS2 connectors to fail. 
> We have been able to consistently reproduce the issue under high load with 40 
> connectors (threads) that use the library. When we try an alternate 
> workaround that uses the kerberos keytab on the system the connector operates 
> without issues.
> We identified the root cause to be a race condition bug in the Hadoop 2.x 
> library that causes the ticker renewal to fail with the error below: 
> {code:java}
> Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>  at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)We
>  reached the conclusion of the root cause once we tried the same environment 
> (40 connectors) with Hadoop 3.x, and our HDFS3 connectors and operated 
> without renewal issues. Additionally, identifying that the synchronization 
> issue has been fixed for the newer Hadoop 3.x releases  we confirmed our 
> hypothesis about the root cause. Request
> {code}
> There are many changes in HDFS 3 
> [UserGroupInformation.java|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java]
>  related to UGI synchronization which were done as part of 
> https://issues.apache.org/jira/browse/HADOOP-9747, and those changes suggest 
> some race conditions were happening with older version, i.e HDFS 2.x Which 
> would explain why we can reproduce the problem with HDFS2.
> For example(among others):
> {code:java}
>   private void relogin(HadoopLoginContext login, boolean ignoreLastLoginTime)
>       throws IOException {
>     // ensure the relogin is atomic to avoid leaving credentials in an
>     // inconsistent state.  prevents other ugi instances, SASL, and SPNEGO
>     // from accessing or altering credentials during the relogin.
>     synchronized(login.getSubjectLock()) {
>       // another racing thread may have beat us to the relogin.
>       if (login == getLogin()) {
>         unprotectedRelogin(login, ignoreLastLoginTime);
>       }
>     }
>   }
> {code}
> All those changes were not backported to Hadoop 2.x (out HDFS2 connector uses 
> 2.10.1), on which several CDH distributions are based. 
> *Request*
> We would like to ask for the synchronization fix to be backported to Hadoop 
> 2.x so that our users can operate without issues. 
> *Impact*
> The older 2.x Hadoop version is used by our HDFS connector, which is used in 
> production by our community. Currently, the issue causes our HDFS connector 
> to fail, as it is unable to recover and renew the ticket at a later point. 
> Having the backported fix would allow our users to operate without issues 
> that require manual intervention every week (or few days in some cases). The 
> only workaround available to community for the issue is to run a command or 
> restart their workers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to