[jira] [Updated] (HDFS-16165) Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x
[ https://issues.apache.org/jira/browse/HDFS-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-16165: Target Version/s: 2.10.3 (was: 2.10.2) > Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x > -- > > Key: HDFS-16165 > URL: https://issues.apache.org/jira/browse/HDFS-16165 > Project: Hadoop HDFS > Issue Type: Wish > Environment: Can be reproduced in docker HDFS environment with > Kerberos > https://github.com/vdesabou/kafka-docker-playground/blob/93a93de293ad2f9bb22afb244f2d8729a178296e/connect/connect-hdfs2-sink/hdfs2-sink-ha-kerberos-repro-gss-exception.sh >Reporter: Daniel Osvath >Priority: Major > Labels: Confluent > > *Problem Description* > For more than a year Apache Kafka Connect users have been running into a > Kerberos renewal issue that causes our HDFS2 connectors to fail. > We have been able to consistently reproduce the issue under high load with 40 > connectors (threads) that use the library. When we try an alternate > workaround that uses the kerberos keytab on the system the connector operates > without issues. > We identified the root cause to be a race condition bug in the Hadoop 2.x > library that causes the ticker renewal to fail with the error below: > {code:java} > Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)We > reached the conclusion of the root cause once we tried the same environment > (40 connectors) with Hadoop 3.x, and our HDFS3 connectors and operated > without renewal issues. Additionally, identifying that the synchronization > issue has been fixed for the newer Hadoop 3.x releases we confirmed our > hypothesis about the root cause. Request > {code} > There are many changes in HDFS 3 > [UserGroupInformation.java|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java] > related to UGI synchronization which were done as part of > https://issues.apache.org/jira/browse/HADOOP-9747, and those changes suggest > some race conditions were happening with older version, i.e HDFS 2.x Which > would explain why we can reproduce the problem with HDFS2. > For example(among others): > {code:java} > private void relogin(HadoopLoginContext login, boolean ignoreLastLoginTime) > throws IOException { > // ensure the relogin is atomic to avoid leaving credentials in an > // inconsistent state. prevents other ugi instances, SASL, and SPNEGO > // from accessing or altering credentials during the relogin. > synchronized(login.getSubjectLock()) { > // another racing thread may have beat us to the relogin. > if (login == getLogin()) { > unprotectedRelogin(login, ignoreLastLoginTime); > } > } > } > {code} > All those changes were not backported to Hadoop 2.x (out HDFS2 connector uses > 2.10.1), on which several CDH distributions are based. > *Request* > We would like to ask for the synchronization fix to be backported to Hadoop > 2.x so that our users can operate without issues. > *Impact* > The older 2.x Hadoop version is used by our HDFS connector, which is used in > production by our community. Currently, the issue causes our HDFS connector > to fail, as it is unable to recover and renew the ticket at a later point. > Having the backported fix would allow our users to operate without issues > that require manual intervention every week (or few days in some cases). The > only workaround available to community for the issue is to run a command or > restart their workers. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16165) Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x
[ https://issues.apache.org/jira/browse/HDFS-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated HDFS-16165: - Target Version/s: 2.10.2 > Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x > -- > > Key: HDFS-16165 > URL: https://issues.apache.org/jira/browse/HDFS-16165 > Project: Hadoop HDFS > Issue Type: Wish > Environment: Can be reproduced in docker HDFS environment with > Kerberos > https://github.com/vdesabou/kafka-docker-playground/blob/93a93de293ad2f9bb22afb244f2d8729a178296e/connect/connect-hdfs2-sink/hdfs2-sink-ha-kerberos-repro-gss-exception.sh >Reporter: Daniel Osvath >Priority: Major > Labels: Confluent > > *Problem Description* > For more than a year Apache Kafka Connect users have been running into a > Kerberos renewal issue that causes our HDFS2 connectors to fail. > We have been able to consistently reproduce the issue under high load with 40 > connectors (threads) that use the library. When we try an alternate > workaround that uses the kerberos keytab on the system the connector operates > without issues. > We identified the root cause to be a race condition bug in the Hadoop 2.x > library that causes the ticker renewal to fail with the error below: > {code:java} > Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)We > reached the conclusion of the root cause once we tried the same environment > (40 connectors) with Hadoop 3.x, and our HDFS3 connectors and operated > without renewal issues. Additionally, identifying that the synchronization > issue has been fixed for the newer Hadoop 3.x releases we confirmed our > hypothesis about the root cause. Request > {code} > There are many changes in HDFS 3 > [UserGroupInformation.java|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java] > related to UGI synchronization which were done as part of > https://issues.apache.org/jira/browse/HADOOP-9747, and those changes suggest > some race conditions were happening with older version, i.e HDFS 2.x Which > would explain why we can reproduce the problem with HDFS2. > For example(among others): > {code:java} > private void relogin(HadoopLoginContext login, boolean ignoreLastLoginTime) > throws IOException { > // ensure the relogin is atomic to avoid leaving credentials in an > // inconsistent state. prevents other ugi instances, SASL, and SPNEGO > // from accessing or altering credentials during the relogin. > synchronized(login.getSubjectLock()) { > // another racing thread may have beat us to the relogin. > if (login == getLogin()) { > unprotectedRelogin(login, ignoreLastLoginTime); > } > } > } > {code} > All those changes were not backported to Hadoop 2.x (out HDFS2 connector uses > 2.10.1), on which several CDH distributions are based. > *Request* > We would like to ask for the synchronization fix to be backported to Hadoop > 2.x so that our users can operate without issues. > *Impact* > The older 2.x Hadoop version is used by our HDFS connector, which is used in > production by our community. Currently, the issue causes our HDFS connector > to fail, as it is unable to recover and renew the ticket at a later point. > Having the backported fix would allow our users to operate without issues > that require manual intervention every week (or few days in some cases). The > only workaround available to community for the issue is to run a command or > restart their workers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16165) Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x
[ https://issues.apache.org/jira/browse/HDFS-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Osvath updated HDFS-16165: - Labels: Confluent (was: ) > Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x > -- > > Key: HDFS-16165 > URL: https://issues.apache.org/jira/browse/HDFS-16165 > Project: Hadoop HDFS > Issue Type: Wish > Environment: Can be reproduced in docker HDFS environment with > Kerberos > https://github.com/vdesabou/kafka-docker-playground/blob/93a93de293ad2f9bb22afb244f2d8729a178296e/connect/connect-hdfs2-sink/hdfs2-sink-ha-kerberos-repro-gss-exception.sh >Reporter: Daniel Osvath >Priority: Major > Labels: Confluent > > *Problem Description* > For more than a year Apache Kafka Connect users have been running into a > Kerberos renewal issue that causes our HDFS2 connectors to fail. > We have been able to consistently reproduce the issue under high load with 40 > connectors (threads) that use the library. When we try an alternate > workaround that uses the kerberos keytab on the system the connector operates > without issues. > We identified the root cause to be a race condition bug in the Hadoop 2.x > library that causes the ticker renewal to fail with the error below: > {code:java} > Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)We > reached the conclusion of the root cause once we tried the same environment > (40 connectors) with Hadoop 3.x, and our HDFS3 connectors and operated > without renewal issues. Additionally, identifying that the synchronization > issue has been fixed for the newer Hadoop 3.x releases we confirmed our > hypothesis about the root cause. Request > {code} > There are many changes in HDFS 3 > [UserGroupInformation.java|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java] > related to UGI synchronization which were done as part of > https://issues.apache.org/jira/browse/HADOOP-9747, and those changes suggest > some race conditions were happening with older version, i.e HDFS 2.x Which > would explain why we can reproduce the problem with HDFS2. > For example(among others): > {code:java} > private void relogin(HadoopLoginContext login, boolean ignoreLastLoginTime) > throws IOException { > // ensure the relogin is atomic to avoid leaving credentials in an > // inconsistent state. prevents other ugi instances, SASL, and SPNEGO > // from accessing or altering credentials during the relogin. > synchronized(login.getSubjectLock()) { > // another racing thread may have beat us to the relogin. > if (login == getLogin()) { > unprotectedRelogin(login, ignoreLastLoginTime); > } > } > } > {code} > All those changes were not backported to Hadoop 2.x (out HDFS2 connector uses > 2.10.1), on which several CDH distributions are based. > *Request* > We would like to ask for the synchronization fix to be backported to Hadoop > 2.x so that our users can operate without issues. > *Impact* > The older 2.x Hadoop version is used by our HDFS connector, which is used in > production by our community. Currently, the issue causes our HDFS connector > to fail, as it is unable to recover and renew the ticket at a later point. > Having the backported fix would allow our users to operate without issues > that require manual intervention every week (or few days in some cases). The > only workaround available to community for the issue is to run a command or > restart their workers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org