[ https://issues.apache.org/jira/browse/HDFS-15960?focusedWorklogId=605505&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-605505 ]
ASF GitHub Bot logged work on HDFS-15960: ----------------------------------------- Author: ASF GitHub Bot Created on: 02/Jun/21 20:05 Start Date: 02/Jun/21 20:05 Worklog Time Spent: 10m Work Description: bolerio commented on a change in pull request #2887: URL: https://github.com/apache/hadoop/pull/2887#discussion_r644283148 ########## File path: hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/NamenodeHeartbeatService.java ########## @@ -170,7 +172,20 @@ protected void serviceInit(Configuration configuration) throws Exception { @Override public void periodicInvoke() { - updateState(); + try { + SecurityUtil.doAsCurrentUser( + new PrivilegedExceptionAction<Object>() { + @Override + public Object run() { + updateState(); + return null; + } + }); + } catch (IOException e) { + // Generic error that we don't know about + LOG.error("Unexpected exception while communicating with {}: {}", Review comment: Hi @goiri , following up this. I was able to create a unit test that reproduces the problem and demonstrate that the patch fixes it. However, there is a challenge. The failure is when the router calls the JMX endpoint which returns some info stats in addition to the basic alive status which is obtained in a separate RPC call. The failure is soft - logs the exception and continues, without the information it tried to obtain. However that information is needed later during load balancing, which is how the original bug was discovered. Now, because the main interface capturing knowledge about a NN on the router side (FederationNamenodeContext) does not contain these stats, there is no way to write a unit test against it. There are some unit tests in that area that mock this interface and I modified the mock to include stats, but then I have to downcast to the mock object in the test which is very ugly. So the options are: (1) accept this ugly downcast (2) don't write the test and eventually if Hadoop has an integration test suite, cover the use case there and (3) modify the FederationNamenodeContext to include the stats (see MembershipState and MembershipStats class). My vote would be for (3) as those stats seem essential to the operation of a federated cluster. It would be ok not to make all of the numbers part of the public interface, but the fact that we need stats about resource utilization should be part of the interface. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 605505) Time Spent: 1h 10m (was: 1h) > Router NamenodeHeartbeatService fails to authenticate with namenode in a > kerberized envi > ---------------------------------------------------------------------------------------- > > Key: HDFS-15960 > URL: https://issues.apache.org/jira/browse/HDFS-15960 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Borislav Iordanov > Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > We use http.hadoop.authentication.type = "kerberos" and when the > NamenodeHeartbeatService calls the namenode via JMX, it is not providing a > user security context so the authentication token is not transmitted and it > fails. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org