Alex Ivanov created HADOOP-13652: ------------------------------------ Summary: ZKDelegationTokenSecretManager doesn't seem to honor ZK connection/session timeouts Key: HADOOP-13652 URL: https://issues.apache.org/jira/browse/HADOOP-13652 Project: Hadoop Common Issue Type: Bug Components: kms Reporter: Alex Ivanov
Looking at some of the errors I've seen due to Zookeeper connection issues from KMS, it doesn't seem like the following timeouts are picked up. {code} package org.apache.hadoop.security.token.delegation; public abstract class ZKDelegationTokenSecretManager<TokenIdent extends AbstractDelegationTokenIdentifier> extends AbstractDelegationTokenSecretManager<TokenIdent> { public static final int ZK_DTSM_ZK_SESSION_TIMEOUT_DEFAULT = 10000; public static final int ZK_DTSM_ZK_CONNECTION_TIMEOUT_DEFAULT = 10000; ... } {code} Instead, the connection/session timeouts are, correspondingly, 15 & 60 secs: the curator defaults. {code} package org.apache.curator.framework; public class CuratorFrameworkFactory { private static final int DEFAULT_SESSION_TIMEOUT_MS = Integer.getInteger("curator-default-session-timeout", 60 * 1000); private static final int DEFAULT_CONNECTION_TIMEOUT_MS = Integer.getInteger("curator-default-connection-timeout", 15 * 1000); ... } {code} It looks like DelegationTokenAuthenticationFilter is setting curator, and that may cause an issue: {code} package org.apache.hadoop.security.token.delegation.web; public class DelegationTokenAuthenticationFilter extends AuthenticationFilter { protected void initializeAuthHandler(String authHandlerClassName, FilterConfig filterConfig) throws ServletException { ZKDelegationTokenSecretManager.setCurator((CuratorFramework) filterConfig.getServletContext().getAttribute(ZKSignerSecretProvider. ZOOKEEPER_SIGNER_SECRET_PROVIDER_CURATOR_CLIENT_ATTRIBUTE)); super.initializeAuthHandler(authHandlerClassName, filterConfig); ZKDelegationTokenSecretManager.setCurator(null); } {code} Example errors: {code} 2016-09-25 01:46:33,053 ERROR ConnectionState - Connection timed out for connection string (host1, host2, host3) and timeout (15000) / elapsed (15001) 2016-09-25 01:46:33,053 ERROR ConnectionState - Connection timed out for connection string (host1, host2, host3) and timeout (15000) / elapsed (15001) 2016-09-25 01:46:34,028 ERROR ConnectionState - Connection timed out for connection string (host1, host2, host3) and timeout (15000) / elapsed (15976) 2016-09-25 01:46:34,053 ERROR ConnectionState - Connection timed out for connection string (host1, host2, host3) and timeout (15000) / elapsed (16001) 2016-09-25 01:46:37,053 ERROR ConnectionState - Connection timed out for connection string (host1, host2, host3) and timeout (15000) / elapsed (19001) 2016-09-25 01:46:40,053 ERROR ConnectionState - Connection timed out for connection string (host1, host2, host3) and timeout (15000) / elapsed (22001) 2016-09-25 01:46:49,055 ERROR ConnectionState - Connection timed out for connection string (host1, host2, host3) and timeout (15000) / elapsed (31003) 2016-09-25 01:46:52,029 ERROR ConnectionState - Connection timed out for connection string (host1, host2, host3) and timeout (15000) / elapsed (33977) 2016-09-25 01:47:05,344 ERROR ConnectionState - Connection timed out for connection string (host1, host2, host3) and timeout (15000) / elapsed (47292) 2016-09-25 01:47:09,345 ERROR ConnectionState - Connection timed out for connection string (host1, host2, host3) and timeout (15000) / elapsed (51292) 2016-09-25 01:47:24,346 WARN ConnectionState - Connection attempt unsuccessful after 66294 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-09-25 01:47:43,740 ERROR ConnectionState - Connection timed out for connection string (host1, host2, host3) and timeout (15000) / elapsed (15001) 2016-09-25 01:47:43,740 ERROR ConnectionState - Connection timed out for connection string (host1, host2, host3) and timeout (15000) / elapsed (15001) {code} There are also some connections issues between KMS and Zookeeper. It is sporadic, that's why I'm still trying to pinpoint them, but essentially KMS can get into this perpetual connect/disconnect cycle from which it eventually recovers or a restart also helps. I'm mentioning this fact in case it is related to this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org