[ https://issues.apache.org/jira/browse/HDFS-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016002#comment-14016002 ]
Yongjun Zhang commented on HDFS-6475: ------------------------------------- Hi Jing, Thanks a lot for the info! I took a quick look, the issue is similar but seems there is an important difference here. That is, In HDFS-5322 fix, the method (and all caller hierarchy) {code} private void saslProcess(RpcSaslProto saslMessage) throws WrappedRpcServerException, IOException, InterruptedException { {code} is allowed to throw IOException, so your HDFS-5322 solution work well. For HDFS-6475, the involved class UserProvider is not allowed to throw IOException. In fact, UserProvider is only throwing unchecked exception, e.g., SecurityException here to include the StandbyException info in the message and cause: {code} /** Inject user information to http operations. */ @Provider public class UserProvider extends AbstractHttpContextInjectable<UserGroupInformation> implements InjectableProvider<Context, Type> { @Context HttpServletRequest request; @Context ServletContext servletcontext; ...... @Override public UserGroupInformation getValue(final HttpContext context) { final Configuration conf = (Configuration) servletcontext .getAttribute(JspHelper.CURRENT_CONF); try { return JspHelper.getUGI(servletcontext, request, conf, AuthenticationMethod.KERBEROS, false); } catch (IOException e) { throw new SecurityException( "Failed to obtain user group information: " + e, e); } } {code} This means we can't throw StandbyException (inherited from IOException) from here. So my uploaded patch tried to parse the message string of the SecurityException thrown here. UserProvider class inherits from classes of jersey package, which we won't be able to change the interface spec. We might be able to change the client/server interface: we detect this kind of case at the interface, then instead of throwing the RemoteException that wraps SecurityException, we throw RemoteException that wraps the cause of StandbyException. I'n not sure whether we should go this route though. Would you please comment again? thanks. > WebHdfs clients fail without retry because incorrect handling of > StandbyException > --------------------------------------------------------------------------------- > > Key: HDFS-6475 > URL: https://issues.apache.org/jira/browse/HDFS-6475 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, webhdfs > Affects Versions: 2.4.0 > Reporter: Yongjun Zhang > Assignee: Yongjun Zhang > Attachments: HDFS-6475.001.patch > > > With WebHdfs clients connected to a HA HDFS service, the delegation token is > previously initialized with the active NN. > When clients try to issue request, the NN it contacts is stored in a map > returned by DFSUtil.getNNServiceRpcAddresses(conf). And the client contact > the NN based on the order, so likely the first one it runs into is StandbyNN. > If the StandbyNN doesn't have the updated client crediential, it will throw a > s SecurityException that wraps StandbyException. > The client is expected to retry another NN, but due to the insufficient > handling of SecurityException mentioned above, it failed. > Example message: > {code} > {RemoteException={message=Failed to obtain user group information: > org.apache.hadoop.security.token.SecretManager$InvalidToken: > StandbyException, javaCl > assName=java.lang.SecurityException, exception=SecurityException}} > org.apache.hadoop.ipc.RemoteException(java.lang.SecurityException): Failed to > obtain user group information: > org.apache.hadoop.security.token.SecretManager$InvalidToken: StandbyException > at > org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:159) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:325) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$700(WebHdfsFileSystem.java:107) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.getResponse(WebHdfsFileSystem.java:635) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:542) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem.run(WebHdfsFileSystem.java:431) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:685) > at > org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:696) > at kclient1.kclient$1.run(kclient.java:64) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:356) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528) > at kclient1.kclient.main(kclient.java:58) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)