[ https://issues.apache.org/jira/browse/HADOOP-9421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13690034#comment-13690034 ]
Daryn Sharp commented on HADOOP-9421: ------------------------------------- You seem to be trying to tailor a design that only considers today's implementation of tokens and kerberos. It seems "easy" when you assume there are only two choices. The optimization becomes more and more complicated, and in many cases impossible, instead of simply doing something the server tells you to do. When pluggable auth support allows a world of heterogenous security, such as Knox or Rhino, requiring REINITIATE penalties becomes very expensive. Sorry for the very long read, but these are topics I intended to address on the call that unfortunately didn't happen today. +IP failover+ Distinct service principals with IP failover isn't "insane". With a shared principal services can't be accessed directly because the host doesn't match the shared principal. So a different config with a hardcoded shared principal is needed. Similarly, DNs won't be able to heartbeat directly into HA NNs. I'm sure there are more problems than we've already discovered investigating that route. The root issue is the client must only use the hostname that appears in the kerberos service principal. Which means you can't access the service via all its interface, hostnames, or even pretty CNAMEs. If server advertises "this is who I am" via the NEGOTIATE, then the problem is solved. +Token selection issues+ Selecting tokens pre-connection based on the service as a host or ip port tuple is a problem. Let's take a few examples: Using the IP precludes multi-interface host support, for instance if you want to have a fast/private intra-cluster network and a separate public network. Tokens will contain the public IP, but clients using the private interface (different IP) can't find them. This isn't contrived, it's something Cloudera has wanted to do. You also can't use the IP because changing a service's IP will break clients using tokens with the old IP. In comes the bane of my creation, use_ip=false, to use the given hostname. But you can't allow non-fully qualified names because it will resolve differently on depending on the dns search path. There's a raft of reasons why the canonicalization isn't as straightforward as you'd think, which led to a custom NetUtils resolver and complicated path normalization. Likewise, any sort of public proxy or NAT-ing between an external client and a cluster service creates an unusable token service within the grid. HA token logic is unnecessarily convoluted to clone tokens from a logical uri into multiple tokens with each failover's service. _Solution_ A clean solution to all these problems is tokens contain a server generated opaque id. The server NEGOTIATE reports this id. The client looks for a token with that id. Now no matter what interface/IP/hostname/proxy/NAT is used, the client will always find the token. If you cut out the use of the NEGOTIATE, this ability is gone. +Supporting new auth methods+ Other new auths in the future may need the protocol/serverId hints from the NEGOTIATE to locate the required credentials. Guessing may not be an option. The RPC client shouldn't have to be modified to make a pre-connection guess for all the auth methods it supports. Because... Why should the client attempt an auth method before it _even knows if the server can do it_? Let's look at some hairy examples: The client tries to do kerberos, so it needs to generate the initial response to take advantage of your "optimization". But the server isn't kerberized. So either the client fails because it has no TGT, which it doesn't even need! Or fails to get a non-existent service principal. What if the client decides to use an SSO service, but the server doesn't do SSO? Take a REINITIATE penalty every time? +Supporting new mechanisms+ Let's say we add support for a new mechanism like SCRAM. Just because the client can do it doesn't mean all services across all clusters can do it. The server's NEGOTIATE will tell the client if it can do DIGEST-MD5, SCRAM, etc. Inter-cluster compatibility and rolling upgrades will introduce scenarios where the required mechanism differs, and penalizing the client to REINITIATE is not a valid option. --- In all of these scenarios, there aren't complex issues if the NEGOTIATE is used to chose an appropriate auth type. In a world of multiple auths and multiple mechanisms for an auth, requiring REINITIATE penalties is too expensive. Ignoring all the issues I've cited, your optimization doesn't appear to have a positive impact on performance. Even if it did shave a few milliseconds or even 100ms, will it have a measurable real-world impact? Considering how many RPC requests are performed over a single connection, will the negligible penalty from one extra packet make any difference? I feel like we've spent weeks haggling over an ill-suited pre-mature optimization that could been spent building upon this implementation. :( > Convert SASL to use ProtoBuf and add lengths for non-blocking processing > ------------------------------------------------------------------------ > > Key: HADOOP-9421 > URL: https://issues.apache.org/jira/browse/HADOOP-9421 > Project: Hadoop Common > Issue Type: Sub-task > Affects Versions: 2.0.3-alpha > Reporter: Sanjay Radia > Assignee: Daryn Sharp > Priority: Blocker > Attachments: HADOOP-9421.patch, HADOOP-9421.patch, HADOOP-9421.patch, > HADOOP-9421.patch, HADOOP-9421.patch, HADOOP-9421.patch, HADOOP-9421.patch, > HADOOP-9421.patch, HADOOP-9421-v2-demo.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira