All -

As a follow up to the discussions that were had during Hadoop Summit, I would 
like to introduce the discussion topic around the moving parts of a Hadoop 
SSO/Token Service.
There are a couple of related Jira's that can be referenced and may or may not 
be updated as a result of this discuss thread.

https://issues.apache.org/jira/browse/HADOOP-9533
https://issues.apache.org/jira/browse/HADOOP-9392

As the first aspect of the discussion, we should probably state the overall 
goals and scoping for this effort:
* An alternative authentication mechanism to Kerberos for user authentication
* A broader capability for integration into enterprise identity and SSO 
solutions
* Possibly the advertisement/negotiation of available authentication mechanisms
* Backward compatibility for the existing use of Kerberos
* No (or minimal) changes to existing Hadoop tokens (delegation, job, block 
access, etc)
* Pluggable authentication mechanisms across: RPC, REST and webui enforcement 
points
* Continued support for existing authorization policy/ACLs, etc
* Keeping more fine grained authorization policies in mind - like attribute 
based access control
        - fine grained access control is a separate but related effort that we 
must not preclude with this effort
* Cross cluster SSO

In order to tease out the moving parts here are a couple high level and 
simplified descriptions of SSO interaction flow:
                               +------+
        +------+ credentials 1 | SSO  |
        |CLIENT|-------------->|SERVER|
        +------+  :tokens      +------+
          2 |                    
            | access token
            V :requested resource
        +-------+
        |HADOOP |
        |SERVICE|
        +-------+
        
The above diagram represents the simplest interaction model for an SSO service 
in Hadoop.
1. client authenticates to SSO service and acquires an access token
  a. client presents credentials to an authentication service endpoint exposed 
by the SSO server (AS) and receives a token representing the authentication 
event and verified identity
  b. client then presents the identity token from 1.a. to the token endpoint 
exposed by the SSO server (TGS) to request an access token to a particular 
Hadoop service and receives an access token
2. client presents the Hadoop access token to the Hadoop service for which the 
access token has been granted and requests the desired resource or services
  a. access token is presented as appropriate for the service endpoint protocol 
being used
  b. Hadoop service token validation handler validates the token and verifies 
its integrity and the identity of the issuer
    
    +------+
    |  IdP |
    +------+
    1   ^ credentials
        | :idp_token
        |                      +------+
        +------+  idp_token  2 | SSO  |
        |CLIENT|-------------->|SERVER|
        +------+  :tokens      +------+
          3 |                    
            | access token
            V :requested resource
        +-------+
        |HADOOP |
        |SERVICE|
        +-------+
        

The above diagram represents a slightly more complicated interaction model for 
an SSO service in Hadoop that removes Hadoop from the credential collection 
business.
1. client authenticates to a trusted identity provider within the enterprise 
and acquires an IdP specific token
  a. client presents credentials to an enterprise IdP and receives a token 
representing the authentication identity
2. client authenticates to SSO service and acquires an access token
  a. client presents idp_token to an authentication service endpoint exposed by 
the SSO server (AS) and receives a token representing the authentication event 
and verified identity
  b. client then presents the identity token from 2.a. to the token endpoint 
exposed by the SSO server (TGS) to request an access token to a particular 
Hadoop service and receives an access token
3. client presents the Hadoop access token to the Hadoop service for which the 
access token has been granted and requests the desired resource or services
  a. access token is presented as appropriate for the service endpoint protocol 
being used
  b. Hadoop service token validation handler validates the token and verifies 
its integrity and the identity of the issuer
        
Considering the above set of goals and high level interaction flow description, 
we can start to discuss the component inventory required to accomplish this 
vision:

1. SSO Server Instance: this component must be able to expose endpoints for 
both authentication of users by collecting and validating credentials and 
federation of identities represented by tokens from trusted IdPs within the 
enterprise. The endpoints should be composable so as to allow for multifactor 
authentication mechanisms. They will also need to return tokens that represent 
the authentication event and verified identity as well as access tokens for 
specific Hadoop services.

2. Authentication Providers: pluggable authentication mechanisms must be easily 
created and configured for use within the SSO server instance. They will 
ideally allow the enterprise to plugin their preferred components from off the 
shelf as well as provide custom providers. Supporting existing standards for 
such authentication providers should be a top priority concern. There are a 
number of standard approaches in use in the Java world: JAAS loginmodules, 
servlet filters, JASPIC authmodules, etc. A pluggable provider architecture 
that allows the enterprise to leverage existing investments in these 
technologies and existing skill sets would be ideal.

3. Token Authority: a token authority component would need to have the ability 
to issue, verify and revoke tokens. This authority will need to be trusted by 
all enforcement points that need to verify incoming tokens. Using something 
like PKI for establishing trust will be required.

4. Hadoop SSO Tokens: the exact shape and form of the sso tokens will need to 
be considered in order to determine the means by which trust and integrity are 
ensured while using them. There may be some abstraction of the underlying 
format provided through interface based design but all token implementations 
will need to have the same attributes and capabilities in terms of validation 
and cryptographic verification.

5. SSO Protocol: the lowest common denominator protocol for SSO server 
interactions across client types would likely be REST. Depending on the REST 
client in use it may require explicitly coding to the token flow described in 
the earlier interaction descriptions or a plugin may be provided for things 
like HTTPClient, curl, etc. RPC clients will have this taken care for them 
within the SASL layer and will leverage the REST endpoints as well. This likely 
implies trust requirements for the RPC client to be able to trust the SSO 
server's identity cert that is presented over SSL. 

6. REST Client Agent Plugins: required for encapsulating the interaction with 
the SSO server for the client programming models. We may need these for many 
client types: e.g. Java, JavaScript, .Net, Python, cURL etc.

7. Server Side Authentication Handlers: the server side of the REST, RPC or 
webui connection will need to be able to validate and verify the incoming 
Hadoop tokens in order to grant or deny access to requested resources.

8. Credential/Trust Management: throughout the system - on client and server 
sides - we will need to manage and provide access to PKI and potentially shared 
secret artifacts in order to establish the required trust relationships to 
replace the mutual authentication that would be otherwise provided by using 
kerberos everywhere.

So, discussion points:

1. Are there additional components that would be required for a Hadoop SSO 
service?
2. Should any of the above described components be considered not actually 
necessary or poorly described?
2. Should we create a new umbrella Jira to identify each of these as a subtask?
3. Should we just continue to use 9533 for the SSO server and add additional 
subtasks?
4. What are the natural seams of separation between these components and any 
dependencies between one and another that affect priority?

Obviously, each component that we identify will have a jira of its own - more 
than likely - so we are only trying to identify the high level descriptions for 
now.

Can we try and drive this discussion to a close by the end of the week? This 
will allow us to start breaking out into component implementation plans.

thanks,

--larry

Reply via email to