[
https://issues.apache.org/jira/browse/HADOOP-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655097#action_12655097
]
Kan Zhang commented on HADOOP-4359:
-----------------------------------
I plan to introduce an HDFS token, called Access Token, as a vehicle to pass
data access authorization information from NN to DN. One can think of Access
Tokens as capabilities; an Access Token enables its owner to access certain
data blocks. It is issued by NN and used on DN. Access Tokens should be
generated in such a way that their authenticity can be verified by DN.
In general, tokens can be generated in 2 ways. A) Using a public-key scheme,
where NN chooses a pair of private/public keys and uses the private key to sign
a token. The signature becomes an integral part of the token. DN is given NN's
public key, which can be used to verify the signature associated with a token.
Since only the NN knows the private key, only the NN can generate a valid
token. B) Using a symmetric key scheme, where NN and all DNs share a secret
key. For each token, the NN computes a keyed hash (also known as message
authentication code or MAC) as the token authenticator. The token authenticator
becomes an integral part of the token. When a DN receives a token, it uses its
copy of the secret key to re-compute the token authenticator and compares it
with the one submitted as part of the token. If they match, the token is
verified as authentic. Since only NN and DNs know the key (DNs are trusted to
never issue tokens; they only use the key to verify tokens they receive), no
third party can forge tokens. Method A has the advantage that DN doesn't have
to store any secret key and it provides stronger security in the sense that
even if a DN is compromised, the attacker still can't forge tokens. However,
generating and verifying public-key signatures are expensive compared to
symmetric key operations. I plan to use method B to generate Access Tokens.
Access Tokens are ideally non-transferable, i.e., only the owner can use it.
This means we don't have to worry if a token gets stolen, for example during
transit. One way to make it non-transferable is to include the owner's id in
the token and require whoever uses the token to authenticate herself as the
owner specified in the token. I plan to simply include the owner's id in the
token for now and DN doesn't verify it. Authentication and verification of
owner id can be added later if needed.
Access Tokens are meant to be lightweight and short-lived. No need to renew or
revoke an Access Token. When a cached Access Token expires, simply get a new
one. Access Tokens should be cached only in memory and never written to disk. A
typical use case is as follows. A HDFS client asks NN for block ids/locations
for a file. NN verifies that the client is authorized to access the file and
sends back block ids/locations along with an Access Token for each block.
Whenever the HDFS client needs to access a block, it sends the block id along
with its associated Access Token to a DN. DN verifies the Access Token before
allowing access to the block. The HDFS client may cache Access Tokens received
from NN in memory and only get new tokens from NN when the cached ones expire
or accessing non-cached blocks.
An Access Token will look like the following, where access mode can be read,
write, replicate, etc.
TokenID = {expirationDate, ownerID, blockID, accessModes}
TokenAuthenticator = HMAC(key, TokenID)
Access Token = {TokenID, TokenAuthenticator}
An Access Token is valid on all DNs regardless where the data block is actually
stored. The secret key used to compute token authenticator is randomly chosen
by the NN and sent to DNs when they first register with the NN. There is a key
rolling mechanism that updates this key on NN and pushes the new key to DNs at
regular intervals.
> Support for data access authorization checking on DataNodes
> -----------------------------------------------------------
>
> Key: HADOOP-4359
> URL: https://issues.apache.org/jira/browse/HADOOP-4359
> Project: Hadoop Core
> Issue Type: New Feature
> Components: dfs
> Reporter: Kan Zhang
> Assignee: Kan Zhang
> Fix For: 0.20.0
>
>
> Currently, DataNodes do not enforce any access control on accesses to its
> data blocks. This makes it possible for an unauthorized client to read a data
> block as long as she can supply its block ID. It's also possible for anyone
> to write arbitrary data blocks to DataNodes.
> When users request file accesses on the NameNode, file permission checking
> takes place. Authorization decisions are made with regard to whether the
> requested accesses to those files (and implicitly, to their corresponding
> data blocks) are permitted. However, when it comes to subsequent data block
> accesses on the DataNodes, those authorization decisions are not made
> available to the DataNodes and consequently, such accesses are not verified.
> Datanodes are not capable of reaching those decisions independently since
> they don't have concepts of files, let alone file permissions.
> In order to implement data access policies consistently across HDFS services,
> there is a need for a mechanism by which authorization decisions made on the
> NameNode can be faithfully enforced on the DataNodes and any unauthorized
> access is declined.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.