[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443864#comment-13443864
 ] 

Benoy Antony commented on MAPREDUCE-4491:
-----------------------------------------

1.      Do I understand correctly that your approach can be used to securely 
store (encrypt) data even on non-secure (security=simple) clusters?   
        
You are right!! If TaskTracker and Task processes are owned by different users, 
then it is possible to use this approach to encrypt/decrypt data in a 
non-secure cluster.  This  does not require each task to be run as job owner, 
instead a fixed user other than TT user is sufficient.  The cluster private key 
can be made readable/accessible only by TaskTracker user. In this way, the 
Tasks cannot get hold of the cluster private key. But it requires the use of 
LinuxTaskController to spawn  tasks as a different user. It also requires some 
code changes to enable this via configuration. 
            
2.      So JobClient uses current user credentials to obtain keys from the 
KeyStore, encrypts them with cluster-public-key and sends to the cluster along 
with the user credentials. JobTracker has nothing to do with the keys and 
passes the encrypted blob over to TaskTrackers scheduled to execute the tasks. 
TT decrypts the user keys using private-cluster-key and handles them to the 
local tasks, which is secure as keys don't travel over the wires. Is it right 
so far?
        
That is correct. Its a clear and concise  explanation of this straight forward 
approach.  Please note that though the design is described in terms of 
TaskTrackers and TaskControllers (1.0 terminology) , the implementation is 
available for both 1.0 and 2.0 .

3.      TT should be using user credentials to decrypt the blob of keys 
somehow? Or does it authenticate the user and then decrypts if authentication 
passes? I did not find it in your document.
                
This is an important point as we do not want Tasktracker to decrypt the blob of 
keys and blindly hand over to Tasks. The JobClient stores JobId along with keys 
as part of the encrypted blob. The taskTracker decrypts the encrypted blob, 
verifies that the JobId in the encrypted blob matches  JobId of the task. The 
keys are handed over to Tasks only if the JobId verification is successful. 
This ensures that keys are handed over to the correct tasks.

4.      How cluster-private-key is delivered to TTs?

The TTs can use an implementation of the KeyProvider interface to retrieve 
keys. The implementation can be configured as a cluster configuration. The 
default Key provider is Java keystore based key provider in which private key 
is stored in a Java keystore file on the TT machines. This is the same scheme 
used by web servers to store their private keys. It is possible to plugin more 
complex KeyStorage mechanisms via configuration.

5. I think configuration parameters naming need some changes. They should not 
start with mapreduce.job. Based on your examples you can just encrypt a HDFS 
file without spawning any actual jobs. In this case seeing mapreduce.job.* 
seems confusing.
My suggestion is to prefix all parameters with simply 
hadoop.crypto.* Then you can use e.g. full word "keystore" instead of "ks".

The distributed utility to encrypt/decrypt an HDFS file actually spawns map 
jobs. Irrespective of that, I think it make perfect sense to rename the 
configurations as hadoop.crypto  as this approach is useful in non-mapreduce 
situations. I'll change the configuration names.

I plan to get into reviewing the implementation soon.  

Thanks and please post your comments.
                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted 
> wherever it is stored. Common use case is to pull encrypted data out of a 
> datasource and store in HDFS for analysis. The keys are stored in an external 
> keystore. 
> The feature adds a customizable framework to integrate different types of 
> keystores, support for Java KeyStore, read keys from keystores, and transport 
> keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to 
> perform encryption related steps.
> The design document is attached. It explains the requirement, design and use 
> cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial 
> work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to