[ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443864#comment-13443864 ]
Benoy Antony commented on MAPREDUCE-4491: ----------------------------------------- 1. Do I understand correctly that your approach can be used to securely store (encrypt) data even on non-secure (security=simple) clusters? You are right!! If TaskTracker and Task processes are owned by different users, then it is possible to use this approach to encrypt/decrypt data in a non-secure cluster. This does not require each task to be run as job owner, instead a fixed user other than TT user is sufficient. The cluster private key can be made readable/accessible only by TaskTracker user. In this way, the Tasks cannot get hold of the cluster private key. But it requires the use of LinuxTaskController to spawn tasks as a different user. It also requires some code changes to enable this via configuration. 2. So JobClient uses current user credentials to obtain keys from the KeyStore, encrypts them with cluster-public-key and sends to the cluster along with the user credentials. JobTracker has nothing to do with the keys and passes the encrypted blob over to TaskTrackers scheduled to execute the tasks. TT decrypts the user keys using private-cluster-key and handles them to the local tasks, which is secure as keys don't travel over the wires. Is it right so far? That is correct. Its a clear and concise explanation of this straight forward approach. Please note that though the design is described in terms of TaskTrackers and TaskControllers (1.0 terminology) , the implementation is available for both 1.0 and 2.0 . 3. TT should be using user credentials to decrypt the blob of keys somehow? Or does it authenticate the user and then decrypts if authentication passes? I did not find it in your document. This is an important point as we do not want Tasktracker to decrypt the blob of keys and blindly hand over to Tasks. The JobClient stores JobId along with keys as part of the encrypted blob. The taskTracker decrypts the encrypted blob, verifies that the JobId in the encrypted blob matches JobId of the task. The keys are handed over to Tasks only if the JobId verification is successful. This ensures that keys are handed over to the correct tasks. 4. How cluster-private-key is delivered to TTs? The TTs can use an implementation of the KeyProvider interface to retrieve keys. The implementation can be configured as a cluster configuration. The default Key provider is Java keystore based key provider in which private key is stored in a Java keystore file on the TT machines. This is the same scheme used by web servers to store their private keys. It is possible to plugin more complex KeyStorage mechanisms via configuration. 5. I think configuration parameters naming need some changes. They should not start with mapreduce.job. Based on your examples you can just encrypt a HDFS file without spawning any actual jobs. In this case seeing mapreduce.job.* seems confusing. My suggestion is to prefix all parameters with simply hadoop.crypto.* Then you can use e.g. full word "keystore" instead of "ks". The distributed utility to encrypt/decrypt an HDFS file actually spawns map jobs. Irrespective of that, I think it make perfect sense to rename the configurations as hadoop.crypto as this approach is useful in non-mapreduce situations. I'll change the configuration names. I plan to get into reviewing the implementation soon. Thanks and please post your comments. > Encryption and Key Protection > ----------------------------- > > Key: MAPREDUCE-4491 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: documentation, security, task-controller, tasktracker > Reporter: Benoy Antony > Assignee: Benoy Antony > Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf > > > When dealing with sensitive data, it is required to keep the data encrypted > wherever it is stored. Common use case is to pull encrypted data out of a > datasource and store in HDFS for analysis. The keys are stored in an external > keystore. > The feature adds a customizable framework to integrate different types of > keystores, support for Java KeyStore, read keys from keystores, and transport > keys from JobClient to Tasks. > The feature adds PGP encryption as a codec and additional utilities to > perform encryption related steps. > The design document is attached. It explains the requirement, design and use > cases. > Kindly review and comment. Collaboration is very much welcome. > I have a tested patch for this for 1.1 and will upload it soon as an initial > work for further refinement. > Update: The patches are uploaded to subtasks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira