[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2013-01-31 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1356#comment-1356
 ] 

Benoy Antony commented on MAPREDUCE-4491:
-

Yes, That makes sense.  

 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: crypto_abstractions.zip, Hadoop_Encryption.pdf, 
 Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement.
 Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2013-01-28 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564819#comment-13564819
 ] 

Benoy Antony commented on MAPREDUCE-4491:
-

I'll continue working on this jira. I'll start by incorporating Jerry's 
framework changes.

 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: crypto_abstractions.zip, Hadoop_Encryption.pdf, 
 Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement.
 Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2013-01-28 Thread Jerry Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565111#comment-13565111
 ] 

Jerry Chen commented on MAPREDUCE-4491:
---

[~benoyantony]
Hi Benoy, I am glad that you can working on this again and I was thinking about 
this too. And one thing that I could think of is that the source code of the 
encryption feature actually span two hadoop projects: Hadoop Common and Map 
Reduce. We may need a hadoop common Jira entry to submit the framework and 
codec implementations which is mostly under the package 
org.apache.hadoop.io.crypto under hadoop-common; and put map reduce side things 
at this jira.

Based on these two jira entries, we can further split subtasks if needed. What 
do you think as to this?

Jerry

 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: crypto_abstractions.zip, Hadoop_Encryption.pdf, 
 Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement.
 Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-10-17 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478230#comment-13478230
 ] 

Benoy Antony commented on MAPREDUCE-4491:
-

+1 . I agree. A more generic framework is useful in addressing encryption in 
components other than MR. Let us work on it together.

 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: crypto_abstractions.zip, Hadoop_Encryption.pdf, 
 Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement.
 Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-10-16 Thread Jerry Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477634#comment-13477634
 ] 

Jerry Chen commented on MAPREDUCE-4491:
---

Hi Benoy,
I am Haifeng from Intel. and we was discussing offline as to this feature. And 
I really apperciate your initiation of this work. And we also see the 
importance of encryption and decryption in Hadoop when we are deasling with 
sensitive data. 

Just as you pointed out, the functionalities requirements are more or less 
same. For hadoop community, we wish to get a high level abstraction that 
basically provide a foundation for these requirements in different hadoop 
components (such as HDFS, MapReduce, HBase) while enable different 
implementations such as different encryption algorithms or different ways of 
key management of different parts / companies so that not bounding a concept on 
a specific implementation.  Just as we disuccssed offline, the driving force 
for such a abstraction is summarized  as following:

1. Encryption and decryption need to be supported in different components and 
usage models. For example, We may use HDFS Client API and Codec directly to 
encrypt and decrypt HDFS file; We may use MapReduce to processing a encrypted 
file and output a encrypted file; And also, the HBase may needs to store its 
files (such as hfiles) in an encrypted way.

2. The community may have different implemenation of encryption codecs and 
different ways of providing keys. CompressionCodec provides us a foundation for 
related work. But CompressionCodec are not enough for encryption and decryption 
because CompressionCodec assumes to initilize from hadoop Configuration while 
encryption/decryption may needs a per file crypto context such as the Key. With 
an abtraction layer of crypto, we can share the common featurs such as Provide 
different keys for different input files of a MapReduce job. other than each 
implementation get his own way in MapReduce core and finally becames into a 
mess.

Based on these driving forces, your work done and our offline discussions, we 
refined our work and would like to propose the following,

1. For Hadoop common, a new CryptoCodec interface which extends 
CompressionCodec, which adding the methods of 
getCryptoContext/setCryptoContext. Just as CompressionCodec, it will initialize 
its global settings from Configuration. But CryptoCodec will receive its crypto 
context (the Key, for example) through CryptoContext object setting by 
setCryptoContext, allowing different usage cases such as direct use 
CryptoCodec to encrypt/decrypt a HDFS file by direct providing the 
CryptoContext(Key) or Map Reduce way of using CryptoCodec that a 
CryptoContext(Key) is choosed per file based on some policy.

Any specific crypto implementation are under this umbrella and will implement 
CryptoCodec. The PGPCodec is pretty good fit into a implementation of 
CryptoCodec. And we also are able to implements our splittable CryptoCodec.

2. For MapReduce, use CryptoContextProvider interface to abstract 
implementation specific service and allowing the MapReduce core is able to 
written shared code of retrieveing the CryptoContext of a specific file from a 
CryptoContextProvider and pass to the CryptoCodec in using. Different 
CryptoContextProvider implementations can implement different ways of deciding 
the CryptoContext and different ways of retrieving Keys from different Key 
Stores. We can provide basic and common implementations of 
CryptoContextProviders such as A CryptoContextProvider provides CryptoContext 
for a file by regular expression matching the file path and get the key from a 
java KeyStore while not preventing users to implement or extends their own if 
existing implementation doesn't satisfy their requirements.

CryptoContextProvider configurations are passed by hadoop JobConfig and 
credentials (credential secret keys) and the implementation of 
CryptoContextProvider can choose whether or not to encrypt the secret keys 
stored in job Credentials.

I attched the java files of these interfaces and basic strucutes in Attachments 
section for demonstrating the concepts and I wish to have a design document for 
these high level things when we have enough discussion and come to an agreement.

Again, thanks for your patient and time. 


 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common 

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-09-07 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13451050#comment-13451050
 ] 

Benoy Antony commented on MAPREDUCE-4491:
-

Key Protection is simple to explain.
JobClient retrieves keys from a configured Keystore ,encrypts the keys along 
with jobId  using cluster public key , submits the encrypted blob 
as part of the job credentials. 
TaskTrackers decrypts the encrypted blob using cluster private key during job 
localization, verifies that jobId inside the encrypted blob matches the JobId 
of the task. During Task Launch, the keys are made available to the  child 
(task) process as an environment variable.

Since the JobId is part of the encrypted blob, the replay attack is prevented 
with the JobId verification. It is easy to add integrity protection also.

Now, the scheme was designed to be used in a secure cluster. It is good to 
explore whether it can be used in a non-secure cluster. 

One issue was with the cluster private key. It should be made accessible only 
to TaskTracker process. If the access is determined by the user's permissions, 
then tasks should be run as a different user. But it need not be the job owner. 
It can be a fixed user. 

I believe , you are bringing up another issue in this regard.  
If a rogue task can  make a TT launch another rogue task with a jobId matching 
the one inside encrypted blob, then the keys area available to the newly 
launched rogue task.
That's a good point. Basically the rogue task is acting as a JT/AppMaster. I am 
not sure whether that is possible. Even if its possible, there should be ways 
to detect it. 





 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement.
 Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-09-05 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13449260#comment-13449260
 ] 

Aaron T. Myers commented on MAPREDUCE-4491:
---

bq. This is an important point as we do not want Tasktracker to decrypt the 
blob of keys and blindly hand over to Tasks. The JobClient stores JobId along 
with keys as part of the encrypted blob. The taskTracker decrypts the encrypted 
blob, verifies that the JobId in the encrypted blob matches JobId of the task. 
The keys are handed over to Tasks only if the JobId verification is successful. 
This ensures that keys are handed over to the correct tasks.

Unless I'm missing something, this seems to be insecure unless secure 
authentication (i.e. Kerberos) is enabled, since someone could connect to the 
TT from a different task and simply report a different JobId. Or do I 
misunderstand somehow?

 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement.
 Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-09-04 Thread Plamen Jeliazkov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13447981#comment-13447981
 ] 

Plamen Jeliazkov commented on MAPREDUCE-4491:
-

Great work, Benoy!

This looks like a very neat feature to add. I am all in support. I like your 
similarity with the compressor / decompressor interfaces and the ease of the 
implementation to plug-in any keystores.

I am in the midst of applying your patches and doing a small test locally and 
will reply back with any results I find.

 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement.
 Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-08-29 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13443864#comment-13443864
 ] 

Benoy Antony commented on MAPREDUCE-4491:
-

1.  Do I understand correctly that your approach can be used to securely 
store (encrypt) data even on non-secure (security=simple) clusters?   

You are right!! If TaskTracker and Task processes are owned by different users, 
then it is possible to use this approach to encrypt/decrypt data in a 
non-secure cluster.  This  does not require each task to be run as job owner, 
instead a fixed user other than TT user is sufficient.  The cluster private key 
can be made readable/accessible only by TaskTracker user. In this way, the 
Tasks cannot get hold of the cluster private key. But it requires the use of 
LinuxTaskController to spawn  tasks as a different user. It also requires some 
code changes to enable this via configuration. 

2.  So JobClient uses current user credentials to obtain keys from the 
KeyStore, encrypts them with cluster-public-key and sends to the cluster along 
with the user credentials. JobTracker has nothing to do with the keys and 
passes the encrypted blob over to TaskTrackers scheduled to execute the tasks. 
TT decrypts the user keys using private-cluster-key and handles them to the 
local tasks, which is secure as keys don't travel over the wires. Is it right 
so far?

That is correct. Its a clear and concise  explanation of this straight forward 
approach.  Please note that though the design is described in terms of 
TaskTrackers and TaskControllers (1.0 terminology) , the implementation is 
available for both 1.0 and 2.0 .

3.  TT should be using user credentials to decrypt the blob of keys 
somehow? Or does it authenticate the user and then decrypts if authentication 
passes? I did not find it in your document.

This is an important point as we do not want Tasktracker to decrypt the blob of 
keys and blindly hand over to Tasks. The JobClient stores JobId along with keys 
as part of the encrypted blob. The taskTracker decrypts the encrypted blob, 
verifies that the JobId in the encrypted blob matches  JobId of the task. The 
keys are handed over to Tasks only if the JobId verification is successful. 
This ensures that keys are handed over to the correct tasks.

4.  How cluster-private-key is delivered to TTs?

The TTs can use an implementation of the KeyProvider interface to retrieve 
keys. The implementation can be configured as a cluster configuration. The 
default Key provider is Java keystore based key provider in which private key 
is stored in a Java keystore file on the TT machines. This is the same scheme 
used by web servers to store their private keys. It is possible to plugin more 
complex KeyStorage mechanisms via configuration.

5. I think configuration parameters naming need some changes. They should not 
start with mapreduce.job. Based on your examples you can just encrypt a HDFS 
file without spawning any actual jobs. In this case seeing mapreduce.job.* 
seems confusing.
My suggestion is to prefix all parameters with simply 
hadoop.crypto.* Then you can use e.g. full word keystore instead of ks.

The distributed utility to encrypt/decrypt an HDFS file actually spawns map 
jobs. Irrespective of that, I think it make perfect sense to rename the 
configurations as hadoop.crypto  as this approach is useful in non-mapreduce 
situations. I'll change the configuration names.

I plan to get into reviewing the implementation soon.  

Thanks and please post your comments.

 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement.
 Update: The patches are 

[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-08-28 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13442982#comment-13442982
 ] 

Konstantin Shvachko commented on MAPREDUCE-4491:


Benoy. I went over your design document. Pretty comprehensive description. 
Want to clarify couple of things. 
# Do I understand correctly that your approach can be used to securely store 
(encrypt) data even on non-secure (security=simple) clusters?
# So JobClient uses current user credentials to obtain keys from the KeyStore, 
encrypts them with cluster-public-key and sends to the cluster along with the 
user credentials. JobTracker has nothing to do with the keys and passes the 
encrypted blob over to TaskTrackers scheduled to execute the tasks. TT decrypts 
the user keys using private-cluster-key and handles them to the local tasks, 
which is secure as keys don't travel over the wires. Is it right so far?
# TT should be using user credentials to decrypt the blob of keys somehow? Or 
does it authenticate the user and then decrypts if authentication passes? I did 
not find it in your document.
# How cluster-private-key is delivered to TTs?
# I think configuration parameters naming need some changes. They should not 
start with {{mapreduce.job}}. Based on your examples you can just encrypt a 
HDFS file without spawning any actual jobs. In this case seeing 
{{mapreduce.job.*}} seems confusing.
My suggestion is to prefix all parameters with simply {{crypto.*}} Then you can 
use e.g. full word keystore instead of ks.

I plan to get into reviewing the implementation soon.

 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement.
 Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-08-28 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13442987#comment-13442987
 ] 

Konstantin Shvachko commented on MAPREDUCE-4491:


Edited previous comment. Was: crypto.* Changed to: hadoop.crypto.*
Similar to hadoop.security

 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement.
 Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-08-13 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13433237#comment-13433237
 ] 

Benoy Antony commented on MAPREDUCE-4491:
-

To make the reviewing this patch easier, I am dividing this patch  into smaller 
patches. I am opening sub tasks under this jira issue and attaching the patches 
to those liras.

 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf, 
 MR_4491_1.1.patch, MR_4491_trunk.patch


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-08-13 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13433401#comment-13433401
 ] 

Benoy Antony commented on MAPREDUCE-4491:
-

One of the goals of this feature is to achieve encryption of files in transit 
and at rest(when stored on disk). One way to achieve this goal is to depend on 
a software/hardware which allows encryption in the local file system plus rely 
on HDFS-3637  and MR shuffle encryption.

This jira  explores an alternative approach to the problem without depending on 
s special software to do local file system encryption. 

The key advantages of this approach over the local file system encryption 
approach are

1)  A file can be decrypted only if the user provides the correct key. So even 
if someone managed to read the file, he cannot read its contents without key. 
So user's possession of the key is required in addition to his read permission. 
So there are two levels of protection. 

There could be cases where a user accidentally set read permissions for 
everyone. There could be cases where a superuser reads the file. But  this 
scheme protects the data.

2) No dependency on local file system encryption software.  This approach 
allows encryption without such special setup.

3) A file is decrypted/encrypted only during processing and not when it is 
read.  So this results in a less number of encryption/decryption.


Other key points will be :

1) Encrypted and plain text files can coexist in a normal file system. 

2) Developers can plugin other encryption algorithms/standards - CMS, AES, 
custom encryption and thus have more flexibility.

3) Allows transporting keys/password/tokens  from JobClient to tasks for use 
cases other than encryption like connecting to a webservice . MAPREDUCE-4491 
adds keyProtection and encryption uses it.

4) Can manage keys in one central location. JobClient  gets on behalf of user 
like any other application. 

If we look at these two approaches from a higher level, we can see that one 
local file system approach is an internal approach to encryption and 
MAPREDUCE-4491 approach is an external approach. These two choices are 
available in normal (non-distributed) application development also where 
developers can rely on the file system to provide encryption or do encryption 
themselves. There are tradeoffs and flexibilities in the both the approaches 
and we choose it based on our use cases and needs.  So I believe , we should 
provide  these two alternatives  in Hadoop.

In addition, this feature allows key protection in general, which can be used 
for purposes other than encryption. The keys also will be encrypted when stored 
on disk and decrypted only in memory.


 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement.
 Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-07-30 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425244#comment-13425244
 ] 

Alejandro Abdelnur commented on MAPREDUCE-4491:
---

Benoy, I've done a quick read to the doc. A couple of initial questions:

* If using compression codec for encryption, are you losing the compression 
capabilities if doing using encryption or will it work as a composition?
* For the keystores, are you proposing to store them in HDFS use file system 
permissions to protect them? I'm not sure if I understood this part correctly. 
If that is the case, then HDFS-3637 would ensure secure transfer.

I'll read the design doc in more detail later this week.



 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-07-30 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425286#comment-13425286
 ] 

Benoy Antony commented on MAPREDUCE-4491:
-

To Rob's questions :

Different Encryption Keys for Different files:  At this point, the PGPCodec 
supports only one secret key/Key Pair  for all input files. 
What we need is the ability to specify secret keys/key pair per input file. 
Another enhancement will be to specify secret keys/key pair per each phase like 
map-output , reduce-output .
As you mentioned, this mapping has to specified via configuration.
I'll try to add these two enhancements. 

Decryption/Encryption of different columns within the same file: This is 
actually left to the mapreduce programmer as he has to do the 
Decryption/Encryption of the fields programmatically. The programmer can choose 
to use different keys  for different fields in the mapreduce program. Multiple 
keys can be retrieved from the keystore and these keys can be retrieved in the 
mapper/reducer using the credentials API.  
In a higher level interface like Hive, it may be possible to add additional 
metadata information to specify the key name. Another reviewer also has 
recommended to add this capability Hive to identify an encryption field and 
specify the key (name of the key)  to be used to decrypt/encrypt it.

Thanks for the review and recommendations, Rob. Please let me know if I have 
not answered the question correctly.

 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-07-30 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425320#comment-13425320
 ] 

Benoy Antony commented on MAPREDUCE-4491:
-

To Alejandro's questions:

1) If using compression codec for encryption, are you losing the compression 
capabilities if doing using encryption or will it work as a composition?
What I have done is to first compress and then encrypt. I have hardcoded to 
ZIP. I can expose this as a configuration with a choice of {UNCOMPRESSED, ZIP, 
ZLIB, BZIP2}. This is an enhancement that I can add.
I have also provided a DistributedSplitter  so that files can be split into 
smaller files.
I am not aware of an ability to chain multiple compression Codecs, though it 
was a desirable capability in this case. 

2) For the keystores, are you proposing to store them in HDFS use file system 
permissions to protect them?

Actually, I am not proposing to store them in HDFS. The keystores themselves 
are encrypted and a password is required to read keys from them. 

In the use cases that I have encountered, the keystores were external to the 
cluster. They were either on the CLI machine from where the jobs were submitted 
or on a separate machine from where the keys were retrieved based on user's 
credentials. (Alfredo was used in this regard to fetch keys via webservice)
So they were two schemes that I have supported -
  1) reading keys from Java keystore
  2) reading keys from a web Service based keystore  (Safe)





 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-4491) Encryption and Key Protection

2012-07-27 Thread Rob Weltman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424216#comment-13424216
 ] 

Rob Weltman commented on MAPREDUCE-4491:


If you want to use different encryption keys for different files (or even for 
different columns within the same file), how do you identify the right key from 
the Safe or Keystore, i.e. where is the mapping maintained? Would that be an 
additional layer on top of this?


 Encryption and Key Protection
 -

 Key: MAPREDUCE-4491
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: documentation, security, task-controller, tasktracker
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: Hadoop_Encryption.pdf


 When dealing with sensitive data, it is required to keep the data encrypted 
 wherever it is stored. Common use case is to pull encrypted data out of a 
 datasource and store in HDFS for analysis. The keys are stored in an external 
 keystore. 
 The feature adds a customizable framework to integrate different types of 
 keystores, support for Java KeyStore, read keys from keystores, and transport 
 keys from JobClient to Tasks.
 The feature adds PGP encryption as a codec and additional utilities to 
 perform encryption related steps.
 The design document is attached. It explains the requirement, design and use 
 cases.
 Kindly review and comment. Collaboration is very much welcome.
 I have a tested patch for this for 1.1 and will upload it soon as an initial 
 work for further refinement. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira