[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343975#comment-16343975 ] Steve Moist commented on HADOOP-15006: -- >what's your proposal for letting the client encryption be an optional feature, >with key? Config If s3a.client.encryption.enabled=true then check for BEZ if exists encrypt objects, else no encryption for the bucket. Or if the BEZI provider is configured as well rather than just the flag. >Is the file length as returned in listings 100% consistent with the amount of >data you get to read? Yes. >I'm not going to touch this right now as its at the too raw stage That's why I submitted it, for you and everyone else to play with to evaluate if this is something that we should move forward with. If needed I can go fix the broken S3Guard/Committer/byte comparison tests and have yetus pass it, but the actual code is going to be about the same. > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf, s3-cse-poc.patch > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342920#comment-16342920 ] Steve Loughran commented on HADOOP-15006: - I'm going to touch this right now as its at the too raw stage, but progressing. I'll let yetus be the style police, including rejecting files for lack of ASF copyright, line endings etc. Ignoring that * what's your proposal for letting the client encryption be an optional feature, with key? Config * Once its configurable, the test would need to use two FS instances, one without encryption, one with. * Is the file length as returned in listings 100% consistent with the amount of data you get to read? > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf, s3-cse-poc.patch > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341776#comment-16341776 ] Steve Moist commented on HADOOP-15006: -- Ok fixed it. The workflow with block output was causing it to write to disk encrypted and then when it sent it to S3 it encrypted it again causing it to decrypt. So there's a small issue with that in some cases. However, now encryption should work fine for most things. It uses a fixed IV and key to do the encryption, so any files written to S3 will be automatically encrypted/decrypted, so we get some free coverage from the unit tests. It's a quick and dirty prototype so many of the unit tests fail as its not covering all scenarios. I'm able to upload/download files to S3 using the command line without issue. When I view the object in S3 gui, it shows up encrypted, but will automatically decrypt when i do a hdfs get from the cli. Play around with it and let me know what you think. The CryptoStreams work fine, but the integration to fully flesh this out into a feature is what we need to really look at. > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf, s3-cse-poc.patch > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339855#comment-16339855 ] Steve Moist commented on HADOOP-15006: -- I've attached a quick and dirty proof of concept. It's now using CryptoFSDataInput/OutputStream to write encrypted data to S3. It uses a constant key and iv. I've mainly just ran the entire unit/integration tests against it. I get a bunch of failures as they compare data byte for byte, and as expected they would fail. But the rest of the suite passes with encryption enabled. Play around with it and let me know what you all think. > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf, s3-cse-poc.patch > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322852#comment-16322852 ] Steve Loughran commented on HADOOP-15006: - Look how SequenceFile.Reader() works: it gets the length of the file from getFileStatus() & then uses it downstream. If its size != stream length, this is the code which crashes first :) Imagine we had {code} FSDataInputStream file = openFile(fs, filename, bufSize, len); len = file.getLength(); ... {code} Fix that class and Hadoop internally gets robust, and on object stores, actually cuts out a HEAD request. (saves $0.005 and 100 mS). Patch ORC & Parquet and you've just moved the core formats onto it too. > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320663#comment-16320663 ] Steve Moist commented on HADOOP-15006: -- {quote} you are going to have to handle the fact that its only on the final post where things are manifest later, with a separate process/host doing the final POST. that's when any info would have to be persisted to DDB {quote} That seems reasonable. I had thought to store the OEMI first and have it act as a the lock to prevent multiple uploads from confusing who's OEMI is theirs. By inserting it first, it's hard to tell whether or not the object is still being uploaded or the host/process/jvm/etc has crashed and abandoned the upload. There's more to discuss than I originally had thought. {quote} One thing to consider is that posix seek() lets you do a relative-to-EOF seek. Could we offer that, or at least an API call to get the length of a data source, in our input stream. And then modify the core file formats (Sequence, ORC, Parquet...) to use the length returned in the open call, rather than a previously cached value? {quote} Can you give me a sample scenario to help visualize what your trying to solve by this. > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320399#comment-16320399 ] Steve Loughran commented on HADOOP-15006: - * you are going to have to handle the fact that its only on the final post where things are manifest later, with a separate process/host doing the final POST. that's when any info would have to be persisted to DDB * There is a length http header on encrypted files; these could be scanned to repopulate the s3guard tables One thing to consider is that posix seek() lets you do a relative-to-EOF seek. Could we offer that, or at least an API call to get the length of a data source, in our input stream. And then modify the core file formats (Sequence, ORC, Parquet...) to use the length returned in the open call, rather than a previously cached value? > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317345#comment-16317345 ] Steve Moist commented on HADOOP-15006: -- {quote} Do have any good links for further reading on the crypto algorithms, particularly the NoPadding variant you mention? (How do lengths and byte offsets map from the user data to the encrypted stream?) {quote} I've got a few links about general block ciphers and padding. I'll post more as I find them later. * http://web.cs.ucdavis.edu/~rogaway/papers/modes.pdf is a good(and lengthy) doc on encryption, look at page 5 for a summary and then page 45 for more on CTR. * https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Padding Obligatory wikipedia page * https://www.cryptosys.net/pki/manpki/pki_paddingschemes.html {quote} (How do lengths and byte offsets map from the user data to the encrypted stream?) {quote} They should map 1-1. AES works on a fixed block size of 16 bytes. So you would read from 0->15 bytes, 16->31, etc. It means you can't read 3->18 directly, you'd have to read 0->31 to read 3->18. Not sure exactly but I would imagine HDFS transparent encryption also has the same issue that has already been solved. It would just mean we would have to get the before block as well to properly decrypt. CTR allows for random encryption/decryption so I don't expect this to be a problem performing encryption/decryption . It's just a minor technical point. So far in my testing I haven't hit it, but I also haven't been directly invoking the MultiPartUpload. This is the only issue I see when randomly reading/writing blocks and it's easily solvable. {quote} What are the actual atomicity requirements? {quote} This is a good question. The main atomicity requirements that I have is that once the S3a stream is closed and the object committed that the OEMI is also committed. I haven't fully worked around that from a specific code perspective yet. {quote} Specifically, how do we handle multiple clients racing to create the same path? {quote} Using OEMI Storage Option #5: Suppose userA uploads objectA, with OEMIA to key1 and userB uploads objectB with OEMIB to key1. S3 is doesn't guarantee which one is the winner, so it is possible that objectA is stored with OEMIB. This shouldn't happen if OEMI is stored as object metadata. It could be done such that we create a "lock" on the DynamoDB row that userA owns that location preventing the upload of objectB. In HDFS, once the INode is created, it should prevent userB from creating that file, perhaps we should do the same for S3? {quote} Also since the scope of the encryption zone is the bucket, we could get by with a very low provisioned I/O budget on the Dynamo table and save money, no? {quote} Yea we should be able to, I believe the only requirements is that the table can have things inserted and read. IIRC, each bucket gets their own S3a jvm (or somethign to that effect) so at least at startup we can cache its EZ information. > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314171#comment-16314171 ] Aaron Fabbri commented on HADOOP-15006: --- Thanks again for writing this up [~moist]--it is very helpful. I'm in general agreement with the discussion here. The length / seek issue is interesting. Do have any good links for further reading on the crypto algorithms, particularly the NoPadding variant you mention? (How do lengths and byte offsets map from the user data to the encrypted stream?) What are the actual atomicity requirements? Specifically, how do we handle multiple clients racing to create the same path? Option 5 (store encryption metadata in Dynamo, but in its own separate table) sounds good to me. As we discussed offline, data in S3Guard has a different lifetime (it is not required to be retained, and that policy offers multiple benefits for S3Guard but would cause data loss for CSE). Also since the scope of the encryption zone is the bucket, we could get by with a very low provisioned I/O budget on the Dynamo table and save money, no? I'm available any time to give a walkthrough of S3Guard's DynamoDB logic or answer any questions about it. Also thanks [~xiaochen] and Steve for taking time to look over this. > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312157#comment-16312157 ] Xiao Chen commented on HADOOP-15006: Had a sync with [~moist], summarizing some points discussed: - Using a new hadoop crypto would give us more freedom and cleaner interfaces. - We can be fancy to support hybrid hdfs + s3 setup in crypto commands, but perhaps only makes sense if we actually support hybrid hadoop cluster. - I'm okay with option 5 to store BEZI and OEMI, but we should get [~fabbri]'s comments from his past experience. - No support on messing with things directly in S3. - Atomicity is delegated to DynamoDB (read after write), and it seems there should not be scenarios where the BEZI and OEMI mismatching with the actual dir / file in DynamoDB. Need to think about this carefully though since any miss could become a data loss > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16310101#comment-16310101 ] Steve Moist commented on HADOOP-15006: -- {quote} Before worrying about these, why not conduct some experiments? You could take S3A and modify it to always encrypt client side with the same key, then run as many integration tests as you can against it (Hive, Spark, impala, ...), and see what fails. I think that should be a first step to anything client-side related {quote} I wrote a simple proof of concept back in May using the HDFS Crytro Streams wrapping the S3 streams with a fixed AES key and IV. I was able to run the S3 integration tests without issue, terragen/sort/verify without issue and write various files (of differing sizes) and compare the check sums. It's given me enough confidence to move forward back then with writing the original proposal. Unfortunately, I've seemed to misplaced the work since it's been so long. I'll work on re-creating it in the next few weeks and post it here; I've got a deadline I've got to focus on for now instead. Besides AES/CTR/NoPadding should generate a cipher text the same size as the plain text unlike the AWS's SDK's AES/CBC/PKCS5Padding which is causing the file size to change. > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309710#comment-16309710 ] Steve Loughran commented on HADOOP-15006: - As noted before, I really don't like how client-side encryption results in datasets shorter than the file length: this breaks so much. I'm currently not confident that you can use any client-side encrypted data as a source for operations. # Somehow s3guard enabled buckets should be better at this (there's a header which you can get in a HEAD or GET which returns the real length, but it doesn't show in a LIST). If s3guard can check these values on a file open the shorter value can be cached. # maybe the actual length of a file could be provided by an input stream (if the right API is there), and/or seek() could be expanded to explicitly support EOF-relative seeks, to which those standard bits of code which do this (hadoop's internal format, Parquet & ORC, presumably). Before worrying about these, why not conduct some experiments? You could take S3A and modify it to always encrypt client side with the same key, then run as many integration tests as you can against it (Hive, Spark, impala, ...), *and see what fails*. I think that should be a first step to anything client-side related > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308621#comment-16308621 ] Steve Moist commented on HADOOP-15006: -- {quote} It appears there will be NO CHANGES to the KMS, right? {quote} Yes there are no changes to the KMS, and I expect to be able to do all KMS actions through the existing API calls. {quote} Do you intend to add a hadoop crypto wrapper, or is there intention to change hdfs's CryptoAdmin? I'd suggest the former. {quote} I'm planning to change the CryptoAdmin to support S3, but it would call out into the S3aCryptoAdmin and change the CLI invocation. While this may be a bigger change now, renaming the command can start paving the way for Azure CSE way down the road. {quote} About metadata persistence, does option #5 mean to add a new column to the S3G DynamoDB table, to store the edeks along with the other metadata of a file? I feel this is the safest way because we don't have to worry about consistency particular to edeks. {quote} No, it is not meant to be part of S3Guard as it has the ability to delete the table and refresh it. Doing so would cause loss of EDEK's and therefore data loss. {quote} How is the EZ information stored? {quote} It is stored as BEZI. Simarily, the NN could read the BEZI or S3a could cache that as well. {quote} What's the behavior when KMS is slow and we create a file? If the generateEDEK exhausted cache, where does it hang? When this happens does it impact other operation? (This is a pain for NN due to the single namespace lock, may not be an issue for s3a but I'm curious) {quote} S3a would have to wait for the KMS to generate an EDEK. This I'm not sure how it would affect other operations. {quote} I saw raw bytes mentioned. For hdfs, the way to access raw bytes is through a special path: /.reserved/raw/original_path. How is this done in s3? {quote} That piece I'm not too sure yet about. It could be a virtual object key (such as bucket-name/.reserved/raw/original_path) that isn't actually created in S3 and interpreted on a copy or move command to not decrypt data. I would like to include it as potentially users could use DistCP to copy encrypted data from an HDFS EZ to a S3 EZ without having to decrypt it as they share the same EZK. I think that feature would be a great thing to do. But this subject needs to be fleshed out more. > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302145#comment-16302145 ] Xiao Chen commented on HADOOP-15006: Hi [~moist], Thanks for posting the design doc here! I had a quick review and have a few comments. I'm not an expert on s3, so comments are from kms and hdfs. You may find some of the comments perhaps really shy on s3 knowledge... feel free to point to me the related jira or doc if so. :) - It appears there will be NO CHANGES to the KMS, right? We are doing the s3a equivalent of hdfs crypto and hdfs clients, and all required KMS actions can be achieved using existing KMS APIs. - {{hdfs crypto}}: (btw, there is no {{hadoop crypto}} currently, only {{hdfs crypto}}.) I understand for hdfs r/w operations, we can happily update hdfs-site and core-site then happily use whatever hadoop fs CLIs. But {{CryptoAdmin}} currently just calls to hdfs. Do you intend to add a hadoop crypto wrapper, or is there intention to change hdfs's CryptoAdmin? I'd suggest the former... - About metadata persistence, does option #5 mean to add a new column to the S3G DynamoDB table, to store the edeks along with the other metadata of a file? I feel this is the safest way because we don't have to worry about consistency particular to edeks. - How is the EZ information stored? In HDFS it's part of the root zone INodeDir's xattr, and scanned upon NN start. We'd need a similar reliable way to make sure all files within an EZ will be encrypted. Architecture graph only shows BEZI and OEMI are stored separately. - What's the behavior when KMS is slow and we create a file? If the generateEDEK exhausted cache, where does it hang? When this happens does it impact other operation? (This is a pain for NN due to the single namespace lock, may not be an issue for s3a but I'm curious) - I saw raw bytes mentioned. For hdfs, the way to access raw bytes is through a special path: {{/.reserved/raw/original_path}}. How is this done in s3? > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288396#comment-16288396 ] Steve Moist commented on HADOOP-15006: -- I don't think anyone's started it. I posted the design doc in hopes of others looking at it and critiquing it in the background while I focus on other things, that once enough people had reviewed it, to start on it then. The changes to the Hadoop CLI, KMS and other components was what worried me about it. It's bigger in scope than just S3a. In the proposal I made, we didn't have an issue with the cipher text length and plaintext length as we used CTR with no padding vs the CBC with PKCS5Padding that the AWS sdk uses. I wrote a quick prototype using AES/CTR/NoPadding and ran all the integration tests against it and it ran without issue and did diffs on the before/after of upload/download along with TerraSort and had no issues. > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16287510#comment-16287510 ] Steve Loughran commented on HADOOP-15006: - Not looked other than to scan through it and conclude "it's complicated" It is certainly not on my TODO list. If other's started it? well, I wouldn't block it, as long as it could be integrated in a way that it's integration points were relatively non-intrusive. Both S3Guard and the committers (Well, the retry logic really) have radically changed the S3AFileSystem, and its reaching that scale point where I'm starting to find it hard to visualise. Short term, I'd prefer the S3Guard phase II work to get focus. At the same time, I can see the interest in client-side encryption. It's just that mismatch between file length and list length which worries me. > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286869#comment-16286869 ] Aaron Fabbri commented on HADOOP-15006: --- Thanks for the reminder.. Will try to take a look this week. > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
[ https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286574#comment-16286574 ] Steve Moist commented on HADOOP-15006: -- Hey [~fabbri] and [~steve_l], any traction on this? > Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS > --- > > Key: HADOOP-15006 > URL: https://issues.apache.org/jira/browse/HADOOP-15006 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/s3, kms >Reporter: Steve Moist >Priority: Minor > Attachments: S3-CSE Proposal.pdf > > > This is for the proposal to introduce Client Side Encryption to S3 in such a > way that it can leverage HDFS transparent encryption, use the Hadoop KMS to > manage keys, use the `hdfs crypto` command line tools to manage encryption > zones in the cloud, and enable distcp to copy from HDFS to S3 (and > vice-versa) with data still encrypted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org