[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-29 Thread Steve Moist (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343975#comment-16343975
 ] 

Steve Moist commented on HADOOP-15006:
--

>what's your proposal for letting the client encryption be an optional feature, 
>with key? Config

If s3a.client.encryption.enabled=true then check for BEZ if exists encrypt 
objects, else no encryption for the bucket.  Or if the BEZI provider is 
configured as well rather than just the flag.

>Is the file length as returned in listings 100% consistent with the amount of 
>data you get to read?

Yes.

>I'm not going to touch this right now as its at the too raw stage

That's why I submitted it, for you and everyone else to play with to evaluate 
if this is something that we should move forward with.  If needed I can go fix 
the broken S3Guard/Committer/byte comparison tests and have yetus pass it, but 
the actual code is going to be about the same.

 

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf, s3-cse-poc.patch
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-28 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342920#comment-16342920
 ] 

Steve Loughran commented on HADOOP-15006:
-

I'm going to touch this right now as its at the too raw stage, but progressing. 
I'll let yetus be the style police, including rejecting files for lack of ASF 
copyright, line endings etc.

Ignoring that
 * what's your proposal for letting the client encryption be an optional 
feature, with key? Config
 * Once its configurable, the test would need to use two FS instances, one 
without encryption, one with.
 * Is the file length as returned in listings 100% consistent with the amount 
of data you get to read?

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf, s3-cse-poc.patch
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-26 Thread Steve Moist (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341776#comment-16341776
 ] 

Steve Moist commented on HADOOP-15006:
--

Ok fixed it.  The workflow with block output was causing it to write to disk 
encrypted and then when it sent it to S3 it encrypted it again causing it to 
decrypt.  So there's a small issue with that in some cases.  However, now 
encryption should work fine for most things.  It uses a fixed IV and key to do 
the encryption, so any files written to S3 will be automatically 
encrypted/decrypted, so we get some free coverage from the unit tests.  It's a 
quick and dirty prototype so many of the unit tests fail as its not covering 
all scenarios.  I'm able to upload/download files to S3 using the command line 
without issue.  When I view the object in S3 gui, it shows up encrypted, but 
will automatically decrypt when i do a hdfs get from the cli.  Play around with 
it and let me know what you think.  The CryptoStreams work fine, but the 
integration to fully flesh this out into a feature is what we need to really 
look at.

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf, s3-cse-poc.patch
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-25 Thread Steve Moist (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339855#comment-16339855
 ] 

Steve Moist commented on HADOOP-15006:
--

I've attached a quick and dirty proof of concept. It's now using 
CryptoFSDataInput/OutputStream to write encrypted data to S3.  It uses a 
constant key and iv.  I've mainly just ran the entire unit/integration tests 
against it.  I get a bunch of failures as they compare data byte for byte, and 
as expected they would fail.  But the rest of the suite passes with encryption 
enabled.  Play around with it and let me know what you all think. 

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf, s3-cse-poc.patch
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-11 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322852#comment-16322852
 ] 

Steve Loughran commented on HADOOP-15006:
-

Look how SequenceFile.Reader() works: it gets the length of the file from 
getFileStatus() & then uses it downstream. If its size != stream length, this 
is the code which crashes first :)
  

Imagine we had 
  
{code}  

FSDataInputStream file = openFile(fs, filename, bufSize, len);
len = file.getLength();
...
{code}

Fix that class and Hadoop internally gets robust, and on object stores, 
actually cuts out a HEAD request. (saves $0.005 and 100 mS).
Patch ORC & Parquet and you've just moved the core formats onto it too.


> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-10 Thread Steve Moist (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320663#comment-16320663
 ] 

Steve Moist commented on HADOOP-15006:
--

{quote}
you are going to have to handle the fact that its only on the final post where 
things are manifest later, with a separate process/host doing the final POST. 
that's when any info would have to be persisted to DDB
{quote}
That seems reasonable.  I had thought to store the OEMI first and have it act 
as a the lock to prevent multiple uploads from confusing who's OEMI is theirs.  
By inserting it first, it's hard to tell whether or not the object is still 
being uploaded or the host/process/jvm/etc has crashed and abandoned the 
upload.  There's more to discuss than I originally had thought.

{quote}
One thing to consider is that posix seek() lets you do a relative-to-EOF seek. 
Could we offer that, or at least an API call to get the length of a data 
source, in our input stream. And then modify the core file formats (Sequence, 
ORC, Parquet...) to use the length returned in the open call, rather than a 
previously cached value?
{quote}

Can you give me a sample scenario to help visualize what your trying to solve 
by this.

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320399#comment-16320399
 ] 

Steve Loughran commented on HADOOP-15006:
-

* you are going to have to handle the fact that its only on the final post 
where things are manifest later, with a separate process/host doing the final 
POST. that's when any info would have to be persisted to DDB
* There is a length http header on encrypted files; these could be scanned to 
repopulate the s3guard tables

One thing to consider is that posix seek() lets you do a relative-to-EOF seek. 
Could we offer that, or at least an API call to get the length of a data 
source, in our input stream. And then modify the core file formats (Sequence, 
ORC, Parquet...) to use the length returned in the open call, rather than a 
previously cached value?

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-08 Thread Steve Moist (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317345#comment-16317345
 ] 

Steve Moist commented on HADOOP-15006:
--

{quote}
Do have any good links for further reading on the crypto algorithms, 
particularly the NoPadding variant you mention? (How do lengths and byte 
offsets map from the user data to the encrypted stream?)
{quote}
I've got a few links about general block ciphers and padding.  I'll post more 
as I find them later.
* http://web.cs.ucdavis.edu/~rogaway/papers/modes.pdf is a good(and lengthy) 
doc on encryption, look at page 5 for a summary and then page 45 for more on 
CTR.
* https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Padding 
Obligatory wikipedia page
* https://www.cryptosys.net/pki/manpki/pki_paddingschemes.html

{quote}
 (How do lengths and byte offsets map from the user data to the encrypted 
stream?)
{quote}
They should map 1-1.  AES works on a fixed block size of 16 bytes.  So you 
would read from 0->15 bytes, 16->31, etc.  It means you can't read 3->18 
directly, you'd have to read 0->31 to read 3->18.  Not sure exactly but I would 
imagine HDFS transparent encryption also has the same issue that has already 
been solved.  It would just mean we would have to get the before block as well 
to properly decrypt.  CTR allows for random encryption/decryption so I don't 
expect this to be a problem performing encryption/decryption .  It's just a 
minor technical point.  So far in my testing I haven't hit it, but I also 
haven't been directly invoking the MultiPartUpload.  This is the only issue I 
see when randomly reading/writing blocks and it's easily solvable.

{quote}
What are the actual atomicity requirements?
{quote}
This is a good question.  The main atomicity requirements that I have is that 
once the S3a stream is closed and the object committed that the OEMI is also 
committed.  I haven't fully worked around that from a specific code perspective 
yet.  
{quote}
Specifically, how do we handle multiple clients racing to create the same path?
{quote}
Using OEMI Storage Option #5: Suppose userA uploads objectA, with OEMIA to key1 
and userB uploads objectB with OEMIB to key1.  S3 is doesn't guarantee which 
one is the winner, so it is possible that objectA is stored with OEMIB.  This 
shouldn't happen if OEMI is stored as object metadata.  It could be done such 
that we create a "lock" on the DynamoDB row that userA owns that location 
preventing the upload of objectB.  In HDFS, once the INode is created, it 
should prevent userB from creating that file, perhaps we should do the same for 
S3?

{quote}
Also since the scope of the encryption zone is the bucket, we could get by with 
a very low provisioned I/O budget on the Dynamo table and save money, no?
{quote}
Yea we should be able to, I believe the only requirements is that the table can 
have things inserted and read.  IIRC, each bucket gets their own S3a jvm (or 
somethign to that effect) so at least at startup we can cache its EZ 
information.

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-05 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314171#comment-16314171
 ] 

Aaron Fabbri commented on HADOOP-15006:
---

Thanks again for writing this up [~moist]--it is very helpful. I'm in general 
agreement with the discussion here.

The length / seek issue is interesting.

Do have any good links for further reading on the crypto algorithms, 
particularly the NoPadding variant you mention?  (How do lengths and byte 
offsets map from the user data to the encrypted stream?)

What are the actual atomicity requirements? Specifically, how do we handle 
multiple clients racing to create the same path?

Option 5 (store encryption metadata in Dynamo, but in its own separate table) 
sounds good to me. As we discussed offline, data in S3Guard has a different 
lifetime (it is not required to be retained, and that policy offers multiple 
benefits for S3Guard but would cause data loss for CSE). Also since the scope 
of the encryption zone is the bucket, we could get by with a very low 
provisioned I/O budget on the Dynamo table and save money, no?

I'm available any time to give a walkthrough of S3Guard's DynamoDB logic or 
answer any questions about it.

Also thanks [~xiaochen] and Steve for taking time to look over this.

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-04 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312157#comment-16312157
 ] 

Xiao Chen commented on HADOOP-15006:


Had a sync with [~moist], summarizing some points discussed:
- Using a new hadoop crypto would give us more freedom and cleaner interfaces.
- We can be fancy to support hybrid hdfs + s3 setup in crypto commands, but 
perhaps only makes sense if we actually support hybrid hadoop cluster.
- I'm okay with option 5 to store BEZI and OEMI, but we should get [~fabbri]'s 
comments from his past experience.
- No support on messing with things directly in S3.
- Atomicity is delegated to DynamoDB (read after write), and it seems there 
should not be scenarios where the BEZI and OEMI mismatching with the actual dir 
/ file in DynamoDB. Need to think about this carefully though since any miss 
could become a data loss

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-03 Thread Steve Moist (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16310101#comment-16310101
 ] 

Steve Moist commented on HADOOP-15006:
--

{quote}
Before worrying about these, why not conduct some experiments? You could take 
S3A and modify it to always encrypt client side with the same key, then run as 
many integration tests as you can against it (Hive, Spark, impala, ...), and 
see what fails. I think that should be a first step to anything client-side 
related
{quote}

I wrote a simple proof of concept back in May using the HDFS Crytro Streams 
wrapping the S3 streams with a fixed AES key and IV.  I was able to run the S3 
integration tests without issue, terragen/sort/verify without issue and write 
various files (of differing sizes) and compare the check sums.  It's given me 
enough confidence to move forward back then with writing the original proposal. 
 Unfortunately, I've seemed to misplaced the work since it's been so long.  
I'll work on re-creating it in the next few weeks and post it here; I've got a 
deadline I've got to focus on for now instead.  Besides AES/CTR/NoPadding 
should generate a cipher text the same size as the plain text unlike the AWS's 
SDK's AES/CBC/PKCS5Padding which is causing the file size to change.

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-03 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309710#comment-16309710
 ] 

Steve Loughran commented on HADOOP-15006:
-

As noted before, I really don't like how client-side encryption results in 
datasets shorter than the file length: this breaks so much. I'm currently not 
confident that you can use any client-side encrypted data as a source for 
operations.

# Somehow s3guard enabled buckets should be better at this (there's a header 
which you can get in a HEAD or GET which returns the real length, but it 
doesn't show in a LIST). If s3guard can check these values on a file open the 
shorter value can be cached.
# maybe the actual length of a file could be provided by an input stream (if 
the right API is there), and/or seek() could be expanded to explicitly support 
EOF-relative seeks, to which those standard bits of code which do this 
(hadoop's internal format, Parquet & ORC, presumably).

Before worrying about these, why not conduct some experiments? You could take 
S3A and modify it to always encrypt client side with the same key, then run as 
many integration tests as you can against it (Hive, Spark, impala, ...), *and 
see what fails*. I think that should be a first step to anything client-side 
related

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2018-01-02 Thread Steve Moist (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308621#comment-16308621
 ] 

Steve Moist commented on HADOOP-15006:
--

{quote}
It appears there will be NO CHANGES to the KMS, right? 
{quote}
Yes there are no changes to the KMS, and I expect to be able to do all KMS 
actions through the existing API calls.

{quote}
Do you intend to add a hadoop crypto wrapper, or is there intention to change 
hdfs's CryptoAdmin? I'd suggest the former.
{quote}
I'm planning to change the CryptoAdmin to support S3, but it would call out 
into the S3aCryptoAdmin and change the CLI invocation.  While this may be a 
bigger change now, renaming the command can start paving the way for Azure CSE 
way down the road.

{quote}
About metadata persistence, does option #5 mean to add a new column to the S3G 
DynamoDB table, to store the edeks along with the other metadata of a file? I 
feel this is the safest way because we don't have to worry about consistency 
particular to edeks.
{quote}
No, it is not meant to be part of S3Guard as it has the ability to delete the 
table and refresh it.  Doing so would cause loss of EDEK's and therefore data 
loss.

{quote}
How is the EZ information stored? 
{quote}
It is stored as BEZI.  Simarily, the NN could read the BEZI or S3a could cache 
that as well.

{quote}
What's the behavior when KMS is slow and we create a file? If the generateEDEK 
exhausted cache, where does it hang? When this happens does it impact other 
operation? (This is a pain for NN due to the single namespace lock, may not be 
an issue for s3a but I'm curious)
{quote}
S3a would have to wait for the KMS to generate an EDEK.  This I'm not sure how 
it would affect other operations.

{quote}
I saw raw bytes mentioned. For hdfs, the way to access raw bytes is through a 
special path: /.reserved/raw/original_path. How is this done in s3?
{quote}
That piece I'm not too sure yet about.  It could be a virtual object key (such 
as bucket-name/.reserved/raw/original_path) that isn't actually created in S3 
and interpreted on a copy or move command to not decrypt data.  I would like to 
include it as potentially users could use DistCP to copy encrypted data from an 
HDFS EZ to a S3 EZ without having to decrypt it as they share the same EZK.  I 
think that feature would be a great thing to do.  But this subject needs to be 
fleshed out more.

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2017-12-22 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16302145#comment-16302145
 ] 

Xiao Chen commented on HADOOP-15006:


Hi [~moist],

Thanks for posting the design doc here! I had a quick review and have a few 
comments.
I'm not an expert on s3, so comments are from kms and hdfs. You may find some 
of the comments perhaps really shy on s3 knowledge... feel free to point to me 
the related jira or doc if so. :)

- It appears there will be NO CHANGES to the KMS, right? We are doing the s3a 
equivalent of hdfs crypto and hdfs clients, and all required KMS actions can be 
achieved using existing KMS APIs.
- {{hdfs crypto}}: (btw, there is no {{hadoop crypto}} currently, only {{hdfs 
crypto}}.)
I understand for hdfs r/w operations, we can happily update hdfs-site and 
core-site then happily use whatever hadoop fs CLIs. But {{CryptoAdmin}} 
currently just calls to hdfs. Do you intend to add a hadoop crypto wrapper, or 
is there intention to change hdfs's CryptoAdmin? I'd suggest the former...
- About metadata persistence, does option #5 mean to add a new column to the 
S3G DynamoDB table, to store the edeks along with the other metadata of a file? 
I feel this is the safest way because we don't have to worry about consistency 
particular to edeks.
- How is the EZ information stored? In HDFS it's part of the root zone 
INodeDir's xattr, and scanned upon NN start. We'd need a similar reliable way 
to make sure all files within an EZ will be encrypted. Architecture graph only 
shows BEZI and OEMI are stored separately.
- What's the behavior when KMS is slow and we create a file? If the 
generateEDEK exhausted cache, where does it hang? When this happens does it 
impact other operation? (This is a pain for NN due to the single namespace 
lock, may not be an issue for s3a but I'm curious)
- I saw raw bytes mentioned. For hdfs, the way to access raw bytes is through a 
special path: {{/.reserved/raw/original_path}}. How is this done in s3?

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2017-12-12 Thread Steve Moist (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288396#comment-16288396
 ] 

Steve Moist commented on HADOOP-15006:
--

I don't think anyone's started it.  I posted the design doc in hopes of others 
looking at it and critiquing it in the background while I focus on other 
things, that once enough people had reviewed it, to start on it then.  The 
changes to the Hadoop CLI, KMS and other components was what worried me about 
it.  It's bigger in scope than just S3a. 

In the proposal I made, we didn't have an issue with the cipher text length and 
plaintext length as we used CTR with no padding vs the CBC with PKCS5Padding 
that the AWS sdk uses.  I wrote a quick prototype using AES/CTR/NoPadding and 
ran all the integration tests against it and it ran without issue and did diffs 
on the before/after of upload/download along with TerraSort and had no issues.

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2017-12-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16287510#comment-16287510
 ] 

Steve Loughran commented on HADOOP-15006:
-

Not looked other than to scan through it and conclude "it's complicated"

It is certainly not on my TODO list. If other's started it? well, I wouldn't 
block it, as long as it could be integrated in a way that it's integration 
points were relatively non-intrusive. Both S3Guard and the committers (Well, 
the retry logic really) have radically changed the S3AFileSystem, and its 
reaching that scale point where I'm starting to find it hard to visualise.

Short term, I'd prefer the S3Guard phase II work to get focus.

At the same time, I can see the interest in client-side encryption. It's just 
that mismatch between file length and list length which worries me. 

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2017-12-11 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286869#comment-16286869
 ] 

Aaron Fabbri commented on HADOOP-15006:
---

Thanks for the reminder.. Will try to take a look this week.

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15006) Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS

2017-12-11 Thread Steve Moist (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286574#comment-16286574
 ] 

Steve Moist commented on HADOOP-15006:
--

Hey [~fabbri] and [~steve_l], any traction on this?

> Encrypt S3A data client-side with Hadoop libraries & Hadoop KMS
> ---
>
> Key: HADOOP-15006
> URL: https://issues.apache.org/jira/browse/HADOOP-15006
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs/s3, kms
>Reporter: Steve Moist
>Priority: Minor
> Attachments: S3-CSE Proposal.pdf
>
>
> This is for the proposal to introduce Client Side Encryption to S3 in such a 
> way that it can leverage HDFS transparent encryption, use the Hadoop KMS to 
> manage keys, use the `hdfs crypto` command line tools to manage encryption 
> zones in the cloud, and enable distcp to copy from HDFS to S3 (and 
> vice-versa) with data still encrypted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org