[ 
https://issues.apache.org/jira/browse/HADOOP-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13946828#comment-13946828
 ] 

Yi Liu commented on HADOOP-10150:
---------------------------------

Thanks [~tucu00] for your comment.
We less concern the internal use of HDFS client, on the contrary we care more 
about encrypted data easy for clients. Even though we found that in webhdfs it 
should use DistributedFileSystem as well to remove the symlink issue as 
HDFS-4933 stated(The issue we found is “Throwing UnresolvedPathException when 
getting HDFS symlink file through HDFS REST API”, and there is no “statistics” 
for HDFS REST which is inconsistent with behavior of DistributedFileSystem, 
suppose this JIRA will resolve it).

“Transparent” or “at rest” encryption usually means that the server handles 
encrypting data for persistence, but does not manage keys for particular 
clients or applications, nor require applications to even be aware that 
encryption is in use. Hence how it can be described as transparent. This type 
of solution distributes secret keys within the secure enclave (not to clients), 
or might employ a two tier key architecture (data keys wrapped by the cluster 
secret key) but with keys managed per application typically. E.g. in a database 
system, per table. The goal here is to avoid data leakage from the server by 
universally encrypting data “at rest”.

Other cryptographic application architectures handle use cases where clients or 
applications want to protect data with encryption from other clients or 
applications. For those use cases encryption and decryption is done on the 
client, and the scope of key sharing should be minimized to where the 
cryptographic operations take place. In this type of solution the server 
becomes an unnecessary central point of compromise for user or application 
keys, so sharing there should be avoided. This isn’t really an “at rest” 
solution because the client may or may not choose to encrypt, and because key 
sharing is minimized, the server cannot and should not be able to distinguish 
encrypted data from random bytes, so cannot guarantee all persisted data is 
encrypted.

Therefore we have two different types of solutions useful for different 
reasons, with different threat models. Combinations of the two must be 
carefully done (or avoided) so as not to end up with something combining the 
worst of both threat models.

HDFS-6134 and HADOOP-10150 are orthogonal and complimentary solutions when 
viewed in this light. HDFS-6134, as described at least by the JIRA title, wants 
to introduce transparent encryption within HDFS. In my opinion, it shouldn’t 
attempt “client side encryption on the server” for reasons mentioned above. 
HADOOP-10150 wants to make management of partially encrypted data easy for 
clients, for the client side encryption use cases, by presenting a filtered 
view over base Hadoop filesystems like HDFS.

{quote} in the "Storage of IV and data key" is stated "So we implement extended 
information based on INode feature, and use it to store data key and IV. 
"{quote}
We assume HDFS-2006 could help, that’s why we put separate patches. In that the 
CFS patch it was decoupled with underlying filesystem if xattr present. And it 
could be end user’s choice to decide whether store key alias or data encryption 
key.

{quote}(Mentioned before), how thing flush() operations will be handled as the 
encryption block will be cut short? How this is handled on writes? How this is 
handled on reads?{quote}
For hflush, hsync, actually it's very simple. In cryptographic output stream of 
CFS, we buffer the plain text in cache and do encryption until data size 
reaches buffer length to improve performance. So for hflush /hsync, we just 
need to flush the buffer and do encryption immediately, and then call 
FSDataOutputStream.hfulsh/hsync which will handle the remaining thing.

{quote}Still, it is not clear how transparency will be achieved for existing 
applications: HDFS URI changes, clients must connect to the Key store to 
retrieve the encryption key (clients will need key store principals). The 
encryption key must be propagated to jobs tasks (i.e. Mapper/Reducer 
processes){quote}
There is no URL changed, please see latest design doc and test case.
We have considered HADOOP-9534 and HADOOP-10141, encryption of key material 
could be handled by the implementation of key providers according to customers 
environment.

{quote}Use of AES-CTR (instead of an authenticated encryption mode such as 
AES-GCM){quote}
AES-GCM was introduce addition CPU cycles by GHASH - 2.5x additional cycles in 
Sandy-Bridge and Ivy-Bridge, 0.6x additional cycle in Haswell. Data integrity 
was ensured by underlying filesystem like HDFS in this scenario. We decide to 
use AES-CTR for best performance.
Furthermore, AES-GCM mode is not available as a JCE cipher in Java 6. It may be 
EOL but plenty of Hadoopers are still running it. It's not even listed on the 
Java 7 Sun provider document 
(http://docs.oracle.com/javase/7/docs/technotes/guides/security/SunProviders.html)
 but that may be an omission.

{quote}By looking at the latest design doc of HADOOP-10150 I can see that 
things have been modified a bit (from the original design doc) bringing it a 
bit closer to some of the HDFS-6134 requirements.{quote}
Actually we designed like this much earlier before we updated, just look at the 
patch.

{quote}Definitely, I want to work together with you guys to leverage as much as 
posible. Either by unifying the 2 proposal or by sharing common code if we 
think both approaches have merits and we decide to move forward with 
both.{quote}
I agree.

{quote}Restrictions of move operations for files within an encrypted directory. 
The original design had something about it (not entirely correct), now is 
gone{quote}
Rename is atomic operation in Hadoop, so we only allow move between one 
directory/file and another directory/file if they share same data key, then no 
decryption is required. Please see my MAR/21 patch.
Actually we have not mentioned rename in the earlier doc, we just discussed it 
in review comments, since @Steve had the same questions, and we covered this in 
the comments of discussion with him.

{quote}Explicit auditing on encrypted files access does not seem handled{quote}
The auditing could be another topic we need to address especially when 
discussing the client side encryption. One possible way is to add a pluggable 
point that customer can route audit event to their existing auditing system. On 
that above points discussion conclusion we think on this point later.


> Hadoop cryptographic file system
> --------------------------------
>
>                 Key: HADOOP-10150
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10150
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: security
>    Affects Versions: 3.0.0
>            Reporter: Yi Liu
>            Assignee: Yi Liu
>              Labels: rhino
>             Fix For: 3.0.0
>
>         Attachments: CryptographicFileSystem.patch, HADOOP cryptographic file 
> system-V2.docx, HADOOP cryptographic file system.pdf, cfs.patch, extended 
> information based on INode feature.patch
>
>
> There is an increasing need for securing data when Hadoop customers use 
> various upper layer applications, such as Map-Reduce, Hive, Pig, HBase and so 
> on.
> HADOOP CFS (HADOOP Cryptographic File System) is used to secure data, based 
> on HADOOP “FilterFileSystem” decorating DFS or other file systems, and 
> transparent to upper layer applications. It’s configurable, scalable and fast.
> High level requirements:
> 1.    Transparent to and no modification required for upper layer 
> applications.
> 2.    “Seek”, “PositionedReadable” are supported for input stream of CFS if 
> the wrapped file system supports them.
> 3.    Very high performance for encryption and decryption, they will not 
> become bottleneck.
> 4.    Can decorate HDFS and all other file systems in Hadoop, and will not 
> modify existing structure of file system, such as namenode and datanode 
> structure if the wrapped file system is HDFS.
> 5.    Admin can configure encryption policies, such as which directory will 
> be encrypted.
> 6.    A robust key management framework.
> 7.    Support Pread and append operations if the wrapped file system supports 
> them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to