[ https://issues.apache.org/jira/browse/HADOOP-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13946828#comment-13946828 ]
Yi Liu commented on HADOOP-10150: --------------------------------- Thanks [~tucu00] for your comment. We less concern the internal use of HDFS client, on the contrary we care more about encrypted data easy for clients. Even though we found that in webhdfs it should use DistributedFileSystem as well to remove the symlink issue as HDFS-4933 stated(The issue we found is “Throwing UnresolvedPathException when getting HDFS symlink file through HDFS REST API”, and there is no “statistics” for HDFS REST which is inconsistent with behavior of DistributedFileSystem, suppose this JIRA will resolve it). “Transparent” or “at rest” encryption usually means that the server handles encrypting data for persistence, but does not manage keys for particular clients or applications, nor require applications to even be aware that encryption is in use. Hence how it can be described as transparent. This type of solution distributes secret keys within the secure enclave (not to clients), or might employ a two tier key architecture (data keys wrapped by the cluster secret key) but with keys managed per application typically. E.g. in a database system, per table. The goal here is to avoid data leakage from the server by universally encrypting data “at rest”. Other cryptographic application architectures handle use cases where clients or applications want to protect data with encryption from other clients or applications. For those use cases encryption and decryption is done on the client, and the scope of key sharing should be minimized to where the cryptographic operations take place. In this type of solution the server becomes an unnecessary central point of compromise for user or application keys, so sharing there should be avoided. This isn’t really an “at rest” solution because the client may or may not choose to encrypt, and because key sharing is minimized, the server cannot and should not be able to distinguish encrypted data from random bytes, so cannot guarantee all persisted data is encrypted. Therefore we have two different types of solutions useful for different reasons, with different threat models. Combinations of the two must be carefully done (or avoided) so as not to end up with something combining the worst of both threat models. HDFS-6134 and HADOOP-10150 are orthogonal and complimentary solutions when viewed in this light. HDFS-6134, as described at least by the JIRA title, wants to introduce transparent encryption within HDFS. In my opinion, it shouldn’t attempt “client side encryption on the server” for reasons mentioned above. HADOOP-10150 wants to make management of partially encrypted data easy for clients, for the client side encryption use cases, by presenting a filtered view over base Hadoop filesystems like HDFS. {quote} in the "Storage of IV and data key" is stated "So we implement extended information based on INode feature, and use it to store data key and IV. "{quote} We assume HDFS-2006 could help, that’s why we put separate patches. In that the CFS patch it was decoupled with underlying filesystem if xattr present. And it could be end user’s choice to decide whether store key alias or data encryption key. {quote}(Mentioned before), how thing flush() operations will be handled as the encryption block will be cut short? How this is handled on writes? How this is handled on reads?{quote} For hflush, hsync, actually it's very simple. In cryptographic output stream of CFS, we buffer the plain text in cache and do encryption until data size reaches buffer length to improve performance. So for hflush /hsync, we just need to flush the buffer and do encryption immediately, and then call FSDataOutputStream.hfulsh/hsync which will handle the remaining thing. {quote}Still, it is not clear how transparency will be achieved for existing applications: HDFS URI changes, clients must connect to the Key store to retrieve the encryption key (clients will need key store principals). The encryption key must be propagated to jobs tasks (i.e. Mapper/Reducer processes){quote} There is no URL changed, please see latest design doc and test case. We have considered HADOOP-9534 and HADOOP-10141, encryption of key material could be handled by the implementation of key providers according to customers environment. {quote}Use of AES-CTR (instead of an authenticated encryption mode such as AES-GCM){quote} AES-GCM was introduce addition CPU cycles by GHASH - 2.5x additional cycles in Sandy-Bridge and Ivy-Bridge, 0.6x additional cycle in Haswell. Data integrity was ensured by underlying filesystem like HDFS in this scenario. We decide to use AES-CTR for best performance. Furthermore, AES-GCM mode is not available as a JCE cipher in Java 6. It may be EOL but plenty of Hadoopers are still running it. It's not even listed on the Java 7 Sun provider document (http://docs.oracle.com/javase/7/docs/technotes/guides/security/SunProviders.html) but that may be an omission. {quote}By looking at the latest design doc of HADOOP-10150 I can see that things have been modified a bit (from the original design doc) bringing it a bit closer to some of the HDFS-6134 requirements.{quote} Actually we designed like this much earlier before we updated, just look at the patch. {quote}Definitely, I want to work together with you guys to leverage as much as posible. Either by unifying the 2 proposal or by sharing common code if we think both approaches have merits and we decide to move forward with both.{quote} I agree. {quote}Restrictions of move operations for files within an encrypted directory. The original design had something about it (not entirely correct), now is gone{quote} Rename is atomic operation in Hadoop, so we only allow move between one directory/file and another directory/file if they share same data key, then no decryption is required. Please see my MAR/21 patch. Actually we have not mentioned rename in the earlier doc, we just discussed it in review comments, since @Steve had the same questions, and we covered this in the comments of discussion with him. {quote}Explicit auditing on encrypted files access does not seem handled{quote} The auditing could be another topic we need to address especially when discussing the client side encryption. One possible way is to add a pluggable point that customer can route audit event to their existing auditing system. On that above points discussion conclusion we think on this point later. > Hadoop cryptographic file system > -------------------------------- > > Key: HADOOP-10150 > URL: https://issues.apache.org/jira/browse/HADOOP-10150 > Project: Hadoop Common > Issue Type: New Feature > Components: security > Affects Versions: 3.0.0 > Reporter: Yi Liu > Assignee: Yi Liu > Labels: rhino > Fix For: 3.0.0 > > Attachments: CryptographicFileSystem.patch, HADOOP cryptographic file > system-V2.docx, HADOOP cryptographic file system.pdf, cfs.patch, extended > information based on INode feature.patch > > > There is an increasing need for securing data when Hadoop customers use > various upper layer applications, such as Map-Reduce, Hive, Pig, HBase and so > on. > HADOOP CFS (HADOOP Cryptographic File System) is used to secure data, based > on HADOOP “FilterFileSystem” decorating DFS or other file systems, and > transparent to upper layer applications. It’s configurable, scalable and fast. > High level requirements: > 1. Transparent to and no modification required for upper layer > applications. > 2. “Seek”, “PositionedReadable” are supported for input stream of CFS if > the wrapped file system supports them. > 3. Very high performance for encryption and decryption, they will not > become bottleneck. > 4. Can decorate HDFS and all other file systems in Hadoop, and will not > modify existing structure of file system, such as namenode and datanode > structure if the wrapped file system is HDFS. > 5. Admin can configure encryption policies, such as which directory will > be encrypted. > 6. A robust key management framework. > 7. Support Pread and append operations if the wrapped file system supports > them. -- This message was sent by Atlassian JIRA (v6.2#6252)