[ https://issues.apache.org/jira/browse/HADOOP-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Charles Lamb updated HADOOP-10603: ---------------------------------- Attachment: HADOOP-10603.1.patch [~yi.a.liu] Here are some comments from [~tucu00] and me. *Encryptor and Decryptor interfaces and implementation* How about a minimal API, something like (we've attached a patch with these classes): {code} public abstract class Crypto { public interface Encryptor { public void encrypt(long pos, ByteBuffer in, ByteBuffer out) } public interface Decryptor { public void decrypt(long pos, ByteBuffer in, ByteBuffer out); } public static Crypto getInstance(Configuration conf) { ... } public abstract Encryptor getEncryptor(byte[] key, byte[] iv) public abstract Decryptor getDecryptor(byte[] key, byte[] iv) } {code} That way all the cipher initialization/reinitialization is encapsulated within the implementation. The Encryptor and Decryptor implementations encrypt/decrypt using the byte buffers parameters and the absolute position (or in the case of streams, the stream position). Specifying it like this lets the IV counter portion (in the case of AES/CTR) be easily calculated inside of the encryptor/decryptor. When you're dealing with padding during encrypting/decrypting, the offset within an AES block (16 bytes) could be done by simply doing a cipher.update() on a NULL-SINK before passing the real data to encrypt/decrypt. The Encryptor and Decryptor work with byte[]. Working with ByteBuffers (direct) would allow avoiding some array copies if doing JNI to some native crypto library (i.e. PCKS#11 or OpenSSL). The AESCTREncryptor/AESCTRDecryptor do update() instead of doFinal() on the cipher. This could lead to incomplete decryption of the buffer in some cipher implementations since the contract is that a doFinal() must be done. The ivCalc() could be bubbled up to the corresponding Crypto factory implementation and have a single implementation. The '& 0xFF' is not needed and the logic could be made a bit more readable like this: {code} private static final int CTR_OFFSET = 8; ... iv[CTR_OFFSET + 0] = (byte) (l >>> 56); iv[CTR_OFFSET + 1] = (byte) (l >>> 48); iv[CTR_OFFSET + 2] = (byte) (l >>> 40); iv[CTR_OFFSET + 3] = (byte) (l >>> 32); iv[CTR_OFFSET + 4] = (byte) (l >>> 24); iv[CTR_OFFSET + 5] = (byte) (l >>> 16); iv[CTR_OFFSET + 6] = (byte) (l >>> 8); iv[CTR_OFFSET + 7] = (byte) (l); {code} The current implementation assumes the counter portion always starts with zero, right? This may present some interoperability issues if somebody is using a different cipher stream that takes an arbitrary IV with the counter portion not set to zero. The attached crypto.patch file shows an impl with the proposed changes and makes things get a bit simpler. Then, for the stream implementations it is a simple integration via a single method (encrypt or decrypt). *Stream Implementations:* The crypto streams are implemented as a class (EncyrptorStream, subclass (FSDecryptorStream) and a decorator (CryptographicFSDataInputStream), with an analogous hierarchy in the output stream path. Why not just have single decorator class for (each) input/output streams with all the encryption/decryption logic? The DecryptorStream seems to decrypt based on the ready buffer size instead of on the DecryptorStream buffer size. It should do it on stream buffer size. The FSDecryptorStream seek() and skip() methods don't seem to propagate the new absolute position to the decryptor (to calculate the IV correctly). Are we missing something obvious? I think the write() & read() logic during de/encryption could be a bit simpler, i.e.: For encryption: {code} void write(byte[] b, int off, int len) { IF preEncryptionBuffer has enough space for b[off..len] THEN copy b[off..len] to preEncryptionBuffer ELSE WHILE still data in b[] copy as much as possible from b to preEncryptionBuffer IF preEncryptionBuffer is full THEN encrypt() FI FI } {code} For decryption: {code} int read(byte[] b, int off, int len) { IF decryptedBuffer has data THEN drain decryptionBuffer to b[] up to MIN(decryptedBuffer.avail, len) ELSE read encrypted stream into encryptedBuffer. decrypt() drain decryptionBuffer to b[] up to MIN(decryptedBuffer.avail, len) fi return bytes read } {code} With this logic for write/read we are ensuring that we use the specified buffer sizes for encryption/decryption. This gets the boost of speed that bigger buffers give us since there are fewer cipher init() calls and things like memory prefetch can be maximized. > Crypto input and output streams implementing Hadoop stream interfaces > --------------------------------------------------------------------- > > Key: HADOOP-10603 > URL: https://issues.apache.org/jira/browse/HADOOP-10603 > Project: Hadoop Common > Issue Type: Sub-task > Components: security > Reporter: Alejandro Abdelnur > Assignee: Yi Liu > Fix For: 3.0.0 > > Attachments: HADOOP-10603.1.patch, HADOOP-10603.2.patch, > HADOOP-10603.patch > > > A common set of Crypto Input/Output streams. They would be used by > CryptoFileSystem, HDFS encryption, MapReduce intermediate data and spills. > Note we cannot use the JDK Cipher Input/Output streams directly because we > need to support the additional interfaces that the Hadoop FileSystem streams > implement (Seekable, PositionedReadable, ByteBufferReadable, > HasFileDescriptor, CanSetDropBehind, CanSetReadahead, > HasEnhancedByteBufferAccess, Syncable, CanSetDropBehind). -- This message was sent by Atlassian JIRA (v6.2#6252)