[ 
https://issues.apache.org/jira/browse/HADOOP-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Lamb updated HADOOP-10603:
----------------------------------

    Attachment: HADOOP-10603.1.patch

[~yi.a.liu] Here are some comments from [~tucu00] and me.

*Encryptor and Decryptor interfaces and implementation*

How about a minimal API, something like (we've attached a patch with these 
classes):

{code}
public abstract class Crypto {

  public interface Encryptor {
    public void encrypt(long pos, ByteBuffer in, ByteBuffer out)
  }

  public interface Decryptor {
    public void decrypt(long pos, ByteBuffer in, ByteBuffer out);
  }

  public static Crypto getInstance(Configuration conf) {
    ...
  }

  public abstract Encryptor getEncryptor(byte[] key, byte[] iv)
  public abstract Decryptor getDecryptor(byte[] key, byte[] iv)
}
{code}

That way all the cipher initialization/reinitialization is
encapsulated within the implementation.

The Encryptor and Decryptor implementations encrypt/decrypt using the
byte buffers parameters and the absolute position (or in the case of
streams, the stream position). Specifying it like this lets the IV
counter portion (in the case of AES/CTR) be easily calculated inside
of the encryptor/decryptor.

When you're dealing with padding during encrypting/decrypting, the
offset within an AES block (16 bytes) could be done by simply doing a
cipher.update() on a NULL-SINK before passing the real data to
encrypt/decrypt.

The Encryptor and Decryptor work with byte[]. Working with ByteBuffers
(direct) would allow avoiding some array copies if doing JNI to some
native crypto library (i.e. PCKS#11 or OpenSSL).

The AESCTREncryptor/AESCTRDecryptor do update() instead of doFinal()
on the cipher. This could lead to incomplete decryption of the buffer
in some cipher implementations since the contract is that a doFinal()
must be done.

The ivCalc() could be bubbled up to the corresponding Crypto factory
implementation and have a single implementation. The '& 0xFF' is not
needed and the logic could be made a bit more readable like this:

{code}
    private static final int CTR_OFFSET = 8;
    ...
    iv[CTR_OFFSET + 0] = (byte) (l >>> 56);
    iv[CTR_OFFSET + 1] = (byte) (l >>> 48);
    iv[CTR_OFFSET + 2] = (byte) (l >>> 40);
    iv[CTR_OFFSET + 3] = (byte) (l >>> 32);
    iv[CTR_OFFSET + 4] = (byte) (l >>> 24);
    iv[CTR_OFFSET + 5] = (byte) (l >>> 16);
    iv[CTR_OFFSET + 6] = (byte) (l >>> 8);
    iv[CTR_OFFSET + 7] = (byte) (l);
{code}

The current implementation assumes the counter portion always starts
with zero, right? This may present some interoperability issues if
somebody is using a different cipher stream that takes an arbitrary IV
with the counter portion not set to zero.

The attached crypto.patch file shows an impl with the proposed changes
and makes things get a bit simpler. Then, for the stream
implementations it is a simple integration via a single method
(encrypt or decrypt).

*Stream Implementations:*

The crypto streams are implemented as a class (EncyrptorStream,
subclass (FSDecryptorStream) and a decorator
(CryptographicFSDataInputStream), with an analogous hierarchy in the
output stream path. Why not just have single decorator class for
(each) input/output streams with all the encryption/decryption logic?

The DecryptorStream seems to decrypt based on the ready buffer size
instead of on the DecryptorStream buffer size. It should do it on
stream buffer size.

The FSDecryptorStream seek() and skip() methods don't seem to
propagate the new absolute position to the decryptor (to calculate the
IV correctly). Are we missing something obvious?

I think the write() & read() logic during de/encryption could be a bit
simpler, i.e.:

For encryption:
{code}
  void write(byte[] b, int off, int len) {
    IF preEncryptionBuffer has enough space for b[off..len] THEN
      copy b[off..len] to preEncryptionBuffer
    ELSE
      WHILE still data in b[]
        copy as much as possible from b to preEncryptionBuffer
        IF preEncryptionBuffer is full THEN
          encrypt()
        FI
    FI
  }  
{code}

For decryption:

{code}
  int read(byte[] b, int off, int len)  {
    IF decryptedBuffer has data THEN
      drain decryptionBuffer to b[] up to MIN(decryptedBuffer.avail, len)
    ELSE
      read encrypted stream into encryptedBuffer.
      decrypt()
      drain decryptionBuffer to b[] up to MIN(decryptedBuffer.avail, len)
    fi
    return bytes read
  }
{code}

With this logic for write/read we are ensuring that we use the
specified buffer sizes for encryption/decryption. This gets the boost
of speed that bigger buffers give us since there are fewer cipher
init() calls and things like memory prefetch can be maximized.


> Crypto input and output streams implementing Hadoop stream interfaces
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-10603
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10603
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: security
>            Reporter: Alejandro Abdelnur
>            Assignee: Yi Liu
>             Fix For: 3.0.0
>
>         Attachments: HADOOP-10603.1.patch, HADOOP-10603.2.patch, 
> HADOOP-10603.patch
>
>
> A common set of Crypto Input/Output streams. They would be used by 
> CryptoFileSystem, HDFS encryption, MapReduce intermediate data and spills. 
> Note we cannot use the JDK Cipher Input/Output streams directly because we 
> need to support the additional interfaces that the Hadoop FileSystem streams 
> implement (Seekable, PositionedReadable, ByteBufferReadable, 
> HasFileDescriptor, CanSetDropBehind, CanSetReadahead, 
> HasEnhancedByteBufferAccess, Syncable, CanSetDropBehind).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to