[ https://issues.apache.org/jira/browse/MAPREDUCE-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041955#comment-14041955 ]
Chris Douglas commented on MAPREDUCE-5890: ------------------------------------------ Thanks for updating the patch, Arun. Adding seeks for serving map output would be regrettable. Few nits: * unused, private static field {{counter}} added to {{Fetcher}} * unit test should use JUnit4 annotations rather than extending {{TestCase}} * {noformat} + InputStream is = input; + is = CryptoUtils.wrap(jobConf, iv, is, offset, compressedLength); {noformat} is equivalently {{InputStream is = CryptoUtils.wrap(jobConf, iv, input, offset, compressedLength);}} * While not terribly expensive, there are a lot of redundant lookups for the encrypted shuffle config parameter. * There are many counterexamples, but running a MR job is a heavy way to test this. * To be sure I understand the IV logic, it's injected in the stream as a prefix to the segment during a merge, but is part of the index record during a spill. Is that accurate? Adding a few comments calling this out would be appreciated, particularly since it's hard to spot in the merge. * Has this been tested on spills with intermediate merges? With more than a single reduce? Looking at the patch, it looks like it creates the stream with the IV, it doesn't reset the IV for each segment (apologies, I haven't tried applying it, so I might just be misreading the context). * Since the IV size is hard-coded in {{CryptoUtils}} to 16 bytes (and part of the {{IndexRecord}} format), it should probably fail if the {{CryptoCodec::getAlgorithmBlockSize}} returns anything else. Much of the logic in here is internal to MapReduce, so it would be unfair to ask that this create better abstractions than what exists, but the IV handling is pretty ad hoc. Other improvements under consideration- particularly native implementations and other frameworks building on the {{ShuffleHandler}}- may rely on this code, as well as older versions of MapReduce that will fail without deploying two versions of the ShuffleHandler. To make it backwards compatible, the IV can be part of each {{IFile}} segment (requiring no changes to {{ShuffleHandler}} or the {{SpillRecord}}/{{IndexRecord}} format), or the IVs can be added to the end of the {{SpillRecord}}. In the latter case, the {{Fetcher}} will need to request that the alternate interpretation by including a header; old versions will get the existing interpretation of the {{SpillRecord}}. > Support for encrypting Intermediate data and spills in local filesystem > ----------------------------------------------------------------------- > > Key: MAPREDUCE-5890 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5890 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: security > Affects Versions: 2.4.0 > Reporter: Alejandro Abdelnur > Assignee: Arun Suresh > Labels: encryption > Attachments: MAPREDUCE-5890.3.patch, MAPREDUCE-5890.4.patch, > org.apache.hadoop.mapred.TestMRIntermediateDataEncryption-output.txt, > syslog.tar.gz > > > For some sensitive data, encryption while in flight (network) is not > sufficient, it is required that while at rest it should be encrypted. > HADOOP-10150 & HDFS-6134 bring encryption at rest for data in filesystem > using Hadoop FileSystem API. MapReduce intermediate data and spills should > also be encrypted while at rest. -- This message was sent by Atlassian JIRA (v6.2#6252)