RE: Question related to Decompressor interface

java8964 java8964 Sun, 10 Feb 2013 07:13:59 -0800

Hi, Dave:
Thanks for you reply. I am not sure how the EncryptedWritable will work, can 
you share more ideas about it?
For example, if I have a text file as my source raw file. Now I need to store 
it in HDFS. If I use any encryption to encrypt the whole file, then there is no 
good InputFormat or RecordReader to process it, unless whole file is decrypted 
first at runtime, then using TextInputFormat to process it, right?
What you suggest is  when I encrypted the file, store it as a SequenceFile, 
using anything I want as the key, then encrypt each line (Record), and stores 
it as the value, put both (key, value) pair into the sequence file, is that 
right? 
Then in the runtime, each value can be decrypted from the sequence file, and 
ready for next step in the by the EncryptedWritable class. Is my understanding 
correct?
 In this case, of course I don't need to worry about split any more, as each 
record is encrypted/decrypted separately.
I think it is a valid option, but problem is that the data has to be encrypted 
by this EncryptedWritable class. What I was thinking about is allow data source 
to encrypt its data any way they want, as long as it is supported by Java 
security package, then only provide the private key to the runtime to decrypt 
it.
Yong


From: [email protected]
To: [email protected]
Subject: RE: Question related to Decompressor interface
Date: Sun, 10 Feb 2013 09:36:40 +0700

I can’t answer your question about the Decompressor interface, but I have a 
query for you. Why not just create an EncryptedWritable object? Encrypt/decrypt 
the bytes on the read/write method, that should be darn near trivial. Then 
stick with good ‘ol SequenceFile, which, as you note, is splittable. Otherwise 
you’d have to deal with making the output splittable, and given encrypted data, 
the only solution that I see is basically rolling your own SequenceFile with 
encrypted innards.  Come to think of it, a simple, standardized 
EncryptedWritable object out of the box with Hadoop would be great. Or perhaps 
better yet, an EncryptedWritableWrapper<T extends Writable> so we can convert 
any existing Writable into an encrypted form. Dave  From: java8964 java8964 
[mailto:[email protected]] 
Sent: Sunday, February 10, 2013 3:50 AM
To: [email protected]
Subject: Question related to Decompressor interface HI,  Currently I am 
researching about options of encrypting the data in the MapReduce, as we plan 
to use the Amazon EMR or EC2 services for our data. I am thinking that the 
compression codec is good place to integrate with the encryption logic, and I 
found out there are some people having the same idea as mine. I google around 
and found out this code: https://github.com/geisbruch/HadoopCryptoCompressor/ 
It doesn't seem maintained any more, but it gave me a starting point. I 
download the source code, and try to do some tests with it. It doesn't work out 
of box. There are some bugs I have to fix to make it work. I believe it 
contains 'AES' as an example algorithm. But right now, I faced a problem when I 
tried to use it in my testing MapReduer program. Here is the stack trace I got: 
2013-02-08 23:16:47,038 INFO 
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length = 512, 
and offset = 0, length = -132967308java.lang.IndexOutOfBoundsException    at 
java.nio.ByteBuffer.wrap(ByteBuffer.java:352)    at 
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)
    at 
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)
    at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)
    at java.io.InputStream.read(InputStream.java:82)    at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)    at 
org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)    at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)
    at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)
    at 
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
    at 
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)    at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)    at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)    at 
org.apache.hadoop.mapred.Child$4.run(Child.java:268)    at 
java.security.AccessController.doPrivileged(Native Method)    at 
javax.security.auth.Subject.doAs(Subject.java:396)    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
    at org.apache.hadoop.mapred.Child.main(Child.java:262) I know the error is 
thrown out of this custom CryptoBasicDecompressor class, but I really have 
questions related to the interface it implemented: Decompressor. There is 
limited document about this interface, for example, when and how the method 
setInput() will be invoked. If I want to write my own Decompressor, what do 
these methods mean in the interface?In the above case, I enable some debug 
information, you can see that in this case, the byte[] array passed to setInput 
method, only have 512 as the length, but the 3rd parameter of length passed in 
is a negative number: -132967308. That caused the IndexOutOfBoundsException. If 
I check the GzipDecompressor class of this method in the hadoop, the code will 
also throw IndexOutoutBoundsException in this case, so this is a 
RuntimeException case. Why it happened in my test case? Here is my test case: I 
have a simpel log text file about 700k. I encrypted it with above code using 
'AES'. I can encrypted and decrypted to get my original content. The file name 
is foo.log.crypto, this file extension is registered to invoke this 
CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (hadoop 
2.0). Everything works as I expected. The CryptoBasicDecompressor is invoked 
when the input file is foo.log.crypto, as you can see in the above stack trace. 
But I don't know why the 3rd parameter (length) in setInput() is a negative 
number at runtime. In additional to it, I also have further questions related 
to use Compressor/Decompressor to handle the encrypting/decrypting file. 
Ideally, I wonder if the encrypting/decrypting can support file splits. This 
maybe depends the algorithm we are using, is that right? If so, what kind of 
algorithm can do that? I am not sure if it likes the compressor cases, most of 
them do not support file split. If so, it maybe not good for my requirements. 
If we have a 1G file, encrypted in the Amazone S3, after it copied to the HDFS 
of Amazon EMR, can each block of the date be decrypted independently by each 
mapper, then passed to the underline RecorderReader to be processed totally 
concurrently? Does any one do this before? If so, what encryption algorithm 
does support it? Any idea? Thanks Yong

RE: Question related to Decompressor interface

Reply via email to