Hi Niels,
I think your assessment of the situation makes sense. Using the
compression hooks would encrypt entire Avro blocks, so you need a way to
change individual fields.
I think it may make sense for you to take advantage of the
recently-added logical types. That allows you a hook for individual
fields and you can supply your own code. So you could implement a
logical type that looks like this:
{ "type": "binary",
"logicalType": "encrypted",
"keyId": "...",
"originalType": "string" }
And you would also supply a converter. Here's an example for UUID:
https://github.com/apache/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/Conversions.java#L30
I think it would be pretty easy to implement and then you'd also have a
reliable way to label the fields that are encrypted.
rb
On 08/19/2015 05:12 AM, Niels Basjes wrote:
Hi,
I'm working on a project where I plan to put clickstream data into Kafka
serialized using AVRO. In a later stage I want these records persisted into
AVRO files so they can be used by people using PIG.
So far this is no problem at all.
Now some of those fields (not all) are privacy sensitive so I do not want
them to be 'plain text' in the data. I want them to be encrypted so that
they can only be read by the people who need access to these fields.
The only thing I have found so far about encrypting data in AVRO is
https://issues.apache.org/jira/browse/AVRO-1371 which states
Quote
* Similar to compression and decompression, encryption and decryption *
* can be implemented with Codecs, a concept that already exists in
Avro.*
I had a look at that Codecs API and it simply takes the 'entire thing' as a
ByteBuffer and compresses it. So this means the entire record is encrypted
(which is not what I want).
I want without storing the data twice (it is too big for that):
- All consumers to be able to read 'most' fields.
- Some consumers to be able to read 'all' fields.
I was contemplating to simply put the keyid and the encrypted bytes into a
field of the type 'bytes'. That way there is no need to change the
underlying file format.
To keep it simple I would simply have the application code generate the
'encrypted value' and store it in the record. Then at the PIG side I would
simply create a UDF that does the decryption again.
To make using this easier I even thought about extending the IDL language
(keyword 'encrypted') and then generate extra/different utility methods
that wrap/encrypt that field via the setters/builders and put that in a
normal AVRO file as bytes.
But before I start coding;
Has anyone ever thought about what the 'right' approach is to do this in
AVRO?
Has anyone build something I can have a look at?
--
Ryan Blue
Software Engineer
Cloudera, Inc.