Hey everybody, thanks a lot for reading and giving feedback!! I'll try and answer all points that I found going through the thread in this mail, but if I miss something please feel free to let me know! I've added a running number to the discussed topics for ease of reference down the road.
I'll go through the KIP and update it with everything that I have written below after sending this mail. @Tom: (1) If I understand your concerns correctly you feel that this functionality would have a hard time getting approved into Apache Kafka because it can be achieved with custom Serializers in the same way and that we should maybe develop this outside of Apache Kafka at first. I feel like it is precisely the fact that this is not part of core Apache Kafka that makes people think twice about doing end-to-end encryption. I may be working in a market (Germany) that is a bit special when compared to the rest of the world where encryption and things like that are concerned, but I've personally sat in multiple meetings where this feature was discussed. It is not necessarily the end-to-end encryption itself, but the at-rest encryption that you get with it. When people hear that this is not part of Apache Kafka itself, but that would need to develop something themselves that more often than not is the end of that discussion. Using something that is not "stock" is quite often simply not an option. Even if they decide to go forward with it, they'll find Hendrik's blog post from 4 years ago on this, probably the Whitepapers from Confluent and Lenses and maybe a few implementations on github - all of which just serve to further muddy the waters. Not because any of these resources are bad or wrong, but just because information and implementations are spread out over a lot of different places. Developing this outside of Apache Kafka would simply serve to add one more item to this list that would not really matter I'm afraid. I strongly feel that this is a needed feature in Kafka and that there is a large number of people out there that would want to use it - but I may very well be mistaken, responses to this thread have not exactly been plentiful this last year and a half.. @Mike: (2) Regarding the encryption of headers, my current idea is to keep this configurable. I have seen customers use headers for stuff like account numbers which under the GDPR are considered to be personal data that should be encrypted wherever possible. So in some instances it might be useful to encrypt header fields as well. My current PoC implementation allows specifying a Regex for headers that should be encrypted, which would allow having encrypted and unencrypted headers in the same record to hopefully suit most use cases. (3) Also, my plan is to not change the message format, but to "encrypt-in-place" and add a header field with the necessary information for decryption, which would then be removed by the decrypting consumer. There may be some out-of-date intentions still in the KIP, I'll go through it and update. @Ryanne: First off, I fully agree that we should avoid painting ourselves into a corner with an early client-only implementation. I scaled down this Kip from earlier attempts that included things like key rollover and broker-side implementations because I could not get any feedback from the community on those for a long time and felt that maybe there was no appetite for the full-blown solution. So I decided to try with a more limited scope. I am very happy to discuss/go for the fully featured version again :) (4) Regarding plaintext data in RocksDB instances, I am a bit torn to be honest. On the one hand, I feel like this scenario is not something that we can fully control. Kafka Streams in this case is a client that takes data from Kafka, decrypts it and then puts it somewhere in plaintext. To me this scenario differs only slightly from for example someone writing a backup job that reads a topic and writes it to a textfile - not much we can do about it. That being said, Kafka Streams is part of Apache Kafka, so does merit special consideration. I'll have to dig into how StateStores are used a bit (I am not the worlds largest expert - or any kind of expert on that) to try and come up with an idea. (5) On key encryption and hashing, this is definitely an issue that we need a solution for. I currently have key encryption configurable in my implementation. When encryption is enabled, an option would of course be to hash the original key and store the key data together with the value in an encrypted form. Any salt added to the key before hashing could be encrypted along with the data. This would allow all key-based functionality like compaction, joins etc. to keep working without having to know the cleartext key. I've also considered deterministic encryption which would keep the encrypted key the same, but I am fairly certain that we will want to allow regular key rotation (more on this in next paragraph) without re-encrypting older data and that would then change the encrypted key and break all these things. Regarding re-encrypting existing keys when a crypto key is compromised, I think we need to be very careful with this if we do it in-place on the broker. If we add functionality along the lines of compaction, which reads re-encrypts and rewrites segment files we have to make sure that producers chose partitions on the cleartext value, otherwise all records starting from the key change may go to a different partition of the topic.. (6) Key rollover would be a cool feature to have. I was up until now only thinking about supporting regular key rollover functionality that would change keys for all records going forward tbh - mostly for complexity reasons - I think there was actually a sentence in the original KIP to this regard. But if you and others feel this is needed then I am happy to discuss this. If we implement this on the broker we could use topic compaction for inspiration, read all segment files and check records one by one, if the key used for that record has been "retired/compromised/..." re-encrypt with new key and write a new segment file. Lots of things to consider around this regarding performance, how to trigger etc. but in principle this could work I think. One issue I can see with this is if we use envelope encryption for the keys to address the rogue admin issue, so the broker doesn't have access to the actual key encrypting the data, this would make that operation impossible. I hope I got to all items that were raised, but may very well have overlooked something, please let me know if I did - and of course your thoughts on what I wrote! I'll update the KIP today as well. Best regards, Sönke On Thu, 7 May 2020 at 19:54, Ryanne Dolan <ryannedo...@gmail.com> wrote: > Tom, good point, I've done exactly that -- hashing record keys -- but it's > unclear to me what should happen when the hash key must be rotated. In my > case the (external) solution involved rainbow tables, versioned keys, and > custom materializers that were aware of older keys for each record. > > In particular I had a pipeline that would re-key records and re-ingest > them, while opportunistically overwriting records materialized with the old > key. > > For a native solution I think maybe we'd need to carry around any old > versions of each record key, perhaps as metadata. Then brokers and > materializers can compact records based on _any_ overlapping key, maybe? > Not sure. > > Ryanne > > On Thu, May 7, 2020, 12:05 PM Tom Bentley <tbent...@redhat.com> wrote: > > > Hi Rayanne, > > > > You raise some good points there. > > > > Similarly, if the whole record is encrypted, it becomes impossible to do > > > joins, group bys etc, which just need the record key and maybe don't > have > > > access to the encryption key. Maybe only record _values_ should be > > > encrypted, and maybe Kafka Streams could defer decryption until the > > actual > > > value is inspected. That way joins etc are possible without the > > encryption > > > key, and RocksDB would not need to decrypt values before materializing > to > > > disk. > > > > > > > It's getting a bit late here, so maybe I overlooked something, but > wouldn't > > the natural thing to do be to make the "encrypted" key a hash of the > > original key, and let the value of the encrypted value be the cipher text > > of the (original key, original value) pair. A scheme like this would > > preserve equality of the key (strictly speaking there's a chance of > > collision of course). I guess this could also be a solution for the > > compacted topic issue Sönke mentioned. > > > > Cheers, > > > > Tom > > > > > > > > On Thu, May 7, 2020 at 5:17 PM Ryanne Dolan <ryannedo...@gmail.com> > wrote: > > > > > Thanks Sönke, this is an area in which Kafka is really, really far > > behind. > > > > > > I've built secure systems around Kafka as laid out in the KIP. One > issue > > > that is not addressed in the KIP is re-encryption of records after a > key > > > rotation. When a key is compromised, it's important that any data > > encrypted > > > using that key is immediately destroyed or re-encrypted with a new key. > > > Ideally first-class support for end-to-end encryption in Kafka would > make > > > this possible natively, or else I'm not sure what the point would be. > It > > > seems to me that the brokers would need to be involved in this process, > > so > > > perhaps a client-first approach will be painting ourselves into a > corner. > > > Not sure. > > > > > > Another issue is whether materialized tables, e.g. in Kafka Streams, > > would > > > see unencrypted or encrypted records. If we implemented the KIP as > > written, > > > it would still result in a bunch of plain text data in RocksDB > > everywhere. > > > Again, I'm not sure what the point would be. Perhaps using custom > serdes > > > would actually be a more holistic approach, since Kafka Streams etc > could > > > leverage these as well. > > > > > > Similarly, if the whole record is encrypted, it becomes impossible to > do > > > joins, group bys etc, which just need the record key and maybe don't > have > > > access to the encryption key. Maybe only record _values_ should be > > > encrypted, and maybe Kafka Streams could defer decryption until the > > actual > > > value is inspected. That way joins etc are possible without the > > encryption > > > key, and RocksDB would not need to decrypt values before materializing > to > > > disk. > > > > > > This is why I've implemented encryption on a per-field basis, not at > the > > > record level, when addressing kafka security in the past. And I've had > to > > > build external pipelines that purge, re-encrypt, and re-ingest records > > when > > > keys are compromised. > > > > > > This KIP might be a step in the right direction, not sure. But I'm > > hesitant > > > to support the idea of end-to-end encryption without a plan to address > > the > > > myriad other problems. > > > > > > That said, we need this badly and I hope something shakes out. > > > > > > Ryanne > > > > > > On Tue, Apr 28, 2020, 6:26 PM Sönke Liebau > > > <soenke.lie...@opencore.com.invalid> wrote: > > > > > > > All, > > > > > > > > I've asked for comments on this KIP in the past, but since I didn't > > > really > > > > get any feedback I've decided to reduce the initial scope of the KIP > a > > > bit > > > > and try again. > > > > > > > > I have reworked to KIP to provide a limited, but useful set of > features > > > for > > > > this initial KIP and laid out a very rough roadmap of what I'd > envision > > > > this looking like in a final version. > > > > > > > > I am aware that the KIP is currently light on implementation details, > > but > > > > would like to get some feedback on the general approach before fully > > > > speccing everything. > > > > > > > > The KIP can be found at > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-317%3A+Add+end-to-end+data+encryption+functionality+to+Apache+Kafka > > > > > > > > > > > > I would very much appreciate any feedback! > > > > > > > > Best regards, > > > > Sönke > > > > > > > > > > -- Sönke Liebau Partner Tel. +49 179 7940878 OpenCore GmbH & Co. KG - Thomas-Mann-Straße 8 - 22880 Wedel - Germany