Hello, Maksim. For implementation, I chose so-called "in place background re-encryption" design.
The first step is to rotate the key for writing data, it only works on the active cluster, at the moment.. The second step is re-encryption (to remove previous encryption key). If node was restarted reencryption starts after metastorage becomes ready for read/write. Each "re-encrypted" partition (including index) has an attribute on the meta page that indicates whether background re-encryption should be continued. I updated the description in wiki [1]. Some more details in jira [2]. Draft PR [3]. [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652384 [2] https://issues.apache.org/jira/browse/IGNITE-12843 [3] https://github.com/apache/ignite/pull/7941 вт, 7 июл. 2020 г. в 13:49, Maksim Stepachev <maksim.stepac...@gmail.com>: > > Hi! > > Do you have any updates about this issue? What types of implementations > have you chosen (in-place, offline, or in the background)? I know that we > want to add a partition defragmentation function, we can add a hole to > integrate the re-encryption scheme. Could you update your IEP with your > plans? > > пн, 25 мая 2020 г. в 12:50, Pavel Pereslegin <xxt...@gmail.com>: > > > Nikolay, Alexei, > > > > thanks for your suggestions. > > > > Offline re-encryption does not seem so simple, we need to read/replace > > the existing encryption keys on all nodes (therefore, we should be > > able to read/write metastore/WAL and exchange data between the > > baseline nodes). Re-encryption in maintenance mode (for example, in a > > stable read-only cluster) will be simple, but it still looks very > > inconvenient, at least because users will need to interrupt all > > operations. > > > > The main advantage of online "in place" re-encryption is that we'll > > support multiple keys for reading, and this procedure does not > > directly depend on background re-encryption. > > > > So, the first step is similar to rotating the master key when the new > > key was set for writing on all nodes - that’s it, the cache group key > > rotation is complete (this is what PCI DSS requires - encrypt new > > updates with new keys). > > The second step is to re-encrypt the existing data, As I said > > previously I thought about scanning all partition pages in some > > background mode (store progress on the metapage to continue after > > restart), but rebalance approach should also work here if I figure out > > how to automate this process. > > > > пн, 25 мая 2020 г. в 12:22, Alexei Scherbakov < > > alexey.scherbak...@gmail.com>: > > > > > > > > > > > > пн, 25 мая 2020 г. в 12:00, Nikolay Izhikov <nizhi...@apache.org>: > > >> > > >> > This willl takes us to the re-encryption using full rebalancing > > >> > > >> Rebalance will require 2x efforts for reencryption > > >> > > >> 1. Read and send data from supplier node. > > >> 2. Reencrypt and write data on demander node. > > >> > > >> Instead of > > >> > > >> 1. Read, reencrypt and write data on «demander» node. > > > > > > > > > Usually, reading and sending is not a bottleneck. And don't forget we > > can run out of WAL history and fall back to full rebalancing with partition > > eviction eliminating all efforts from offline re-encryption. > > > > > > On the other side, for a grid having many nodes one-by-one re-encryption > > can take a long time. > > > It should also be possible to re-encrypt all data as fast as possible > > if, for example, if a load can be switched to another grid, where offline > > encryption will come in handy. > > > > > > So, I suggest to implement offline re-encryption and online > > re-encryption using rebalancing as a first step. > > > > > > Next step can be online in-place re-encryption. It's important to > > measure business impact from it on online grid. > > > > > >> > > >> > > >> > > >> > 25 мая 2020 г., в 11:46, Alexei Scherbakov < > > alexey.scherbak...@gmail.com> написал(а): > > >> > > > >> > For me, the one big disadvantage for offline re-encryption is the > > >> > possibility to run out of WAL history. > > >> > If an re-encryption takes a long time we will get full rebalancing > > with > > >> > partition eviction. > > >> > This willl takes us to the re-encryption using full rebalancing, > > proposed > > >> > by me earlier. > > >> > > > >> > > > >> > > > >> > пн, 25 мая 2020 г. в 11:27, Nikolay Izhikov <nizhi...@apache.org>: > > >> > > > >> >>> And definitely this approach is much simplier to implement > > >> >> > > >> >> I agree. > > >> >> > > >> >> If we allow to made nodes offline for reencryption then we can > > implement a > > >> >> fully offline procedure: > > >> >> > > >> >> 1. Stop node. > > >> >> 2. Execute some control.sh command that will reencrypt all data > > without > > >> >> starting node > > >> >> 3. Start node. > > >> >> > > >> >> Pavel, can you, please, write it one more time - what disadvantages > > in > > >> >> offline procedure? > > >> >> > > >> >>> 25 мая 2020 г., в 11:20, Alexei Scherbakov < > > alexey.scherbak...@gmail.com> > > >> >> написал(а): > > >> >>> > > >> >>> And definitely this approach is much simplier to implement because > > all > > >> >>> corner cases are handled by rebalancing code. > > >> >>> > > >> >>> пн, 25 мая 2020 г. в 11:16, Alexei Scherbakov < > > >> >> alexey.scherbak...@gmail.com > > >> >>>> : > > >> >>> > > >> >>>> I mean: serving supply requests. > > >> >>>> > > >> >>>> пн, 25 мая 2020 г. в 11:15, Alexei Scherbakov < > > >> >>>> alexey.scherbak...@gmail.com>: > > >> >>>> > > >> >>>>> Nikolay, > > >> >>>>> > > >> >>>>> Can you explain why such restriction is necessary ? > > >> >>>>> Most likely having a currently re-encrypting node serving only > > demand > > >> >>>>> requests will have least preformance impact on a grid. > > >> >>>>> > > >> >>>>> пн, 25 мая 2020 г. в 11:08, Nikolay Izhikov <nizhi...@apache.org > > >: > > >> >>>>> > > >> >>>>>> Hello, Alexei. > > >> >>>>>> > > >> >>>>>> I think we want to implement this feature without nodes restart. > > >> >>>>>> In the ideal scenario all nodes will stay alive and respond to > > the > > >> >> user > > >> >>>>>> requests. > > >> >>>>>> > > >> >>>>>>> 24 мая 2020 г., в 15:24, Alexei Scherbakov < > > >> >>>>>> alexey.scherbak...@gmail.com> написал(а): > > >> >>>>>>> > > >> >>>>>>> Pavel Pereslegin, > > >> >>>>>>> > > >> >>>>>>> I see another opportunity. > > >> >>>>>>> We can use rebalancing to re-encrypt node data with a new key. > > >> >>>>>>> It's a trivial procedure for me: stop a node, clear database, > > change > > >> >> a > > >> >>>>>> key, > > >> >>>>>>> start node and wait for rebalancing to complete. > > >> >>>>>>> Data will be re-encrypted during rebalancing. > > >> >>>>>>> > > >> >>>>>>> Did I miss something ? > > >> >>>>>>> > > >> >>>>>>> пт, 22 мая 2020 г. в 16:14, Ivan Rakov <ivan.glu...@gmail.com>: > > >> >>>>>>> > > >> >>>>>>>> Folks, > > >> >>>>>>>> > > >> >>>>>>>> Just keeping you informed: I and my colleagues are highly > > interested > > >> >>>>>> in TDE > > >> >>>>>>>> in general and keys rotations specifically, but we don't have > > enough > > >> >>>>>> time > > >> >>>>>>>> so far. > > >> >>>>>>>> We'll dive into this feature and participate in reviews next > > month. > > >> >>>>>>>> > > >> >>>>>>>> -- > > >> >>>>>>>> Best Regards, > > >> >>>>>>>> Ivan Rakov > > >> >>>>>>>> > > >> >>>>>>>> On Sun, May 17, 2020 at 10:51 PM Pavel Pereslegin < > > xxt...@gmail.com > > >> >>> > > >> >>>>>>>> wrote: > > >> >>>>>>>> > > >> >>>>>>>>> Hello, Alexey. > > >> >>>>>>>>> > > >> >>>>>>>>>> is the encryption key for the data the same on all nodes in > > the > > >> >>>>>>>> cluster? > > >> >>>>>>>>> Yes, each encrypted cache group has its own encryption key, > > the key > > >> >>>>>> is > > >> >>>>>>>>> the same on all nodes. > > >> >>>>>>>>> > > >> >>>>>>>>>> Clearly, during the re-encryption there will exist pages > > >> >>>>>>>>>> encrypted with both new and old keys at the same time. > > >> >>>>>>>>> Yes, there will be pages encrypted with different keys at the > > same > > >> >>>>>> time. > > >> >>>>>>>>> Currently, we only store one key for one cache group. To > > rotate a > > >> >>>>>> key, > > >> >>>>>>>>> at a certain point in time it is necessary to support several > > keys > > >> >>>>>> (at > > >> >>>>>>>>> least for reading the WAL). > > >> >>>>>>>>> For the "in place" strategy, we'll store the encryption key > > >> >>>>>> identifier > > >> >>>>>>>>> on each encrypted page (we currently have some unused space on > > >> >>>>>>>>> encrypted page, so I don't expect any memory overhead here). > > Thus, > > >> >> we > > >> >>>>>>>>> will have several keys for reading and one key for writing. I > > >> >> assume > > >> >>>>>>>>> that the old key will be automatically deleted when a > > specific WAL > > >> >>>>>>>>> segment is deleted (and re-encryption is finished). > > >> >>>>>>>>> > > >> >>>>>>>>>> Will a node continue to re-encrypt the data after it > > restarts? > > >> >>>>>>>>> Yes. > > >> >>>>>>>>> > > >> >>>>>>>>>> If a node goes down during the re-encryption, but the rest > > of the > > >> >>>>>>>>>> cluster finishes re-encryption, will we consider the > > procedure > > >> >>>>>>>> complete? > > >> >>>>>>>>> I'm not sure, but it looks like the key rotation is complete > > when > > >> >> we > > >> >>>>>>>>> set the new key on all nodes so that the updates will be > > encrypted > > >> >>>>>>>>> with the new key (as required by PCI DSS). > > >> >>>>>>>>> Status of re-encryption can be obtained separately (locally or > > >> >>>>>> cluster > > >> >>>>>>>>> wide). > > >> >>>>>>>>> > > >> >>>>>>>>> I forgot to mention that with “in place” re-encryption it > > will be > > >> >>>>>>>>> impossible to quickly cancel re-encryption, because by > > canceling we > > >> >>>>>>>>> mean re-encryption with the old key. > > >> >>>>>>>>> > > >> >>>>>>>>>> How do you see the whole key rotation procedure will work? > > >> >>>>>>>>> Initial design for re-encryption with "partition copying" is > > >> >>>>>> described > > >> >>>>>>>>> here [1]. I'll prepare detailed design for "in place" > > re-encryption > > >> >>>>>> if > > >> >>>>>>>>> we'll go this way. In short, send the new encryption key > > >> >>>>>> cluster-wide, > > >> >>>>>>>>> each node adds a new key and starts background re-encryption. > > >> >>>>>>>>> > > >> >>>>>>>>> [1] > > >> >>>>>>>>> > > >> >>>>>>>> > > >> >>>>>> > > >> >> > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652384#TDE.Phase-3.Cachekeyrotation.-Copywithre-encryptiondesign > > >> >>>>>>>>> . > > >> >>>>>>>>> > > >> >>>>>>>>> вс, 17 мая 2020 г. в 18:35, Alexey Goncharuk < > > >> >>>>>> alexey.goncha...@gmail.com > > >> >>>>>>>>> : > > >> >>>>>>>>>> > > >> >>>>>>>>>> Pavel, Anton, > > >> >>>>>>>>>> > > >> >>>>>>>>>> How do you see the whole key rotation procedure will work? > > >> >> Clearly, > > >> >>>>>>>>> during > > >> >>>>>>>>>> the re-encryption there will exist pages encrypted with both > > new > > >> >> and > > >> >>>>>>>> old > > >> >>>>>>>>>> keys at the same time. Will a node continue to re-encrypt > > the data > > >> >>>>>>>> after > > >> >>>>>>>>> it > > >> >>>>>>>>>> restarts? If a node goes down during the re-encryption, but > > the > > >> >>>>>> rest of > > >> >>>>>>>>> the > > >> >>>>>>>>>> cluster finishes re-encryption, will we consider the > > procedure > > >> >>>>>>>> complete? > > >> >>>>>>>>> By > > >> >>>>>>>>>> the way, is the encryption key for the data the same on all > > nodes > > >> >> in > > >> >>>>>>>> the > > >> >>>>>>>>>> cluster? > > >> >>>>>>>>>> > > >> >>>>>>>>>> чт, 14 мая 2020 г. в 11:30, Anton Vinogradov <a...@apache.org > > >: > > >> >>>>>>>>>> > > >> >>>>>>>>>>> +1 to "In place re-encryption". > > >> >>>>>>>>>>> > > >> >>>>>>>>>>> - It has a simple design. > > >> >>>>>>>>>>> - Clusters under load may require just load to re-encrypt > > the > > >> >> data. > > >> >>>>>>>>>>> (Friendly to load). > > >> >>>>>>>>>>> - Easy to throttle. > > >> >>>>>>>>>>> - Easy to continue. > > >> >>>>>>>>>>> - Design compatible with the multi-key architecture. > > >> >>>>>>>>>>> - It can be optimized to use own WAL buffer and to > > re-encrypt > > >> >> pages > > >> >>>>>>>>> without > > >> >>>>>>>>>>> restoring them to on-heap. > > >> >>>>>>>>>>> > > >> >>>>>>>>>>> On Thu, May 14, 2020 at 1:54 AM Pavel Pereslegin < > > >> >> xxt...@gmail.com > > >> >>>>>>> > > >> >>>>>>>>> wrote: > > >> >>>>>>>>>>> > > >> >>>>>>>>>>>> Hello Igniters. > > >> >>>>>>>>>>>> > > >> >>>>>>>>>>>> Recently, master key rotation for Apache Ignite > > Transparent Data > > >> >>>>>>>>>>>> Encryption was implemented [1], but some security > > standards (PCI > > >> >>>>>>>> DSS > > >> >>>>>>>>>>>> at least) require rotation of all encryption keys [2]. > > >> >> Currently, > > >> >>>>>>>>>>>> encryption occurs when reading/writing pages to disk, cache > > >> >>>>>>>>> encryption > > >> >>>>>>>>>>>> keys are stored in metastore. > > >> >>>>>>>>>>>> > > >> >>>>>>>>>>>> I'm going to contribute cache encryption key rotation and > > want > > >> >> to > > >> >>>>>>>>>>>> consult what is the best way to re-encrypting existing > > data, I > > >> >> see > > >> >>>>>>>>> two > > >> >>>>>>>>>>>> different strategies. > > >> >>>>>>>>>>>> > > >> >>>>>>>>>>>> 1. In place re-encryption: > > >> >>>>>>>>>>>> Using the old key, sequentially read all the pages from the > > >> >>>>>>>>> datastore, > > >> >>>>>>>>>>>> mark as dirty and log them into the WAL. After checkpoint > > pages > > >> >>>>>>>> will > > >> >>>>>>>>>>>> be stored to disk encrypted with the new key (as usual, > > along > > >> >> with > > >> >>>>>>>>>>>> updates). This strategy requires store the identifier > > (number) > > >> >> of > > >> >>>>>>>> the > > >> >>>>>>>>>>>> encryption key into the encrypted page. > > >> >>>>>>>>>>>> pros: > > >> >>>>>>>>>>>> - can work in the background with minimal performance > > impact > > >> >>>>>>>> (this > > >> >>>>>>>>>>>> impact can be managed). > > >> >>>>>>>>>>>> cons: > > >> >>>>>>>>>>>> - page duplication in the WAL may affect performance and > > >> >>>>>>>> historical > > >> >>>>>>>>>>>> rebalance. > > >> >>>>>>>>>>>> > > >> >>>>>>>>>>>> 2. Copy partition with re-encryption. > > >> >>>>>>>>>>>> This strategy is similar to partition snapshotting [3] - > > create > > >> >>>>>>>>>>>> partition copy encrypted with the new key and then replace > > the > > >> >>>>>>>>>>>> original partition file with the new one (see details [4]). > > >> >>>>>>>>>>>> pros: > > >> >>>>>>>>>>>> - should work faster than "in place" re-encryption. > > >> >>>>>>>>>>>> cons: > > >> >>>>>>>>>>>> - re-encryption in active cluster (and on unstable > > topology) can > > >> >>>>>>>> be > > >> >>>>>>>>>>>> difficult to implement. > > >> >>>>>>>>>>>> > > >> >>>>>>>>>>>> (See more detailed comparison [5]) > > >> >>>>>>>>>>>> > > >> >>>>>>>>>>>> Re-encryption of existing data is a long and rare > > procedure (It > > >> >> is > > >> >>>>>>>>>>>> recommended to change the key every 6 months, but at least > > once > > >> >>>>>>>> every > > >> >>>>>>>>>>>> 2 years). Thus, re-encryption can be implemented for > > maintenance > > >> >>>>>>>> mode > > >> >>>>>>>>>>>> (for example, on a stable topology in a read-only cluster) > > and > > >> >> in > > >> >>>>>>>>> such > > >> >>>>>>>>>>>> case the approach with partition copying seems simpler and > > >> >> faster. > > >> >>>>>>>>>>>> > > >> >>>>>>>>>>>> So, what do you think - do we need "online" re-encryption > > and > > >> >>>>>> which > > >> >>>>>>>>> of > > >> >>>>>>>>>>>> the proposed options is best suited for this? > > >> >>>>>>>>>>>> > > >> >>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12186 > > >> >>>>>>>>>>>> [2] > > >> >>>>>>>>> > > https://www.pcisecuritystandards.org/documents/PCI_DSS_v3-2-1.pdf > > >> >>>>>>>>>>>> [3] > > >> >>>>>>>>>>>> > > >> >>>>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>> > > >> >>>>>> > > >> >> > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-43%3A+Cluster+snapshots#IEP-43:Clustersnapshots-Partitionscopystrategy > > >> >>>>>>>>>>>> [4] > > >> >>>>>>>>>>>> > > >> >>>>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>> > > >> >>>>>> > > >> >> > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652384#TDE.Phase-3.Cachekeyrotation.-Copywithre-encryptiondesign > > >> >>>>>>>>>>>> . > > >> >>>>>>>>>>>> [5] > > >> >>>>>>>>>>>> > > >> >>>>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>> > > >> >>>>>> > > >> >> > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652384#TDE.Phase-3.Cachekeyrotation.-Comparison > > >> >>>>>>>>>>>> > > >> >>>>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> -- > > >> >>>>>>> > > >> >>>>>>> Best regards, > > >> >>>>>>> Alexei Scherbakov > > >> >>>>>> > > >> >>>>>> > > >> >>>>> > > >> >>>>> -- > > >> >>>>> > > >> >>>>> Best regards, > > >> >>>>> Alexei Scherbakov > > >> >>>>> > > >> >>>> > > >> >>>> > > >> >>>> -- > > >> >>>> > > >> >>>> Best regards, > > >> >>>> Alexei Scherbakov > > >> >>>> > > >> >>> > > >> >>> > > >> >>> -- > > >> >>> > > >> >>> Best regards, > > >> >>> Alexei Scherbakov > > >> >> > > >> >> > > >> > > > >> > -- > > >> > > > >> > Best regards, > > >> > Alexei Scherbakov > > >> > > > > > > > > > -- > > > > > > Best regards, > > > Alexei Scherbakov > >