Re: [DISCUSS] Best way to re-encrypt existing data (TDE cache key rotation).
Hello, Maksim. Thanks for the summary. From my point of view, we should focus on Phase3 implementation and then do the refactoring for some specific SPI implementation. > 8. Question - why we add a lot of system properties? Can you, please, list system properties that should be moved to the configuration? > 10. Question - CRC is read in two places encryptionFileIO and filePageStore - > what should we do with this? filePageStore checks CRC of the encrypted page. This required to confirm the page not corrupted on the disk. encryptionFileIO checks CRC of the decrypted page(CRC itself stored in the encrypted data). This required to be sure the decrypted page contains correct data and not replaced with some malicious content. Here is the list of items that are not related to Phase3 implementation. Please, tell me what do you think: > 2. We should try to run the existing test suites in encryption mode. We did it during TDE.Phase1 testing. > 3. SPI requires an additional method such as getKeyDigest > 4. Recommendation - the encryption processor should be divided into external > subclasses > 5. Recommendation - we should not use tuples and triples, because this is a > marker of a design problem. > 6. Strict recommendation - please don't put context everywhere Actually, this is a question of taste and obviously not related to the current discussion. > 24 июля 2020 г., в 14:27, Maksim Stepachev > написал(а): > > Hello everyone, yesterday we discussed the implementation of TDE over a > conference call. I added a summary of this call here: > > 1. The wiki documentation should be expanded. It should describe the > steps - how it works under the hood. What are the domain objects in the > implementation? > 2. We should try to run the existing test suites in encryption mode. > Encryption should not affect any PDS or other tests. > 3. SPI requires an additional method such as getKeyDigest, because the > current implementation of GridEncryptionManager#masterKeyDigest() looks > strange. We reset the master key to calculate the digest. This will not > work well if we want to use VOLT as a key provider implementation. > 4. Recommendation - the encryption processor should be divided into > external subclasses, and we should use the OOP decomposition pattern for > it. Right now, this class has more than 2000 lines and does not support > SOLID. This is similar to inline unrelated logic with a single class. > 5. Recommendation - we should not use tuples and triples, because this > is a marker of a design problem. > 6. Strict recommendation - please don't put context everywhere. it > should only be used in the parent class. You can pass the necessary > dependencies through the constructor, as in the DI pattern. > 7. Question -the current implementation does not use the throttling that > is implemented in PDS. Users should set the throughput such as 5 MB per > second, but not the timeout, packet size, or stream size. > 8. Question - why we add a lot of system properties? Why we didn’t add a > configuration for it? > 9. Question - How do we optimize when we can check that this page is > already encrypted by parallel loading? Maybe we should do this in Phase 4? > 10. Question - CRC is read in two places encryptionFileIO and > filePageStore - what should we do with this? > 11. We should remember about complicated test scenarios with failover > like node left when encryption started and joined after it finished. In the > process, the baseline changed node left before / after / in the middle of > this process. And etc. > 12. How to use a sandbox to protect our cluster of master and user key > stealing via compute? > 13. Will re-encryption continue after the cluster is completely stopped? > > If I forgot some points, you can add them to the message. > > > вт, 7 июл. 2020 г. в 17:40, Pavel Pereslegin : > >> Hello, Maksim. >> >> For implementation, I chose so-called "in place background >> re-encryption" design. >> >> The first step is to rotate the key for writing data, it only works on >> the active cluster, at the moment.. >> The second step is re-encryption (to remove previous encryption key). >> If node was restarted reencryption starts after metastorage becomes >> ready for read/write. Each "re-encrypted" partition (including index) >> has an attribute on the meta page that indicates whether background >> re-encryption should be continued. >> >> I updated the description in wiki [1]. >> Some more details in jira [2]. >> Draft PR [3]. >> >> [1] >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652384 >> [2] https://issues.apache.org/jira/browse/IGNITE-12843 >> [3] https://github.com/apache/ignite/pull/7941 >> >> вт, 7 июл. 2020 г. в 13:49, Maksim Stepachev : >>> >>> Hi! >>> >>> Do you have any updates about this issue? What types of implementations >>> have you chosen (in-place, offline, or in the
Re: [DISCUSS] Best way to re-encrypt existing data (TDE cache key rotation).
Hello everyone, yesterday we discussed the implementation of TDE over a conference call. I added a summary of this call here: 1. The wiki documentation should be expanded. It should describe the steps - how it works under the hood. What are the domain objects in the implementation? 2. We should try to run the existing test suites in encryption mode. Encryption should not affect any PDS or other tests. 3. SPI requires an additional method such as getKeyDigest, because the current implementation of GridEncryptionManager#masterKeyDigest() looks strange. We reset the master key to calculate the digest. This will not work well if we want to use VOLT as a key provider implementation. 4. Recommendation - the encryption processor should be divided into external subclasses, and we should use the OOP decomposition pattern for it. Right now, this class has more than 2000 lines and does not support SOLID. This is similar to inline unrelated logic with a single class. 5. Recommendation - we should not use tuples and triples, because this is a marker of a design problem. 6. Strict recommendation - please don't put context everywhere. it should only be used in the parent class. You can pass the necessary dependencies through the constructor, as in the DI pattern. 7. Question -the current implementation does not use the throttling that is implemented in PDS. Users should set the throughput such as 5 MB per second, but not the timeout, packet size, or stream size. 8. Question - why we add a lot of system properties? Why we didn’t add a configuration for it? 9. Question - How do we optimize when we can check that this page is already encrypted by parallel loading? Maybe we should do this in Phase 4? 10. Question - CRC is read in two places encryptionFileIO and filePageStore - what should we do with this? 11. We should remember about complicated test scenarios with failover like node left when encryption started and joined after it finished. In the process, the baseline changed node left before / after / in the middle of this process. And etc. 12. How to use a sandbox to protect our cluster of master and user key stealing via compute? 13. Will re-encryption continue after the cluster is completely stopped? If I forgot some points, you can add them to the message. вт, 7 июл. 2020 г. в 17:40, Pavel Pereslegin : > Hello, Maksim. > > For implementation, I chose so-called "in place background > re-encryption" design. > > The first step is to rotate the key for writing data, it only works on > the active cluster, at the moment.. > The second step is re-encryption (to remove previous encryption key). > If node was restarted reencryption starts after metastorage becomes > ready for read/write. Each "re-encrypted" partition (including index) > has an attribute on the meta page that indicates whether background > re-encryption should be continued. > > I updated the description in wiki [1]. > Some more details in jira [2]. > Draft PR [3]. > > [1] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652384 > [2] https://issues.apache.org/jira/browse/IGNITE-12843 > [3] https://github.com/apache/ignite/pull/7941 > > вт, 7 июл. 2020 г. в 13:49, Maksim Stepachev : > > > > Hi! > > > > Do you have any updates about this issue? What types of implementations > > have you chosen (in-place, offline, or in the background)? I know that we > > want to add a partition defragmentation function, we can add a hole to > > integrate the re-encryption scheme. Could you update your IEP with your > > plans? > > > > пн, 25 мая 2020 г. в 12:50, Pavel Pereslegin : > > > > > Nikolay, Alexei, > > > > > > thanks for your suggestions. > > > > > > Offline re-encryption does not seem so simple, we need to read/replace > > > the existing encryption keys on all nodes (therefore, we should be > > > able to read/write metastore/WAL and exchange data between the > > > baseline nodes). Re-encryption in maintenance mode (for example, in a > > > stable read-only cluster) will be simple, but it still looks very > > > inconvenient, at least because users will need to interrupt all > > > operations. > > > > > > The main advantage of online "in place" re-encryption is that we'll > > > support multiple keys for reading, and this procedure does not > > > directly depend on background re-encryption. > > > > > > So, the first step is similar to rotating the master key when the new > > > key was set for writing on all nodes - that’s it, the cache group key > > > rotation is complete (this is what PCI DSS requires - encrypt new > > > updates with new keys). > > > The second step is to re-encrypt the existing data, As I said > > > previously I thought about scanning all partition pages in some > > > background mode (store progress on the metapage to continue after > > > restart), but rebalance approach should also work here if I figure out > > > how to automate
Re: Extended logging for rebalance performance analysis
Hi Kirill, I think it would be helpful to add to the log line Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1] a few more metrics: avgSpeed=, partitionsNum=N, histPartitionsNum=M, histBytesRcvd=X KB, histDuration=Yms, histAvgSpeed=, fullPartitionsNum=K, fullBytesRcvd=X KB, fullDuration=Yms, fullAvgSpeed= I believe these additional metrics are good in that a user can easily setup monitoring based on them, and they give a lot of useful info about rebalance performance in general. No need to look for the "Starting rebalance" message, everything is in one place. If any of these are hard to add then we can skip them (I suspect histDuration and fullDuration can't be measured in practice but don't know for sure). Let's add what we can. Thanks, Stan > On 3 Jul 2020, at 17:21, ткаленко кирилл wrote: > > Sorry, forget. > > [1] - > org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance > > 03.07.2020, 17:20, "ткаленко кирилл" : >> Hi, Stan! >> >> I don't understand you yet. >> >> Now you can use metrics as it was done in the test [1]. Or can you tell me >> where to do this, for example when completing rebalancing for all groups? >> >> See what is now available and added in the logs: >> 1)Which group is rebalanced and which type of rebalance. >> Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, >> minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, >> fullPartitions=[4, 7], histPartitions=[], rebalanceId=1] >> >> 2) Completion of rebalancing from one of the suppliers. >> Completed rebalancing [grp=grp0, >> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, >> duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, >> minorTopVer=0], progress=1/3, rebalanceId=1] >> >> 3) Completion of the entire rebalance. >> Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, >> duration=41ms, bytesRcvd=40,4 KB] >> >> These messages have a common parameter rebalanceId=1. >> >> 03.07.2020, 16:48, "Stanislav Lukyanov" : On 3 Jul 2020, at 09:51, ткаленко кирилл wrote: To calculate the average value, you can use the existing metrics "RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" and "RebalancingReceivedBytes". >>> >>> You can calculate it, and I believe that this is the first thing anyone >>> would do when reading these logs and metrics. >>> If that's an essential thing then maybe it should be available out of the >>> box? >>> This also works correctly with the historical rebalance. Now we can see rebalance type for each group and for each supplier in logs. I don't think we should duplicate this information. [2020-07-03 09:49:31,481][INFO ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a5957710, fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1] >>> >>> I'm talking about adding info on how much data has been transferred during >>> rebalance. >>> When rebalance completes I'd like to know how much data has been >>> transferred, was it historical or full, what was the average rebalance >>> speed. >>> >>> There are two reasons for having all that. >>> >>> First, it helps to analyze the issues by searching the logs and looking >>> for anomalies. >>> >>> Second, this allows to automate alerts: e.g. if you know your typical >>> historical rebalance speed, you can trigger an alert if it drops below that. >>> 03.07.2020, 02:49, "Stanislav Lukyanov" : > Kirill, > > I've looked through the patch. > Looks good, but it feels like the first thing someone will try to do > given bytesRcvd and duration is to divide one by another to get an > average speed. > Do you think it's reasonable to also add it to the logs? Maybe even to > the metrics? > > Also, this works with historical rebalance, right? Can we specify how > much data was transferred via historical or full rebalance from each > supplier? > Maybe even estimate transfer speed in entries and bytes for each > rebalance type? > > Thanks, > Stan > >>On 29 Jun 2020, at 11:50, Ivan Rakov wrote: >> >>+1 to Alex G. >> >>From my experience, the most interesting cases with Ignite rebalancing >>happen exactly in production. According to the fact that we already >> have >>detailed rebalancing logging, adding info about rebalance
Re: Move TcpDiscoveryZookeeperIpFinder to ignite-extensions as part of IEP-36
Ivan, Sound like an obsolete compromise partial solution kept just because of the "Zookeeper" word at its name, not because of some profit. >> So I suggest to move this single class with tests to separate module in >> ignite-extensions. +1 On Fri, Jul 24, 2020 at 11:27 AM Ivan Daschinsky wrote: > Hello, Igniters. > > Currently, in module ignite-zookeeper, that contains full implementation of > ZookeeperDiscoverySpi, also presents one class that looks like a little > snippet and I have some concerns about presence of this class in the > module. > > 1) This class is a simple snippet-like implementation of > TcpDiscoverIpFinder > 2) It creates EPHEMERAL_SEQUENTIAL ZNode in root directory with json, > containing IP addresses of joining node. > 3) It reads registered children znodes from root directory and use these > addresses to participate in common TcpDiscovery process. > > 1) This class has nothing in common with ZookeeperDiscovery, but > 2) It brings to module additional dependencies (apache-curator framework, > jackson and so on) > 3) All of these dependencies are not required to ZookeeperDiscoverySpi. > 4) The usefulness of it is doubtful, because we have already fully > functional ZookeeperDiscovery and use Zookeeper quorum as just simple store > for IP addresses without utilizing full zookeeper power is questionable > decision. > > So I suggest to move this single class with tests to separate module in > ignite-extensions. > > What do you think? > > -- > Sincerely yours, Ivan Daschinskiy >
Move TcpDiscoveryZookeeperIpFinder to ignite-extensions as part of IEP-36
Hello, Igniters. Currently, in module ignite-zookeeper, that contains full implementation of ZookeeperDiscoverySpi, also presents one class that looks like a little snippet and I have some concerns about presence of this class in the module. 1) This class is a simple snippet-like implementation of TcpDiscoverIpFinder 2) It creates EPHEMERAL_SEQUENTIAL ZNode in root directory with json, containing IP addresses of joining node. 3) It reads registered children znodes from root directory and use these addresses to participate in common TcpDiscovery process. 1) This class has nothing in common with ZookeeperDiscovery, but 2) It brings to module additional dependencies (apache-curator framework, jackson and so on) 3) All of these dependencies are not required to ZookeeperDiscoverySpi. 4) The usefulness of it is doubtful, because we have already fully functional ZookeeperDiscovery and use Zookeeper quorum as just simple store for IP addresses without utilizing full zookeeper power is questionable decision. So I suggest to move this single class with tests to separate module in ignite-extensions. What do you think? -- Sincerely yours, Ivan Daschinskiy
Re: Re[2]: Apache Ignite 2.9.0 RELEASE [Time, Scope, Manager]
Guys, I've cherry-picked IGNITE-12438 (One-way client-server connections) and IGNITE-13038 (Web console removing) to 2.9. Since there are no objections I will move IGNITE-13006 (Spring libs upgrade to 5.2 version), IGNITE-12489 (Error during purges by expiration) and IGNITE-12553 (public Java metrics API) to the next release. I will cherry-pick ticket IGNITE-11942 (IGFS and Hadoop removing) after it will be reviewed and merged to master. What about IGNITE-12911 [1] (B+Tree Corrupted exception when using a key extracted from a BinaryObject value object --- and SQL enabled)? Anton Kalashnikov, Ilya Kasnacheev, can you please clarify the ticket status? [1]: https://issues.apache.org/jira/browse/IGNITE-12911 чт, 23 июл. 2020 г. в 12:08, Alexey Kuznetsov : > Alex, Denis > > Issue with moving of Web Console to separate repository is merged to > master. > See: https://issues.apache.org/jira/browse/IGNITE-13038 > > Please, consider to cherry-pick to ignite-2.9 > > --- > Alexey Kuznetsov > >
[jira] [Created] (IGNITE-13296) .NET: TransactionImpl finalizer can crash the process
Pavel Tupitsyn created IGNITE-13296: --- Summary: .NET: TransactionImpl finalizer can crash the process Key: IGNITE-13296 URL: https://issues.apache.org/jira/browse/IGNITE-13296 Project: Ignite Issue Type: Bug Components: platforms Reporter: Pavel Tupitsyn Assignee: Pavel Tupitsyn Fix For: 2.10 ~TransactionImpl potentially throws an exception (e.g. when grid is stopped), causing the entire process to abort. * Finalizers should not throw exceptions * Stopped grid is a valid use case Review all filalizers and make sure they are safe. -- This message was sent by Atlassian Jira (v8.3.4#803005)