Re: [DISCUSS] Best way to re-encrypt existing data (TDE cache key rotation).

2020-07-24 Thread Nikolay Izhikov
Hello, Maksim.

Thanks for the summary.
From my point of view, we should focus on Phase3 implementation and then do the 
refactoring for some specific SPI implementation.

> 8. Question - why we add a lot of system properties?

Can you, please, list system properties that should be moved to the 
configuration?

> 10. Question - CRC is read in two places encryptionFileIO and filePageStore - 
> what should we do with this?

filePageStore checks CRC of the encrypted page. This required to confirm the 
page not corrupted on the disk.
encryptionFileIO checks CRC of the decrypted page(CRC itself stored in the 
encrypted data). 
This required to be sure the decrypted page contains correct data and not 
replaced with some malicious content.

Here is the list of items that are not related to Phase3 implementation.
Please, tell me what do you think:

>   2. We should try to run the existing test suites in encryption mode.

We did it during TDE.Phase1 testing.

>  3. SPI requires an additional method such as getKeyDigest
>  4. Recommendation - the encryption processor should be divided into external 
> subclasses
 
> 5. Recommendation - we should not use tuples and triples, because this is a 
> marker of a design problem.
> 6. Strict recommendation - please don't put context everywhere

Actually, this is a question of taste and obviously not related to the current 
discussion.

> 24 июля 2020 г., в 14:27, Maksim Stepachev  
> написал(а):
> 
> Hello everyone, yesterday we discussed the implementation of TDE over a
> conference call. I added a summary of this call here:
> 
>   1. The wiki documentation should be expanded. It should describe the
>   steps - how it works under the hood. What are the domain objects in the
>   implementation?
>   2. We should try to run the existing test suites in encryption mode.
>   Encryption should not affect any PDS or other tests.
>   3. SPI requires an additional method such as getKeyDigest, because the
>   current implementation of GridEncryptionManager#masterKeyDigest() looks
>   strange. We reset the master key to calculate the digest. This will not
>   work well if we want to use VOLT as a key provider implementation.
>   4. Recommendation - the encryption processor should be divided into
>   external subclasses, and we should use the OOP decomposition pattern for
>   it. Right now, this class has more than 2000 lines and does not support
>   SOLID. This is similar to inline unrelated logic with a single class.
>   5. Recommendation - we should not use tuples and triples, because this
>   is a marker of a design problem.
>   6. Strict recommendation - please don't put context everywhere. it
>   should only be used in the parent class. You can pass the necessary
>   dependencies through the constructor, as in the DI pattern.
>   7. Question -the current implementation does not use the throttling that
>   is implemented in PDS. Users should set the throughput such as 5 MB per
>   second, but not the timeout, packet size, or stream size.
>   8. Question - why we add a lot of system properties? Why we didn’t add a
>   configuration for it?
>   9. Question - How do we optimize when we can check that this page is
>   already encrypted by parallel loading? Maybe we should do this in Phase 4?
>   10. Question - CRC is read in two places encryptionFileIO and
>   filePageStore - what should we do with this?
>   11. We should remember about complicated test scenarios with failover
>   like node left when encryption started and joined after it finished. In the
>   process, the baseline changed node left before / after / in the middle of
>   this process. And etc.
>   12. How to use a sandbox to protect our cluster of master and user key
>   stealing via compute?
>   13. Will re-encryption continue after the cluster is completely stopped?
> 
> If I forgot some points, you can add them to the message.
> 
> 
> вт, 7 июл. 2020 г. в 17:40, Pavel Pereslegin :
> 
>> Hello, Maksim.
>> 
>> For implementation, I chose so-called "in place background
>> re-encryption" design.
>> 
>> The first step is to rotate the key for writing data, it only works on
>> the active cluster, at the moment..
>> The second step is re-encryption (to remove previous encryption key).
>> If node was restarted reencryption starts after metastorage becomes
>> ready for read/write. Each "re-encrypted" partition (including index)
>> has an attribute on the meta page that indicates whether background
>> re-encryption should be continued.
>> 
>> I updated the description in wiki [1].
>> Some more details in jira [2].
>> Draft PR [3].
>> 
>> [1]
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652384
>> [2] https://issues.apache.org/jira/browse/IGNITE-12843
>> [3] https://github.com/apache/ignite/pull/7941
>> 
>> вт, 7 июл. 2020 г. в 13:49, Maksim Stepachev :
>>> 
>>> Hi!
>>> 
>>> Do you have any updates about this issue? What types of implementations
>>> have you chosen (in-place, offline, or in the 

Re: [DISCUSS] Best way to re-encrypt existing data (TDE cache key rotation).

2020-07-24 Thread Maksim Stepachev
Hello everyone, yesterday we discussed the implementation of TDE over a
conference call. I added a summary of this call here:

   1. The wiki documentation should be expanded. It should describe the
   steps - how it works under the hood. What are the domain objects in the
   implementation?
   2. We should try to run the existing test suites in encryption mode.
   Encryption should not affect any PDS or other tests.
   3. SPI requires an additional method such as getKeyDigest, because the
   current implementation of GridEncryptionManager#masterKeyDigest() looks
   strange. We reset the master key to calculate the digest. This will not
   work well if we want to use VOLT as a key provider implementation.
   4. Recommendation - the encryption processor should be divided into
   external subclasses, and we should use the OOP decomposition pattern for
   it. Right now, this class has more than 2000 lines and does not support
   SOLID. This is similar to inline unrelated logic with a single class.
   5. Recommendation - we should not use tuples and triples, because this
   is a marker of a design problem.
   6. Strict recommendation - please don't put context everywhere. it
   should only be used in the parent class. You can pass the necessary
   dependencies through the constructor, as in the DI pattern.
   7. Question -the current implementation does not use the throttling that
   is implemented in PDS. Users should set the throughput such as 5 MB per
   second, but not the timeout, packet size, or stream size.
   8. Question - why we add a lot of system properties? Why we didn’t add a
   configuration for it?
   9. Question - How do we optimize when we can check that this page is
   already encrypted by parallel loading? Maybe we should do this in Phase 4?
   10. Question - CRC is read in two places encryptionFileIO and
   filePageStore - what should we do with this?
   11. We should remember about complicated test scenarios with failover
   like node left when encryption started and joined after it finished. In the
   process, the baseline changed node left before / after / in the middle of
   this process. And etc.
   12. How to use a sandbox to protect our cluster of master and user key
   stealing via compute?
   13. Will re-encryption continue after the cluster is completely stopped?

If I forgot some points, you can add them to the message.


вт, 7 июл. 2020 г. в 17:40, Pavel Pereslegin :

> Hello, Maksim.
>
> For implementation, I chose so-called "in place background
> re-encryption" design.
>
> The first step is to rotate the key for writing data, it only works on
> the active cluster, at the moment..
> The second step is re-encryption (to remove previous encryption key).
> If node was restarted reencryption starts after metastorage becomes
> ready for read/write. Each "re-encrypted" partition (including index)
> has an attribute on the meta page that indicates whether background
> re-encryption should be continued.
>
> I updated the description in wiki [1].
> Some more details in jira [2].
> Draft PR [3].
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652384
> [2] https://issues.apache.org/jira/browse/IGNITE-12843
> [3] https://github.com/apache/ignite/pull/7941
>
> вт, 7 июл. 2020 г. в 13:49, Maksim Stepachev :
> >
> > Hi!
> >
> > Do you have any updates about this issue? What types of implementations
> > have you chosen (in-place, offline, or in the background)? I know that we
> > want to add a partition defragmentation function, we can add a hole to
> > integrate the re-encryption scheme. Could you update your IEP with your
> > plans?
> >
> > пн, 25 мая 2020 г. в 12:50, Pavel Pereslegin :
> >
> > > Nikolay, Alexei,
> > >
> > > thanks for your suggestions.
> > >
> > > Offline re-encryption does not seem so simple, we need to read/replace
> > > the existing encryption keys on all nodes (therefore, we should be
> > > able to read/write metastore/WAL and exchange data between the
> > > baseline nodes). Re-encryption in maintenance mode (for example, in a
> > > stable read-only cluster) will be simple, but it still looks very
> > > inconvenient, at least because users will need to interrupt all
> > > operations.
> > >
> > > The main advantage of online "in place" re-encryption is that we'll
> > > support multiple keys for reading, and this procedure does not
> > > directly depend on background re-encryption.
> > >
> > > So, the first step is similar to rotating the master key when the new
> > > key was set for writing on all nodes - that’s it, the cache group key
> > > rotation is complete (this is what PCI DSS requires - encrypt new
> > > updates with new keys).
> > > The second step is to re-encrypt the existing data, As I said
> > > previously I thought about scanning all partition pages in some
> > > background mode (store progress on the metapage to continue after
> > > restart), but rebalance approach should also work here if I figure out
> > > how to automate 

Re: Extended logging for rebalance performance analysis

2020-07-24 Thread Stanislav Lukyanov
Hi Kirill,

I think it would be helpful to add to the log line
Completed rebalancing [grp=grp0, 
supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, 
duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, 
minorTopVer=0], progress=1/3, rebalanceId=1]
a few more metrics:
avgSpeed=, partitionsNum=N, histPartitionsNum=M, 
histBytesRcvd=X KB, histDuration=Yms, 
histAvgSpeed=, fullPartitionsNum=K, fullBytesRcvd=X 
KB, fullDuration=Yms, fullAvgSpeed=

I believe these additional metrics are good in that a user can easily setup 
monitoring based on them, and they give a lot of useful info about rebalance 
performance in general. No need to look for the "Starting rebalance" message, 
everything is in one place.
If any of these are hard to add then we can skip them (I suspect histDuration 
and fullDuration can't be measured in practice but don't know for sure). Let's 
add what we can.

Thanks,
Stan

> On 3 Jul 2020, at 17:21, ткаленко кирилл  wrote:
> 
> Sorry, forget.
> 
> [1] - 
> org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance
> 
> 03.07.2020, 17:20, "ткаленко кирилл" :
>> Hi, Stan!
>> 
>> I don't understand you yet.
>> 
>> Now you can use metrics as it was done in the test [1]. Or can you tell me 
>> where to do this, for example when completing rebalancing for all groups?
>> 
>> See what is now available and added in the logs:
>> 1)Which group is rebalanced and which type of rebalance.
>> Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, 
>> minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, 
>> fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]
>> 
>> 2) Completion of rebalancing from one of the suppliers.
>> Completed rebalancing [grp=grp0, 
>> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, 
>> duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, 
>> minorTopVer=0], progress=1/3, rebalanceId=1]
>> 
>> 3) Completion of the entire rebalance.
>> Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, 
>> duration=41ms, bytesRcvd=40,4 KB]
>> 
>> These messages have a common parameter rebalanceId=1.
>> 
>> 03.07.2020, 16:48, "Stanislav Lukyanov" :
   On 3 Jul 2020, at 09:51, ткаленко кирилл  wrote:
 
   To calculate the average value, you can use the existing metrics 
 "RebalancingStartTime", "RebalancingLastCancelledTime", 
 "RebalancingEndTime", "RebalancingPartitionsLeft", 
 "RebalancingReceivedKeys" and "RebalancingReceivedBytes".
>>> 
>>>  You can calculate it, and I believe that this is the first thing anyone 
>>> would do when reading these logs and metrics.
>>>  If that's an essential thing then maybe it should be available out of the 
>>> box?
>>> 
   This also works correctly with the historical rebalance.
   Now we can see rebalance type for each group and for each supplier in 
 logs. I don't think we should duplicate this information.
 
   [2020-07-03 09:49:31,481][INFO 
 ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance 
 routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, 
 minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a5957710, 
 fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>>> 
>>>  I'm talking about adding info on how much data has been transferred during 
>>> rebalance.
>>>  When rebalance completes I'd like to know how much data has been 
>>> transferred, was it historical or full, what was the average rebalance 
>>> speed.
>>> 
>>>  There are two reasons for having all that.
>>> 
>>>  First, it helps to analyze the issues by searching the logs and looking 
>>> for anomalies.
>>> 
>>>  Second, this allows to automate alerts: e.g. if you know your typical 
>>> historical rebalance speed, you can trigger an alert if it drops below that.
>>> 
   03.07.2020, 02:49, "Stanislav Lukyanov" :
>   Kirill,
> 
>   I've looked through the patch.
>   Looks good, but it feels like the first thing someone will try to do 
> given bytesRcvd and duration is to divide one by another to get an 
> average speed.
>   Do you think it's reasonable to also add it to the logs? Maybe even to 
> the metrics?
> 
>   Also, this works with historical rebalance, right? Can we specify how 
> much data was transferred via historical or full rebalance from each 
> supplier?
>   Maybe even estimate transfer speed in entries and bytes for each 
> rebalance type?
> 
>   Thanks,
>   Stan
> 
>>On 29 Jun 2020, at 11:50, Ivan Rakov  wrote:
>> 
>>+1 to Alex G.
>> 
>>From my experience, the most interesting cases with Ignite rebalancing
>>happen exactly in production. According to the fact that we already 
>> have
>>detailed rebalancing logging, adding info about rebalance 

Re: Move TcpDiscoveryZookeeperIpFinder to ignite-extensions as part of IEP-36

2020-07-24 Thread Anton Vinogradov
Ivan,

Sound like an obsolete compromise partial solution kept just because of the
"Zookeeper" word at its name, not because of some profit.

>> So I suggest to move this single class with tests to separate module in
>> ignite-extensions.
+1

On Fri, Jul 24, 2020 at 11:27 AM Ivan Daschinsky 
wrote:

> Hello, Igniters.
>
> Currently, in module ignite-zookeeper, that contains full implementation of
> ZookeeperDiscoverySpi, also presents one class that looks like a little
> snippet and I have some concerns about presence of this class in the
> module.
>
> 1) This class is a simple snippet-like implementation of
> TcpDiscoverIpFinder
> 2) It creates EPHEMERAL_SEQUENTIAL ZNode in root directory with json,
> containing IP addresses of joining node.
> 3) It reads registered children znodes from root directory  and use these
> addresses to participate in common TcpDiscovery process.
>
> 1) This class has nothing in common with ZookeeperDiscovery, but
> 2) It brings to module additional dependencies (apache-curator framework,
> jackson and so on)
> 3) All of these dependencies are not required to ZookeeperDiscoverySpi.
> 4) The usefulness of it is doubtful, because we have already fully
> functional ZookeeperDiscovery and use Zookeeper quorum as just simple store
> for IP addresses without utilizing full zookeeper power is questionable
> decision.
>
> So I suggest to move this single class with tests to separate module in
> ignite-extensions.
>
> What do you think?
>
> --
> Sincerely yours, Ivan Daschinskiy
>


Move TcpDiscoveryZookeeperIpFinder to ignite-extensions as part of IEP-36

2020-07-24 Thread Ivan Daschinsky
Hello, Igniters.

Currently, in module ignite-zookeeper, that contains full implementation of
ZookeeperDiscoverySpi, also presents one class that looks like a little
snippet and I have some concerns about presence of this class in the module.

1) This class is a simple snippet-like implementation of TcpDiscoverIpFinder
2) It creates EPHEMERAL_SEQUENTIAL ZNode in root directory with json,
containing IP addresses of joining node.
3) It reads registered children znodes from root directory  and use these
addresses to participate in common TcpDiscovery process.

1) This class has nothing in common with ZookeeperDiscovery, but
2) It brings to module additional dependencies (apache-curator framework,
jackson and so on)
3) All of these dependencies are not required to ZookeeperDiscoverySpi.
4) The usefulness of it is doubtful, because we have already fully
functional ZookeeperDiscovery and use Zookeeper quorum as just simple store
for IP addresses without utilizing full zookeeper power is questionable
decision.

So I suggest to move this single class with tests to separate module in
ignite-extensions.

What do you think?

-- 
Sincerely yours, Ivan Daschinskiy


Re: Re[2]: Apache Ignite 2.9.0 RELEASE [Time, Scope, Manager]

2020-07-24 Thread Alex Plehanov
Guys,

I've cherry-picked IGNITE-12438 (One-way client-server connections) and
IGNITE-13038 (Web console removing) to 2.9.

Since there are no objections I will move IGNITE-13006 (Spring libs upgrade
to 5.2 version), IGNITE-12489 (Error during purges by expiration) and
IGNITE-12553 (public Java metrics API) to the next release.

I will cherry-pick ticket IGNITE-11942 (IGFS and Hadoop removing) after it
will be reviewed and merged to master.

What about IGNITE-12911 [1] (B+Tree Corrupted exception when using a key
extracted from a BinaryObject value object --- and SQL enabled)? Anton
Kalashnikov, Ilya Kasnacheev, can you please clarify the ticket status?

[1]: https://issues.apache.org/jira/browse/IGNITE-12911

чт, 23 июл. 2020 г. в 12:08, Alexey Kuznetsov :

> Alex, Denis
>
> Issue with moving of Web Console to separate repository is merged to
> master.
> See:  https://issues.apache.org/jira/browse/IGNITE-13038
>
> Please, consider to cherry-pick to ignite-2.9
>
> ---
> Alexey Kuznetsov
>
>


[jira] [Created] (IGNITE-13296) .NET: TransactionImpl finalizer can crash the process

2020-07-24 Thread Pavel Tupitsyn (Jira)
Pavel Tupitsyn created IGNITE-13296:
---

 Summary: .NET: TransactionImpl finalizer can crash the process
 Key: IGNITE-13296
 URL: https://issues.apache.org/jira/browse/IGNITE-13296
 Project: Ignite
  Issue Type: Bug
  Components: platforms
Reporter: Pavel Tupitsyn
Assignee: Pavel Tupitsyn
 Fix For: 2.10


~TransactionImpl potentially throws an exception (e.g. when grid is stopped), 
causing the entire process to abort.
* Finalizers should not throw exceptions
* Stopped grid is a valid use case

Review all filalizers and make sure they are safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)