Igniters,
Currently, from my perspective, the Apache Ignite has a very raw rebalance process metrics. Moreover, the most interesting metrics are related to the Cache, not a CacheGroup and require enabling cache statistics which can affect node performance. Some of the metrics are not working as they expected. For instance, `EstimatedRebalancingKeys` metric time to time returns `-1` value due to an internal issues which require investigation (check [1] for details). Another metric `rebalanceKeysReceived` metric treated as CacheMetric in fact calculated for the whole cache group, see [2] comment (e.g. historical rebalance, see IGNITE-11330 and code block comment below). It confuses Ignite users. I think the rebalance process metrics must be reworked, some issues fixed and I invite you to participate in the current discussion. WHAT TO DO I've posted my thought in the description of the issue [3]. Here is some details. All such metrics (or their analogue) must be available for the CacheGroupMetrics and I'd like to suggest to do the following steps: Phase-1 rebalancingPartitionsLeft long metric rebalancingReceivedKeys long metric rebalancingReceivedBytes long metric rebalancingStartTime long metric rebalancingFinishTime long metric It is not possible to get the actual values of rebalanced partitions from the `LocalNodeMovingPartitionsCount` since for the empty node join the cluster we are owning and enabling WAL simultaneously for all the partitions at once. Partitions are actually transferred, but not yet owning. That's why `rebalancingPartitionsLeft` metric needed, from my point. Phase-2 rebalancingExpectedKeys long metric rebalancingExpectedBytes long metric rebalancingEvictedPartitionsLeft long metric The investigation is needed for the issues with the calculation of estimated rebalancing keys count for full and historical rebalance processes and their actual partitions sizes. These metrics must be calculated before the new rebalance started for each cache group on rebalancing node, so the user can see real values of 'how many keys will be rebalanced and can able to estimate the rebalance process finish time using a monitoring system that he uses. Phase-3 (statistics must be enabled) rebalancingKeysRate HitRate metric rebalancingBytesRate HitRate metric Currently, I've observed a lot of CPU (up to 100%) consumption for the calculation of such type of metrics. I think it should be investigated too and these metrics by default must be disabled. Phase-4 After the rebalance process cache group level metrics will be implemented we need to mark rebalancing CacheMetrics deprecated and remove them from metrics a newly introduced metrics framework [4]. Such cache metrics should be implemented in an old-fashion way (like they were before the metrics framework added) to keep backwards compatibility and must be removed it Apache Ignite 3.0 Any thoughts? [1] https://issues.apache.org/jira/browse/IGNITE-11330?focusedCommentId=16867537&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16867537 [2] https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionDemander.java#L1134 [3] https://issues.apache.org/jira/browse/IGNITE-12183 [4] https://issues.apache.org/jira/browse/IGNITE-11848