Re: Metric showing how many nodes may safely leave the cluster

Ivan Rakov Fri, 04 Oct 2019 08:58:33 -0700

Nikolay,

Many users start to use Ignite with a small project withoutproduction-level monitoring. When proof-of-concept appears to be viable,they tend to expand Ignite usage by growing cluster and adding neededenvironment (including monitoring systems).Inability to find such basic thing as survival in case of next nodecrash may affect overall product impression. We all want Ignite to besuccessful and widespread.

Can you clarify, what do you mean, exactly?

Right now user can access metric mentioned by Alex and choose minimum ofall cache groups. I want to highlight that not every user understandsIgnite and its internals so much to find out that exactly these sequenceof actions will bring him to desired answer.

Can you clarify, what do you mean, exactly?
We have a ticket[1] to support metrics output via visor.sh.

My understanding: we should have an easy way to output metric values for each 
node in cluster.

[1] https://issues.apache.org/jira/browse/IGNITE-12191

I propose to add metric method for aggregated"getMinimumNumberOfPartitionCopies" and expose it to control.sh.My understanding: it's result is critical enough to be accessible in ashort path. I've started this topic due to request from user list, andI've heard many similar complaints before.


Best Regards,
Ivan Rakov

On 04.10.2019 17:18, Nikolay Izhikov wrote:

Ivan.

We shouldn't force users to configure external tools and write extra code for 
basic things.

Actually, I don't agree with you.
Having external monitoring system for any production cluster is a *basic* thing.

Can you, please, define "basic things"?

single method for the whole cluster

Can you clarify, what do you mean, exactly?
We have a ticket[1] to support metrics output via visor.sh.

My understanding: we should have an easy way to output metric values for each 
node in cluster.

[1] https://issues.apache.org/jira/browse/IGNITE-12191


В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет:

Max,

What if user simply don't have configured monitoring system?
Knowing whether cluster will survive node shutdown is critical for any
administrator that performs any manipulations with cluster topology.
Essential information should be easily accessed. We shouldn't force
users to configure external tools and write extra code for basic things.

Alex,

Thanks, that's exact metric we need.
My point is that we should make it more accessible: via control.sh
command and single method for the whole cluster.

Best Regards,
Ivan Rakov

On 04.10.2019 16:34, Alex Plehanov wrote:

Ivan, there already exist metric
CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which shows the
current redundancy level for the cache group.
We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes without data
loss in this cache group.

пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <ivan.glu...@gmail.com>:

Igniters,

I've seen numerous requests to find out an easy way to check whether is
it safe to turn off cluster node. As we know, in Ignite protection from
sudden node shutdown is implemented through keeping several backup
copies of each partition. However, this guarantee can be weakened for a
while in case cluster has recently experienced node restart and
rebalancing process is still in progress.
Example scenario is restarting nodes one by one in order to update a
local configuration parameter. User restarts one node and rebalancing
starts: when it will be completed, it will be safe to proceed (backup
count=1). However, there's no transparent way to determine whether
rebalancing is over.
   From my perspective, it would be very helpful to:
1) Add information about rebalancing and number of free-to-go nodes to
./control.sh --state command.
Examples of output:

Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
Cluster tag: new_tag

--------------------------------------------------------------------------------

Cluster is active
All partitions are up-to-date.
3 node(s) can safely leave the cluster without partition loss.
Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
Cluster tag: new_tag

--------------------------------------------------------------------------------

Cluster is active
Rebalancing is in progress.
1 node(s) can safely leave the cluster without partition loss.

2) Provide the same information via ClusterMetrics. For example:
ClusterMetrics#isRebalanceInProgress // boolean
ClusterMetrics#getSafeToLeaveNodesCount // int

Here I need to mention that this information can be calculated from
existing rebalance metrics (see CacheMetrics#*rebalance*). However, I
still think that we need more simple and understandable flag whether
cluster is in danger of data loss. Another point is that current metrics
are bound to specific cache, which makes this information even harder to
analyze.

Thoughts?

--
Best Regards,
Ivan Rakov

Re: Metric showing how many nodes may safely leave the cluster

Reply via email to