It is correct for an operator, but not correct for readiness probe. It's not your understanding of Ignite metrics. It is your understanding of Kubernetes. Kubernetes rolling update logic assumes all of your service backend nodes are completely independent, but you have chosen a readiness probe which reflects how nodes are interacting and interdependent.
Hypothetically: We have bounced one node, and it has rejoined the cluster, and is rebalancing. If Kubernetes probes this node for readiness, we fail because we are rebalancing. The scheduler will block progress of the rolling update. If Kubernetes probes any other node for readiness, it will fail because we are rebalancing. The scheduler will remove this node from any services. All the nodes will reflect the state of the cluster: rebalancing. No nodes will remain in the service backend. If you are using the Kubernetes discovery SPI, the restarted node will find itself unable to discover any peers. The problem is that Kubernetes interprets the readiness probe as a NODE STATE. The cluster.rebalanced metric is a CLUSTER STATE. If you had a Kubernetes job that executes Kubectl commands from within the cluster, looping over the pods in a StatefulSet and restarting them, it would make perfect sense to check cluster.rebalanced and block until rebalancing finishes, but Kubernetes does something different with readiness probes based on some assumptions about clustering which do not apply to Ignite. On Thu, Sep 5, 2024 at 11:29 AM Humphrey Lopez <[email protected]> wrote: > Yes I’m trying to read the cluster.rebalanced metric from the JMX mBean, > is that the correct one? I’ve build that into the readiness endpoint from > actuator and let kubernetes wait for the cluster to be ready before move to > the next pod. > > Humphrey > > On 5 Sep 2024, at 17:34, Jeremy McMillan <[email protected]> wrote: > > > I assume you have created your caches/tables with backups>=1. > > You should restart one node at a time, and wait until the restarted node > has rejoined the cluster, then wait for rebalancing to begin, then wait for > rebalancing to finish before restarting the next node. Kubernetes readiness > probes aren't sophisticated enough. "Node ready" state isn't the same thing > as "Cluster ready" state, but Kubernetes scheduler can't distinguish. This > should be handled by an operator, either human, or a Kubernetes automated > one. > > On Tue, Sep 3, 2024 at 1:13 PM Humphrey <[email protected]> wrote: > >> Thanks, I meant Rolling Update of the same version of Ignite (2.16). Not >> upgrade to a new version. We have our ignite embedded in Spring Boot >> application, and when changing code we need to deploy new version of the >> jar. >> >> Humphrey >> >> On 3 Sep 2024, at 19:24, Gianluca Bonetti <[email protected]> >> wrote: >> >> >> Hello >> >> If you want to upgrade Apache Ignite version, this is not supported by >> Apache Ignite >> >> "Ignite cluster cannot have nodes that run on different Ignite versions. >> You need to stop the cluster and start it again on the new Ignite version." >> https://ignite.apache.org/docs/latest/installation/upgrades >> >> If you need rolling upgrades you can upgrade to GridGain which bring >> rolling upgrades together with many other interesting features >> "Rolling Upgrades is a feature of GridGain Enterprise and Ultimate >> Edition that allows nodes with different GridGain versions to coexist in a >> cluster while you roll out a new version. This prevents downtime when >> performing software upgrades." >> https://www.gridgain.com/docs/latest/installation-guide/rolling-upgrades >> >> Cheers >> Gianluca Bonetti >> >> On Tue, 3 Sept 2024 at 18:15, Humphrey Lopez <[email protected]> wrote: >> >>> Hello, we have several pods with ignite caches running in kubernetes. We >>> only use memory mode (not persistence) and want to perform rolling update >>> of without losing data. What metric should we monitor to know when it’s >>> safe to replace the next pod? >>> >>> We have tried the Cluser.Rebalanced (1) metric from JMX in a readiness >>> probe but we still end up losing data from the caches. >>> >>> 1) >>> https://ignite.apache.org/docs/latest/monitoring-metrics/new-metrics#cluster >>> >>> Should we use another mechanism or metric for determining the readiness >>> of the new started pod? >>> >>> >>> Humphrey >>> >>
