Alexander Lapin created IGNITE-26532:
----------------------------------------
Summary: Design CMG/MG absence handling logic
Key: IGNITE-26532
URL: https://issues.apache.org/jira/browse/IGNITE-26532
Project: Ignite
Issue Type: Task
Reporter: Alexander Lapin
h3. Motivation
In case of
# loss of majority in *MG* only
# loss of majority in *CMG* only
# loss of majority in both *CMG* and *MG*
User operations behave adequately: within the specified timeouts they attempt
to wait for majority restoration, and if it does not happen, they fail with a
clear error. At the same time, they do not flood the logs with tons of
exceptions on every internal retry.
We are talking about operations such as:
* Schema changes (e.g., creating a table).
* Transactions of all types (with partially applied transactions being rolled
back).
* Adding nodes.
* Various {{{}resetPartitions{}}}.
* …
At the same time, user operations such as
* stopping a node, and
* read-only transactions (as in the past)
must complete successfully without exceptions being logged.
Internal _system_ operations must wait indefinitely for the restoration of
majority in the corresponding system groups (whether via infinite retry or
reactively), and under no circumstances should they trigger FG (which is what
happens now).
A node should log reasonably little about the unavailability of a system group,
not as excessively as it currently does.
Cancellation operations (rollback, abort, etc.) should, whenever possible, work
even in the absence of CMG/MG. This needs to be verified separately, since it’s
unclear if we can guarantee it for everything.
When CMG/MG is restored, the cluster should return to normal operability.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)