[jira] [Updated] (IGNITE-26532) Design CMG/MG absence handling logic

Alexander Lapin (Jira) Fri, 26 Sep 2025 04:28:19 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-26532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexander Lapin updated IGNITE-26532:
-------------------------------------
    Description: 
h3. Motivation

In case of
 # loss of majority in *MG* only

 # loss of majority in *CMG* only

 # loss of majority in both *CMG* and *MG*

User operations behave adequately: within the specified timeouts they attempt 
to wait for majority restoration, and if it does not happen, they fail with a 
clear error. At the same time, they do not flood the logs with tons of 
exceptions on every internal retry.

We are talking about operations such as:
 * Schema changes (e.g., creating a table).

 * Transactions of all types (with partially applied transactions being rolled 
back).

 * Adding nodes.

 * Various {{{}resetPartitions{}}}.

 * …

At the same time, user operations such as
 * stopping a node, and

 * read-only transactions (as in the past)

must complete successfully without exceptions being logged.

Internal _system_ operations must wait indefinitely for the restoration of 
majority in the corresponding system groups (whether via infinite retry or 
reactively), and under no circumstances should they trigger FG (which is what 
happens now).

A node should log reasonably little about the unavailability of a system group, 
not as excessively as it currently does.

Cancellation operations (rollback, abort, etc.) should, whenever possible, work 
even in the absence of CMG/MG. This needs to be verified separately, since it’s 
unclear if we can guarantee it for everything.

When CMG/MG is restored, the cluster should return to normal operability.
h3. Definition of Done

Design document that addresses aforementioned questions is ready.

  was:
h3. Motivation

In case of
 # loss of majority in *MG* only

 # loss of majority in *CMG* only

 # loss of majority in both *CMG* and *MG*

User operations behave adequately: within the specified timeouts they attempt 
to wait for majority restoration, and if it does not happen, they fail with a 
clear error. At the same time, they do not flood the logs with tons of 
exceptions on every internal retry.

We are talking about operations such as:
 * Schema changes (e.g., creating a table).

 * Transactions of all types (with partially applied transactions being rolled 
back).

 * Adding nodes.

 * Various {{{}resetPartitions{}}}.

 * …

At the same time, user operations such as
 * stopping a node, and

 * read-only transactions (as in the past)

must complete successfully without exceptions being logged.

Internal _system_ operations must wait indefinitely for the restoration of 
majority in the corresponding system groups (whether via infinite retry or 
reactively), and under no circumstances should they trigger FG (which is what 
happens now).

A node should log reasonably little about the unavailability of a system group, 
not as excessively as it currently does.

Cancellation operations (rollback, abort, etc.) should, whenever possible, work 
even in the absence of CMG/MG. This needs to be verified separately, since it’s 
unclear if we can guarantee it for everything.

When CMG/MG is restored, the cluster should return to normal operability.


> Design CMG/MG absence handling logic
> ------------------------------------
>
>                 Key: IGNITE-26532
>                 URL: https://issues.apache.org/jira/browse/IGNITE-26532
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Alexander Lapin
>            Assignee: Vladislav Pyatkov
>            Priority: Major
>              Labels: ignite-3
>
> h3. Motivation
> In case of
>  # loss of majority in *MG* only
>  # loss of majority in *CMG* only
>  # loss of majority in both *CMG* and *MG*
> User operations behave adequately: within the specified timeouts they attempt 
> to wait for majority restoration, and if it does not happen, they fail with a 
> clear error. At the same time, they do not flood the logs with tons of 
> exceptions on every internal retry.
> We are talking about operations such as:
>  * Schema changes (e.g., creating a table).
>  * Transactions of all types (with partially applied transactions being 
> rolled back).
>  * Adding nodes.
>  * Various {{{}resetPartitions{}}}.
>  * …
> At the same time, user operations such as
>  * stopping a node, and
>  * read-only transactions (as in the past)
> must complete successfully without exceptions being logged.
> Internal _system_ operations must wait indefinitely for the restoration of 
> majority in the corresponding system groups (whether via infinite retry or 
> reactively), and under no circumstances should they trigger FG (which is what 
> happens now).
> A node should log reasonably little about the unavailability of a system 
> group, not as excessively as it currently does.
> Cancellation operations (rollback, abort, etc.) should, whenever possible, 
> work even in the absence of CMG/MG. This needs to be verified separately, 
> since it’s unclear if we can guarantee it for everything.
> When CMG/MG is restored, the cluster should return to normal operability.
> h3. Definition of Done
> Design document that addresses aforementioned questions is ready.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-26532) Design CMG/MG absence handling logic

Reply via email to