Re: [PR] IGNITE-26157 Enhance the documentation with information about Mainten… [ignite]

via GitHub Wed, 18 Feb 2026 01:49:07 -0800


alex-plekhanov commented on code in PR #12727:
URL: https://github.com/apache/ignite/pull/12727#discussion_r2821289511



##########
docs/_docs/maintenance-mode.adoc:
##########
@@ -0,0 +1,118 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+= Maintenance Mode
+
+== Overview
+
+Maintenance mode is a special state of the node where its functionality is 
limited.
+Nodes in this mode do not join the cluster and remain isolated until 
maintenance mode has been completed.
+
+Nodes can enter maintenance mode during restarts when situations that could 
lead to data corruption or actions required may affect the functioning of the 
cluster while the node remains part of it.
+To enter nodes into emergency mode requires a restart. More details are 
provided below in the section “<<Reasons for Transitioning into Maintenance 
Mode>>”
+
+When a node enters maintenance mode, it becomes isolated from the cluster and 
does not receive data updates.
+Depending on the task at hand, manual intervention by an administrator might 
be necessary, or the node will resolve issues automatically (for example, 
repairing problems with data and indexes).
+
+After all tasks associated with maintenance mode have been completed, the 
administrator must manually restart the node — after which it exits maintenance 
mode.
+The node rejoins the cluster upon next restart.
+
+During operation in maintenance mode, the node is considered offline within 
the cluster topology.
+Before making changes to base topology, ensure that the node is no longer in 
maintenance mode.
+
+== The process of computing tasks in maintenance mode
+
+When a node receives a command to enter maintenance mode, it creates a 
`maintenance_tasks.mntc` file in the working directory.
+If this file exists after a restart, the node automatically enters emergency 
mode and attempts to perform the necessary tasks.
+
+Task list:
+
+[cols="1,4,1",opts="header"]
+|===
+| Task | Maintenance  | Is it performed automatically at startup
+| `defragmentationMaintenanceTask` | Node defragmentation is scheduled | Yes
+
+| `indexRebuildMaintenanceTask` | Data indexes are scheduled to be restored | 
Yes
+|===
+
+Once the tasks are completed, the `maintenance_tasks.mntc` file is removed.
+The node continues operating in maintenance mode until a manual restart occurs.
+
+Additionally, entering maintenance mode can also be initiated manually as 
scheduled.
+For more information about this, see the section titled "<<Scheduled 
Maintenance Mode>>" below.
+
+== Reasons for Transitioning into Maintenance Mode
+
+=== Possible data corruption
+
+If a node with persistence enabled and write-ahead logging disabled terminates 
abnormally during checkpointing, it cannot reliably determine whether any data 
corruption occurred.
+In this case the node detects possible data damage on subsequent startup and 
shuts down.
+Upon the next restart, the node enters maintenance mode and waits for 
administrative action.
+
+To solve the problem:
+
+- Restart the node and it will enter maintenance mode.
+- Use the management script to execute the following command to remove 
potentially corrupted data:
++
+`control.sh --persistence clean corrupted`.
++
+You can also create backups using the following command:
++
+`control.sh --persistence backup corrupted`
++
+Command examples:
++
+[source, shell]
+----
+ control.sh|bat --host {host} --port {port} --persistence backup corrupted
+ control.sh|bat --host {host} --port {port} --persistence clean corrupted
+----
+A node's IP address and port can be found in its logs.
+- After completing the task, restart the node — it will resume the 
checkpointing process.
+
+The node remains in maintenance mode until potentially corrupted data is 
cleared.
+This deletion can be done manually followed by a node restart.
+Afterward, the node will recover lost data from backups stored on other 
cluster nodes through the rebalancing process.
+More detailed information about this procedure can be found in the 
"link:data-rebalancing[Data Rebalancing]" section of the "Application Developer 
Guide."
+
+Following data removal, the node will exit maintenance mode and rejoin the 
cluster upon the next restart.

Review Comment:
   Looks inconsistent. We've already described node start in normal mode and 
rebalancing, and returning back to data removal and maintenance mode.



##########
docs/_docs/tools/control-script.adoc:
##########
@@ -1262,6 +1262,214 @@ TASKS                          SYS       Running 
compute tasks
 Command [SYSTEM-VIEW] finished with code: 0
 
 
+== Working With Persistence Data
+
+[WARNING]
+====
+All `--persistence` commands below function exclusively in 
link:maintenance-mode[Maintenance Mode]
+====
+
+=== Displaying Information About Damaged Caches
+
+Use the `--persistence info` option to display information about potentially 
damaged caches in the local node:
+
+[tabs]
+--
+tab:Unix[]
+[source,shell]
+----
+control.sh --persistence info
+----
+tab:Window[]
+[source,shell]
+----
+control.bat --persistence info
+----
+--
+
+=== Cleaning Up Damaged Caches
+
+Use the `--persistence clean corrupted` option to clear directories containing 
caches with corrupted data files:
+
+[tabs]
+--
+tab:Unix[]
+[source,shell]
+----
+control.sh --persistence clean corrupted
+----
+tab:Window[]
+[source,shell]
+----
+control.bat --persistence clean corrupted
+----
+--
+
+=== Clearing All Caches
+
+Use the `--persistence clean all` option to delete all cache directories:
+
+[tabs]
+--
+tab:Unix[]
+[source,shell]
+----
+control.sh --persistence clean all
+----
+tab:Window[]
+[source,shell]
+----
+control.bat --persistence clean all
+----
+--
+
+=== Clearing Specific Caches
+
+Use the `--persistence clean caches` option to delete specific listed caches:
+
+[tabs]
+--
+tab:Unix[]
+[source,shell]
+----
+control.sh --persistence clean caches cache1,cache2,cache3
+----
+tab:Window[]
+[source,shell]
+----
+control.bat --persistence clean caches cache1,cache2,cache3
+----
+--
+
+where `cache1,cache2,cache3` are comma-separated cache names.
+
+=== Backing Up Damaged Files
+
+Use the `--persistence backup corrupted` option to back up corrupted data 
files:

Review Comment:
   Where these backup files are stored? What can we do with these files? How to 
delete backup files after node returns to normal mode if everething is ok? How 
to use backup files if there is problems with data restoration?
   I think backup-files workflow should be described somewhere.



##########
docs/_docs/maintenance-mode.adoc:
##########
@@ -0,0 +1,118 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+= Maintenance Mode
+
+== Overview
+
+Maintenance mode is a special state of the node where its functionality is 
limited.
+Nodes in this mode do not join the cluster and remain isolated until 
maintenance mode has been completed.

Review Comment:
   'until maintenance mode' -> ' until maintenance task'?



##########
docs/_docs/maintenance-mode.adoc:
##########
@@ -0,0 +1,118 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+= Maintenance Mode
+
+== Overview
+
+Maintenance mode is a special state of the node where its functionality is 
limited.
+Nodes in this mode do not join the cluster and remain isolated until 
maintenance mode has been completed.
+
+Nodes can enter maintenance mode during restarts when situations that could 
lead to data corruption or actions required may affect the functioning of the 
cluster while the node remains part of it.
+To enter nodes into emergency mode requires a restart. More details are 
provided below in the section “<<Reasons for Transitioning into Maintenance 
Mode>>”
+
+When a node enters maintenance mode, it becomes isolated from the cluster and 
does not receive data updates.
+Depending on the task at hand, manual intervention by an administrator might 
be necessary, or the node will resolve issues automatically (for example, 
repairing problems with data and indexes).
+
+After all tasks associated with maintenance mode have been completed, the 
administrator must manually restart the node — after which it exits maintenance 
mode.
+The node rejoins the cluster upon next restart.
+
+During operation in maintenance mode, the node is considered offline within 
the cluster topology.
+Before making changes to base topology, ensure that the node is no longer in 
maintenance mode.
+
+== The process of computing tasks in maintenance mode
+
+When a node receives a command to enter maintenance mode, it creates a 
`maintenance_tasks.mntc` file in the working directory.
+If this file exists after a restart, the node automatically enters emergency 
mode and attempts to perform the necessary tasks.
+
+Task list:
+
+[cols="1,4,1",opts="header"]
+|===
+| Task | Maintenance  | Is it performed automatically at startup
+| `defragmentationMaintenanceTask` | Node defragmentation is scheduled | Yes
+
+| `indexRebuildMaintenanceTask` | Data indexes are scheduled to be restored | 
Yes
+|===
+
+Once the tasks are completed, the `maintenance_tasks.mntc` file is removed.
+The node continues operating in maintenance mode until a manual restart occurs.
+
+Additionally, entering maintenance mode can also be initiated manually as 
scheduled.
+For more information about this, see the section titled "<<Scheduled 
Maintenance Mode>>" below.
+
+== Reasons for Transitioning into Maintenance Mode
+
+=== Possible data corruption
+
+If a node with persistence enabled and write-ahead logging disabled terminates 
abnormally during checkpointing, it cannot reliably determine whether any data 
corruption occurred.
+In this case the node detects possible data damage on subsequent startup and 
shuts down.
+Upon the next restart, the node enters maintenance mode and waits for 
administrative action.
+
+To solve the problem:
+
+- Restart the node and it will enter maintenance mode.
+- Use the management script to execute the following command to remove 
potentially corrupted data:
++
+`control.sh --persistence clean corrupted`.
++
+You can also create backups using the following command:
++
+`control.sh --persistence backup corrupted`
++
+Command examples:
++
+[source, shell]
+----
+ control.sh|bat --host {host} --port {port} --persistence backup corrupted
+ control.sh|bat --host {host} --port {port} --persistence clean corrupted
+----
+A node's IP address and port can be found in its logs.
+- After completing the task, restart the node — it will resume the 
checkpointing process.
+
+The node remains in maintenance mode until potentially corrupted data is 
cleared.
+This deletion can be done manually followed by a node restart.
+Afterward, the node will recover lost data from backups stored on other 
cluster nodes through the rebalancing process.
+More detailed information about this procedure can be found in the 
"link:data-rebalancing[Data Rebalancing]" section of the "Application Developer 
Guide."

Review Comment:
   'Application Developer Guide'?



##########
docs/_docs/maintenance-mode.adoc:
##########
@@ -0,0 +1,118 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+= Maintenance Mode
+
+== Overview
+
+Maintenance mode is a special state of the node where its functionality is 
limited.
+Nodes in this mode do not join the cluster and remain isolated until 
maintenance mode has been completed.
+
+Nodes can enter maintenance mode during restarts when situations that could 
lead to data corruption or actions required may affect the functioning of the 
cluster while the node remains part of it.
+To enter nodes into emergency mode requires a restart. More details are 
provided below in the section “<<Reasons for Transitioning into Maintenance 
Mode>>”
+
+When a node enters maintenance mode, it becomes isolated from the cluster and 
does not receive data updates.
+Depending on the task at hand, manual intervention by an administrator might 
be necessary, or the node will resolve issues automatically (for example, 
repairing problems with data and indexes).
+
+After all tasks associated with maintenance mode have been completed, the 
administrator must manually restart the node — after which it exits maintenance 
mode.
+The node rejoins the cluster upon next restart.
+
+During operation in maintenance mode, the node is considered offline within 
the cluster topology.
+Before making changes to base topology, ensure that the node is no longer in 
maintenance mode.
+
+== The process of computing tasks in maintenance mode
+
+When a node receives a command to enter maintenance mode, it creates a 
`maintenance_tasks.mntc` file in the working directory.
+If this file exists after a restart, the node automatically enters emergency 
mode and attempts to perform the necessary tasks.
+
+Task list:
+
+[cols="1,4,1",opts="header"]
+|===
+| Task | Maintenance  | Is it performed automatically at startup
+| `defragmentationMaintenanceTask` | Node defragmentation is scheduled | Yes
+
+| `indexRebuildMaintenanceTask` | Data indexes are scheduled to be restored | 
Yes
+|===
+
+Once the tasks are completed, the `maintenance_tasks.mntc` file is removed.
+The node continues operating in maintenance mode until a manual restart occurs.
+
+Additionally, entering maintenance mode can also be initiated manually as 
scheduled.
+For more information about this, see the section titled "<<Scheduled 
Maintenance Mode>>" below.
+
+== Reasons for Transitioning into Maintenance Mode
+
+=== Possible data corruption
+
+If a node with persistence enabled and write-ahead logging disabled terminates 
abnormally during checkpointing, it cannot reliably determine whether any data 
corruption occurred.
+In this case the node detects possible data damage on subsequent startup and 
shuts down.
+Upon the next restart, the node enters maintenance mode and waits for 
administrative action.
+
+To solve the problem:
+
+- Restart the node and it will enter maintenance mode.
+- Use the management script to execute the following command to remove 
potentially corrupted data:
++
+`control.sh --persistence clean corrupted`.
++
+You can also create backups using the following command:
++
+`control.sh --persistence backup corrupted`
++
+Command examples:
++
+[source, shell]
+----
+ control.sh|bat --host {host} --port {port} --persistence backup corrupted
+ control.sh|bat --host {host} --port {port} --persistence clean corrupted
+----
+A node's IP address and port can be found in its logs.
+- After completing the task, restart the node — it will resume the 
checkpointing process.
+
+The node remains in maintenance mode until potentially corrupted data is 
cleared.
+This deletion can be done manually followed by a node restart.
+Afterward, the node will recover lost data from backups stored on other 
cluster nodes through the rebalancing process.
+More detailed information about this procedure can be found in the 
"link:data-rebalancing[Data Rebalancing]" section of the "Application Developer 
Guide."
+
+Following data removal, the node will exit maintenance mode and rejoin the 
cluster upon the next restart.
+
+== Scheduled Maintenance Mode
+
+Some tasks require isolating the node so their execution doesn't impact the 
cluster.
+Once the command is executed, the node will enter maintenance mode on the next 
restart and complete the required tasks.
+Another restart will then be needed to bring the node back into the cluster.
+
+Commands that trigger maintenance mode on the next restart:
+
+- `control.sh --defragmentation` - node defragmentation;
+- `control.sh --cache schedule_indexes_rebuild` - schedule rebuilding cache 
data indexes in Maintenance Mode.
+
+More details about these commands can be found in the "Control Script" section 
under subsections "link:tools/control-script#defragmentation[Defragmentation]" 
and "link:tools/control-script#rebuild_index[Rebuild index]".
+
+To exit maintenance mode and return the node to the cluster, restart the node.
+
+== Outdated Caches
+
+A cache is considered outdated if both conditions are met:
+
+- the node left the cluster (e.g., entered maintenance mode);
+- while the node was unavailable, the cache was removed from the cluster.
+
+Outdated caches need to be deleted.
+
+To maintain data consistency, the node marks outdated caches for deletion and 
enters maintenance mode.
+
+While in maintenance mode, the node automatically deletes outdated caches.

Review Comment:
   - Node was in maintenance mode
   - Cache was deleted in the cluster in this time
   - Node returns to the cluster and need to be returned to maintenance mode 
again to delete outdated caches?
   
   I'm not sure, but I think outdated caches will be deleted automatically once 
node exit maintenance mode and join the cluster.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] IGNITE-26157 Enhance the documentation with information about Mainten… [ignite]

Reply via email to