ivandika3 commented on code in PR #8719: URL: https://github.com/apache/ozone/pull/8719#discussion_r2185029887
########## hadoop-hdds/docs/content/troubleshooting/om-ha-snapshot-installation.md: ########## @@ -0,0 +1,37 @@ +--- +title: Troubleshooting OM HA snapshot installation issues +weight: 12 +--- +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +When a new Ozone Manager (OM) is added to an existing OM HA cluster, it needs to obtain the latest OM DB snapshot from the leader OM. +In cases where the OM DB is very large, the new OM may get stuck in a loop trying to download the snapshot. +This can happen if the leader OM purges the Raft logs associated with the snapshot before the new OM can finish downloading it. +When this happens, the new OM will have to restart the snapshot download, and the process can repeat indefinitely. + +To avoid this issue, you can configure the following properties on the leader OM: + +1. Set `ozone.om.ratis.log.purge.preservation.log.num` to a high value (e.g. 1000000). + This property controls how many Raft logs are preserved on the leader OM. + By setting it to a high value, you can prevent the leader from purging the logs that the new OM needs to catch up. + +2. Set `ozone.om.ratis.log.purge.upto.snapshot.index` to `false`. + This property prevents the leader OM from purging any logs until all followers have installed the latest snapshot. + This ensures that the new OM will have enough time to download and install the snapshot without the logs being purged. Review Comment: Few comments regarding this 1. If `ozone.om.ratis.log.purge.preservation.log.num` is set to non-zero number, we should keep `ozone.om.ratis.log.purge.upto.snapshot.index` to `true` since if not `ozone.om.ratis.log.purge.upto.snapshot.index` will override the preservation configuration. So we need to ensure that both are not set together. 2. Let's swap the ordering of these: "Set `ozone.om.ratis.log.purge.upto.snapshot.index` to `false`" option is a more risky approach since it might cause the Raft logs to increase indefinitely when the OM follower is down for a long time, which can cause OM metadata dir to be full. The "Set `ozone.om.ratis.log.purge.preservation.log.num` to a high value (e.g. 1000000)" option is a more balanced approach to ensure that some logs are preserved so that they can be replicated to the slow follower (instead of installing snapshot), but if the number of logs exceeded this amount, OM leader will purge the logs to prevent disk to be full. We can make the latter approach as the recommended way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org