Arafat Khan created HDDS-11780:
----------------------------------

             Summary: Slight Delay in Exiting Safe Mode Due to and Impact on 
Client Writes
                 Key: HDDS-11780
                 URL: https://issues.apache.org/jira/browse/HDDS-11780
             Project: Apache Ozone
          Issue Type: Bug
          Components: Ozone Client, SCM
            Reporter: Arafat Khan


h4. Issue:

There was an observed delay in the SCM exiting safe mode, taking slightly over 
a minute in one instance. This issue is related to the 
{*}HealthyPipelineSafeModeRule{*}, which depends on DataNodes reporting the 
health of pipelines to the SCM. Delays in pipeline reporting can impact the 
time taken to meet the criteria for exiting safe mode, thereby delaying client 
operations like writes.
h4. Cause:

The delay was caused by the following factors:
 # {*}DataNode Registration{*}: DataNodes register with the SCM using heartbeat 
intervals, which take around 30 seconds. If the leader {*}SCM is restarted and 
regains leadership{*}, all DataNodes and pipelines need to re-register, leading 
to delays.
 # {*}Pipeline Health Reporting{*}: Pipelines are marked as "healthy" only when 
all associated DataNodes are active and reporting. This process may take 
additional time due to the stabilisation of pipelines and network conditions.

h4. Logs:

>From the logs:
{code:java}
2024-10-24 00:45:49,636 INFO 
[main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total 
pipeline count is 4, healthy pipeline threshold count is 1  
2024-10-24 00:46:01,908 INFO 
[main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: 
Refreshed total pipeline count is 4, healthy pipeline threshold count is 1  
2024-10-24 00:47:00,770 INFO 
[main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: 
Refreshed total pipeline count is 4, healthy pipeline threshold count is 1 
{code}
The *HealthyPipelineSafeModeRule* took a significant time to validate the 
required healthy pipeline threshold, contributing to the delay.
h4. Impact:

Due to this delay, Ozone client write operations were impacted, as they rely on 
the SCM being fully operational to allocate blocks or create keys.
h3. Resolution:

To prevent write failures during such delays, we propose increasing the 
following configurations in the Ozone client:
 * {{BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS}}
 * {{BLOCK_ALLOCATION_RETRY_COUNT}}

This will allow the client to wait longer (slightly over a minute) for the SCM 
to exit safe mode, ensuring that writes are not prematurely failed during such 
scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to