Arafat Khan created HDDS-11780:
----------------------------------
Summary: Slight Delay in Exiting Safe Mode Due to and Impact on
Client Writes
Key: HDDS-11780
URL: https://issues.apache.org/jira/browse/HDDS-11780
Project: Apache Ozone
Issue Type: Bug
Components: Ozone Client, SCM
Reporter: Arafat Khan
h4. Issue:
There was an observed delay in the SCM exiting safe mode, taking slightly over
a minute in one instance. This issue is related to the
{*}HealthyPipelineSafeModeRule{*}, which depends on DataNodes reporting the
health of pipelines to the SCM. Delays in pipeline reporting can impact the
time taken to meet the criteria for exiting safe mode, thereby delaying client
operations like writes.
h4. Cause:
The delay was caused by the following factors:
# {*}DataNode Registration{*}: DataNodes register with the SCM using heartbeat
intervals, which take around 30 seconds. If the leader {*}SCM is restarted and
regains leadership{*}, all DataNodes and pipelines need to re-register, leading
to delays.
# {*}Pipeline Health Reporting{*}: Pipelines are marked as "healthy" only when
all associated DataNodes are active and reporting. This process may take
additional time due to the stabilisation of pipelines and network conditions.
h4. Logs:
>From the logs:
{code:java}
2024-10-24 00:45:49,636 INFO
[main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total
pipeline count is 4, healthy pipeline threshold count is 1
2024-10-24 00:46:01,908 INFO
[main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
2024-10-24 00:47:00,770 INFO
[main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
{code}
The *HealthyPipelineSafeModeRule* took a significant time to validate the
required healthy pipeline threshold, contributing to the delay.
h4. Impact:
Due to this delay, Ozone client write operations were impacted, as they rely on
the SCM being fully operational to allocate blocks or create keys.
h3. Resolution:
To prevent write failures during such delays, we propose increasing the
following configurations in the Ozone client:
* {{BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS}}
* {{BLOCK_ALLOCATION_RETRY_COUNT}}
This will allow the client to wait longer (slightly over a minute) for the SCM
to exit safe mode, ensuring that writes are not prematurely failed during such
scenarios.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]