[
https://issues.apache.org/jira/browse/HDDS-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arafat Khan reassigned HDDS-11780:
----------------------------------
Assignee: Arafat Khan
> Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes
> --------------------------------------------------------------------
>
> Key: HDDS-11780
> URL: https://issues.apache.org/jira/browse/HDDS-11780
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Client, SCM
> Reporter: Arafat Khan
> Assignee: Arafat Khan
> Priority: Major
>
> h4. Issue:
> There was an observed delay in the SCM exiting safe mode, taking slightly
> over a minute in one instance. This issue is related to the
> {*}HealthyPipelineSafeModeRule{*}, which depends on DataNodes reporting the
> health of pipelines to the SCM. Delays in pipeline reporting can impact the
> time taken to meet the criteria for exiting safe mode, thereby delaying
> client operations like writes.
> h4. Cause:
> The delay was caused by the following factors:
> # {*}DataNode Registration{*}: DataNodes register with the SCM using
> heartbeat intervals, which take around 30 seconds. If the leader {*}SCM is
> restarted and regains leadership{*}, all DataNodes and pipelines need to
> re-register, leading to delays.
> # {*}Pipeline Health Reporting{*}: Pipelines are marked as "healthy" only
> when all associated DataNodes are active and reporting. This process may take
> additional time due to the stabilisation of pipelines and network conditions.
> h4. Logs:
> From the logs:
> {code:java}
> 2024-10-24 00:45:49,636 INFO
> [main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total
> pipeline count is 4, healthy pipeline threshold count is 1
> 2024-10-24 00:46:01,908 INFO
> [main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
> Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
> 2024-10-24 00:47:00,770 INFO
> [main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
> Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
> {code}
> The *HealthyPipelineSafeModeRule* took a significant time to validate the
> required healthy pipeline threshold, contributing to the delay.
> h4. Impact:
> Due to this delay, Ozone client write operations were impacted, as they rely
> on the SCM being fully operational to allocate blocks or create keys.
> h3. Resolution:
> To prevent write failures during such delays, we propose increasing the
> following configurations in the Ozone client:
> * {{BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS}}
> * {{BLOCK_ALLOCATION_RETRY_COUNT}}
> This will allow the client to wait longer (slightly over a minute) for the
> SCM to exit safe mode, ensuring that writes are not prematurely failed during
> such scenarios.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]