[ 
https://issues.apache.org/jira/browse/HDDS-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arafat Khan reassigned HDDS-11780:
----------------------------------

    Assignee: Arafat Khan

> Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes
> --------------------------------------------------------------------
>
>                 Key: HDDS-11780
>                 URL: https://issues.apache.org/jira/browse/HDDS-11780
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Client, SCM
>            Reporter: Arafat Khan
>            Assignee: Arafat Khan
>            Priority: Major
>
> h4. Issue:
> There was an observed delay in the SCM exiting safe mode, taking slightly 
> over a minute in one instance. This issue is related to the 
> {*}HealthyPipelineSafeModeRule{*}, which depends on DataNodes reporting the 
> health of pipelines to the SCM. Delays in pipeline reporting can impact the 
> time taken to meet the criteria for exiting safe mode, thereby delaying 
> client operations like writes.
> h4. Cause:
> The delay was caused by the following factors:
>  # {*}DataNode Registration{*}: DataNodes register with the SCM using 
> heartbeat intervals, which take around 30 seconds. If the leader {*}SCM is 
> restarted and regains leadership{*}, all DataNodes and pipelines need to 
> re-register, leading to delays.
>  # {*}Pipeline Health Reporting{*}: Pipelines are marked as "healthy" only 
> when all associated DataNodes are active and reporting. This process may take 
> additional time due to the stabilisation of pipelines and network conditions.
> h4. Logs:
> From the logs:
> {code:java}
> 2024-10-24 00:45:49,636 INFO 
> [main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total 
> pipeline count is 4, healthy pipeline threshold count is 1  
> 2024-10-24 00:46:01,908 INFO 
> [main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: 
> Refreshed total pipeline count is 4, healthy pipeline threshold count is 1  
> 2024-10-24 00:47:00,770 INFO 
> [main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: 
> Refreshed total pipeline count is 4, healthy pipeline threshold count is 1 
> {code}
> The *HealthyPipelineSafeModeRule* took a significant time to validate the 
> required healthy pipeline threshold, contributing to the delay.
> h4. Impact:
> Due to this delay, Ozone client write operations were impacted, as they rely 
> on the SCM being fully operational to allocate blocks or create keys.
> h3. Resolution:
> To prevent write failures during such delays, we propose increasing the 
> following configurations in the Ozone client:
>  * {{BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS}}
>  * {{BLOCK_ALLOCATION_RETRY_COUNT}}
> This will allow the client to wait longer (slightly over a minute) for the 
> SCM to exit safe mode, ensuring that writes are not prematurely failed during 
> such scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to