[jira] [Updated] (HDDS-11780) Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes

ASF GitHub Bot (Jira) Thu, 21 Nov 2024 22:47:04 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HDDS-11780:
----------------------------------
    Labels: pull-request-available  (was: )

> Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes
> --------------------------------------------------------------------
>
>                 Key: HDDS-11780
>                 URL: https://issues.apache.org/jira/browse/HDDS-11780
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Client, SCM
>            Reporter: Arafat Khan
>            Assignee: Arafat Khan
>            Priority: Major
>              Labels: pull-request-available
>
> h4. Issue:
> There was an observed delay in the SCM exiting safe mode, taking slightly 
> over a minute in one instance. This issue is related to the 
> {*}HealthyPipelineSafeModeRule{*}, which depends on DataNodes reporting the 
> health of pipelines to the SCM. Delays in pipeline reporting can impact the 
> time taken to meet the criteria for exiting safe mode, thereby delaying 
> client operations like writes.
> h4. Cause:
> The delay was caused by the following factors:
>  # {*}DataNode Registration{*}: DataNodes register with the SCM using 
> heartbeat intervals, which take around 30 seconds. If the leader {*}SCM is 
> restarted and regains leadership{*}, all DataNodes and pipelines need to 
> re-register, leading to delays.
>  # {*}Pipeline Health Reporting{*}: Pipelines are marked as "healthy" only 
> when all associated DataNodes are active and reporting. This process may take 
> additional time due to the stabilisation of pipelines and network conditions.
> h4. Logs:
> From the logs:
> {code:java}
> 2024-10-24 00:45:49,630 INFO 
> [main]-org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule: containers 
> with one replica threshold count 0
> 2024-10-24 00:45:49,636 INFO 
> [main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total 
> pipeline count is 4, healthy pipeline threshold count is 1
> 2024-10-24 00:45:49,639 INFO 
> [main]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: 
> Total pipeline count is 4, pipeline's with at least one datanode reported 
> threshold count is 4
> 2024-10-24 00:45:55,452 INFO [main]-org.apache.hadoop.hdds.scm.ha.SCMContext: 
> Update SafeModeStatus from SafeModeStatus{safeModeStatus=true, 
> preCheckPassed=false} to SafeModeStatus{safeModeStatus=true, 
> preCheckPassed=false}.
> 2024-10-24 00:46:01,908 INFO 
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
>  Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
> 2024-10-24 00:46:01,911 INFO 
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule:
>  Refreshed one replica container threshold 0, currentThreshold 0
> 2024-10-24 00:46:01,912 INFO 
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule:
>  Refreshed Total pipeline count is 4, pipeline's with at least one datanode 
> reported threshold count is 4
> 2024-10-24 00:47:00,770 INFO 
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
>  Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
> 2024-10-24 00:47:00,770 INFO 
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  ContainerSafeModeRule rule is successfully validated
> 2024-10-24 00:47:00,770 INFO 
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule:
>  Refreshed Total pipeline count is 4, pipeline's with at least one datanode 
> reported threshold count is 4
> org.apache.hadoop.hdds.scm.exceptions.SCMException: SafeModePrecheck failed 
> for deleteBlocks
> 2024-10-24 00:47:09,206 INFO 
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  SCM in safe mode. 1 DataNodes registered, 1 required.
> 2024-10-24 00:47:09,303 INFO 
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  DataNodeSafeModeRule rule is successfully validated
> 2024-10-24 00:47:09,303 INFO 
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  All SCM safe mode pre check rules have passed
> 2024-10-24 00:47:09,303 INFO 
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.ha.SCMContext:
>  Update SafeModeStatus from SafeModeStatus{safeModeStatus=true, 
> preCheckPassed=false} to SafeModeStatus{safeModeStatus=true, 
> preCheckPassed=true}.
> 2024-10-24 00:47:09,303 INFO 
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.pipeline.BackgroundPipelineCreator:
>  trigger a one-shot run on node3-RatisPipelineUtilsThread.
> 2024-10-24 00:47:09,207 INFO 
> [node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  SCM in safe mode. Pipelines with at least one datanode reported count is 2, 
> required at least one datanode reported per pipeline count is 4
> 2024-10-24 00:47:09,304 INFO 
> [node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  SCM in safe mode. Pipelines with at least one datanode reported count is 4, 
> required at least one datanode reported per pipeline count is 4
> 2024-10-24 00:47:09,304 INFO 
> [node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  AtleastOneDatanodeReportedRule rule is successfully validated
> 2024-10-24 00:47:09,620 INFO 
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  SCM in safe mode. Healthy pipelines reported count is 0, required healthy 
> pipeline reported count is 1
> 2024-10-24 00:47:09,620 INFO 
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  SCM in safe mode. Healthy pipelines reported count is 0, required healthy 
> pipeline reported count is 1
> 2024-10-24 00:47:09,620 INFO 
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  SCM in safe mode. Healthy pipelines reported count is 0, required healthy 
> pipeline reported count is 1
> 2024-10-24 00:47:09,620 INFO 
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  SCM in safe mode. Healthy pipelines reported count is 0, required healthy 
> pipeline reported count is 1
> 2024-10-24 00:47:09,935 INFO 
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  SCM in safe mode. Healthy pipelines reported count is 1, required healthy 
> pipeline reported count is 1
> 2024-10-24 00:47:09,935 INFO 
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  HealthyPipelineSafeModeRule rule is successfully validated
> 2024-10-24 00:47:09,935 INFO 
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  ScmSafeModeManager, all rules are successfully validated
> 2024-10-24 00:47:09,935 INFO 
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
>  SCM exiting safe mode.
> 2024-10-24 00:47:09,936 INFO 
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.ha.SCMContext:
>  Update SafeModeStatus from SafeModeStatus{safeModeStatus=true, 
> preCheckPassed=true} to SafeModeStatus{safeModeStatus=false, 
> preCheckPassed=true}.
> {code}
> The *HealthyPipelineSafeModeRule* took a significant time to validate the 
> required healthy pipeline threshold, contributing to the delay.
> h4. Impact:
> Due to this delay, Ozone client write operations were impacted, as they rely 
> on the SCM being fully operational to allocate blocks or create keys.
> h3. Resolution:
> To prevent write failures during such delays, we propose increasing the 
> following configurations in the Ozone client:
>  * {{BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS}}
>  * {{BLOCK_ALLOCATION_RETRY_COUNT}}
> This will allow the client to wait longer (slightly over a minute) for the 
> SCM to exit safe mode, ensuring that writes are not prematurely failed during 
> such scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-11780) Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes

Reply via email to