[
https://issues.apache.org/jira/browse/HDDS-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arafat Khan updated HDDS-11780:
-------------------------------
Description:
h4. Issue:
There was an observed delay in the SCM exiting safe mode, taking slightly over
a minute in one instance. This issue is related to the
{*}HealthyPipelineSafeModeRule{*}, which depends on DataNodes reporting the
health of pipelines to the SCM. Delays in pipeline reporting can impact the
time taken to meet the criteria for exiting safe mode, thereby delaying client
operations like writes.
h4. Cause:
The delay was caused by the following factors:
# {*}DataNode Registration{*}: DataNodes register with the SCM using heartbeat
intervals, which take around 30 seconds. If the leader {*}SCM is restarted and
regains leadership{*}, all DataNodes and pipelines need to re-register, leading
to delays.
# {*}Pipeline Health Reporting{*}: Pipelines are marked as "healthy" only when
all associated DataNodes are active and reporting. This process may take
additional time due to the stabilisation of pipelines and network conditions.
h4. Logs:
>From the logs:
{code:java}
2024-10-24 00:45:49,630 INFO
[main]-org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule: containers
with one replica threshold count 0
2024-10-24 00:45:49,636 INFO
[main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total
pipeline count is 4, healthy pipeline threshold count is 1
2024-10-24 00:45:49,639 INFO
[main]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule:
Total pipeline count is 4, pipeline's with at least one datanode reported
threshold count is 4
2024-10-24 00:45:55,452 INFO [main]-org.apache.hadoop.hdds.scm.ha.SCMContext:
Update SafeModeStatus from SafeModeStatus{safeModeStatus=true,
preCheckPassed=false} to SafeModeStatus{safeModeStatus=true,
preCheckPassed=false}.
2024-10-24 00:46:01,908 INFO
[bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
2024-10-24 00:46:01,911 INFO
[bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule:
Refreshed one replica container threshold 0, currentThreshold 0
2024-10-24 00:46:01,912 INFO
[bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule:
Refreshed Total pipeline count is 4, pipeline's with at least one datanode
reported threshold count is 4
2024-10-24 00:47:00,770 INFO
[bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
2024-10-24 00:47:00,770 INFO
[bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
ContainerSafeModeRule rule is successfully validated
2024-10-24 00:47:00,770 INFO
[bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule:
Refreshed Total pipeline count is 4, pipeline's with at least one datanode
reported threshold count is 4
org.apache.hadoop.hdds.scm.exceptions.SCMException: SafeModePrecheck failed for
deleteBlocks
2024-10-24 00:47:09,206 INFO
[node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
SCM in safe mode. 1 DataNodes registered, 1 required.
2024-10-24 00:47:09,303 INFO
[node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
DataNodeSafeModeRule rule is successfully validated
2024-10-24 00:47:09,303 INFO
[node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
All SCM safe mode pre check rules have passed
2024-10-24 00:47:09,303 INFO
[node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.ha.SCMContext:
Update SafeModeStatus from SafeModeStatus{safeModeStatus=true,
preCheckPassed=false} to SafeModeStatus{safeModeStatus=true,
preCheckPassed=true}.
2024-10-24 00:47:09,303 INFO
[node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.pipeline.BackgroundPipelineCreator:
trigger a one-shot run on node3-RatisPipelineUtilsThread.
2024-10-24 00:47:09,207 INFO
[node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
SCM in safe mode. Pipelines with at least one datanode reported count is 2,
required at least one datanode reported per pipeline count is 4
2024-10-24 00:47:09,304 INFO
[node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
SCM in safe mode. Pipelines with at least one datanode reported count is 4,
required at least one datanode reported per pipeline count is 4
2024-10-24 00:47:09,304 INFO
[node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
AtleastOneDatanodeReportedRule rule is successfully validated
2024-10-24 00:47:09,620 INFO
[node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
SCM in safe mode. Healthy pipelines reported count is 0, required healthy
pipeline reported count is 1
2024-10-24 00:47:09,620 INFO
[node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
SCM in safe mode. Healthy pipelines reported count is 0, required healthy
pipeline reported count is 1
2024-10-24 00:47:09,620 INFO
[node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
SCM in safe mode. Healthy pipelines reported count is 0, required healthy
pipeline reported count is 1
2024-10-24 00:47:09,620 INFO
[node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
SCM in safe mode. Healthy pipelines reported count is 0, required healthy
pipeline reported count is 1
2024-10-24 00:47:09,935 INFO
[node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
SCM in safe mode. Healthy pipelines reported count is 1, required healthy
pipeline reported count is 1
2024-10-24 00:47:09,935 INFO
[node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
HealthyPipelineSafeModeRule rule is successfully validated
2024-10-24 00:47:09,935 INFO
[node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
ScmSafeModeManager, all rules are successfully validated
2024-10-24 00:47:09,935 INFO
[node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
SCM exiting safe mode.
2024-10-24 00:47:09,936 INFO
[node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.ha.SCMContext:
Update SafeModeStatus from SafeModeStatus{safeModeStatus=true,
preCheckPassed=true} to SafeModeStatus{safeModeStatus=false,
preCheckPassed=true}.
{code}
The *HealthyPipelineSafeModeRule* took a significant time to validate the
required healthy pipeline threshold, contributing to the delay.
h4. Impact:
Due to this delay, Ozone client write operations were impacted, as they rely on
the SCM being fully operational to allocate blocks or create keys.
h3. Resolution:
To prevent write failures during such delays, we propose increasing the
following configurations in the Ozone client:
* {{BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS}}
* {{BLOCK_ALLOCATION_RETRY_COUNT}}
This will allow the client to wait longer (slightly over a minute) for the SCM
to exit safe mode, ensuring that writes are not prematurely failed during such
scenarios.
was:
h4. Issue:
There was an observed delay in the SCM exiting safe mode, taking slightly over
a minute in one instance. This issue is related to the
{*}HealthyPipelineSafeModeRule{*}, which depends on DataNodes reporting the
health of pipelines to the SCM. Delays in pipeline reporting can impact the
time taken to meet the criteria for exiting safe mode, thereby delaying client
operations like writes.
h4. Cause:
The delay was caused by the following factors:
# {*}DataNode Registration{*}: DataNodes register with the SCM using heartbeat
intervals, which take around 30 seconds. If the leader {*}SCM is restarted and
regains leadership{*}, all DataNodes and pipelines need to re-register, leading
to delays.
# {*}Pipeline Health Reporting{*}: Pipelines are marked as "healthy" only when
all associated DataNodes are active and reporting. This process may take
additional time due to the stabilisation of pipelines and network conditions.
h4. Logs:
>From the logs:
{code:java}
2024-10-24 00:45:49,636 INFO
[main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total
pipeline count is 4, healthy pipeline threshold count is 1
2024-10-24 00:46:01,908 INFO
[main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
2024-10-24 00:47:00,770 INFO
[main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
{code}
The *HealthyPipelineSafeModeRule* took a significant time to validate the
required healthy pipeline threshold, contributing to the delay.
h4. Impact:
Due to this delay, Ozone client write operations were impacted, as they rely on
the SCM being fully operational to allocate blocks or create keys.
h3. Resolution:
To prevent write failures during such delays, we propose increasing the
following configurations in the Ozone client:
* {{BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS}}
* {{BLOCK_ALLOCATION_RETRY_COUNT}}
This will allow the client to wait longer (slightly over a minute) for the SCM
to exit safe mode, ensuring that writes are not prematurely failed during such
scenarios.
> Slight Delay in Exiting Safe Mode Due to and Impact on Client Writes
> --------------------------------------------------------------------
>
> Key: HDDS-11780
> URL: https://issues.apache.org/jira/browse/HDDS-11780
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Client, SCM
> Reporter: Arafat Khan
> Assignee: Arafat Khan
> Priority: Major
>
> h4. Issue:
> There was an observed delay in the SCM exiting safe mode, taking slightly
> over a minute in one instance. This issue is related to the
> {*}HealthyPipelineSafeModeRule{*}, which depends on DataNodes reporting the
> health of pipelines to the SCM. Delays in pipeline reporting can impact the
> time taken to meet the criteria for exiting safe mode, thereby delaying
> client operations like writes.
> h4. Cause:
> The delay was caused by the following factors:
> # {*}DataNode Registration{*}: DataNodes register with the SCM using
> heartbeat intervals, which take around 30 seconds. If the leader {*}SCM is
> restarted and regains leadership{*}, all DataNodes and pipelines need to
> re-register, leading to delays.
> # {*}Pipeline Health Reporting{*}: Pipelines are marked as "healthy" only
> when all associated DataNodes are active and reporting. This process may take
> additional time due to the stabilisation of pipelines and network conditions.
> h4. Logs:
> From the logs:
> {code:java}
> 2024-10-24 00:45:49,630 INFO
> [main]-org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule: containers
> with one replica threshold count 0
> 2024-10-24 00:45:49,636 INFO
> [main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total
> pipeline count is 4, healthy pipeline threshold count is 1
> 2024-10-24 00:45:49,639 INFO
> [main]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule:
> Total pipeline count is 4, pipeline's with at least one datanode reported
> threshold count is 4
> 2024-10-24 00:45:55,452 INFO [main]-org.apache.hadoop.hdds.scm.ha.SCMContext:
> Update SafeModeStatus from SafeModeStatus{safeModeStatus=true,
> preCheckPassed=false} to SafeModeStatus{safeModeStatus=true,
> preCheckPassed=false}.
> 2024-10-24 00:46:01,908 INFO
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
> Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
> 2024-10-24 00:46:01,911 INFO
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule:
> Refreshed one replica container threshold 0, currentThreshold 0
> 2024-10-24 00:46:01,912 INFO
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule:
> Refreshed Total pipeline count is 4, pipeline's with at least one datanode
> reported threshold count is 4
> 2024-10-24 00:47:00,770 INFO
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule:
> Refreshed total pipeline count is 4, healthy pipeline threshold count is 1
> 2024-10-24 00:47:00,770 INFO
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> ContainerSafeModeRule rule is successfully validated
> 2024-10-24 00:47:00,770 INFO
> [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule:
> Refreshed Total pipeline count is 4, pipeline's with at least one datanode
> reported threshold count is 4
> org.apache.hadoop.hdds.scm.exceptions.SCMException: SafeModePrecheck failed
> for deleteBlocks
> 2024-10-24 00:47:09,206 INFO
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. 1 DataNodes registered, 1 required.
> 2024-10-24 00:47:09,303 INFO
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> DataNodeSafeModeRule rule is successfully validated
> 2024-10-24 00:47:09,303 INFO
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> All SCM safe mode pre check rules have passed
> 2024-10-24 00:47:09,303 INFO
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.ha.SCMContext:
> Update SafeModeStatus from SafeModeStatus{safeModeStatus=true,
> preCheckPassed=false} to SafeModeStatus{safeModeStatus=true,
> preCheckPassed=true}.
> 2024-10-24 00:47:09,303 INFO
> [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.pipeline.BackgroundPipelineCreator:
> trigger a one-shot run on node3-RatisPipelineUtilsThread.
> 2024-10-24 00:47:09,207 INFO
> [node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Pipelines with at least one datanode reported count is 2,
> required at least one datanode reported per pipeline count is 4
> 2024-10-24 00:47:09,304 INFO
> [node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Pipelines with at least one datanode reported count is 4,
> required at least one datanode reported per pipeline count is 4
> 2024-10-24 00:47:09,304 INFO
> [node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> AtleastOneDatanodeReportedRule rule is successfully validated
> 2024-10-24 00:47:09,620 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Healthy pipelines reported count is 0, required healthy
> pipeline reported count is 1
> 2024-10-24 00:47:09,620 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Healthy pipelines reported count is 0, required healthy
> pipeline reported count is 1
> 2024-10-24 00:47:09,620 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Healthy pipelines reported count is 0, required healthy
> pipeline reported count is 1
> 2024-10-24 00:47:09,620 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Healthy pipelines reported count is 0, required healthy
> pipeline reported count is 1
> 2024-10-24 00:47:09,935 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM in safe mode. Healthy pipelines reported count is 1, required healthy
> pipeline reported count is 1
> 2024-10-24 00:47:09,935 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> HealthyPipelineSafeModeRule rule is successfully validated
> 2024-10-24 00:47:09,935 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> ScmSafeModeManager, all rules are successfully validated
> 2024-10-24 00:47:09,935 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager:
> SCM exiting safe mode.
> 2024-10-24 00:47:09,936 INFO
> [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.ha.SCMContext:
> Update SafeModeStatus from SafeModeStatus{safeModeStatus=true,
> preCheckPassed=true} to SafeModeStatus{safeModeStatus=false,
> preCheckPassed=true}.
> {code}
> The *HealthyPipelineSafeModeRule* took a significant time to validate the
> required healthy pipeline threshold, contributing to the delay.
> h4. Impact:
> Due to this delay, Ozone client write operations were impacted, as they rely
> on the SCM being fully operational to allocate blocks or create keys.
> h3. Resolution:
> To prevent write failures during such delays, we propose increasing the
> following configurations in the Ozone client:
> * {{BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS}}
> * {{BLOCK_ALLOCATION_RETRY_COUNT}}
> This will allow the client to wait longer (slightly over a minute) for the
> SCM to exit safe mode, ensuring that writes are not prematurely failed during
> such scenarios.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]