[ https://issues.apache.org/jira/browse/HDDS-12469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz-wo Sze resolved HDDS-12469. ------------------------------- Fix Version/s: 1.4.2 Resolution: Fixed The pull request is now merged. Thanks, [~sumitagrawl]! > fail fast for write block stuck > ------------------------------- > > Key: HDDS-12469 > URL: https://issues.apache.org/jira/browse/HDDS-12469 > Project: Apache Ozone > Issue Type: Sub-task > Components: Ozone Datanode > Reporter: Sumit Agrawal > Assignee: Sumit Agrawal > Priority: Major > Labels: pull-request-available > Fix For: 1.4.2 > > Attachments: 8022_review.patch > > > In follower, ContainerStateMachine's write() return future, which will actual > perform block/chunk write. > As part of check write, > * can create container if not exist > * write block chunk to disk > > Under disk full condition / low disk, its taking huge time to process the > write chunk and seems stuck. > From JMX metrics for DNs, its observed that Time taken (ns) in order of > 10^14, 10^13, ... that is, 100k second/10k seconds, .... shows process is > really stuck and unable to come out. > > {code:java} > jmxnode1_p1: "WriteStateMachineDataNsAvgTime" : 1.0438595905348E14 > jmxnode2_p2: "WriteStateMachineDataNsAvgTime" : 2.2966696397828832E13 > jmxnode2_p3: "WriteStateMachineDataNsAvgTime" : 1.4061009948751E13 > jmxnode3_p4: "WriteStateMachineDataNsAvgTime" : 1.0024869351741E13 > ... {code} > > This might be due to the reason of volume might be failed, later observed few > volume disk have issues. > > From logs of ratis, it keeps track and printing TimeoutException for the task > every 10 sec. > {code:java} > org.apache.ratis.protocol.exceptions.TimeoutIOException: Timeout 10s: > WriteLog:115: (t:1, i:115), STATEMACHINELOGENTRY, cmdType: WriteChunk > traceID: "" containerID: 18446516 datanodeUuid: > "2834c106-e999-4013-9934-a165fdbe41cf" pipelineID: > "f1efe128-22fe-4762-a248-7aebcaa07dff" > ... > ...{code} > Considering above scenario, > * Need make pipeline unhealthy if time taken is crossing certain threshold > (can be 10 min as max time for 256MB write or lesser), trigger pipeline > closure > * need make current task stop and fail, and avoid accepting further raft logs > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org