[
https://issues.apache.org/jira/browse/HDDS-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Krishna Kumar Asawa reassigned HDDS-9184:
-----------------------------------------
Assignee: Siddhant Sangwan
> Re-replication of UNHEALTHY replica takes a long time in Ratis rack-scatter
> environment
> ---------------------------------------------------------------------------------------
>
> Key: HDDS-9184
> URL: https://issues.apache.org/jira/browse/HDDS-9184
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM
> Reporter: Jyotirmoy Sinha
> Assignee: Siddhant Sangwan
> Priority: Major
>
> Steps :
> # Create volume/bucket/key (ratis) in a rack-scatter environment
> # Restart 1 datanode of the container of above key
> # Simulate UNHEALTHY replica in the above datanode
> Observation - Re-replication and removal of UNHEALTHY datanode takes about 1
> hour.
> dn-container.log -
> {code:java}
> 2023-08-10 09:23:48,422 | INFO | ID=11011 | Index=0 | BCSID=132 |
> State=CLOSED |
> 2023-08-10 09:28:14,105 | ERROR | ID=11010 | Index=0 | BCSID=129 |
> State=UNHEALTHY | INCONSISTENT_CHUNK_LENGTH for file
> /hadoop-ozone/datanode/data632176/hdds/CID-d7f3b02c-172b-41ad-8944-c3ff90bb7382/current/containerDir21/11010/chunks/111677748019215006.block.
> Message: Inconsistent read for chunk=111677748019215006_chunk_1 expected
> length=4194304 actual length=0 for block conID: 11010 locID:
> 111677748019215006 bcsId: 129 |
> 2023-08-10 09:29:11,631 | INFO | ID=12001 | Index=0 | BCSID=140 |
> State=CLOSED |
> 2023-08-10 10:21:11,361 | INFO | ID=10003 | Index=0 | BCSID=227 |
> State=CLOSED |
> 2023-08-10 10:31:09,934 | INFO | ID=11010 | Index=0 | BCSID=129 |
> State=DELETED | Container force deleted |
> 2023-08-10 10:36:15,827 | INFO | ID=11010 | Index=0 | BCSID=129 |
> State=CLOSED | {code}
> ozone-scm.log -
> {code:java}
> 2023-08-10 09:29:09,572 INFO [Under Replicated
> Processor]-org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter:
> No satisfied datanode to meet the constraints. Metadatadata size required: 0
> Data size required: 5368709120, scope /rack_h1wbu, excluded nodes
> [c396dcdf-3253-4827-995e-e339ec9527e1(quasar-ssbwak-7.quasar-ssbwak.root.hwx.site/172.27.53.195),
>
> 00ecbf81-ef0b-4827-b6aa-e818a42ac522(quasar-ssbwak-5.quasar-ssbwak.root.hwx.site/172.27.105.129),
>
> 616e9d5b-537a-42cf-83f1-8eb267cf119d(quasar-ssbwak-2.quasar-ssbwak.root.hwx.site/172.27.85.1)]
> 2023-08-10 09:29:09,572 WARN [Under Replicated
> Processor]-org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter:
> Placement policy could not choose the enough nodes from available racks.
> Chosen nodes size from Unique Racks: 0, but required nodes to choose from
> Unique Racks: 1 do not match. Available racks count: 1, Excluded nodes count:
> 1, UsedNodes count: 2
> 2023-08-10 09:29:09,572 ERROR [Under Replicated
> Processor]-org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor:
> Error processing Health result of class: class
> org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult
> for container ContainerInfo{id=#11010, state=CLOSED,
> stateEnterTime=2023-08-10T09:12:16.522733Z,
> pipelineID=PipelineID=e58f6cad-1824-4701-a26a-1d400e9be503, owner=om124}
> org.apache.hadoop.hdds.scm.exceptions.SCMException: Placement Policy: class
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter
> did not return any nodes. Number of required Nodes 1, Datasize Required:
> 5368709120
> at
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManagerUtil.getTargetDatanodes(ReplicationManagerUtil.java:93)
> at
> org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:284)
> at
> org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:118)
> at
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:773)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:60)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:29)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:156)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:116)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:165)
> at java.base/java.lang.Thread.run(Thread.java:834)
> 2023-08-10 09:29:09,573 INFO [Under Replicated
> Processor]-org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter:
> Chosen nodes:
> [616e9d5b-537a-42cf-83f1-8eb267cf119d(quasar-ssbwak-2.quasar-ssbwak.root.hwx.site/172.27.85.1)].
> isPolicySatisfied: true. {code}
> In SCM logs the unhealthy re-replication is failing for a long time due to
> above error. It resolved after 1 hour -
> {code:java}
> 2023-08-10 10:28:07,158 INFO [Under Replicated
> Processor]-org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter:
> No satisfied datanode to meet the constraints. Metadatadata size required: 0
> Data size required: 5368709120, scope /rack_6us5i, excluded nodes
> [c396dcdf-3253-4827-995e-e339ec9527e1(quasar-ssbwak-7.quasar-ssbwak.root.hwx.site/172.27.53.195),
>
> 00ecbf81-ef0b-4827-b6aa-e818a42ac522(quasar-ssbwak-5.quasar-ssbwak.root.hwx.site/172.27.105.129),
>
> 616e9d5b-537a-42cf-83f1-8eb267cf119d(quasar-ssbwak-2.quasar-ssbwak.root.hwx.site/172.27.85.1)]
> 2023-08-10 10:28:07,158 WARN [Under Replicated
> Processor]-org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter:
> Placement policy could not choose the enough nodes from available racks.
> Chosen nodes size from Unique Racks: 0, but required nodes to choose from
> Unique Racks: 1 do not match. Available racks count: 1, Excluded nodes count:
> 1, UsedNodes count: 2
> 2023-08-10 10:28:07,158 ERROR [Under Replicated
> Processor]-org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor:
> Error processing Health result of class: class
> org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult
> for container ContainerInfo{id=#11010, state=CLOSED,
> stateEnterTime=2023-08-10T09:12:16.522Z,
> pipelineID=PipelineID=e58f6cad-1824-4701-a26a-1d400e9be503, owner=om124}
> org.apache.hadoop.hdds.scm.exceptions.SCMException: Placement Policy: class
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter
> did not return any nodes. Number of required Nodes 1, Datasize Required:
> 5368709120
> at
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManagerUtil.getTargetDatanodes(ReplicationManagerUtil.java:93)
> at
> org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:284)
> at
> org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:118)
> at
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:773)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:60)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:29)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:156)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:116)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:165)
> at java.base/java.lang.Thread.run(Thread.java:834)
> 2023-08-10 10:28:07,159 INFO [Under Replicated
> Processor]-org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor:
> Processed 0 containers with health state counts {}, failed processing 1,
> deferred due to load 0
> 2023-08-10 10:28:11,717 INFO [Socket Reader #1 for port
> 9860]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for
> recon/[email protected]
> (auth:KERBEROS)
> 2023-08-10 10:28:11,723 INFO [Socket Reader #1 for port
> 9860]-SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
> Authorization successful for
> recon/[email protected]
> (auth:KERBEROS) for protocol=interface
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol
> 2023-08-10 10:28:12,932 INFO
> [EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.node.DeadNodeHandler:
> A dead datanode is detected.
> 616e9d5b-537a-42cf-83f1-8eb267cf119d(quasar-ssbwak-2.quasar-ssbwak.root.hwx.site/172.27.85.1)
> 2023-08-10 10:28:12,934 INFO
> [EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider:
> Send pipeline:PipelineID=942c5e9b-c5d6-44e6-ab4e-2b4c4a12b344 close command
> to datanode 616e9d5b-537a-42cf-83f1-8eb267cf119d
> 2023-08-10 10:28:12,944 INFO
> [EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl:
> Pipeline Pipeline[ Id: 942c5e9b-c5d6-44e6-ab4e-2b4c4a12b344, Nodes:
> 616e9d5b-537a-42cf-83f1-8eb267cf119d(quasar-ssbwak-2.quasar-ssbwak.root.hwx.site/172.27.85.1),
> ReplicationConfig: RATIS/ONE, State:CLOSED,
> leaderId:616e9d5b-537a-42cf-83f1-8eb267cf119d,
> CreationTimestamp2023-08-10T10:10:36.905504Z[UTC]] removed.
> 2023-08-10 10:28:12,951 INFO
> [EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.node.DeadNodeHandler:
> Clearing command queue of size 1 for DN
> 616e9d5b-537a-42cf-83f1-8eb267cf119d(quasar-ssbwak-2.quasar-ssbwak.root.hwx.site/172.27.85.1)
> 2023-08-10 10:28:12,951 INFO
> [EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl:
> Removed a node: /rack_6us5i/616e9d5b-537a-42cf-83f1-8eb267cf119d
> 2023-08-10 10:28:16,637 WARN [IPC Server handler 9 on
> 9863]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Cannot find node for
> address 172.27.76.5
> 2023-08-10 10:28:16,817 INFO [Socket Reader #1 for port
> 9860]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for
> om/[email protected] (auth:KERBEROS)
> 2023-08-10 10:28:16,822 INFO [Socket Reader #1 for port
> 9860]-SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
> Authorization successful for
> om/[email protected] (auth:KERBEROS)
> for protocol=interface
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol
> 2023-08-10 10:28:16,845 WARN [IPC Server handler 11 on
> 9863]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Cannot find node for
> address 172.27.76.5
> 2023-08-10 10:28:37,160 INFO [Under Replicated
> Processor]-org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter:
> Chosen nodes:
> [d6b31a38-2e8e-4ba9-aed9-cfc5b41456aa(quasar-ssbwak-3.quasar-ssbwak.root.hwx.site/172.27.131.7)].
> isPolicySatisfied: true.
> 2023-08-10 10:28:37,160 INFO [Under Replicated
> Processor]-org.apache.hadoop.hdds.scm.container.replication.ReplicationManager:
> Sending command [replicateContainerCommand: containerId=11010,
> replicaIndex=0,
> targetNode=d6b31a38-2e8e-4ba9-aed9-cfc5b41456aa(quasar-ssbwak-3.quasar-ssbwak.root.hwx.site/172.27.131.7),
> priority=NORMAL] for container ContainerInfo{id=#11010, state=CLOSED,
> stateEnterTime=2023-08-10T09:12:16.522Z,
> pipelineID=PipelineID=e58f6cad-1824-4701-a26a-1d400e9be503, owner=om124} to
> c396dcdf-3253-4827-995e-e339ec9527e1(quasar-ssbwak-7.quasar-ssbwak.root.hwx.site/172.27.53.195)
> with datanode deadline 1691663887160 and scm deadline 1691663917160
> 2023-08-10 10:28:37,160 INFO [Under Replicated
> Processor]-org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor:
> Processed 1 containers with health state counts {UNDER_REPLICATED=1}, failed
> processing 0, deferred due to load 0 {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]