Pratyush Bhatt created HDDS-10652:
-------------------------------------
Summary: [Upgrade][EC] Reconstruction failing with
"java.io.IOException: None of the block data have checksum"
Key: HDDS-10652
URL: https://issues.apache.org/jira/browse/HDDS-10652
Project: Apache Ozone
Issue Type: Bug
Components: EC, ECOfflineRecovery
Reporter: Pratyush Bhatt
{color:#172b4d}*Upgrade versions:*
Pre upgrade hash:
[https://github.com/apache/ozone/commit/6ee6c357678676661ebb3181a56622c79b487bc1]
Post upgrade Hash:
[https://github.com/apache/ozone/commit/46b6f3def1d84ca769affb4d3f0d84dece6e8567]
{color}{color:#172b4d}*Scenario:*
Write a EC file(5GB) RS-3-2-1024K policy(in this case) before upgrade, after
upgrade, shut down either 2 Parity nodes(this case) or 2 Data nodes, as the
policy supports tolerating 2 DN failure. Check if reconstruction happens after
sometime.
*Observed Behavior:*
1. Data was successfully written pre-upgrade using Freon.
File name:
_o3://ozone1711558189/ec-construct-vol/ec-construct-buck/ec-construction/0_
2. Post upgrade Stop two of the DNs, in this case the Parity nodes that we
obtained from one of the containers that was storing the above file's
data.{color}
{code:java}
ozone admin container info 1004 --json
2024-03-27 21:35:15,065|INFO|MainThread|machine.py:232 -
run()||GUID=183f2d10-e3a7-407f-adb5-b87f3e3af53b|Exit Code: 0
2024-03-27 21:35:15,098|INFO|MainThread|ozone.py:723 -
find_ec_data_parity_hosts()|parity hosts: ['ccycloud-4.quasar-ftvxjz.xyz',
'ccycloud-3.quasar-ftvxjz.xyz']
2024-03-27 21:35:15,098|INFO|MainThread|ozone.py:724 -
find_ec_data_parity_hosts()|data hosts: ['ccycloud-8.quasar-ftvxjz.xyz',
'ccycloud-5.quasar-ftvxjz.xyz', 'ccycloud-1.quasar-ftvxjz.xyz'] {code}
{code:java}
2024-03-27 21:35:15,311|INFO|MainThread|cm_apilib.py:1214 -
stopComponent()|Initiating stop of OZONE_DATANODE at host
ccycloud-4.quasar-ftvxjz.xyz
2024-03-27 21:35:15,349|INFO|MainThread|cm_apilib.py:1218 -
stopComponent()|Command name = Stop , ID = 2860
2024-03-27 21:35:15,580|INFO|MainThread|cm_apilib.py:1214 -
stopComponent()|Initiating stop of OZONE_DATANODE at host
ccycloud-3.quasar-ftvxjz.xyz
2024-03-27 21:35:15,609|INFO|MainThread|cm_apilib.py:1218 -
stopComponent()|Command name = Stop , ID = 2862 {code}
{color:#172b4d}Node ccycloud-3.quasar-ftvxjz.xyz and
cycloud-4.quasar-ftvxjz.xyz are stopped.
3. Read file's data(Online Reconstruction) and compute checksum, -> That
matched.
4. Wait for Reconstruction to happen, test waited for 20 Minutes, but Still
only 3 DNs were present even after 20 minutes:{color}
{code:java}
['ccycloud-5.quasar-ftvxjz.xyz', 'ccycloud-1.quasar-ftvxjz.xyz',
'ccycloud-8.quasar-ftvxjz.xyz']{code}
Infact still after 10 hours(At the time of writing), there are still 3 DNs only:
{code:java}
date
Thu Mar 28 08:39:16 UTC 2024
ozone admin container info 1004 --json
{
"containerInfo" : {
"state" : "CLOSED",
"stateEnterTime" : "2024-03-27T18:43:51.934Z",
"replicationConfig" : {
"data" : 3,
"parity" : 2,
"ecChunkSize" : 1048576,
"codec" : "RS",
"requiredNodes" : 5,
"replicationType" : "EC"
},
"usedBytes" : 1342177280,
"numberOfKeys" : 5,
"lastUsed" : "2024-03-28T08:39:24.535189Z",
"owner" : "om1",
"containerID" : 1004,
"deleteTransactionId" : 0,
"sequenceId" : 0,
"deleted" : false,
"open" : false
},
"pipeline" : {
"id" : {
"id" : "73532c14-40ac-4924-9353-2f18ab0d63f2"
},
"replicationConfig" : {
"data" : 3,
"parity" : 2,
"ecChunkSize" : 1048576,
"codec" : "RS",
"requiredNodes" : 5,
"replicationType" : "EC"
},
"nodesInOrder" : [ {
"level" : 0,
"cost" : 0,
"uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"ipAddress" : "10.140.37.12",
"hostName" : "ccycloud-5.quasar-ftvxjz.xyz",
"ports" : [ {
"name" : "HTTPS",
"value" : 9883
}, {
"name" : "CLIENT_RPC",
"value" : 9864
}, {
"name" : "REPLICATION",
"value" : 9886
}, {
"name" : "RATIS",
"value" : 9858
}, {
"name" : "RATIS_ADMIN",
"value" : 9857
}, {
"name" : "RATIS_SERVER",
"value" : 9856
}, {
"name" : "STANDALONE",
"value" : 9859
} ],
"setupTime" : 0,
"persistedOpState" : "IN_SERVICE",
"persistedOpStateExpiryEpochSec" : 0,
"initialVersion" : 0,
"currentVersion" : 1,
"decommissioned" : false,
"maintenance" : false,
"signature" : -662262523,
"networkLocation" : "/default",
"networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880",
"numOfLeaves" : 1
}, {
"level" : 0,
"cost" : 0,
"uuid" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"uuidString" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"ipAddress" : "10.140.40.9",
"hostName" : "ccycloud-1.quasar-ftvxjz.xyz",
"ports" : [ {
"name" : "HTTPS",
"value" : 9883
}, {
"name" : "CLIENT_RPC",
"value" : 9864
}, {
"name" : "REPLICATION",
"value" : 9886
}, {
"name" : "RATIS",
"value" : 9858
}, {
"name" : "RATIS_ADMIN",
"value" : 9857
}, {
"name" : "RATIS_SERVER",
"value" : 9856
}, {
"name" : "STANDALONE",
"value" : 9859
} ],
"setupTime" : 0,
"persistedOpState" : "IN_SERVICE",
"persistedOpStateExpiryEpochSec" : 0,
"initialVersion" : 0,
"currentVersion" : 1,
"decommissioned" : false,
"maintenance" : false,
"signature" : -1387859873,
"networkLocation" : "/default",
"networkName" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"networkFullPath" : "/default/d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"numOfLeaves" : 1
}, {
"level" : 0,
"cost" : 0,
"uuid" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
"uuidString" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
"ipAddress" : "10.140.137.128",
"hostName" : "ccycloud-8.quasar-ftvxjz.xyz",
"ports" : [ {
"name" : "HTTPS",
"value" : 9883
}, {
"name" : "CLIENT_RPC",
"value" : 9864
}, {
"name" : "REPLICATION",
"value" : 9886
}, {
"name" : "RATIS",
"value" : 9858
}, {
"name" : "RATIS_ADMIN",
"value" : 9857
}, {
"name" : "RATIS_SERVER",
"value" : 9856
}, {
"name" : "STANDALONE",
"value" : 9859
} ],
"setupTime" : 0,
"persistedOpState" : "IN_SERVICE",
"persistedOpStateExpiryEpochSec" : 0,
"initialVersion" : 0,
"currentVersion" : 1,
"decommissioned" : false,
"maintenance" : false,
"signature" : 1098159392,
"networkLocation" : "/default",
"networkName" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
"networkFullPath" : "/default/ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
"numOfLeaves" : 1
} ],
"creationTimestamp" : "2024-03-28T08:39:24.480Z",
"stateEnterTime" : "2024-03-28T08:39:24.545517Z",
"leaderNode" : {
"level" : 0,
"cost" : 0,
"uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"ipAddress" : "10.140.37.12",
"hostName" : "ccycloud-5.quasar-ftvxjz.xyz",
"ports" : [ {
"name" : "HTTPS",
"value" : 9883
}, {
"name" : "CLIENT_RPC",
"value" : 9864
}, {
"name" : "REPLICATION",
"value" : 9886
}, {
"name" : "RATIS",
"value" : 9858
}, {
"name" : "RATIS_ADMIN",
"value" : 9857
}, {
"name" : "RATIS_SERVER",
"value" : 9856
}, {
"name" : "STANDALONE",
"value" : 9859
} ],
"setupTime" : 0,
"persistedOpState" : "IN_SERVICE",
"persistedOpStateExpiryEpochSec" : 0,
"initialVersion" : 0,
"currentVersion" : 1,
"decommissioned" : false,
"maintenance" : false,
"signature" : -662262523,
"networkLocation" : "/default",
"networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880",
"numOfLeaves" : 1
},
"firstNode" : {
"level" : 0,
"cost" : 0,
"uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"ipAddress" : "10.140.37.12",
"hostName" : "ccycloud-5.quasar-ftvxjz.xyz",
"ports" : [ {
"name" : "HTTPS",
"value" : 9883
}, {
"name" : "CLIENT_RPC",
"value" : 9864
}, {
"name" : "REPLICATION",
"value" : 9886
}, {
"name" : "RATIS",
"value" : 9858
}, {
"name" : "RATIS_ADMIN",
"value" : 9857
}, {
"name" : "RATIS_SERVER",
"value" : 9856
}, {
"name" : "STANDALONE",
"value" : 9859
} ],
"setupTime" : 0,
"persistedOpState" : "IN_SERVICE",
"persistedOpStateExpiryEpochSec" : 0,
"initialVersion" : 0,
"currentVersion" : 1,
"decommissioned" : false,
"maintenance" : false,
"signature" : -662262523,
"networkLocation" : "/default",
"networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880",
"numOfLeaves" : 1
},
"closestNode" : {
"level" : 0,
"cost" : 0,
"uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"ipAddress" : "10.140.37.12",
"hostName" : "ccycloud-5.quasar-ftvxjz.xyz",
"ports" : [ {
"name" : "HTTPS",
"value" : 9883
}, {
"name" : "CLIENT_RPC",
"value" : 9864
}, {
"name" : "REPLICATION",
"value" : 9886
}, {
"name" : "RATIS",
"value" : 9858
}, {
"name" : "RATIS_ADMIN",
"value" : 9857
}, {
"name" : "RATIS_SERVER",
"value" : 9856
}, {
"name" : "STANDALONE",
"value" : 9859
} ],
"setupTime" : 0,
"persistedOpState" : "IN_SERVICE",
"persistedOpStateExpiryEpochSec" : 0,
"initialVersion" : 0,
"currentVersion" : 1,
"decommissioned" : false,
"maintenance" : false,
"signature" : -662262523,
"networkLocation" : "/default",
"networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880",
"numOfLeaves" : 1
},
"allocationTimeout" : false,
"healthy" : true,
"pipelineState" : "ALLOCATED",
"nodes" : [ {
"level" : 0,
"cost" : 0,
"uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"ipAddress" : "10.140.37.12",
"hostName" : "ccycloud-5.quasar-ftvxjz.root.comops.site",
"ports" : [ {
"name" : "HTTPS",
"value" : 9883
}, {
"name" : "CLIENT_RPC",
"value" : 9864
}, {
"name" : "REPLICATION",
"value" : 9886
}, {
"name" : "RATIS",
"value" : 9858
}, {
"name" : "RATIS_ADMIN",
"value" : 9857
}, {
"name" : "RATIS_SERVER",
"value" : 9856
}, {
"name" : "STANDALONE",
"value" : 9859
} ],
"setupTime" : 0,
"persistedOpState" : "IN_SERVICE",
"persistedOpStateExpiryEpochSec" : 0,
"initialVersion" : 0,
"currentVersion" : 1,
"decommissioned" : false,
"maintenance" : false,
"signature" : -662262523,
"networkLocation" : "/default",
"networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880",
"numOfLeaves" : 1
}, {
"level" : 0,
"cost" : 0,
"uuid" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"uuidString" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"ipAddress" : "10.140.40.9",
"hostName" : "ccycloud-1.quasar-ftvxjz.root.comops.site",
"ports" : [ {
"name" : "HTTPS",
"value" : 9883
}, {
"name" : "CLIENT_RPC",
"value" : 9864
}, {
"name" : "REPLICATION",
"value" : 9886
}, {
"name" : "RATIS",
"value" : 9858
}, {
"name" : "RATIS_ADMIN",
"value" : 9857
}, {
"name" : "RATIS_SERVER",
"value" : 9856
}, {
"name" : "STANDALONE",
"value" : 9859
} ],
"setupTime" : 0,
"persistedOpState" : "IN_SERVICE",
"persistedOpStateExpiryEpochSec" : 0,
"initialVersion" : 0,
"currentVersion" : 1,
"decommissioned" : false,
"maintenance" : false,
"signature" : -1387859873,
"networkLocation" : "/default",
"networkName" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"networkFullPath" : "/default/d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"numOfLeaves" : 1
}, {
"level" : 0,
"cost" : 0,
"uuid" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
"uuidString" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
"ipAddress" : "10.140.137.128",
"hostName" : "ccycloud-8.quasar-ftvxjz.root.comops.site",
"ports" : [ {
"name" : "HTTPS",
"value" : 9883
}, {
"name" : "CLIENT_RPC",
"value" : 9864
}, {
"name" : "REPLICATION",
"value" : 9886
}, {
"name" : "RATIS",
"value" : 9858
}, {
"name" : "RATIS_ADMIN",
"value" : 9857
}, {
"name" : "RATIS_SERVER",
"value" : 9856
}, {
"name" : "STANDALONE",
"value" : 9859
} ],
"setupTime" : 0,
"persistedOpState" : "IN_SERVICE",
"persistedOpStateExpiryEpochSec" : 0,
"initialVersion" : 0,
"currentVersion" : 1,
"decommissioned" : false,
"maintenance" : false,
"signature" : 1098159392,
"networkLocation" : "/default",
"networkName" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
"networkFullPath" : "/default/ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
"numOfLeaves" : 1
} ],
"empty" : false,
"type" : "EC"
},
"replicas" : [ {
"containerID" : 1004,
"state" : "CLOSED",
"datanodeDetails" : {
"level" : 0,
"cost" : 0,
"uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"ipAddress" : "10.140.37.12",
"hostName" : "ccycloud-5.quasar-ftvxjz.xyz",
"ports" : [ {
"name" : "HTTPS",
"value" : 9883
}, {
"name" : "CLIENT_RPC",
"value" : 9864
}, {
"name" : "REPLICATION",
"value" : 9886
}, {
"name" : "RATIS",
"value" : 9858
}, {
"name" : "RATIS_ADMIN",
"value" : 9857
}, {
"name" : "RATIS_SERVER",
"value" : 9856
}, {
"name" : "STANDALONE",
"value" : 9859
} ],
"setupTime" : 0,
"persistedOpState" : "IN_SERVICE",
"persistedOpStateExpiryEpochSec" : 0,
"initialVersion" : 0,
"currentVersion" : 1,
"decommissioned" : false,
"maintenance" : false,
"signature" : -662262523,
"networkLocation" : "/default",
"networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880",
"numOfLeaves" : 1
},
"placeOfBirth" : "6179347f-5824-41d4-b722-f1dbc5f14880",
"sequenceId" : 0,
"keyCount" : 5,
"bytesUsed" : 1342177280,
"replicaIndex" : 2
}, {
"containerID" : 1004,
"state" : "CLOSED",
"datanodeDetails" : {
"level" : 0,
"cost" : 0,
"uuid" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"uuidString" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"ipAddress" : "10.140.40.9",
"hostName" : "ccycloud-1.quasar-ftvxjz.root.comops.site",
"ports" : [ {
"name" : "HTTPS",
"value" : 9883
}, {
"name" : "CLIENT_RPC",
"value" : 9864
}, {
"name" : "REPLICATION",
"value" : 9886
}, {
"name" : "RATIS",
"value" : 9858
}, {
"name" : "RATIS_ADMIN",
"value" : 9857
}, {
"name" : "RATIS_SERVER",
"value" : 9856
}, {
"name" : "STANDALONE",
"value" : 9859
} ],
"setupTime" : 0,
"persistedOpState" : "IN_SERVICE",
"persistedOpStateExpiryEpochSec" : 0,
"initialVersion" : 0,
"currentVersion" : 1,
"decommissioned" : false,
"maintenance" : false,
"signature" : -1387859873,
"networkLocation" : "/default",
"networkName" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"networkFullPath" : "/default/d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"numOfLeaves" : 1
},
"placeOfBirth" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c",
"sequenceId" : 0,
"keyCount" : 5,
"bytesUsed" : 1342177280,
"replicaIndex" : 3
}, {
"containerID" : 1004,
"state" : "CLOSED",
"datanodeDetails" : {
"level" : 0,
"cost" : 0,
"uuid" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
"uuidString" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
"ipAddress" : "10.140.137.128",
"hostName" : "ccycloud-8.quasar-ftvxjz.root.comops.site",
"ports" : [ {
"name" : "HTTPS",
"value" : 9883
}, {
"name" : "CLIENT_RPC",
"value" : 9864
}, {
"name" : "REPLICATION",
"value" : 9886
}, {
"name" : "RATIS",
"value" : 9858
}, {
"name" : "RATIS_ADMIN",
"value" : 9857
}, {
"name" : "RATIS_SERVER",
"value" : 9856
}, {
"name" : "STANDALONE",
"value" : 9859
} ],
"setupTime" : 0,
"persistedOpState" : "IN_SERVICE",
"persistedOpStateExpiryEpochSec" : 0,
"initialVersion" : 0,
"currentVersion" : 1,
"decommissioned" : false,
"maintenance" : false,
"signature" : 1098159392,
"networkLocation" : "/default",
"networkName" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
"networkFullPath" : "/default/ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e",
"numOfLeaves" : 1
},
"placeOfBirth" : "711656cf-a99e-4b2c-8c35-f015ee94889c",
"sequenceId" : 0,
"keyCount" : 5,
"bytesUsed" : 1342177280,
"replicaIndex" : 1
} ]
} {code}
Checked the SCM Logs, it is still sending reconstructECContainersCommand,
{code:java}
2024-03-28 08:36:56,748 INFO [Under Replicated
Processor]-org.apache.hadoop.hdds.scm.container.replication.ReplicationManager:
Sending command [reconstructECContainersCommand: containerID: 1004,
replicationConfig: EC{rs-3-2-1024k}, sources:
[ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e(ccycloud-8.quasar-ftvxjz.root.comops.site/10.140.137.128)
replicaIndex: 1,
6179347f-5824-41d4-b722-f1dbc5f14880(ccycloud-5.quasar-ftvxjz.root.comops.site/10.140.37.12)
replicaIndex: 2,
d8afb52b-5f4c-4d94-9286-7c3cfd6c315c(ccycloud-1.quasar-ftvxjz.root.comops.site/10.140.40.9)
replicaIndex: 3], targets:
[572ed33d-a834-4d80-be35-7b1b19c8bd74(ccycloud-7.quasar-ftvxjz.root.comops.site/10.140.234.130),
711656cf-a99e-4b2c-8c35-f015ee94889c(ccycloud-2.quasar-ftvxjz.root.comops.site/10.140.45.129)],
missingIndexes: [4, 5]] for container ContainerInfo{id=#1004, state=CLOSED,
stateEnterTime=2024-03-27T18:43:51.934Z,
pipelineID=PipelineID=53f5587f-9e6c-465d-a0cb-b82d10c227d3, owner=om1} to
572ed33d-a834-4d80-be35-7b1b19c8bd74(ccycloud-7.quasar-ftvxjz.root.comops.site/10.140.234.130)
with datanode deadline 1711615886747 and scm deadline 1711615916747 {code}
Checked one of the Target DN ccycloud-7.quasar-ftvxjz.root.comops.site, its
throwing below warnings.
{code:java}
2024-03-28 08:37:14,982 WARN
[ContainerReplicationThread-5]-org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask:
FAILED reconstructECContainersCommand: containerID=1004,
replication=rs-3-2-1024k, missingIndexes=[4, 5],
sources={1=ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e(ccycloud-8.quasar-ftvxjz.xyz/10.140.137.128),
2=6179347f-5824-41d4-b722-f1dbc5f14880(ccycloud-5.quasar-ftvxjz.xyz/10.140.37.12),
3=d8afb52b-5f4c-4d94-9286-7c3cfd6c315c(ccycloud-1.quasar-ftvxjz.xyz/10.140.40.9)},
targets={4=572ed33d-a834-4d80-be35-7b1b19c8bd74(ccycloud-7.quasar-ftvxjz.xyz/10.140.234.130),
5=711656cf-a99e-4b2c-8c35-f015ee94889c(ccycloud-2.quasar-ftvxjz.xyz/10.140.45.129)}
after 10639 ms
java.io.IOException: None of the block data have checksum which means
2(parity)+1 blocks are not present
at
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:156)
at
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECBlockGroup(ECReconstructionCoordinator.java:325)
at
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:171)
at
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
at
org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
2024-03-28 08:37:14,982 WARN
[ContainerReplicationThread-5]-org.apache.hadoop.ozone.container.replication.ReplicationSupervisor:
Failed FAILED reconstructECContainersCommand: containerID=1004,
replication=rs-3-2-1024k, missingIndexes=[4, 5],
sources={1=ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e(ccycloud-8.quasar-ftvxjz.xyz/10.140.137.128),
2=6179347f-5824-41d4-b722-f1dbc5f14880(ccycloud-5.quasar-ftvxjz.xyz/10.140.37.12),
3=d8afb52b-5f4c-4d94-9286-7c3cfd6c315c(ccycloud-1.quasar-ftvxjz.xyz/10.140.40.9)},
targets={4=572ed33d-a834-4d80-be35-7b1b19c8bd74(ccycloud-7.quasar-ftvxjz.xyz/10.140.234.130),
5=711656cf-a99e-4b2c-8c35-f015ee94889c(ccycloud-2.quasar-ftvxjz.xyz/10.140.45.129)}
{code}
*Expected Behavior:* Reconstruction should have happened
Note: This is fairly reproducible everytime.
cc: [~siddhant]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]