[jira] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759 ] ruiliang deleted comment on HDFS-15759: - was (Author: ruilaing): [~weichiu] Hello, our current production data also has this kind of EC storage data damage problem, about the problem description [https://github.com/apache/orc/issues/1939] I was wondering if cherry picked your current code (GitHub pull request #2869), Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17628) hdfs ec datanode Decommissioning Stuck and some blk Decommissioning to many nodes indefinitely
ruiliang created HDFS-17628: --- Summary: hdfs ec datanode Decommissioning Stuck and some blk Decommissioning to many nodes indefinitely Key: HDFS-17628 URL: https://issues.apache.org/jira/browse/HDFS-17628 Project: Hadoop HDFS Issue Type: Bug Components: ec, hdfs Affects Versions: 3.1.1 Reporter: ruiliang Attachments: image-2024-09-20-10-47-10-051.png When the datanode decommissioning reaches the last few blocks, the stuck phase is processed, and there are unlimited decommissioning cases when viewing the log. The md5 values of physical blocks on each node are consistent, indicating that block replication is indeed performed. Has this issue been fixed? Is there a patch available to fix it? thank you !image-2024-09-20-10-47-10-051.png! log {code:java} xx-dn-12-67-49.hiido.host.xx.xx.com is DECOMMISSIONING grep 9223372036628464382_15347979 xxx-hdfs-datanode.log 2024-09-20 10:13:32,097 INFO datanode.DataNode (DataNode.java:transferBlock(2328)) - DatanodeRegistration(10.12.67.49:1019, datanodeUuid=e73eb2ed-634b-40bd-a110-21ce485b329c, infoPort=1022, infoSecurePort=0, ipcPort=38010, storageInfo=lv=-57;cid=CID-1becf536-8c05-40cb-a1ff-106923139c5c;nsid=848315649;c=1660893388633) Starting thread to transfer BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036628464382_15347979 to 10.12.65.86:1019 2024-09-20 10:13:32,264 INFO datanode.DataNode (DataNode.java:run(2541)) - DataTransfer, at xx-dn-12-67-49.hiido.host.xx.xx.com:1019: Transmitted BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036628464382_15347979 (numBytes=83886080) to /10.12.65.86:1019 2024-09-20 10:13:35,096 INFO datanode.DataNode (DataNode.java:transferBlock(2328)) - DatanodeRegistration(10.12.67.49:1019, datanodeUuid=e73eb2ed-634b-40bd-a110-21ce485b329c, infoPort=1022, infoSecurePort=0, ipcPort=38010, storageInfo=lv=-57;cid=CID-1becf536-8c05-40cb-a1ff-106923139c5c;nsid=848315649;c=1660893388633) Starting thread to transfer BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036628464382_15347979 to 10.12.66.30:1019 2024-09-20 10:13:35,519 INFO datanode.DataNode (DataNode.java:run(2541)) - DataTransfer, at xx-dn-12-67-49.hiido.host.xx.xx.com:1019: Transmitted BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036628464382_15347979 (numBytes=83886080) to /10.12.66.30:1019 2024-09-20 10:13:38,096 INFO datanode.DataNode (DataNode.java:transferBlock(2328)) - DatanodeRegistration(10.12.67.49:1019, datanodeUuid=e73eb2ed-634b-40bd-a110-21ce485b329c, infoPort=1022, infoSecurePort=0, ipcPort=38010, storageInfo=lv=-57;cid=CID-1becf536-8c05-40cb-a1ff-106923139c5c;nsid=848315649;c=1660893388633) Starting thread to transfer BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036628464382_15347979 to 10.12.78.39:1019 2024-09-20 10:13:38,510 INFO datanode.DataNode (DataNode.java:run(2541)) - DataTransfer, at xx-dn-12-67-49.hiido.host.xx.xx.com:1019: Transmitted BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036628464382_15347979 (numBytes=83886080) to /10.12.78.39:1019 2024-09-20 10:13:44,095 INFO datanode.DataNode (DataNode.java:transferBlock(2328)) - DatanodeRegistration(10.12.67.49:1019, datanodeUuid=e73eb2ed-634b-40bd-a110-21ce485b329c, infoPort=1022, infoSecurePort=0, ipcPort=38010, storageInfo=lv=-57;cid=CID-1becf536-8c05-40cb-a1ff-106923139c5c;nsid=848315649;c=1660893388633) Starting thread to transfer BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036628464382_15347979 to 10.12.66.85:1019 2024-09-20 10:13:44,599 INFO datanode.DataNode (DataNode.java:run(2541)) - DataTransfer, at xx-dn-12-67-49.hiido.host.xx.xx.com:1019: Transmitted BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036628464382_15347979 (numBytes=83886080) to /10.12.66.85:1019 2024-09-20 10:13:50,097 INFO datanode.DataNode (DataNode.java:transferBlock(2328)) - DatanodeRegistration(10.12.67.49:1019, datanodeUuid=e73eb2ed-634b-40bd-a110-21ce485b329c, infoPort=1022, infoSecurePort=0, ipcPort=38010, storageInfo=lv=-57;cid=CID-1becf536-8c05-40cb-a1ff-106923139c5c;nsid=848315649;c=1660893388633) Starting thread to transfer BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036628464382_15347979 to 10.12.67.42:1019 2024-09-20 10:13:50,514 INFO datanode.DataNode (DataNode.java:run(2541)) - DataTransfer, at xx-dn-12-67-49.hiido.host.xx.xx.com:1019: Transmitted BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036628464382_15347979 (numBytes=83886080) to /10.12.67.42:1019 2024-09-20 10:13:53,095 INFO datanode.DataNode (DataNode.java:transferBlock(2328)) - DatanodeRegistration(10.12.67.49:1019, datanodeUuid=e73eb2ed-634b-40bd-a110-21ce485b329c, infoPort=1022, infoSecurePort=0, ipcPort=38010, storageInfo=lv=-57;cid=CID-1becf536-8c05-40cb-a1ff-106923139c5c;nsid=848315649;c=1660893388633) Starting thread to transfer BP-1822992414-10.12.65.48-1660893388633:blk_-92
[jira] (HDFS-11242) Add refresh cluster network topology operation to dfs admin
[ https://issues.apache.org/jira/browse/HDFS-11242 ] ruiliang deleted comment on HDFS-11242: - was (Author: ruilaing): The correct process for adding a new node with the correct mapping is: 1. Update the topology file of the new DN. 2. Issue dfadmin-refreshnodes to update the new topology map in the NN. 3. Start DN only after (2) so that it picks up the correct mapping and the default mapping is not cached. > Add refresh cluster network topology operation to dfs admin > --- > > Key: HDFS-11242 > URL: https://issues.apache.org/jira/browse/HDFS-11242 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.0.0-alpha1 >Reporter: Reid Chan >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-11242.002.patch, HDFS-11242.patch > > Time Spent: 50m > Remaining Estimate: 0h > > The network topology and dns to switch mapping are initialized at the start > of the namenode. > If admin wants to change the topology because of new datanodes added, he has > to stop and restart namenode(s), otherwise those new added datanodes are > squeezed under /default-rack. > It is a low frequency operation, but it should be operated appropriately, so > dfs admin should take the responsibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11242) Add refresh cluster network topology operation to dfs admin
[ https://issues.apache.org/jira/browse/HDFS-11242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868588#comment-17868588 ] ruiliang commented on HDFS-11242: - The correct process for adding a new node with the correct mapping is: 1. Update the topology file of the new DN. 2. Issue dfadmin-refreshnodes to update the new topology map in the NN. 3. Start DN only after (2) so that it picks up the correct mapping and the default mapping is not cached. > Add refresh cluster network topology operation to dfs admin > --- > > Key: HDFS-11242 > URL: https://issues.apache.org/jira/browse/HDFS-11242 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.0.0-alpha1 >Reporter: Reid Chan >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-11242.002.patch, HDFS-11242.patch > > Time Spent: 50m > Remaining Estimate: 0h > > The network topology and dns to switch mapping are initialized at the start > of the namenode. > If admin wants to change the topology because of new datanodes added, he has > to stop and restart namenode(s), otherwise those new added datanodes are > squeezed under /default-rack. > It is a low frequency operation, but it should be operated appropriately, so > dfs admin should take the responsibility. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17589) hdfs EC data new blk reconstruct old blk not delete
[ https://issues.apache.org/jira/browse/HDFS-17589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-17589: Description: The reason is that the cluster was faulty before, and Datanodes kept losing connections and recovering, resulting in a lot of EC data reconstruct, but a lot of old blk failed to clean up correctly. Has this been repaired? What patch do I need to add, thank you The following is a detailed check log ok: blk_-9223372036371044652 in 10.12.66.225 {color:#de350b}error: blk_-9223372036371044652 in 10.12.66.154(/data3/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044652) {color} {color:#de350b}Why didn't you delete it?{color} {code:java} datanode delete data ec blk ? grep blk_-9223372036371044656 hadoop-hdfs-root-datanode-fs-hiido-dn-12-66-111.hiido.host.xxx.com.log 2024-07-18 17:25:07,879 INFO datanode.DataNode (DataXceiver.java:writeBlock(738)) - Receiving BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 src: /10.12.66.111:25066 dest: /10.12.66.111:1019 2024-07-18 17:25:17,396 INFO datanode.DataNode (StripedBlockReconstructor.java:run(86)) - ok EC reconstruct striped block: BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 blockId: -9223372036371044656 2024-07-18 17:25:17,396 INFO datanode.DataNode (DataXceiver.java:writeBlock(914)) - Received BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 src: /10.12.66.111:25066 dest: /10.12.66.111:1019 of size 193986560 2024-07-18 17:25:25,465 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(225)) - Scheduling blk_-9223372036371044656_1688858793 replica FinalizedReplica, blk_-9223372036371044656_1688858793, FINALIZED getBlockURI() = file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656 for deletion 2024-07-18 17:25:25,746 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(333)) - Deleted BP-1822992414-10.12.65.48-1660893388633 blk_-9223372036371044656_1688858793 URI file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656=my config dfs.blockreport.intervalMsec =2160namenode3 log hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 04:34:39,523 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 04:34:40,131 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 10:34:38,950 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 10:34:39,559 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18 16:34:38,564 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18 16:34:39,190 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 04:34:39,462 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 04:34:40,083 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 10:34:39,686 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-922337203637104465
[jira] [Updated] (HDFS-17589) hdfs EC data new blk reconstruct old blk not delete
[ https://issues.apache.org/jira/browse/HDFS-17589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-17589: Summary: hdfs EC data new blk reconstruct old blk not delete (was: hdfs EC data old blk reconstruct old blk not delete) > hdfs EC data new blk reconstruct old blk not delete > -- > > Key: HDFS-17589 > URL: https://issues.apache.org/jira/browse/HDFS-17589 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: ruiliang >Priority: Major > > The reason is that the cluster was faulty before, and Datanodes kept losing > connections and recovering, resulting in a lot of EC data reconstruct, but a > lot of old blk failed to clean up correctly. Has this been repaired? What > patch do I need to add, thank you > The following is a detailed check log > > ok: blk_-9223372036371044652 in 10.12.66.225 > {color:#de350b}error: blk_-9223372036371044652 in > 10.12.66.154(/data3/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044652) > {color} > > {code:java} > datanode delete data ec blk ? > grep blk_-9223372036371044656 > hadoop-hdfs-root-datanode-fs-hiido-dn-12-66-111.hiido.host.xxx.com.log > 2024-07-18 17:25:07,879 INFO datanode.DataNode > (DataXceiver.java:writeBlock(738)) - Receiving > BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 > src: /10.12.66.111:25066 dest: /10.12.66.111:1019 > 2024-07-18 17:25:17,396 INFO datanode.DataNode > (StripedBlockReconstructor.java:run(86)) - ok EC reconstruct striped block: > BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 > blockId: -9223372036371044656 > 2024-07-18 17:25:17,396 INFO datanode.DataNode > (DataXceiver.java:writeBlock(914)) - Received > BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 > src: /10.12.66.111:25066 dest: /10.12.66.111:1019 of size 193986560 > 2024-07-18 17:25:25,465 INFO impl.FsDatasetAsyncDiskService > (FsDatasetAsyncDiskService.java:deleteAsync(225)) - Scheduling > blk_-9223372036371044656_1688858793 replica FinalizedReplica, > blk_-9223372036371044656_1688858793, FINALIZED > getBlockURI() = > file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656 > for deletion > 2024-07-18 17:25:25,746 INFO impl.FsDatasetAsyncDiskService > (FsDatasetAsyncDiskService.java:run(333)) - Deleted > BP-1822992414-10.12.65.48-1660893388633 blk_-9223372036371044656_1688858793 > URI > file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656=my > config > dfs.blockreport.intervalMsec =2160namenode3 log > hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 > 04:34:39,523 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) > - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to > storageType DISK on node 10.12.66.154:1019 > hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 > 04:34:40,131 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) > - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to > storageType DISK on node 10.12.66.154:1019 > hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 > 10:34:38,950 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) > - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to > storageType DISK on node 10.12.66.154:1019 > hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 > 10:34:39,559 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) > - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to > storageType DISK on node 10.12.66.154:1019 > hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18 > 16:34:38,564 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) > - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to > storageType DISK on node 10.12.66.154:1019 > hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18 > 16:34:39,190 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) > - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to > storageType DISK on node 10.12.66.154:1019 > hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 > 04:34:39,462 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) > - BLOCK* addStoredBlock: block blk_-92233720363710
[jira] [Updated] (HDFS-17589) hdfs EC data old blk reconstruct old blk not delete
[ https://issues.apache.org/jira/browse/HDFS-17589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-17589: Description: The reason is that the cluster was faulty before, and Datanodes kept losing connections and recovering, resulting in a lot of EC data reconstruct, but a lot of old blk failed to clean up correctly. Has this been repaired? What patch do I need to add, thank you The following is a detailed check log ok: blk_-9223372036371044652 in 10.12.66.225 {color:#de350b}error: blk_-9223372036371044652 in 10.12.66.154(/data3/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044652) {color} {code:java} datanode delete data ec blk ? grep blk_-9223372036371044656 hadoop-hdfs-root-datanode-fs-hiido-dn-12-66-111.hiido.host.xxx.com.log 2024-07-18 17:25:07,879 INFO datanode.DataNode (DataXceiver.java:writeBlock(738)) - Receiving BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 src: /10.12.66.111:25066 dest: /10.12.66.111:1019 2024-07-18 17:25:17,396 INFO datanode.DataNode (StripedBlockReconstructor.java:run(86)) - ok EC reconstruct striped block: BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 blockId: -9223372036371044656 2024-07-18 17:25:17,396 INFO datanode.DataNode (DataXceiver.java:writeBlock(914)) - Received BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 src: /10.12.66.111:25066 dest: /10.12.66.111:1019 of size 193986560 2024-07-18 17:25:25,465 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(225)) - Scheduling blk_-9223372036371044656_1688858793 replica FinalizedReplica, blk_-9223372036371044656_1688858793, FINALIZED getBlockURI() = file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656 for deletion 2024-07-18 17:25:25,746 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(333)) - Deleted BP-1822992414-10.12.65.48-1660893388633 blk_-9223372036371044656_1688858793 URI file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656=my config dfs.blockreport.intervalMsec =2160namenode3 log hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 04:34:39,523 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 04:34:40,131 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 10:34:38,950 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 10:34:39,559 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18 16:34:38,564 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18 16:34:39,190 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 04:34:39,462 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 04:34:40,083 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 10:34:39,686 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 1
[jira] [Updated] (HDFS-17589) hdfs EC data old blk reconstruct old blk not delete
[ https://issues.apache.org/jira/browse/HDFS-17589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-17589: Description: The reason is that the cluster was faulty before, and Datanodes kept losing connections and recovering, resulting in a lot of EC data reconstruct, but a lot of old blk failed to clean up correctly. Has this been repaired? What patch do I need to add, thank you The following is a detailed check log {code:java} datanode delete data ec blk ? grep blk_-9223372036371044656 hadoop-hdfs-root-datanode-fs-hiido-dn-12-66-111.hiido.host.xxx.com.log 2024-07-18 17:25:07,879 INFO datanode.DataNode (DataXceiver.java:writeBlock(738)) - Receiving BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 src: /10.12.66.111:25066 dest: /10.12.66.111:1019 2024-07-18 17:25:17,396 INFO datanode.DataNode (StripedBlockReconstructor.java:run(86)) - ok EC reconstruct striped block: BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 blockId: -9223372036371044656 2024-07-18 17:25:17,396 INFO datanode.DataNode (DataXceiver.java:writeBlock(914)) - Received BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 src: /10.12.66.111:25066 dest: /10.12.66.111:1019 of size 193986560 2024-07-18 17:25:25,465 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(225)) - Scheduling blk_-9223372036371044656_1688858793 replica FinalizedReplica, blk_-9223372036371044656_1688858793, FINALIZED getBlockURI() = file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656 for deletion 2024-07-18 17:25:25,746 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(333)) - Deleted BP-1822992414-10.12.65.48-1660893388633 blk_-9223372036371044656_1688858793 URI file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656=my config dfs.blockreport.intervalMsec =2160namenode3 log hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 04:34:39,523 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 04:34:40,131 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 10:34:38,950 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 10:34:39,559 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18 16:34:38,564 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18 16:34:39,190 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 04:34:39,462 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 04:34:40,083 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 10:34:39,686 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 10:34:40,295 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to stora
[jira] [Created] (HDFS-17589) hdfs EC data old blk reconstruct old blk not delete
ruiliang created HDFS-17589: --- Summary: hdfs EC data old blk reconstruct old blk not delete Key: HDFS-17589 URL: https://issues.apache.org/jira/browse/HDFS-17589 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.1.1 Reporter: ruiliang The reason is that the cluster was faulty before, and Datanodes kept losing connections and recovering, resulting in a lot of EC data reconstruct, but a lot of old blk failed to clean up correctly. Has this been repaired? What patch do I need to add, thank you The following is a detailed check log {code:java} datanode delete data ec blk ? grep blk_-9223372036371044656 hadoop-hdfs-root-datanode-fs-hiido-dn-12-66-111.hiido.host.xxx.com.log 2024-07-18 17:25:07,879 INFO datanode.DataNode (DataXceiver.java:writeBlock(738)) - Receiving BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 src: /10.12.66.111:25066 dest: /10.12.66.111:1019 2024-07-18 17:25:17,396 INFO datanode.DataNode (StripedBlockReconstructor.java:run(86)) - ok EC reconstruct striped block: BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 blockId: -9223372036371044656 2024-07-18 17:25:17,396 INFO datanode.DataNode (DataXceiver.java:writeBlock(914)) - Received BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 src: /10.12.66.111:25066 dest: /10.12.66.111:1019 of size 193986560 2024-07-18 17:25:25,465 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(225)) - Scheduling blk_-9223372036371044656_1688858793 replica FinalizedReplica, blk_-9223372036371044656_1688858793, FINALIZED getBlockURI() = file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656 for deletion 2024-07-18 17:25:25,746 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(333)) - Deleted BP-1822992414-10.12.65.48-1660893388633 blk_-9223372036371044656_1688858793 URI file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656=my config dfs.blockreport.intervalMsec =2160namenode3 log hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 04:34:39,523 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 04:34:40,131 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 10:34:38,950 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 10:34:39,559 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18 16:34:38,564 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18 16:34:39,190 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 04:34:39,462 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 04:34:40,083 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 10:34:39,686 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 1
[jira] [Updated] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?
[ https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-17535: Description: 我了解到 EC 确实存在文件损坏的错误 https://issues.apache.org/jira/browse/HDFS-15759 1:我已确认 EC 损坏文件,此损坏文件可以恢复吗? 有重要数据导致我们生产数据丢失问题?有办法恢复吗? 检查 EC 块组:blk_-9223372036361352768 状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765 2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240? hdfs 版本 3.1.0 谢谢 Latest findings: It is a machine network problem, the cpu si(soft interrupt) is too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct. Because the Weaver-Scope service of k8s is installed on the server, conntrack interruption times out seriously, affecting all network usage. was: 我了解到 EC 确实存在文件损坏的重错误 https://issues.apache.org/jira/browse/HDFS-15759 1:我已确认 EC 损坏文件,此损坏文件可以恢复吗? 有重要数据导致我们生产数据丢失问题?有办法恢复吗? 检查 EC 块组:blk_-9223372036361352768 状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765 2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240? hdfs 版本 3.1.0 谢谢 Latest findings: It is a machine network problem, the cpu si(soft interrupt) is too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct. Because the Weaver-Scope service of k8s is installed on the server, conntrack interruption times out seriously, affecting all network usage. > I have confirmed the EC corrupt file, can this corrupt file be restored? > > > Key: HDFS-17535 > URL: https://issues.apache.org/jira/browse/HDFS-17535 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > 我了解到 EC 确实存在文件损坏的错误 > https://issues.apache.org/jira/browse/HDFS-15759 > 1:我已确认 EC 损坏文件,此损坏文件可以恢复吗? > 有重要数据导致我们生产数据丢失问题?有办法恢复吗? > 检查 EC 块组:blk_-9223372036361352768 > 状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765 > 2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull > request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240? > hdfs 版本 3.1.0 > 谢谢 > > Latest findings: It is a machine network problem, the cpu si(soft interrupt) > is too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct. > Because the Weaver-Scope service of k8s is installed on the server, conntrack > interruption times out seriously, affecting all network usage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?
[ https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-17535: Description: 我了解到 EC 确实存在文件损坏的重错误 https://issues.apache.org/jira/browse/HDFS-15759 1:我已确认 EC 损坏文件,此损坏文件可以恢复吗? 有重要数据导致我们生产数据丢失问题?有办法恢复吗? 检查 EC 块组:blk_-9223372036361352768 状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765 2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240? hdfs 版本 3.1.0 谢谢 Latest findings: It is a machine network problem, the cpu si(soft interrupt) is too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct. Because the Weaver-Scope service of k8s is installed on the server, conntrack interruption times out seriously, affecting all network usage. was: 我了解到 EC 确实存在文件损坏的重大错误 https://issues.apache.org/jira/browse/HDFS-15759 1:我已确认 EC 损坏文件,此损坏文件可以恢复吗? 有重要数据导致我们生产数据丢失问题?有办法恢复吗? 检查 EC 块组:blk_-9223372036361352768 状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765 2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240? hdfs 版本 3.1.0 谢谢 Latest findings: It is a machine network problem, the cpu si(soft interrupt) is too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct. Because the Weaver-Scope service of k8s is installed on the server, conntrack interruption times out seriously, affecting all network usage. > I have confirmed the EC corrupt file, can this corrupt file be restored? > > > Key: HDFS-17535 > URL: https://issues.apache.org/jira/browse/HDFS-17535 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > 我了解到 EC 确实存在文件损坏的重错误 > https://issues.apache.org/jira/browse/HDFS-15759 > 1:我已确认 EC 损坏文件,此损坏文件可以恢复吗? > 有重要数据导致我们生产数据丢失问题?有办法恢复吗? > 检查 EC 块组:blk_-9223372036361352768 > 状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765 > 2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull > request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240? > hdfs 版本 3.1.0 > 谢谢 > > Latest findings: It is a machine network problem, the cpu si(soft interrupt) > is too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct. > Because the Weaver-Scope service of k8s is installed on the server, conntrack > interruption times out seriously, affecting all network usage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?
[ https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-17535: Description: 我了解到 EC 确实存在文件损坏的重大错误 https://issues.apache.org/jira/browse/HDFS-15759 1:我已确认 EC 损坏文件,此损坏文件可以恢复吗? 有重要数据导致我们生产数据丢失问题?有办法恢复吗? 检查 EC 块组:blk_-9223372036361352768 状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765 2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240? hdfs 版本 3.1.0 谢谢 Latest findings: It is a machine network problem, the cpu si(soft interrupt) is too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct. Because the Weaver-Scope service of k8s is installed on the server, conntrack interruption times out seriously, affecting all network usage. was: I learned that EC does have a major bug with file corrupt https://issues.apache.org/jira/browse/HDFS-15759 1:I have confirmed the EC corrupt file, can this corrupt file be restored? Have important data that is causing us production data loss issues? Is there a way to recover Checking EC block group: blk_-9223372036361352768 Status: ERROR, message: EC compute result not match.:ip is 10.12.66.116 block is : -9223372036361352765 2:[https://github.com/apache/orc/issues/1939] I was wondering if cherry picked your current code (GitHub pull request #2869), Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240? hdfs version 3.1.0 thank you > I have confirmed the EC corrupt file, can this corrupt file be restored? > > > Key: HDFS-17535 > URL: https://issues.apache.org/jira/browse/HDFS-17535 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > 我了解到 EC 确实存在文件损坏的重大错误 > https://issues.apache.org/jira/browse/HDFS-15759 > 1:我已确认 EC 损坏文件,此损坏文件可以恢复吗? > 有重要数据导致我们生产数据丢失问题?有办法恢复吗? > 检查 EC 块组:blk_-9223372036361352768 > 状态:错误,消息:EC 计算结果不匹配。:ip 为 10.12.66.116 块为:-9223372036361352765 > 2:[https://github.com/apache/orc/issues/1939]我想知道如果你选择了你当前的代码(GitHub pull > request #2869),我可以跳过与HDFS-14768,HDFS-15186, 和HDFS-15240? > hdfs 版本 3.1.0 > 谢谢 > > Latest findings: It is a machine network problem, the cpu si(soft interrupt) > is too high, nn loses dn heartbeat, nn sends to dn to recover and reconstruct. > Because the Weaver-Scope service of k8s is installed on the server, conntrack > interruption times out seriously, affecting all network usage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?
[ https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854008#comment-17854008 ] ruiliang edited comment on HDFS-17535 at 6/27/24 1:34 PM: -- [https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java] After studying for a long time, a bad block recovery method has been realized. Two bad blocks (rs-3-1024) need to exclude datanode read the [orc,txt,txt gzip,parquet]file for 10 times, and then check whether the [orc,txt,txt gzip,parquet]is legitimate. The recovery program is as follows, with changes to the source code, and the relevant jar is in the lib/ directory. {code:java} 1:check ec file & return sigle block error to datanode ip info 2:read ec file & skip block error datanode ip to copy new dir 3:orc check read (Verify according to your own file format) 4:if error block >1 (for all datanode read data) 5:for all datanode Still unable to recover,The data is completely blocked{code} It would be best if a community official provided this feature. was (Author: ruilaing): [https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java] After studying for a long time, a bad block recovery method has been realized. Two bad blocks (rs-3-1024) need to exclude datanode read the [orc,txt,txt gzip,parquet]file for 10 times, and then check whether the orc is legitimate. The recovery program is as follows, with changes to the source code, and the relevant jar is in the lib/ directory. {code:java} 1:check ec file & return sigle block error to datanode ip info 2:read ec file & skip block error datanode ip to copy new dir 3:orc check read (Verify according to your own file format) 4:if error block >1 (for all datanode read data) 5:for all datanode Still unable to recover,The data is completely blocked{code} It would be best if a community official provided this feature. > I have confirmed the EC corrupt file, can this corrupt file be restored? > > > Key: HDFS-17535 > URL: https://issues.apache.org/jira/browse/HDFS-17535 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > I learned that EC does have a major bug with file corrupt > https://issues.apache.org/jira/browse/HDFS-15759 > 1:I have confirmed the EC corrupt file, can this corrupt file be restored? > Have important data that is causing us production data loss issues? Is > there a way to recover > Checking EC block group: blk_-9223372036361352768 > Status: ERROR, message: EC compute result not match.:ip is 10.12.66.116 block > is : -9223372036361352765 > 2:[https://github.com/apache/orc/issues/1939] I was wondering if cherry > picked your current code (GitHub pull request #2869), Can I skip patches > related to HDFS-14768,HDFS-15186, and HDFS-15240? > hdfs version 3.1.0 > thank you -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?
[ https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-17535: Description: I learned that EC does have a major bug with file corrupt https://issues.apache.org/jira/browse/HDFS-15759 1:I have confirmed the EC corrupt file, can this corrupt file be restored? Have important data that is causing us production data loss issues? Is there a way to recover Checking EC block group: blk_-9223372036361352768 Status: ERROR, message: EC compute result not match.:ip is 10.12.66.116 block is : -9223372036361352765 2:[https://github.com/apache/orc/issues/1939] I was wondering if cherry picked your current code (GitHub pull request #2869), Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240? hdfs version 3.1.0 thank you was: I learned that EC does have a major bug with file corrupt https://issues.apache.org/jira/browse/HDFS-15759 1:I have confirmed the EC corrupt file, can this corrupt file be restored? Have important data that is causing us production data loss issues? Is there a way to recover corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups \{blk_-xx[blk_-xx]} 2:https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240? hdfs version 3.1.0 thank you > I have confirmed the EC corrupt file, can this corrupt file be restored? > > > Key: HDFS-17535 > URL: https://issues.apache.org/jira/browse/HDFS-17535 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > I learned that EC does have a major bug with file corrupt > https://issues.apache.org/jira/browse/HDFS-15759 > 1:I have confirmed the EC corrupt file, can this corrupt file be restored? > Have important data that is causing us production data loss issues? Is > there a way to recover > Checking EC block group: blk_-9223372036361352768 > Status: ERROR, message: EC compute result not match.:ip is 10.12.66.116 block > is : -9223372036361352765 > 2:[https://github.com/apache/orc/issues/1939] I was wondering if cherry > picked your current code (GitHub pull request #2869), Can I skip patches > related to HDFS-14768,HDFS-15186, and HDFS-15240? > hdfs version 3.1.0 > thank you -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?
[ https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854008#comment-17854008 ] ruiliang edited comment on HDFS-17535 at 6/27/24 8:28 AM: -- [https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java] After studying for a long time, a bad block recovery method has been realized. Two bad blocks (rs-3-1024) need to exclude datanode read the [orc,txt,txt gzip,parquet]file for 10 times, and then check whether the orc is legitimate. The recovery program is as follows, with changes to the source code, and the relevant jar is in the lib/ directory. {code:java} 1:check ec file & return sigle block error to datanode ip info 2:read ec file & skip block error datanode ip to copy new dir 3:orc check read (Verify according to your own file format) 4:if error block >1 (for all datanode read data) 5:for all datanode Still unable to recover,The data is completely blocked{code} It would be best if a community official provided this feature. was (Author: ruilaing): [https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java] After studying for a long time, a bad block recovery method has been realized. Two bad blocks (rs-3-1024) need to exclude datanode read the orc file for 10 times, and then check whether the orc is legitimate. The recovery program is as follows, with changes to the source code, and the relevant jar is in the lib/ directory. {code:java} 1:check ec file & return sigle block error to datanode ip info 2:read ec file & skip block error datanode ip to copy new dir 3:orc check read (Verify according to your own file format) 4:if error block >1 (for all datanode read data) 5:for all datanode Still unable to recover,The data is completely blocked{code} It would be best if a community official provided this feature. > I have confirmed the EC corrupt file, can this corrupt file be restored? > > > Key: HDFS-17535 > URL: https://issues.apache.org/jira/browse/HDFS-17535 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > I learned that EC does have a major bug with file corrupt > https://issues.apache.org/jira/browse/HDFS-15759 > 1:I have confirmed the EC corrupt file, can this corrupt file be restored? > Have important data that is causing us production data loss issues? Is > there a way to recover > corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups > \{blk_-xx[blk_-xx]} > 2:https://github.com/apache/orc/issues/1939 I was wondering if cherry picked > your current code (GitHub pull request #2869), Can I skip patches related to > HDFS-14768,HDFS-15186, and HDFS-15240? > hdfs version 3.1.0 > thank you -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?
[ https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854008#comment-17854008 ] ruiliang edited comment on HDFS-17535 at 6/11/24 11:51 AM: --- [https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java] After studying for a long time, a bad block recovery method has been realized. Two bad blocks (rs-3-1024) need to exclude datanode read the orc file for 10 times, and then check whether the orc is legitimate. The recovery program is as follows, with changes to the source code, and the relevant jar is in the lib/ directory. {code:java} 1:check ec file & return sigle block error to datanode ip info 2:read ec file & skip block error datanode ip to copy new dir 3:orc check read (Verify according to your own file format) 4:if error block >1 (for all datanode read data) 5:for all datanode Still unable to recover,The data is completely blocked{code} It would be best if a community official provided this feature. was (Author: ruilaing): [https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java] After studying for a long time, a bad block recovery method has been realized. Two bad blocks (rs-3-1024) need to exclude datanode read the orc file for 10 times, and then check whether the orc is legitimate. The recovery program is as follows, with changes to the source code, and the relevant jar is in the lib/ directory. It would be best if a community official provided this feature. > I have confirmed the EC corrupt file, can this corrupt file be restored? > > > Key: HDFS-17535 > URL: https://issues.apache.org/jira/browse/HDFS-17535 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > I learned that EC does have a major bug with file corrupt > https://issues.apache.org/jira/browse/HDFS-15759 > 1:I have confirmed the EC corrupt file, can this corrupt file be restored? > Have important data that is causing us production data loss issues? Is > there a way to recover > corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups > \{blk_-xx[blk_-xx]} > 2:https://github.com/apache/orc/issues/1939 I was wondering if cherry picked > your current code (GitHub pull request #2869), Can I skip patches related to > HDFS-14768,HDFS-15186, and HDFS-15240? > hdfs version 3.1.0 > thank you -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?
[ https://issues.apache.org/jira/browse/HDFS-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854008#comment-17854008 ] ruiliang commented on HDFS-17535: - [https://github.com/liangrui1988/hadoop-client-op/blob/main/src/main/java/com/yy/bigdata/orc/OpenFileLine.java] After studying for a long time, a bad block recovery method has been realized. Two bad blocks (rs-3-1024) need to exclude datanode read the orc file for 10 times, and then check whether the orc is legitimate. The recovery program is as follows, with changes to the source code, and the relevant jar is in the lib/ directory. It would be best if a community official provided this feature. > I have confirmed the EC corrupt file, can this corrupt file be restored? > > > Key: HDFS-17535 > URL: https://issues.apache.org/jira/browse/HDFS-17535 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > I learned that EC does have a major bug with file corrupt > https://issues.apache.org/jira/browse/HDFS-15759 > 1:I have confirmed the EC corrupt file, can this corrupt file be restored? > Have important data that is causing us production data loss issues? Is > there a way to recover > corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups > \{blk_-xx[blk_-xx]} > 2:https://github.com/apache/orc/issues/1939 I was wondering if cherry picked > your current code (GitHub pull request #2869), Can I skip patches related to > HDFS-14768,HDFS-15186, and HDFS-15240? > hdfs version 3.1.0 > thank you -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17547) debug verifyEC check error
[ https://issues.apache.org/jira/browse/HDFS-17547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-17547: Attachment: image-2024-06-07-16-02-07-480.png Description: When I validate a block that has been corrupted many times, does it appear normal? {code:java} hdfs debug verifyEC -file /file.orc 24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable Checking EC block group: blk_-9223372036492703744 Status: OK {code} ByteBuffer hb show [0..] [0..] !image-2024-06-07-16-02-07-480.png! {code:java} buffers = {ByteBuffer[5]@3270} 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +65,436 more] buffers[this.dataBlkNum + ixx].equals(outputs[ixx] =true ? outputs = {ByteBuffer[2]@3271} 0 = {HeapByteBuffer@3455} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3459} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +65,436 more]{code} Can this situation be judged as an anomaly? check orc file {code:java} Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer in skip_ip/_skip_file. at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276) at org.apache.orc.tools.FileDump.main(FileDump.java:137) at org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 7752508 in column 3 kind LENGTH at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507) at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943) at org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333) at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355) ... 6 more {code} was: When I validate a block that has been corrupted many times, does it appear normal? {code:java} hdfs debug verifyEC -file /file.orc 24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable Checking EC block group: blk_-9223372036492703744 Status: OK {code} ByteBuffer hb show [0..] {code:java} buffers = {ByteBuffer[5]@3270} 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
[jira] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759 ] ruiliang deleted comment on HDFS-15759: - was (Author: ruilaing): When I validate a block that has been corrupted many times, does it appear normal? ByteBuffer hb show [0..] {code:java} buffers = {ByteBuffer[5]@3270} 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +65,436 more] buffers[this.dataBlkNum + ixx].equals(outputs[ixx] =true ? outputs = {ByteBuffer[2]@3271} 0 = {HeapByteBuffer@3455} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3459} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +65,436 more]{code} Can this situation be judged as an anomaly? {code:java} hdfs debug verifyEC -file /file.orc 24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable Checking EC block group: blk_-9223372036492703744 Status: OK {code} check orc file {code:java} Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer in skip_ip/_skip_file. at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276) at org.apache.orc.tools.FileDump.main(FileDump.java:137) at org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 7752508 in column 3 kind LENGTH at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507) at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943) at org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333) at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355) ... 6 more {code} > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to r
[jira] [Created] (HDFS-17547) debug verifyEC check error
ruiliang created HDFS-17547: --- Summary: debug verifyEC check error Key: HDFS-17547 URL: https://issues.apache.org/jira/browse/HDFS-17547 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-common Reporter: ruiliang When I validate a block that has been corrupted many times, does it appear normal? {code:java} hdfs debug verifyEC -file /file.orc 24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable Checking EC block group: blk_-9223372036492703744 Status: OK {code} ByteBuffer hb show [0..] {code:java} buffers = {ByteBuffer[5]@3270} 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +65,436 more] buffers[this.dataBlkNum + ixx].equals(outputs[ixx] =true ? outputs = {ByteBuffer[2]@3271} 0 = {HeapByteBuffer@3455} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3459} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +65,436 more]{code} Can this situation be judged as an anomaly? check orc file {code:java} Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer in skip_ip/_skip_file. at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276) at org.apache.orc.tools.FileDump.main(FileDump.java:137) at org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 7752508 in column 3 kind LENGTH at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507) at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943) at org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333) at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355) ... 6 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853065#comment-17853065 ] ruiliang edited comment on HDFS-15759 at 6/7/24 7:55 AM: - When I validate a block that has been corrupted many times, does it appear normal? ByteBuffer hb show [0..] {code:java} buffers = {ByteBuffer[5]@3270} 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +65,436 more] buffers[this.dataBlkNum + ixx].equals(outputs[ixx] =true ? outputs = {ByteBuffer[2]@3271} 0 = {HeapByteBuffer@3455} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3459} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +65,436 more]{code} Can this situation be judged as an anomaly? {code:java} hdfs debug verifyEC -file /file.orc 24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable Checking EC block group: blk_-9223372036492703744 Status: OK {code} check orc file {code:java} Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer in skip_ip/_skip_file. at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276) at org.apache.orc.tools.FileDump.main(FileDump.java:137) at org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 7752508 in column 3 kind LENGTH at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507) at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943) at org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333) at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355) ... 6 more {code} was (Author: ruilaing): When I validate a block that has been corrupted many times, does it appear normal? ByteBuffer hb show [0..] {code:java} buffers = {ByteBuffer[5]@3270} 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +65,436 more]{code} Can this
[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853065#comment-17853065 ] ruiliang edited comment on HDFS-15759 at 6/7/24 7:54 AM: - When I validate a block that has been corrupted many times, does it appear normal? ByteBuffer hb show [0..] {code:java} buffers = {ByteBuffer[5]@3270} 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +65,436 more]{code} Can this situation be judged as an anomaly? {code:java} hdfs debug verifyEC -file /file.orc 24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable Checking EC block group: blk_-9223372036492703744 Status: OK {code} check orc file {code:java} Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer in skip_ip/_skip_file. at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276) at org.apache.orc.tools.FileDump.main(FileDump.java:137) at org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 7752508 in column 3 kind LENGTH at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507) at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943) at org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333) at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355) ... 6 more {code} was (Author: ruilaing): When I validate a block that has been corrupted many times, does it appear normal? ByteBuffer hb show [0..] !image-2024-06-07-15-52-26-294.png! Can this situation be judged as an anomaly? {code:java} hdfs debug verifyEC -file /file.orc 24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable Checking EC block group: blk_-9223372036492703744 Status: OK {code} check orc file {code:java} Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer in skip_ip/_skip_file. at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276) at org.apache.orc.tools.FileDump.main(FileDump.java:137) at org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 7752508 in column 3 kind LENGTH at org.apache.orc.impl.InStream$CompressedStream.readHeader(
[jira] [Commented] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853065#comment-17853065 ] ruiliang commented on HDFS-15759: - When I validate a block that has been corrupted many times, does it appear normal? ByteBuffer hb show [0..] !image-2024-06-07-15-52-26-294.png! Can this situation be judged as an anomaly? {code:java} hdfs debug verifyEC -file /file.orc 24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable Checking EC block group: blk_-9223372036492703744 Status: OK {code} check orc file {code:java} Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer in skip_ip/_skip_file. at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276) at org.apache.orc.tools.FileDump.main(FileDump.java:137) at org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 7752508 in column 3 kind LENGTH at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507) at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943) at org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333) at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355) ... 6 more {code} > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?
ruiliang created HDFS-17535: --- Summary: I have confirmed the EC corrupt file, can this corrupt file be restored? Key: HDFS-17535 URL: https://issues.apache.org/jira/browse/HDFS-17535 Project: Hadoop HDFS Issue Type: Bug Components: ec, hdfs Affects Versions: 3.1.0 Reporter: ruiliang I learned that EC does have a major bug with file corrupt https://issues.apache.org/jira/browse/HDFS-15759 1:I have confirmed the EC corrupt file, can this corrupt file be restored? Have important data that is causing us production data loss issues? Is there a way to recover corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups \{blk_-xx[blk_-xx]} 2:https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240? hdfs version 3.1.0 thank you -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849217#comment-17849217 ] ruiliang commented on HDFS-15186: - I have confirmed the EC corrupt file, can this corrupt file be restored? Have important data that is causing us production data loss issues? Is there a way to recover corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups \{blk_-xx[blk_-xx]} hdfs version 3.1.0 > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Fix For: 3.3.0 > > Attachments: HDFS-15186.001.patch, HDFS-15186.002.patch, > HDFS-15186.003.patch, HDFS-15186.004.patch, HDFS-15186.005.patch > > > # I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang edited comment on HDFS-15759 at 5/23/24 4:52 AM: -- [~weichiu] Hello, our current production data also has this kind of EC storage data damage problem, about the problem description [https://github.com/apache/orc/issues/1939] I was wondering if cherry picked your current code (GitHub pull request #2869), Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! was (Author: ruilaing): Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang edited comment on HDFS-15759 at 5/23/24 3:53 AM: -- Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! was (Author: ruilaing): Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang edited comment on HDFS-15759 at 5/23/24 3:52 AM: -- Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! was (Author: ruilaing): Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang edited comment on HDFS-15759 at 5/23/24 3:51 AM: -- Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! was (Author: ruilaing): Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I would like to ask if cherry picked your current code (GitHub pull request #2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 related patches? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang edited comment on HDFS-15759 at 5/23/24 3:50 AM: -- Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I would like to ask if cherry picked your current code (GitHub pull request #2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 related patches? The current version of hdfs is 3.1.0. Thank you! was (Author: ruilaing): Hello, our current online data also appears this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I would like to ask if cherry picked your current code (GitHub pull request #2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 related patches? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang commented on HDFS-15759: - Hello, our current online data also appears this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I would like to ask if cherry picked your current code (GitHub pull request #2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 related patches? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-17407) Exception during image upload
[ https://issues.apache.org/jira/browse/HDFS-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824329#comment-17824329 ] ruiliang edited comment on HDFS-17407 at 3/7/24 9:29 AM: - After analyzing the log and source code, it is because the two sbnn initiated Checkpoint at the same time. When the latter checked the file flow, it found that the file had been updated and threw an exception. Should not output as an exception? SbNN 1 log {code:java} root@cluster06-yynn1:/data/logs/hadoop/hdfs# grep 57258734311 hadoop-hdfs-namenode-cluster06-nn1.xx.com.log 2024-03-07 16:48:00,061 INFO namenode.FSImage (FSImage.java:loadEdits(887)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@4afc4056 expecting start txid #57258734311 2024-03-07 16:48:00,061 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(158)) - Start loading edits file http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-191.xx.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true maxTxnsToRead = 9223372036854775807 2024-03-07 16:48:00,061 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-191.xx.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true' to transaction ID 57258734311 2024-03-07 16:48:00,061 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true' to transaction ID 57258734311 2024-03-07 16:48:02,592 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(162)) - Edits file http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-191.xx.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true of size 35380849 edits # 214398 loaded in 2 seconds {code} SbNN 2 log {code:java} root@cluster06-yynn3:/data/logs/hadoop/hdfs# grep 57258734311 hadoop-hdfs-namenode-cluster06-nn3.xx.com.log 2024-03-07 16:48:32,536 INFO namenode.FSImage (FSImage.java:loadEdits(887)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6d0659cd expecting start txid #57258734311 2024-03-07 16:48:32,536 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(158)) - Start loading edits file http://fs-nn-party-65-191.xx.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true maxTxnsToRead = 9223372036854775807 2024-03-07 16:48:32,536 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://fs-nn-party-65-191.xx.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-190.xx.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true' to transaction ID 57258734311 2024-03-07 16:48:32,536 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://fs-nn-party-65-191.xxcom:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true' to transaction ID 57258734311 2024-03-07 16:48:35,634 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(162)) - Edits file http://fs-nn-party-65-191.xx.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inP
[jira] [Comment Edited] (HDFS-17407) Exception during image upload
[ https://issues.apache.org/jira/browse/HDFS-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824329#comment-17824329 ] ruiliang edited comment on HDFS-17407 at 3/7/24 9:26 AM: - After analyzing the log and source code, it is because the two sbnn initiated Checkpoint at the same time. When the latter checked the file flow, it found that the file had been updated and threw an exception. Should not output as an exception? SbNN 1 log {code:java} root@cluster06-yynn1:/data/logs/hadoop/hdfs# grep 57258734311 hadoop-hdfs-namenode-cluster06-yynn1.xx.com.log 2024-03-07 16:48:00,061 INFO namenode.FSImage (FSImage.java:loadEdits(887)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@4afc4056 expecting start txid #57258734311 2024-03-07 16:48:00,061 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(158)) - Start loading edits file http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true maxTxnsToRead = 9223372036854775807 2024-03-07 16:48:00,061 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true' to transaction ID 57258734311 2024-03-07 16:48:00,061 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true' to transaction ID 57258734311 2024-03-07 16:48:02,592 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(162)) - Edits file http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true of size 35380849 edits # 214398 loaded in 2 seconds {code} SbNN 2 log {code:java} root@cluster06-yynn3:/data/logs/hadoop/hdfs# grep 57258734311 hadoop-hdfs-namenode-cluster06-yynn3.xx.com.log 2024-03-07 16:48:32,536 INFO namenode.FSImage (FSImage.java:loadEdits(887)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6d0659cd expecting start txid #57258734311 2024-03-07 16:48:32,536 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(158)) - Start loading edits file http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true maxTxnsToRead = 9223372036854775807 2024-03-07 16:48:32,536 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true' to transaction ID 57258734311 2024-03-07 16:48:32,536 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true' to transaction ID 57258734311 2024-03-07 16:48:35,634 INFO namenode.FSImage (FSEditLogLoader.java:lo
[jira] [Updated] (HDFS-17407) Exception during image upload
[ https://issues.apache.org/jira/browse/HDFS-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-17407: Issue Type: Improvement (was: Bug) > Exception during image upload > - > > Key: HDFS-17407 > URL: https://issues.apache.org/jira/browse/HDFS-17407 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Affects Versions: 3.1.0 > Environment: hadoop 3.1.0 > linux:ubuntu 16.04 > ambari-hdp:3.1.1 >Reporter: ruiliang >Priority: Major > > After I added the third hdfs namenode, the service was fine. However, the two > Standby namenode service logs always show exceptions during image upload. > However, I observe that the image file of the primary node is being updated > normally, which indicates that the secondary node has merged the image file > and uploaded it to the primary node. But I don't understand why two Standby > namenode keep sending such exception logs. Are there potential risk issues? > > namenode log > {code:java} > 2024-03-01 15:31:46,162 INFO namenode.TransferFsImage > (TransferFsImage.java:copyFileToStream(394)) - Sending fileName: > /data/hadoop/hdfs/namenode/current/fsimage_55689095810, fileSize: > 4626167848. Sent total: 1703936 bytes. Size of last segment intended to send: > 131072 bytes. > java.io.IOException: Error writing request body to server > at > sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587) > at > sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570) > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376) > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320) > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2024-03-01 15:31:46,630 INFO blockmanagement.BlockManager > (BlockManager.java:enqueue(4923)) - Block report queue is full > 2024-03-01 15:31:46,664 ERROR ha.StandbyCheckpointer > (StandbyCheckpointer.java:doWork(452)) - Exception in doCheckpoint > java.io.IOException: Exception during image upload > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:257) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1500(StandbyCheckpointer.java:62) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:432) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:331) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:351) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1710) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:347) > Caused by: java.util.concurrent.ExecutionException: java.io.IOException: > Error writing request body to server > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:250) > ... 9 more > Caused by: java.io.IOException: Error writing request body to server > at > sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587) > at > sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570) > at > org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(T
[jira] [Commented] (HDFS-17407) Exception during image upload
[ https://issues.apache.org/jira/browse/HDFS-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824329#comment-17824329 ] ruiliang commented on HDFS-17407: - After analyzing the log and source code, it is because the two sbnn initiated Checkpoint at the same time. When the latter checked the file flow, it found that the file had been updated and threw an exception. Should not output as an exception? SbNN 1 log {code:java} root@fs-hiido-yycluster06-yynn1:/data/logs/hadoop/hdfs# grep 57258734311 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com.log 2024-03-07 16:48:00,061 INFO namenode.FSImage (FSImage.java:loadEdits(887)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@4afc4056 expecting start txid #57258734311 2024-03-07 16:48:00,061 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(158)) - Start loading edits file http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true maxTxnsToRead = 9223372036854775807 2024-03-07 16:48:00,061 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true' to transaction ID 57258734311 2024-03-07 16:48:00,061 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true' to transaction ID 57258734311 2024-03-07 16:48:02,592 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(162)) - Edits file http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true of size 35380849 edits # 214398 loaded in 2 seconds {code} SbNN 2 log {code:java} root@fs-hiido-yycluster06-yynn3:/data/logs/hadoop/hdfs# grep 57258734311 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.int.yy.com.log 2024-03-07 16:48:32,536 INFO namenode.FSImage (FSImage.java:loadEdits(887)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6d0659cd expecting start txid #57258734311 2024-03-07 16:48:32,536 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(158)) - Start loading edits file http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true maxTxnsToRead = 9223372036854775807 2024-03-07 16:48:32,536 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true, http://fs-nn-party-65-190.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true' to transaction ID 57258734311 2024-03-07 16:48:32,536 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://fs-nn-party-65-191.hiido.host.yydevops.com:8480/getJournal?jid=yycluster06&segmentTxId=57258734311&storageInfo=-64%3A848315649%3A1660893388633%3ACID-1becf536-8c05-40cb-a1ff-106923139c5c&inProgressOk=true' to transaction ID 57258734311 2024-03-07 16:48:35,634 INFO namenode.FSIm
[jira] [Resolved] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space
[ https://issues.apache.org/jira/browse/HDFS-16799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang resolved HDFS-16799. - Resolution: Cannot Reproduce > The dn space size is not consistent, and Balancer can not work, resulting in > a very unbalanced space > > > Key: HDFS-16799 > URL: https://issues.apache.org/jira/browse/HDFS-16799 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > > {code:java} > echo 'A DFS Used 99.8% to ip' > sorucehost > hdfs --debug balancer -fs hdfs://xxcluster06 -threshold 10 -source -f > sorucehost > > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.243:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.247:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-15-10/10.12.65.214:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-02-08/10.12.14.8:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-13/10.12.15.154:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-04/10.12.65.218:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.143:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-05/10.12.12.200:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.217:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.142:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.246:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.219:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.147:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-15-10/10.12.65.186:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-13/10.12.15.153:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-03-07/10.12.19.23:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-04-14/10.12.65.119:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.131:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-04/10.12.12.210:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-11/10.12.14.168:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.245:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-03-02/10.12.17.26:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.241:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-13/10.12.15.152:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.249:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-07-14/10.12.64.71:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-03-03/10.12.17.35:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.195:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.242:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.248:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.240:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-15-12/10.12.65.196:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-13/10.12.15.150:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.222:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.145:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.244:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-03-07/10.12.19.22:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.221:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.136:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.129:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-15/10.12.15.163:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-07-14/10.12.64.72:1019 > 22
[jira] [Updated] (HDFS-17407) Exception during image upload
[ https://issues.apache.org/jira/browse/HDFS-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-17407: Description: After I added the third hdfs namenode, the service was fine. However, the two Standby namenode service logs always show exceptions during image upload. However, I observe that the image file of the primary node is being updated normally, which indicates that the secondary node has merged the image file and uploaded it to the primary node. But I don't understand why two Standby namenode keep sending such exception logs. Are there potential risk issues? namenode log {code:java} 2024-03-01 15:31:46,162 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(394)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_55689095810, fileSize: 4626167848. Sent total: 1703936 bytes. Size of last segment intended to send: 131072 bytes. java.io.IOException: Error writing request body to server at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587) at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2024-03-01 15:31:46,630 INFO blockmanagement.BlockManager (BlockManager.java:enqueue(4923)) - Block report queue is full 2024-03-01 15:31:46,664 ERROR ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(452)) - Exception in doCheckpoint java.io.IOException: Exception during image upload at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:257) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1500(StandbyCheckpointer.java:62) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:432) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:331) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:351) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1710) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:347) Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error writing request body to server at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:250) ... 9 more Caused by: java.io.IOException: Error writing request body to server at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587) at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231) at java.util.concurre
[jira] [Updated] (HDFS-17407) Exception during image upload
[ https://issues.apache.org/jira/browse/HDFS-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-17407: Description: After I added the third hdfs namenode, the service was fine. However, the two Standby namenode service logs always show exceptions during image upload. However, I observe that the image file of the primary node is being updated normally, which indicates that the secondary node has merged the image file and uploaded it to the primary node. But I don't understand why two Standby namenode keep sending such exception logs. Are there potential risk issues? {code:java} 2024-03-01 15:31:46,162 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(394)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_55689095810, fileSize: 4626167848. Sent total: 1703936 bytes. Size of last segment intended to send: 131072 bytes. java.io.IOException: Error writing request body to server at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587) at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2024-03-01 15:31:46,630 INFO blockmanagement.BlockManager (BlockManager.java:enqueue(4923)) - Block report queue is full 2024-03-01 15:31:46,664 ERROR ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(452)) - Exception in doCheckpoint java.io.IOException: Exception during image upload at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:257) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1500(StandbyCheckpointer.java:62) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:432) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:331) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:351) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1710) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:347) Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error writing request body to server at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:250) ... 9 more Caused by: java.io.IOException: Error writing request body to server at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587) at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231) at java.util.concurrent.FutureTas
[jira] [Created] (HDFS-17407) Exception during image upload
ruiliang created HDFS-17407: --- Summary: Exception during image upload Key: HDFS-17407 URL: https://issues.apache.org/jira/browse/HDFS-17407 Project: Hadoop HDFS Issue Type: Bug Components: namanode Affects Versions: 3.1.0 Environment: hadoop 3.1.0 linux:ubuntu 16.04 ambari-hdp:3.1.1 Reporter: ruiliang After I added the third hdfs namenode, the service was fine. However, the two Standby namenode service logs always show exceptions during image upload. However, I observe that the image file of the primary node is being updated normally, which indicates that the secondary node has merged the image file and uploaded it to the primary node. But I don't understand why two Standby namenode keep sending such exception logs. Are there potential risk issues? {code:java} 2024-03-01 15:31:46,162 INFO namenode.TransferFsImage (TransferFsImage.java:copyFileToStream(394)) - Sending fileName: /data/hadoop/hdfs/namenode/current/fsimage_55689095810, fileSize: 4626167848. Sent total: 1703936 bytes. Size of last segment intended to send: 131072 bytes. java.io.IOException: Error writing request body to server at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587) at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2024-03-01 15:31:46,630 INFO blockmanagement.BlockManager (BlockManager.java:enqueue(4923)) - Block report queue is full 2024-03-01 15:31:46,664 ERROR ha.StandbyCheckpointer (StandbyCheckpointer.java:doWork(452)) - Exception in doCheckpoint java.io.IOException: Exception during image upload at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:257) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1500(StandbyCheckpointer.java:62) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:432) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:331) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:351) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1710) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:347) Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error writing request body to server at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:250) ... 9 more Caused by: java.io.IOException: Error writing request body to server at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587) at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229) at org.apache.h
[jira] [Commented] (HDFS-7343) HDFS smart storage management
[ https://issues.apache.org/jira/browse/HDFS-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691365#comment-17691365 ] ruiliang commented on HDFS-7343: https://github.com/Intel-bigdata/SSM This repository has been archived by the owner on Jan 4, 2023. It is now read-only. Is this item still available? Why not develop it? Or did something else take over? thank you > HDFS smart storage management > - > > Key: HDFS-7343 > URL: https://issues.apache.org/jira/browse/HDFS-7343 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Kai Zheng >Assignee: Wei Zhou >Priority: Major > Attachments: HDFS-Smart-Storage-Management-update.pdf, > HDFS-Smart-Storage-Management.pdf, > HDFSSmartStorageManagement-General-20170315.pdf, > HDFSSmartStorageManagement-Phase1-20170315.pdf, access_count_tables.jpg, > move.jpg, tables_in_ssm.xlsx > > > As discussed in HDFS-7285, it would be better to have a comprehensive and > flexible storage policy engine considering file attributes, metadata, data > temperature, storage type, EC codec, available hardware capabilities, > user/application preference and etc. > Modified the title for re-purpose. > We'd extend this effort some bit and aim to work on a comprehensive solution > to provide smart storage management service in order for convenient, > intelligent and effective utilizing of erasure coding or replicas, HDFS cache > facility, HSM offering, and all kinds of tools (balancer, mover, disk > balancer and so on) in a large cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space
[ https://issues.apache.org/jira/browse/HDFS-16799 ] ruiliang deleted comment on HDFS-16799: - was (Author: ruilaing): ok > The dn space size is not consistent, and Balancer can not work, resulting in > a very unbalanced space > > > Key: HDFS-16799 > URL: https://issues.apache.org/jira/browse/HDFS-16799 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > > {code:java} > echo 'A DFS Used 99.8% to ip' > sorucehost > hdfs --debug balancer -fs hdfs://xxcluster06 -threshold 10 -source -f > sorucehost > > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.243:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.247:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-15-10/10.12.65.214:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-02-08/10.12.14.8:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-13/10.12.15.154:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-04/10.12.65.218:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.143:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-05/10.12.12.200:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.217:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.142:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.246:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.219:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.147:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-15-10/10.12.65.186:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-13/10.12.15.153:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-03-07/10.12.19.23:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-04-14/10.12.65.119:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.131:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-04/10.12.12.210:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-11/10.12.14.168:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.245:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-03-02/10.12.17.26:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.241:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-13/10.12.15.152:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.249:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-07-14/10.12.64.71:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-03-03/10.12.17.35:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.195:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.242:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.248:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.240:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-15-12/10.12.65.196:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-13/10.12.15.150:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.222:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.145:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.244:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-03-07/10.12.19.22:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.221:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.136:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.129:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-15/10.12.15.163:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-07-14/10.12.64.72:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a
[jira] [Commented] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space
[ https://issues.apache.org/jira/browse/HDFS-16799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688914#comment-17688914 ] ruiliang commented on HDFS-16799: - ok > The dn space size is not consistent, and Balancer can not work, resulting in > a very unbalanced space > > > Key: HDFS-16799 > URL: https://issues.apache.org/jira/browse/HDFS-16799 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > > {code:java} > echo 'A DFS Used 99.8% to ip' > sorucehost > hdfs --debug balancer -fs hdfs://xxcluster06 -threshold 10 -source -f > sorucehost > > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.243:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.247:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-15-10/10.12.65.214:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-02-08/10.12.14.8:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-13/10.12.15.154:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-04/10.12.65.218:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.143:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-05/10.12.12.200:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.217:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.142:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.246:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.219:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.147:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-15-10/10.12.65.186:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-13/10.12.15.153:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-03-07/10.12.19.23:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-04-14/10.12.65.119:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.131:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-04/10.12.12.210:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-11/10.12.14.168:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.245:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-03-02/10.12.17.26:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.241:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-13/10.12.15.152:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.249:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-07-14/10.12.64.71:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-03-03/10.12.17.35:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.195:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.242:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.248:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.240:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-15-12/10.12.65.196:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-13/10.12.15.150:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.222:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.145:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-01-08/10.12.65.244:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-03-07/10.12.19.22:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.221:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.136:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-12-03/10.12.65.129:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-05-15/10.12.15.163:1019 > 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: > /4F08-0
[jira] [Resolved] (HDFS-16806) ec data balancer block blk_id The index error ,Data cannot be moved
[ https://issues.apache.org/jira/browse/HDFS-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang resolved HDFS-16806. - Hadoop Flags: Reviewed Resolution: Fixed > ec data balancer block blk_id The index error ,Data cannot be moved > --- > > Key: HDFS-16806 > URL: https://issues.apache.org/jira/browse/HDFS-16806 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Critical > Attachments: image-2022-10-20-11-32-35-833.png > > > ec data balancer block blk_id The index error ,Data cannot be moved > dn->10.12.15.149 use disk 100% > > {code:java} > echo 10.12.15.149>sorucehost > balancer -fs hdfs://xxcluster06 -threshold 10 -source -f sorucehost > 2>>~/balancer.log & {code} > > datanode logs > A lot of this log output > {code:java} > datanode logs > ... > 2022-10-19 14:43:02,031 ERROR datanode.DataNode (DataXceiver.java:run(321)) - > fs-hiido-dn-12-15-149.xx.com:1019:DataXceiver error processing COPY_BLOCK > operation src: /10.12.65.216:58214 dst: /10.12.15.149:1019 > org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not > found for > BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:492) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:256) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1089) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290) > at java.lang.Thread.run(Thread.java:748) > ... > > hdfs fsck -fs hdfs://xxcluster06 -blockId blk_-9223372036799576592 > Connecting to namenode via > http://fs-hiido-xxcluster06-yynn2.xx.com:50070/fsck?ugi=hdfs&blockId=blk_-9223372036799576592+&path=%2F > FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 at Wed Oct 19 > 14:47:15 CST 2022Block Id: blk_-9223372036799576592 > Block belongs to: > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > No. of Expected Replica: 5 > No. of live Replica: 5 > No. of excess Replica: 0 > No. of stale Replica: 5 > No. of decommissioned Replica: 0 > No. of decommissioning Replica: 0 > No. of corrupted Replica: 0 > Block replica on datanode/rack: fs-hiido-dn-12-66-4.xx.com/4F08-01-09 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-65-244.xx.com/4F08-01-08 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-15-149.xx.com/4F08-05-13 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-65-218.xx.com/4F08-12-04 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-17-35.xx.com/4F08-03-03 is > HEALTHY > hdfs fsck -fs hdfs://xxcluster06 > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > -files -blocks -locations > Connecting to namenode via > http://xx.com:50070/fsck?ugi=hdfs&files=1&blocks=1&locations=1&path=%2Fhive_warehouse%2Fwarehouse_old_snapshots%2Fyy_mbsdkevent_original%2Fdt%3D20210505%2Fpost_202105052129_33.log.gz > FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 for path > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > at Wed Oct 19 14:48:42 CST 2022 > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > 500582412 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s): OK > 0. BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 > len=500582412 Live_repl=5 > [blk_-9223372036799576592:DatanodeInfoWithStorage[10.12.17.35:1019,DS-3ccebf8d-5f05-45b5-ac7f-96d1cfb48608,DISK], > > blk_-9223372036799576591:DatanodeInfoWithStorage[10.12.65.218:1019,DS-4f8e3114-7566-4cf1-ad5a-e454c8ea8805,DISK], > > blk_-9223372036799576590:DatanodeInfoWithStorage[10.12.15.149:1019,DS-1dd55c27-8f47-46a6-935b-1d9024ca9188,DISK], > > blk_-9223372036799576589:DatanodeInfoWithStorage[10.12.65.244:1019,DS-a9ffd747-c427-4aaa-8559-04cded7d9d5f,DISK], > > blk_-9223372036799576588:DatanodeInfoWithStorage[10.12.66.4:1019,DS-d88f94db-6db1-4753-a652-780d7cd7f081,DISK]] > Status: HEALTHY > Number of data-nodes: 62 > Number of racks: 19 > Total dirs: 0 > Total symlinks: 0Replicated Blocks: > Total size: 0 B > Total files: 0 > Total blocks (validated): 0 > Minimally replicated blocks: 0 > Over-replica
[jira] [Commented] (HDFS-16806) ec data balancer block blk_id The index error ,Data cannot be moved
[ https://issues.apache.org/jira/browse/HDFS-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620704#comment-17620704 ] ruiliang commented on HDFS-16806: - After I pull HDFS-16333, I only update hadoop-hdfs.jar on balancer client service, and the problem is solved. The following figure is a comparison before and after the update. !image-2022-10-20-11-32-35-833.png! > ec data balancer block blk_id The index error ,Data cannot be moved > --- > > Key: HDFS-16806 > URL: https://issues.apache.org/jira/browse/HDFS-16806 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Critical > Attachments: image-2022-10-20-11-32-35-833.png > > > ec data balancer block blk_id The index error ,Data cannot be moved > dn->10.12.15.149 use disk 100% > > {code:java} > echo 10.12.15.149>sorucehost > balancer -fs hdfs://xxcluster06 -threshold 10 -source -f sorucehost > 2>>~/balancer.log & {code} > > datanode logs > A lot of this log output > {code:java} > datanode logs > ... > 2022-10-19 14:43:02,031 ERROR datanode.DataNode (DataXceiver.java:run(321)) - > fs-hiido-dn-12-15-149.xx.com:1019:DataXceiver error processing COPY_BLOCK > operation src: /10.12.65.216:58214 dst: /10.12.15.149:1019 > org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not > found for > BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:492) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:256) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1089) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290) > at java.lang.Thread.run(Thread.java:748) > ... > > hdfs fsck -fs hdfs://xxcluster06 -blockId blk_-9223372036799576592 > Connecting to namenode via > http://fs-hiido-xxcluster06-yynn2.xx.com:50070/fsck?ugi=hdfs&blockId=blk_-9223372036799576592+&path=%2F > FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 at Wed Oct 19 > 14:47:15 CST 2022Block Id: blk_-9223372036799576592 > Block belongs to: > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > No. of Expected Replica: 5 > No. of live Replica: 5 > No. of excess Replica: 0 > No. of stale Replica: 5 > No. of decommissioned Replica: 0 > No. of decommissioning Replica: 0 > No. of corrupted Replica: 0 > Block replica on datanode/rack: fs-hiido-dn-12-66-4.xx.com/4F08-01-09 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-65-244.xx.com/4F08-01-08 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-15-149.xx.com/4F08-05-13 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-65-218.xx.com/4F08-12-04 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-17-35.xx.com/4F08-03-03 is > HEALTHY > hdfs fsck -fs hdfs://xxcluster06 > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > -files -blocks -locations > Connecting to namenode via > http://xx.com:50070/fsck?ugi=hdfs&files=1&blocks=1&locations=1&path=%2Fhive_warehouse%2Fwarehouse_old_snapshots%2Fyy_mbsdkevent_original%2Fdt%3D20210505%2Fpost_202105052129_33.log.gz > FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 for path > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > at Wed Oct 19 14:48:42 CST 2022 > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > 500582412 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s): OK > 0. BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 > len=500582412 Live_repl=5 > [blk_-9223372036799576592:DatanodeInfoWithStorage[10.12.17.35:1019,DS-3ccebf8d-5f05-45b5-ac7f-96d1cfb48608,DISK], > > blk_-9223372036799576591:DatanodeInfoWithStorage[10.12.65.218:1019,DS-4f8e3114-7566-4cf1-ad5a-e454c8ea8805,DISK], > > blk_-9223372036799576590:DatanodeInfoWithStorage[10.12.15.149:1019,DS-1dd55c27-8f47-46a6-935b-1d9024ca9188,DISK], > > blk_-9223372036799576589:DatanodeInfoWithStorage[10.12.65.244:1019,DS-a9ffd747-c427-4aaa-8559-04cded7d9d5f,DISK], > > blk_-9223372036799576588:DatanodeInfoWithStorage[10.12.66.4:1019,DS-d88f94db-6db1-4753-a652-780d7cd7f081,DISK]] > Status: HEALTHY > Number of data-nodes: 62 > Number of racks: 19
[jira] [Updated] (HDFS-16806) ec data balancer block blk_id The index error ,Data cannot be moved
[ https://issues.apache.org/jira/browse/HDFS-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-16806: Attachment: image-2022-10-20-11-32-35-833.png > ec data balancer block blk_id The index error ,Data cannot be moved > --- > > Key: HDFS-16806 > URL: https://issues.apache.org/jira/browse/HDFS-16806 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Critical > Attachments: image-2022-10-20-11-32-35-833.png > > > ec data balancer block blk_id The index error ,Data cannot be moved > dn->10.12.15.149 use disk 100% > > {code:java} > echo 10.12.15.149>sorucehost > balancer -fs hdfs://xxcluster06 -threshold 10 -source -f sorucehost > 2>>~/balancer.log & {code} > > datanode logs > A lot of this log output > {code:java} > datanode logs > ... > 2022-10-19 14:43:02,031 ERROR datanode.DataNode (DataXceiver.java:run(321)) - > fs-hiido-dn-12-15-149.xx.com:1019:DataXceiver error processing COPY_BLOCK > operation src: /10.12.65.216:58214 dst: /10.12.15.149:1019 > org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not > found for > BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:492) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:256) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1089) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290) > at java.lang.Thread.run(Thread.java:748) > ... > > hdfs fsck -fs hdfs://xxcluster06 -blockId blk_-9223372036799576592 > Connecting to namenode via > http://fs-hiido-xxcluster06-yynn2.xx.com:50070/fsck?ugi=hdfs&blockId=blk_-9223372036799576592+&path=%2F > FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 at Wed Oct 19 > 14:47:15 CST 2022Block Id: blk_-9223372036799576592 > Block belongs to: > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > No. of Expected Replica: 5 > No. of live Replica: 5 > No. of excess Replica: 0 > No. of stale Replica: 5 > No. of decommissioned Replica: 0 > No. of decommissioning Replica: 0 > No. of corrupted Replica: 0 > Block replica on datanode/rack: fs-hiido-dn-12-66-4.xx.com/4F08-01-09 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-65-244.xx.com/4F08-01-08 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-15-149.xx.com/4F08-05-13 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-65-218.xx.com/4F08-12-04 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-17-35.xx.com/4F08-03-03 is > HEALTHY > hdfs fsck -fs hdfs://xxcluster06 > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > -files -blocks -locations > Connecting to namenode via > http://xx.com:50070/fsck?ugi=hdfs&files=1&blocks=1&locations=1&path=%2Fhive_warehouse%2Fwarehouse_old_snapshots%2Fyy_mbsdkevent_original%2Fdt%3D20210505%2Fpost_202105052129_33.log.gz > FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 for path > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > at Wed Oct 19 14:48:42 CST 2022 > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > 500582412 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s): OK > 0. BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 > len=500582412 Live_repl=5 > [blk_-9223372036799576592:DatanodeInfoWithStorage[10.12.17.35:1019,DS-3ccebf8d-5f05-45b5-ac7f-96d1cfb48608,DISK], > > blk_-9223372036799576591:DatanodeInfoWithStorage[10.12.65.218:1019,DS-4f8e3114-7566-4cf1-ad5a-e454c8ea8805,DISK], > > blk_-9223372036799576590:DatanodeInfoWithStorage[10.12.15.149:1019,DS-1dd55c27-8f47-46a6-935b-1d9024ca9188,DISK], > > blk_-9223372036799576589:DatanodeInfoWithStorage[10.12.65.244:1019,DS-a9ffd747-c427-4aaa-8559-04cded7d9d5f,DISK], > > blk_-9223372036799576588:DatanodeInfoWithStorage[10.12.66.4:1019,DS-d88f94db-6db1-4753-a652-780d7cd7f081,DISK]] > Status: HEALTHY > Number of data-nodes: 62 > Number of racks: 19 > Total dirs: 0 > Total symlinks: 0Replicated Blocks: > Total size: 0 B > Total files: 0 > Total blocks (validated): 0 > Minimally replicated blocks: 0 > Over-replicated
[jira] [Commented] (HDFS-16806) ec data balancer block blk_id The index error ,Data cannot be moved
[ https://issues.apache.org/jira/browse/HDFS-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17620065#comment-17620065 ] ruiliang commented on HDFS-16806: - https://issues.apache.org/jira/browse/HDFS-16333 Is that the question? All I have to do is join the balancer client, right? Or pull it to the namenode server > ec data balancer block blk_id The index error ,Data cannot be moved > --- > > Key: HDFS-16806 > URL: https://issues.apache.org/jira/browse/HDFS-16806 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Blocker > > ec data balancer block blk_id The index error ,Data cannot be moved > dn->10.12.15.149 use disk 100% > > {code:java} > echo 10.12.15.149>sorucehost > balancer -fs hdfs://xxcluster06 -threshold 10 -source -f sorucehost > 2>>~/balancer.log & {code} > > datanode logs > A lot of this log output > {code:java} > datanode logs > ... > 2022-10-19 14:43:02,031 ERROR datanode.DataNode (DataXceiver.java:run(321)) - > fs-hiido-dn-12-15-149.xx.com:1019:DataXceiver error processing COPY_BLOCK > operation src: /10.12.65.216:58214 dst: /10.12.15.149:1019 > org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not > found for > BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:492) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:256) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1089) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290) > at java.lang.Thread.run(Thread.java:748) > ... > > hdfs fsck -fs hdfs://xxcluster06 -blockId blk_-9223372036799576592 > Connecting to namenode via > http://fs-hiido-xxcluster06-yynn2.xx.com:50070/fsck?ugi=hdfs&blockId=blk_-9223372036799576592+&path=%2F > FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 at Wed Oct 19 > 14:47:15 CST 2022Block Id: blk_-9223372036799576592 > Block belongs to: > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > No. of Expected Replica: 5 > No. of live Replica: 5 > No. of excess Replica: 0 > No. of stale Replica: 5 > No. of decommissioned Replica: 0 > No. of decommissioning Replica: 0 > No. of corrupted Replica: 0 > Block replica on datanode/rack: fs-hiido-dn-12-66-4.xx.com/4F08-01-09 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-65-244.xx.com/4F08-01-08 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-15-149.xx.com/4F08-05-13 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-65-218.xx.com/4F08-12-04 is > HEALTHY > Block replica on datanode/rack: fs-hiido-dn-12-17-35.xx.com/4F08-03-03 is > HEALTHY > hdfs fsck -fs hdfs://xxcluster06 > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > -files -blocks -locations > Connecting to namenode via > http://xx.com:50070/fsck?ugi=hdfs&files=1&blocks=1&locations=1&path=%2Fhive_warehouse%2Fwarehouse_old_snapshots%2Fyy_mbsdkevent_original%2Fdt%3D20210505%2Fpost_202105052129_33.log.gz > FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 for path > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > at Wed Oct 19 14:48:42 CST 2022 > /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz > 500582412 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s): OK > 0. BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 > len=500582412 Live_repl=5 > [blk_-9223372036799576592:DatanodeInfoWithStorage[10.12.17.35:1019,DS-3ccebf8d-5f05-45b5-ac7f-96d1cfb48608,DISK], > > blk_-9223372036799576591:DatanodeInfoWithStorage[10.12.65.218:1019,DS-4f8e3114-7566-4cf1-ad5a-e454c8ea8805,DISK], > > blk_-9223372036799576590:DatanodeInfoWithStorage[10.12.15.149:1019,DS-1dd55c27-8f47-46a6-935b-1d9024ca9188,DISK], > > blk_-9223372036799576589:DatanodeInfoWithStorage[10.12.65.244:1019,DS-a9ffd747-c427-4aaa-8559-04cded7d9d5f,DISK], > > blk_-9223372036799576588:DatanodeInfoWithStorage[10.12.66.4:1019,DS-d88f94db-6db1-4753-a652-780d7cd7f081,DISK]] > Status: HEALTHY > Number of data-nodes: 62 > Number of racks: 19 > Total dirs: 0 > Total symlinks: 0Replicated Blocks: > Total size: 0 B >
[jira] [Updated] (HDFS-16806) ec data balancer block blk_id The index error ,Data cannot be moved
[ https://issues.apache.org/jira/browse/HDFS-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-16806: Description: ec data balancer block blk_id The index error ,Data cannot be moved dn->10.12.15.149 use disk 100% {code:java} echo 10.12.15.149>sorucehost balancer -fs hdfs://xxcluster06 -threshold 10 -source -f sorucehost 2>>~/balancer.log & {code} datanode logs A lot of this log output {code:java} datanode logs ... 2022-10-19 14:43:02,031 ERROR datanode.DataNode (DataXceiver.java:run(321)) - fs-hiido-dn-12-15-149.xx.com:1019:DataXceiver error processing COPY_BLOCK operation src: /10.12.65.216:58214 dst: /10.12.15.149:1019 org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 at org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:492) at org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:256) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1089) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290) at java.lang.Thread.run(Thread.java:748) ... hdfs fsck -fs hdfs://xxcluster06 -blockId blk_-9223372036799576592 Connecting to namenode via http://fs-hiido-xxcluster06-yynn2.xx.com:50070/fsck?ugi=hdfs&blockId=blk_-9223372036799576592+&path=%2F FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 at Wed Oct 19 14:47:15 CST 2022Block Id: blk_-9223372036799576592 Block belongs to: /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz No. of Expected Replica: 5 No. of live Replica: 5 No. of excess Replica: 0 No. of stale Replica: 5 No. of decommissioned Replica: 0 No. of decommissioning Replica: 0 No. of corrupted Replica: 0 Block replica on datanode/rack: fs-hiido-dn-12-66-4.xx.com/4F08-01-09 is HEALTHY Block replica on datanode/rack: fs-hiido-dn-12-65-244.xx.com/4F08-01-08 is HEALTHY Block replica on datanode/rack: fs-hiido-dn-12-15-149.xx.com/4F08-05-13 is HEALTHY Block replica on datanode/rack: fs-hiido-dn-12-65-218.xx.com/4F08-12-04 is HEALTHY Block replica on datanode/rack: fs-hiido-dn-12-17-35.xx.com/4F08-03-03 is HEALTHY hdfs fsck -fs hdfs://xxcluster06 /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz -files -blocks -locations Connecting to namenode via http://xx.com:50070/fsck?ugi=hdfs&files=1&blocks=1&locations=1&path=%2Fhive_warehouse%2Fwarehouse_old_snapshots%2Fyy_mbsdkevent_original%2Fdt%3D20210505%2Fpost_202105052129_33.log.gz FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 for path /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz at Wed Oct 19 14:48:42 CST 2022 /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz 500582412 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s): OK 0. BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 len=500582412 Live_repl=5 [blk_-9223372036799576592:DatanodeInfoWithStorage[10.12.17.35:1019,DS-3ccebf8d-5f05-45b5-ac7f-96d1cfb48608,DISK], blk_-9223372036799576591:DatanodeInfoWithStorage[10.12.65.218:1019,DS-4f8e3114-7566-4cf1-ad5a-e454c8ea8805,DISK], blk_-9223372036799576590:DatanodeInfoWithStorage[10.12.15.149:1019,DS-1dd55c27-8f47-46a6-935b-1d9024ca9188,DISK], blk_-9223372036799576589:DatanodeInfoWithStorage[10.12.65.244:1019,DS-a9ffd747-c427-4aaa-8559-04cded7d9d5f,DISK], blk_-9223372036799576588:DatanodeInfoWithStorage[10.12.66.4:1019,DS-d88f94db-6db1-4753-a652-780d7cd7f081,DISK]] Status: HEALTHY Number of data-nodes: 62 Number of racks: 19 Total dirs: 0 Total symlinks: 0Replicated Blocks: Total size: 0 B Total files: 0 Total blocks (validated): 0 Minimally replicated blocks: 0 Over-replicated blocks: 0 Under-replicated blocks: 0 Mis-replicated blocks: 0 Default replication factor: 3 Average block replication: 0.0 Missing blocks: 0 Corrupt blocks: 0 Missing replicas: 0Erasure Coded Block Groups: Total size: 500582412 B Total files: 1 Total block groups (validated): 1 (avg. block group size 500582412 B) Minimally erasure-coded block groups: 1 (100.0 %) Over-erasure-coded block groups: 0 (0.0 %) Under-erasure-coded block groups: 0 (0.0 %) Unsatisfactory placement block groups: 0 (0.0 %) Average block group size: 5.0 Missing block groups: 0 Corrupt block gr
[jira] [Created] (HDFS-16806) ec data balancer block blk_id The index error ,Data cannot be moved
ruiliang created HDFS-16806: --- Summary: ec data balancer block blk_id The index error ,Data cannot be moved Key: HDFS-16806 URL: https://issues.apache.org/jira/browse/HDFS-16806 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 3.1.0 Reporter: ruiliang ec data balancer block blk_id The index error ,Data cannot be moved dn->10.12.15.149 use disk 100% {code:java} echo 10.12.15.149>sorucehost balancer -fs hdfs://xxcluster06 -threshold 10 -source -f sorucehost 2>>~/balancer.log & {code} {code:java} datanode logs ... 2022-10-19 14:43:02,031 ERROR datanode.DataNode (DataXceiver.java:run(321)) - fs-hiido-dn-12-15-149.xx.com:1019:DataXceiver error processing COPY_BLOCK operation src: /10.12.65.216:58214 dst: /10.12.15.149:1019 org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 at org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:492) at org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:256) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1089) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290) at java.lang.Thread.run(Thread.java:748) ... hdfs fsck -fs hdfs://xxcluster06 -blockId blk_-9223372036799576592 Connecting to namenode via http://fs-hiido-xxcluster06-yynn2.xx.com:50070/fsck?ugi=hdfs&blockId=blk_-9223372036799576592+&path=%2F FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 at Wed Oct 19 14:47:15 CST 2022Block Id: blk_-9223372036799576592 Block belongs to: /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz No. of Expected Replica: 5 No. of live Replica: 5 No. of excess Replica: 0 No. of stale Replica: 5 No. of decommissioned Replica: 0 No. of decommissioning Replica: 0 No. of corrupted Replica: 0 Block replica on datanode/rack: fs-hiido-dn-12-66-4.xx.com/4F08-01-09 is HEALTHY Block replica on datanode/rack: fs-hiido-dn-12-65-244.xx.com/4F08-01-08 is HEALTHY Block replica on datanode/rack: fs-hiido-dn-12-15-149.xx.com/4F08-05-13 is HEALTHY Block replica on datanode/rack: fs-hiido-dn-12-65-218.xx.com/4F08-12-04 is HEALTHY Block replica on datanode/rack: fs-hiido-dn-12-17-35.xx.com/4F08-03-03 is HEALTHYhdfs fsck -fs hdfs://xxcluster06 /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz -files -blocks -locations Connecting to namenode via http://xx.com:50070/fsck?ugi=hdfs&files=1&blocks=1&locations=1&path=%2Fhive_warehouse%2Fwarehouse_old_snapshots%2Fyy_mbsdkevent_original%2Fdt%3D20210505%2Fpost_202105052129_33.log.gz FSCK started by hdfs (auth:KERBEROS_SSL) from /10.12.19.4 for path /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz at Wed Oct 19 14:48:42 CST 2022 /hive_warehouse/warehouse_old_snapshots/yy_mbsdkevent_original/dt=20210505/post_202105052129_33.log.gz 500582412 bytes, erasure-coded: policy=RS-3-2-1024k, 1 block(s): OK 0. BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036799576592_4218617 len=500582412 Live_repl=5 [blk_-9223372036799576592:DatanodeInfoWithStorage[10.12.17.35:1019,DS-3ccebf8d-5f05-45b5-ac7f-96d1cfb48608,DISK], blk_-9223372036799576591:DatanodeInfoWithStorage[10.12.65.218:1019,DS-4f8e3114-7566-4cf1-ad5a-e454c8ea8805,DISK], blk_-9223372036799576590:DatanodeInfoWithStorage[10.12.15.149:1019,DS-1dd55c27-8f47-46a6-935b-1d9024ca9188,DISK], blk_-9223372036799576589:DatanodeInfoWithStorage[10.12.65.244:1019,DS-a9ffd747-c427-4aaa-8559-04cded7d9d5f,DISK], blk_-9223372036799576588:DatanodeInfoWithStorage[10.12.66.4:1019,DS-d88f94db-6db1-4753-a652-780d7cd7f081,DISK]] Status: HEALTHY Number of data-nodes: 62 Number of racks: 19 Total dirs: 0 Total symlinks: 0Replicated Blocks: Total size: 0 B Total files: 0 Total blocks (validated): 0 Minimally replicated blocks: 0 Over-replicated blocks: 0 Under-replicated blocks: 0 Mis-replicated blocks: 0 Default replication factor: 3 Average block replication: 0.0 Missing blocks: 0 Corrupt blocks: 0 Missing replicas: 0Erasure Coded Block Groups: Total size: 500582412 B Total files: 1 Total block groups (validated): 1 (avg. block group size 500582412 B) Minimally erasure-coded block groups: 1 (100.0 %) Over-erasure-coded block groups: 0 (0.0 %) Under-erasure-coded block groups:
[jira] [Comment Edited] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space
[ https://issues.apache.org/jira/browse/HDFS-16799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614703#comment-17614703 ] ruiliang edited comment on HDFS-16799 at 10/13/22 2:19 AM: --- It seems that the empty nodes on the back of the rack are concentrated, so it is not possible to select enough racks first. In this case, only the rack is Adjust a reasonable num? {code:java} Datanode 10.12.65.241:1019 is not chosen since the rack has too many chosen nodes. 2022-10-09 19:27:18,407 DEBUG blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseLocalRack(637)) - Failed to choose from local rack (location = /4F08-05-15), retry with the rack of the next replica (location = /4F08-12-03) org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:834) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlentPolicyDefault.java:629) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:589) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseOnce(BlockPlacementPolicyRackFaultTolerant.java:218) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseTargetInOrder(BlockPlacementPolicyRackFaultTolerant.java:94) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:419) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:295) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:148) at org.apache.hadoop.hdfs.server.blockmanagement.ErasureCodingWork.chooseTargets(ErasureCodingWork.java:60) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1862) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1814) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4655) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4522) at java.lang.Thread.run(Thread.java:748) 2022-10-09 19:27:18,416 DEBUG blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseRandom(824)) - [ Node /4F08-01-08/10.12.65.242:1019 [ Datanode 10.12.65.242:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.248:1019 [ Datanode 10.12.65.248:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.195:1019 [ Datanode 10.12.65.195:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.241:1019 [ Datanode 10.12.65.241:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.243:1019 [ Datanode 10.12.65.243:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.244:1019 [ Datanode 10.12.65.244:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.249:1019 [ Datanode 10.12.65.249:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.245:1019 [ Datanode 10.12.65.245:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.240:1019 [ Datanode 10.12.65.240:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.247:1019 [ Datanode 10.12.65.247:1019 is not chosen since the rack has too many chosen nodes. 2022-10-09 19:27:18,416 INFO blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseRandom(832)) - Not enough replicas was chosen. Reason:{TOO_MANY_NODES_ON_RACK=10} 2022-10-09 19:27:18,417 DEBUG blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseFromNextRack(669)) - Failed to choose from the next rack (location = /4F08-01-08), retry choosing randomly org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:834) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:722) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:665)
[jira] [Commented] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space
[ https://issues.apache.org/jira/browse/HDFS-16799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614703#comment-17614703 ] ruiliang commented on HDFS-16799: - It seems that the empty nodes on the back of the rack are concentrated, so it is not possible to select enough racks first. In this case, only the rack is broken up? {code:java} Datanode 10.12.65.241:1019 is not chosen since the rack has too many chosen nodes. 2022-10-09 19:27:18,407 DEBUG blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseLocalRack(637)) - Failed to choose from local rack (location = /4F08-05-15), retry with the rack of the next replica (location = /4F08-12-03) org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:834) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlentPolicyDefault.java:629) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:589) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseOnce(BlockPlacementPolicyRackFaultTolerant.java:218) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseTargetInOrder(BlockPlacementPolicyRackFaultTolerant.java:94) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:419) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:295) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:148) at org.apache.hadoop.hdfs.server.blockmanagement.ErasureCodingWork.chooseTargets(ErasureCodingWork.java:60) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1862) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1814) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4655) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4522) at java.lang.Thread.run(Thread.java:748) 2022-10-09 19:27:18,416 DEBUG blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseRandom(824)) - [ Node /4F08-01-08/10.12.65.242:1019 [ Datanode 10.12.65.242:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.248:1019 [ Datanode 10.12.65.248:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.195:1019 [ Datanode 10.12.65.195:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.241:1019 [ Datanode 10.12.65.241:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.243:1019 [ Datanode 10.12.65.243:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.244:1019 [ Datanode 10.12.65.244:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.249:1019 [ Datanode 10.12.65.249:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.245:1019 [ Datanode 10.12.65.245:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.240:1019 [ Datanode 10.12.65.240:1019 is not chosen since the rack has too many chosen nodes. Node /4F08-01-08/10.12.65.247:1019 [ Datanode 10.12.65.247:1019 is not chosen since the rack has too many chosen nodes. 2022-10-09 19:27:18,416 INFO blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseRandom(832)) - Not enough replicas was chosen. Reason:{TOO_MANY_NODES_ON_RACK=10} 2022-10-09 19:27:18,417 DEBUG blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseFromNextRack(669)) - Failed to choose from the next rack (location = /4F08-01-08), retry choosing randomly org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:834) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:722) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:665) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlac
[jira] [Updated] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space
[ https://issues.apache.org/jira/browse/HDFS-16799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-16799: Description: {code:java} echo 'A DFS Used 99.8% to ip' > sorucehost hdfs --debug balancer -fs hdfs://xxcluster06 -threshold 10 -source -f sorucehost 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.243:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.247:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-15-10/10.12.65.214:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-02-08/10.12.14.8:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-13/10.12.15.154:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-04/10.12.65.218:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.143:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-05/10.12.12.200:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.217:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.142:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.246:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.219:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.147:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-15-10/10.12.65.186:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-13/10.12.15.153:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-03-07/10.12.19.23:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-04-14/10.12.65.119:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.131:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-04/10.12.12.210:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-11/10.12.14.168:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.245:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-03-02/10.12.17.26:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.241:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-13/10.12.15.152:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.249:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-07-14/10.12.64.71:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-03-03/10.12.17.35:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.195:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.242:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.248:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.240:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-15-12/10.12.65.196:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-13/10.12.15.150:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.222:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.145:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.244:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-03-07/10.12.19.22:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.221:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.136:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.129:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-15/10.12.15.163:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-07-14/10.12.64.72:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-13/10.12.15.149:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.130:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.220:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-03-01/10.12.17.27:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-15/10.12.15.162:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.216:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-03-07/10.12.19.20:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Add
[jira] [Created] (HDFS-16799) The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space
ruiliang created HDFS-16799: --- Summary: The dn space size is not consistent, and Balancer can not work, resulting in a very unbalanced space Key: HDFS-16799 URL: https://issues.apache.org/jira/browse/HDFS-16799 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 3.1.0 Reporter: ruiliang {code:java} echo 'A DFS Used 99.8% to ip' > sorucehost hdfs --debug balancer -fs hdfs://xxcluster06 -threshold 10 -source -f sorucehost 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.243:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.247:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-15-10/10.12.65.214:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-02-08/10.12.14.8:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-13/10.12.15.154:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-04/10.12.65.218:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.143:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-05/10.12.12.200:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.217:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.142:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.246:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.219:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.147:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-15-10/10.12.65.186:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-13/10.12.15.153:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-03-07/10.12.19.23:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-04-14/10.12.65.119:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.131:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-04/10.12.12.210:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-11/10.12.14.168:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.245:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-03-02/10.12.17.26:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.241:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-13/10.12.15.152:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.249:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-07-14/10.12.64.71:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-03-03/10.12.17.35:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.195:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.242:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.248:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.240:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-15-12/10.12.65.196:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-13/10.12.15.150:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.222:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.145:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-01-08/10.12.65.244:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-03-07/10.12.19.22:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.221:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.136:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.129:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-15/10.12.15.163:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-07-14/10.12.64.72:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-13/10.12.15.149:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.130:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-12-03/10.12.65.220:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-03-01/10.12.17.27:1019 22/10/09 16:43:52 INFO net.NetworkTopology: Adding a new node: /4F08-05-15/10.12.15.162:1019 2
[jira] [Updated] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation
[ https://issues.apache.org/jira/browse/HDFS-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-16788: Description: !image-2022-09-30-14-14-44-164.png! ||Configured Capacity:|3.02 PB| ||Configured Remote Capacity:|0 B| ||DFS Used:|1.39 PB (45.96%)| ||Non DFS Used:|0 B| ||DFS Remaining:|1.62 PB (53.67%)| ||Block Pool Used:|1.39 PB (45.96%)| ||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%| ||[Live Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50 (Decommissioned: 0, In Maintenance: 0) | I've been working hard in the background to balance the data, but before I discp when {code:java} hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 -Ddfs.balancer.moverThreads=1200 -Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 -threshold 50 {code} {code:java} hadoop distcp -Dmapreduce.task.timeout=60 -skipcrccheck -update hdfs://01 hdfs://02xx syslog ... 2022-09-30 14:22:50,724 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:22:58,389 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #3: failed, blk_-9223372036808890525_3095130 2022-09-30 14:22:58,389 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null 2022-09-30 14:23:21,547 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:23:29,319 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, blk_-9223372036808889612_3095200 2022-09-30 14:23:36,950 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:23:44,822 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, blk_-922337203680572_3095307 2022-09-30 14:23:44,837 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:23:52,306 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null 2022-09-30 14:23:52,321 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:23:59,822 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null 2022-09-30 14:23:59,836 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=3, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:23:59,836 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:24:07,302 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #3: failed, blk_-9223372036808887853_3095387 2022-09-30 14:24:07,303 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null 2022-09-30 14:24:07,317 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:24:15,383 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null 2022-09-30 14:24:15,395 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:24:22,795 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null 2022-09-30 14:24:22,812 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=3, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:24:22,812 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:24:31,352 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #3: failed, blk_-9223372036808887133_3095476 2022-09-30 14:24:31,352 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null discp out . Error: java.io.IOException: File copy failed: Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hive_warehouse/warehouse_old_snapshots/credit/.di
[jira] [Updated] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation
[ https://issues.apache.org/jira/browse/HDFS-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-16788: Description: !image-2022-09-30-14-14-44-164.png! ||Configured Capacity:|3.02 PB| ||Configured Remote Capacity:|0 B| ||DFS Used:|1.39 PB (45.96%)| ||Non DFS Used:|0 B| ||DFS Remaining:|1.62 PB (53.67%)| ||Block Pool Used:|1.39 PB (45.96%)| ||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%| ||[Live Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50 (Decommissioned: 0, In Maintenance: 0) | I've been working hard in the background to balance the data, but before I discp when {code:java} hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 -Ddfs.balancer.moverThreads=1200 -Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 -threshold 50 {code} {code:java} hadoop distcp -Dmapreduce.task.timeout=60 -skipcrccheck -update hdfs://01 hdfs://02xx ... 2022-09-30 14:22:50,724 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:22:58,389 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #3: failed, blk_-9223372036808890525_3095130 2022-09-30 14:22:58,389 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null 2022-09-30 14:23:21,547 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:23:29,319 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, blk_-9223372036808889612_3095200 2022-09-30 14:23:36,950 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:23:44,822 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, blk_-922337203680572_3095307 2022-09-30 14:23:44,837 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:23:52,306 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null 2022-09-30 14:23:52,321 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:23:59,822 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null 2022-09-30 14:23:59,836 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=3, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:23:59,836 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:24:07,302 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #3: failed, blk_-9223372036808887853_3095387 2022-09-30 14:24:07,303 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null 2022-09-30 14:24:07,317 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:24:15,383 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null 2022-09-30 14:24:15,395 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:24:22,795 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null 2022-09-30 14:24:22,812 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=3, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:24:22,812 WARN [main] org.apache.hadoop.hdfs.DFSOutputStream: Cannot allocate parity block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[] 2022-09-30 14:24:31,352 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #3: failed, blk_-9223372036808887133_3095476 2022-09-30 14:24:31,352 INFO [main] org.apache.hadoop.hdfs.DFSOutputStream: replacing previously failed streamer #4: failed, block==null --- // Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hive_warehouse/warehouse_old_snapshots/credit/.distcp.tmp.attempt_166383067_314191_m_08_2 could only be writt
[jira] [Updated] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation
[ https://issues.apache.org/jira/browse/HDFS-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-16788: Description: !image-2022-09-30-14-14-44-164.png! ||Configured Capacity:|3.02 PB| ||Configured Remote Capacity:|0 B| ||DFS Used:|1.39 PB (45.96%)| ||Non DFS Used:|0 B| ||DFS Remaining:|1.62 PB (53.67%)| ||Block Pool Used:|1.39 PB (45.96%)| ||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%| ||[Live Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50 (Decommissioned: 0, In Maintenance: 0) | I've been working hard in the background to balance the data, but before I discp when {code:java} hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 -Ddfs.balancer.moverThreads=1200 -Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 -threshold 50{code} {code:java} // Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hive_warehouse/warehouse_old_snapshots/credit/.distcp.tmp.attempt_166383067_314191_m_08_2 could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2128) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2706) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:875) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:561) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497) at org.apache.hadoop.ipc.Client.call(Client.java:1443) at org.apache.hadoop.ipc.Client.call(Client.java:1353) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy13.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:510) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy14.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1078) at org.apache.hadoop.hdfs.DFSStripedOutputStream.allocateNewBlock(DFSStripedOutputStream.java:479) at org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:525) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217) at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:125) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:111) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:290)
[jira] [Updated] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation
[ https://issues.apache.org/jira/browse/HDFS-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-16788: Description: !image-2022-09-30-14-14-44-164.png! ||Configured Capacity:|3.02 PB| ||Configured Remote Capacity:|0 B| ||DFS Used:|1.39 PB (45.96%)| ||Non DFS Used:|0 B| ||DFS Remaining:|1.62 PB (53.67%)| ||Block Pool Used:|1.39 PB (45.96%)| ||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%| ||[Live Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50 (Decommissioned: 0, In Maintenance: 0) | I've been working hard in the background to balance the data, but before I discp when {code:java} hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 -Ddfs.balancer.moverThreads=1200 -Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 -threshold 50{code} {code:java} // Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hive_warehouse/warehouse_old_snapshots/credit/.distcp.tmp.attempt_166383067_314191_m_08_2 could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2128) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2706) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:875) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:561) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497) at org.apache.hadoop.ipc.Client.call(Client.java:1443) at org.apache.hadoop.ipc.Client.call(Client.java:1353) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy13.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:510) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy14.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1078) at org.apache.hadoop.hdfs.DFSStripedOutputStream.allocateNewBlock(DFSStripedOutputStream.java:479) at org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:525) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217) at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:125) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:111) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:290)
[jira] [Created] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation
ruiliang created HDFS-16788: --- Summary: could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation Key: HDFS-16788 URL: https://issues.apache.org/jira/browse/HDFS-16788 Project: Hadoop HDFS Issue Type: Improvement Reporter: ruiliang Attachments: image-2022-09-30-14-14-29-963.png, image-2022-09-30-14-14-44-164.png !image-2022-09-30-14-14-44-164.png! ||Configured Capacity:|3.02 PB| ||Configured Remote Capacity:|0 B| ||DFS Used:|1.39 PB (45.96%)| ||Non DFS Used:|0 B| ||DFS Remaining:|1.62 PB (53.67%)| ||Block Pool Used:|1.39 PB (45.96%)| ||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%| ||[Live Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50 (Decommissioned: 0, In Maintenance: 0) | I've been working hard in the background to balance the data, but before I discp when {code:java} hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 -Ddfs.balancer.moverThreads=1200 -Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 -threshold 50{code} {code:java} // Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hive_warehouse/warehouse_old_snapshots/credit/.distcp.tmp.attempt_166383067_314191_m_08_2 could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2128) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2706) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:875) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:561) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497) at org.apache.hadoop.ipc.Client.call(Client.java:1443) at org.apache.hadoop.ipc.Client.call(Client.java:1353) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy13.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:510) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy14.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1078) at org.apache.hadoop.hdfs.DFSStripedOutputStream.allocateNewBlock(DFSStripedOutputStream.java:479) at org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:525) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217) at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:125) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:111) at org.apache.hadoop.fs.FSDataOut
[jira] [Updated] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation
[ https://issues.apache.org/jira/browse/HDFS-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-16788: Component/s: hdfs > could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There > are 50 datanode(s) running and no node(s) are excluded in this operation > --- > > Key: HDFS-16788 > URL: https://issues.apache.org/jira/browse/HDFS-16788 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Major > Attachments: image-2022-09-30-14-14-29-963.png, > image-2022-09-30-14-14-44-164.png > > > !image-2022-09-30-14-14-44-164.png! > ||Configured Capacity:|3.02 PB| > ||Configured Remote Capacity:|0 B| > ||DFS Used:|1.39 PB (45.96%)| > ||Non DFS Used:|0 B| > ||DFS Remaining:|1.62 PB (53.67%)| > ||Block Pool Used:|1.39 PB (45.96%)| > ||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%| > ||[Live > Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50 > (Decommissioned: 0, In Maintenance: 0) > > | > I've been working hard in the background to balance the data, but before I > discp when > {code:java} > hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 > -Ddfs.balancer.moverThreads=1200 > -Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 > -threshold 50{code} > {code:java} > // > Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /hive_warehouse/warehouse_old_snapshots/credit/.distcp.tmp.attempt_166383067_314191_m_08_2 > could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There > are 50 datanode(s) running and no node(s) are excluded in this operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2128) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2706) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:875) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:561) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497) > at org.apache.hadoop.ipc.Client.call(Client.java:1443) > at org.apache.hadoop.ipc.Client.call(Client.java:1353) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy13.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:510) > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy14.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFS
[jira] [Updated] (HDFS-16788) could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There are 50 datanode(s) running and no node(s) are excluded in this operation
[ https://issues.apache.org/jira/browse/HDFS-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ruiliang updated HDFS-16788: Affects Version/s: 3.1.0 > could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There > are 50 datanode(s) running and no node(s) are excluded in this operation > --- > > Key: HDFS-16788 > URL: https://issues.apache.org/jira/browse/HDFS-16788 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.1.0 >Reporter: ruiliang >Priority: Major > Attachments: image-2022-09-30-14-14-29-963.png, > image-2022-09-30-14-14-44-164.png > > > !image-2022-09-30-14-14-44-164.png! > ||Configured Capacity:|3.02 PB| > ||Configured Remote Capacity:|0 B| > ||DFS Used:|1.39 PB (45.96%)| > ||Non DFS Used:|0 B| > ||DFS Remaining:|1.62 PB (53.67%)| > ||Block Pool Used:|1.39 PB (45.96%)| > ||DataNodes usages% (Min/Median/Max/stdDev):|8.20% / 32.44% / 98.85% / 37.30%| > ||[Live > Nodes|http://fs-hiido-yycluster06-yynn1.hiido.host.yydevops.com:50070/dfshealth.html#tab-datanode]|50 > (Decommissioned: 0, In Maintenance: 0) > > | > I've been working hard in the background to balance the data, but before I > discp when > {code:java} > hdfs balancer -Ddfs.datanode.balance.max.concurrent.moves=300 > -Ddfs.balancer.moverThreads=1200 > -Ddfs.datanode.balance.bandwidthPerSec=1073741824 -fs hdfs://yycluster06 > -threshold 50{code} > {code:java} > // > Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /hive_warehouse/warehouse_old_snapshots/credit/.distcp.tmp.attempt_166383067_314191_m_08_2 > could only be written to 2 of the 3 required nodes for RS-3-2-1024k. There > are 50 datanode(s) running and no node(s) are excluded in this operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2128) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2706) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:875) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:561) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497) > at org.apache.hadoop.ipc.Client.call(Client.java:1443) > at org.apache.hadoop.ipc.Client.call(Client.java:1353) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy13.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:510) > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy14.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:107