[jira] [Created] (HDFS-16573) Fix test TestDFSStripedInputStreamWithRandomECPolicy
daimin created HDFS-16573: - Summary: Fix test TestDFSStripedInputStreamWithRandomECPolicy Key: HDFS-16573 URL: https://issues.apache.org/jira/browse/HDFS-16573 Project: Hadoop HDFS Issue Type: Test Components: test Affects Versions: 3.3.2 Reporter: daimin Assignee: daimin TestDFSStripedInputStreamWithRandomECPolicy fails due to test from HDFS-16520 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17532770#comment-17532770 ] daimin commented on HDFS-13243: --- [~gzh1992n][~weichiu] We have encountered the same problem in our cluster and fixed it some months ago. As this jira is still unresolved, I would like to continue to fix it. Please let me know if you have some concerns on this, thanks. > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: daimin >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW
[jira] [Assigned] (HDFS-13243) Get CorruptBlock because of calling close and sync in same time
[ https://issues.apache.org/jira/browse/HDFS-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daimin reassigned HDFS-13243: - Assignee: daimin (was: Zephyr Guo) > Get CorruptBlock because of calling close and sync in same time > --- > > Key: HDFS-13243 > URL: https://issues.apache.org/jira/browse/HDFS-13243 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2, 3.2.0 >Reporter: Zephyr Guo >Assignee: daimin >Priority: Critical > Attachments: HDFS-13243-v1.patch, HDFS-13243-v2.patch, > HDFS-13243-v3.patch, HDFS-13243-v4.patch, HDFS-13243-v5.patch, > HDFS-13243-v6.patch > > > HDFS File might get broken because of corrupt block(s) that could be produced > by calling close and sync in the same time. > When calling close was not successful, UCBlock status would change to > COMMITTED, and if a sync request gets popped from queue and processed, sync > operation would change the last block length. > After that, DataNode would report all received block to NameNode, and will > check Block length of all COMMITTED Blocks. But the block length was already > different between recorded in NameNode memory and reported by DataNode, and > consequently, the last block is marked as corruptted because of inconsistent > length. > > {panel:title=Log in my hdfs} > 2018-03-05 04:05:39,261 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1085498930_11758129\{UCState=UNDER_CONSTRUCTION, > truncateBlock=null, primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > for > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,760 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > for DFSClient_NONMAPREDUCE_1077513762_1 > 2018-03-05 04:05:39,761 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,1519845790686/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com%2C16020%2C1519845790686.default.1520193926515 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.0.0.220:50010 is added to > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > size 2054413 > 2018-03-05 04:05:39,761 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.219:50010 by > hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com/10.0.0.219 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:39,762 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1085498930 added as corrupt on > 10.0.0.218:50010 by > hb-j5e517al6xib80rkb-004.hbase.rds.aliyuncs.com/10.0.0.218 because block is > COMMITTED and reported length 2054413 does not match length in block map > 141232 > 2018-03-05 04:05:40,162 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > blk_1085498930_11758129\{UCState=COMMITTED, truncateBlock=null, > primaryNodeIndex=-1, > replicas=[ReplicaUC[[DISK]DS-32c7e479-3845-4a44-adf1-831edec7506b:NORMAL:10.0.0.219:50010|RBW], > > ReplicaUC[[DISK]DS-a9a5d653-c049-463d-8e4a-d1f0dc14409c:NORMAL:10.0.0.220:50010|RBW], > > ReplicaUC[[DISK]DS-f2b7c04a-b724-4c69-abbf-d2e416f70706:NORMAL:10.0.0.218:50010|RBW]]} > is not COMPLETE (ucState = COMMITTED, replication# = 3 >= minimum = 2) in > file > /hbase/WALs/hb-j5e517al6xib80rkb-006.hbase.rds.aliyuncs.com,16020,
[jira] [Created] (HDFS-16546) Fix UT TestOfflineImageViewer#testReverseXmlWithoutSnapshotDiffSection to branch branch-3.2
daimin created HDFS-16546: - Summary: Fix UT TestOfflineImageViewer#testReverseXmlWithoutSnapshotDiffSection to branch branch-3.2 Key: HDFS-16546 URL: https://issues.apache.org/jira/browse/HDFS-16546 Project: Hadoop HDFS Issue Type: Test Components: test Affects Versions: 3.2.0 Reporter: daimin Assignee: daimin The test fails due to incorrect layoutVersion. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16543) Keep default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec consistent
[ https://issues.apache.org/jira/browse/HDFS-16543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daimin updated HDFS-16543: -- Description: "WARN datanode.DirectoryScanner (DirectoryScanner.java:(300)) - dfs.datanode.directoryscan.throttle.limit.ms.per.sec set to value above 1000 ms/sec. Assuming default value of -1" A warning log like above will be printed when datanode is starting. The reason of that is the default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec is 1000 in hdfs-site.xml, and is -1 in DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT. We should try to keep it consistent, and value -1 looks is better to 1000. The code segment that print the warning log: {{ if (throttle >= TimeUnit.SECONDS.toMillis(1)) { LOG.warn( "{} set to value above 1000 ms/sec. Assuming default value of {}", DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_KEY, DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT); throttle = DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT; } }} was: "WARN datanode.DirectoryScanner (DirectoryScanner.java:(300)) - dfs.datanode.directoryscan.throttle.limit.ms.per.sec set to value above 1000 ms/sec. Assuming default value of -1" A warning log like above will be printed when datanode is starting. The reason of that is the default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec is 1000 in hdfs-site.xml, and is -1 in DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT. We should try to keep it consistent, and value -1 looks is better to 1000. The code segment that print the warning log: {{if (throttle >= TimeUnit.SECONDS.toMillis(1)) { LOG.warn( "{} set to value above 1000 ms/sec. Assuming default value of {}", DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_KEY, DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT); throttle = DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT; }}} > Keep default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec > consistent > - > > Key: HDFS-16543 > URL: https://issues.apache.org/jira/browse/HDFS-16543 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, datanode >Affects Versions: 3.3.2 >Reporter: daimin >Assignee: daimin >Priority: Minor > > "WARN datanode.DirectoryScanner (DirectoryScanner.java:(300)) - > dfs.datanode.directoryscan.throttle.limit.ms.per.sec set to value above 1000 > ms/sec. Assuming default value of -1" > A warning log like above will be printed when datanode is starting. The > reason of that is the default value of > dfs.datanode.directoryscan.throttle.limit.ms.per.sec is 1000 in > hdfs-site.xml, and is -1 in > DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT. We should try > to keep it consistent, and value -1 looks is better to 1000. > The code segment that print the warning log: > {{ > if (throttle >= TimeUnit.SECONDS.toMillis(1)) { > LOG.warn( > "{} set to value above 1000 ms/sec. Assuming default value of {}", > > DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_KEY, > > DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT); > throttle = > > DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT; > } > }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16543) Keep default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec consistent
[ https://issues.apache.org/jira/browse/HDFS-16543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daimin updated HDFS-16543: -- Description: "WARN datanode.DirectoryScanner (DirectoryScanner.java:(300)) - dfs.datanode.directoryscan.throttle.limit.ms.per.sec set to value above 1000 ms/sec. Assuming default value of -1" A warning log like above will be printed when datanode is starting. The reason of that is the default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec is 1000 in hdfs-site.xml, and is -1 in DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT. We should try to keep it consistent, and value -1 looks is better to 1000. The code segment that print the warning log: if (throttle >= TimeUnit.SECONDS.toMillis(1)) { LOG.warn( "{} set to value above 1000 ms/sec. Assuming default value of {}", DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_KEY, DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT); throttle = DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT; } was: "WARN datanode.DirectoryScanner (DirectoryScanner.java:(300)) - dfs.datanode.directoryscan.throttle.limit.ms.per.sec set to value above 1000 ms/sec. Assuming default value of -1" A warning log like above will be printed when datanode is starting. The reason of that is the default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec is 1000 in hdfs-site.xml, and is -1 in DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT. We should try to keep it consistent, and value -1 looks is better to 1000. The code segment that print the warning log: {{ if (throttle >= TimeUnit.SECONDS.toMillis(1)) { LOG.warn( "{} set to value above 1000 ms/sec. Assuming default value of {}", DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_KEY, DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT); throttle = DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT; } }} > Keep default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec > consistent > - > > Key: HDFS-16543 > URL: https://issues.apache.org/jira/browse/HDFS-16543 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration, datanode >Affects Versions: 3.3.2 >Reporter: daimin >Assignee: daimin >Priority: Minor > > "WARN datanode.DirectoryScanner (DirectoryScanner.java:(300)) - > dfs.datanode.directoryscan.throttle.limit.ms.per.sec set to value above 1000 > ms/sec. Assuming default value of -1" > A warning log like above will be printed when datanode is starting. The > reason of that is the default value of > dfs.datanode.directoryscan.throttle.limit.ms.per.sec is 1000 in > hdfs-site.xml, and is -1 in > DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT. We should try > to keep it consistent, and value -1 looks is better to 1000. > The code segment that print the warning log: > if (throttle >= TimeUnit.SECONDS.toMillis(1)) { > LOG.warn( > "{} set to value above 1000 ms/sec. Assuming default value of {}", > DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_KEY, > > DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT); > throttle = > > DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT; > } -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16543) Keep default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec consistent
daimin created HDFS-16543: - Summary: Keep default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec consistent Key: HDFS-16543 URL: https://issues.apache.org/jira/browse/HDFS-16543 Project: Hadoop HDFS Issue Type: Improvement Components: configuration, datanode Affects Versions: 3.3.2 Reporter: daimin Assignee: daimin "WARN datanode.DirectoryScanner (DirectoryScanner.java:(300)) - dfs.datanode.directoryscan.throttle.limit.ms.per.sec set to value above 1000 ms/sec. Assuming default value of -1" A warning log like above will be printed when datanode is starting. The reason of that is the default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec is 1000 in hdfs-site.xml, and is -1 in DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT. We should try to keep it consistent, and value -1 looks is better to 1000. The code segment that print the warning log: {{if (throttle >= TimeUnit.SECONDS.toMillis(1)) { LOG.warn( "{} set to value above 1000 ms/sec. Assuming default value of {}", DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_KEY, DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT); throttle = DFSConfigKeys.DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT; }}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16542) Fix failed unit tests in branch branch-3.2
daimin created HDFS-16542: - Summary: Fix failed unit tests in branch branch-3.2 Key: HDFS-16542 URL: https://issues.apache.org/jira/browse/HDFS-16542 Project: Hadoop HDFS Issue Type: Improvement Components: test Affects Versions: 3.2.3 Reporter: daimin Assignee: daimin Tests fail in branch branch-3.2: hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewer hadoop.hdfs.TestReconstructStripedFileWithValidator hadoop.hdfs.server.namenode.TestFsck hadoop.hdfs.server.blockmanagement.TestBlockManager -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads
[ https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520328#comment-17520328 ] daimin commented on HDFS-16422: --- [~tasanuma] I think NativeRSRawDecoder is thread safe after HDFS-16422, and it is not before. > Fix thread safety of EC decoding during concurrent preads > - > > Key: HDFS-16422 > URL: https://issues.apache.org/jira/browse/HDFS-16422 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient, ec, erasure-coding >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.3, 3.3.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Reading data on an erasure-coded file with missing replicas(internal block of > block group) will cause online reconstruction: read dataUnits part of data > and decode them into the target missing data. Each DFSStripedInputStream > object has a RawErasureDecoder object, and when we doing pread concurrently, > RawErasureDecoder.decode will be invoked concurrently too. > RawErasureDecoder.decode is not thread safe, as a result of that we get wrong > data from pread occasionally. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16520) Improve EC pread: avoid potential reading whole block
daimin created HDFS-16520: - Summary: Improve EC pread: avoid potential reading whole block Key: HDFS-16520 URL: https://issues.apache.org/jira/browse/HDFS-16520 Project: Hadoop HDFS Issue Type: Improvement Components: dfsclient, ec Affects Versions: 3.3.2, 3.3.1 Reporter: daimin Assignee: daimin HDFS client 'pread' represents 'position read', this kind of read just need a range of data instead of reading the whole file/block. By using BlockReaderFactory#setLength, client tells datanode the block length to be read from disk and sent to client. To EC file, the block length to read is not well set, by default using 'block.getBlockSize() - offsetInBlock' to both pread and sread. Thus datanode read much more data and send to client, and abort when client closes connection. There is a lot waste of resource to this situation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16519) Add throttler to EC reconstruction
daimin created HDFS-16519: - Summary: Add throttler to EC reconstruction Key: HDFS-16519 URL: https://issues.apache.org/jira/browse/HDFS-16519 Project: Hadoop HDFS Issue Type: Improvement Components: datanode, ec Affects Versions: 3.3.2, 3.3.1 Reporter: daimin Assignee: daimin HDFS already have throttlers for data transfer(replication) and balancer, the throttlers reduce the impact of these background procedures to user read/write. We should add a throttler to EC background reconstruction too. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads
[ https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511224#comment-17511224 ] daimin edited comment on HDFS-16422 at 3/23/22, 12:30 PM: -- [~jingzhao] I tested this again, and my test steps are: # Setup a cluster with 11 datanodes, and write 4 EC RS-8-2 files: 1g, 2g, 4g, 8g # Stop one datanode # Check md5sum of these files through HDFS FUSE, this is a simple way to create concurrent preads(indirect IO on FUSE) Here is test result: * md5sum check before datanode down: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} * md5sum after datanode down, with native(ISA-L) decoder: {quote}md5sum /mnt/fuse/*g 206288b264b92af42563a14a242aa629 /mnt/fuse/1g bc86f9f549912d78c8b3d02ada5621a2 /mnt/fuse/2g c201356b7437e6aac1b574ade08b6ccb /mnt/fuse/4g ef2e6f6b4b6ab96a24e5f734e93bacc3 /mnt/fuse/8g {quote} * md5sum after datanode down, with pure Java decoder: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} In conclusion: RSRawDecoder seems to be thread safe, NativeRSRawDecoder is not thread safe, the read/write lock seems unable to protect the native decodeImpl method. And I also tested on md5sum check on same file with native(ISA-L) decoder, the result is different every time. {quote} for i in \{1..5};do md5sum /mnt/fuse/1g;done 2e68ea6738dccb4f248df81b5c55d464 /mnt/fuse/1g 54944120797266fc4e26bd465ae5e67a /mnt/fuse/1g ef4d099269fb117e357015cf424723a9 /mnt/fuse/1g 6a40dbca2636ae796b6380385ddfbc83 /mnt/fuse/1g 126fc40073dcebb67d413de95571c08b /mnt/fuse/1g {quote} IMO, HADOOP-15499 did improve the performance of decoder, however it breaked the correctness of decode method when invoked concurrently. We should take synchronized back, and it's ok to the the read/write lock too as it protects from init/release methods. Thanks [~jingzhao] again. was (Author: cndaimin): [~jingzhao] I tested this again, and my test steps are: # Setup a cluster with 11 datanodes, and write 4 EC RS-8-2 files: 1g, 2g, 4g, 8g # Stop one datanode # Check md5sum of these files through HDFS FUSE, this is a simple way to create concurrent preads(indirect IO on FUSE) Here is test result: * md5sum check before datanode down: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} * md5sum after datanode down, with native(ISA-L) decoder: {quote}md5sum /mnt/fuse/*g 206288b264b92af42563a14a242aa629 /mnt/fuse/1g bc86f9f549912d78c8b3d02ada5621a2 /mnt/fuse/2g c201356b7437e6aac1b574ade08b6ccb /mnt/fuse/4g ef2e6f6b4b6ab96a24e5f734e93bacc3 /mnt/fuse/8g {quote} * md5sum after datanode down, with pure Java decoder: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} In conclusion: RSRawDecoder seems to be thread safe, NativeRSRawDecoder is not thread safe, the read/write lock seems unable to protect the native decodeImpl method. And I also tested on md5sum check on same file with native(ISA-L) decoder, the result is different every time. {quote} for i in \{1..5};do md5sum /mnt/fuse/1g;done 2e68ea6738dccb4f248df81b5c55d464 /mnt/fuse/1g 54944120797266fc4e26bd465ae5e67a /mnt/fuse/1g ef4d099269fb117e357015cf424723a9 /mnt/fuse/1g 6a40dbca2636ae796b6380385ddfbc83 /mnt/fuse/1g 126fc40073dcebb67d413de95571c08b /mnt/fuse/1g {quote} IMO, HADOOP-15499 did improve the performance of decoder, however it breaked the correctness of decode method when invoked concurrently. We should take synchronized back, and I will submit a new PR later to do this work. Thanks [~jingzhao] again. > Fix thread safety of EC decoding during concurrent preads > - > > Key: HDFS-16422 > URL: https://issues.apache.org/jira/browse/HDFS-16422 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient, ec, erasure-coding >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.3, 3.3.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Reading data on an erasure-coded file with missing replicas(internal block of > block grou
[jira] [Comment Edited] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads
[ https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511224#comment-17511224 ] daimin edited comment on HDFS-16422 at 3/23/22, 12:26 PM: -- [~jingzhao] I tested this again, and my test steps are: # Setup a cluster with 11 datanodes, and write 4 EC RS-8-2 files: 1g, 2g, 4g, 8g # Stop one datanode # Check md5sum of these files through HDFS FUSE, this is a simple way to create concurrent preads(indirect IO on FUSE) Here is test result: * md5sum check before datanode down: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} * md5sum after datanode down, with native(ISA-L) decoder: {quote}md5sum /mnt/fuse/*g 206288b264b92af42563a14a242aa629 /mnt/fuse/1g bc86f9f549912d78c8b3d02ada5621a2 /mnt/fuse/2g c201356b7437e6aac1b574ade08b6ccb /mnt/fuse/4g ef2e6f6b4b6ab96a24e5f734e93bacc3 /mnt/fuse/8g {quote} * md5sum after datanode down, with pure Java decoder: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} In conclusion: RSRawDecoder seems to be thread safe, NativeRSRawDecoder is not thread safe, the read/write lock seems unable to protect the native decodeImpl method. And I also tested on md5sum check on same file with native(ISA-L) decoder, the result is different every time. {quote} for i in \{1..5};do md5sum /mnt/fuse/1g;done 2e68ea6738dccb4f248df81b5c55d464 /mnt/fuse/1g 54944120797266fc4e26bd465ae5e67a /mnt/fuse/1g ef4d099269fb117e357015cf424723a9 /mnt/fuse/1g 6a40dbca2636ae796b6380385ddfbc83 /mnt/fuse/1g 126fc40073dcebb67d413de95571c08b /mnt/fuse/1g {quote} IMO, HADOOP-15499 did improve the performance of decoder, however it breaked the correctness of decode method when invoked concurrently. We should take synchronized back, and I will submit a new PR later to do this work. Thanks [~jingzhao] again. was (Author: cndaimin): [~jingzhao] I tested this again, and my test steps are: # Setup a cluster with 11 datanodes, and write 4 EC RS-8-2 files: 1g, 2g, 4g, 8g # Stop one datanode # Check md5sum of these files through HDFS FUSE, this is a simple way to create concurrent preads(indirect IO on FUSE) Here is test result: * md5sum check before datanode down: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} * md5sum after datanode down, with native(ISA-L) decoder: {quote}md5sum /mnt/fuse/*g 206288b264b92af42563a14a242aa629 /mnt/fuse/1g bc86f9f549912d78c8b3d02ada5621a2 /mnt/fuse/2g c201356b7437e6aac1b574ade08b6ccb /mnt/fuse/4g ef2e6f6b4b6ab96a24e5f734e93bacc3 /mnt/fuse/8g {quote} * md5sum after datanode down, with pure Java decoder: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} In conclusion: RSRawDecoder seems to be thread safe, NativeRSRawDecoder is not thread safe, the read/write lock seems unable to protect the native decodeImpl method. And I also tested on md5sum check on same file with native(ISA-L) decoder, the result is different every time. {quote}for i in \{1..5};do md5sum /mnt/fuse/1g;done 2e68ea6738dccb4f248df81b5c55d464 /mnt/fuse/1g 54944120797266fc4e26bd465ae5e67a /mnt/fuse/1g ef4d099269fb117e357015cf424723a9 /mnt/fuse/1g 6a40dbca2636ae796b6380385ddfbc83 /mnt/fuse/1g 126fc40073dcebb67d413de95571c08b /mnt/fuse/1g {quote} IMO, HADOOP-15499 did improve the performance of decoder, however it breaked the correctness of decode method when invoked concurrently. We should take synchronized back, and I will submit a new PR later to do this work. Thanks [~jingzhao] again. > Fix thread safety of EC decoding during concurrent preads > - > > Key: HDFS-16422 > URL: https://issues.apache.org/jira/browse/HDFS-16422 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient, ec, erasure-coding >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.3, 3.3.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Reading data on an erasure-coded file with missing replicas(internal block of > block group) will cause online reconstruction: rea
[jira] [Comment Edited] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads
[ https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511224#comment-17511224 ] daimin edited comment on HDFS-16422 at 3/23/22, 12:25 PM: -- [~jingzhao] I tested this again, and my test steps are: # Setup a cluster with 11 datanodes, and write 4 EC RS-8-2 files: 1g, 2g, 4g, 8g # Stop one datanode # Check md5sum of these files through HDFS FUSE, this is a simple way to create concurrent preads(indirect IO on FUSE) Here is test result: * md5sum check before datanode down: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} * md5sum after datanode down, with native(ISA-L) decoder: {quote}md5sum /mnt/fuse/*g 206288b264b92af42563a14a242aa629 /mnt/fuse/1g bc86f9f549912d78c8b3d02ada5621a2 /mnt/fuse/2g c201356b7437e6aac1b574ade08b6ccb /mnt/fuse/4g ef2e6f6b4b6ab96a24e5f734e93bacc3 /mnt/fuse/8g {quote} * md5sum after datanode down, with pure Java decoder: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} In conclusion: RSRawDecoder seems to be thread safe, NativeRSRawDecoder is not thread safe, the read/write lock seems unable to protect the native decodeImpl method. And I also tested on md5sum check on same file with native(ISA-L) decoder, the result is different every time. {quote}for i in \{1..5};do md5sum /mnt/fuse/1g;done 2e68ea6738dccb4f248df81b5c55d464 /mnt/fuse/1g 54944120797266fc4e26bd465ae5e67a /mnt/fuse/1g ef4d099269fb117e357015cf424723a9 /mnt/fuse/1g 6a40dbca2636ae796b6380385ddfbc83 /mnt/fuse/1g 126fc40073dcebb67d413de95571c08b /mnt/fuse/1g {quote} IMO, HADOOP-15499 did improve the performance of decoder, however it breaked the correctness of decode method when invoked concurrently. We should take synchronized back, and I will submit a new PR later to do this work. Thanks [~jingzhao] again. was (Author: cndaimin): [~jingzhao] I tested this again, and my test steps are: # Setup a cluster with 11 datanodes, and write 4 EC RS-8-2 files: 1g, 2g, 4g, 8g # Stop one datanode # Check md5sum of these files through HDFS FUSE, this is a simple way to create concurrent preads(indirect IO on FUSE) Here is test result: * md5sum check before datanode down: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} * md5sum after datanode down, with native(ISA-L) decoder: {quote}md5sum /mnt/fuse/*g 206288b264b92af42563a14a242aa629 /mnt/fuse/1g bc86f9f549912d78c8b3d02ada5621a2 /mnt/fuse/2g c201356b7437e6aac1b574ade08b6ccb /mnt/fuse/4g ef2e6f6b4b6ab96a24e5f734e93bacc3 /mnt/fuse/8g {quote} * md5sum after datanode down, with pure Java decoder: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} In conclusion: RSRawDecoder seems to be thread safe, NativeRSRawDecoder is not thread safe, the read/write lock seems unable to protect the native decodeImpl method. And I also tested on md5sum check on same file with native(ISA-L) decoder, the result is different every time. {quote}for i in \{1..5};do md5sum /mnt/fuse/1g;done 2e68ea6738dccb4f248df81b5c55d464 /mnt/fuse/1g 54944120797266fc4e26bd465ae5e67a /mnt/fuse/1g ef4d099269fb117e357015cf424723a9 /mnt/fuse/1g 6a40dbca2636ae796b6380385ddfbc83 /mnt/fuse/1g 126fc40073dcebb67d413de95571c08b /mnt/fuse/1g {quote} IMO, HADOOP-15499 did improve the performance of decoder, however it breaked the correctness of decode method when invoked concurrently. We should take synchronized back, and I will submit a new PR later to do this work. Thanks [~jingzhao] again. > Fix thread safety of EC decoding during concurrent preads > - > > Key: HDFS-16422 > URL: https://issues.apache.org/jira/browse/HDFS-16422 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient, ec, erasure-coding >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.3, 3.3.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Reading data on an erasure-coded file with missing replicas(internal block of > block group) will cause online reconstruction: read dataUnits pa
[jira] [Commented] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads
[ https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511224#comment-17511224 ] daimin commented on HDFS-16422: --- [~jingzhao] I tested this again, and my test steps are: # Setup a cluster with 11 datanodes, and write 4 EC RS-8-2 files: 1g, 2g, 4g, 8g # Stop one datanode # Check md5sum of these files through HDFS FUSE, this is a simple way to create concurrent preads(indirect IO on FUSE) Here is test result: * md5sum check before datanode down: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} * md5sum after datanode down, with native(ISA-L) decoder: {quote}md5sum /mnt/fuse/*g 206288b264b92af42563a14a242aa629 /mnt/fuse/1g bc86f9f549912d78c8b3d02ada5621a2 /mnt/fuse/2g c201356b7437e6aac1b574ade08b6ccb /mnt/fuse/4g ef2e6f6b4b6ab96a24e5f734e93bacc3 /mnt/fuse/8g {quote} * md5sum after datanode down, with pure Java decoder: {quote}md5sum /mnt/fuse/*g 5e6c32c0b572e2ff24fb14f93c4cc45b /mnt/fuse/1g 782173623681c129558c09e89251f46d /mnt/fuse/2g e107f9a83a383b98aa23fdd3171b589c /mnt/fuse/4g adb81da2c34161f249439597c515db1d /mnt/fuse/8g {quote} In conclusion: RSRawDecoder seems to be thread safe, NativeRSRawDecoder is not thread safe, the read/write lock seems unable to protect the native decodeImpl method. And I also tested on md5sum check on same file with native(ISA-L) decoder, the result is different every time. {quote}for i in \{1..5};do md5sum /mnt/fuse/1g;done 2e68ea6738dccb4f248df81b5c55d464 /mnt/fuse/1g 54944120797266fc4e26bd465ae5e67a /mnt/fuse/1g ef4d099269fb117e357015cf424723a9 /mnt/fuse/1g 6a40dbca2636ae796b6380385ddfbc83 /mnt/fuse/1g 126fc40073dcebb67d413de95571c08b /mnt/fuse/1g {quote} IMO, HADOOP-15499 did improve the performance of decoder, however it breaked the correctness of decode method when invoked concurrently. We should take synchronized back, and I will submit a new PR later to do this work. Thanks [~jingzhao] again. > Fix thread safety of EC decoding during concurrent preads > - > > Key: HDFS-16422 > URL: https://issues.apache.org/jira/browse/HDFS-16422 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient, ec, erasure-coding >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.3, 3.3.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Reading data on an erasure-coded file with missing replicas(internal block of > block group) will cause online reconstruction: read dataUnits part of data > and decode them into the target missing data. Each DFSStripedInputStream > object has a RawErasureDecoder object, and when we doing pread concurrently, > RawErasureDecoder.decode will be invoked concurrently too. > RawErasureDecoder.decode is not thread safe, as a result of that we get wrong > data from pread occasionally. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16510) Fix EC decommission when rack is not enough
daimin created HDFS-16510: - Summary: Fix EC decommission when rack is not enough Key: HDFS-16510 URL: https://issues.apache.org/jira/browse/HDFS-16510 Project: Hadoop HDFS Issue Type: Bug Components: block placement, ec Affects Versions: 3.3.2, 3.3.1 Reporter: daimin Assignee: daimin The decommission always fail when we start decommission multiple nodes on a cluster whose racks is not enough, a cluster with 6 racks to deploy RS-6-3, for example. We find that those decommission nodes cover at least a rack, it's actulaly like we are decommission one or more racks. And rack decommission is not well supported currently, especially for cluster whose racks is not enough already. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16509) Fix decommission UnsupportedOperationException: Remove unsupported
daimin created HDFS-16509: - Summary: Fix decommission UnsupportedOperationException: Remove unsupported Key: HDFS-16509 URL: https://issues.apache.org/jira/browse/HDFS-16509 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.3.2, 3.3.1 Reporter: daimin Assignee: daimin We encountered an "UnsupportedOperationException: Remove unsupported" error when some datanodes were in decommission. The reason of the exception is that datanode.getBlockIterator() returns an Iterator does not support remove, however DatanodeAdminDefaultMonitor#processBlocksInternal invokes it.remove() when a block not found, e.g, the file containing the block is deleted. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads
[ https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507969#comment-17507969 ] daimin commented on HDFS-16422: --- I tested both NativeRSRawDecoder and RSRawDecoder before, and they seems both not thread safe to decode, therefore I simply add a synchronized to the decode method. In consideration of HADOOP-15499, I will do some more tests to find out what's missing of the original lock protection. Thanks for your reminding. [~jingzhao] > Fix thread safety of EC decoding during concurrent preads > - > > Key: HDFS-16422 > URL: https://issues.apache.org/jira/browse/HDFS-16422 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsclient, ec, erasure-coding >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.3, 3.3.3 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Reading data on an erasure-coded file with missing replicas(internal block of > block group) will cause online reconstruction: read dataUnits part of data > and decode them into the target missing data. Each DFSStripedInputStream > object has a RawErasureDecoder object, and when we doing pread concurrently, > RawErasureDecoder.decode will be invoked concurrently too. > RawErasureDecoder.decode is not thread safe, as a result of that we get wrong > data from pread occasionally. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16430) Validate maximum blocks in EC group when adding an EC policy
daimin created HDFS-16430: - Summary: Validate maximum blocks in EC group when adding an EC policy Key: HDFS-16430 URL: https://issues.apache.org/jira/browse/HDFS-16430 Project: Hadoop HDFS Issue Type: Improvement Components: ec, erasure-coding Affects Versions: 3.3.1, 3.3.0 Reporter: daimin Assignee: daimin HDFS EC adopts the last 4 bits of block ID to store the block index in EC block group. Therefore maximum blocks in EC block group is 2^4=16, and which is defined here: HdfsServerConstants#MAX_BLOCKS_IN_GROUP. Currently there is no limitation or warning when adding a bad EC policy with numDataUnits + numParityUnits > 16. It only results in read/write error on EC file with bad EC policy. To users this is not very straightforward. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads
daimin created HDFS-16422: - Summary: Fix thread safety of EC decoding during concurrent preads Key: HDFS-16422 URL: https://issues.apache.org/jira/browse/HDFS-16422 Project: Hadoop HDFS Issue Type: Bug Components: dfsclient, ec, erasure-coding Affects Versions: 3.3.1, 3.3.0 Reporter: daimin Assignee: daimin Reading data on an erasure-coded file with missing replicas(internal block of block group) will cause online reconstruction: read dataUnits part of data and decode them into the target missing data. Each DFSStripedInputStream object has a RawErasureDecoder object, and when we doing pread concurrently, RawErasureDecoder.decode will be invoked concurrently too. RawErasureDecoder.decode is not thread safe, as a result of that we get wrong data from pread occasionally. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
[ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daimin updated HDFS-8791: - Description: underlined textWe are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. So, what we have on disk is: - 256 inodes for the first level directories - 256 directory blocks for the first level directories - 256*256 inodes for the second level directories - 256*256 directory blocks for the second level directories - Then the inodes and blocks to store the the HDFS blocks themselves. The main problem is the 256*256 directory blocks. inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes. How do things get cold? - A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster. Why didn't the previous layout see this? - It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds. - With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted. was: We are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. So, what we have on disk is: - 256 inodes for the first level directories - 256 directory blocks for the first level directories - 256*256 inodes for the second level directories - 256*256 directory blocks for the second level directories - Then the inodes and blocks to store the the HDFS blocks themselves. The main problem is the 256*256 directory blocks. inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes. How do things get cold? - A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster. Why didn't the previous layout see this? - It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds. - With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted. > block ID-based DN storage layout can be very slow for datanode
[jira] [Created] (HDFS-16403) Improve FUSE IO performance by supporting FUSE parameter max_background
daimin created HDFS-16403: - Summary: Improve FUSE IO performance by supporting FUSE parameter max_background Key: HDFS-16403 URL: https://issues.apache.org/jira/browse/HDFS-16403 Project: Hadoop HDFS Issue Type: Improvement Components: fuse-dfs Affects Versions: 3.3.1, 3.3.0 Reporter: daimin Assignee: daimin When we examining the FUSE IO performance on HDFS, we found that the simultaneous IO requests number are limited to a fixed number, like 12. This limitation makes the IO performance on FUSE client quite unacceptable. We did some research on this and inspired by the article [Performance and Resource Utilization of FUSE User-Space File Systems|https://dl.acm.org/doi/fullHtml/10.1145/3310148], clearly the FUSE parameter '{{{}max_background{}}}' decides the simultaneous IO requests number, which is 12 by default. We add 'max_background' to fuse_dfs mount options, the FUSE kernel will take effect when an option value is given. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16286) Debug tool to verify the correctness of erasure coding on file
[ https://issues.apache.org/jira/browse/HDFS-16286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436476#comment-17436476 ] daimin commented on HDFS-16286: --- [~sodonnell] Thanks for your reply. I have visited the README document and the code implementation, the EC validator looks good enough to me. And in fact, if I had known this work has been done, I would not paying time to re-work it out. Our motive to build such a tool initially is because we had data corruption of EC block group in our production environment and one big problem is that we could not tell which files are good or bad. After we checked all the suspicious files, we think this tool may be useful to those who are using EC too. Could you please pay some time on reviewing the patch? Thanks a lot. > Debug tool to verify the correctness of erasure coding on file > -- > > Key: HDFS-16286 > URL: https://issues.apache.org/jira/browse/HDFS-16286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, tools >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Block data in erasure coded block group may corrupt and the block meta > (checksum) is unable to discover the corruption in some cases such as EC > reconstruction, related issues are: HDFS-14768, HDFS-15186, HDFS-15240. > In addition to HDFS-15759, there needs a tool to check erasure coded file > whether any block group has data corruption in case of other conditions > rather than EC reconstruction, or the feature HDFS-15759(validation during EC > reconstruction) is not open(which is close by default now). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16286) Debug tool to verify the correctness of erasure coding on file
[ https://issues.apache.org/jira/browse/HDFS-16286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434694#comment-17434694 ] daimin edited comment on HDFS-16286 at 10/27/21, 6:57 AM: -- pull request #3593: URL: [https://github.com/apache/hadoop/pull/3593|https://github.com/apache/hadoop/pull/3593] was (Author: cndaimin): pull request #3593: URL: [https://github.com/apache/hadoop/pull/3593|https://github.com/apache/hadoop/pull/3548] > Debug tool to verify the correctness of erasure coding on file > -- > > Key: HDFS-16286 > URL: https://issues.apache.org/jira/browse/HDFS-16286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, tools >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Minor > Labels: pull-request-available > > Block data in erasure coded block group may corrupt and the block meta > (checksum) is unable to discover the corruption in some cases such as EC > reconstruction, related issues are: HDFS-14768, HDFS-15186, HDFS-15240. > In addition to HDFS-15759, there needs a tool to check erasure coded file > whether any block group has data corruption in case of other conditions > rather than EC reconstruction, or the feature HDFS-15759(validation during EC > reconstruction) is not open(which is close by default now). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16286) Debug tool to verify the correctness of erasure coding on file
[ https://issues.apache.org/jira/browse/HDFS-16286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daimin updated HDFS-16286: -- Labels: pull-request-available (was: ) > Debug tool to verify the correctness of erasure coding on file > -- > > Key: HDFS-16286 > URL: https://issues.apache.org/jira/browse/HDFS-16286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, tools >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Minor > Labels: pull-request-available > > Block data in erasure coded block group may corrupt and the block meta > (checksum) is unable to discover the corruption in some cases such as EC > reconstruction, related issues are: HDFS-14768, HDFS-15186, HDFS-15240. > In addition to HDFS-15759, there needs a tool to check erasure coded file > whether any block group has data corruption in case of other conditions > rather than EC reconstruction, or the feature HDFS-15759(validation during EC > reconstruction) is not open(which is close by default now). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16286) Debug tool to verify the correctness of erasure coding on file
[ https://issues.apache.org/jira/browse/HDFS-16286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434694#comment-17434694 ] daimin commented on HDFS-16286: --- pull request #3593: URL: [https://github.com/apache/hadoop/pull/3593|https://github.com/apache/hadoop/pull/3548] > Debug tool to verify the correctness of erasure coding on file > -- > > Key: HDFS-16286 > URL: https://issues.apache.org/jira/browse/HDFS-16286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, tools >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Minor > > Block data in erasure coded block group may corrupt and the block meta > (checksum) is unable to discover the corruption in some cases such as EC > reconstruction, related issues are: HDFS-14768, HDFS-15186, HDFS-15240. > In addition to HDFS-15759, there needs a tool to check erasure coded file > whether any block group has data corruption in case of other conditions > rather than EC reconstruction, or the feature HDFS-15759(validation during EC > reconstruction) is not open(which is close by default now). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16286) Debug tool to verify the correctness of erasure coding on file
[ https://issues.apache.org/jira/browse/HDFS-16286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daimin updated HDFS-16286: -- Labels: (was: pull) > Debug tool to verify the correctness of erasure coding on file > -- > > Key: HDFS-16286 > URL: https://issues.apache.org/jira/browse/HDFS-16286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, tools >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Minor > > Block data in erasure coded block group may corrupt and the block meta > (checksum) is unable to discover the corruption in some cases such as EC > reconstruction, related issues are: HDFS-14768, HDFS-15186, HDFS-15240. > In addition to HDFS-15759, there needs a tool to check erasure coded file > whether any block group has data corruption in case of other conditions > rather than EC reconstruction, or the feature HDFS-15759(validation during EC > reconstruction) is not open(which is close by default now). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16286) Debug tool to verify the correctness of erasure coding on file
[ https://issues.apache.org/jira/browse/HDFS-16286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] daimin updated HDFS-16286: -- Labels: pull (was: ) > Debug tool to verify the correctness of erasure coding on file > -- > > Key: HDFS-16286 > URL: https://issues.apache.org/jira/browse/HDFS-16286 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding, tools >Affects Versions: 3.3.0, 3.3.1 >Reporter: daimin >Assignee: daimin >Priority: Minor > Labels: pull > > Block data in erasure coded block group may corrupt and the block meta > (checksum) is unable to discover the corruption in some cases such as EC > reconstruction, related issues are: HDFS-14768, HDFS-15186, HDFS-15240. > In addition to HDFS-15759, there needs a tool to check erasure coded file > whether any block group has data corruption in case of other conditions > rather than EC reconstruction, or the feature HDFS-15759(validation during EC > reconstruction) is not open(which is close by default now). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16286) Debug tool to verify the correctness of erasure coding on file
daimin created HDFS-16286: - Summary: Debug tool to verify the correctness of erasure coding on file Key: HDFS-16286 URL: https://issues.apache.org/jira/browse/HDFS-16286 Project: Hadoop HDFS Issue Type: Improvement Components: erasure-coding, tools Affects Versions: 3.3.1, 3.3.0 Reporter: daimin Assignee: daimin Block data in erasure coded block group may corrupt and the block meta (checksum) is unable to discover the corruption in some cases such as EC reconstruction, related issues are: HDFS-14768, HDFS-15186, HDFS-15240. In addition to HDFS-15759, there needs a tool to check erasure coded file whether any block group has data corruption in case of other conditions rather than EC reconstruction, or the feature HDFS-15759(validation during EC reconstruction) is not open(which is close by default now). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16282) Duplicate generic usage information to hdfs debug command
daimin created HDFS-16282: - Summary: Duplicate generic usage information to hdfs debug command Key: HDFS-16282 URL: https://issues.apache.org/jira/browse/HDFS-16282 Project: Hadoop HDFS Issue Type: Improvement Components: tools Affects Versions: 3.3.1, 3.3.0 Reporter: daimin Assignee: daimin When we type 'hdfs debug' in console, the generic usage information will be repeated 4 times and the target command like 'verifyMeta' or 'recoverLease' seems hard to find. {quote}~ $ hdfs debug Usage: hdfs debug [arguments] These commands are for advanced users only. Incorrect usages may result in data loss. Use at your own risk. verifyMeta -meta [-block ] Generic options supported are: -conf specify an application configuration file -D define a value for a given property -fs specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations. -jt specify a ResourceManager -files specify a comma-separated list of files to be copied to the map reduce cluster -libjars specify a comma-separated list of jar files to be included in the classpath -archives specify a comma-separated list of archives to be unarchived on the compute machines The general command line syntax is: command [genericOptions] [commandOptions] computeMeta -block -out Generic options supported are: -conf specify an application configuration file -D define a value for a given property -fs specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations. -jt specify a ResourceManager -files specify a comma-separated list of files to be copied to the map reduce cluster -libjars specify a comma-separated list of jar files to be included in the classpath -archives specify a comma-separated list of archives to be unarchived on the compute machines The general command line syntax is: command [genericOptions] [commandOptions] recoverLease -path [-retries ] Generic options supported are: -conf specify an application configuration file -D define a value for a given property -fs specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations. -jt specify a ResourceManager -files specify a comma-separated list of files to be copied to the map reduce cluster -libjars specify a comma-separated list of jar files to be included in the classpath -archives specify a comma-separated list of archives to be unarchived on the compute machines The general command line syntax is: command [genericOptions] [commandOptions] Generic options supported are: -conf specify an application configuration file -D define a value for a given property -fs specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations. -jt specify a ResourceManager -files specify a comma-separated list of files to be copied to the map reduce cluster -libjars specify a comma-separated list of jar files to be included in the classpath -archives specify a comma-separated list of archives to be unarchived on the compute machines The general command line syntax is: command [genericOptions] [commandOptions] {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16272) Int overflow in computing safe length during EC block recovery
daimin created HDFS-16272: - Summary: Int overflow in computing safe length during EC block recovery Key: HDFS-16272 URL: https://issues.apache.org/jira/browse/HDFS-16272 Project: Hadoop HDFS Issue Type: Bug Components: 3.1.1 Affects Versions: 3.3.1, 3.3.0 Environment: Cluster settings: EC RS-8-2-256k, Block Size 1GiB. Reporter: daimin There exists an int overflow problem in StripedBlockUtil#getSafeLength, which will produce a negative or zero length: 1. With negative length, it fails to the later >=0 check, and will crash the BlockRecoveryWorker thread, which make the lease recovery operation unable to finish. 2. With zero length, it passes the check, and directly truncate the block size to zero, leads to data lossing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org