[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954304#comment-16954304 ] Hadoop QA commented on HDFS-14768: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 58s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 39m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 31m 11s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 40s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 32s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 170 unchanged - 1 fixed = 170 total (was 171) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 26m 8s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 11s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}156m 8s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 1m 30s{color} | {color:red} The patch generated 50 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}284m 48s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.datanode.TestNNHandlesCombinedBlockReport | | | hadoop.hdfs.tools.TestECAdmin | | | hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics | | | hadoop.hdfs.TestFileChecksum | | | hadoop.hdfs.security.TestDelegationTokenForProxyUser | | | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting | | | hadoop.hdfs.TestDFSStripedInputStream | | | hadoop.hdfs.server.datanode.TestBlockHasMultipleReplicasOnSameDN | | | hadoop.hdfs.qjournal.server.TestJournalNodeMXBean | | | hadoop.hdfs.TestDecommission | | | hadoop.hdfs.TestFileCreation | | | hadoop.hdfs.server.namenode.TestQuotaWithStripedBlocksWithRandomECPolicy | | | hadoop.hdfs.server.namenode.TestAddStripedBlocks | | | hadoop.hdfs.TestPread | | | hadoop.hdfs.server.datanode.TestDataNodeReconfiguration | | | hadoop.hdfs.tools.TestStoragePolicySatisfyAdminCommands | | | hadoop.hdfs.server.datanode.TestDataNodeUUID | | | hadoop.hdfs.server.namenode.TestDeleteRace | | | hadoop.hdfs.tools.TestDelegationTokenFetcher | | | hadoop.hdfs.TestParallelShortCircuitRead | | | hadoop.hdfs.TestFileCorruption | | | hadoop.hdfs.TestDatanodeReport | | | hadoop.hdfs.TestEncryptionZo
[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954296#comment-16954296 ] guojh commented on HDFS-14768: -- rebase trunk > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.g
[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: HDFS-14768.006.patch > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager().removeDatanode
[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954292#comment-16954292 ] guojh commented on HDFS-14768: -- [~hadoopqa] test > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager(
[jira] [Updated] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()
[ https://issues.apache.org/jira/browse/HDDS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Wagle updated HDDS-2323: -- Status: Patch Available (was: Open) > Mem allocation: Optimise AuditMessage::build() > -- > > Key: HDDS-2323 > URL: https://issues.apache.org/jira/browse/HDDS-2323 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Assignee: Siddharth Wagle >Priority: Major > Labels: performance > Attachments: HDDS-2323.01.patch, Screenshot 2019-10-18 at 8.24.52 > AM.png > > > String format allocates/processes more than > {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color} > {color:#00}Would be good to use + instead of format.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()
[ https://issues.apache.org/jira/browse/HDDS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Wagle updated HDDS-2323: -- Attachment: HDDS-2323.01.patch > Mem allocation: Optimise AuditMessage::build() > -- > > Key: HDDS-2323 > URL: https://issues.apache.org/jira/browse/HDDS-2323 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Priority: Major > Labels: performance > Attachments: HDDS-2323.01.patch, Screenshot 2019-10-18 at 8.24.52 > AM.png > > > String format allocates/processes more than > {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color} > {color:#00}Would be good to use + instead of format.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()
[ https://issues.apache.org/jira/browse/HDDS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Wagle reassigned HDDS-2323: - Assignee: Siddharth Wagle > Mem allocation: Optimise AuditMessage::build() > -- > > Key: HDDS-2323 > URL: https://issues.apache.org/jira/browse/HDDS-2323 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Assignee: Siddharth Wagle >Priority: Major > Labels: performance > Attachments: HDDS-2323.01.patch, Screenshot 2019-10-18 at 8.24.52 > AM.png > > > String format allocates/processes more than > {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color} > {color:#00}Would be good to use + instead of format.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954278#comment-16954278 ] Hadoop QA commented on HDFS-14768: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 26s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 33m 22s{color} | {color:red} root in trunk failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 54s{color} | {color:red} hadoop-hdfs in trunk failed. {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 51s{color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 40s{color} | {color:red} hadoop-hdfs in trunk failed. {color} | | {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 3m 52s{color} | {color:red} branch has errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 51s{color} | {color:red} hadoop-hdfs in trunk failed. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 37s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 21s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 41s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 41s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 21s{color} | {color:orange} The patch fails to run checkstyle in hadoop-hdfs {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 12m 39s{color} | {color:red} patch has errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 18s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 37s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 58s{color} | {color:red} The patch generated 1 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 74m 4s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.3 Server=19.03.3 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14768 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12983334/HDFS-14768.005.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 5d416d38d52b 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 54dc6b7 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | mvninstall | https://builds.apache.org/job/PreCommit-HDFS-Build/28109/artifact/out/branch-mvninstall-root.txt | | compile | https://builds.apache.org/job/PreCommit-HDFS-Build/28109/artifact/out/branch-compile-hadoop-hdfs-project_hadoop-hdfs.txt | | mvnsite | https://builds.apache.org/job/PreCommit-HDFS-Build/28109/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt | | findbugs | https://builds.apache.org/job/PreCommit-HDFS-Build/28109/artifact/out/branch-findbugs-hadoop-hdfs-project_hadoop-hdfs.txt | | compile | https://builds.apache.org/job/PreCommit-HDFS-Build/
[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues
[ https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954273#comment-16954273 ] Fei Hui commented on HDFS-14852: Upload v004 patch. * move test to TestLowRedundancyBlockQueues * decrementBlockStat when remove block from QUEUE_WITH_CORRUPT_BLOCKS level successfully > Remove of LowRedundancyBlocks do NOT remove the block from all queues > - > > Key: HDFS-14852 > URL: https://issues.apache.org/jira/browse/HDFS-14852 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, > HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch > > > LowRedundancyBlocks.java > {code:java} > // Some comments here > if(priLevel >= 0 && priLevel < LEVEL > && priorityQueues.get(priLevel).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}" > + " from priority queue {}", > block, priLevel); > decrementBlockStat(block, priLevel, oldExpectedReplicas); > return true; > } else { > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > for (int i = 0; i < LEVEL; i++) { > if (i != priLevel && priorityQueues.get(i).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" + > " {} from priority queue {}", block, i); > decrementBlockStat(block, i, oldExpectedReplicas); > return true; > } > } > } > return false; > } > {code} > Source code is above, the comments as follow > {quote} > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > {quote} > The function "remove" does NOT remove the block from all queues. > Function add from LowRedundancyBlocks.java is used on some places and maybe > one block in two or more queues. > We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is > related to this. > Upload initial patch -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954261#comment-16954261 ] Hadoop QA commented on HDFS-14768: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 41s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 36s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 23s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 170 unchanged - 1 fixed = 170 total (was 171) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 12s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 17s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 87m 42s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 34s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}145m 21s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.blockmanagement.TestBlockManager | | | hadoop.hdfs.TestMultipleNNPortQOP | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | | | hadoop.hdfs.TestDFSClientRetries | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.3 Server=19.03.3 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14768 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12983324/HDFS-14768.006.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 88370772c07f 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 54dc6b7 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28108/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28108/testReport/ | | Max. process+thread count | 3571 (vs.
[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954257#comment-16954257 ] guojh commented on HDFS-14768: -- [~surendrasingh] Sorry about my thoughtless and thanks for you review. > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.g
[jira] [Updated] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues
[ https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-14852: --- Attachment: HDFS-14852.004.patch > Remove of LowRedundancyBlocks do NOT remove the block from all queues > - > > Key: HDFS-14852 > URL: https://issues.apache.org/jira/browse/HDFS-14852 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.2.0, 3.0.3, 3.1.2, 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: CorruptBlocksMismatch.png, HDFS-14852.001.patch, > HDFS-14852.002.patch, HDFS-14852.003.patch, HDFS-14852.004.patch > > > LowRedundancyBlocks.java > {code:java} > // Some comments here > if(priLevel >= 0 && priLevel < LEVEL > && priorityQueues.get(priLevel).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block {}" > + " from priority queue {}", > block, priLevel); > decrementBlockStat(block, priLevel, oldExpectedReplicas); > return true; > } else { > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > for (int i = 0; i < LEVEL; i++) { > if (i != priLevel && priorityQueues.get(i).remove(block)) { > NameNode.blockStateChangeLog.debug( > "BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block" + > " {} from priority queue {}", block, i); > decrementBlockStat(block, i, oldExpectedReplicas); > return true; > } > } > } > return false; > } > {code} > Source code is above, the comments as follow > {quote} > // Try to remove the block from all queues if the block was > // not found in the queue for the given priority level. > {quote} > The function "remove" does NOT remove the block from all queues. > Function add from LowRedundancyBlocks.java is used on some places and maybe > one block in two or more queues. > We found that corrupt blocks mismatch corrupt files on NN web UI. Maybe it is > related to this. > Upload initial patch -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: HDFS-14768.005.patch > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager().removeDatanode(datanodeDescriptor);
[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: (was: HDFS-14768.006.patch) > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.jpg, > guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, > zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager().removeDatanode(datanodeDescriptor); > //assertNu
[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: (was: HDFS-14768.005.patch) > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.jpg, > guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, > zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager().removeDatanode(datanodeDescriptor); > //assertNu
[jira] [Commented] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId
[ https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954248#comment-16954248 ] Ravuri Sushma sree commented on HDFS-14442: --- [~xkrogen], Thank you so much for your valuable suggestions. I have uploaded a patch following up the first approach. Can you please review > Disagreement between HAUtil.getAddressOfActive and > RpcInvocationHandler.getConnectionId > --- > > Key: HDFS-14442 > URL: https://issues.apache.org/jira/browse/HDFS-14442 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Erik Krogen >Priority: Major > Attachments: HDFS-14442.001.patch > > > While working on HDFS-14245, we noticed a discrepancy in some proxy-handling > code. > The description of {{RpcInvocationHandler.getConnectionId()}} states: > {code} > /** >* Returns the connection id associated with the InvocationHandler instance. >* @return ConnectionId >*/ > ConnectionId getConnectionId(); > {code} > It does not make any claims about whether this connection ID will be an > active proxy or not. Yet in {{HAUtil}} we have: > {code} > /** >* Get the internet address of the currently-active NN. This should rarely > be >* used, since callers of this method who connect directly to the NN using > the >* resulting InetSocketAddress will not be able to connect to the active NN > if >* a failover were to occur after this method has been called. >* >* @param fs the file system to get the active address of. >* @return the internet address of the currently-active NN. >* @throws IOException if an error occurs while resolving the active NN. >*/ > public static InetSocketAddress getAddressOfActive(FileSystem fs) > throws IOException { > if (!(fs instanceof DistributedFileSystem)) { > throw new IllegalArgumentException("FileSystem " + fs + " is not a > DFS."); > } > // force client address resolution. > fs.exists(new Path("/")); > DistributedFileSystem dfs = (DistributedFileSystem) fs; > DFSClient dfsClient = dfs.getClient(); > return RPC.getServerAddress(dfsClient.getNamenode()); > } > {code} > Where the call {{RPC.getServerAddress()}} eventually terminates into > {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> > {{RPC.getConnectionIdForProxy()}} -> > {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making > an incorrect assumption that {{RpcInvocationHandler}} will necessarily return > an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a > counter-example to this, since the current connection ID may be pointing at, > for example, an Observer NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId
[ https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravuri Sushma sree updated HDFS-14442: -- Attachment: HDFS-14442.001.patch > Disagreement between HAUtil.getAddressOfActive and > RpcInvocationHandler.getConnectionId > --- > > Key: HDFS-14442 > URL: https://issues.apache.org/jira/browse/HDFS-14442 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Erik Krogen >Priority: Major > Attachments: HDFS-14442.001.patch > > > While working on HDFS-14245, we noticed a discrepancy in some proxy-handling > code. > The description of {{RpcInvocationHandler.getConnectionId()}} states: > {code} > /** >* Returns the connection id associated with the InvocationHandler instance. >* @return ConnectionId >*/ > ConnectionId getConnectionId(); > {code} > It does not make any claims about whether this connection ID will be an > active proxy or not. Yet in {{HAUtil}} we have: > {code} > /** >* Get the internet address of the currently-active NN. This should rarely > be >* used, since callers of this method who connect directly to the NN using > the >* resulting InetSocketAddress will not be able to connect to the active NN > if >* a failover were to occur after this method has been called. >* >* @param fs the file system to get the active address of. >* @return the internet address of the currently-active NN. >* @throws IOException if an error occurs while resolving the active NN. >*/ > public static InetSocketAddress getAddressOfActive(FileSystem fs) > throws IOException { > if (!(fs instanceof DistributedFileSystem)) { > throw new IllegalArgumentException("FileSystem " + fs + " is not a > DFS."); > } > // force client address resolution. > fs.exists(new Path("/")); > DistributedFileSystem dfs = (DistributedFileSystem) fs; > DFSClient dfsClient = dfs.getClient(); > return RPC.getServerAddress(dfsClient.getNamenode()); > } > {code} > Where the call {{RPC.getServerAddress()}} eventually terminates into > {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> > {{RPC.getConnectionIdForProxy()}} -> > {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making > an incorrect assumption that {{RpcInvocationHandler}} will necessarily return > an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a > counter-example to this, since the current connection ID may be pointing at, > for example, an Observer NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: HDFS-14768.006.patch > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager().removeDatanode
[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: (was: HDFS-14768.006.patch) > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager().removeDatanode(datanodeDe
[jira] [Commented] (HDFS-14699) Erasure Coding: Storage not considered in live replica when replication streams hard limit reached to threshold
[ https://issues.apache.org/jira/browse/HDFS-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954245#comment-16954245 ] guojh commented on HDFS-14699: -- [~surendrasingh] The UT testChooseSrcDatanodesWithDupEC in class TestBlockManager.java. The code: {code:java} bm.chooseSourceDatanodes( aBlockInfoStriped, cntNodes, liveNodes, numReplicas, liveBlockIndices, liveBusyBlockIndices, LowRedundancyBlocks.QUEUE_HIGHEST_PRIORITY); {code} change to {code:java} bm.chooseSourceDatanodes( aBlockInfoStriped, cntNodes, liveNodes, numReplicas, liveBlockIndices, liveBusyBlockIndices, LowRedundancyBlocks.QUEUE_VERY_LOW_REDUNDANCY); {code} > Erasure Coding: Storage not considered in live replica when replication > streams hard limit reached to threshold > --- > > Key: HDFS-14699 > URL: https://issues.apache.org/jira/browse/HDFS-14699 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.2.0, 3.1.1, 3.3.0 >Reporter: Zhao Yi Ming >Assignee: Zhao Yi Ming >Priority: Critical > Labels: patch > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14699.00.patch, HDFS-14699.01.patch, > HDFS-14699.02.patch, HDFS-14699.03.patch, HDFS-14699.04.patch, > HDFS-14699.05.patch, image-2019-08-20-19-58-51-872.png, > image-2019-09-02-17-51-46-742.png > > > We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the > same scenario as you said https://issues.apache.org/jira/browse/HDFS-8881. > Following are our testing steps, hope it can helpful.(following DNs have the > testing internal blocks) > # we customized a new 10-2-1024k policy and use it on a path, now we have 12 > internal block(12 live block) > # decommission one DN, after the decommission complete. now we have 13 > internal block(12 live block and 1 decommission block) > # then shutdown one DN which did not have the same block id as 1 > decommission block, now we have 12 internal block(11 live block and 1 > decommission block) > # after wait for about 600s (before the heart beat come) commission the > decommissioned DN again, now we have 12 internal block(11 live block and 1 > duplicate block) > # Then the EC is not reconstruct the missed block > We think this is a critical issue for using the EC function in a production > env. Could you help? Thanks a lot! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14699) Erasure Coding: Storage not considered in live replica when replication streams hard limit reached to threshold
[ https://issues.apache.org/jira/browse/HDFS-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954241#comment-16954241 ] Surendra Singh Lilhore commented on HDFS-14699: --- Where it is failing, can you give the link ? > Erasure Coding: Storage not considered in live replica when replication > streams hard limit reached to threshold > --- > > Key: HDFS-14699 > URL: https://issues.apache.org/jira/browse/HDFS-14699 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.2.0, 3.1.1, 3.3.0 >Reporter: Zhao Yi Ming >Assignee: Zhao Yi Ming >Priority: Critical > Labels: patch > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14699.00.patch, HDFS-14699.01.patch, > HDFS-14699.02.patch, HDFS-14699.03.patch, HDFS-14699.04.patch, > HDFS-14699.05.patch, image-2019-08-20-19-58-51-872.png, > image-2019-09-02-17-51-46-742.png > > > We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the > same scenario as you said https://issues.apache.org/jira/browse/HDFS-8881. > Following are our testing steps, hope it can helpful.(following DNs have the > testing internal blocks) > # we customized a new 10-2-1024k policy and use it on a path, now we have 12 > internal block(12 live block) > # decommission one DN, after the decommission complete. now we have 13 > internal block(12 live block and 1 decommission block) > # then shutdown one DN which did not have the same block id as 1 > decommission block, now we have 12 internal block(11 live block and 1 > decommission block) > # after wait for about 600s (before the heart beat come) commission the > decommissioned DN again, now we have 12 internal block(11 live block and 1 > duplicate block) > # Then the EC is not reconstruct the missed block > We think this is a critical issue for using the EC function in a production > env. Could you help? Thanks a lot! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954239#comment-16954239 ] Surendra Singh Lilhore commented on HDFS-14768: --- {quote}I don't understand why the code that you point out is not unnecessary? liveBusyBlockIndices should include the condition below: {quote} Pls check my comment again. I just asked you to remove changes where just some space is added, no code change. Will wait for build... > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > //
[jira] [Updated] (HDDS-2311) Fix logic of RetryPolicy in OzoneClientSideTranslatorPB
[ https://issues.apache.org/jira/browse/HDDS-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2311: - Labels: pull-request-available (was: ) > Fix logic of RetryPolicy in OzoneClientSideTranslatorPB > --- > > Key: HDDS-2311 > URL: https://issues.apache.org/jira/browse/HDDS-2311 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task >Reporter: Bharat Viswanadham >Priority: Blocker > Labels: pull-request-available > > OzoneManagerProtocolClientSideTranslatorPB.java > L251: if (cause instanceof NotLeaderException) { > NotLeaderException notLeaderException = (NotLeaderException) cause; > omFailoverProxyProvider.performFailoverIfRequired( > notLeaderException.getSuggestedLeaderNodeId()); > return getRetryAction(RetryAction.RETRY, retries, failovers); > } > > The suggested leader returned from Server is not used during failOver, as the > cause is a type of RemoteException. So with current code, it does not use > suggested leader for failOver at all and by default with each OM, it tries > max retries. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2311) Fix logic of RetryPolicy in OzoneClientSideTranslatorPB
[ https://issues.apache.org/jira/browse/HDDS-2311?focusedWorklogId=330261&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330261 ] ASF GitHub Bot logged work on HDDS-2311: Author: ASF GitHub Bot Created on: 18/Oct/19 03:55 Start Date: 18/Oct/19 03:55 Worklog Time Spent: 10m Work Description: hanishakoneru commented on pull request #51: HDDS-2311. Fix logic of RetryPolicy in OzoneClientSideTranslatorPB. URL: https://github.com/apache/hadoop-ozone/pull/51 OzoneManagerProtocolClientSideTranslatorPB.java L251: if (cause instanceof NotLeaderException) { NotLeaderException notLeaderException = (NotLeaderException) cause; omFailoverProxyProvider.performFailoverIfRequired( notLeaderException.getSuggestedLeaderNodeId()); return getRetryAction(RetryAction.RETRY, retries, failovers); } The suggested leader returned from Server is not used during failOver, as the cause is a type of RemoteException. So with current code, it does not use suggested leader for failOver at all and by default with each OM, it tries max retries. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330261) Remaining Estimate: 0h Time Spent: 10m > Fix logic of RetryPolicy in OzoneClientSideTranslatorPB > --- > > Key: HDDS-2311 > URL: https://issues.apache.org/jira/browse/HDDS-2311 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task >Reporter: Bharat Viswanadham >Priority: Blocker > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > OzoneManagerProtocolClientSideTranslatorPB.java > L251: if (cause instanceof NotLeaderException) { > NotLeaderException notLeaderException = (NotLeaderException) cause; > omFailoverProxyProvider.performFailoverIfRequired( > notLeaderException.getSuggestedLeaderNodeId()); > return getRetryAction(RetryAction.RETRY, retries, failovers); > } > > The suggested leader returned from Server is not used during failOver, as the > cause is a type of RemoteException. So with current code, it does not use > suggested leader for failOver at all and by default with each OM, it tries > max retries. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()
[ https://issues.apache.org/jira/browse/HDDS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954221#comment-16954221 ] Mukul Kumar Singh commented on HDDS-2323: - cc: [~dchitlangia] > Mem allocation: Optimise AuditMessage::build() > -- > > Key: HDDS-2323 > URL: https://issues.apache.org/jira/browse/HDDS-2323 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Priority: Major > Labels: performance > Attachments: Screenshot 2019-10-18 at 8.24.52 AM.png > > > String format allocates/processes more than > {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color} > {color:#00}Would be good to use + instead of format.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()
[ https://issues.apache.org/jira/browse/HDDS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HDDS-2323: --- Labels: performance (was: ) > Mem allocation: Optimise AuditMessage::build() > -- > > Key: HDDS-2323 > URL: https://issues.apache.org/jira/browse/HDDS-2323 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Priority: Major > Labels: performance > Attachments: Screenshot 2019-10-18 at 8.24.52 AM.png > > > String format allocates/processes more than > {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color} > {color:#00}Would be good to use + instead of format.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()
[ https://issues.apache.org/jira/browse/HDDS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HDDS-2323: --- Component/s: Ozone Manager > Mem allocation: Optimise AuditMessage::build() > -- > > Key: HDDS-2323 > URL: https://issues.apache.org/jira/browse/HDDS-2323 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Priority: Major > Attachments: Screenshot 2019-10-18 at 8.24.52 AM.png > > > String format allocates/processes more than > {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color} > {color:#00}Would be good to use + instead of format.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2323) Mem allocation: Optimise AuditMessage::build()
Rajesh Balamohan created HDDS-2323: -- Summary: Mem allocation: Optimise AuditMessage::build() Key: HDDS-2323 URL: https://issues.apache.org/jira/browse/HDDS-2323 Project: Hadoop Distributed Data Store Issue Type: Bug Reporter: Rajesh Balamohan Attachments: Screenshot 2019-10-18 at 8.24.52 AM.png String format allocates/processes more than {color:#00}OzoneAclUtil.fromProtobuf in write benchmark.{color} {color:#00}Would be good to use + instead of format.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2208) Propagate System Exceptions from OM transaction apply phase
[ https://issues.apache.org/jira/browse/HDDS-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Supratim Deka updated HDDS-2208: Status: Patch Available (was: Open) > Propagate System Exceptions from OM transaction apply phase > --- > > Key: HDDS-2208 > URL: https://issues.apache.org/jira/browse/HDDS-2208 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Components: Ozone Manager >Reporter: Supratim Deka >Assignee: Supratim Deka >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The change for HDDS-2206 tracks system exceptions during preExecute phase of > OM request handling. > The current jira is to implement exception propagation once the OM request is > submitted to Ratis - when the handler is running validateAndUpdateCache for > the request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: HDFS-14768.006.patch > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager().removeDatanode
[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: (was: HDFS-14768.006.patch) > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager().removeDatanode(datanodeDe
[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954202#comment-16954202 ] guojh commented on HDFS-14768: -- [~surendrasingh] I put a new path06, please review it. > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(),
[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: HDFS-14768.006.patch > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.006.patch, HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager().removeDatanode
[jira] [Commented] (HDFS-14699) Erasure Coding: Storage not considered in live replica when replication streams hard limit reached to threshold
[ https://issues.apache.org/jira/browse/HDFS-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954193#comment-16954193 ] guojh commented on HDFS-14699: -- [~zhaoyim] [~surendrasingh] In you test case, why you set the priority to QUEUE_HIGHEST_PRIORITY, If the priority is not equals 0, you test case is not passed. {code:java} bm.chooseSourceDatanodes( aBlockInfoStriped, cntNodes, liveNodes, numReplicas, liveBlockIndices, liveBusyBlockIndices, LowRedundancyBlocks.QUEUE_HIGHEST_PRIORITY); // If the priority set to QUEUE_VERY_LOW_REDUNDANCY, this will not works. {code} > Erasure Coding: Storage not considered in live replica when replication > streams hard limit reached to threshold > --- > > Key: HDFS-14699 > URL: https://issues.apache.org/jira/browse/HDFS-14699 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.2.0, 3.1.1, 3.3.0 >Reporter: Zhao Yi Ming >Assignee: Zhao Yi Ming >Priority: Critical > Labels: patch > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14699.00.patch, HDFS-14699.01.patch, > HDFS-14699.02.patch, HDFS-14699.03.patch, HDFS-14699.04.patch, > HDFS-14699.05.patch, image-2019-08-20-19-58-51-872.png, > image-2019-09-02-17-51-46-742.png > > > We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the > same scenario as you said https://issues.apache.org/jira/browse/HDFS-8881. > Following are our testing steps, hope it can helpful.(following DNs have the > testing internal blocks) > # we customized a new 10-2-1024k policy and use it on a path, now we have 12 > internal block(12 live block) > # decommission one DN, after the decommission complete. now we have 13 > internal block(12 live block and 1 decommission block) > # then shutdown one DN which did not have the same block id as 1 > decommission block, now we have 12 internal block(11 live block and 1 > decommission block) > # after wait for about 600s (before the heart beat come) commission the > decommissioned DN again, now we have 12 internal block(11 live block and 1 > duplicate block) > # Then the EC is not reconstruct the missed block > We think this is a critical issue for using the EC function in a production > env. Could you help? Thanks a lot! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: HDFS-14768.005.patch > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager().removeDatanode(datanodeDescriptor);
[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: (was: HDFS-14768.004.patch) > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager().removeDatanode(datanodeDe
[jira] [Updated] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: HDFS-14768.004.patch > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.004.patch, > HDFS-14768.jpg, guojh_UT_after_deomission.txt, > guojh_UT_before_deomission.txt, zhaoyiming_UT_after_deomission.txt, > zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); > bm.getDatanodeManager().removeDatanode(datanodeDescriptor);
[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954183#comment-16954183 ] guojh commented on HDFS-14768: -- [~surendrasingh] Thanks for your review. I don't understand why the code that you point out is not unnecessary? liveBusyBlockIndices should include the condition below: {code:java} if (priority != LowRedundancyBlocks.QUEUE_HIGHEST_PRIORITY && (!node.isDecommissionInProgress() && !node.isEnteringMaintenance()) && node.getNumberOfBlocksToBeReplicated() >= maxReplicationStreams) { if (isStriped && state == StoredReplicaState.LIVE) { liveBusyBlockIndices.add(blockIndex); } continue; // already reached replication limit } {code} > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.jpg, > guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, > zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageI
[jira] [Assigned] (HDDS-2322) DoubleBuffer flush termination and OM is shutdown
[ https://issues.apache.org/jira/browse/HDDS-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharat Viswanadham reassigned HDDS-2322: Assignee: Bharat Viswanadham > DoubleBuffer flush termination and OM is shutdown > - > > Key: HDDS-2322 > URL: https://issues.apache.org/jira/browse/HDDS-2322 > Project: Hadoop Distributed Data Store > Issue Type: Task >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > > om1_1 | 2019-10-18 00:34:45,317 [OMDoubleBufferFlushThread] ERROR > - Terminating with exit status 2: OMDoubleBuffer flush > threadOMDoubleBufferFlushThreadencountered Throwable error > om1_1 | java.util.ConcurrentModificationException > om1_1 | at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1660) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > om1_1 | at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > om1_1 | at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > om1_1 | at > org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfoGroup.getProtobuf(OmKeyLocationInfoGroup.java:65) > om1_1 | at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > om1_1 | at > java.base/java.util.Collections$2.tryAdvance(Collections.java:4745) > om1_1 | at > java.base/java.util.Collections$2.forEachRemaining(Collections.java:4753) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > om1_1 | at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > om1_1 | at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > om1_1 | at > org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:362) > om1_1 | at > org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:37) > om1_1 | at > org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:31) > om1_1 | at > org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68) > om1_1 | at > org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125) > om1_1 | at > org.apache.hadoop.ozone.om.response.key.OMKeyCreateResponse.addToDBBatch(OMKeyCreateResponse.java:58) > om1_1 | at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:139) > om1_1 | at > java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) > om1_1 | at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:137) > om1_1 | at java.base/java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2322) DoubleBuffer flush termination and OM shutdown's after that.
[ https://issues.apache.org/jira/browse/HDDS-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharat Viswanadham updated HDDS-2322: - Summary: DoubleBuffer flush termination and OM shutdown's after that. (was: DoubleBuffer flush termination and OM is shutdown) > DoubleBuffer flush termination and OM shutdown's after that. > > > Key: HDDS-2322 > URL: https://issues.apache.org/jira/browse/HDDS-2322 > Project: Hadoop Distributed Data Store > Issue Type: Task >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > > om1_1 | 2019-10-18 00:34:45,317 [OMDoubleBufferFlushThread] ERROR > - Terminating with exit status 2: OMDoubleBuffer flush > threadOMDoubleBufferFlushThreadencountered Throwable error > om1_1 | java.util.ConcurrentModificationException > om1_1 | at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1660) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > om1_1 | at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > om1_1 | at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > om1_1 | at > org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfoGroup.getProtobuf(OmKeyLocationInfoGroup.java:65) > om1_1 | at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > om1_1 | at > java.base/java.util.Collections$2.tryAdvance(Collections.java:4745) > om1_1 | at > java.base/java.util.Collections$2.forEachRemaining(Collections.java:4753) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > om1_1 | at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > om1_1 | at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > om1_1 | at > org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:362) > om1_1 | at > org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:37) > om1_1 | at > org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:31) > om1_1 | at > org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68) > om1_1 | at > org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125) > om1_1 | at > org.apache.hadoop.ozone.om.response.key.OMKeyCreateResponse.addToDBBatch(OMKeyCreateResponse.java:58) > om1_1 | at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:139) > om1_1 | at > java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) > om1_1 | at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:137) > om1_1 | at java.base/java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2322) DoubleBuffer flush termination and OM is shutdown
Bharat Viswanadham created HDDS-2322: Summary: DoubleBuffer flush termination and OM is shutdown Key: HDDS-2322 URL: https://issues.apache.org/jira/browse/HDDS-2322 Project: Hadoop Distributed Data Store Issue Type: Task Reporter: Bharat Viswanadham om1_1 | 2019-10-18 00:34:45,317 [OMDoubleBufferFlushThread] ERROR - Terminating with exit status 2: OMDoubleBuffer flush threadOMDoubleBufferFlushThreadencountered Throwable error om1_1 | java.util.ConcurrentModificationException om1_1 | at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1660) om1_1 | at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) om1_1 | at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) om1_1 | at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) om1_1 | at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) om1_1 | at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) om1_1 | at org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfoGroup.getProtobuf(OmKeyLocationInfoGroup.java:65) om1_1 | at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) om1_1 | at java.base/java.util.Collections$2.tryAdvance(Collections.java:4745) om1_1 | at java.base/java.util.Collections$2.forEachRemaining(Collections.java:4753) om1_1 | at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) om1_1 | at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) om1_1 | at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) om1_1 | at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) om1_1 | at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) om1_1 | at org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:362) om1_1 | at org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:37) om1_1 | at org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:31) om1_1 | at org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68) om1_1 | at org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125) om1_1 | at org.apache.hadoop.ozone.om.response.key.OMKeyCreateResponse.addToDBBatch(OMKeyCreateResponse.java:58) om1_1 | at org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:139) om1_1 | at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) om1_1 | at org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:137) om1_1 | at java.base/java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2131) Optimize replication type and creation time calculation in S3 MPU list call
[ https://issues.apache.org/jira/browse/HDDS-2131?focusedWorklogId=330179&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330179 ] ASF GitHub Bot logged work on HDDS-2131: Author: ASF GitHub Bot Created on: 17/Oct/19 23:03 Start Date: 17/Oct/19 23:03 Worklog Time Spent: 10m Work Description: swagle commented on pull request #50: HDDS-2131. Optimize replication type and creation time calculation in S3 MPU list call. URL: https://github.com/apache/hadoop-ozone/pull/50 ## What changes were proposed in this pull request? Optimize listMultiPartUpload to not read from openKeyTable. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-2131 ## How was this patch tested? Ran related unit tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330179) Remaining Estimate: 0h Time Spent: 10m > Optimize replication type and creation time calculation in S3 MPU list call > --- > > Key: HDDS-2131 > URL: https://issues.apache.org/jira/browse/HDDS-2131 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Siddharth Wagle >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Based on the review from [~bharatviswa]: > {code} > > hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java > metadataManager.getOpenKeyTable(); > OmKeyInfo omKeyInfo = > openKeyTable.get(upload.getDbKey()); > {code} > {quote}Here we are reading openKeyTable only for getting creation time. If we > can have this information in omMultipartKeyInfo, we could avoid DB calls for > openKeyTable. > To do this, We can set creationTime in OmMultipartKeyInfo during > initiateMultipartUpload . In this way, we can get all the required > information from the MultipartKeyInfo table. > And also StorageClass is missing from the returned OmMultipartUpload, as > listMultipartUploads shows StorageClass information. For this, if we can > return replicationType and depending on this value, we can set StorageClass > in the listMultipartUploads Response. > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2131) Optimize replication type and creation time calculation in S3 MPU list call
[ https://issues.apache.org/jira/browse/HDDS-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2131: - Labels: pull-request-available (was: ) > Optimize replication type and creation time calculation in S3 MPU list call > --- > > Key: HDDS-2131 > URL: https://issues.apache.org/jira/browse/HDDS-2131 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Siddharth Wagle >Priority: Major > Labels: pull-request-available > > Based on the review from [~bharatviswa]: > {code} > > hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java > metadataManager.getOpenKeyTable(); > OmKeyInfo omKeyInfo = > openKeyTable.get(upload.getDbKey()); > {code} > {quote}Here we are reading openKeyTable only for getting creation time. If we > can have this information in omMultipartKeyInfo, we could avoid DB calls for > openKeyTable. > To do this, We can set creationTime in OmMultipartKeyInfo during > initiateMultipartUpload . In this way, we can get all the required > information from the MultipartKeyInfo table. > And also StorageClass is missing from the returned OmMultipartUpload, as > listMultipartUploads shows StorageClass information. For this, if we can > return replicationType and depending on this value, we can set StorageClass > in the listMultipartUploads Response. > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDDS-2320) Negative value seen for OM NumKeys Metric in JMX.
[ https://issues.apache.org/jira/browse/HDDS-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aravindan Vijayan reassigned HDDS-2320: --- Assignee: Aravindan Vijayan > Negative value seen for OM NumKeys Metric in JMX. > - > > Key: HDDS-2320 > URL: https://issues.apache.org/jira/browse/HDDS-2320 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Aravindan Vijayan >Assignee: Aravindan Vijayan >Priority: Major > Attachments: Screen Shot 2019-10-17 at 11.31.08 AM.png > > > While running teragen/terasort on a cluster and verifying number of keys > created on Ozone Manager, I noticed that the value of NumKeys counter metric > to be a negative value !Screen Shot 2019-10-17 at 11.31.08 AM.png! . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDDS-2254) Fix flaky unit testTestContainerStateMachine#testRatisSnapshotRetention
[ https://issues.apache.org/jira/browse/HDDS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anu Engineer resolved HDDS-2254. Fix Version/s: 0.5.0 Resolution: Fixed Committed to the master branch. > Fix flaky unit testTestContainerStateMachine#testRatisSnapshotRetention > --- > > Key: HDDS-2254 > URL: https://issues.apache.org/jira/browse/HDDS-2254 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: test >Affects Versions: 0.5.0 >Reporter: Siddharth Wagle >Assignee: Aravindan Vijayan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Test always fails with assertion error: > {code} > java.lang.AssertionError > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.ozone.client.rpc.TestContainerStateMachine.testRatisSnapshotRetention(TestContainerStateMachine.java:188) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2254) Fix flaky unit testTestContainerStateMachine#testRatisSnapshotRetention
[ https://issues.apache.org/jira/browse/HDDS-2254?focusedWorklogId=330168&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330168 ] ASF GitHub Bot logged work on HDDS-2254: Author: ASF GitHub Bot Created on: 17/Oct/19 22:12 Start Date: 17/Oct/19 22:12 Worklog Time Spent: 10m Work Description: anuengineer commented on pull request #31: HDDS-2254. Fix flaky unit test TestContainerStateMachine#testRatisSn… URL: https://github.com/apache/hadoop-ozone/pull/31 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330168) Time Spent: 1h 50m (was: 1h 40m) > Fix flaky unit testTestContainerStateMachine#testRatisSnapshotRetention > --- > > Key: HDDS-2254 > URL: https://issues.apache.org/jira/browse/HDDS-2254 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: test >Affects Versions: 0.5.0 >Reporter: Siddharth Wagle >Assignee: Aravindan Vijayan >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Test always fails with assertion error: > {code} > java.lang.AssertionError > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.ozone.client.rpc.TestContainerStateMachine.testRatisSnapshotRetention(TestContainerStateMachine.java:188) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14912) Set dfs.image.string-tables.expanded default to false in branch-2.7
[ https://issues.apache.org/jira/browse/HDFS-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyao Meng updated HDFS-14912: -- Summary: Set dfs.image.string-tables.expanded default to false in branch-2.7 (was: Set dfs.image.string-tables.expanded default config value to false in branch-2.7) > Set dfs.image.string-tables.expanded default to false in branch-2.7 > --- > > Key: HDFS-14912 > URL: https://issues.apache.org/jira/browse/HDFS-14912 > Project: Hadoop HDFS > Issue Type: Task >Affects Versions: 2.7.8 >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Major > Attachments: HDFS-14912.001.branch-2.7.patch > > > In the branch-2.7 patch for CVE-2018-11768 HDFS FSImage Corruption, > dfs.image.string-tables.expanded is set to true by default: > https://github.com/apache/hadoop/commit/109d44604ca843212bdf22b50e86a5a41e1d21da#diff-36b19e9d8816002ed9dff8580055d3fbR627 > This is different from all other branches, which set it to false by default. > For instance, branch-2.8: > https://github.com/apache/hadoop/commit/f697f3c4fc0067bb82494e445900d86942685b09#diff-36b19e9d8816002ed9dff8580055d3fbR629 > Goal: Flip the dfs.image.string-tables.expanded default in branch-2.7 to > false to make it consistent with other branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14912) Set dfs.image.string-tables.expanded default config value to false in branch-2.7
[ https://issues.apache.org/jira/browse/HDFS-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyao Meng updated HDFS-14912: -- Summary: Set dfs.image.string-tables.expanded default config value to false in branch-2.7 (was: Default dfs.image.string-tables.expanded to false in branch-2.7) > Set dfs.image.string-tables.expanded default config value to false in > branch-2.7 > > > Key: HDFS-14912 > URL: https://issues.apache.org/jira/browse/HDFS-14912 > Project: Hadoop HDFS > Issue Type: Task >Affects Versions: 2.7.8 >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Major > Attachments: HDFS-14912.001.branch-2.7.patch > > > In the branch-2.7 patch for CVE-2018-11768 HDFS FSImage Corruption, > dfs.image.string-tables.expanded is set to true by default: > https://github.com/apache/hadoop/commit/109d44604ca843212bdf22b50e86a5a41e1d21da#diff-36b19e9d8816002ed9dff8580055d3fbR627 > This is different from all other branches, which set it to false by default. > For instance, branch-2.8: > https://github.com/apache/hadoop/commit/f697f3c4fc0067bb82494e445900d86942685b09#diff-36b19e9d8816002ed9dff8580055d3fbR629 > Goal: Flip the dfs.image.string-tables.expanded default in branch-2.7 to > false to make it consistent with other branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14912) Default dfs.image.string-tables.expanded to false in branch-2.7
[ https://issues.apache.org/jira/browse/HDFS-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyao Meng updated HDFS-14912: -- Attachment: HDFS-14912.001.branch-2.7.patch Status: Patch Available (was: Open) > Default dfs.image.string-tables.expanded to false in branch-2.7 > --- > > Key: HDFS-14912 > URL: https://issues.apache.org/jira/browse/HDFS-14912 > Project: Hadoop HDFS > Issue Type: Task >Affects Versions: 2.7.8 >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Major > Attachments: HDFS-14912.001.branch-2.7.patch > > > In the branch-2.7 patch for CVE-2018-11768 HDFS FSImage Corruption, > dfs.image.string-tables.expanded is set to true by default: > https://github.com/apache/hadoop/commit/109d44604ca843212bdf22b50e86a5a41e1d21da#diff-36b19e9d8816002ed9dff8580055d3fbR627 > This is different from all other branches, which set it to false by default. > For instance, branch-2.8: > https://github.com/apache/hadoop/commit/f697f3c4fc0067bb82494e445900d86942685b09#diff-36b19e9d8816002ed9dff8580055d3fbR629 > Goal: Flip the dfs.image.string-tables.expanded default in branch-2.7 to > false to make it consistent with other branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14912) Default dfs.image.string-tables.expanded to false in branch-2.7
Siyao Meng created HDFS-14912: - Summary: Default dfs.image.string-tables.expanded to false in branch-2.7 Key: HDFS-14912 URL: https://issues.apache.org/jira/browse/HDFS-14912 Project: Hadoop HDFS Issue Type: Task Affects Versions: 2.7.8 Reporter: Siyao Meng Assignee: Siyao Meng In the branch-2.7 patch for CVE-2018-11768 HDFS FSImage Corruption, dfs.image.string-tables.expanded is set to true by default: https://github.com/apache/hadoop/commit/109d44604ca843212bdf22b50e86a5a41e1d21da#diff-36b19e9d8816002ed9dff8580055d3fbR627 This is different from all other branches, which set it to false by default. For instance, branch-2.8: https://github.com/apache/hadoop/commit/f697f3c4fc0067bb82494e445900d86942685b09#diff-36b19e9d8816002ed9dff8580055d3fbR629 Goal: Flip the dfs.image.string-tables.expanded default in branch-2.7 to false to make it consistent with other branches. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2310) Add support to add ozone ranger plugin to Ozone Manager classpath
[ https://issues.apache.org/jira/browse/HDDS-2310?focusedWorklogId=330126&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330126 ] ASF GitHub Bot logged work on HDDS-2310: Author: ASF GitHub Bot Created on: 17/Oct/19 20:35 Start Date: 17/Oct/19 20:35 Worklog Time Spent: 10m Work Description: vivekratnavel commented on pull request #49: HDDS-2310. Add support to add ozone ranger plugin to Ozone Manager cl… URL: https://github.com/apache/hadoop-ozone/pull/49 …asspath. ## What changes were proposed in this pull request? This PR adds a new hadoop shell profile for ozone manager to be able to extend Ozone Manager's classpath. It is implemented in a generic way and is not exclusive to ranger plugin. Any path can be added to Ozone Manager classpath by setting the env variable `OZONE_MANAGER_CLASSPATH` ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-2310 ## How was this patch tested? Tested by running ozone manager locally with the above mentioned env set to some path. ``` export OZONE_MANAGER_CLASSPATH=/tmp/* ./hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/bin/ozone --debug om ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330126) Remaining Estimate: 0h Time Spent: 10m > Add support to add ozone ranger plugin to Ozone Manager classpath > - > > Key: HDDS-2310 > URL: https://issues.apache.org/jira/browse/HDDS-2310 > Project: Hadoop Distributed Data Store > Issue Type: Task > Components: Ozone Manager >Affects Versions: 0.5.0 >Reporter: Vivek Ratnavel Subramanian >Assignee: Vivek Ratnavel Subramanian >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, there is no way to add Ozone Ranger plugin to Ozone Manager > classpath. > We should be able to set an environment variable that will be respected by > ozone and added to Ozone Manager classpath. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2310) Add support to add ozone ranger plugin to Ozone Manager classpath
[ https://issues.apache.org/jira/browse/HDDS-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2310: - Labels: pull-request-available (was: ) > Add support to add ozone ranger plugin to Ozone Manager classpath > - > > Key: HDDS-2310 > URL: https://issues.apache.org/jira/browse/HDDS-2310 > Project: Hadoop Distributed Data Store > Issue Type: Task > Components: Ozone Manager >Affects Versions: 0.5.0 >Reporter: Vivek Ratnavel Subramanian >Assignee: Vivek Ratnavel Subramanian >Priority: Major > Labels: pull-request-available > > Currently, there is no way to add Ozone Ranger plugin to Ozone Manager > classpath. > We should be able to set an environment variable that will be respected by > ozone and added to Ozone Manager classpath. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2181) Ozone Manager should send correct ACL type in ACL requests to Authorizer
[ https://issues.apache.org/jira/browse/HDDS-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vivek Ratnavel Subramanian updated HDDS-2181: - Resolution: Fixed Status: Resolved (was: Patch Available) > Ozone Manager should send correct ACL type in ACL requests to Authorizer > > > Key: HDDS-2181 > URL: https://issues.apache.org/jira/browse/HDDS-2181 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Affects Versions: 0.4.1 >Reporter: Vivek Ratnavel Subramanian >Assignee: Vivek Ratnavel Subramanian >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Currently, Ozone manager sends "WRITE" as ACLType for key create, key delete > and bucket create operation. Fix the acl type in all requests to the > authorizer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2320) Negative value seen for OM NumKeys Metric in JMX.
[ https://issues.apache.org/jira/browse/HDDS-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954076#comment-16954076 ] Aravindan Vijayan commented on HDDS-2320: - cc [~hanishakoneru] / [~bharat] > Negative value seen for OM NumKeys Metric in JMX. > - > > Key: HDDS-2320 > URL: https://issues.apache.org/jira/browse/HDDS-2320 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Aravindan Vijayan >Priority: Major > Attachments: Screen Shot 2019-10-17 at 11.31.08 AM.png > > > While running teragen/terasort on a cluster and verifying number of keys > created on Ozone Manager, I noticed that the value of NumKeys counter metric > to be a negative value !Screen Shot 2019-10-17 at 11.31.08 AM.png! . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2321) Ozone Block Token verify should not apply to all datanode cmd
[ https://issues.apache.org/jira/browse/HDDS-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954062#comment-16954062 ] Anu Engineer commented on HDDS-2321: Since SCM has the root cert, it might be intresting if it send a token over, that way these commands are also verified. In the long run, or even the short run, these SCM commands to DNs will go away. > Ozone Block Token verify should not apply to all datanode cmd > - > > Key: HDDS-2321 > URL: https://issues.apache.org/jira/browse/HDDS-2321 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.4.1 >Reporter: Nilotpal Nandi >Assignee: Xiaoyu Yao >Priority: Major > > DN container protocol has cmd send from SCM or other DN, which do not bear OM > block token like OM client. We should restrict the OM Block token check only > for those issued from OM client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14911) hdfs oiv "Delimited" processer failing with java.lang.UnsatisfiedLinkError
[ https://issues.apache.org/jira/browse/HDFS-14911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bb updated HDFS-14911: -- Issue Type: Bug (was: Wish) > hdfs oiv "Delimited" processer failing with java.lang.UnsatisfiedLinkError > -- > > Key: HDFS-14911 > URL: https://issues.apache.org/jira/browse/HDFS-14911 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, namenode >Affects Versions: 2.6.0 >Reporter: bb >Priority: Major > > The hdfs oiv "Delimited" processer is failing as follows: > {code:java} > root@hostname [/dirname] $ hdfs oiv --inputFile > /dirname/user1/fsimage_12908988825 --temp /dirname/user1 --outputFile > /dirname/user1/fsimage.dsv --processor Delimited -delimiter "|" > Exception in thread "main" java.lang.UnsatisfiedLinkError: Could not load > library. Reasons: [no leveldbjni64-1.8 in java.library.path, no > leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, > /tmp/libleveldbjni-64-1-1996775695699915007.8: > /tmp/libleveldbjni-64-1-1996775695699915007.8: failed to map segment from > shared object: Operation not permitted] > at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182) > at org.fusesource.hawtjni.runtime.Library.load(Library.java:140) > at org.fusesource.leveldbjni.JniDBFactory.(JniDBFactory.java:48) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap$LevelDBStore.(PBImageTextWriter.java:235) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap.(PBImageTextWriter.java:306) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.(PBImageTextWriter.java:409) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDelimitedTextWriter.(PBImageDelimitedTextWriter.java:56) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:210) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:138){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14911) hdfs oiv "Delimited" processer failing with java.lang.UnsatisfiedLinkError
[ https://issues.apache.org/jira/browse/HDFS-14911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bb updated HDFS-14911: -- Description: The hdfs oiv "Delimited" processer is failing as follows: {code:java} root@hostname [/dirname] $ hdfs oiv --inputFile /dirname/user1/fsimage_12908988825 --temp /dirname/user1 --outputFile /dirname/user1/fsimage.dsv --processor Delimited -delimiter "|" Exception in thread "main" java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, /tmp/libleveldbjni-64-1-1996775695699915007.8: /tmp/libleveldbjni-64-1-1996775695699915007.8: failed to map segment from shared object: Operation not permitted] at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182) at org.fusesource.hawtjni.runtime.Library.load(Library.java:140) at org.fusesource.leveldbjni.JniDBFactory.(JniDBFactory.java:48) at org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap$LevelDBStore.(PBImageTextWriter.java:235) at org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap.(PBImageTextWriter.java:306) at org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.(PBImageTextWriter.java:409) at org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDelimitedTextWriter.(PBImageDelimitedTextWriter.java:56) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:210) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:138){code} was: The hdfs oiv "Delimited" processer is failing as follows: root@hostname [/dirname] $ hdfs oiv --inputFile /dirname/user1/fsimage_12908988825 --temp /dirname/user1 --outputFile /dirname/user1/fsimage.dsv --processor Delimited -delimiter "|" Exception in thread "main" java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, /tmp/libleveldbjni-64-1-1996775695699915007.8: /tmp/libleveldbjni-64-1-1996775695699915007.8: failed to map segment from shared object: Operation not permitted] at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182) at org.fusesource.hawtjni.runtime.Library.load(Library.java:140) at org.fusesource.leveldbjni.JniDBFactory.(JniDBFactory.java:48) at org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap$LevelDBStore.(PBImageTextWriter.java:235) at org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap.(PBImageTextWriter.java:306) at org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.(PBImageTextWriter.java:409) at org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDelimitedTextWriter.(PBImageDelimitedTextWriter.java:56) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:210) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:138) > hdfs oiv "Delimited" processer failing with java.lang.UnsatisfiedLinkError > -- > > Key: HDFS-14911 > URL: https://issues.apache.org/jira/browse/HDFS-14911 > Project: Hadoop HDFS > Issue Type: Wish > Components: hdfs, namenode >Affects Versions: 2.6.0 >Reporter: bb >Priority: Major > > The hdfs oiv "Delimited" processer is failing as follows: > {code:java} > root@hostname [/dirname] $ hdfs oiv --inputFile > /dirname/user1/fsimage_12908988825 --temp /dirname/user1 --outputFile > /dirname/user1/fsimage.dsv --processor Delimited -delimiter "|" > Exception in thread "main" java.lang.UnsatisfiedLinkError: Could not load > library. Reasons: [no leveldbjni64-1.8 in java.library.path, no > leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, > /tmp/libleveldbjni-64-1-1996775695699915007.8: > /tmp/libleveldbjni-64-1-1996775695699915007.8: failed to map segment from > shared object: Operation not permitted] > at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182) > at org.fusesource.hawtjni.runtime.Library.load(Library.java:140) > at org.fusesource.leveldbjni.JniDBFactory.(JniDBFactory.java:48) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap$LevelDBStore.(PBImageTextWriter.java:235) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap.(PBImageTextWriter.java:306) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.(PBImageTextWriter.java:409) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDeli
[jira] [Created] (HDDS-2321) Ozone Block Token verify should not apply to all datanode cmd
Xiaoyu Yao created HDDS-2321: Summary: Ozone Block Token verify should not apply to all datanode cmd Key: HDDS-2321 URL: https://issues.apache.org/jira/browse/HDDS-2321 Project: Hadoop Distributed Data Store Issue Type: Bug Affects Versions: 0.4.1 Reporter: Nilotpal Nandi Assignee: Xiaoyu Yao DN container protocol has cmd send from SCM or other DN, which do not bear OM block token like OM client. We should restrict the OM Block token check only for those issued from OM client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2181) Ozone Manager should send correct ACL type in ACL requests to Authorizer
[ https://issues.apache.org/jira/browse/HDDS-2181?focusedWorklogId=330077&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330077 ] ASF GitHub Bot logged work on HDDS-2181: Author: ASF GitHub Bot Created on: 17/Oct/19 19:13 Start Date: 17/Oct/19 19:13 Worklog Time Spent: 10m Work Description: xiaoyuyao commented on pull request #43: HDDS-2181. Ozone Manager should send correct ACL type in ACL requests… URL: https://github.com/apache/hadoop-ozone/pull/43 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330077) Time Spent: 11h 10m (was: 11h) > Ozone Manager should send correct ACL type in ACL requests to Authorizer > > > Key: HDDS-2181 > URL: https://issues.apache.org/jira/browse/HDDS-2181 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Affects Versions: 0.4.1 >Reporter: Vivek Ratnavel Subramanian >Assignee: Vivek Ratnavel Subramanian >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Currently, Ozone manager sends "WRITE" as ACLType for key create, key delete > and bucket create operation. Fix the acl type in all requests to the > authorizer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2320) Negative value seen for OM NumKeys Metric in JMX.
Aravindan Vijayan created HDDS-2320: --- Summary: Negative value seen for OM NumKeys Metric in JMX. Key: HDDS-2320 URL: https://issues.apache.org/jira/browse/HDDS-2320 Project: Hadoop Distributed Data Store Issue Type: Bug Components: Ozone Manager Reporter: Aravindan Vijayan Attachments: Screen Shot 2019-10-17 at 11.31.08 AM.png While running teragen/terasort on a cluster and verifying number of keys created on Ozone Manager, I noticed that the value of NumKeys counter metric to be a negative value !Screen Shot 2019-10-17 at 11.31.08 AM.png! . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14911) hdfs oiv "Delimited" processer failing with java.lang.UnsatisfiedLinkError
bb created HDFS-14911: - Summary: hdfs oiv "Delimited" processer failing with java.lang.UnsatisfiedLinkError Key: HDFS-14911 URL: https://issues.apache.org/jira/browse/HDFS-14911 Project: Hadoop HDFS Issue Type: Wish Components: hdfs, namenode Affects Versions: 2.6.0 Reporter: bb The hdfs oiv "Delimited" processer is failing as follows: root@hostname [/dirname] $ hdfs oiv --inputFile /dirname/user1/fsimage_12908988825 --temp /dirname/user1 --outputFile /dirname/user1/fsimage.dsv --processor Delimited -delimiter "|" Exception in thread "main" java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, /tmp/libleveldbjni-64-1-1996775695699915007.8: /tmp/libleveldbjni-64-1-1996775695699915007.8: failed to map segment from shared object: Operation not permitted] at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182) at org.fusesource.hawtjni.runtime.Library.load(Library.java:140) at org.fusesource.leveldbjni.JniDBFactory.(JniDBFactory.java:48) at org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap$LevelDBStore.(PBImageTextWriter.java:235) at org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter$LevelDBMetadataMap.(PBImageTextWriter.java:306) at org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.(PBImageTextWriter.java:409) at org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDelimitedTextWriter.(PBImageDelimitedTextWriter.java:56) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:210) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:138) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14768) In some cases, erasure blocks are corruption when they are reconstruct.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953990#comment-16953990 ] Surendra Singh Lilhore commented on HDFS-14768: --- Thanks [~gjhkael] for path. Lets handle other scenario in HDFS-14847. Changes look good to me. Minor comments : # Pls fix the check-style warning. # No change in {{StripedWriter.java}}, pls remove it from patch. # Some unnecessary changes are there which is not required. {code:java} -blockIndex = ((BlockInfoStriped) block) -.getStorageBlockIndex(storage); +blockIndex = ((BlockInfoStriped) block). +getStorageBlockIndex(storage); if (state == StoredReplicaState.LIVE) { if (!bitSet.get(blockIndex)) { bitSet.set(blockIndex); - } else { + } else { {code} {code:java} (getSrcNodes()[i].isEnteringMaintenance() && - getSrcNodes()[i].isAlive())) { -srcIndices.add(i); + getSrcNodes()[i].isAlive())) { + srcIndices.add(i); } {code} . > In some cases, erasure blocks are corruption when they are reconstruct. > > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Fix For: 3.3.0 > > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.jpg, > guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, > zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanode
[jira] [Assigned] (HDDS-2145) Optimize client read path by reading multiple chunks along with block info in a single rpc call.
[ https://issues.apache.org/jira/browse/HDDS-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mukul Kumar Singh reassigned HDDS-2145: --- Assignee: Hanisha Koneru (was: Shashikant Banerjee) > Optimize client read path by reading multiple chunks along with block info in > a single rpc call. > > > Key: HDDS-2145 > URL: https://issues.apache.org/jira/browse/HDDS-2145 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Client, Ozone Datanode >Reporter: Shashikant Banerjee >Assignee: Hanisha Koneru >Priority: Major > Fix For: 0.5.0 > > > Currently, ozone client issues a getBlock call to read the metadata info from > rocks Db on dn to get the chunkInfo and then chunk info is read one by one > inn separate rpc calls in the read path. This can be optimized by > piggybacking readChunk calls along with getBlock in a single rpc call to dn. > This Jira aims to address this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14854) Create improved decommission monitor implementation
[ https://issues.apache.org/jira/browse/HDFS-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953972#comment-16953972 ] Hadoop QA commented on HDFS-14854: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 37s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 40s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 45s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 11 new + 462 unchanged - 5 fixed = 473 total (was 467) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 1s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 19s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 8s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 99m 53s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 33s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}170m 32s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestDistributedFileSystem | | | hadoop.hdfs.tools.TestDFSZKFailoverController | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.3 Server=19.03.3 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14854 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12983289/HDFS-14854.011.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux a1d0fabd0d16 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 3990ffa | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28104/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | u
[jira] [Commented] (HDFS-14384) When lastLocatedBlock token expire, it will take 1~3s second to refetch it.
[ https://issues.apache.org/jira/browse/HDFS-14384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953971#comment-16953971 ] Surendra Singh Lilhore commented on HDFS-14384: --- Thanks [~vinayakumarb] for review. Will wait for [~daryn] comments. > When lastLocatedBlock token expire, it will take 1~3s second to refetch it. > --- > > Key: HDFS-14384 > URL: https://issues.apache.org/jira/browse/HDFS-14384 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.7.2 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-14384.001.patch, HDFS-14384.002.patch > > > Scenario : > 1. Write file with one block which is in-progress. > 2. Open input stream and close the output stream. > 3. Wait for block token expiration and read the data. > 4. Last block read take 1~3 sec to read it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2318) Avoid proto::tostring in preconditions to save CPU cycles
[ https://issues.apache.org/jira/browse/HDDS-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2318: - Labels: performance pull-request-available (was: performance) > Avoid proto::tostring in preconditions to save CPU cycles > - > > Key: HDDS-2318 > URL: https://issues.apache.org/jira/browse/HDDS-2318 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Assignee: Bharat Viswanadham >Priority: Major > Labels: performance, pull-request-available > Attachments: Screenshot 2019-10-17 at 6.10.22 PM.png > > > [https://github.com/apache/hadoop-ozone/blob/61f4aa30f502b34fd778d9b37b1168721abafb2f/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java#L117] > > This ends up converting proto toString in precondition checks and burns CPU > cycles. {{request.toString()}} can be added in debug log on need basis. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2318) Avoid proto::tostring in preconditions to save CPU cycles
[ https://issues.apache.org/jira/browse/HDDS-2318?focusedWorklogId=330026&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330026 ] ASF GitHub Bot logged work on HDDS-2318: Author: ASF GitHub Bot Created on: 17/Oct/19 17:39 Start Date: 17/Oct/19 17:39 Worklog Time Spent: 10m Work Description: bharatviswa504 commented on pull request #48: HDDS-2318. Avoid proto::tostring in preconditions to save CPU cycles. URL: https://github.com/apache/hadoop-ozone/pull/48 ## What changes were proposed in this pull request? Avoiding proto::toString in OzoneManagerServerSideTranslatorPB.java ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-2318 ## How was this patch tested? Ran TestOzoneRpcClient which executes this code path. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330026) Remaining Estimate: 0h Time Spent: 10m > Avoid proto::tostring in preconditions to save CPU cycles > - > > Key: HDDS-2318 > URL: https://issues.apache.org/jira/browse/HDDS-2318 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Assignee: Bharat Viswanadham >Priority: Major > Labels: performance, pull-request-available > Attachments: Screenshot 2019-10-17 at 6.10.22 PM.png > > Time Spent: 10m > Remaining Estimate: 0h > > [https://github.com/apache/hadoop-ozone/blob/61f4aa30f502b34fd778d9b37b1168721abafb2f/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java#L117] > > This ends up converting proto toString in precondition checks and burns CPU > cycles. {{request.toString()}} can be added in debug log on need basis. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14909) DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage count for excluded node which is already part of excluded scope
[ https://issues.apache.org/jira/browse/HDFS-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-14909: -- Fix Version/s: 3.2.2 3.1.4 3.3.0 Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~elgoiri] and [~brahmareddy] for review. Committed to branch-3.1, branch-3.2 and trunk. > DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage > count for excluded node which is already part of excluded scope > - > > Key: HDFS-14909 > URL: https://issues.apache.org/jira/browse/HDFS-14909 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14909.001.patch, HDFS-14909.002.patch, > HDFS-14909.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14909) DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage count for excluded node which is already part of excluded scope
[ https://issues.apache.org/jira/browse/HDFS-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953943#comment-16953943 ] Hudson commented on HDFS-14909: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17546 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17546/]) HDFS-14909. DFSNetworkTopology#chooseRandomWithStorageType() should not (surendralilhore: rev 54dc6b7d720851eb6017906d664aa0fda2698225) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DFSNetworkTopology.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestMissingBlocksAlert.java > DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage > count for excluded node which is already part of excluded scope > - > > Key: HDFS-14909 > URL: https://issues.apache.org/jira/browse/HDFS-14909 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-14909.001.patch, HDFS-14909.002.patch, > HDFS-14909.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-14910) TestRenameWithSnapshots#testRename2PreDescendant failing consistently
[ https://issues.apache.org/jira/browse/HDFS-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang reassigned HDFS-14910: -- Assignee: Wei-Chiu Chuang > TestRenameWithSnapshots#testRename2PreDescendant failing consistently > - > > Key: HDFS-14910 > URL: https://issues.apache.org/jira/browse/HDFS-14910 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Íñigo Goiri >Assignee: Wei-Chiu Chuang >Priority: Major > > TestRenameWithSnapshots#testRename2PreDescendant has been failing > consistently. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2318) Avoid proto::tostring in preconditions to save CPU cycles
[ https://issues.apache.org/jira/browse/HDDS-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953929#comment-16953929 ] Attila Doroszlai commented on HDDS-2318: Using a supplier with the right {{toString()}} method might help: https://github.com/apache/incubator-ratis/blob/master/ratis-common/src/main/java/org/apache/ratis/util/StringUtils.java#L130-L137 > Avoid proto::tostring in preconditions to save CPU cycles > - > > Key: HDDS-2318 > URL: https://issues.apache.org/jira/browse/HDDS-2318 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Assignee: Bharat Viswanadham >Priority: Major > Labels: performance > Attachments: Screenshot 2019-10-17 at 6.10.22 PM.png > > > [https://github.com/apache/hadoop-ozone/blob/61f4aa30f502b34fd778d9b37b1168721abafb2f/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java#L117] > > This ends up converting proto toString in precondition checks and burns CPU > cycles. {{request.toString()}} can be added in debug log on need basis. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2240) Command line tool for OM Admin
[ https://issues.apache.org/jira/browse/HDDS-2240?focusedWorklogId=330005&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330005 ] ASF GitHub Bot logged work on HDDS-2240: Author: ASF GitHub Bot Created on: 17/Oct/19 17:06 Start Date: 17/Oct/19 17:06 Worklog Time Spent: 10m Work Description: anuengineer commented on pull request #1586: HDDS-2240. Command line tool for OM HA. URL: https://github.com/apache/hadoop/pull/1586#discussion_r336122720 ## File path: hadoop-ozone/common/src/main/proto/OzoneManagerProtocol.proto ## @@ -1097,11 +1097,34 @@ message UpdateGetS3SecretRequest { required string awsSecret = 2; } +message OMServiceId { +required string serviceID = 1; +} + +/** + This proto is used to define the OM node Id and its ratis server state. +*/ +message RoleInfo { +required string omNodeID = 1; +required string ratisServerRole = 2; +} + +/** + This is used to get the Server States of OMs. +*/ +message ServiceState { +repeated RoleInfo roleInfos = 1; +} + /** The OM service that takes care of Ozone namespace. */ service OzoneManagerService { // A client-to-OM RPC to send client requests to OM Ratis server rpc submitRequest(OMRequest) returns(OMResponse); + +// A client-to-OM RPC to get ratis server states of OMs +rpc getServiceState(OMServiceId) + returns(ServiceState); } Review comment: Sorry did not see this comment. How can a client communicate to OM? Does it not need to send the request to the leader ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330005) Time Spent: 3h 20m (was: 3h 10m) > Command line tool for OM Admin > -- > > Key: HDDS-2240 > URL: https://issues.apache.org/jira/browse/HDDS-2240 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru >Priority: Major > Labels: pull-request-available > Time Spent: 3h 20m > Remaining Estimate: 0h > > A command line tool (*ozone omha*) to get information related to OM HA. > This Jira proposes to add the _getServiceState_ option for OM HA which lists > all the OMs in the service and their corresponding Ratis server roles > (LEADER/ FOLLOWER). > We can later add more options to this tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2319) CLI command to perform on-demand data scan of a specific container
[ https://issues.apache.org/jira/browse/HDDS-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Doroszlai updated HDDS-2319: --- Description: On-demand data scan for a specific container might be a useful debug tool. Thanks [~aengineer] for the idea. (was: On-demand data scan for a specific container might be a useful debug tool.) > CLI command to perform on-demand data scan of a specific container > -- > > Key: HDDS-2319 > URL: https://issues.apache.org/jira/browse/HDDS-2319 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Components: Ozone CLI >Reporter: Attila Doroszlai >Priority: Major > > On-demand data scan for a specific container might be a useful debug tool. > Thanks [~aengineer] for the idea. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2319) CLI command to perform on-demand data scan of a specific container
Attila Doroszlai created HDDS-2319: -- Summary: CLI command to perform on-demand data scan of a specific container Key: HDDS-2319 URL: https://issues.apache.org/jira/browse/HDDS-2319 Project: Hadoop Distributed Data Store Issue Type: Sub-task Components: Ozone CLI Reporter: Attila Doroszlai On-demand data scan for a specific container might be a useful debug tool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14910) TestRenameWithSnapshots#testRename2PreDescendant failing consistently
[ https://issues.apache.org/jira/browse/HDFS-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953916#comment-16953916 ] Ayush Saxena commented on HDFS-14910: - The this.removeFeature(..) is getting called two times in {{InodeDirectory.java}}. The only possible solution to fix that I found was something like this in {{InodeDirectory.java}} L843 : {code:java} - if (sf.getDiffs().isEmpty() && - !(sf instanceof DirectorySnapshottableFeature)) { + if (sf.getDiffs().isEmpty() + && !(sf instanceof DirectorySnapshottableFeature) + && this.getDirectorySnapshottableFeature() != null) {code} [~weichiu] [~shashikant] if you have any other solution. Let me know. Otherwise, I can put up the above written code, adding a check before removing the feature, that it exists actually or not. > TestRenameWithSnapshots#testRename2PreDescendant failing consistently > - > > Key: HDFS-14910 > URL: https://issues.apache.org/jira/browse/HDFS-14910 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Íñigo Goiri >Priority: Major > > TestRenameWithSnapshots#testRename2PreDescendant has been failing > consistently. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14810) Review FSNameSystem editlog sync
[ https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953914#comment-16953914 ] Hudson commented on HDFS-14810: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17545 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17545/]) HDFS-14810. Review FSNameSystem editlog sync. Contributed by Xiaoqiao (ayushsaxena: rev 5527d79adb9b1e2f2779c283f81d6a3d5447babc) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java > Review FSNameSystem editlog sync > > > Key: HDFS-14810 > URL: https://issues.apache.org/jira/browse/HDFS-14810 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, > HDFS-14810.003.patch, HDFS-14810.004.patch > > > refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 > mentioned. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14887) RBF: In Router Web UI, Observer Namenode Information displaying as Unavailable
[ https://issues.apache.org/jira/browse/HDFS-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953909#comment-16953909 ] Takanobu Asanuma commented on HDFS-14887: - Thanks for the discussion and updating the patch, [~hemanthboyina] and [~elgoiri]. I tested with 008 patch and found that the observer icon is used in the Nameservice Information when there are ActiveNN and ObserverNN in the same nameservice. Please see [^14887.008.png]. * This is because the enum order in {{FederationNamenodeServiceState}} is used for the comparator. I think the order should be ACTIVE > OBSERVER > STANDBY. * I want to use the same order(active>observer>stadby) in {{FederationNamenodeServiceState#getState()}} and {{augment_namenodes()}} in federationhealth.js. * We may also need to add the observer icon to federationhealth-namenode-legend in Nameservice Information. > RBF: In Router Web UI, Observer Namenode Information displaying as Unavailable > -- > > Key: HDFS-14887 > URL: https://issues.apache.org/jira/browse/HDFS-14887 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: 14887.008.png, 14887.after.png, 14887.before.png, > HDFS-14887.001.patch, HDFS-14887.002.patch, HDFS-14887.003.patch, > HDFS-14887.004.patch, HDFS-14887.005.patch, HDFS-14887.006.patch, > HDFS-14887.007.patch, HDFS-14887.008.patch > > > In Router Web UI, Observer Namenode Information displaying as Unavailable. > We should show a proper icon for them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14887) RBF: In Router Web UI, Observer Namenode Information displaying as Unavailable
[ https://issues.apache.org/jira/browse/HDFS-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953907#comment-16953907 ] Hadoop QA commented on HDFS-14887: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 9s{color} | {color:red} HDFS-14887 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-14887 | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28105/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > RBF: In Router Web UI, Observer Namenode Information displaying as Unavailable > -- > > Key: HDFS-14887 > URL: https://issues.apache.org/jira/browse/HDFS-14887 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: 14887.008.png, 14887.after.png, 14887.before.png, > HDFS-14887.001.patch, HDFS-14887.002.patch, HDFS-14887.003.patch, > HDFS-14887.004.patch, HDFS-14887.005.patch, HDFS-14887.006.patch, > HDFS-14887.007.patch, HDFS-14887.008.patch > > > In Router Web UI, Observer Namenode Information displaying as Unavailable. > We should show a proper icon for them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14910) TestRenameWithSnapshots#testRename2PreDescendant failing consistently
[ https://issues.apache.org/jira/browse/HDFS-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953904#comment-16953904 ] Íñigo Goiri commented on HDFS-14910: [~ayushsaxena] narrowed it down to HDFS-14492. [~weichiu], [~shashikant], any insights? > TestRenameWithSnapshots#testRename2PreDescendant failing consistently > - > > Key: HDFS-14910 > URL: https://issues.apache.org/jira/browse/HDFS-14910 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Íñigo Goiri >Priority: Major > > TestRenameWithSnapshots#testRename2PreDescendant has been failing > consistently. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14910) TestRenameWithSnapshots#testRename2PreDescendant failing consistently
Íñigo Goiri created HDFS-14910: -- Summary: TestRenameWithSnapshots#testRename2PreDescendant failing consistently Key: HDFS-14910 URL: https://issues.apache.org/jira/browse/HDFS-14910 Project: Hadoop HDFS Issue Type: Bug Reporter: Íñigo Goiri TestRenameWithSnapshots#testRename2PreDescendant has been failing consistently. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14887) RBF: In Router Web UI, Observer Namenode Information displaying as Unavailable
[ https://issues.apache.org/jira/browse/HDFS-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takanobu Asanuma updated HDFS-14887: Attachment: 14887.008.png > RBF: In Router Web UI, Observer Namenode Information displaying as Unavailable > -- > > Key: HDFS-14887 > URL: https://issues.apache.org/jira/browse/HDFS-14887 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: 14887.008.png, 14887.after.png, 14887.before.png, > HDFS-14887.001.patch, HDFS-14887.002.patch, HDFS-14887.003.patch, > HDFS-14887.004.patch, HDFS-14887.005.patch, HDFS-14887.006.patch, > HDFS-14887.007.patch, HDFS-14887.008.patch > > > In Router Web UI, Observer Namenode Information displaying as Unavailable. > We should show a proper icon for them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14909) DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage count for excluded node which is already part of excluded scope
[ https://issues.apache.org/jira/browse/HDFS-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953902#comment-16953902 ] Íñigo Goiri commented on HDFS-14909: I see [~ayushtkn] narrowed it down to HDFS-14492. I'll open a JIRA. > DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage > count for excluded node which is already part of excluded scope > - > > Key: HDFS-14909 > URL: https://issues.apache.org/jira/browse/HDFS-14909 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-14909.001.patch, HDFS-14909.002.patch, > HDFS-14909.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14854) Create improved decommission monitor implementation
[ https://issues.apache.org/jira/browse/HDFS-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953903#comment-16953903 ] Stephen O'Donnell commented on HDFS-14854: -- Let me have a look at that Configuration change too ... [~weichiu] has told me he plans to review this when he gets some time. As we have had a few review cycles already it should be in good shape for him to have a thorough look now. > Create improved decommission monitor implementation > --- > > Key: HDFS-14854 > URL: https://issues.apache.org/jira/browse/HDFS-14854 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: Decommission_Monitor_V2_001.pdf, HDFS-14854.001.patch, > HDFS-14854.002.patch, HDFS-14854.003.patch, HDFS-14854.004.patch, > HDFS-14854.005.patch, HDFS-14854.006.patch, HDFS-14854.007.patch, > HDFS-14854.008.patch, HDFS-14854.009.patch, HDFS-14854.010.patch, > HDFS-14854.011.patch > > > In HDFS-13157, we discovered a series of problems with the current > decommission monitor implementation, such as: > * Blocks are replicated sequentially disk by disk and node by node, and > hence the load is not spread well across the cluster > * Adding a node for decommission can cause the namenode write lock to be > held for a long time. > * Decommissioning nodes floods the replication queue and under replicated > blocks from a future node or disk failure may way for a long time before they > are replicated. > * Blocks pending replication are checked many times under a write lock > before they are sufficiently replicate, wasting resources > In this Jira I propose to create a new implementation of the decommission > monitor that resolves these issues. As it will be difficult to prove one > implementation is better than another, the new implementation can be enabled > or disabled giving the option of the existing implementation or the new one. > I will attach a pdf with some more details on the design and then a version 1 > patch shortly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14909) DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage count for excluded node which is already part of excluded scope
[ https://issues.apache.org/jira/browse/HDFS-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953898#comment-16953898 ] Íñigo Goiri commented on HDFS-14909: +1 on [^HDFS-14909.003.patch]. Does anybody know what's happening with TestRenameWithSnapshots.testRename2PreDescendant? I've seen it failing pretty consistently in the last few days. > DFSNetworkTopology#chooseRandomWithStorageType() should not decrease storage > count for excluded node which is already part of excluded scope > - > > Key: HDFS-14909 > URL: https://issues.apache.org/jira/browse/HDFS-14909 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-14909.001.patch, HDFS-14909.002.patch, > HDFS-14909.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14854) Create improved decommission monitor implementation
[ https://issues.apache.org/jira/browse/HDFS-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953892#comment-16953892 ] Íñigo Goiri commented on HDFS-14854: Thanks for the changes, [^HDFS-14854.011.patch] looks good to me. One minor thing, we can use Configuration#getClass() in DatanodeAdminManager#136 as it will take care of getting the class an making it an interface, etc. Anybody else up for going over this? > Create improved decommission monitor implementation > --- > > Key: HDFS-14854 > URL: https://issues.apache.org/jira/browse/HDFS-14854 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: Decommission_Monitor_V2_001.pdf, HDFS-14854.001.patch, > HDFS-14854.002.patch, HDFS-14854.003.patch, HDFS-14854.004.patch, > HDFS-14854.005.patch, HDFS-14854.006.patch, HDFS-14854.007.patch, > HDFS-14854.008.patch, HDFS-14854.009.patch, HDFS-14854.010.patch, > HDFS-14854.011.patch > > > In HDFS-13157, we discovered a series of problems with the current > decommission monitor implementation, such as: > * Blocks are replicated sequentially disk by disk and node by node, and > hence the load is not spread well across the cluster > * Adding a node for decommission can cause the namenode write lock to be > held for a long time. > * Decommissioning nodes floods the replication queue and under replicated > blocks from a future node or disk failure may way for a long time before they > are replicated. > * Blocks pending replication are checked many times under a write lock > before they are sufficiently replicate, wasting resources > In this Jira I propose to create a new implementation of the decommission > monitor that resolves these issues. As it will be difficult to prove one > implementation is better than another, the new implementation can be enabled > or disabled giving the option of the existing implementation or the new one. > I will attach a pdf with some more details on the design and then a version 1 > patch shortly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14810) Review FSNameSystem editlog sync
[ https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-14810: Fix Version/s: 3.3.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) > Review FSNameSystem editlog sync > > > Key: HDFS-14810 > URL: https://issues.apache.org/jira/browse/HDFS-14810 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, > HDFS-14810.003.patch, HDFS-14810.004.patch > > > refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 > mentioned. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14810) Review FSNameSystem editlog sync
[ https://issues.apache.org/jira/browse/HDFS-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953891#comment-16953891 ] Ayush Saxena commented on HDFS-14810: - Committed to trunk. Thanx [~hexiaoqiao] for the contribution!!! > Review FSNameSystem editlog sync > > > Key: HDFS-14810 > URL: https://issues.apache.org/jira/browse/HDFS-14810 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-14810.001.patch, HDFS-14810.002.patch, > HDFS-14810.003.patch, HDFS-14810.004.patch > > > refactor and unified type of edit log sync in FSNamesystem as HDFS-11246 > mentioned. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14852) Remove of LowRedundancyBlocks do NOT remove the block from all queues
[ https://issues.apache.org/jira/browse/HDFS-14852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953862#comment-16953862 ] Hadoop QA commented on HDFS-14852: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 40s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 59s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 47s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 33s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}107m 24s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}183m 15s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.blockmanagement.TestLowRedundancyBlockQueues | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.3 Server=19.03.3 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14852 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12983274/HDFS-14852.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux af919f484f4f 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 3990ffa | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28103/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28103/testReport/ | | Max. process+thread count | 2671 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommi
[jira] [Commented] (HDFS-14854) Create improved decommission monitor implementation
[ https://issues.apache.org/jira/browse/HDFS-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953828#comment-16953828 ] Stephen O'Donnell commented on HDFS-14854: -- [~elgoiri] I have addressed all the latest comments, I think. moveBlocksToPending is tricky to make much simpler, as it needs to break out of the inner or outer loop on various conditions. However I pulled some of the logic into a new method and added a few commends, which I think it improves it a bit. Let me know what you think. > Create improved decommission monitor implementation > --- > > Key: HDFS-14854 > URL: https://issues.apache.org/jira/browse/HDFS-14854 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: Decommission_Monitor_V2_001.pdf, HDFS-14854.001.patch, > HDFS-14854.002.patch, HDFS-14854.003.patch, HDFS-14854.004.patch, > HDFS-14854.005.patch, HDFS-14854.006.patch, HDFS-14854.007.patch, > HDFS-14854.008.patch, HDFS-14854.009.patch, HDFS-14854.010.patch, > HDFS-14854.011.patch > > > In HDFS-13157, we discovered a series of problems with the current > decommission monitor implementation, such as: > * Blocks are replicated sequentially disk by disk and node by node, and > hence the load is not spread well across the cluster > * Adding a node for decommission can cause the namenode write lock to be > held for a long time. > * Decommissioning nodes floods the replication queue and under replicated > blocks from a future node or disk failure may way for a long time before they > are replicated. > * Blocks pending replication are checked many times under a write lock > before they are sufficiently replicate, wasting resources > In this Jira I propose to create a new implementation of the decommission > monitor that resolves these issues. As it will be difficult to prove one > implementation is better than another, the new implementation can be enabled > or disabled giving the option of the existing implementation or the new one. > I will attach a pdf with some more details on the design and then a version 1 > patch shortly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14854) Create improved decommission monitor implementation
[ https://issues.apache.org/jira/browse/HDFS-14854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell updated HDFS-14854: - Attachment: HDFS-14854.011.patch > Create improved decommission monitor implementation > --- > > Key: HDFS-14854 > URL: https://issues.apache.org/jira/browse/HDFS-14854 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: Decommission_Monitor_V2_001.pdf, HDFS-14854.001.patch, > HDFS-14854.002.patch, HDFS-14854.003.patch, HDFS-14854.004.patch, > HDFS-14854.005.patch, HDFS-14854.006.patch, HDFS-14854.007.patch, > HDFS-14854.008.patch, HDFS-14854.009.patch, HDFS-14854.010.patch, > HDFS-14854.011.patch > > > In HDFS-13157, we discovered a series of problems with the current > decommission monitor implementation, such as: > * Blocks are replicated sequentially disk by disk and node by node, and > hence the load is not spread well across the cluster > * Adding a node for decommission can cause the namenode write lock to be > held for a long time. > * Decommissioning nodes floods the replication queue and under replicated > blocks from a future node or disk failure may way for a long time before they > are replicated. > * Blocks pending replication are checked many times under a write lock > before they are sufficiently replicate, wasting resources > In this Jira I propose to create a new implementation of the decommission > monitor that resolves these issues. As it will be difficult to prove one > implementation is better than another, the new implementation can be enabled > or disabled giving the option of the existing implementation or the new one. > I will attach a pdf with some more details on the design and then a version 1 > patch shortly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDDS-2221) Monitor datanodes in ozoneperf compose cluster
[ https://issues.apache.org/jira/browse/HDDS-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek resolved HDDS-2221. --- Fix Version/s: 0.5.0 Resolution: Fixed > Monitor datanodes in ozoneperf compose cluster > -- > > Key: HDDS-2221 > URL: https://issues.apache.org/jira/browse/HDDS-2221 > Project: Hadoop Distributed Data Store > Issue Type: Task > Components: docker >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > ozoneperf compose cluster contains a prometheus but as of now it collects the > data only from scm and om. > We don't know the exact number of datanodes (can be scaled up and down) > therefor it's harder to configure the datanode host names. I would suggest to > configure the first 10 datanodes (which covers most of the use cases) > How to test? > {code:java} > cd hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/compose/ozoneperf > docker-compose up -d > firefox http://localhost:9090/targets > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2221) Monitor datanodes in ozoneperf compose cluster
[ https://issues.apache.org/jira/browse/HDDS-2221?focusedWorklogId=329888&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-329888 ] ASF GitHub Bot logged work on HDDS-2221: Author: ASF GitHub Bot Created on: 17/Oct/19 14:00 Start Date: 17/Oct/19 14:00 Worklog Time Spent: 10m Work Description: elek commented on pull request #1: HDDS-2221. Monitor datanodes in ozoneperf compose cluster URL: https://github.com/apache/hadoop-ozone/pull/1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 329888) Time Spent: 1h 20m (was: 1h 10m) > Monitor datanodes in ozoneperf compose cluster > -- > > Key: HDDS-2221 > URL: https://issues.apache.org/jira/browse/HDDS-2221 > Project: Hadoop Distributed Data Store > Issue Type: Task > Components: docker >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > ozoneperf compose cluster contains a prometheus but as of now it collects the > data only from scm and om. > We don't know the exact number of datanodes (can be scaled up and down) > therefor it's harder to configure the datanode host names. I would suggest to > configure the first 10 datanodes (which covers most of the use cases) > How to test? > {code:java} > cd hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/compose/ozoneperf > docker-compose up -d > firefox http://localhost:9090/targets > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1985) Fix listVolumes API
[ https://issues.apache.org/jira/browse/HDDS-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated HDDS-1985: -- Fix Version/s: 0.5.0 Resolution: Fixed Status: Resolved (was: Patch Available) > Fix listVolumes API > --- > > Key: HDDS-1985 > URL: https://issues.apache.org/jira/browse/HDDS-1985 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This Jira is to fix lisVolumes API in HA code path. > In HA, we have an in-memory cache, where we put the result to in-memory cache > and return the response, later it will be picked by double buffer thread and > it will flush to disk. So, now when do listVolumes, it should use both > in-memory cache and rocksdb volume table to list volumes for a user. > > No fix is required for this, as the information is retrieved from the MPU Key > table, this information is not retrieved through RocksDB Table iteration. (As > when we use get() this checks from cache first, and then it checks table) > > Used this Jira to add an integration test to verify the behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1985) Fix listVolumes API
[ https://issues.apache.org/jira/browse/HDDS-1985?focusedWorklogId=329887&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-329887 ] ASF GitHub Bot logged work on HDDS-1985: Author: ASF GitHub Bot Created on: 17/Oct/19 13:57 Start Date: 17/Oct/19 13:57 Worklog Time Spent: 10m Work Description: elek commented on pull request #33: HDDS-1985. Fix listVolumes API URL: https://github.com/apache/hadoop-ozone/pull/33 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 329887) Time Spent: 0.5h (was: 20m) > Fix listVolumes API > --- > > Key: HDDS-1985 > URL: https://issues.apache.org/jira/browse/HDDS-1985 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This Jira is to fix lisVolumes API in HA code path. > In HA, we have an in-memory cache, where we put the result to in-memory cache > and return the response, later it will be picked by double buffer thread and > it will flush to disk. So, now when do listVolumes, it should use both > in-memory cache and rocksdb volume table to list volumes for a user. > > No fix is required for this, as the information is retrieved from the MPU Key > table, this information is not retrieved through RocksDB Table iteration. (As > when we use get() this checks from cache first, and then it checks table) > > Used this Jira to add an integration test to verify the behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2318) Avoid proto::tostring in preconditions to save CPU cycles
[ https://issues.apache.org/jira/browse/HDDS-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953743#comment-16953743 ] Mukul Kumar Singh commented on HDDS-2318: - cc [~arp], [~nanda], [~bharat], [~hanishakoneru] > Avoid proto::tostring in preconditions to save CPU cycles > - > > Key: HDDS-2318 > URL: https://issues.apache.org/jira/browse/HDDS-2318 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Assignee: Bharat Viswanadham >Priority: Major > Labels: performance > Attachments: Screenshot 2019-10-17 at 6.10.22 PM.png > > > [https://github.com/apache/hadoop-ozone/blob/61f4aa30f502b34fd778d9b37b1168721abafb2f/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java#L117] > > This ends up converting proto toString in precondition checks and burns CPU > cycles. {{request.toString()}} can be added in debug log on need basis. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDDS-2318) Avoid proto::tostring in preconditions to save CPU cycles
[ https://issues.apache.org/jira/browse/HDDS-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mukul Kumar Singh reassigned HDDS-2318: --- Assignee: Bharat Viswanadham > Avoid proto::tostring in preconditions to save CPU cycles > - > > Key: HDDS-2318 > URL: https://issues.apache.org/jira/browse/HDDS-2318 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Assignee: Bharat Viswanadham >Priority: Major > Labels: performance > Attachments: Screenshot 2019-10-17 at 6.10.22 PM.png > > > [https://github.com/apache/hadoop-ozone/blob/61f4aa30f502b34fd778d9b37b1168721abafb2f/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java#L117] > > This ends up converting proto toString in precondition checks and burns CPU > cycles. {{request.toString()}} can be added in debug log on need basis. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1847) Datanode Kerberos principal and keytab config key looks inconsistent
[ https://issues.apache.org/jira/browse/HDDS-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953724#comment-16953724 ] Chris Teoh commented on HDDS-1847: -- I'm unclear on this issue. The bottom 2 keys are HDFS specific, not Ozone, changing those keys means affecting HDFS project and not HDDS project?. Can I please get more clarification on what is required? > Datanode Kerberos principal and keytab config key looks inconsistent > > > Key: HDDS-1847 > URL: https://issues.apache.org/jira/browse/HDDS-1847 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Eric Yang >Assignee: Chris Teoh >Priority: Major > Labels: newbie > > Ozone Kerberos configuration can be very confusing: > | config name | Description | > | hdds.scm.kerberos.principal | SCM service principal | > | hdds.scm.kerberos.keytab.file | SCM service keytab file | > | ozone.om.kerberos.principal | Ozone Manager service principal | > | ozone.om.kerberos.keytab.file | Ozone Manager keytab file | > | hdds.scm.http.kerberos.principal | SCM service spnego principal | > | hdds.scm.http.kerberos.keytab.file | SCM service spnego keytab file | > | ozone.om.http.kerberos.principal | Ozone Manager spnego principal | > | ozone.om.http.kerberos.keytab.file | Ozone Manager spnego keytab file | > | hdds.datanode.http.kerberos.keytab | Datanode spnego keytab file | > | hdds.datanode.http.kerberos.principal | Datanode spnego principal | > | dfs.datanode.kerberos.principal | Datanode service principal | > | dfs.datanode.keytab.file | Datanode service keytab file | > The prefix are very different for each of the datanode configuration. It > would be nice to have some consistency for datanode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2301) Write path: Reduce read contention in rocksDB
[ https://issues.apache.org/jira/browse/HDDS-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HDDS-2301: --- Labels: performance (was: ) > Write path: Reduce read contention in rocksDB > - > > Key: HDDS-2301 > URL: https://issues.apache.org/jira/browse/HDDS-2301 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Rajesh Balamohan >Assignee: Nanda kumar >Priority: Major > Labels: performance > Attachments: om_write_profile.png > > > Benchmark: > > Simple benchmark which creates 100 and 1000s of keys (empty directory) in > OM. This is done in a tight loop and multiple threads from client side to add > enough load on CPU. Note that intention is to understand the bottlenecks in > OM (intentionally avoiding interactions with SCM & DN). > Observation: > - > During write path, Ozone checks {{OMFileRequest.verifyFilesInPath}}. This > internally calls {{omMetadataManager.getKeyTable().get(dbKeyName)}} for every > write operation. This turns out to be expensive and chokes the write path. > [https://github.com/apache/hadoop/blob/trunk/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/file/OMDirectoryCreateRequest.java#L155] > [https://github.com/apache/hadoop/blob/trunk/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/file/OMFileRequest.java#L63] > In most of the cases, directory creation would be fresh entry. In such cases, > it would be good to try with {{RocksDB::keyMayExist.}} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2318) Avoid proto::tostring in preconditions to save CPU cycles
[ https://issues.apache.org/jira/browse/HDDS-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HDDS-2318: --- Labels: performance (was: ) > Avoid proto::tostring in preconditions to save CPU cycles > - > > Key: HDDS-2318 > URL: https://issues.apache.org/jira/browse/HDDS-2318 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Priority: Major > Labels: performance > Attachments: Screenshot 2019-10-17 at 6.10.22 PM.png > > > [https://github.com/apache/hadoop-ozone/blob/61f4aa30f502b34fd778d9b37b1168721abafb2f/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java#L117] > > This ends up converting proto toString in precondition checks and burns CPU > cycles. {{request.toString()}} can be added in debug log on need basis. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2309) Optimise OzoneManagerDoubleBuffer::flushTransactions to flush in batches
[ https://issues.apache.org/jira/browse/HDDS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HDDS-2309: --- Labels: performance (was: ) > Optimise OzoneManagerDoubleBuffer::flushTransactions to flush in batches > > > Key: HDDS-2309 > URL: https://issues.apache.org/jira/browse/HDDS-2309 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Reporter: Rajesh Balamohan >Assignee: Bharat Viswanadham >Priority: Major > Labels: performance > Attachments: Screenshot 2019-10-15 at 4.19.13 PM.png > > > When running a write heavy benchmark, > {{{color:#00}org/apache/hadoop/ozone/om/ratis/OzoneManagerDoubleBuffer.flushTransactions{color}}} > was invoked for pretty much every write. > This forces {{cleanupCache}} to be invoked which ends up choking in single > thread executor. Attaching the profiler information which gives more details. > Ideally, {{flushTransactions}} should batch up the work to reduce load on > rocksDB. > > [https://github.com/apache/hadoop-ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerDoubleBuffer.java#L130] > > [https://github.com/apache/hadoop-ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerDoubleBuffer.java#L322] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org