[jira] [Commented] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.

2019-10-31 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964627#comment-16964627
 ] 

Surendra Singh Lilhore commented on HDFS-14768:
---

[~gjhkael], we appreciate your hard work and patience. You are reporting good 
quality issues.

> EC : Busy DN replica should be consider in live replica check.
> --
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, 
> HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   

[jira] [Comment Edited] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.

2019-10-31 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964625#comment-16964625
 ] 

Surendra Singh Lilhore edited comment on HDFS-14768 at 11/1/19 5:51 AM:


[~gjhkael] why you think it is not a problem ?

This is major issue and we have to fix this. Will commit this tomorrow if no 
comment  from other guys.


was (Author: surendrasingh):
[~gjhkael] why you think it is not a problem ?

This is major issue and we have to fix this. Will commit this tomorrow if no 
other comment  form other guys.

> EC : Busy DN replica should be consider in live replica check.
> --
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, 
> HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   

[jira] [Commented] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.

2019-10-31 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964625#comment-16964625
 ] 

Surendra Singh Lilhore commented on HDFS-14768:
---

[~gjhkael] why you think it is not a problem ?

This is major issue and we have to fix this. Will commit this tomorrow if no 
other comment  form other guys.

> EC : Busy DN replica should be consider in live replica check.
> --
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, 
> HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   

[jira] [Reopened] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.

2019-10-31 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore reopened HDFS-14768:
---

> EC : Busy DN replica should be consider in live replica check.
> --
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, 
> HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>  

[jira] [Updated] (HDFS-14950) missing libhdfspp libs in dist-package

2019-10-31 Thread Yuan Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuan Zhou updated HDFS-14950:
-
Attachment: fix_libhdfspp_lib.patch
Status: Patch Available  (was: Open)

> missing libhdfspp libs in dist-package
> --
>
> Key: HDFS-14950
> URL: https://issues.apache.org/jira/browse/HDFS-14950
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: build
>Reporter: Yuan Zhou
>Assignee: Yuan Zhou
>Priority: Major
> Attachments: fix_libhdfspp_lib.patch
>
>
> In a Hadoop build like "mvn package -Pnative" will copy HDFS native libs to 
> target/lib/native. For now it will only copy the C client 
> libraries(libhdfs.\{a,so}). C++ based HDFS client libraies(libhdfspp.\{a,so}) 
> are missing there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14950) missing libhdfspp libs in dist-package

2019-10-31 Thread Yuan Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuan Zhou updated HDFS-14950:
-
Attachment: (was: fix_libhdfspp_lib.patch)

> missing libhdfspp libs in dist-package
> --
>
> Key: HDFS-14950
> URL: https://issues.apache.org/jira/browse/HDFS-14950
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: build
>Reporter: Yuan Zhou
>Assignee: Yuan Zhou
>Priority: Major
>
> In a Hadoop build like "mvn package -Pnative" will copy HDFS native libs to 
> target/lib/native. For now it will only copy the C client 
> libraries(libhdfs.\{a,so}). C++ based HDFS client libraies(libhdfspp.\{a,so}) 
> are missing there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14938) Add check if excludedNodes contain scope in DFSNetworkTopology#chooseRandomWithStorageType()

2019-10-31 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964617#comment-16964617
 ] 

Hadoop QA commented on HDFS-14938:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 34m 
39s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
5s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 43s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
19s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
7s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 43s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 3 new + 1 unchanged - 0 fixed = 4 total (was 1) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 46s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}101m 17s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}198m 45s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestMaintenanceState |
|   | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14938 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12984565/HDFS-14938.004.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 22f7ab011716 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / f9b99d2 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28217/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28217/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28217/testReport/ |
| Max. process+thread count 

[jira] [Commented] (HDFS-14927) RBF: Add metrics for async callers thread pool

2019-10-31 Thread Leon Gao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964598#comment-16964598
 ] 

Leon Gao commented on HDFS-14927:
-

Thanks [~inigoiri] and [~ayushtkn] for the review ^

> RBF: Add metrics for async callers thread pool
> --
>
> Key: HDFS-14927
> URL: https://issues.apache.org/jira/browse/HDFS-14927
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Minor
> Attachments: HDFS-14927.001.patch, HDFS-14927.002.patch, 
> HDFS-14927.003.patch, HDFS-14927.004.patch, HDFS-14927.005.patch, 
> HDFS-14927.006.patch, HDFS-14927.007.patch, HDFS-14927.008.patch, 
> HDFS-14927.009.patch
>
>
> It is good to add some monitoring on the async caller thread pool to handle 
> fan-out RPC client requests, so we know the utilization and when to bump up 
> dfs.federation.router.client.thread-size



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-2311) Fix logic of RetryPolicy in OzoneClientSideTranslatorPB

2019-10-31 Thread Dinesh Chitlangia (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Chitlangia resolved HDDS-2311.
-
Fix Version/s: 0.5.0
   Resolution: Fixed

[~bharat] Thank you for flagging this issue and reviewing.
[~hanishakoneru] Thank you for the contribution.

The integration failure was unrelated to patch and this has been committed to 
master.

> Fix logic of RetryPolicy in OzoneClientSideTranslatorPB
> ---
>
> Key: HDDS-2311
> URL: https://issues.apache.org/jira/browse/HDDS-2311
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Bharat Viswanadham
>Assignee: Hanisha Koneru
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> OzoneManagerProtocolClientSideTranslatorPB.java
> L251: if (cause instanceof NotLeaderException) {
>  NotLeaderException notLeaderException = (NotLeaderException) cause;
>  omFailoverProxyProvider.performFailoverIfRequired(
>  notLeaderException.getSuggestedLeaderNodeId());
>  return getRetryAction(RetryAction.RETRY, retries, failovers);
>  }
>  
> The suggested leader returned from Server is not used during failOver, as the 
> cause is a type of RemoteException. So with current code, it does not use 
> suggested leader for failOver at all and by default with each OM, it tries 
> max retries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2311) Fix logic of RetryPolicy in OzoneClientSideTranslatorPB

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2311?focusedWorklogId=337179=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337179
 ]

ASF GitHub Bot logged work on HDDS-2311:


Author: ASF GitHub Bot
Created on: 01/Nov/19 04:39
Start Date: 01/Nov/19 04:39
Worklog Time Spent: 10m 
  Work Description: dineshchitlangia commented on pull request #51: 
HDDS-2311. Fix logic of RetryPolicy in OzoneClientSideTranslatorPB.
URL: https://github.com/apache/hadoop-ozone/pull/51
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 337179)
Time Spent: 20m  (was: 10m)

> Fix logic of RetryPolicy in OzoneClientSideTranslatorPB
> ---
>
> Key: HDDS-2311
> URL: https://issues.apache.org/jira/browse/HDDS-2311
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Bharat Viswanadham
>Assignee: Hanisha Koneru
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> OzoneManagerProtocolClientSideTranslatorPB.java
> L251: if (cause instanceof NotLeaderException) {
>  NotLeaderException notLeaderException = (NotLeaderException) cause;
>  omFailoverProxyProvider.performFailoverIfRequired(
>  notLeaderException.getSuggestedLeaderNodeId());
>  return getRetryAction(RetryAction.RETRY, retries, failovers);
>  }
>  
> The suggested leader returned from Server is not used during failOver, as the 
> cause is a type of RemoteException. So with current code, it does not use 
> suggested leader for failOver at all and by default with each OM, it tries 
> max retries.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2397) Fix calling cleanup for few missing tables in OM

2019-10-31 Thread Bharat Viswanadham (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharat Viswanadham reassigned HDDS-2397:


Assignee: Bharat Viswanadham

> Fix calling cleanup for few missing tables in OM
> 
>
> Key: HDDS-2397
> URL: https://issues.apache.org/jira/browse/HDDS-2397
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After DoubleBuffer flushes, we call cleanup cache to cleanup tables cache.
> For few tables cleanup of cache is missed:
>  # PrefixTable
>  # S3SecretTable
>  # DelegationTable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2397) Fix calling cleanup for few missing tables in OM

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2397?focusedWorklogId=337178=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337178
 ]

ASF GitHub Bot logged work on HDDS-2397:


Author: ASF GitHub Bot
Created on: 01/Nov/19 04:32
Start Date: 01/Nov/19 04:32
Worklog Time Spent: 10m 
  Work Description: bharatviswa504 commented on pull request #112: 
HDDS-2397. Fix calling cleanup for few missing tables in OM.
URL: https://github.com/apache/hadoop-ozone/pull/112
 
 
   ## What changes were proposed in this pull request?
   
   Fix calling clean up of few tables which is missing in 
OzoneManagerDoubleBuffer cleanupcache.
   
   For few tables cleanup of cache is missed:
   
   PrefixTable
   S3SecretTable
   DelegationTable
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-2397
   
   ## How was this patch tested?
   
   Ran a few integration tests.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 337178)
Remaining Estimate: 0h
Time Spent: 10m

> Fix calling cleanup for few missing tables in OM
> 
>
> Key: HDDS-2397
> URL: https://issues.apache.org/jira/browse/HDDS-2397
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After DoubleBuffer flushes, we call cleanup cache to cleanup tables cache.
> For few tables cleanup of cache is missed:
>  # PrefixTable
>  # S3SecretTable
>  # DelegationTable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2397) Fix calling cleanup for few missing tables in OM

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-2397:
-
Labels: pull-request-available  (was: )

> Fix calling cleanup for few missing tables in OM
> 
>
> Key: HDDS-2397
> URL: https://issues.apache.org/jira/browse/HDDS-2397
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
>
> After DoubleBuffer flushes, we call cleanup cache to cleanup tables cache.
> For few tables cleanup of cache is missed:
>  # PrefixTable
>  # S3SecretTable
>  # DelegationTable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2397) Fix calling cleanup for few missing tables in OM

2019-10-31 Thread Bharat Viswanadham (Jira)
Bharat Viswanadham created HDDS-2397:


 Summary: Fix calling cleanup for few missing tables in OM
 Key: HDDS-2397
 URL: https://issues.apache.org/jira/browse/HDDS-2397
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
Reporter: Bharat Viswanadham


After DoubleBuffer flushes, we call cleanup cache to cleanup tables cache.

For few tables cleanup of cache is missed:
 # PrefixTable
 # S3SecretTable
 # DelegationTable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14937) [SBN read] ObserverReadProxyProvider should throw InterruptException

2019-10-31 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964591#comment-16964591
 ] 

Hadoop QA commented on HDFS-14937:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
51s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
 1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m  8s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
33s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m  3s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m  
2s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 67m 16s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14937 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12984571/HDFS-14937-trunk-002.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux ae75a26963b0 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / f9b99d2 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28218/testReport/ |
| Max. process+thread count | 309 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs-client U: 
hadoop-hdfs-project/hadoop-hdfs-client |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28218/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> [SBN read] 

[jira] [Issue Comment Deleted] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Bharat Viswanadham (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharat Viswanadham updated HDDS-2356:
-
Comment: was deleted

(was: Has the above error caused crash in OM? If so, can you share stack trace?)

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: hs_err_pid9340.log, image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
>  at 
> org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
>  at 
> org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)
>  at 
> org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)
>  at 
> org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)
>  
> The following errors has been resolved in 
> https://issues.apache.org/jira/browse/HDDS-2322. 
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
>  java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Bharat Viswanadham (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964586#comment-16964586
 ] 

Bharat Viswanadham commented on HDDS-2356:
--

Has the above error caused crash in OM? If so, can you share stack trace?

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: hs_err_pid9340.log, image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
>  at 
> org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
>  at 
> org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)
>  at 
> org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)
>  at 
> org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)
>  
> The following errors has been resolved in 
> https://issues.apache.org/jira/browse/HDDS-2322. 
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
>  java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> 

[jira] [Commented] (HDDS-2395) Handle Ozone S3 completeMPU to match with aws s3 behavior.

2019-10-31 Thread Bharat Viswanadham (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964585#comment-16964585
 ] 

Bharat Viswanadham commented on HDDS-2395:
--

Hi [~timmylicheng]

Exclude List is fixed as part of HDDS-2381. Thanks.

> Handle Ozone S3 completeMPU to match with aws s3 behavior.
> --
>
> Key: HDDS-2395
> URL: https://issues.apache.org/jira/browse/HDDS-2395
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> # When uploaded 2 parts, and when complete upload 1 part no error
>  # During complete multipart upload name/part number not matching with 
> uploaded part and part number then InvalidPart error
>  # When parts are not specified in sorted order InvalidPartOrder
>  # During complete multipart upload when no uploaded parts, and we specify 
> some parts then also InvalidPart
>  # Uploaded parts 1,2,3 and during complete we can do upload 1,3 (No error)
>  # When part 3 uploaded, complete with part 3 can be done



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12943) Consistent Reads from Standby Node

2019-10-31 Thread Jonathan Hung (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hung updated HDFS-12943:
-
Release Note: 
Observer is a new type of a NameNode in addition to Active and Standby Nodes in 
HA settings. An Observer Node maintains a replica of the namespace same as a 
Standby Node. It additionally allows execution of clients read requests.

To ensure read-after-write consistency within a single client, a state ID is 
introduced in RPC headers. The Observer responds to the client request only 
after its own state has caught up with the client’s state ID, which it 
previously received from the Active NameNode.

Clients can explicitly invoke a new client protocol call msync(), which ensures 
that subsequent reads by this client from an Observer are consistent.

A new client-side ObserverReadProxyProvider is introduced to provide automatic 
switching between Active and Observer NameNodes for submitting respectively 
write and read requests.

  was:
Observer is a new type of a NameNode in addition to Active and Standby Nodes in 
HA settings. An Observer Node maintains a replica of the namespace same as a 
Standby Node. It additionally allows execution of clients read requests.
To ensure read-after-write consistency within a single client, a state ID is 
introduced in RPC headers. The Observer responds to the client request only 
after its own state has caught up with the client’s state ID, which it 
previously received from the Active NameNode.
Clients can explicitly invoke a new client protocol call msync(), which ensures 
that subsequent reads by this client from an Observer are consistent.
A new client-side ObserverReadProxyProvider is introduced to provide automatic 
switching between Active and Observer NameNodes for submitting respectively 
write and read requests.


> Consistent Reads from Standby Node
> --
>
> Key: HDFS-12943
> URL: https://issues.apache.org/jira/browse/HDFS-12943
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: hdfs
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: ConsistentReadsFromStandbyNode.pdf, 
> ConsistentReadsFromStandbyNode.pdf, HDFS-12943-001.patch, 
> HDFS-12943-002.patch, HDFS-12943-003.patch, HDFS-12943-004.patch, 
> TestPlan-ConsistentReadsFromStandbyNode.pdf
>
>
> StandbyNode in HDFS is a replica of the active NameNode. The states of the 
> NameNodes are coordinated via the journal. It is natural to consider 
> StandbyNode as a read-only replica. As with any replicated distributed system 
> the problem of stale reads should be resolved. Our main goal is to provide 
> reads from standby in a consistent way in order to enable a wide range of 
> existing applications running on top of HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-2363) Failed to create Ratis container

2019-10-31 Thread Sammi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sammi Chen resolved HDDS-2363.
--
Resolution: Fixed

> Failed to create Ratis container
> 
>
> Key: HDDS-2363
> URL: https://issues.apache.org/jira/browse/HDDS-2363
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Sammi Chen
>Assignee: Sammi Chen
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Error logs;
> 2019-10-29 10:24:59,553 [pool-7-thread-1] ERROR  - 
> org.rocksdb.RocksDBException Failed init RocksDB, db path : 
> /data2/hdds/efe9f8f3-86be-417c-93cd-24bbeceee86f/current/containerDir2/1126/metadata/1126-dn-container.db,
>  exception 
> :/data2/hdds/efe9f8f3-86be-417c-93cd-24bbeceee86f/current/containerDir2/1126/metadata/1126-dn-container.db:
>  does not exist (create_if_missing is false)
> CACHED_OPTS is a RockDB options cache in MetadataStoreBuilder.  The cache 
> keeps the old rocksdb options which is not refreshed with new option values 
> at new call. 
> Logs as following didn't reveal the true failure of write failure.  Will 
> improve following logs too. 
> 2019-10-24 17:43:53,460 [pool-7-thread-1] INFO   - Operation: 
> CreateContainer : Trace ID:  : Message: Container creation failed. : Result: 
> CONTAINER_INTERNAL_ERROR
> 2019-10-24 17:43:53,478 [pool-7-thread-1] INFO   - Operation: WriteChunk 
> : Trace ID:  : Message: ContainerID 402 creation failed : Result: 
> CONTAINER_INTERNAL_ERROR



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2363) Failed to create Ratis container

2019-10-31 Thread Sammi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sammi Chen updated HDDS-2363:
-
Fix Version/s: 0.5.0

> Failed to create Ratis container
> 
>
> Key: HDDS-2363
> URL: https://issues.apache.org/jira/browse/HDDS-2363
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Sammi Chen
>Assignee: Sammi Chen
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Error logs;
> 2019-10-29 10:24:59,553 [pool-7-thread-1] ERROR  - 
> org.rocksdb.RocksDBException Failed init RocksDB, db path : 
> /data2/hdds/efe9f8f3-86be-417c-93cd-24bbeceee86f/current/containerDir2/1126/metadata/1126-dn-container.db,
>  exception 
> :/data2/hdds/efe9f8f3-86be-417c-93cd-24bbeceee86f/current/containerDir2/1126/metadata/1126-dn-container.db:
>  does not exist (create_if_missing is false)
> CACHED_OPTS is a RockDB options cache in MetadataStoreBuilder.  The cache 
> keeps the old rocksdb options which is not refreshed with new option values 
> at new call. 
> Logs as following didn't reveal the true failure of write failure.  Will 
> improve following logs too. 
> 2019-10-24 17:43:53,460 [pool-7-thread-1] INFO   - Operation: 
> CreateContainer : Trace ID:  : Message: Container creation failed. : Result: 
> CONTAINER_INTERNAL_ERROR
> 2019-10-24 17:43:53,478 [pool-7-thread-1] INFO   - Operation: WriteChunk 
> : Trace ID:  : Message: ContainerID 402 creation failed : Result: 
> CONTAINER_INTERNAL_ERROR



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12943) Consistent Reads from Standby Node

2019-10-31 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964579#comment-16964579
 ] 

Hudson commented on HDFS-12943:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17592 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17592/])
Add 2.10.0 release notes for HDFS-12943 (jhung: rev 
ef9d12df24c0db76fd37a95551db7920d27d740c)
* (edit) 
hadoop-common-project/hadoop-common/src/site/markdown/release/2.10.0/RELEASENOTES.2.10.0.md


> Consistent Reads from Standby Node
> --
>
> Key: HDFS-12943
> URL: https://issues.apache.org/jira/browse/HDFS-12943
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: hdfs
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: ConsistentReadsFromStandbyNode.pdf, 
> ConsistentReadsFromStandbyNode.pdf, HDFS-12943-001.patch, 
> HDFS-12943-002.patch, HDFS-12943-003.patch, HDFS-12943-004.patch, 
> TestPlan-ConsistentReadsFromStandbyNode.pdf
>
>
> StandbyNode in HDFS is a replica of the active NameNode. The states of the 
> NameNodes are coordinated via the journal. It is natural to consider 
> StandbyNode as a read-only replica. As with any replicated distributed system 
> the problem of stale reads should be resolved. Our main goal is to provide 
> reads from standby in a consistent way in order to enable a wide range of 
> existing applications running on top of HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14950) missing libhdfspp libs in dist-package

2019-10-31 Thread Yuan Zhou (Jira)
Yuan Zhou created HDFS-14950:


 Summary: missing libhdfspp libs in dist-package
 Key: HDFS-14950
 URL: https://issues.apache.org/jira/browse/HDFS-14950
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: build
Reporter: Yuan Zhou
Assignee: Yuan Zhou
 Attachments: fix_libhdfspp_lib.patch

In a Hadoop build like "mvn package -Pnative" will copy HDFS native libs to 
target/lib/native. For now it will only copy the C client 
libraries(libhdfs.\{a,so}). C++ based HDFS client libraies(libhdfspp.\{a,so}) 
are missing there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964554#comment-16964554
 ] 

Li Cheng edited comment on HDDS-2356 at 11/1/19 3:45 AM:
-

Also see a core dump in rocksdb during last night's testing. Please check the 
attachment for the entire log.

 

>From the first glance, it looks like when rocksdb is iterating the write_batch 
>to insert to the memtable, there happens a stl memory error during memory 
>movement. It might not be related to ozone, but it would cause rocksdb 
>failure. 

 

Created https://issues.apache.org/jira/browse/HDDS-2396 to track the core dump 
in OM rocksdb.

Below is some part of the stack:

C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
 C [librocksdbjni3192271038586903156.so+0x358fec] 
rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&, rocksdb:
 :ValueType)+0x51c
 C [librocksdbjni3192271038586903156.so+0x359d17] 
rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&)+0x17
 C [librocksdbjni3192271038586903156.so+0x3513bc] 
rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
 C [librocksdbjni3192271038586903156.so+0x354df9] 
rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, bool, 
unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
 C [librocksdbjni3192271038586903156.so+0x29fd79] 
rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, 
rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, 
unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
 C [librocksdbjni3192271038586903156.so+0x2a0431] 
rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21
 C [librocksdbjni3192271038586903156.so+0x1a064c] 
Java_org_rocksdb_RocksDB_write0+0xcc
 J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
[0x7f58f1872d00+0xbe]
 J 10093% C1 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V 
(400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
 j org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4


was (Author: timmylicheng):
Also see a core dump in rocksdb during last night's testing. Please check the 
attachment for the entire log.

 

>From the first glance, it looks like when rocksdb is iterating the write_batch 
>to insert to the memtable, there happens a stl memory error during memory 
>movement. It might not be related to ozone, but it would cause rocksdb 
>failure. 

Below is some part of the stack:

C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
C [librocksdbjni3192271038586903156.so+0x358fec] 
rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&, rocksdb:
:ValueType)+0x51c
C [librocksdbjni3192271038586903156.so+0x359d17] 
rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&)+0x17
C [librocksdbjni3192271038586903156.so+0x3513bc] 
rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
C [librocksdbjni3192271038586903156.so+0x354df9] 
rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, bool, 
unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
C [librocksdbjni3192271038586903156.so+0x29fd79] 
rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, 
rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, 
unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
C [librocksdbjni3192271038586903156.so+0x2a0431] 
rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21
C [librocksdbjni3192271038586903156.so+0x1a064c] 
Java_org_rocksdb_RocksDB_write0+0xcc
J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
[0x7f58f1872d00+0xbe]
J 10093% C1 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V 
(400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
j org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 

[jira] [Commented] (HDDS-2396) OM rocksdb core dump during writing

2019-10-31 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964578#comment-16964578
 ] 

Li Cheng commented on HDDS-2396:


Attached the entire log for the core dump. Will try to turn on ulimit and 
reproduce this, but it happens only occasionally. 

> OM rocksdb core dump during writing
> ---
>
> Key: HDDS-2396
> URL: https://issues.apache.org/jira/browse/HDDS-2396
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
>Reporter: Li Cheng
>Priority: Major
> Attachments: hs_err_pid9340.log
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
>  
> There happens core dump in rocksdb while it's occasional. 
>  
> Stack: [0x7f5891a23000,0x7f5891b24000], sp=0x7f5891b21bb8, free 
> space=1018k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
> C [librocksdbjni3192271038586903156.so+0x358fec] 
> rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&, rocksdb:
> :ValueType)+0x51c
> C [librocksdbjni3192271038586903156.so+0x359d17] 
> rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&)+0x17
> C [librocksdbjni3192271038586903156.so+0x3513bc] 
> rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
> C [librocksdbjni3192271038586903156.so+0x354df9] 
> rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
> unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, 
> bool, unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
> C [librocksdbjni3192271038586903156.so+0x29fd79] 
> rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, 
> bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
> C [librocksdbjni3192271038586903156.so+0x2a0431] 
> rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*)+0x21
> C [librocksdbjni3192271038586903156.so+0x1a064c] 
> Java_org_rocksdb_RocksDB_write0+0xcc
> J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
> [0x7f58f1872d00+0xbe]
> J 10093% C1 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V
>  (400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
> j 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4
> j java.lang.Thread.run()V+11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2396) OM rocksdb core dump during writing

2019-10-31 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2396:
---
Attachment: hs_err_pid9340.log

> OM rocksdb core dump during writing
> ---
>
> Key: HDDS-2396
> URL: https://issues.apache.org/jira/browse/HDDS-2396
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
>Reporter: Li Cheng
>Priority: Major
> Attachments: hs_err_pid9340.log
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
>  
> There happens core dump in rocksdb while it's occasional. 
>  
> Stack: [0x7f5891a23000,0x7f5891b24000], sp=0x7f5891b21bb8, free 
> space=1018k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
> C [librocksdbjni3192271038586903156.so+0x358fec] 
> rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&, rocksdb:
> :ValueType)+0x51c
> C [librocksdbjni3192271038586903156.so+0x359d17] 
> rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&)+0x17
> C [librocksdbjni3192271038586903156.so+0x3513bc] 
> rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
> C [librocksdbjni3192271038586903156.so+0x354df9] 
> rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
> unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, 
> bool, unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
> C [librocksdbjni3192271038586903156.so+0x29fd79] 
> rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, 
> bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
> C [librocksdbjni3192271038586903156.so+0x2a0431] 
> rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*)+0x21
> C [librocksdbjni3192271038586903156.so+0x1a064c] 
> Java_org_rocksdb_RocksDB_write0+0xcc
> J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
> [0x7f58f1872d00+0xbe]
> J 10093% C1 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V
>  (400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
> j 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4
> j java.lang.Thread.run()V+11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2396) OM rocksdb core dump during writing

2019-10-31 Thread Li Cheng (Jira)
Li Cheng created HDDS-2396:
--

 Summary: OM rocksdb core dump during writing
 Key: HDDS-2396
 URL: https://issues.apache.org/jira/browse/HDDS-2396
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: Ozone Manager
Affects Versions: 0.4.1
Reporter: Li Cheng
 Attachments: hs_err_pid9340.log

Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
it's VM0.

I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path on 
VM0, while reading data from VM0 local disk and write to mount path. The 
dataset has various sizes of files from 0 byte to GB-level and it has a number 
of ~50,000 files. 

 

There happens core dump in rocksdb while it's occasional. 

 

Stack: [0x7f5891a23000,0x7f5891b24000], sp=0x7f5891b21bb8, free 
space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
C [librocksdbjni3192271038586903156.so+0x358fec] 
rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&, rocksdb:
:ValueType)+0x51c
C [librocksdbjni3192271038586903156.so+0x359d17] 
rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&)+0x17
C [librocksdbjni3192271038586903156.so+0x3513bc] 
rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
C [librocksdbjni3192271038586903156.so+0x354df9] 
rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, bool, 
unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
C [librocksdbjni3192271038586903156.so+0x29fd79] 
rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, 
rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, 
unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
C [librocksdbjni3192271038586903156.so+0x2a0431] 
rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21
C [librocksdbjni3192271038586903156.so+0x1a064c] 
Java_org_rocksdb_RocksDB_write0+0xcc
J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
[0x7f58f1872d00+0xbe]
J 10093% C1 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V 
(400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
j org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4
j java.lang.Thread.run()V+11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2395) Handle Ozone S3 completeMPU to match with aws s3 behavior.

2019-10-31 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964573#comment-16964573
 ] 

Li Cheng commented on HDDS-2395:


Also please note the exclude list issue.

 

2019-11-01 11:25:24,047 [qtp1383524016-27648] INFO - Allocating block with 
ExcludeList \{datanodes = [], containerIds = [], pipelineIds = 
[PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97]}

> Handle Ozone S3 completeMPU to match with aws s3 behavior.
> --
>
> Key: HDDS-2395
> URL: https://issues.apache.org/jira/browse/HDDS-2395
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> # When uploaded 2 parts, and when complete upload 1 part no error
>  # During complete multipart upload name/part number not matching with 
> uploaded part and part number then InvalidPart error
>  # When parts are not specified in sorted order InvalidPartOrder
>  # During complete multipart upload when no 

[jira] [Updated] (HDFS-14937) [SBN read] ObserverReadProxyProvider should throw InterruptException

2019-10-31 Thread xuzq (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuzq updated HDFS-14937:

Attachment: HDFS-14937-trunk-002.patch

> [SBN read] ObserverReadProxyProvider should throw InterruptException
> 
>
> Key: HDFS-14937
> URL: https://issues.apache.org/jira/browse/HDFS-14937
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14937-trunk-001.patch, HDFS-14937-trunk-002.patch
>
>
> ObserverReadProxyProvider should throw InterruptException immediately if one 
> Observer catch InterruptException in invoking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14937) [SBN read] ObserverReadProxyProvider should throw InterruptException

2019-10-31 Thread xuzq (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964560#comment-16964560
 ] 

xuzq commented on HDFS-14937:
-

Thanks [~xkrogen] [~vagarychen] for the comment.

RetryInvocationHandler use Thread.currentThread().isInterrupted() to check 
Interrupted, so I keep it. 
{code:java}
final long failoverCount = retryInvocationHandler.getFailoverCount();
try {
  return invoke();
} catch (Exception e) {
  if (LOG.isTraceEnabled()) {
LOG.trace(toString(), e);
  }
  if (Thread.currentThread().isInterrupted()) {
// If interrupted, do not retry.
throw e;
  }

  retryInfo = retryInvocationHandler.handleException(
  method, callId, retryPolicy, counters, failoverCount, e);
  return processWaitTimeAndRetryInfo();
}
{code}
Use InterruptedException to instead of them?

> [SBN read] ObserverReadProxyProvider should throw InterruptException
> 
>
> Key: HDFS-14937
> URL: https://issues.apache.org/jira/browse/HDFS-14937
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14937-trunk-001.patch
>
>
> ObserverReadProxyProvider should throw InterruptException immediately if one 
> Observer catch InterruptException in invoking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964554#comment-16964554
 ] 

Li Cheng commented on HDDS-2356:


Also see a core dump in rocksdb during last night's testing. Please check the 
attachment for the entire log.

 

>From the first glance, it looks like when rocksdb is iterating the write_batch 
>to insert to the memtable, there happens a stl memory error during memory 
>movement. It might not be related to ozone, but it would cause rocksdb 
>failure. 

Below is some part of the stack:

C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
C [librocksdbjni3192271038586903156.so+0x358fec] 
rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&, rocksdb:
:ValueType)+0x51c
C [librocksdbjni3192271038586903156.so+0x359d17] 
rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&)+0x17
C [librocksdbjni3192271038586903156.so+0x3513bc] 
rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
C [librocksdbjni3192271038586903156.so+0x354df9] 
rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, bool, 
unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
C [librocksdbjni3192271038586903156.so+0x29fd79] 
rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, 
rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, 
unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
C [librocksdbjni3192271038586903156.so+0x2a0431] 
rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21
C [librocksdbjni3192271038586903156.so+0x1a064c] 
Java_org_rocksdb_RocksDB_write0+0xcc
J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
[0x7f58f1872d00+0xbe]
J 10093% C1 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V 
(400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
j org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: hs_err_pid9340.log, image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
>  at 
> org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
>  at 
> org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
>  

[jira] [Updated] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2356:
---
Attachment: hs_err_pid9340.log

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: hs_err_pid9340.log, image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
>  at 
> org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
>  at 
> org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)
>  at 
> org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)
>  at 
> org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)
>  
> The following errors has been resolved in 
> https://issues.apache.org/jira/browse/HDDS-2322. 
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
>  java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> 

[jira] [Updated] (HDDS-2363) Failed to create Ratis container

2019-10-31 Thread Sammi Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sammi Chen updated HDDS-2363:
-
Summary: Failed to create Ratis container  (was: Fail to create Ratis 
container)

> Failed to create Ratis container
> 
>
> Key: HDDS-2363
> URL: https://issues.apache.org/jira/browse/HDDS-2363
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Sammi Chen
>Assignee: Sammi Chen
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Error logs;
> 2019-10-29 10:24:59,553 [pool-7-thread-1] ERROR  - 
> org.rocksdb.RocksDBException Failed init RocksDB, db path : 
> /data2/hdds/efe9f8f3-86be-417c-93cd-24bbeceee86f/current/containerDir2/1126/metadata/1126-dn-container.db,
>  exception 
> :/data2/hdds/efe9f8f3-86be-417c-93cd-24bbeceee86f/current/containerDir2/1126/metadata/1126-dn-container.db:
>  does not exist (create_if_missing is false)
> CACHED_OPTS is a RockDB options cache in MetadataStoreBuilder.  The cache 
> keeps the old rocksdb options which is not refreshed with new option values 
> at new call. 
> Logs as following didn't reveal the true failure of write failure.  Will 
> improve following logs too. 
> 2019-10-24 17:43:53,460 [pool-7-thread-1] INFO   - Operation: 
> CreateContainer : Trace ID:  : Message: Container creation failed. : Result: 
> CONTAINER_INTERNAL_ERROR
> 2019-10-24 17:43:53,478 [pool-7-thread-1] INFO   - Operation: WriteChunk 
> : Trace ID:  : Message: ContainerID 402 creation failed : Result: 
> CONTAINER_INTERNAL_ERROR



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2363) Fail to create Ratis container

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2363?focusedWorklogId=337145=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337145
 ]

ASF GitHub Bot logged work on HDDS-2363:


Author: ASF GitHub Bot
Created on: 01/Nov/19 02:33
Start Date: 01/Nov/19 02:33
Worklog Time Spent: 10m 
  Work Description: ChenSammi commented on pull request #98: HDDS-2363. 
Fail to create Ratis container.
URL: https://github.com/apache/hadoop-ozone/pull/98
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 337145)
Time Spent: 20m  (was: 10m)

> Fail to create Ratis container
> --
>
> Key: HDDS-2363
> URL: https://issues.apache.org/jira/browse/HDDS-2363
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Sammi Chen
>Assignee: Sammi Chen
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Error logs;
> 2019-10-29 10:24:59,553 [pool-7-thread-1] ERROR  - 
> org.rocksdb.RocksDBException Failed init RocksDB, db path : 
> /data2/hdds/efe9f8f3-86be-417c-93cd-24bbeceee86f/current/containerDir2/1126/metadata/1126-dn-container.db,
>  exception 
> :/data2/hdds/efe9f8f3-86be-417c-93cd-24bbeceee86f/current/containerDir2/1126/metadata/1126-dn-container.db:
>  does not exist (create_if_missing is false)
> CACHED_OPTS is a RockDB options cache in MetadataStoreBuilder.  The cache 
> keeps the old rocksdb options which is not refreshed with new option values 
> at new call. 
> Logs as following didn't reveal the true failure of write failure.  Will 
> improve following logs too. 
> 2019-10-24 17:43:53,460 [pool-7-thread-1] INFO   - Operation: 
> CreateContainer : Trace ID:  : Message: Container creation failed. : Result: 
> CONTAINER_INTERNAL_ERROR
> 2019-10-24 17:43:53,478 [pool-7-thread-1] INFO   - Operation: WriteChunk 
> : Trace ID:  : Message: ContainerID 402 creation failed : Result: 
> CONTAINER_INTERNAL_ERROR



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14942) Change Log Level to warn in JournalNodeSyncer#syncWithJournalAtIndex

2019-10-31 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964535#comment-16964535
 ] 

Lisheng Sun commented on HDFS-14942:


hi [~weichiu] [~elgoiri] [~ayushtkn] Could you mind review this patch? Thank 
you.

> Change Log Level to warn in JournalNodeSyncer#syncWithJournalAtIndex
> 
>
> Key: HDFS-14942
> URL: https://issues.apache.org/jira/browse/HDFS-14942
> Project: Hadoop HDFS
>  Issue Type: Improvement
> Environment: when hadoop 2.x upgrades to hadoop 3.x,  
> InterQJournalProtocol is newly added,so  throw Unknown protocol. 
> the newly InterQJournalProtocol is used to sychronize past log segments to 
> JNs that missed them.  And that an error occurs does not affect normal 
> service. I think it should not be a ERROR log,and that log a warn log is more 
> reasonable.
> {code:java}
>  private void syncWithJournalAtIndex(int index) {
>   ...
> GetEditLogManifestResponseProto editLogManifest;
> try {
>   editLogManifest = jnProxy.getEditLogManifestFromJournal(jid,
>   nameServiceId, 0, false);
> } catch (IOException e) {
>   LOG.error("Could not sync with Journal at " +
>   otherJNProxies.get(journalNodeIndexForSync), e);
>   return;
> }
> {code}
> {code:java}
> 2019-10-30,15:11:17,388 ERROR 
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer: Could not sync with 
> Journal at mos1-hadoop-prc-ct17.ksru/10.85.3.59:111002019-10-30,15:11:17,388 
> ERROR org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer: Could not 
> sync with Journal at 
> mos1-hadoop-prc-ct17.ksru/10.85.3.59:11100org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
>  Unknown protocol: 
> org.apache.hadoop.hdfs.qjournal.protocol.InterQJournalProtocol at 
> org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1565) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1511) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1421) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>  at com.sun.proxy.$Proxy16.getEditLogManifestFromJournal(Unknown Source) at 
> org.apache.hadoop.hdfs.qjournal.protocolPB.InterQJournalProtocolTranslatorPB.getEditLogManifestFromJournal(InterQJournalProtocolTranslatorPB.java:75)
>  at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer.syncWithJournalAtIndex(JournalNodeSyncer.java:250)
>  at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer.syncJournals(JournalNodeSyncer.java:226)
>  at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer.lambda$startSyncJournalsDaemon$0(JournalNodeSyncer.java:186)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
>Reporter: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-14942.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14938) Add check if excludedNodes contain scope in DFSNetworkTopology#chooseRandomWithStorageType()

2019-10-31 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964530#comment-16964530
 ] 

Lisheng Sun commented on HDFS-14938:


Thanks [~elgoiri] for your comments.

I add a javadoc and ut for this patch. 

Upload the v004 patch. Could you have time to review it.  Thank you.

> Add check if excludedNodes contain scope in 
> DFSNetworkTopology#chooseRandomWithStorageType() 
> -
>
> Key: HDFS-14938
> URL: https://issues.apache.org/jira/browse/HDFS-14938
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14938.001.patch, HDFS-14938.002.patch, 
> HDFS-14938.003.patch, HDFS-14938.004.patch
>
>
> Add check if excludedNodes contain scope in 
> DFSNetworkTopology#chooseRandomWithStorageType().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14938) Add check if excludedNodes contain scope in DFSNetworkTopology#chooseRandomWithStorageType()

2019-10-31 Thread Lisheng Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lisheng Sun updated HDFS-14938:
---
Attachment: HDFS-14938.004.patch

> Add check if excludedNodes contain scope in 
> DFSNetworkTopology#chooseRandomWithStorageType() 
> -
>
> Key: HDFS-14938
> URL: https://issues.apache.org/jira/browse/HDFS-14938
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14938.001.patch, HDFS-14938.002.patch, 
> HDFS-14938.003.patch, HDFS-14938.004.patch
>
>
> Add check if excludedNodes contain scope in 
> DFSNetworkTopology#chooseRandomWithStorageType().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14927) RBF: Add metrics for async callers thread pool

2019-10-31 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964502#comment-16964502
 ] 

Íñigo Goiri commented on HDFS-14927:


+1 on  [^HDFS-14927.009.patch].

> RBF: Add metrics for async callers thread pool
> --
>
> Key: HDFS-14927
> URL: https://issues.apache.org/jira/browse/HDFS-14927
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Minor
> Attachments: HDFS-14927.001.patch, HDFS-14927.002.patch, 
> HDFS-14927.003.patch, HDFS-14927.004.patch, HDFS-14927.005.patch, 
> HDFS-14927.006.patch, HDFS-14927.007.patch, HDFS-14927.008.patch, 
> HDFS-14927.009.patch
>
>
> It is good to add some monitoring on the async caller thread pool to handle 
> fan-out RPC client requests, so we know the utilization and when to bump up 
> dfs.federation.router.client.thread-size



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2395) Handle Ozone S3 completeMPU to match with aws s3 behavior.

2019-10-31 Thread Bharat Viswanadham (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharat Viswanadham updated HDDS-2395:
-
Issue Type: Bug  (was: Task)

> Handle Ozone S3 completeMPU to match with aws s3 behavior.
> --
>
> Key: HDDS-2395
> URL: https://issues.apache.org/jira/browse/HDDS-2395
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> # When uploaded 2 parts, and when complete upload 1 part no error
>  # During complete multipart upload name/part number not matching with 
> uploaded part and part number then InvalidPart error
>  # When parts are not specified in sorted order InvalidPartOrder
>  # During complete multipart upload when no uploaded parts, and we specify 
> some parts then also InvalidPart
>  # Uploaded parts 1,2,3 and during complete we can do upload 1,3 (No error)
>  # When part 3 uploaded, complete with part 3 can be done



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Bharat Viswanadham (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964482#comment-16964482
 ] 

Bharat Viswanadham commented on HDDS-2356:
--

Opened HDDS-2359 to handle CompleteMPU error cases.

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
>  at 
> org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
>  at 
> org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)
>  at 
> org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)
>  at 
> org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)
>  
> The following errors has been resolved in 
> https://issues.apache.org/jira/browse/HDDS-2322. 
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
>  java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> 

[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Bharat Viswanadham (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964482#comment-16964482
 ] 

Bharat Viswanadham edited comment on HDDS-2356 at 11/1/19 12:41 AM:


Opened HDDS-2395 to handle CompleteMPU error cases.


was (Author: bharatviswa):
Opened HDDS-2359 to handle CompleteMPU error cases.

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
>  at 
> org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
>  at 
> org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)
>  at 
> org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)
>  at 
> org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)
>  
> The following errors has been resolved in 
> https://issues.apache.org/jira/browse/HDDS-2322. 
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
>  java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> 

[jira] [Commented] (HDFS-14720) DataNode shouldn't report block as bad block if the block length is Long.MAX_VALUE.

2019-10-31 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964476#comment-16964476
 ] 

Konstantin Shvachko commented on HDFS-14720:


Does it need a unit test?

> DataNode shouldn't report block as bad block if the block length is 
> Long.MAX_VALUE.
> ---
>
> Key: HDFS-14720
> URL: https://issues.apache.org/jira/browse/HDFS-14720
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14720.001.patch
>
>
> {noformat}
> 2019-08-11 09:15:58,092 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Can't replicate block 
> BP-725378529-10.0.0.8-1410027444173:blk_13276745777_1112363330268 because 
> on-disk length 175085 is shorter than NameNode recorded length 
> 9223372036854775807.{noformat}
> If the block length is Long.MAX_VALUE, means file belongs to this block is 
> deleted from the namenode and DN got the command after deletion of file. In 
> this case command should be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12943) Consistent Reads from Standby Node

2019-10-31 Thread Konstantin Shvachko (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-12943:
---
Release Note: 
Observer is a new type of a NameNode in addition to Active and Standby Nodes in 
HA settings. An Observer Node maintains a replica of the namespace same as a 
Standby Node. It additionally allows execution of clients read requests.
To ensure read-after-write consistency within a single client, a state ID is 
introduced in RPC headers. The Observer responds to the client request only 
after its own state has caught up with the client’s state ID, which it 
previously received from the Active NameNode.
Clients can explicitly invoke a new client protocol call msync(), which ensures 
that subsequent reads by this client from an Observer are consistent.
A new client-side ObserverReadProxyProvider is introduced to provide automatic 
switching between Active and Observer NameNodes for submitting respectively 
write and read requests.

  was:
Observer is a new type of a NameNode in addition to Active and Standby in HA 
settings. Observer Node maintains a replica of the namespace same as a Standby 
Node. It additionally allows execution of clients read requests.
To ensure read-after-write consistency within a single client, a state ID is 
introduced in RPC headers. The Observer responds to the client request only 
after its own state has caught up with the client’s state ID, which it 
previously received from the Active NameNode.
Clients can explicitly invoke a new client protocol call msync(), which ensures 
that subsequent reads by this client from an Observer are consistent.
A new client-side ObserverReadProxyProvider is introduced to provide automatic 
switching between Active and Observer NameNodes for submitting respectively 
write and read requests.


> Consistent Reads from Standby Node
> --
>
> Key: HDFS-12943
> URL: https://issues.apache.org/jira/browse/HDFS-12943
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: hdfs
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: ConsistentReadsFromStandbyNode.pdf, 
> ConsistentReadsFromStandbyNode.pdf, HDFS-12943-001.patch, 
> HDFS-12943-002.patch, HDFS-12943-003.patch, HDFS-12943-004.patch, 
> TestPlan-ConsistentReadsFromStandbyNode.pdf
>
>
> StandbyNode in HDFS is a replica of the active NameNode. The states of the 
> NameNodes are coordinated via the journal. It is natural to consider 
> StandbyNode as a read-only replica. As with any replicated distributed system 
> the problem of stale reads should be resolved. Our main goal is to provide 
> reads from standby in a consistent way in order to enable a wide range of 
> existing applications running on top of HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12943) Consistent Reads from Standby Node

2019-10-31 Thread Konstantin Shvachko (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-12943:
---
Release Note: 
Observer is a new type of a NameNode in addition to Active and Standby in HA 
settings. Observer Node maintains a replica of the namespace same as a Standby 
Node. It additionally allows execution of clients read requests.
To ensure read-after-write consistency within a single client, a state ID is 
introduced in RPC headers. The Observer responds to the client request only 
after its own state has caught up with the client’s state ID, which it 
previously received from the Active NameNode.
Clients can explicitly invoke a new client protocol call msync(), which ensures 
that subsequent reads by this client from an Observer are consistent.
A new client-side ObserverReadProxyProvider is introduced to provide automatic 
switching between Active and Observer NameNodes for submitting respectively 
write and read requests.

  was:
Observer is a new type of NameNodes in addition to Active and Standby in HA 
settings. Observer Node maintains a replica of the namespace same as a Standby 
Node. It additionally allows execution of clients read requests.
To ensure read-after-write consistency within a single client, a state ID is 
introduced in RPC headers. The Observer responds to the client request only 
after its own state has caught up with the client’s state ID, which it 
previously received from the Active NameNode.
Clients can explicitly invoke a new client protocol call msync(), which ensures 
that subsequent reads by this client from an Observer are consistent.
A new client-side ObserverReadProxyProvider is introduced to provide automatic 
switching between Active and Observer NameNodes for submitting respectively 
write and read requests.


> Consistent Reads from Standby Node
> --
>
> Key: HDFS-12943
> URL: https://issues.apache.org/jira/browse/HDFS-12943
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: hdfs
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: ConsistentReadsFromStandbyNode.pdf, 
> ConsistentReadsFromStandbyNode.pdf, HDFS-12943-001.patch, 
> HDFS-12943-002.patch, HDFS-12943-003.patch, HDFS-12943-004.patch, 
> TestPlan-ConsistentReadsFromStandbyNode.pdf
>
>
> StandbyNode in HDFS is a replica of the active NameNode. The states of the 
> NameNodes are coordinated via the journal. It is natural to consider 
> StandbyNode as a read-only replica. As with any replicated distributed system 
> the problem of stale reads should be resolved. Our main goal is to provide 
> reads from standby in a consistent way in order to enable a wide range of 
> existing applications running on top of HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-12943) Consistent Reads from Standby Node

2019-10-31 Thread Konstantin Shvachko (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko resolved HDFS-12943.

Fix Version/s: 2.10.0
   3.2.2
   3.1.4
   3.3.0
 Hadoop Flags: Reviewed
 Release Note: 
Observer is a new type of NameNodes in addition to Active and Standby in HA 
settings. Observer Node maintains a replica of the namespace same as a Standby 
Node. It additionally allows execution of clients read requests.
To ensure read-after-write consistency within a single client, a state ID is 
introduced in RPC headers. The Observer responds to the client request only 
after its own state has caught up with the client’s state ID, which it 
previously received from the Active NameNode.
Clients can explicitly invoke a new client protocol call msync(), which ensures 
that subsequent reads by this client from an Observer are consistent.
A new client-side ObserverReadProxyProvider is introduced to provide automatic 
switching between Active and Observer NameNodes for submitting respectively 
write and read requests.
   Resolution: Fixed

Closing this as Fixed. The feature has been tested, back-ported down to 2.10 
and released. Few remaining subtasks are being addressed as usual issues.
Added release notes. Please review if I missed anything.

_Thank you everybody for contributing to this effort._

> Consistent Reads from Standby Node
> --
>
> Key: HDFS-12943
> URL: https://issues.apache.org/jira/browse/HDFS-12943
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: hdfs
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.0
>
> Attachments: ConsistentReadsFromStandbyNode.pdf, 
> ConsistentReadsFromStandbyNode.pdf, HDFS-12943-001.patch, 
> HDFS-12943-002.patch, HDFS-12943-003.patch, HDFS-12943-004.patch, 
> TestPlan-ConsistentReadsFromStandbyNode.pdf
>
>
> StandbyNode in HDFS is a replica of the active NameNode. The states of the 
> NameNodes are coordinated via the journal. It is natural to consider 
> StandbyNode as a read-only replica. As with any replicated distributed system 
> the problem of stale reads should be resolved. Our main goal is to provide 
> reads from standby in a consistent way in order to enable a wide range of 
> existing applications running on top of HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.

2019-10-31 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh resolved HDFS-14768.
--
Resolution: Not A Problem

> EC : Busy DN replica should be consider in live replica check.
> --
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, 
> HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   

[jira] [Reopened] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.

2019-10-31 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh reopened HDFS-14768:
--

> EC : Busy DN replica should be consider in live replica check.
> --
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, 
> HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   

[jira] [Resolved] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.

2019-10-31 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh resolved HDFS-14768.
--
Resolution: Abandoned

> EC : Busy DN replica should be consider in live replica check.
> --
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, 
> HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
>   

[jira] [Updated] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.

2019-10-31 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Status: Open  (was: Patch Available)

> EC : Busy DN replica should be consider in live replica check.
> --
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, 
> HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), 

[jira] [Commented] (HDDS-2321) Ozone Block Token verify should not apply to all datanode cmd

2019-10-31 Thread Xiaoyu Yao (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964461#comment-16964461
 ] 

Xiaoyu Yao commented on HDDS-2321:
--

{quote}Since SCM has the root cert, it might be intresting if it send a token 
over, that way these commands are also verified.

In the long run, or even the short run, these SCM commands to DNs will go away.
{quote}
Good point. We will use follow up JIRAs to add SCM and DN tokens for other 
command types. This one focus on Om block token check improvement but allows 
future extension for SCM/DN tokens. 

> Ozone Block Token verify should not apply to all datanode cmd
> -
>
> Key: HDDS-2321
> URL: https://issues.apache.org/jira/browse/HDDS-2321
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.1
>Reporter: Nilotpal Nandi
>Assignee: Xiaoyu Yao
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> DN container protocol has cmd send from SCM or other DN, which do not bear OM 
> block token like OM client. We should restrict the OM Block token check only 
> for those issued from OM client. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2321) Ozone Block Token verify should not apply to all datanode cmd

2019-10-31 Thread Xiaoyu Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoyu Yao updated HDDS-2321:
-
Status: Patch Available  (was: Open)

> Ozone Block Token verify should not apply to all datanode cmd
> -
>
> Key: HDDS-2321
> URL: https://issues.apache.org/jira/browse/HDDS-2321
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.1
>Reporter: Nilotpal Nandi
>Assignee: Xiaoyu Yao
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> DN container protocol has cmd send from SCM or other DN, which do not bear OM 
> block token like OM client. We should restrict the OM Block token check only 
> for those issued from OM client. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2321) Ozone Block Token verify should not apply to all datanode cmd

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2321?focusedWorklogId=337100=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337100
 ]

ASF GitHub Bot logged work on HDDS-2321:


Author: ASF GitHub Bot
Created on: 31/Oct/19 23:47
Start Date: 31/Oct/19 23:47
Worklog Time Spent: 10m 
  Work Description: xiaoyuyao commented on pull request #110: HDDS-2321. 
Ozone Block Token verify should not apply to all datanode …
URL: https://github.com/apache/hadoop-ozone/pull/110
 
 
   
   ## What changes were proposed in this pull request?
   
   * Change the TokenVerifier interface to check the command type and the block 
id.
   * Token verification based on token encode in the command done inside 
HddsDispatcher. 
   *  Remove the Grpc Client/Server CredentialInterceptor as it cannot fit into 
Ratis commands. 
   * Added more unit test coverage on the Tokenverifier.
* 
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-2321
   
   ## How was this patch tested?
   
   Added Unit test testBlockTokenVerifier()
   Update Unit test in TestSecureContainerServer.java 
   ozone secure smoke test.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 337100)
Remaining Estimate: 0h
Time Spent: 10m

> Ozone Block Token verify should not apply to all datanode cmd
> -
>
> Key: HDDS-2321
> URL: https://issues.apache.org/jira/browse/HDDS-2321
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.1
>Reporter: Nilotpal Nandi
>Assignee: Xiaoyu Yao
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> DN container protocol has cmd send from SCM or other DN, which do not bear OM 
> block token like OM client. We should restrict the OM Block token check only 
> for those issued from OM client. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2321) Ozone Block Token verify should not apply to all datanode cmd

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-2321:
-
Labels: pull-request-available  (was: )

> Ozone Block Token verify should not apply to all datanode cmd
> -
>
> Key: HDDS-2321
> URL: https://issues.apache.org/jira/browse/HDDS-2321
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.1
>Reporter: Nilotpal Nandi
>Assignee: Xiaoyu Yao
>Priority: Major
>  Labels: pull-request-available
>
> DN container protocol has cmd send from SCM or other DN, which do not bear OM 
> block token like OM client. We should restrict the OM Block token check only 
> for those issued from OM client. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-1847) Datanode Kerberos principal and keytab config key looks inconsistent

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1847?focusedWorklogId=337099=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337099
 ]

ASF GitHub Bot logged work on HDDS-1847:


Author: ASF GitHub Bot
Created on: 31/Oct/19 23:45
Start Date: 31/Oct/19 23:45
Worklog Time Spent: 10m 
  Work Description: xiaoyuyao commented on issue #1678: HDDS-1847: Datanode 
Kerberos principal and keytab config key looks inconsistent
URL: https://github.com/apache/hadoop/pull/1678#issuecomment-548612304
 
 
   The order of the initialization in StorageContainerManagerHttpServer cause 
NPE after this change, which failed secure acceptance tests. We can use 
HDDS-2393 to track the fix. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 337099)
Time Spent: 1h 20m  (was: 1h 10m)

> Datanode Kerberos principal and keytab config key looks inconsistent
> 
>
> Key: HDDS-1847
> URL: https://issues.apache.org/jira/browse/HDDS-1847
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Eric Yang
>Assignee: Chris Teoh
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Ozone Kerberos configuration can be very confusing:
> | config name | Description |
> | hdds.scm.kerberos.principal | SCM service principal |
> | hdds.scm.kerberos.keytab.file | SCM service keytab file |
> | ozone.om.kerberos.principal | Ozone Manager service principal |
> | ozone.om.kerberos.keytab.file | Ozone Manager keytab file |
> | hdds.scm.http.kerberos.principal | SCM service spnego principal |
> | hdds.scm.http.kerberos.keytab.file | SCM service spnego keytab file |
> | ozone.om.http.kerberos.principal | Ozone Manager spnego principal |
> | ozone.om.http.kerberos.keytab.file | Ozone Manager spnego keytab file |
> | hdds.datanode.http.kerberos.keytab | Datanode spnego keytab file |
> | hdds.datanode.http.kerberos.principal | Datanode spnego principal |
> | dfs.datanode.kerberos.principal | Datanode service principal |
> | dfs.datanode.keytab.file | Datanode service keytab file |
> The prefix are very different for each of the datanode configuration.  It 
> would be nice to have some consistency for datanode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14794) [SBN read] reportBadBlock is rejected by Observer.

2019-10-31 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964455#comment-16964455
 ] 

Wei-Chiu Chuang commented on HDFS-14794:


reportBadBlock is initiated by client or DataNode when they detect a block is 
bad.
And then it is up to the active NameNode to schedule a replica for the bad 
block / invalidation.

There would be a period of time where observer thinks a bad block is still 
good. But hopefully the duration is short. Client would retry other replicas 
IIRC. It's might be okay for observer to process reportBadBlock and mark a 
block replica corrupt but it should not try to schedule block replication/ 
invalidation. 

> [SBN read] reportBadBlock is rejected by Observer.
> --
>
> Key: HDFS-14794
> URL: https://issues.apache.org/jira/browse/HDFS-14794
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Priority: Major
>
> {{reportBadBlock}} is rejected by Observer via StandbyException
> {code}StandbyException: Operation category WRITE is not supported in state 
> observer{code}
> We should investigate what are the consequences of this and if we should 
> treat {{reportBadBlock}} as IBRs. Note that {{reportBadBlock}} is a part of 
> both {{ClientProtocol}} and {{DatanodeProtocol}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14794) [SBN read] reportBadBlock is rejected by Observer.

2019-10-31 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964455#comment-16964455
 ] 

Wei-Chiu Chuang edited comment on HDFS-14794 at 10/31/19 11:41 PM:
---

reportBadBlock is initiated by client or DataNode when they detect a block is 
bad.
And then it is up to the active NameNode to schedule a replica for the bad 
block / invalidation.

There would be a period of time where observer thinks a bad block is still 
good. But hopefully the duration is short. Client would retry other replicas 
IIRC. It's might be okay for observer to process reportBadBlock and mark a 
block replica corrupt but it should not try to schedule block replication/ 
invalidation. 

In summary, I don't think it affect correctness. Maybe a little drop in 
availability.


was (Author: jojochuang):
reportBadBlock is initiated by client or DataNode when they detect a block is 
bad.
And then it is up to the active NameNode to schedule a replica for the bad 
block / invalidation.

There would be a period of time where observer thinks a bad block is still 
good. But hopefully the duration is short. Client would retry other replicas 
IIRC. It's might be okay for observer to process reportBadBlock and mark a 
block replica corrupt but it should not try to schedule block replication/ 
invalidation. 

> [SBN read] reportBadBlock is rejected by Observer.
> --
>
> Key: HDFS-14794
> URL: https://issues.apache.org/jira/browse/HDFS-14794
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Priority: Major
>
> {{reportBadBlock}} is rejected by Observer via StandbyException
> {code}StandbyException: Operation category WRITE is not supported in state 
> observer{code}
> We should investigate what are the consequences of this and if we should 
> treat {{reportBadBlock}} as IBRs. Note that {{reportBadBlock}} is a part of 
> both {{ClientProtocol}} and {{DatanodeProtocol}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14927) RBF: Add metrics for async callers thread pool

2019-10-31 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964451#comment-16964451
 ] 

Hadoop QA commented on HDFS-14927:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
54s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 26s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 54s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  7m  
3s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 64m 44s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14927 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12984556/HDFS-14927.009.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 68e07677bdb1 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / f9b99d2 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28216/testReport/ |
| Max. process+thread count | 2738 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28216/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> RBF: Add metrics for async callers thread pool
> --
>
> Key: HDFS-14927
> URL: 

[jira] [Commented] (HDFS-14720) DataNode shouldn't report block as bad block if the block length is Long.MAX_VALUE.

2019-10-31 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964450#comment-16964450
 ] 

Wei-Chiu Chuang commented on HDFS-14720:


BTW I think the fix is correct. +1 from me.

> DataNode shouldn't report block as bad block if the block length is 
> Long.MAX_VALUE.
> ---
>
> Key: HDFS-14720
> URL: https://issues.apache.org/jira/browse/HDFS-14720
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14720.001.patch
>
>
> {noformat}
> 2019-08-11 09:15:58,092 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Can't replicate block 
> BP-725378529-10.0.0.8-1410027444173:blk_13276745777_1112363330268 because 
> on-disk length 175085 is shorter than NameNode recorded length 
> 9223372036854775807.{noformat}
> If the block length is Long.MAX_VALUE, means file belongs to this block is 
> deleted from the namenode and DN got the command after deletion of file. In 
> this case command should be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14720) DataNode shouldn't report block as bad block if the block length is Long.MAX_VALUE.

2019-10-31 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964443#comment-16964443
 ] 

Wei-Chiu Chuang commented on HDFS-14720:


I don't think this fix is relevant to HDFS-14794. This one was meant to solve a 
corner case.

> DataNode shouldn't report block as bad block if the block length is 
> Long.MAX_VALUE.
> ---
>
> Key: HDFS-14720
> URL: https://issues.apache.org/jira/browse/HDFS-14720
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14720.001.patch
>
>
> {noformat}
> 2019-08-11 09:15:58,092 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Can't replicate block 
> BP-725378529-10.0.0.8-1410027444173:blk_13276745777_1112363330268 because 
> on-disk length 175085 is shorter than NameNode recorded length 
> 9223372036854775807.{noformat}
> If the block length is Long.MAX_VALUE, means file belongs to this block is 
> deleted from the namenode and DN got the command after deletion of file. In 
> this case command should be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14720) DataNode shouldn't report block as bad block if the block length is Long.MAX_VALUE.

2019-10-31 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964438#comment-16964438
 ] 

Konstantin Shvachko commented on HDFS-14720:


Hey guys, could you explain how this fixes the {{reportBadBlock()}} issue from 
HDFS-14794?

> DataNode shouldn't report block as bad block if the block length is 
> Long.MAX_VALUE.
> ---
>
> Key: HDFS-14720
> URL: https://issues.apache.org/jira/browse/HDFS-14720
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14720.001.patch
>
>
> {noformat}
> 2019-08-11 09:15:58,092 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Can't replicate block 
> BP-725378529-10.0.0.8-1410027444173:blk_13276745777_1112363330268 because 
> on-disk length 175085 is shorter than NameNode recorded length 
> 9223372036854775807.{noformat}
> If the block length is Long.MAX_VALUE, means file belongs to this block is 
> deleted from the namenode and DN got the command after deletion of file. In 
> this case command should be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2395) Handle Ozone S3 completeMPU to match with aws s3 behavior.

2019-10-31 Thread Bharat Viswanadham (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharat Viswanadham updated HDDS-2395:
-
Summary: Handle Ozone S3 completeMPU to match with aws s3 behavior.  (was: 
Handle completeMPU scenarios to match with aws s3 behavior.)

> Handle Ozone S3 completeMPU to match with aws s3 behavior.
> --
>
> Key: HDDS-2395
> URL: https://issues.apache.org/jira/browse/HDDS-2395
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> # When uploaded 2 parts, and when complete upload 1 part no error
>  # During complete multipart upload name/part number not matching with 
> uploaded part and part number then InvalidPart error
>  # When parts are not specified in sorted order InvalidPartOrder
>  # During complete multipart upload when no uploaded parts, and we specify 
> some parts then also InvalidPart
>  # Uploaded parts 1,2,3 and during complete we can do upload 1,3 (No error)
>  # When part 3 uploaded, complete with part 3 can be done



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2388) Teragen test failure due to OM exception

2019-10-31 Thread Aravindan Vijayan (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964434#comment-16964434
 ] 

Aravindan Vijayan commented on HDDS-2388:
-

[~shashikant] Is the OM crashing due to this error? This is on a different 
thread than the OM read/write path. 

> Teragen test failure due to OM exception
> 
>
> Key: HDDS-2388
> URL: https://issues.apache.org/jira/browse/HDDS-2388
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Priority: Major
> Fix For: 0.5.0
>
>
> Ran into below exception while running teragen:
> {code:java}
> Unable to get delta updates since sequenceNumber 79932 
> org.rocksdb.RocksDBException: Requested sequence not yet written in the db
>   at org.rocksdb.RocksDB.getUpdatesSince(Native Method)
>   at org.rocksdb.RocksDB.getUpdatesSince(RocksDB.java:3587)
>   at 
> org.apache.hadoop.hdds.utils.db.RDBStore.getUpdatesSince(RDBStore.java:338)
>   at 
> org.apache.hadoop.ozone.om.OzoneManager.getDBUpdates(OzoneManager.java:3283)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getOMDBUpdates(OzoneManagerRequestHandler.java:404)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handle(OzoneManagerRequestHandler.java:314)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:219)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:134)
>   at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:102)
>   at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:984)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:912)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2364) Add a OM metrics to find the false positive rate for the keyMayExist

2019-10-31 Thread Aravindan Vijayan (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964433#comment-16964433
 ] 

Aravindan Vijayan commented on HDDS-2364:
-

[~msingh] Thanks for the review. I will raise follow up JIRAs for metrics that 
are not already exposed through RocksDB. 

> Add a OM metrics to find the false positive rate for the keyMayExist
> 
>
> Key: HDDS-2364
> URL: https://issues.apache.org/jira/browse/HDDS-2364
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.5.0
>Reporter: Mukul Kumar Singh
>Assignee: Aravindan Vijayan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Add a OM metrics to find the false positive rate for the keyMayExist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14927) RBF: Add metrics for async callers thread pool

2019-10-31 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964431#comment-16964431
 ] 

Hadoop QA commented on HDFS-14927:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
52s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 38s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 19s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  7m 
36s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 64m  6s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14927 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12984553/HDFS-14927.008.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 5950634d174e 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / f9b99d2 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28215/testReport/ |
| Max. process+thread count | 2751 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28215/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> RBF: Add metrics for async callers thread pool
> --
>
> Key: HDFS-14927
> URL: 

[jira] [Work logged] (HDDS-2395) Handle completeMPU scenarios to match with aws s3 behavior.

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2395?focusedWorklogId=337084=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337084
 ]

ASF GitHub Bot logged work on HDDS-2395:


Author: ASF GitHub Bot
Created on: 31/Oct/19 22:45
Start Date: 31/Oct/19 22:45
Worklog Time Spent: 10m 
  Work Description: bharatviswa504 commented on pull request #109: 
HDDS-2395. Handle completeMPU scenarios to match with aws s3 behavior.
URL: https://github.com/apache/hadoop-ozone/pull/109
 
 
   ## What changes were proposed in this pull request?
   
   Fix few cases which were missed during complete Multipart upload.
   
   When uploaded 2 parts, and when complete upload 1 part no error
   During complete multipart upload name/part number not matching with uploaded 
part and part number then InvalidPart error
   When parts are not specified in sorted order InvalidPartOrder
   During complete multipart upload when no uploaded parts, and we specify some 
parts then also InvalidPart
   Uploaded parts 1,2,3 and during complete we can do upload 1,3 (No error)
   When part 3 uploaded, complete with part 3 can be done
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-2395
   
   ## How was this patch tested?
   Ran S3 smoke tests and also added smoke tests.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 337084)
Remaining Estimate: 0h
Time Spent: 10m

> Handle completeMPU scenarios to match with aws s3 behavior.
> ---
>
> Key: HDDS-2395
> URL: https://issues.apache.org/jira/browse/HDDS-2395
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> # When uploaded 2 parts, and when complete upload 1 part no error
>  # During complete multipart upload name/part number not matching with 
> uploaded part and part number then InvalidPart error
>  # When parts are not specified in sorted order InvalidPartOrder
>  # During complete multipart upload when no uploaded parts, and we specify 
> some parts then also InvalidPart
>  # Uploaded parts 1,2,3 and during complete we can do upload 1,3 (No error)
>  # When part 3 uploaded, complete with part 3 can be done



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2395) Handle completeMPU scenarios to match with aws s3 behavior.

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-2395:
-
Labels: pull-request-available  (was: )

> Handle completeMPU scenarios to match with aws s3 behavior.
> ---
>
> Key: HDDS-2395
> URL: https://issues.apache.org/jira/browse/HDDS-2395
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
>
> # When uploaded 2 parts, and when complete upload 1 part no error
>  # During complete multipart upload name/part number not matching with 
> uploaded part and part number then InvalidPart error
>  # When parts are not specified in sorted order InvalidPartOrder
>  # During complete multipart upload when no uploaded parts, and we specify 
> some parts then also InvalidPart
>  # Uploaded parts 1,2,3 and during complete we can do upload 1,3 (No error)
>  # When part 3 uploaded, complete with part 3 can be done



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2395) Handle completeMPU scenarios to match with aws s3 behavior.

2019-10-31 Thread Bharat Viswanadham (Jira)
Bharat Viswanadham created HDDS-2395:


 Summary: Handle completeMPU scenarios to match with aws s3 
behavior.
 Key: HDDS-2395
 URL: https://issues.apache.org/jira/browse/HDDS-2395
 Project: Hadoop Distributed Data Store
  Issue Type: Task
Reporter: Bharat Viswanadham


# When uploaded 2 parts, and when complete upload 1 part no error
 # During complete multipart upload name/part number not matching with uploaded 
part and part number then InvalidPart error
 # When parts are not specified in sorted order InvalidPartOrder
 # During complete multipart upload when no uploaded parts, and we specify some 
parts then also InvalidPart
 # Uploaded parts 1,2,3 and during complete we can do upload 1,3 (No error)
 # When part 3 uploaded, complete with part 3 can be done



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2395) Handle completeMPU scenarios to match with aws s3 behavior.

2019-10-31 Thread Bharat Viswanadham (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharat Viswanadham reassigned HDDS-2395:


Assignee: Bharat Viswanadham

> Handle completeMPU scenarios to match with aws s3 behavior.
> ---
>
> Key: HDDS-2395
> URL: https://issues.apache.org/jira/browse/HDDS-2395
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>
> # When uploaded 2 parts, and when complete upload 1 part no error
>  # During complete multipart upload name/part number not matching with 
> uploaded part and part number then InvalidPart error
>  # When parts are not specified in sorted order InvalidPartOrder
>  # During complete multipart upload when no uploaded parts, and we specify 
> some parts then also InvalidPart
>  # Uploaded parts 1,2,3 and during complete we can do upload 1,3 (No error)
>  # When part 3 uploaded, complete with part 3 can be done



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14927) RBF: Add metrics for async callers thread pool

2019-10-31 Thread Leon Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leon Gao updated HDFS-14927:

Attachment: HDFS-14927.009.patch

> RBF: Add metrics for async callers thread pool
> --
>
> Key: HDFS-14927
> URL: https://issues.apache.org/jira/browse/HDFS-14927
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Minor
> Attachments: HDFS-14927.001.patch, HDFS-14927.002.patch, 
> HDFS-14927.003.patch, HDFS-14927.004.patch, HDFS-14927.005.patch, 
> HDFS-14927.006.patch, HDFS-14927.007.patch, HDFS-14927.008.patch, 
> HDFS-14927.009.patch
>
>
> It is good to add some monitoring on the async caller thread pool to handle 
> fan-out RPC client requests, so we know the utilization and when to bump up 
> dfs.federation.router.client.thread-size



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2394) Ozone allows bucket name with underscore to be created but throws an error during put key operation

2019-10-31 Thread Vivek Ratnavel Subramanian (Jira)
Vivek Ratnavel Subramanian created HDDS-2394:


 Summary: Ozone allows bucket name with underscore to be created 
but throws an error during put key operation
 Key: HDDS-2394
 URL: https://issues.apache.org/jira/browse/HDDS-2394
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: Ozone Manager
Affects Versions: 0.4.1
Reporter: Vivek Ratnavel Subramanian
Assignee: Vivek Ratnavel Subramanian


Steps to reproduce:
aws s3api --endpoint http://localhost:9878 create-bucket --bucket ozone_test

aws s3api --endpoint http://localhost:9878 put-object --bucket ozone_test --key 
ozone-site.xml --body /etc/hadoop/conf/ozone-site.xml

S3 gateway throws a warning:
{code:java}
javax.servlet.ServletException: javax.servlet.ServletException: 
java.lang.IllegalArgumentException: Bucket or Volume name has an unsupported 
character : _
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:139)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:539)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.servlet.ServletException: java.lang.IllegalArgumentException: 
Bucket or Volume name has an unsupported character : _
at 
org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:432)
at 
org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:370)
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:389)
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:342)
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:229)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:840)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1780)
at 
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1628)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1767)
at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1767)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:513)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
... 13 more
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2393) HDDS-1847 broke some unit tests

2019-10-31 Thread Chris Teoh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Teoh reassigned HDDS-2393:


   Assignee: Chris Teoh
Description: 
Siyao Meng commented on HDDS-1847:
--

Looks like this commit breaks {{TestKeyManagerImpl}} in {{setUp()}} and 
{{cleanup()}}. Run {{TestKeyManagerImpl#testListStatus()}} to steadily repro. I 
believe there could be other tests that are broken by this.

{code}
java.lang.NullPointerException
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerHttpServer.getSpnegoPrincipal(StorageContainerManagerHttpServer.java:74)
at 
org.apache.hadoop.hdds.server.BaseHttpServer.(BaseHttpServer.java:81)
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerHttpServer.(StorageContainerManagerHttpServer.java:36)
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.(StorageContainerManager.java:330)
at org.apache.hadoop.hdds.scm.TestUtils.getScm(TestUtils.java:544)
at 
org.apache.hadoop.ozone.om.TestKeyManagerImpl.setUp(TestKeyManagerImpl.java:150)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at 
com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
{code}

{code}
java.lang.NullPointerException
at 
org.apache.hadoop.ozone.om.TestKeyManagerImpl.cleanup(TestKeyManagerImpl.java:176)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at 
com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
{code}

  was:
Looks like this commit breaks {{TestKeyManagerImpl}} in {{setUp()}} and 
{{cleanup()}}. Run {{TestKeyManagerImpl#testListStatus()}} to steadily repro. I 
believe there could be other tests that are broken by this.

{code}
java.lang.NullPointerException
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerHttpServer.getSpnegoPrincipal(StorageContainerManagerHttpServer.java:74)
at 
org.apache.hadoop.hdds.server.BaseHttpServer.(BaseHttpServer.java:81)
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerHttpServer.(StorageContainerManagerHttpServer.java:36)
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.(StorageContainerManager.java:330)
at org.apache.hadoop.hdds.scm.TestUtils.getScm(TestUtils.java:544)
at 
org.apache.hadoop.ozone.om.TestKeyManagerImpl.setUp(TestKeyManagerImpl.java:150)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 

[jira] [Assigned] (HDFS-13689) NameNodeRpcServer getEditsFromTxid assumes it is run on active NameNode

2019-10-31 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-13689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen reassigned HDFS-13689:
--

Assignee: (was: Erik Krogen)

> NameNodeRpcServer getEditsFromTxid assumes it is run on active NameNode
> ---
>
> Key: HDFS-13689
> URL: https://issues.apache.org/jira/browse/HDFS-13689
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs, namenode
>Reporter: Erik Krogen
>Priority: Major
>
> {{NameNodeRpcServer#getEditsFromTxid}} currently decides which transactions 
> are able to be served, i.e. which transactions are durable, using the 
> following logic:
> {code}
> long syncTxid = log.getSyncTxId();
> // If we haven't synced anything yet, we can only read finalized
> // segments since we can't reliably determine which txns in in-progress
> // segments have actually been committed (e.g. written to a quorum of 
> JNs).
> // If we have synced txns, we can definitely read up to syncTxid since
> // syncTxid is only updated after a transaction is committed to all
> // journals. (In-progress segments written by old writers are already
> // discarded for us, so if we read any in-progress segments they are
> // guaranteed to have been written by this NameNode.)
> boolean readInProgress = syncTxid > 0;
> {code}
> This assumes that the NameNode serving this request is the current 
> writer/active NameNode, which may not be true in the ObserverNode situation. 
> Since {{selectInputStreams}} now has a {{onlyDurableTxns}} flag, which, if 
> enabled, will only return durable/committed transactions, we can instead 
> leverage this to provide the same functionality. We should utilize this to 
> avoid consistency issues when serving this request from the ObserverNode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2393) HDDS-1847 broke some unit tests

2019-10-31 Thread Chris Teoh (Jira)
Chris Teoh created HDDS-2393:


 Summary: HDDS-1847 broke some unit tests
 Key: HDDS-2393
 URL: https://issues.apache.org/jira/browse/HDDS-2393
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
Reporter: Chris Teoh


Looks like this commit breaks {{TestKeyManagerImpl}} in {{setUp()}} and 
{{cleanup()}}. Run {{TestKeyManagerImpl#testListStatus()}} to steadily repro. I 
believe there could be other tests that are broken by this.

{code}
java.lang.NullPointerException
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerHttpServer.getSpnegoPrincipal(StorageContainerManagerHttpServer.java:74)
at 
org.apache.hadoop.hdds.server.BaseHttpServer.(BaseHttpServer.java:81)
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerHttpServer.(StorageContainerManagerHttpServer.java:36)
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.(StorageContainerManager.java:330)
at org.apache.hadoop.hdds.scm.TestUtils.getScm(TestUtils.java:544)
at 
org.apache.hadoop.ozone.om.TestKeyManagerImpl.setUp(TestKeyManagerImpl.java:150)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at 
com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
{code}

{code}
java.lang.NullPointerException
at 
org.apache.hadoop.ozone.om.TestKeyManagerImpl.cleanup(TestKeyManagerImpl.java:176)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at 
com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14927) RBF: Add metrics for async callers thread pool

2019-10-31 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964407#comment-16964407
 ] 

Íñigo Goiri commented on HDFS-14927:


I don't think you need to catch the exception and then throw it.
Just having the finally should be enough.
The exception will just surface.

> RBF: Add metrics for async callers thread pool
> --
>
> Key: HDFS-14927
> URL: https://issues.apache.org/jira/browse/HDFS-14927
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Minor
> Attachments: HDFS-14927.001.patch, HDFS-14927.002.patch, 
> HDFS-14927.003.patch, HDFS-14927.004.patch, HDFS-14927.005.patch, 
> HDFS-14927.006.patch, HDFS-14927.007.patch, HDFS-14927.008.patch
>
>
> It is good to add some monitoring on the async caller thread pool to handle 
> fan-out RPC client requests, so we know the utilization and when to bump up 
> dfs.federation.router.client.thread-size



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-14443) Throwing RemoteException in the time of Read Operation

2019-10-31 Thread Konstantin Shvachko (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko resolved HDFS-14443.

Resolution: Not A Problem

Resolving as not a problem. Please reopen if its.

> Throwing RemoteException in the time of Read Operation
> --
>
> Key: HDFS-14443
> URL: https://issues.apache.org/jira/browse/HDFS-14443
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ranith Sardar
>Priority: Major
>
> 2019-04-19 20:54:59,178 DEBUG 
> org.apache.hadoop.io.retry.RetryInvocationHandler: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category WRITE is not supported in state observer. Visit 
> [https://s.apache.org/sbnn-error]
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1990)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1443)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.msync(NameNodeRpcServer.java:1372)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.msync(ClientNamenodeProtocolServerSideTranslatorPB.java:1929)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:531)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:862)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2791)
>  , while invoking $Proxy5.getFileInfo over 
> [host-*-*-*-*/*.*.*.*:6*5,host-*-*-*-*/*.*.*.*:**,host-*-*-*-*/*.*.*.*:6**5]. 
> Trying to failover immediately.
>  
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category WRITE is not supported in state observer. Visit 
> [https://s.apache.org/sbnn-error]
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-14020) Emulate Observer node falling far behind the Active

2019-10-31 Thread Konstantin Shvachko (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko resolved HDFS-14020.

Resolution: Duplicate

Resolving as duplicate since HDFS-13873 introduced {{testObserverFallBehind()}} 
in {{TestMultiObserverNode}}, which serves the purpose. This has also been 
already tested on live clusters.

> Emulate Observer node falling far behind the Active
> ---
>
> Key: HDFS-14020
> URL: https://issues.apache.org/jira/browse/HDFS-14020
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Sherwood Zheng
>Assignee: Sherwood Zheng
>Priority: Major
>
> Emulate Observer node falling far behind the Active. Ensure readers switch 
> over
> to another Observer instead of waiting for the lagging Observer to catch up. 
> If
> there is only a single Observer, it should fall back to the Active.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14927) RBF: Add metrics for async callers thread pool

2019-10-31 Thread Leon Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leon Gao updated HDFS-14927:

Attachment: HDFS-14927.008.patch

> RBF: Add metrics for async callers thread pool
> --
>
> Key: HDFS-14927
> URL: https://issues.apache.org/jira/browse/HDFS-14927
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Minor
> Attachments: HDFS-14927.001.patch, HDFS-14927.002.patch, 
> HDFS-14927.003.patch, HDFS-14927.004.patch, HDFS-14927.005.patch, 
> HDFS-14927.006.patch, HDFS-14927.007.patch, HDFS-14927.008.patch
>
>
> It is good to add some monitoring on the async caller thread pool to handle 
> fan-out RPC client requests, so we know the utilization and when to bump up 
> dfs.federation.router.client.thread-size



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1847) Datanode Kerberos principal and keytab config key looks inconsistent

2019-10-31 Thread Anu Engineer (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964390#comment-16964390
 ] 

Anu Engineer commented on HDDS-1847:


interesting, [~chris.t...@gmail.com] can you please take a look when you get a 
chance.?

> Datanode Kerberos principal and keytab config key looks inconsistent
> 
>
> Key: HDDS-1847
> URL: https://issues.apache.org/jira/browse/HDDS-1847
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Eric Yang
>Assignee: Chris Teoh
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Ozone Kerberos configuration can be very confusing:
> | config name | Description |
> | hdds.scm.kerberos.principal | SCM service principal |
> | hdds.scm.kerberos.keytab.file | SCM service keytab file |
> | ozone.om.kerberos.principal | Ozone Manager service principal |
> | ozone.om.kerberos.keytab.file | Ozone Manager keytab file |
> | hdds.scm.http.kerberos.principal | SCM service spnego principal |
> | hdds.scm.http.kerberos.keytab.file | SCM service spnego keytab file |
> | ozone.om.http.kerberos.principal | Ozone Manager spnego principal |
> | ozone.om.http.kerberos.keytab.file | Ozone Manager spnego keytab file |
> | hdds.datanode.http.kerberos.keytab | Datanode spnego keytab file |
> | hdds.datanode.http.kerberos.principal | Datanode spnego principal |
> | dfs.datanode.kerberos.principal | Datanode service principal |
> | dfs.datanode.keytab.file | Datanode service keytab file |
> The prefix are very different for each of the datanode configuration.  It 
> would be nice to have some consistency for datanode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1847) Datanode Kerberos principal and keytab config key looks inconsistent

2019-10-31 Thread Siyao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964374#comment-16964374
 ] 

Siyao Meng commented on HDDS-1847:
--

Looks like this commit breaks {{TestKeyManagerImpl}} in {{setUp()}} and 
{{cleanup()}}. Run {{TestKeyManagerImpl#testListStatus()}} to steadily repro. I 
believe there could be other tests that are broken by this.

{code}
java.lang.NullPointerException
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerHttpServer.getSpnegoPrincipal(StorageContainerManagerHttpServer.java:74)
at 
org.apache.hadoop.hdds.server.BaseHttpServer.(BaseHttpServer.java:81)
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerHttpServer.(StorageContainerManagerHttpServer.java:36)
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.(StorageContainerManager.java:330)
at org.apache.hadoop.hdds.scm.TestUtils.getScm(TestUtils.java:544)
at 
org.apache.hadoop.ozone.om.TestKeyManagerImpl.setUp(TestKeyManagerImpl.java:150)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at 
com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
{code}

{code}
java.lang.NullPointerException
at 
org.apache.hadoop.ozone.om.TestKeyManagerImpl.cleanup(TestKeyManagerImpl.java:176)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at 
com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
{code}

> Datanode Kerberos principal and keytab config key looks inconsistent
> 
>
> Key: HDDS-1847
> URL: https://issues.apache.org/jira/browse/HDDS-1847
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Eric Yang
>Assignee: Chris Teoh
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Ozone Kerberos configuration can be very confusing:
> | config name | Description |
> | hdds.scm.kerberos.principal | SCM service principal |
> | hdds.scm.kerberos.keytab.file | SCM service keytab file |
> | ozone.om.kerberos.principal | Ozone Manager service principal |
> | ozone.om.kerberos.keytab.file | Ozone Manager keytab file |
> | hdds.scm.http.kerberos.principal | SCM service spnego principal |
> | hdds.scm.http.kerberos.keytab.file | SCM service spnego keytab file |
> | ozone.om.http.kerberos.principal | Ozone Manager spnego principal |
> | ozone.om.http.kerberos.keytab.file | Ozone 

[jira] [Updated] (HDDS-2392) Fix TestScmSafeMode#testSCMSafeModeRestrictedOp

2019-10-31 Thread Hanisha Koneru (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hanisha Koneru updated HDDS-2392:
-
Description: 
After ratis upgrade (HDDS-2340), TestScmSafeMode#testSCMSafeModeRestrictedOp 
fails as the DNs fail to restart XceiverServerRatis. 

RaftServer#start() fails with following exception:
{code:java}
java.io.IOException: java.lang.IllegalStateException: Not started
at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
at 
org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
at 
org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:421)
at 
org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:215)
at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:110)
at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Not started
at 
org.apache.ratis.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:504)
at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.getPort(ServerImpl.java:176)
at 
org.apache.ratis.grpc.server.GrpcService.lambda$new$2(GrpcService.java:143)
at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62)
at 
org.apache.ratis.grpc.server.GrpcService.getInetSocketAddress(GrpcService.java:182)
at 
org.apache.ratis.server.impl.RaftServerImpl.lambda$new$0(RaftServerImpl.java:84)
at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62)
at 
org.apache.ratis.server.impl.RaftServerImpl.getPeer(RaftServerImpl.java:136)
at 
org.apache.ratis.server.impl.RaftServerMetrics.(RaftServerMetrics.java:70)
at 
org.apache.ratis.server.impl.RaftServerMetrics.getRaftServerMetrics(RaftServerMetrics.java:62)
at 
org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:119)
at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
{code}

  was:
After ratis upgrade (HDDS-2340), TestScmSafeMode#testSCMSafeModeRestrictedOp 
fails as the DNs fail to restart XceiverServerRatis. 
RaftServer#start() fails with following exception:
{code:java}
java.io.IOException: java.lang.IllegalStateException: Not 
startedjava.io.IOException: java.lang.IllegalStateException: Not started at 
org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54) at 
org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61) at 
org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70) at 
org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284) 
at org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296) 
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:421)
 at 
org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:215)
 at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:110)
 at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)Caused by: 
java.lang.IllegalStateException: Not started at 
org.apache.ratis.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:504)
 at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.getPort(ServerImpl.java:176)
 at org.apache.ratis.grpc.server.GrpcService.lambda$new$2(GrpcService.java:143) 
at 

[jira] [Created] (HDDS-2392) Fix TestScmSafeMode#testSCMSafeModeRestrictedOp

2019-10-31 Thread Hanisha Koneru (Jira)
Hanisha Koneru created HDDS-2392:


 Summary: Fix TestScmSafeMode#testSCMSafeModeRestrictedOp
 Key: HDDS-2392
 URL: https://issues.apache.org/jira/browse/HDDS-2392
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
Reporter: Hanisha Koneru


After ratis upgrade (HDDS-2340), TestScmSafeMode#testSCMSafeModeRestrictedOp 
fails as the DNs fail to restart XceiverServerRatis. 
RaftServer#start() fails with following exception:
{code:java}
java.io.IOException: java.lang.IllegalStateException: Not 
startedjava.io.IOException: java.lang.IllegalStateException: Not started at 
org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54) at 
org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61) at 
org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70) at 
org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284) 
at org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296) 
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:421)
 at 
org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:215)
 at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:110)
 at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)Caused by: 
java.lang.IllegalStateException: Not started at 
org.apache.ratis.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:504)
 at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.getPort(ServerImpl.java:176)
 at org.apache.ratis.grpc.server.GrpcService.lambda$new$2(GrpcService.java:143) 
at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62) at 
org.apache.ratis.grpc.server.GrpcService.getInetSocketAddress(GrpcService.java:182)
 at 
org.apache.ratis.server.impl.RaftServerImpl.lambda$new$0(RaftServerImpl.java:84)
 at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62) at 
org.apache.ratis.server.impl.RaftServerImpl.getPeer(RaftServerImpl.java:136) at 
org.apache.ratis.server.impl.RaftServerMetrics.(RaftServerMetrics.java:70)
 at 
org.apache.ratis.server.impl.RaftServerMetrics.getRaftServerMetrics(RaftServerMetrics.java:62)
 at org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:119) 
at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
 at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14948) Improve HttpFS Server

2019-10-31 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964354#comment-16964354
 ] 

Wei-Chiu Chuang commented on HDFS-14948:


[~smeng] and other contributors added a number of REST APIs missing compared to 
WebHDFS.
But I'm pretty sure they're mostly backported in trunk only or 3.x.

> Improve HttpFS Server
> -
>
> Key: HDFS-14948
> URL: https://issues.apache.org/jira/browse/HDFS-14948
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: httpfs
>Reporter: Kihwal Lee
>Assignee: Ahmed Hussein
>Priority: Major
>
> We see increasing use of HttpFS as a compatibility bridge and also as a 
> bridge between different security domains. As it gains more users, people are 
> finding missing pieces and bugs in it.  There already are efforts to tackle 
> some of these issues and this jira aims to make ongoing works and future 
> works more coherent.  I do not really want to make it as a umbrella jira, but 
> a place for recording all related works. That way, one can easily figure out 
> what is missing in their version of HttpFS and what to backport, if necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2384) Large chunks during write can have memory pressure on DN with multiple clients

2019-10-31 Thread Anu Engineer (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964348#comment-16964348
 ] 

Anu Engineer commented on HDDS-2384:


Thank you for flagging this issue. I think this is a hard to solve problem in 
the current architecture.  I would like to explore some possibilitues on how we 
can solve this issue.

1. We add the support for a buffer pool inside data node. A buffer pool would a 
large chunk of memory that data node will pin and internally treat it as a set 
of buffers that can be used for I/O. When we read or write data, we will also 
use this buffer pool. That way, we can limit the maximum committed memory that 
we will end up using for data path. 

2. In order to do that, we will now need the ability to read data not in 16 MB 
chunks, but perhaps in smaller 8KB kind of size(assuming the page size is going 
to be 8KB in the buffer pool).

3. The advantage of such an approach is that we will read data only as much 
memory we have, but the network layer still might have to buffer this data.

4. This also allows us to push back against a client that is sending or trying 
to read too much data from the data node at any given time. 

Question: Do you think such a change would address this issue ? If you have 
other suggestions, I would love to hear them. Once more, thank you for flagging 
this issue.


> Large chunks during write can have memory pressure on DN with multiple clients
> --
>
> Key: HDDS-2384
> URL: https://issues.apache.org/jira/browse/HDDS-2384
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Anu Engineer
>Priority: Major
>  Labels: performance
>
> During large file writes, it ends up writing {{16 MB}} chunks.  
> https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java#L691
> In large clusters, 100s of clients may connect to DN. In such cases, 
> depending on the incoming write workload mem load on DN can increase 
> significantly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14884) Add sanity check that zone key equals feinfo key while setting Xattrs

2019-10-31 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964345#comment-16964345
 ] 

Hadoop QA commented on HDFS-14884:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 22m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-2 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  9m 
 6s{color} | {color:green} branch-2 passed {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
15s{color} | {color:red} hadoop-hdfs in branch-2 failed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green} branch-2 passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
29s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
55s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
3s{color} | {color:green} branch-2 passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
16s{color} | {color:red} hadoop-hdfs in branch-2 failed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} branch-2 passed with JDK v1.8.0_222 {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
12s{color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_95. 
{color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 12s{color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_95. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
8s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
13s{color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_95. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed with JDK v1.8.0_222 {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 32s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}116m 23s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.balancer.TestBalancerRPCDelay |
|   | hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:f555aa740b5 |
| JIRA Issue | HDFS-14884 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12984512/HDFS-14884-branch-2.001.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 4ca513ad11a9 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | branch-2 / a36dbe6 |
| maven | 

[jira] [Assigned] (HDDS-2384) Large chunks during write can have memory pressure on DN with multiple clients

2019-10-31 Thread Anu Engineer (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anu Engineer reassigned HDDS-2384:
--

Assignee: Anu Engineer

> Large chunks during write can have memory pressure on DN with multiple clients
> --
>
> Key: HDDS-2384
> URL: https://issues.apache.org/jira/browse/HDDS-2384
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Anu Engineer
>Priority: Major
>  Labels: performance
>
> During large file writes, it ends up writing {{16 MB}} chunks.  
> https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java#L691
> In large clusters, 100s of clients may connect to DN. In such cases, 
> depending on the incoming write workload mem load on DN can increase 
> significantly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-31 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964315#comment-16964315
 ] 

Hudson commented on HDFS-14920:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17590 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17590/])
HDFS-14920. Erasure Coding: Decommission may hang If one or more (ayushsaxena: 
rev 9d25ae7669eed1a047578b574f42bd121b445a3c)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingReconstructionBlocks.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/NumberReplicas.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommissionWithStriped.java


> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch, 
> HDFS-14920.003.patch, HDFS-14920.004.patch, HDFS-14920.005.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org

[jira] [Updated] (HDFS-14936) Add getNumOfChildren() for interface InnerNode

2019-10-31 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-14936:

Fix Version/s: 3.3.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Add getNumOfChildren() for interface InnerNode
> --
>
> Key: HDFS-14936
> URL: https://issues.apache.org/jira/browse/HDFS-14936
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: HDFS-14936.001.patch, HDFS-14936.002.patch, 
> HDFS-14936.003.patch
>
>
> current code InnerNode subclass InnerNodeImpl and DFSTopologyNodeImpl both 
> have getNumOfChildren(). 
> so Add getNumOfChildren() for interface InnerNode and remove unnessary 
> getNumOfChildren() in DFSTopologyNodeImpl.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-1847) Datanode Kerberos principal and keytab config key looks inconsistent

2019-10-31 Thread Anu Engineer (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anu Engineer resolved HDDS-1847.

Fix Version/s: 0.5.0
   Resolution: Fixed

I have committed this patch to the master branch. 

> Datanode Kerberos principal and keytab config key looks inconsistent
> 
>
> Key: HDDS-1847
> URL: https://issues.apache.org/jira/browse/HDDS-1847
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Eric Yang
>Assignee: Chris Teoh
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Ozone Kerberos configuration can be very confusing:
> | config name | Description |
> | hdds.scm.kerberos.principal | SCM service principal |
> | hdds.scm.kerberos.keytab.file | SCM service keytab file |
> | ozone.om.kerberos.principal | Ozone Manager service principal |
> | ozone.om.kerberos.keytab.file | Ozone Manager keytab file |
> | hdds.scm.http.kerberos.principal | SCM service spnego principal |
> | hdds.scm.http.kerberos.keytab.file | SCM service spnego keytab file |
> | ozone.om.http.kerberos.principal | Ozone Manager spnego principal |
> | ozone.om.http.kerberos.keytab.file | Ozone Manager spnego keytab file |
> | hdds.datanode.http.kerberos.keytab | Datanode spnego keytab file |
> | hdds.datanode.http.kerberos.principal | Datanode spnego principal |
> | dfs.datanode.kerberos.principal | Datanode service principal |
> | dfs.datanode.keytab.file | Datanode service keytab file |
> The prefix are very different for each of the datanode configuration.  It 
> would be nice to have some consistency for datanode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-1847) Datanode Kerberos principal and keytab config key looks inconsistent

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1847?focusedWorklogId=336990=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-336990
 ]

ASF GitHub Bot logged work on HDDS-1847:


Author: ASF GitHub Bot
Created on: 31/Oct/19 18:32
Start Date: 31/Oct/19 18:32
Worklog Time Spent: 10m 
  Work Description: anuengineer commented on issue #1678: HDDS-1847: 
Datanode Kerberos principal and keytab config key looks inconsistent
URL: https://github.com/apache/hadoop/pull/1678#issuecomment-548510969
 
 
   I have committed this patch into hadoop-ozone branch. Not apache:trunk. 
   
   Here is the commit info. 
   
   commit 8527a9d9ceb0e1b2ba3bfc8ebc06e7589135f7f3 (HEAD -> master, 
origin/master, origin/HEAD)
   Author: Anu Engineer 
   Date:   Thu Oct 31 11:19:54 2019 -0700
   
   HDDS-1847: Datanode Kerberos principal and keytab config key looks 
inconsistent
   Contributed by christeoh.
   
   
   @christeoh  Thank you for the contribution. @macroadster  Thanks for 
comments and filing the JIRA.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 336990)
Time Spent: 1h  (was: 50m)

> Datanode Kerberos principal and keytab config key looks inconsistent
> 
>
> Key: HDDS-1847
> URL: https://issues.apache.org/jira/browse/HDDS-1847
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Eric Yang
>Assignee: Chris Teoh
>Priority: Major
>  Labels: newbie, pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Ozone Kerberos configuration can be very confusing:
> | config name | Description |
> | hdds.scm.kerberos.principal | SCM service principal |
> | hdds.scm.kerberos.keytab.file | SCM service keytab file |
> | ozone.om.kerberos.principal | Ozone Manager service principal |
> | ozone.om.kerberos.keytab.file | Ozone Manager keytab file |
> | hdds.scm.http.kerberos.principal | SCM service spnego principal |
> | hdds.scm.http.kerberos.keytab.file | SCM service spnego keytab file |
> | ozone.om.http.kerberos.principal | Ozone Manager spnego principal |
> | ozone.om.http.kerberos.keytab.file | Ozone Manager spnego keytab file |
> | hdds.datanode.http.kerberos.keytab | Datanode spnego keytab file |
> | hdds.datanode.http.kerberos.principal | Datanode spnego principal |
> | dfs.datanode.kerberos.principal | Datanode service principal |
> | dfs.datanode.keytab.file | Datanode service keytab file |
> The prefix are very different for each of the datanode configuration.  It 
> would be nice to have some consistency for datanode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-1847) Datanode Kerberos principal and keytab config key looks inconsistent

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1847?focusedWorklogId=336991=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-336991
 ]

ASF GitHub Bot logged work on HDDS-1847:


Author: ASF GitHub Bot
Created on: 31/Oct/19 18:32
Start Date: 31/Oct/19 18:32
Worklog Time Spent: 10m 
  Work Description: anuengineer commented on pull request #1678: HDDS-1847: 
Datanode Kerberos principal and keytab config key looks inconsistent
URL: https://github.com/apache/hadoop/pull/1678
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 336991)
Time Spent: 1h 10m  (was: 1h)

> Datanode Kerberos principal and keytab config key looks inconsistent
> 
>
> Key: HDDS-1847
> URL: https://issues.apache.org/jira/browse/HDDS-1847
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Eric Yang
>Assignee: Chris Teoh
>Priority: Major
>  Labels: newbie, pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Ozone Kerberos configuration can be very confusing:
> | config name | Description |
> | hdds.scm.kerberos.principal | SCM service principal |
> | hdds.scm.kerberos.keytab.file | SCM service keytab file |
> | ozone.om.kerberos.principal | Ozone Manager service principal |
> | ozone.om.kerberos.keytab.file | Ozone Manager keytab file |
> | hdds.scm.http.kerberos.principal | SCM service spnego principal |
> | hdds.scm.http.kerberos.keytab.file | SCM service spnego keytab file |
> | ozone.om.http.kerberos.principal | Ozone Manager spnego principal |
> | ozone.om.http.kerberos.keytab.file | Ozone Manager spnego keytab file |
> | hdds.datanode.http.kerberos.keytab | Datanode spnego keytab file |
> | hdds.datanode.http.kerberos.principal | Datanode spnego principal |
> | dfs.datanode.kerberos.principal | Datanode service principal |
> | dfs.datanode.keytab.file | Datanode service keytab file |
> The prefix are very different for each of the datanode configuration.  It 
> would be nice to have some consistency for datanode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-31 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-14920:

Fix Version/s: 3.2.2
   3.1.4
   3.3.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed to trunk.
Thanx [~ferhui] for the contribution!!!

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch, 
> HDFS-14920.003.patch, HDFS-14920.004.patch, HDFS-14920.005.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2375) Refactor BlockOutputStream to allow flexible buffering

2019-10-31 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated HDDS-2375:
-
Description: 
In HDDS-2331, we found that Ozone client allocates a ByteBuffer with chunk size 
(e.g. 16MB ) to store data, unregarded the actual data size.  The ByteBuffer 
will create a  byte[] with chunk size.  When the ByteBuffer is wrapped to a 
ByteString the byte[] remains in the ByteString.

As a result, when the actual data size is small (e.g. 1MB), a lot of memory 
spaces (15MB) are wasted.

In this JIRA, we refactor BlockOutputStream so that the buffering becomes more 
flexible.  In a later JIRA (HDDS-2386), we implement chunk buffer using a list 
of smaller buffers which are allocated only if needed.

  was:
In HDDS-2331, we found that Ozone client allocates a ByteBuffer with chunk size 
(e.g. 16MB ) to store data, unregarded the actual data size.  The ByteBuffer 
will create a  byte[] with chunk size.  When the ByteBuffer is wrapped to a 
ByteString the byte[] remains in the ByteString.

As a result, when the actual data size is small (e.g. 1MB), a lot of memory 
spaces (15MB) are wasted.

In this JIRA, we refactor BlockOutputStream so that the buffering becomes more 
flexible.  In a later JIRA, we could implement chunk buffer using a list of 
smaller buffers which are allocated only if needed.


> Refactor BlockOutputStream to allow flexible buffering
> --
>
> Key: HDDS-2375
> URL: https://issues.apache.org/jira/browse/HDDS-2375
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Client
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In HDDS-2331, we found that Ozone client allocates a ByteBuffer with chunk 
> size (e.g. 16MB ) to store data, unregarded the actual data size.  The 
> ByteBuffer will create a  byte[] with chunk size.  When the ByteBuffer is 
> wrapped to a ByteString the byte[] remains in the ByteString.
> As a result, when the actual data size is small (e.g. 1MB), a lot of memory 
> spaces (15MB) are wasted.
> In this JIRA, we refactor BlockOutputStream so that the buffering becomes 
> more flexible.  In a later JIRA (HDDS-2386), we implement chunk buffer using 
> a list of smaller buffers which are allocated only if needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14936) Add getNumOfChildren() for interface InnerNode

2019-10-31 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964301#comment-16964301
 ] 

Hudson commented on HDFS-14936:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17589 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17589/])
HDFS-14936. Add getNumOfChildren() for interface InnerNode. Contributed 
(ayushsaxena: rev d9fbedc4ae41d3dc688cf6b697f0fb46a28b47c5)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/net/TestNetworkTopology.java
* (edit) 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/InnerNodeImpl.java
* (edit) 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/InnerNode.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/net/DFSTopologyNodeImpl.java


> Add getNumOfChildren() for interface InnerNode
> --
>
> Key: HDFS-14936
> URL: https://issues.apache.org/jira/browse/HDFS-14936
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-14936.001.patch, HDFS-14936.002.patch, 
> HDFS-14936.003.patch
>
>
> current code InnerNode subclass InnerNodeImpl and DFSTopologyNodeImpl both 
> have getNumOfChildren(). 
> so Add getNumOfChildren() for interface InnerNode and remove unnessary 
> getNumOfChildren() in DFSTopologyNodeImpl.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2386) Implement incremental ChunkBuffer

2019-10-31 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964298#comment-16964298
 ] 

Tsz-wo Sze commented on HDDS-2386:
--

o2386_20191031b.patch: fixes test failures.

> Implement incremental ChunkBuffer
> -
>
> Key: HDDS-2386
> URL: https://issues.apache.org/jira/browse/HDDS-2386
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Client
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: o2386_20191030.patch, o2386_20191031b.patch
>
>
> HDDS-2375 introduces a ChunkBuffer for flexible buffering. In this JIRA, we 
> implement ChunkBuffer with an incremental buffering so that the memory spaces 
> are allocated incrementally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2388) Teragen test failure due to OM exception

2019-10-31 Thread Bharat Viswanadham (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964297#comment-16964297
 ] 

Bharat Viswanadham commented on HDDS-2388:
--

If WAL has been cleared, then we get this error. Not sure if it is critical 
error or not for recon.

Tagging [~avijayan].

 

Shashi is this causing test failure means, can you provide more info. (As I see 
this API is only used by recon)

> Teragen test failure due to OM exception
> 
>
> Key: HDDS-2388
> URL: https://issues.apache.org/jira/browse/HDDS-2388
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Priority: Major
> Fix For: 0.5.0
>
>
> Ran into below exception while running teragen:
> {code:java}
> Unable to get delta updates since sequenceNumber 79932 
> org.rocksdb.RocksDBException: Requested sequence not yet written in the db
>   at org.rocksdb.RocksDB.getUpdatesSince(Native Method)
>   at org.rocksdb.RocksDB.getUpdatesSince(RocksDB.java:3587)
>   at 
> org.apache.hadoop.hdds.utils.db.RDBStore.getUpdatesSince(RDBStore.java:338)
>   at 
> org.apache.hadoop.ozone.om.OzoneManager.getDBUpdates(OzoneManager.java:3283)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.getOMDBUpdates(OzoneManagerRequestHandler.java:404)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handle(OzoneManagerRequestHandler.java:314)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:219)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:134)
>   at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>   at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:102)
>   at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:984)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:912)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2370) Remove classpath in RunningWithHDFS.md ozone-hdfs/docker-compose as dir 'ozoneplugin' is not exist anymore

2019-10-31 Thread Anu Engineer (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964296#comment-16964296
 ] 

Anu Engineer commented on HDDS-2370:


Yeap, I am fine with removing it. I don't think we are testing or running the 
plugin inside HDFS any more.

> Remove classpath in RunningWithHDFS.md ozone-hdfs/docker-compose as dir 
> 'ozoneplugin' is not exist anymore
> --
>
> Key: HDDS-2370
> URL: https://issues.apache.org/jira/browse/HDDS-2370
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>  Components: documentation
>Reporter: luhuachao
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDDS-2370.1.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In RunningWithHDFS.md 
> {code:java}
> export 
> HADOOP_CLASSPATH=/opt/ozone/share/hadoop/ozoneplugin/hadoop-ozone-datanode-plugin.jar{code}
> ozone-hdfs/docker-compose.yaml
>  
> {code:java}
>   environment:
>  HADOOP_CLASSPATH: /opt/ozone/share/hadoop/ozoneplugin/*.jar
> {code}
> when i run hddsdatanodeservice as pulgin in hdfs datanode, it comes out with 
> the error below , there is no constructor without parameter.
>  
>  
> {code:java}
> 2019-10-21 21:38:56,391 ERROR datanode.DataNode 
> (DataNode.java:startPlugins(972)) - Unable to load DataNode plugins. 
> Specified list of plugins: org.apache.hadoop.ozone.HddsDatanodeService
> java.lang.RuntimeException: java.lang.NoSuchMethodException: 
> org.apache.hadoop.ozone.HddsDatanodeService.()
> {code}
> what i doubt is that, ozone-0.5 not support running as a plugin in hdfs 
> datanode now ? if so, 
> why donnot  we remove doc RunningWithHDFS.md ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2386) Implement incremental ChunkBuffer

2019-10-31 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated HDDS-2386:
-
Attachment: o2386_20191031b.patch

> Implement incremental ChunkBuffer
> -
>
> Key: HDDS-2386
> URL: https://issues.apache.org/jira/browse/HDDS-2386
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Client
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: o2386_20191030.patch, o2386_20191031b.patch
>
>
> HDDS-2375 introduces a ChunkBuffer for flexible buffering. In this JIRA, we 
> implement ChunkBuffer with an incremental buffering so that the memory spaces 
> are allocated incrementally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-31 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964286#comment-16964286
 ] 

Ayush Saxena commented on HDFS-14920:
-

Thanx [~ferhui] for the patch.
v005 LGTM +1

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch, 
> HDFS-14920.003.patch, HDFS-14920.004.patch, HDFS-14920.005.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14936) Add getNumOfChildren() for interface InnerNode

2019-10-31 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964283#comment-16964283
 ] 

Ayush Saxena commented on HDFS-14936:
-

Committed to trunk.
Thanx [~leosun08] for the contribution [~elgoiri] and [~smeng] for the 
reviews!!!

> Add getNumOfChildren() for interface InnerNode
> --
>
> Key: HDFS-14936
> URL: https://issues.apache.org/jira/browse/HDFS-14936
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-14936.001.patch, HDFS-14936.002.patch, 
> HDFS-14936.003.patch
>
>
> current code InnerNode subclass InnerNodeImpl and DFSTopologyNodeImpl both 
> have getNumOfChildren(). 
> so Add getNumOfChildren() for interface InnerNode and remove unnessary 
> getNumOfChildren() in DFSTopologyNodeImpl.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14284) RBF: Log Router identifier when reporting exceptions

2019-10-31 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964271#comment-16964271
 ] 

Ayush Saxena commented on HDFS-14284:
-

Thanx [~hemanthboyina] for the patch.
As I said before, please give a check to the RouterIOException by extracting 
from the RemoteException. You can get the RemoteException as :
{code:java}

RemoteException re = LambdaTestUtils.intercept(RemoteException.class,
"Cannot locate a registered namenode for ns0 from "
+ routerContext.getRouter().getRouterId(),
() -> routerProtocol.addBlock(testPath, clientName, newBlock, null, 1,
null, null));
 RouterIOException rioe =  (RouterIOException) 
re.unwrapRemoteException(RouterIOException.class);
 rioe.getMessage(); // Have assertion checks for this and similarly for routerID
{code}

You can do something like this, To manually unwrap, you need to have a 
constructor with just {{String}} as param.Else it shall throw 
{{NoMethodException}}. You can create one and set the message and RouterID into 
it and then try.
I had a quick rough try it worked. Give a try, if you face issues, Let me know. 
I will try help write.

> RBF: Log Router identifier when reporting exceptions
> 
>
> Key: HDFS-14284
> URL: https://issues.apache.org/jira/browse/HDFS-14284
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14284.001.patch, HDFS-14284.002.patch, 
> HDFS-14284.003.patch, HDFS-14284.004.patch, HDFS-14284.005.patch, 
> HDFS-14284.006.patch, HDFS-14284.007.patch, HDFS-14284.008.patch
>
>
> The typical setup is to use multiple Routers through 
> ConfiguredFailoverProxyProvider.
> In a regular HA Namenode setup, it is easy to know which NN was used.
> However, in RBF, any Router can be the one reporting the exception and it is 
> hard to know which was the one.
> We should have a way to identify which Router/Namenode was the one triggering 
> the exception.
> This would also apply with Observer Namenodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   >