[jira] [Work logged] (HDFS-16345) Fix test cases fail in TestBlockStoragePolicy

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16345?focusedWorklogId=690461=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690461
 ]

ASF GitHub Bot logged work on HDFS-16345:
-

Author: ASF GitHub Bot
Created on: 04/Dec/21 07:36
Start Date: 04/Dec/21 07:36
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on a change in pull request #3696:
URL: https://github.com/apache/hadoop/pull/3696#discussion_r762397861



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestBlockStoragePolicy.java
##
@@ -1291,6 +1291,9 @@ public void testChooseTargetWithTopology() throws 
Exception {
 new HashSet(), 0, policy2, null);
 System.out.println(Arrays.asList(targets));
 Assert.assertEquals(3, targets.length);
+if (namenode != null) {
+  namenode.stop();
+}

Review comment:
   This should be in a finally block. eg:
   Namenode namenode = new Namenode(conf)
try {
Do Something
   } finally {
 if (namenode != null) {
 namenode.stop();
   }
   
   The reason being, in case the test fails after creation of namenode, in that 
case if it isn't in the finally block the stop command won't be executed.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690461)
Time Spent: 1h 10m  (was: 1h)

> Fix test cases fail in TestBlockStoragePolicy
> -
>
> Key: HDFS-16345
> URL: https://issues.apache.org/jira/browse/HDFS-16345
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 3.3.1
>Reporter: guophilipse
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> test class ``TestBlockStoragePolicy` ` fail frequently for the 
> `BindException`, it block all normal source code build. we can improve it.
> [ERROR] Tests run: 26, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 49.295 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestBlockStoragePolicy 
> [ERROR] 
> testChooseTargetWithTopology(org.apache.hadoop.hdfs.TestBlockStoragePolicy) 
> Time elapsed: 0.551 s <<< ERROR! java.net.BindException: Problem binding to 
> [localhost:43947] java.net.BindException: Address already in use; For more 
> details see: http://wiki.apache.org/hadoop/BindException at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931) at 
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:827) at 
> org.apache.hadoop.ipc.Server.bind(Server.java:657) at 
> org.apache.hadoop.ipc.Server$Listener.(Server.java:1352) at 
> org.apache.hadoop.ipc.Server.(Server.java:3252) at 
> org.apache.hadoop.ipc.RPC$Server.(RPC.java:1062) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server.(ProtobufRpcEngine2.java:468)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2.getServer(ProtobufRpcEngine2.java:371)
>  at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:853) at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.(NameNodeRpcServer.java:466)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createRpcServer(NameNode.java:860)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:766) 
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1017) 
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992) 
> at 
> org.apache.hadoop.hdfs.TestBlockStoragePolicy.testChooseTargetWithTopology(TestBlockStoragePolicy.java:1275)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  at 
> 

[jira] [Work logged] (HDFS-16338) Fix error configuration message in FSImage

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16338?focusedWorklogId=690460=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690460
 ]

ASF GitHub Bot logged work on HDFS-16338:
-

Author: ASF GitHub Bot
Created on: 04/Dec/21 07:30
Start Date: 04/Dec/21 07:30
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on a change in pull request #3684:
URL: https://github.com/apache/hadoop/pull/3684#discussion_r762397398



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSImage.java
##
@@ -275,6 +276,29 @@ public void testSaveAndLoadStripedINodeFile() throws 
IOException{
 }
   }
 
+  @Test
+  public void testImportCheckpoint() {
+Configuration conf = new Configuration();
+conf.set(DFSConfigKeys.DFS_NAMENODE_CHECKPOINT_EDITS_DIR_KEY, "");
+MiniDFSCluster cluster = null;
+try {

Review comment:
   Can use try with resources for cluster

##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSImage.java
##
@@ -275,6 +276,29 @@ public void testSaveAndLoadStripedINodeFile() throws 
IOException{
 }
   }
 
+  @Test
+  public void testImportCheckpoint() {
+Configuration conf = new Configuration();
+conf.set(DFSConfigKeys.DFS_NAMENODE_CHECKPOINT_EDITS_DIR_KEY, "");
+MiniDFSCluster cluster = null;
+try {
+  cluster = new MiniDFSCluster.Builder(conf).build();
+  cluster.waitActive();
+  FSNamesystem fsn = cluster.getNamesystem();
+  FSImage fsImage= new FSImage(conf);
+  fsImage.doImportCheckpoint(fsn);
+  fail("Expect to throw IOException.");
+} catch (IOException e) {
+  GenericTestUtils.assertExceptionContains(
+  "Cannot import image from a checkpoint. "
+  + "\"dfs.namenode.checkpoint.edits.dir\" is not set.", 
e);

Review comment:
   Use LambdaTestUtils instead of try-catch-assert




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690460)
Time Spent: 2h 20m  (was: 2h 10m)

> Fix error configuration message in FSImage
> --
>
> Key: HDFS-16338
> URL: https://issues.apache.org/jira/browse/HDFS-16338
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.1
>Reporter: guophilipse
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> `dfs.namenode.checkpoint.edits.dir` may be different from 
> `dfs.namenode.checkpoint.dir` , if `checkpointEditsDirs` is null or empty, 
> error message should warn the edit dir configuration, we can fix it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16324) fix error log in BlockManagerSafeMode

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16324?focusedWorklogId=690459=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690459
 ]

ASF GitHub Bot logged work on HDFS-16324:
-

Author: ASF GitHub Bot
Created on: 04/Dec/21 07:26
Start Date: 04/Dec/21 07:26
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on a change in pull request #3661:
URL: https://github.com/apache/hadoop/pull/3661#discussion_r762397173



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockManagerSafeMode.java
##
@@ -41,7 +42,6 @@
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertFalse;
 import static org.junit.Assert.assertTrue;
-

Review comment:
   nit: revert this change




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690459)
Time Spent: 2h 50m  (was: 2h 40m)

> fix error log in BlockManagerSafeMode
> -
>
> Key: HDFS-16324
> URL: https://issues.apache.org/jira/browse/HDFS-16324
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.1
>Reporter: guophilipse
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> if `recheckInterval` was set as invalid value, there will be warning log 
> output, but the message seems not that proper ,we can improve it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13947) Review of DirectoryScanner Class

2021-12-03 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453298#comment-17453298
 ] 

Ayush Saxena commented on HDFS-13947:
-

Hey Folks,

Observed this while checking HDFS-16347. This patch changed the default value 
in DfsConfigKeys but not in hdfs-defaults:

{code:java}
   DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_KEY =
   "dfs.datanode.directoryscan.throttle.limit.ms.per.sec";
   public static final int
-  DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT = 1000;
+  DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT = -1;
{code}

Was that a miss or the change here is accidental. If we changed the default 
value we should have put that in the release notes for others to know.
Let me know if that was intentional, if so we can get HDFS-16347 in and update 
release notes there


> Review of DirectoryScanner Class
> 
>
> Key: HDFS-13947
> URL: https://issues.apache.org/jira/browse/HDFS-13947
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-13947.1.patch, HDFS-13947.2.patch, 
> HDFS-13947.3.patch, HDFS-13947.4.patch, HDFS-13947.5.patch
>
>
> Review of Directory Scanner.   Replaced a lot of code with Guava MultiMap.  
> Some general house cleaning and improved logging.  For performance, using 
> {{ArrayList}} instead of {{LinkedList}} where possible, especially since 
> these lists can be quite large a LinkedList will consume a lot of memory and 
> be slow to sort/iterate over.
> https://stackoverflow.com/questions/322715/when-to-use-linkedlist-over-arraylist-in-java



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16347) Fix directory scan throttle default value

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16347?focusedWorklogId=690453=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690453
 ]

ASF GitHub Bot logged work on HDFS-16347:
-

Author: ASF GitHub Bot
Created on: 04/Dec/21 06:35
Start Date: 04/Dec/21 06:35
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on a change in pull request #3703:
URL: https://github.com/apache/hadoop/pull/3703#discussion_r76239



##
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
##
@@ -874,7 +874,7 @@
 
 
   dfs.datanode.directoryscan.throttle.limit.ms.per.sec
-  1000
+  -1

Review comment:
   This is not just a doc change, This change will indeed change the 
default. Let me confirm on the original jira as well




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690453)
Time Spent: 1h 10m  (was: 1h)

> Fix directory scan throttle default value
> -
>
> Key: HDFS-16347
> URL: https://issues.apache.org/jira/browse/HDFS-16347
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.3.1
>Reporter: guophilipse
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> `dfs.datanode.directoryscan.throttle.limit.ms.per.sec` was changed from 
> `1000` to `-1` by default after HDFS-13947, we can improve the doc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16347) Fix directory scan throttle default value

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16347?focusedWorklogId=690452=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690452
 ]

ASF GitHub Bot logged work on HDFS-16347:
-

Author: ASF GitHub Bot
Created on: 04/Dec/21 06:34
Start Date: 04/Dec/21 06:34
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on a change in pull request #3703:
URL: https://github.com/apache/hadoop/pull/3703#discussion_r76239



##
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
##
@@ -874,7 +874,7 @@
 
 
   dfs.datanode.directoryscan.throttle.limit.ms.per.sec
-  1000
+  -1

Review comment:
   This is not just a doc change, This change will indeed change the 
default.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690452)
Time Spent: 1h  (was: 50m)

> Fix directory scan throttle default value
> -
>
> Key: HDFS-16347
> URL: https://issues.apache.org/jira/browse/HDFS-16347
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.3.1
>Reporter: guophilipse
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> `dfs.datanode.directoryscan.throttle.limit.ms.per.sec` was changed from 
> `1000` to `-1` by default after HDFS-13947, we can improve the doc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16322) The NameNode implementation of ClientProtocol.truncate(...) can cause data loss.

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16322?focusedWorklogId=690449=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690449
 ]

ASF GitHub Bot logged work on HDFS-16322:
-

Author: ASF GitHub Bot
Created on: 04/Dec/21 06:28
Start Date: 04/Dec/21 06:28
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on a change in pull request #3705:
URL: https://github.com/apache/hadoop/pull/3705#discussion_r762392267



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
##
@@ -1100,20 +1100,29 @@ public void rename2(String src, String dst, 
Options.Rename... options)
   }
 
   @Override // ClientProtocol
-  public boolean truncate(String src, long newLength, String clientName)
-  throws IOException {
+  public boolean truncate(String src, long newLength, String clientName) 
throws IOException {

Review comment:
   nit: only formatting change and unrelated, Please avoid 

##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
##
@@ -1100,20 +1100,29 @@ public void rename2(String src, String dst, 
Options.Rename... options)
   }
 
   @Override // ClientProtocol
-  public boolean truncate(String src, long newLength, String clientName)
-  throws IOException {
+  public boolean truncate(String src, long newLength, String clientName) 
throws IOException {
 checkNNStartup();
-stateChangeLog
-.debug("*DIR* NameNode.truncate: " + src + " to " + newLength);
+if(stateChangeLog.isDebugEnabled()) {
+  stateChangeLog.debug("*DIR* NameNode.truncate: " + src + " to " +
+  newLength);
+}
+CacheEntryWithPayload cacheEntry = 
RetryCache.waitForCompletion(retryCache, null);
+if (cacheEntry != null && cacheEntry.isSuccess()) {
+  return (boolean)cacheEntry.getPayload();
+}
+
 String clientMachine = getClientMachine();
+boolean ret = false;
 try {
-  return namesystem.truncate(
+  ret = namesystem.truncate(
   src, newLength, clientName, clientMachine, now());
 } finally {
+  RetryCache.setState(cacheEntry, true, ret);

Review comment:
   Finally block will be executed in case of exception as well, we can not 
hard-code `true` here. 
   Can check other codes like for `append` to get some reference & idea.

##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
##
@@ -1100,20 +1100,29 @@ public void rename2(String src, String dst, 
Options.Rename... options)
   }
 
   @Override // ClientProtocol
-  public boolean truncate(String src, long newLength, String clientName)
-  throws IOException {
+  public boolean truncate(String src, long newLength, String clientName) 
throws IOException {
 checkNNStartup();
-stateChangeLog
-.debug("*DIR* NameNode.truncate: " + src + " to " + newLength);
+if(stateChangeLog.isDebugEnabled()) {
+  stateChangeLog.debug("*DIR* NameNode.truncate: " + src + " to " +
+  newLength);
+}
+CacheEntryWithPayload cacheEntry = 
RetryCache.waitForCompletion(retryCache, null);

Review comment:
   I think we should have 
   ``
   namesystem.checkOperation(OperationCategory.WRITE);
   ``
   above the this like other calls and remove this check before lock in 
FsNamesystem.

##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java
##
@@ -1100,20 +1100,29 @@ public void rename2(String src, String dst, 
Options.Rename... options)
   }
 
   @Override // ClientProtocol
-  public boolean truncate(String src, long newLength, String clientName)
-  throws IOException {
+  public boolean truncate(String src, long newLength, String clientName) 
throws IOException {
 checkNNStartup();
-stateChangeLog
-.debug("*DIR* NameNode.truncate: " + src + " to " + newLength);
+if(stateChangeLog.isDebugEnabled()) {
+  stateChangeLog.debug("*DIR* NameNode.truncate: " + src + " to " +
+  newLength);
+}
+CacheEntryWithPayload cacheEntry = 
RetryCache.waitForCompletion(retryCache, null);
+if (cacheEntry != null && cacheEntry.isSuccess()) {
+  return (boolean)cacheEntry.getPayload();
+}
+
 String clientMachine = getClientMachine();
+boolean ret = false;
 try {
-  return namesystem.truncate(
+  ret = namesystem.truncate(
   src, newLength, clientName, clientMachine, now());
 } finally {
+  RetryCache.setState(cacheEntry, true, ret);
   metrics.incrFilesTruncated();
 }
+return ret;
   }
-
+  

Review comment:
   nit:
   unrelated change, revert!!




-- 
This is an automated message from the Apache Git Service.
To respond to the 

[jira] [Work logged] (HDFS-16370) Fix assert message for BlockInfo

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16370?focusedWorklogId=690448=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690448
 ]

ASF GitHub Bot logged work on HDFS-16370:
-

Author: ASF GitHub Bot
Created on: 04/Dec/21 05:53
Start Date: 04/Dec/21 05:53
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on a change in pull request #3747:
URL: https://github.com/apache/hadoop/pull/3747#discussion_r762390295



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfo.java
##
@@ -146,7 +146,7 @@ BlockInfo getNext(int index) {
 BlockInfo info = (BlockInfo)triplets[index*3+2];
 assert info == null || info.getClass().getName().startsWith(
 BlockInfo.class.getName()) :
-"BlockInfo is expected at " + index*3;
+"BlockInfo is expected at " + (index*3+2);

Review comment:
   Same as above, Can you pad some space between the values

##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfo.java
##
@@ -136,7 +136,7 @@ BlockInfo getPrevious(int index) {
 BlockInfo info = (BlockInfo)triplets[index*3+1];
 assert info == null ||
 info.getClass().getName().startsWith(BlockInfo.class.getName()) :
-"BlockInfo is expected at " + index*3;
+"BlockInfo is expected at " + (index*3+1);

Review comment:
   nit: Better to have some space around the values
   ```suggestion
   "BlockInfo is expected at " + (index * 3 + 1);
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690448)
Time Spent: 0.5h  (was: 20m)

> Fix assert message for BlockInfo
> 
>
> Key: HDFS-16370
> URL: https://issues.apache.org/jira/browse/HDFS-16370
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In both methods BlockInfo#getPrevious and BlockInfo#getNext, the assert 
> message is wrong. This may cause some misunderstanding and needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16351) add path exception information in FSNamesystem

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16351?focusedWorklogId=690447=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690447
 ]

ASF GitHub Bot logged work on HDFS-16351:
-

Author: ASF GitHub Bot
Created on: 04/Dec/21 05:49
Start Date: 04/Dec/21 05:49
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on a change in pull request #3713:
URL: https://github.com/apache/hadoop/pull/3713#discussion_r762390099



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSNamesystem.java
##
@@ -120,6 +123,23 @@ public void testStartupSafemode() throws IOException {
   + "isInSafeMode still returned false",  fsn.isInSafeMode());
   }
 
+  @Test
+  public void testCheckAccess() throws IOException {
+Configuration conf = new Configuration();
+FSImage fsImage = Mockito.mock(FSImage.class);

Review comment:
   Remove this, one test is enough, we need not to mock and try

##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSPermission.java
##
@@ -30,6 +30,7 @@
 import java.util.Map;
 import java.util.Random;
 
+import org.apache.hadoop.test.GenericTestUtils;

Review comment:
   import order seems wrong,
   the import should be in org.apache.hadoop. block with the others.

##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSPermission.java
##
@@ -260,6 +261,33 @@ private void createAndCheckPermission(OpType op, Path 
name, short umask,
 checkPermission(name, expectedPermission, delete);
   }
 
+  @Test
+  public void testFSNamesystemCheckAccess() throws Exception {
+Path testValidDir = new Path("/test1");
+Path testValidFile = new Path("/test1/file1");
+Path testInvalidPath = new Path("/test2");
+fs = FileSystem.get(conf);
+
+fs.mkdirs(testValidDir);
+fs.create(testValidFile);
+
+fs.access(testValidDir, FsAction.READ);
+fs.access(testValidFile, FsAction.READ);
+
+assertTrue(fs.exists(testValidDir));
+assertTrue(fs.exists(testValidFile));
+
+try {
+  fs.access(testInvalidPath, FsAction.READ);
+  fail("Failed to get expected FileNotFoundException");
+} catch (FileNotFoundException e) {
+  GenericTestUtils.assertExceptionContains(
+  "Path not found: " + testInvalidPath, e);
+} finally {
+  fs.delete(testValidDir, true);
+}
+  }
+

Review comment:
   This is like testing normal fs.access also which isn't required, we just 
changed the exception, we can test that only.
   Can use LambdaTestUtils for that rather than the present try-catch. 
Something like this should do:
   ```
 @Test
 public void testFSNamesystemCheckAccess() throws Exception {
   Path testInvalidPath = new Path("/test2");
   fs = FileSystem.get(conf);
   
   LambdaTestUtils.intercept(FileNotFoundException.class,
   "Path not found: " + testInvalidPath,
   () -> fs.access(testInvalidPath, FsAction.READ));
 }
   ```

##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSPermission.java
##
@@ -289,7 +317,7 @@ public void testImmutableFsPermission() throws IOException {
 fs.setPermission(new Path("/"),
 FsPermission.createImmutable((short)0777));
   }
-  
+

Review comment:
   unrelated, revert




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690447)
Time Spent: 2h 10m  (was: 2h)

> add path exception information in FSNamesystem
> --
>
> Key: HDFS-16351
> URL: https://issues.apache.org/jira/browse/HDFS-16351
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.1
>Reporter: guophilipse
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> add path information in exception message to make message more clear in 
> FSNamesystem



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16369) RBF: Fix the retry logic of RouterRpcServer#invokeAtAvailableNs

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16369?focusedWorklogId=690442=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690442
 ]

ASF GitHub Bot logged work on HDFS-16369:
-

Author: ASF GitHub Bot
Created on: 04/Dec/21 05:25
Start Date: 04/Dec/21 05:25
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on pull request #3745:
URL: https://github.com/apache/hadoop/pull/3745#issuecomment-985971627


   Merged, Thanx @goiri and @tomscut for the review!!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690442)
Time Spent: 1.5h  (was: 1h 20m)

> RBF: Fix the retry logic of RouterRpcServer#invokeAtAvailableNs
> ---
>
> Key: HDFS-16369
> URL: https://issues.apache.org/jira/browse/HDFS-16369
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> As of now invokeAtAvailableNs, retries only once if the default or the first 
> namespace is not available, despite having other namespaces available.
> Optimise to retry on all namespaces.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16369) RBF: Fix the retry logic of RouterRpcServer#invokeAtAvailableNs

2021-12-03 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved HDFS-16369.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RBF: Fix the retry logic of RouterRpcServer#invokeAtAvailableNs
> ---
>
> Key: HDFS-16369
> URL: https://issues.apache.org/jira/browse/HDFS-16369
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> As of now invokeAtAvailableNs, retries only once if the default or the first 
> namespace is not available, despite having other namespaces available.
> Optimise to retry on all namespaces.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16369) RBF: Fix the retry logic of RouterRpcServer#invokeAtAvailableNs

2021-12-03 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453290#comment-17453290
 ] 

Ayush Saxena commented on HDFS-16369:
-

Committed to trunk.

Thanx Everyone for the review!!!

> RBF: Fix the retry logic of RouterRpcServer#invokeAtAvailableNs
> ---
>
> Key: HDFS-16369
> URL: https://issues.apache.org/jira/browse/HDFS-16369
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> As of now invokeAtAvailableNs, retries only once if the default or the first 
> namespace is not available, despite having other namespaces available.
> Optimise to retry on all namespaces.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16369) RBF: Fix the retry logic of RouterRpcServer#invokeAtAvailableNs

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16369?focusedWorklogId=690441=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690441
 ]

ASF GitHub Bot logged work on HDFS-16369:
-

Author: ASF GitHub Bot
Created on: 04/Dec/21 05:24
Start Date: 04/Dec/21 05:24
Worklog Time Spent: 10m 
  Work Description: ayushtkn merged pull request #3745:
URL: https://github.com/apache/hadoop/pull/3745


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690441)
Time Spent: 1h 20m  (was: 1h 10m)

> RBF: Fix the retry logic of RouterRpcServer#invokeAtAvailableNs
> ---
>
> Key: HDFS-16369
> URL: https://issues.apache.org/jira/browse/HDFS-16369
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> As of now invokeAtAvailableNs, retries only once if the default or the first 
> namespace is not available, despite having other namespaces available.
> Optimise to retry on all namespaces.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16371) Exclude slow disks when choosing volume

2021-12-03 Thread tomscut (Jira)
tomscut created HDFS-16371:
--

 Summary: Exclude slow disks when choosing volume
 Key: HDFS-16371
 URL: https://issues.apache.org/jira/browse/HDFS-16371
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: tomscut
Assignee: tomscut


Currently, the datanode can detect slow disks. When choosing volume, we can 
exclude these slow disks according to some rules. This will prevents some slow 
disks from affecting the throughput of the whole datanode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16370) Fix assert message for BlockInfo

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16370?focusedWorklogId=690265=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690265
 ]

ASF GitHub Bot logged work on HDFS-16370:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 19:07
Start Date: 03/Dec/21 19:07
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3747:
URL: https://github.com/apache/hadoop/pull/3747#issuecomment-985760522


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  1s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  35m 22s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 28s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   0m 57s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 27s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  1s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 31s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 23s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  25m 40s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 14s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m 14s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 52s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 24s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 26s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  25m 39s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 331m 40s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 40s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 440m  7s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3747/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3747 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux aac445805a8d 4.15.0-153-generic #160-Ubuntu SMP Thu Jul 29 
06:54:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 8cc00d6045598c7dfee290975fa04ecf6438d371 |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3747/1/testReport/ |
   | Max. process+thread count | 2118 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3747/1/console |
   | versions | 

[jira] [Commented] (HDFS-16293) Client sleeps and holds 'dataQueue' when DataNodes are congested

2021-12-03 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453162#comment-17453162
 ] 

Hadoop QA commented on HDFS-16293:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
42s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
1s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  2m  
6s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
41s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  5m 
41s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  5m  
3s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
10s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
20s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
22m 53s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
39s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
10s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 32m 
19s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  5m 
38s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
27s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
 2s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  5m 
20s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  5m 20s{color} 
| 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/748/artifact/out/diff-compile-javac-hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt{color}
 | {color:red} hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 
with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 generated 5 new + 646 unchanged 
- 0 fixed = 651 total (was 646) {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  5m  
1s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  5m  1s{color} 
| 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/748/artifact/out/diff-compile-javac-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt{color}
 | {color:red} 
hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 with 
JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 generated 5 new + 623 
unchanged - 0 fixed = 628 total (was 623) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m  5s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/748/artifact/out/diff-checkstyle-hadoop-hdfs-project.txt{color}
 | {color:orange} 

[jira] [Work logged] (HDFS-16369) RBF: Fix the retry logic of RouterRpcServer#invokeAtAvailableNs

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16369?focusedWorklogId=690203=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690203
 ]

ASF GitHub Bot logged work on HDFS-16369:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 17:30
Start Date: 03/Dec/21 17:30
Worklog Time Spent: 10m 
  Work Description: goiri commented on a change in pull request #3745:
URL: https://github.com/apache/hadoop/pull/3745#discussion_r762121422



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/router/TestRouterRPCMultipleDestinationMountTableResolver.java
##
@@ -668,14 +674,16 @@ public void testInvokeAtAvailableNs() throws IOException {
 // Make one subcluster unavailable.
 MiniDFSCluster dfsCluster = cluster.getCluster();
 dfsCluster.shutdownNameNode(0);
+dfsCluster.shutdownNameNode(1);
 try {
   // Verify that #invokeAtAvailableNs works by calling #getServerDefaults.
   RemoteMethod method = new RemoteMethod("getServerDefaults");
   FsServerDefaults serverDefaults =
   rpcServer.invokeAtAvailableNs(method, FsServerDefaults.class);
   assertNotNull(serverDefaults);

Review comment:
   Yes, the flakiness is not ideal.
   Let's go with this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690203)
Time Spent: 1h 10m  (was: 1h)

> RBF: Fix the retry logic of RouterRpcServer#invokeAtAvailableNs
> ---
>
> Key: HDFS-16369
> URL: https://issues.apache.org/jira/browse/HDFS-16369
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> As of now invokeAtAvailableNs, retries only once if the default or the first 
> namespace is not available, despite having other namespaces available.
> Optimise to retry on all namespaces.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16314) Support to make dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled reconfigurable

2021-12-03 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved HDFS-16314.
--
Fix Version/s: 3.4.0
   3.3.3
   Resolution: Fixed

Committed to trunk and branch-3.3. Thanks [~haiyang Hu] for your contribution!

> Support to make 
> dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled reconfigurable
> -
>
> Key: HDFS-16314
> URL: https://issues.apache.org/jira/browse/HDFS-16314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Consider that make 
> dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled reconfigurable 
> and rapid rollback in case this feature HDFS-16076 unexpected things happen 
> in production environment



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16287) Support to make dfs.namenode.avoid.read.slow.datanode reconfigurable

2021-12-03 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated HDFS-16287:
-
Fix Version/s: 3.3.3

Backported to branch-3.3 to backport HDFS-16314.

> Support to make dfs.namenode.avoid.read.slow.datanode  reconfigurable
> -
>
> Key: HDFS-16287
> URL: https://issues.apache.org/jira/browse/HDFS-16287
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
>  Time Spent: 11.5h
>  Remaining Estimate: 0h
>
> 1. Consider that make dfs.namenode.avoid.read.slow.datanode  reconfigurable 
> and rapid rollback in case this feature 
> [HDFS-16076|https://issues.apache.org/jira/browse/HDFS-16076] unexpected 
> things happen in production environment  
> 2.  DatanodeManager#startSlowPeerCollector by parameter 
> 'dfs.datanode.peer.stats.enabled' to control



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16314) Support to make dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled reconfigurable

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16314?focusedWorklogId=690195=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690195
 ]

ASF GitHub Bot logged work on HDFS-16314:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 17:20
Start Date: 03/Dec/21 17:20
Worklog Time Spent: 10m 
  Work Description: aajisaka commented on pull request #3664:
URL: https://github.com/apache/hadoop/pull/3664#issuecomment-985693281


   Merged. Thank you @haiyang1987 for your contribution and thank you @ferhui 
@tomscut for your review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690195)
Time Spent: 3h 50m  (was: 3h 40m)

> Support to make 
> dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled reconfigurable
> -
>
> Key: HDFS-16314
> URL: https://issues.apache.org/jira/browse/HDFS-16314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Consider that make 
> dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled reconfigurable 
> and rapid rollback in case this feature HDFS-16076 unexpected things happen 
> in production environment



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16314) Support to make dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled reconfigurable

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16314?focusedWorklogId=690194=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690194
 ]

ASF GitHub Bot logged work on HDFS-16314:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 17:19
Start Date: 03/Dec/21 17:19
Worklog Time Spent: 10m 
  Work Description: aajisaka merged pull request #3664:
URL: https://github.com/apache/hadoop/pull/3664


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690194)
Time Spent: 3h 40m  (was: 3.5h)

> Support to make 
> dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled reconfigurable
> -
>
> Key: HDFS-16314
> URL: https://issues.apache.org/jira/browse/HDFS-16314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Consider that make 
> dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled reconfigurable 
> and rapid rollback in case this feature HDFS-16076 unexpected things happen 
> in production environment



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16314) Support to make dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled reconfigurable

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16314?focusedWorklogId=690193=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690193
 ]

ASF GitHub Bot logged work on HDFS-16314:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 17:17
Start Date: 03/Dec/21 17:17
Worklog Time Spent: 10m 
  Work Description: aajisaka commented on a change in pull request #3664:
URL: https://github.com/apache/hadoop/pull/3664#discussion_r762113040



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java
##
@@ -261,4 +261,16 @@ protected String getRack(final DatanodeInfo datanode) {
   }
 }
   }
+
+  /**
+   * Updates the value used for excludeSlowNodesEnabled, which is set by
+   * {@code 
DFSConfigKeys.DFS_NAMENODE_BLOCKPLACEMENTPOLICY_EXCLUDE_SLOW_NODES_ENABLED_KEY}
+   * initially.
+   *
+   * @param enable true, we will filter out slow nodes
+   * when choosing targets for blocks, otherwise false not filter.
+   */
+  public abstract void setExcludeSlowNodesEnabled(boolean enable);
+
+  public abstract boolean getExcludeSlowNodesEnabled();

Review comment:
   This interface is marked as `@Private`, so adding abstract methods is 
okay.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690193)
Time Spent: 3.5h  (was: 3h 20m)

> Support to make 
> dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled reconfigurable
> -
>
> Key: HDFS-16314
> URL: https://issues.apache.org/jira/browse/HDFS-16314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Consider that make 
> dfs.namenode.block-placement-policy.exclude-slow-nodes.enabled reconfigurable 
> and rapid rollback in case this feature HDFS-16076 unexpected things happen 
> in production environment



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16293) Client sleeps and holds 'dataQueue' when DataNodes are congested

2021-12-03 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453142#comment-17453142
 ] 

Hadoop QA commented on HDFS-16293:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
47s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  2m 
13s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
26s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m  
8s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  6m 
37s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
41s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
42s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
26m 31s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
58s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
26s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 37m 
32s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  6m 
40s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
28s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
23s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  6m 
50s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  6m 50s{color} 
| 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/747/artifact/out/diff-compile-javac-hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt{color}
 | {color:red} hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 
with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 generated 5 new + 647 unchanged 
- 0 fixed = 652 total (was 647) {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  6m 
19s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  6m 19s{color} 
| 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/747/artifact/out/diff-compile-javac-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt{color}
 | {color:red} 
hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 with 
JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 generated 5 new + 624 
unchanged - 0 fixed = 629 total (was 624) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
21s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
46s{color} | 

[jira] [Work logged] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16303?focusedWorklogId=690153=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690153
 ]

ASF GitHub Bot logged work on HDFS-16303:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 16:28
Start Date: 03/Dec/21 16:28
Worklog Time Spent: 10m 
  Work Description: KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r762078547



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java
##
@@ -189,6 +190,30 @@ public void run() {
  * node will be removed from tracking by the pending cancel.
  */
 processCancelledNodes();
+
+// Having more nodes decommissioning than can be tracked will impact 
decommissioning
+// performance due to queueing delay
+int numTrackedNodes = outOfServiceNodeBlocks.size();
+int numQueuedNodes = getPendingNodes().size();
+int numDecommissioningNodes = numTrackedNodes + numQueuedNodes;
+if (numDecommissioningNodes > maxConcurrentTrackedNodes) {
+  LOG.warn(
+  "There are {} nodes decommissioning but only {} nodes will be 
tracked at a time. "
+  + "{} nodes are currently queued waiting to be 
decommissioned.",
+  numDecommissioningNodes, maxConcurrentTrackedNodes, 
numQueuedNodes);
+
+  // Re-queue unhealthy nodes to make space for decommissioning 
healthy nodes
+  final List unhealthyDns = 
outOfServiceNodeBlocks.keySet().stream()
+  .filter(dn -> 
!blockManager.isNodeHealthyForDecommissionOrMaintenance(dn))
+  .collect(Collectors.toList());
+  final List toRequeue =
+  identifyUnhealthyNodesToRequeue(unhealthyDns, 
numDecommissioningNodes);
+  for (DatanodeDescriptor dn : toRequeue) {
+getPendingNodes().add(dn);
+outOfServiceNodeBlocks.remove(dn);

Review comment:
   I think I may also need to remove from "pendingRep" here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690153)
Time Spent: 6h 50m  (was: 6h 40m)

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but 

[jira] [Work logged] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16303?focusedWorklogId=690149=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690149
 ]

ASF GitHub Bot logged work on HDFS-16303:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 16:27
Start Date: 03/Dec/21 16:27
Worklog Time Spent: 10m 
  Work Description: KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r762077696



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
##
@@ -1654,4 +1658,139 @@ public Boolean get() {
 
 cleanupFile(fileSys, file);
   }
+
+  /**
+   * Test DatanodeAdminManager logic to re-queue unhealthy decommissioning 
nodes
+   * which are blocking the decommissioning of healthy nodes.
+   * Force the tracked nodes set to be filled with nodes lost while 
decommissioning,
+   * then decommission healthy nodes & validate they are decommissioned 
eventually.
+   */
+  @Test(timeout = 12)
+  public void testRequeueUnhealthyDecommissioningNodes() throws Exception {
+// Allow 3 datanodes to be decommissioned at a time
+
getConf().setInt(DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES,
 3);
+// Disable the normal monitor runs
+getConf()
+.setInt(MiniDFSCluster.DFS_NAMENODE_DECOMMISSION_INTERVAL_TESTING_KEY, 
Integer.MAX_VALUE);
+
+// Start cluster with 6 datanodes
+startCluster(1, 6);
+final FSNamesystem namesystem = getCluster().getNamesystem();
+final BlockManager blockManager = namesystem.getBlockManager();
+final DatanodeManager datanodeManager = blockManager.getDatanodeManager();
+final DatanodeAdminManager decomManager = 
datanodeManager.getDatanodeAdminManager();
+assertEquals(6, getCluster().getDataNodes().size());
+
+// 3 datanodes will be "live" datanodes that are expected to be 
decommissioned eventually
+final List liveNodes = 
getCluster().getDataNodes().subList(3, 6).stream()
+.map(dn -> getDatanodeDesriptor(namesystem, dn.getDatanodeUuid()))
+.collect(Collectors.toList());
+assertEquals(3, liveNodes.size());
+
+// 3 datanodes will be "dead" datanodes that are expected to never be 
decommissioned
+final List deadNodes = 
getCluster().getDataNodes().subList(0, 3).stream()
+.map(dn -> getDatanodeDesriptor(namesystem, dn.getDatanodeUuid()))
+.collect(Collectors.toList());
+assertEquals(3, deadNodes.size());
+
+// Need to create some data or "isNodeHealthyForDecommissionOrMaintenance"
+// may unexpectedly return true for a dead node
+writeFile(getCluster().getFileSystem(), new Path("/tmp/test1"), 1, 100);
+
+// Cause the 3 "dead" nodes to be lost while in state decommissioning
+// and fill the tracked nodes set with those 3 "dead" nodes
+ArrayList decommissionedNodes = Lists.newArrayList();
+int expectedNumTracked = 0;
+for (final DatanodeDescriptor deadNode : deadNodes) {

Review comment:
   should put the "waitFor" after the for loop such that the nodes can be 
stopped in parallel, this will improve the runtime of the test




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690149)
Time Spent: 6h 40m  (was: 6.5h)

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a 

[jira] [Work logged] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16303?focusedWorklogId=690147=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690147
 ]

ASF GitHub Bot logged work on HDFS-16303:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 16:26
Start Date: 03/Dec/21 16:26
Worklog Time Spent: 10m 
  Work Description: KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r762076676



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
##
@@ -1654,4 +1658,139 @@ public Boolean get() {
 
 cleanupFile(fileSys, file);
   }
+
+  /**
+   * Test DatanodeAdminManager logic to re-queue unhealthy decommissioning 
nodes
+   * which are blocking the decommissioning of healthy nodes.
+   * Force the tracked nodes set to be filled with nodes lost while 
decommissioning,
+   * then decommission healthy nodes & validate they are decommissioned 
eventually.
+   */
+  @Test(timeout = 12)
+  public void testRequeueUnhealthyDecommissioningNodes() throws Exception {
+// Allow 3 datanodes to be decommissioned at a time
+
getConf().setInt(DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES,
 3);
+// Disable the normal monitor runs
+getConf()
+.setInt(MiniDFSCluster.DFS_NAMENODE_DECOMMISSION_INTERVAL_TESTING_KEY, 
Integer.MAX_VALUE);
+
+// Start cluster with 6 datanodes
+startCluster(1, 6);
+final FSNamesystem namesystem = getCluster().getNamesystem();
+final BlockManager blockManager = namesystem.getBlockManager();
+final DatanodeManager datanodeManager = blockManager.getDatanodeManager();
+final DatanodeAdminManager decomManager = 
datanodeManager.getDatanodeAdminManager();
+assertEquals(6, getCluster().getDataNodes().size());
+
+// 3 datanodes will be "live" datanodes that are expected to be 
decommissioned eventually
+final List liveNodes = 
getCluster().getDataNodes().subList(3, 6).stream()
+.map(dn -> getDatanodeDesriptor(namesystem, dn.getDatanodeUuid()))
+.collect(Collectors.toList());
+assertEquals(3, liveNodes.size());
+
+// 3 datanodes will be "dead" datanodes that are expected to never be 
decommissioned
+final List deadNodes = 
getCluster().getDataNodes().subList(0, 3).stream()
+.map(dn -> getDatanodeDesriptor(namesystem, dn.getDatanodeUuid()))
+.collect(Collectors.toList());
+assertEquals(3, deadNodes.size());
+
+// Need to create some data or "isNodeHealthyForDecommissionOrMaintenance"
+// may unexpectedly return true for a dead node
+writeFile(getCluster().getFileSystem(), new Path("/tmp/test1"), 1, 100);

Review comment:
   should use a larger replication factor here to ensure there are 
LowRendundancy blocks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690147)
Time Spent: 6.5h  (was: 6h 20m)

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 

[jira] [Work logged] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16303?focusedWorklogId=690146=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690146
 ]

ASF GitHub Bot logged work on HDFS-16303:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 16:25
Start Date: 03/Dec/21 16:25
Worklog Time Spent: 10m 
  Work Description: KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r762076295



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
##
@@ -1654,4 +1658,139 @@ public Boolean get() {
 
 cleanupFile(fileSys, file);
   }
+
+  /**
+   * Test DatanodeAdminManager logic to re-queue unhealthy decommissioning 
nodes
+   * which are blocking the decommissioning of healthy nodes.
+   * Force the tracked nodes set to be filled with nodes lost while 
decommissioning,
+   * then decommission healthy nodes & validate they are decommissioned 
eventually.
+   */
+  @Test(timeout = 12)
+  public void testRequeueUnhealthyDecommissioningNodes() throws Exception {
+// Allow 3 datanodes to be decommissioned at a time
+
getConf().setInt(DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES,
 3);
+// Disable the normal monitor runs
+getConf()
+.setInt(MiniDFSCluster.DFS_NAMENODE_DECOMMISSION_INTERVAL_TESTING_KEY, 
Integer.MAX_VALUE);
+
+// Start cluster with 6 datanodes
+startCluster(1, 6);

Review comment:
   can probably reduce the number of nodes in this test




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690146)
Time Spent: 6h 20m  (was: 6h 10m)

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but there is an edge 
> case where this logic prevents the namenode from making any forward progress 
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish 
> decommissioning, then other datanodes (which may be able to be 
> decommissioned) will never get added to "tracked.nodes" and therefore will 
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
> (DatanodeAdminMonitor-0): 

[jira] [Work logged] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16303?focusedWorklogId=690140=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690140
 ]

ASF GitHub Bot logged work on HDFS-16303:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 16:20
Start Date: 03/Dec/21 16:20
Worklog Time Spent: 10m 
  Work Description: KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-985650485


   I would also add, that if you look at the implementation of the proposed 
alternative of removing a dead DECOMMISSION_INPROGRESS node from the 
DatanodeAdminManager: https://github.com/apache/hadoop/pull/3746/files
   
   It is not any less complex than this change, due to aforementioned caveats 
that need to be dealt with


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690140)
Time Spent: 6h 10m  (was: 6h)

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but there is an edge 
> case where this logic prevents the namenode from making any forward progress 
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish 
> decommissioning, then other datanodes (which may be able to be 
> decommissioned) will never get added to "tracked.nodes" and therefore will 
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
> (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
> Progress. Cannot be safely decommissioned or be in maintenance since there is 
> risk of reduced data durability or data loss. Either restart the failed node 
> or force decommissioning or maintenance by removing, calling refreshNodes, 
> then re-adding to the excludes or host config files.
> {quote}
> If a Datanode is lost while decommissioning (for example if the underlying 
> hardware fails or is lost), then it will remain in state decommissioning 
> forever.
> If 100 or more Datanodes are lost while decommissioning over the Hadoop 
> cluster lifetime, then this is enough to completely fill up the 
> "tracked.nodes" set. With the entire "tracked.nodes" set filled with 
> datanodes that can never finish decommissioning, any 

[jira] [Work logged] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16303?focusedWorklogId=690136=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690136
 ]

ASF GitHub Bot logged work on HDFS-16303:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 16:16
Start Date: 03/Dec/21 16:16
Worklog Time Spent: 10m 
  Work Description: KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-985647896


   @sodonnel The existing test 
"TestDecommissioningStatus.testDecommissionStatusAfterDNRestart" will be 
problematic for the proposed alternative of removing a dead 
DECOMMISSION_INPROGRESS node from the DatanodeAdminManager: 
https://github.com/apache/hadoop/pull/3746/
   
   As previously stated, removing the dead DECOMMISSION_INPROGRESS node from 
the DatanodeAdminManager means that when there are no LowRedundancy blocks the 
dead node will remain in DECOMMISSION_INPROGRESS rather than transitioning to 
DECOMMISSIONED
   
   This violates the expectation the the unit test is enforcing which is that a 
dead DECOMMISSION_INPROGRESS node should transition to DECOMMISSIONED when 
there are no LowRedundancy blocks
   
   ```
   "Delete the under-replicated file, which should let the 
DECOMMISSION_IN_PROGRESS node become DECOMMISSIONED"
   ```
   
   
https://github.com/apache/hadoop/blob/6342d5e523941622a140fd877f06e9b59f48c48b/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestDecommissioningStatus.java#L451
   
   Therefore, I think this is a good argument to remain more in favor of the 
original proposed change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690136)
Time Spent: 6h  (was: 5h 50m)

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but there is an edge 
> case where this logic prevents the namenode from making any forward progress 
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish 
> decommissioning, then other datanodes (which may be able to be 
> decommissioned) will never get added to "tracked.nodes" and therefore will 
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN 
> 

[jira] [Work logged] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16303?focusedWorklogId=690134=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690134
 ]

ASF GitHub Bot logged work on HDFS-16303:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 16:15
Start Date: 03/Dec/21 16:15
Worklog Time Spent: 10m 
  Work Description: KevinWikant commented on pull request #3746:
URL: https://github.com/apache/hadoop/pull/3746#issuecomment-985646803


   @sodonnel The existing test 
"TestDecommissioningStatus.testDecommissionStatusAfterDNRestart" will be 
problematic for this change
   
   As previously stated, removing the dead DECOMMISSION_INPROGRESS node from 
the DatanodeAdminManager means that when there are no LowRedundancy blocks the 
dead node will remain in DECOMMISSION_INPROGRESS rather than transitioning to 
DECOMMISSIONED
   
   This violates the expectation the the unit test is enforcing which is that a 
dead DECOMMISSION_INPROGRESS node should transition to DECOMMISSIONED when 
there are no LowRedundancy blocks
   
   ```
   "Delete the under-replicated file, which should let the 
DECOMMISSION_IN_PROGRESS node become DECOMMISSIONED"
   ```
   
   
https://github.com/apache/hadoop/blob/6342d5e523941622a140fd877f06e9b59f48c48b/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestDecommissioningStatus.java#L451
   
   Therefore, I think this is a good argument to remain more in favor of the 
original proposed change: https://github.com/apache/hadoop/pull/3675


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690134)
Time Spent: 5h 50m  (was: 5h 40m)

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but there is an edge 
> case where this logic prevents the namenode from making any forward progress 
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish 
> decommissioning, then other datanodes (which may be able to be 
> decommissioned) will never get added to "tracked.nodes" and therefore will 
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
> (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in 

[jira] [Commented] (HDFS-16293) Client sleeps and holds 'dataQueue' when DataNodes are congested

2021-12-03 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453122#comment-17453122
 ] 

Hadoop QA commented on HDFS-16293:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 14m 
19s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
50s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
34s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  6m  
7s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  5m 
40s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
16s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
36s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
25m 16s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
49s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
15s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 35m 
34s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  6m 
16s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
29s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
24s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
20s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  7m 20s{color} 
| 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/746/artifact/out/diff-compile-javac-hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt{color}
 | {color:red} hadoop-hdfs-project-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 
with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 generated 5 new + 646 unchanged 
- 0 fixed = 651 total (was 646) {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  6m 
34s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  6m 34s{color} 
| 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/746/artifact/out/diff-compile-javac-hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt{color}
 | {color:red} 
hadoop-hdfs-project-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 with 
JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 generated 5 new + 624 
unchanged - 0 fixed = 629 total (was 624) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
18s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
28s{color} | 

[jira] [Work logged] (HDFS-16357) Fix log format in DFSUtilClient

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16357?focusedWorklogId=690133=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690133
 ]

ASF GitHub Bot logged work on HDFS-16357:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 16:14
Start Date: 03/Dec/21 16:14
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on a change in pull request #3729:
URL: https://github.com/apache/hadoop/pull/3729#discussion_r762068062



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSUtilClient.java
##
@@ -733,13 +733,13 @@ public static boolean isLocalAddress(InetSocketAddress 
targetAddr)
 InetAddress addr = targetAddr.getAddress();
 Boolean cached = localAddrMap.get(addr.getHostAddress());
 if (cached != null) {
-  LOG.trace("Address {} is {} local", targetAddr, (cached ? "" : "not"));
+  LOG.trace("Address " + targetAddr + (cached ? " is local" : " is not 
local"));

Review comment:
   The present change Looks good to me, What is the problem in that?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690133)
Time Spent: 1h  (was: 50m)

> Fix log format in DFSUtilClient
> ---
>
> Key: HDFS-16357
> URL: https://issues.apache.org/jira/browse/HDFS-16357
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.1
>Reporter: guophilipse
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> if address is local ,there will be additional space in the log .we can 
> improve it to look proper



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16332) Expired block token causes slow read due to missing handling in sasl handshake

2021-12-03 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated HDFS-16332:
-
Fix Version/s: 3.4.0
   3.2.4
   3.3.3
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed to trunk, branch-3.3, and branch-3.2. Thank you [~lineyshinya] for 
your contribution!

> Expired block token causes slow read due to missing handling in sasl handshake
> --
>
> Key: HDFS-16332
> URL: https://issues.apache.org/jira/browse/HDFS-16332
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, dfs, dfsclient
>Affects Versions: 2.8.5, 3.3.1
>Reporter: Shinya Yoshida
>Assignee: Shinya Yoshida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.3
>
> Attachments: Screenshot from 2021-11-18 12-11-34.png, Screenshot from 
> 2021-11-18 12-14-29.png, Screenshot from 2021-11-18 13-31-35.png
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> We're operating the HBase 1.4.x cluster on Hadoop 2.8.5.
> We're recently evaluating Kerberos secured HBase and Hadoop cluster with 
> production load and we observed HBase's response slows >= several seconds, 
> and about several minutes for worst-case (about once~three times a month).
> The following image is a scatter plot of HBase's response slow, each circle 
> is each base's slow response log.
> The X-axis is the date time of the log occurred, the Y-axis is the response 
> slow time.
>  !Screenshot from 2021-11-18 12-14-29.png! 
> We could reproduce this issue by reducing "dfs.block.access.token.lifetime" 
> and we could figure out the cause.
> (We used dfs.block.access.token.lifetime=60, i.e. 1 hour)
> When hedged read enabled:
>  !Screenshot from 2021-11-18 12-11-34.png! 
> When hedged read disabled:
>  !Screenshot from 2021-11-18 13-31-35.png! 
> As you can see, it's worst if the hedged read is enabled. However, it happens 
> whether the hedged read is enabled or not.
> This impacts our 99%tile response time.
> This happens when the block token is expired and the root cause is the wrong 
> handling of the InvalidToken exception in sasl handshake in 
> SaslDataTransferServer.
> I propose to add a new response code for DataTransferEncryptorStatus to 
> request the client to update the block token like DataTransferProtos does.
> The test code and patch is available in 
> https://github.com/apache/hadoop/pull/3677
> We could reproduce this issue by the following test code in 2.8.5 branch and 
> trunk as I tested
> {code:java}
> // HDFS is configured as secure cluster
> try (FileSystem fs = newFileSystem();
>  FSDataInputStream in = fs.open(PATH)) {
> waitBlockTokenExpired(in);
> in.read(0, bytes, 0, bytes.length)
> }
> private void waitBlockTokenExpired(FSDataInputStream in1) throws Exception {
> DFSInputStream innerStream = (DFSInputStream) in1.getWrappedStream();
> for (LocatedBlock block : innerStream.getAllBlocks()) {
> while (!SecurityTestUtil.isBlockTokenExpired(block.getBlockToken())) {
> Thread.sleep(100);
> }
> }
> }
> {code}
> Here is the log we got, we added a custom log before and after the block 
> token refresh:
> https://github.com/bitterfox/hadoop/commit/173a9f876f2264b76af01d658f624197936fd79c
> {code}
> 2021-11-16 09:40:20,330 WARN  [hedgedRead-247] impl.BlockReaderFactory: I/O 
> error constructing remote block reader.
> java.io.IOException: DIGEST-MD5: IO error acquiring password
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815)
> at 
> 

[jira] [Work logged] (HDFS-16332) Expired block token causes slow read due to missing handling in sasl handshake

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16332?focusedWorklogId=690047=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690047
 ]

ASF GitHub Bot logged work on HDFS-16332:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 14:30
Start Date: 03/Dec/21 14:30
Worklog Time Spent: 10m 
  Work Description: aajisaka commented on pull request #3677:
URL: https://github.com/apache/hadoop/pull/3677#issuecomment-985565777


   Merged. Thank you @bitterfox for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690047)
Time Spent: 5h 40m  (was: 5.5h)

> Expired block token causes slow read due to missing handling in sasl handshake
> --
>
> Key: HDFS-16332
> URL: https://issues.apache.org/jira/browse/HDFS-16332
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, dfs, dfsclient
>Affects Versions: 2.8.5, 3.3.1
>Reporter: Shinya Yoshida
>Assignee: Shinya Yoshida
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screenshot from 2021-11-18 12-11-34.png, Screenshot from 
> 2021-11-18 12-14-29.png, Screenshot from 2021-11-18 13-31-35.png
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> We're operating the HBase 1.4.x cluster on Hadoop 2.8.5.
> We're recently evaluating Kerberos secured HBase and Hadoop cluster with 
> production load and we observed HBase's response slows >= several seconds, 
> and about several minutes for worst-case (about once~three times a month).
> The following image is a scatter plot of HBase's response slow, each circle 
> is each base's slow response log.
> The X-axis is the date time of the log occurred, the Y-axis is the response 
> slow time.
>  !Screenshot from 2021-11-18 12-14-29.png! 
> We could reproduce this issue by reducing "dfs.block.access.token.lifetime" 
> and we could figure out the cause.
> (We used dfs.block.access.token.lifetime=60, i.e. 1 hour)
> When hedged read enabled:
>  !Screenshot from 2021-11-18 12-11-34.png! 
> When hedged read disabled:
>  !Screenshot from 2021-11-18 13-31-35.png! 
> As you can see, it's worst if the hedged read is enabled. However, it happens 
> whether the hedged read is enabled or not.
> This impacts our 99%tile response time.
> This happens when the block token is expired and the root cause is the wrong 
> handling of the InvalidToken exception in sasl handshake in 
> SaslDataTransferServer.
> I propose to add a new response code for DataTransferEncryptorStatus to 
> request the client to update the block token like DataTransferProtos does.
> The test code and patch is available in 
> https://github.com/apache/hadoop/pull/3677
> We could reproduce this issue by the following test code in 2.8.5 branch and 
> trunk as I tested
> {code:java}
> // HDFS is configured as secure cluster
> try (FileSystem fs = newFileSystem();
>  FSDataInputStream in = fs.open(PATH)) {
> waitBlockTokenExpired(in);
> in.read(0, bytes, 0, bytes.length)
> }
> private void waitBlockTokenExpired(FSDataInputStream in1) throws Exception {
> DFSInputStream innerStream = (DFSInputStream) in1.getWrappedStream();
> for (LocatedBlock block : innerStream.getAllBlocks()) {
> while (!SecurityTestUtil.isBlockTokenExpired(block.getBlockToken())) {
> Thread.sleep(100);
> }
> }
> }
> {code}
> Here is the log we got, we added a custom log before and after the block 
> token refresh:
> https://github.com/bitterfox/hadoop/commit/173a9f876f2264b76af01d658f624197936fd79c
> {code}
> 2021-11-16 09:40:20,330 WARN  [hedgedRead-247] impl.BlockReaderFactory: I/O 
> error constructing remote block reader.
> java.io.IOException: DIGEST-MD5: IO error acquiring password
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> 

[jira] [Work logged] (HDFS-16332) Expired block token causes slow read due to missing handling in sasl handshake

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16332?focusedWorklogId=690046=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-690046
 ]

ASF GitHub Bot logged work on HDFS-16332:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 14:30
Start Date: 03/Dec/21 14:30
Worklog Time Spent: 10m 
  Work Description: aajisaka merged pull request #3677:
URL: https://github.com/apache/hadoop/pull/3677


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 690046)
Time Spent: 5.5h  (was: 5h 20m)

> Expired block token causes slow read due to missing handling in sasl handshake
> --
>
> Key: HDFS-16332
> URL: https://issues.apache.org/jira/browse/HDFS-16332
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, dfs, dfsclient
>Affects Versions: 2.8.5, 3.3.1
>Reporter: Shinya Yoshida
>Assignee: Shinya Yoshida
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screenshot from 2021-11-18 12-11-34.png, Screenshot from 
> 2021-11-18 12-14-29.png, Screenshot from 2021-11-18 13-31-35.png
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> We're operating the HBase 1.4.x cluster on Hadoop 2.8.5.
> We're recently evaluating Kerberos secured HBase and Hadoop cluster with 
> production load and we observed HBase's response slows >= several seconds, 
> and about several minutes for worst-case (about once~three times a month).
> The following image is a scatter plot of HBase's response slow, each circle 
> is each base's slow response log.
> The X-axis is the date time of the log occurred, the Y-axis is the response 
> slow time.
>  !Screenshot from 2021-11-18 12-14-29.png! 
> We could reproduce this issue by reducing "dfs.block.access.token.lifetime" 
> and we could figure out the cause.
> (We used dfs.block.access.token.lifetime=60, i.e. 1 hour)
> When hedged read enabled:
>  !Screenshot from 2021-11-18 12-11-34.png! 
> When hedged read disabled:
>  !Screenshot from 2021-11-18 13-31-35.png! 
> As you can see, it's worst if the hedged read is enabled. However, it happens 
> whether the hedged read is enabled or not.
> This impacts our 99%tile response time.
> This happens when the block token is expired and the root cause is the wrong 
> handling of the InvalidToken exception in sasl handshake in 
> SaslDataTransferServer.
> I propose to add a new response code for DataTransferEncryptorStatus to 
> request the client to update the block token like DataTransferProtos does.
> The test code and patch is available in 
> https://github.com/apache/hadoop/pull/3677
> We could reproduce this issue by the following test code in 2.8.5 branch and 
> trunk as I tested
> {code:java}
> // HDFS is configured as secure cluster
> try (FileSystem fs = newFileSystem();
>  FSDataInputStream in = fs.open(PATH)) {
> waitBlockTokenExpired(in);
> in.read(0, bytes, 0, bytes.length)
> }
> private void waitBlockTokenExpired(FSDataInputStream in1) throws Exception {
> DFSInputStream innerStream = (DFSInputStream) in1.getWrappedStream();
> for (LocatedBlock block : innerStream.getAllBlocks()) {
> while (!SecurityTestUtil.isBlockTokenExpired(block.getBlockToken())) {
> Thread.sleep(100);
> }
> }
> }
> {code}
> Here is the log we got, we added a custom log before and after the block 
> token refresh:
> https://github.com/bitterfox/hadoop/commit/173a9f876f2264b76af01d658f624197936fd79c
> {code}
> 2021-11-16 09:40:20,330 WARN  [hedgedRead-247] impl.BlockReaderFactory: I/O 
> error constructing remote block reader.
> java.io.IOException: DIGEST-MD5: IO error acquiring password
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> 

[jira] [Assigned] (HDFS-16332) Expired block token causes slow read due to missing handling in sasl handshake

2021-12-03 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned HDFS-16332:


Assignee: Shinya Yoshida

> Expired block token causes slow read due to missing handling in sasl handshake
> --
>
> Key: HDFS-16332
> URL: https://issues.apache.org/jira/browse/HDFS-16332
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, dfs, dfsclient
>Affects Versions: 2.8.5, 3.3.1
>Reporter: Shinya Yoshida
>Assignee: Shinya Yoshida
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screenshot from 2021-11-18 12-11-34.png, Screenshot from 
> 2021-11-18 12-14-29.png, Screenshot from 2021-11-18 13-31-35.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> We're operating the HBase 1.4.x cluster on Hadoop 2.8.5.
> We're recently evaluating Kerberos secured HBase and Hadoop cluster with 
> production load and we observed HBase's response slows >= several seconds, 
> and about several minutes for worst-case (about once~three times a month).
> The following image is a scatter plot of HBase's response slow, each circle 
> is each base's slow response log.
> The X-axis is the date time of the log occurred, the Y-axis is the response 
> slow time.
>  !Screenshot from 2021-11-18 12-14-29.png! 
> We could reproduce this issue by reducing "dfs.block.access.token.lifetime" 
> and we could figure out the cause.
> (We used dfs.block.access.token.lifetime=60, i.e. 1 hour)
> When hedged read enabled:
>  !Screenshot from 2021-11-18 12-11-34.png! 
> When hedged read disabled:
>  !Screenshot from 2021-11-18 13-31-35.png! 
> As you can see, it's worst if the hedged read is enabled. However, it happens 
> whether the hedged read is enabled or not.
> This impacts our 99%tile response time.
> This happens when the block token is expired and the root cause is the wrong 
> handling of the InvalidToken exception in sasl handshake in 
> SaslDataTransferServer.
> I propose to add a new response code for DataTransferEncryptorStatus to 
> request the client to update the block token like DataTransferProtos does.
> The test code and patch is available in 
> https://github.com/apache/hadoop/pull/3677
> We could reproduce this issue by the following test code in 2.8.5 branch and 
> trunk as I tested
> {code:java}
> // HDFS is configured as secure cluster
> try (FileSystem fs = newFileSystem();
>  FSDataInputStream in = fs.open(PATH)) {
> waitBlockTokenExpired(in);
> in.read(0, bytes, 0, bytes.length)
> }
> private void waitBlockTokenExpired(FSDataInputStream in1) throws Exception {
> DFSInputStream innerStream = (DFSInputStream) in1.getWrappedStream();
> for (LocatedBlock block : innerStream.getAllBlocks()) {
> while (!SecurityTestUtil.isBlockTokenExpired(block.getBlockToken())) {
> Thread.sleep(100);
> }
> }
> }
> {code}
> Here is the log we got, we added a custom log before and after the block 
> token refresh:
> https://github.com/bitterfox/hadoop/commit/173a9f876f2264b76af01d658f624197936fd79c
> {code}
> 2021-11-16 09:40:20,330 WARN  [hedgedRead-247] impl.BlockReaderFactory: I/O 
> error constructing remote block reader.
> java.io.IOException: DIGEST-MD5: IO error acquiring password
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697)
> at 
> 

[jira] [Commented] (HDFS-16293) Client sleeps and holds 'dataQueue' when DataNodes are congested

2021-12-03 Thread Yuanxin Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452973#comment-17452973
 ] 

Yuanxin Zhu commented on HDFS-16293:


[~tasanuma] Thanks for your review. I added some comments for the unit test. 
Could you check it?

> Client sleeps and holds 'dataQueue' when DataNodes are congested
> 
>
> Key: HDFS-16293
> URL: https://issues.apache.org/jira/browse/HDFS-16293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.2.2, 3.3.1, 3.2.3
>Reporter: Yuanxin Zhu
>Assignee: Yuanxin Zhu
>Priority: Major
> Attachments: HDFS-16293.01-branch-3.2.2.patch, HDFS-16293.01.patch, 
> HDFS-16293.02.patch, HDFS-16293.03.patch, HDFS-16293.04.patch, 
> HDFS-16293.05.patch, HDFS-16293.06.patch, HDFS-16293.07.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When I open the ECN and use Terasort(500G data,8 DataNodes,76 vcores/DN) for 
> testing, DataNodes are congested(HDFS-8008). The client enters the sleep 
> state after receiving the ACK for many times, but does not release the 
> 'dataQueue'. The ResponseProcessor thread needs the 'dataQueue' to execute 
> 'ackQueue.getFirst()', so the ResponseProcessor will wait for the client to 
> release the 'dataQueue', which is equivalent to that the ResponseProcessor 
> thread also enters sleep, resulting in ACK delay.MapReduce tasks can be 
> delayed by tens of minutes or even hours.
> The DataStreamer thread can first execute 'one = dataQueue. getFirst()', 
> release 'dataQueue', and then judge whether to execute 'backOffIfNecessary()' 
> according to 'one.isHeartbeatPacket()'
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16293) Client sleeps and holds 'dataQueue' when DataNodes are congested

2021-12-03 Thread Yuanxin Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanxin Zhu updated HDFS-16293:
---
Attachment: HDFS-16293.07.patch

> Client sleeps and holds 'dataQueue' when DataNodes are congested
> 
>
> Key: HDFS-16293
> URL: https://issues.apache.org/jira/browse/HDFS-16293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.2.2, 3.3.1, 3.2.3
>Reporter: Yuanxin Zhu
>Assignee: Yuanxin Zhu
>Priority: Major
> Attachments: HDFS-16293.01-branch-3.2.2.patch, HDFS-16293.01.patch, 
> HDFS-16293.02.patch, HDFS-16293.03.patch, HDFS-16293.04.patch, 
> HDFS-16293.05.patch, HDFS-16293.06.patch, HDFS-16293.07.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When I open the ECN and use Terasort(500G data,8 DataNodes,76 vcores/DN) for 
> testing, DataNodes are congested(HDFS-8008). The client enters the sleep 
> state after receiving the ACK for many times, but does not release the 
> 'dataQueue'. The ResponseProcessor thread needs the 'dataQueue' to execute 
> 'ackQueue.getFirst()', so the ResponseProcessor will wait for the client to 
> release the 'dataQueue', which is equivalent to that the ResponseProcessor 
> thread also enters sleep, resulting in ACK delay.MapReduce tasks can be 
> delayed by tens of minutes or even hours.
> The DataStreamer thread can first execute 'one = dataQueue. getFirst()', 
> release 'dataQueue', and then judge whether to execute 'backOffIfNecessary()' 
> according to 'one.isHeartbeatPacket()'
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16332) Expired block token causes slow read due to missing handling in sasl handshake

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16332?focusedWorklogId=689956=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-689956
 ]

ASF GitHub Bot logged work on HDFS-16332:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 12:10
Start Date: 03/Dec/21 12:10
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3677:
URL: https://github.com/apache/hadoop/pull/3677#issuecomment-985468663


   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 38s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  buf  |   0m  1s |  |  buf was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 48s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  21m 23s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   5m 18s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   4m 55s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m 12s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 25s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 46s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   2m 16s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   5m 40s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 12s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 28s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m  1s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m  7s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  cc  |   5m  7s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   5m  7s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   4m 49s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  cc  |   4m 49s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   4m 49s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  4s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   2m 11s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 23s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 57s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   5m 43s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m  6s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 23s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | +1 :green_heart: |  unit  | 223m 39s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 48s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 352m 24s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3677/12/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3677 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell cc buflint bufcompat |
   | uname | Linux 8c3a896c9e3b 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 
23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 5fe1d32e150a50598c0760a7b3848b0cee87ffe4 |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   |  Test Results | 

[jira] [Work logged] (HDFS-16370) Fix assert message for BlockInfo

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16370?focusedWorklogId=689935=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-689935
 ]

ASF GitHub Bot logged work on HDFS-16370:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 11:46
Start Date: 03/Dec/21 11:46
Worklog Time Spent: 10m 
  Work Description: tomscut opened a new pull request #3747:
URL: https://github.com/apache/hadoop/pull/3747


   JIRA: [HDFS-16370](https://issues.apache.org/jira/browse/HDFS-16370).
   
   In both methods BlockInfo#getPrevious and BlockInfo#getNext, the assert 
message is wrong. This may cause some misunderstanding and needs to be fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 689935)
Remaining Estimate: 0h
Time Spent: 10m

> Fix assert message for BlockInfo
> 
>
> Key: HDFS-16370
> URL: https://issues.apache.org/jira/browse/HDFS-16370
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In both methods BlockInfo#getPrevious and BlockInfo#getNext, the assert 
> message is wrong. This may cause some misunderstanding and needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16370) Fix assert message for BlockInfo

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16370:
--
Labels: pull-request-available  (was: )

> Fix assert message for BlockInfo
> 
>
> Key: HDFS-16370
> URL: https://issues.apache.org/jira/browse/HDFS-16370
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In both methods BlockInfo#getPrevious and BlockInfo#getNext, the assert 
> message is wrong. This may cause some misunderstanding and needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16370) Fix assert message for BlockInfo

2021-12-03 Thread tomscut (Jira)
tomscut created HDFS-16370:
--

 Summary: Fix assert message for BlockInfo
 Key: HDFS-16370
 URL: https://issues.apache.org/jira/browse/HDFS-16370
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: tomscut
Assignee: tomscut


In both methods BlockInfo#getPrevious and BlockInfo#getNext, the assert message 
is wrong. This may cause some misunderstanding and needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16364) Remove unnecessary brackets in NameNodeRpcServer#L453

2021-12-03 Thread Brahma Reddy Battula (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452922#comment-17452922
 ] 

Brahma Reddy Battula commented on HDFS-16364:
-

Committed to trunk. [~wangzhaohui] thanks for contribution.

> Remove unnecessary brackets in NameNodeRpcServer#L453
> -
>
> Key: HDFS-16364
> URL: https://issues.apache.org/jira/browse/HDFS-16364
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: wangzhaohui
>Assignee: wangzhaohui
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16364) Remove unnecessary brackets in NameNodeRpcServer#L453

2021-12-03 Thread Brahma Reddy Battula (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula resolved HDFS-16364.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Remove unnecessary brackets in NameNodeRpcServer#L453
> -
>
> Key: HDFS-16364
> URL: https://issues.apache.org/jira/browse/HDFS-16364
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: wangzhaohui
>Assignee: wangzhaohui
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16364) Remove unnecessary brackets in NameNodeRpcServer#L453

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16364?focusedWorklogId=689921=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-689921
 ]

ASF GitHub Bot logged work on HDFS-16364:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 11:21
Start Date: 03/Dec/21 11:21
Worklog Time Spent: 10m 
  Work Description: brahmareddybattula merged pull request #3742:
URL: https://github.com/apache/hadoop/pull/3742


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 689921)
Time Spent: 40m  (was: 0.5h)

> Remove unnecessary brackets in NameNodeRpcServer#L453
> -
>
> Key: HDFS-16364
> URL: https://issues.apache.org/jira/browse/HDFS-16364
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: wangzhaohui
>Assignee: wangzhaohui
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16364) Remove unnecessary brackets in NameNodeRpcServer#L453

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16364?focusedWorklogId=689920=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-689920
 ]

ASF GitHub Bot logged work on HDFS-16364:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 11:20
Start Date: 03/Dec/21 11:20
Worklog Time Spent: 10m 
  Work Description: brahmareddybattula commented on pull request #3742:
URL: https://github.com/apache/hadoop/pull/3742#issuecomment-985438559


   lgtm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 689920)
Time Spent: 0.5h  (was: 20m)

> Remove unnecessary brackets in NameNodeRpcServer#L453
> -
>
> Key: HDFS-16364
> URL: https://issues.apache.org/jira/browse/HDFS-16364
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: wangzhaohui
>Assignee: wangzhaohui
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16293) Client sleeps and holds 'dataQueue' when DataNodes are congested

2021-12-03 Thread Takanobu Asanuma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452908#comment-17452908
 ] 

Takanobu Asanuma commented on HDFS-16293:
-

[~Yuanxin Zhu] Thanks for your explanation and for updating the patch. It seems 
the unit test becomes stable, and [^HDFS-16293.06.patch] mostly looks good to 
me. Some minor comments:
 * Could you add a timeout to the unit test?  @Test(timeout=6)
 * Please provide more comments to the unit tests about the purpose of each 
thread, and why it verifies that congestedNodes.size() is greater than 1, and 
so on.
 * How about adding a comment like "// streamer has to release dataQueue before 
calling backoff" before calling backOffIfNecessary()?

> Client sleeps and holds 'dataQueue' when DataNodes are congested
> 
>
> Key: HDFS-16293
> URL: https://issues.apache.org/jira/browse/HDFS-16293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.2.2, 3.3.1, 3.2.3
>Reporter: Yuanxin Zhu
>Assignee: Yuanxin Zhu
>Priority: Major
> Attachments: HDFS-16293.01-branch-3.2.2.patch, HDFS-16293.01.patch, 
> HDFS-16293.02.patch, HDFS-16293.03.patch, HDFS-16293.04.patch, 
> HDFS-16293.05.patch, HDFS-16293.06.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When I open the ECN and use Terasort(500G data,8 DataNodes,76 vcores/DN) for 
> testing, DataNodes are congested(HDFS-8008). The client enters the sleep 
> state after receiving the ACK for many times, but does not release the 
> 'dataQueue'. The ResponseProcessor thread needs the 'dataQueue' to execute 
> 'ackQueue.getFirst()', so the ResponseProcessor will wait for the client to 
> release the 'dataQueue', which is equivalent to that the ResponseProcessor 
> thread also enters sleep, resulting in ACK delay.MapReduce tasks can be 
> delayed by tens of minutes or even hours.
> The DataStreamer thread can first execute 'one = dataQueue. getFirst()', 
> release 'dataQueue', and then judge whether to execute 'backOffIfNecessary()' 
> according to 'one.isHeartbeatPacket()'
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16293) Client sleeps and holds 'dataQueue' when DataNodes are congested

2021-12-03 Thread Yuanxin Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452879#comment-17452879
 ] 

Yuanxin Zhu commented on HDFS-16293:


[~tasanuma] In HDFS-16293.06.patch, the program will definitely finish. Could 
you check it?

> Client sleeps and holds 'dataQueue' when DataNodes are congested
> 
>
> Key: HDFS-16293
> URL: https://issues.apache.org/jira/browse/HDFS-16293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.2.2, 3.3.1, 3.2.3
>Reporter: Yuanxin Zhu
>Assignee: Yuanxin Zhu
>Priority: Major
> Attachments: HDFS-16293.01-branch-3.2.2.patch, HDFS-16293.01.patch, 
> HDFS-16293.02.patch, HDFS-16293.03.patch, HDFS-16293.04.patch, 
> HDFS-16293.05.patch, HDFS-16293.06.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When I open the ECN and use Terasort(500G data,8 DataNodes,76 vcores/DN) for 
> testing, DataNodes are congested(HDFS-8008). The client enters the sleep 
> state after receiving the ACK for many times, but does not release the 
> 'dataQueue'. The ResponseProcessor thread needs the 'dataQueue' to execute 
> 'ackQueue.getFirst()', so the ResponseProcessor will wait for the client to 
> release the 'dataQueue', which is equivalent to that the ResponseProcessor 
> thread also enters sleep, resulting in ACK delay.MapReduce tasks can be 
> delayed by tens of minutes or even hours.
> The DataStreamer thread can first execute 'one = dataQueue. getFirst()', 
> release 'dataQueue', and then judge whether to execute 'backOffIfNecessary()' 
> according to 'one.isHeartbeatPacket()'
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16293) Client sleeps and holds 'dataQueue' when DataNodes are congested

2021-12-03 Thread Yuanxin Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanxin Zhu updated HDFS-16293:
---
Attachment: HDFS-16293.06.patch

> Client sleeps and holds 'dataQueue' when DataNodes are congested
> 
>
> Key: HDFS-16293
> URL: https://issues.apache.org/jira/browse/HDFS-16293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.2.2, 3.3.1, 3.2.3
>Reporter: Yuanxin Zhu
>Assignee: Yuanxin Zhu
>Priority: Major
> Attachments: HDFS-16293.01-branch-3.2.2.patch, HDFS-16293.01.patch, 
> HDFS-16293.02.patch, HDFS-16293.03.patch, HDFS-16293.04.patch, 
> HDFS-16293.05.patch, HDFS-16293.06.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When I open the ECN and use Terasort(500G data,8 DataNodes,76 vcores/DN) for 
> testing, DataNodes are congested(HDFS-8008). The client enters the sleep 
> state after receiving the ACK for many times, but does not release the 
> 'dataQueue'. The ResponseProcessor thread needs the 'dataQueue' to execute 
> 'ackQueue.getFirst()', so the ResponseProcessor will wait for the client to 
> release the 'dataQueue', which is equivalent to that the ResponseProcessor 
> thread also enters sleep, resulting in ACK delay.MapReduce tasks can be 
> delayed by tens of minutes or even hours.
> The DataStreamer thread can first execute 'one = dataQueue. getFirst()', 
> release 'dataQueue', and then judge whether to execute 'backOffIfNecessary()' 
> according to 'one.isHeartbeatPacket()'
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16293) Client sleeps and holds 'dataQueue' when DataNodes are congested

2021-12-03 Thread Yuanxin Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452859#comment-17452859
 ] 

Yuanxin Zhu edited comment on HDFS-16293 at 12/3/21, 10:07 AM:
---

[~tasanuma] Thanks for your feedback. What I'm worried about is that the unit 
test went wrong because of threading problems. 

I think there are two situations:
 * Without fixing DataStreamer, the congestedNodes thread may run one step 
ahead of the dataQueue thread, resulting in the size of the congestedNodes 
greater than 1, it can be solved by increasing the sleep time of the 
congestedNodes thread.
 * With fixing DataStreamer, in order to save time, the previous unit test 
program exits after the dataQueue thread ends, which may cause the program to 
exit in advance when the size of the congestedNodes is not greater than 1. It 
can be solved by increasing the number of the congestedNodes thread runs and 
putting the program exit code in the congestedNodes thread, but it will affect 
the running time of the unit test Without fixing DataStreamer. 

If the program can't finish occasionally, we can increase the number of times 
the dataQueue thread runs, so as to prevent the DataStreamer from waiting 
because the dataQueue is empty, or add a packet again before the congestedNodes 
thread ends.

Could you check it?


was (Author: yuanxin zhu):
[~tasanuma] Thanks for your feedback. What I'm worried about is that the unit 
test went wrong because of threading problems

I think there are two situations:
 * Without fixing DataStreamer, the congestedNodes thread may run one step 
ahead of the dataQueue thread, resulting in the size of the congestedNodes 
greater than 1, it can be solved by increasing the sleep time of the 
congestedNodes thread.
 * With fixing DataStreamer, in order to save time, the previous unit test 
program exits after the dataQueue thread ends, which may cause the program to 
exit in advance when the size of the congestedNodes is not greater than 1. It 
can be solved by increasing the number of the congestedNodes thread runs and 
putting the program exit code in the congestedNodes thread, but it will affect 
the running time of the unit test Without fixing DataStreamer. 

If the program can't finish occasionally, we can increase the number of times 
the dataQueue thread runs, so as to prevent the DataStreamer from waiting 
because the dataQueue is empty, or add a packet again before the congestedNodes 
thread ends.

Could you check it?

> Client sleeps and holds 'dataQueue' when DataNodes are congested
> 
>
> Key: HDFS-16293
> URL: https://issues.apache.org/jira/browse/HDFS-16293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.2.2, 3.3.1, 3.2.3
>Reporter: Yuanxin Zhu
>Assignee: Yuanxin Zhu
>Priority: Major
> Attachments: HDFS-16293.01-branch-3.2.2.patch, HDFS-16293.01.patch, 
> HDFS-16293.02.patch, HDFS-16293.03.patch, HDFS-16293.04.patch, 
> HDFS-16293.05.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When I open the ECN and use Terasort(500G data,8 DataNodes,76 vcores/DN) for 
> testing, DataNodes are congested(HDFS-8008). The client enters the sleep 
> state after receiving the ACK for many times, but does not release the 
> 'dataQueue'. The ResponseProcessor thread needs the 'dataQueue' to execute 
> 'ackQueue.getFirst()', so the ResponseProcessor will wait for the client to 
> release the 'dataQueue', which is equivalent to that the ResponseProcessor 
> thread also enters sleep, resulting in ACK delay.MapReduce tasks can be 
> delayed by tens of minutes or even hours.
> The DataStreamer thread can first execute 'one = dataQueue. getFirst()', 
> release 'dataQueue', and then judge whether to execute 'backOffIfNecessary()' 
> according to 'one.isHeartbeatPacket()'
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16293) Client sleeps and holds 'dataQueue' when DataNodes are congested

2021-12-03 Thread Yuanxin Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452859#comment-17452859
 ] 

Yuanxin Zhu edited comment on HDFS-16293 at 12/3/21, 10:05 AM:
---

[~tasanuma] Thanks for your feedback. What I'm worried about is that the unit 
test went wrong because of threading problems

I think there are two situations:
 * Without fixing DataStreamer, the congestedNodes thread may run one step 
ahead of the dataQueue thread, resulting in the size of the congestedNodes 
greater than 1, it can be solved by increasing the sleep time of the 
congestedNodes thread.
 * With fixing DataStreamer, in order to save time, the previous unit test 
program exits after the dataQueue thread ends, which may cause the program to 
exit in advance when the size of the congestedNodes is not greater than 1. It 
can be solved by increasing the number of the congestedNodes thread runs and 
putting the program exit code in the congestedNodes thread, but it will affect 
the running time of the unit test Without fixing DataStreamer. 

If the program can't finish occasionally, we can increase the number of times 
the dataQueue thread runs, so as to prevent the DataStreamer from waiting 
because the dataQueue is empty, or add a packet again before the congestedNodes 
thread ends.

Could you check it?


was (Author: yuanxin zhu):
[~tasanuma] Thanks for your feedback. It's also what I'm worried about.

I think there are two situations:
 * Without fixing DataStreamer, the congestedNodes thread may run one step 
ahead of the dataQueue thread, resulting in the size of the congestedNodes 
greater than 1, it can be solved by increasing the sleep time of the 
congestedNodes thread.
 * With fixing DataStreamer, in order to save time, the previous unit test 
program exits after the dataQueue thread ends, which may cause the program to 
exit in advance when the size of the congestedNodes is not greater than 1. It 
can be solved by increasing the number of the congestedNodes thread runs and 
putting the program exit code in the congestedNodes thread, but it will affect 
the running time of the unit test Without fixing DataStreamer. 

Could you check it?

> Client sleeps and holds 'dataQueue' when DataNodes are congested
> 
>
> Key: HDFS-16293
> URL: https://issues.apache.org/jira/browse/HDFS-16293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.2.2, 3.3.1, 3.2.3
>Reporter: Yuanxin Zhu
>Assignee: Yuanxin Zhu
>Priority: Major
> Attachments: HDFS-16293.01-branch-3.2.2.patch, HDFS-16293.01.patch, 
> HDFS-16293.02.patch, HDFS-16293.03.patch, HDFS-16293.04.patch, 
> HDFS-16293.05.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When I open the ECN and use Terasort(500G data,8 DataNodes,76 vcores/DN) for 
> testing, DataNodes are congested(HDFS-8008). The client enters the sleep 
> state after receiving the ACK for many times, but does not release the 
> 'dataQueue'. The ResponseProcessor thread needs the 'dataQueue' to execute 
> 'ackQueue.getFirst()', so the ResponseProcessor will wait for the client to 
> release the 'dataQueue', which is equivalent to that the ResponseProcessor 
> thread also enters sleep, resulting in ACK delay.MapReduce tasks can be 
> delayed by tens of minutes or even hours.
> The DataStreamer thread can first execute 'one = dataQueue. getFirst()', 
> release 'dataQueue', and then judge whether to execute 'backOffIfNecessary()' 
> according to 'one.isHeartbeatPacket()'
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16331) Make dfs.blockreport.intervalMsec reconfigurable

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16331?focusedWorklogId=689854=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-689854
 ]

ASF GitHub Bot logged work on HDFS-16331:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 09:42
Start Date: 03/Dec/21 09:42
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3676:
URL: https://github.com/apache/hadoop/pull/3676#issuecomment-985371174


   Thanks @tasanuma for the review and the merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 689854)
Time Spent: 5h 10m  (was: 5h)

> Make dfs.blockreport.intervalMsec reconfigurable
> 
>
> Key: HDFS-16331
> URL: https://issues.apache.org/jira/browse/HDFS-16331
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2021-11-18-09-33-24-236.png, 
> image-2021-11-18-09-35-35-400.png
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> We have a cold data cluster, which stores as EC policy. There are 24 fast 
> disks on each node and each disk is 7 TB. 
> Recently, many nodes have more than 10 million blocks, and the interval of 
> FBR is 6h as default. Frequent FBR caused great pressure on NN.
> !image-2021-11-18-09-35-35-400.png|width=334,height=229!
> !image-2021-11-18-09-33-24-236.png|width=566,height=159!
> We want to increase the interval of FBR, but have to rolling restart the DNs, 
> this operation is very heavy. In this scenario, it is necessary to make 
> _dfs.blockreport.intervalMsec_ reconfigurable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16293) Client sleeps and holds 'dataQueue' when DataNodes are congested

2021-12-03 Thread Yuanxin Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452859#comment-17452859
 ] 

Yuanxin Zhu commented on HDFS-16293:


[~tasanuma] Thanks for your feedback. It's also what I'm worried about.

I think there are two situations:
 * Without fixing DataStreamer, the congestedNodes thread may run one step 
ahead of the dataQueue thread, resulting in the size of the congestedNodes 
greater than 1, it can be solved by increasing the sleep time of the 
congestedNodes thread.
 * With fixing DataStreamer, in order to save time, the previous unit test 
program exits after the dataQueue thread ends, which may cause the program to 
exit in advance when the size of the congestedNodes is not greater than 1. It 
can be solved by increasing the number of the congestedNodes thread runs and 
putting the program exit code in the congestedNodes thread, but it will affect 
the running time of the unit test Without fixing DataStreamer. 

Could you check it?

> Client sleeps and holds 'dataQueue' when DataNodes are congested
> 
>
> Key: HDFS-16293
> URL: https://issues.apache.org/jira/browse/HDFS-16293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.2.2, 3.3.1, 3.2.3
>Reporter: Yuanxin Zhu
>Assignee: Yuanxin Zhu
>Priority: Major
> Attachments: HDFS-16293.01-branch-3.2.2.patch, HDFS-16293.01.patch, 
> HDFS-16293.02.patch, HDFS-16293.03.patch, HDFS-16293.04.patch, 
> HDFS-16293.05.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When I open the ECN and use Terasort(500G data,8 DataNodes,76 vcores/DN) for 
> testing, DataNodes are congested(HDFS-8008). The client enters the sleep 
> state after receiving the ACK for many times, but does not release the 
> 'dataQueue'. The ResponseProcessor thread needs the 'dataQueue' to execute 
> 'ackQueue.getFirst()', so the ResponseProcessor will wait for the client to 
> release the 'dataQueue', which is equivalent to that the ResponseProcessor 
> thread also enters sleep, resulting in ACK delay.MapReduce tasks can be 
> delayed by tens of minutes or even hours.
> The DataStreamer thread can first execute 'one = dataQueue. getFirst()', 
> release 'dataQueue', and then judge whether to execute 'backOffIfNecessary()' 
> according to 'one.isHeartbeatPacket()'
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16293) Client sleeps and holds 'dataQueue' when DataNodes are congested

2021-12-03 Thread Yuanxin Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanxin Zhu updated HDFS-16293:
---
Attachment: HDFS-16293.05.patch

> Client sleeps and holds 'dataQueue' when DataNodes are congested
> 
>
> Key: HDFS-16293
> URL: https://issues.apache.org/jira/browse/HDFS-16293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.2.2, 3.3.1, 3.2.3
>Reporter: Yuanxin Zhu
>Assignee: Yuanxin Zhu
>Priority: Major
> Attachments: HDFS-16293.01-branch-3.2.2.patch, HDFS-16293.01.patch, 
> HDFS-16293.02.patch, HDFS-16293.03.patch, HDFS-16293.04.patch, 
> HDFS-16293.05.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When I open the ECN and use Terasort(500G data,8 DataNodes,76 vcores/DN) for 
> testing, DataNodes are congested(HDFS-8008). The client enters the sleep 
> state after receiving the ACK for many times, but does not release the 
> 'dataQueue'. The ResponseProcessor thread needs the 'dataQueue' to execute 
> 'ackQueue.getFirst()', so the ResponseProcessor will wait for the client to 
> release the 'dataQueue', which is equivalent to that the ResponseProcessor 
> thread also enters sleep, resulting in ACK delay.MapReduce tasks can be 
> delayed by tens of minutes or even hours.
> The DataStreamer thread can first execute 'one = dataQueue. getFirst()', 
> release 'dataQueue', and then judge whether to execute 'backOffIfNecessary()' 
> according to 'one.isHeartbeatPacket()'
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16332) Expired block token causes slow read due to missing handling in sasl handshake

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16332?focusedWorklogId=689835=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-689835
 ]

ASF GitHub Bot logged work on HDFS-16332:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 09:13
Start Date: 03/Dec/21 09:13
Worklog Time Spent: 10m 
  Work Description: aajisaka commented on a change in pull request #3677:
URL: https://github.com/apache/hadoop/pull/3677#discussion_r761763311



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/sasl/SaslDataTransferClient.java
##
@@ -603,7 +603,20 @@ private IOStreamPair doSaslHandshake(InetAddress addr,
   conf, cipherOption, underlyingOut, underlyingIn, false) :
   sasl.createStreamPair(out, in);
 } catch (IOException ioe) {
-  sendGenericSaslErrorMessage(out, ioe.getMessage());
+  String message = ioe.getMessage();
+  try {
+sendGenericSaslErrorMessage(out, message);
+  } catch (Exception e) {
+// If ioe is caused by error response from server, server will close 
peer connection.
+// So sendGenericSaslErrorMessage might cause IOException due to 
"Broken pipe".
+// We suppress IOException from sendGenericSaslErrorMessage
+// and always throw `ioe` as top level.
+// `ioe` can be InvalidEncryptionKeyException or 
InvalidBlockTokenException
+// that indicates refresh key or token and are important for caller.
+LOG.debug("Failed to send generic sasl error (server: {}, message: 
{}), suppress exception",
+addr.toString(), message, e);

Review comment:
   Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 689835)
Time Spent: 5h 10m  (was: 5h)

> Expired block token causes slow read due to missing handling in sasl handshake
> --
>
> Key: HDFS-16332
> URL: https://issues.apache.org/jira/browse/HDFS-16332
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, dfs, dfsclient
>Affects Versions: 2.8.5, 3.3.1
>Reporter: Shinya Yoshida
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screenshot from 2021-11-18 12-11-34.png, Screenshot from 
> 2021-11-18 12-14-29.png, Screenshot from 2021-11-18 13-31-35.png
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> We're operating the HBase 1.4.x cluster on Hadoop 2.8.5.
> We're recently evaluating Kerberos secured HBase and Hadoop cluster with 
> production load and we observed HBase's response slows >= several seconds, 
> and about several minutes for worst-case (about once~three times a month).
> The following image is a scatter plot of HBase's response slow, each circle 
> is each base's slow response log.
> The X-axis is the date time of the log occurred, the Y-axis is the response 
> slow time.
>  !Screenshot from 2021-11-18 12-14-29.png! 
> We could reproduce this issue by reducing "dfs.block.access.token.lifetime" 
> and we could figure out the cause.
> (We used dfs.block.access.token.lifetime=60, i.e. 1 hour)
> When hedged read enabled:
>  !Screenshot from 2021-11-18 12-11-34.png! 
> When hedged read disabled:
>  !Screenshot from 2021-11-18 13-31-35.png! 
> As you can see, it's worst if the hedged read is enabled. However, it happens 
> whether the hedged read is enabled or not.
> This impacts our 99%tile response time.
> This happens when the block token is expired and the root cause is the wrong 
> handling of the InvalidToken exception in sasl handshake in 
> SaslDataTransferServer.
> I propose to add a new response code for DataTransferEncryptorStatus to 
> request the client to update the block token like DataTransferProtos does.
> The test code and patch is available in 
> https://github.com/apache/hadoop/pull/3677
> We could reproduce this issue by the following test code in 2.8.5 branch and 
> trunk as I tested
> {code:java}
> // HDFS is configured as secure cluster
> try (FileSystem fs = newFileSystem();
>  FSDataInputStream in = fs.open(PATH)) {
> waitBlockTokenExpired(in);
> in.read(0, bytes, 0, bytes.length)
> }
> private void waitBlockTokenExpired(FSDataInputStream in1) throws Exception {
> DFSInputStream innerStream = (DFSInputStream) in1.getWrappedStream();
> for (LocatedBlock block : innerStream.getAllBlocks()) {
> while 

[jira] [Work logged] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format

2021-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15987?focusedWorklogId=689814=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-689814
 ]

ASF GitHub Bot logged work on HDFS-15987:
-

Author: ASF GitHub Bot
Created on: 03/Dec/21 08:20
Start Date: 03/Dec/21 08:20
Worklog Time Spent: 10m 
  Work Description: whbing commented on a change in pull request #2918:
URL: https://github.com/apache/hadoop/pull/2918#discussion_r76172



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/PBImageTextWriter.java
##
@@ -651,14 +683,123 @@ private void output(Configuration conf, FileSummary 
summary,
 is = FSImageUtil.wrapInputStreamForCompression(conf,
 summary.getCodec(), new BufferedInputStream(new LimitInputStream(
 fin, section.getLength(;
-outputINodes(is);
+INodeSection s = INodeSection.parseDelimitedFrom(is);
+LOG.info("Found {} INodes in the INode section", s.getNumInodes());
+int count = outputINodes(is, out);
+LOG.info("Outputted {} INodes.", count);
   }
 }
 afterOutput();
 long timeTaken = Time.monotonicNow() - startTime;
 LOG.debug("Time to output inodes: {}ms", timeTaken);
   }
 
+  /**
+   * STEP1: Multi-threaded process sub-sections.
+   * Given n (n>1) threads to process k (k>=n) sections,
+   * E.g. 10 sections and 4 threads, grouped as follows:
+   * |---|
+   * | (012)(345)(67) (89)   |
+   * | thread[0]thread[1]thread[2]thread[3]  |
+   * |---|
+   *
+   * STEP2: Merge files.
+   */
+  private void outputInParallel(Configuration conf, FileSummary summary,
+  ArrayList subSections)
+  throws IOException {
+int nThreads = Integer.min(numThreads, subSections.size());
+LOG.info("Outputting in parallel with {} sub-sections" +
+" using {} threads", subSections.size(), nThreads);
+final CopyOnWriteArrayList exceptions =
+new CopyOnWriteArrayList<>();
+Thread[] threads = new Thread[nThreads];
+String[] paths = new String[nThreads];
+for (int i = 0; i < paths.length; i++) {
+  paths[i] = parallelOut + ".tmp." + i;
+}
+AtomicLong expectedINodes = new AtomicLong(0);
+AtomicLong totalParsed = new AtomicLong(0);
+String codec = summary.getCodec();
+
+int mark = 0;
+for (int i = 0; i < nThreads; i++) {
+  // Each thread processes different ordered sub-sections
+  // and outputs to different paths
+  int step = subSections.size() / nThreads +
+  (i < subSections.size() % nThreads ? 1 : 0);
+  int start = mark;
+  int end = start + step;
+  ArrayList subList = new ArrayList<>(
+  subSections.subList(start, end));
+  mark = end;
+  String path = paths[i];
+
+  threads[i] = new Thread(() -> {

Review comment:
   > Maybe thread pool is better here?
   
   @symious  Thanks! I will try this suggestion in next commit.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 689814)
Time Spent: 4h 20m  (was: 4h 10m)

> Improve oiv tool to parse fsimage file in parallel with delimited format
> 
>
> Key: HDFS-15987
> URL: https://issues.apache.org/jira/browse/HDFS-15987
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Improve_oiv_tool_001.pdf
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> The purpose of this Jira is to improve oiv tool to parse fsimage file with 
> sub-sections (see -HDFS-14617-) in parallel with delmited format. 
> 1.Serial parsing is time-consuming
> The time to serially parse a large fsimage with delimited format (e.g. `hdfs 
> oiv -p Delimited -t  ...`) is as follows: 
> {code:java}
> 1) Loading string table: -> Not time consuming.
> 2) Loading inode references: -> Not time consuming
> 3) Loading directories in INode section: -> Slightly time consuming (3%)
> 4) Loading INode directory section:  -> A bit time consuming (11%)
> 5) Output:   -> Very time consuming (86%){code}
> Therefore, output is the