[jira] [Commented] (HDFS-17141) Optimize the default parameters for the FileUtil.isRegularFile() method

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752637#comment-17752637
 ] 

ASF GitHub Bot commented on HDFS-17141:
---

2005hithlj commented on PR #5925:
URL: https://github.com/apache/hadoop/pull/5925#issuecomment-1672599391

   @Hexiaoqiao  @slfan1989   Because the current parameter setting of this 
method is not optimal, I understand that the method with default parameters 
should be called by most other methods.




> Optimize the default parameters for the FileUtil.isRegularFile() method
> ---
>
> Key: HDFS-17141
> URL: https://issues.apache.org/jira/browse/HDFS-17141
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Liangjun He
>Assignee: Liangjun He
>Priority: Minor
>  Labels: pull-request-available
>
> Optimize the default parameters of the FileUtil.isRegularFile() method to 
> facilitate direct invocation by more methods.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17140) Optimize the BPOfferService.reportBadBlocks() method

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752636#comment-17752636
 ] 

ASF GitHub Bot commented on HDFS-17140:
---

2005hithlj commented on code in PR #5924:
URL: https://github.com/apache/hadoop/pull/5924#discussion_r1289578811


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java:
##
@@ -291,9 +291,8 @@ public String toString() {
   void reportBadBlocks(ExtendedBlock block,
String storageUuid, StorageType storageType) {
 checkBlock(block);
+ReportBadBlockAction rbbAction = new ReportBadBlockAction(block, 
storageUuid, storageType);

Review Comment:
   @slfan1989  This improvement may not have much effect, but it does reduce 
the creation of redundant and useless temporary objects. At the same time, from 
the perspective of code style, the implementation here is not elegant.





> Optimize the BPOfferService.reportBadBlocks() method
> 
>
> Key: HDFS-17140
> URL: https://issues.apache.org/jira/browse/HDFS-17140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Liangjun He
>Assignee: Liangjun He
>Priority: Minor
>  Labels: pull-request-available
>
> The current BPOfferService.reportBadBlocks() method can be optimized by 
> moving the creation of the rbbAction object outside the loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17140) Optimize the BPOfferService.reportBadBlocks() method

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752635#comment-17752635
 ] 

ASF GitHub Bot commented on HDFS-17140:
---

2005hithlj commented on code in PR #5924:
URL: https://github.com/apache/hadoop/pull/5924#discussion_r1285727589


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java:
##
@@ -291,10 +291,11 @@ public String toString() {
   void reportBadBlocks(ExtendedBlock block,
String storageUuid, StorageType storageType) {
 checkBlock(block);
-for (BPServiceActor actor : bpServices) {
-  ReportBadBlockAction rbbAction = new ReportBadBlockAction
-  (block, storageUuid, storageType);
-  actor.bpThreadEnqueue(rbbAction);
+if (!bpServices.isEmpty()) {

Review Comment:
   @Hexiaoqiao sir, BPServices will definitely not be empty here, I will remove 
this judgment.



##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java:
##
@@ -291,9 +291,8 @@ public String toString() {
   void reportBadBlocks(ExtendedBlock block,
String storageUuid, StorageType storageType) {
 checkBlock(block);
+ReportBadBlockAction rbbAction = new ReportBadBlockAction(block, 
storageUuid, storageType);

Review Comment:
   @slfan1989  This improvement may not have much effect, but it does reduce 
the creation of redundant and useless temporary objects. At the same time, from 
the perspective of code style, the implementation here is not elegant.





> Optimize the BPOfferService.reportBadBlocks() method
> 
>
> Key: HDFS-17140
> URL: https://issues.apache.org/jira/browse/HDFS-17140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Liangjun He
>Assignee: Liangjun He
>Priority: Minor
>  Labels: pull-request-available
>
> The current BPOfferService.reportBadBlocks() method can be optimized by 
> moving the creation of the rbbAction object outside the loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752631#comment-17752631
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

haiyang1987 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289579716


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&

Review Comment:
   Here I think the writing process of EC files will call getAdditionalBlock 
and will set ReplicaUnderConstruction[] replicas , for standby name need run 
failover ha  will  set BlockUnderConstructionFeature#replicas is totally when 
block reports (IBR or FBR). ?





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752630#comment-17752630
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hfutatzhanghb commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289579189


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&

Review Comment:
   @zhangshuyan0 Thanks a lot. I ignored the failover condition.





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752626#comment-17752626
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

haiyang1987 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289559249


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&
+  lastBlock.getNumBytes() == 0) {
 // There is no datanode reported to this block.
 // may be client have crashed before writing data to pipeline.
 // This blocks doesn't need any recovery.
 // We can remove this block and close the file.
 pendingFile.removeLastBlock(lastBlock);
 finalizeINodeFileUnderConstruction(src, pendingFile,
 iip.getLatestSnapshotId(), false);
-NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
-+ "Removed empty last block and closed file " + src);
+if (uc.getNumExpectedLocations() == 0) {

Review Comment:
   Here how about update to this judgment logic?
   ```
 if (lastBlock.isStriped()) {
   NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
 + "Removed last unrecoverable block group and closed file " + 
src);
 } else {
 NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
 + "Removed empty last block and closed file " + src);
 }```





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17148) RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752625#comment-17752625
 ] 

ASF GitHub Bot commented on HDFS-17148:
---

slfan1989 commented on PR #5936:
URL: https://github.com/apache/hadoop/pull/5936#issuecomment-1672563962

   @goiri @simbadzina  Can you help to review this pr? Thank you very much! 
From my personal point of view, this PR is good.




> RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL
> ---
>
> Key: HDFS-17148
> URL: https://issues.apache.org/jira/browse/HDFS-17148
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Hector Sandoval Chaverri
>Priority: Major
>  Labels: pull-request-available
>
> The SQLDelegationTokenSecretManager fetches tokens from SQL and stores them 
> temporarily in a memory cache with a short TTL. The ExpiredTokenRemover in 
> AbstractDelegationTokenSecretManager runs periodically to cleanup any expired 
> tokens from the cache, but most tokens have been evicted automatically per 
> the TTL configuration. This leads to many expired tokens in the SQL database 
> that should be cleaned up.
> The SQLDelegationTokenSecretManager should find expired tokens in SQL instead 
> of in the memory cache when running the periodic cleanup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17148) RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752622#comment-17752622
 ] 

ASF GitHub Bot commented on HDFS-17148:
---

hadoop-yetus commented on PR #5936:
URL: https://github.com/apache/hadoop/pull/5936#issuecomment-167274

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 54s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  16m 49s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  36m 30s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  18m 29s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  compile  |  16m 59s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  checkstyle  |   4m 42s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 33s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 58s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 27s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   4m  6s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  39m 16s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 29s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 29s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  17m 33s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javac  |  17m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 59s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  javac  |  16m 59s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   4m 32s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   2m 27s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   4m 23s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  39m 13s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  18m 53s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |  22m 20s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 59s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 281m 58s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5936/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5936 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux e66ca09926f6 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 0c721a2df50771f821585ab1a47657081aaac84d |
   | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5936/2/testReport/ |
   | Max. process+thread count | 2652 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-hdfs-project/hadoop-hdfs-rbf U: . |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5936/2/console |
   | 

[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752615#comment-17752615
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

zhangshuyan0 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289529656


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&

Review Comment:
   First, the writing process of EC files never call `getAdditionalBlock`. 
Second, after a failover in NameNode, the content of 
`BlockUnderConstructionFeature#replicas` is totally depends on block reports 
(IBR or FBR).





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752612#comment-17752612
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hfutatzhanghb commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289519764


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&

Review Comment:
   Hi, @zhangshuyan0 ,  Also have a question here,  is 
`uc.getNumExpectedLocations()` always `>=` minLocationsNum?  because when 
invoke getAdditionalBlock rpc, the field replicas in 
classBlockUnderConstructionFeature has been set to targets.





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752611#comment-17752611
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

zhangshuyan0 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289517665


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&
+  lastBlock.getNumBytes() == 0) {

Review Comment:
   Thanks for your review. This log is not very suitable, I will make some 
changes.





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752606#comment-17752606
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hfutatzhanghb commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289509777


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&
+  lastBlock.getNumBytes() == 0) {

Review Comment:
   @zhangshuyan0 Whether should we log this special case or not?  
https://github.com/apache/hadoop/blob/4e7cd881a92b7deb7d8d188b8c1dc85ca6e8ee5f/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L3819
   





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752602#comment-17752602
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hfutatzhanghb commented on PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1672500893

   @zhangshuyan0 Thanks a lot for reporting this bug. This phenomenon also 
happens on our clusters. Leave some questions, hope to receive your reply~ 
Thanks.




> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17150:
--
Labels: pull-request-available  (was: )

> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752600#comment-17752600
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

zhangshuyan0 opened a new pull request, #5937:
URL: https://github.com/apache/hadoop/pull/5937

   EC: Fix the bug of failed lease recovery.
   
   If the client crashes without writing the minimum number of internal blocks 
required by the EC policy, the lease recovery process for the corresponding 
unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
timeline is as follows:
   1. The client writes some data to only 5 datanodes;
   2. Client crashes;
   3. NN fails over;
   4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
block report, and there are 5 datanodes reporting internal blocks;
   5. When the lease expires hard limit, NN issues a block recovery command;
   6. The datanode checks the command and finds that the number of internal 
blocks is insufficient, resulting in an exception and recovery failure;
   
https://github.com/apache/hadoop/blob/b6edcb9a84ceac340c79cd692637b3e11c997cc5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockRecoveryWorker.java#L534-L540
   7. The lease expires hard limit again, and NN issues a block recovery 
command again, but the recovery fails again..
   
   When the number of internal blocks written by the client is less than 6, the 
block group is actually unrecoverable. We should equate this situation to the 
case where the number of replicas is 0 when processing replica files, i.e., 
directly remove the last block group and close the file.
   
   




> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread Shuyan Zhang (Jira)
Shuyan Zhang created HDFS-17150:
---

 Summary: EC: Fix the bug of failed lease recovery.
 Key: HDFS-17150
 URL: https://issues.apache.org/jira/browse/HDFS-17150
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Shuyan Zhang


If the client crashes without writing the minimum number of internal blocks 
required by the EC policy, the lease recovery process for the corresponding 
unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
timeline is as follows:
1. The client writes some data to only 5 datanodes;
2. Client crashes;
3. NN fails over;
4. Now the result of `uc.getNumExpectedLocations()` completely depends on block 
report, and there are 5 datanodes reporting internal blocks;
5. When the lease expires hard limit, NN issues a block recovery command;
6. The datanode checks the command and finds that the number of internal blocks 
is insufficient, resulting in an error and recovery failure;

7. The lease expires hard limit again, and NN issues a block recovery command 
again, but the recovery fails again..

When the number of internal blocks written by the client is less than 6, the 
block group is actually unrecoverable. We should equate this situation to the 
case where the number of replicas is 0 when processing replica files, i.e., 
directly remove the last block group and close the file.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17148) RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752539#comment-17752539
 ] 

ASF GitHub Bot commented on HDFS-17148:
---

hadoop-yetus commented on PR #5936:
URL: https://github.com/apache/hadoop/pull/5936#issuecomment-1672270862

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  18m  9s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 44s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  36m  3s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  19m 16s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  compile  |  17m 18s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  checkstyle  |   4m 41s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 33s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 59s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   4m  3s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  39m 27s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 29s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 31s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  17m 39s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javac  |  17m 39s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  17m 14s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  javac  |  17m 14s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5936/1/artifact/out/blanks-eol.txt)
 |  The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | +1 :green_heart: |  checkstyle  |   4m 33s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   2m 26s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 54s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | -1 :x: |  spotbugs  |   1m 44s | 
[/new-spotbugs-hadoop-hdfs-project_hadoop-hdfs-rbf.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5936/1/artifact/out/new-spotbugs-hadoop-hdfs-project_hadoop-hdfs-rbf.html)
 |  hadoop-hdfs-project/hadoop-hdfs-rbf generated 1 new + 0 unchanged - 0 fixed 
= 1 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  39m 33s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  18m 53s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |  22m 24s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m  1s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 299m  5s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | module:hadoop-hdfs-project/hadoop-hdfs-rbf |
   |  |  
org.apache.hadoop.hdfs.server.federation.router.security.token.SQLDelegationTokenSecretManagerImpl.lambda$selectTokenInfos$4(long,
 int) may fail to clean up java.sql.ResultSet  Obligation to clean up resource 
created at SQLDelegationTokenSecretManagerImpl.java:clean up java.sql.ResultSet 
 Obligation to clean up resource created at 
SQLDelegationTokenSecretManagerImpl.java:[line 165] is not discharged |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5936/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5936 |
   | Optional Tests | dupname asflicense compile javac 

[jira] [Commented] (HDFS-17148) RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752534#comment-17752534
 ] 

ASF GitHub Bot commented on HDFS-17148:
---

slfan1989 commented on PR #5936:
URL: https://github.com/apache/hadoop/pull/5936#issuecomment-1672259214

   Good catch!




> RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL
> ---
>
> Key: HDFS-17148
> URL: https://issues.apache.org/jira/browse/HDFS-17148
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Hector Sandoval Chaverri
>Priority: Major
>  Labels: pull-request-available
>
> The SQLDelegationTokenSecretManager fetches tokens from SQL and stores them 
> temporarily in a memory cache with a short TTL. The ExpiredTokenRemover in 
> AbstractDelegationTokenSecretManager runs periodically to cleanup any expired 
> tokens from the cache, but most tokens have been evicted automatically per 
> the TTL configuration. This leads to many expired tokens in the SQL database 
> that should be cleaned up.
> The SQLDelegationTokenSecretManager should find expired tokens in SQL instead 
> of in the memory cache when running the periodic cleanup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17148) RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL

2023-08-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17148:
--
Labels: pull-request-available  (was: )

> RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL
> ---
>
> Key: HDFS-17148
> URL: https://issues.apache.org/jira/browse/HDFS-17148
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Hector Sandoval Chaverri
>Priority: Major
>  Labels: pull-request-available
>
> The SQLDelegationTokenSecretManager fetches tokens from SQL and stores them 
> temporarily in a memory cache with a short TTL. The ExpiredTokenRemover in 
> AbstractDelegationTokenSecretManager runs periodically to cleanup any expired 
> tokens from the cache, but most tokens have been evicted automatically per 
> the TTL configuration. This leads to many expired tokens in the SQL database 
> that should be cleaned up.
> The SQLDelegationTokenSecretManager should find expired tokens in SQL instead 
> of in the memory cache when running the periodic cleanup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17148) RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752489#comment-17752489
 ] 

ASF GitHub Bot commented on HDFS-17148:
---

hchaverri opened a new pull request, #5936:
URL: https://github.com/apache/hadoop/pull/5936

   
   
   ### Description of PR
   JIRA: HDFS-17148. RBF: SQLDelegationTokenSecretManager must cleanup expired 
tokens in SQL
   
   These changes update the SQLDelegationTokenSecretManager to cleanup expired 
tokens found in SQL. Currently, AbstractDelegationTokenSecretManagers only 
cleanup tokens in its memory cache. The SQLDelegationTokenSecretManager was 
recently updated to use a LoadingCache with a short TTL, so most expired tokens 
won't be present in memory.
   
   During token cleanup, the SQLDelegationTokenSecretManager will query SQL for 
a list of tokens that have not been updated recently, based on the modifiedTime 
column. We will limit the amount of results returned to prevent performance 
impact on SQL. Once the list is returned, the ExpiredTokenRemover will evaluate 
if the tokens are actually expired and delete them from SQL if so.
   
   ### How was this patch tested?
   Added unit test for different token cleanup scenarios:
   1. Having an expired token in SQL. which should be deleted
   2. Having a token with a long renewal time, which should not be deleted
   3. Having a token recently renewed, which should not be deleted
   
   ### For code changes:
   
   - [Y] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [Y] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [Y] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [Y] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> RBF: SQLDelegationTokenSecretManager must cleanup expired tokens in SQL
> ---
>
> Key: HDFS-17148
> URL: https://issues.apache.org/jira/browse/HDFS-17148
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Hector Sandoval Chaverri
>Priority: Major
>
> The SQLDelegationTokenSecretManager fetches tokens from SQL and stores them 
> temporarily in a memory cache with a short TTL. The ExpiredTokenRemover in 
> AbstractDelegationTokenSecretManager runs periodically to cleanup any expired 
> tokens from the cache, but most tokens have been evicted automatically per 
> the TTL configuration. This leads to many expired tokens in the SQL database 
> that should be cleaned up.
> The SQLDelegationTokenSecretManager should find expired tokens in SQL instead 
> of in the memory cache when running the periodic cleanup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17149) getBlockLocations RPC should use actual client ip to compute network distance when using RBF.

2023-08-09 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752303#comment-17752303
 ] 

farmmamba commented on HDFS-17149:
--

[~hexiaoqiao] Sir, thanks a lot for pointing that.  I have read the comments in 
HDFS-15079 and i think current issue is different from HDFS-15079.

The parameter clientMachine in  FSNamesystem#sortLocatedBlocks method is 
router's ip address which should be actual client ip i think.

> getBlockLocations RPC should use actual client ip to compute network distance 
> when using RBF.
> -
>
> Key: HDFS-17149
> URL: https://issues.apache.org/jira/browse/HDFS-17149
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> Please correct me if i understand wrongly. Thanks.
> Currently, when a getBlockLocations RPC forwards to namenode via router.  
> NameNode will use router ip address as client machine to compute network 
> distance against block's locations. See FSNamesystem#sortLocatedBlocks method 
> for more detailed information.  
> I think this compute method is not correct and should use actual client ip.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17149) getBlockLocations RPC should use actual client ip to compute network distance when using RBF.

2023-08-09 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17149:
-
Description: 
Please correct me if i understand wrongly. Thanks.

Currently, when a getBlockLocations RPC forwards to namenode via router.  
NameNode will use router ip address as client machine to compute network 
distance against block's locations. See FSNamesystem#sortLocatedBlocks method 
for more detailed information.  

I think this compute method is not correct and should use actual client ip.

 

  was:
Please correct me if i understand wrongly. Thanks.

Currently, when a getBlockLocations RPC forwards to namenode via router.  
NameNode will use router ip address as client machine to compute network 
distance against block's locations. See FSNamesystem#sortLocatedBlocksMore 
method for more detailed information.  

I think this compute method is not correct and should use actual client ip.

 


> getBlockLocations RPC should use actual client ip to compute network distance 
> when using RBF.
> -
>
> Key: HDFS-17149
> URL: https://issues.apache.org/jira/browse/HDFS-17149
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> Please correct me if i understand wrongly. Thanks.
> Currently, when a getBlockLocations RPC forwards to namenode via router.  
> NameNode will use router ip address as client machine to compute network 
> distance against block's locations. See FSNamesystem#sortLocatedBlocks method 
> for more detailed information.  
> I think this compute method is not correct and should use actual client ip.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17149) getBlockLocations RPC should use actual client ip to compute network distance when using RBF.

2023-08-09 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752284#comment-17752284
 ] 

Xiaoqiao He commented on HDFS-17149:


Hi [~zhanghaobo], Please check if HDFS-15079 can solve this issue.

> getBlockLocations RPC should use actual client ip to compute network distance 
> when using RBF.
> -
>
> Key: HDFS-17149
> URL: https://issues.apache.org/jira/browse/HDFS-17149
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>
> Please correct me if i understand wrongly. Thanks.
> Currently, when a getBlockLocations RPC forwards to namenode via router.  
> NameNode will use router ip address as client machine to compute network 
> distance against block's locations. See FSNamesystem#sortLocatedBlocksMore 
> method for more detailed information.  
> I think this compute method is not correct and should use actual client ip.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752282#comment-17752282
 ] 

ASF GitHub Bot commented on HDFS-17030:
---

xinglin commented on PR #5878:
URL: https://github.com/apache/hadoop/pull/5878#issuecomment-1670751349

   > > Hi @goiri,
   > > could you take a look at this backport PR for branch-3.3 as well? thanks,
   > 
   > You'd have to put a separate PR together I'd say.
   
   I am confused: this is a separate PR, right?




> Limit wait time for getHAServiceState in ObserverReaderProxy
> 
>
> Key: HDFS-17030
> URL: https://issues.apache.org/jira/browse/HDFS-17030
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When namenode HA is enabled and a standby NN is not responsible, we have 
> observed it would take a long time to serve a request, even though we have a 
> healthy observer or active NN. 
> Basically, when a standby is down, the RPC client would (re)try to create 
> socket connection to that standby for _ipc.client.connect.timeout_ _* 
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
> heap dump at a standby, the NN still accepts the socket connection but it 
> won't send responses to these RPC requests and we would timeout after 
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters 
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a 
> request takes more than 2 mins to complete when we take a heap dump at a 
> standby. This has been causing user job failures. 
> We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
> getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
> still use the original value from the config). However, that would double the 
> socket connection between clients and the NN (which is a deal-breaker). 
> The proposal is to add a timeout on getHAServiceState() calls in 
> ObserverReaderProxy and we will only wait for the timeout for an NN to 
> respond its HA state. Once we pass that timeout, we will move on to probe the 
> next NN. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752281#comment-17752281
 ] 

ASF GitHub Bot commented on HDFS-17030:
---

xinglin commented on code in PR #5878:
URL: https://github.com/apache/hadoop/pull/5878#discussion_r1288010220


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/ObserverReadProxyProvider.java:
##
@@ -285,13 +323,67 @@ private synchronized NNProxyInfo 
changeProxy(NNProxyInfo initial) {
 }
 currentIndex = (currentIndex + 1) % nameNodeProxies.size();
 currentProxy = createProxyIfNeeded(nameNodeProxies.get(currentIndex));
-currentProxy.setCachedState(getHAServiceState(currentProxy));
+currentProxy.setCachedState(getHAServiceStateWithTimeout(currentProxy));
 LOG.debug("Changed current proxy from {} to {}",
 initial == null ? "none" : initial.proxyInfo,
 currentProxy.proxyInfo);
 return currentProxy;
   }
 
+  /**
+   * Execute getHAServiceState() call with a timeout, to avoid a long wait when
+   * an NN becomes irresponsive to rpc requests
+   * (when a thread/heap dump is being taken, e.g.).
+   *
+   * For each getHAServiceState() call, a task is created and submitted to a
+   * threadpool for execution. We will wait for a response up to
+   * namenodeHAStateProbeTimeoutSec and cancel these requests if they time out.
+   *
+   * The implementation is split into two functions so that we can unit test
+   * the second function.
+   */
+  HAServiceState getHAServiceStateWithTimeout(final NNProxyInfo proxyInfo) {
+Callable getHAServiceStateTask = () -> 
getHAServiceState(proxyInfo);
+
+try {
+  Future task =
+  nnProbingThreadPool.submit(getHAServiceStateTask);

Review Comment:
   fixed. fits in one line with 100 characters. So, did not bother splitting 
into two lines.





> Limit wait time for getHAServiceState in ObserverReaderProxy
> 
>
> Key: HDFS-17030
> URL: https://issues.apache.org/jira/browse/HDFS-17030
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When namenode HA is enabled and a standby NN is not responsible, we have 
> observed it would take a long time to serve a request, even though we have a 
> healthy observer or active NN. 
> Basically, when a standby is down, the RPC client would (re)try to create 
> socket connection to that standby for _ipc.client.connect.timeout_ _* 
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
> heap dump at a standby, the NN still accepts the socket connection but it 
> won't send responses to these RPC requests and we would timeout after 
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters 
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a 
> request takes more than 2 mins to complete when we take a heap dump at a 
> standby. This has been causing user job failures. 
> We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
> getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
> still use the original value from the config). However, that would double the 
> socket connection between clients and the NN (which is a deal-breaker). 
> The proposal is to add a timeout on getHAServiceState() calls in 
> ObserverReaderProxy and we will only wait for the timeout for an NN to 
> respond its HA state. Once we pass that timeout, we will move on to probe the 
> next NN. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752280#comment-17752280
 ] 

ASF GitHub Bot commented on HDFS-17030:
---

xinglin commented on code in PR #5878:
URL: https://github.com/apache/hadoop/pull/5878#discussion_r1288009483


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/ObserverReadProxyProvider.java:
##
@@ -285,13 +323,67 @@ private synchronized NNProxyInfo 
changeProxy(NNProxyInfo initial) {
 }
 currentIndex = (currentIndex + 1) % nameNodeProxies.size();
 currentProxy = createProxyIfNeeded(nameNodeProxies.get(currentIndex));
-currentProxy.setCachedState(getHAServiceState(currentProxy));
+currentProxy.setCachedState(getHAServiceStateWithTimeout(currentProxy));
 LOG.debug("Changed current proxy from {} to {}",
 initial == null ? "none" : initial.proxyInfo,
 currentProxy.proxyInfo);
 return currentProxy;
   }
 
+  /**
+   * Execute getHAServiceState() call with a timeout, to avoid a long wait when
+   * an NN becomes irresponsive to rpc requests
+   * (when a thread/heap dump is being taken, e.g.).
+   *
+   * For each getHAServiceState() call, a task is created and submitted to a
+   * threadpool for execution. We will wait for a response up to
+   * namenodeHAStateProbeTimeoutSec and cancel these requests if they time out.
+   *
+   * The implementation is split into two functions so that we can unit test
+   * the second function.
+   */
+  HAServiceState getHAServiceStateWithTimeout(final NNProxyInfo proxyInfo) {
+Callable getHAServiceStateTask = () -> 
getHAServiceState(proxyInfo);
+
+try {
+  Future task =
+  nnProbingThreadPool.submit(getHAServiceStateTask);
+  return getHAServiceStateWithTimeout(proxyInfo, task);
+} catch (RejectedExecutionException e) {
+  LOG.warn("Run out of threads to submit the request to query HA state. "
+  + "Ok to return null and we will fallback to use active NN to serve "
+  + "this request.");
+  return null;
+}
+  }
+
+  HAServiceState getHAServiceStateWithTimeout(final NNProxyInfo proxyInfo,
+  Future task) {
+HAServiceState state = null;
+try {
+  if (namenodeHAStateProbeTimeoutMs > 0) {
+state = task.get(namenodeHAStateProbeTimeoutMs, TimeUnit.MILLISECONDS);
+  } else {
+// Disable timeout by waiting indefinitely when 
namenodeHAStateProbeTimeoutSec is set to 0
+// or a negative value.
+state = task.get();
+  }
+  LOG.debug("HA State for {} is {}", proxyInfo.proxyInfo, state);
+} catch (TimeoutException e) {
+  // Cancel the task on timeout
+  String msg = String.format("Cancel NN probe task due to timeout for %s", 
proxyInfo.proxyInfo);
+  LOG.warn(msg, e);
+  if (task != null) {

Review Comment:
   removed.





> Limit wait time for getHAServiceState in ObserverReaderProxy
> 
>
> Key: HDFS-17030
> URL: https://issues.apache.org/jira/browse/HDFS-17030
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When namenode HA is enabled and a standby NN is not responsible, we have 
> observed it would take a long time to serve a request, even though we have a 
> healthy observer or active NN. 
> Basically, when a standby is down, the RPC client would (re)try to create 
> socket connection to that standby for _ipc.client.connect.timeout_ _* 
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
> heap dump at a standby, the NN still accepts the socket connection but it 
> won't send responses to these RPC requests and we would timeout after 
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters 
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a 
> request takes more than 2 mins to complete when we take a heap dump at a 
> standby. This has been causing user job failures. 
> We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
> getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
> still use the original value from the config). However, that would double the 
> socket connection between clients and the NN (which is a deal-breaker). 
> The proposal is to add a timeout on getHAServiceState() calls in 
> ObserverReaderProxy and we will only wait for the timeout for an NN to 
> respond its HA state. Once we pass that timeout, we will move on to probe the 
> next NN. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)