[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=750731&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-750731
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 31/Mar/22 04:33
Start Date: 31/Mar/22 04:33
Worklog Time Spent: 10m 
  Work Description: lfxy opened a new pull request #4126:
URL: https://github.com/apache/hadoop/pull/4126


   In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
   1. Enable EC policy, such as RS-6-3-1024k.
   2. The rack number in this cluster is equal with or less than the 
replication number(9)
   3. A rack only has one DN, and decommission this DN.
   The root cause is in 
BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
the maxNodesPerRack is 1, which means each rack can only be chosen one datanode.
   `  
   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
  ...
   // If more replicas than racks, evenly spread the replicas.
   // This calculation rounds up.
   int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
   return new int[] {numOfReplicas, maxNodesPerRack};
 } 
   `
   int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
   here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
   
   When we decommission one dn which is only one node in its rack, the 
chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
will throw NotEnoughReplicasException, but the exception will not be caught and 
fail to fallback to chooseEvenlyFromRemainingRacks() function.
   
   When decommission, after choose targets, verifyBlockPlacement() function 
will return the total rack number contains the invalid rack, and 
BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false and 
it will also cause decommission fail.
   `  public boolean isPlacementPolicySatisfied() {
   return requiredRacks <= currentRacks || currentRacks >= totalRacks;
 }`
   According to the above description, we should make the below modify to fix 
it:
   1. In startDecommission() or stopDecommission(), we should also change the 
numOfRacks in class NetworkTopology. Or choose targets may fail for the 
maxNodesPerRack is too small. And even choose targets success, 
isPlacementPolicySatisfied will also return false cause decommission fail.
   2. In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the 
first chooseOnce() function should also be put in try..catch..., or it will not 
fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
   3. In verifyBlockPlacement, we need to remove invalid racks from total 
numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail to 
reconstruct data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 750731)
Remaining Estimate: 0h
Time Spent: 10m

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than rac

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=750739&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-750739
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 31/Mar/22 04:42
Start Date: 31/Mar/22 04:42
Worklog Time Spent: 10m 
  Work Description: lfxy commented on pull request #4126:
URL: https://github.com/apache/hadoop/pull/4126#issuecomment-1084067934


   @tasanuma Please review, thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 750739)
Time Spent: 20m  (was: 10m)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicy

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-03-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=751042&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-751042
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 31/Mar/22 15:42
Start Date: 31/Mar/22 15:42
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #4126:
URL: https://github.com/apache/hadoop/pull/4126#issuecomment-1084764646


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  1s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  11m 38s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  23m 13s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  23m  2s |  |  trunk passed with JDK 
Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |  20m 15s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   3m 37s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 25s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 29s |  |  trunk passed with JDK 
Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   6m  1s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  26m 33s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 29s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 39s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  24m 19s |  |  the patch passed with JDK 
Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |  24m 19s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  21m 51s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |  21m 51s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   3m 33s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/1/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 1 new + 155 unchanged - 0 fixed = 156 total (was 
155)  |
   | +1 :green_heart: |  mvnsite  |   3m 21s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 25s |  |  the patch passed with JDK 
Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   6m 18s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m 55s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  18m 10s |  |  hadoop-common in the patch 
passed.  |
   | -1 :x: |  unit  | 431m 53s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 12s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 668m  1s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes |
   |   | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
   |   | hadoop.hdfs.server.diskbalancer.command.TestDiskBalancerCommand |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4126 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 03872bc8785a 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 
17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-04-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=751563&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-751563
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 01/Apr/22 13:45
Start Date: 01/Apr/22 13:45
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #4126:
URL: https://github.com/apache/hadoop/pull/4126#issuecomment-1085920723


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  0s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 34s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  23m 15s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  22m 54s |  |  trunk passed with JDK 
Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |  20m 10s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   3m 38s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 25s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 30s |  |  trunk passed with JDK 
Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 31s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   5m 58s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m 33s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 29s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 17s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  22m 15s |  |  the patch passed with JDK 
Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |  22m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  20m  2s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |  20m  2s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   3m 36s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 19s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 27s |  |  the patch passed with JDK 
Ubuntu-11.0.14+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 36s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   6m 18s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  24m 11s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  18m 49s |  |  hadoop-common in the patch 
passed.  |
   | -1 :x: |  unit  | 420m 34s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 11s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 650m 45s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
   |   | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4126 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 9a71fa74327b 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 
17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 372cf01e09848e8f3372b3eca599f7f109d59c2f |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.1

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-04-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=751575&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-751575
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 01/Apr/22 13:57
Start Date: 01/Apr/22 13:57
Worklog Time Spent: 10m 
  Work Description: lfxy commented on pull request #4126:
URL: https://github.com/apache/hadoop/pull/4126#issuecomment-1085934796


   @tasanuma  The failed UT don't seem to relate to this patch, please help to 
check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 751575)
Time Spent: 50m  (was: 40m)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid rack

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-04-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=753952&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-753952
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 07/Apr/22 10:22
Start Date: 07/Apr/22 10:22
Worklog Time Spent: 10m 
  Work Description: tasanuma commented on code in PR #4126:
URL: https://github.com/apache/hadoop/pull/4126#discussion_r844954667


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestBlockPlacementPolicyRackFaultTolerant.java:
##
@@ -167,6 +179,107 @@ private void doTestChooseTargetSpecialCase() throws 
Exception {
 }
   }
 
+  /**
+   * Verify decommission a dn which is an only node in its rack.
+   */
+  public void testPlacementWithOnlyOneNodeInRackDecommission() throws 
Exception {

Review Comment:
   Instead of adding `testPlacementWithOnlyOneNodeInRackDecommission()` to 
`testChooseTarget()`, it would be better to create 
`testPlacementWithOnlyOneNodeInRackDecommission()` as one unit test method.
   ```suggestion
 @Test
 public void testPlacementWithOnlyOneNodeInRackDecommission() throws 
Exception {
   ```



##
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java:
##
@@ -101,6 +101,13 @@ protected NetworkTopology init(InnerNode.Factory factory) {
   private int depthOfAllLeaves = -1;
   /** rack counter */
   protected int numOfRacks = 0;
+  /** empty rack map, rackname->nodenumber. */
+  private HashMap> emptyRackMap =

Review Comment:
   This is a rack map to calculate the number of empty racks. So I prefer to 
use the name `rackMap`.
   ```suggestion
 private HashMap> rackMap =
   ```



##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestBlockPlacementPolicyRackFaultTolerant.java:
##
@@ -167,6 +179,107 @@ private void doTestChooseTargetSpecialCase() throws 
Exception {
 }
   }
 
+  /**
+   * Verify decommission a dn which is an only node in its rack.
+   */
+  public void testPlacementWithOnlyOneNodeInRackDecommission() throws 
Exception {
+Configuration conf = new HdfsConfiguration();
+final String[] racks = {"/RACK0", "/RACK0", "/RACK2", "/RACK3", "/RACK4", 
"/RACK5", "/RACK2"};
+final String[] hosts = {"/host0", "/host1", "/host2", "/host3", "/host4", 
"/host5", "/host6"};
+
+// enables DFSNetworkTopology
+conf.setClass(DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(DFSConfigKeys.DFS_USE_DFS_NETWORK_TOPOLOGY_KEY, true);
+conf.setLong(DFSConfigKeys.DFS_BLOCK_SIZE_KEY, DEFAULT_BLOCK_SIZE);
+conf.setInt(DFSConfigKeys.DFS_BYTES_PER_CHECKSUM_KEY,
+DEFAULT_BLOCK_SIZE / 2);
+
+if (cluster != null) {
+  cluster.shutdown();
+}
+cluster = new MiniDFSCluster.Builder(conf).numDataNodes(7).racks(racks)
+.hosts(hosts).build();
+cluster.waitActive();
+nameNodeRpc = cluster.getNameNodeRpc();
+namesystem = cluster.getNamesystem();
+DistributedFileSystem fs = cluster.getFileSystem();
+fs.enableErasureCodingPolicy("RS-3-2-1024k");
+fs.setErasureCodingPolicy(new Path("/"), "RS-3-2-1024k");
+
+final BlockManager bm = cluster.getNamesystem().getBlockManager();
+final DatanodeManager dm = bm.getDatanodeManager();
+assertTrue(dm.getNetworkTopology() instanceof DFSNetworkTopology);
+
+String clientMachine = "/host4";
+String clientRack = "/RACK4";
+String src = "/test";
+
+final DatanodeManager dnm = 
namesystem.getBlockManager().getDatanodeManager();
+DatanodeDescriptor dnd4 = 
dnm.getDatanode(cluster.getDataNodes().get(4).getDatanodeId());
+assertEquals(dnd4.getNetworkLocation(), clientRack);
+dnm.getDatanodeAdminManager().startDecommission(dnd4);
+short replication = 5;
+short additionalReplication = 1;
+
+try {
+  // Create the file with client machine
+  HdfsFileStatus fileStatus = namesystem.startFile(src, perm,
+  clientMachine, clientMachine, EnumSet.of(CreateFlag.CREATE), true,
+  replication, DEFAULT_BLOCK_SIZE * 1024 * 10, null, null, null, 
false);
+
+  //test chooseTarget for new file
+  LocatedBlock locatedBlock = nameNodeRpc.addBlock(src, clientMachine,
+  null, null, fileStatus.getFileId(), null, null);
+  HashMap racksCount = new HashMap();
+  doTestLocatedBlockRacks(racksCount, replication, 4, locatedBlock);
+
+  //test chooseTarget for existing file.
+  LocatedBlock additionalLocatedBlock =
+  nameNodeRpc.getAdditionalDatanode(src, fileStatus.getFileId(),
+  locatedBlock.getBlock(), locatedBlock.getLocations(),
+  locatedBlock.getStorageIDs(), DatanodeInfo.EMPTY_ARRAY,
+   

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-04-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=754157&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-754157
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 07/Apr/22 15:01
Start Date: 07/Apr/22 15:01
Worklog Time Spent: 10m 
  Work Description: lfxy commented on PR #4126:
URL: https://github.com/apache/hadoop/pull/4126#issuecomment-1091849794

   @tasanuma  Yes, I think clusterMap.getNumOfRacks() in 
BlockPlacementPolicyDefault should also be updated because only non empty rack 
makes sense.




Issue Time Tracking
---

Worklog Id: (was: 754157)
Time Spent: 1h 10m  (was: 1h)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsub

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-04-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=754366&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-754366
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 07/Apr/22 23:17
Start Date: 07/Apr/22 23:17
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4126:
URL: https://github.com/apache/hadoop/pull/4126#issuecomment-1092293764

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  12m 32s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  16m  3s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  26m 14s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  26m 33s |  |  trunk passed with JDK 
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |  22m 52s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   3m 56s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 25s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 27s |  |  trunk passed with JDK 
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 35s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   6m 19s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  24m 36s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 28s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 23s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  22m 59s |  |  the patch passed with JDK 
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |  22m 59s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  21m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |  21m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   3m 40s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 15s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 18s |  |  the patch passed with JDK 
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 27s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   7m  4s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  25m 24s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  18m 10s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  | 242m 29s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m  6s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 501m  6s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4126 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 33711d5ec556 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 
23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 662323a9701a50f4f95edb4632b6d0bf0d4b8d9c |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/4/testReport/ |
   | Max. process+thread count | 3325 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-hdfs-project/hadoop-hdf

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-04-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=754395&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-754395
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 08/Apr/22 01:47
Start Date: 08/Apr/22 01:47
Worklog Time Spent: 10m 
  Work Description: tasanuma commented on PR #4126:
URL: https://github.com/apache/hadoop/pull/4126#issuecomment-1092361843

   @surendralilhore Please comment if you have any concerns.




Issue Time Tracking
---

Worklog Id: (was: 754395)
Time Spent: 1.5h  (was: 1h 20m)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-m

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-04-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=754399&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-754399
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 08/Apr/22 02:13
Start Date: 08/Apr/22 02:13
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4126:
URL: https://github.com/apache/hadoop/pull/4126#issuecomment-1092373868

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  18m 38s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m 55s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  26m 29s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  26m 43s |  |  trunk passed with JDK 
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |  23m  9s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   3m 45s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 38s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 43s |  |  trunk passed with JDK 
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 26s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   6m 45s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  24m 45s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 29s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 17s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  23m 30s |  |  the patch passed with JDK 
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |  23m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  22m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |  22m 23s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   3m 40s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 23s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 21s |  |  the patch passed with JDK 
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   3m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   6m 40s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  25m  0s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  18m 27s |  |  hadoop-common in the patch 
passed.  |
   | -1 :x: |  unit  | 429m 35s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 15s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 697m  5s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4126/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4126 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 97de28519e31 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 
17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 662323a9701a50f4f95edb4632b6d0bf0d4b8d9c |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=756897&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-756897
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 14/Apr/22 09:42
Start Date: 14/Apr/22 09:42
Worklog Time Spent: 10m 
  Work Description: tasanuma merged PR #4126:
URL: https://github.com/apache/hadoop/pull/4126




Issue Time Tracking
---

Worklog Id: (was: 756897)
Time Spent: 1h 50m  (was: 1h 40m)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=756899&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-756899
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 14/Apr/22 09:43
Start Date: 14/Apr/22 09:43
Worklog Time Spent: 10m 
  Work Description: tasanuma commented on PR #4126:
URL: https://github.com/apache/hadoop/pull/4126#issuecomment-1098943763

   Merged. Thanks for your contribution, @lfxy!




Issue Time Tracking
---

Worklog Id: (was: 756899)
Time Spent: 2h  (was: 1h 50m)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-04-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=757579&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-757579
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 16/Apr/22 12:29
Start Date: 16/Apr/22 12:29
Worklog Time Spent: 10m 
  Work Description: lfxy commented on PR #4126:
URL: https://github.com/apache/hadoop/pull/4126#issuecomment-1100653197

   @tasanuma Thank you for your review and giving a lot of useful suggestions.




Issue Time Tracking
---

Worklog Id: (was: 757579)
Time Spent: 2h 10m  (was: 2h)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@h

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-05-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=769424&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-769424
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 12/May/22 05:25
Start Date: 12/May/22 05:25
Worklog Time Spent: 10m 
  Work Description: jojochuang opened a new pull request, #4304:
URL: https://github.com/apache/hadoop/pull/4304

   
   
   (cherry picked from commit cee8c62498f55794f911ce62edfd4be9e88a7361)
   
Conflicts:

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java
   
   Change-Id: I5da7693d6176315f126f1b5107de2bb100ee9778
   
   
   
   ### Description of PR
   (HDFS-16456 was merged in trunk. It has conflict in branch-3.3 because of 
HDFS-15263. It looks like HDFS-15263 could potentially change the behavior and 
surprise users, so decided to backport without HDFS-15263)
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




Issue Time Tracking
---

Worklog Id: (was: 769424)
Time Spent: 2h 20m  (was: 2h 10m)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicyS

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-05-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=769465&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-769465
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 12/May/22 08:14
Start Date: 12/May/22 08:14
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4304:
URL: https://github.com/apache/hadoop/pull/4304#issuecomment-1124671782

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  11m  3s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m 42s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  23m 44s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |  21m 44s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   3m 47s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   4m 25s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   4m 33s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   6m 51s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  28m 26s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 32s |  |  Maven dependency ordering for patch  |
   | -1 :x: |  mvninstall  |   0m 49s | 
[/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch failed.  |
   | -1 :x: |  compile  |   2m 50s | 
[/patch-compile-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/patch-compile-root.txt)
 |  root in the patch failed.  |
   | -1 :x: |  javac  |   2m 50s | 
[/patch-compile-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/patch-compile-root.txt)
 |  root in the patch failed.  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   2m 29s |  |  the patch passed  |
   | -1 :x: |  mvnsite  |   0m 54s | 
[/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch failed.  |
   | +1 :green_heart: |  javadoc  |   3m 13s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   0m 50s | 
[/patch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/patch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch failed.  |
   | -1 :x: |  shadedclient  |  12m 40s |  |  patch has errors when building 
and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  17m 52s |  |  hadoop-common in the patch 
passed.  |
   | -1 :x: |  unit  |   0m 50s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch failed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 167m 55s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4304 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 8d3f7941228c 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 
17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / c387e506e8d0fb76d000ec507f9b44bcaee4fd69 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/1/testReport/ |
   | Max. process+thread count | 1267 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-05-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=773102&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-773102
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 21/May/22 06:01
Start Date: 21/May/22 06:01
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4304:
URL: https://github.com/apache/hadoop/pull/4304#issuecomment-1133542869

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  10m 53s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  1s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 14s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  23m 42s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |  18m  3s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   3m 18s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   4m 14s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   4m 38s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   7m  0s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  28m  7s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 57s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 28s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  17m 17s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |  17m 17s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   3m 13s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   4m 12s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   4m 23s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   6m 58s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  28m 24s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  18m  3s |  |  hadoop-common in the patch 
passed.  |
   | -1 :x: |  unit  | 239m 44s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 51s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 440m 42s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4304 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 50c5cd921baa 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 
17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 509b0f013f0797c99026e276966cf678c0fe576d |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/2/testReport/ |
   | Max. process+thread count | 2650 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-hdfs-project/hadoop-hdfs U: . |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4304/2/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
---

Worklog Id: (was: 773102)
Time Spent: 2h 40m  (was: 2.5h)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> 

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-05-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=775021&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-775021
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 26/May/22 12:16
Start Date: 26/May/22 12:16
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4358:
URL: https://github.com/apache/hadoop/pull/4358#issuecomment-1138510420

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  11m 50s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ branch-3.2 Compile Tests _ |
   | +0 :ok: |  mvndep  |   4m 53s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  27m 40s |  |  branch-3.2 passed  |
   | -1 :x: |  compile  |   9m 52s | 
[/branch-compile-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/1/artifact/out/branch-compile-root.txt)
 |  root in branch-3.2 failed.  |
   | +1 :green_heart: |  checkstyle  |   2m 46s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  mvnsite  |   2m 41s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  javadoc  |   2m 19s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  spotbugs  |   5m 29s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  shadedclient  |  17m 59s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 26s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 14s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 37s |  |  the patch passed  |
   | -1 :x: |  javac  |  16m 37s | 
[/results-compile-javac-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/1/artifact/out/results-compile-javac-root.txt)
 |  root generated 203 new + 1465 unchanged - 0 fixed = 1668 total (was 1465)  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   3m  5s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 16s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 44s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   6m 11s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  18m 19s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  16m 23s |  |  hadoop-common in the patch 
passed.  |
   | -1 :x: |  unit  | 213m  5s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 24s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 368m  4s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.TestFsck |
   |   | hadoop.hdfs.TestRollingUpgrade |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4358 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux ad5610eb3128 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.2 / 7ebbaa393860a115aa1f51edc02f640507ced9e0 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/1/testReport/ |
   | Max. process+thread count | 2093 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-hdfs-project/hadoop-hdfs U: . |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/1/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://ye

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-05-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=775142&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-775142
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 26/May/22 16:50
Start Date: 26/May/22 16:50
Worklog Time Spent: 10m 
  Work Description: jojochuang commented on PR #4304:
URL: https://github.com/apache/hadoop/pull/4304#issuecomment-1138778927

   Thanks for the review!




Issue Time Tracking
---

Worklog Id: (was: 775142)
Time Spent: 3h 10m  (was: 3h)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mai

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-05-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=775141&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-775141
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 26/May/22 16:50
Start Date: 26/May/22 16:50
Worklog Time Spent: 10m 
  Work Description: jojochuang merged PR #4304:
URL: https://github.com/apache/hadoop/pull/4304




Issue Time Tracking
---

Worklog Id: (was: 775141)
Time Spent: 3h  (was: 2h 50m)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-05-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=775293&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-775293
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 26/May/22 23:38
Start Date: 26/May/22 23:38
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4358:
URL: https://github.com/apache/hadoop/pull/4358#issuecomment-1139145684

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 52s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ branch-3.2 Compile Tests _ |
   | +0 :ok: |  mvndep  |   4m 44s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  27m 50s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  compile  |  16m  8s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  checkstyle  |   3m 11s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  mvnsite  |   3m 15s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  javadoc  |   2m 51s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  spotbugs  |   6m  4s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  shadedclient  |  18m  5s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 27s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m  8s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  15m 18s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |  15m 18s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   3m  1s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 13s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 43s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   6m 10s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  18m 13s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  16m 30s |  |  hadoop-common in the patch 
passed.  |
   | -1 :x: |  unit  | 212m 31s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 23s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 363m 29s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.tools.TestDFSZKFailoverController |
   |   | hadoop.hdfs.TestDecommissionWithStriped |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4358 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 3ff3eb2d0a7c 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.2 / 7ebbaa393860a115aa1f51edc02f640507ced9e0 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/2/testReport/ |
   | Max. process+thread count | 1913 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-hdfs-project/hadoop-hdfs U: . |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/2/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
---

Worklog Id: (was: 775293)
Time Spent: 3h 20m  (was: 3h 10m)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> --

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-06-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=87&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-87
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 02/Jun/22 19:01
Start Date: 02/Jun/22 19:01
Worklog Time Spent: 10m 
  Work Description: jojochuang commented on PR #4358:
URL: https://github.com/apache/hadoop/pull/4358#issuecomment-1145210866

   The failed tests do not fail in my local tree. I'll merge the change later.




Issue Time Tracking
---

Worklog Id: (was: 87)
Time Spent: 3.5h  (was: 3h 20m)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.4
>
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-06-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=778348&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778348
 ]

ASF GitHub Bot logged work on HDFS-16456:
-

Author: ASF GitHub Bot
Created on: 03/Jun/22 22:56
Start Date: 03/Jun/22 22:56
Worklog Time Spent: 10m 
  Work Description: jojochuang merged PR #4358:
URL: https://github.com/apache/hadoop/pull/4358




Issue Time Tracking
---

Worklog Id: (was: 778348)
Time Spent: 3h 40m  (was: 3.5h)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Assignee: caozhiqiang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.4
>
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org