[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

ASF GitHub Bot (Jira) Thu, 26 May 2022 16:39:06 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16456?focusedWorklogId=775293&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-775293
 ]


ASF GitHub Bot logged work on HDFS-16456:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/May/22 23:38
            Start Date: 26/May/22 23:38
    Worklog Time Spent: 10m 
      Work Description: hadoop-yetus commented on PR #4358:
URL: https://github.com/apache/hadoop/pull/4358#issuecomment-1139145684

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 52s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
   |||| _ branch-3.2 Compile Tests _ |
   | +0 :ok: |  mvndep  |   4m 44s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  27m 50s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  compile  |  16m  8s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  checkstyle  |   3m 11s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  mvnsite  |   3m 15s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  javadoc  |   2m 51s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  spotbugs  |   6m  4s |  |  branch-3.2 passed  |
   | +1 :green_heart: |  shadedclient  |  18m  5s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 27s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m  8s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  15m 18s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |  15m 18s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   3m  1s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 13s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 43s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   6m 10s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  18m 13s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | +1 :green_heart: |  unit  |  16m 30s |  |  hadoop-common in the patch 
passed.  |
   | -1 :x: |  unit  | 212m 31s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 23s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 363m 29s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.hdfs.tools.TestDFSZKFailoverController |
   |   | hadoop.hdfs.TestDecommissionWithStriped |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4358 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 3ff3eb2d0a7c 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.2 / 7ebbaa393860a115aa1f51edc02f640507ced9e0 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/2/testReport/ |
   | Max. process+thread count | 1913 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-hdfs-project/hadoop-hdfs U: . |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4358/2/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
-------------------

    Worklog Id:     (was: 775293)
    Time Spent: 3h 20m  (was: 3h 10m)

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16456
>                 URL: https://issues.apache.org/jira/browse/HDFS-16456
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ec, namenode
>    Affects Versions: 3.4.0
>            Reporter: caozhiqiang
>            Assignee: caozhiqiang
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 3.4.0, 3.3.4
>
>         Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch, 
> HDFS-16456.003.patch, HDFS-16456.004.patch, HDFS-16456.005.patch, 
> HDFS-16456.006.patch, HDFS-16456.007.patch, HDFS-16456.008.patch, 
> HDFS-16456.009.patch, HDFS-16456.010.patch
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>    ...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
>     return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set<String> racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In verifyBlockPlacement, we need to remove invalid racks from total 
> numOfRacks, or isPlacementPolicySatisfied() will return false and cause fail 
> to reconstruct data.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

Reply via email to