[ https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823916#comment-17823916 ]
ASF GitHub Bot commented on HDFS-17299: --------------------------------------- hadoop-yetus commented on PR #6566: URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1980368491 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |:----:|----------:|--------:|:--------:|:-------:| | +0 :ok: | reexec | 0m 47s | | Docker mode activated. | |||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 3 new or modified test files. | |||| _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 18s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 37m 2s | | trunk passed | | +1 :green_heart: | compile | 6m 6s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 5m 54s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 34s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 17s | | trunk passed | | +1 :green_heart: | javadoc | 1m 51s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 2m 21s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | -1 :x: | spotbugs | 2m 40s | [/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/21/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html) | hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs warnings. | | +1 :green_heart: | shadedclient | 41m 8s | | branch has no errors when building and testing our client artifacts. | | -0 :warning: | patch | 41m 30s | | Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary. | |||| _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 31s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 59s | | the patch passed | | +1 :green_heart: | compile | 5m 55s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 5m 55s | | the patch passed | | +1 :green_heart: | compile | 5m 42s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | javac | 5m 42s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 22s | | hadoop-hdfs-project: The patch generated 0 new + 243 unchanged - 3 fixed = 243 total (was 246) | | +1 :green_heart: | mvnsite | 2m 3s | | the patch passed | | +1 :green_heart: | javadoc | 1m 32s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 2m 5s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 5m 58s | | the patch passed | | +1 :green_heart: | shadedclient | 41m 11s | | patch has no errors when building and testing our client artifacts. | |||| _ Other Tests _ | | +1 :green_heart: | unit | 2m 22s | | hadoop-hdfs-client in the patch passed. | | -1 :x: | unit | 257m 8s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/21/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 45s | | The patch does not generate ASF License warnings. | | | | 449m 3s | | | | Reason | Tests | |-------:|:------| | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs | | | hadoop.hdfs.tools.TestDFSAdmin | | | hadoop.hdfs.server.datanode.TestLargeBlockReport | | Subsystem | Report/Notes | |----------:|:-------------| | Docker | ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/21/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6566 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 5d10c83ee64d 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / f34203b9952bc2db4b4c2de675e81a4001971134 | | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/21/testReport/ | | Max. process+thread count | 3044 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/21/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > HDFS is not rack failure tolerant while creating a new file. > ------------------------------------------------------------ > > Key: HDFS-17299 > URL: https://issues.apache.org/jira/browse/HDFS-17299 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.10.1 > Reporter: Rushabh Shah > Assignee: Ritesh > Priority: Critical > Labels: pull-request-available > Attachments: repro.patch > > > Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ. > Our configuration: > 1. We use 3 Availability Zones (AZs) for fault tolerance. > 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy. > 3. We use the following configuration parameters: > dfs.namenode.heartbeat.recheck-interval: 600000 > dfs.heartbeat.interval: 3 > So it will take 1230000 ms (20.5mins) to detect that datanode is dead. > > Steps to reproduce: > # Bring down 1 AZ. > # HBase (HDFS client) tries to create a file (WAL file) and then calls > hflush on the newly created file. > # DataStreamer is not able to find blocks locations that satisfies the rack > placement policy (one copy in each rack which essentially means one copy in > each AZ) > # Since all the datanodes in that AZ are down but still alive to namenode, > the client gets different datanodes but still all of them are in the same AZ. > See logs below. > # HBase is not able to create a WAL file and it aborts the region server. > > Relevant logs from hdfs client and namenode > > {noformat} > 2023-12-16 17:17:43,818 INFO [on default port 9000] FSNamesystem.audit - > allowed=true ugi=hbase/<rs-name> (auth:KERBEROS) ip=<rs-IP> > cmd=create src=/hbase/WALs/<WAL-file> dst=null > 2023-12-16 17:17:43,978 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652565_140946716, replicas=<AZ-1-dn-1>:50010, > <AZ-2-dn-1>:50010, <AZ-3-dn-1>:50010 for /hbase/WALs/<WAL-file> > 2023-12-16 17:17:44,061 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as <AZ-2-dn-1>:50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,061 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-<NN-IP>-1594838129323:blk_1214652565_140946716 > 2023-12-16 17:17:44,179 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[<AZ-2-dn-1>:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK] > 2023-12-16 17:17:44,339 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652580_140946764, replicas=<AZ-1-dn-2>:50010, > <AZ-3-dn-2>:50010, <AZ-2-dn-2>:50010 for /hbase/WALs/<WAL-file> > 2023-12-16 17:17:44,369 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as <AZ-2-dn-2>:50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,369 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764 > 2023-12-16 17:17:44,454 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK] > 2023-12-16 17:17:44,522 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652594_140946796, replicas=<AZ-1-dn-2>:50010, > <AZ-2-dn-3>:50010, <AZ-3-dn-3>:50010 for /hbase/WALs/<WAL-file> > 2023-12-16 17:17:44,712 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as <AZ-2-dn-3>:50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,712 WARN [Thread-39087] hdfs.DataStreamer - Abandoning > BP-179318874-NN-IP-1594838129323:blk_1214652594_140946796 > 2023-12-16 17:17:44,732 WARN [Thread-39087] hdfs.DataStreamer - Excluding > datanode > DatanodeInfoWithStorage[<AZ-2-dn-3>:50010,DS-86b77463-a26f-4f42-ae1b-21b75c407203,DISK] > 2023-12-16 17:17:44,855 INFO [on default port 9000] hdfs.StateChange - > BLOCK* allocate blk_1214652607_140946850, replicas=<AZ-1-dn-4>:50010, > <AZ-2-dn-4>:50010, <AZ-3-dn-4>:50010 for /hbase/WALs/<WAL-file> > 2023-12-16 17:17:44,867 INFO [Thread-39087] hdfs.DataStreamer - Exception in > createBlockOutputStream > java.io.IOException: Got error, status=ERROR, status message , ack with > firstBadLink as <AZ-2-dn-4>:50010 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113) > at > org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747) > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,988 WARN [Thread-39087] hdfs.DataStreamer - DataStreamer > Exception > java.io.IOException: Unable to create new block. > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1665) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715) > 2023-12-16 17:17:44,988 WARN [Thread-39087] hdfs.DataStreamer - Could not > get block locations. Source file "/hbase/WALs/<WAL-file>" - Aborting... > {noformat} > > *Proposed fix:* > Client always correctly identifies the bad datanode in the pipeline. > The number of retries dfs client makes is controlled by > dfs.client.block.write.retries (defaults to 3). So in total it tries 4 times > to create the pipeline. > So on the 3rd or 4th attempt, if we see all the excluded nodes in the > pipeline belongs to the same rack, we can pass the hint to namenode to > exclude that rack for the next attempt. > Once that rack is back online, Replication monitor will handle to replicate > that block to that rack. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org