[jira] [Commented] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease
[ https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697710#comment-17697710 ] ASF GitHub Bot commented on HDFS-16942: --- hadoop-yetus commented on PR #5460: URL: https://github.com/apache/hadoop/pull/5460#issuecomment-1459286807 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 58s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 41s | | trunk passed | | +1 :green_heart: | compile | 1m 26s | | trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 1m 24s | | trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 | | +1 :green_heart: | checkstyle | 1m 7s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 28s | | trunk passed | | +1 :green_heart: | javadoc | 1m 10s | | trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 34s | | trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 | | +1 :green_heart: | spotbugs | 3m 25s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 34s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 15s | | the patch passed | | +1 :green_heart: | compile | 1m 19s | | the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 1m 19s | | the patch passed | | +1 :green_heart: | compile | 1m 11s | | the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 | | +1 :green_heart: | javac | 1m 11s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 51s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5460/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 79 unchanged - 0 fixed = 80 total (was 79) | | +1 :green_heart: | mvnsite | 1m 22s | | the patch passed | | +1 :green_heart: | javadoc | 0m 50s | | the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 22s | | the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 | | +1 :green_heart: | spotbugs | 3m 17s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 25s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 238m 22s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 51s | | The patch does not generate ASF License warnings. | | | | 345m 2s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5460/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5460 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 1c427e3c3acb 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 15851db5be391e6f36c263739f55723ab2abb7af | | Default Java | Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5460/2/testReport/ | | Max. process+thread count | 3164 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console o
[jira] [Commented] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease
[ https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697686#comment-17697686 ] ASF GitHub Bot commented on HDFS-16942: --- Hexiaoqiao commented on code in PR #5460: URL: https://github.com/apache/hadoop/pull/5460#discussion_r1128883822 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java: ## @@ -791,6 +792,9 @@ private void offerService() throws Exception { shouldServiceRun = false; return; } +if (InvalidBlockReportLeaseException.class.getName().equals(reClass)) { + fullBlockReportLeaseId = 0; Review Comment: @sodonnel Thanks for your works here. Do we also need set `forceFullBlockReport` to true here? Otherwise, datanode will send block report at next 6 hour by default, right? > Send error to datanode if FBR is rejected due to bad lease > -- > > Key: HDFS-16942 > URL: https://issues.apache.org/jira/browse/HDFS-16942 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > > When a datanode sends a FBR to the namenode, it requires a lease to send it. > On a couple of busy clusters, we have seen an issue where the DN is somehow > delayed in sending the FBR after requesting the least. Then the NN rejects > the FBR and logs a message to that effect, but from the Datanodes point of > view, it thinks the report was successful and does not try to send another > report until the 6 hour default interval has passed. > If this happens to a few DNs, there can be missing and under replicated > blocks, further adding to the cluster load. Even worse, I have see the DNs > join the cluster with zero blocks, so it is not obvious the under replication > is caused by lost a FBR, as all DNs appear to be up and running. > I believe we should propagate an error back to the DN if the FBR is rejected, > that way, the DN can request a new lease and try again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16943) RBF: Implement MySQL based StateStoreDriver
[ https://issues.apache.org/jira/browse/HDFS-16943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simbarashe Dzinamarira reassigned HDFS-16943: - Assignee: Simbarashe Dzinamarira > RBF: Implement MySQL based StateStoreDriver > --- > > Key: HDFS-16943 > URL: https://issues.apache.org/jira/browse/HDFS-16943 > Project: Hadoop HDFS > Issue Type: Task > Components: hdfs, rbf >Reporter: Simbarashe Dzinamarira >Assignee: Simbarashe Dzinamarira >Priority: Major > > RBF supports two types of StateStoreDrivers > # StateStoreFileImpl > # StateStoreZooKeeperImpl > I propose implementing a third driver that is backed by MySQL. > * StateStoreZooKeeperImpl requires an additional Zookeeper cluster. > * StateStoreFileImpl can use one of the namenodes in the HDFS cluster, but > that namenode becomes a single point of failure, introducing coupling between > the federated clusters. > HADOOP-18535 implemented a MySQL token store. When tokens are stored in > MySQL, using MySQL for the StateStore as well reduces the number of external > dependencies for routers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16943) RBF: Implement MySQL based StateStoreDriver
[ https://issues.apache.org/jira/browse/HDFS-16943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simbarashe Dzinamarira updated HDFS-16943: -- Description: RBF supports two types of StateStoreDrivers # StateStoreFileImpl # StateStoreZooKeeperImpl I propose implementing a third driver that is backed by MySQL. * StateStoreZooKeeperImpl requires an additional Zookeeper cluster. * StateStoreFileImpl can use one of the namenodes in the HDFS cluster, but that namenode becomes a single point of failure, introducing coupling between the federated clusters. HADOOP-18535 implemented a MySQL token store. When tokens are stored in MySQL, using MySQL for the StateStore as well reduces the number of external dependencies for routers. was: RBF supports two types of StateStoreDrivers # StateStoreFileImpl # StateStoreZooKeeperImpl I propose implementing a third driver that is backed by MySQL. HADOOP-18535 implemented a MySQL token store. When tokens are stored in MySQL, using MySQL for the StateStore as well reduces the number of external dependencies for routers. > RBF: Implement MySQL based StateStoreDriver > --- > > Key: HDFS-16943 > URL: https://issues.apache.org/jira/browse/HDFS-16943 > Project: Hadoop HDFS > Issue Type: Task > Components: hdfs, rbf >Reporter: Simbarashe Dzinamarira >Priority: Major > > RBF supports two types of StateStoreDrivers > # StateStoreFileImpl > # StateStoreZooKeeperImpl > I propose implementing a third driver that is backed by MySQL. > * StateStoreZooKeeperImpl requires an additional Zookeeper cluster. > * StateStoreFileImpl can use one of the namenodes in the HDFS cluster, but > that namenode becomes a single point of failure, introducing coupling between > the federated clusters. > HADOOP-18535 implemented a MySQL token store. When tokens are stored in > MySQL, using MySQL for the StateStore as well reduces the number of external > dependencies for routers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16943) RBF: Implement MySQL based StateStoreDriver
Simbarashe Dzinamarira created HDFS-16943: - Summary: RBF: Implement MySQL based StateStoreDriver Key: HDFS-16943 URL: https://issues.apache.org/jira/browse/HDFS-16943 Project: Hadoop HDFS Issue Type: Task Components: hdfs, rbf Reporter: Simbarashe Dzinamarira RBF supports two types of StateStoreDrivers # StateStoreFileImpl # StateStoreZooKeeperImpl I propose implementing a third driver that is backed by MySQL. HADOOP-18535 implemented a MySQL token store. When tokens are stored in MySQL, using MySQL for the StateStore as well reduces the number of external dependencies for routers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease
[ https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697630#comment-17697630 ] ASF GitHub Bot commented on HDFS-16942: --- sodonnel commented on PR #5460: URL: https://github.com/apache/hadoop/pull/5460#issuecomment-1458896438 Not sure what to do about this checkstyle error: ``` ./hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/InvalidBlockReportLeaseException.java:1:/**: Missing package-info.java file. [JavadocPackage] ``` If I add the file: ``` hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/package-info.java ``` I get an enforcer error: ``` [INFO] ---< org.apache.hadoop:hadoop-client-check-test-invariants > [INFO] Building Apache Hadoop Client Packaging Invariants for Test 3.4.0-SNAPSHOT [106/113] [INFO] [ pom ]- [INFO] [INFO] - > Send error to datanode if FBR is rejected due to bad lease > -- > > Key: HDFS-16942 > URL: https://issues.apache.org/jira/browse/HDFS-16942 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > > When a datanode sends a FBR to the namenode, it requires a lease to send it. > On a couple of busy clusters, we have seen an issue where the DN is somehow > delayed in sending the FBR after requesting the least. Then the NN rejects > the FBR and logs a message to that effect, but from the Datanodes point of > view, it thinks the report was successful and does not try to send another > report until the 6 hour default interval has passed. > If this happens to a few DNs, there can be missing and under replicated > blocks, further adding to the cluster load. Even worse, I have see the DNs > join the cluster with zero blocks, so it is not obvious the under replication > is caused by lost a FBR, as all DNs appear to be up and running. > I believe we should propagate an error back to the DN if the FBR is rejected, > that way, the DN can request a new lease and try again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease
[ https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697620#comment-17697620 ] ASF GitHub Bot commented on HDFS-16942: --- hadoop-yetus commented on PR #5460: URL: https://github.com/apache/hadoop/pull/5460#issuecomment-1458829144 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 57s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 40m 48s | | trunk passed | | +1 :green_heart: | compile | 1m 30s | | trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 1m 21s | | trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 | | +1 :green_heart: | checkstyle | 1m 5s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 33s | | trunk passed | | +1 :green_heart: | javadoc | 1m 6s | | trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 30s | | trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 | | +1 :green_heart: | spotbugs | 3m 41s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 57s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 18s | | the patch passed | | +1 :green_heart: | compile | 1m 20s | | the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 1m 20s | | the patch passed | | +1 :green_heart: | compile | 1m 15s | | the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 | | +1 :green_heart: | javac | 1m 15s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 51s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5460/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 79 unchanged - 0 fixed = 81 total (was 79) | | +1 :green_heart: | mvnsite | 1m 25s | | the patch passed | | +1 :green_heart: | javadoc | 0m 53s | | the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 31s | | the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 | | +1 :green_heart: | spotbugs | 3m 26s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 53s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 243m 39s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5460/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 51s | | The patch does not generate ASF License warnings. | | | | 354m 37s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestRollingUpgrade | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5460/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5460 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux e8a6e148e1c3 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / b20165566caec49a059a287131608138782e5d48 | | Default Java | Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20
[jira] [Commented] (HDFS-16849) Terminate SNN when failing to perform EditLogTailing
[ https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697607#comment-17697607 ] Arpit Agarwal commented on HDFS-16849: -- I see, that is strange why the SNN couldn't recover on retries. That needs further investigation to check why. > Terminate SNN when failing to perform EditLogTailing > > > Key: HDFS-16849 > URL: https://issues.apache.org/jira/browse/HDFS-16849 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Karthik Palanisamy >Priority: Major > > We should terminate SNN if we fail LogTrailing for sufficient JN. We found > this after Kerberos error. > {code:java} > 2022-10-14 10:53:16,796 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms > (timeout=2 ms) for a response for selectStreamingInputStreams. Exceptions > so far: [:8485: DestHost:destPort :8485 , LocalHost:localPort > /:0. Failed on local exception: > org.apache.hadoop.security.KerberosAuthException: Login failure for user: > hdfs/ javax.security.auth.login.LoginException: Client not found in > Kerberos database (6)] > 2022-10-14 10:53:30,796 WARN > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input > streams from QJM to [:8485, :8485, :8485]. Skipping. > java.io.IOException: Timed out waiting 2ms for a quorum of nodes to > respond. > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:361) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427) > {code} > > We have no check whether sufficient JN met: > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280] > So we should implement a similar check this, > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16849) Terminate SNN when failing to perform EditLogTailing
[ https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697606#comment-17697606 ] Karthik Palanisamy commented on HDFS-16849: --- No misconfiguration [~arp]. I remember, SNN failed to recover when the KDC connection issue or KDC restart. But ANN was able to recover when the KDC connection become good. > Terminate SNN when failing to perform EditLogTailing > > > Key: HDFS-16849 > URL: https://issues.apache.org/jira/browse/HDFS-16849 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Karthik Palanisamy >Priority: Major > > We should terminate SNN if we fail LogTrailing for sufficient JN. We found > this after Kerberos error. > {code:java} > 2022-10-14 10:53:16,796 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms > (timeout=2 ms) for a response for selectStreamingInputStreams. Exceptions > so far: [:8485: DestHost:destPort :8485 , LocalHost:localPort > /:0. Failed on local exception: > org.apache.hadoop.security.KerberosAuthException: Login failure for user: > hdfs/ javax.security.auth.login.LoginException: Client not found in > Kerberos database (6)] > 2022-10-14 10:53:30,796 WARN > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input > streams from QJM to [:8485, :8485, :8485]. Skipping. > java.io.IOException: Timed out waiting 2ms for a quorum of nodes to > respond. > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:361) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427) > {code} > > We have no check whether sufficient JN met: > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280] > So we should implement a similar check this, > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16849) Terminate SNN when failing to perform EditLogTailing
[ https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697590#comment-17697590 ] Arpit Agarwal commented on HDFS-16849: -- [~kpalanisamy] what causes the login failure? This particular error doesn't seem recoverable. Was it a cluster misconfiguration? > Terminate SNN when failing to perform EditLogTailing > > > Key: HDFS-16849 > URL: https://issues.apache.org/jira/browse/HDFS-16849 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Karthik Palanisamy >Priority: Major > > We should terminate SNN if we fail LogTrailing for sufficient JN. We found > this after Kerberos error. > {code:java} > 2022-10-14 10:53:16,796 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms > (timeout=2 ms) for a response for selectStreamingInputStreams. Exceptions > so far: [:8485: DestHost:destPort :8485 , LocalHost:localPort > /:0. Failed on local exception: > org.apache.hadoop.security.KerberosAuthException: Login failure for user: > hdfs/ javax.security.auth.login.LoginException: Client not found in > Kerberos database (6)] > 2022-10-14 10:53:30,796 WARN > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input > streams from QJM to [:8485, :8485, :8485]. Skipping. > java.io.IOException: Timed out waiting 2ms for a quorum of nodes to > respond. > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:361) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427) > {code} > > We have no check whether sufficient JN met: > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280] > So we should implement a similar check this, > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16849) Terminate SNN when failing to perform EditLogTailing
[ https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697584#comment-17697584 ] Karthik Palanisamy commented on HDFS-16849: --- Sure [~arp]. Auto recovering is the good solution instead of termination. > Terminate SNN when failing to perform EditLogTailing > > > Key: HDFS-16849 > URL: https://issues.apache.org/jira/browse/HDFS-16849 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Karthik Palanisamy >Priority: Major > > We should terminate SNN if we fail LogTrailing for sufficient JN. We found > this after Kerberos error. > {code:java} > 2022-10-14 10:53:16,796 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms > (timeout=2 ms) for a response for selectStreamingInputStreams. Exceptions > so far: [:8485: DestHost:destPort :8485 , LocalHost:localPort > /:0. Failed on local exception: > org.apache.hadoop.security.KerberosAuthException: Login failure for user: > hdfs/ javax.security.auth.login.LoginException: Client not found in > Kerberos database (6)] > 2022-10-14 10:53:30,796 WARN > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input > streams from QJM to [:8485, :8485, :8485]. Skipping. > java.io.IOException: Timed out waiting 2ms for a quorum of nodes to > respond. > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:361) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427) > {code} > > We have no check whether sufficient JN met: > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280] > So we should implement a similar check this, > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16849) Terminate SNN when failing to perform EditLogTailing
[ https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697582#comment-17697582 ] Karthik Palanisamy commented on HDFS-16849: --- Yes Arpit, SNN keeps retrying but fail always until we reboot the namenode. local exception: org.apache.hadoop.security.KerberosAuthException: Login failure for user: hdfs/ javax.security.auth.login.LoginException: Client not found in Kerberos database (6)] The problem is our checkpoint which didn't run. Customers think that the checkpoint was doing fine since SNN up. But in reality, SNN is dead-state. > Terminate SNN when failing to perform EditLogTailing > > > Key: HDFS-16849 > URL: https://issues.apache.org/jira/browse/HDFS-16849 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Karthik Palanisamy >Priority: Major > > We should terminate SNN if we fail LogTrailing for sufficient JN. We found > this after Kerberos error. > {code:java} > 2022-10-14 10:53:16,796 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms > (timeout=2 ms) for a response for selectStreamingInputStreams. Exceptions > so far: [:8485: DestHost:destPort :8485 , LocalHost:localPort > /:0. Failed on local exception: > org.apache.hadoop.security.KerberosAuthException: Login failure for user: > hdfs/ javax.security.auth.login.LoginException: Client not found in > Kerberos database (6)] > 2022-10-14 10:53:30,796 WARN > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input > streams from QJM to [:8485, :8485, :8485]. Skipping. > java.io.IOException: Timed out waiting 2ms for a quorum of nodes to > respond. > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:361) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427) > {code} > > We have no check whether sufficient JN met: > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280] > So we should implement a similar check this, > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16849) Terminate SNN when failing to perform EditLogTailing
[ https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697525#comment-17697525 ] Arpit Agarwal commented on HDFS-16849: -- It has been years since I looked at this code so this may be a dumb question. What is the benefit of self-terminating the SNN? Also what does the SNN do today - does it keep retrying? If this is a recoverable/potentially transient error then retrying may be the right thing to do. > Terminate SNN when failing to perform EditLogTailing > > > Key: HDFS-16849 > URL: https://issues.apache.org/jira/browse/HDFS-16849 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Karthik Palanisamy >Priority: Major > > We should terminate SNN if we fail LogTrailing for sufficient JN. We found > this after Kerberos error. > {code:java} > 2022-10-14 10:53:16,796 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms > (timeout=2 ms) for a response for selectStreamingInputStreams. Exceptions > so far: [:8485: DestHost:destPort :8485 , LocalHost:localPort > /:0. Failed on local exception: > org.apache.hadoop.security.KerberosAuthException: Login failure for user: > hdfs/ javax.security.auth.login.LoginException: Client not found in > Kerberos database (6)] > 2022-10-14 10:53:30,796 WARN > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input > streams from QJM to [:8485, :8485, :8485]. Skipping. > java.io.IOException: Timed out waiting 2ms for a quorum of nodes to > respond. > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:361) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427) > {code} > > We have no check whether sufficient JN met: > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280] > So we should implement a similar check this, > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease
[ https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-16942: -- Labels: pull-request-available (was: ) > Send error to datanode if FBR is rejected due to bad lease > -- > > Key: HDFS-16942 > URL: https://issues.apache.org/jira/browse/HDFS-16942 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > > When a datanode sends a FBR to the namenode, it requires a lease to send it. > On a couple of busy clusters, we have seen an issue where the DN is somehow > delayed in sending the FBR after requesting the least. Then the NN rejects > the FBR and logs a message to that effect, but from the Datanodes point of > view, it thinks the report was successful and does not try to send another > report until the 6 hour default interval has passed. > If this happens to a few DNs, there can be missing and under replicated > blocks, further adding to the cluster load. Even worse, I have see the DNs > join the cluster with zero blocks, so it is not obvious the under replication > is caused by lost a FBR, as all DNs appear to be up and running. > I believe we should propagate an error back to the DN if the FBR is rejected, > that way, the DN can request a new lease and try again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease
[ https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697456#comment-17697456 ] ASF GitHub Bot commented on HDFS-16942: --- sodonnel opened a new pull request, #5460: URL: https://github.com/apache/hadoop/pull/5460 ### Description of PR When a datanode sends a FBR to the namenode, it requires a lease to send it. On a couple of busy clusters, we have seen an issue where the DN is somehow delayed in sending the FBR after requesting the least. Then the NN rejects the FBR and logs a message to that effect, but from the Datanodes point of view, it thinks the report was successful and does not try to send another report until the 6 hour default interval has passed. If this happens to a few DNs, there can be missing and under replicated blocks, further adding to the cluster load. Even worse, I have see the DNs join the cluster with zero blocks, so it is not obvious the under replication is caused by lost a FBR, as all DNs appear to be up and running. I believe we should propagate an error back to the DN if the FBR is rejected, that way, the DN can request a new lease and try again. ### How was this patch tested? Modified a test and added a new test. Also validated manually by changing the code to always throw the InvalidLease exception to ensure the DN handled it, reset the least ID and retried. > Send error to datanode if FBR is rejected due to bad lease > -- > > Key: HDFS-16942 > URL: https://issues.apache.org/jira/browse/HDFS-16942 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > > When a datanode sends a FBR to the namenode, it requires a lease to send it. > On a couple of busy clusters, we have seen an issue where the DN is somehow > delayed in sending the FBR after requesting the least. Then the NN rejects > the FBR and logs a message to that effect, but from the Datanodes point of > view, it thinks the report was successful and does not try to send another > report until the 6 hour default interval has passed. > If this happens to a few DNs, there can be missing and under replicated > blocks, further adding to the cluster load. Even worse, I have see the DNs > join the cluster with zero blocks, so it is not obvious the under replication > is caused by lost a FBR, as all DNs appear to be up and running. > I believe we should propagate an error back to the DN if the FBR is rejected, > that way, the DN can request a new lease and try again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease
Stephen O'Donnell created HDFS-16942: Summary: Send error to datanode if FBR is rejected due to bad lease Key: HDFS-16942 URL: https://issues.apache.org/jira/browse/HDFS-16942 Project: Hadoop HDFS Issue Type: Bug Components: datanode, namenode Reporter: Stephen O'Donnell Assignee: Stephen O'Donnell When a datanode sends a FBR to the namenode, it requires a lease to send it. On a couple of busy clusters, we have seen an issue where the DN is somehow delayed in sending the FBR after requesting the least. Then the NN rejects the FBR and logs a message to that effect, but from the Datanodes point of view, it thinks the report was successful and does not try to send another report until the 6 hour default interval has passed. If this happens to a few DNs, there can be missing and under replicated blocks, further adding to the cluster load. Even worse, I have see the DNs join the cluster with zero blocks, so it is not obvious the under replication is caused by lost a FBR, as all DNs appear to be up and running. I believe we should propagate an error back to the DN if the FBR is rejected, that way, the DN can request a new lease and try again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16929) Interactive option with yes/no for intensional/accidental data deletion with -skipTrash
[ https://issues.apache.org/jira/browse/HDFS-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Navin Kumar updated HDFS-16929: --- Description: To avoid dataloss while executing some unintentional/accidental rm command with -skipTrash . Although we have many features like snapshot, protected directories the interactive option with "yes/no" while deletion could help in double confirmation while deletion with message that it might delete the data permanently . was: To avoid dataloss while executing some unintentional/accidental rm command with -skipTrash . Although we have many features like snapshot, protected directories the interactive option with "yes/no" while deletion could help in double confirmation while deletion. > Interactive option with yes/no for intensional/accidental data deletion with > -skipTrash > --- > > Key: HDFS-16929 > URL: https://issues.apache.org/jira/browse/HDFS-16929 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfs, hdfs >Reporter: Navin Kumar >Priority: Major > > To avoid dataloss while executing some unintentional/accidental rm command > with -skipTrash . Although we have many features like snapshot, protected > directories the interactive option with "yes/no" while deletion could help in > double confirmation while deletion with message that it might delete the data > permanently . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org