[jira] [Commented] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697710#comment-17697710
 ] 

ASF GitHub Bot commented on HDFS-16942:
---

hadoop-yetus commented on PR #5460:
URL: https://github.com/apache/hadoop/pull/5460#issuecomment-1459286807

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 58s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  38m 41s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  trunk passed with JDK 
Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   1m 24s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09  |
   | +1 :green_heart: |  checkstyle  |   1m  7s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 28s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 10s |  |  trunk passed with JDK 
Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09  |
   | +1 :green_heart: |  spotbugs  |   3m 25s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 34s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  the patch passed with JDK 
Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 11s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09  |
   | +1 :green_heart: |  javac  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 51s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5460/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 79 unchanged - 
0 fixed = 80 total (was 79)  |
   | +1 :green_heart: |  mvnsite  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 50s |  |  the patch passed with JDK 
Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 22s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09  |
   | +1 :green_heart: |  spotbugs  |   3m 17s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 25s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 238m 22s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 51s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 345m  2s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5460/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5460 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 1c427e3c3acb 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 15851db5be391e6f36c263739f55723ab2abb7af |
   | Default Java | Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
 /usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5460/2/testReport/ |
   | Max. process+thread count | 3164 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console o

[jira] [Commented] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697686#comment-17697686
 ] 

ASF GitHub Bot commented on HDFS-16942:
---

Hexiaoqiao commented on code in PR #5460:
URL: https://github.com/apache/hadoop/pull/5460#discussion_r1128883822


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java:
##
@@ -791,6 +792,9 @@ private void offerService() throws Exception {
   shouldServiceRun = false;
   return;
 }
+if (InvalidBlockReportLeaseException.class.getName().equals(reClass)) {
+  fullBlockReportLeaseId = 0;

Review Comment:
   @sodonnel Thanks for your works here. Do we also need set 
`forceFullBlockReport` to true here? Otherwise, datanode will send block report 
at next 6 hour by default, right?





> Send error to datanode if FBR is rejected due to bad lease
> --
>
> Key: HDFS-16942
> URL: https://issues.apache.org/jira/browse/HDFS-16942
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
>
> When a datanode sends a FBR to the namenode, it requires a lease to send it. 
> On a couple of busy clusters, we have seen an issue where the DN is somehow 
> delayed in sending the FBR after requesting the least. Then the NN rejects 
> the FBR and logs a message to that effect, but from the Datanodes point of 
> view, it thinks the report was successful and does not try to send another 
> report until the 6 hour default interval has passed.
> If this happens to a few DNs, there can be missing and under replicated 
> blocks, further adding to the cluster load. Even worse, I have see the DNs 
> join the cluster with zero blocks, so it is not obvious the under replication 
> is caused by lost a FBR, as all DNs appear to be up and running.
> I believe we should propagate an error back to the DN if the FBR is rejected, 
> that way, the DN can request a new lease and try again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16943) RBF: Implement MySQL based StateStoreDriver

2023-03-07 Thread Simbarashe Dzinamarira (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simbarashe Dzinamarira reassigned HDFS-16943:
-

Assignee: Simbarashe Dzinamarira

> RBF: Implement MySQL based StateStoreDriver
> ---
>
> Key: HDFS-16943
> URL: https://issues.apache.org/jira/browse/HDFS-16943
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: hdfs, rbf
>Reporter: Simbarashe Dzinamarira
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>
> RBF supports two types of StateStoreDrivers
>  # StateStoreFileImpl
>  # StateStoreZooKeeperImpl
> I propose implementing a third driver that is backed by MySQL.
>  * StateStoreZooKeeperImpl requires an additional Zookeeper cluster.
>  * StateStoreFileImpl can use one of the namenodes in the HDFS cluster, but 
> that namenode becomes a single point of failure, introducing coupling between 
> the federated clusters.
>  HADOOP-18535 implemented a MySQL token store. When tokens are stored in 
> MySQL, using MySQL for the StateStore as well reduces the number of external 
> dependencies for routers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16943) RBF: Implement MySQL based StateStoreDriver

2023-03-07 Thread Simbarashe Dzinamarira (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simbarashe Dzinamarira updated HDFS-16943:
--
Description: 
RBF supports two types of StateStoreDrivers
 # StateStoreFileImpl
 # StateStoreZooKeeperImpl

I propose implementing a third driver that is backed by MySQL.
 * StateStoreZooKeeperImpl requires an additional Zookeeper cluster.
 * StateStoreFileImpl can use one of the namenodes in the HDFS cluster, but 
that namenode becomes a single point of failure, introducing coupling between 
the federated clusters.

 HADOOP-18535 implemented a MySQL token store. When tokens are stored in MySQL, 
using MySQL for the StateStore as well reduces the number of external 
dependencies for routers.

  was:
RBF supports two types of StateStoreDrivers
 # StateStoreFileImpl
 # StateStoreZooKeeperImpl

I propose implementing a third driver that is backed by MySQL.

HADOOP-18535 implemented a MySQL token store. When tokens are stored in MySQL, 
using MySQL for the StateStore as well reduces the number of external 
dependencies for routers.


> RBF: Implement MySQL based StateStoreDriver
> ---
>
> Key: HDFS-16943
> URL: https://issues.apache.org/jira/browse/HDFS-16943
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: hdfs, rbf
>Reporter: Simbarashe Dzinamarira
>Priority: Major
>
> RBF supports two types of StateStoreDrivers
>  # StateStoreFileImpl
>  # StateStoreZooKeeperImpl
> I propose implementing a third driver that is backed by MySQL.
>  * StateStoreZooKeeperImpl requires an additional Zookeeper cluster.
>  * StateStoreFileImpl can use one of the namenodes in the HDFS cluster, but 
> that namenode becomes a single point of failure, introducing coupling between 
> the federated clusters.
>  HADOOP-18535 implemented a MySQL token store. When tokens are stored in 
> MySQL, using MySQL for the StateStore as well reduces the number of external 
> dependencies for routers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16943) RBF: Implement MySQL based StateStoreDriver

2023-03-07 Thread Simbarashe Dzinamarira (Jira)
Simbarashe Dzinamarira created HDFS-16943:
-

 Summary: RBF: Implement MySQL based StateStoreDriver
 Key: HDFS-16943
 URL: https://issues.apache.org/jira/browse/HDFS-16943
 Project: Hadoop HDFS
  Issue Type: Task
  Components: hdfs, rbf
Reporter: Simbarashe Dzinamarira


RBF supports two types of StateStoreDrivers
 # StateStoreFileImpl
 # StateStoreZooKeeperImpl

I propose implementing a third driver that is backed by MySQL.

HADOOP-18535 implemented a MySQL token store. When tokens are stored in MySQL, 
using MySQL for the StateStore as well reduces the number of external 
dependencies for routers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697630#comment-17697630
 ] 

ASF GitHub Bot commented on HDFS-16942:
---

sodonnel commented on PR #5460:
URL: https://github.com/apache/hadoop/pull/5460#issuecomment-1458896438

   Not sure what to do about this checkstyle error:
   
   ```
   
./hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/InvalidBlockReportLeaseException.java:1:/**:
 Missing package-info.java file. [JavadocPackage]
   ```
   If I add the file:
   
   ```
   
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/package-info.java
   ```
   
   I get an enforcer error:
   
   ```
   [INFO] ---< org.apache.hadoop:hadoop-client-check-test-invariants 
>
   [INFO] Building Apache Hadoop Client Packaging Invariants for Test 
3.4.0-SNAPSHOT [106/113]
   [INFO] [ pom 
]-
   [INFO] 
   [INFO] -

> Send error to datanode if FBR is rejected due to bad lease
> --
>
> Key: HDFS-16942
> URL: https://issues.apache.org/jira/browse/HDFS-16942
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
>
> When a datanode sends a FBR to the namenode, it requires a lease to send it. 
> On a couple of busy clusters, we have seen an issue where the DN is somehow 
> delayed in sending the FBR after requesting the least. Then the NN rejects 
> the FBR and logs a message to that effect, but from the Datanodes point of 
> view, it thinks the report was successful and does not try to send another 
> report until the 6 hour default interval has passed.
> If this happens to a few DNs, there can be missing and under replicated 
> blocks, further adding to the cluster load. Even worse, I have see the DNs 
> join the cluster with zero blocks, so it is not obvious the under replication 
> is caused by lost a FBR, as all DNs appear to be up and running.
> I believe we should propagate an error back to the DN if the FBR is rejected, 
> that way, the DN can request a new lease and try again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697620#comment-17697620
 ] 

ASF GitHub Bot commented on HDFS-16942:
---

hadoop-yetus commented on PR #5460:
URL: https://github.com/apache/hadoop/pull/5460#issuecomment-1458829144

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 57s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  40m 48s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 30s |  |  trunk passed with JDK 
Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   1m 21s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09  |
   | +1 :green_heart: |  checkstyle  |   1m  5s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 33s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  6s |  |  trunk passed with JDK 
Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09  |
   | +1 :green_heart: |  spotbugs  |   3m 41s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 57s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 20s |  |  the patch passed with JDK 
Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 15s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09  |
   | +1 :green_heart: |  javac  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 51s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5460/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 79 unchanged - 
0 fixed = 81 total (was 79)  |
   | +1 :green_heart: |  mvnsite  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09  |
   | +1 :green_heart: |  spotbugs  |   3m 26s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m 53s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 243m 39s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5460/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 51s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 354m 37s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5460/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5460 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux e8a6e148e1c3 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / b20165566caec49a059a287131608138782e5d48 |
   | Default Java | Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
 /usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_362-8u362-ga-0ubuntu1~20

[jira] [Commented] (HDFS-16849) Terminate SNN when failing to perform EditLogTailing

2023-03-07 Thread Arpit Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697607#comment-17697607
 ] 

Arpit Agarwal commented on HDFS-16849:
--

I see, that is strange why the SNN couldn't recover on retries. That needs 
further investigation to check why.

> Terminate SNN when failing to perform EditLogTailing
> 
>
> Key: HDFS-16849
> URL: https://issues.apache.org/jira/browse/HDFS-16849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Karthik Palanisamy
>Priority: Major
>
> We should terminate SNN if we fail LogTrailing for sufficient JN. We found 
> this after Kerberos error. 
> {code:java}
> 2022-10-14 10:53:16,796 INFO 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms 
> (timeout=2 ms) for a response for selectStreamingInputStreams. Exceptions 
> so far: [:8485:  DestHost:destPort :8485 , LocalHost:localPort 
> /:0. Failed on local exception: 
> org.apache.hadoop.security.KerberosAuthException: Login failure for user: 
> hdfs/  javax.security.auth.login.LoginException: Client not found in 
> Kerberos database (6)]
> 2022-10-14 10:53:30,796 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input 
> streams from QJM to [:8485, :8485, :8485]. Skipping.
> java.io.IOException: Timed out waiting 2ms for a quorum of nodes to 
> respond.
>         at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427)
>  {code}
>  
> We have no check whether sufficient JN met: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280]
> So we should implement a similar check this,
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16849) Terminate SNN when failing to perform EditLogTailing

2023-03-07 Thread Karthik Palanisamy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697606#comment-17697606
 ] 

Karthik Palanisamy commented on HDFS-16849:
---

No misconfiguration [~arp]. I remember, SNN failed to recover when the KDC 
connection issue or KDC restart. But ANN was able to recover when the KDC 
connection become good. 

> Terminate SNN when failing to perform EditLogTailing
> 
>
> Key: HDFS-16849
> URL: https://issues.apache.org/jira/browse/HDFS-16849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Karthik Palanisamy
>Priority: Major
>
> We should terminate SNN if we fail LogTrailing for sufficient JN. We found 
> this after Kerberos error. 
> {code:java}
> 2022-10-14 10:53:16,796 INFO 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms 
> (timeout=2 ms) for a response for selectStreamingInputStreams. Exceptions 
> so far: [:8485:  DestHost:destPort :8485 , LocalHost:localPort 
> /:0. Failed on local exception: 
> org.apache.hadoop.security.KerberosAuthException: Login failure for user: 
> hdfs/  javax.security.auth.login.LoginException: Client not found in 
> Kerberos database (6)]
> 2022-10-14 10:53:30,796 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input 
> streams from QJM to [:8485, :8485, :8485]. Skipping.
> java.io.IOException: Timed out waiting 2ms for a quorum of nodes to 
> respond.
>         at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427)
>  {code}
>  
> We have no check whether sufficient JN met: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280]
> So we should implement a similar check this,
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16849) Terminate SNN when failing to perform EditLogTailing

2023-03-07 Thread Arpit Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697590#comment-17697590
 ] 

Arpit Agarwal commented on HDFS-16849:
--

[~kpalanisamy] what causes the login failure? This particular error doesn't 
seem recoverable. Was it a cluster misconfiguration?

> Terminate SNN when failing to perform EditLogTailing
> 
>
> Key: HDFS-16849
> URL: https://issues.apache.org/jira/browse/HDFS-16849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Karthik Palanisamy
>Priority: Major
>
> We should terminate SNN if we fail LogTrailing for sufficient JN. We found 
> this after Kerberos error. 
> {code:java}
> 2022-10-14 10:53:16,796 INFO 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms 
> (timeout=2 ms) for a response for selectStreamingInputStreams. Exceptions 
> so far: [:8485:  DestHost:destPort :8485 , LocalHost:localPort 
> /:0. Failed on local exception: 
> org.apache.hadoop.security.KerberosAuthException: Login failure for user: 
> hdfs/  javax.security.auth.login.LoginException: Client not found in 
> Kerberos database (6)]
> 2022-10-14 10:53:30,796 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input 
> streams from QJM to [:8485, :8485, :8485]. Skipping.
> java.io.IOException: Timed out waiting 2ms for a quorum of nodes to 
> respond.
>         at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427)
>  {code}
>  
> We have no check whether sufficient JN met: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280]
> So we should implement a similar check this,
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16849) Terminate SNN when failing to perform EditLogTailing

2023-03-07 Thread Karthik Palanisamy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697584#comment-17697584
 ] 

Karthik Palanisamy commented on HDFS-16849:
---

Sure [~arp]. Auto recovering is the good solution instead of termination. 

> Terminate SNN when failing to perform EditLogTailing
> 
>
> Key: HDFS-16849
> URL: https://issues.apache.org/jira/browse/HDFS-16849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Karthik Palanisamy
>Priority: Major
>
> We should terminate SNN if we fail LogTrailing for sufficient JN. We found 
> this after Kerberos error. 
> {code:java}
> 2022-10-14 10:53:16,796 INFO 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms 
> (timeout=2 ms) for a response for selectStreamingInputStreams. Exceptions 
> so far: [:8485:  DestHost:destPort :8485 , LocalHost:localPort 
> /:0. Failed on local exception: 
> org.apache.hadoop.security.KerberosAuthException: Login failure for user: 
> hdfs/  javax.security.auth.login.LoginException: Client not found in 
> Kerberos database (6)]
> 2022-10-14 10:53:30,796 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input 
> streams from QJM to [:8485, :8485, :8485]. Skipping.
> java.io.IOException: Timed out waiting 2ms for a quorum of nodes to 
> respond.
>         at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427)
>  {code}
>  
> We have no check whether sufficient JN met: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280]
> So we should implement a similar check this,
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16849) Terminate SNN when failing to perform EditLogTailing

2023-03-07 Thread Karthik Palanisamy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697582#comment-17697582
 ] 

Karthik Palanisamy commented on HDFS-16849:
---

Yes Arpit, SNN keeps retrying but fail always until we reboot the namenode.  
local exception: org.apache.hadoop.security.KerberosAuthException: Login 
failure for user: hdfs/  javax.security.auth.login.LoginException: Client 
not found in Kerberos database (6)]
The problem is our checkpoint which didn't run. 

Customers think that the checkpoint was doing fine since SNN up.  But in 
reality, SNN is dead-state. 

> Terminate SNN when failing to perform EditLogTailing
> 
>
> Key: HDFS-16849
> URL: https://issues.apache.org/jira/browse/HDFS-16849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Karthik Palanisamy
>Priority: Major
>
> We should terminate SNN if we fail LogTrailing for sufficient JN. We found 
> this after Kerberos error. 
> {code:java}
> 2022-10-14 10:53:16,796 INFO 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms 
> (timeout=2 ms) for a response for selectStreamingInputStreams. Exceptions 
> so far: [:8485:  DestHost:destPort :8485 , LocalHost:localPort 
> /:0. Failed on local exception: 
> org.apache.hadoop.security.KerberosAuthException: Login failure for user: 
> hdfs/  javax.security.auth.login.LoginException: Client not found in 
> Kerberos database (6)]
> 2022-10-14 10:53:30,796 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input 
> streams from QJM to [:8485, :8485, :8485]. Skipping.
> java.io.IOException: Timed out waiting 2ms for a quorum of nodes to 
> respond.
>         at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427)
>  {code}
>  
> We have no check whether sufficient JN met: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280]
> So we should implement a similar check this,
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16849) Terminate SNN when failing to perform EditLogTailing

2023-03-07 Thread Arpit Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697525#comment-17697525
 ] 

Arpit Agarwal commented on HDFS-16849:
--

It has been years since I looked at this code so this may be a dumb question. 
What is the benefit of self-terminating the SNN? Also what does the SNN do 
today - does it keep retrying? If this is a recoverable/potentially transient 
error then retrying may be the right thing to do.

> Terminate SNN when failing to perform EditLogTailing
> 
>
> Key: HDFS-16849
> URL: https://issues.apache.org/jira/browse/HDFS-16849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Karthik Palanisamy
>Priority: Major
>
> We should terminate SNN if we fail LogTrailing for sufficient JN. We found 
> this after Kerberos error. 
> {code:java}
> 2022-10-14 10:53:16,796 INFO 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms 
> (timeout=2 ms) for a response for selectStreamingInputStreams. Exceptions 
> so far: [:8485:  DestHost:destPort :8485 , LocalHost:localPort 
> /:0. Failed on local exception: 
> org.apache.hadoop.security.KerberosAuthException: Login failure for user: 
> hdfs/  javax.security.auth.login.LoginException: Client not found in 
> Kerberos database (6)]
> 2022-10-14 10:53:30,796 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input 
> streams from QJM to [:8485, :8485, :8485]. Skipping.
> java.io.IOException: Timed out waiting 2ms for a quorum of nodes to 
> respond.
>         at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605)
>         at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523)
>         at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>         at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427)
>  {code}
>  
> We have no check whether sufficient JN met: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280]
> So we should implement a similar check this,
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease

2023-03-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16942:
--
Labels: pull-request-available  (was: )

> Send error to datanode if FBR is rejected due to bad lease
> --
>
> Key: HDFS-16942
> URL: https://issues.apache.org/jira/browse/HDFS-16942
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
>
> When a datanode sends a FBR to the namenode, it requires a lease to send it. 
> On a couple of busy clusters, we have seen an issue where the DN is somehow 
> delayed in sending the FBR after requesting the least. Then the NN rejects 
> the FBR and logs a message to that effect, but from the Datanodes point of 
> view, it thinks the report was successful and does not try to send another 
> report until the 6 hour default interval has passed.
> If this happens to a few DNs, there can be missing and under replicated 
> blocks, further adding to the cluster load. Even worse, I have see the DNs 
> join the cluster with zero blocks, so it is not obvious the under replication 
> is caused by lost a FBR, as all DNs appear to be up and running.
> I believe we should propagate an error back to the DN if the FBR is rejected, 
> that way, the DN can request a new lease and try again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697456#comment-17697456
 ] 

ASF GitHub Bot commented on HDFS-16942:
---

sodonnel opened a new pull request, #5460:
URL: https://github.com/apache/hadoop/pull/5460

   ### Description of PR
   
   When a datanode sends a FBR to the namenode, it requires a lease to send it. 
On a couple of busy clusters, we have seen an issue where the DN is somehow 
delayed in sending the FBR after requesting the least. Then the NN rejects the 
FBR and logs a message to that effect, but from the Datanodes point of view, it 
thinks the report was successful and does not try to send another report until 
the 6 hour default interval has passed.
   
   If this happens to a few DNs, there can be missing and under replicated 
blocks, further adding to the cluster load. Even worse, I have see the DNs join 
the cluster with zero blocks, so it is not obvious the under replication is 
caused by lost a FBR, as all DNs appear to be up and running.
   
   I believe we should propagate an error back to the DN if the FBR is 
rejected, that way, the DN can request a new lease and try again.
   
   ### How was this patch tested?
   
   Modified a test and added a new test. Also validated manually by changing 
the code to always throw the InvalidLease exception to ensure the DN handled 
it, reset the least ID and retried.




> Send error to datanode if FBR is rejected due to bad lease
> --
>
> Key: HDFS-16942
> URL: https://issues.apache.org/jira/browse/HDFS-16942
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>
> When a datanode sends a FBR to the namenode, it requires a lease to send it. 
> On a couple of busy clusters, we have seen an issue where the DN is somehow 
> delayed in sending the FBR after requesting the least. Then the NN rejects 
> the FBR and logs a message to that effect, but from the Datanodes point of 
> view, it thinks the report was successful and does not try to send another 
> report until the 6 hour default interval has passed.
> If this happens to a few DNs, there can be missing and under replicated 
> blocks, further adding to the cluster load. Even worse, I have see the DNs 
> join the cluster with zero blocks, so it is not obvious the under replication 
> is caused by lost a FBR, as all DNs appear to be up and running.
> I believe we should propagate an error back to the DN if the FBR is rejected, 
> that way, the DN can request a new lease and try again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16942) Send error to datanode if FBR is rejected due to bad lease

2023-03-07 Thread Stephen O'Donnell (Jira)
Stephen O'Donnell created HDFS-16942:


 Summary: Send error to datanode if FBR is rejected due to bad lease
 Key: HDFS-16942
 URL: https://issues.apache.org/jira/browse/HDFS-16942
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, namenode
Reporter: Stephen O'Donnell
Assignee: Stephen O'Donnell


When a datanode sends a FBR to the namenode, it requires a lease to send it. On 
a couple of busy clusters, we have seen an issue where the DN is somehow 
delayed in sending the FBR after requesting the least. Then the NN rejects the 
FBR and logs a message to that effect, but from the Datanodes point of view, it 
thinks the report was successful and does not try to send another report until 
the 6 hour default interval has passed.

If this happens to a few DNs, there can be missing and under replicated blocks, 
further adding to the cluster load. Even worse, I have see the DNs join the 
cluster with zero blocks, so it is not obvious the under replication is caused 
by lost a FBR, as all DNs appear to be up and running.

I believe we should propagate an error back to the DN if the FBR is rejected, 
that way, the DN can request a new lease and try again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16929) Interactive option with yes/no for intensional/accidental data deletion with -skipTrash

2023-03-07 Thread Navin Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navin Kumar updated HDFS-16929:
---
Description: 
To avoid dataloss while executing some unintentional/accidental rm command with 
-skipTrash . Although we have many features like snapshot, protected 
directories the interactive option with "yes/no" while deletion could help in 
double confirmation while deletion with message that it might delete the data 
permanently .



  was:
To avoid dataloss while executing some unintentional/accidental rm command with 
-skipTrash . Although we have many features like snapshot, protected 
directories the interactive option with "yes/no" while deletion could help in 
double confirmation while deletion.




> Interactive option with yes/no for intensional/accidental data deletion with 
> -skipTrash
> ---
>
> Key: HDFS-16929
> URL: https://issues.apache.org/jira/browse/HDFS-16929
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfs, hdfs
>Reporter: Navin Kumar
>Priority: Major
>
> To avoid dataloss while executing some unintentional/accidental rm command 
> with -skipTrash . Although we have many features like snapshot, protected 
> directories the interactive option with "yes/no" while deletion could help in 
> double confirmation while deletion with message that it might delete the data 
> permanently .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org