[jira] [Commented] (HDFS-16891) Avoid the overhead of copy-on-write exception list while loading inodes sub sections in parallel

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678101#comment-17678101
 ] 

ASF GitHub Bot commented on HDFS-16891:
---

hadoop-yetus commented on PR #5300:
URL: https://github.com/apache/hadoop/pull/5300#issuecomment-1386602477

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 40s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m 49s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m  7s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 31s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  7s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 23s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  25m 37s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 17s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   1m 17s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 10s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   1m 10s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 51s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   1m 26s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 15s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  25m 36s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 306m 56s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 47s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 424m 25s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5300/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5300 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 2e2dd3ae5836 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / f205782b8ee24449b5dcf91db00f726398eb7ca0 |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5300/2/testReport/ |
   | Max. process+thread count | 3066 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5300/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was 

[jira] [Commented] (HDFS-16791) New API for enclosing root path for a file

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678068#comment-17678068
 ] 

ASF GitHub Bot commented on HDFS-16791:
---

hadoop-yetus commented on PR #4967:
URL: https://github.com/apache/hadoop/pull/4967#issuecomment-1386503291

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 38s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +0 :ok: |  buf  |   0m  0s |  |  buf was not available.  |
   | +0 :ok: |  buf  |   0m  0s |  |  buf was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 11 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m  1s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  31m 17s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  23m  8s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |  20m 26s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   3m 48s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   5m 29s |  |  trunk passed  |
   | -1 :x: |  javadoc  |   1m 13s | 
[/branch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4967/6/artifact/out/branch-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt)
 |  hadoop-common in trunk failed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.  |
   | +1 :green_heart: |  javadoc  |   4m 44s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |  10m 18s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  24m 31s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 28s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   4m  5s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  22m 30s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  cc  |  22m 30s |  |  the patch passed  |
   | -1 :x: |  javac  |  22m 30s | 
[/results-compile-javac-root-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4967/6/artifact/out/results-compile-javac-root-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt)
 |  root-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 generated 8 new + 2821 unchanged - 0 
fixed = 2829 total (was 2821)  |
   | +1 :green_heart: |  compile  |  20m 26s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  cc  |  20m 26s |  |  the patch passed  |
   | -1 :x: |  javac  |  20m 26s | 
[/results-compile-javac-root-jdkPrivateBuild-1.8.0_352-8u352-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4967/6/artifact/out/results-compile-javac-root-jdkPrivateBuild-1.8.0_352-8u352-ga-1~20.04-b08.txt)
 |  root-jdkPrivateBuild-1.8.0_352-8u352-ga-1~20.04-b08 with JDK Private 
Build-1.8.0_352-8u352-ga-1~20.04-b08 generated 8 new + 2621 unchanged - 0 fixed 
= 2629 total (was 2621)  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4967/6/artifact/out/blanks-eol.txt)
 |  The patch has 11 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | -0 :warning: |  checkstyle  |   3m 38s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4967/6/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 33 new + 707 unchanged - 15 fixed = 740 total 
(was 722)  |
   | +1 :green_heart: |  mvnsite  |   5m 28s |  |  the patch passed  |
   | -1 :x: |  javadoc  |   1m  4s | 

[jira] [Commented] (HDFS-16892) Fix method name of RPC.Builder#setnumReaders

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678064#comment-17678064
 ] 

ASF GitHub Bot commented on HDFS-16892:
---

ayushtkn commented on code in PR #5301:
URL: https://github.com/apache/hadoop/pull/5301#discussion_r1073077149


##
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java:
##
@@ -901,7 +901,7 @@ public Builder setNumHandlers(int numHandlers) {
  * @return Default: -1.
  * @param numReaders input numReaders.
  */
-public Builder setnumReaders(int numReaders) {
+public Builder setNumReaders(int numReaders) {

Review Comment:
   Is this ok from compatibility point of view, it is a public method in a 
public class. if any downstream project is using it, they will get compilation 
failures?
   this method seems to be there since 2012, so pretty old as well...





> Fix method name  of RPC.Builder#setnumReaders
> -
>
> Key: HDFS-16892
> URL: https://issues.apache.org/jira/browse/HDFS-16892
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16891) Avoid the overhead of copy-on-write exception list while loading inodes sub sections in parallel

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678031#comment-17678031
 ] 

ASF GitHub Bot commented on HDFS-16891:
---

virajjasani commented on PR #5300:
URL: https://github.com/apache/hadoop/pull/5300#issuecomment-1386270795

   Thanks for the reviews @cnauroth @sodonnel.
   
   > which will result in the image failing to load and the NN aborting, so its 
an exception that we really don't expect to happen.
   
   That is correct. As such this is going to lead to failure eventually. The 
only reason I came across this sometime back was due to profiling of a 
purposeful failure asserting test. We would like to use this parallelism of 
inodes loading with hadoop 3 upgrades (still running hadoop 2 for majority 
clusters), and hence running some tests around this.
   
   
   > Can the code be simplifed to this?
   > final List exceptions = Collections.synchronizedList(new 
ArrayList<>());
   
   > Using `Collections.synchronizedList` does seem simpler or synchronizing on 
the exceptions object rather than having a separate lock object probably makes 
sense to simplify this change further.
   
   Sounds good, thanks.




> Avoid the overhead of copy-on-write exception list while loading inodes sub 
> sections in parallel
> 
>
> Key: HDFS-16891
> URL: https://issues.apache.org/jira/browse/HDFS-16891
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> If we enable parallel loading and persisting of inodes from/to fs image, we 
> get the benefit of improved performance. However, while loading sub-sections 
> INODE_DIR_SUB and INODE_SUB, if we encounter any errors, we use copy-on-write 
> list to maintain the list of exceptions. Since our usecase is not to iterate 
> over this list while executor threads are adding new elements to the list, 
> using copy-on-write is bit of an overhead for this usecase.
> It would be better to synchronize adding new elements to the list rather than 
> having the list copy all elements over every time new element is added to the 
> list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16891) Avoid the overhead of copy-on-write exception list while loading inodes sub sections in parallel

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678009#comment-17678009
 ] 

ASF GitHub Bot commented on HDFS-16891:
---

sodonnel commented on PR #5300:
URL: https://github.com/apache/hadoop/pull/5300#issuecomment-1386198922

   I don't recall my reason for using a copyOnWrite list, but the list is only 
used in the case of an exception, which will result in the image failing to 
load and the NN aborting, so its an exception that we really don't expect to 
happen. Therefore as it stands, the CopyOnWrite list has basically zero 
overhead. Even if there are exceptions, the total number of entries is equal to 
the parallel loading threads, so low tens of entries at the most.
   
   Using `Collections.synchronizedList` does seem simpler or synchronizing on 
the exceptions object rather than having a separate lock object probably makes 
sense to simplify this change further.




> Avoid the overhead of copy-on-write exception list while loading inodes sub 
> sections in parallel
> 
>
> Key: HDFS-16891
> URL: https://issues.apache.org/jira/browse/HDFS-16891
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> If we enable parallel loading and persisting of inodes from/to fs image, we 
> get the benefit of improved performance. However, while loading sub-sections 
> INODE_DIR_SUB and INODE_SUB, if we encounter any errors, we use copy-on-write 
> list to maintain the list of exceptions. Since our usecase is not to iterate 
> over this list while executor threads are adding new elements to the list, 
> using copy-on-write is bit of an overhead for this usecase.
> It would be better to synchronize adding new elements to the list rather than 
> having the list copy all elements over every time new element is added to the 
> list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16891) Avoid the overhead of copy-on-write exception list while loading inodes sub sections in parallel

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678005#comment-17678005
 ] 

ASF GitHub Bot commented on HDFS-16891:
---

cnauroth commented on PR #5300:
URL: https://github.com/apache/hadoop/pull/5300#issuecomment-1386189316

   CC @sodonnel and @jojochuang who originally authored/reviewed this code in 
[HDFS-14617](https://issues.apache.org/jira/browse/HDFS-14617) / #1028 , just 
in case they had some reason I haven't considered to prefer 
`CopyOnWriteArrayList` for this.




> Avoid the overhead of copy-on-write exception list while loading inodes sub 
> sections in parallel
> 
>
> Key: HDFS-16891
> URL: https://issues.apache.org/jira/browse/HDFS-16891
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> If we enable parallel loading and persisting of inodes from/to fs image, we 
> get the benefit of improved performance. However, while loading sub-sections 
> INODE_DIR_SUB and INODE_SUB, if we encounter any errors, we use copy-on-write 
> list to maintain the list of exceptions. Since our usecase is not to iterate 
> over this list while executor threads are adding new elements to the list, 
> using copy-on-write is bit of an overhead for this usecase.
> It would be better to synchronize adding new elements to the list rather than 
> having the list copy all elements over every time new element is added to the 
> list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16791) New API for enclosing root path for a file

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677925#comment-17677925
 ] 

ASF GitHub Bot commented on HDFS-16791:
---

mccormickt12 commented on PR #4967:
URL: https://github.com/apache/hadoop/pull/4967#issuecomment-1385965245

   > > I see AbstractFSContract but that seems pretty bare, im not sure if 
thats what you are referring to or something else.
   > 
   > look at all the subclasses of `AbstractFSContractTestBase` ... they are 
the main test suites implemented by all the connectors with the docs and hdfs 
the reference implementation
   
   @steveloughran I think I found it. I just updated the PR with the contract 
tests. It looks pretty comparable to other contract test suites. lmk if there 
is still something missing. 




> New API for enclosing root path for a file
> --
>
> Key: HDFS-16791
> URL: https://issues.apache.org/jira/browse/HDFS-16791
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Tom McCormick
>Assignee: Tom McCormick
>Priority: Major
>  Labels: pull-request-available
>
> At LinkedIn we run many HDFS volumes that are federated by either 
> ViewFilesystem or Router Based Federation. As our number of hdfs volumes 
> grows, we have a growing need to migrate data seemlessly across volumes.
> Many frameworks have a notion of staging or temp directories, but those 
> directories often live in random locations. We want an API getEnclosingRoot, 
> which provides the root path a file or dataset. 
> In ViewFilesystem / Router Based Federation, the enclosingRoot will be the 
> mount point.
> We will also take into account other restrictions for renames like 
> encryptions zones.
> If there are several paths (a mount point and an encryption zone), we will 
> return the longer path



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16689) Standby NameNode crashes when transitioning to Active with in-progress tailer

2023-01-17 Thread Xing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-16689:

Description: 
Standby NameNode crashes when transitioning to Active with a in-progress 
tailer. And the error message like blew:
{code:java}
Caused by: java.lang.IllegalStateException: Cannot start writing at txid X when 
there is a stream available for read: ByteStringEditLog[X, Y], 
ByteStringEditLog[X, 0]
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132)
... 36 more
{code}
After tracing and found there is a critical bug in 
*EditlogTailer#catchupDuringFailover()* when *DFS_HA_TAILEDITS_INPROGRESS_KEY* 
is true. Because *catchupDuringFailover()* try to replay all missed edits from 
JournalNodes with {*}onlyDurableTxns=true{*}. It may cannot replay any edits 
when they are some abnormal JournalNodes.

Reproduce method, suppose:
 - There are 2 namenode, namely NN0 and NN1, and the status of echo namenode is 
Active, Standby respectively. And there are 3 JournalNodes, namely JN0, JN1 and 
JN2.
 - NN0 try to sync 3 edits to JNs with started txid 3, but only successfully 
synced them to JN1 and JN2 {-}JN3{-}. And JN0 is abnormal, such as GC, bad 
network or restarted.
 - NN1's lastAppliedTxId is 2, and at the moment, we are trying failover active 
from NN0 to NN1.
 - NN1 only got two responses from JN0 and JN1 when it try to selecting 
inputStreams with *fromTxnId=3* and {*}onlyDurableTxns=true{*}, and the count 
txid of response is 0, 3 respectively. JN2 is abnormal, such as GC, bad network 
or restarted.
 - NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes 
because the *maxAllowedTxns* is 0.

So I think Standby NameNode should *catchupDuringFailover()* with 
*onlyDurableTxns=false* , so that it can replay all missed edits from 
JournalNode.

  was:
Standby NameNode crashes when transitioning to Active with a in-progress 
tailer. And the error message like blew:


{code:java}
Caused by: java.lang.IllegalStateException: Cannot start writing at txid X when 
there is a stream available for read: ByteStringEditLog[X, Y], 
ByteStringEditLog[X, 0]
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:344)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.openForWrite(FSEditLogAsync.java:113)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1423)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:2132)
... 36 more
{code}

After tracing and found there is a critical bug in 
*EditlogTailer#catchupDuringFailover()* when *DFS_HA_TAILEDITS_INPROGRESS_KEY* 
is true. Because *catchupDuringFailover()* try to replay all missed edits from 
JournalNodes with *onlyDurableTxns=true*. It may cannot replay any edits when 
they are some abnormal JournalNodes. 

Reproduce method, suppose:
- There are 2 namenode, namely NN0 and NN1, and the status of echo namenode is 
Active, Standby respectively. And there are 3 JournalNodes, namely JN0, JN1 and 
JN2. 
- NN0 try to sync 3 edits to JNs with started txid 3, but only successfully 
synced them to JN1 and JN3. And JN0 is abnormal, such as GC, bad network or 
restarted.
- NN1's lastAppliedTxId is 2, and at the moment, we are trying failover active 
from NN0 to NN1. 
- NN1 only got two responses from JN0 and JN1 when it try to selecting 
inputStreams with *fromTxnId=3*  and *onlyDurableTxns=true*, and the count txid 
of response is 0, 3 respectively. JN2 is abnormal, such as GC,  bad network or 
restarted.
- NN1 will cannot replay any Edits with *fromTxnId=3* from JournalNodes because 
the *maxAllowedTxns* is 0.


So I think Standby NameNode should *catchupDuringFailover()* with 
*onlyDurableTxns=false* , so that it can replay all missed edits from 
JournalNode.


> Standby NameNode crashes when transitioning to Active with in-progress tailer
> -
>
> Key: HDFS-16689
> URL: https://issues.apache.org/jira/browse/HDFS-16689
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Standby NameNode crashes when transitioning to Active with a in-progress 
> tailer. And the error message like blew:
> 

[jira] [Commented] (HDFS-16791) New API for enclosing root path for a file

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677907#comment-17677907
 ] 

ASF GitHub Bot commented on HDFS-16791:
---

steveloughran commented on PR #4967:
URL: https://github.com/apache/hadoop/pull/4967#issuecomment-1385922725

   >  I see AbstractFSContract but that seems pretty bare, im not sure if thats 
what you are referring to or something else.
   
   look at all the subclasses of `AbstractFSContractTestBase` ... they are the 
main test suites implemented by all the connectors with the docs and hdfs the 
reference implementation




> New API for enclosing root path for a file
> --
>
> Key: HDFS-16791
> URL: https://issues.apache.org/jira/browse/HDFS-16791
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Tom McCormick
>Assignee: Tom McCormick
>Priority: Major
>  Labels: pull-request-available
>
> At LinkedIn we run many HDFS volumes that are federated by either 
> ViewFilesystem or Router Based Federation. As our number of hdfs volumes 
> grows, we have a growing need to migrate data seemlessly across volumes.
> Many frameworks have a notion of staging or temp directories, but those 
> directories often live in random locations. We want an API getEnclosingRoot, 
> which provides the root path a file or dataset. 
> In ViewFilesystem / Router Based Federation, the enclosingRoot will be the 
> mount point.
> We will also take into account other restrictions for renames like 
> encryptions zones.
> If there are several paths (a mount point and an encryption zone), we will 
> return the longer path



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16891) Avoid the overhead of copy-on-write exception list while loading inodes sub sections in parallel

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677888#comment-17677888
 ] 

ASF GitHub Bot commented on HDFS-16891:
---

virajjasani commented on PR #5300:
URL: https://github.com/apache/hadoop/pull/5300#issuecomment-1385849500

   @jojochuang @sodonnel could you please review this PR?




> Avoid the overhead of copy-on-write exception list while loading inodes sub 
> sections in parallel
> 
>
> Key: HDFS-16891
> URL: https://issues.apache.org/jira/browse/HDFS-16891
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> If we enable parallel loading and persisting of inodes from/to fs image, we 
> get the benefit of improved performance. However, while loading sub-sections 
> INODE_DIR_SUB and INODE_SUB, if we encounter any errors, we use copy-on-write 
> list to maintain the list of exceptions. Since our usecase is not to iterate 
> over this list while executor threads are adding new elements to the list, 
> using copy-on-write is bit of an overhead for this usecase.
> It would be better to synchronize adding new elements to the list rather than 
> having the list copy all elements over every time new element is added to the 
> list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16864) HDFS advisory caching should drop cache behind block when block closed

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677870#comment-17677870
 ] 

ASF GitHub Bot commented on HDFS-16864:
---

dlmarion commented on PR #5204:
URL: https://github.com/apache/hadoop/pull/5204#issuecomment-1385796995

   Not sure what to make of these test failures. I ran 'mvn clean package 
-Dtest=TestBalancer*` locally and they all passed.
   
   ```
   [INFO] -

> HDFS advisory caching should drop cache behind block when block closed
> --
>
> Key: HDFS-16864
> URL: https://issues.apache.org/jira/browse/HDFS-16864
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.4
>Reporter: Dave Marion
>Priority: Minor
>  Labels: pull-request-available
>
> One of the comments in HDFS-4817 describes the behavior in 
> BlockReceiver.manageWriterOsCache:
> "The general idea is that there isn't much point in calling 
> {{sync_file_pages}} twice on the same offsets, since the sync process has 
> presumably already begun. On the other hand, calling 
> {{fadvise(FADV_DONTNEED)}} again and again will tend to purge more and more 
> bytes from the cache. The reason is because dirty pages (those containing 
> un-written-out-data) cannot be purged using {{{}FADV_DONTNEED{}}}. And we 
> can't know exactly when the pages we wrote will be flushed to disk. But we do 
> know that calling {{FADV_DONTNEED}} on very recently written bytes is a 
> waste, since they will almost certainly not have been written out to disk. 
> That is why it purges between 0 and {{{}lastCacheManagementOffset - 
> CACHE_WINDOW_SIZE{}}}, rather than simply 0 to pos."
> Looking at the code, I'm wondering if at least the last 8MB (size of 
> CACHE_WINDOW_SIZE) of a block might be left without an associated 
> FADVISE_DONT_NEED call. We're having a 
> [discussion|https://the-asf.slack.com/archives/CERNB8NDC/p1669399302264189] 
> in #accumulo about the file caching feature and I found some interesting 
> [results|https://gist.github.com/dlmarion/1835f387b0fa8fb9dbf849a0c87b6d04] 
> in a test that we wrote. Specifically, that for a multi-block file using 
> setDropBehind with either hsync or CreateFlag.SYNC_BLOCK, parts of each block 
> remained in the cache instead of parts of the last block.
> I'm wondering if there is a reason not to call fadvise(FADV_DONTNEED) on the 
> entire block in close 
> [here|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java#L371]
>  when dropCacheBehindWrites is true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16890) RBF: Add period state refresh to keep router state near active namenode's

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677866#comment-17677866
 ] 

ASF GitHub Bot commented on HDFS-16890:
---

goiri commented on code in PR #5298:
URL: https://github.com/apache/hadoop/pull/5298#discussion_r1072549013


##
hadoop-hdfs-project/hadoop-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/router/TestObserverWithRouter.java:
##
@@ -639,4 +643,36 @@ public void testRouterStateIdContextCleanup() throws 
Exception {
 assertEquals("ns0", namespace1.get(0));
 assertTrue(namespace2.isEmpty());
   }
+
+  @Test
+  @Tag(SKIP_BEFORE_EACH_CLUSTER_STARTUP)
+  public void testPeriodicStateRefreshUsingActiveNamenode() throws Exception {
+Path rootPath = new Path("/");
+
+Configuration confOverride = new Configuration(false);
+
confOverride.set(RBFConfigKeys.DFS_ROUTER_OBSERVER_STATE_ID_REFRESH_PERIOD_KEY, 
"500ms");
+confOverride.set(DFSConfigKeys.DFS_HA_TAILEDITS_PERIOD_KEY, "3s");
+startUpCluster(1, confOverride);
+
+fileSystem  = routerContext.getFileSystem(getConfToEnableObserverReads());
+fileSystem.listStatus(rootPath);
+int initialLengthOfRootListing = fileSystem.listStatus(rootPath).length;
+
+DFSClient activeClient = cluster.getNamenodes("ns0")
+.stream()
+.filter(nnContext -> nnContext.getNamenode().isActiveState())
+.findFirst().orElseThrow(() -> new IllegalStateException("No active 
namenode."))
+.getClient();
+
+for (int i = 0; i < 10; i++) {
+  activeClient.mkdirs("/dir" + i, null, false);
+}
+activeClient.close();
+
+// Wait long enough for state in router to be considered stale.
+Thread.sleep(700);

Review Comment:
   GenericTestUtils#waitFor





> RBF: Add period state refresh to keep router state near active namenode's
> -
>
> Key: HDFS-16890
> URL: https://issues.apache.org/jira/browse/HDFS-16890
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Simbarashe Dzinamarira
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>  Labels: pull-request-available
>
> When using the ObserverReadProxyProvider, clients can set 
> *dfs.client.failover.observer.auto-msync-period...* to periodically get the 
> Active namenode's state. When using routers without the 
> ObserverReadProxyProvider, this periodic update is lost.
> In a busy cluster, the Router constantly gets updated with the active 
> namenode's state when
>  # There is a write operation.
>  # There is an operation (read/write) from a new clients.
> However, in the scenario when there are no new clients and no write 
> operations, the state kept in the router can lag behind the active's. The 
> router does update its state with responses from the Observer, but the 
> observer may be lagging behind too.
> We should have a periodic refresh in the router to serve a similar role as 
> *dfs.client.failover.observer.auto-msync-period*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16890) RBF: Add period state refresh to keep router state near active namenode's

2023-01-17 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HDFS-16890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated HDFS-16890:
---
Summary: RBF: Add period state refresh to keep router state near active 
namenode's  (was: Add period state refresh to keep router state near active 
namenode's)

> RBF: Add period state refresh to keep router state near active namenode's
> -
>
> Key: HDFS-16890
> URL: https://issues.apache.org/jira/browse/HDFS-16890
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Simbarashe Dzinamarira
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>  Labels: pull-request-available
>
> When using the ObserverReadProxyProvider, clients can set 
> *dfs.client.failover.observer.auto-msync-period...* to periodically get the 
> Active namenode's state. When using routers without the 
> ObserverReadProxyProvider, this periodic update is lost.
> In a busy cluster, the Router constantly gets updated with the active 
> namenode's state when
>  # There is a write operation.
>  # There is an operation (read/write) from a new clients.
> However, in the scenario when there are no new clients and no write 
> operations, the state kept in the router can lag behind the active's. The 
> router does update its state with responses from the Observer, but the 
> observer may be lagging behind too.
> We should have a periodic refresh in the router to serve a similar role as 
> *dfs.client.failover.observer.auto-msync-period*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16853) The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed because HADOOP-18324

2023-01-17 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677849#comment-17677849
 ] 

Steve Loughran commented on HDFS-16853:
---

marking asa a blocker for 3.3.5. [~omalley] could you look at this?

> The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed 
> because HADOOP-18324
> ---
>
> Key: HDFS-16853
> URL: https://issues.apache.org/jira/browse/HDFS-16853
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.5
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Blocker
>  Labels: pull-request-available
>
> The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed 
> with error message: Waiting for cluster to become active. And the blocking 
> jstack as bellows:
> {code:java}
> "BP-1618793397-192.168.3.4-1669198559828 heartbeating to 
> localhost/127.0.0.1:54673" #260 daemon prio=5 os_prio=31 tid=0x
> 7fc1108fa000 nid=0x19303 waiting on condition [0x700017884000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0007430a9ec0> (a 
> java.util.concurrent.SynchronousQueue$TransferQueue)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(SynchronousQueue.java:762)
>         at 
> java.util.concurrent.SynchronousQueue$TransferQueue.transfer(SynchronousQueue.java:695)
>         at 
> java.util.concurrent.SynchronousQueue.put(SynchronousQueue.java:877)
>         at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1186)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1482)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1429)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>         at com.sun.proxy.$Proxy23.sendHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClient
> SideTranslatorPB.java:168)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:570)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:714)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:915)
>         at java.lang.Thread.run(Thread.java:748)  {code}
> After looking into the code and found that this bug is imported by 
> HADOOP-18324. Because RpcRequestSender exited without cleaning up the 
> rpcRequestQueue, then caused BPServiceActor was blocked in sending request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16853) The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed because HADOOP-18324

2023-01-17 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HDFS-16853:
--
Affects Version/s: 3.3.5

> The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed 
> because HADOOP-18324
> ---
>
> Key: HDFS-16853
> URL: https://issues.apache.org/jira/browse/HDFS-16853
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.5
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Blocker
>  Labels: pull-request-available
>
> The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed 
> with error message: Waiting for cluster to become active. And the blocking 
> jstack as bellows:
> {code:java}
> "BP-1618793397-192.168.3.4-1669198559828 heartbeating to 
> localhost/127.0.0.1:54673" #260 daemon prio=5 os_prio=31 tid=0x
> 7fc1108fa000 nid=0x19303 waiting on condition [0x700017884000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0007430a9ec0> (a 
> java.util.concurrent.SynchronousQueue$TransferQueue)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(SynchronousQueue.java:762)
>         at 
> java.util.concurrent.SynchronousQueue$TransferQueue.transfer(SynchronousQueue.java:695)
>         at 
> java.util.concurrent.SynchronousQueue.put(SynchronousQueue.java:877)
>         at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1186)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1482)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1429)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>         at com.sun.proxy.$Proxy23.sendHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClient
> SideTranslatorPB.java:168)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:570)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:714)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:915)
>         at java.lang.Thread.run(Thread.java:748)  {code}
> After looking into the code and found that this bug is imported by 
> HADOOP-18324. Because RpcRequestSender exited without cleaning up the 
> rpcRequestQueue, then caused BPServiceActor was blocked in sending request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16853) The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed because HADOOP-18324

2023-01-17 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HDFS-16853:
--
Priority: Blocker  (was: Major)

> The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed 
> because HADOOP-18324
> ---
>
> Key: HDFS-16853
> URL: https://issues.apache.org/jira/browse/HDFS-16853
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Blocker
>  Labels: pull-request-available
>
> The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed 
> with error message: Waiting for cluster to become active. And the blocking 
> jstack as bellows:
> {code:java}
> "BP-1618793397-192.168.3.4-1669198559828 heartbeating to 
> localhost/127.0.0.1:54673" #260 daemon prio=5 os_prio=31 tid=0x
> 7fc1108fa000 nid=0x19303 waiting on condition [0x700017884000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0007430a9ec0> (a 
> java.util.concurrent.SynchronousQueue$TransferQueue)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(SynchronousQueue.java:762)
>         at 
> java.util.concurrent.SynchronousQueue$TransferQueue.transfer(SynchronousQueue.java:695)
>         at 
> java.util.concurrent.SynchronousQueue.put(SynchronousQueue.java:877)
>         at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1186)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1482)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1429)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>         at com.sun.proxy.$Proxy23.sendHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClient
> SideTranslatorPB.java:168)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:570)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:714)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:915)
>         at java.lang.Thread.run(Thread.java:748)  {code}
> After looking into the code and found that this bug is imported by 
> HADOOP-18324. Because RpcRequestSender exited without cleaning up the 
> rpcRequestQueue, then caused BPServiceActor was blocked in sending request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16893) Standardize the usage of DFSClient debug log

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677828#comment-17677828
 ] 

ASF GitHub Bot commented on HDFS-16893:
---

hadoop-yetus commented on PR #5303:
URL: https://github.com/apache/hadoop/pull/5303#issuecomment-1385679680

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 50s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m 33s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m  1s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   0m 55s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 35s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 59s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   2m 39s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  24m 53s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 56s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   0m 53s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 46s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 19s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 49s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   2m 32s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  24m 46s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 25s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 111m 15s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5303/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5303 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 87c7a4c95305 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / f1b703d1d4ea68cf8a6e89536f20055a20e66d7b |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5303/3/testReport/ |
   | Max. process+thread count | 570 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-client U: 
hadoop-hdfs-project/hadoop-hdfs-client |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5303/3/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Standardize the usage of DFSClient debug log
> 

[jira] [Resolved] (HDFS-16764) ObserverNamenode handles addBlock rpc and throws a FileNotFoundException

2023-01-17 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16764.

Resolution: Fixed

> ObserverNamenode handles addBlock rpc and throws a FileNotFoundException 
> -
>
> Key: HDFS-16764
> URL: https://issues.apache.org/jira/browse/HDFS-16764
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> ObserverNameNode currently can handle the addBlockLocation RPC, but it may 
> throw a FileNotFoundException when it contains stale txid.
>  * AddBlock is not a coordinated method, so Observer will not check the 
> statId.
>  * AddBlock does the validation with checkOperation(OperationCategory.READ)
> So the observer can handle the addBlock rpc. If this observer cannot replay 
> the edit of create file, it will throw a FileNotFoundException during doing 
> validation.
> The related code as follows:
> {code:java}
> checkOperation(OperationCategory.READ);
> final FSPermissionChecker pc = getPermissionChecker();
> FSPermissionChecker.setOperationType(operationName);
> readLock();
> try {
>   checkOperation(OperationCategory.READ);
>   r = FSDirWriteFileOp.validateAddBlock(this, pc, src, fileId, clientName,
> previous, onRetryBlock);
> } finally {
>   readUnlock(operationName);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16764) ObserverNamenode handles addBlock rpc and throws a FileNotFoundException

2023-01-17 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16764:
---
Fix Version/s: 3.3.9
   (was: 3.3.6)

> ObserverNamenode handles addBlock rpc and throws a FileNotFoundException 
> -
>
> Key: HDFS-16764
> URL: https://issues.apache.org/jira/browse/HDFS-16764
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> ObserverNameNode currently can handle the addBlockLocation RPC, but it may 
> throw a FileNotFoundException when it contains stale txid.
>  * AddBlock is not a coordinated method, so Observer will not check the 
> statId.
>  * AddBlock does the validation with checkOperation(OperationCategory.READ)
> So the observer can handle the addBlock rpc. If this observer cannot replay 
> the edit of create file, it will throw a FileNotFoundException during doing 
> validation.
> The related code as follows:
> {code:java}
> checkOperation(OperationCategory.READ);
> final FSPermissionChecker pc = getPermissionChecker();
> FSPermissionChecker.setOperationType(operationName);
> readLock();
> try {
>   checkOperation(OperationCategory.READ);
>   r = FSDirWriteFileOp.validateAddBlock(this, pc, src, fileId, clientName,
> previous, onRetryBlock);
> } finally {
>   readUnlock(operationName);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members

2023-01-17 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-16872:
---
Fix Version/s: 3.3.9
   (was: 3.3.6)

> Fix log throttling by declaring LogThrottlingHelper as static members
> -
>
> Key: HDFS-16872
> URL: https://issues.apache.org/jira/browse/HDFS-16872
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Chengbing Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.5, 3.3.9
>
>
> In our production cluster with Observer NameNode enabled, we have plenty of 
> logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The 
> {{LogThrottlingHelper}} doesn't seem to work.
> {noformat}
> 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
> 17686250688] maxTxnsToRead = 9223372036854775807
> 2022-10-25 09:26:50,380 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
> 17686250688]' to transaction ID 17686250688
> 2022-10-25 09:26:50,380 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to 
> transaction ID 17686250688
> 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, 
> 17686250688], ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits 
> 1.0, total load time 0.0 ms
> 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
> 17686250693] maxTxnsToRead = 9223372036854775807
> 2022-10-25 09:26:50,387 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
> 17686250693]' to transaction ID 17686250689
> 2022-10-25 09:26:50,387 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to 
> transaction ID 17686250689
> 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, 
> 17686250693], ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits 
> 5.0, total load time 1.0 ms
> {noformat}
> After some digging, I found the cause is that {{LogThrottlingHelper}}'s are 
> declared as instance variables of all the enclosing classes, including 
> {{FSImage}}, {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. 
> Therefore the logging frequency will not be limited across different 
> instances. For classes with only limited number of instances, such as 
> {{FSImage}}, this is fine. For others whose instances are created frequently, 
> such as {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}, it will 
> result in plenty of logs.
> This can be fixed by declaring {{LogThrottlingHelper}}'s as static members.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16893) Standardize the usage of DFSClient debug log

2023-01-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677674#comment-17677674
 ] 

ASF GitHub Bot commented on HDFS-16893:
---

hadoop-yetus commented on PR #5303:
URL: https://github.com/apache/hadoop/pull/5303#issuecomment-1385028185

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 49s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m 50s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m  3s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   0m 53s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 35s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  1s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   2m 41s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  25m 19s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 55s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 52s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   0m 52s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 46s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 19s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 49s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   2m 32s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  26m 13s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 28s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 113m 45s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5303/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5303 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux d494d3b8015b 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 4a1f5cb7c437f4750d813f2214b5837e80851eca |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5303/2/testReport/ |
   | Max. process+thread count | 738 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-client U: 
hadoop-hdfs-project/hadoop-hdfs-client |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5303/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |