[jira] [Commented] (HDFS-14646) Standby NameNode should not upload fsimage to an inappropriate NameNode.

2023-09-18 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766559#comment-17766559
 ] 

Matthew Sharp commented on HDFS-14646:
--

We hit this running Hadoop 3.3.0 as well with 3 NNs.  It stayed quiet for years 
until we saw enough write activity.  The txns since last checkpoint metrics 
would show a really bad lag and eventually got far enough out that one of the 
SNNs would crash since it couldn't find the older txn id. Shutting down one SNN 
over the last few weeks seems to have resolved this so it does appear to be an 
active issue running 3 NNs.  It may be worth noting that the fsimage size did 
not seem to play too much of a factor.  We see this logging exceptions in a dev 
cluster which is only hundreds of MBs.

> Standby NameNode should not upload fsimage to an inappropriate NameNode.
> 
>
> Key: HDFS-14646
> URL: https://issues.apache.org/jira/browse/HDFS-14646
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.2
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Major
>  Labels: multi-sbnn
> Attachments: HDFS-14646.000.patch, HDFS-14646.001.patch, 
> HDFS-14646.002.patch
>
>
> *Problem Description:*
>  In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put 
> the image to all other NNs (whether the peer NN is an ANN or not), and even 
> if the peer NN immediately replies an error (such as 
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult 
> .OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put 
> process immediately, but will put the FsImage completely to the peer NN, and 
> will not read the peer NN's reply until the put is completed.
> Depending on the version of Jetty, this behavior can lead to different 
> consequences : 
> *1.Under Hadoop 2.7.2 (with Jetty 6.1.26)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will still be established, and the data SNN sent will be read by 
> Jetty framework itself in the peer NN side, so the SNN will insignificantly 
> send the FsImage to the peer NN continuously, causing a waste of time and 
> bandwidth. In a relatively large HDFS cluster, the size of FsImage can often 
> reach about 30GB, This is indeed a big waste.
> *2.Under newest release-3.2.0-RC1 (with Jetty 9.3.24) and trunk (with Jetty 
> 9.3.27)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will be auto closed, and then SNN will directly get an "Error 
> writing request body to server" exception, as below, note this test needs a 
> relatively big FSImage (e.g. 10MB level):
> {code:java}
> 2019-08-17 03:59:25,413 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 524288 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:396)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:340)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:314)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:277)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:272)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  2019-08-17 03:59:25,422 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 851968 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>  at 
> 

[jira] [Resolved] (HDFS-17138) RBF: We changed the hadoop.security.auth_to_local configuration of one router, the other routers stopped working

2023-09-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HDFS-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri resolved HDFS-17138.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RBF: We changed the hadoop.security.auth_to_local configuration of one 
> router, the other routers stopped working
> 
>
> Key: HDFS-17138
> URL: https://issues.apache.org/jira/browse/HDFS-17138
> Project: Hadoop HDFS
>  Issue Type: Bug
> Environment: hadoop 3.3.0
>Reporter: Xiping Zhang
>Assignee: Xiping Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2023-08-02-16-20-34-454.png, 
> image-2023-08-03-10-32-03-457.png
>
>
> other routers  error log:
> !image-2023-08-02-16-20-34-454.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17192) Add bock info when constructing remote block reader meets IOException

2023-09-18 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17192.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Add bock info when constructing remote block reader meets IOException
> -
>
> Key: HDFS-17192
> URL: https://issues.apache.org/jira/browse/HDFS-17192
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Currently, when constructing remote block reader meets IOException, it will 
> not log block info. We should add it for troubleshooting problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17194) Enhance the log message for striped block recovery

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766314#comment-17766314
 ] 

ASF GitHub Bot commented on HDFS-17194:
---

haiyang1987 commented on PR #6094:
URL: https://github.com/apache/hadoop/pull/6094#issuecomment-1723136576

   The failed unit test seems unrelated to the change.
   
   Hi Sirs @Hexiaoqiao @ayushtkn @ZanderXu  Could you please help me review 
this minor changes when you have free time ? Thanks a lot~




> Enhance the log message for striped block recovery
> --
>
> Key: HDFS-17194
> URL: https://issues.apache.org/jira/browse/HDFS-17194
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> In order to convenient troubleshoot problems, consider add internalBlk 
> information to the RecoveryTaskStriped#recover log message and optimize some 
> log output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17198) RBF: fix bug of getRepresentativeQuorum when records have same dateModified

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766307#comment-17766307
 ] 

ASF GitHub Bot commented on HDFS-17198:
---

KeeProMise commented on PR #6096:
URL: https://github.com/apache/hadoop/pull/6096#issuecomment-1723086155

   @goiri Do you have time to help review?




> RBF: fix bug of getRepresentativeQuorum when records have same dateModified
> ---
>
> Key: HDFS-17198
> URL: https://issues.apache.org/jira/browse/HDFS-17198
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17198.v001.patch
>
>
> h2. *Bug description*
> In the original implementation, when each router reports nn status at 
> different times, the nn status is the status reported by majority routers, 
> for example:
> router1 -> nn0:active dateModified:1
> router2 -> nn0:active dateModified:2
> router3 -> nn0:active dateModified:3
> router0 -> nn0:standby dateModified:4
> Then, the status of nn0 is active, because majority routers report that nn0 
> is active.
> If majority routers report nn status at the same time, for example:
> (record1) router1 -> nn0:active dateModified:1
> (record2) router2 -> nn0:active dateModified:1
> (record3) router3 -> nn0:active dateModified:1
> (record4) router0 -> nn0:standbydateModified:2
> Then the state of nn0 is standby, but We expect the status of nn0 is active
> This bug is because the above record is put into the Treeset in the method 
> getRepresentativeQuorum. Since record1,2,3 have the same dateModified, there 
> will only be one record in the final treeset of this method, so this method 
> thinks that this nn is standby, because record4 newer
> h2. *How to reproduce*
> Running my unit test testRegistrationMajorityQuorumEqDateModified, but using 
> the original code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16000) HDFS : Rename performance optimization

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766304#comment-17766304
 ] 

ASF GitHub Bot commented on HDFS-16000:
---

zhuxiangyi commented on PR #2964:
URL: https://github.com/apache/hadoop/pull/2964#issuecomment-1723082950

   
   > Thanks @zhuxiangyi for your works. It is great idea and improvement. 
Almost LGTM. Leave some comments inline. Will give my +1 once correct. Thanks.
   @Hexiaoqiao 
   Thank you very much for your reivew. I have fixed the problem and 
resubmitted the code.




> HDFS : Rename performance optimization
> --
>
> Key: HDFS-16000
> URL: https://issues.apache.org/jira/browse/HDFS-16000
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Affects Versions: 3.1.4, 3.3.1
>Reporter: Xiangyi Zhu
>Assignee: Xiangyi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20210428-143238.svg, 20210428-171635-lambda.svg, 
> HDFS-16000.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It takes a long time to move a large directory with rename. For example, it 
> takes about 40 seconds to move a 1000W directory. When a large amount of data 
> is deleted to the trash, the move large directory will occur when the recycle 
> bin makes checkpoint. In addition, the user may also actively trigger the 
> move large directory operation, which will cause the NameNode to lock too 
> long and be killed by Zkfc. Through the flame graph, it is found that the 
> main time consuming is to create the EnumCounters object.
>  
> h3. Rename logic optimization:
>  * Regardless of whether the rename operation is the source directory and the 
> target directory, the quota count must be calculated three times. The first 
> time, check whether the moved directory exceeds the target directory quota, 
> the second time, calculate the mobile directory quota to update the source 
> directory quota, and the third time, calculate the mobile directory 
> configuration update to the target directory.
>  * I think some of the above three quota quota calculations are unnecessary. 
> For example, if all parent directories of the source directory and target 
> directory are not configured with quota, there is no need to calculate 
> quotaCount. Even if both the source directory and the target directory use 
> quota, there is no need to calculate the quota three times. The calculation 
> logic for the first and third times is the same, and it only needs to be 
> calculated once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17198) RBF: fix bug of getRepresentativeQuorum when records have same dateModified

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766303#comment-17766303
 ] 

ASF GitHub Bot commented on HDFS-17198:
---

hadoop-yetus commented on PR #6096:
URL: https://github.com/apache/hadoop/pull/6096#issuecomment-1723080068

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 49s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  48m 27s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 44s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 43s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 37s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 48s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 48s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 28s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  33m 41s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 35s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  javac  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 21s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  33m 46s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  21m 17s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 154m 12s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6096/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6096 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 8eb649cabf3c 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / a934576064a1ba30acfabaeb9627faead1f3b2e1 |
   | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6096/3/testReport/ |
   | Max. process+thread count | 2509 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6096/3/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> RBF: fix bug of getRepresentativeQuorum when records have same 

[jira] [Commented] (HDFS-16000) HDFS : Rename performance optimization

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766299#comment-17766299
 ] 

ASF GitHub Bot commented on HDFS-16000:
---

zhuxiangyi commented on code in PR #2964:
URL: https://github.com/apache/hadoop/pull/2964#discussion_r1328469427


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirRenameOp.java:
##
@@ -470,17 +475,53 @@ static RenameResult unprotectedRenameTo(FSDirectory fsd,
   }
 } finally {
   if (undoRemoveSrc) {
-tx.restoreSource();
+tx.restoreSource(srcStoragePolicyCounts);
   }
   if (undoRemoveDst) { // Rename failed - restore dst
-tx.restoreDst(bsps);
+tx.restoreDst(bsps, dstStoragePolicyCounts);
   }
 }
 NameNode.stateChangeLog.warn("DIR* FSDirectory.unprotectedRenameTo: " +
 "failed to rename " + src + " to " + dst);
 throw new IOException("rename from " + src + " to " + dst + " failed.");
   }
 
+  /*
+   * Calculate QuotaCounts based on parent directory and storage policy
+   * 1. If the storage policy of src and dst are different,
+   *  calculate the QuotaCounts of src and dst respectively.
+   * 2. If all parent nodes of src and dst are not set with Quota,
+   *  there is no need to calculate QuotaCount.
+   * 3. if parent nodes of src and dst have Quota configured,
+   *  the QuotaCount is calculated once using the storage policy of src.
+   * */
+  private static void computeQuotaCounts(
+  QuotaCounts srcStoragePolicyCounts,
+  QuotaCounts dstStoragePolicyCounts,
+  INodesInPath srcIIP,
+  INodesInPath dstIIP,
+  BlockStoragePolicySuite bsps,
+  RenameOperation tx) {
+INode dstParent = dstIIP.getINode(-2);
+INode srcParentNode = FSDirectory.
+getFirstSetQuotaParentNode(srcIIP);
+INode srcInode = srcIIP.getLastINode();
+INode dstParentNode = FSDirectory.
+getFirstSetQuotaParentNode(dstIIP);
+byte srcStoragePolicyID = FSDirectory.getStoragePolicyId(srcInode);
+byte dstStoragePolicyID = FSDirectory.getStoragePolicyId(dstParent);
+if (srcStoragePolicyID != dstStoragePolicyID) {
+  srcStoragePolicyCounts.add(srcIIP.getLastINode().
+  computeQuotaUsage(bsps));
+  dstStoragePolicyCounts.add(srcIIP.getLastINode()
+  .computeQuotaUsage(bsps, dstParent.getStoragePolicyID(), false,
+  Snapshot.CURRENT_STATE_ID));
+} else if (srcParentNode != dstParentNode || tx.withCount != null) {
+  
srcStoragePolicyCounts.add(srcIIP.getLastINode().computeQuotaUsage(bsps));
+  dstStoragePolicyCounts.add(srcStoragePolicyCounts);
+}

Review Comment:
   If this is the case, it can be understood that src and dst have a configured 
quota, and src is the isSrcInSnapshot attribute.
   





> HDFS : Rename performance optimization
> --
>
> Key: HDFS-16000
> URL: https://issues.apache.org/jira/browse/HDFS-16000
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Affects Versions: 3.1.4, 3.3.1
>Reporter: Xiangyi Zhu
>Assignee: Xiangyi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20210428-143238.svg, 20210428-171635-lambda.svg, 
> HDFS-16000.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It takes a long time to move a large directory with rename. For example, it 
> takes about 40 seconds to move a 1000W directory. When a large amount of data 
> is deleted to the trash, the move large directory will occur when the recycle 
> bin makes checkpoint. In addition, the user may also actively trigger the 
> move large directory operation, which will cause the NameNode to lock too 
> long and be killed by Zkfc. Through the flame graph, it is found that the 
> main time consuming is to create the EnumCounters object.
>  
> h3. Rename logic optimization:
>  * Regardless of whether the rename operation is the source directory and the 
> target directory, the quota count must be calculated three times. The first 
> time, check whether the moved directory exceeds the target directory quota, 
> the second time, calculate the mobile directory quota to update the source 
> directory quota, and the third time, calculate the mobile directory 
> configuration update to the target directory.
>  * I think some of the above three quota quota calculations are unnecessary. 
> For example, if all parent directories of the source directory and target 
> directory are not configured with quota, there is no need to calculate 
> quotaCount. Even if both the source directory and the target directory use 
> quota, there is no need to calculate the quota three times. The calculation 
> logic for the first and 

[jira] [Commented] (HDFS-16000) HDFS : Rename performance optimization

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766298#comment-17766298
 ] 

ASF GitHub Bot commented on HDFS-16000:
---

zhuxiangyi commented on code in PR #2964:
URL: https://github.com/apache/hadoop/pull/2964#discussion_r1328465474


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java:
##
@@ -1468,6 +1475,30 @@ static Collection 
normalizePaths(Collection paths,
 return normalized;
   }
 
+  /**
+   * Get the first Node that sets Quota.
+   */
+  static INode getFirstSetQuotaParentNode(INodesInPath iip) {
+for (int i = iip.length() - 1; i > 0; i--) {
+  INode currNode = iip.getINode(i);
+  if (currNode == null) {

Review Comment:
   There should be no expected





> HDFS : Rename performance optimization
> --
>
> Key: HDFS-16000
> URL: https://issues.apache.org/jira/browse/HDFS-16000
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Affects Versions: 3.1.4, 3.3.1
>Reporter: Xiangyi Zhu
>Assignee: Xiangyi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20210428-143238.svg, 20210428-171635-lambda.svg, 
> HDFS-16000.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It takes a long time to move a large directory with rename. For example, it 
> takes about 40 seconds to move a 1000W directory. When a large amount of data 
> is deleted to the trash, the move large directory will occur when the recycle 
> bin makes checkpoint. In addition, the user may also actively trigger the 
> move large directory operation, which will cause the NameNode to lock too 
> long and be killed by Zkfc. Through the flame graph, it is found that the 
> main time consuming is to create the EnumCounters object.
>  
> h3. Rename logic optimization:
>  * Regardless of whether the rename operation is the source directory and the 
> target directory, the quota count must be calculated three times. The first 
> time, check whether the moved directory exceeds the target directory quota, 
> the second time, calculate the mobile directory quota to update the source 
> directory quota, and the third time, calculate the mobile directory 
> configuration update to the target directory.
>  * I think some of the above three quota quota calculations are unnecessary. 
> For example, if all parent directories of the source directory and target 
> directory are not configured with quota, there is no need to calculate 
> quotaCount. Even if both the source directory and the target directory use 
> quota, there is no need to calculate the quota three times. The calculation 
> logic for the first and third times is the same, and it only needs to be 
> calculated once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16000) HDFS : Rename performance optimization

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766297#comment-17766297
 ] 

ASF GitHub Bot commented on HDFS-16000:
---

zhuxiangyi commented on code in PR #2964:
URL: https://github.com/apache/hadoop/pull/2964#discussion_r1328464878


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java:
##
@@ -1468,6 +1475,30 @@ static Collection 
normalizePaths(Collection paths,
 return normalized;
   }
 
+  /**
+   * Get the first Node that sets Quota.
+   */
+  static INode getFirstSetQuotaParentNode(INodesInPath iip) {
+for (int i = iip.length() - 1; i > 0; i--) {
+  INode currNode = iip.getINode(i);
+  if (currNode == null) {

Review Comment:
   Here we traverse from the last INode to the penultimate node, excluding the 
root node.





> HDFS : Rename performance optimization
> --
>
> Key: HDFS-16000
> URL: https://issues.apache.org/jira/browse/HDFS-16000
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Affects Versions: 3.1.4, 3.3.1
>Reporter: Xiangyi Zhu
>Assignee: Xiangyi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20210428-143238.svg, 20210428-171635-lambda.svg, 
> HDFS-16000.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It takes a long time to move a large directory with rename. For example, it 
> takes about 40 seconds to move a 1000W directory. When a large amount of data 
> is deleted to the trash, the move large directory will occur when the recycle 
> bin makes checkpoint. In addition, the user may also actively trigger the 
> move large directory operation, which will cause the NameNode to lock too 
> long and be killed by Zkfc. Through the flame graph, it is found that the 
> main time consuming is to create the EnumCounters object.
>  
> h3. Rename logic optimization:
>  * Regardless of whether the rename operation is the source directory and the 
> target directory, the quota count must be calculated three times. The first 
> time, check whether the moved directory exceeds the target directory quota, 
> the second time, calculate the mobile directory quota to update the source 
> directory quota, and the third time, calculate the mobile directory 
> configuration update to the target directory.
>  * I think some of the above three quota quota calculations are unnecessary. 
> For example, if all parent directories of the source directory and target 
> directory are not configured with quota, there is no need to calculate 
> quotaCount. Even if both the source directory and the target directory use 
> quota, there is no need to calculate the quota three times. The calculation 
> logic for the first and third times is the same, and it only needs to be 
> calculated once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16000) HDFS : Rename performance optimization

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766295#comment-17766295
 ] 

ASF GitHub Bot commented on HDFS-16000:
---

zhuxiangyi commented on code in PR #2964:
URL: https://github.com/apache/hadoop/pull/2964#discussion_r1328454449


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirRenameOp.java:
##
@@ -470,17 +475,53 @@ static RenameResult unprotectedRenameTo(FSDirectory fsd,
   }
 } finally {
   if (undoRemoveSrc) {
-tx.restoreSource();
+tx.restoreSource(srcStoragePolicyCounts);
   }
   if (undoRemoveDst) { // Rename failed - restore dst
-tx.restoreDst(bsps);
+tx.restoreDst(bsps, dstStoragePolicyCounts);
   }
 }
 NameNode.stateChangeLog.warn("DIR* FSDirectory.unprotectedRenameTo: " +
 "failed to rename " + src + " to " + dst);
 throw new IOException("rename from " + src + " to " + dst + " failed.");
   }
 
+  /*
+   * Calculate QuotaCounts based on parent directory and storage policy
+   * 1. If the storage policy of src and dst are different,
+   *  calculate the QuotaCounts of src and dst respectively.
+   * 2. If all parent nodes of src and dst are not set with Quota,
+   *  there is no need to calculate QuotaCount.
+   * 3. if parent nodes of src and dst have Quota configured,
+   *  the QuotaCount is calculated once using the storage policy of src.
+   * */
+  private static void computeQuotaCounts(
+  QuotaCounts srcStoragePolicyCounts,
+  QuotaCounts dstStoragePolicyCounts,
+  INodesInPath srcIIP,
+  INodesInPath dstIIP,
+  BlockStoragePolicySuite bsps,
+  RenameOperation tx) {
+INode dstParent = dstIIP.getINode(-2);
+INode srcParentNode = FSDirectory.
+getFirstSetQuotaParentNode(srcIIP);
+INode srcInode = srcIIP.getLastINode();
+INode dstParentNode = FSDirectory.
+getFirstSetQuotaParentNode(dstIIP);
+byte srcStoragePolicyID = FSDirectory.getStoragePolicyId(srcInode);
+byte dstStoragePolicyID = FSDirectory.getStoragePolicyId(dstParent);
+if (srcStoragePolicyID != dstStoragePolicyID) {
+  srcStoragePolicyCounts.add(srcIIP.getLastINode().
+  computeQuotaUsage(bsps));
+  dstStoragePolicyCounts.add(srcIIP.getLastINode()
+  .computeQuotaUsage(bsps, dstParent.getStoragePolicyID(), false,
+  Snapshot.CURRENT_STATE_ID));
+} else if (srcParentNode != dstParentNode || tx.withCount != null) {

Review Comment:
   This is to determine whether the inode isSrcInSnapshot. If it is 
isSrcInSnapshot, we will calculate the quotaCount. I will change this to 
isSrcInSnapshot to determine.





> HDFS : Rename performance optimization
> --
>
> Key: HDFS-16000
> URL: https://issues.apache.org/jira/browse/HDFS-16000
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Affects Versions: 3.1.4, 3.3.1
>Reporter: Xiangyi Zhu
>Assignee: Xiangyi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20210428-143238.svg, 20210428-171635-lambda.svg, 
> HDFS-16000.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It takes a long time to move a large directory with rename. For example, it 
> takes about 40 seconds to move a 1000W directory. When a large amount of data 
> is deleted to the trash, the move large directory will occur when the recycle 
> bin makes checkpoint. In addition, the user may also actively trigger the 
> move large directory operation, which will cause the NameNode to lock too 
> long and be killed by Zkfc. Through the flame graph, it is found that the 
> main time consuming is to create the EnumCounters object.
>  
> h3. Rename logic optimization:
>  * Regardless of whether the rename operation is the source directory and the 
> target directory, the quota count must be calculated three times. The first 
> time, check whether the moved directory exceeds the target directory quota, 
> the second time, calculate the mobile directory quota to update the source 
> directory quota, and the third time, calculate the mobile directory 
> configuration update to the target directory.
>  * I think some of the above three quota quota calculations are unnecessary. 
> For example, if all parent directories of the source directory and target 
> directory are not configured with quota, there is no need to calculate 
> quotaCount. Even if both the source directory and the target directory use 
> quota, there is no need to calculate the quota three times. The calculation 
> logic for the first and third times is the same, and it only needs to be 
> calculated once.



--
This message was sent by Atlassian 

[jira] [Commented] (HDFS-16000) HDFS : Rename performance optimization

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766292#comment-17766292
 ] 

ASF GitHub Bot commented on HDFS-16000:
---

zhuxiangyi commented on code in PR #2964:
URL: https://github.com/apache/hadoop/pull/2964#discussion_r1328445332


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirRenameOp.java:
##
@@ -470,17 +475,53 @@ static RenameResult unprotectedRenameTo(FSDirectory fsd,
   }
 } finally {
   if (undoRemoveSrc) {
-tx.restoreSource();
+tx.restoreSource(srcStoragePolicyCounts);
   }
   if (undoRemoveDst) { // Rename failed - restore dst
-tx.restoreDst(bsps);
+tx.restoreDst(bsps, dstStoragePolicyCounts);
   }
 }
 NameNode.stateChangeLog.warn("DIR* FSDirectory.unprotectedRenameTo: " +
 "failed to rename " + src + " to " + dst);
 throw new IOException("rename from " + src + " to " + dst + " failed.");
   }
 
+  /*
+   * Calculate QuotaCounts based on parent directory and storage policy
+   * 1. If the storage policy of src and dst are different,
+   *  calculate the QuotaCounts of src and dst respectively.
+   * 2. If all parent nodes of src and dst are not set with Quota,
+   *  there is no need to calculate QuotaCount.
+   * 3. if parent nodes of src and dst have Quota configured,
+   *  the QuotaCount is calculated once using the storage policy of src.
+   * */
+  private static void computeQuotaCounts(
+  QuotaCounts srcStoragePolicyCounts,
+  QuotaCounts dstStoragePolicyCounts,
+  INodesInPath srcIIP,
+  INodesInPath dstIIP,
+  BlockStoragePolicySuite bsps,
+  RenameOperation tx) {
+INode dstParent = dstIIP.getINode(-2);
+INode srcParentNode = FSDirectory.
+getFirstSetQuotaParentNode(srcIIP);
+INode srcInode = srcIIP.getLastINode();
+INode dstParentNode = FSDirectory.
+getFirstSetQuotaParentNode(dstIIP);
+byte srcStoragePolicyID = FSDirectory.getStoragePolicyId(srcInode);
+byte dstStoragePolicyID = FSDirectory.getStoragePolicyId(dstParent);
+if (srcStoragePolicyID != dstStoragePolicyID) {
+  srcStoragePolicyCounts.add(srcIIP.getLastINode().
+  computeQuotaUsage(bsps));
+  dstStoragePolicyCounts.add(srcIIP.getLastINode()

Review Comment:
   Thanks for finding this problem. If the Inode sets the StoragePolicy we 
should use the StoragePolicy calculation of the Inode. I will fix it.





> HDFS : Rename performance optimization
> --
>
> Key: HDFS-16000
> URL: https://issues.apache.org/jira/browse/HDFS-16000
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Affects Versions: 3.1.4, 3.3.1
>Reporter: Xiangyi Zhu
>Assignee: Xiangyi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20210428-143238.svg, 20210428-171635-lambda.svg, 
> HDFS-16000.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It takes a long time to move a large directory with rename. For example, it 
> takes about 40 seconds to move a 1000W directory. When a large amount of data 
> is deleted to the trash, the move large directory will occur when the recycle 
> bin makes checkpoint. In addition, the user may also actively trigger the 
> move large directory operation, which will cause the NameNode to lock too 
> long and be killed by Zkfc. Through the flame graph, it is found that the 
> main time consuming is to create the EnumCounters object.
>  
> h3. Rename logic optimization:
>  * Regardless of whether the rename operation is the source directory and the 
> target directory, the quota count must be calculated three times. The first 
> time, check whether the moved directory exceeds the target directory quota, 
> the second time, calculate the mobile directory quota to update the source 
> directory quota, and the third time, calculate the mobile directory 
> configuration update to the target directory.
>  * I think some of the above three quota quota calculations are unnecessary. 
> For example, if all parent directories of the source directory and target 
> directory are not configured with quota, there is no need to calculate 
> quotaCount. Even if both the source directory and the target directory use 
> quota, there is no need to calculate the quota three times. The calculation 
> logic for the first and third times is the same, and it only needs to be 
> calculated once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17198) RBF: fix bug of getRepresentativeQuorum when records have same dateModified

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766290#comment-17766290
 ] 

ASF GitHub Bot commented on HDFS-17198:
---

hadoop-yetus commented on PR #6096:
URL: https://github.com/apache/hadoop/pull/6096#issuecomment-1723023243

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  0s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  45m 30s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 45s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 41s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 33s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 49s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 46s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 34s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 26s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  39m 27s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 37s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 37s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  javac  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 21s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 31s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 24s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  35m 52s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  24m 12s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 40s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 162m 56s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6096/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6096 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux f64a7086e8a5 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / a934576064a1ba30acfabaeb9627faead1f3b2e1 |
   | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6096/2/testReport/ |
   | Max. process+thread count | 3087 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6096/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> RBF: fix bug of getRepresentativeQuorum when records have same 

[jira] [Commented] (HDFS-17199) LightWeightResizableGSet#values() seems not thread-safe.

2023-09-18 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766283#comment-17766283
 ] 

farmmamba commented on HDFS-17199:
--

I found two replicas methods in class ReplicaMap. One is thread-safe and 
another is not.

The one which is not thread-safe is used in readReplicasFromCache. Although 
there is not race condition problem here, I think we'd 

better drop the non-thread-safe replicas method and only keep the thread-safe 
one left.

> LightWeightResizableGSet#values() seems not thread-safe.
> 
>
> Key: HDFS-17199
> URL: https://issues.apache.org/jira/browse/HDFS-17199
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17199) LightWeightResizableGSet#values() seems not thread-safe.

2023-09-18 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba reassigned HDFS-17199:


Assignee: farmmamba

> LightWeightResizableGSet#values() seems not thread-safe.
> 
>
> Key: HDFS-17199
> URL: https://issues.apache.org/jira/browse/HDFS-17199
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17197) Show file replication when listing corrupt files.

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766275#comment-17766275
 ] 

ASF GitHub Bot commented on HDFS-17197:
---

zhangshuyan0 commented on code in PR #6095:
URL: https://github.com/apache/hadoop/pull/6095#discussion_r1328403016


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -6195,7 +6197,11 @@ Collection 
listCorruptFileBlocks(String path,
 if (inode != null) {
   String src = inode.getFullPathName();
   if (isParentEntry(src, path)) {
-corruptFiles.add(new CorruptFileBlockInfo(src, blk));
+int repl = 0;
+if (inode.isFile()) {
+  repl = inode.asFile().getFileReplication();

Review Comment:
   Oh, I forgot to take this situation into account. How about printing -1 in 
`CorruptFileBlockInfo` when this is a EC file?





> Show file replication when listing corrupt files.
> -
>
> Key: HDFS-17197
> URL: https://issues.apache.org/jira/browse/HDFS-17197
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> Files with different replication have different reliability guarantees. We 
> need to pay attention to corrupted files with a specified replication greater 
> than or equal to 3. So, when listing corrupt files, it would be useful to 
> display the corresponding replication of the files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17197) Show file replication when listing corrupt files.

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766264#comment-17766264
 ] 

ASF GitHub Bot commented on HDFS-17197:
---

hadoop-yetus commented on PR #6095:
URL: https://github.com/apache/hadoop/pull/6095#issuecomment-1722953186

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 38s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  45m 35s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  trunk passed with JDK 
Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  checkstyle  |   1m 15s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 30s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 12s |  |  trunk passed with JDK 
Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 42s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   3m 25s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  36m 12s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 15s |  |  the patch passed with JDK 
Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 13s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  javac  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  0s |  |  
hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 110 unchanged - 1 
fixed = 110 total (was 111)  |
   | +1 :green_heart: |  mvnsite  |   1m 16s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   3m 13s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  35m 44s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 220m 45s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 56s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 364m  8s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6095/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6095 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 548b75eba011 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 16e9ef6d27294c8c02121e9209a705fc56e57e31 |
   | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20.1+1-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6095/2/testReport/ |
   | Max. process+thread count | 2933 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6095/2/console |
   | versions | git=2.25.1 maven=3.6.3 

[jira] [Created] (HDFS-17199) LightWeightResizableGSet#values() seems not thread-safe.

2023-09-18 Thread farmmamba (Jira)
farmmamba created HDFS-17199:


 Summary: LightWeightResizableGSet#values() seems not thread-safe.
 Key: HDFS-17199
 URL: https://issues.apache.org/jira/browse/HDFS-17199
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.4.0
Reporter: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17198) RBF: fix bug of getRepresentativeQuorum when records have same dateModified

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766239#comment-17766239
 ] 

ASF GitHub Bot commented on HDFS-17198:
---

KeeProMise opened a new pull request, #6096:
URL: https://github.com/apache/hadoop/pull/6096

   
   
   ### Description of PR
   In the original implementation, when each router reports nn status at 
different times, the nn status is the status reported by majority routers, for 
example:
   router1 -> nn0:active dateModified:1
   
   router2 -> nn0:active dateModified:2
   
   router3 -> nn0:active dateModified:3
   
   router0 -> nn0:standby dateModified:4
   
   Then, the status of nn0 is active, because majority routers report that nn0 
is active.
   
   If majority routers report nn status at the same time, for example:
   (record1) router1 -> nn0:active dateModified:1
   
   (record2) router2 -> nn0:active dateModified:1
   
   (record3) router3 -> nn0:active dateModified:1
   
   (record4) router0 -> nn0:standbydateModified:2
   
   Then the state of nn0 is standby, but We expect the status of nn0 is active
   
   This bug is because the above record is put into the Treeset in the method 
getRepresentativeQuorum. Since record1,2,3 have the same dateModified, there 
will only be one record in the final treeset of this method, so this method 
thinks that this nn is standby, because record4 newer
   
   see: https://issues.apache.org/jira/browse/HDFS-17198
   
   ### How was this patch tested?
   my unit test testRegistrationMajorityQuorumEqDateModified
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> RBF: fix bug of getRepresentativeQuorum when records have same dateModified
> ---
>
> Key: HDFS-17198
> URL: https://issues.apache.org/jira/browse/HDFS-17198
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17198.v001.patch
>
>
> h2. *Bug description*
> In the original implementation, when each router reports nn status at 
> different times, the nn status is the status reported by majority routers, 
> for example:
> router1 -> nn0:active dateModified:1
> router2 -> nn0:active dateModified:2
> router3 -> nn0:active dateModified:3
> router0 -> nn0:standby dateModified:4
> Then, the status of nn0 is active, because majority routers report that nn0 
> is active.
> If majority routers report nn status at the same time, for example:
> (record1) router1 -> nn0:active dateModified:1
> (record2) router2 -> nn0:active dateModified:1
> (record3) router3 -> nn0:active dateModified:1
> (record4) router0 -> nn0:standbydateModified:2
> Then the state of nn0 is standby, but We expect the status of nn0 is active
> This bug is because the above record is put into the Treeset in the method 
> getRepresentativeQuorum. Since record1,2,3 have the same dateModified, there 
> will only be one record in the final treeset of this method, so this method 
> thinks that this nn is standby, because record4 newer
> h2. *How to reproduce*
> Running my unit test testRegistrationMajorityQuorumEqDateModified, but using 
> the original code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17198) RBF: fix bug of getRepresentativeQuorum when records have same dateModified

2023-09-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766238#comment-17766238
 ] 

ASF GitHub Bot commented on HDFS-17198:
---

KeeProMise closed pull request #6096: HDFS-17198. RBF: fix bug of 
getRepresentativeQuorum when records have same dateModified
URL: https://github.com/apache/hadoop/pull/6096




> RBF: fix bug of getRepresentativeQuorum when records have same dateModified
> ---
>
> Key: HDFS-17198
> URL: https://issues.apache.org/jira/browse/HDFS-17198
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
> Attachments: HDFS-17198.v001.patch
>
>
> h2. *Bug description*
> In the original implementation, when each router reports nn status at 
> different times, the nn status is the status reported by majority routers, 
> for example:
> router1 -> nn0:active dateModified:1
> router2 -> nn0:active dateModified:2
> router3 -> nn0:active dateModified:3
> router0 -> nn0:standby dateModified:4
> Then, the status of nn0 is active, because majority routers report that nn0 
> is active.
> If majority routers report nn status at the same time, for example:
> (record1) router1 -> nn0:active dateModified:1
> (record2) router2 -> nn0:active dateModified:1
> (record3) router3 -> nn0:active dateModified:1
> (record4) router0 -> nn0:standbydateModified:2
> Then the state of nn0 is standby, but We expect the status of nn0 is active
> This bug is because the above record is put into the Treeset in the method 
> getRepresentativeQuorum. Since record1,2,3 have the same dateModified, there 
> will only be one record in the final treeset of this method, so this method 
> thinks that this nn is standby, because record4 newer
> h2. *How to reproduce*
> Running my unit test testRegistrationMajorityQuorumEqDateModified, but using 
> the original code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17198) RBF: fix bug of getRepresentativeQuorum when records have same dateModified

2023-09-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17198:
--
Labels: pull-request-available  (was: )

> RBF: fix bug of getRepresentativeQuorum when records have same dateModified
> ---
>
> Key: HDFS-17198
> URL: https://issues.apache.org/jira/browse/HDFS-17198
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17198.v001.patch
>
>
> h2. *Bug description*
> In the original implementation, when each router reports nn status at 
> different times, the nn status is the status reported by majority routers, 
> for example:
> router1 -> nn0:active dateModified:1
> router2 -> nn0:active dateModified:2
> router3 -> nn0:active dateModified:3
> router0 -> nn0:standby dateModified:4
> Then, the status of nn0 is active, because majority routers report that nn0 
> is active.
> If majority routers report nn status at the same time, for example:
> (record1) router1 -> nn0:active dateModified:1
> (record2) router2 -> nn0:active dateModified:1
> (record3) router3 -> nn0:active dateModified:1
> (record4) router0 -> nn0:standbydateModified:2
> Then the state of nn0 is standby, but We expect the status of nn0 is active
> This bug is because the above record is put into the Treeset in the method 
> getRepresentativeQuorum. Since record1,2,3 have the same dateModified, there 
> will only be one record in the final treeset of this method, so this method 
> thinks that this nn is standby, because record4 newer
> h2. *How to reproduce*
> Running my unit test testRegistrationMajorityQuorumEqDateModified, but using 
> the original code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org