[jira] [Created] (HDFS-17407) Exception during image upload

2024-02-29 Thread ruiliang (Jira)
ruiliang created HDFS-17407:
---

 Summary: Exception during image upload
 Key: HDFS-17407
 URL: https://issues.apache.org/jira/browse/HDFS-17407
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namanode
Affects Versions: 3.1.0
 Environment: hadoop 3.1.0 

linux:ubuntu 16.04

ambari-hdp:3.1.1
Reporter: ruiliang


After I added the third hdfs namenode, the service was fine. However, the two 
Standby namenode service logs always show exceptions during image upload. 
However, I observe that the image file of the primary node is being updated 
normally, which indicates that the secondary node has merged the image file and 
uploaded it to the primary node. But I don't understand why two Standby 
namenode keep sending such exception logs. Are there potential risk issues?

 

 
{code:java}
2024-03-01 15:31:46,162 INFO  namenode.TransferFsImage 
(TransferFsImage.java:copyFileToStream(394)) - Sending fileName: 
/data/hadoop/hdfs/namenode/current/fsimage_55689095810, fileSize: 
4626167848. Sent total: 1703936 bytes. Size of last segment intended to send: 
131072 bytes.
java.io.IOException: Error writing request body to server
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:236)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:231)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2024-03-01 15:31:46,630 INFO  blockmanagement.BlockManager 
(BlockManager.java:enqueue(4923)) - Block report queue is full
2024-03-01 15:31:46,664 ERROR ha.StandbyCheckpointer 
(StandbyCheckpointer.java:doWork(452)) - Exception in doCheckpoint
java.io.IOException: Exception during image upload
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:257)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1500(StandbyCheckpointer.java:62)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:432)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$600(StandbyCheckpointer.java:331)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:351)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1710)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:347)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error 
writing request body to server
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:250)
        ... 9 more
Caused by: java.io.IOException: Error writing request body to server
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
        at 
sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:376)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:320)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:294)
        at 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:229)
        at 
org.apache.h

[jira] [Updated] (HDFS-17406) Suppress UnresolvedPathException in hdfs router log

2024-02-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17406:
--
Labels: pull-request-available  (was: )

> Suppress UnresolvedPathException in hdfs router log
> ---
>
> Key: HDFS-17406
> URL: https://issues.apache.org/jira/browse/HDFS-17406
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> UnresolvedPathException as a normal process of resolving symlinks, the router 
> server doesn't need to be logged at all .
> we need to optimize it, reduce log output.
> {code:java}
> 2024-03-01 14:51:25,084 INFO  ipc.Server (Server.java:logException(3417)) 
> [IPC Server on default port ] - IPC Server 965 on default port , call 
> Call#1313293 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from xxx
> org.apache.hadoop.hdfs.protocol.UnresolvedPathException: /xxx/path
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkNotSymlink(FSPermissionChecker.java:762)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:721)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1997)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:2015)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:823)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:164)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2265)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2250)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:914)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
> 467)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3360)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnresolvedPathException):
>  /xxx/path
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkNotSymlink(FSPermissionChecker.java:762)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:721)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1997)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:2015)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:823)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:164)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2265)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2250)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:914)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
> 467)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
>

[jira] [Commented] (HDFS-17406) Suppress UnresolvedPathException in hdfs router log

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822432#comment-17822432
 ] 

ASF GitHub Bot commented on HDFS-17406:
---

haiyang1987 opened a new pull request, #6603:
URL: https://github.com/apache/hadoop/pull/6603

   
   
   ### Description of PR
   https://issues.apache.org/jira/browse/HDFS-17406
   UnresolvedPathException as a normal process of resolving symlinks, the 
router server doesn't need to be logged at all .
   we need to optimize it, reduce log output.
   
   ```
   2024-03-01 14:51:25,084 INFO  ipc.Server (Server.java:logException(3417)) 
[IPC Server on default port ] - IPC Server 965 on default port , call 
Call#1313293 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from xxx
   org.apache.hadoop.hdfs.protocol.UnresolvedPathException: /xxx/path
   at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkNotSymlink(FSPermissionChecker.java:762)
   at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:721)
   at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1997)
   at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:2015)
   at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:823)
   at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:164)
   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2265)
   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2250)
   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:914)
   at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
   467)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236)
   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3360)
   Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnresolvedPathException):
 /xxx/path
   at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkNotSymlink(FSPermissionChecker.java:762)
   at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:721)
   at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1997)
   at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:2015)
   at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:823)
   at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:164)
   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2265)
   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2250)
   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:914)
   at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
   467)
   at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236)
   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134)
   at java.security.AccessController.doPrivileged(Nat

[jira] [Updated] (HDFS-17406) Suppress UnresolvedPathException in hdfs router log

2024-02-29 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17406:
--
Description: 
UnresolvedPathException as a normal process of resolving symlinks, the router 
server doesn't need to be logged at all .
we need to optimize it, reduce log output.


{code:java}
2024-03-01 14:51:25,084 INFO  ipc.Server (Server.java:logException(3417)) [IPC 
Server on default port ] - IPC Server 965 on default port , call 
Call#1313293 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from xxx
org.apache.hadoop.hdfs.protocol.UnresolvedPathException: /xxx/path
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkNotSymlink(FSPermissionChecker.java:762)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:721)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1997)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:2015)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:823)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:164)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2265)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2250)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:914)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
467)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3360)
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnresolvedPathException):
 /xxx/path
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkNotSymlink(FSPermissionChecker.java:762)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:721)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1997)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:2015)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:823)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:164)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2265)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2250)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:914)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
467)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3360)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1579)
at org.apache.hadoop.ipc.Client.ca

[jira] [Commented] (HDFS-17387) [FGL] Abstract the configurable locking mode

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822424#comment-17822424
 ] 

ASF GitHub Bot commented on HDFS-17387:
---

hadoop-yetus commented on PR #6572:
URL: https://github.com/apache/hadoop/pull/6572#issuecomment-1972669698

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ HDFS-17384 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m 49s |  |  HDFS-17384 passed  |
   | +1 :green_heart: |  compile  |   0m 44s |  |  HDFS-17384 passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 40s |  |  HDFS-17384 passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 41s |  |  HDFS-17384 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 46s |  |  HDFS-17384 passed  |
   | +1 :green_heart: |  javadoc  |   0m 43s |  |  HDFS-17384 passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  6s |  |  HDFS-17384 passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 43s |  |  HDFS-17384 passed  |
   | +1 :green_heart: |  shadedclient  |  20m 34s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 39s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 37s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 37s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 34s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 30s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6572/8/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 47 new + 312 unchanged 
- 8 fixed = 359 total (was 320)  |
   | +1 :green_heart: |  mvnsite  |   0m 39s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 29s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  1s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 43s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 42s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 205m 58s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6572/8/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 31s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 294m 33s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6572/8/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6572 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint |
   | uname | Linux 8969e8042f4e 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revisio

[jira] [Updated] (HDFS-17406) Suppress UnresolvedPathException in hdfs router log

2024-02-29 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17406:
--
Description: 
UnresolvedPathException as a normal process of resolving symlinks, the router 
server doesn't need to be logged at all .
we need to optimize it, reduce log output.


{code:java}
2024-03-01 14:51:25,084 INFO  ipc.Server (Server.java:logException(3417)) [IPC 
Server on default port ] - IPC Server 965 on default port , call 
Call#1313293 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from xxx
org.apache.hadoop.hdfs.protocol.UnresolvedPathException: /xxx/path
_type=local/grass_region=ID/is_cb_shop=0/compact-20730-de870ead-9bdf-49b2-b722-1081781b5bf1.c000.zstd.parquet
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkNotSymlink(FSPermissionChecker.java:762)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:721)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1997)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:2015)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:823)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:164)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2265)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2250)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:914)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
467)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3360)
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnresolvedPathException):
 /xxx/path
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkNotSymlink(FSPermissionChecker.java:762)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:721)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1997)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:2015)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:823)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:164)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2265)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2250)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:914)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
467)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3360)

 

[jira] [Updated] (HDFS-17406) Suppress UnresolvedPathException in hdfs router log

2024-02-29 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17406:
--
Description: 
UnresolvedPathException as a normal process of resolving symlinks, the router 
server doesn't need to be logged at all .
we need to optimize it, reduce log output.


{code:java}
2024-03-01 14:51:25,084 INFO  ipc.Server (Server.java:logException(3417)) [IPC 
Server on default port ] - IPC Server 965 on default port , call 
Call#1313293 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from xxx

Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnresolvedPathException):
 /xxx/path
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkNotSymlink(FSPermissionChecker.java:762)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:721)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1997)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:2015)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:823)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:164)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2265)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2250)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:914)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
467)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3360)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1579)
at org.apache.hadoop.ipc.Client.call(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1402)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:335)
at sun.reflect.GeneratedMethodAccessor106.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:776)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:596)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:1132)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:1077)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getBlockLocations(RouterClientProtocol.java:308)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getBlockLocations(RouterRpcServer.java:624)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
465)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:623)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:591

[jira] [Updated] (HDFS-17406) Suppress UnresolvedPathException in hdfs router log

2024-02-29 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17406:
--
Description: 
UnresolvedPathException as a normal process of resolving symlinks, the router 
server doesn't need to be logged at all .
we need to optimize it, reduce log output.


{code:java}
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnresolvedPathException):
 /xxx/path
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkNotSymlink(FSPermissionChecker.java:762)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:721)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1997)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:2015)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:823)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:164)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2265)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2250)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:914)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
467)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3360)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1579)
at org.apache.hadoop.ipc.Client.call(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1402)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:335)
at sun.reflect.GeneratedMethodAccessor106.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:776)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:596)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:1132)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:1077)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getBlockLocations(RouterClientProtocol.java:308)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getBlockLocations(RouterRpcServer.java:624)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
465)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:623)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:591)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:575)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1230)
  

[jira] [Updated] (HDFS-17406) Suppress UnresolvedPathException in hdfs router log

2024-02-29 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17406:
--
Description: 
UnresolvedPathException as a normal process of resolving symlinks, the router 
server doesn't need to be logged at all .


{code:java}
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnresolvedPathException):
 /xxx/path
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkNotSymlink(FSPermissionChecker.java:762)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:721)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:1997)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkTraverse(FSDirectory.java:2015)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resolvePath(FSDirectory.java:823)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:164)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2265)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2250)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:914)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
467)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3360)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1579)
at org.apache.hadoop.ipc.Client.call(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1402)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:335)
at sun.reflect.GeneratedMethodAccessor106.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:776)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:596)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:1132)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:1077)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getBlockLocations(RouterClientProtocol.java:308)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getBlockLocations(RouterRpcServer.java:624)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:
465)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:623)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:591)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:575)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1230)
at org.apache.hadoop.ipc.Server$RpcCa

[jira] [Created] (HDFS-17406) Suppress UnresolvedPathException in hdfs router log

2024-02-29 Thread Haiyang Hu (Jira)
Haiyang Hu created HDFS-17406:
-

 Summary: Suppress UnresolvedPathException in hdfs router log
 Key: HDFS-17406
 URL: https://issues.apache.org/jira/browse/HDFS-17406
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Haiyang Hu
Assignee: Haiyang Hu


UnresolvedPathException as a normal process of resolving symlinks, the router 
server doesn't need to be logged at all .





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822394#comment-17822394
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1972572072

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 46s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m  0s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  39m 53s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m 15s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   6m  3s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 29s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 23s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 51s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 12s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | -1 :x: |  spotbugs  |   2m 54s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/12/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  44m 31s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  44m 57s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   1m  5s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 18s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   6m 59s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   6m 59s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   6m 19s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   6m 19s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m 28s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/12/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 3 new + 244 unchanged - 2 fixed = 
247 total (was 246)  |
   | +1 :green_heart: |  mvnsite  |   2m 23s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 43s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m  9s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m 50s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  44m  5s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 28s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 266m 49s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/12/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 44s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 472m  0s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-65

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822364#comment-17822364
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

hadoop-yetus commented on PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#issuecomment-1972468090

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 47s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 43s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  37m  8s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m  4s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   5m 44s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 30s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 18s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 51s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 20s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | -1 :x: |  spotbugs  |   2m 38s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/11/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html)
 |  hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs 
warnings.  |
   | +1 :green_heart: |  shadedclient  |  40m 58s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  41m 19s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 31s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 59s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   6m 12s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   6m 12s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   6m  0s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   6m  0s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m 22s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/11/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 3 new + 244 unchanged - 2 fixed = 
247 total (was 246)  |
   | +1 :green_heart: |  mvnsite  |   2m  7s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m  2s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m  3s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  42m 37s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 27s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 262m 34s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6566/11/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 56s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 456m 47s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   |   | hadoop.hdfs.tools.TestDFSAdmin |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-m

[jira] [Commented] (HDFS-17387) [FGL] Abstract the configurable locking mode

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822340#comment-17822340
 ] 

ASF GitHub Bot commented on HDFS-17387:
---

ZanderXu commented on PR #6572:
URL: https://github.com/apache/hadoop/pull/6572#issuecomment-1972341796

   > @ZanderXu @ferhui Thank you for work! I tried to upgrade the 
maven-surefire-plugin (#6537) on the trunk, but it was unsuccessful, which may 
cause some unit tests to fail to run. I rolled back this pr (#6578), and I 
cherrypicked this pr(#6578) to the 
[HDFS-17384](https://issues.apache.org/jira/browse/HDFS-17384) branch.
   
   @slfan1989 Thanks so much. I will rebase this PR again.




> [FGL] Abstract the configurable locking mode
> 
>
> Key: HDFS-17387
> URL: https://issues.apache.org/jira/browse/HDFS-17387
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> Abstract a lock mode to cover the current global lock and the new 
> fine-grained lock(global FS lock and global BM lock).
> End-user can select to use lock mode through configuration.
> The possible lock modes after this patch are as follows:
>  * GLOBAL Lock
>  * FS Lock
>  * BM Lock



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17397) Choose another DN as soon as possible, when encountering network issues

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822331#comment-17822331
 ] 

ASF GitHub Bot commented on HDFS-17397:
---

xleoken commented on code in PR #6591:
URL: https://github.com/apache/hadoop/pull/6591#discussion_r1508372843


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1182,10 +1182,12 @@ public void run() {
 if (begin != null) {
   long duration = Time.monotonicNowNanos() - begin;
   if (TimeUnit.NANOSECONDS.toMillis(duration) > 
dfsclientSlowLogThresholdMs) {
-LOG.info("Slow ReadProcessor read fields for block " + block
+final String msg = "Slow ReadProcessor read fields for block " 
+ block
 + " took " + TimeUnit.NANOSECONDS.toMillis(duration) + "ms 
(threshold="
 + dfsclientSlowLogThresholdMs + "ms); ack: " + ack
-+ ", targets: " + Arrays.asList(targets));
++ ", targets: " + Arrays.asList(targets);
+LOG.warn(msg);
+throw new IOException(msg);

Review Comment:
   Welcome @ZanderXu 
   
   > How to identify this case
   
   When the client takes more time to read ack than 
`dfsclientSlowLogThresholdMs`.
   
   > Which datanode should be marked as a bad or slow DN
   
   When some datanodes in poor network environment.
   
   > Maybe Datastreamer can identify this case and recovery it through 
PipelineRecovery
   
   The core issue is that the response time between the client and DN is 
greater than `dfsclientSlowLogThresholdMs`, but only print a log without taking 
any action. We should print the log and throw an `IOException`.
   
   > but I don't think your modification is a good solution.
   
   Maybe you're right, but this may be the simplest modification. After this 
patch, we solved the slow dn problem in production environment.
   
   1. 
打了patch之后,客户端会在超时`dfsclientSlowLogThresholdMs`之后立马选择一个新的DN完成写操作,尽量保证客户端写入不hang死在与某些慢dn交互中
   2. 这些慢节点会出现在hdfs的jmx里面,当监控到这些慢节点,运维会有后续的处理方案





> Choose another DN as soon as possible, when encountering network issues
> ---
>
> Key: HDFS-17397
> URL: https://issues.apache.org/jira/browse/HDFS-17397
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xleoken
>Priority: Minor
>  Labels: pull-request-available
> Attachments: hadoop.png
>
>
> Choose another DN as soon as possible, when encountering network issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17397) Choose another DN as soon as possible, when encountering network issues

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822330#comment-17822330
 ] 

ASF GitHub Bot commented on HDFS-17397:
---

xleoken commented on code in PR #6591:
URL: https://github.com/apache/hadoop/pull/6591#discussion_r1508372843


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1182,10 +1182,12 @@ public void run() {
 if (begin != null) {
   long duration = Time.monotonicNowNanos() - begin;
   if (TimeUnit.NANOSECONDS.toMillis(duration) > 
dfsclientSlowLogThresholdMs) {
-LOG.info("Slow ReadProcessor read fields for block " + block
+final String msg = "Slow ReadProcessor read fields for block " 
+ block
 + " took " + TimeUnit.NANOSECONDS.toMillis(duration) + "ms 
(threshold="
 + dfsclientSlowLogThresholdMs + "ms); ack: " + ack
-+ ", targets: " + Arrays.asList(targets));
++ ", targets: " + Arrays.asList(targets);
+LOG.warn(msg);
+throw new IOException(msg);

Review Comment:
   @ZanderXu 
   
   > How to identify this case
   
   When the client takes more time to read ack than 
`dfsclientSlowLogThresholdMs`.
   
   > Which datanode should be marked as a bad or slow DN
   
   When some datanodes in poor network environment.
   
   > Maybe Datastreamer can identify this case and recovery it through 
PipelineRecovery
   
   The core issue is that the response time between the client and DN is 
greater than `dfsclientSlowLogThresholdMs`, but only print a log without taking 
any action. We should print the log and throw an `IOException`.
   
   > but I don't think your modification is a good solution.
   
   Maybe you're right, but this may be the simplest modification. After this 
patch, we solved the slow dn problem in production environment.



##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1182,10 +1182,12 @@ public void run() {
 if (begin != null) {
   long duration = Time.monotonicNowNanos() - begin;
   if (TimeUnit.NANOSECONDS.toMillis(duration) > 
dfsclientSlowLogThresholdMs) {
-LOG.info("Slow ReadProcessor read fields for block " + block
+final String msg = "Slow ReadProcessor read fields for block " 
+ block
 + " took " + TimeUnit.NANOSECONDS.toMillis(duration) + "ms 
(threshold="
 + dfsclientSlowLogThresholdMs + "ms); ack: " + ack
-+ ", targets: " + Arrays.asList(targets));
++ ", targets: " + Arrays.asList(targets);
+LOG.warn(msg);
+throw new IOException(msg);

Review Comment:
   Welcome @ZanderXu 
   
   > How to identify this case
   
   When the client takes more time to read ack than 
`dfsclientSlowLogThresholdMs`.
   
   > Which datanode should be marked as a bad or slow DN
   
   When some datanodes in poor network environment.
   
   > Maybe Datastreamer can identify this case and recovery it through 
PipelineRecovery
   
   The core issue is that the response time between the client and DN is 
greater than `dfsclientSlowLogThresholdMs`, but only print a log without taking 
any action. We should print the log and throw an `IOException`.
   
   > but I don't think your modification is a good solution.
   
   Maybe you're right, but this may be the simplest modification. After this 
patch, we solved the slow dn problem in production environment.





> Choose another DN as soon as possible, when encountering network issues
> ---
>
> Key: HDFS-17397
> URL: https://issues.apache.org/jira/browse/HDFS-17397
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xleoken
>Priority: Minor
>  Labels: pull-request-available
> Attachments: hadoop.png
>
>
> Choose another DN as soon as possible, when encountering network issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822316#comment-17822316
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1508255658


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDistributedFileSystem.java:
##
@@ -2651,5 +2653,154 @@ public void 
testNameNodeCreateSnapshotTrashRootOnStartup()
 }
   }
 
+  @Test
+  public void testSingleRackFailureDuringPipelineSetupMinReplicationPossible() 
throws Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 2);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(0);
+  // create a file with replication 3, for rack fault tolerant BPP,
+  // it should allocate nodes in all 3 racks.
+  DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  cluster.shutdown(true);
+}
+  }
+
+  @Test
+  public void 
testSingleRackFailureDuringPipelineSetupMinReplicationImpossible() throws 
Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 3);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(0);
+  boolean threw = false;
+  try {
+DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  } catch (IOException e) {
+// success
+threw = true;
+  } finally {
+cluster.shutdown(true);
+  }
+  assertTrue("Failed to throw IOE when creating a file with less "
+  + "DNs than required for min replication", threw);
+}
+  }
+
+  @Test
+  public void 
testMultipleRackFailureDuringPipelineSetupMinReplicationPossible() throws 
Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 1);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill 2 DN, so only 1 racks stays with active DN
+  cluster.stopDataNode(0);
+  cluster.stopDataNode(1);
+  // create a file with replication 3, for rack fault tolerant BPP,
+  // it should allocate nodes in all 3 racks.
+  DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  cluster.shutdown(true);

Review Comment:
   Try with resource handles that. We don't need cluster.shutdown here. See 
[AutoCloseable](https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html)
 and 
[close](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java#L3564).
 I removed it from tests that I added





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822317#comment-17822317
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1508255776


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1618,33 +1625,47 @@ private void setupPipelineForAppendOrRecovery() throws 
IOException {
   LOG.warn(msg);
   lastException.set(new IOException(msg));
   streamerClosed = true;
-  return;
+  return false;
 }
-setupPipelineInternal(nodes, storageTypes, storageIDs);
+return setupPipelineInternal(nodes, storageTypes, storageIDs);
   }
 
-  protected void setupPipelineInternal(DatanodeInfo[] datanodes,
+  protected boolean setupPipelineInternal(DatanodeInfo[] datanodes,
   StorageType[] nodeStorageTypes, String[] nodeStorageIDs)
   throws IOException {
 boolean success = false;
 long newGS = 0L;
+boolean isCreateStage = BlockConstructionStage.PIPELINE_SETUP_CREATE == 
stage;
 while (!success && !streamerClosed && dfsClient.clientRunning) {
   if (!handleRestartingDatanode()) {
-return;
+return false;
   }
 
-  final boolean isRecovery = errorState.hasInternalError();
+  final boolean isRecovery = errorState.hasInternalError() && 
!isCreateStage;
+
+
   if (!handleBadDatanode()) {
-return;
+return false;
   }
 
   handleDatanodeReplacement();
 
+  // During create stage, if we remove a node (nodes.length - 1)

Review Comment:
   Updated





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822314#comment-17822314
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

ritegarg commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1508253750


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDistributedFileSystem.java:
##
@@ -2651,5 +2653,154 @@ public void 
testNameNodeCreateSnapshotTrashRootOnStartup()
 }
   }
 
+  @Test
+  public void testSingleRackFailureDuringPipelineSetupMinReplicationPossible() 
throws Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 2);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(0);
+  // create a file with replication 3, for rack fault tolerant BPP,
+  // it should allocate nodes in all 3 racks.
+  DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  cluster.shutdown(true);

Review Comment:
   Try with resource handles that. We don't need cluster.shutdown here. See 
[AutoCloseable](https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html)
 and 
[close](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java#L3564)





> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Assignee: Ritesh
>Priority: Critical
>  Labels: pull-request-available
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822312#comment-17822312
 ] 

ASF GitHub Bot commented on HDFS-17299:
---

shahrs87 commented on code in PR #6566:
URL: https://github.com/apache/hadoop/pull/6566#discussion_r1508229688


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1618,33 +1625,47 @@ private void setupPipelineForAppendOrRecovery() throws 
IOException {
   LOG.warn(msg);
   lastException.set(new IOException(msg));
   streamerClosed = true;
-  return;
+  return false;
 }
-setupPipelineInternal(nodes, storageTypes, storageIDs);
+return setupPipelineInternal(nodes, storageTypes, storageIDs);
   }
 
-  protected void setupPipelineInternal(DatanodeInfo[] datanodes,
+  protected boolean setupPipelineInternal(DatanodeInfo[] datanodes,
   StorageType[] nodeStorageTypes, String[] nodeStorageIDs)
   throws IOException {
 boolean success = false;
 long newGS = 0L;
+boolean isCreateStage = BlockConstructionStage.PIPELINE_SETUP_CREATE == 
stage;
 while (!success && !streamerClosed && dfsClient.clientRunning) {
   if (!handleRestartingDatanode()) {
-return;
+return false;
   }
 
-  final boolean isRecovery = errorState.hasInternalError();
+  final boolean isRecovery = errorState.hasInternalError() && 
!isCreateStage;
+
+
   if (!handleBadDatanode()) {
-return;
+return false;
   }
 
   handleDatanodeReplacement();
 
+  // During create stage, if we remove a node (nodes.length - 1)

Review Comment:
   I think this comment needs to be updated.



##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDistributedFileSystem.java:
##
@@ -2651,5 +2653,154 @@ public void 
testNameNodeCreateSnapshotTrashRootOnStartup()
 }
   }
 
+  @Test
+  public void testSingleRackFailureDuringPipelineSetupMinReplicationPossible() 
throws Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 2);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(0);
+  // create a file with replication 3, for rack fault tolerant BPP,
+  // it should allocate nodes in all 3 racks.
+  DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  cluster.shutdown(true);
+}
+  }
+
+  @Test
+  public void 
testSingleRackFailureDuringPipelineSetupMinReplicationImpossible() throws 
Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+false);
+conf.setInt(HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.
+MIN_REPLICATION, 3);
+// 3 racks & 3 nodes. 1 per rack
+try (MiniDFSCluster cluster = new 
MiniDFSCluster.Builder(conf).numDataNodes(3)
+.racks(new String[] {"/rack1", "/rack2", "/rack3"}).build()) {
+  cluster.waitClusterUp();
+  DistributedFileSystem fs = cluster.getFileSystem();
+  // kill one DN, so only 2 racks stays with active DN
+  cluster.stopDataNode(0);
+  boolean threw = false;
+  try {
+DFSTestUtil.createFile(fs, new Path("/testFile"), 1024L, (short) 3, 
1024L);
+  } catch (IOException e) {
+// success
+threw = true;
+  } finally {
+cluster.shutdown(true);
+  }
+  assertTrue("Failed to throw IOE when creating a file with less "
+  + "DNs than required for min replication", threw);
+}
+  }
+
+  @Test
+  public void 
testMultipleRackFailureDuringPipelineSetupMinReplicationPossible() throws 
Exception {
+Configuration conf = getTestConfiguration();
+conf.setClass(
+DFSConfigKeys.DFS_BLOCK_REPLICATOR_CLASSNAME_KEY,
+BlockPlacementPolicyRackFaultTolerant.class,
+BlockPlacementPolicy.class);
+conf.setBoolean(
+HdfsClientConfigKeys.BlockWrite.ReplaceDatanodeOnFailure.ENABLE_KEY,
+fal

[jira] [Commented] (HDFS-17387) [FGL] Abstract the configurable locking mode

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822301#comment-17822301
 ] 

ASF GitHub Bot commented on HDFS-17387:
---

slfan1989 commented on PR #6572:
URL: https://github.com/apache/hadoop/pull/6572#issuecomment-1971955704

   @ZanderXu @ferhui Thank you for work!  I tried to upgrade the 
maven-surefire-plugin (#6537) on the trunk, but it was unsuccessful, which may 
cause some unit tests to fail to run. I rolled back this pr (#6578), and I 
cherrypicked this pr(#6578) to the HDFS-17384 branch. 




> [FGL] Abstract the configurable locking mode
> 
>
> Key: HDFS-17387
> URL: https://issues.apache.org/jira/browse/HDFS-17387
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> Abstract a lock mode to cover the current global lock and the new 
> fine-grained lock(global FS lock and global BM lock).
> End-user can select to use lock mode through configuration.
> The possible lock modes after this patch are as follows:
>  * GLOBAL Lock
>  * FS Lock
>  * BM Lock



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17382) Add Apache Log4j Extras Library to Hadoop 3.3 for Enhanced Log Rolling Capabilities

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822300#comment-17822300
 ] 

ASF GitHub Bot commented on HDFS-17382:
---

slfan1989 commented on PR #6584:
URL: https://github.com/apache/hadoop/pull/6584#issuecomment-1971946182

   @dntjr8096 Thanks for the contribution! @dineshchitlangia Thanks for the 
review! 
   
   I am not sure if we still need to include content from log4j 1.x version. I 
believe we should not include it.
   
   cc: @ayushtkn @steveloughran 




> Add Apache Log4j Extras Library to Hadoop 3.3 for Enhanced Log Rolling 
> Capabilities
> ---
>
> Key: HDFS-17382
> URL: https://issues.apache.org/jira/browse/HDFS-17382
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.4.0
>Reporter: woosuk.ro
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2024-02-25-00-46-08-841.png
>
>
> In the current Hadoop 3.4 version, the system relies on Log4j 1.x for logging 
> purposes. This dependency limits the logging functionality, especially when 
> it comes to rolling log files such as 'hdfs-audit.log'. Rolling log files is 
> crucial for managing log size and ensuring logs are rotated out over time to 
> prevent excessive disk space usage. However, the Log4j 1.x version integrated 
> within Hadoop lacks the necessary capabilities to efficiently handle log 
> rolling.
> This library extends logging capabilities, including more flexible and 
> configurable log rolling features. By deploying this library, we can enable 
> advanced rolling strategies such as time-based rolling, size-based rolling, 
> and compression of rolled logs, which are not supported by the default Log4j 
> 1.x setup in Hadoop.
> The integration of Apache Log4j Extras into Hadoop will significantly improve 
> log management by allowing for more sophisticated and configurable log 
> rotation policies. This enhancement is particularly important for maintaining 
> system performance and reliability, as well as for compliance with log 
> retention policies.
> Although there are plans to upgrade to Log4j 2 in the forthcoming Hadoop 3.5 
> version, which will inherently solve these issues by providing enhanced 
> logging features, there is an immediate need to enable advanced log rolling 
> capabilities in the current and previous versions of Hadoop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17403) jenkins build failing for HDFS 3.5.0-SNAPSHOT

2024-02-29 Thread Shilun Fan (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822299#comment-17822299
 ] 

Shilun Fan commented on HDFS-17403:
---

[~szetszwo] This is due to a failed compilation of the Maven plugin.

 

We can check the following link for more details: 
[https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6549/2/artifact/out/branch-mvninstall-root.txt].

 
{code:java}
[INFO] Apache Hadoop Maven Plugins  FAILURE [ 40.643 s]
.
[ERROR] unable to create new native thread -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError {code}
 

If we trigger a recompilation, the issue should be resolved. This may depend on 
the state of the container at runtime, but in most cases, we can successfully 
compile.

 

# PR-6595 compile report

[https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6595/1/artifact/out/branch-mvninstall-root.txt]

# PR-6594 compile report

https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6594/2/artifact/out/branch-mvninstall-root.txt

> jenkins build failing for HDFS 3.5.0-SNAPSHOT
> -
>
> Key: HDFS-17403
> URL: https://issues.apache.org/jira/browse/HDFS-17403
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: build
>Reporter: Tsz-wo Sze
>Priority: Major
>
> See 
> https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6549/2/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt
> {code}
> [INFO] ---< org.apache.hadoop:hadoop-hdfs 
> >
> [INFO] Building Apache Hadoop HDFS 3.5.0-SNAPSHOT
> [INFO] [ jar 
> ]-
> [WARNING] The POM for 
> org.apache.hadoop:hadoop-maven-plugins:jar:3.5.0-SNAPSHOT is missing, no 
> dependency information available
> [INFO] 
> 
> [INFO] BUILD FAILURE
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17403) jenkins build failing for HDFS 3.5.0-SNAPSHOT

2024-02-29 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822275#comment-17822275
 ] 

Tsz-wo Sze commented on HDFS-17403:
---

[~slfan1989], the jenkins build are failing after updating the version to 
3.5.0-SNAPSHOT.  Any ideas?

> jenkins build failing for HDFS 3.5.0-SNAPSHOT
> -
>
> Key: HDFS-17403
> URL: https://issues.apache.org/jira/browse/HDFS-17403
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: build
>Reporter: Tsz-wo Sze
>Priority: Major
>
> See 
> https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6549/2/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt
> {code}
> [INFO] ---< org.apache.hadoop:hadoop-hdfs 
> >
> [INFO] Building Apache Hadoop HDFS 3.5.0-SNAPSHOT
> [INFO] [ jar 
> ]-
> [WARNING] The POM for 
> org.apache.hadoop:hadoop-maven-plugins:jar:3.5.0-SNAPSHOT is missing, no 
> dependency information available
> [INFO] 
> 
> [INFO] BUILD FAILURE
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17358) EC: infinite lease recovery caused by the length of RWR equals to zero.

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822089#comment-17822089
 ] 

ASF GitHub Bot commented on HDFS-17358:
---

tomscut commented on PR #6509:
URL: https://github.com/apache/hadoop/pull/6509#issuecomment-1970911161

   @zhangshuyan0 Hi, I closed the issue HDFS-17358.




> EC: infinite lease recovery caused by the length of RWR equals to zero.
> ---
>
> Key: HDFS-17358
> URL: https://issues.apache.org/jira/browse/HDFS-17358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Recently, there is a strange case happened on our ec production cluster.
> The phenomenon is as below described: NameNode does infinite recovery lease 
> of some ec files(~80K+) and those files could never be closed.
>  
> After digging into logs and releated code, we found the root cause is below 
> codes in method `BlockRecoveryWorker$RecoveryTaskStriped#recover`:
> {code:java}
>           // we met info.getNumBytes==0 here! 
>   if (info != null &&
>               info.getGenerationStamp() >= block.getGenerationStamp() &&
>               info.getNumBytes() > 0) {
>             final BlockRecord existing = syncBlocks.get(blockId);
>             if (existing == null ||
>                 info.getNumBytes() > existing.rInfo.getNumBytes()) {
>               // if we have >1 replicas for the same internal block, we
>               // simply choose the one with larger length.
>               // TODO: better usage of redundant replicas
>               syncBlocks.put(blockId, new BlockRecord(id, proxyDN, info));
>             }
>           }
>   // throw exception here!
>           checkLocations(syncBlocks.size());
> {code}
> The related logs are as below:
> {code:java}
> java.io.IOException: 
> BP-1157541496-10.104.10.198-1702548776421:blk_-9223372036808032688_2938828 
> has no enough internal blocks, unable to start recovery. Locations=[...] 
> {code}
> {code:java}
> 2024-01-23 12:48:16,171 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> initReplicaRecovery: blk_-9223372036808032686_2938828, recoveryId=27615365, 
> replica=ReplicaUnderRecovery, blk_-9223372036808032686_2938828, RUR 
> getNumBytes() = 0 getBytesOnDisk() = 0 getVisibleLength()= -1 getVolume() = 
> /data25/hadoop/hdfs/datanode getBlockURI() = 
> file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-x.x.x.x-1702548776421/current/rbw/blk_-9223372036808032686
>  recoveryId=27529675 original=ReplicaWaitingToBeRecovered, 
> blk_-9223372036808032686_2938828, RWR getNumBytes() = 0 getBytesOnDisk() = 0 
> getVisibleLength()= -1 getVolume() = /data25/hadoop/hdfs/datanode 
> getBlockURI() = 
> file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-10.104.10.198-1702548776421/current/rbw/blk_-9223372036808032686
> {code}
> because the length of RWR is zero,  the length of the returned object in 
> below codes is zero. We can't put it into syncBlocks.
> So throw exception in checkLocations method.
> {code:java}
>           ReplicaRecoveryInfo info = callInitReplicaRecovery(proxyDN,
>               new RecoveringBlock(internalBlk, null, recoveryId)); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17358) EC: infinite lease recovery caused by the length of RWR equals to zero.

2024-02-29 Thread Tao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Li resolved HDFS-17358.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> EC: infinite lease recovery caused by the length of RWR equals to zero.
> ---
>
> Key: HDFS-17358
> URL: https://issues.apache.org/jira/browse/HDFS-17358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Recently, there is a strange case happened on our ec production cluster.
> The phenomenon is as below described: NameNode does infinite recovery lease 
> of some ec files(~80K+) and those files could never be closed.
>  
> After digging into logs and releated code, we found the root cause is below 
> codes in method `BlockRecoveryWorker$RecoveryTaskStriped#recover`:
> {code:java}
>           // we met info.getNumBytes==0 here! 
>   if (info != null &&
>               info.getGenerationStamp() >= block.getGenerationStamp() &&
>               info.getNumBytes() > 0) {
>             final BlockRecord existing = syncBlocks.get(blockId);
>             if (existing == null ||
>                 info.getNumBytes() > existing.rInfo.getNumBytes()) {
>               // if we have >1 replicas for the same internal block, we
>               // simply choose the one with larger length.
>               // TODO: better usage of redundant replicas
>               syncBlocks.put(blockId, new BlockRecord(id, proxyDN, info));
>             }
>           }
>   // throw exception here!
>           checkLocations(syncBlocks.size());
> {code}
> The related logs are as below:
> {code:java}
> java.io.IOException: 
> BP-1157541496-10.104.10.198-1702548776421:blk_-9223372036808032688_2938828 
> has no enough internal blocks, unable to start recovery. Locations=[...] 
> {code}
> {code:java}
> 2024-01-23 12:48:16,171 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> initReplicaRecovery: blk_-9223372036808032686_2938828, recoveryId=27615365, 
> replica=ReplicaUnderRecovery, blk_-9223372036808032686_2938828, RUR 
> getNumBytes() = 0 getBytesOnDisk() = 0 getVisibleLength()= -1 getVolume() = 
> /data25/hadoop/hdfs/datanode getBlockURI() = 
> file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-x.x.x.x-1702548776421/current/rbw/blk_-9223372036808032686
>  recoveryId=27529675 original=ReplicaWaitingToBeRecovered, 
> blk_-9223372036808032686_2938828, RWR getNumBytes() = 0 getBytesOnDisk() = 0 
> getVisibleLength()= -1 getVolume() = /data25/hadoop/hdfs/datanode 
> getBlockURI() = 
> file:/data25/hadoop/hdfs/datanode/current/BP-1157541496-10.104.10.198-1702548776421/current/rbw/blk_-9223372036808032686
> {code}
> because the length of RWR is zero,  the length of the returned object in 
> below codes is zero. We can't put it into syncBlocks.
> So throw exception in checkLocations method.
> {code:java}
>           ReplicaRecoveryInfo info = callInitReplicaRecovery(proxyDN,
>               new RecoveringBlock(internalBlk, null, recoveryId)); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17397) Choose another DN as soon as possible, when encountering network issues

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822053#comment-17822053
 ] 

ASF GitHub Bot commented on HDFS-17397:
---

ZanderXu commented on code in PR #6591:
URL: https://github.com/apache/hadoop/pull/6591#discussion_r1507313404


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java:
##
@@ -1182,10 +1182,12 @@ public void run() {
 if (begin != null) {
   long duration = Time.monotonicNowNanos() - begin;
   if (TimeUnit.NANOSECONDS.toMillis(duration) > 
dfsclientSlowLogThresholdMs) {
-LOG.info("Slow ReadProcessor read fields for block " + block
+final String msg = "Slow ReadProcessor read fields for block " 
+ block
 + " took " + TimeUnit.NANOSECONDS.toMillis(duration) + "ms 
(threshold="
 + dfsclientSlowLogThresholdMs + "ms); ack: " + ack
-+ ", targets: " + Arrays.asList(targets));
++ ", targets: " + Arrays.asList(targets);
+LOG.warn(msg);
+throw new IOException(msg);

Review Comment:
   Thanks @xleoken for involving me.
   
   Your reported problem should be fixed, but I don't think your modification 
is a good solution.
   
   Maybe Datastreamer can identify this case and recovery it through 
PipelineRecovery. But there are two questions should be confirmed:
   
   - How to identify this case?
   - Which datanode should be marked as a bad or slow DN?





> Choose another DN as soon as possible, when encountering network issues
> ---
>
> Key: HDFS-17397
> URL: https://issues.apache.org/jira/browse/HDFS-17397
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xleoken
>Priority: Minor
>  Labels: pull-request-available
> Attachments: hadoop.png
>
>
> Choose another DN as soon as possible, when encountering network issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17405) [FGL] Using different metric name to trace FGL and Global lock

2024-02-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822030#comment-17822030
 ] 

ASF GitHub Bot commented on HDFS-17405:
---

xleoken commented on PR #6600:
URL: https://github.com/apache/hadoop/pull/6600#issuecomment-1970732881

   hi @ZanderXu, take a review when free, thanks 
https://github.com/apache/hadoop/pull/6591




> [FGL] Using different metric name to trace FGL and Global lock
> --
>
> Key: HDFS-17405
> URL: https://issues.apache.org/jira/browse/HDFS-17405
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> Currently, Fine-grained locks and global lock are using the same metric name 
> to trace its performance, so we need to different them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org