[jira] [Commented] (HDFS-17430) RecoveringBlock will skip no live replicas when get block recovery command.
[ https://issues.apache.org/jira/browse/HDFS-17430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828589#comment-17828589 ] ASF GitHub Bot commented on HDFS-17430: --- dineshchitlangia commented on PR #6635: URL: https://github.com/apache/hadoop/pull/6635#issuecomment-2008619571 @ZanderXu as you had posted the first set of suggestions, could you confirm if your suggestions are addressed? We can merge once we have your +1 > RecoveringBlock will skip no live replicas when get block recovery command. > --- > > Key: HDFS-17430 > URL: https://issues.apache.org/jira/browse/HDFS-17430 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > RecoveringBlock maybe skip no live replicas when get block recovery command. > *Issue:* > Currently the following scenarios may lead to failure in the execution of > BlockRecoveryWorker by the datanode, resulting file being not to be closed > for a long time. > *t1.* The block_xxx_xxx has two replicas[dn1,dn2]; the dn1 machine shut down > and will be dead status, the dn2 is live status. > *t2.* Occurs block recovery. > related logs: > {code:java} > 2024-03-13 21:58:00.651 WARN hdfs.StateChange DIR* > NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease > recovery is in progress. RecoveryId = 28577373754 for block blk_xxx_xxx > {code} > *t3.* The dn2 is chosen for block recovery. > dn1 is marked as stale (is dead state) at this time, here the > recoveryLocations size is 1, currently according to the following logic, dn1 > and dn2 will be chosen to participate in block recovery. > DatanodeManager#getBlockRecoveryCommand > {code:java} >// Skip stale nodes during recovery > final List recoveryLocations = > new ArrayList<>(storages.length); > final List storageIdx = new ArrayList<>(storages.length); > for (int i = 0; i < storages.length; ++i) { >if (!storages[i].getDatanodeDescriptor().isStale(staleInterval)) { > recoveryLocations.add(storages[i]); > storageIdx.add(i); >} > } > ... > // If we only get 1 replica after eliminating stale nodes, choose all > // replicas for recovery and let the primary data node handle failures. > DatanodeInfo[] recoveryInfos; > if (recoveryLocations.size() > 1) { >if (recoveryLocations.size() != storages.length) { > LOG.info("Skipped stale nodes for recovery : " > + (storages.length - recoveryLocations.size())); >} >recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations); > } else { >// If too many replicas are stale, then choose all replicas to >// participate in block recovery. >recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages); > } > {code} > {code:java} > 2024-03-13 21:58:01,425 INFO datanode.DataNode > (BlockRecoveryWorker.java:logRecoverBlock(563)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - > BlockRecoveryWorker: NameNode at xxx:8040 calls > recoverBlock(BP-xxx:blk_xxx_xxx, > targets=[DatanodeInfoWithStorage[dn1:50010,null,null], > DatanodeInfoWithStorage[dn2:50010,null,null]], > newGenerationStamp=28577373754, newBlock=null, isStriped=false) > {code} > *t4.* When dn2 executes BlockRecoveryWorker#recover, it will call > initReplicaRecovery operation on dn1, however, since the dn1 machine is > currently down state at this time, it will take a very long time to timeout, > the default number of retries to establish a server connection is 45 times. > related logs: > {code:java} > 2024-03-13 21:59:31,518 INFO ipc.Client > (Client.java:handleConnectionTimeout(904)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - > Retrying connect to server: dn1:8010. Already tried 0 time(s); maxRetries=45 > ... > 2024-03-13 23:05:35,295 INFO ipc.Client > (Client.java:handleConnectionTimeout(904)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - > Retrying connect to server: dn2:8010. Already tried 44 time(s); maxRetries=45 > 2024-03-13 23:07:05,392 WARN protocol.InterDatanodeProtocol > (BlockRecoveryWorker.java:recover(170)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - > Failed to recover block (block=BP-xxx:blk_xxx_xxx, > datanode=DatanodeInfoWithStorage[dn1:50010,null,null]) > org.apache.hadoop.net.ConnectTimeoutException: > Call From dn2 to dn1:8010 failed on socket timeout exception: > org.apache.hadoop.net.ConnectTimeoutException: 9 millis timeout while > waiting for channel to be ready for connect.ch : > java.nio.channels.SocketChanne
[jira] [Commented] (HDFS-17430) RecoveringBlock will skip no live replicas when get block recovery command.
[ https://issues.apache.org/jira/browse/HDFS-17430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828586#comment-17828586 ] ASF GitHub Bot commented on HDFS-17430: --- haiyang1987 commented on PR #6635: URL: https://github.com/apache/hadoop/pull/6635#issuecomment-2008615230 Hi Sir @dineshchitlangia @ayushtkn Would you mind help review this PR when you have free time? Thank you so much. > RecoveringBlock will skip no live replicas when get block recovery command. > --- > > Key: HDFS-17430 > URL: https://issues.apache.org/jira/browse/HDFS-17430 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > RecoveringBlock maybe skip no live replicas when get block recovery command. > *Issue:* > Currently the following scenarios may lead to failure in the execution of > BlockRecoveryWorker by the datanode, resulting file being not to be closed > for a long time. > *t1.* The block_xxx_xxx has two replicas[dn1,dn2]; the dn1 machine shut down > and will be dead status, the dn2 is live status. > *t2.* Occurs block recovery. > related logs: > {code:java} > 2024-03-13 21:58:00.651 WARN hdfs.StateChange DIR* > NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease > recovery is in progress. RecoveryId = 28577373754 for block blk_xxx_xxx > {code} > *t3.* The dn2 is chosen for block recovery. > dn1 is marked as stale (is dead state) at this time, here the > recoveryLocations size is 1, currently according to the following logic, dn1 > and dn2 will be chosen to participate in block recovery. > DatanodeManager#getBlockRecoveryCommand > {code:java} >// Skip stale nodes during recovery > final List recoveryLocations = > new ArrayList<>(storages.length); > final List storageIdx = new ArrayList<>(storages.length); > for (int i = 0; i < storages.length; ++i) { >if (!storages[i].getDatanodeDescriptor().isStale(staleInterval)) { > recoveryLocations.add(storages[i]); > storageIdx.add(i); >} > } > ... > // If we only get 1 replica after eliminating stale nodes, choose all > // replicas for recovery and let the primary data node handle failures. > DatanodeInfo[] recoveryInfos; > if (recoveryLocations.size() > 1) { >if (recoveryLocations.size() != storages.length) { > LOG.info("Skipped stale nodes for recovery : " > + (storages.length - recoveryLocations.size())); >} >recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations); > } else { >// If too many replicas are stale, then choose all replicas to >// participate in block recovery. >recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages); > } > {code} > {code:java} > 2024-03-13 21:58:01,425 INFO datanode.DataNode > (BlockRecoveryWorker.java:logRecoverBlock(563)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - > BlockRecoveryWorker: NameNode at xxx:8040 calls > recoverBlock(BP-xxx:blk_xxx_xxx, > targets=[DatanodeInfoWithStorage[dn1:50010,null,null], > DatanodeInfoWithStorage[dn2:50010,null,null]], > newGenerationStamp=28577373754, newBlock=null, isStriped=false) > {code} > *t4.* When dn2 executes BlockRecoveryWorker#recover, it will call > initReplicaRecovery operation on dn1, however, since the dn1 machine is > currently down state at this time, it will take a very long time to timeout, > the default number of retries to establish a server connection is 45 times. > related logs: > {code:java} > 2024-03-13 21:59:31,518 INFO ipc.Client > (Client.java:handleConnectionTimeout(904)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - > Retrying connect to server: dn1:8010. Already tried 0 time(s); maxRetries=45 > ... > 2024-03-13 23:05:35,295 INFO ipc.Client > (Client.java:handleConnectionTimeout(904)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - > Retrying connect to server: dn2:8010. Already tried 44 time(s); maxRetries=45 > 2024-03-13 23:07:05,392 WARN protocol.InterDatanodeProtocol > (BlockRecoveryWorker.java:recover(170)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - > Failed to recover block (block=BP-xxx:blk_xxx_xxx, > datanode=DatanodeInfoWithStorage[dn1:50010,null,null]) > org.apache.hadoop.net.ConnectTimeoutException: > Call From dn2 to dn1:8010 failed on socket timeout exception: > org.apache.hadoop.net.ConnectTimeoutException: 9 millis timeout while > waiting for channel to be ready for connect.ch : > java.nio.channels.SocketChannel[connection-pending remote=dn:801
[jira] [Resolved] (HDFS-17431) Fix log format for BlockRecoveryWorker#recoverBlocks
[ https://issues.apache.org/jira/browse/HDFS-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dinesh Chitlangia resolved HDFS-17431. -- Fix Version/s: 3.4.1 Resolution: Fixed Thanks [~haiyang Hu] for the improvement. > Fix log format for BlockRecoveryWorker#recoverBlocks > > > Key: HDFS-17431 > URL: https://issues.apache.org/jira/browse/HDFS-17431 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > Fix For: 3.4.1 > > > Fix log format for BlockRecoveryWorker#recoverBlocks > > As seen in PR [https://github.com/apache/hadoop/pull/6635] the additional {} > is moot. > > 2024-03-13 23:07:05,401 WARN datanode.DataNode > (BlockRecoveryWorker.java:run(623)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - > recover Block: RecoveringBlock\{BP-xxx:blk_xxx_xxx; getBlockSize()=0; > corrupt=false; offset=-1; locs=[DatanodeInfoWithStorage[dn1:50010,null,null], > DatanodeInfoWithStorage[dn2:50010,null,null]]; cachedLocs=[]} > FAILED: > *{}* > org.apache.hadoop.ipc.RemoteException(java.io.IOException): The recovery id > 28577373754 does not match current recovery id 28578772548 for block > BP-xxx:blk_xxx_xxx > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4129) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:1184) > at -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17431) Fix log format for BlockRecoveryWorker#recoverBlocks
[ https://issues.apache.org/jira/browse/HDFS-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dinesh Chitlangia updated HDFS-17431: - Description: Fix log format for BlockRecoveryWorker#recoverBlocks As seen in PR [https://github.com/apache/hadoop/pull/6635] the additional {} is moot. 2024-03-13 23:07:05,401 WARN datanode.DataNode (BlockRecoveryWorker.java:run(623)) [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - recover Block: RecoveringBlock\{BP-xxx:blk_xxx_xxx; getBlockSize()=0; corrupt=false; offset=-1; locs=[DatanodeInfoWithStorage[dn1:50010,null,null], DatanodeInfoWithStorage[dn2:50010,null,null]]; cachedLocs=[]} FAILED: *{}* org.apache.hadoop.ipc.RemoteException(java.io.IOException): The recovery id 28577373754 does not match current recovery id 28578772548 for block BP-xxx:blk_xxx_xxx at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4129) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:1184) at was:Fix log format for BlockRecoveryWorker#recoverBlocks > Fix log format for BlockRecoveryWorker#recoverBlocks > > > Key: HDFS-17431 > URL: https://issues.apache.org/jira/browse/HDFS-17431 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > Fix log format for BlockRecoveryWorker#recoverBlocks > > As seen in PR [https://github.com/apache/hadoop/pull/6635] the additional {} > is moot. > > 2024-03-13 23:07:05,401 WARN datanode.DataNode > (BlockRecoveryWorker.java:run(623)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - > recover Block: RecoveringBlock\{BP-xxx:blk_xxx_xxx; getBlockSize()=0; > corrupt=false; offset=-1; locs=[DatanodeInfoWithStorage[dn1:50010,null,null], > DatanodeInfoWithStorage[dn2:50010,null,null]]; cachedLocs=[]} > FAILED: > *{}* > org.apache.hadoop.ipc.RemoteException(java.io.IOException): The recovery id > 28577373754 does not match current recovery id 28578772548 for block > BP-xxx:blk_xxx_xxx > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4129) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:1184) > at -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17431) Fix log format for BlockRecoveryWorker#recoverBlocks
[ https://issues.apache.org/jira/browse/HDFS-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828583#comment-17828583 ] ASF GitHub Bot commented on HDFS-17431: --- haiyang1987 commented on PR #6643: URL: https://github.com/apache/hadoop/pull/6643#issuecomment-2008608210 Thanks @dineshchitlangia @ayushtkn @wzk784533 for your review and merge~ > Fix log format for BlockRecoveryWorker#recoverBlocks > > > Key: HDFS-17431 > URL: https://issues.apache.org/jira/browse/HDFS-17431 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > Fix log format for BlockRecoveryWorker#recoverBlocks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17431) Fix log format for BlockRecoveryWorker#recoverBlocks
[ https://issues.apache.org/jira/browse/HDFS-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828581#comment-17828581 ] ASF GitHub Bot commented on HDFS-17431: --- dineshchitlangia merged PR #6643: URL: https://github.com/apache/hadoop/pull/6643 > Fix log format for BlockRecoveryWorker#recoverBlocks > > > Key: HDFS-17431 > URL: https://issues.apache.org/jira/browse/HDFS-17431 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > Fix log format for BlockRecoveryWorker#recoverBlocks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17431) Fix log format for BlockRecoveryWorker#recoverBlocks
[ https://issues.apache.org/jira/browse/HDFS-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828580#comment-17828580 ] ASF GitHub Bot commented on HDFS-17431: --- haiyang1987 commented on PR #6643: URL: https://github.com/apache/hadoop/pull/6643#issuecomment-2008603273 Thanks @wzk784533 @ayushtkn @dineshchitlangia for your review. I found that the log format showed some problems, such as the mentioned in this issue https://github.com/apache/hadoop/pull/6635. ``` 2024-03-13 23:07:05,401 WARN datanode.DataNode (BlockRecoveryWorker.java:run(623)) [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - recover Block: RecoveringBlock{BP-xxx:blk_xxx_xxx; getBlockSize()=0; corrupt=false; offset=-1; locs=[DatanodeInfoWithStorage[dn1:50010,null,null], DatanodeInfoWithStorage[dn2:50010,null,null]]; cachedLocs=[]} FAILED: {} org.apache.hadoop.ipc.RemoteException(java.io.IOException): The recovery id 28577373754 does not match current recovery id 28578772548 for block BP-xxx:blk_xxx_xxx at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4129) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:1184) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolServerSideTranslatorPB.java:310) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:34391) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3360) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1579) at org.apache.hadoop.ipc.Client.call(Client.java:1511) at org.apache.hadoop.ipc.Client.call(Client.java:1402) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:268) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:142) at com.sun.proxy.$Proxy17.commitBlockSynchronization(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolClientSideTranslatorPB.java:342) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:334) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:189) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:620) at java.lang.Thread.run(Thread.java:748) ``` `LOG.warn("recover Block: {} FAILED: {}", b, e);` it invoke e will print the entire trace. so for the second placeholders is meaningless, i think choose to remove for the second placeholders or change e to e.toString() Hi sir @ayushtkn @dineshchitlangia @wzk784533 what dou you think? Thanks~ > Fix log format for BlockRecoveryWorker#recoverBlocks > > > Key: HDFS-17431 > URL: https://issues.apache.org/jira/browse/HDFS-17431 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > Fix log format for BlockRecoveryWorker#recoverBlocks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands
[ https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828579#comment-17828579 ] ASF GitHub Bot commented on HDFS-17433: --- hfutatzhanghb commented on PR #6644: URL: https://github.com/apache/hadoop/pull/6644#issuecomment-2008595490 > +1 LGTM, pending CI @hfutatzhanghb thanks for finding this issue and contributing the fix. Sir, thanks a lot for reviewing. > metrics sumOfActorCommandQueueLength should only record valid commands > -- > > Key: HDFS-17433 > URL: https://issues.apache.org/jira/browse/HDFS-17433 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.4.0 >Reporter: farmmamba >Assignee: farmmamba >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands
[ https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-17433: -- Labels: pull-request-available (was: ) > metrics sumOfActorCommandQueueLength should only record valid commands > -- > > Key: HDFS-17433 > URL: https://issues.apache.org/jira/browse/HDFS-17433 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.4.0 >Reporter: farmmamba >Assignee: farmmamba >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands
[ https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828574#comment-17828574 ] ASF GitHub Bot commented on HDFS-17433: --- hfutatzhanghb opened a new pull request, #6644: URL: https://github.com/apache/hadoop/pull/6644 ### Description of PR We add an phone alarm on metrics sumOfActorCommandQueueLength when it beyond 3000. Recently, we received the alarm and we found that `DatanodeCommand[] cmds` with array length equals to 0 was still put into queue and incrActorCmdQueueLength. When processedCommandsOpAvgTime is high, those empty cmds were put into queue every heartbeat intervel. sumOfActorCommandQueueLength should better only record valid commands. > metrics sumOfActorCommandQueueLength should only record valid commands > -- > > Key: HDFS-17433 > URL: https://issues.apache.org/jira/browse/HDFS-17433 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.4.0 >Reporter: farmmamba >Assignee: farmmamba >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands
farmmamba created HDFS-17433: Summary: metrics sumOfActorCommandQueueLength should only record valid commands Key: HDFS-17433 URL: https://issues.apache.org/jira/browse/HDFS-17433 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 3.4.0 Reporter: farmmamba Assignee: farmmamba -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17431) Fix log format for BlockRecoveryWorker#recoverBlocks
[ https://issues.apache.org/jira/browse/HDFS-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828490#comment-17828490 ] ASF GitHub Bot commented on HDFS-17431: --- ayushtkn commented on code in PR #6643: URL: https://github.com/apache/hadoop/pull/6643#discussion_r1530927655 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockRecoveryWorker.java: ## @@ -628,7 +628,7 @@ public void run() { new RecoveryTaskContiguous(b).recover(); } } catch (IOException e) { - LOG.warn("recover Block: {} FAILED: {}", b, e); + LOG.warn("recover Block: {} FAILED: ", b, e); Review Comment: Whats wrong here? the number of placeholders are correct only, for the second one it will invoke e.toString(), now you are changing it to print the entire trace. I don't think it is broken, it looks like it was intentional > Fix log format for BlockRecoveryWorker#recoverBlocks > > > Key: HDFS-17431 > URL: https://issues.apache.org/jira/browse/HDFS-17431 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > Fix log format for BlockRecoveryWorker#recoverBlocks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17431) Fix log format for BlockRecoveryWorker#recoverBlocks
[ https://issues.apache.org/jira/browse/HDFS-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828479#comment-17828479 ] ASF GitHub Bot commented on HDFS-17431: --- hadoop-yetus commented on PR #6643: URL: https://github.com/apache/hadoop/pull/6643#issuecomment-2007841730 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 35s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 47m 17s | | trunk passed | | +1 :green_heart: | compile | 1m 29s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 1m 20s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 18s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 29s | | trunk passed | | +1 :green_heart: | javadoc | 1m 8s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 46s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 30s | | trunk passed | | +1 :green_heart: | shadedclient | 38m 29s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 17s | | the patch passed | | +1 :green_heart: | compile | 1m 15s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 1m 15s | | the patch passed | | +1 :green_heart: | compile | 1m 13s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 13s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 4s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 18s | | the patch passed | | +1 :green_heart: | javadoc | 0m 58s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 38s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 25s | | the patch passed | | +1 :green_heart: | shadedclient | 38m 41s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 228m 41s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 45s | | The patch does not generate ASF License warnings. | | | | 379m 37s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6643/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6643 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 28104aeadaba 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 0ce2a9e09116ee8807a24c37e87595b52f3713da | | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6643/1/testReport/ | | Max. process+thread count | 4053 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6643/1/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
[jira] [Commented] (HDFS-17413) [FGL] CacheReplicationMonitor supports fine-grained lock
[ https://issues.apache.org/jira/browse/HDFS-17413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828425#comment-17828425 ] ASF GitHub Bot commented on HDFS-17413: --- hadoop-yetus commented on PR #6641: URL: https://github.com/apache/hadoop/pull/6641#issuecomment-2007547095 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 12m 26s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ HDFS-17384 Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 55s | | HDFS-17384 passed | | +1 :green_heart: | compile | 1m 21s | | HDFS-17384 passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 1m 16s | | HDFS-17384 passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 1m 13s | | HDFS-17384 passed | | +1 :green_heart: | mvnsite | 1m 24s | | HDFS-17384 passed | | +1 :green_heart: | javadoc | 1m 11s | | HDFS-17384 passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 41s | | HDFS-17384 passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 17s | | HDFS-17384 passed | | +1 :green_heart: | shadedclient | 35m 33s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 12s | | the patch passed | | +1 :green_heart: | compile | 1m 11s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 1m 11s | | the patch passed | | +1 :green_heart: | compile | 1m 8s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 1m 8s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 0s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 10s | | the patch passed | | +1 :green_heart: | javadoc | 0m 52s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 32s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 15s | | the patch passed | | +1 :green_heart: | shadedclient | 35m 11s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 229m 58s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6641/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 46s | | The patch does not generate ASF License warnings. | | | | 381m 28s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.datanode.TestLargeBlockReport | | | hadoop.hdfs.tools.TestDFSAdmin | | | hadoop.hdfs.protocol.TestBlockListAsLongs | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6641/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6641 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux dbd98b8aab75 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | HDFS-17384 / 901fff7cbf4ac90b8be0b4799ea19426eff89a20 | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b
[jira] [Commented] (HDFS-17431) Fix log format for BlockRecoveryWorker#recoverBlocks
[ https://issues.apache.org/jira/browse/HDFS-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828325#comment-17828325 ] ASF GitHub Bot commented on HDFS-17431: --- wzk784533 commented on PR #6643: URL: https://github.com/apache/hadoop/pull/6643#issuecomment-2007159204 LGTM > Fix log format for BlockRecoveryWorker#recoverBlocks > > > Key: HDFS-17431 > URL: https://issues.apache.org/jira/browse/HDFS-17431 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > Fix log format for BlockRecoveryWorker#recoverBlocks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17431) Fix log format for BlockRecoveryWorker#recoverBlocks
[ https://issues.apache.org/jira/browse/HDFS-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828293#comment-17828293 ] ASF GitHub Bot commented on HDFS-17431: --- haiyang1987 opened a new pull request, #6643: URL: https://github.com/apache/hadoop/pull/6643 ### Description of PR https://issues.apache.org/jira/browse/HDFS-17431 Fix log format for BlockRecoveryWorker#recoverBlocks > Fix log format for BlockRecoveryWorker#recoverBlocks > > > Key: HDFS-17431 > URL: https://issues.apache.org/jira/browse/HDFS-17431 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > > Fix log format for BlockRecoveryWorker#recoverBlocks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17431) Fix log format for BlockRecoveryWorker#recoverBlocks
[ https://issues.apache.org/jira/browse/HDFS-17431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-17431: -- Labels: pull-request-available (was: ) > Fix log format for BlockRecoveryWorker#recoverBlocks > > > Key: HDFS-17431 > URL: https://issues.apache.org/jira/browse/HDFS-17431 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > Fix log format for BlockRecoveryWorker#recoverBlocks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17413) [FGL] CacheReplicationMonitor supports fine-grained lock
[ https://issues.apache.org/jira/browse/HDFS-17413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-17413: -- Labels: pull-request-available (was: ) > [FGL] CacheReplicationMonitor supports fine-grained lock > > > Key: HDFS-17413 > URL: https://issues.apache.org/jira/browse/HDFS-17413 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > * addCacheDirective > * modifyCacheDirective > * removeCacheDirective > * listCacheDirectives > * addCachePool > * modifyCachePool > * removeCachePool > * listCachePools > * cacheReport > * CacheManager > * CacheReplicationMonitor -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17413) [FGL] CacheReplicationMonitor supports fine-grained lock
[ https://issues.apache.org/jira/browse/HDFS-17413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828241#comment-17828241 ] ASF GitHub Bot commented on HDFS-17413: --- ZanderXu opened a new pull request, #6641: URL: https://github.com/apache/hadoop/pull/6641 Using FSLock to make cache-pool and cache-directive thread safe, since Clients will access or modify these information and these information has nothing to do with block. Using BMLock to make cachedBlock thread safe, since the related logic will access block information and modify cache-related information of one DN. > [FGL] CacheReplicationMonitor supports fine-grained lock > > > Key: HDFS-17413 > URL: https://issues.apache.org/jira/browse/HDFS-17413 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > > * addCacheDirective > * modifyCacheDirective > * removeCacheDirective > * listCacheDirectives > * addCachePool > * modifyCachePool > * removeCachePool > * listCachePools > * cacheReport > * CacheManager > * CacheReplicationMonitor -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17413) [FGL] CacheReplicationMonitor supports fine-grained lock
[ https://issues.apache.org/jira/browse/HDFS-17413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZanderXu updated HDFS-17413: Description: * addCacheDirective * modifyCacheDirective * removeCacheDirective * listCacheDirectives * addCachePool * modifyCachePool * removeCachePool * listCachePools * cacheReport * CacheManager * CacheReplicationMonitor was: Client RPCs involving Cache supports fine-grained lock. * addCacheDirective * modifyCacheDirective * removeCacheDirective * listCacheDirectives * addCachePool * modifyCachePool * removeCachePool * listCachePools > [FGL] CacheReplicationMonitor supports fine-grained lock > > > Key: HDFS-17413 > URL: https://issues.apache.org/jira/browse/HDFS-17413 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > > * addCacheDirective > * modifyCacheDirective > * removeCacheDirective > * listCacheDirectives > * addCachePool > * modifyCachePool > * removeCachePool > * listCachePools > * cacheReport > * CacheManager > * CacheReplicationMonitor -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17413) [FGL] CacheReplicationMonitor supports fine-grained lock
[ https://issues.apache.org/jira/browse/HDFS-17413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZanderXu updated HDFS-17413: Summary: [FGL] CacheReplicationMonitor supports fine-grained lock (was: [FGL] Client RPCs involving Cache supports fine-grained lock) > [FGL] CacheReplicationMonitor supports fine-grained lock > > > Key: HDFS-17413 > URL: https://issues.apache.org/jira/browse/HDFS-17413 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > > Client RPCs involving Cache supports fine-grained lock. > * addCacheDirective > * modifyCacheDirective > * removeCacheDirective > * listCacheDirectives > * addCachePool > * modifyCachePool > * removeCachePool > * listCachePools -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org